<<

Illuminating the black box: Encoding of phonology in a recurrent neural model of grounded natural speech

Jeroen van der Weijden STUDENT NUMBER: 1261593

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE SCHOOL OF HUMANITIES AND DIGITAL SCIENCES TILBURG UNIVERSITY

Thesis committee:

Dr. Grzegorz Chrupala Dr. Henry Brighton

Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands January 2000 Cognitive Science & Artificial Intelligence 2019

Cognitive Science & Artificial Intelligence 2019

Preface

This thesis is dedicated to my grandmother: Mien Siebers, who passed away during the construction of this work. I want to thank my family and friends for their love, Jenske Vermeulen for her unwavering support and for believing in me. I’d also like to thank my thesis advisor dr. Grzegorz Chrupala for his help and patience, as well as for training the model and providing crucial code. This work proved both a challenge and a jump in the deep, two things which often seem to coincide.

Cognitive Science & Artificial Intelligence 2019

Cognitive Science & Artificial Intelligence 2019

Contents 1. Introduction ...... 6 2. Related work ...... 8 2.1 Deep multi-modal models ...... 8 2.2 Model analyses...... 10 2.3 Natural language acquisition...... 11 3. Experimental setup...... 12 3.1 Data ...... 12 3.2 Model ...... 13 Model architecture ...... 13 Model settings ...... 14 3.3 Experiments ...... 15 Experiment 1 – discrimination ...... 16 Experiment 2 – Synonym discrimination ...... 17 4. Results ...... 18 4.1 Experiment 1 ...... 18 4.2 Experiment 2 ...... 19 5. Discussion ...... 20 5.1 Experiment 1 ...... 21 5.2 Experiment 2 ...... 21 5.3 Limitations ...... 22 6. Conclusion ...... 24 References ...... 25 Appendix A ...... 31

Cognitive Science & Artificial Intelligence 2019

Illuminating the black box: encoding of phonology in a recurrent model of grounded natural speech

Jeroen van der Weijden

In this work we investigate how phonology is encoded in a recurrent neural model of grounded natural speech. In a weakly supervised automatic speech recognition learning task, this model takes images and spoken descriptions of said images, and projects them to a joint semantic space. Previous work has found that encoding of phonology is most prevalent in lower layers, whereas further up the hierarchy of layers in the model semantic aspects of the language input are more prevalent (Alishahi, Barking, & Chrupala, 2017).However, these analyses have only been conducted using a dataset of synthetically generated spoken captions. The present work aims to validate their findings using natural speech, specifically the MIT Places Audio Caption Corpus (Harwath et al., 2018) with additional stimuli generated by the Google WaveNet TTS (Oord et al., 2016). In a series of two experiments we are able to confirm the findings by previous research. Furthermore, we show that the noisiness of natural speech can benefit the encoding of phonology, and that longer sentences compel the model to attenuate encoding of form on level. These results extend the limited body of work which focuses on peeking inside the black box of deep- learning techniques, and opens up new possible avenues for future research.

1. Introduction Decoding and understanding natural language has been a key focus of machine learning. The rise of deep learning techniques has vastly accelerated progress in this area (Belinkov & Glass, 2019). The effects of this progress can readily be seen in the world today, for example: at time of writing the four largest voice assistants (Google Assistant, Apple Siri, Microsoft Cortana, Amazon Alexa) are installed on a combined number of 2 billion devices, with at least 520 million active users (Kinsella, 2018; Bohn, 2019; Kinsella, 2019). These voice assistants rely on Automatic Speech Recognition (ASR) in order to function. While the performance of neural networks on speech recognition tasks like those used for voice assistants has been impressive, many of these networks require large amounts of transcribed speech signals, which makes such neural networks very time and resource intensive. While large companies such as Amazon or Apple are able to absorb these costs, this might not be the case for many other companies, nor is it a given that enough data is available to make ASR work for any particular language. Particularly for under- resourced languages, ASR can be of critical importance by improving documentation efforts, and facilitating communication with speakers of less- prevalent languages so that, for example, humanitarian workers are able to communicate with disaster struck populations. The use of weakly, or unsupervised, learning methods for neural networks are therefore preferred, especially in cases such as those where no large datasets are available (Le et al., 2009; Besacier et al., 2014; Michel et al., 2016). Recent studies have

Cognitive Science & Artificial Intelligence 2019

proposed such a weakly supervised learning method for training neural networks in natural language processing (NLP) tasks (Noda et al., 2015; Harwath et al., 2015; Harwath et al., 2016; Gelderloos & Chrupala, 2016). This technique relies on a network learning the semantic space between photographic images of everyday scenes, and spoken descriptions of said images. This ‘visually grounded speech perception’ model allows a network to create a semantic representation of the spoken descriptions. One of the benefits of such a model is that it requires less supervision to train, making them possibly cheaper and easier to successfully implement when data is scarce. Since its inception, several network architectures such as convolutional neural networks (CNNs), and recurrent neural networks (RNNs) have been used for implementing the grounded speech model (Noda et al., 2015; Harwath et al., 2016; Gelderloos & Chrupala, 2016; Chaabouni et al., 2017). As of yet, one of the most successful implementation has been a multi-layer gated recurrent neural network (GRNN) constructed by Chrupala et al. (2017) which managed to show significant improvements over convolutional neural network implementations used by Harwath and Glass (2015). However, with the rise of deep learning techniques and their accompanying success, more and more questions are raised as to the inner workings of these deep learning models. Many works argue for an increase in accountability of machine learning systems, as well as more interpretability of the neural networks (Lipton, 2016; Doshi-Velez et al., 2017). And despite the recognized importance of these issues, the ability to interpret and explain NLP predictions of neural networks is still a work in progress (Belinkov & Glass, 2019). Furthermore, there is a growing field which uses methods of analyses of these models to look at the similarities of neural networks and brain activity. At this intersection of neuroscience, psycholinguistics, and deep learning, research looks into if and how well, certain neural networks mimic the workings of the brain. In this emerging field, analysing the internal workings of these neural networks could offer non-invasive ways to investigate the brain, as well as give more insight into linguistic theories and the deep learning models themselves (Ter Schure et al., 2016; Chaabouni et al., 2017; Bashivan et al., 2019). Both the works of Chrupala et al. (2017) as well as subsequent research by Alishahi, Barking, and Chrupala (2017) have made efforts to analyse how the grounded speech perception GRNN model encodes the language inputs. Chrupala et al. (2017) found that semantic aspects of the language input were more prevalent higher up in the hierarchy of layers, whereas the encoding of form tended to initially increase, and then plateau or even decrease after the first few layers of the model. The finer- grained analyses conducted by Alishahi et al. (2017), which took as unit of analyses, affirmed these findings, but also found similarities between natural language acquisition and the workings of the model. Unfortunately these analyses were conducted, and with a model trained on, largely synthetically generated speech using the Google Text To Speech API on

Cognitive Science & Artificial Intelligence 2019

original image descriptions of the COCO dataset (Lin et al., 2014; Havard, Besacier, & Rosec, 2017). While this method creates a relatively high-quality speech signal, it is no substitute for natural human speech signals, which are notably more noisy and messy. Given the nature of the model as a weakly supervised ASR, as well as its possibility at giving neuroscientific and psycholinguistic insight into how humans use and acquire language, it is crucial to do fine grained analyses of the grounded perception GRNN model on natural human speech. The present research, therefore, aims to replicate the analyses of Alishahi et al. (2017), but on natural speech instead of purely synthetic speech, whereby achieving a finer understanding of the real world performance and workings of this model. The research question of this proposal is as follows: How is information about individual phonemes encoded in the MFCC features extracted from the speech signal, and the activations of the layers of a gated recurrent neural model of grounded natural speech? In order to achieve this, several experiments will be conducted which will be discussed in detail later on. First, however, the related work will be examined, followed by a comprehensive look at the GRNN model, as well as the data used to train this model and conduct the experiments.

2. Related work

2.1 Deep multi-modal models The benefits of weak or unsupervised learning by neural networks of automatic speech recognition tasks are numerous, as mentioned before. One important approach in this field has been the use of multiple modalities in a neural network architecture. These so called multi-modal networks use, for example, audio and images in tandem to learn semantic relations between the two modalities. Figure 1 shows how a multi-modal network can create a representational space of the semantic relation between an image and a description of said image (Harwath & Glass, 2015).

Cognitive Science & Artificial Intelligence 2019

Figure 1. Example of the alignment between image and corresponding image description1. Multi-modal approaches in machine learning have been around for some time (Bregler & Konig, 1994; Duchnowski et al., 1994; Meier et al., 1996; Barnard et al., 2003): in 1989, Yuhas et al. trained a neural network to predict a certain auditory signal given a specific visual input. However, with the rise of deep learning techniques, deep multimodal learning networks have become much more feasible (Socher & Fei-Fei, 2010; Ngiam et al., 2011; Matuszek et al., 2012; Frome et al., 2013; Kong et al., 2014; Lin et al., 2014; Karpathy & Fei- Fei, 2015). Srivastava and Salakhutdinov (2012) showed how two Deep Boltzmann Machines in a multimodal learning model could accurately generate tags describing images, as well as generate image features closely related to their relative input tags. The success of implementing such a model for a practical automatic visual speech recognition system (Sui et al., 2015) has given rise to a large body of research into ASR systems using deep multimodal neural models (e.g. Noda et al., 2015; Harwath et al., 2016; Chrupala et al., 2015; Gelderloos & Chrupala, 2016; Chaabouni et al., 2017; Yang et al., 2017). Notably, Harwath and Glass (2015) proposed a model which uses two convolutional neural networks (CNNs) in order to create a semantic embedding space between images and spoken captions of said images, as seen in Figure 1. The benefit of this model is that it models the of the speech and images directly on the audio signal level, thereby skipping the need to deal with text on a orthographic level. This allows the model to not be hindered by i.e. spelling mistakes in the text descriptions of the images. An example of this from Chrupala et al. (2017) can be found in Figure 2. This takes out another layer of pre-processing that text based ASR systems have to deal with. Several works have aimed at improving this model, with the most successful implementation that of Chrupala et al. (2017) who constructed a multi-layer gated recurrent neural network (gRNN) which saw significant improvements over the convolutional neural network implementation of Harwath and Glass (2015). In Section 3.2 this model will be reviewed in more detail.

1 Adapted from “Deep Multimodal Semantic Embeddings for Speech and Images”, by D. Harwath and J. Glass, 2015, in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, p. 242. Copyright 2015 IEEE.

Cognitive Science & Artificial Intelligence 2019

Figure 2. Images returned for utterance ‘a yellow and white birtd is in flight’ by the text (left) and speech (right) models2.

2.2 Model analyses With the rise of deep learning techniques and their accompanying success, more and more questions are raised as to the inner workings of these deep learning models. Many works argue for an increase in accountability of machine learning systems, as well as more interpretability of the neural networks (Lipton, 2016; Doshi-Velez et al., 2017). However, despite the recognized importance of these issues, the ability to interpret and explain NLP predictions of neural networks is still a work in progress (Belinkov & Glass, 2019). Analysing neural network models has been done using a variety of methods, see Table SM1 in Belinkev & Glass (2019) for an extensive overview of recent research. Elman (1989; 1990; 1991) analysed recurrent neural networks trained on synthetic sentences in a language prediction task. These works show how a network captures syntactic structures, acquire word representations that reflect syntactic and lexical categories, and discover the notion of a word when predicting characters. Subsequent research applied similar analyses to different networks and tasks (Harris, 1990; Pollack, 1990; Niklasson & Linåker, 2000; Frank, Mathis, & Badecker, 2013; Köhn, 2015; Qian, Qiu, & Huang, 2016; Adi et al., 2016). These analyses, however, have been primarily done on word or sentence level. The segmentation of phonemes is often a critical step in speech recognition systems (Michel et al., 2016), and has subsequently frequently been used as a unit of analyse of ASR networks. Early research analysed how well deep belief networks could recognize phonemes using different acoustic features (Mohamed, Hinton, & Penn, 2012). Results show that the model’s learned representations are less prone to speaker-variance than the acoustic features. Nagamine et al. (2015; 2016) analysed how deep neural networks (DNNs) form phonemic categories. They found that a DNN trained for phoneme recognition tasks becomes more nonlinear higher up the model’s architecture making it better at discriminating acoustically similar phones, and suggesting a hierarchical language acquisition structure. Wang et al. (2017) analysed the Gate Activation Signals (GAS) for

2 Adapted from “Representations of language in a model of visually grounded speech signal”, by G. Chrupala, L. Gelderloos, and A. Alishahi, 2017, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, p. 618. Copyright 2017 by the Association for Computational Linguistics.

Cognitive Science & Artificial Intelligence 2019

gRNNs and found that these signals contain temporal structures highly related to phoneme boundaries. Their analyses of GAS, as well as further phoneme segmentation experiments shine light on why gRNN perform well for weakly/unsupervised ASR tasks. These works have primarily focused on single modal models, however in recent years multimodal neural network models have also become the subject of analyse. Harwath and Glass (2017) analysed an improved version of their previous grounded speech model (Harwath & Glass, 2015), and found amongst other things, how well the model congregates similar semantic concepts together in the embedding space. The “water”, “lake”, “river”, and “pond” were all closely connected. Chrupala et al. (2017), in addition to testing the performance of their GRNN model, also made efforts to analyse how this model encodes the language inputs. In a series of experiments on a synthetically generated speech model, as well as human speech model, they found that semantic aspects of the language input tend to become richer higher up in the hierarchy of layers, whereas the encoding of form tended to initially increase, and then plateau or even decrease after the first few layers of the model. These findings affirm previous findings (Nagamine et al., 2015; Nagamine et al., 2016; Gelderloos & Chrupala, 2016), and are also repeated in later work where Blevins et al. (2018) found hierarchical representations of syntax in RNN models. Many of these analyses on joint audio-visual models are relatively coarse: they primarily use words as unit of analyses. Alishahi et al. (2017) used the multi- layer gRNN model of Chrupala et al. (2017) to perform finer grained analyses on the representation and encoding of phonemes in a grounded speech perception model. In a series of four experiments they analysed how information about individual phonemes are encoded in the activations of the model, as well as the MFCC features which were extracted from the synthetic speech signal. They found that the representation of phonological knowledge was the most accurate in the lower layers of the model, in line with previous findings of the model’s encoding of hierarchy (Gelderloos & Chrupala, 2016; Chrupala et al., 2017). However, they also found that the top recurrent layer still contained a large amount of phonological information. Furthermore, after the top layer, an attention mechanism filters the encoding of phonology, which led to a significant increase in invariancy to synonymy. This research will aim to verify these findings on human speech instead of only synthetically generated speech.

2.3 Natural language acquisition The findings of Alishahi et al (2017) also show how the workings of the gRNN grounded speech model seem to resemble natural language acquisition in humans. Recent research reveals how infants can discriminate between two sounds according to their accompanying visual objects respectively (Ter Schure, Junge, & Boersma, 2016). This effectively means infants create phonetic categories based on both visual semantic cues, as well as specific peaks found in the speech sounds, similar to the learning process of the grounded speech gRNN model. Recently there has been a surge in research

Cognitive Science & Artificial Intelligence 2019

looking into whether neural networks can be used to study the brain. Bashivan, Kar, and DiCarlo (2019) have used a DNN to model how the brain processes and represents patterns of light. In their model each brain area corresponds to a layer in the model, and each brain neuron corresponds to a neuron on the neural network. The use of a DNN allows them to potentially research brain activity in a non-invasive way. Analyses of deep ASR models have already shown their workings can resemble natural human language acquisition. Mohamed et al. (2012) show that the different workings of the layers fit with human speech recognition, which uses multiple layers with different event detectors and feature extractors (Allen, 1994). Nagamine et al. (2015) note that the phoneme misclassification between two very similar phonemes resembles the way humans make mistakes when perceiving phonemes under noisy circumstances (Phatak & Allen, 2007). Furthermore, Chaabouni et al. (2017) used audio in tandem with videos of mouths uttering the audio speech resulting in a learning task resembling lip reading. Their results show that the added visual aspects can improve the model’s ability to distinguish phonemes which are visually distinct from one another. Recent work analyses how a DNN sees depth using queues taken from the workings of human depth perception (Dijk & Croon, 2019). These works show that domains such as neuroscience can benefit from the analysing of neural networks, as well as vice versa. By doing fine grained analyses of the gRNN model from Chrupala et al. (2017) using natural speech, this research aims to contribute to this growing body of work, and will hopefully improve on applicability in this effort compared to previous results by Alishahi et al. (2017).

3. Experimental setup This section contains a description of the data used for the experiments, a further elaboration on the gRNN model by Chrupala et al. (2017), and will conclude by giving a descriptive account of the experiments conducted to analyze this model.

3.1 Data Alishahi et al. (2017) proposed using human speech to verify the results of their study, however they note that, in order to reliably train the model, a large amount of data is necessary. Large datasets of images annotated by human speech descriptions are hard to find, therefore this study uses the MIT Places Audio Caption Corpus (MIT PACC) dataset (Harwath et al., 2018) to train the model. This dataset contains 402,385 image and spoken caption pairs, with a validation set of a 1000 pairs. The image/caption pairs originate from recordings obtained via Amazon Mechanical Turk3 of people verbally describing natural images from drawn from Places 205 image dataset (Zhou et al., 2014). The captions were spoken by 2,683 unique speakers. For these verbal captions there are no ground truth transcriptions available; the

3 For more information see: Paolacci, Chandler, & Ipeirotis, 2010; Buhrmester, Kwang, & Gosling, 2011

Cognitive Science & Artificial Intelligence 2019

transcriptions provided were derived afterwards by ASR, and are reliable but erroneous as they have an approximate 20% word error rate (Harwath et al., 2016). Given that the gRNN model doesn’t use text transcriptions in training, it is not directly affected by this. The images described by the speakers were randomly drawn from a collection of over 2.5 million images from 205 scenes in the Places 205 dataset (Zhou et al., 2014). In order to conduct the phoneme discrimination, and synonym discrimination experiments, which will be explained in depth later, specific audio stimuli is required. Therefore, the use of the Google WaveNet text-to- speech generator based on WaveNet (Oord et al., 2016) is proposed for creating said stimuli. The Google WaveNet is a state-of-the-art TTS model which performs very close to human speech. According to Mean Opinion Scores conducted by Oord et al. (2016): the WaveNet scores 4.21 whereas ‘human speech’ scores 4.55. This performance is reportedly a 50% increase over Google’s previous best TTS systems (Oord et al., 2016). Given that one of the main aims of this study is to validate the results of Alishahi et al. (2017) on human speech, the use of synthetically generated stimuli is far from ideal, however the effect of the use of synthetic stimuli will hopefully be very limited given the WaveNet’s state-of-the-art performance. Furthermore, many recent studies have used both natural human speech, as well as synthetically generated speech to improve their models’ performance and fidelity (Tjandra et al., 2017; Tjandra et al., 2018; Hayashi et al., 2018; Jia et al., 2018; Li et al., 2018). The WaveNet TTS API4 was used to generate around 15,000 additional spoken captions, exclusively used in the experiments (see Section 3.3).

3.2 Model The history and general workings of the gRNN model by Chrupala et al. (2017) have already been described previously. In this section the aim is to give a brief overview of some of the technical details of the model, as well as the model settings. Given that the focus of this research is to analyze how the layers of the gRNN model encode phonemes, for a full in-depth description of the model we refer you to Chrupala et al. (2017).

Model architecture As explained before, the gRNN model is a multi-modal network which learning objective is to project spoken utterances and images to a joint semantic space. As can be seen in Figure 1, this can result in the utterance (u) ‘dog’, being close in the semantic space with the image (i) of a dog, while unrelated pairs are far away from each other. A loss function by a margin a tries to accomplish this:

4 Available at https://github.com/GoogleCloudPlatform/python-docs- samples/tree/master/texttospeech/cloud-client.

Cognitive Science & Artificial Intelligence 2019

∑ (∑ max⁡[0, 훼 + 푑(푢, 푖) − 푑(푢′, 푖)] + ∑ max⁡[0, 훼 + 푑(푢, 푖) 푢,푖 푢′ 푢′

− 푑(푢, 푖′)])⁡(1)

Here 푑(푢, 푖)⁡is the cosine distance between these pairs of encoded image i and encoded utterance u. The encoders for u and i respectively will now be described. The two modalities of the model are expressed by an image encoder and an utterance encoder. The utterance encoder takes Mel-frequency Cepstral Coefficients (MFCC) of the speech as input. The utterance encoder’s architecture looks like this:

enc푢(퐮) = unit (Attn (gRNN푘,퐿 (Conv푠,푑,푧(풖))))⁡(2)

The first layer of the architecture is a convolutional layer Convs,d,z. This layer of size s takes the MFCC and subsamples them with stride z, and then projects them to d dimensions. The next part in the architecture is a gRNNk,L layer where k is the amount of residual recurrent layers, and L the recurrence depth. The final part of the architecture is a lookback operator Attn. This layer takes a weighted average of the activations across time steps, which as suggested by Alishahi et al. (2017) can filter out form in order to focus on the semantic information of the input. The image encoder uses features which are extracted with an object classification model called VGG-16 (Simonyan & Zisserman, 2014) which is pre-trained on ImageNet (Deng, Dong, & Socher, 2009). These are then projected to the joint space by a linear transformation:

enc푖(퐢) = unit⁡(퐀퐢 + 푏)⁡(3) 푥 Here unit⁡(푥) = ⁡ and (A, b) are the learned parameters. Lastly, the (푥푡푥)0.5 projections of both utterance, and image encoder are L2-normalized.

Model settings The model is trained on the MIT PACC, and implemented in PyTorch (Paszke et al., 2017). The set-up of this model is as follows: the convolutional layer has a size of 64, a length of 6, a stride of 2, with full border padding. After this follow four recurrent layers with 1024 dimensions, followed finally by the attention layer of a Multi-Layer Perceptron, with 128 hidden units, and an Adam optimizer with initial learning rate of 0.0002. The image features are extracted from the final connect layer of VGG-16 (Simonyan & Zisserman, 2014) pre-trained on ImageNet (Deng, Dong, & Socher, 2009) and are 4096- dimensional image feature vectors. Figure 3 shows a visual overview of the speech utterance architecture.

Cognitive Science & Artificial Intelligence 2019

Figure 3: MIT PACC Speech utterance encoder architecture.

3.3 Experiments This section is dedicated to detailed descriptions of the two experiments this research conducts in order to understand how information about individual phonemes encoded in the MFCC features are extracted from the speech signal, as well as the activations of the layers of the grounded speech model. Experiment 1 uses a Minimal-Pair ABX task (Schatz et al., 2013) to see how well the model can discriminate between phonemes. In Experiment 2 a synonym discrimination task is used to see how the model encodes representations of meaning versus phonological form. Lastly, the original outline of this research was to include a third experiment, namely the phoneme decoding experiment from Alishahi et al. (2017). This experiment uses the Gentle 5 toolkit to align phonemes with speech utterances, and then uses this alignment to quantify how well phoneme identity can be recovered from the MFCC features and layer representations. However, the extremely poor results that this produced indicated that either this experiment had been incorrectly performed, or that the phoneme alignment was too inaccurate for the results to be of any scientific value. It was therefore elected to omit the phoneme decoding experiment from this research. Section 5 discusses this further. The experiments are all conducted using Python (version 3.6.7) in Jupyter Notebook (version 5.2.3) accessed with Anaconda (version 5.3.1). All the resources will be made available online6. Furthermore, the way phoneme representations are calculated for these experiments is as follows: the activations per layer averaged over the duration of every phoneme occurrence in the input. For phoneme identities to be found in the MFCC features, the average input vectors of each phoneme will be calculated by averaging the MFCC vectors of the speech input over the articulation time of each phoneme. The MFCC vectors as well as layer activation vectors are retrieved using the Talkie toolkit7. The specific phonemes used in these experiments will be replicated from Alishahi et al. (2017), and can be found in Figure 4.

5 Available at https://github.com/lowerquality/gentle. 6 Available at https://github.com/jvdweijden/encoding_of_phonology. 7 Available at https://github.com/gchrupala/talkie.

Cognitive Science & Artificial Intelligence 2019

Figure 4: Phonemes of General American English8.

Experiment 1 – Phoneme discrimination In this experiment the aim is to evaluate how well the PACC model performs on the Phoneme across Context (PaC) task taken from the Minimal-Pair ABX tasks proposed by Schatz et al. (2013). These tasks have been created by Schatz et al. (2013) as a way to test the speech features which are learned by unsupervised models. They look at how well these features can distinguish between speakers, and how well they are able to discriminate phonemes over speakers, as well as phonetic context. This is done by comparing two syllable pairs that only differ by a single phoneme in a variety of tasks. For this experiment, the Phoneme across Context (PaC) task (Schatz et al., 2013) is used in accordance with the phoneme discrimination task by Alishahi et al. (2017). The PaC task measures how invariant the model is of context in phoneme discrimination. Specifically, the task consists of taking three stimuli: A, B, and X (either a consonant or vowel), where stimuli A and B only differ by one phoneme, which makes them a minimal pair. B and X are also minimal pairs in this way, however A and X are not. For example, when considering (be /bi/, me /mi/, my /maI/) as (A, B, X) then: (be /bi/) is a minimal pair with (me /mi/), and (me /mi/, my /maI/) is also a minimal pair, however (be /bi/, my/maI/) is not. Thus the goal of the task is to see how often the model is able to recognize that X (my /maI/) is closer to B (me /mi/) than to A (be /bi/). This way the PaC task is able to determine how invariant of context the model is when discriminating between phonemes. To accomplish this, a list of syllables was compiled according to the criteria Alishahi et al. (2017) used in their phoneme discrimination experiment: a list of vowel-consonant syllables in American English, compiled using the syllabification method from Gorman (2013), with syllables excluded which could not be generated in English using their TTS system9. Of each of

8 Adapted from “Encoding of phonology in a recurrent neural model of grounded speech”, by Alishahi, A., Barking, M., & Chrupala, G., 2017, in Proceedings of the 21st Conference on Computational Natural Language Learning, p. 371. Copyright 2017 Association for Computational Linguistics. 9 Available at https://github.com/pndurette/gTTS.

Cognitive Science & Artificial Intelligence 2019

these syllables, audio was then generated using the WaveNet TTS. For these syllables, all the possible tuples where (A, B) and (B, X) are minimal pairs but (A, X) are not, were collected, resulting in 34,650 tuples. Then the PaC task was performed according to: sign(dist(푖, 푗) − dist(푖, 푗))⁡(4) Here dist(푖, 푗) is the Euclidean distance between the MFCC vectors and layer representation vectors of syllable j, and i respectively, resulting in sign(dist(퐴, 푋) − dist(퐵, 푋)) for each tuple. If this calculation results in a positive value, this means the model was able to correctly discriminate between phonemes across context. A more difficult variety of this task was also performed where the target (X) and the distractor (A) are of the same phoneme class.

Experiment 2 – Synonym discrimination In this second experiment the aim is to determine the extent in which the model can distinguish between two pairs of synonyms. Because synonyms have different forms but share the same meaning they are useful for determining how the model extrapolates meaning from form. Layers of the model which contain mostly representations of phonological form will be able to distinguish between synonyms with relative ease. However for layers lacking these phonological form representations, synonyms will have to be distinguished not on form, but on meaning as seen in Alishahi et al. (2017). To see if the natural speech gRNN model shares these findings, the following synonym discrimination experiment will be conducted. Synonyms were selected according to the synonym criteria from Alishahi et al. (2017). Firstly, the ASR transcriptions of the MIT PACC speech utterances were POS-tagged using NLTK (Loper, 2006). Using the resulting POS-tags, all the verbs, nouns, and adjectives in the transcriptions were selected, which resulted in 2,363,486 words. In Alishahi et al. (2017) only the words which occur more than 20 times in the validation data are selected. However, given that the Synthetically Spoken COCO (Havard et al., 2017; Chrupala et al., 2017) validation data contains around 200,000 utterances, and the PACC dataset has around 400,000, for the present research instead of more than 20, only the words which are found more than 40 times, but in less than 95% of the total occurrences, are selected. For these words, synonyms are then generated using synset membership from Wordnet (Miller, 1995) and selected using the same criteria used to select the words which they were generated from. In the synonym sentence list used for their synonym discrimination task, Alishahi et al. (2017) have around 17,000 sentences10. In order to keep this experiment comparable, 25 synonym word pairs with around 340 sentences are selected. This means that when the spoken utterances are generated in which each original word is replaced by its respective synonym, the total amount of generated sentences will be roughly the same: (25 × 340) × 2 = 17,000. From the 84 word pairs left, 25 were finally selected if:

10 Available at https://github.com/gchrupala/encoding-of-phonology.

Cognitive Science & Artificial Intelligence 2019

• the words clearly differed in form from each other (so words which only differ in spelling such as humor/humour, and theatre/theatre would be excluded), • both forms were actual synonyms from one another, so that one word could replace the other in a sentence without changing the sentence’s meaning, • the word pair was represented in the list of word pairs from Alishahi et al. (2017), in its same lexical category respectively. This resulted in 2 verb, and 23 noun pairs. As mentioned before, for every sentence in which the selected synonym words appear in their proper lexical category, a new sentence is generated in which the original word is replaced by its paired synonym word, which results in a total of 15,101 sentences. The average length of these sentences is around 133 characters. Speech utterances are then generated from these sentences using WaveNet TTS. Finally, a 10-fold cross validation using Logistic Regression from Scikit-learn (Pedregosa et al., 2011) attempts to predict which of the two words in each synonym pair is contained in the utterance. For this it uses as input both the MFCC features, as well as average unit activations per recurrent layer, of each word pair respectively.

4. Results This section reports the results of the two experiments conducted in this research in order to understand how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, as well as the activations of the layers of the grounded speech gRNN model.

4.1 Experiment 1 The results of the Phoneme across Context task can be found in Table 1. It shows the discrimination accuracy for each feature input. Table 1 Accuracy of choosing the correct target in an ABX task using different representations as input. Representation Accuracy

MFCC 0.79

Recurrent 1 0.89

Recurrent 2 0.85

Recurrent 3 0.79

Recurrent 4 0.74

Recurrent layer 4 had the lowest accuracy (0.74), and the highest accuracy was achieved by the first recurrent layer (0.89).

Cognitive Science & Artificial Intelligence 2019

Furthermore, Figure 5 shows the discrimination accuracies per feature input broken down by phoneme class. In these results the target (X) was of the same phoneme class as the distractor (A).

Figure 5: Accuracies for the ABX CV task for the cases where the target and the distractor belong to the same phoneme class. Shaded area extends ± 1 standard error from the mean.

4.2 Experiment 2 The final results are for the synonym discrimination experiment, and can be found in Figure 6. It shows the error rates of the 10-fold cross validation using Logistic Regression trained to predict which synonym out of a pair could be found in the input features.

Cognitive Science & Artificial Intelligence 2019

Figure 6: Synonym discrimination error rates, per representation and synonym pair. In Section 5 these results will be discussed.

5. Discussion The aim of this research was to investigate how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and in the activations of the layers of a gRNN model of grounded natural speech. To achieve this, two experiments were conducted: phoneme discrimination, and synonym discrimination. This research is firmly based on the research conducted by Alishahi et al. (2017) which is identical in aim, but different in the dataset used to train and analyze the grounded speech gRNN model by Chrupala et al. (2017). Alishahi et al. (2017) used the Synthetically Spoken COCO dataset (Havard et al., 2017), which as the name suggests contains pairs of synthetically generated spoken captions, and images. Given

Cognitive Science & Artificial Intelligence 2019

that natural speech is notably more messy and noisy (Katsaggelos, Bahaadini, & Molina, 2015), the primary contribution this research aims to give is to see how well the findings on synthetic speech generalize for a model trained on, and in-part evaluated with, a dataset of natural spoken captions and image pairs. In the next section, the findings of this research will be discussed and compared with the findings of Alishahi et al. (2017).

5.1 Experiment 1 The aim of Experiment 1 was to use the PaC task in order to evaluate how invariant of context phoneme discrimination is. Compared to the results of the phoneme discrimination task by Alishahi et al. (2017), the ABX task accuracies in the present work have proven to be higher: Table 1 shows that the highest accuracies were those of recurrent layer 1 and 2, with 0.89 and 0.85 respectively. The accuracy then drops for layers 2 and 3, indicating the early layers are more focused on encoding phonological form, a result which confirms similar findings by Alishahi et al. (2017). Figure 5 shows the ABX CV task accuracy for each phoneme class. MFCC features accuracy is relatively low as expected. Vowel discrimination accuracy was the highest by a large margin, similar to Alishihi et al. (2017). These results show a very consistent pattern of high accuracy at the start of the recurrent layers, and then a gradual diminishing of performance further down the model’s layers. Compared to Alishahi et al. (2017), the approximant class has a high accuracy, whereas the other classes are very similar in their accuracy levels, vowels excluded. Furthermore, even though the shaded areas indicate a comparable standard error, the erratic patterns of the nasal and affricate classes observed in Alishahi et al. (2017) are not found in the present results which solidifies these findings even more. The high accuracies of the model in the PaC task could indicate multiple things. For this experiment, the WaveNet TTS was used to generate the audio input, which results in higher quality utterances than the TTS used by Alishahi et al. (2017). The increase of performance in this task could be related to this higher audio quality. However, perhaps to a greater extent, the higher accuracies could show that the noisier data used to train the model, resulted in more generalizable performance, and consequently more accurate phonological representations when high quality audio is used as input.

5.2 Experiment 2 The second experiment focused on synonym discrimination. The core idea was that if layers encode more phonological form, this would make synonym discrimination easier and lead to low error rates, whereas if the layers encode more representational meaning, discriminating between synonyms would be harder and consequently result in a higher error rate. Alishahi et al. (2017) observed that as evident by the low error rates, the recurrent layers primarily encode phonological form. Furthermore, the high error rate of the sentence embeddings indicated the attention layer significantly attenuated phonological form in favour of meaning. Figure 6 shows the results of the synonym

Cognitive Science & Artificial Intelligence 2019

discrimination experiment in this research. It seems clear these results are quite different from those found in Alishahi et al. (2017); there is a much higher spread of error rates across different synonym pairs. This would indicate that the model, while in general adhering to the same hierarchical structure of high encoding of form in early, and high encoding of meaning in later layers found in previous research, does not encode phonological form to the same degree as the synthetic speech model analysed by Alishahi et al. (2017). One explanation for this has already been mentioned: the more noisy audio of natural speech might make it harder for the model to encode phonological form. However, another could be found in the data collection method of the PACC dataset. The PACC spoken utterances were retrieved using Amazon Mechanical Turk. Given that the ‘turkers’ gave descriptions of the images ad-lib, some of the utterances are quite abstract, bordering on rambling. This could result in the model naturally focussing less on the form of the speech, as for the more abstract utterances a large part of the sentence can prove to be irrelevant. Furthermore, the average sentence length used for this experiment was substantially higher than those in Alishahi et al. (2017). The generated sentence list used by Alishahi et al. (2017) contains roughly 880,000 characters, with a total of ±17,000 sentences this reveals an average sentence length of around 52 characters, compared to the average sentence length of 133 characters in the present research. These long sentences might incentivise the model to focus less on form. In fact, Chrupala et al. (2017) found that one of the cases where speech outperformed their text model was for long sentences. The posited that the attention layer of the speech model was able to cherry pick which parts of the long sentences were important, focusing on meaning rather than form. Future research could consciously vary sentence length to test the effects of this on encoding of form in the grounded speech gRNN model. Lastly, the majority of the synonym pairs did follow the pattern of high error rates in the MFCC features, with a sharp drop in the first recurrent layer, and a gradual increase further along the layers. These findings again confirm the general pattern of encoding of phonology in the lower layers of the model, whereas meaning is encoded in the later layers.

5.3 Limitations Even though this research garners illuminating results, it is not without its limitations. The most glaring of which is the fact that neither the convolutional layer, nor sentence embeddings were used in the experiments. The reasons for this are two-fold: firstly due to the time constraints placed upon this thesis, and secondly due to the significant technical challenges encountered along the way. While the results of this work give new insights, as well as confirm previously found ones, the lack of sentence embeddings in the experiments specifically, makes it hard to elucidate whether, and how the model transitions from primarily encodings of phonology, to encodings of meaning. The data

Cognitive Science & Artificial Intelligence 2019

and resources used in this work will be made freely available11, so that future work might include these layers. The second main limitation which will be discussed regards the quality of the ASR transcriptions in the PACC dataset. As briefly mentioned in Section 3.3 before, this research originally intended to include a third experiment: the phoneme experiment performed by Alishahi et al. (2017). The main goal of this experiment is to quantify how well phoneme identity can be recovered from MFCC features and layer representations. Alishahi et al. (2017) found that phonemes were most easily recoverable from the first two recurrent layers of the model, whereas this phoneme decoding accuracy steadily decreased in later layers. This would indicate that the lower layers encode more phonological form, compared to later layers, a finding which has been replicated in this work as well. In this research, due to the extremely poor results, this experiment was decided to be omitted. The error rate of phoneme decoding in the recurrent layers was around 92%, which is the same as the accuracy of a majority baseline (92%). For comparison, the highest error rate of the recurrent layers in Alishahi et al. (2017, p. 373) was around 29%. It is possible this experiment was incorrectly performed by this work, however this seems unlikely given the reliance on the original resources 12 provided by Alisahi et al. (2017) to implement these experiments. The largest contributor to these results, in all likelihood, stems from the high word error rates (± 20%) in the ASR transcriptions of the spoken utterances in the PACC. The phoneme alignments created by the Gentle toolkit used these erroneous transcriptions which resulted in a high alignment failure rate. Of around 250,000 sentences aligned, only 64,000 were free of failed aligments. In contrast, Alishahi et al. (2017) used ground-truth transcriptions with clean audio for their alignments, which consequently only failed alignment for a few cases. This could mean that the inconsistency of the phoneme alignments attributed significantly to the high error rate. For further inspection of these results, see Appendix A. Even though the use of ground-truth transcriptions during training and evaluation phases can cause a significant discrepancy in real-world performance (Lakomkin et al., 2019), future work is highly advised to extract their own transcriptions using a better performing ASR system when using the PACC dataset for model analyse purposes. Another limitation concerns the size of the PACC dataset. Alishahi et al. (2017, p. 376) mention that “while small scale databases of speech and image are available, they are not large enough to reliably train a model such as ours”. As example they mention the Flickr8k Audio Caption Corpus (Harwath & Glass, 2015). While the PACC dataset with around 400,000 spoken utterances, is not as large as the Synthetically Spoken COCO dataset (more than 600,000 synthetic utterances), it is significantly larger than the Flickr8k ACC which only contains around 40,000 naturally spoken utterances. Still, more data would most likely improve the accuracy of the natural speech

11 Available at https://github.com/jvdweijden/encoding_of_phonology. 12 Available at https://github.com/gchrupala/encoding-of-phonology.

Cognitive Science & Artificial Intelligence 2019

model, and therefore future research should look to either extent the PACC dataset, or find new sources of spoken captions and image pairs.

6. Conclusion The success of deep-learning techniques in a large variety of domains will inevitably inspire more innovation. Therefore, it is ever more important to elucidate the black box of deep-learning networks. By analysing a grounded natural speech GRNN model using a variety of tasks, this work has been able to contribute in this effort. Specifically the development of understanding into the workings of weakly/unsupervised multi-modal ASR models can have a significant impact on the world, given that such models could be used to e.g. reduce ASR training costs, or bring ASR to under-resourced languages. The main finding of the present work is the way the natural speech model encodes form versus meaning. The synonym discrimination task, and especially the phoneme discrimination task, revealed the hierarchical structure of the layers of the model, in which the first few layers contain high levels of encoding of phonological form, while higher up the layers more semantic information is encoded. While these results have been found in previous work (Nagamine et al., 2015; Nagamine et al., 2016; Chrupala et al, 2017; Alishahi et al., 2017), they had yet to be validated on a grounded speech model trained on human spoken utterances. Furthermore, the phoneme discrimination task showed that the grounded speech model trained on noisy natural speech utterances, is able to improve in accuracy of phoneme discrimination, over a model trained on clean clear synthetic speech data. This indicates that noisy training circumstances can benefit the encoding of phonology in ASR models. Finally, the synonym discrimination task showed that sentence length could influence the level in which form is encoded for the grounded natural speech model. While the encoding of form in the model on phonological level showed strong results, for synonym discrimination the model performed noticeably worse. This implies that the grounded natural speech GRNN model is able to attenuate encoding of form on a word level, according to sentence length. However, more research is needed to verify these findings. With regards to the overlap between multi-modal ASR models and natural language acquisition, the results of this research argue in favour for this comparison. The results show the ability of the model to distinguish between phonetic categories based on speech features, similar to findings of Ter Schure et al. (2016) on natural language acquisition by infants. The way the model learns to attenuate form for long sentences is similar to how humans divide their attention when listening to others (Shinn-Cunningham & Best, 2008). However, future research is strongly recommended to pursue quantifying the verity of this comparison. While this research has not been without its limitations, its results have both confirmed previous findings, as well as uncovered new ones. Hopefully, the efforts of this work can contribute to, in time, illuminating the black box of deep-learning, as well as paving a way for unsupervised ASR models.

Cognitive Science & Artificial Intelligence 2019

References Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., & Goldberg, Y. (2016). Fine- grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207. Alishahi, A., Barking, M., & Chrupala, G. (2017). Encoding of phonology in a recurrent neural model of grounded speech. Proceedings of the 21st Conference on Computational Natural Language Learning, 368-378. Allen, J. B. (1994). How do humans process and recognize speech?. IEEE Transactions on speech and audio processing, 2(4), 567-577. Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. D., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of machine learning research, 3(Feb), 1107-1135. Bashivan, P., Kar, K., & DiCarlo, J. J. (2019). Neural population control via deep image synthesis. Science, 364(6439), eaav9436. Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85-100. Belinkov, Y., & Glass, J. (2018). Analysis Methods in Neural Language Processing: A Survey. arXiv preprint arXiv:1812.08951. Blevins, T., Levy, O., & Zettlemoyer, L. (2018). Deep RNNs Encode Soft Hierarchical Syntax. In: Proceedings of the 56th Annual meeting of the Association for Computational Linguistics (Short Papers), 14-19. Bohn, D. (2019). Google Assistant will soon be on a billion devices, and feature phones are next. Retrieved from https://www.theverge.com/2019/1/7/18169939/google-assistant- billion-devices-feature-phones-ces-2019 Bregler, C., & Konig, Y. (1994, April). " Eigenlips" for robust speech recognition. In Proceedings of ICASSP'94. IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. II- 669). IEEE. Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1), 3-5. Chaabouni, R., Dunbar, E., Zeghidour, N., & Dupoux, E. (2017). Learning weakly supervised multimodal phoneme embeddings. arXiv preprint arXiv:1704.06913. Chrupala, G., Gelderloos, L., & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

Cognitive Science & Artificial Intelligence 2019

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee. van Dijk, T., & de Croon, G. C. (2019). How do neural networks see depth in single images? arXiv preprint arXiv:1905.07005. Duchnowski, P., Meier, U., & Waibel, A. (1994). See me, hear me: Integrating automatic speech recognition and lip-reading. In Third International Conference on Spoken Language Processing. Elman, J. L. (1989). Representation and structure in connectionist models (No. N00014-85-K-0076). CALIFORNIA UNIV SAN DIEGO LA JOLLA CENTER FOR RESEARCH IN LANGUAGE. Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179- 211. Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2-3), 195-225. Frank, R., Mathis, D., & Badecker, W. (2013). The acquisition of anaphora by simple recurrent networks. Language Acquisition, 20(3), 181-227. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (pp. 2121-2129). Gelderloos, L., & Chrupala, G. (2016). From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Gorman, K. (2013). Generative phonotactics. University of Pennsylvania. Harwath, D., & Glass, J. (2015). Deep Multi-modal semantic embeddings for speech and images. In: IEEE Automatic Speech Recognition and Understanding Workshop. DOI: 0.1109/ASRU.2015.7404800 Harwath, D., Torralba, A., & Glass., J. (2016). Unsupervised learning of spoken language with visual context. In: Advances in Neural Information Processing Systems 2016,1856-1866. Harwath, D., & Glass, J. R. (2017). Learning word-like units from joint audio- visual analysis. arXiv preprint arXiv:1701.07481. Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., & Glass, J. (2018). Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 649-665).

Cognitive Science & Artificial Intelligence 2019

Havard, W., Besacier, L., & Rosec, O. (2017). Speech-coco: 600k visually grounded spoken captions aligned to mscoco data set. arXiv preprint arXiv:1707.08435. Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., Astudillo, R., & Takeda, K. (2018). Back-translation-style data augmentation for end- to-end ASR. arXiv preprint arXiv:1807.10893. Jia, Y., Johnson, M., Macherey, W., Weiss, R. J., Cao, Y., Chiu, C. C., ... & Wu, Y. (2018). Leveraging Weakly Supervised Data to Improve End- to-End Speech-to-Text Translation. arXiv preprint arXiv:1811.02050. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137). Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges and new approaches. Proceedings of the IEEE, 103(9), 1635-1653. Kinsella, B. (2018). Microsoft Cortana available on 400 million devices gets a new product leader. Retrieved from https://voicebot.ai/2018/03/02/microsoft-cortana-available-400- million-devices-gets-new-product-leader/ Kinsella, B. (2019). Google Assistant to be available on 1 billion devices this month – 10x more than Alexa. Retrieved from https://voicebot.ai/2019/01/07/google-assistant-to-be-available-on-1- billion-devices-this-month-10x-more-than-alexa/ Köhn, A. (2015). What’s in an embedding? Analyzing word embeddings through multilingual evaluation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2067-2073). Lakomkin, E., Zamani, M. A., Weber, C., Magg, S., & Wermter, S. (2019). Incorporating End-to-End Speech Recognition Models for Sentiment Analysis. arXiv preprint arXiv:1902.11245. Le, V. B., & Besacier, L. (2009). Automatic speech recognition for under- resourced languages: application to Vietnamese language. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1471-1482. Li, J., Gadde, R., Ginsburg, B., & Lavrukhin, V. (2018). Training Neural Speech Recognition Systems with Synthetic Speech Augmentation. arXiv preprint arXiv:1811.00707. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In: Computer Vision ECCV 2014, Springer, 740-755.

Cognitive Science & Artificial Intelligence 2019

Lin, D., Fidler, S., Kong, C., & Urtasun, R. (2014). Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2657-2664). Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028. Nagamine, T., Seltzer, M. L., & Mesgarani, N. (2015). Exploring how deep neural networks form phonemic categories. In Sixteenth Annual Conference of the International Speech Communication Association. Nagamine, T., Seltzer, M. L., & Mesgarani, N. (2016). On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models. In Interspeech (pp. 803-807). Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 689-696). Niklasson, L., & Linåker, F. (2000). Distributed representations for extended syntactic transformation. Connection Science, 12(3-4), 299-314. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4), 722-737. Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., & Fox, D. (2012). A joint model of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423. Meier, U., Hurst, W., & Duchnowski, P. (1996, May). Adaptive bimodal sensor fusion for automatic speechreading. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 2, pp. 833-836). IEEE. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41. Mohamed, A. R., Hinton, G., & Penn, G. (2012). Understanding how deep belief networks perform acoustic modelling. neural networks, 6-9. Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5), 411- 419. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Zeming, L., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.

Cognitive Science & Artificial Intelligence 2019

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830. Phatak, S. A., & Allen, J. B. (2007). Consonant and vowel confusions in speech-weighted noise. The Journal of the Acoustical Society of America, 121(4), 2312-2326. Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1-2), 77-105. Qian, P., Qiu, X., & Huang, X. (2016). Investigating language universal and specific properties in word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1478-1488). Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., & Dupoux, E. (2013, August). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. In INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association (pp. 1-5). ter Schure, S. M. M., Junge, C. M. M., & Boersma, P. P. G. (2016). Semantics guide infants’ vowel learning: computational and experimental evidence. Infant Behavior and Development, 43, 44-57. Shinn-Cunningham, B. G., & Best, V. (2008). Selective attention in normal and impaired hearing. Trends in amplification, 12(4), 283-299. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems (pp. 2222-2230). Socher, R., & Fei-Fei, L. (2010, June). Connecting modalities: Semi- supervised segmentation and annotation of images using unaligned text corpora. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 966-973). IEEE. Sui, C., Bennamoun, M., & Togneri, R. (2015). Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines. In Proceedings of the IEEE International Conference on Computer Vision (pp. 154-162). Tjandra, A., Sakti, S., & Nakamura, S. (2017, December). Listening while speaking: Speech chain by deep learning. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 301- 308). IEEE.

Cognitive Science & Artificial Intelligence 2019

Tjandra, A., Sakti, S., & Nakamura, S. (2018). Machine speech chain with one-shot speaker adaptation. arXiv preprint arXiv:1803.10525. Wang, S., Qian, Y., & Yu, K. (2017, August). What does the speaker embedding encode?. In Interspeech (pp. 1497-1501) Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5447-5455). Yuhas, B. P., Goldstein, M. H., & Sejnowski, T. J. (1989). Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11), 65-71. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487-495).

Cognitive Science & Artificial Intelligence 2019

Appendix A: Omitted phoneme decoding results

Figure: Accuracy of phoneme decoding with input MFCC features and PACC model activations. The boxplot shows error rates bootstrapped with 1000 resamples.