Illuminating the Black Box: Encoding of Phonology in a Recurrent Neural Model of Grounded Natural Speech

Illuminating the black box: Encoding of phonology in a recurrent neural model of grounded natural speech Jeroen van der Weijden STUDENT NUMBER: 1261593 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE DEPARTMENT OF COGNITIVE SCIENCE & ARTIFICIAL INTELLIGENCE SCHOOL OF HUMANITIES AND DIGITAL SCIENCES TILBURG UNIVERSITY Thesis committee: Dr. Grzegorz Chrupala Dr. Henry Brighton Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands January 2000 Cognitive Science & Artificial Intelligence 2019 Cognitive Science & Artificial Intelligence 2019 Preface This thesis is dedicated to my grandmother: Mien Siebers, who passed away during the construction of this work. I want to thank my family and friends for their love, Jenske Vermeulen for her unwavering support and for believing in me. I’d also like to thank my thesis advisor dr. Grzegorz Chrupala for his help and patience, as well as for training the model and providing crucial code. This work proved both a challenge and a jump in the deep, two things which often seem to coincide. Cognitive Science & Artificial Intelligence 2019 Cognitive Science & Artificial Intelligence 2019 Contents 1. Introduction .................................................................................................... 6 2. Related work .................................................................................................. 8 2.1 Deep multi-modal models ........................................................................ 8 2.2 Model analyses....................................................................................... 10 2.3 Natural language acquisition.................................................................. 11 3. Experimental setup....................................................................................... 12 3.1 Data ........................................................................................................ 12 3.2 Model ..................................................................................................... 13 Model architecture ................................................................................... 13 Model settings .......................................................................................... 14 3.3 Experiments ........................................................................................... 15 Experiment 1 – Phoneme discrimination ................................................. 16 Experiment 2 – Synonym discrimination ................................................ 17 4. Results .......................................................................................................... 18 4.1 Experiment 1 .......................................................................................... 18 4.2 Experiment 2 .......................................................................................... 19 5. Discussion .................................................................................................... 20 5.1 Experiment 1 .......................................................................................... 21 5.2 Experiment 2 .......................................................................................... 21 5.3 Limitations ............................................................................................. 22 6. Conclusion ................................................................................................... 24 References ........................................................................................................ 25 Appendix A ...................................................................................................... 31 Cognitive Science & Artificial Intelligence 2019 Illuminating the black box: encoding of phonology in a recurrent model of grounded natural speech Jeroen van der Weijden In this work we investigate how phonology is encoded in a recurrent neural model of grounded natural speech. In a weakly supervised automatic speech recognition learning task, this model takes images and spoken descriptions of said images, and projects them to a joint semantic space. Previous work has found that encoding of phonology is most prevalent in lower layers, whereas further up the hierarchy of layers in the model semantic aspects of the language input are more prevalent (Alishahi, Barking, & Chrupala, 2017).However, these analyses have only been conducted using a dataset of synthetically generated spoken captions. The present work aims to validate their findings using natural speech, specifically the MIT Places Audio Caption Corpus (Harwath et al., 2018) with additional stimuli generated by the Google WaveNet TTS (Oord et al., 2016). In a series of two experiments we are able to confirm the findings by previous research. Furthermore, we show that the noisiness of natural speech can benefit the encoding of phonology, and that longer sentences compel the model to attenuate encoding of form on word level. These results extend the limited body of work which focuses on peeking inside the black box of deep- learning techniques, and opens up new possible avenues for future research. 1. Introduction Decoding and understanding natural language has been a key focus of machine learning. The rise of deep learning techniques has vastly accelerated progress in this area (Belinkov & Glass, 2019). The effects of this progress can readily be seen in the world today, for example: at time of writing the four largest voice assistants (Google Assistant, Apple Siri, Microsoft Cortana, Amazon Alexa) are installed on a combined number of 2 billion devices, with at least 520 million active users (Kinsella, 2018; Bohn, 2019; Kinsella, 2019). These voice assistants rely on Automatic Speech Recognition (ASR) in order to function. While the performance of neural networks on speech recognition tasks like those used for voice assistants has been impressive, many of these networks require large amounts of transcribed speech signals, which makes such neural networks very time and resource intensive. While large companies such as Amazon or Apple are able to absorb these costs, this might not be the case for many other companies, nor is it a given that enough data is available to make ASR work for any particular language. Particularly for under- resourced languages, ASR can be of critical importance by improving documentation efforts, and facilitating communication with speakers of less- prevalent languages so that, for example, humanitarian workers are able to communicate with disaster struck populations. The use of weakly, or unsupervised, learning methods for neural networks are therefore preferred, especially in cases such as those where no large datasets are available (Le et al., 2009; Besacier et al., 2014; Michel et al., 2016). Recent studies have Cognitive Science & Artificial Intelligence 2019 proposed such a weakly supervised learning method for training neural networks in natural language processing (NLP) tasks (Noda et al., 2015; Harwath et al., 2015; Harwath et al., 2016; Gelderloos & Chrupala, 2016). This technique relies on a network learning the semantic space between photographic images of everyday scenes, and spoken descriptions of said images. This ‘visually grounded speech perception’ model allows a network to create a semantic representation of the spoken descriptions. One of the benefits of such a model is that it requires less supervision to train, making them possibly cheaper and easier to successfully implement when data is scarce. Since its inception, several network architectures such as convolutional neural networks (CNNs), and recurrent neural networks (RNNs) have been used for implementing the grounded speech model (Noda et al., 2015; Harwath et al., 2016; Gelderloos & Chrupala, 2016; Chaabouni et al., 2017). As of yet, one of the most successful implementation has been a multi-layer gated recurrent neural network (GRNN) constructed by Chrupala et al. (2017) which managed to show significant improvements over convolutional neural network implementations used by Harwath and Glass (2015). However, with the rise of deep learning techniques and their accompanying success, more and more questions are raised as to the inner workings of these deep learning models. Many works argue for an increase in accountability of machine learning systems, as well as more interpretability of the neural networks (Lipton, 2016; Doshi-Velez et al., 2017). And despite the recognized importance of these issues, the ability to interpret and explain NLP predictions of neural networks is still a work in progress (Belinkov & Glass, 2019). Furthermore, there is a growing field which uses methods of analyses of these models to look at the similarities of neural networks and brain activity. At this intersection of neuroscience, psycholinguistics, and deep learning, research looks into if and how well, certain neural networks mimic the workings of the brain. In this emerging field, analysing the internal workings of these neural networks could offer non-invasive ways to investigate the brain, as well as give more insight into linguistic theories and the deep learning models themselves (Ter Schure et al., 2016; Chaabouni et al., 2017; Bashivan et al., 2019). Both the works of Chrupala et al. (2017) as well as subsequent research by Alishahi, Barking, and Chrupala (2017) have made

Illuminating the Black Box: Encoding of Phonology in a Recurrent Neural Model of Grounded Natural Speech

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support