On the Induction of Temporal Structure by Recurrent Neural Networks

On the Induction of Temporal Structure by Recurrent Neural Networks MAHMUD SAAD SHERTIL A thesis submitted in partial fulfilment of the requirements of Nottingham Trent University for the degree of Doctor of Philosophy November 2014 Acknowledgment I would like to present my full gratitude to my director of studies, Dr. Heather Powell, for her support, guidance and invaluable advice to achieve the requirements of the thesis. She significantly dedicated and provided necessary critique regarding the research. I would like also to extend my sincere thanks to my second supervisor, Dr. Jonathan Tepper, I could not have imagined having a better advisor and mentor for my PhD, without his common sense, knowledge, perceptiveness and assistances and his friendly support in many issues during the research. I would like to dedicate this thesis to the soul of my mother, who I have never seen her and the soul of my father. In addition, many thanks to the beloved family who have taken care of me after I lost my mother. Heartfelt thanks and deep respect for my wife, who stood by me in both my sad and happy moments. Also, thanks for her understanding and endless love, throughout my studies. Even when I felt panicked or disturbed, she encouraged me and said: Go on! You have to continue in the doctoral task, which is similar to a marathon runner. My kind thanks also, go to my lovely children: “Salam, Salsabeel, Samaher, Mohammed ”, seeing them, I absorb the power to achieve the best position to make them proud as a father. I would also like to acknowledge my brothers, sisters and friends for their constant encouragement for everything. I acknowledge all those who have prayed for me, guided me with wisdom, helped me with their kindness, and tolerated me out of their love. Last but not least, my real thanks and appreciations goes to my colleagues and friends in my native country, Libya and the PhD students in the UK, for their help and wishes for the successful completion of this research. Abstract Language acquisition is one of the core problems in artificial intelligence (AI) and it is generally accepted that any successful AI account of the mind will stand or fall depending on its ability to model human language. Simple Recurrent Networks (SRNs) are a class of so-called artificial neural networks that have a long history in language modelling via learning to predict the next word in a sentence. However, SRNs have also been shown to suffer from catastrophic forgetting, lack of syntactic systematicity and an inability to represent more than three levels of centre-embedding, due to the so-called 'vanishing gradients' problem. This problem is caused by the decay of past input information encoded within the error-gradients which vanish exponentially as additional input information is encountered and passed through the recurrent connections. That said, a number of architectural variations have been applied which may compensate for this issue, such as the Nonlinear Autoregressive Network with exogenous inputs (NARX) network and the multi-recurrent network (MRN). In addition to this, Echo State Networks (ESNs) are a relatively new class of recurrent neural network that do not suffer from the vanishing gradients problem and have been shown to exhibit state-of-the-art performance in tasks such as motor control, dynamic time series prediction, and more recently language processing. This research re-explores the class of SRNs and evaluates them against the state-of-the- art ESN to identify which model class is best able to induce the underlying finite-state automaton of the target grammar implicitly through the next word prediction task. In order to meet its aim, the research analyses the internal representations formed by each of the different models and explores the conditions under which they are able to carry information about long term sequential dependencies beyond what is found in the training data. The findings of the research are significant. It reveals that the traditional class of SRNs, trained with backpropagation through time, are superior to ESNs for the grammar prediction task. More specifically, the MRN, with its state-based memory of varying rigidity, is more able to learn the underlying grammar than any other model. An analysis of the MRN’s internal state reveals that this is due to its ability to maintain a constant iii variance within its state-based representation of the embedded aspects (or finite state machines) of the target grammar. The investigations show that in order to successfully induce complex context free grammars directly from sentence examples, then not only are a hidden layer and output layer recurrency required, but so is self-recurrency on the context layer to enable varying degrees of current and past state information, that are integrated over time. iv Contents Acknowledgment ............................................................................................................ii Abstract .......................................................................................................................... iii Contents.......................................................................................................................... v List of Figures ................................................................................................................ x List of Tables................................................................................................................ xiii Acronyms .................................................................................................................... xvii Chapter 1 ........................................................................................................................ 1 1. Introduction ............................................................................................................ 1 1.1 Summary ........................................................................................................ 1 1.2 Problem Statement ......................................................................................... 2 1.3 Scope of Research .......................................................................................... 4 1.4 Thesis Outline ................................................................................................ 6 Chapter 2 ........................................................................................................................ 8 2. Literature Study ...................................................................................................... 8 2.1 The Nature and Complexity of Language ...................................................... 8 2.1.1 Nativist vs. Empiricist Perspectives .............................................................. 8 2.1.2 Language Complexity and Computation..................................................... 11 2.2 Connectionist and Statistical Models of Language Acquisition .................. 15 2.2.1 Supervised Connectionist Learning Algorithms ......................................... 18 2.2.2 Supervised Connectionist Models of Language Acquisition ...................... 20 2.3 Limitations of Connectionism ...................................................................... 22 2.3.1 Argument against Biological Plausibility ................................................... 24 2.3.2 Argument against Connectionism for Developmental Cognitive Modelling .............................................................................................................................. 25 2.3.3 Learning Deterministic Representations Using a Continuous State Space. 26 2.4 Discussion and Conclusion .......................................................................... 28 v Chapter 3 ...................................................................................................................... 30 3. Neural Network Architectures ............................................................................. 30 3.1. Recurrent Neural Networks.......................................................................... 30 3.1.1 Jordan Network ........................................................................................... 31 3.1.2 Time Delay Neural Recurrent Network (TDNN) ....................................... 32 3.1.3 Nonlinear Autoregressive Network with Exogenous Input (NARX) ......... 33 3.1.4 Simple Recurrent Networks (SRN)............................................................. 35 3.1.5 Multi Recurrent Networks (MRN) ......................................................... 37 3.1.6 Long Short Term Memory (LSTM) ............................................................ 39 3.1.7 Echo State Networks (ESNs) ...................................................................... 40 3.2. Summary ...................................................................................................... 43 Chapter 4 ...................................................................................................................... 44 4. Data and Methodology ......................................................................................... 44 4.1. The Reber Grammar Datasets ...................................................................... 44 4.1.1 Symbol Representations ......................................................................... 45 4.1.2 The Regular Grammar: Simple Reber Grammar ................................... 47 4.1.2.1 Reber Grammar Dataset ......................................................................

On the Induction of Temporal Structure by Recurrent Neural Networks

Unsupervised Recurrent Neural Network Grammars

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

GRAINS: Generative Recursive Autoencoders for Indoor Scenes

Unsupervised Recurrent Neural Network Grammars

Semantic Analysis of Multi Meaning Words Using Machine Learning and Knowledge Representation by Marjan Alirezaie

Max-Margin Synchronous Grammar Induction for Machine Translation

Compound Probabilistic Context-Free Grammars for Grammar Induction

Grammar Induction

Dependency Grammar Induction with a Neural Variational Transition

Are Pre-Trained Language Models Aware of Phrases?Simplebut Strong Baselinesfor Grammar Induction

What Do Recurrent Neural Network Grammars Learn About Syntax?

A Survey of Unsupervised Dependency Parsing