Recurrent Neural Networks for Structured Data
Total Page:16
File Type:pdf, Size:1020Kb
Recurrent Neural Networks for Structured Data by Trang Thi Minh Pham BSc. (Honours) Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Deakin University June 2018 Acknowledgements I would first like to express my deepest appreciation to my principal supervisor A/Prof. Truyen Tran for his constant guidance, support and encouragement through- out my research. I have been extremely privileged to have an eminent supervisor with deep insight, critical thinking and wisdom, who has taught me valuable know- ledge and research skills and instilled me a professional work ethic. It is my pleasure to acknowledge my co-supervisor, Prof. Svetha Venkatesh for giving me the opportunity to undertake research at PRaDA and for her valuable suggestions and guidance. I deeply appreciate the discussions, writing classes and her valuable advice on my career path and life. I would also like to express my gratitude to Prof. Dinh Phung for spending time providing constructive feedback and making suggestions on my papers. I would like to thank Michele Mooney, my husband Thanh Nguyen and my team- mates Hung Le and Kien Do for proofreading this thesis. My thanks also go to my friends – Phuoc, Tung, Hung and Kien for the times of leisure shared together that helped me to overcome some difficult moments; and to all the PRaDA members for providing an encouraging and friendly working atmosphere. This thesis is dedicated to my family: my parents, my sister and my husband for being there when I needed them most, and to my aunt – my father’s sister who taught me the value of study. iii Contents Acknowledgements iii Abstract xxiii Relevant Publications xxv Notation xxvii 1 Introduction1 1.1 Motivations................................1 1.2 Aims and Scope..............................4 1.3 Significance and Contribution......................5 1.4 Thesis Structure..............................6 2 Related Background9 2.1 Overview of Neural Networks......................9 2.1.1 Model description........................9 2.1.2 Activation functions....................... 11 2.1.3 Training neural networks.................... 13 v 2.1.4 Hyper-parameter tuning..................... 15 2.2 Regularisation.............................. 16 2.2.1 Parameter norm penalties.................... 17 2.2.2 Early stopping.......................... 19 2.2.3 Dropout.............................. 21 2.3 Embedding................................ 22 2.4 Structured Data............................. 25 2.5 Closing Remarks............................. 27 3 Recurrent Neural Networks 29 3.1 RNN Family............................... 29 3.1.1 Vanilla RNNs........................... 30 3.1.2 RNNs for different settings................... 31 3.1.3 Encoder-decoder RNNs..................... 32 3.1.4 Bidirectional RNNs........................ 34 3.1.5 Deep RNNs............................ 35 3.1.6 Recursive neural networks.................... 36 3.1.7 Challenges and solutions..................... 37 3.2 Gated RNNs............................... 38 3.2.1 Long Short-Term Memory.................... 39 3.2.2 Gated Recurrent Units..................... 41 3.2.3 Highway Networks........................ 42 3.3 p-norm Gating Mechanism....................... 43 3.3.1 p-norm gating........................... 44 3.3.2 Behaviour of the p-norm gates.................. 45 3.4 Attention Mechanism.......................... 48 3.4.1 Attention model.......................... 49 3.4.2 Hierarchical attention....................... 50 3.4.3 Attention in different learning tasks with RNNs........ 51 3.5 Memory-Augmented Neural Networks................. 54 3.5.1 End-to-End Memory Networks.................. 54 3.5.2 Neural Turing Machine...................... 56 3.5.3 Key-Value Memory Networks.................. 58 3.6 Applications................................ 60 3.7 Closing Remarks............................. 61 4 RNNs for Episodic Intervening Data 63 4.1 Introduction................................ 63 4.2 Related Background........................... 66 4.2.1 Electronic Medical Records................... 66 4.2.2 Existing models.......................... 67 4.3 DeepCare Model............................. 68 4.3.1 Model overview.......................... 68 4.3.2 Representing variable-size admissions.............. 70 4.3.3 C-LSTM unit........................... 72 4.3.4 Trajectory prediction....................... 73 4.3.5 Model training.......................... 75 4.3.6 Model complexity........................ 77 4.3.7 Pretraining and regularisation................. 77 4.4 Case Studies on Chronic Diseases.................... 79 4.4.1 Data................................ 79 4.4.2 Experiments and Results..................... 81 4.5 Discussion................................. 87 4.5.1 DeepCare as a model of healthcare memory.......... 87 4.5.2 Limitations............................ 88 4.6 Closing Remarks............................. 89 5 Networked RNNs for Relational Domain 91 5.1 Introduction................................ 91 5.2 Preliminaries............................... 93 5.2.1 Collective classification in multi-relational setting....... 93 5.2.2 Stacked learning......................... 94 5.3 Column Networks............................. 95 5.3.1 Architecture........................... 95 5.3.2 Highway Network as mini-column............... 97 5.3.3 Parameter sharing for compactness............... 98 5.3.4 Capturing long-range dependencies............... 98 5.3.5 Training with mini-batch.................... 99 5.4 Applications............................... 100 5.4.1 Baselines.............................. 100 5.4.2 Experiment settings....................... 101 5.4.3 Software delay prediction..................... 101 5.4.4 PubMed publication classification................ 102 5.4.5 Film genre prediction....................... 104 5.5 Related Work............................... 106 5.6 Discussion................................. 107 5.7 Closing Remarks............................. 107 6 RNNs for Multi-X Learning 109 6.1 Introduction................................ 109 6.2 Background................................ 111 6.2.1 Multi-instance learning..................... 111 6.2.2 Multi-view learning........................ 112 6.2.3 Multi-label learning........................ 112 6.2.4 Multiple multi-X learning.................... 113 6.3 Multi-X Modular Networks........................ 113 6.3.1 Architectural overview...................... 114 6.3.2 Module structure and interaction................ 115 6.3.3 MXM network for different multi-X settings.......... 117 6.3.4 Handling large-scale data.................... 119 6.4 Experiments and Results......................... 120 6.4.1 Model implementation...................... 120 6.4.2 Datasets............................. 120 6.4.3 Multi-inputs: multi-view and multi-instance learning..... 122 6.4.4 Multi-outputs: multi-label learning............... 124 6.4.5 Multi-inputs+Multi-outputs: MV-ML and MI-ML....... 125 6.5 Discussion................................. 126 6.6 Closing Remarks............................. 127 7 RNNs for Graphs 129 7.1 Introduction................................ 129 7.2 Related Background........................... 132 7.2.1 Graph modelling......................... 132 7.2.2 Molecular activity and interaction prediction.......... 133 7.3 Virtual Column Networks........................ 134 7.3.1 Definition and notation: Multi-relational graphs........ 134 7.3.2 The Virtual Column....................... 135 7.4 Graph Memory Networks......................... 136 7.4.1 The controller and attentive reading.............. 137 7.4.2 Graph-structured multi-relational memory........... 138 7.4.3 Recurrent skip-connections.................... 140 7.4.4 GraphMN for multi-task learning................ 140 7.5 Graph Memory Networks for Graph-Graph Interaction........ 141 7.5.1 Multiple memories for multiple graphs............. 141 7.5.2 Multiple attentions........................ 142 7.6 Applications................................ 143 7.6.1 VCNs for software vulnerability prediction........... 143 7.6.2 GraphMN for molecular activity prediction.......... 144 7.6.3 GraphMN for chemical interaction prediction......... 148 7.7 Discussion................................. 152 8 Conclusions 155 8.1 Summary................................. 155 8.2 Future Directions............................. 157 A Supplementary 159 A.1 Gradient computation.......................... 159 A.1.1 Computing RNN gradients.................... 159 A.1.2 Computing LSTM gradients................... 160 A.1.3 Computing DeepCare gradients................. 162 Bibliography 165 List of Figures 2.1 An FNN with two hidden layers.................... 10 2.2 An example of learning curve of the train and validation sets over time.................................... 20 2.3 Dropout randomly removes some units from the base network. (Left). The base network. (Right). Three examples of the base network after applying dropout. The dashed units and their connections are removed from the network........................ 21 3.1 A typical Recurrent Neural Network and (Right) an RNN unfolded in time that maps a sequence of input vectors to a sequence of labels. Each RNN unit at time step t reads an input xt and the previous hidden state ht−1, then generates an output at and predicts the label y˜t...................................... 30 3.2 Recursive neural networks........................ 36 3.3 An LSTM unit that reads input xt and the previous output state ht−1and produces an output