Deep Learning for Classification of Dna Functional Sequences

DEEP LEARNING FOR CLASSIFICATION OF DNA FUNCTIONAL SEQUENCES Word count: 29363 Griet De Clercq Student number: 01103351 Promoters: Prof. Dr. Ir. Joni Dambre Prof. Dr. Wesley De Neve Prof. Dr. Willem Waegeman Supervisor: Jasper Zuallaert A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Bioinformatics: Bioscience Engineering Academic Year: 2018 - 2019 Preface This master thesis was a long time coming. Three years to be exact, due to suffering a depression at the end of 2016, which caused me to temporally put my studies on hold. I managed to overcome my condition, and am immensely proud to finally present the finished result of these past threeyears. I firstly would like to thank my promotors Prof. Dr. Wesley De Neve and Prof. Dr. Willem Waegeman for providing me with guidance during my thesis years. I especially want to thank my head promotor Prof. Dr. Ir. Joni Dambre, for giving me the opportunity to finish my chosen subject, her never-ending contribution of her expert knowledge, her patience, and for taking on the role of both promotorand supervisor. I would also like to thank Prof. Dr. Yvan Saeys for supplying the subject for this thesis, and Lionel Pigou and Jasper Zuallaert for answering every possible question I ever shot their way. A round of applause is in order for my loving parents, who gave me the opportunity of pursuing an education at Ghent University, who believed in me year after year, and who allowed me thefreedom I needed to find myself again. Another series of thanks go out to my friends and family; Melissa and Sofie for the consistent encouragement and the tummy-aching laugh sessions, Bram for always being his level-headed and calm self, Michiel and Francis for reaching out to me when times were difficult, and my brother and sister for repeatedly providing a loving and welcome home to visit whenever I wanted. Above all, I would like to thank my boyfriend Mathieu for tolerating all of my shenanigans throughout the years, and for being the most handsome rubber duck I could ever think of during my hours-long coding sessions. A very special thanks goes out to my therapists Nele Arbyn and Liselot Vangronsvelt. Even while I can already hear them say ’in the end, you did all of this work yourself’, I cannot express my gratitude enough for all the support they provided. You guys truly worked miracles. And although science hasn’t found a way to communicate with animals yet, I would still like to thank my cat Ellie, for being my ever-present fluffy companion during my deep learning endeavours, andmydog Mona, for always being her overly enthusiastic self when I needed hugs the most. Griet De Clercq Ghent, 26 August 2019 Table of Contents List of Figures v List of Tables vii List of Abbreviations ix List of Symbols xi Abstract 1 English ............................................... 1 Nederlands ............................................ 3 1 Introduction 5 1.1 Problem statement ..................................... 6 1.2 Aims ............................................. 7 2 Biology of the genome 9 2.1 DNA and its structure .................................... 9 2.2 The reference genome ................................... 9 2.3 Gene expression and RNA .................................. 11 2.4 Genes and their genomic elements ............................ 12 2.4.1 Splice sites ..................................... 12 2.4.2 Promoter and enhancer regions .......................... 13 2.4.3 Other regulatory elements ............................. 15 3 Deep learning 17 3.1 Origins within machine learning .............................. 17 3.1.1 Data fitting and splitting .............................. 17 3.2 Neural networks ....................................... 19 3.2.1 Input data ...................................... 19 3.2.2 General layout ................................... 20 3.2.3 Loss functions and their minimisation ....................... 22 Stochastic gradient descent and backpropagation ................. 23 Vanishing and exploding gradients ........................ 24 3.2.4 Activation functions ................................ 25 i TABLE OF CONTENTS 3.2.5 Recognising and solving over- and underfitting .................. 26 3.2.6 Hyperparameters .................................. 28 3.3 Convolutional neural networks ............................... 28 3.3.1 Convolutional layer ................................. 28 3.3.2 Pooling layer .................................... 30 3.3.3 Following layers ................................... 31 4 State of the art 33 4.1 Splice site prediction .................................... 33 4.2 Promoter prediction .................................... 34 5 Materials and methods 37 5.1 Datasets - Splice sites .................................... 37 5.1.1 Arabidopsis thaliana ................................ 37 5.2 Datasets - Promoters .................................... 37 5.2.1 Arabidopsis thaliana positive sample set ...................... 38 5.2.2 Homo sapiens positive sample set ......................... 39 5.2.3 Negative sample construction ........................... 39 Balanced set ................................... 39 Conserved set .................................. 40 5.3 Preprocessing of the data and its labels .......................... 40 5.3.1 Data splitting .................................... 40 Stratified split .................................. 41 Grouped split .................................. 41 5.3.2 Data augmentation ................................. 42 5.3.3 Data encoding ................................... 42 One-hot encoding ................................ 42 K-mer encoding ................................. 42 5.3.4 Label encoding ................................... 43 5.4 Deep neural networks .................................... 43 5.4.1 Splice site prediction ................................ 44 One-hot encoded data .............................. 44 K-mer encoded data ............................... 44 5.4.2 Promoter prediction ................................ 44 One-hot encoded data .............................. 46 K-mer encoded data ............................... 46 5.4.3 Performance measures ............................... 48 ii TABLE OF CONTENTS 5.5 Post-processing of the results ................................ 48 5.6 Soft- and hardware ..................................... 49 6 Results and discussion 51 6.1 Splice site prediction .................................... 51 6.1.1 One-hot encoding: establishing baseline results by Zuallaert et al. ........ 51 6.1.2 K-mer encoding ................................... 56 K-mer histograms ................................ 56 K-mer network ................................. 56 6.2 Promoter prediction .................................... 58 6.2.1 K-mer histograms .................................. 58 6.2.2 Performance of the ARAprom- and HOMpromnet models ............ 60 ARApromnet .................................. 60 HOMpromnet .................................. 65 7 Conclusion 69 7.1 Future perspectives ..................................... 70 References 73 Appendices 79 A ARAsplice assembly as described by Degroeve et al. .................... 79 B ARAprom and HOMprom assembly pipelines by EPDnew ................. 80 B.1 ARAprom dataset .................................. 80 B.2 HOMprom dataset ................................. 81 C Python package versions .................................. 83 D Sequence logos and how to interpret them ........................ 84 E Effect of dropout on the loss during training ........................ 85 F K-mer histograms promoter datasets ............................ 86 F.1 ARAPROM: conserved ............................... 86 F.2 HOMPROM: balanced ............................... 88 F.3 HOMPROM: conserved ............................... 90 G Negative promoter construction: unbalanced approach .................. 92 iii TABLE OF CONTENTS iv List of Figures 2.1 Structure of a nucleotide and DNA helix .......................... 10 2.2 Schematic overview of the eukaryotic protein gene expression process .......... 12 2.3 Structural blocks within DNA ................................ 14 3.1 Machine learning model fitting workflow using holdout validation ............ 17 3.2 Example of k-fold CV with k = 5 .............................. 18 3.3 Levels of abstraction in a deep learning algorithm ..................... 19 3.4 Visual representation of a single neuron in anNN ..................... 21 3.5 Fully connected feedforward DNN with three hidden layers ................ 21 3.6 Loss plots for the MSE and cross-entropy loss functions .................. 23 3.7 Activation functions typically used inNNs ......................... 26 3.8 Three different DNN architectures run on the same dataset, with the train and validation loss plotted after training for 70 epochs .......................... 27 3.9 High level overview of a convolutional neural network (CNN) for use with one-hot encoded desoxyribonucleic acid (DNA) data ......................... 28 3.10 Inner workings of a convolution layer (Karpathy et al., 2016). ............... 30 3.11 Visualisation of application of a max pooling layer onto a single depth splice.(Karpathy et al., 2016) ......................................... 31 4.1 Two of the first NNs used for splice site prediction ..................... 33 4.2 Growth of added DNA sequences in both the GenBank and WGS database ........ 34 5.1 Class distribution within the A. thaliana splice dataset

Deep Learning for Classification of Dna Functional Sequences

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support