
Examining applications of Neural Networks in predicting polygenic traits McMaster University Mu Tian Date Abstract Polygenic risk scores are scores used in precision medicine in order to assess an individual's risk of having a certain quantitative trait based on his or her genetics. Previous works have shown that machine learning, namely Gradient Boosted Regression Trees, can be successfully applied to calibrate the weights of the risk score to improve its predictive power in a target population. Neural networks are a very powerful class of machine learning algorithms that have demonstrated success in various fields of genetics, and in this work, we examined the predictive power of a polygenic risk score that uses neural networks to perform the weight calibration. Using a single neural network, we were able to obtain prediction R2 of 0.234 and 0.074 for height and BMI, respectively. We further experimented with changing the dimension of the input features, using ensembled models, and varying the number of splits used to train the models in order to obtain a final prediction R2 of 0.242 for height and 0.0804 for BMI, achieving a relative improvement of 1.26% in prediction R2 for height. Furthermore, we performed extensive analysis of the behaviour of the neural network-calibrated weights. In our analysis, we highlighted several potential drawbacks of using neural networks, as well as machine learning algorithms in general when performing the weight calibration, and offer several suggestions for improving the consistency and performance of machine learning-calibrated weights for future research. i Acknowledgements First and foremost, I want to thank my amazing girlfriend Margaret for being so supportive throughout the entire research process. Secondly, I want to thank my supervisor, Dr. Canty, for all the support and insights offered over the course of the research, Dr. Par´eand Shihong Mao for the the assistance in setting up the experiments, as well as Dr. Bolker for being a part of the defense committee. ii Contents 1 Introduction 1 1.1 Background and definitions . .2 1.2 Genome Wide Association Study . .3 1.3 Applications of GWAS . .4 2 Polygenic Risk Score 6 2.1 Pruning and Thresholding . .7 2.2 Applications of Polygenic Risk Scores . .8 2.3 Penalized Regression Models . 10 2.4 LDpred . 12 2.5 Machine Learning in Predicting Polygenic Traits . 14 2.6 A Neural Network approach to calibrating weights . 17 3 Neural Networks - Background 18 3.1 Structure . 18 3.2 Perceptron . 20 3.3 Model Training . 21 3.4 Optimizing neural networks . 22 3.5 Batch Training . 23 3.6 Adaptive Optimizers . 24 3.7 Model Initialization . 25 3.8 Neural Network Regularization . 25 3.9 Learning Rate Scheduling . 28 iii 3.10 Gradient Clipping . 29 3.11 Neural Networks in Genetics . 29 4 Data and Experiments 33 4.1 Data . 33 4.2 Model Fitting . 35 4.3 Error Report . 37 4.4 Further Model Improvements . 41 4.5 Model Fitting with Additional Data . 46 4.5.1 Analysis of Increased Variance in Neural Networks . 48 4.6 Model Ensembling . 50 4.6.1 Effects of Ensembled Models on Variability ofw ^ .............. 52 4.7 Summary of Results . 53 5 Further Analysis of Fitted Models 55 5.1 Distinct Clusters ofw ^ ................................ 55 ^ 5.2 Effect of βext = 0 in the PRS . 60 5.3 Difference in Variability in Height and BMI . 66 5.4 Effect of Different Splits on the PRS . 69 5.5 Using no Splits of the Data to train the NN . 74 5.6 Interpretability of Neural Networks . 77 5.7 Issues With Relying on an External GWAS . 78 6 Conclusion and Future Directions for Research 79 6.1 Conclusion . 79 6.2 Further Model Ensembling . 80 6.3 Novel Model Architectures . 81 6.4 Convolutional and Recurrent Neural Networks . 83 6.5 Sequence to Sequence Networks . 84 A Appendix 86 A.1 Common Activation Functions . 86 iv A.2 Common Loss Functions . 86 A.3 Optimization Algorithms . 87 A.4 Batch Normalization Algorithm . 88 A.5 Batch Normalization Backpropagation . 89 A.6 How Model Averaging Reduces Variance . 93 v List of Figures 3.1 Simple Neural Network Diagram . 19 3.2 Perceptron diagram . 20 ^ 4.1 Plot of βext for height and BMI . 34 ^ 4.2 Plot ofw ^ against βext for both height and BMI under the incorrect formula . 38 ^ 4.3 Plot ofw ^ against βext for both height and BMI under the correct formula . 39 4.4 Plot of training and validation loss for height . 42 4.5 Plot of training and validation loss for BMI . 43 4.6 Mean and standard deviation ofw ^ (stage 1 features) . 49 4.7 Mean and standard deviation ofw ^ (univariate features) . 49 4.8 Mean and standard deviation ofw ^ across 5 ensembled models . 53 5.1 Plot of mean vs standard deviation of each set ofw ^ for height across 20 runs . 56 5.2 Plot of mean vs standard deviation of each set ofw ^ for BMI across 20 runs . 58 5.3 Box plot of each distinct set's mean across 20 runs of the experiment . 59 5.4 Box plot of each distinct set's standard deviation across 20 runs of the experiment 59 5.5 Plot of Mean vs Standard Deviation of each set ofw ^ for height across 20 runs ^ after removing the weights where βext =0 ..................... 61 5.6 Plot of Mean vs Standard Deviation of each set ofw ^ for BMI across 20 runs after ^ removing the weights where βext =0 ........................ 62 5.7 Histogram of 0's in the LD adjustment window . 65 ^ 5.8 Box plot of βext across splits . 67 ^ 5.9 Plot of βext against standard deviation of acrossw ^ 20 runs of the experiment . 68 5.10 Plot of d against d^ (Single Model) . 70 vi 5.11 Plot of d against d^ (Ensemble) . 70 5.12 Plot of d against d^ (2 Splits) . 71 5.13 Plot of d against d^ (10 Splits) . 72 5.14 Plot of d against d^ using GBRT . 73 5.15 Plot of d against d^ (No Splits) . 75 5.16 Plot of mean against sd ofw ^ across 20 runs (No Splits) . 76 vii List of Tables 4.1 Initial test results of R2 under different model architectures . 37 4.2 R2 Values of PRS using different NN architectures and correctw ^ formula . 40 4.3 R2 Values of PRS under different training times . 41 4.4 Training MSE for the 1-Layer NN . 42 4.5 R2 Values of PRS under different optimizers . 44 4.6 R2 Values of different PRS without LD adjustment . 45 4.7 R2 Values of different PRS with LD adjustment . 45 4.8 R2 Values of neural networks with extra features . 47 4.9 R2 Values of different PRS generated by Ensembles after LD adjustment . 51 4.10 R2 Values of different PRS compared to machine learning adjusted PRS . 53 5.1 R2 Values of PRS generated by the neural network trained without 0's . 64 5.2 R2 Values of different PRS generated by the neural network with different splits of the data . 73 5.3 R2 Values of different PRS generated by the neural network with no splits . 75 viii Chapter 1 Introduction Polygenic risk scores are scores used to assess an individual's risk of having a certain quanti- tative trait based on his or her genetics, and are widely used in precision medicine. Currently, a polygenic risk score is built from regression coefficients from an external genome wide asso- ciation study (GWAS), sometimes, some type of adjustment is used to calibrate the risk score and to improve its predictive power. Machine learning is a widely popular set of computational methods aimed at solving big data problems, which have shown some promising results in cal- ibrating polygenic risk scores to improve their predictive potential. Par´eet al. (2017) showed that by using Gradient Boosted Regression Trees (GBRT) to adjust the weights of the poly- genic risk score, they were able to achieve an improvement in R2 over currently used methods of computing the polygenic risk score. While Par´eet al. (2017) achieved improved results in prediction R2, over other methods, we hypothesize that it may be possible to further improve upon their results. Gradient boosting is a powerful class of machine learning algorithms, but with recent advances in deep learning, neural networks (NN) have the capability to achieve comparable, sometimes exceeding the pre- dictive power of gradient boosted trees (De Br´ebissonet al., 2015; Guo and Berkhahn, 2016). In this work, we examine and compare the performance of a neural network calibrated polygenic risk score that incorporates recent advances and current best practices in the field. 1 1.1 Background and definitions The study of genetics begins with the DNA. Deoxyribonucleic acid (DNA) is a molecule, found typically in the nucleus of a cell which carries the genetic information of the living organism. The genetic information stored in the DNA codes for cellular growth, function, and reproduc- tion. DNA is made up of two strands of molecules called nucleotides coiled together to form a double helix structure. Each nucleotide is made up of one of the four types of nitrogen bases: adenine (A), thymine (T), guanine (G), and cytosine (C), combined with a sugar molecule and phosphate molecule. Due to the Watson-Crick pairing rules, which states that only adenine can be paired with thymine, and guanine with cytosine, each strand of the DNA contains the exact same genetic information, with different coding. A sequence of DNA that produces a protein or other functional element is referred to as a gene.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages112 Page
-
File Size-