Deep Learning in Classifying Cancer Subtypes, Extracting

DEEP LEARNING IN CLASSIFYING CANCER SUBTYPES, EXTRACTING RELEVANT GENES AND IDENTIFYING NOVEL MUTATIONS A thesis submitted in fulfilment of the requirements for the degree of Master of Engineering Raktim Kumar Mondol (B.Sc. in Electrical and Electronic Engineering, BRAC University) School of Engineering College of Science, Engieering and Health RMIT University Melbourne, Australia February 2019 ii DECLARATION I certify that, except where due acknowledgment has been made, the work pre- sented in this thesis is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed. I acknowledge the support I have received for my research through the provision of an Australian Government Research Training Pro- gram Scholarship. Raktim Kumar Mondol 10/01/2019 iii ACKNOWLEDGMENTS It is a pleasure to acknowledge my appreciation to the people who have made this thesis possible. I wish to express my sincere gratitude to my supervisors Dr. Omid Kavehei, Dr. Samuel Ippolito and Dr. Reza Bonyadi for their continued guidance, motivation and support during my research. I would like to thank them for their patience, trust and encouragement throughout my masters. Furthermore, I would like to thank Dr. Esmaeil Ebrahimie from University of Adelaide for collaborating in my research and providing me critical feedback. I would also like to extend my sincere thanks to my fellow PhD colleague Nhan Duy Truong for guidance in my project with his kind words. Finally, I would also like to thank my family for their support and understanding during this period. iv PREFACE This dissertation is original and unpublished work by the author, R.K. Mondol. v TABLE OF CONTENTS Page LIST OF TABLES :::::::::::::::::::::::::::::::::: viii LIST OF FIGURES ::::::::::::::::::::::::::::::::: ix ABBREVIATIONS :::::::::::::::::::::::::::::::::: xi NOMENCLATURE ::::::::::::::::::::::::::::::::: xiv ABSTRACT ::::::::::::::::::::::::::::::::::::: xvi 1 Introduction :::::::::::::::::::::::::::::::::::: 1 1.1 Background ::::::::::::::::::::::::::::::::: 1 1.2 Problem Statement ::::::::::::::::::::::::::::: 2 1.3 Scope and Rationale of the research :::::::::::::::::::: 3 1.4 Objectives of the Research ::::::::::::::::::::::::: 4 1.5 Research Questions ::::::::::::::::::::::::::::: 4 1.5.1 Research Question 1 :::::::::::::::::::::::: 4 1.5.2 Research Question 2 :::::::::::::::::::::::: 5 1.5.3 Research Question 3 :::::::::::::::::::::::: 6 1.6 Structure of the Thesis ::::::::::::::::::::::::::: 6 2 Literature Review ::::::::::::::::::::::::::::::::: 8 2.1 Overview ::::::::::::::::::::::::::::::::::: 8 2.2 Deep Learning Theoretical Background :::::::::::::::::: 8 2.2.1 Artificial Neural Network & Various types of Data Mining Methods 8 2.2.2 Types of Neural Networks ::::::::::::::::::::: 11 2.2.3 Training Neural Network using Backpropagation ::::::::: 12 2.3 Bioinformatics Theoretical Background :::::::::::::::::: 14 2.3.1 Background ::::::::::::::::::::::::::::: 14 2.3.2 Molecular biology :::::::::::::::::::::::::: 14 2.3.3 Various Pipelines in Bioinformatics :::::::::::::::: 16 vi Page 2.3.4 Gene Ontology and Pathway analysis ::::::::::::::: 19 3 Feature Extraction using Adversarial Autoencoder ::::::::::::::: 21 3.1 Overview ::::::::::::::::::::::::::::::::::: 21 3.2 Challenges :::::::::::::::::::::::::::::::::: 22 3.3 Data Collection ::::::::::::::::::::::::::::::: 23 3.4 Evaluating Performance of AAE using Classification ::::::::::: 24 3.4.1 Background ::::::::::::::::::::::::::::: 24 3.4.2 Methodology :::::::::::::::::::::::::::: 25 3.4.3 AAE model implementation :::::::::::::::::::: 26 3.4.4 Performance Metrics of Classification ::::::::::::::: 27 3.4.5 Results :::::::::::::::::::::::::::::::: 29 3.4.6 Discussion :::::::::::::::::::::::::::::: 32 3.5 Evaluating Performance of AAE by Analyzing Connectivity Matrices : 33 3.5.1 Overview :::::::::::::::::::::::::::::: 33 3.5.2 Methodology :::::::::::::::::::::::::::: 35 3.5.3 AAE model Implementation :::::::::::::::::::: 37 3.5.4 Results :::::::::::::::::::::::::::::::: 38 3.5.5 Discussion :::::::::::::::::::::::::::::: 40 3.6 Chapter Summary ::::::::::::::::::::::::::::: 41 4 Identification of Novel Mutation using Bioinformatics Pipelines :::::::: 42 4.1 Overview ::::::::::::::::::::::::::::::::::: 42 4.2 Data Collection ::::::::::::::::::::::::::::::: 43 4.3 Methodology :::::::::::::::::::::::::::::::: 44 4.4 Results & Discussion :::::::::::::::::::::::::::: 45 4.5 Gene Ontology (GO) & Protein-protein interaction (PPI) Analysis ::: 52 4.6 Chapter Summary ::::::::::::::::::::::::::::: 53 5 Conclusions and Future Work :::::::::::::::::::::::::: 54 5.1 Conclusions ::::::::::::::::::::::::::::::::: 54 vii Page 5.2 Recommendation for future Work ::::::::::::::::::::: 55 REFERENCES :::::::::::::::::::::::::::::::::::: 59 A Algorithms ::::::::::::::::::::::::::::::::::::: 69 B Tables ::::::::::::::::::::::::::::::::::::::: 71 C Figures ::::::::::::::::::::::::::::::::::::::: 77 D Codes ::::::::::::::::::::::::::::::::::::::: 79 D.1 AAE Architechture ::::::::::::::::::::::::::::: 79 D.2 Fine Tuning ::::::::::::::::::::::::::::::::: 81 D.3 Extract Weight form Latent Space :::::::::::::::::::: 83 D.4 Analyze and sort Weight Matrix :::::::::::::::::::::: 84 viii LIST OF TABLES Table Page 3.1 Summary of proposed AAE architecture. Each of the network in AAE contains one hidden layer with 1000 neurons. To train the model, adadelta optimizer is used with learning rate of 1 and binary cross entropy is used as a loss function. :::::::::::::::::::::::::::::::: 29 3.2 Results of Gene enrichment analysis using various feature extraction methods on BRCA cancer data-set. ::::::::::::::::::::::::: 39 4.1 Eight leukemia data samples with their corresponding sample types are described below. :::::::::::::::::::::::::::::::: 44 4.2 List of novel variants obtained from annotated data ::::::::::::: 50 B.1 Specification of computer used in the experiment :::::::::::::: 71 B.2 Comparison of various feature extraction techniques using BRCA dataset with twelve different classifiers. Five-fold cross validation were performed during evaluation. :::::::::::::::::::::::::::::::: 72 B.3 Benchmarking of various feature extractor while classifying various cancer subtypes using BRCA dataset. ::::::::::::::::::::::::: 73 B.4 Benchmarking of various feature extractor while extracting biologically relevant genes using BRCA dataset. :::::::::::::::::::::: 73 B.5 Results of pathway analysis using various feature extraction methods on BRCA data-set. ::::::::::::::::::::::::::::::::: 74 B.6 Results of gene enrichment analysis using various feature extraction methods further validated on UCEC data-set. ::::::::::::::::::: 74 B.7 The novel variants with associated cancer and their primary expression :: 75 B.8 Results of GeneOntology (GO) enrichment analysis using variants genes are shown. :::::::::::::::::::::::::::::::::::: 76 ix LIST OF FIGURES Figure Page 1.1 The cost of genome sequencing is decreasing since the last decade [1]. :: 1 1.2 Nucleotide sequence, mass spectrometry, and microarray data keep in- creasing [3]. :::::::::::::::::::::::::::::::::: 2 2.1 Biological neuron compared with the artificial neural network [7]. ::::: 9 2.2 Multilayer perceptron with input layer, one hidden layer, and output layer [15].11 2.3 Autoencoder reconstruct actual input to its output layer and encoded data stored in the hidden layer [16]. ::::::::::::::::::::::::: 13 2.4 Structure of a gene includes a promoter region, RNA-coding region and terminator sites [26]. :::::::::::::::::::::::::::::: 16 2.5 DNA transcribed into RNA, and this transcript may then be translated into protein [28]. :::::::::::::::::::::::::::::::: 16 2.6 Pipeline for differential gene analysis using RNA-Seq data [31]. ::::::: 17 2.7 Pipeline for variant analysis using DNA-Seq data [33]. ::::::::::: 18 3.1 Architecture of AAE for classifying cancer subtypes. :::::::::::: 26 3.2 Various classifiers such as DT, GB, KNN etc. are used to evaluate the performance of feature extraction methods such as AAE, PCA, AE, VAE and DAE. (a) Precision score is compared among feature extraction methods using twelve different classifiers. (b) Recall score is compared among feature extraction methods using twelve different classifiers. :::::::: 30 3.3 Performance of various feature extracting methods in terms of their com- putation time. :::::::::::::::::::::::::::::::::: 31 3.4 Proposed AAE is compared with other methods in terms of seven performance metrics such as accuracy, F1-score, recall, precision, AUC, MCC and Kappa. ::::::::::::::::::::::::::::::::::: 32 3.5 Block diagram for weight matrix analysis using AAE architecture. :::: 36 3.6 Histogram of AAE weights after applying TopGene algorithm. ::::::: 40 x Figure Page 4.1 In this block diagram, the whole pipeline for variant analysis are shown. First, in the pre-processing steps quality of the data are observed and trim the data to improve the quality. Then data is aligned with reference human genome. Next, PCR duplicates are removed before variant calling. Next, important variants are obtained through filtering. After that, fil- tered variants are annotated with reference variant annotation databases. Finally, data is further refined

Deep Learning in Classifying Cancer Subtypes, Extracting

Genome-Wide Identification and Analysis of Prognostic Features in Human Cancers

Characterization of a Monoclonal Antibody Panel Shows That the Myotonic Dystrophy Protein Kinase, DMPK, Is Expressed Almost Exclusively in Muscle and Heart

WO 2012/174282 A2 20 December 2012 (20.12.2012) P O P C T

In Silico Analysis of IDH3A Gene Revealed Novel Mutations Associated with Retinitis Pigmentosa

A High Throughput, Functional Screen of Human Body Mass Index GWAS Loci Using Tissue-Specific Rnai Drosophila Melanogaster Crosses Thomas J

Leveraging Tissue Specific Gene Expression Regulation to Identify Genes Associated with Complex Diseases

Searching for Candidate Genes for Male Infertility

Exome Sequencing of 457 Autism Families Recruited Online Provides Evidence for Autism Risk Genes

Content Based Search in Gene Expression Databases and a Meta-Analysis of Host Responses to Infection

Methylation of the G-C Rich Dmpk Gene: Causation For

Alsaai Udel 0060D 14351.Pdf

Supplementary Data