DEEP LEARNING IN CLASSIFYING CANCER SUBTYPES, EXTRACTING

RELEVANT AND IDENTIFYING NOVEL MUTATIONS

A thesis submitted in fulfilment of the requirements for the degree

of

Master of Engineering

Raktim Kumar Mondol

(B.Sc. in Electrical and Electronic Engineering, BRAC University)

School of Engineering

College of Science, Engieering and Health

RMIT University

Melbourne, Australia

February 2019 ii

DECLARATION

I certify that, except where due acknowledgment has been made, the work pre- sented in this thesis is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the con- tent of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed. I acknowledge the support I have received for my research through the provision of an Australian Government Research Training Pro- gram Scholarship.

Raktim Kumar Mondol 10/01/2019 iii

ACKNOWLEDGMENTS

It is a pleasure to acknowledge my appreciation to the people who have made this thesis possible. I wish to express my sincere gratitude to my supervisors Dr. Omid Kavehei, Dr. Samuel Ippolito and Dr. Reza Bonyadi for their continued guidance, motivation and support during my research. I would like to thank them for their patience, trust and encouragement throughout my masters. Furthermore, I would like to thank Dr. Esmaeil Ebrahimie from University of Adelaide for collaborating in my research and providing me critical feedback. I would also like to extend my sincere thanks to my fellow PhD colleague Nhan Duy Truong for guidance in my project with his kind words. Finally, I would also like to thank my family for their support and understanding during this period. iv

PREFACE

This dissertation is original and unpublished work by the author, R.K. Mondol. v

TABLE OF CONTENTS

Page LIST OF TABLES ...... viii LIST OF FIGURES ...... ix ABBREVIATIONS ...... xi NOMENCLATURE ...... xiv ABSTRACT ...... xvi 1 Introduction ...... 1 1.1 Background ...... 1 1.2 Problem Statement ...... 2 1.3 Scope and Rationale of the research ...... 3 1.4 Objectives of the Research ...... 4 1.5 Research Questions ...... 4 1.5.1 Research Question 1 ...... 4 1.5.2 Research Question 2 ...... 5 1.5.3 Research Question 3 ...... 6 1.6 Structure of the Thesis ...... 6 2 Literature Review ...... 8 2.1 Overview ...... 8 2.2 Deep Learning Theoretical Background ...... 8 2.2.1 Artificial Neural Network & Various types of Data Mining Methods 8 2.2.2 Types of Neural Networks ...... 11 2.2.3 Training Neural Network using Backpropagation ...... 12 2.3 Bioinformatics Theoretical Background ...... 14 2.3.1 Background ...... 14 2.3.2 Molecular biology ...... 14 2.3.3 Various Pipelines in Bioinformatics ...... 16 vi

Page 2.3.4 Ontology and Pathway analysis ...... 19 3 Feature Extraction using Adversarial Autoencoder ...... 21 3.1 Overview ...... 21 3.2 Challenges ...... 22 3.3 Data Collection ...... 23 3.4 Evaluating Performance of AAE using Classification ...... 24 3.4.1 Background ...... 24 3.4.2 Methodology ...... 25 3.4.3 AAE model implementation ...... 26 3.4.4 Performance Metrics of Classification ...... 27 3.4.5 Results ...... 29 3.4.6 Discussion ...... 32 3.5 Evaluating Performance of AAE by Analyzing Connectivity Matrices . 33 3.5.1 Overview ...... 33 3.5.2 Methodology ...... 35 3.5.3 AAE model Implementation ...... 37 3.5.4 Results ...... 38 3.5.5 Discussion ...... 40 3.6 Chapter Summary ...... 41 4 Identification of Novel Mutation using Bioinformatics Pipelines ...... 42 4.1 Overview ...... 42 4.2 Data Collection ...... 43 4.3 Methodology ...... 44 4.4 Results & Discussion ...... 45 4.5 (GO) & -protein interaction (PPI) Analysis ... 52 4.6 Chapter Summary ...... 53 5 Conclusions and Future Work ...... 54 5.1 Conclusions ...... 54 vii

Page

5.2 Recommendation for future Work ...... 55 REFERENCES ...... 59 A Algorithms ...... 69 B Tables ...... 71 C Figures ...... 77 D Codes ...... 79 D.1 AAE Architechture ...... 79 D.2 Fine Tuning ...... 81 D.3 Extract Weight form Latent Space ...... 83 D.4 Analyze and sort Weight Matrix ...... 84 viii

LIST OF TABLES

Table Page 3.1 Summary of proposed AAE architecture. Each of the network in AAE contains one hidden layer with 1000 neurons. To train the model, adadelta optimizer is used with learning rate of 1 and binary cross entropy is used as a loss function...... 29 3.2 Results of Gene enrichment analysis using various feature extraction meth- ods on BRCA cancer data-set...... 39 4.1 Eight leukemia data samples with their corresponding sample types are described below...... 44 4.2 List of novel variants obtained from annotated data ...... 50 B.1 Specification of computer used in the experiment ...... 71 B.2 Comparison of various feature extraction techniques using BRCA dataset with twelve different classifiers. Five-fold cross validation were performed during evaluation...... 72 B.3 Benchmarking of various feature extractor while classifying various cancer subtypes using BRCA dataset...... 73 B.4 Benchmarking of various feature extractor while extracting biologically relevant genes using BRCA dataset...... 73 B.5 Results of pathway analysis using various feature extraction methods on BRCA data-set...... 74 B.6 Results of gene enrichment analysis using various feature extraction meth- ods further validated on UCEC data-set...... 74 B.7 The novel variants with associated cancer and their primary expression .. 75 B.8 Results of GeneOntology (GO) enrichment analysis using variants genes are shown...... 76 ix

LIST OF FIGURES

Figure Page 1.1 The cost of genome sequencing is decreasing since the last decade [1]. .. 1 1.2 Nucleotide sequence, mass spectrometry, and microarray data keep in- creasing [3]...... 2 2.1 Biological neuron compared with the artificial neural network [7]...... 9 2.2 Multilayer perceptron with input layer, one hidden layer, and output layer [15].11 2.3 Autoencoder reconstruct actual input to its output layer and encoded data stored in the hidden layer [16]...... 13 2.4 Structure of a gene includes a promoter region, RNA-coding region and terminator sites [26]...... 16 2.5 DNA transcribed into RNA, and this transcript may then be translated into protein [28]...... 16 2.6 Pipeline for differential gene analysis using RNA-Seq data [31]...... 17 2.7 Pipeline for variant analysis using DNA-Seq data [33]...... 18 3.1 Architecture of AAE for classifying cancer subtypes...... 26 3.2 Various classifiers such as DT, GB, KNN etc. are used to evaluate the performance of feature extraction methods such as AAE, PCA, AE, VAE and DAE. (a) Precision score is compared among feature extraction meth- ods using twelve different classifiers. (b) Recall score is compared among feature extraction methods using twelve different classifiers...... 30 3.3 Performance of various feature extracting methods in terms of their com- putation time...... 31 3.4 Proposed AAE is compared with other methods in terms of seven perfor- mance metrics such as accuracy, F1-score, recall, precision, AUC, MCC and Kappa...... 32 3.5 Block diagram for weight matrix analysis using AAE architecture. .... 36 3.6 Histogram of AAE weights after applying TopGene algorithm...... 40 x

Figure Page 4.1 In this block diagram, the whole pipeline for variant analysis are shown. First, in the pre-processing steps quality of the data are observed and trim the data to improve the quality. Then data is aligned with reference hu- man genome. Next, PCR duplicates are removed before variant calling. Next, important variants are obtained through filtering. After that, fil- tered variants are annotated with reference variant annotation databases. Finally, data is further refined in order to find, validate and visualize novel variant...... 46 4.2 Three different variant calling tools are shown in venn diagram for sample SRR40862 & SRR408630...... 48 4.3 In this figure, variant obtained from the SRR408623 sample is shown. Here, the novel variant is located in X:150470758 positions where nucleotide base changes from C to G. Here, the MAMLD1 gene is found to be responsible for this mutation...... 52 C.1 The olfactory transduction pathway obtained from AAE model. The GC- D expressing ORNs and Odorant (highlighted in red) are oncogenic acti- vators responsible for triggering olfactory transduction pathway ...... 77 C.2 In this figure, Interaction between are shown. Here, PLK3 variant gene are connected to SF1 variant gene via CDC5L ...... 78 xi

ABBREVIATIONS

AAE Adversarial Autoencoder AE Shallow Autoencoder AML Acute Myeloid Leukemia ANN Artificial Neural Network AUC Area Under ROC Curve BWA Burrows-Wheeler Aligner CAD Computer Aided Diagnosis CAD Computer Aided Diagnosis CML Chronic Myelogenous Leukemia CNH Copy Number High CNL Copy Number Low CNN Convolutional Neural Network DAE Denoising Autoencoder DBN Deep Belief Network DNA Deoxyribonucleic Acid DT Decision Tree ENA European Nucleotide Archive FISH Fuorescence in Situ Hybridization GA Genetic Algorithm GATK Genome Analysis Toolkit GB Gaussian Naive Bayes GO Gene Ontology GPU Graphics Processing Unit HPC High Performance Computing HYM Hyper Mutated Indel Insertion and Deletions xii

LR Logistic Regression LumA Luminal A LumB Luminal B MLP Multilayer Perceptron MNPs Multi-Nucleotide Polymorphisms mtDNA Mitochondrial DNA mRNA Messenger Ribonucleic Acid NGS Next Generation Sequencing NMF Non-negetive Matix Factorization PCA Principle Component Analysis PCR Polymerase Chain Reaction PCR Polymerase Chain Reaction QT Quantile transformer RBCs Red Blood Cells RF Random Forest RM Exceptation Maximization RNA Ribonucleic Acid RNN Recurrent Neural Network ROC Receiver operating characteristic SDAE Stacked Denosing Autoencoder SGD Stocastic Gradiant Descent SMOTE Synthetic Minority Oversampling Technique SNPs Single Nucleotide Polymorphism SNV Single Nucleotide Variants SVM Support Vector Machine Tri-Neg Triple Negative ULM Ultra Mutated VAE Variational Autoencoder VEP Variant Effect Predictor xiii

WBCs White Blood Cells WGS Whole Genome Sequencing WES Whole Exome Sequencing WHO World Health Organization xiv

NOMENCLATURE

DIRAS1 DIRAS Family GTPase 1 DMWD DM1 Locus, WD Repeat Containing DOK2 Docking Protein 2 EGR2 Early Growth Response 2 MAMLD1 Mastermind Like Domain Containing 1 MYCBPAP MYCBP Associated Protein PLK3 Polo Like Kinase 3 POU4F2 POU Class 4 Homeobox 2 QSOX2 Quiescin Sulfhydryl Oxidase 2 SF1 Splicing Factor 1 SLC22A6 Solute Carrier Family 22 Member 6 SLC2A5 Solute Carrier Family 2 Member 5 TEX2 Testis Expressed 2 ZFHX3 Zinc Finger Homeobox 3 ALOX5 Arachidonate 5-Lipoxygenase ATP13A2 ATPase Cation Transporting 13A2d BBOX1 Gamma-Butyrobetaine Hydroxylase 1 BMPR1B Bone Morphogenetic Protein Receptor Type 1B BRSK2 BR Serine/Threonine Kinase 2 CACNA1A Calcium Voltage-Gated Channel Subunit Alpha1 A CDK11A Cyclin Dependent Kinase 11A CFAP69 Cilia And Flagella Associated Protein 69 CLTB Clathrin Light Chain B CPD Carboxypeptidase D EGR3 Early Growth Response 3 GPR182 G Protein-Coupled Receptor 182 xv

GTF3C5 General Transcription Factor IIIC Subunit 5 HECTD4 HECT Domain E3 Ubiquitin Protein Ligase 4 OR2A25 Olfactory Receptor Family 2 Subfamily A Member 25 OR2B6 Olfactory Receptor Family 2 Subfamily B Member 6 OR2T27 Olfactory Receptor Family 2 Subfamily T Member 27 OR6V1 Olfactory Receptor Family 6 Subfamily V Member 1 OR8B8 Olfactory Receptor Family 8 Subfamily B Member 8 PHF2 PHD Finger Protein 2 PKHD1 PKHD1, Fibrocystin/Polyductin RNF114 Ring Finger Protein 114 SLU7 SLU7 Homolog, Splicing Factor SURF6 Surfeit 6 TKTL1 Transketolase Like 1 ZNF655 Zinc Finger Protein 655 ZNF677 Zinc Finger Protein 677 ZSWIM7 Zinc Finger SWIM-Type Containing 7 xvi

ABSTRACT

Technological advancement in high-throughput genomics such as deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) sequencing has significantly increased the size and the complexity of data-sets. Also, a high number of features (genes) with a limited number of samples (patients) introduces a significant amount of noise in the data. Modern deep learning algorithms in conjunction with high computational power can handle those problems and are able to detect and diagnose diseases in a short time with reduced chance of error. Thus, developing such method and pipeline is of current research interest. At the beginning of the research, high dimensionality problem of the dataset was addressed and tried to overcome that problem by reducing the number of features using various feature extraction methods. Thus adversarial autoencoder (AAE) based feature extraction model has been introduced. Then the performance of the proposed model is evaluated using classification and weight matrix. First, the AAE model was tested using twelve various classifiers, and the results show significant perfor- mance improvement in terms of precision and recall. Compared to all other methods, AAE with support vector machine (SVM), decision tree (DT), k-nearest neighbours (KNN), quadratic discriminat analysis (QDA) and xgboost (XGB) classifiers show significant performance improvement in precision having score of 85.96%, 84.41%, 85.74%, 84.27% and 85.47% respectively. Most importantly, AAE provides consistent results in all performance metrics across twelve different classifiers which makes this feature extraction model classifier independent. Then, by analysing the weight ma- trix of AAE, important biomarkers such as OR2T27, OR2A25, OR8B8 and OR6V1 are identified that belong to molecular function of olfactory receptor activity. Recent study shows that olfactory receptor genes are highly expressed in breast carcinoma tissues which validated our result. xvii

Later, a pipeline has been developed for mutation identification using methylated DNA dataset. For this, raw data is analysed and three different variant caller methods are used to validate the results. Then common variants are taken for annotation to extract biological information. Next, low-quality genes are filtered out and identified 22 mutated genes that are responsible for acute myeloid leukemia (AML) diseases. Among them, most of the mutated genes are missense, non-synonymous and some of them are stop-gain and non-frameshift. The mutation frequency of the mutated genes are validated using DriverDB and Intogen databases. Here, three genes such as ZFHX3, DOK2, and PKHD1 have mutation frequency of 0.51% found in AML where rest of the mutated genes are novel for leukemia disease. To summarise, it is shown in the experiment that the feature extraction method is not only useful for diagnosing diseases, but it can be helpful for identifying biological marker which can further assist in developing personalised medicine and selecting a therapeutic target. Besides, variant analysis pipeline provides novel variant genes which could enhance the understanding of the leukemia diseases and provide direction for further clinical research and drug development. 1

Chapter 1 Introduction

1.1 Background

The advent of next-generation sequencing (NGS) technologies has revolutionised the genome sequencing applications for its higher efficiency and relatively low cost as illustrated in Fig. 1.1. Using the whole genome sequencing (WGS) data, it is possible to identify regulatory elements such as transcription factors and histones, predict protein structure and single nucleotide polymorphism (SNPs). Since NGS produces millions of sequencing reads, it raises complexity for storing and analysing the data. Robust deep learning and statistical methods in combination with various bioinformatics tools can be used to diagnose rare diseases and provide personalised treatment.

Fig. 1.1. The cost of genome sequencing is decreasing since the last decade [1]. 2

1.2 Problem Statement

Deep learning has a number of significant applications in the field of bioinfor- matics. Biological data is highly heterogeneous, often has many missing values and require sophisticated tools to analyse. These complex nature of data can be han- dled using deep learning-based algorithms [2]. Some of the problems regarding deep learning while applying in the application of biological data are described below:

• Unsupervised deep learning shows excellent success in dimension reduction and clustering techniques. However, this model needs to be optimised to incorporate genetic data as the dataset is growing very fast (see Fig. 1.2). Also, it is difficult to extract meaningful biological information as data contain thousands of genes.

Fig. 1.2. Nucleotide sequence, mass spectrometry, and microarray data keep increasing [3].

• Biological omics data are heterogeneous as they contain many types of informa- tion, such as genetic sequences, protein-protein interaction or electronic health records. Besides, this data often has a relatively small sample size which pro- vides biased results while applied deep learning methodologies. 3

1.3 Scope and Rationale of the research

As bioinformatics is a highly inter-discipline field, the research on next-generation sequencing technique, feature extraction method and classification approach using high-performance computing are becoming an essential part of genomic data analysis. The primary purpose of this data-driven approach is to obtain biological meaning from genetic data using various mathematical and statistical techniques. Due to the complex nature of genetic data, deep learning methods are particularly well suited for this kind of problem. Furthermore, deep learning has revolutionised several areas of biology and medicine by examining the variety of issues such as diagnosing diseases, protein integrations and drug discovery, which may aid the human investigation. The complexity of biological data presents new challenges, but modern artificial intelligence methods such as deep learning have increased the expertise of pathologist by improving diagnostic accuracy. This study provides evidence that deep learning based diagnosis can facilitate clinical decision making to detect cancer. Potential benefits of using computer-aided diagnosis (CAD) include reduced diagnostic turn- around time with increased accuracy. While achieving state of the art results and even surpass human accuracy in many challenging tasks, but the adoption of deep learning in bioinformatics using genomic data has been comparatively low. Due to high dimensional data, the use of the deep neural network is suitable in this case, and it outperforms classical machine learning techniques for cancer detection and relevant gene identification. After successfully implementing this project, it will be easier to assess the potential effectiveness of CAD to support clinical decision making. Besides improving accuracy, the system will be able to detect potential markers for cancer, identify genetically mutated genes, and recommend personalised medicine. Finally, cancer patients, cancer researchers, and clinicians including pathologists and oncologists will get benefit from this research. 4

1.4 Objectives of the Research

The research project deals with two different types of data for two separate anal- ysis. The first analysis deals with normalised RNA-seq breast cancer data where machine learning techniques are used to classify and identify the biomarker of cancer. Second analysis deals with raw DNA methylated leukemia samples to determine the mutations. The main objectives are:

• To develop a deep learning algorithm which can classify breast cancer subtypes while extracting essential features about the disease.

• To obtain optimal hyper-parameters of the developed algorithm

• To create a pipeline for identifying novel mutations in leukemia samples

• To generate results and analyze the performance of proposed methods

1.5 Research Questions

Some of the questions which are answered in this work are:

1.5.1 Research Question 1

How to classify breast cancer sub-types while maintaining accuracy?

Significance

As cancer develops progressively, classification of cancer types at an early stage is likely to be more effective and has the lowest chance of recurrence. Machine intelligent based classification techniques can handle a large amount of data and much more accurate than human-crafted rules. This classification algorithms can automatically find the correlation in data and can assist pathologist to understand the extent of diseases. 5

Key Findings

Handling genetic data is crucial as this data has lots of features with a limited amount of samples. Applying proper pre-processing techniques can make a signifi- cant improvement in classification. Hence, AAE based feature extraction method is developed which helps to improve performance in most of the classifiers.

1.5.2 Research Question 2

How to extract meaningful features to identify breast cancer biomarkers using normalized RNA-seq datasets?

Significance

Identifying biomarkers are essential to understanding the spectrum of any dis- eases. By extracting meaningful features and analysing the corresponding weight matrix, relevant biomarkers can be identified which will help researchers in develop- ing personalized medicine and selecting therapeutic targets. relevant biomarkers can be identified, which will help.

Key Findings

AAE based feature extraction method not only assist in classifying cancer but it encoded some essential genes. Those genes were analysed and ranked them according to their weights. Finally, top 50 genes are taken for gene enrichment analysis. Recent studies prove that molecular function of those genes show association with breast cancer. It is found that AAE can reveal important biomarkers compared with current state-of-the-art methods. 6

1.5.3 Research Question 3

What is the best a pipeline for detecting novel variants in raw DNA methylated samples?

Significance

Genetic variations can act as biomarkers and able to locate genes associated with diseases. The location of these biomarkers can be enormously important in terms of predicting functional significance [4]. It also helps to identify individual response to certain drugs thus able to determine targeted therapies.

Key Findings

Proposed pipelines have been applied to eight different leukemia samples and got total of 22 novel variant genes. Their corresponding chromosome location is reported, and functional analysis is performed. Finally, results reveal that those mutated genes are known to be associated with AML.

1.6 Structure of the Thesis

The thesis is divided into five chapters namely Introduction, Literature review, Feature extraction using adversarial autoencoder, Identification of novel mutations using bioinformatics pipelines, and Conclusions. Chapter 1 presents research background, problem statement, scope, rationale, objec- tives, and research questions. Chapter 2 provides a complete literature review and helps to identify the gaps in the existing literature. Chapter 3 outlines the methodology for breast cancer subtype classification and ex- tracting biologically relevant genes from normalized RNA-seq data. The results and experiments that have been conducted using the deep learning library such as Ten- 7 sorflow, Keras, scikit-learn and programming languages such as python are also dis- cussed. Chapter 4 outlines the methodology of variant analysis of leukemia sample. All anal- ysis was done using web-based tools called Galaxy. Deep learning-based model is also applied for finding variants and results are compared with existing bioinformatics tools. Finally, in Chapter 5 overall conclusions on deep learning based feature extraction method and variant analysis pipelines are summarized with recommendations for future research studies. 8

Chapter 2 Literature Review

2.1 Overview

This chapter reflects on the existing literature of various types of machine learning models and their applications. It also exhibits various bioinformatics pipelines, tools, and datatypes. First, review of deep learning methods such as supervised learning, unsupervised learning, semi-supervised learning are presented. Then, different types of deep learning models and their applications are discussed. Bioinformatics tools, datatypes, and various pipelines are also reviewed in this chapter. Next, discussions are drawn based on challenges and opportunities for machine learning in bioinformat- ics. Finally, conclusions are drawn based on research gaps and critical findings.

2.2 Deep Learning Theoretical Background

2.2.1 Artificial Neural Network & Various types of Data Mining Methods

Deep learning is one of the significant areas in artificial intelligence that give machines the ability to learn representations of data with multiple levels of abstraction that discovers hidden structure in a large dataset [5]. The following Fig. 2.1 illustrates the biological neuron that has dendrites which receive signals from other neurons, summed them and forward these signals to axons. With the help of axons, these signals are fired to the next neurons through which are known as synapse. On the other hand in artificial neural network input nodes act as dendrites, ac- tivation function act as axon and output layer act as a Synapse. Perceptron is an artificial neuron which is similar to biological neurons, that has input nodes and a single output node, which is connected to each input node. The machine can learn 9

in supervised, unsupervised, and semi-supervised manner. Deep learning has many applications in healthcare for example cancer detection, diagnosis in medical imag- ing, drug discovery, robotic surgery, protein structure prediction, and personalised medicine and treatment [6].

(Synapse)

Fig. 2.1. Biological neuron compared with the artificial neural network [7].

Supervised Learning

In supervised learning, the model takes a dataset that has the output variable, and the learning algorithm is used to map the input dataset to the output so that it can predict the fate of unknown instances [8]. Classification and regression are the applications of supervised learning. For example, using supervised learning, clas- sification algorithm will learn to identify diseases after being trained on a dataset of genes that are correctly labelled. There are various supervised machine learning algorithms, for example, random forest, linear and logistic regression, support vector 10 machines that are commonly used for classification and regression steps. However, for the excessively large dataset, deep neural network based classification and regression algorithm are generally required to deal with data non-linearity.

Unupervised Learning

The goal of unsupervised learning is to map the underlying hidden structure of the dataset where machine only receives inputs but do not receive any outputs [9]. This model is called unsupervised since dataset contains only input variables and does not contain any label during the learning process. Unsupervised learning discovers interesting structures in the data and common use of this is cluster and association rule analysis [10] [11]. Some popular unsupervised algorithms are k-means, principal component analysis (PCA), autoencoder for clustering problem and apriori algorithm for the association rule learning problem. Here, autoencoder is a neural network based unsupervised learning algorithm that has many applications such as clustering, dimension reduction, and image denoising. Biological data often don’t have labels; therefore, unsupervised learning is beneficial for clustering and helps to reduce the dimension to facilitate further analysis [9].

Semi-Supervised Learning

Semi-supervised learning algorithms require both labelled and unlabeled data to build complex machine learning models [12]. In real life, labelled data is quite ex- pensive and labelling the data manually is time-consuming hence, including lots of unlabeled data in combination with labelled data tends to improve the accuracy and reduce biases from the model. Semi-supervised model shines in webpage classification, speech recognition, or even for genetic engineering [13]. 11

2.2.2 Types of Neural Networks

Multi-Layer Perceptron

multilayer perceptron (MLP) is a supervised learning algorithm that can handle nonlinearity and is required to solve complex tasks, unlike single perceptron [14]. With perception, it is possible to build a larger and more effective structure that can be described as MLP and has at least three main layers; input layer, hidden layer, and an output layer. It is also called a feedforward network since it has a direct connection between all layers is shown in Figure 2.2.

Fig. 2.2. Multilayer perceptron with input layer, one hidden layer, and output layer [15]. 12

Autoencoder

An autoencoder is a neural network based unsupervised learning algorithm (il- lustrated in Fig. 2.3) that applies backpropagation, setting the target values to be equal to the inputs and can extract non-linearities within the data. It can learn au- tomatically from data examples, which means autoencoder does not require labelled data. Some practical applications of autoencoders are data denoising and dimen- sionality reduction from high dimensional data which is useful for data visualisation. With appropriate dimensionality and sparsity constraints, autoencoders can learn data projections that are more interesting than PCA or other basic techniques. Au- toencoder has some drawbacks; for example, it can compress data similar to what they have been trained on; therefore this method is data-specific. Also, autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the original inputs. An encoding function, a decoding function, and a distance function which is a loss between the compressed and the decompressed representa- tion of data are required to build an autoencoder. The parameters of the encoding and decoding functions can be optimised using stochastic gradient descent (SGD) to minimise the reconstruction loss.

2.2.3 Training Neural Network using Backpropagation

Training a deep neural network is done using the back-propagation learning pro- cedure [17]. Back-propagation algorithm repeatedly computes gradient in order to minimize the cost function and update the weights of the network [18] [19]. Back- propagation requires a forward pass and a backward pass for every considered training example. After forward pass, the cost is calculated using cost function. During back- propagation, partial derivatives (with respect to weight or bias) of that cost function is computed to reduce the loss and adjust the weights and bias accordingly. 13

Fig. 2.3. Autoencoder reconstruct actual input to its output layer and encoded data stored in the hidden layer [16].

Various Gradient Descent Algorithms

Batch gradient descent: Batch gradient descent calculate gradient of the full training set and update the model after all training sample have been evaluated [20]. This algorithm requires entire dataset loaded into memory which can be slow for big datasets.

Stochastic gradient descent (SGD): SGD update parameters for each training sample and it shows amazing performance in large scale problems [20] [21]. This approach overcome problem where dataset is too big to fit in memory. Although SGD performs frequent updates with high fluctuations, it can lead to a faster convergence.

Mini-batch gradient descent: Mini-batch gradient descent is a trade-off between stochastic gradient descent and batch gradient descent. It splits the training dataset 14 into small batch size range between 50 and 256 [22] that are used to calculate model error and update model coefficients. Using small batches of data reduces fluctuations of parameters during model updates compared to other two methods.

2.3 Bioinformatics Theoretical Background

2.3.1 Background

NGS has grown by leaps and bounds over the last decades. This technology had high-throughput capabilities and revolutionised the landscape of genomic medicine [23]. The rapidly decreasing cost of sequencing per genome has sparked growing demand in the field of data analysis and machine learning in particular. The rapid adoption of NGS technology is undoubtedly the gateway to the new era of personalised treatment and cancer prognosis, hence NGS bioinformatics can provide a solution for the next generation of molecular diagnostics. However, it is still a challenge for researchers to analyse the NGS data and extract meaningful information from it because of the high complexity and sparsity of the data. Deep learning is getting more and more attention after being known to outperform commonly used bioinformatics tools. Deep learning algorithm on NGS data revolutionise nowadays and lead to potential discoveries that can help on clinical diagnosis and disease prognosis. In this research, a deep learning based algorithm named deep-variant is used to find the variant in DNA-methylated data in conjunction with traditional tools like freebayes, genome analysis toolkit (GATK).

2.3.2 Molecular biology

DNA Nucleotides and Strands:

Deoxyribonucleic acid (DNA), is the genetic material in humans and almost all other organisms that carried genetic code. All the biological information are stored in the DNA. Most of the DNA is located in the cell nucleus, but a small amount 15 of DNA can also be found in the mitochondria which are known as mitochondrial DNA (mtDNA) [24]. DNA is a chainlike double helix, long molecule that has two strands which are made up of simpler molecules called nucleotides. The nucleotide is composed of four nitrogen-containing nucleobases such as cytosine (C), guanine (G), adenine (A) and thymine (T). Nucleobases pair up with each other, A with T and C with G to form units called base pairs. An important property of DNA is that it can make copies of itself.

RNA (Ribonucleic Acid):

Ribonucleic acid (RNA) is a linear molecule responsible for the coding, decoding, regulation, and expression of genes. Unlike DNA, RNA only has a single strand that folds onto itself, and its sugar is called Ribose. It composed of four types of smaller molecules called ribonucleotide bases: adenine (A), cytosine (C), guanine (G), and uracil (U).

Gene Expression:

DNA contains genes, and those genes are used to produce RNA using the tran- scription process. Information from a gene is used in the synthesis of a functional gene product; this process is called gene expression [25]. A gene is a continuous string of nucleotides that begins with a promoter and ends with a terminator shown in Fig. 2.4. The process of transforming DNA into proteins is divided into two stages: transcription and translation. In the transcription process, RNA is produced using DNA templates, and during translation, proteins are synthesised using RNA tem- plates shown in Fig. 2.5 [27]. A transcription process is carried out by RNA polymerase enzyme and many tran- scription factors. The mRNA is formed during transcription contains coding and non-coding sections which are known as exons and introns respectively. Since, the 16

Fig. 2.4. Structure of a gene includes a promoter region, RNA-coding region and terminator sites [26].

only exon contains information about protein, intros that are non-coding sections are then removed by splicing [25].

Fig. 2.5. DNA transcribed into RNA, and this transcript may then be translated into protein [28].

2.3.3 Various Pipelines in Bioinformatics

RNA-Seq Analysis

RNA sequencing (RNA-seq) is a high-throughput next generation sequencing tech- nology that provides a comprehensive understanding of the transcriptome [29]. Pow- 17 erful computational resources and good algorithms are required to perform this RNA- Seq analysis [30]. The pipeline for differential gene analysis using RNA-Seq data is shown below in Fig. 2.6. Several RNA-seq analysis pipelines have been proposed to date. However, no single analysis pipeline can capture the dynamics of the en- tire transcriptome. Here, a robust and commonly used analytical pipeline have been compiled and presented which includes quality checks, alignment of reads, differential gene expression analysis, the discovery of cryptic splicing events, and visualization. First, after obtaining the raw reads in the sequencing process the reads are subjected to a quality control step. In this step, FASTQC tools are used in order to check the data and trim the data if necessary to improve further stages performance. After the quality control step, the reads need to be aligned with a reference genome. Following the alignment step, a quantification is performed. In this step, reads are aggregated to calculate gene expression values to be then compared with other samples values in the differential expression step in order to know what genes are most expressed.

Fig. 2.6. Pipeline for differential gene analysis using RNA-Seq data [31]. 18

DNA-Seq Analysis

NGS is becoming the most popular high throughput technology for biomedical research, personalised treatment, and drug design for the diseases. It is empowering genetic disease research and brings significant challenges for sequencing data analysis. DNA-Seq pipeline using whole exome sequencing (WES) and WGS data, to detect variants or mutations from disease samples shown in Fig. 2.7. DNA-seq data analysis is to study genomic variants through aligning raw reads from NGS sequencing to a reference genome and then apply variant call software to identify genomic mutations, including SNPs, etc. [32]. The DNA-Seq analysis requires deep knowledge regarding bioinformatics tools, High-performance computing (HPC) resources and the analysis needs to be able to reproduce the results to facilitate comparisons as well as down- stream analysis. The pipeline starts with checking the quality of the sequence using FASTQC tools

Fig. 2.7. Pipeline for variant analysis using DNA-Seq data [33].

and apply trim adapters from reads to improve the quality of the reads. Then align- 19 ment is done using burrows-wheeler aligner (BWA) with a reference genome. Then Freebayes or GATK based variant caller is used to identify the variants of mutations. Finally, the functional annotation is done using existing variant dataset. Then further downstream analysis, for example, gene enrichment analysis, protein-protein interac- tion and gene pathway analysis can be done using a variety of web tools like DAVID, KEGG and gene ontology (GO). By doing this analysis, it is possible to know the exact chromosome location of variants and its corresponding gene which is responsible for that variation.

2.3.4 Gene Ontology and Pathway analysis

Gene ontology analysis is useful for finding out if a set of genes are associated with a specific biological process, cellular component or molecular function [34]. Perform- ing this analysis is the core step to extract any meaning from the result. GO provides a set of structure in annotating genes and GO terms, with a name, a unique alphanu- meric identifier, a definition with cited sources, and a namespace with the domain where it belongs. Gene Ontology covers three domains: i) cellular component, the parts of a cell or its extracellular environment. Ii) molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis iii) bi- ological process, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. On the other hand, A biological pathway is a sequence of interactions between molecules in a cell that results in a certain product of changes in a cell. Those interactions aim to control the flow of information, energy and bio- chemical compounds in the cell and the ability of the cell to change its behaviour in response to stimuli. There are several types of biological pathways with the most commonly known ones being metabolic, signalling and gene-regulation pathways. By comparing two pathways, researchers can find what triggered such disease. This gene enrichment and pathway analysis and can be performed using various tools, for in- 20 stance, DAVID, PANTHER9, or gProfiler11.

In the literature review, artificial intelligence and its applications in the bioinfor- matics field are well studied in order to classify disease subtypes and to find novel biomarkers. 21

Chapter 3 Feature Extraction using Adversarial Autoencoder

3.1 Overview

Breast cancer is a common and debilitating disease affecting mainly women around the world. According to cancer Australia, 17730 new cases diagnosed in 2017 and there were 3114 death from breast cancer however five-year relative survival rate had increased from 72% to 90% in between 1982 to 2012 [35]. There are several molecular subtypes of breast cancer such as luminal A (LumA), luminal B (LumB), triple-negative (Tri-Neg), HER2 enriched and Claudin-Low. Also, breast cancer can be classified by its histological types such as ductal carcinoma, lobular carcinoma or mixed carcinoma. Furthermore, there are multiple pathological stages of breast cancer such as stage 0, I, II, III and IV. However, genetic data is highly correlated with molecular subtypes. It, is important to note that inherited genetic mutation can greatly increase the risk of developing cancer, therefore analysing genetic subtypes can help us to detect cancer at the early stages and increase five-year survival rate. Besides, the gene expression profile enables cancer diagnosis and treatment of tumour characterisation in order to minimise the recurrence [36]. Although mammography remains the mainstay of a breast cancer diagnosis with fewer false-positives, gene expression data can reveal the root cause of cancer by analysing RNA expression, mutation and methylation patterns [37] [38] [39]. Machine intelligence methods automatically detect the pattern of data and measure predictive targets based on its learning. Vanishing gradient, overfitting and comput- ing power were some of the limitations of artificial neural network (ANN) ever since it was introduced [40]. However, enhanced computing power with graphical pro- cessing units (GPUs) and advanced algorithm, for example, MLP, recurrent neural network (RNN), convolutional neural network (CNN), autoencoder etc. exhibited sig- 22 nificant performance to overcome those challenges. Machine intelligence has brought a paradigm shift to health care, and CAD to facilitate radiologist to reduce assess- ment time and increase detection accuracy [40]. Computational biology has already shown promising results in medical imaging, however analysing genomics, and tran- scriptional profiling data is more challenging because of its high dimensionality, small sample size, and missing values. Deep learning which is a part of machine intelligence can extract high-level features in an unsupervised manner and find complex patterns on unseen data.

3.2 Challenges

Gene expression data of RNA-Seq are quite big and need a proper analysis tool to classify and extract biological information. Medical data has lots of issues such as missing values, low quality data, small sample set size and higher dimension all of which create noise in the data. To remove noises, proper pre-processing steps are required that will help the classifiers to classify diseases more efficiently. The first challenge comes with higher dimensionality of the available datasets. This high dimension and sparseness of data often lead to overfitting. This problem can be handled by applying various dimensionality reduction methods such as autoencoder which improves the classification performance. Among thousands of genes, only a few of them are relevant and contain useful infor- mation thus extracting biologically essential genes are extremely important. However, it is challenging to extract relevant genes from encoded data. First, it is needed to analyse the weight matrix of a neural network then sort out highly weighted genes. If neural network has any hidden layer, it is required to compute the dot product of the weight metrics for each layer. By extracting highly weighted genes give biologists key insights on how those genes are correlated to a specific type of cancer. Furthermore, classes in a dataset are not equal in number which gives a biased result during model evaluation. Most machine learning and neural network based algo- 23 rithms are not designed to deal with such imbalanced classes. Also, it is difficult to train a deep learning model with a small number of samples. Therefore, the sample size needs to be increased, and classes need to be balanced using over-sampling tech- niques. Most importantly, high computational power with GPU cluster is required to get the results in an acceptable time. In short, sophisticated pre-processing steps and high-end computational resources are needed to analyse the genetic data.

3.3 Data Collection

Two public data-sets from cBioportal1 were used in this work. First, data-set is breast invasive carcinoma (TCGA, Cell 2015) shortly named as BRCA has five molec- ular subtypes such as LumA, LumB, Basal, Trig-Neg and Her2. However, Basal and Tri-Neg sample are merged together as they share very similar biological pattern [41]. This data-set contains 20439 genes and 605 number of patients. Then, a different cancer data-set was chosen which is uterine corpus endometrial carcinoma (TCGA, Nature 2013), shortly named as UCEC and categorise them into four molecular sub- types for example copy number high (CNH), copy number low (CNL), hyper-mutated (HYM) and ultra-mutated (ULM). UCEC contains a total number of 230 samples and each sample has 15482 genes. The data was labelled with their corresponding clinical information. The main goal of this work is to present a new approach for dimensionality reduction using AAE and evaluate its performance in two ways. First, extracted features will be evaluated through supervised classification using twelve different classifiers. Then to verify AAE’s capability, biologically relevant genes will be extracted using the connectivity metrics of its possible space and compared with other state-of-the-art methods. 1cBioportal http://www.cbioportal.org/ 24

3.4 Evaluating Performance of AAE using Classification

3.4.1 Background

Several promising studies have attempted to classify and cluster gene expression using various machine learning algorithms. Unsupervised deep feature extraction al- gorithm such as stacked denoising auto-encoder (SDAE) and variational auto-encoder (VAE) has been applied to extract the features [37] [42] [43]. Moreover, it has been compared with traditional machine learning algorithm based method such as principal component analysis (PCA) with different kernels to measure the perfor- mance [37] [43] [44]. These techniques can also be applied to predict protein-protein interaction and discover new drugs. Various unsupervised clustering algorithm includ- ing partition based clustering for evaluation, K-means clustering, hierarchical clus- tering, expectation maximisation (EM) and non-negative matrix factorisation (NMF) clustering algorithm have been applied to cluster the gene [45]. Also, to extracting the features, ANN and deep belief network (DBN) have been employed to classify the diseases [46] [47]. Moreover, various machine learning algo- rithm such as multinominal logistic regression (LR), DT, SVM, random forest (RF) etc. equally exploited in cancer research in order to classify diseases from microarray and RNA-Seq data [37] [42] [44] [46] [48].

As discussed prior, a high number of genes with limited number of patients intro- duces a significant amount of noise which is known as the “curse of dimensionality”. To overcome this challenge feature extraction method such as AAE has been pro- posed [49]. This technique also facilitates to visualise high dimensional data and prevents manifold fracturing problem [49]. To evaluate the effectiveness of extracted features through AAE (see Algorithm 2), twelve different supervised classifiers are used such as MLP, RF, SVM, KNN, DT, XGB, QDA, LR, gradiant boosting (GB), gaussian naive bayes (NB), linear discriminative analysis (LDA) and voting classifier (VC). 25

3.4.2 Methodology

Data Pre-processing

Class imbalance often leads to overfitting and low predictive score [50]. Since the chosen data is highly imbalanced in nature, synthetic minority oversampling technique (SMOTE) is adapted to balance the classes. Next, normalisation is used to eliminate data redundancy and noise. In this project, the use of quantile transformer (QT) normalisation technique which transforms the features to follow a uniform or a normal distribution was used [51]. This technique not only reduces the impact of outliers but also spread out the most frequent values.

Dimension Reduction

After using a feature selection method, unsupervised AAE depicted in Fig. 3.1 is used to reduce the dimensionality. This state of the art auto-encoder encodes the data into a given dimension without losing significant gene pattern. The training process of AAE is unsupervised thus label is not necessary. However, this AAE needs huge data to capitalise its ability for feature extraction.

Evolutionary Approach for Classification

Applying deep learning model to any data takes a significant amount of time and effort to get optimal hyperparameters. Therefore, TPOT [52] library was used for finding optimal hyperparameters of twelve classifiers. However, it is needed to select some parameters of the genetic algorithm (GA) such as generations, populations etc. to evolve the model. Advantages of GA is that it has a much less number of parameters compared to deep learning method. In addition, it finds the best deep learning parameters based on its fitness function. Here, ‘accuracy’ is chosen as a fitness function. The limitation of this tool is it only able to search for the scikit- 26

learn based classifier and regressors model and optimise their parameters. In the experiment, TOPT generated hyperparameters are used.

Weight Optimization

Latent Space

Encoder z Generator

Discriminator Real or 605 x 20439 Fake? Input RNA-Seq Data

PLumA

P Multiclass LumB Classi er PBasal & Tri-Neg

PHER2-enriched

Fig. 3.1. Architecture of AAE for classifying cancer subtypes.

3.4.3 AAE model implementation

Four different autoencoders such as AE, AAE, VAE, and denoising autoencoder (DAE) were used in this work. Unlike other autoencoders, the proposed AAE archi- tecture consists of three networks such as an encoder, generator, and discriminator. The additional discriminator network further compares the generated result with ac- tual input and update the weight of each network. First, single layer AAE, that is, an autoencoder without a hidden layer, having the input directly connected to the output, was used to reduce the features into 50 and classifiers were used to validate the result, but results were not promising. Therefore, AAE architecture having one hidden layer in each network is used as described in Table 3.1. Encoder network contains an input layer which takes all genes (20439) as an input, one hidden layer with 1000 neurons and an output layer with 50 neurons. Next, generator network 27 contains an encoded vector (50) as input, one hidden layer with 1000 neurons and generate fake genes (20439) as output. Finally, discriminator network takes an actual gene and generated gene as input and contains one hidden layer with 1000 neurons and one neuron in the output. During pre-training of this model, adadelta is used as an optimizer with a learning rate of 1 and binary cross entropy is used as a loss function. In addition, 100 epochs are used with batch size of 128 to train the model in an unsupervised manner. After that, fine-tuning of encoder network is performed using an additional layer having four neurons with softmax activation function. The fine tuning training process is supervised and allow output layer of encoder (50 neu- rons) to optimise its weights. Validation of the model was performed using twelve different classifiers. The use of dropout or regularizer was avoided. The number of hidden layers are further increased, and the result appears that increasing the depth of the model modestly improves the performance.

3.4.4 Performance Metrics of Classification

The performance of a system is evaluated through a confusion matrix which helps to find classified and mis-classified rate of the test data. The classification perfor- mance is measured using various metrics like accuracy, sensitivity, specificity, preci- sion, f-score, the area under the ROC curve (AUC) and the Matthews correlation coefficient (MCC). Accuracy is defined as ratio of properly classified data to overall classified data. The accuracy is shown in Eq.3.1.

TP + TN Accuracy = (3.1) TP + FP + TN + FN Sensitivity is measured as the proportion of correct positive (cancer) prediction with the total number of positives. For example, if the system is highly sensitive and test result is positive, it is certain that patient does have disease(or cancer). It can be 28

calculated using Eq.3.2.

TP Sensitivity (T rue positive rate or Recall) = (3.2) TP + FN Specificity is calculated as the ratio of correct negative prediction (not cancer) divided by total number of negatives.

TN Specificity (T rue negative rate) = (3.3) TN + FP

Precision is defined as number of correct positive prediction divided by sum of true positive and false positives.

TP P recision = (3.4) TP + FP

F1-score of a system is referred to as the harmonic mean of precision and recall that is: P recision × Recall F 1 − Score = 2 × (3.5) P recision + Recall MCC also known as phi coefficient takes account confusion matrix parameters and does consider imbalance classes in the data. Furthermore, this method can be used in both binary and multi-class classification.

PK c × s − k pk × tk MCC (Multicalss) = q (3.6) 2 PK 2 2 PK 2 (s − k pk) × (s − k tk)

Kappa score measures the level of agreement between two annotators on a classifi- cation problem. Kappa score takes account the possibility of the agreement occurring by chance. It is defined as follows:

KappaScore(κ) = (po − pe)/(1 − pe) (3.7)

where, TP=True Positive TN=True Negative 29

FP=False Positive FN=False Negative C=Confusion Matrix PK tk = i Cik, the number of times class k truly occurred, PK pk = i Cki, the number of times class k was predicted, PK c = k Ckk, the total number of samples correctly predicted, PK PK s = i j Cij, the total number of samples. po = T he observed agreement ratio pe = Expected agreement

Table 3.1. Summary of proposed AAE architecture. Each of the network in AAE contains one hidden layer with 1000 neurons. To train the model, adadelta optimizer is used with learning rate of 1 and binary cross entropy is used as a loss function.

Network Input Ouput Layer Activation Optimizer Loss

1st Tanh Encoder All genes actual Encoded vector 2nd Tanh

1st Tanh Generator Encoded vector All genes generated Adadelta Binary crossentropy 2nd Sigmoid

All genes actual & 1st Tanh Discriminator Discriminator probability All genes generated 2nd Sigmoid

3.4.5 Results

Experimental result presented in Table B.2 shows twelve classifiers such as DT, GB, KNN, LDA, LR, MLP, NB, QDA, RF, SVC, VC and XGB are used to evalu- ate various feature extraction techniques such as AAE, PCA, VAE, AE and DAE in 30 terms of accuracy, recall, precision, f1-score, AUC, MCC and kappa score. All the experiments were performed using five-fold cross-validation and StratifiedKFold tech- niques were adopted. In BRCA dataset, four breast cancer subtypes such as LumA, LumB, Tri-Neg & Basal, and Her2 are labelled from the clinical information of the patients. Therefore, predicting breast cancer subtypes is a multiclass classification problem. However, the dataset contains 20439 genes which need to be reduced in order to classify correctly.

[a]

[b]

Fig. 3.2. Various classifiers such as DT, GB, KNN etc. are used to evaluate the performance of feature extraction methods such as AAE, PCA, AE, VAE and DAE. (a) Precision score is compared among feature extraction methods using twelve different classifiers. (b) Recall score is compared among feature extraction methods using twelve different classifiers. 31

1500 PCA

VAE 1200 DAE

900 AE

AAE (Proposed)

Seconds 600

300

0 Computation Time

Fig. 3.3. Performance of various feature extracting methods in terms of their computation time.

As the dataset is growing fast, an efficient feature extraction technique is necessary which can handle big amount of data and can perform well on any classifier. Therefore, AAE based feature extraction algorithm is developed which performs well on any classifier in terms of all the performance metrics. For this, current state-of-the-art methods such as PCA, VAE, AE, and DAE are used to compare the results. It is found that AAE is most steady across all classifiers. In the experiment, highest of 86.44% accuracy, 86.01% f1-score, 86.44% recall, 86.78% precision, 97.19% AUC, 0.79 MCC, and 0.78 kappa score are obtained when using MLP classifier with PCA based feature extraction method. In addition, PCA with DT, KNN, NB, QDA, SVC and XGB shows precision of 75.41%, 79.89%, 70.56%, 72.25%, 25.25% and 81.80% respectively. However, AAE based method with DT, KNN, NB, QDA, SVC, and XGB classifiers shows performance improvement in precision having a score of 84.41%, 85.74%, 85.42%, 84.27%, 85.96% and 85.47 respectively. Fig. 3.2 shows precision and recall score of twelve different classifiers where AAE found to be mostly steady. To summarise, six out of twelve classifiers such as DT, KNN, QDA, SVM, NB, and XGB perform well in all performance metrics when using AAE as a feature extraction method. Remaining six classifiers such as GB, LDA, LR, RF, VC and MLP with 32

AAE perform almost similar or slightly lower compared to other feature extraction methods. Here, AAE takes 2164 MB of memory and 10 min 54 sec for training, fine tuning and classifying various cancer sub types as depicted in Table. B.3.

Fig. 3.4. Proposed AAE is compared with other methods in terms of seven performance metrics such as accuracy, F1-score, recall, precision, AUC, MCC and Kappa.

3.4.6 Discussion

In this work, AAE based feature extraction method is essential in extracting rel- evant information in both microarray and RNA-seq data. Based on the results, it is observed that the proposed model yields satisfactory results in classification that outperforms traditional methods such as PCA, AE, VAE and DAE. AAE is trained in an unsupervised manner to extract highly correlated genes and see how well various classifiers can predict cancer based on those features. In this study, pre-processing step which includes sampling, normalisation and feature extraction are essential part 33 for classification. Then comparison among various feature extraction methods are made based on the performance of twelve different classifiers which are widely used for cancer prediction. Results using BRCA dataset shows that AAE feature extrac- tion method leads to better performance on most of the classifiers compared to other techniques such as PCA, DAE, AE and VAE. Most importantly, AAE provides consis- tent results (see Fig. 3.4) in all performance metrics across twelve different classifiers which makes this feature extraction model classifier independent. In terms of compu- tation time and memory usage, PCA takes less time and memory among all; however, my proposed AAE performs best compared to other existing neural network based autoencoder (see Fig. 3.3). The limitation of this architecture is that it requires large amount of data to extract meaningful features. It is anticipated that larger dataset has the potential to take advantage of the power of the deep neural network to extract the gene expression.

3.5 Evaluating Performance of AAE by Analyzing Connectivity Matrices

3.5.1 Overview

High throughput genomics in combination with machine learning give opportunity to the researchers to identify disease-associated genes from a large dataset which is essential for understanding underlying biological mechanisms. The growth of transcriptome datasets provide further opportunities to identify novel biomarkers that are likely to facilitate drug development, targeted therapy, and pro- vide direction in clinical research. Gene selection is essential because focusing on the marker genes can reduce the cost of biological experiment and decision. With the advent of NGS technologies, modern deep learning algorithms try to figure out fun- damental biological questions such as prediction of single nucleotide polymorphisms (SNPs) [53], identification of biologically relevant gene [54] [55] [56], design of pro- tein [57] and prediction its structures [58] [59]. In addition, personalized treatment such as targeted therapies, and predicting drug response could disrupt the traditional 34 treatment in future [60]. Feature extraction, an essential part of data analysis is used to reduce high-dimensional data into an amenable number of features which is helpful to interpret results. In addition, feature extraction methods in cancer genomics are able to identify small list of biomarkers with potential clinical utility. The application of feature extrac- tion methods is particularly relevant for transcriptomic datasets that often have very large features. Many researchers have applied unsupervised deep learning models such as VAE [54] [61], and DAE [55] in transcriptomic datasets to extract meaning- ful representations of gene modules. Moreover, analysing the hidden nodes within neural networks could potentially highlight important genetic behavior [62]. Latent space features learned through unsupervised pre-training and supervised fine-tuning can reveal underlying patterns in genomic and biomedical domains. DAE is applied on RNA-Seq data from the cancer genome atlas (TCGA) database and used latent space to identify genes that are critical for the diagnosis of breast cancer. Similarly, VAE applied on pan-cancer gene expression data and compared with other feature extraction algorithms, however VAE proved to be more informative. Due to low sequencing cost, sample size keep increasing thus deep learning based models are becoming even more successful. In contrast, traditional analysis methods cannot cope up with large dataset and often produced biased results. Furthermore, unsupervised feature extraction methods are currently less developed, poorly under- stood and difficult to train [63] than supervised counterparts, however it has huge potential in extracting genes that bear significant biological relevance. In this work, a neural network based feature extraction model is presented using gene expression data that identifies biologically relevant genes and their functions. The examination of these genes could be useful in validating recent discoveries in cancer research. Proposed model has enhanced the performance of selecting relevant genes compared to current state-of-the-art methods. A list of highly weighted genes is first selected then sorted out in descending order. The studies include identification of genes responsible for diseases, functional annotation of genes, and the analysis of 35 gene signaling pathways from biological databases. Experiments are conducted in or- der to compare the proposed methods with current state-of-the-art methods in terms of biological significance, and showed the efficacy and potential of the proposed ap- proach with promising results.

3.5.2 Methodology

Neural network based AAE model is introduced to extract biologically-relevant features from RNA-Seq data. A method named TopGene is also developed to find highly interactive genes from the latent space. AAE in combination with TopGene method finds important genes which could be useful for finding cancer biomarkers (see Fig. 3.5).

Data Pre-processing

Class imbalance often leads to overfitting and low predictive score [50] there- fore synthetic minority oversampling technique (SMOTE) is adopted to balance the classes. Quantile transformer normalisation technique is used to eliminate data re- dundancy and noise.

Dimension Reduction

After data prepossessing, AAE model is trained to reduce the data into a smaller dimension without losing significant gene pattern. Hyperparameters of AAE model such as batch size, epochs, optimiser, learning rate and loss function are chosen using grid search technique to obtain the lowest train and validation loss. The training process of this autoencoder is unsupervised thus label is not necessary. Finally, AAE generated features are compared with existing state-of-the-art methods such as AE, DAE, VAE, ICA, NMF and PCA. 36

Weight Optimization

Encoder Generator

966 x 20439 966 x 50 Real or Latent Space (z) 966 x 20439 Discriminator Input RNA-Seq Data Generated Data Fake? (Pre-Processed)

Functional Analysis Weight Matrix Apply TopGene (Gene Ontology & (Gene x Weight) Algorithm Pathway)

Fig. 3.5. Block diagram for weight matrix analysis using AAE architec- ture.

Sorting Highly Weighted Genes

Trained AAE generates low dimensional latent space which is then sorted using TopGene method. When AAE has hidden layer, it is needed to do dot matrix mul- tiplication among different layers to obtain final weight. On the other hand, single layer AAE does not have any hidden layer; hence dot matrix multiplication is not needed. TopGene method is then applied to sum up the weight across all the nodes. Then gene is sorted according to their weights, and top 50 genes are taken for further analysis.

GO and Pathway Analysis

Functional enrichment of top 50 highly weighted genes through GO analysis are examined using DAVID tools. Table 1 presents statistically enriched GO terms under ’molecular function’, and corresponding p-value is shown. Many of the most signifi- cant terms are related to olfactory receptor activity. It is observed that top 50 genes 37 and olfactory receptor activity is found to be tumour suppressor for breast cancer. Pathway analysis is also used for identifying related proteins and genes. It gives various interactions among molecules in a cell related to its metabolism, gene regu- lation and signal transmission [64]. DAVID pathway [65] [66] was used for retrieval of pathways associated with genes identified from the present study. Also, olfactory transduction is found as a pathway shows in Figure C.1 for breast cancer.

3.5.3 AAE model Implementation

AAE is trained in two phases with stochastic gradient descent (SGD) algorithm [67]. First, encoder encodes the data and generators acts like decoder which repro- duces the data as close as possible to its input and tries to minimise the reconstruction error. Second, discriminator network tries to tell apart the generated sample from the original sample (actual input) then update the weights of generator and encoder [67]. Once the training process is done latent space z contains useful biological information by omitting noisy features. AAE is versatile and powerful as it can extract both linear and nonlinear relationship from the input data. First, single layer AAE, that is, an autoencoder without hidden layer, having the input directly connected to the output, was used to reduce the features into dimension of 50 then its weight matrix was analyzed to know its biological context. In addition, AAE with one hidden layer with 1000 neurons in each network (encoder, generator, and discriminator) is tested and it reveals meaningful genes. During training the network, adadelta is chosen as a optimizer with learning rate of 1 and binary cross entropy is used as a loss function. In addition, 100 epochs are used with batch size of 128 to train the model in unsuper- vised manner. After that weight matrix is analyzed to sort out highly weighted gene. In this model, dropout or regularizer was not used as validation loss and training loss were almost similar. The number of hidden layers further increased, however result appears that increasing the depth of the model modestly improves the performance. For the implementation of this model, scikit-learn library [51] [68] and keras API [69] 38 with tensorflow [70] back-end are used which was running on computer specified on Table B.1. AAE encoded latent space contains important subsets of genes effective for cancer detection. For a single layer, weight matrix is considered as W = G × 50 where G in the number of genes and each gene contains 50 nodes. Deeper model is also analyzed, where dot product of the weight matrix is calculated for each layer of AAE. For example, 2 layer neural network having one hidden layer with 1000 neu- rons, weight matrix of first layer is W1 = GX1000 and second layer is W2 = 1000X50.

Then applying dot product W = (W1.W2) finally got, W = G × 50. Therefore, it can be generalised that weight matrix for the n-layer neural network as n Y W = Wi i=1 It is found that the weight matrix was strongly normally distributed Fig 3.6. First, all the nodes for each gene are summed up then sorted out genes by its weight. Method of sorting out weight is explained in Algorithm 1. Finally, those genes are analysed by doing GO and pathway analysis to know its biological insights.

3.5.4 Results

All experiments in this study were conducted on University’s server described in Table B.1. Experimental result using BRCA data presented in Table 3.2 and Table B.5 and validated result using UCEC data presented in Table B.6. In this experiment, causation of the diseases are explored by extracting highly weighted genes form the latent space. By analysing those genes could help us to find out the cause of the diseases. Here, top 50 highly weighted genes are taken for further GO and pathway analysis. In the proposed model, four genes such as OR2T27, OR2A25, OR8B8, OR6V1 are found that are related to olfactory receptor activity. In addition, those four genes provide olfactory transduction pathway. In a recent study by Lea et al. (2018) shows that the olfactory receptor genes such as OR2B6 are highly expressed in breast carcinoma tissues compared to the respective healthy tissues [71]. They considered OR2B6 as a potential biomarker for breast carcinoma 39

Table 3.2. Results of Gene enrichment analysis using various feature extraction meth- ods on BRCA cancer data-set.

Algorithm Molecular Function P-value

AAE (Single layer) Olfactory receptor activity 5.92e-2

AAE (2 layer) Small conjugating protein ligase activity 7.34e-2

AAE (3 layer) Single-stranded RNA binding 6.25e-2

AE General RNA polymerase II transcription factor activity 7.97e-3

DAE Not Available

PCA Calcium ion binding 6.44e-2

VAE General RNA polymerase II transcription factor activity 9.55e 2

NMF Transition metal ion binding 2.8e-2

ICA RNA helicase activity 6.49e-2

tissues. Their result provides validation of the proposed model. Deeper model of AAE (2 layers) identified protein ligase activity which is closely linked with human breast cancer [72]. PCA extracted genes has GO terms related to calcium ion binding which is known to be associated with breast cancer [73] [74]. Choon et al. (2018) found that transporting calcium ions into milk also have altered expression in some breast cancers. ICA generates molecular function related to RNA helicase activity which is associated with breast cancer. RNA helicase plays an oncogenic role in breast cancer by changing the cell cycle [75]. Other feature extractor methods are also applied and analysed their highly weighted genes but no molecular function or significant pathway is found to be associated with breast cancer. In terms of computational efficiency, my proposed single layer AAE takes 1946 MB of memory and 1 min 13 sec of computation time (wall clock) as described in Table. B.4. Furthermore, proposed methods are evaluated with different data-set named UCEC which contains four major 40

Frequencies

Weights

Fig. 3.6. Histogram of AAE weights after applying TopGene algorithm.

subtypes of endometrial carcinoma. Single layer AAE model identified metal ion binding genes such as KLF6, EGR3, CLTB, BRSK2, ZSWIM7, ZNF655, TKTL1, BBOX1, ZNF677, RNF114, PHF2, SLU7, ALOX5, CPD, BMPR1B, CACNA1A which is associated with endometrial cancer risk [76] [77]. It is considered that heavy metal like cadmium are well known carcinogens and can increase the incidence of endometrial cancer [77]. It is observed that genes extracted by other methods such as PCA, VAE, NMF, AE do not have relevance with the diseases.

3.5.5 Discussion

In this work, AAE based feature extraction method is presented. It is shown that feature extraction methods not only help classification or visualise data, it is also essential in extracting relevant biological information from both microarray and RNA-seq data. Based on the results, it is observed that the proposed AAE model provide unique benefit in identifying biomarker on two independent data-set. Using BRCA data-set, only AAE based model find important pathway for breast cancer identification. AAE with 2-layer neural network also extracts useful genes for tumour 41 identification. Single-cell RNA-Seq, long read RNA-seq and microbiome datasets has massive amount of data; therefore deep learning based AAE model has a great future. The performance of the AAE model could certainly improve visualisation of high dimensional data as it provides the non-linear mapping. In the future, it is anticipated that unsupervised AAE can provide a new way to extract and integrate large genomic data-set effectively.

3.6 Chapter Summary

This chapter presents applications of AAE extracted features for classification and for identifying biologically relevant genes. Proposed method exhibited high per- formance on multi-class classification irrespective of the classifier. Also, single layer and 2-layer AAE performs best in extracting meaningful features. It is successfully shown that AAE has captured important information in the latent space and has identified potential bio-marker by analysing the weight matrix. The most optimal hyper-parameters were determined for the AAE model and then evaluated the results using fivefold cross-validator. As data-set is growing exponentially, in the future, it is possible to develop higher capacity model with multiple hidden layers to identify cross-cancer bio-markers using heterogeneous cancer data. To sum up, AAE based feature extraction method reveals important bio-marker that could be used in fore- casting diseases progression and determining treatment strategy. In the next chapter, instead of using normalized RNA-Seq data, raw methylated DNA data is used in order to find novel biomarkers; hence a pipeline is proposed. 42

Chapter 4 Identification of Novel Mutation using Bioinformatics Pipelines

4.1 Overview

When hematopoietic cells proliferate abnormally, it leads to the development of leukemia. World health organization (WHO) incorporated genetic information into the diagnostic algorithms [78]. In their revised report, genetic test was reinforced for AML classification. It also emphasises on genetic test to define entities which are clinically relevant to AML in conjunction with immune-phenotype, clinicopatho- logic features and morphology [79]. Due to continuing advancement, the application of high-throughput genetic technologies, such as NGS and gene expression profiling revealed new potential therapeutic targets and mechanisms of tumorigenesis [80]. Leukemia is recognized as one of the major causes of morbidity and mortality among children and adults in the world. It is estimated that leukemia accounts for 2.5% of cancers worldwide and approximately 250000 people are diagnosed every year [81]. Another type of leukemia includes CML, which affects the adults more than children. The CML affects the myeloid cells which produce red blood cells (RBC), white blood cells (WBC) and platelets. The change in this stage causes the formation of an ab- normal gene which is responsible for CML. The myeloid cells then grow, divide and build up in the bone marrow and then transport to other parts of the body through blood. AML can be divided into acute and chronic leukemia and further leading to AML which is one of the most common leukemia among adults affecting ∼80% of cases. It also affects ∼20% of cases in children [82]. It has been observed that the pathogenicity of AML in most patients are linked to oncogenic fusion proteins which are generated due to chromosome translocation or inversion. When a gene 43 ontology profile of 1,100 known genes studied, it revealed that each translocation was responsible for altering a specific function such as regulation of transcription, cell cycle, protein synthesis, and apoptosis [83]. Traditionally the risk was calculated by cytogenetic studies, but molecular detection of gene mutations has also become im- portant. As AML is heterogeneous, multiple genes are involved in the disease, and it makes harder to treat [84]. About 749 chromosomal aberrations have been observed in AML which includes oncogenes as well as loss of tumour-suppressing genes. It has been noticed that abnormalities in certain genes as well the gene expression profile confer prognostic significance in adult AML patients [85]. Somatic mutations have been found in several genes for AML that includes JAK2, KRAS2, FLT3, NUP98, and KIT. The variant analysis is one of the significant applications of NGS technology which provides genetic abnormalities. Several NGS analytical approaches are devel- oped to identify somatic and germinal variants which focuses on selectively enriching the region of interest from either whole genome or transcriptome or exome data files. Hence, allowing widespread application in current research typically for molecular di- agnostics purpose. In the current research, one of the applications of NGS technology is focused, i.e. variant analysis for finding out novel variants which may be potent carcinogens involved in leukemia.

4.2 Data Collection

The variant analysis was performed on methylated DNA samples where SRR408629 and SRR408630 were Leukemia cell line from AML patient. The other samples such as SRR408623, SRR408624, SRR408625, SRR408626, SRR408627 and SRR408628 were reprogrammed cell line. The samples were collected from the European Nucleotide Archive (ENA) with study accession no. PRJNA156619 and the library layout was single-end Illumina sequenced. The samples considered are as follows in the Table 4.1. Heretofore, the methylation pattern of k562 cell line was analysed, and epigenetic change was investigated during the reprogramming process. The previous study also 44 showed that epigenetic change had functional consequences on leukemia phenotype. However, in this study, we analyzed methylated DNA of various leukemia cell line, named k562, hES, LiPS1 and LiPS3 to find the mutations. In addition, mutation fre- quencies are validated using DriverDB and Intogen databases in order to find novel mutations. Finally, we reported the responsible genes for the mutations with their chromosome location (see Table 4.2).

Table 4.1. Eight leukemia data samples with their corresponding sample types are described below.

Sample ID Sample Type SRR408623 LiPS1 Cell line (Reprogrammed) SRR408624 LiPS1 Cell line (Reprogrammed) SRR408625 LiPS3 Cell line (Reprogrammed) SRR408626 LiPS3 Cell line (Reprogrammed) SRR408627 hES Cell line (Reprogrammed) SRR408628 hES Cell line (Reprogrammed) SRR408629 K562 Cell line (Leukemia Cell) SRR408630 K562 Cell line (Leukemia Cell)

4.3 Methodology

The whole pipeline for variant analysis is depicted in Fig. 4.1. The analysis was carried out using Galaxy which is a web-based server available for NGS data analysis. The data can be uploaded directly through European nucleotide archive (ENA), and FASTQ Groomer tool extracts this data into FASTQ formats. The quality of this data checked through FASTQC which gives us detailed information about the sample. The parameters include GC content, over-expressed sequences, per base sequence quality and sequence length distribution. If the data has requires any 45 further processing like trimming [86], then it is performed using either Trimmomatic or FASTQ Trimmer. Trimming adapter sequences is necessary as it contaminates the sequences. Trimmomatic, on the other hand, uses a sliding window method wherein the bases are removed which are below the quality threshold, by scanning the entire sequence [87] [88]. In the following experiment, the trimming was performed using the tool Trim Galore which is specifically used for DNA-methylated samples. The stringency and phred quality of base calls for adapter removal need to be specified individually. Next, mapping was performed using Bowtie2 for longer read sequences. It is an alignment algorithm used to align sequence reads against a large reference genome. It automatically chooses between local and end-to-end alignments, support paired-end reads and performs chimeric alignments. The tool can be used for a wide range of sequence lengths, ranging from 70bp to a few megabases [89]. Then, variant analysis is performed using three different variant callers such as Freebayes, GATK and Deepvariant and choose the common variants among them as illustrated in Fig. 4.2. Freebayes is a variant detector used to find SNPs, insertion-deletion mutation (Indels) and multi-nucleotide polymorphisms (MNPs) [90]. Here, DeepVariant is a deep neural network based tool that was used to call genetic variants from next- generation DNA sequencing data [53]. Next, variants were annotated using VEP [91] which is an annotation server that can annotate SNPs, Indels and even MNP’s. After annotation, low quality synonymous genes are filtered out and novel variants are found. Finally, GO, and protein-protein interaction (PPI) was performed on variant genes to know their biological function and how those genes are connected.

4.4 Results & Discussion

Genes such as KIT, JAK2 or KRAS2 are well known to be involved, but regarding heterogeneous nature of AML many other genes may also be involved that are not known yet. After applying pre-processing steps, variant caller is used to find the vari- ants. Through the process of variant calling, the VCF files were obtained using VEP 46

Fig. 4.1. In this block diagram, the whole pipeline for variant analysis are shown. First, in the pre-processing steps quality of the data are ob- served and trim the data to improve the quality. Then data is aligned with reference . Next, PCR duplicates are removed before variant calling. Next, important variants are obtained through filtering. After that, filtered variants are annotated with reference variant annota- tion databases. Finally, data is further refined in order to find, validate and visualize novel variant.

tool. Annotated files contained genes and their exonic functions. Then synonymous SNVs and unknown SNVs are qualitatively filtered out since these types of mutations do not have any significant effect on the genes exonic function. The remaining exonic functions, i.e., non-synonymous SNVs, frameshift mutations, stop gain and stop loss mutations were taken to obtain novel SNPs. The variants obtained from the analysis are shown in the Table 4.2. Total of 22 new variants are found by analyzing the methylated DNA datasets. The involvement of variants in AML was confirmed using DriverDB database [92]. Their mutation frequency for AML was found using Intogen Database [93]. Three variants such as ZFHX3, DOK2, and PKHD1 were found to have some involvement in AML as observed on DriverDB. The rest of the genes were novel and rarely associated with AML. Welch et al. performed a study on the origin and evolution of genes involved in AML and concluded saying that ZFHX3 was found to be mutated in patients [94]. ZFHX3 has been previously studied for acute lymphoblastic Leukemia by Oshima et al. [95], 47

and the mutation frequency for this was 1.16%. In this study, the mutation frequency of ZFHX3 in AML was found to be 0.51%. It was further observed that ZFHX3 was involved in AML by a study conducted by Garg et. al. They conducted the profil- ing of all somatic mutations observed in AML. ZFHX3 was reported as one of the mutations in AML, but the chromosomal location was observed to be different [96]. Two non-synonymous mutations in ZFHX3 gene is identified at two different posi- tions of chromosome 16. The gene DOK2 is responsible for downstream of tyrosine kinase. The gene is found to have a mutation frequency of 0.51% for AML. The involvement of this gene in AML is still mostly unknown. A study performed by He et al. using RQ-PCR to detect the expression of DOK2 in AML patients confirmed that decreased expressions of DOK2 and hypermethylation of their promoters could be adverse in AML [97]. In another study performed by Kim et al. they found no sign of DOK2 gene in AML. They concluded saying that DOK2 might be mutated in other ways such as loss or gain of function or deletion of the locus. Research also shows that DOK genes might be responsible for AML [98]. In this study, a non- synonymous SNV of DOK2 on chromosome 18 is also found. The third gene showing some previous expression in AML is PKHD1. A single SNV of textitPKHD1 gene is found on chromosome 6. The mutation rate of PKHD1 was found to be 0.51%. Also, PKHD1 is found to have relation with autosomal-recessive polycystic kidney disease. Siraj et al. found that PKHD1 gene had a role in pancreatic ductal adeno- carcinoma, but no reports on the expression in AML [99]. The other variants found were non-carcinogenic or didnt have any role in AML. Their mutation frequency was 0%. Most of the variants are either less studied or not looked upon in perspective of AML. Dillon et al. [100] conducted a study on recurrently mutated genes and their role in AML patients. The patients were African, African Caribbean and of Japanese origin. The study revealed that there were mutations across several pathways and the genes involved in them. One of the pathways affected was NOTCH which contains the MAMLD1 gene. The mutation was observed across all patients in the study. In our study, we also found MAMLD1 mutation illustrated in Fig. ??. Bullinger Valk 48

GATK DeepVariant GATK DeepVariant (83) 0 12 (173) (73) 0 4 (124) 0 0

80 73 3 81 0 47

806 917

Freebayes Freebayes (970) (1037)

Fig. 4.2. Three different variant calling tools are shown in venn diagram for sample SRR40862 & SRR408630.

conducted a study which consisted of profiling gene expressions involved in AML. Multiple genes were studied involved in AML using multiple patients. Genes belong- ing to HOX family and MYB were found to be expressed higher in many patients. One of the genes DMWD involved in the study was found to be expressed but only in a few numbers of patients [101]. When a study was performed on ovarian cancer cells by Liu et al., it included CDK11 gene silencing by synthetic siRNAs or lentiviral shRNA. It has been seen that CDK11 is essential in cell migration and invasion. When a knockdown version of CDK11 by CRISPR-RNAi technology was used for the study, it showed growth inhibitions in AML cells [102]. SLC or solute carrier proteins are responsible for the transport of a wide range of substrates. SLC gene and SLC2A5 was found to be expressed more in AML patients in a study conducted by Chen et. al. Further, when a knockdown of the gene was introduced in AML cells, it decreased cell proliferation, and overexpression of the same increased proliferation, cell growth, migration, invasiveness. They also observed that AML cells utilise much fructose under low-glucose conditions which are increased by SLC2A5 [103]. In a similar study performed by Weng et al., studied the effects of SLC2A5 gene on lung adenocarcinoma; they even did tests on AML cells and found similar results and concluded by saying that AML cells largely take up fructose for cell proliferation and 49

SLC2A5 gene assists in the uptake. They also used a knockdown of a gene which showed less proliferation and decreased uptake of fructose [104]. The other SLC gene involved was SLC22A6 which acts as a transporter for anti- cancer agents into cancer cells. It also helps in the uptake of various essential nutrients required for the growth of cancer cells. Li Shu studied the same with various SLC genes. They observed that specific genes were responsible for transporting anticancer drugs to AML cells. They saw that mutation in such genes could be adverse for patients with cancer [105]. No other particular study was found related to the role of SLC22A6 in AML, and the effects are still unknown. Splicing factors or SF genes, one such SF1 was observed in this study as well as a study conducted by Larsson et al., wherein, they studied different recurrent somatic mutations in various genes. They observed that mutations rarely occurred in SF1 gene as compared to other genes and further disregarded that there will be any effects of SF1 on AML [106]. EGR2 was studied by Pospisil et al. where they consider the inhibition of oncogenic miRNA cluster 17∼92 by EGR2. It was observed that in most AML patients the levels of EGR2 were significantly low as compared to the healthy counterpart. When EGR2 found into AML cells, it was observed that inhibition of miRNA 17∼92 clusters involved in AML and were overexpressed in the patients. Upregulation of EGR2 also led to induction of apopto- sis by increasing the level of BIM mRNA [107]. Tian et al., also conducted studies on EGR genes but focused mainly on EGR1. They did conclude that EGR1 was more probable in causing AML than other EGR genes including EGR2. They also stated that all members of EGR family share highly conserved DNA binding domain [108]. Plks or Polo-like kinases belong to serine-threonine kinases family which are respon- sible for regulation of multiple intracellular processes like DNA replication, mitosis or stress responses. Liu et al. conducted a study on Plks as a therapeutic approach for cancer treatment. He et al. observed that PLK3 could be a possible tumour suppressor gene because of the downregulation in many cancers. No specific relation 50

Table 4.2. List of novel variants obtained from annotated data

Sample ID Novel variants Type of mutations Chromosome Location SRR408623 MAMLD1 non-synonymous ChrX: 150470758 SLC22A6 non-synonymous Chr11: 62983599 SRR408624 DMWD non-synonymous Chr19: 45786321 CDK11A non-synonymous Chr1: 1707539 SRR408625 SLC2A5 non-synonymous Chr1: 9037714 QSOX2 non-synonymous Chr9: 136226806 SRR408626 No novel variants are observed EGR2 non-frameshift Substitution Chr10: 62813491 SF1 Stopgain Chr11: 64765966 non-synonymous Chr16: 72796453 ZFHX3 SRR408627 non-synonymous Chr16: 72796480 MYCBPAP non-synonymous Chr17: 50517594 DIRAS1 non-synonymous Chr19: 2717683 DOK2 non-synonymous Chr8: 21909724 PLK3 Stopgain Mutation Chr1: 44801922 SRR408628 GPR182 non-synonymous Chr12: 56996180 POU4F2 non-synonymous Chr4: 146639979 ATP13A2 non-synonymous Chr1: 16996279 non-synonymous Chr17: 64214100 TEX2 non-synonymous Chr17: 64214124 PKHD1 non-synonymous Chr6: 52053075 SRR408629 GTF3C5 nonframeshift Chr9: 133043843 SURF6 nonframeshift Chr9: 133332667 HECTD4 (After Trimming) non-synonymous Chr12: 112194954 SRR408630 CFAP69 Stopgain Chr7: 90245542

was found between PLK3 and AML however; he et al. did find a relation between Plk1 and AML [109]. In another study conducted by Brandwein and by Renner et al., they also found that Plk1 was overexpressed in multiple tumor tissues which included AML. While 51

PLK1 was overexpressed, PLK3 was found to be down-regulated in many cancerous tissues [110] [111]. Berg et al., found that other PLK genes such as PLK2 and PLK3 can act as tumor suppressor genes in vivo [112]. Strebhardt et al. concluded that PLK1 overexpression could result in oncogenic transformation, and ectopic expres- sion of PLK3 induces apoptosis. The activity of PLK3 is then rapidly increased with DNA damage and cellular stress leaving the expression level unchanged. It was also suggested that the PLK3 was involved in DNA damage and apoptosis through p53 pathway. On inducing mice with PLK3, they found tumors in many organs, many of which were large suggesting angiogenesis. There were no relations made between AML and PLK3 though [113]. GTF3C5 was studied by Qian et. al, where the study aimed to observe the cytogenetic and genetic pathways involved in AML. One of the experiments involved identification of cooperating cancer genes and obtaining their retroviral integration sites. Common insertion sites were obtained from a database called Retroviral Tagged Cancer Gene Database [114]. From the database they ob- tained two CISs one of which was found to be 23kb upstream of GTF3C5 locus. With this they concluded that the gene plays an important role in AML. EGR2 was also found to play an important role in AML as it promotes monocytic differentia- tion [115]. Other genes which were studied for AML was GPR182 by Kechele et al., where they studied the suppressor qualities of GPR182 gene to stop adenoma formation. It was seen that GPR182 was among the genes which was severely altered in model zebrafish with AML [90]. The rest of the novel variants found, i.e., QSOX2, MY- CBPAPA, DIRAS1, POU4F2, ATP13A2, TEX2, SURF6, HECTD4, and CFAP69 were not much studied with regards to their involvement in the formation of AML. CFAP69 has no data available concerning its mutational frequency or its ability as an oncogene. 52

Fig. 4.3. In this figure, variant obtained from the SRR408623 sample is shown. Here, the novel variant is located in Chromosome X:150470758 positions where nucleotide base changes from C to G. Here, the MAMLD1 gene is found to be responsible for this mutation.

4.5 Gene Ontology (GO) & Protein-protein interaction (PPI) Analysis

The functional enrichment of variant genes are examined through DAVID func- tional annotation tool. Table B.8 shows the statistically-enriched GO terms under “cellular component”, “molecular function”, and “biological process” with their cor- responding p-value. Most significant terms are related to nucleoplasm part under cel- lular component with three gene count having p-value 9.3E-02. In addition, among 22 variant genes, five genes are involved in transcription regulator activity with p-value 7.1E-2. On the other hand, protein-protein interaction network work using variant genes are shown in Fig. C.2. The network consists of a total of three nodes and four edges. In this network two variants genes PLK3 (Threonine-Protein Kinase) and SF1 (Splicing Factor 1) are connected via CDC5L (Cell Division Cycle 5 Like). 53

4.6 Chapter Summary

Apart from AML, most of the variants obtained from this study have an associa- tion in other diseases as well (see Table B.7). To sum up, most of the variants obtained were non-synonymous SNVs, apart from them EGR2, GTF3C5, and SURF6 are non- frameshift and PLK3, CFAP69, and SF1 are stop-gain mutation. These identified genes may be helpful in enhancing the underlying molecular mechanism of AML. The research could be further elaborated in finding out the role of such novel genes for metabolic pathways and develop molecular targeted therapies for AML. In the previous chapter, normalised RNA-seq data was used to classify diseases and to iden- tify biomarkers, however, in this chapter raw methylated DNA data is used which requires different pre-processing steps and algorithms such as Deepvariant in order to find novel variants and their chromosomal location. 54

Chapter 5 Conclusions and Future Work

5.1 Conclusions

In this thesis, the AAE based feature extraction method is developed, and its usefulness is analysed. Also, a pipeline has been developed for mutation detection, and novel mutations have been identified. As genetic data contains lots of features with limited sample, it creates significant amount of noise in the data. AAE based feature reduction technique extracts mean- ingful features that enable classification of cancer cells. First, to verify the usefulness of AAE, various classifiers were used and compared its performance with existing state-of-the-art methods. AAE extracted features performed excellent and consistent across various classifiers. Classification of cancer types at an early stage increases the chances for successful treatment. AAE based feature extraction method auto- matically find correlation in the data and extract new features which helps classifiers to classify cancer subtypes. This computational technique can help pathologist to diagnose diseases more accurately. Besides, it was possible to obtain biological information by studying the weight matrix of AAE. Encoded genes were ranked according to their weights and top 50 genes are selected. Those genes were taken for gene ontology and pathway analy- sis and their molecular function proved to be asssociated with breast cancer. From findings, it can be concluded that this AAE based feature extraction technique has much potential in cancer sub-type classification and biomarker identification. Find- ing novel biomarkers from RNA-seq data are valuable in predicting prognosis, and personalized medicine. Moreover, biomarkers can be helpful in detecting therapeutic and monitoring treatment. 55

To find genetic variation in DNA methylated data, bioinformatics pipeline is de- veloped where three different variant callers such as Freebayes, GATK and neural network based Deep-Variant are used. This pipeline applied to DNA methylated leukemia samples and 22 novel variants were obtained. Furthermore, various down- stream analysis such as GO and PPI has been performed to know the biological function and interaction between the mutated genes. Also, chromosomal location of the mutated genes have been identified. Identifying location of genetic variation is important in terms of predicting functional signicance or functional annotation. Overall, genetic variation can assist to determine targered therapies and personalized medicine. One limitation of deep learning based method is that it requires lots of data however as genetic data-sets keep increasing, AAE based method has lots of potential in cancer research. Also, variants obtained from leukemia samples could provide direction for further biological evaluation.

5.2 Recommendation for future Work

The following recommendations are made for future research studies: 1) It is recommended to implement AAE model on different bigger data-sets with more hidden layers and optimise the hyper-parameters accordingly. 2) Instead of using grid search technique, AAE architecture in conjunction with a genetic algorithm can be implemented to optimise its hyper-parameters more effi- ciently. 3) Better hardware with higher GPU cluster is recommended for training deeper AAE. 4) Cross cancer biomarker can be identified if AAE can be applied to heterogeneous cancer data. 56

5) Mutation detection pipeline can be applied to different genomic data to identify the mutation. Also, based on the need of the researchers, a different biological plan can be implemented on this pipelines using this methylated DNA dataset. It is expected that this thesis provides a strong basis for ongoing research in deep learning and bioinformatics. In this thesis, a new architecture for feature extraction and pipeline for mutation detection are developed. Then the proposed method is compared with traditional methods, and its effectiveness is established by comparing their results. 57

Reproducibility

My work in this thesis is fully reproducible, please follow the link below. https://bit.ly/2x3FHZt https://github.com/Nano-Neuro-Research-Lab/latent-space-discovery https://github.com/Nano-Neuro-Research-Lab/ngs-variant-analysis 58

Presentation

1. R.K. Mondol. Machine learning approach to characterize breast cancer sub- types using gene expression profile. Presented Poster in AMSI BioinfoSummer 2017 at Monash University. 59

REFERENCES

[1] K. Wetterstrand. (2017) Dna sequencing costs: Data from the nhgri genome sequencing program (gsp). Website. National Human Genome Research Institute. [Online]. Available: www.genome.gov/sequencingcostsdata [2] C. Cao, F. Liu, H. Tan, D. Song, W. Shu, W. Li, Y. Zhou, X. Bo, and Z. Xie, “Deep learning and its applications in biomedicine,” Genomics, Proteomics & Bioinformatics, vol. 16, no. 1, pp. 17–32, feb 2018. [3] C. E. Cook, M. T. Bergman, G. Cochrane, R. Apweiler, and E. Birney, “The european bioinformatics institute in 2017: data coordination and integration,” Nucleic Acids Research, vol. 46, no. D1, pp. D21–D29, nov 2017. [4] T. H. Shen, C. S. Carlson, and P. Tarczy-Hornoch, “SNPit: A federated data integration system for the purpose of functional SNP annotation,” Computer Methods and Programs in Biomedicine, vol. 95, no. 2, pp. 181–189, aug 2009. [5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, may 2015. [6] S. Min, B. Lee, and S. Yoon, “Deep learning in bioinformatics,” Briefings in Bioinformatics, p. bbw068, jul 2016. [7] R. M. F. James D. Miller. (2017) Mastering predic- tive analytics with r - second edition. [Online]. Avail- able: https://subscription.packtpub.com/book/big data and business intelligence/9781787121393/5/ch05lvl1sec38/the-biological-neuron [8] A. Singh, N. Thakur, and A. Sharma, “A review of supervised machine learning algorithms,” in 2016 3rd International Conference on Computing for Sustain- able Global Development (INDIACom), March 2016, pp. 1310–1315. [9] Z. Ghahramani, “Unsupervised learning,” in Summer School on Machine Learn- ing. Springer, 2003, pp. 72–112. [10] R. Gentleman and V. J. Carey, “Unsupervised machine learning,” in Biocon- ductor Case Studies. Springer New York, 2008, pp. 137–157. [11] K. J. Cios, R. W. Swiniarski, W. Pedrycz, and L. A. Kurgan, “Unsupervised learning: Association rules,” in Data Mining. Springer US, pp. 289–306. [12] X. Zhu, “Semi-supervised learning,” in Encyclopedia of Machine Learning and Data Mining. Springer US, 2017, pp. 1142–1147. [13] N. Castle. (2018, Feb.) What is semi-supervised learning? [Online]. Available: https://www.datascience.com/blog/what-is-semi-supervised-learning [14] “1.17. neural network models (supervised).” [Online]. Available: https: //scikit-learn.org/stable/modules/neural networks supervised.html 60

[15] Neural network models (supervised). Sickit-Learn. [Online]. Available: http://scikit-learn.org/stable/modules/neural networks supervised.html [16] E. Fouch. Neural-based outlier discovery. [Online]. Available: https: //edouardfouche.com/Neural-based-Outlier-Discovery/ [17] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. [18] G. Hinton and T. Sejnowski, “A theoretical framework for back-propagation.” [19] R. Rojas, “The backpropagation algorithm,” in Neural networks. Springer, 1996, pp. 149–182. [20] “A gentle introduction to mini-batch gradient descent and how to configure batch size,” Apr 2018. [Online]. Available: https://machinelearningmastery. com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/ [21] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [22] S. Ruder, “An overview of gradient descent optimization algorithms,” CoRR, vol. abs/1609.04747, 2016. [Online]. Available: http://arxiv.org/abs/1609. 04747 [23] S. Roy, W. A. LaFramboise, Y. E. Nikiforov, M. N. Nikiforova, M. J. Routbort, J. Pfeifer, R. Nagarajan, A. B. Carter, and L. Pantanowitz, “Next-generation sequencing informatics: Challenges and strategies for implementation in a clini- cal environment,” Archives of Pathology & Laboratory Medicine, vol. 140, no. 9, pp. 958–975, sep 2016. [24] “Mitochondrial dna - genetics home reference - nih.” [Online]. Available: https://ghr.nlm.nih.gov/mitochondrial-dna [25] R. E. H. Geoffrey M. Cooper, The Cell: A Molecular Approach. SINAUER ASSOC, 2015. [26] P. J. Russell, iGenetics: A Molecular Approach. CUMMINGS, 2009. [Online]. Available: https://www.ebook.de/de/product/8045932/ peter j russell igenetics a molecular approach.html [27] S. C. . W. Brown. (2008) Translation: Dna to mrna to pro- tein. [Online]. Available: https://www.nature.com/scitable/topicpage/ translation-dna-to-mrna-to-protein-393 [28] Gene expression. [Online]. Available: https://www.wikiwand.com/en/Gene expression [29] K. R. Kukurba and S. B. Montgomery, “RNA sequencing and analysis,” Cold Spring Harbor Protocols, vol. 2015, no. 11, p. pdb.top084970, apr 2015. [30] Y. W. Hari Krishna Yalamanchili and Z. Liu, “Data analysis pipeline for rnaseq experiments: From differential expression to cryptic splicing,” Current Protocols in Bioinformatics, vol. 59, pp. 11.15.1–11.15.21, 2018. 61

[31] Rna-seq analysis. Vanderbilt Technologies for Advanced Genomics Analysis and Research Design (VANGARD). [Online]. Available: http://bioinfo. vanderbilt.edu/vangard/services-rnaseq.html [32] J. L. Causey, C. Ashby, K. Walker, Z. P. Wang, M. Yang, Y. Guan, J. H. Moore, and X. Huang, “DNAp: A pipeline for DNA-seq data analysis,” Scientific Re- ports, vol. 8, no. 1, may 2018. [33] Dna-seq analysis. Vanderbilt Technologies for Advanced Genomics Analysis and Research Design (VANGARD). [Online]. Available: http://bioinfo. vanderbilt.edu/vangard/services-dnaseq.html [34] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ring- wald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, may 2000. [35] C. Australia, “Breast cancer statistics,” Jan 2013. [On- line]. Available: https://canceraustralia.gov.au/affected-cancer/cancer-types/ breast-cancer/breast-cancer-statistics [36] E. S. McDonald, A. S. Clark, J. Tchou, P. Zhang, and G. M. Freedman, “Clinical diagnosis and management of breast cancer,” Journal of Nuclear Medicine, vol. 57, no. Supplement 1, pp. 9S–16S, feb 2016. [37] P. Danaee, R. Ghaeini, and D. A. Hendrix, “A Deep Learning Approach For Cancer Detection And Relevant Gene Identification,” in Biocomputing 2017. World Scientific, nov 2016. [38] H. R. Macdonald, Breast Cancer Screening. Cham: Springer International Publishing, 2017, pp. 1–8. [39] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays,” Genome Research, vol. 18, no. 9, pp. 1509–1517, jul 2008. [40] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim, “Deep learning in medical imaging: General overview,” Korean Journal of Radiology, vol. 18, no. 4, p. 570, 2017. [41] S. Badve, D. J. Dabbs, S. J. Schnitt, F. L. Baehner, T. Decker, V. Eusebi, S. B. Fox, S. Ichihara, J. Jacquemier, S. R. Lakhani, and et al., “Basal-like and triple-negative breast cancers: a critical review with an emphasis on the implications for pathologists and oncologists,” Modern Pathology, vol. 24, no. 2, p. 157167, Dec 2010. [42] R. R. Bhat, V. Viswanath, and X. Li, “Deepcancer: Detecting cancer through gene expressions via deep generative learning,” CoRR, vol. abs/1612.03211, 2016. [43] A. Gupta, H. Wang, and M. Ganapathiraju, “Learning structure in gene ex- pression data using deep architectures, with an application to gene clustering,” nov 2015. 62

[44] G. I. Salama, M. B. Abdelhalim, and M. A. e. Zeid, “Experimental comparison of classifiers for breast cancer diagnosis,” in 2012 Seventh International Con- ference on Computer Engineering Systems (ICCES), Nov 2012, pp. 180–185. [45] S. Vural, X. Wang, and C. Guda, “Classification of breast cancer patients using somatic mutation profiles and machine learning approaches,” BMC Systems Biology, vol. 10, no. S3, aug 2016. [46] A. M. Abdel-Zaher and A. M. Eldeib, “Breast cancer classification using deep belief networks,” Expert Systems with Applications, vol. 46, pp. 139–144, mar 2016. [47] E. M. Karabulut and T. Ibrikci, “Discriminative deep belief networks for mi- croarray based cancer classification.” Biomedical Research, vol. 25, no. 3, pp. 1016–1024, 2017. [48] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Computational and Structural Biotechnology Journal, vol. 13, pp. 8–17, 2015. [49] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow, “Adversarial autoen- coders,” CoRR, vol. abs/1511.05644, 2015. [50] R. Longadge and S. Dongre, “Class imbalance problem in data mining review,” CoRR, vol. abs/1305.1707, 2013. [Online]. Available: http: //arxiv.org/abs/1305.1707 [51] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma- chine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [52] R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C. Kidd, and J. H. Moore, Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 – April 1, 2016, Proceedings, Part I. Springer International Publishing, 2016, ch. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization, pp. 123–137. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-31204-0 9 [53] R. Poplin, P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, S. S. Gross, L. Dorfman, C. Y. McLean, and M. A. DePristo, “A universal snp and small-indel variant caller using deep neural networks,” Nature Biotechnology, vol. 36, p. 983, Sep. 2018. [Online]. Available: https://doi.org/10.1038/nbt.4235 [54] G. P. Way and C. S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Biocomputing 2018. WORLD SCIENTIFIC, nov 2017. [55] P. DANAEE, R. GHAEINI, and D. A. HENDRIX, “A DEEP LEARNING APPROACH FOR CANCER DETECTION AND RELEVANT GENE IDEN- TIFICATION,” in Biocomputing 2017. WORLD SCIENTIFIC, nov 2016. 63

[56] R. Xie, A. Quitadamo, J. Cheng, and X. Shi, “A predictive model of gene expression using a deep learning framework,” in 2016 IEEE International Con- ference on Bioinformatics and Biomedicine (BIBM). IEEE, dec 2016. [57] J. Wang, H. Cao, J. Z. H. Zhang, and Y. Qi, “Computational protein design with deep learning neural networks,” Scientific Reports, vol. 8, no. 1, p. 6349, Apr. 2018. [Online]. Available: https://doi.org/10.1038/s41598-018-24760-x [58] K. Baek, “Learning deep architectures for protein structure prediction,” Pro- ceedings of the 7th International Conference on Bioinformatics and Computa- tional Biology, BICOB 2015, pp. 137–142, 01 2015. [59] K. Chandni, P. M. Pandya, and D. S. Jardosh, “Deep learning approaches for protein structure prediction,” International Journal of Engineering & Technol- ogy, vol. 7, no. 4.5, pp. 168–170, sep 2018. [60] F. Emmert-Streib and M. Dehmer, “A machine learning perspective on person- alized medicine: An automized, comprehensive knowledge base with ontology for pattern recognition,” Machine Learning and Knowledge Extraction, vol. 1, no. 1, pp. 149–156, sep 2018. [61] G. P. Way and C. S. Greene, “Evaluating deep variational autoencoders trained on pan-cancer gene expression.” [62] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J. Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E. Carpenter, A. Shrikumar, J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J. Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M. Boca, S. J. Swamidass, A. Huang, A. Gitter, and C. S. Greene, “Opportunities and obstacles for deep learning in biology and medicine,” Journal of The Royal Society Interface, vol. 15, no. 141, p. 20170387, apr 2018. [63] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pre-training,” in Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 2009, pp. 153–160. [Online]. Available: http: //jmlr.csail.mit.edu/proceedings/papers/v5/erhan09a/erhan09a.pdf [64] U. Raj, I. Aier, R. Semwal, and P. K. Varadwaj, “Identification of novel dys- regulated key genes in breast cancer through high throughput ChIP-seq data analysis,” Scientific Reports, vol. 7, no. 1, jun 2017. [65] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Systematic and integra- tive analysis of large gene lists using DAVID bioinformatics resources,” Nature Protocols, vol. 4, no. 1, pp. 44–57, jan 2009. [66] ——, “Bioinformatics enrichment tools: paths toward the comprehensive func- tional analysis of large gene lists,” Nucleic Acids Research, vol. 37, no. 1, pp. 1–13, nov 2008. [67] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow, “Adversarial autoencoders,” CoRR, vol. abs/1511.05644, 2015. [Online]. Available: http://arxiv.org/abs/1511.05644 64

[68] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vander- Plas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Work- shop: Languages for Data Mining and Machine Learning, 2013, pp. 108–122. [69] F. Chollet et al., “Keras,” https://keras.io, 2015. [70] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/ [71] L. Weber, D. Maßberg, C. Becker, J. Altmller, B. Ubrig, G. Bonatz, G. Wlk, S. Philippou, A. Tannapfel, H. Hatt, and G. Gisselmann, “Olfactory receptors as biomarkers in human breast carcinoma tissues,” Frontiers in Oncology, vol. 8, feb 2018. [72] A. Burger, Y. Amemiya, R. Kitching, and A. K. Seth, “Novel RING e3 ubiquitin ligases in breast cancer,” Neoplasia, vol. 8, no. 8, pp. 689–695, aug 2006. [73] C. Cui, R. Merritt, L. Fu, and Z. Pan, “Targeting calcium signaling in cancer therapy,” Acta Pharmaceutica Sinica B, vol. 7, no. 1, pp. 3–17, jan 2017. [74] C. L. So, J. M. Saunus, S. J. Roberts-Thomson, and G. R. Monteith, “Calcium signalling and breast cancer,” Seminars in Cell & Developmental Biology, nov 2018. [75] E. Cannizzaro, A. J. Bannister, N. Han, A. Alendar, and T. Kouzarides, “DDX3x RNA helicase affects breast cancer cell cycle progression by regulating expression of KLF4,” FEBS Letters, vol. 592, no. 13, pp. 2308–2322, jun 2018. [76] J. R. Delaney and D. G. Stupack, “Whole genome pathway analysis identifies an association of cadmium response gene loss with copy number variation in mutant p53 bearing uterine endometrial carcinomas,” PLOS ONE, vol. 11, no. 7, p. e0159114, jul 2016. [77] J. A. McElroy, R. L. Kruse, J. Guthrie, R. E. Gangnon, and J. D. Robert- son, “Cadmium exposure and endometrial cancer risk: A large midwestern u.s. population-based case-control study,” PLOS ONE, vol. 12, no. 7, p. e0179360, jul 2017. [78] E. S. Jaffe, Pathology and Genetics: Tumours of Haematopoietic and Lymphoid Tissues (World Health Organization Classification of Tumours). Intl Agency for Research on Cancer, 2003. [Online]. Available: https://www. amazon.com/Pathology-Genetics-Haematopoietic-Organization-Classification/ dp/9283224116?SubscriptionId=AKIAIOBINVZYXZQZ2U3A&tag= chimbori05-20&linkCode=xm2&camp=2025&creative=165953& creativeASIN=9283224116 65

[79] S. Swerdlow, E. Campo, N. L. Harris, E. S. Jaffe, S. A. Pileri, H. Stein, J. Thiele, D. Arber, R. Hasserjian, and M. L. Beau, WHO Classification of Tumours of Haematopoietic and Lymphoid Tissues (Medicine). World Health Organization, 2017. [Online]. Available: https://www.amazon.com/ Classification-Tumours-Haematopoietic-Lymphoid-Medicine/dp/928324494X? SubscriptionId=AKIAIOBINVZYXZQZ2U3A&tag=chimbori05-20& linkCode=xm2&camp=2025&creative=165953&creativeASIN=928324494X [80] S. H. Swerdlow, E. Campo, S. A. Pileri, N. L. Harris, H. Stein, R. Siebert, R. Advani, M. Ghielmini, G. A. Salles, A. D. Zelenetz, and E. S. Jaffe, “The 2016 revision of the world health organization classification of lymphoid neoplasms,” Blood, vol. 127, no. 20, pp. 2375–2390, mar 2016. [81] M.-G. Yu and H.-Y. Zheng, “Acute myeloid leukemia: Advancements in di- agnosis and treatment,” Chinese medical journal, vol. 130(2), p. 211218, Jan. 2017. [82] I. D. Kouchkovsky and M. Abdul-Hay, “‘acute myeloid leukemia: a compre- hensive review and 2016 update’,” Blood Cancer Journal, vol. 6, no. 7, pp. e441–e441, jul 2016. [83] S. Lee, J. Chen, G. Zhou, R. Z. Shi, G. G. Bouffard, M. Kocherginsky, X. Ge, M. Sun, N. Jayathilaka, Y. C. Kim, N. Emmanuel, S. K. Bohlander, M. Minden, J. Kline, O. Ozer, R. A. Larson, M. M. LeBeau, E. D. Green, J. Trent, T. Kar- rison, P. P. Liu, S. M. Wang, and J. D. Rowley, “Gene expression profiles in acute myeloid leukemia with common translocations using SAGE,” Proceedings of the National Academy of Sciences, vol. 103, no. 4, pp. 1030–1035, jan 2006. [84] S. Yohe, “Molecular genetic markers in acute myeloid leukemia,” Journal of Clinical Medicine, vol. 4, no. 3, pp. 460–478, mar 2015. [85] C. C. Kumar, “Genetic abnormalities and challenges in the treatment of acute myeloid leukemia,” Genes & Cancer, vol. 2, no. 2, pp. 95–107, feb 2011. [86] R. M. Leggett, R. H. Ramirez-Gonzalez, B. J. Clavijo, D. Waite, and R. P. Davey, “Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics,” Frontiers in Genetics, vol. 4, 2013. [87] D. Blankenberg, A. Gordon, G. V. Kuster, N. Coraor, J. Taylor, and A. N. and, “Manipulation of FASTQ data with galaxy,” Bioinformatics, vol. 26, no. 14, pp. 1783–1785, jun 2010. [88] A. M. Bolger, M. Lohse, and B. Usadel, “Trimmomatic: a flexible trimmer for illumina sequence data,” Bioinformatics, vol. 30, no. 15, pp. 2114–2120, apr 2014. [89] R. Al-Ali, N. Kathiresan, M. E. Anbari, E. R. Schendel, and T. A. Zaid, “Work- flow optimization of performance and quality of service for bioinformatics ap- plication in high performance computing,” Journal of Computational Science, vol. 15, pp. 3–10, jul 2016. [90] S. Hwang, E. Kim, I. Lee, and E. M. Marcotte, “Systematic comparison of variant calling pipelines using gold standard personal exome variants,” Scientific Reports, vol. 5, no. 1, dec 2015. 66

[91] “Variant effect predictor.” [Online]. Available: https://asia.ensembl.org/Tools/ VEP [92] “A database for cancer driver gene/mutation.” [Online]. Available: http: //driverdb.tms.cmu.edu.tw/ddbv2/dataset.php [93] “Mutational cancer drivers database.” [Online]. Available: https://www. intogen.org/search [94] J. S. Welch, T. J. Ley, D. C. Link, C. A. Miller, D. E. Larson, D. C. Koboldt, L. D. Wartman, T. L. Lamprecht, F. Liu, J. Xia, C. Kandoth, R. S. Fulton, M. D. McLellan, D. J. Dooling, J. W. Wallis, K. Chen, C. C. Harris, H. K. Schmidt, J. M. Kalicki-Veizer, C. Lu, Q. Zhang, L. Lin, M. D. O’Laughlin, J. F. McMichael, K. D. Delehaunty, L. A. Fulton, V. J. Magrini, S. D. McGrath, R. T. Demeter, T. L. Vickery, J. Hundal, L. L. Cook, G. W. Swift, J. P. Reed, P. A. Alldredge, T. N. Wylie, J. R. Walker, M. A. Watson, S. E. Heath, W. D. Shannon, N. Varghese, R. Nagarajan, J. E. Payton, J. D. Baty, S. Kulkarni, J. M. Klco, M. H. Tomasson, P. Westervelt, M. J. Walter, T. A. Graubert, J. F. DiPersio, L. Ding, E. R. Mardis, and R. K. Wilson, “The origin and evolution of mutations in acute myeloid leukemia,” Cell, vol. 150, no. 2, pp. 264–278, jul 2012. [95] K. Oshima, H. Khiabanian, A. C. da Silva-Almeida, G. Tzoneva, F. Abate, A. Ambesi-Impiombato, M. Sanchez-Martin, Z. Carpenter, A. Penson, A. Perez- Garcia, C. Eckert, C. Nicolas, M. Balbin, M. L. Sulis, M. Kato, K. Koh, M. Pa- ganin, G. Basso, J. M. Gastier-Foster, M. Devidas, M. L. Loh, R. Kirschner- Schwabe, T. Palomero, R. Rabadan, and A. A. Ferrando, “Mutational land- scape, clonal evolution patterns, and role of RAS mutations in relapsed acute lymphoblastic leukemia,” Proceedings of the National Academy of Sciences, vol. 113, no. 40, pp. 11 306–11 311, sep 2016. [96] M. Garg, Y. Nagata, D. Kanojia, A. Mayakonda, K. Yoshida, S. H. Keloth, Z. J. Zang, Y. Okuno, Y. Shiraishi, K. Chiba, H. Tanaka, S. Miyano, L.-W. Ding, T. Alpermann, Q.-Y. Sun, D.-C. Lin, W. Chien, V. Madan, L.-Z. Liu, K.-T. Tan, A. Sampath, S. Venkatesan, K. Inokuchi, S. Wakita, H. Yamaguchi, W. J. Chng, S.-K. Y. Kham, A. E.-J. Yeoh, M. Sanada, J. Schiller, K.-A. Kreuzer, S. M. Kornblau, H. M. Kantarjian, T. Haferlach, M. Lill, M.-C. Kuo, L.-Y. Shih, I.-W. Blau, O. Blau, H. Yang, S. Ogawa, and H. P. Koeffler, “Profiling of somatic mutations in acute myeloid leukemia with FLT3-ITD at diagnosis and relapse,” Blood, vol. 126, no. 22, pp. 2491–2501, oct 2015. [97] P.-F. He, Z.-J. Xu, J.-D. Zhou, X.-X. Li, W. Zhang, D.-H. Wu, Z.-H. Zhang, X.- Y. Lian, X.-Y. Yao, Z.-Q. Deng, J. Lin, and J. Qian, “Methylation-associated DOK1 and DOK2 down-regulation: Potential biomarkers for predicting adverse prognosis in acute myeloid leukemia,” Journal of Cellular Physiology, vol. 233, no. 9, pp. 6604–6614, mar 2018. [98] E. Coppin, V. Gelsi-Boyer, X. Morelli, N. Cervera, A. Murati, P. P. Pandolfi, D. Birnbaum, and J. A. Nun`es,“Mutational analysis of the DOK2 haploinsuf- ficient tumor suppressor gene in chronic myelomonocytic leukemia (CMML),” Leukemia, vol. 29, no. 2, pp. 500–502, sep 2014. [99] A. K. Siraj, T. Masoodi, R. Bu, S. Beg, S. S. Al-Sobhi, F. Al-Dayel, M. Al- Dawish, F. S. Alkuraya, and K. S. Al-Kuraya, “Genomic profiling of thyroid 67

cancer reveals a role for thyroglobulin in metastasis,” The American Journal of Human Genetics, vol. 98, no. 6, pp. 1170–1180, jun 2016. [100] A. S. R. N. G. M. A. J. M. S. S. W. C. R. M. B. G. P. T. Richard Dillon, Lucy Cook and P. Fields, “Whole exome sequencing of flow purified tumour cells reveals recurrently mutated genes and pathways in adult t-cell lymphoma/leukaemia (atll),” American Society of Hematology, Dec. 2015. [Online]. Available: http://www.bloodjournal.org/content/126/23/1469 [101] L. Bullinger and P. J. Valk, “Gene expression profiling in acute myeloid leukemia,” Journal of Clinical Oncology, vol. 23, no. 26, pp. 6296–6305, sep 2005. [102] X. Liu, Y. Gao, J. Shen, W. Yang, E. Choy, H. Mankin, F. J. Hornicek, and Z. Duan, “Cyclin-dependent kinase 11 (CDK11) is required for ovarian cancer cell growth in vitro and in vivo , and its inhibition causes apoptosis and sen- sitizes cells to paclitaxel,” Molecular Cancer Therapeutics, vol. 15, no. 7, pp. 1691–1701, may 2016. [103] W.-L. Chen, Y.-Y. Wang, A. Zhao, L. Xia, G. Xie, M. Su, L. Zhao, J. Liu, C. Qu, R. Wei, C. Rajani, Y. Ni, Z. Cheng, Z. Chen, S.-J. Chen, and W. Jia, “Enhanced fructose utilization mediated by SLC2a5 is a unique metabolic feature of acute myeloid leukemia with therapeutic potential,” Cancer Cell, vol. 30, no. 5, pp. 779–791, nov 2016. [104] Y. Weng, X. Fan, Y. Bai, S. Wang, H. Huang, H. Yang, J. Zhu, and F. Zhang, “SLC2a5 promotes lung adenocarcinoma cell growth and metastasis by enhanc- ing fructose utilization,” Cell Death Discovery, vol. 4, no. 1, feb 2018. [105] Q. Li and Y. Shu, “Role of solute carriers in response to anticancer drugs,” Molecular and Cellular Therapies, vol. 2, no. 1, p. 15, 2014. [106] C. A. Larsson, G. Cote, and A. Quint´as-Cardama,“The changing mutational landscape of acute myeloid leukemia and myelodysplastic syndrome,” Molecular Cancer Research, vol. 11, no. 8, pp. 815–827, may 2013. [107] V. Pospisil, K. Vargova, P. Burda, J. Kokavec, E. Necas, J. Zavadil, and T. Stopka, “Transcription factors pu.1 and egr2 inhibits the oncogenic microrna cluster mir-17-92 during macrophage differentiation,” Blood, vol. 112, no. 11, pp. 473–473, 2008. [Online]. Available: http://www.bloodjournal.org/ content/112/11/473 [108] Y. H. T. J. X. S. Jing Tian, Ziwei Li and G. Jiang, “The progress of early growth response factor 1 and leukemia,” Intractable & rare diseases research, vol. 5(2), pp. 76–82, May 2016. [109] X. Liu, “Targeting polo-like kinases: A promising therapeutic approach for cancer treatment,” Translational Oncology, vol. 8, no. 3, pp. 185–195, jun 2015. [110] J. M. Brandwein, “Targeting polo-like kinase 1 in acute myeloid leukemia,” Therapeutic Advances in Hematology, vol. 6, no. 2, pp. 80–87, feb 2015. [111] A. G. Renner, C. D. Santos, C. Recher, C. Bailly, L. Creancier, A. Kruczynski, B. Payrastre, and S. Manenti, “Polo-like kinase 1 is overexpressed in acute myeloid leukemia and its inhibition preferentially targets the proliferation of leukemic cells,” Blood, vol. 114, no. 3, pp. 659–662, may 2009. 68

[112] T. Berg, G. Bug, O. G. Ottmann, and K. Strebhardt, “Polo-like kinases in AML,” Expert Opinion on Investigational Drugs, vol. 21, no. 8, pp. 1069–1074, jun 2012. [113] K. Strebhardt, S. Becker, and Y. Matthess, “Thoughts on the current assess- ment of polo-like kinase inhibitor drug discovery,” Expert Opinion on Drug Discovery, vol. 10, no. 1, pp. 1–8, sep 2014. [114] “Retroviral tagged cancer gene database.” [Online]. Available: http: //RTCGD.ncifcrf.gov [115] Z. Qian, J. M. Joslin, T. R. Tennant, S. C. Reshmi, D. J. Young, A. Stod- dart, R. A. Larson, and M. M. L. Beau, “Cytogenetic and genetic pathways in therapy-related acute myeloid leukemia,” Chemico-Biological Interactions, vol. 184, no. 1-2, pp. 50–57, mar 2010. APPENDICES 69

Chapter A Algorithms

Algorithm 1 Algorithm named TopGene for obtaining highly weighted genes Input : Weight Matrix (Gene × Weight) Output : gene list ← Top 50 highly weighted genes

1: procedure To obtain highly weighted gene

2: for gene ← 0 to totalGene do

3: weight sum ← 0

4: gene weight list; . empty list

5: for node ← 0 to totalNode do

6: weight sum ← weight sum + weight(node)

7: end for

8: Append gene weight list ← (gene × weight sum)

9: end for

10: Sort gene weight list by weight

11: Select top 50 highly weighted genes from gene list ← gene weight list

12: Return gene list

13: end procedure 70

Algorithm 2 Pseudocode of our methodological approach for classification procedure Procedure for cancer classification dataset ⇐ {“BRCA”} over sampler ⇐ {“SMOTE” } normalizer ⇐ {“Quantile transformer” } feature extractor ⇐ {“AAE”, “VAE”, “DAE”, “AE” } classifier ⇐ {“MLP”, “SVM”, “RF”, “LR” etc. } performance metrics ⇐ {“Accuracy”, “Precision”, “Recall” etc.} X, Y ⇐ load data(dataset) . Here, X is data sample and Y is label for k=1 → 5 do . for five fold cross validation X train, Y train, X test, Y test ⇐ split data(X,Y,k) fitted ⇐ over sampler(X train,Y train) X train ⇐ predict sampler(X train, fitted) . only training data is over-sampled fitted ⇐ normalizer(X train norm) X train norm ⇐ predict normalizer(X train sampled, fitted) X test norm ⇐ predict normalizer(X test, fitted) model trained ⇐ feature extractor(X train norm) . feature extraction is unsupervised fine tuned ⇐ fine tuning(X train norm, Y train, model trained) . fine tun- ing is supervised State X train feature ⇐ predict feature(X train norm, fine tuned) X test feature ⇐ predict feature(X test norm, fine tuned) fitted ⇐ classifier(X train feature, Y train) Y test prediction ⇐ predict classifier(X test feature, fitted score[k] ⇐ performance metrics(Y test,Y test prediction} end for results ⇐ mean(score) end procedure 71

Chapter B Tables

Table B.1. Specification of computer used in the experiment

OS Ubuntu 14.04.4 LTS CPU 2× Intel Xeon CPU E52630V3 2.4GHz GPU 6× NVIDIA Tesla K80 (12GB each) Memory 128 GB 72

Table B.2. Comparison of various feature extraction techniques using BRCA dataset with twelve different classifiers. Five-fold cross validation were performed during evaluation.

Feature Classifier Accuracy F1-Score Recall Precision AUC MCC Kappa DecisionTree 0.839388483 0.836249942 0.839388483 0.844108326 0.847653599 0.750988818 0.749160982 GaussianNaiveBayes 0.849622936 0.847029478 0.849622936 0.85426443 0.954855911 0.769002453 0.766501187 GradientBoosting 0.839621327 0.839088704 0.839621327 0.849173344 0.935249701 0.756213173 0.754086964 K-nearestNeighbors 0.854514302 0.853827909 0.854514302 0.857487812 0.901797375 0.77617233 0.774851882 LinearDiscriminantAnalysis 0.846303812 0.847388265 0.846303812 0.856157931 0.942246196 0.768003317 0.765195144 LogisticRegression 0.851193839 0.85247222 0.851193839 0.857671964 0.967289315 0.77328451 0.772255133 AAE MultiLayerPerceptron 0.849499173 0.850309202 0.849499173 0.854115081 0.9646908 0.770009528 0.769177168 QuadraticDiscriminantAnalysis 0.833009526 0.819977387 0.833009526 0.842746164 0.920429012 0.740315082 0.734175864 RandomForest 0.849443833 0.848652248 0.849443833 0.851384158 0.963209116 0.768292222 0.767233376 SupportVectorMachine 0.854473431 0.855198005 0.854473431 0.85960089 0.966842232 0.776959628 0.775883708 VotingClassifier 0.851180522 0.852628473 0.851180522 0.859301306 0.964492001 0.772955155 0.771627221 XGBoost 0.847858688 0.846767381 0.847858688 0.854791386 0.940914777 0.768282568 0.765991729 DecisionTree 0.753736615 0.751592834 0.753736615 0.75418319 0.791098601 0.622354019 0.620818118 GaussianNaiveBayes 0.710472338 0.657722762 0.710472338 0.705645881 0.919806712 0.548915033 0.489031294 GradientBoosting 0.839539354 0.834799917 0.839539354 0.842904871 0.947004649 0.751577474 0.7488104 K-nearestNeighbors 0.800137319 0.771107037 0.800137319 0.798929207 0.841333945 0.68760381 0.669386738 LinearDiscriminantAnalysis 0.841274672 0.840933216 0.841274672 0.847812351 0.952886432 0.757780367 0.755420527 LogisticRegression 0.856231565 0.851939132 0.856231565 0.853748179 0.972522922 0.780780828 0.778275478 PCA MultiLayerPerceptron 0.864484286 0.860142917 0.864484286 0.867868279 0.971902183 0.792014794 0.788583467 QuadraticDiscriminantAnalysis 0.763384783 0.714156607 0.763384783 0.722590756 0.822266525 0.633382512 0.594658536 RandomForest 0.821544982 0.814433738 0.821544982 0.823637601 0.95407046 0.723236263 0.720677223 SupportVectorMachine 0.502520441 0.336148077 0.502520441 0.252544581 0.911284185 0 0 VotingClassifier 0.864457867 0.862053862 0.864457867 0.867286196 0.968348976 0.792500149 0.790122967 XGBoost 0.790163958 0.794978228 0.790163958 0.818087056 0.927292824 0.693161313 0.685385267 DecisionTree 0.774698501 0.773041542 0.774698501 0.77944041 0.794796607 0.652719602 0.649762969 GaussianNaiveBayes 0.849511829 0.84608678 0.849511829 0.857923589 0.966968445 0.768035279 0.765997897 GradientBoosting 0.826188391 0.817940653 0.826188391 0.827152685 0.954982207 0.730262282 0.723697307 K-nearestNeighbors 0.836298391 0.824661131 0.836298391 0.845570676 0.873404354 0.747587069 0.736835105 LinearDiscriminantAnalysis 0.846299758 0.842439558 0.846299758 0.847203539 0.95363347 0.762176099 0.760102351 LogisticRegression 0.849485398 0.844308487 0.849485398 0.856267011 0.965020919 0.768293372 0.765512002 VAE MultiLayerPerceptron 0.844443382 0.838125267 0.844443382 0.849904151 0.96736337 0.760232579 0.757019396 QuadraticDiscriminantAnalysis 0.816545417 0.780158188 0.816545417 0.756123651 0.942402546 0.712892589 0.701145717 RandomForest 0.842802908 0.834898783 0.842802908 0.837045414 0.959560767 0.756357916 0.752568376 SupportVectorMachine 0.741605395 0.696701387 0.741605395 0.748949679 0.932980503 0.598378311 0.547314634 VotingClassifier 0.849485398 0.844008731 0.849485398 0.856164611 0.96482982 0.768117807 0.765113451 XGBoost 0.809727449 0.804951436 0.809727449 0.813582187 0.93342438 0.707593404 0.703730845 DecisionTree 0.756745202 0.76136685 0.756745202 0.772239554 0.80156567 0.635143351 0.63293761 GaussianNaiveBayes 0.803209125 0.804638936 0.803209125 0.823718224 0.944553731 0.712128676 0.70478357 GradientBoosting 0.827940633 0.824491699 0.827940633 0.82558335 0.952923342 0.735924476 0.734120872 K-nearestNeighbors 0.810029836 0.801003462 0.810029836 0.804152958 0.870360912 0.704581917 0.701436887 LinearDiscriminantAnalysis 0.829633057 0.827492948 0.829633057 0.831083486 0.946805835 0.737863582 0.73599749 LogisticRegression 0.851316231 0.844226923 0.851316231 0.843058001 0.968190793 0.771852926 0.769832134 DAE MultiLayerPerceptron 0.85132957 0.845358555 0.85132957 0.846808679 0.965392221 0.771412082 0.76978905 QuadraticDiscriminantAnalysis 0.659424406 0.660667302 0.659424406 0.741780811 0.87448019 0.52613188 0.499096967 RandomForest 0.833091515 0.83153543 0.833091515 0.834235186 0.955596555 0.743636066 0.74263735 SupportVectorMachine 0.504159785 0.339680007 0.504159785 0.298859443 0.888868625 0.02021046 0.004147157 VotingClassifier 0.849608925 0.844833196 0.849608925 0.846621453 0.96489829 0.768181112 0.766579872 XGBoost 0.798196022 0.798449186 0.798196022 0.805404252 0.935335233 0.694800308 0.692319846 DecisionTree 0.776870101 0.777640872 0.776870101 0.788093191 0.813045073 0.664381503 0.661059745 GaussianNaiveBayes 0.824594203 0.824433767 0.824594203 0.836907819 0.953435819 0.735705809 0.731511428 GradientBoosting 0.837817784 0.834262996 0.837817784 0.846374018 0.953101885 0.752539948 0.748924611 K-nearestNeighbors 0.806641389 0.795330514 0.806641389 0.797187143 0.860574433 0.700721583 0.696349558 LinearDiscriminantAnalysis 0.832762907 0.830316602 0.832762907 0.833601319 0.949007155 0.741437876 0.739739786 LogisticRegression 0.839649344 0.834615387 0.839649344 0.837407037 0.96738091 0.754727258 0.752788666 AE MultiLayerPerceptron 0.839717317 0.833825499 0.839717317 0.831991725 0.966478992 0.75464963 0.752927149 QuadraticDiscriminantAnalysis 0.769696892 0.752339499 0.769696892 0.785987804 0.917625177 0.656824961 0.640123808 RandomForest 0.856207382 0.855261686 0.856207382 0.862356696 0.961018755 0.781087521 0.778914687 SupportVectorMachine 0.502520441 0.336148077 0.502520441 0.252544581 0.898905424 0 0 VotingClassifier 0.846178704 0.84271771 0.846178704 0.845300405 0.963799657 0.764311356 0.762747477 XGBoost 0.826135756 0.825636444 0.826135756 0.833472159 0.941867158 0.736588773 0.733891252 73

Table B.3. Benchmarking of various feature extractor while classifying various cancer subtypes using BRCA dataset.

Feature Extractor CPU Usage (%) Memory (MB) Running Time AAE (2 layer) 24.1 2164 10 min 54 sec AE (2 layer) 21.8 2335 20 min 38 sec PCA (2 layer) 18.7 1402 2 min 28 sec VAE (2 layer) 20.2 2254 19 min 30 sec DenoisingAE (2 layer) 22.2 2546 20 min 43 sec

Table B.4. Benchmarking of various feature extractor while extracting biologically rel- evant genes using BRCA dataset.

Feature Extractor CPU Usage (%) Memory (MB) Running Time AAE (Single layer) 23.5 1946 1 min 13 sec AAE (2 layer) 21.6 2087 1 min 57 sec AAE(3 layer) 29.5 2121 2 min 1 sec VAE (Single layer) 24.1 1837 1 min 4 sec AE (Single layer) 24.7 1840 1 min 12 sec DenoisingAE (Single layer) 25.7 2036 1 min 14 sec PCA 26.2 1012 25 sec NMF 39.9 916 47 sec FastICA 30.3 1342 27 sec FactorAnalysis 27.5 1138 26 sec LDA 61.2 884 16 min 20 sec tSVD 26.7 913 24 sec 74

Table B.5. Results of pathway analysis using various feature extraction methods on BRCA data-set.

Algorithm Top Pathway P-value AAE (Single layer) Olfactory transduction 5.23e-2 AAE (2 layer) No Significant pathway AAE (3 layer) Neuroactive ligand-receptor interaction 1.97e-2 AE No significant pathway DAE No significant pathway PCA No Significant Pathway VAE No Significant Pathway NMF No Significant Pathway ICA No Significant Pathway

Table B.6. Results of gene enrichment analysis using various feature extraction meth- ods further validated on UCEC data-set.

Algorithm Molecular Function P-value Metal ion binding 5.44e-2 AAE (Single layer) Transition metal ion binding 6.38e-2 AAE (2 layer) Small conjugating protein ligase activity 5.59e-2 AAE (3 layer) Drug binding 1.02e-2 AE Hydrogen ion transmembrane transporter activity 1.82e-2 PCA Calcium ion binding 3.42e-3 VAE Olfactory receptor activity 4.25e-6 NMF snRNA binding 2.43e-2 75

Table B.7. The novel variants with associated cancer and their primary expression

Gene Name Expression Associated Cancer(Mutation Frequency)

MAMLD1 Broad Expression, most in Testis (RPKM 8.2) Stomach Adenocarcinoma (3.73%)

SLC22A6 Restricted to Kidney (RPKM 150.9) Cutaneous Melanoma (3.79%)

DMWD Ubiquitous Expression in Testis (RPKM 11.5) Oesophageal Carcinoma (2.05%)

CDK11A Ubiquitous Expression in Bone Marrow (RPKM 22.4) Cutaneous Melanoma (1.08%)

SLC2A5 Biased Expression in Duodenum (RPKM 33.7) Cutaneous Melanoma (2.17%)

QSOX2 Ubiquitous Expression in Lymph Nodes (RPKM 8.6) Non-Small Cell Lung Carcinoma (3.23%)

EGR2 Biased Expression in Thyroid (RPKM 66.1) Lung Squamous Cell Carcinoma (1.72%)

SF1 Ubiquitous Expression in Bone Marrow (RPKM 62.1) Cutaneous Melanoma (2.17%)

ZFHX3 Ubiquitous Expression in Thyroid (RPKM 3.0) Uterine Corpus Endometrioid Carcinoma (12.16%)

MYCBPAP Restricted Expression toward Testis (RPKM 33.9) Cutaneous Melanoma (3.79%)

DIRAS1 Biased Expression in Brain (RPKM 26.6) Lung Squamous Cell Carcinoma (1.15%)

DOK2 Broad Expression, most in Spleen (RPKM 18.2) Oesophageal Carcinoma (1.37%)

PLK3 Ubiquitous Expression in Bone Marrow (RPKM 16.1) Bladder Carcinoma (2.04%)

GPR182 Biased Expression in Spleen (RPKM 5.1) Small Cell Lung Carcinoma (1.45%)

POU4F2 Low Expression Observed in Reference Dataset Stomach Adenocarcinoma (2.48%)

ATP13A2 Ubiquitous Expression in Brain (RPKM 19.6) Bladder Carcinoma (4.08%)

TEX2 Broad Expression in Testis (RPKM 15.4) Lung Squamous Cell Carcinoma (2.87%)

PKHD1 Biased Expression in Kidney (RPKM 4.0) Small Cell Lung Carcinoma (23.19%)

GTF3C5 Ubiquitous Expression in ovary (RPKM 17.0) Stomach Adenocarcinoma (1.86%)

SURF6 Ubiquitous Expression in Colon (RPKM 12.0) Oesophageal Carcinoma (2.05%)

HECTD4 Ubiquitous Expression in Brain (RPKM 8.7) Bladder Carcinoma (9.18%)

CFAP69 Broad Expression in Prostate (RPKM 3.6) No Data Available 76

Table B.8. Results of GeneOntology (GO) enrichment analysis using variants genes are shown.

Category Term Gene Count P-Value Cellular component Nucleoplasm Part 3 9.3E-2 Molecular Function Transcription Regulator Activity 5 7.1E-2 Biological Process Cell Cycle 5 2.1E-2 Biological Process Axonogenesis 3 3.0E-2 77

Chapter C Figures

Fig. C.1. The olfactory transduction pathway obtained from AAE model. The GC-D expressing ORNs and Odorant (highlighted in red) are oncogenic activators responsible for triggering olfactory transduction pathway 78

Fig. C.2. In this figure, Interaction between proteins are shown. Here, PLK3 variant gene are connected to SF1 variant gene via CDC5L 79

Chapter D Codes

D.1 AAE Architechture

1 def model generator(latent dim , input shape , hidden dim =1000) :

2 return Sequential([

3 Dense(hidden dim , name=”generator h1”,

4 input dim=latent dim , activation=’tanh’),

5 Dense(input shape[0] , name=”generator output”,

6 activation=’sigmoid’)],

7 name=”generator”)

8

9 def model encoder(latent dim , input shape , hidden dim =1000) :

10 x = Input(input shape , name=”x”)

11 h = Dense(hidden dim , name=”encoder h1”, activation=’tanh’)(x)

12 z = Dense(latent dim , name=”encoder mu”, activation=’tanh’)(h)

13 return Model(x, z, name=”encoder”)

14

15 def model discriminator(input shape , output dim=1, hidden dim =1000) :

16 z = Input(input shape )

17 h = z

18 h = Dense(hidden dim , name=”discriminator h 1”,

19 activation=’tanh’)(h)

20 y = Dense(output dim , name=”discriminator y”,

21 activation=”sigmoid”)(h)

22 return Model(z, y)

23

24 def aae model(path, adversarial optimizer ,xtrain ,ytrain ,xtest ,ytest , encoded dim=100,img dim=25, nb epoch=2) :

25 latent dim = encoded dim

26 input shape = ( img dim , ) 80

27 generator = model generator(latent dim , input shape )

28 encoder = model encoder(latent dim , input shape )

29 autoencoder = Model(encoder.inputs ,

30 generator(encoder(encoder.inputs)) , name=”autoencoder”)

31 discriminator = model discriminator(input shape )

32 # assemble AAE

33 x = encoder.inputs[0]

34 z = encoder(x)

35 xpred = generator(z)

36 yreal = discriminator(x)

37 yfake = discriminator(xpred)

38 aae = Model(x, fix names([xpred, yfake, yreal],

39 [”xpred”,”yfake”,”yreal”]))

40 # print summary of models

41 encoder .summary()

42 generator .summary()

43 discriminator .summary()

44 # build adversarial model

45 generative params = generator.trainable weights + encoder. t r a i n a b l e w e i g h t s

46 model = AdversarialModel(base model=aae ,

47 player params=[generative params ,

48 discriminator.trainable w e i g h t s ] ,

49 player names =[”generator”,”discriminator”]

50 )

51 model. adversarial compile(adversarial o p t i m i z e r=

52 adversarial o p t i m i z e r

53 p l a y e r optimizers=[Adadelta() ,Adadelta() ] ,

54 l o s s ={”yfake”:”binary crossentropy”,

55 ”yreal”:”binary crossentropy”,

56 ”xpred”:”binary crossentropy” } ,

57 player compile kwargs =[{”loss w e i g h t s”:

58 {”yfake”: 1e −4,”yreal”: 1e −4,”xpred”: 1 e1 }} ] ∗ 2 )

59 n = xtrain.shape[0] 81

60 y = [xtrain, np.ones((n, 1)), np.zeros((n, 1)),

61 xtrain, np.zeros((n, 1)), np.ones((n, 1))]

62 ntest = xtest.shape[0]

63 ytest = [xtest, np.ones((ntest, 1)),

64 np.zeros((ntest, 1)), xtest ,

65 np.zeros((ntest , 1)), np.ones((ntest , 1))]

66 history = model. fit(x=xtrain , y=y,

67 v a l i d a t i o n data=(xtest , ytest),

68 epochs=nb epoch, batch s i z e =128,

69 shuffle=False)

70 df = pd.DataFrame(history.history)

71 df . t o csv(os.path.join(path,”aae n e w a c t i v a t i o n h i s t o r y.csv”))

72 encoder.save(os.path. join(path,”aae encoder.h5”))

73 generator.save(os.path.join(path,”aae decoder.h5”))

74 K. c l e a r s e s s i o n ( )

D.2 Fine Tuning

1 y t r a i n binarize = label b i n a r i z e ( y train , classes=[0,1,2,3])

2 y t e s t binarize = label b i n a r i z e ( y test , classes=[0,1,2,3])

3

4 model=load model (’./saved models/AAE/OneHidden/aae encoder.h5’)

5 #load the AAE trained model

6 model .summary()

7

8 t r a n s f e r layer=model. get l a y e r (’encoder mu’)

9 #pull the last layer

10 aae prev model=Model(inputs=model.input ,outputs=transfer layer .output)

11 new model=Sequential ()

12 new model.add(aae prev model )

13 new model.add(Dropout(p = 0.001))

14 # add dropout

15 new model.add(Dense(units = 4, activation =’softmax’, name=’ new layer added’))

16 #add output layer with four neuron since we have four class

17 def print l a y e r trainable(): 82

18 f o r layer in aae prev model.layers:

19 p r i n t(” {0}:\ t {1}”.format(layer.trainable ,layer.name))

20

21 f o r layer in aae prev model.layers:

22 layer . trainable=False

23

24 p r i n t l a y e r trainable()

25 aae prev model. trainable=True

26

27 #Now It is time to Modify the Layer

28

29 f o r layer in aae prev model.layers:

30 trainable=(’encoder mu’ in layer.name)

31 #modify the last layer only

32 #trainable=(’encoder h2’ in layer.name or’encoder mu’ in layer.name )

33 layer. trainable=trainable

34 p r i n t l a y e r trainable()

35

36 new model .compile(optimizer =’adadelta’,

37 l o s s =’categorical crossentropy’,

38 metrics = [’accuracy’])

39

40 history=new model. fit(x t r a i n , y t r a i n b i n a r i z e ,

41 b a t c h size = 20, epochs = 50,

42 v a l i d a t i o n d a t a=

43 ( x t e s t , y t e s t b i n a r i z e )

44 )

45

46 s c o r e = new model.evaluate(x t e s t , y t e s t binarize , verbose=1, b a t c h s i z e =20)

47

48 p r i n t(”Test Accuracy: \n%s: %.2f%%” % (new model. metrics names[1], score [ 1 ] ∗ 1 0 0 ) )

49 83

50 path=’./saved models/AAE/OneHidden/’

51 df = pd.DataFrame(history.history)

52 df . t o csv(os.path.join(path,’fine t u n e d h i s t o r y.csv’))

53 new model .summary()

54 new model. layers .pop()

55 new model .summary()

56 new model. layers .pop()

57 #delete the last two layers

58 new model .summary()

59 new model.save(os.path. join(path,’encoder f i n e t u n e d.h5’))

60

61 model = load model (’./saved models/AAE/OneHidden/encoder f i n e t u n e d.h5’)

62 model .summary()

63 x train = model.predict(x t r a i n )

64 p r i n t(’X Train Shape after AAE:’, x train .shape)

65 x test = model.predict(x t e s t )

66 p r i n t(’X Test Shape after AAE:’, x test .shape)

D.3 Extract Weight form Latent Space

1 weight = load model (’./saved models/MOLECULAR/Weight Analysis/AAE/ aae encoder.h5’)

2 encoded aae=weight. predict(X)

3 weights = [ ]

4

5 f o r layer in weight.layers:

6 weights.append(layer .get w e i g h t s ( ) )

7 d = pd.DataFrame(weights)

8 p r i n t’Weight layer shape:’, d.shape

9

10 #for one hidden layer

11 intermediate w e i g h t df = pd.DataFrame(weights [1][0])

12 intermediate w e i g h t d f 2 = pd.DataFrame(weights [2][0])

13 a b s t r a c t e d w e i g h t df = intermediate w e i g h t d f . dot ( intermediate w e i g h t d f 2 )

14 #we need to apply dot product of two weight metrics 84

15 weight matrix=intermediate w e i g h t d f

D.4 Analyze and sort Weight Matrix

1 def weight analysis(weight matrix, weight f i l e ,

2 encoded matrix, encoded f i l e ,

3 g e n e file , index, feature distribution):

4

5 e n file =os.path.join(encoded f i l e )

6 encoded = pd.DataFrame(encoded matrix )

7 encoded . t o c s v ( e n f i l e , sep=’ \ t’)

8

9 w t file = os.path.join(weight f i l e )

10 wt matrix = weight matrix . s e t index(index)# it is already in dataframe

11 wt matrix . t o c s v ( w t f i l e , sep=’ \ t’)

12 g e n e s o r t e d = wt matrix .sum(axis=1). sort values(ascending=False)

13

14 g n file = os.path.join(gene f i l e )

15 gene matrix = pd.DataFrame(gene s o r t e d )

16 gene matrix . t o c s v ( g n f i l e , sep=’ \ t’)

17

18 g = sns.FacetGrid(gene matrix , sharex=False , sharey=False)

19 g . map(sns.distplot ,0)

20 g.savefig(feature distribution , dpi=300)

21

22 p r i n t(’After Applying Dimension Reduction’, encoded.shape)

23 p r i n t(’Weight Matrix Shape’, wt matrix.shape)

24 #print(’Sum of Encoded Node and Sorted’,sum node activity)