Abdelazim et al (2020): Classification analysis for genomics Dec, 2020 Vol. 23 Issue 24

A Survey on Classification Analysis for Cancer Genomics: Limitations and Novel Opportunity in the Era of Cancer Classification and Target

Marwa Abouelkhir Abdelazim1*, Mona Mohamed Nasr 2, and Waleed Mahmoud Ead1

1Faculty of Computers and Artificial Intelligence, Beni-Suef University, Egypt

2Faculty of Computers and Artificial Intelligence, Helwan University, Egypt

*Corresponding author: [email protected] (Abelazim) Abstract: Advanced machine learning approaches are qualified for recognizing the too composite patterns in the massive datasets. We provide a perspective technical survey analysis in machine learning (ML), and deep learning (DL) approaches for genome analysis. It's quickly rising applications related to cancer diseases such as cancer diagnosis or subtypes of cancer through omics input data. It discusses effective approaches in the fields of genomics regulatory, pathogenicity, and variant calling. Moreover, the representation of ML's potential benefits due to the several technological platforms involved in its diagnosis, prognosis, and treatment. We concentrate on the most up-to-date knowledge of cancer classification models, targeted , and define how genetic mutations inspire targeted therapy's responsiveness and highlight the different related issues in this era of precision . Finally, we disuse limitations of the different approaches and hopeful ways of upcoming research in targeted therapy. Keywords: Deep Learning, Genome Analysis, Precision Medicine, Cancer Classification Models, Classification Models, Omics How to cite this article: Abdelazim MA, Nasr MM, Ead WM (2020): A survey on classification analysis for cancer genomics: Limitations and novel opportunity in the era of cancer classification and target therapies, Ann Trop Med & ; 23(S24): SP232434. DOI: http://doi.org/10.36295/ASRO.2020.232434

Introduction DNA molecule is translated to mRNA for the synthesis of proteins. Proteins are the primary factors in the utmost cellular processes. The process via which a fragment of DNA is reading besides transformed addicted to a protein has excessive awareness in several therapeutic analyses and biological. This process may affect diverse phases; however, the process's basis is the creature’s DNA that converts according to the sequence of specific rules. Note to toward the full structure of DNA in a creature’s cell the genome. The DNA consisted of four forms of roots – “Cytosine” (C), “Guanine” (G), “Thymine “(T), and “Adenine” (A), the genome of human is made up of completed 3 billion of these genomic letters. The genome can be measured to keep a series cleared on the character set {A, C, G, T}. A sequence of the genome can deliver over the DNA sequencing process. In the latest years, novel sequencing technologies, named Next Generation and Third Generation sequencing (NGS, TGS). The principal reason for cancer has situated a mutation in the genome, each inherited or established through an individual's life.

Annals of & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Cancer disease has been stated as one of the lethal genetic diseases of the human genome. It has been the research attention pending today by clinicians, pathologists, biologists, and other life science and health specialists. The World Health Organization (WHO) informed cancer disease having 14 million novel cases in 2012. Cancer is an essential reason for sickness and death average that explanations equally the second primary reason for death universal causing 8.8 million deaths in 2015. The World Cancer Report well -defined cancer as a universal problem and proposed that cancer occurrence will cause growth to 20 million novel cases by 2025. Cancer disease research's significant aim is to recognize the particular genes that produced normal cells to mutate into cancer disease (Joseph, M., Devaraj, M., & Vea, L. A., 2018). Numerous procedures have been established toward exploring information gained over sequencing to mark variations (besides titled variants or mutations) compared to a recognized normal reference sequence. This allows symbolizing an individual’s (or a population’s) genome sequence(s) in terms of the variants. Variations can be of diverse types, from alterations in single positions to dissimilarities in the construction of thousands of bases' arrangements. Determining variants (or variant calling) has applications in various bioinformatics fields, exploring the study of diseases in humans (Ramachandran, A., Li, H., & Chen, D., 2018).

1.1 Overview of microarray technology and gene expression

Contrasted with the old method of genomic study, which has obsessed with the limited consideration and group of information on particular genes, microarray technologies have prepared it promising to display the expression stages for tens of thousands of genes similar. Upon the arena of molecular natural science biology, the examination of gene code information can deliver direction for cancer's initial identification. Expression gene information covers gene action data and can mirror the biological from the present cell; for instance, a cell is located in the typical or worst state. Though there is a great volume of expression gene information, further than 95% of them are unusable (Zhu, Q. X., Fan, Y., He, Y. L., & Xu, Y., 2018). Though, great dimension, slight sample scope, and askew spreading are altogether appearances of expression gene information. These appearances similarly encounter the taxonomy of subtype's cancer. Consequently; the situation stays vital toward discovery specific genetic factors correlated to the disease of cancer. Nevertheless, within the appearance of enormous genetic factors, characteristic choice stays actual essential. Presently(Zhu, Q. X., Fan, Y., He., & Xu, Y., 2018), the key issue within gene Microarray is grouping because of the thousands of quantities of genetic factor composed, the exclusion element is that growing quickness and the structure of heterogeneous create it greatly extra puzzling to switch biomedical information using such assets than traditional information exploration approaches as standard and that vast quantity of genetic factor can create the classification mission extremely challenging. And now, the period of “precision medicine,” cancer cell treatment stay able to tailor-made to a single patient founded on the outline of a tumor's genomic. Despite the ever-rising richness of genomic cancer information, they connected variation (mutation) profiles to treatment effectiveness remainders difficult (Chang, Y., Park, H., Yang, H. J., & Shin, J. M.,2018). A genomic drug's purposes are defining how differences in the DNA of persons can impact the hazard of diverse illnesses and discover fundamental reasons that targeted treatments can be prepared. Here, we emphasize how progressive machine learning can use to resolve the main difficulties in genomic treatment. The genomics field is reading the job and data structure prearranged in live cells' DNA structures. Simultaneously, accurate medicine is the exercise of adapting drug

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

founded on wholly related data around the patient, containing a patient (Leung, M. K., Delong, & Frey, B. J., 2015). Subsequently, the levels of medical discipline and healthiness informatics have prepared major development lately. They have headed to detailed analytics required through generation, gathering, and huge information (Lan, K., Wang, D. T., & Dey,2018). Solitary point while not capable to unnoticed is that the methods of artificial intelligence applications production an additional important part cutting-edge the achievement of bioinformatics investigation as of genetic facts opinion, then a relationship stands highlighted in addition to recognized toward connection these twofold analytics information procedures and bioinformatics on together manufacturing and academic circles it practical via construction models, building predictions, performing clustering and classifications, discovery “associated rules,” then lastly recognition desired features. Temporarily, “deep learning” be present an additional modern theory and outline, and has greatly well capability of trait demonstration in summary standard than overall machine learning, "One-size-fits-all" methodology toward medication might driving to certain of these sideways special effects, subsequently,

Wholly individuals remain dissimilar. Predicting disease status for difficult human diseases like cancer with genomic data is essential to review the relationship between gene expression profiles and disease states plays a vital role in biological and clinical applications. So, deep learning is appropriate for forecasting medicine -target contact with slight directing. Lately, numerous hopeful consequences have been elucidated via deep learning in medicine development, treatment repurposing (i.e., documentation of potential novel drives of accepted or new ), and medicine-target summarizing better forecast accuracy paralleled to other traditional machine learning approaches. And the passionate expansion of biotechnology has been cumulative in the past era of time, the exponential growth average of biomedical

Information produced thru numerous investigation and application regions be able to choose from the scale of micro molecular (gene roles, protein collaborations, etc.), the scale of organic tissue (brain connectivity chart, Magnetic reverberation images, etc.), the scale of a medical patient ( “intensive care part,” EMR “electronic medical record,” etc.) and the scale of the whole population (therapeutic message panel, common mass media, etc.) (Lan, K., Wang, D. T., Fong, S., Liu, L. S., Wong, K. K., & Dey, 2018). The remainder of this survey is structured as follows: section two cancer taxonomy, section three bioinformatics input data, section four feature selection methods for informative genes, section five “Machine Learning” (ML) and “Deep Learning” (DL) in Genomics, section six common omics dataset sources, section seven effectiveness of genetic mutation on the susceptibility of targeted therapy and directions of upcoming research, Section eight discussions, and finally, it concludes the survey. The contribution of this work is a:

 We clarify quickly rising approaches related to cancer diseases. for representative the potential benefits of ML due to the various technological platforms involved in its diagnosis, prognosis, and treatment

 discover and analyze the most related works in cancer classification model and feature selection methods with gene expression and genetics information

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

 Explore Platforms contain a multimodal analysis of huge amounts of tissue, analysis of gene expression data with genomic data, and advanced genome sequencing.

2. Cancer taxonomy can be characterized either on the histological kind or their major spot (the location where cancer originated). From a perspective of histological, there are tens of diverse malignancies, which are clustered into six main groups (National Cancer Institute):

: Greatest impact tissues or glands produce excretion, for  For example, the breasts produce milk, or the lungs, which filtrate slime, colon or bladder or prostate.

 Sarcomatoid malignancy: "" mentions malignance that initiates in caring and disease as tendons, muscle, , , and .  Myeloma cells: are malignancy that instigates in the cells of plasma of marrow. The plasma cells yield particular of the proteins institute in the . Sarcoma

 “”: ("blood cancers " or " liquid

 Cancers ") are malignancies of the bone marrow (the position of blood cell manufacture). Instances of leukemia contain Myelogenous or granulocytic leukemia (granulocytic white blood cell sequences and hostility of the myeloid):

 Lymphatic, lymphocytic, or lymphoblastic leukemia (lymphocytic blood cell series and malignancy of the lymphoid)  erythremia (malignancy of several blood cell yields, however through red cells dominant)  Cancer of the lymph nodes: produce in the nodes of the lymphatic system or glands. Lymphomas could happen in particular tissues, for example, the breast, brain, or stomach.  Diverse Types: The kind constituents might be within a single kind or from diverse kinds. Designated samples are:  Teratocarcinoma  Adenosquamous carcinoma

3. Bioinformatics input data

We classified the subjects of interest in the area of bioinformatics into three clusters (“Omics,” “biomedical imaging,” “biomedical signal processing”).

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Genetic data, such as the genome, transcriptome, and proteome information, is applied to bioinformatics methodology issues through omics investigation. The greatest public raw data in omics stand fresh genetic structures (i.e., RNA, DNA, and amino acid structures). Supplementary, mined features from sequences are often used as inputs for deep learning processes to ease complications from difficult genetic information and progress outcomes. Any of the greatest explored difficulties in protein sequence forecast, which targets to forecast the subordinate structure or interaction diagram of a protein gene expression guideline, containing RNA binding proteins or splice junctions, and protein taxonomy including subcellular localization or great family, are similarly dynamically considered. Additionally, variance classification methods have been utilized with omics information to discover malignancy (Min, S., Lee, B., & Yoon, S., 2017).

4. Feature selection methods for informative genes

Gene classification or mining of features in a supervised or unsupervised style like the region of investigation attitudes is a novel encounter for “DNA Microarray expression,” covering above to 25000 genetic factors by the matching time. Separately from imprecating dimensionality, numerous other difficulties are handled in gene choice as supernumerary information, mislabeled data, noisy and unrelated data. Several gene choice procedures and algorithms are applied to decrease dimensionality by removing inappropriate, duplicate, and noisy genes (Hira, Z. M., & Gillies, D. F, 2015). Effective pattern selection has some advantages in such circumstances where thousands of features are complicated. First, dimension reduction is hired to decrease the computational cost. Second, a discount of noises is performed to improve classification accuracy. Lastly, the extraction of additional interpretable features or characteristics that can be supportive of recognizing and monitoring the target diseases. Here are three common groups of feature selection procedures: “filter methods,” “wrapper methods,” besides “embedded methods” (Zhang, D., Zhou, X., & He, F, 2018) as shown in (Fig. 1).

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Fig. 1. Feature Selection Methods

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

5. Machine learning (ML) and deep learning in genomics

Machine learning techniques permit automatic and scalable methods in which visions from huge, multi-dimensional data can be gathered. Computer-based decision-support systems founded on machine learning (ML) can modernize medicine by performing difficult tasks that are presently allocated to specialists to improve diagnostic accuracy, recover clinical workflow, increase the effectiveness of throughputs, improve treatment selections, and reduce human resource costs. There are four common groups of Machine learning algorithms: Supervised algorithms, unsupervised algorithms, Semi- Supervised algorithms, Re-enforcement algorithms, as shown in (Fig. 2). (Table 2) contains all papers which used machine learning algorithms.

Table 2. Machine Learning Algorithms Cancer type Datasets features Selection ML algorithms Reference Methods

Breast Microarray Scoring method  (K-Nearest Neighbor) and L- (Zhou, R. & Hu, Gene KMOD function K.,2017) expression  Fuzzy SVM data

Colon microarray Mutual  decision tree (C4.5) classifier (Pavithra&Laksh Gene Information, manan,2017) expression genetic algorithm Data ovarian TCGA PGS statistic  the univariate Cox (Ahn& RNAseq, proportional-hazard model Park,2018) gene expression data set prostate genetic factor  cluster score (CS) (Lee& expression  predicting score (PS)Yang,2018) data, functions protein- protein interaction network

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Leukemia, Acute microarray (MMI)  Extreme Learning Machine (Zhu& Xu,2018) Lymphoblastic genetic (ELM) Leukemia (ALL),factor Acute Myeloid expression Leukemia (AML)

Mixed cancer type DNA joint sparse  Joint Partial Correlation (Fang& methylation, Canonical Detection Wang,2017) gene correlation expression analysis(CCA)

Mixed cancer type DNA Mutual information  SVM (Yan& Lu,2018) microarrays maximization and  The (RoF) algorithm gene relief, Genetic  (CSD-ELM) expression algorithm data Mixed cancer type capture  Logistic regression (Ainscough& sequencing,  random-forest Griffith,2018) Exome  deep learning sequencing, Genome sequencing Mixed cancer types Gene Bayesian  ANN with Cox proportional (Yousefi&Cooper expression, optimization hazards results in a layer , 2017) RNA Sequencing, Protein expression Lung Expression z-scores  SVM (Li& Zhang,2018) level , RNA- seq Lung RNA-seq mutational  SVM (Xiang& spectrum Xing,2019) information Colon microRNA  SVM (Lu& Lin,2018)

Breast gene mRMR  incorporating DNN and (sun&li,2017) expression (SVM) classifier profile data

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Colon rRNA  (SVM) (Kishk& El- sequence  Random Forest Hadidi,2018)  (NLP) methods such as kmer frequency Mixed cancer types gene differential  (SVM) (Roberts& expression variance Kennedy,2018) data Mixed cancer types Microarray recursive feature  Multiple SVM (Hasri& gene elimination  Recursive Feature Elimination Kasim,2017) expression (MSVMRFE) dataset Mixed cancer types gene Fast Correlation-  (SVM). (Kavitha& expression Based Filter  SVM-Recursive Feature Gopi,2017) data Elimination Mixed cancer types Copy number  Mining Synthetic Lethals (Sinha& and genetic algorithm Bernards,2017) factor expression data pan-cancer copy number  incorporating Rao’s score test (Han& Lu,2019) variation and supervised ML (CNV) data and Genetic factor expression

5.1 Supervised learning

Contain all machine learning algorithms which map input data to a known class label(s) or target value. There are many supervised machine learning algorithms, such as random forest, linear and logistic regression, and support vector machines that are generally used for classification and regression steps. However, for the extremely large dataset, deep neural network-based classification and regression algorithm are commonly essential to deal with data non-linearity, some advantages, and disadvantages of classification algorithms presented in (table 3).

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Fig 2. Machine learning algorithms taxonomy

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Table 3. Advantages and Disadvantages of Various Supervised Machine-Learning Algorithms

Technique Advantages Disadvantages

K-nearest Simplicity and fast Noise sensitively aware

neighbor speed of accomplishment High space complexity required

Naïve Bayes Less time consumption Cannot give accurate results if

and higher accuracy for huge dataset there exists dependency among

variables

Decision tree No domain limitation of Restrict to one output attribute

the knowledge to construct Performance of classifier is

decision tree depended upon the type of dataset

Can easily process the data with

high dimension

Can handle both numerical

and categorical data

Support vector Better accuracy compared with Should select different kernel

machine other classifiers function for different dataset

Easily handle complex nonlinear The comparison with other

data points methods training process takes

more time

Neural network Can simulate almost any functions for The black box nature, hard to

complex applications and problems interpret the structure

High performance of accuracy High computational cost

The availability of multiple training May over-fitting after times of

training

Ensemble Improvement in predictive accuracy Difficult for understanding

Using the wrong ensemble

method will get bad results

5.2 Deep Learning Pathway In Genomics

The DL technique subdivisions ML methods related to a superficial “neural network” then by several hidden layers. Also, extra composite factors applied in training; the most distinctive deep learning models contain “deep belief network”

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

(“DBN”), “stacked auto-encoder” (“SAE”), “recurrent neural network” (“RNN”) and “convolutional neural network” (“CNN”). The DL method permits great standard intellection as a huge capacity of dissimilar and high dimensional raw datasets than conventional ML methods (Yu, K. H., & Kohane, I. S., 2018). Over the past seven years, deep neural networks have run to multiple performance breakthroughs in speech recognition, machine translation, and computer vision. Table 4. Deep Learning Algorithms

Cancer type Datasets Selection features ML and DL Methods Reference

Methods

Breast genetic factor expression PCA approach  autoencoder neural network (Zhang&

datasets  AdaBoost algorithm to make an He.,2018)

ensemble classifier for the

concluding prediction job

Lung gene expression data,  (CNN) the method that (Matsubara&Na

Protein-protein convolutional integrated clustering spectral data cher,2018)

interaction network filtering handling toward categorizing lung

malignancy

Mixed Genetic factor  (DNN) (Ahn& cancers Expression Omnibus Park,2018) types TCGA, Genotype-Tissue

Expression, (microarray

and RNA-seq)

Breast gene expression profile mRMR  incorporating DNN and (SVM) (sun&li,2017)

data classifier

Mixed Micro-array genetic Mutual Information  Deep Belief Network for data (Wisesty& cancer factor expression dataset classification Aditsania,2017) types acute de novo AML were  Deep learning neural network (Lin& myeloid retrieved from the TCGA with stacked (multi-layered) Nguyen,2018) leukemia database, Demographic autoencoder, in which high-level

information features are compressed,

organized, and extracted

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Formative studies in 2015 established the applicability of deep neural networks to DNA sequence data, as shown in (table 4). As presented in (Fig. 3) A, A raw dataset must remain arbitrarily divided into training, validation, and test groups. The negative besides positive sets must stay stabled to prospective confusion (e.g., the sequence position and content); therefore, the forecaster acquires distinguished features by preference baffle. B, the suitable construction remains chosen and taught on the source of field acquaintance. As “CNN's” seize conversion Constancy, then “RNNs” take extra elastic spatial communications. C, “false positive” (FP),” false negative” (FN), “True positive” (TP), and “true negative” (TN) averages are estimated. Once there are additional “negative” than “positive” samples, recall and precision be there frequently measured. D, The learned model is inferred via processing in what manner altering every nucleotide in the input impacts the forecast (Zou, J., Huss, M., Abid, A., & Telenti, A., 2018)

Fig 3. Deep learning pathway in genomics

6. Common omics dataset sources Online database of gene expression data for several cancers like brain tumor, leukemia, , and various others (table 5). The microarray experiment creates a massive amount of data, and it's processing by machine learning approaches. These data are frequently composite, and the importance of single-nucleotide variations can be solid to predict from sequence alone. Consequently, what is the best and most effective way to analyze all the data to discover meaning in the information? Not only does deep learning provide a great technique to analyze data from one particular type of analysis, but it also has the ability to association data from numerous complementary methods to recognize genes and pathways that could be significant in understanding the growth of disease [Navarro, C., 2019].

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Table 5. Bioinformatics datasets source Source Source URL

“The European Genome-phenome Archive (EGA)” https://ega-archive.org/

“The Cancer Genome Atlas” (TCGA) https://www.cancer.gov/

“The National Center for Biotechnology Information (NCBI)” https://www.ncbi.nlm.nih.gov/

“International Cancer Genome Consortium(ICGC)” https://dcc.icgc.org/

“Genotype-Tissue Expression - GTEx Portal” https://gtexportal.org/

“NIH Roadmap Epigenomics Project” http://www.roadmapepigenomics.org

“the 1000 Genomes Project” http://www.internationalgenome.org

7. Effectiveness of genetic mutation on the susceptibility of targeted therapy and directions of upcoming research.

Genetic alterations in malignancy are created from together inherited and lifestyle sides, but many malignancy-related mutations are because of randomized DNA repetition faults. Remarkably, mutations in cancer treatment targets can significantly influence medicine sensitivity. Therefore, the effectiveness of directed treatment is principally reliant on the profile mutation of malignancy in patients. The core challenge of directed treatment nowadays that considers new future investigation ways (1) confirm and recognize chief altered genetic factor and proteins in malignancies such as novel objectives; (2) classify patients maximum unlikely and likely to benefit from confident directed treatments; (3) assess the technique of mutation-driven treatment confrontation [Jin, J., Wu, X., Li, J., ... & Cho, C. H., 2019]. In [Chang, Y., Park, H.., ... & Shin, J. M., 2018], "report Cancer Drug Response profile scan" (CDRscan) the novel deep learning approach which forecasts anticancer medicine responsiveness depended on a great-domain medicine showing analyze data surrounding the genomic profiles of 787 humanoid malignancy cell lines and mechanical profiles of 244 medicines. This precise and vigorous drug response prediction model symbolizes a significant breakthrough for the awareness of correctness cancer drug over its presence in the medicine progress procedures such as medicine repurposing and screening a slight biochemical reference library for novel anticancer medicine applicant. In the medical surroundings, “CDRscan” is predictable towards modernizing “patient-tailored anticancer” medicine choice like a medical “decision support system” DSS through more medical justification readings. In [Wen, M., Zhang, Z., Niu., & Lu, H.,2017] Recognizing interactions amongst well-known medicines and objectives are the main encounter in drug relocation, operative “DL” approach (DBN) stayed practical toward precisely forecast novel drug-target interaction (DTIs) among FDA accepted medicines and targets deprived of separating the target into diverse modules. The established methodology is called DeepDTIs. This procedure can accomplish comparatively great expectation performance. Supplementary exploration of novel predicted DTIs designated. In [Eraslan, G., Avsec., & Theis, F. J.,2019] assessment the ‘big data’ methods aimed at detecting the driver genes in malignancy and how huge-domain several-platform exploration of big associates of patients be able to apply to recognize driver genes that are possible malignancy medicine targets. Over modeling applying tackles as ML decision tree founded classifiers or other algebraic approaches, these raw datasets can analyze composed toward categorizing main

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24 genetic factors that must make the probable to remain tumor medication objectives. In [Huang, C., Mezencev, R., & Vandenberg, F., 2017] at this juncture, an exposed source software platform smearing a greatly adaptable (SVM) approach, which employs typical (RFE) approaches to forecast tumor medicine reaction. Via using (OC) patient genetic factor expression, raw data made forecasts greatly reliable by detected replies to various treatments.

8. Discussion

(Zhou, R el at., 2017), introduced "Fuzzy Support Vector Machine" for breast tumor genetic factor classification, suggested a way to mine a great quantity of data distinguishing the genetic factor, choosing genes with maximum categorized data, also in the meantime eliminating the association appearances of them and considered a "geometric means" KNN membership functions and "L-KMOD" the main function "FSVM" to discriminate breast virulence genetic factor. The consequence error average is 3.89%; the uppermost precision percentage is 98.9%. The limitations of this paper are that the challenges of small sample sizes (Zhang, D el at., 2018), the technique suggested here shows “unsupervised learning” and “supervised learning.” To enhance cancer diagnosis, predicate, and progress, an additional, comprehensive consequence classifier. In “the unsupervised learning level,” “PCA” procedure and DNN are pragmatic to acquire abridged patterns through gene actions, however in “the supervised learning stage,” an “ensemble classifier” is prepared to afford to the patterns learned previously. The limitations of this paper are that firstly, the model constructed is not easy to analyze. Also, identifying which features are most important to the prediction task is difficult. Secondly, due to the deep learning model's complex structure, the amount of data is less prone to over-fitting. This is a common problem in neural networks. (Pavithra, D el at., 2017), focuses on a different variable selection technique based on filter, where a theoretical examination of “Mutual Information” was built on Feature Selection. To choose the best essential gene by applying “Mutual Information.” Then Wrapper founded that feature selection was constructed on the “GA algorithm.” In the “GA algorithm,” main genes are designated for tumor micro-array information; formerly, the designated features are inputs to (C4.5) classifier. In” MI, “it's founded a Feature Selection method that selects a feature from 200 variables nominated, and then the precision will be 87.09677%. Nonetheless, there are eight features selected in the genetic algorithm; then, the precision will be 88.70968%. The limitations for this paper are that the challenges of small sample sizes and the low accuracy to increase the accuracy use the mix of other techniques of feature selection methods and local search methods (Matsubara, T el at., 2018), novels approach that collective “spectral clustering” with “CNN” to classification lung malignancy, which mixes protein collaboration network and “genetic factor expression profiles” information. The methodology combines transcriptome and proteome information and can produce effective and precise predictions by handling network info and applying a “CNN” outline. The averaged consequences subsequently “k-fold CV” was achieved deliver as “accuracy”=0.81, “recall”=0.88, “precision”=0.78, and “specificity”=0.74. The limitations for this paper are that small sample size to improve the prediction combining datasets from different sources (Ahn, T el at., 2018), analyzed RNAseq information from the ovarian malignant tumor patients; in this paper, the important genes were designated through the “p-value” gained beginning “the univariate Cox PH model.” The genetic factor thru “p-value” less than 0.01 stayed nominated in this examination. In the following phase, genetic factor was

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

categorized into “Poor genetic factor “and “Good genetic factor” over the designated genetic factor`s measurement symbol. They designed (PGS) utilizing the mean and standard deviation of the matching in the train set's genetic factor expression. The limitations of this paper are that approach used in this paper is not enough to predict the survival outcome, which even causes an extremely opposite prediction (Lee, J. Y el at., 2018), Predictable PCa28 as the prospective genes for analytical biomarkers of prostate tumour utilize omics data. PCa28 be able to distinguish among the typical and tumour tissues and are particular for prostate tumour, Amongst PCa28, various novel prostate malignance correlated genetic factors, as COL17A1 and HIST3H2A, were recognized by applying “predicting score” (PS) and “cluster score” (CS) functions. The limitations of this paper are that the challenges of small sample sizes (Ahn, T el at.,2018), to identify the difficult environment of cancer founded on the massive communal dataset, this paper proposed a common classifier by (DNN) of cancer information causing starting 24 tissues of origin to detection genes commonly donate to classify tumour in a separate sample and the classify cancer and typical data with an accuracy of 0.997. DNN appeared as a useful gene expression-based classifier tool, as it performed usually higher than other methods, including logistic regression, ridge, lasso, elastic net, and SVM, in their experiments. One of the main drawbacks of a DNN is the difficulty in interpreting an individual's specific contribution to the outcome. (Zhu, Q. X el at.2018), proposed “multidimensional mutual information” (MMI) feature selection practice to choose the greatest explanatory features for classification and eliminate dirty data. Subsequently, feature selection applying the proposed MMI, “Extreme Learning Machine” (ELM), is applied as a resourceful classification. The suggested, “MII-ELM algorithm” accomplishes the “accuracy” of 98.06% on Leukemia raw inputs per 8 features. This paper's limitations are that the challenges of low sample sizes and the proposed algorithm are unstable. (Fang, J el at.2017), proposed a different process to cooperatively identify difficult relations amongst “DNA methylation” and “genetic factor expression” stages from several cancers. The core impression is to smear “joint sparse canonical correlation analysis” to identify a minor set of methylated spots related to an alternative set of genetic factors, whichever communal crossways cancers or detailed to a specific cluster (group-specific) of cancers. This study's limitations: First, they assume that the standardized methylation level follows the multivariate Gaussian distribution. It is not always true in real data. Second, parameter selection for sparse joint CCA is still a challenging problem. Finally, more biological evidence is required to support the results. (Yan, K el at., 2018), introduced a fresh mixture feature selection outline for genetic factor expression information classification. A mixture feature selection outline suggests incorporating filter-based approaches and wrapper based approaches through middle stage assessment via an ensemble classification. A supplementary comprehensive ensemble gene selection procedure founded on a comprehensive “GA algorithm.” Respectively classifier of “CS-D-ELM,” “SVM,” and” RoF” is applied equally as an assessment task to choose essential genetic factors founded on “the EGA algorithm.” The limitations of this paper are that the challenges of low sample sizes (Ainscough, B. J el at., 2018), documentation of somatic variations in sequencing information thru applying “the random forest and DL models” understood great (percentage AUCs>0.95) “classification enactment” through wholly alternative refinement curriculums (somatic, ambiguous, and fail), although “the logistic regression” confirmed abridged enactment (percentage AUC=0.89). This study's limitation is that the training data probably contains a substantial amount of noise that might impact model performance.

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

(Shen, Y el at., 2018) Created a novel classification process that associations “fused lasso and elastic net” by way of regularization for “linear support vector machine” (SVM) that named it “oriented feature selection SVM “(OFSSVM)— recognized its efficacy in “binary classification” (i.e., formative, whether must a malignance or the sub-types of spite) and multi-class classification (i.e., formative numerous cancer categories composed) in the intelligence of all-inclusive assessment which not solitary the classification accuracy but also the interpretability is restrained. But it also brings some problems although features have been selected, they do not provide the same reliability as forward- and backward-stepwise selection. (Yousefi, S el at.,2017) determine in what way DL and Bayesian optimization approaches that must be abnormally effective in common great dimensional forecast errands can be adjusted to the problematic of predicting tumor consequences. Deep survival models can effectively broadcast information through diseases to progress analytical accuracy. Extrapolative information as of multi-cancer raw datasets, recognize perfect hyperparameter surroundings, valid users' substantial time and effort in selecting model parameters. A fundamental challenge in deep learning is determining the network design that provides the best prediction accuracy. This process involves choosing network hyperparameters, including the number of layers, transformation types, and training parameters. (Li, T el at., 2018), this study forecast tumor prognosis in lung adenocarcinoma patients, expect five-year survivorship on or after the expression glassy, patients stood separated into twofold clusters, the cluster of patients through lesser hazard is indicated thru persistence age longer than five years, and the cluster of patients with a greater hazard is indicated by survival era littler than five years, applying a modest SVM model on eight applicants genetic factor founded in aforementioned molecular knowledge and feature selection. The Limited accuracy of cancer prognosis in this paper due to small patients sample size, growing the sample size will critically increase the accuracy of machine learning models, also an unlimited number of training sample because each cancer patient is unique in terms of their genetic background, health status, treatment plan, and other environmental factors. (Sun, D el at.,2017), recommended an innovative way called “D-SVM,” incorporating “DNN” and out-dated (SVM) classification, and mRMR solitary stayed applied to select a slight measure of genetic factor and reduction of the features. At this juncture, a DNN was recycled by way of pattern extraction toward proficiently apprehension genetic factor expression features on or after raw dataset. The variables remained consequently utilized towards the train and test the ML process for humanoid tumor prognostication. The limitation of this paper is that breast cancer is a complicated disease, and several different factors affect cancer survival rates. Using a single omics dataset format not enough to achieve a better robustness and scalability algorithm for human breast cancer diagnosis prediction, so the use of integrating multi-omics datasets enhance the prediction model (Han, Y el at., 2019), DriverML recognizes cancer driver genetic factor through merging a biased score test and machine learning method, applied DriverML to the 31 cancer-specific mutation datasets from TCGA, “Rao’s score statistic” is arithmetically appropriate to association several mechanisms through a usual of weight factors to yield a biased universal indicator, next, the weightiness cutting-edge the slash statistics measure the purposeful influences of diverse alteration categories on the protein. A major challenge faced in this paper is that the cancer genome sequencing is to identify Cancer- associated genes with mutations that drive the cancer phenotype (Wisely, U. N el at., 2017), the problem expanded through the classification procedure is the capacity of micro-array information variables is enormous. Here, applied “Deep Belief Network” aimed at data classification and Mutual

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

Information equally a feature selection technique. The constructed system can categorize the raw microarray dataset with a percentage of 90.84% accuracy and a percentage f1-score of 89.68%. The limitation of this paper is that the low sample size (Roberts, A. G el at., 2018), this study offered feature selection through difference variance to the supplementary overall problematic of classification of subtypes cancers. Predict cancer development or diagnosis via SVM and features designated by improved gene expression variance in tumor tissue paralleled with typical tissue. Then that merging the two methods frequently springs enhanced classification outcomes than either feature selection technique only. This paper's limitation is that all datasets applied here are microarray, so if their applied RNA-seq is more accurate and less noisy.

Conclusion

This survey discusses machine learning and deep learning algorithms that widely used for cancer classification using gene expression dataset; for example, the analysis of cancer or subtypes of cancers is a revolutionary field of precision medicine, and drug decisions are gradually dependent on the accurate molecular level and genetic profiling of tumor cells. Discuss effective applications in the arenas of regulatory genomics, pathogenicity, and variant calling. Simultaneously, modest and effective feature selection is useful for understanding the cause of cancer and affords a theoretic basis for pathologists to research cancer and select targeted therapies.

Appendix A. key terms

Key term Definition

“Gene” “A gene is a specific part of a DNA molecule that contains whole the coding data requiring to monitor a cell to manufacture a selected product, for example, an RNA molecule or a protein. Contain within the gene are parts that we identify as active in the coding process (exons), in addition to noncoding parts (introns).”

“Gene “A research laboratory way that recognizes whole of the genetic factor in a cell or tissue expression that are producing messenger RNA. Messenger RNA molecules transmit the genomic data profile.” that stands required toward creating proteins on or after the DNA in the center of the cell to the cytoplasm wherever the proteins are synthesized (Stranger, Barbara E., et al., 2017)”

“Gene mutation “Malignancy is a genomic illness thru Physically gained genetic abnormalities. Driver analysis.” alterations are wanted for the cancer occurrence, even though passenger alterations are unrelated to cancer growth and collect thru DNA duplication; numerous main cancer sequencing tasks, for example, “the International Cancer Genome Consortium” (ICGC), “the Therapeutically Applicable Research to Generate Effective Treatments” (TARGET) in addition to “The Cancer Genome Atlas” (TCGA), have shaped a general collection of somatic changes through whole several cancer categories (Yi Han, el at., 2019).”

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

“Biomarkers” “As said by “the National Cancer Institute,” “a biomarker” is “a biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease”(NCI) for example cancer.”

“personalized “Speedy advances in genomics and molecular biology, novel markers can be discovered medicine.” for the presence of or sensibility to a disease, or response to therapy, and these advances can help to find special treatment for every patient.”

“Artificial “A division of computer science that tries to do both recognize and construct intellectual intelligence.” objects, frequently programmed as software programs.”

“Deep “A branch of the greater machine learning domain. Deep learning occupations artificial learning.” neural networks with several artificial neurons’ layers. The outcome of a single layer is served as input hooked on the following layer to get better elasticity.” “Supervised “Supervised machine-learning approaches effort through recognizing the input-output machine association in the ‘training’ level and via exhausting the known association to predict the learning.” accurate output of the novel suitcases.”

“Unsupervised “A category of the machine-learning task that aims at concluding essential patterns in machine unlabeled data.” learning.”

References

1. Joseph, M., Devaraj, M., & Vea, L. A. (2018, November). Cancer Classification of Gene Expression Data using Machine Learning Models. In 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM) (pp. 1-6). IEEE.‏ 2. Ramachandran, A., Li, H., Klee, E., Lumetta, S. S., & Chen, D. (2018, January). Deep learning for better variant calling for cancer diagnosis and treatment. In Proceedings of the 23rd Asia and South Pacific Design Automation Conference (pp. 16-21). IEEE Press.‏ 3. Chang, Y., Park, H., Yang, H. J., Lee, S., Lee, K. Y., Kim, T. S., ... & Shin, J. M. (2018). Cancer Drug Response profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Scientific reports, 8(1), 8857.‏ 4. Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F. J. (2019). Deep learning: new computational modeling techniques for genomics. Nature Reviews Genetics, 1.‏ 5. Zhu, Q. X., Fan, Y., He, Y. L., & Xu, Y. (2018, May). Effective Cancer Classification Based on Gene Expression Data using Multidimensional Mutual Information and ELM. In 2018 IEEE 7th Data-Driven Control and Learning Systems Conference (DDCLS) (pp. 954-958). IEEE.‏

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

6. Arshak, Y., & Eesa, A. (2018, October). A New Dimensional Reduction Based on Cuttlefish Algorithm for Human Cancer Gene Expression. In 2018 International Conference on Advanced Science and Engineering (ICOASE) (pp. 48-53). IEEE. 7. ‏ Leung, M. K., Delong, A., Alipanahi, B., & Frey, B. J. (2015). Machine learning in genomic medicine: a review of computational problems and data sets. Proceedings of the IEEE, 104(1), 176-197.‏ 8. Lan, K., Wang, D. T., Fong, S., Liu, L. S., Wong, K. K., & Dey, N. (2018). A survey of data mining and deep learning in bioinformatics. Journal of medical systems, 42(8), 139.‏ 9. Min, S., Lee, B., & Yoon, S. (2017). Deep learning in bioinformatics. Briefings in bioinformatics, 18(5), 851-869.‏ 10. Yu, K. H., Beam, A. L., & Kohane, I. S. (2018). Artificial intelligence in healthcare. Nature biomedical engineering, 2(10), 719.‏ 11. National Cancer Institute. A to Z List of Cancer types. https://www.cancer.gov/types 12. Gómez-Verján, J. C., & Gutiérrez-Robledo, L. M. (2018). The Challenge of Big Data and Data Mining in Aging Research. In Aging Research-Methodological Issues (pp. 185-196). Springer, Cham.‏ 13. Zhou, R., & Hu, K. (2017, June). Fuzzy Support Vector Machine for breast cancer gene classification. In 2017 2nd International Conference on Image, Vision, and Computing (ICIVC) (pp. 676-679). IEEE.‏ 14. Zhang, D., Zou, L., Zhou, X., & He, F. (2018). Integrating feature selection and feature extraction methods with deep learning to predict the clinical outcome of breast cancer. IEEE Access, 6, 28936-28944.‏ 15. Pavithra, D., & Lakshmanan, B. (2017, June). Feature selection and classification in gene expression cancer data. In 2017 International Conference on Computational Intelligence in Data Science (ICCIDS) (pp. 1-6). IEEE.‏ 16. Matsubara, T., Ochiai, T., Hayashida, M., Akutsu, T., & Nacher, J. (2018, October). Convolutional Neural Network Approach to Lung Cancer Classification Integrating Protein Interaction Network and Gene Expression Profiles. In 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE) (pp. 151- 154). IEEE. 17. Coudray, N., Ocampo, P. S., Sakellaropoulos, T., Narula, N., Snuderl, M., Fenyö, D., ... & Tsirigos, A. (2018). Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine, 24(10), 1559.‏ 18. Ahn, T., Kang, N., Kim, Y., & Park, T. (2018, December). Gene expression-based prediction of prognostic outcome in ovarian cancer. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 1753-1757). IEEE.‏ 19. Lee, J. Y., Lin, S. Y., Chuang, Y. H., Huang, S. H., Tseng, Y. Y., Lin, C. Y., ... & Yang, J. M. (2018, October). Identification of the PCa28 Gene Signature as a Predictor in . In 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE) (pp. 155-158). IEEE.‏ 20. Ahn, T., Goo, T., Lee, C. H., Kim, S., Han, K., Park, S., & Park, T. (2018, December). Deep Learning-based Identification of Cancer or Normal Tissue using Gene Expression Data. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 1748-1752). IEEE.‏ 21. Fang, J., Zhang, J. G., Deng, H. W., & Wang, Y. P. (2017). Joint Detection of Associations Between DNA Methylation and Gene Expression From Multiple Cancers. IEEE Journal of biomedical and health informatics, 22(6), 1960-1969.‏

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

22. Yan, K., & Lu, H. (2018, October). An Extended Genetic Algorithm Based Gene Selection Framework for Cancer Diagnosis. In 2018 9th International Conference on Information Technology in Medicine and Education (ITME) (pp. 43-47). IEEE.‏ 23. Ainscough, B. J., Barnell, E. K., Ronning, P., Campbell, K. M., Wagner, A. H., Fehniger, T. A., ... & Griffith, M. (2018). A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nature genetics, 50(12), 1735. 24. Shen, Y., Wu, C., Liu, C., Wu, Y., & Xiong, N. (2018). Oriented feature selection SVM applied to cancer prediction in precision medicine. IEEE Access, 6, 48510-48521. 25. Yousefi, S., Amrollahi, F., Amgad, M., Dong, C., Lewis, J. E., Song, C., ... & Cooper, L. A. (2017). Predicting clinical outcomes from large-scale cancer genomic profiles with deep survival models. Scientific reports, 7(1), 11707.‏ 26. Li, T., Hu, M., & Zhang, L. (2018, October). Using the SVM Method for Lung Adenocarcinoma Prognosis Based on Expression Level. In Proceedings of the 2018 2nd International Conference on Computational Biology and Bioinformatics (pp. 63-66). ACM.‏ 27. Xiang, K., Ye, J., & Xing, B. (2019, March). Applying Machine Learning to Facilitate in Lung Adenocarcinoma. In Proceedings of the 2019 3rd International Conference on Compute and Data Analysis (pp. 9-12). ACM.‏ 28. Lu, Y., Guo, X., Pan, H., & Lin, H. (2018, November). Mutation Relation Extraction and Genes Network Analysis in Colon Cancer. In 2018 5th International Conference on Systems and Informatics (ICSAI) (pp. 1085-1092). IEEE.‏ 29. Sun, D., Wang, M., Feng, H., & Li, A. (2017, October). Prognosis predicting human breast cancer by integrating deep neural network and support vector machine: Supervised feature extraction and classification for breast cancer prognosis prediction. In 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) (pp. 1-5). IEEE.‏ 30. Kishk, A., Elzizy, A., Galal, D., Razek, E. A., Fawzy, E., Ahmed, G., ... & El-Hadidi, M. (2018, December). A Hybrid Machine Learning Approach for the Phenotypic Classification of Metagenomic Colon Cancer Reads Based on Kmer Frequency and Biomarker Profiling. In 2018 9th Cairo International Biomedical Engineering Conference (CIBEC) (pp. 118-121). IEEE.‏ 31. Roberts, A. G., Catchpoole, D. R., & Kennedy, P. J. (2018, July). Variance-based Feature Selection for Classification of Cancer Subtypes Using Gene Expression Data. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.‏ 32. Wisesty, U. N., Pratama, B. B., & Aditsania, A. (2017, November). Cancer Detection Based on Microarray Data Classification Using Deep Belief Network and Mutual Information. In 2017 5th International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME) (pp. 157- 162). IEEE.‏ 33. Hasri, N. M., Wen, N. H., Howe, C. W., Mohamad, M. S., Deris, S., & Kasim, S. (2017). Improved support vector machine using multiple SVM-RFE for cancer classification. International Journal on Advanced Science, Engineering and Information Technology, 7(4-2), 1589-1594.‏

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434

Abdelazim et al (2020): Classification analysis for cancer genomics Dec, 2020 Vol. 23 Issue 24

34. Kavitha, K. R., Gopinath, A., & Gopi, M. (2017, September). Applying improved SVM classifier for leukemia cancer classification using FCBF. In 2017 International Conference on Advances in Computing, Communications, and Informatics (ICACCI) (pp. 61-66). IEEE.‏ 35. Sinha, S., Thomas, D., Chan, S., Gao, Y., Brunen, D., Torabi, D., ... & Bernards, R. (2017). Systematic discovery of mutation-specific synthetic lethal by mining pan-cancer human primary tumor data. Nature communications, 8, 15580.‏ 36. Han, Y., Yang, J., Qian, X., Cheng, W. C., Liu, S. H., Hua, X., ... & Lu, Y. (2019). DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies. Nucleic acids research.‏ 37. Lin, M., Jaitly, V., Wang, I., Hu, Z., Chen, L., Wahid, M., ... & Nguyen, A. N. (2018). Application of Deep Learning on Predicting Prognosis of with Cytogenetics, Age, and Mutations. arXiv preprint arXiv:1810.13247.‏ 38. Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2018). A primer on deep learning in genomics. Nature genetics, 1.‏ 39. Jin, J., Wu, X., Yin, J., Li, M., Shen, J., Li, J., ... & Cho, C. H. (2019). Identification of genetic mutations in cancer: challenge and opportunity in the new era of targeted therapy. Frontiers in , 9.‏ 40. Wen, M., Zhang, Z., Niu, S., Sha, H., Yang, R., Yun, Y., & Lu, H. (2017). Deep-learning-based drug-target interaction prediction. Journal of proteome research, 16(4), 1401-1409. 41. Benstead-Hume, G., Wooller, S. K., & Pearl, F. M. (2017). ‘Big data approaches for novel anti-cancer drug discovery. Expert opinion on drug discovery, 12(6), 599-609.‏ 42. Zhang, D., Zou, L., Zhou, X., & He, F. (2018). Integrating feature selection and feature extraction methods with deep learning to predict the clinical outcome of breast cancer. IEEE Access, 6, 28936-28944.‏ 43. Chen, K. M., Cofer, E. M., Zhou, J., & Troyanskaya, O. G. (2019). Selene: a PyTorch-based deep learning library for sequence data. Nature Methods, 16(4), 315.‏ 44. Yuan, Y., Shi, Y., Li, C., Kim, J., Cai, W., Han, Z., & Feng, D. D. (2016). DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations. BMC bioinformatics, 17(17), 476.‏ 45. Hira, Z., M., & Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied to microarray data. Advances in bioinformatics, 2015.‏ 46. Stranger, B. E., Brigham, L. E., Hasz, R., Hunter, M., Johns, C., Johnson, M., ... & Mestichelli, B. (2017). Enhancing GTEx by bridging the gaps between genotype, gene expression, and disease. Nature genetics, 49(12), 1664.‏ 47. Huang, C., Mezencev, R., McDonald, J. F., & Vannberg, F. (2017). Open source machine-learning algorithms for the prediction of optimal cancer drug therapies. PLoS One, 12(10), e0186906.‏ 48. Navarro, C. (2019). How to Train Your Genome. Cell, 177(1), 3-4.

Annals of Tropical Medicine & Public Health http://doi.org/10.36295/ASRO.2020.232434