The Pennsylvania State University The Graduate School
BIOMARKERS DISCOVERY USING NETWORK BASED
ANOMALY DETECTION
AThesisin Computer Science and Engineering by Cheng-Kai Chen
c 2019 Cheng-Kai Chen
Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science
August 2019 The thesis of Cheng-Kai Chen was reviewed and approved⇤ by the following:
Vasant Honavar Professor of Computer Science and Engineering Professor of Information Sciences and Technology Thesis Advisor
Kamesh Madduri Associate Professor of Computer Science and Engineering
Chitaranjan R. Das Distinguished Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering
⇤Signatures are on file in the Graduate School. Abstract
Identifying biomarkers is an important step in translating research advances in genomics into clinical practice. From a machine learning perspective, computa- tional biomarker identification can be implemented using a broad range of feature selection methods. In this thesis, we consider an alternative approach, Network- Based Biomarker Discovery (NBBD) framework. As the name suggest, NBBD uses network representations of the input data to identify potential biomarkers (i.e., dis- criminative features for training machine learning classifiers). NBBD consists of two main customizable modules: Network Inference Module and Node Importance Scoring Module. The Network Inference Module creates ecological networks from given dataset. The Node Importance Scoring Module computes a score for each node based on the di↵erence between two ecological networks. However, most of the node scoring methods used in NBBD are based on nodes’ local topological properties. To date, NBBD has been successfully applied to metagenomics data. In this thesis, we extend two aspects of the earlier work on NBBD: i) we pro- pose two novel node important scoring methods based on node anomaly scores and di↵erences in nodes global profiles; ii) we demonstrate the applicability of NBBD for Neuroblastoma biomarker discovery from gene expression data. Our computa- tional results show that our methods can outperform the local node importance scoring methods and are comparable to state-of-art feature selection methods, in- cluding Random Forest Feature Importance and Information Gain.
iii Table of Contents
List of Figures vii
List of Tables viii
Acknowledgments xii
Chapter 1 Introduction 1
Chapter 2 Similarity in Graphs 4 2.1 VertexSimilarity ...... 5 2.1.1 LocalApproaches...... 5 2.1.1.1 CommonNeighbor(CN)...... 5 2.1.1.2 The Adamic-Adar Index (AA) ...... 6 2.1.1.3 TheHubPromotedIndex(HPI) ...... 6 2.1.1.4 TheHubDepressedIndex(HDI) ...... 6 2.1.1.5 JaccardIndex(JA)...... 7 2.1.1.6 The Local Leicht-Holme-Newman Index (LLHN) . 7 2.1.1.7 The Preferential Attachment Index(PA) ...... 7 2.1.1.8 TheResourceAllocationIndex(RA) ...... 8 2.1.1.9 TheSaltonIndex(SA)...... 8 2.1.2 The Sørensen Index (SO) ...... 8 2.1.3 GlobalApproaches ...... 9 2.1.3.1 SimRank ...... 9 2.1.3.2 Asymmetric Structure COntext Similarity(ASCOS) 15 2.2 Graph Similarity ...... 19
iv 2.2.1 Measuring Node A nities:FaBP ...... 21 2.2.2 Distance Measure Between Graphs ...... 22 2.2.3 DeltaConNodeAttributionFunction ...... 23
Chapter 3 Anomaly Detection Methods 25 3.1 Auto-Encoder ...... 26 3.2 Clustering-Based Local Outlier Factor(CBLOF) ...... 26 3.3 Histogram-basedOutlierScore(HBOS) ...... 27 3.4 Isolation Forest(IForest) ...... 28 3.5 LocalOutlierFactor(LOF)...... 29 3.6 Minimum Covariance Determinant(MCD) ...... 30 3.7 One-ClassSupportVectorMachines(OCSVM) ...... 31 3.8 PrincipalComponentAnalysis(PCA) ...... 32
Chapter 4 Graph Based Feature Selection Methods and Their Application inBiomarkerDiscovery 33 4.1 Methods ...... 34 4.1.1 Datasets ...... 34 4.1.1.1 Inflammatory Bowel Diseases (IBD) dataset . . . . 34 4.1.1.2 Neuroblastoma(NB)dataset ...... 34 4.1.2 Network-Based Biomarkers Discovery(NBBD) framework . . 35 4.1.3 Proposed Node Importance Scoring Methods ...... 36 4.1.3.1 Node Anomaly Scoring (NAS) ...... 36 4.1.3.2 Node Attribution Profile Scoring (NAPS) . . . . . 37 4.1.4 Experiments ...... 37 4.2 ResultsandDiscussion ...... 40 4.2.1 Performance comparisons using IBD dataset ...... 40 4.2.2 Performance comparisons using NB dataset ...... 41 4.3 Conclusion ...... 42
Chapter 5 Conclusion 44
Appendix A Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAS methods 46
v Appendix B Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAPS methods 55
Appendix C Performance Comparison On Neuroblastoma(NB) dataset dataset Using NAS methods 58
Appendix D Performance Comparison On Neuroblastoma(NB) dataset Us- ing NAPS methods 66
Bibliography 69
vi List of Figures
2.1 Notations ...... 4 2.2 A Sample Citation Graph (adopted from Hamedani et al. [1]) . . . 11 2.3 A toy network (adopted from [2]) ...... 16 2.4 A toy network with edge weights. (adopted from [3]) ...... 18 2.5 Symbols and Definitions for DeltaCon ...... 20 2.6 ToyNetworks ...... 21 2.7 Algorithm: DeltaCon(adoptedfrom[4])...... 23 2.8 Algorithm: DeltaCon Node Attribution(adopted from [4]) ...... 23
3.1 Select Outlier Detection Models in PyOD(adopted from [5]) . . . . 26
4.1 NBBDframeworkoverview(adoptfrom[6]) ...... 35 4.2 NBBD framework overview with two di↵erent node scoring method 36
vii List of Tables
4.1 Performance comparisons on IBD dataset of RF classifiers trained using di↵erent feature selection methods using Information Gain (IG), RF Feature Importance (RFFI), and NBBD using three node topological properties...... 40 4.2 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 41 4.3 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations...... 41 4.4 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 42 4.5 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 42 4.6 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations...... 43
A.1 IBD dataset using NAS:Auto-Encoder with adj Rep...... 46 A.2 IBD dataset using NAS:Auto-Encoder with SimRank Rep...... 46 A.3 IBD dataset using NAS:Auto-Encoder with ASCOS Rep...... 47 A.4 IBD dataset using NAS:Auto-Encoder with FaBP Rep...... 47 A.5 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep...... 47 A.6 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep...... 47 A.7 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep...... 48
viii A.8 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep...... 48 A.9 IBD dataset using NAS:Histogram-based Outlier Score with adj Rep. 48 A.10 IBD dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep...... 48 A.11 IBD dataset using NAS:Histogram-based Outlier Score with AS- COS Rep...... 49 A.12 IBD dataset using NAS:Histogram-based Outlier Score with FaBP Rep...... 49 A.13 IBD dataset using NAS:Isolation Forest(Iforest) with adj Rep. . . . 49 A.14 IBD dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 49 A.15 IBD dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. . 50 A.16 IBD dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. . . 50 A.17 IBD dataset using NAS:Local Outlier Factor(LOF) with adj Rep. . 50 A.18 IBD dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep...... 50 A.19 IBD dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep...... 50 A.20 IBD dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 51 A.21 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep...... 51 A.22 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep...... 51 A.23 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep...... 51 A.24 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep...... 52 A.25 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep...... 52 A.26 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep...... 52 A.27 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep...... 52 A.28 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep...... 53 A.29 IBD dataset using NAS:Principal Component Analysis(PCA) with adj Rep...... 53 A.30 IBD dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep...... 53
ix A.31 IBD dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep...... 53 A.32 IBD dataset using NAS:Principal Component Analysis(PCA) with FaBPRep...... 54
B.1 IBD dataset using NAPS:RoodED with SimRank Rep...... 55 B.2 IBD dataset using NAPS:RoodED with ASCOS Rep...... 55 B.3 IBDdatasetusingNAPS:RoodEDwithFaBPRep...... 56 B.4 IBD dataset using NAPS:Cosine Distance with SimRank Rep. . . . 56 B.5 IBD dataset using NAPS:Cosine Distance with ASCOS Rep. . . . . 56 B.6 IBD dataset using NAPS:Cosine Dsitance with FaBP Rep...... 56 B.7 IBD dataset using NAPS:Bray-Curtis Distance SimRank with Rep. 57 B.8 IBD dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. . 57 B.9 IBD dataset using NAPS:Bray-Curtis Distance FaBP with Rep. . . 57
C.1 NB dataset using NAS:Auto-Encoder with adj Rep...... 58 C.2 NB dataset using NAS:Auto-Encoder with SimRank Rep...... 58 C.3 NB dataset using NAS:Auto-Encoder with ASCOS Rep...... 59 C.4 NB dataset using NAS:Auto-Encoder with FaBP Rep...... 59 C.5 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep...... 59 C.6 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep...... 59 C.7 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep...... 59 C.8 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep...... 60 C.9 NB dataset using NAS:Histogram-based Outlier Score with adj Rep. 60 C.10 NB dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep...... 60 C.11 NB dataset using NAS:Histogram-based Outlier Score with ASCOS Rep...... 60 C.12 NB dataset using NAS:Histogram-based Outlier Score with FaBP Rep...... 60 C.13 NB dataset using NAS:Isolation Forest(Iforest) with adj Rep. . . . . 60 C.14 NB dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 61 C.15 NB dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. . 61 C.16 NB dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. . . 61 C.17 NB dataset using NAS:Local Outlier Factor(LOF) with adj Rep. . . 61
x C.18 NB dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep...... 61 C.19 NB dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. 62 C.20 NB dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 62 C.21 NB dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep...... 62 C.22 NB dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep...... 62 C.23 NB dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep...... 63 C.24 NB dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep...... 63 C.25 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep...... 63 C.26 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) withSimRankRep...... 63 C.27 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep...... 63 C.28 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep...... 64 C.29 NB dataset using NAS:Principal Component Analysis(PCA) with adj Rep...... 64 C.30 NB dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep...... 64 C.31 NB dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep...... 64 C.32 NB dataset using NAS:Principal Component Analysis(PCA) with FaBPRep...... 65
D.1 NB dataset using NAPS:RoodED with SimRank Rep...... 66 D.2 NBdatasetusingNAPS:RoodEDwithASCOSRep...... 66 D.3 NBdatasetusingNAPS:RoodEDwithFaBPRep...... 67 D.4 NB dataset using NAPS:Cosine Distance with SimRank Rep. . . . . 67 D.5 NB dataset using NAPS:Cosine Distance with ASCOS Rep. . . . . 67 D.6 NB dataset using NAPS:Cosine Distance with FaBP Rep...... 67 D.7 NB dataset using NAPS:Bray-Curtis Distance SimRank with Rep. . 68 D.8 NB dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. . . 68 D.9 NB dataset using NAPS:Bray-Curtis Distance FaBP with Rep. . . . 68
xi Acknowledgments
Iwouldliketoexpressmydeepgratitudetothefollowingindividualsfortheir tremendous encouragement, support, and assistance throughout my graduate pro- gram.
Advisor Professor Vasant Honavar
Thesis committee Professor Kamash Madduri
My Lab Mates Yasser El-Manzalawy Aria Khademi David Foley Junjie Liang Tsung-Yu Hsieh Yiwei Sun
Family and Friends My father, Michael Chen My mother, Sonia Chen My sisters, Betty Chen and Jenny Chen My brother, Jordan Chen Lauren Morrison Spannu Kate Chen Adam Lin
xii Tomas Wang Hsiao-Ting Hung Chien-Hua Chen Die Zhu
This work was funded in part by the NIH NCATS through the grants UL1 TR000127 and TR002014, and by the NSF through the grant 1518732. The content is solely the responsibility of the authors and does not necessarily represent the o cial views of the sponsors.
xiii Chapter 1 Introduction
The focus of this thesis is to provide alternative scoring modules that address and resolve some of the limitations in the Network-Based Biomarker Discovery[7] framework used to identify biomarkers. Given the increasing reliance on biomark- ers in an ever-expanding roster of scientific and research applications, e↵ective alternative models would be a valuable addition to the biomarker landscape. The term biomarker, a combined form of biological and marker, came into widespread use in the scientific community several decades ago motivating the Na- tional Institutes of Health and to convene a cross-disciplinary group of experts to define it. The definition that emerged for biomarker from the National In- stitutes of Health Biomarkers Definitions Working Group was a broad one that demonstrates its extensive research and clinical applicability in identification, di- agnosis, treatment, and prognosis: a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention [8]. Other entities, like the FDA, have likewise urged identification and adoption of a consistent lexicon, specifically, a harmonizing terminology in biomarker science, including validation and qualification [9]. As the definition verifies, biomarkers are used to track normal biological pro- cesses as well as abnormal ones. Their enormous potential to predict disease and outcomes of treatment have encouraged researchers to study them. Biomarkers application, like their definition, has therefore become broad including such areas as specific disease diagnosis, cancer diagnosis and staging, drug development, and 2 pharmacologic application and dosing. The task of identifying biomarkers is an arduous one. It involves finding the pool of features that can be accurately identified from two or more groups of samples [10]. This task is di cult because the data usually come with large number of features gathered from a small number of samples, with a high degree of sparsity of the data samples. Several statistical methods have been proposed to deal with these challenges[11] with machine learning based feature selection [12] being one widely used technique for identifying the most informative features in the dataset. Abbas and Le et al.[6] proposed a novel Network-Based Biomarker Discovery (NBBD) framework for detecting disease biomarkers, which addresses the chal- lenges mentioned above. This integrated NBBD framework contains two customiz- able modules: a network inference module and node importance scoring module.Thenetworkinferencemoduleisusedtoconstructtheinferringecolog- ical networks from given data. The node importance scoring module is used to calculate the score for each node in the network based on di↵erent scoring mod- ules. The nodes with the top score will be considered as biomarkers. Although Abbas and Le et al.[6] reported that NBBD is very competitive with some of the state-of-the-art feature selection methods[12] such as random forest[13], most of the node scoring methods used in NBBD are local methods. In this work, we propose two alternative node importance scoring modules: Node Attribution Profile Scoring (NAPS) and Node Anomaly Scoring (NAS), to address the limitation we point out above. Furthermore, since NBBD framework has been tested only in a metagenomics dataset, we are uncertain whether an NBBD framework will also perform well on other kinds of datasets such as a gene expression dataset.
Chapter 2:InChapter2,weintroducedi↵erentsimilaritymethodsin graphs. These methods allow us to capture similarity properties between nodes in the network. We also introduce Node Attribute Function, which is part of DeltoCon. This will rank the nodes that are responsible for the di↵er- ence between two networks. Our Node Attribution Profile Scoring methods are motivated by the Node Attribute Function.
Chapter 3:InChapter3,wedescribeanomalydetectionandintroduce 3
di↵erent anomaly detection methods that can be used in our Node Anomaly Scoring modules.
Chapter 4: In Chapter 4, we explain NBBD frameworks and introduce node importance two scoring modules: NAPS and NAS. We test NAPS and NAS using Inflammatory Bowel Disease (metagenomics) and Neuroblastoma (gene expression) datasets and report the performance of NAPS and NAS.
Our research results show that our methods can outperform the local scoring methods and are comparable to state-of-art feature selection methods, including Random Forest Feature Importance and Information Gain. This outcome holds significance for future biomarker research, o↵ering as it does alternatives to the existing selection methods that appear to be more e↵ective than the original NBBD framework. Further research that confirms this result would be of value. Chapter 2 Similarity in Graphs
Graph similarity is a well-studied topic that has found applications in areas such as social networks, link prediction, anomaly detection, clustering, web searching, and and biological networks comparison [14, 15, 16, 17, 18, 19]. In this section, we view major types of similarity in graphs: vertex similarity and graph similarity, as well as some state-of-the-art algorithms for computing these similarities We will discuss the benefit and drawback of each method.
Figure 2.1. Notations 5
2.1 Vertex Similarity
Vertex similarity is one type of similarity measurement in graphs. These methods are developed from the hypothesis that two nodes are similar if they are connected to similar neighbor nodes or are close to each other in the network based on the given distance function. These vertex similarity approaches define the similarity function as S(u, v)thatwillreturnasimilarityscoreforpairu and v where u, v 2 V .Becausethedefinitionofsimilarityisnotatrivialtask,thesimilarityfunction can vary between di↵erent networks. There are two types of vertex similarities, local and global. We will discuss both in the next section.
2.1.1 Local Approaches
Local similarity approaches use node neighborhood information to compute the similarity score between two node pairs in the network. Usually these approaches do no require a lot of computing time compared to the global approach. The main disadvantage of this approach is that it only considers the local information (usually 1-step or 2-step neighbors). Therefore, the performance may not very accurate since the information is limited by the local neighbor distance.
2.1.1.1 Common Neighbor (CN)
The Common Neighbors method(CN)[20] is the most simple similarity method with the similarity score based on the number of common neighbors that are shared by node u and node v.Thismethodisdefinedas:
S(u, v)= (2.1) | u \ v| The idea is that two instances are more likely to be similar if they share more neighbors compared to others sharing fewer neighbors. Newman [21] proved this hypothesis by observing scientific collaboration networks and showing that the chance that two scientists collaborate with each other will increase based on the number of other collaborators they have in common. Although this is a sim- ple method, Mart´ınez et al. [15] also reported that Common Neighbor performs 6 surprisingly well on most real-world networks compared to other more complex methods.
2.1.1.2 The Adamic-Adar Index (AA)
The Adamic-Adar Index(AA) can be calculated with the following equation:
1 S(u, v)= (2.2) log( w ) w u v 2X\ | | This similarity method was introduced by Adamic and Adar[22]. The intuition behind their method is that two nodes are similar if they share the same fea- tures, and the score will be penalized by the logarithm of the feature’s appearance frequency. This idea makes sense because the more the feature is appearing in general, the less score it will contribute locally. For example, if a feature appears alotonthegraph,thenitshouldhavelessimpactfultodetermingwhethertwo nodes are similar or not.
2.1.1.3 The Hub Promoted Index (HPI)
The Hub Promoted Index(HPI) was first introduced by Ravasz et al.[23] and it is suggested for metabolic networks. The idea is that a node with high topological overlap with its neighbors is more likely to be the same class. The similarity score is defined as: