The Pennsylvania State University The Graduate School

BIOMARKERS DISCOVERY USING NETWORK BASED

ANOMALY DETECTION

AThesisin Computer Science and Engineering by Cheng-Kai Chen

c 2019 Cheng-Kai Chen

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

August 2019 The thesis of Cheng-Kai Chen was reviewed and approved⇤ by the following:

Vasant Honavar Professor of Computer Science and Engineering Professor of Information Sciences and Technology Thesis Advisor

Kamesh Madduri Associate Professor of Computer Science and Engineering

Chitaranjan R. Das Distinguished Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

⇤Signatures are on file in the Graduate School. Abstract

Identifying biomarkers is an important step in translating research advances in genomics into clinical practice. From a perspective, computa- tional biomarker identification can be implemented using a broad range of methods. In this thesis, we consider an alternative approach, Network- Based Biomarker Discovery (NBBD) framework. As the name suggest, NBBD uses network representations of the input data to identify potential biomarkers (i.e., dis- criminative features for training machine learning classifiers). NBBD consists of two main customizable modules: Network Inference Module and Node Importance Scoring Module. The Network Inference Module creates ecological networks from given dataset. The Node Importance Scoring Module computes a score for each node based on the di↵erence between two ecological networks. However, most of the node scoring methods used in NBBD are based on nodes’ local topological properties. To date, NBBD has been successfully applied to metagenomics data. In this thesis, we extend two aspects of the earlier work on NBBD: i) we pro- pose two novel node important scoring methods based on node anomaly scores and di↵erences in nodes global profiles; ii) we demonstrate the applicability of NBBD for Neuroblastoma biomarker discovery from gene expression data. Our computa- tional results show that our methods can outperform the local node importance scoring methods and are comparable to state-of-art feature selection methods, in- cluding Feature Importance and Information Gain.

iii Table of Contents

List of Figures vii

List of Tables viii

Acknowledgments xii

Chapter 1 Introduction 1

Chapter 2 Similarity in Graphs 4 2.1 VertexSimilarity ...... 5 2.1.1 LocalApproaches...... 5 2.1.1.1 CommonNeighbor(CN)...... 5 2.1.1.2 The Adamic-Adar Index (AA) ...... 6 2.1.1.3 TheHubPromotedIndex(HPI) ...... 6 2.1.1.4 TheHubDepressedIndex(HDI) ...... 6 2.1.1.5 JaccardIndex(JA)...... 7 2.1.1.6 The Local Leicht-Holme-Newman Index (LLHN) . 7 2.1.1.7 The Preferential Attachment Index(PA) ...... 7 2.1.1.8 TheResourceAllocationIndex(RA) ...... 8 2.1.1.9 TheSaltonIndex(SA)...... 8 2.1.2 The Sørensen Index (SO) ...... 8 2.1.3 GlobalApproaches ...... 9 2.1.3.1 SimRank ...... 9 2.1.3.2 Asymmetric Structure COntext Similarity(ASCOS) 15 2.2 Graph Similarity ...... 19

iv 2.2.1 Measuring Node Anities:FaBP ...... 21 2.2.2 Distance Measure Between Graphs ...... 22 2.2.3 DeltaConNodeAttributionFunction ...... 23

Chapter 3 Methods 25 3.1 Auto-Encoder ...... 26 3.2 Clustering-Based Local Factor(CBLOF) ...... 26 3.3 Histogram-basedOutlierScore(HBOS) ...... 27 3.4 Isolation Forest(IForest) ...... 28 3.5 LocalOutlierFactor(LOF)...... 29 3.6 Minimum Covariance Determinant(MCD) ...... 30 3.7 One-ClassSupportVectorMachines(OCSVM) ...... 31 3.8 PrincipalComponentAnalysis(PCA) ...... 32

Chapter 4 Graph Based Feature Selection Methods and Their Application inBiomarkerDiscovery 33 4.1 Methods ...... 34 4.1.1 Datasets ...... 34 4.1.1.1 Inflammatory Bowel Diseases (IBD) dataset . . . . 34 4.1.1.2 Neuroblastoma(NB)dataset ...... 34 4.1.2 Network-Based Biomarkers Discovery(NBBD) framework . . 35 4.1.3 Proposed Node Importance Scoring Methods ...... 36 4.1.3.1 Node Anomaly Scoring (NAS) ...... 36 4.1.3.2 Node Attribution Profile Scoring (NAPS) . . . . . 37 4.1.4 Experiments ...... 37 4.2 ResultsandDiscussion ...... 40 4.2.1 Performance comparisons using IBD dataset ...... 40 4.2.2 Performance comparisons using NB dataset ...... 41 4.3 Conclusion ...... 42

Chapter 5 Conclusion 44

Appendix A Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAS methods 46

v Appendix B Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAPS methods 55

Appendix C Performance Comparison On Neuroblastoma(NB) dataset dataset Using NAS methods 58

Appendix D Performance Comparison On Neuroblastoma(NB) dataset Us- ing NAPS methods 66

Bibliography 69

vi List of Figures

2.1 Notations ...... 4 2.2 A Sample Citation Graph (adopted from Hamedani et al. [1]) . . . 11 2.3 A toy network (adopted from [2]) ...... 16 2.4 A toy network with edge weights. (adopted from [3]) ...... 18 2.5 Symbols and Definitions for DeltaCon ...... 20 2.6 ToyNetworks ...... 21 2.7 Algorithm: DeltaCon(adoptedfrom[4])...... 23 2.8 Algorithm: DeltaCon Node Attribution(adopted from [4]) ...... 23

3.1 Select Outlier Detection Models in PyOD(adopted from [5]) . . . . 26

4.1 NBBDframeworkoverview(adoptfrom[6]) ...... 35 4.2 NBBD framework overview with two di↵erent node scoring method 36

vii List of Tables

4.1 Performance comparisons on IBD dataset of RF classifiers trained using di↵erent feature selection methods using Information Gain (IG), RF Feature Importance (RFFI), and NBBD using three node topological properties...... 40 4.2 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 41 4.3 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations...... 41 4.4 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 42 4.5 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations...... 42 4.6 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations...... 43

A.1 IBD dataset using NAS:Auto-Encoder with adj Rep...... 46 A.2 IBD dataset using NAS:Auto-Encoder with SimRank Rep...... 46 A.3 IBD dataset using NAS:Auto-Encoder with ASCOS Rep...... 47 A.4 IBD dataset using NAS:Auto-Encoder with FaBP Rep...... 47 A.5 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep...... 47 A.6 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep...... 47 A.7 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep...... 48

viii A.8 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep...... 48 A.9 IBD dataset using NAS:Histogram-based Outlier Score with adj Rep. 48 A.10 IBD dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep...... 48 A.11 IBD dataset using NAS:Histogram-based Outlier Score with AS- COS Rep...... 49 A.12 IBD dataset using NAS:Histogram-based Outlier Score with FaBP Rep...... 49 A.13 IBD dataset using NAS:Isolation Forest(Iforest) with adj Rep. . . . 49 A.14 IBD dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 49 A.15 IBD dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. . 50 A.16 IBD dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. . . 50 A.17 IBD dataset using NAS:Local Outlier Factor(LOF) with adj Rep. . 50 A.18 IBD dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep...... 50 A.19 IBD dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep...... 50 A.20 IBD dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 51 A.21 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep...... 51 A.22 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep...... 51 A.23 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep...... 51 A.24 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep...... 52 A.25 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep...... 52 A.26 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep...... 52 A.27 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep...... 52 A.28 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep...... 53 A.29 IBD dataset using NAS:Principal Component Analysis(PCA) with adj Rep...... 53 A.30 IBD dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep...... 53

ix A.31 IBD dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep...... 53 A.32 IBD dataset using NAS:Principal Component Analysis(PCA) with FaBPRep...... 54

B.1 IBD dataset using NAPS:RoodED with SimRank Rep...... 55 B.2 IBD dataset using NAPS:RoodED with ASCOS Rep...... 55 B.3 IBDdatasetusingNAPS:RoodEDwithFaBPRep...... 56 B.4 IBD dataset using NAPS:Cosine Distance with SimRank Rep. . . . 56 B.5 IBD dataset using NAPS:Cosine Distance with ASCOS Rep. . . . . 56 B.6 IBD dataset using NAPS:Cosine Dsitance with FaBP Rep...... 56 B.7 IBD dataset using NAPS:Bray-Curtis Distance SimRank with Rep. 57 B.8 IBD dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. . 57 B.9 IBD dataset using NAPS:Bray-Curtis Distance FaBP with Rep. . . 57

C.1 NB dataset using NAS:Auto-Encoder with adj Rep...... 58 C.2 NB dataset using NAS:Auto-Encoder with SimRank Rep...... 58 C.3 NB dataset using NAS:Auto-Encoder with ASCOS Rep...... 59 C.4 NB dataset using NAS:Auto-Encoder with FaBP Rep...... 59 C.5 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep...... 59 C.6 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep...... 59 C.7 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep...... 59 C.8 NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep...... 60 C.9 NB dataset using NAS:Histogram-based Outlier Score with adj Rep. 60 C.10 NB dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep...... 60 C.11 NB dataset using NAS:Histogram-based Outlier Score with ASCOS Rep...... 60 C.12 NB dataset using NAS:Histogram-based Outlier Score with FaBP Rep...... 60 C.13 NB dataset using NAS:Isolation Forest(Iforest) with adj Rep. . . . . 60 C.14 NB dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 61 C.15 NB dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. . 61 C.16 NB dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. . . 61 C.17 NB dataset using NAS:Local Outlier Factor(LOF) with adj Rep. . . 61

x C.18 NB dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep...... 61 C.19 NB dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. 62 C.20 NB dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 62 C.21 NB dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep...... 62 C.22 NB dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep...... 62 C.23 NB dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep...... 63 C.24 NB dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep...... 63 C.25 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep...... 63 C.26 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) withSimRankRep...... 63 C.27 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep...... 63 C.28 NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep...... 64 C.29 NB dataset using NAS:Principal Component Analysis(PCA) with adj Rep...... 64 C.30 NB dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep...... 64 C.31 NB dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep...... 64 C.32 NB dataset using NAS:Principal Component Analysis(PCA) with FaBPRep...... 65

D.1 NB dataset using NAPS:RoodED with SimRank Rep...... 66 D.2 NBdatasetusingNAPS:RoodEDwithASCOSRep...... 66 D.3 NBdatasetusingNAPS:RoodEDwithFaBPRep...... 67 D.4 NB dataset using NAPS:Cosine Distance with SimRank Rep. . . . . 67 D.5 NB dataset using NAPS:Cosine Distance with ASCOS Rep. . . . . 67 D.6 NB dataset using NAPS:Cosine Distance with FaBP Rep...... 67 D.7 NB dataset using NAPS:Bray-Curtis Distance SimRank with Rep. . 68 D.8 NB dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. . . 68 D.9 NB dataset using NAPS:Bray-Curtis Distance FaBP with Rep. . . . 68

xi Acknowledgments

Iwouldliketoexpressmydeepgratitudetothefollowingindividualsfortheir tremendous encouragement, support, and assistance throughout my graduate pro- gram.

Advisor Professor Vasant Honavar

Thesis committee Professor Kamash Madduri

My Lab Mates Yasser El-Manzalawy Aria Khademi David Foley Junjie Liang Tsung-Yu Hsieh Yiwei Sun

Family and Friends My father, Michael Chen My mother, Sonia Chen My sisters, Betty Chen and Jenny Chen My brother, Jordan Chen Lauren Morrison Spannu Kate Chen Adam Lin

xii Tomas Wang Hsiao-Ting Hung Chien-Hua Chen Die Zhu

This work was funded in part by the NIH NCATS through the grants UL1 TR000127 and TR002014, and by the NSF through the grant 1518732. The content is solely the responsibility of the authors and does not necessarily represent the ocial views of the sponsors.

xiii Chapter 1 Introduction

The focus of this thesis is to provide alternative scoring modules that address and resolve some of the limitations in the Network-Based Biomarker Discovery[7] framework used to identify biomarkers. Given the increasing reliance on biomark- ers in an ever-expanding roster of scientific and research applications, e↵ective alternative models would be a valuable addition to the biomarker landscape. The term biomarker, a combined form of biological and marker, came into widespread use in the scientific community several decades ago motivating the Na- tional Institutes of Health and to convene a cross-disciplinary group of experts to define it. The definition that emerged for biomarker from the National In- stitutes of Health Biomarkers Definitions Working Group was a broad one that demonstrates its extensive research and clinical applicability in identification, di- agnosis, treatment, and prognosis: a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention [8]. Other entities, like the FDA, have likewise urged identification and adoption of a consistent lexicon, specifically, a harmonizing terminology in biomarker science, including validation and qualification [9]. As the definition verifies, biomarkers are used to track normal biological pro- cesses as well as abnormal ones. Their enormous potential to predict disease and outcomes of treatment have encouraged researchers to study them. Biomarkers application, like their definition, has therefore become broad including such areas as specific disease diagnosis, cancer diagnosis and staging, drug development, and 2 pharmacologic application and dosing. The task of identifying biomarkers is an arduous one. It involves finding the pool of features that can be accurately identified from two or more groups of samples [10]. This task is dicult because the data usually come with large number of features gathered from a small number of samples, with a high degree of sparsity of the data samples. Several statistical methods have been proposed to deal with these challenges[11] with machine learning based feature selection [12] being one widely used technique for identifying the most informative features in the dataset. Abbas and Le et al.[6] proposed a novel Network-Based Biomarker Discovery (NBBD) framework for detecting disease biomarkers, which addresses the chal- lenges mentioned above. This integrated NBBD framework contains two customiz- able modules: a network inference module and node importance scoring module.Thenetworkinferencemoduleisusedtoconstructtheinferringecolog- ical networks from given data. The node importance scoring module is used to calculate the score for each node in the network based on di↵erent scoring mod- ules. The nodes with the top score will be considered as biomarkers. Although Abbas and Le et al.[6] reported that NBBD is very competitive with some of the state-of-the-art feature selection methods[12] such as random forest[13], most of the node scoring methods used in NBBD are local methods. In this work, we propose two alternative node importance scoring modules: Node Attribution Profile Scoring (NAPS) and Node Anomaly Scoring (NAS), to address the limitation we point out above. Furthermore, since NBBD framework has been tested only in a metagenomics dataset, we are uncertain whether an NBBD framework will also perform well on other kinds of datasets such as a gene expression dataset.

Chapter 2:InChapter2,weintroducedi↵erentsimilaritymethodsin graphs. These methods allow us to capture similarity properties between nodes in the network. We also introduce Node Attribute Function, which is part of DeltoCon. This will rank the nodes that are responsible for the di↵er- ence between two networks. Our Node Attribution Profile Scoring methods are motivated by the Node Attribute Function.

Chapter 3:InChapter3,wedescribeanomalydetectionandintroduce 3

di↵erent anomaly detection methods that can be used in our Node Anomaly Scoring modules.

Chapter 4: In Chapter 4, we explain NBBD frameworks and introduce node importance two scoring modules: NAPS and NAS. We test NAPS and NAS using Inflammatory Bowel Disease (metagenomics) and Neuroblastoma (gene expression) datasets and report the performance of NAPS and NAS.

Our research results show that our methods can outperform the local scoring methods and are comparable to state-of-art feature selection methods, including Random Forest Feature Importance and Information Gain. This outcome holds significance for future biomarker research, o↵ering as it does alternatives to the existing selection methods that appear to be more e↵ective than the original NBBD framework. Further research that confirms this result would be of value. Chapter 2 Similarity in Graphs

Graph similarity is a well-studied topic that has found applications in areas such as social networks, link prediction, anomaly detection, clustering, web searching, and and biological networks comparison [14, 15, 16, 17, 18, 19]. In this section, we view major types of similarity in graphs: vertex similarity and graph similarity, as well as some state-of-the-art algorithms for computing these similarities We will discuss the benefit and drawback of each method.

Figure 2.1. Notations 5

2.1 Vertex Similarity

Vertex similarity is one type of similarity measurement in graphs. These methods are developed from the hypothesis that two nodes are similar if they are connected to similar neighbor nodes or are close to each other in the network based on the given distance function. These vertex similarity approaches define the similarity function as S(u, v)thatwillreturnasimilarityscoreforpairu and v where u, v 2 V .Becausethedefinitionofsimilarityisnotatrivialtask,thesimilarityfunction can vary between di↵erent networks. There are two types of vertex similarities, local and global. We will discuss both in the next section.

2.1.1 Local Approaches

Local similarity approaches use node neighborhood information to compute the similarity score between two node pairs in the network. Usually these approaches do no require a lot of computing time compared to the global approach. The main disadvantage of this approach is that it only considers the local information (usually 1-step or 2-step neighbors). Therefore, the performance may not very accurate since the information is limited by the local neighbor distance.

2.1.1.1 Common Neighbor (CN)

The Common Neighbors method(CN)[20] is the most simple similarity method with the similarity score based on the number of common neighbors that are shared by node u and node v.Thismethodisdefinedas:

S(u, v)= (2.1) | u \ v| The idea is that two instances are more likely to be similar if they share more neighbors compared to others sharing fewer neighbors. Newman [21] proved this hypothesis by observing scientific collaboration networks and showing that the chance that two scientists collaborate with each other will increase based on the number of other collaborators they have in common. Although this is a sim- ple method, Mart´ınez et al. [15] also reported that Common Neighbor performs 6 surprisingly well on most real-world networks compared to other more complex methods.

2.1.1.2 The Adamic-Adar Index (AA)

The Adamic-Adar Index(AA) can be calculated with the following equation:

1 S(u, v)= (2.2) log( w ) w u v 2X\ | | This similarity method was introduced by Adamic and Adar[22]. The intuition behind their method is that two nodes are similar if they share the same fea- tures, and the score will be penalized by the logarithm of the feature’s appearance frequency. This idea makes sense because the more the feature is appearing in general, the less score it will contribute locally. For example, if a feature appears alotonthegraph,thenitshouldhavelessimpactfultodetermingwhethertwo nodes are similar or not.

2.1.1.3 The Hub Promoted Index (HPI)

The Hub Promoted Index(HPI) was first introduced by Ravasz et al.[23] and it is suggested for metabolic networks. The idea is that a node with high topological overlap with its neighbors is more likely to be the same class. The similarity score is defined as:

S(u, v)= | u \ v| (2.3) min( , ) | u| | v| Note that the closer the node is to the hubs, the higher the similarity score is. This is because the denominator is based on the minimum degree of node v and u[24].

2.1.1.4 The Hub Depressed Index (HDI)

The Hub Depressed Index(HDI) is based on HPI[23]. The only di↵erence is that the closer the node is to the hubs, the lower the similarity scores it will receive. The similarity score is defined as: 7

S(u, v)= | u \ v| (2.4) max( , ) | u| | v| 2.1.1.5 Jaccard Index (JA)

Jaccard[25] is similar to Common Neighbor except the similarity score is penalized by non-shared neighbors. The similarity scores is defined as:

S(u, v)=| u \ v| (2.5) | u [ v| 2.1.1.6 The Local Leicht-Holme-Newman Index (LLHN)

This index is introduced by Leicht et al.[26]. It is the ratio of the Common Neigh- bors of node u and v to the product of the degrees of u and v.Thesimilarityscore is defined as:

S(u, v)=| u \ v| (2.6) | u|| v| According to Leicht et al., this measure is more sensitive to structural equiv- alence compared to other Common Neighbor like methods such as the Jaccard Index(2.5).

2.1.1.7 The Preferential Attachment Index(PA)

This method is introduced by Barab´asiand Albert-L´aszl´o[27] and it’s the result of the Barab´asi-Albert complex(BA) network formation model. The BA complex network formation model is an algorithm that uses the Preferential Attachment mechanism to generate random scale-free networks, where the probability that a new link is connected to node u is based on the degree of u, .Thisideacan | u| also be applied to whether an edge will be created between node v and u base on the product of both nodes’ degrees, [24]. The similarity method is defined | u|| v| as:

S(u, v)= (2.7) | u|| v| 8

This method can also apply to global measures, but the prediction accuracy of this application usually poor[15].

2.1.1.8 The Resource Allocation Index (RA)

This method was introduced by Zhou et al.[28] and is motivated by the resource allocation process on complex networks. The similarity score between u and v is based on the amount of resources v received from u.Theformulacanbewritten as:

1 S(u, v)= (2.8) w w u v 2X\ | | Note that score of RA and AA will be similar if the degree of w is small, which means that if the average degree in the network is very small, RA and AA will be similar. RA can also be applied to undirected graph. In this case S(u, v)willbedi↵erent compared to S(v, u). This similarity score function can be written as:

1 1 S(u, v)= (2.9) u w w u v | | 2X\ | | 2.1.1.9 The Salton Index (SA)

The Salton Index, also called Cosine Similarity, was introduced by Salton and McGill [29]. Hamers at al. [30] has also shown that the similarity measures of the Salton Cosine Index in citation research is twice that of the Jaccard Index. The similarity function is defined as:

S(u, v)= | u \ v| (2.10) | u|| v| p 2.1.2 The Sørensen Index (SO)

This method was introduced by Sørensen [31] to compare the similarity between groups in plant sociology. It can be calculated as twice of u and v’Common Neighbors divided by the sum of the degree of u and v.TheSimilarityfunctionis defined as: 9

2 S(u, v)= | u \ v| (2.11) + | u| | v| 2.1.3 Global Approaches

Unlike local approaches, global methods are not limited to measuring the similarity using the distance between a pair of nodes. Instead, global similarity approaches use the whole network’s topological information to score each link. Since global methods consider the topological properties of the network, the computational complexity can make them unfeasible for large networks. On the other hand, the similarity scored is store in an n n matrix, S,sothespatialcomplexityisO(v2). ⇥

2.1.3.1 SimRank

SimRank is a similarity measure proposed by Jeh and Widom[32], that will cal- culate the similarities between two nodes within a single graph. This method has been successfully applied in di↵erent applications such as the World-Wide Web query systems[32], link prediction[20] and the recommender systems[33]. SimRank is implemented by the following iterative formulas:

1,u= v S0(u, v)= (2.12) 80,u= v < 6 : 0,I(v)= or I(u)= ; ; 8 Sk+1(u, v)= 1,u= v (2.13) > <> C I(u) I(v) u I(u) v I(v) Sk(u0,v0),u= v 02 02 6 > | || | > P P Equation 2.12: denotes the initial step of SimRank, which is to considered when the object has exact similar to itself, that is, S0(u, u)=1. Equation 2.13 denotes the similarity score function between u and v after k +1 steps where S (u, v) [0, 1] denotes the similarity score between node u and v k 2 after the k steps of iteration. According to Equation 2.13, the similarity score

Sk+1(u, v) is calculated by the sum of all similarity score between all in-neighbor 10

pairs of u and v, u I(u) v I(v) Sk(u0,v0), and normalized by the total number 02 02 of in-neighbor pairs, I(u) I(v) . Note that u and v may also not have any in- P | P|| | neighbors, and since SimRank is based on the concept that ”two objects are similar if they are referenced by similar objects,” and thus similarity score between u and v will become 0. According to Jeh and Widom[32], Constant C [0, 1] denotes the constant 2 that represents a confidence level or a decay factor. Consider a simple scenario where there is a paper w that references both paper u and v.Thesimilarityof w to itself is one, and we do not want S(w, w)=S(u, v)=1.Thus,welet S(u, v)=C S(w, w), to indicate that we are less confident about the similarity · between u and v than we are with the similarity between w and itself. In Jeh and Widom’s paper[32], constant C is set to be 0.8. SimRank can also be written in matrix form[34] from Equation 2.13. The matrix form is defined as:

S0 = In (2.14)

S = C (QT S Q) I (2.15) k+1 · · k · _ n V V Where Sk R| |⇥| | is the similarity matrix after k steps, V denotes the total 2 | | nodes, and Su,v denotes the similarity score between two nodes, u, v V . Q V V is 2 | |⇥| | a column normalized adjacency matrix and Q[u, v]= 1 if u cites v,0otherwise. Iv | | is an entry-wise maximum operator, ie., (A B) = max A ,B .Since _ _ u,v { uv uv} Equation 2.15 is a non-linear recursive form, it is dicult to computed because the entry-wise maximum value needs to be computed each iteration[35]. There is another approximation matrix form for SimRank that is proposed by the references [36] and [37] which is defined as:

S = C (QT S Q)+(1 C) I (2.16) k+1 · · k · · The di↵erence between Equation 2.15 and 2.16 is that Equation 2.16 is a linear recursive matrix form, and therefore, it is easier to compute. The trade o↵is that this similarity matrix no longer provides an exact SimRank Score since the diagonal entries in S are not equal to 1. However, Kusumoto et al. [35] point out that the 11

Figure 2.2. A Sample Citation Graph (adopted from Hamedani et al. [1]) slope of SimRank scores and approximated SimRank scores(2.16) on highly similar vertices are in a straight line. This means the term (1 C) I in Equation 2.16 will · only change the scale of the SimRank score and not the similarity ranking between nodes. Although SimRank has been successfully used in di↵erent applications,Hameda -ni et el.[1] point out three existing limitations of SimRank:

1. In-links Consideration Problem: According to Equation 2.13, similarity score S(u, v)isbasedonin-neighborsofnodeu and v.Ifforoneofthein- neighbors is I = , I = , or if there are no common paper directly citing u ; v ; both of them, the similarity score between paper u and v will be considered as 0. This means that if paper u and v both cite the same paper w,SimRank will not take it as a consideration to determine if u and v are similar. For example, consider node pair e and h in Figure 2.2 which were generated by [1]. S(e, h)=0becauseI = . However, paper e and h might be considered h ; similar because they both cite paper g and i.

2. Pairwise Normalization Problem:SimRankscoreisnormalizedbytotal number of in-neighbor pairs, I(u) I(v) . However, the SimRank similarity | || | score between two well-known papers that are referenced by many other papers can be lower compared to other pairs of papers that are cited by a small number of papers[38]. Consider pairs (b, c)and(i, g) in Figure 2.2 [1]: 12

since pair (g, i)arecitedbytwocommonpapers,e and h,andpair(b, c) are cited by one common paper. Thus the similarity score between pair (g, i) should be greater than pair (b, c). However, SimRank will give pair (b, c)ahighersimilarityscorecomparedtopair(g, i). Another normalization problem[39] is that when a number of papers cite both paper u and v,the 1 SimRank score S(u, v)mightconvergetozerobasedonterm, I(u) I(v) . | || | 3. Level-Wise Computation Problem:SimRankscoreonlyconsidersthe paths with equal length from common sources[40]. For example, consider pair (c, e) in Figure 2.2. The SimRank score S(c, e) is zero because they do not have common in-neighbors. However, c and e might be considered as similar because there is a path from a to e and a to c. This means SimRank exploits the graph topology level-by-level [1, 41]; The similarity between two papers will be considered as zero if they are not cited by any direct citation. Hamedani et al.[1] also mansions a couple SimRank variants that can solve the existing SimRank limitations mentioned above: 1. Penetrating Rank(P-Rank)[42]: The main idea of P-Rank is similar to Sim- Rank, ”two objects are similar if they are referenced by similar objects,” as P-Rank also considers that ”two objects are similar if they both reference similar objects.” In other words, P-Rank not only considers in-neighbors, but also considers out-neighbors as a factor that a↵ects the similarity score. P-Rank is defined as: 1,u= v S0(u, v)= (2.17) 80,u= v < 6 :

1,u= v 8 C ↵ Sk(u0,v0) > I(u I(v) u0 I(u) v0 I(v) Sk+1(u, v)=> ⇥ | || | 2 2 (2.18) >+(1 ↵) P P <> ⇥ C ( Sk(u00,v00)),u= v > O(u O(v) u00 O(u) v00 O(v) > | || | 2 2 6 > :> P P Where Ou denotes the out-neighbors (papers cited directly by paper u)of node u. ↵ [1, 0] is the weighting parameter for in- and out- links. Since 2 13

C the first part of Equation 2.18, I(u I(v) u I(u) v I(v) Sk(u0,v0), is exactly | || | 02 02 the same equation of SimRank, when P↵ =1,P-RankscorewillbesameP as SimRank. When ↵ =0,P-Rankformulabecomesreverse-SimRank(rvs- SimRank), which only considers out-neighbors. Note that SimRank and rvs- SimRank both have the same links consideration problem because they only consider in- or out- neighbors. Since P-Rank also considers out-neighbors, it can handle the in-links consideration problem.

2. Connector-Based Similarity measure(C-Rank)[43]: C-Rank is a similarity measure that was first designed for scientific literature databases. It argues that the similarity between two papers is based on the three following cases:

(a) The number of papers referenced by both u and v (b) The number of papers which reference both u and v (c) The number of papers that reference either u or v and is referenced by the other paper

Note that SimRank only considers case (a), and P-Rank considers cases (a) and (b) by setting some weighted parameters. Unlike SimRank and P-Rank, C-Rank considers three cases simultaneously. It is defined as[43]: 1,u= v S0(u, v)= (2.19) 80,u= v < 6 : 1,u= v

Iu Iv 8C ( | \ | Iu Iv S (u, v)=> · | [ | (2.20) k+1 > S (u ,v ) > u Iu Iv v Iv k 0 0 >+ 02 02 > Iu Iv Iv < P | P[ || | S (u ,v ) u Iv Iu v Iu k 0 0 + 02 02 ),u= v > Iv Iu Iu > P | P[ || | 6 > C-Rank handled the pairwise:> normalization problem by considering the

Jaccard coecient score between Iu and Iv, the sum of similarity scores among all the possible pairs between I I and I normalized by I I I u v v | u [ v|| v| and the sum of similarity scores among all possible pairs between I I and v u 14

I normalized by I I I .TheideaisthatapaperinI I should v | v [ u|| u| u \ v a↵ect the similarity more compared to the papers in Iu and Iv.

3. PSimRank[39]: PSimRank also solves the pairwise normalization prob- lem similarly to C-Rank. The di↵erence between PSimRank and C-Rank is that C-Rank normalizes the sum of similarity scores among all the possible pairs between I I and I normalized by I I I , while PSimRank u v v | u [ v|| v| normalizes the sum of similarity scores by I I I . | u v|| v| PSimRank is defined as [39]:

1,u= v S0(u, v)= (2.21) 80,u= v < 6 : 1,u= v

Iu Iv 8C ( | \ | Iu Iv S (u, v)=> · | [ | (2.22) k+1 > S (u ,v ) > u Iu Iv v Iv k 0 0 >+ 02 02 > Iu Iv Iv < P | P || | S (u ,v ) u Iv Iu v Iu k 0 0 + 02 02 ),u= v Iv Iu Iu > P | P || | 6 > > 4. MatchSim[44]: di↵erent from: other methods, MatchSim defines the similar- ity score between u and v by averaging the similarity score of the max- imum matching among their neighbors. To calculate S(u, v), MatchSim

first constructs a weighted bipartite graph Gu,v =(Iu + Iv,E,w), where

E = (i, j) i Iu,j Iv and w(u, v)=Sk 1(u, v). The similarity function { | 2 2 } of MatchSim is defined as:

1,u= v S0(u, v)= (2.23) 80,u= v < 6 : Wˆ (u, v) S (u, v)= (2.24) k+1 max( I , I ) | u| | v| where Wˆ (u, v)denotestheweightofthemaximummatching(muv)between 15

Iu and Iv in Gu,v: Wˆ (u, v)= S(i, j)(2.25)

(i,j) muv X2 and it can be calculated using algorithms for the assignment problem.

5. SimRank*(Geometric Series Form)[40]: The idea behind SimRank* is it con- siders the path that is ignored by SimRank similarity computation. The recursive form of geometric SimRank* is defined as:

C Sˆ = (Q Sˆ + Sˆ QT )+(1 C) I (2.26) 2 · · · · n

and the initial step Sˆ0 = I. Although SimRank* consider more paths than SimRank during the computation step, it does not increase computational cost. This is because Equation 2.26 only requires one matrix multiplication since Sˆ is a symmetric matrix and Sˆ Q is identical to QT Sˆ.Moredetail · · can be found in [40].

2.1.3.2 Asymmetric Structure COntext Similarity(ASCOS)

Asymmetric Structure COntext Similarity(ASCOS)[2] is another similarity mea- sure that captures the similarity scores between any pairs of nodes in a network. Unlike SimRank that defines the similarity score S(u, v)betweenpairsu and v based on their in-neighbors, ASCOS defines the similarity score by the similarity scores between v and all neighbor nodes of v. ASCOS similarity score only consid- ering the in-neighbors because Chen et al.[2] suggests an object is defined by how others describe it but not by how it describes others. Since ASCOS only considers in-neighbors, the similarity score it produces is asymmetric, i.e. S(u, v) = S(v, u). 6 This is di↵erent from the traditional similarity measure because the measure func- tion is usually defined to be proportional to inverse of the same distance functions, and the similarity scores are usually symmetric, S(u, v)=S(v, u). By definition, the statement ”a is similar to b” should be equal to the statement ”b is similar to a.” However, studies have shown that this might not be true for some objects, like faces, countries, or personalities[45]. People tend to agree to ”a is similar to b” rather than ”b is similar to a” when b is more general or distinctive. Consider 16

Figure 2.3. A toy network (adopted from [2]) node 1 and node 4 in Figure 2.3. People might think node 4 is similar to node 1 rather than thinking node 1 is similar to node 4. Na¨ıve ASCOS is defined as:

1ifi = j S(u, v):= (2.27) 8 C I(u) w I(u) S(w, v)ifi = j < | | 8 2 6 Where I(u)isthesetofin-neighborsofnodeP u,andC [0, 1] is the relative : 2 importance parameter that controls the relative importance between the direct neighbors and indirect neighbors, i.e., neighbors’ neighbors. The higher the value the more important the indirect neighbors are. Na¨ıve ASCOS will have O(n2) complexity for both computational time and space since it will store the entire similarity matrix, S, during each iteration. This can become a significant problem because modern social networks usually have a large number of nodes. Equation 2.27 can also be written in non-recursive form. Given a graph G and its adjacency matrix A =[aij], we can calculate the column-normalized matrix, aij P =[pij], where pij = . After a suciently large number of iterations, k akj 8 equation 2.27 can be re-writtenP as:

T 1 S =(1 C)(I CP ) (2.28) Equation 2.28 has a di↵erence compared with equation 2.27. The diagonal entries of S are not set to one, i.e., s =1,wherei represents all nodes in G. ii 6 17

However, this problem will only a↵ects the absolute similarity score but not the relative relationship between similarity scores. For example, if we get similarity scores from equation 2.27 such that sij >skj, this relationship will still hold when we calculate the similarity of equation 2.28. The run time of non-recursive ASCOS increases from O(n2)toO(n3)andthespacecomplexityisO(n2)comparedtona¨ıve ASCOS because of the matrix inverse operation[2]. Thus, Chen et al.[2] introduce two ecient ASCOS calculations that will decrease the runtime of equation 2.28:

1. Low Rank Approximation: First, let Q = I CPT from equation 2.28. By using Singular Value Decomposition(SVD), Q can be factorized into U⌃VT , where U and V are orthogonal matrices and ⌃is a diagonal matrix. Q can T be approximated by Q˜ = U˜ ⌃˜T˜ ,where⌃˜ is a diagonal matrix which keeps the largest ⌧ values in ⌃, U˜ is the first ⌧ columns of U and V˜ T is the first T 1 1 T ⌧ rows of V .TheinverseofQ can be approximated by Q˜ = V˜ ⌃˜ U˜ . The Low Rank Approximation form can be re-written as:

1 S (1 C)V˜ ⌃˜ U˜ T (2.29) ⇡

Chen et al.[2] points out that Low Rank Approximation is established by the assumption that there are notable linear correlations in Q.LowRank Approximation will become less accurate if there are no obvious linear cor- relation in Q.

2. The Gauss-Seidel(GS) Approach: Unlike with low rank approximations that 1 try to calculate Q eciently in order to get S,wecanturnequation2.28 in to a classic system linear algebraic problem that will not require for cal- 1 culating Q .Fromequation2.28,wecansplitS into n column vectors

S =[S1, S2,...,Sn]andI into n column vectors I =[I1, I2,...,In]. Equation 2.28 can now be written as:

(I CPT )S =(1 C)I (2.30) i i

where i =1, 2,...,n. Now we can see that I CPT is a coecient matrix of dimension n n,(1 C)I is a constant column vector of size n and S is ⇥ i i an unknown column vector. Chen et al.[2] applied the Gauss-Seidel method, 18

Figure 2.4. A toy network with edge weights. (adopted from [3]) which is a recursive method used to solve a linear system of equations by repeatedly improving the result until Si converges. The advantage of the GS method is, it will only require us to calculate using 1/n of the network’s nodes to get a similarity score between a pair of nodes, while equation 2.27 and equation 2.28 require us to calculate entire networks even when we only want to get a similarity score between a single pair of nodes. On the other hand, since the GS method only needs us to calculate using 1/n of the network’s nodes to get a single node pair’s similarity score, it is possible to store all the variables in to main memory for large networks. The time complexity for calculating with one pair of nodes is O(n)andtheruntimeforcomputing all pairs of nodes is O(n2/k), where k is the number of machines used to calculate similarity scores between all pairs of nodes. Chen and Giles[3] introduced ASCOS++ which will consider both the topol- ogy of the network and the edge weights in the network. The idea is that two nodes with a larger edge weight are more likely to be similar compared with two nodes with a smaller edge weight. Consider nodes a, b and c in Figure 2.4. S(b, c)shouldhavealargersimilarityscorecomparedtoS(a, b). On the other hand, if two pairs of nodes have the same edge weights, e.g., the weights of both edge e(a, b)andedgee(a0,b0) are one in Figure 2.4, because the fact the weight of e(b, c)is10andtheweightofe(b0,c0)isone,theimportance of a0 and b0 is larger than the importance of a and b,sinceb has a relatively more important neighbor c. In order to handle the above scenarios, Chen et al.[3] introduced Consistency Rules originally defined in Antonellis et al.[46]. Consider a weighted network 19

G = V,E,withfournodesv ,v ,v ,v V and two edges e(v ,v ),e(v ,v) i j k l 2 i j k l 2 E.Theweightoftwoedgescanbedefinedaswij and wkl, and the sum of all

edges’ weights that connect with node p can be defined as wp = r I(p) wpr. ⇤ 8 2 Similarity scores S(i, j)anS(k, l)followConsistencyRulesifthefollowingP statement is true:

wij wkl if (1)wij >wkl and (2) > ,thenS(i, j) >S(k, l)(2.31) wi wk ⇤ ⇤ Based on Consistency Rules, ASCOS++ is defined the following:

1ifu = v S(u, v):= (2.32) 8C wuk (1 exp( w ))S(k, v)ifu = v k I(u) wu uk < · 8 2 ⇤ 6 : P where wuk is the weight of edge (u, k), wu = k I(u) wuk.Theterm ⇤ 8 2 1 exp( w ) captures the first condition in the consistency rules, the uk P term wuk captures the second condition in the consistency rules. Since C, wu ⇤ ( wuk (1 exp( w ))) and S(k, v)allbetween0and1,S(u, v) [0, 1]. k I(u) wu uk 8 2 ⇤ 2 P 2.2 Graph Similarity

Suppose we have two graphs, G1 and G2,bothwithknownnodecorrespondence(i.e.

G1 and G2 are defined in the same set of nodes,) how can we determine the di↵erence between two graphs, and which main nodes and edges are causing the di↵erence between two graphs? The operative questions can be: how much is the network di↵erent compared to last month? How di↵erent is the connection between a left-handed male’s brain and a right-handed female’s brain, and what are the main di↵erences between them? These kinds of question are often asked when studying multiple networks. In order to address these questions, one of the solutions is to use graph similarity. Graph similarity can also be used for sense- making: if we detected an abnormal change in the computer network, we might conclude that network is under some attacks. In this section, we will discuss two problems:

1. How to compare two networks and evaluate the degree of their similarity 20

Figure 2.5. Symbols and Definitions for DeltaCon

2. How to identify the main nodes or edges that are causing the di↵erence between two networks

We also introduce DeltaCon [47], a scalable graph similarity method that will solve aforementioned two problems. In [48], the first problem is defined as:

Problem. Given

1. Two Graphs, G1(V,E1) and G2(V,E2) with same node set, V , but di↵erent

edge sets E1 and E2.

2. the node correspondence.

Find a similarity score, Sim(G ,G ) [0, 1], between the input graphs. A score of 1 2 2 value 0 means totally di↵erent graphs, while 1 means identical graphs.

The intuitive way to solve this problem is by measure their edge overlap [49]. However, this often does not work in practice. Consider the graph similarity scores between Sim1 = S(T 1,T2) and Sim2 =(T 1,T3) in Figure 2.6, Sim1 will be equal 21

Figure 2.6. Toy Networks

to Sim2 if we use the edge overlap technique. However, from the information flow perspective, Sim1 should be greater than Sim2 since T 3ismissinganimportant ”bridge” edge, E(3, 4), and edge E(5, 6) is not that important compared to edge E(3, 4). DeltaCon solves this problem in two steps:

1. Compute the pairwise node anities matrix for both G1 and G2,storethem in similarity matrix S and S respectively. S ,S are both n n matrices, 1 2 1 2 ⇥ each entry sij in similarity matrix represents how similar node i is compared to node j.

2. Measure the di↵erence(using some distance function) between S1 and S2 and report the similarity score Sim(S ,S ) [0, 1] 1 2 2

2.2.1 Measuring Node Anities: FaBP

The first step of DeltaCon is to calculate the anities matrices for both graphs,

G1 and G2.Thereareafewmethodsthatcancomputenodeanities.We can use the global similarity to calculate it and Koutra et al.[4] mentions that Random Walks with Restarts(RWR)[50], lazy RWR[51] and the ”electrical network analogy” technique[52] are only a few of the methods that can compute node anities. DeltaCon chooses to use the most recent and principled method[53], Fast 22

Belief Propagation(FaBP), a fast approximation of the loopy Belief Propagation algorithm[54]: [I + ✏2D ✏A]s = e (2.33) ! i ! i T where !si =[si1,...,sin] is the column vector of final similarity scores starting from the ith node, ✏ is a positive constant(less than 1) encoding the influence between neighbors, I is the identity matrix, A is the adjacency matrix and D is the diagonal matrix with the degree of node i. Equation 2.33 can be written as a matrix form if we stack all the !si vectors(i =1,...,n into the n by n matrix, S:

2 1 S =[s ]=[I + ✏ D ✏A] (2.34) ij There are couple of advantages[4] to using FaBP as the node anities method: (i) It is based on the maximum likelihood estimation on marginals. (ii) It is linear on the number of edges. (iii) It will consider the direct neighbors information, and additionally considers 2-step, 3-step and k-step-away neighbors information. If we 2 1 consider equation 2.34, ignore term ✏ D and only look at term [I ✏A] ,thisis exactly equal to the Maclaurin series: (1 x)=1+x + x2 + ....Thus,wecan write the term as:

1 2 2 S [I ✏A] I + ✏A + ✏ A + ... (2.35) ⇡ ⇡ where ✏ is a positive constant(< 1) encoding the influence between neighbors and Ak indicates the k-step paths. This approach will capture the di↵erence infor- mation in 1-step, 2-step, ..., k-step neighborhoods in a weighted way. And since ✏ is smaller than one, the longer the path is, the less e↵ect there will be on the similarity measure.

2.2.2 Distance Measure Between Graphs

After calculating the n n anity matrices, S and S t,DeltaConcomputes ⇥ 1 2 the distance between the two corresponding graphs using root Euclidean dis- 23

Figure 2.7. Algorithm: DeltaCon(adopted from [4])

Figure 2.8. Algorithm: DeltaCon Node Attribution(adopted from [4]) tance(RootED, also known as Matusita distance):

n n d = RootED(S , S )= ( S S )2 (2.36) 1 2 v 1,ij 1,ij u i=1 j=1 uX X p p t The advantage of using RootED as the distance function is that the node ani- ties are within [0, 1], this will make the node anities score larger, which means RootED can detect even minimal changes in the graphs. Finally, the distance is converted to a similarity score by using the formula 1 Sim = 1+d ,whichwillguaranteethegraphsimilarityscoreisboundedto[0,1].

2.2.3 DeltaCon Node Attribution Function

Koutra et al.[4] also introduced another function they called Node Attribution function. This function will return a 1 by n matrix that indicates which node is the key node responsible for changes in the graph. This is useful because we 24 can append it to an anomaly detection tasks. Given two graphs, healthy and unhealthy, where the node representing the biomarker, we can determine which biomarker caused the disease. The steps to calculate the node attribution are:

First let the anity matrices S1, S2 be pre-computed.

1. Calculate the score for each node by using the row vector v from S1 and S2. This can be done by using the same distance function, RootED, that will calculate the distance between two vectors. The equation form is written as:

n w = RootED(S , S )= ( S S )2 (2.37) v 1,v 2,v v 1,vj 1,vj u j=1 uX p p t 2. Sort the scores in a 1 n matrix and report the important scores and their ⇥ node ids. Chapter 3 Anomaly Detection Methods

Anomaly detection, also called outlier detection, is a branch of tasks that will discover rare incidents in the dataset. The first definition of anomaly detection is mentioned by Grubbs[55]: ”An outlying observation, or outlier,isone that appears to deviate markedly from other members of the sample in which it occurs”. However, Grubbs’ definition can now also be extended to two other important characteristics today[56]: (i) Anomalies that are di↵erent from the norm from feature perspective. (ii) Anomalies that are rare compared to other normal instances in the dataset. Anomaly detection has a lot of successful applications in many domains includ- ing intrusion detection in networks[57, 58], credit card fraud detection[59], health insurance claim errors[60], and patient monitoring[61]. There are also exist exten- sive surveys on anomaly detection. Chandola et al.[16] covers anomaly detection techniques, Schubert et al.[62] summarizes the local outlier detection techniques, Goldstein and Uchida[56] compared and evaluated di↵erent unsupervised anomaly detection algorithms for multivariate data, Leman et al.[63] focus on comparing di↵erent graph-based anomaly detection methods. In this section, we will introduce a numbers of anomaly detection methods that are selected and used in Chapter 4 from a scalable outlier detection toolbox: PyOD[5] and provide further explanation for each algorithm. 26

Figure 3.1. Select Outlier Detection Models in PyOD(adopted from [5])

3.1 Auto-Encoder

Auto-Encoder is a type of neural network that will find a middle layer such that the input data will be transformed to this low-dimension middle layer and can be reconstructed back to input data. One of the early motivations of Auto-Encoder is to do dimensional reduction so that the low-dimension features will present the data better as compared to the original data. This idea can also be applied to the anomaly detection task that is similarly to PCA as discussed below, but with less chance of reconstruction errors[64, 65].

3.2 Clustering-Based Local Outlier Factor(CBLOF)

The Cluster-Based Local Outlier Factor(CBLOF) method is used to captures the density of the cluster that the data instance belongs to, and the distance between 27 the data instance and the cluster it belongs to. It is calculated using the following steps[66]:

1. Get cluster Ci for i =0...k from input graph G using any clustering algorithm, where C = C ,C ,...,C , C C ... C , and parameter k is the 1 2 k | 1|| 2| | k| total number of clusters.

2. Cluster C to Large cluster and Small cluster base on two conditions:

(a) C , + C + ... + C G ↵ | 1 | | 2| | b|| | C | b| (b) C | b+1| Where parameter ↵ =thepercentageofnormalinstancesrepresentedingraphG =theratioofthesizeofthesmallclustertothesizeofthelargecluster b =theboundaryoftheLargeandSmallcluster

3. For nodes where v G belongs to cluster C ,andC belongs to the small 2 i i cluster, v’s CBLOF score is equal to the size of Ci times the minimum dis- tance between v and C ,forj =1...b.Fornodesu G belonging to cluster j 2 Ci,whereCi belong to the large cluster, u’s CBLOF score is equal to the

size of Ci times the distance between u and Ci. It is important to know that the performance of CBLOF is highly dependant on the clustering method that is used. Some clustering methods may not fit well on anomaly detection tasks[67]. In our experiment, we are using MiniBatchKmeans as the default clustering algorithm instead of that Squeezer algorithm suggested by He et al. in [66]. He et al.[66] states that the local clusters density can be estimated by using number of cluster members as a scaling factor. However, Amer et al.[68] mentions that this assumption might be incorrect. The number of instances in a cluster does not necessarily present its density. Further evaluation can be found in Goldstein and Uchida[56].

3.3 Histogram-based Outlier Score(HBOS)

Histogram-based outlier score(HBOS)[69] is an unsupervised anomaly detection method, which will calculate the anomaly score from each instances’ features based 28 on the density of the bin in each feature’s histogram. The steps are as follows:

1. Create a histogram for each feature based on parameter, k,whichisthe number of bins represented in the histogram. There are two approaches to constructing histograms:

(a) Fixed bin-width histograms (b) Dynamic bin-width histograms

Goldstein and Dengel[69] recommended to use the second approach since the usually are far away from normal instances, which means the value ranges in the histogram might have a large gaps.

2. Calculate HBOS score for each instance by the formula:

d 1 HBOS(v)= log( ) hist (v) i=0 i X

Where histi is the height of the feature i corresponding to the bins it is located at, and v is the node v G 2 The default value for the number of bins k is set to 10, evaluation of more results with other values can be found in Goldstein and Dengel[69]. Note that HBOS assumes each features is independent, which might lead to the result being less precise. However, this makes the HBOS computation time become linear time, which is much faster than other nearest-neighbor methods such like as Local Outlier Factor(LOF) which will discuss below. Moreover, the disadvantage of HBOS of assuming features are independent becomes less significant if the dataset has high dimensions due to a larger scarcity.

3.4 Isolation Forest(IForest)

Unlike other anomaly detection methods, Isolation Forest(IForest) does not con- struct a profile for normal nodes before it identifies each node. It explicitly isolates anomaly instances by constructing isolation trees. The algorithm is defined as: Let G be the input data G = v ,v ,...,v and each node v G with attribute { 1 2 n} i 2 29

Q = q ,q ,...,q and h be the limiting height for the decision tree. The { 1vi 2vi nvi } anomaly score can be calculated by the following step[70]:

1. Build decision trees by recursively dividing G by randomly selecting attribute q Q and split value p =(min(q), Max(q)) until the tree height is equals to 2 h.

2. Calculate the expected path length E(h(x)) in the IFroest, where h(x) is the single path length for sample v in one decision tree

E(h(x)) ( ) 3. The anomaly score can be calculated as s =2 c(n) ,wherec(n)istheaver- age path length of an unsuccessful searches in the Binary Search Tree(BST).

The main idea is that an outlier vertices are less less frequent then normal vertices. Thus, they are more likely to be separated from the early partitioning and will have shorter than average path length. There are only two variable that needs to be defined in this algorithm:

1. Number of decision trees

2. Size of the tree

Liu et al.[70] suggests setting the number of trees to 100 and the size of the tree to 256. Further discussion can be found in Liu et al.[71].

3.5 Local Outlier Factor(LOF)

Local Outlier Factor(LOF) [72] is a well-known local anomaly detection algorithm. Given a graph G,tocalculatetheLOFscoreforeachnodev G: 2 1. Find the k-nearest-neighbors of node v;thek-nearest-neighborsiscalculated by finding the k nodes that are closest to node v using some distance function.

Denote the set of k-nearest-neighbors of nodes v as Nk(v)

2. Denote reachability distance of node v G with respect to node u G as 2 2 Reachk(v, u). Reachk(v, u) is defined as the maximum amount of distance, 30

either Dist(v, u)betweennodeuandv,orthek-nearestneighbordistance,

Dk(v), of node v.

Reach (v, u)=Max D (v), Dist(v, u) k { k }

The Local Reachability Density(LRD) of node v is the inverse of the average reachability distance based on the k-nearest-neighbors.

1 LRDk(v)= meanu N (v)Reachk(v, u) 2 k

3. The LOF score is calculated by finding the average ratio of LRDk(v)tothe

corresponding values of all points in Nk(u)

LRDk(v) LOFk(v)=meanu Nk(v) 2 LRDk(u)

Note that the higher the LRDk(v)is,thehighertheLOFk(v)willbe.Thisis because normal instances will have similar density compared to their neighbors, which means the LOF score will be close to 1.0. However, an anomaly instance will have a lower density, which will result in a higher LOF score. This algorithm is local because each object v’s LOF score only relies on its direct k-nearest neighborhood. On the other hand, global-anomalies can still be detected because they will have low LRD compared to their neighbors. Goldstein and Uchida[56] reports that this algorithm will generate a lot of false alarms since the anomaly detection tasks are more interested in finding global anomalies. Breunig and Kriegel et al.[72] also discuss the upper and lower bound of LOF by selecting di↵erent k; further evaluation can be found in Goldstein and Uchida[56]. In our experiment, we set k=20 as the default.

3.6 Minimum Covariance Determinant(MCD)

Minimum Covariance Determinant(MCD) estimator[73] is a well-known method that will calculate the mean and covariance matrix based on some subset of nodes h v,wherev G is the total nodes from input graph G,whichwillminimizes 2 2 31 the determinant of the covariance matrix. This method can be used in outlier detection based on the assumption that an outlier node is a node with a distance with Mahalanobis squared distance(MSD) from the center of the data greater than the cuto↵value[73]. The steps for this algorithm are[74]:

1. Apply clustering algorithm to original data to find the clusters.

2. Calculate mean and covariance matrix for each clusters.

3. Calculate MSD for each point with each clusters based on the most recent mean and covariance.

4. Assign each point to the cluster with smallest MSD

5. Calculate new mean and covariance for each clusters based on the minimum number of samples that must not be outliers.

6. Repeat steps 3-5 until no change is observed in the mean and covariance matrix, and return mean and covariance matrix.

7. AnodewillbedetectedasanoutlierifitsMSDisexceedsthecuto↵value. The cuto↵value can be calculated by distributional quintiles of 2 and F distributions.

More detailed and extension works can be found in Hubert and Debruyne et al.[75].

3.7 One-Class Support Vector Machines(OCSVM)

The support vector machine(SVM) is a model that will separate a dataset into two categories. SVM can also be applied to one-class classification problems which will form the One-Class Support Vector Machines(OCSVM). The steps for OCSVM are as follows:

1. Mapping training data G into a subset S F using a Radial Basis Function 2 Kernel[76], where F is a feature space.

2. Find the decision boundary that separates the data from the origin. 32

3. Find a function f,sothatforanypointv S F , f(v)willreturn1or-1 2 2 based on the decision boundary. The decision boundary for function f is set by some priori specified value that is greater than zero.

Any testing data point, v, which lies outside S will be marked as an outlier. Note that in our experiment, we map our the dataset into another higher dimension space to use the Radial Basis Function(RBF) Kernel[77] according to Sch¨olkopf and Platt et al.[78], but any Kernel function can be applied.

3.8 Principal Component Analysis(PCA)

Principal Component Analysis(PCA) is a multivariate technique that uses an or- thogonal transformation to convert a set of variance-covariance variables that pro- jected to a lower dimension space. In order to apply this technique to outlier detection[79]:

1. Perform PCA on the correlation matrix of the original data. This will give us eigenvectors that associated with eigenvalues.

2. Consider sample principal components y1,y2,...,yp of an observation x. There

are two principal component scores that can be calculated by yi:

y2 (a) Major components score q i i=1 i 2 q yi (b) Minor components score Pi=p q+1 i Where q is determined by the numbersP of components that can explain about 50 percent of the total variation in the standardized feature.

3. Node v G will be considered as an outlier if major or minor components 2 score is greater than the outlier thresholds, c.

The main idea of PCA is that any anomalous instances may deviate from the nor- mal instances subspace. The major components will capture the global deviations of the majority of the data and the minor components will capture the smaller deviations. Goldstein and Uchida[56] has evaluated di↵erent strategies by alter- natively using all components, using only major components, only using minor components, and finally using both components. Chapter 4 Graph Based Feature Selection Methods and Their Application in Biomarker Discovery

The term biomarker originated from a conflation of ”biological marker.” These are molecules or biological indicators that may reflect the presence of disease or the spread of a specific disease[80]. Di↵erent diseases can be detected by measur- ing di↵erent kinds of biomarkers. However, to discover disease biomarkers from disease data is usually challenging since: (i) Many of them are characterized by large numbers of features but come with small numbers of samples; (ii) There is high degree of sparsity of the data samples; (iii) There is complexity of the underlying biology and limitations in sequencing technology and taxonomy clas- sification pipelines. In a recent work, Abbas and Le et al.[6] introduced a novel approach for detecting disease biomarkers from metagenomics data, Network- Based Biomarker Discovery (NBBD) framework. NBBD framework includes two major customizable modules: (i) A network inference module which will construct a microbial ecology network based on the given microbial operational taxonomic units (OTUs) table. (ii) A node importance scoring module that will compare the constructed networks for the chosen phenotpyes and give a score to each node base on the topological properties between two graphs. Although the node importance scoring methods which consider the node topological prop- erty methods suggested by Abbas and Le et al.[6] perform well for determining 34 the most discriminative features from metagenomics data, a majority of the meth- ods are focused on local property. We are interested in how well di↵erent kinds of global property methods perform in the NBBD framework. Furthermore, the NBBD framework has only been tested on metagenomics datasets and we want to test to see whether the NBBD framework is compatible with other kinds of biomarker datasets. In this thesis, we are exploring two node importance scoring methods that will consider the global property in the network: (i) Node Attribution Profile Scoring (NAPS) and (ii) Node Anomaly Scoring (NAS).Wealsoexamine the NBBD framework on a gene expression dataset and report the performance.

4.1 Methods

4.1.1 Datasets

We experiment with the following two datasets:

4.1.1.1 Inflammatory Bowel Diseases (IBD) dataset

The large cohort Inflammatory Bowel Diseases (IBD)[81] dataset we are useing for the experiment is identical to the one used in the Abbas and Le et al.[6] paper since we would like to compare this performance with the original NBBD framework. This dataset consists of 657 IBD and 316 healthy control metagenomics biopsy samples. They randomly select 200 IBD and 200 healthy samples to form the training datasets. Each sameple has 786 OTUs at the genus level.

4.1.1.2 Neuroblastoma (NB) dataset

The Neuroblastoma (NB) dataset includes Agilent microarrays data for 498 NB pa- tients collected from the GEO2 from the series GSE49710 and GSE49711. We used limma, an R/Bioconductor package[82] for microarray data analysis, to normalize the Agilent microarray data using quintile normalization. The clinical variables of the samples were extracted from Zhang and Yu et al.[7], and the survival attribute, OS bin, was used to associate a binary label with each sample. The final dataset has the 498 samples, 105 are positively labeled. We used the same partition of the 35

Figure 4.1. NBBD framework overview (adopt from [6]) data into 249 samples for both training and testing dataset introduced in Zhang and Yu et al.[7].

4.1.2 Network-Based Biomarkers Discovery(NBBD) frame- work

The NBBD[6] framework is designed for detecting disease biomarkers from metage- nomics data. It consists of two main customizable modules, network inference module and node importance scoring module. Figure 4.1 shows the overview of NBBD framework. Network inference modules are used to construct a microbial network based on the given dataset, and the node importance scoring module can be used to calculate the node scores between two networks. In our experiment, we followed two main steps in NBBD framework: (i) Con- struct two microbial networks base on the pairs of the data (corresponding to healthy and unhealthy samples), where each node will represent an OTU (or gene if it is NB dataset) and each edge represents a relationship between two nodes. We decide to use CoNet (see 4.1.4) as our Network inference model in this experiment since it performed best compared to all other network construction methods men- tioned in Abbas and Matta et al. [83]. (ii) Compare the two networks using node importance scoring methods. We hypothesize that the node with higher scores will provide useful features for training the classifier to discriminate between two 36

Figure 4.2. NBBD framework overview with two di↵erent node scoring method populations of metagenomics samples. We will introduce two node importance scoring methods that we used in the experiment.

4.1.3 Proposed Node Importance Scoring Methods

We present two approaches for scoring nodes (i.e., features) based on: (i) di↵er- ences in node anomaly scores; (ii) di↵erences in node profile determined from the two networks. These two approaches assume that a discriminating feature has di↵erent patterns of interactions with other features (nodes) in the two networks. Abbas et el.[6] captured these patterns using changes in node topological prop- erties such as betweenness centrality. Here, we propose these two approaches to capture these patterns using changes in node anomaly scores or changes in node profiles represented as the proximity between the target node and all other nodes in the networks.

4.1.3.1 Node Anomaly Scoring (NAS)

Let Gi(Vi,Ei)andGj(Vj,Ej)bethetwographsconstructedfromthetwogroups of training samples (e.g., for metagenomics biomarker discovery, the two groups correspond to the disease and control conditions). NAS scores each node v V V 2 i \ j with respect to a node anomaly score AD as follows: scoreAD(v)= f (v, G ) | AD i f (v, G ) ,wheref (v, G) quantify the abnormality of node v in G using an AD j | P anomaly detection algorithm, AD. 37

In our experiments, we considered a representative set of nine unsupervised anomaly detection algorithms implemented in PyOD. We also experimented with four representations of the network (as input to the anomaly detection algorithm): adjacency matrix; and three anity matrices computed using FaBP[2], ASCOS[2], and SimRank[32] algorithms. Anomaly detection algorithms are summarized in 3 and the computation of anity matrices is described in 2.1.3.

4.1.3.2 Node Attribution Profile Scoring (NAPS)

Given Gi(Vi,Ei)andGj(Vj,Ej)astwographsconstructedfromthetwogroups of training samples, we compute anity matrices Si and Sj from the two graphs. Each row in the anity matrix represents the similarity between the corresponding node and all nodes in the network and therefore it could be considered as a global profile of the node. Following Koutra and Shah et al.[4], we use d(Si,k,Sj,k)to estimate the node attribution (see 2.2.3) of the kth node, where d computes the distance between two rows in the anity matrix. Node attribution scores can be used to estimate changes in nodes global profiles and these scores can be used as node importance scores in the NBBD framework.

4.1.4 Experiments

In our experiments, we learned the graphs from IBD and NB datasets using the network inference tool: Co-occurrence Network Inference (CoNet)[84]. CoNet is a association network inference tool which is designed to constructed the network from biological entities (e.g. microbial communities, genes). The method combined two complementary approaches: (i)The ensemble method of similarity or dissim- ilarity measures. The idea for the ensemble method is to allow us to decrease the false positives edge from network generated by di↵erent methods; and (ii) A novel permutation-renormalization and bootstrap (ReBoot) method which is used to compute the significance associations for evaluation[84]. We used CoNet imple- mented in Cytoscape[85] and followed the procedure described in [86] to construct the network. For feature selection, we considered a filter methods using Information Gain (IG) and an embedded feature selection method based on RF feature importance 38

(RFFI)[13]. We also considered NBBD framwework using three node topological properties, betweenness, closeness centrality, and average node degree, as well as two novel node importance scoring methods proposed in this work. For node importance scoring methods based on node topological properties, we used the following three properties[87] used in Abbas and Le et al.[6]:

Betweenness Centrality (btw):Betweennesscentralityofanodeu is defined as: (u, w v) f (v, G)= | (4.1) btw (u, w) u,w V X2 where (u, w)isthenumberofshortestpathsbetweenu and w,and(u, w v) | is the number of shortest paths between u and w passing through v.

Closeness Centrality (cls):Closenesscentralityofanodeu is defined as:

N 0 1 fcls(v, G)= (4.2) u N for u=v d(u, v) 2 0 6 P where d(u, v) is the distance of the shortest paths between u and v,and

N 0 G is a set of node that can reach v. 2 Average Neighbor Degree (and): The Average Neighbor Degree of node v can be defined as: 1 fand(v, G)= u (4.3) v | | u v | | X2 where is the neighbors of node v and is the degree of node u. v | u| For NAS scoring method, we used the following AD algorithms: PCA, MCD, OCSVM, LOF, CBLOF, HBOS, IForest and Auto Encoder. We also experimented with four representation of the input to these AD algorithms including: adjacency matrix and anity matrices computed using FaBP, ASCOS, and SimRank. For NAPS scoring method, we experimented with three distance functions, rootED, Cosine, braycurtis and three anity matrices computed using FaBP, AS- COS, and SimRank. For classifier training and performance evaluation, we used top k selected fea- tures, k 10, 20, 30, 40, 50, 60, to train Random Forest 2 39

(RF)[13] classifiers to discriminate between the positively labeled (IBD/survival attribute binary label) samples and negatively labeled (healthy/un-survival at- tribute binary label) samples. RF algorithm is implemented in Scikit-learn[88] and the number of trees is set to 500 for our experiment. The result of the RF classifier is then evaluated by applying a set of commonly used performance mea- sure: Accuracy (ACC), Sensitivity (Sn, also called true positive rate), Specificity

(Sp,alsocalledtruenegativerate),MatthewsCorrelationCoecient (MCC) and Area Under ROC Curve(AUC)[89, 90].

TP + TN ACC = (4.4) TP + FP + TN + FN

TP S = (4.5) n TP + FN

TN S = (4.6) p TN + FP

TP TN FP FN MCC = ⇥ ⇥ (4.7) (TN + FN)(TN + FP)(TP + FN)(TP + FP) where TP, FP, TNp and FN are numbers of true positive (correctly identified), false positive (incorrectly identified), true negative (correctly rejected) and false negative (incorrectly rejected). For calculating TP, FP, TN and FN, we used the default RF classifier threshold of 0.5. On the other hand, the Receiver Operating Characteristic (ROC) curve[91] pro- vides an illustration for the performance of the classifier over all possible thresholds. The ROC curve is a two-dimensional plot where the X-axis a is False positive rate and the Y-axis is True positive rate. Each point on the ROC curve indicate the performance of the classifier on the chosen threshold. The Area Under ROC curve (AUC) is used to quantify the performance of the RF classifier, which is the probability that a randomly chosen positive sample will be ranked higher than a randomly chosen negative sample. The ideal classifier will have the AUC score equal to 1, and any AUC score above 0.5 is considered better than random guessing. 40

4.2 Results and Discussion

4.2.1 Performance comparisons using IBD dataset

Table 4.1 compares the performance of di↵erent RF classifiers trained using the following feature selection method: filter based feature selection using Information Gain (IG) scores; embedded feature selection method using RF feature importance [13]; and NBBD using three node topological properties presented in Abbas and Le et al.[6] and summarized in 4.1.2. Without performing any feature selection, RF classifier has an estimated AUC of 0.74. Interestingly, using any of the feature selection methods (except cls) presented in Table 4.1 leads to improvements in RF performance. The highest AUC is noted using RFFI which outperforms the best NBBD FS method, btw, considered in our experiments.

Table 4.1. Performance comparisons on IBD dataset of RF classifiers trained using di↵erent feature selection methods using Information Gain (IG), RF Feature Importance (RFFI), and NBBD using three node topological properties. Method # Features ACC Sn Sp MCC AUC None NA 0.66 0.64 0.75 0.31 0.74 IG 30 0.66 0.63 0.78 0.33 0.76 RFFI 20 0.68 0.65 0.79 0.35 0.79 btw 60 0.65 0.62 0.77 0.31 0.77 cls 60 0.62 0.58 0.77 0.28 0.73 and 40 0.65 0.62 0.76 0.30 0.75

Table 4.2 shows the top performing RF classifiers trained using AD method for feature selection for di↵erent choices of AD algorithm, number of selected features, and input data to the AD algorithm. For each AD algorithm, we report the best performing model and its associated number of features and representation. We note that di↵erent AD algorithms have di↵erent preferences for the input data representation. Three algorithms prefer FaBP representation while the number of algorithms preferring either ASCOS of SimRank is two. The highest AUC of 0.78 is observed HBOS algorithm and ASCOS representation. Table 4.3 reports the top performing RF classifiers trained using NAPS method and three di↵erent distance functions for computing the distance between node pro- files where node profiles where determined using three di↵erent anity matrices using FaBP, ASCOS, and SimRank. Interestingly, all top performing classifiers 41

Table 4.2. Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. Method Rep. #Features ACC Sn Sp MCC AUC AE FaBP 50 0.66 0.64 0.73 0.30 0.75 CBLOF FaBP 60 0.67 0.65 0.76 0.33 0.77 HBOS ASCOS 60 0.68 0.65 0.78 0.35 0.78 iForest FaBP 60 0.66 0.62 0.78 0.33 0.77 LOF SimRank 30 0.65 0.61 0.81 0.34 0.75 MCD SimRank 20 0.68 0.67 0.72 0.32 0.77 OCSVM adj 50 0.67 0.65 0.76 0.33 0.76 PCA ASCOS 60 0.62 0.56 0.84 0.32 0.74 were obtained using FaBP representation. Two distance functions allowed us to reach a highest AUC score of 0.78 using top 50 features. Overall, RF classifiers trained using our proposed node scoring functions based on AD or NAPS have com- parable performance with RFFI and outperform NBBD feature selection methods using node topological properties.

Table 4.3. Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. Distance Representation #Features ACC Sn Sp MCC AUC BC FaBP 50 0.68 0.65 0.79 0.36 0.78 Cosine FaBP 60 0.66 0.63 0.78 0.33 0.77 rootED FaBP 50 0.67 0.65 0.75 0.33 0.78

4.2.2 Performance comparisons using NB dataset

To the best of our knowledge, the utility of NBBD had been demonstrated only for metagenoimics data. Here, we evaluate the applicability of this framework for biomarker discovery from gene expression microarray data. Table 4.4 shows the performance of RF classifiers trained using di↵erent traditional feature selection methods as well as NBBD methods. Surprisingly, training RF classifier using all input features is as good as training RF using top selected features. In fact, three feature selection methods failed to suggest a subset of features that allows the RF to reach AUC of 0.88 on the test dataset. Table 4.5 shows that using two AD 42 algorithms, AE and PCA, NBBD feature selection approach slightly improves the AUC to 0.89 using only 40 and 50 features, respectively. However, Using CBLOF algorithm and SimRank representation, the resultant RF classifier has AUC of 0.88 using only top 20 features. This is an improvement in performance over all the methods reported in Table 4.4 where at least 40 features are needed to reach an AUC score of 0.88. Finally, no improvement in performance is noted using NAPS method (see Table 4.6).

Table 4.4. Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. Method # Features ACC Sn Sp MCC AUC None NA 0.79 0.26 0.93 0.26 0.88 IG 50 0.80 0.37 0.91 0.33 0.87 RFFI 40 0.81 0.41 0.92 0.37 0.88 btw 60 0.77 0.15 0.94 0.14 0.88 cls 40 0.80 0.15 0.97 0.23 0.86 and 40 0.80 0.20 0.96 0.25 0.87

Table 4.5. Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. Method Rep. #Features ACC Sn Sp MCC AUC AE SimRank 40 0.80 0.30 0.94 0.31 0.89 CBLOF SimRank 20 0.82 0.28 0.97 0.37 0.88 HBOS FaBP 20 0.79 0.33 0.91 0.29 0.86 iForest FaBP 40 0.81 0.28 0.95 0.32 0.87 LOF SimRank 30 0.78 0.19 0.94 0.19 0.88 MCD SimRank 50 0.80 0.26 0.95 0.29 0.87 OCSVM ASCOS 40 0.80 0.28 0.95 0.31 0.88 PCA ASCOS 50 0.82 0.31 0.95 0.36 0.89

4.3 Conclusion

In this chapter, we discussed the NBBD framework introduced in Abbas and Le et al.[6]. The framework has two main modules, network inference and node im- portance scores. We pointed out the limitation of the node importance scores 43 proposed in Abbas and Le et al.[6]. Specifically, these scores are computed using local topological properties of the nodes. To address this limitation, we proposed two novel node importance scoring methods based on predicted node anomaly scores and distance between node profiles determined using anity matrices. Us- ing two biological datasets, we demonstrated the viability of our proposed scoring methods in identifying top discriminative features which could be potential candi- date biomarkers for the two diseases considered in this study, Inflammatory Bowel Disease and Neuroblastoma.

Table 4.6. Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. Distance Representation #Features ACC Sn Sp MCC AUC BC ASCOS 30 0.81 0.30 0.95 0.34 0.87 Cosine SimRank 20 0.82 0.33 0.95 0.37 0.87 rootED ASCOS 50 0.80 0.30 0.93 0.30 0.88 Chapter 5 Conclusion

Detection of biomarkers is an important task because it allows us to detect and measure di↵erent diseases, and thus facilitates timely intervention and treatment. In this thesis we discussed a novel model that can be used to detect the biomarker: the NBBD framework [6]. The NBBD framework consists of two main customizable models: network inference module and node importance scoring module. The function of the network inference module is to construct the network from the given dataset. The node importance scoring module is used to score each node in the network. After a description of the NBBD framework, we pointed out the limitations in the NBBD framework, which include the importance scoring methods used in NBBD framework [6] are local methods, and NBBD framework has only been tested on metagenomics dataset (IBD). Then, we proposed two novel models that can be applied with similar computational benefit: Node Attribution Profile Scoring and Node Anomaly Scoring. In Chapter 2, we introduced di↵erent vertex similarity and graph similarity methods that can be used in Node Attribution Profile Scoring models. In Chapter 3, we proposed several anomaly detection methods that can be used in Node Anomaly Scoring models. In Chapter 4, we examined our scoring models using two datasets: metagenomics dataset (IBD) and gene expression dataset(NB). We compared the performance with the local scoring method that was used in Abbas and Le et al.[6]. For further validation, we also compared the performance with machine learning feature selection methods. The results show that our methods can be outperform the local scoring methods and are comparable to some state-of-art feature selection methods including Random 45

Forest Feature Importance and Information gain. Several future research directions are suggested from our work: (i) Comparison of di↵erent node importance scoring methods using cross-validation or multiple evaluations using di↵erent random splits of the data into train and test sets. This will lead to more accurate estimates of classifiers performance and will allow for applying statistical tests for assessing the significance of classifiers results. Unfor- tunately, conducting such experiments would require implementing an automatic script for inferring the graphs from the training data as opposed to using Cytoscape GUI to run CoNet plugin; (ii) We only experimented CoNet network construction tool in the NBBD model. We would like to examine other kinds of network con- struction methods and compare the performance with our current work; (iii) Test our approach with more datasets from electronic health records. Appendix A Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAS methods

Table A.1. IBD dataset using NAS:Auto-Encoder with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE adj 10 0.60 0.60 0.61 0.17 0.64 AE adj 20 0.62 0.61 0.65 0.21 0.66 AE adj 30 0.65 0.64 0.66 0.25 0.69 AE adj 40 0.67 0.67 0.71 0.30 0.73 AE adj 50 0.65 0.63 0.72 0.28 0.74 AE adj 60 0.62 0.60 0.72 0.26 0.73

Table A.2. IBD dataset using NAS:Auto-Encoder with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE SimRank 10 0.59 0.58 0.65 0.18 0.63 AE SimRank 20 0.58 0.55 0.67 0.18 0.63 AE SimRank 30 0.60 0.56 0.73 0.24 0.67 AE SimRank 40 0.59 0.57 0.69 0.21 0.67 AE SimRank 50 0.62 0.59 0.73 0.26 0.71 47

Table A.3. IBD dataset using NAS:Auto-Encoder with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE ASCOS 10 0.49 0.46 0.64 0.08 0.59 AE ASCOS 20 0.57 0.55 0.61 0.13 0.61 AE ASCOS 30 0.59 0.58 0.64 0.18 0.66 AE ASCOS 40 0.58 0.55 0.70 0.20 0.67 AE ASCOS 50 0.55 0.51 0.72 0.18 0.69 AE ASCOS 60 0.61 0.56 0.80 0.29 0.74

Table A.4. IBD dataset using NAS:Auto-Encoder with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE FaBP 10 0.58 0.54 0.72 0.21 0.65 AE FaBP 20 0.68 0.67 0.72 0.32 0.74 AE FaBP 30 0.67 0.65 0.76 0.33 0.73 AE FaBP 40 0.67 0.65 0.76 0.33 0.73 AE FaBP 50 0.66 0.64 0.73 0.30 0.75 AE FaBP 60 0.68 0.66 0.75 0.33 0.75

Table A.5. IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF adj 10 0.51 0.47 0.67 0.11 0.59 CBLOF adj 20 0.53 0.49 0.67 0.13 0.61 CBLOF adj 30 0.55 0.50 0.74 0.19 0.64 CBLOF adj 40 0.63 0.61 0.72 0.26 0.70 CBLOF adj 50 0.67 0.64 0.78 0.34 0.76 CBLOF adj 60 0.66 0.63 0.78 0.33 0.76

Table A.6. IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF SimRank 10 0.50 0.47 0.62 0.07 0.57 CBLOF SimRank 20 0.62 0.58 0.78 0.30 0.71 CBLOF SimRank 30 0.62 0.59 0.74 0.27 0.72 CBLOF SimRank 40 0.62 0.58 0.78 0.29 0.74 CBLOF SimRank 50 0.62 0.58 0.76 0.28 0.74 48

Table A.7. IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with ASCOS Rep. ADM Rep. #Features ACC Sn Sp MCC AUC CBLOF ASCOS 10 0.51 0.49 0.60 0.07 0.57 CBLOF ASCOS 20 0.55 0.53 0.64 0.13 0.58 CBLOF ASCOS 30 0.63 0.61 0.69 0.25 0.70 CBLOF ASCOS 40 0.64 0.61 0.76 0.30 0.71 CBLOF ASCOS 50 0.63 0.62 0.70 0.25 0.71

Table A.8. IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF FaBP 10 0.64 0.66 0.56 0.18 0.68 CBLOF FaBP 20 0.68 0.67 0.72 0.32 0.75 CBLOF FaBP 30 0.66 0.65 0.70 0.29 0.73 CBLOF FaBP 40 0.68 0.66 0.76 0.34 0.75 CBLOF FaBP 50 0.69 0.67 0.78 0.36 0.76 CBLOF FaBP 60 0.67 0.65 0.76 0.33 0.77

Table A.9. IBD dataset using NAS:Histogram-based Outlier Score with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS adj 10 0.58 0.61 0.44 0.04 0.53 HBOS adj 20 0.59 0.60 0.53 0.11 0.59 HBOS adj 30 0.65 0.67 0.61 0.23 0.69 HBOS adj 40 0.67 0.65 0.73 0.31 0.75 HBOS adj 50 0.66 0.67 0.65 0.26 0.72 HBOS adj 60 0.64 0.64 0.66 0.24 0.72

Table A.10. IBD dataset using NAS:Histogram-based Outlier Score with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS SimRank 10 0.58 0.57 0.61 0.15 0.64 HBOS SimRank 20 0.56 0.54 0.64 0.15 0.65 HBOS SimRank 30 0.59 0.56 0.72 0.22 0.67 HBOS SimRank 40 0.60 0.56 0.78 0.27 0.72 49

Table A.11. IBD dataset using NAS:Histogram-based Outlier Score with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS ASCOS 10 0.64 0.65 0.64 0.23 0.67 HBOS ASCOS 20 0.65 0.64 0.70 0.27 0.73 HBOS ASCOS 30 0.63 0.61 0.73 0.27 0.73 HBOS ASCOS 40 0.66 0.63 0.74 0.30 0.72 HBOS ASCOS 50 0.67 0.64 0.80 0.36 0.76 HBOS ASCOS 60 0.68 0.65 0.78 0.35 0.78

Table A.12. IBD dataset using NAS:Histogram-based Outlier Score with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS FaBP 10 0.50 0.49 0.57 0.05 0.54 HBOS FaBP 20 0.55 0.52 0.66 0.15 0.62 HBOS FaBP 30 0.63 0.60 0.75 0.28 0.71 HBOS FaBP 40 0.66 0.64 0.75 0.31 0.76 HBOS FaBP 50 0.67 0.64 0.78 0.35 0.76 HBOS FaBP 60 0.67 0.65 0.75 0.32 0.77

Table A.13. IBD dataset using NAS:Isolation Forest(Iforest) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC iforst adj 10 0.56 0.58 0.47 0.04 0.56 iforst adj 20 0.61 0.61 0.59 0.17 0.65 iforst adj 30 0.66 0.65 0.68 0.27 0.72 iforst adj 40 0.67 0.67 0.68 0.28 0.74 iforst adj 50 0.67 0.66 0.73 0.32 0.74 iforst adj 60 0.66 0.64 0.75 0.31 0.74

Table A.14. IBD dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC iforest SimRank 10 0.57 0.57 0.57 0.11 0.62 iforest SimRank 20 0.55 0.53 0.66 0.14 0.64 iforest SimRank 30 0.59 0.55 0.74 0.23 0.67 iforest SimRank 40 0.63 0.60 0.74 0.27 0.72 iforest SimRank 50 0.63 0.61 0.73 0.27 0.73 iforest SimRank 60 0.63 0.60 0.75 0.28 0.73 50

Table A.15. IBD dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC iforest ASCOS 10 0.56 0.55 0.62 0.14 0.65 iforest ASCOS 20 0.60 0.57 0.73 0.24 0.69 iforest ASCOS 30 0.61 0.59 0.71 0.24 0.69 iforest ASCOS 40 0.65 0.62 0.76 0.31 0.74 iforest ASCOS 50 0.66 0.62 0.80 0.34 0.75 iforest ASCOS 60 0.65 0.61 0.81 0.34 0.76

Table A.16. IBD dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC iforest FaBP 10 0.51 0.49 0.60 0.08 0.58 iforest FaBP 20 0.59 0.56 0.72 0.22 0.70 iforest FaBP 30 0.64 0.60 0.81 0.33 0.76 iforest FaBP 40 0.64 0.60 0.78 0.30 0.76 iforest FaBP 50 0.65 0.62 0.79 0.33 0.76 iforest FaBP 60 0.66 0.62 0.78 0.33 0.77

Table A.17. IBD dataset using NAS:Local Outlier Factor(LOF) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF adj 10 0.57 0.55 0.64 0.15 0.62 LOF adj 20 0.55 0.53 0.66 0.15 0.64 LOF adj 30 0.57 0.54 0.70 0.19 0.66 LOF adj 40 0.63 0.61 0.71 0.26 0.71 LOF adj 50 0.63 0.60 0.74 0.28 0.72 LOF adj 60 0.64 0.61 0.74 0.29 0.73

Table A.18. IBD dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF SimRank 10 0.61 0.59 0.69 0.22 0.70 LOF SimRank 20 0.61 0.56 0.81 0.30 0.73 LOF SimRank 30 0.65 0.61 0.81 0.34 0.75

Table A.19. IBD dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF ASCOS 10 0.53 0.49 0.68 0.14 0.62 LOF ASCOS 20 0.59 0.56 0.69 0.20 0.65 LOF ASCOS 30 0.61 0.56 0.79 0.28 0.74 LOF ASCOS 40 0.62 0.58 0.78 0.30 0.74 51

Table A.20. IBD dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF FaBP 10 0.49 0.46 0.62 0.06 0.53 LOF FaBP 20 0.59 0.58 0.66 0.19 0.64 LOF FaBP 30 0.58 0.56 0.70 0.20 0.66 LOF FaBP 40 0.61 0.58 0.71 0.23 0.70 LOF FaBP 50 0.60 0.57 0.71 0.22 0.70 LOF FaBP 60 0.62 0.58 0.74 0.26 0.70

Table A.21. IBD dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD adj 10 0.56 0.55 0.60 0.12 0.59 MCD adj 20 0.50 0.46 0.66 0.10 0.59 MCD adj 30 0.63 0.59 0.78 0.30 0.72 MCD adj 40 0.66 0.63 0.78 0.34 0.74 MCD adj 50 0.66 0.65 0.73 0.31 0.75 MCD adj 60 0.66 0.63 0.74 0.30 0.74

Table A.22. IBD dataset using NAS:Minimum Covariance Determinant(MCD) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD SimRank 10 0.60 0.58 0.67 0.20 0.68 MCD SimRank 20 0.68 0.67 0.72 0.32 0.77 MCD SimRank 30 0.66 0.63 0.75 0.31 0.76 MCD SimRank 40 0.67 0.65 0.74 0.32 0.76 MCD SimRank 50 0.66 0.64 0.74 0.31 0.75 MCD SimRank 60 0.65 0.62 0.75 0.30 0.75

Table A.23. IBD dataset using NAS:Minimum Covariance Determinant(MCD) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD ASCOS 10 0.62 0.63 0.59 0.18 0.66 MCD ASCOS 20 0.62 0.60 0.68 0.23 0.67 MCD ASCOS 30 0.61 0.60 0.68 0.22 0.70 MCD ASCOS 40 0.64 0.64 0.67 0.25 0.70 MCD ASCOS 50 0.65 0.64 0.70 0.27 0.72 MCD ASCOS 60 0.65 0.64 0.72 0.29 0.72 52

Table A.24. IBD dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD FaBP 10 0.55 0.52 0.66 0.15 0.63 MCD FaBP 20 0.64 0.63 0.66 0.23 0.68 MCD FaBP 30 0.61 0.58 0.72 0.24 0.72 MCD FaBP 40 0.61 0.58 0.74 0.25 0.71 MCD FaBP 50 0.62 0.59 0.77 0.29 0.74 MCD FaBP 60 0.64 0.61 0.72 0.27 0.73

Table A.25. IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM adj 10 0.51 0.46 0.68 0.12 0.60 OCSVM adj 20 0.53 0.50 0.66 0.13 0.63 OCSVM adj 30 0.57 0.54 0.72 0.20 0.65 OCSVM adj 40 0.60 0.58 0.69 0.21 0.68 OCSVM adj 50 0.67 0.65 0.76 0.33 0.76 OCSVM adj 60 0.67 0.65 0.75 0.32 0.76

Table A.26. IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF SimRank 10 0.50 0.47 0.62 0.07 0.57 CBLOF SimRank 20 0.62 0.58 0.78 0.30 0.71 CBLOF SimRank 30 0.62 0.59 0.74 0.27 0.72 CBLOF SimRank 40 0.62 0.58 0.78 0.29 0.74 CBLOF SimRank 50 0.62 0.58 0.76 0.28 0.74

Table A.27. IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM ASCOS 10 0.54 0.51 0.66 0.14 0.62 OCSVM ASCOS 20 0.54 0.52 0.60 0.10 0.60 OCSVM ASCOS 30 0.54 0.52 0.65 0.13 0.61 OCSVM ASCOS 40 0.55 0.53 0.65 0.14 0.64 53

Table A.28. IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM FaBP 10 0.61 0.59 0.67 0.21 0.68 OCSVM FaBP 20 0.60 0.57 0.71 0.22 0.68 OCSVM FaBP 30 0.59 0.56 0.72 0.23 0.69 OCSVM FaBP 40 0.60 0.56 0.75 0.25 0.71 OCSVM FaBP 50 0.61 0.57 0.73 0.25 0.71 OCSVM FaBP 60 0.63 0.60 0.75 0.28 0.72

Table A.29. IBD dataset using NAS:Principal Component Analysis(PCA) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA adj 10 0.60 0.60 0.61 0.17 0.64 PCA adj 20 0.62 0.61 0.65 0.21 0.66 PCA adj 30 0.64 0.64 0.62 0.22 0.69 PCA adj 40 0.65 0.65 0.69 0.27 0.71 PCA adj 50 0.63 0.61 0.72 0.26 0.73 PCA adj 60 0.63 0.60 0.73 0.27 0.73

Table A.30. IBD dataset using NAS:Principal Component Analysis(PCA) with Sim- Rank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA SimRank 10 0.59 0.57 0.68 0.20 0.64 PCA SimRank 20 0.61 0.58 0.72 0.24 0.66 PCA SimRank 30 0.59 0.56 0.72 0.22 0.68 PCA SimRank 40 0.59 0.57 0.68 0.20 0.66 PCA SimRank 50 0.62 0.58 0.76 0.28 0.72

Table A.31. IBD dataset using NAS:Principal Component Analysis(PCA) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA ASCOS 10 0.49 0.45 0.64 0.07 0.59 PCA ASCOS 20 0.55 0.55 0.59 0.11 0.61 PCA ASCOS 30 0.57 0.56 0.62 0.15 0.64 PCA ASCOS 40 0.58 0.55 0.72 0.21 0.68 PCA ASCOS 50 0.56 0.52 0.71 0.18 0.68 PCA ASCOS 60 0.62 0.56 0.84 0.32 0.74 54

Table A.32. IBD dataset using NAS:Principal Component Analysis(PCA) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA FaBP 10 0.48 0.46 0.58 0.03 0.52 PCA FaBP 20 0.57 0.56 0.60 0.13 0.60 PCA FaBP 30 0.62 0.60 0.71 0.25 0.68 PCA FaBP 40 0.62 0.59 0.71 0.24 0.69 PCA FaBP 50 0.59 0.56 0.68 0.20 0.69 PCA FaBP 60 0.61 0.58 0.72 0.24 0.69 Appendix B Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAPS methods

Table B.1. IBD dataset using NAPS:RoodED with SimRank Rep. RootED SimRank 10 0.58 0.56 0.66 0.18 0.64 RootED SimRank 20 0.57 0.54 0.72 0.21 0.67 RootED SimRank 30 0.60 0.58 0.68 0.21 0.71 RootED SimRank 40 0.60 0.58 0.71 0.23 0.70 RootED SimRank 50 0.61 0.59 0.70 0.23 0.69 RootED SimRank 60 0.60 0.56 0.76 0.26 0.70

Table B.2. IBD dataset using NAPS:RoodED with ASCOS Rep. Distance Representation #Features ACC Sn Sp MCC AUC RootED ASCOS 10 0.50 0.45 0.70 0.12 0.62 RootED ASCOS 20 0.56 0.51 0.76 0.22 0.68 RootED ASCOS 30 0.59 0.55 0.73 0.23 0.69 RootED ASCOS 40 0.60 0.56 0.74 0.25 0.70 RootED ASCOS 50 0.60 0.56 0.75 0.25 0.70 RootED ASCOS 60 0.62 0.59 0.73 0.26 0.72 56

Table B.3. IBD dataset using NAPS:RoodED with FaBP Rep. Distance Representation #Features ACC Sn Sp MCC AUC RootED FaBP 10 0.56 0.52 0.72 0.19 0.64 RootED FaBP 20 0.61 0.59 0.70 0.23 0.70 RootED FaBP 30 0.68 0.66 0.77 0.35 0.77 RootED FaBP 40 0.66 0.65 0.74 0.31 0.77 RootED FaBP 50 0.67 0.65 0.75 0.33 0.78 RootED FaBP 60 0.69 0.67 0.78 0.36 0.78

Table B.4. IBD dataset using NAPS:Cosine Distance with SimRank Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine SimRank 10 0.55 0.54 0.63 0.13 0.59 cosine SimRank 20 0.59 0.56 0.71 0.21 0.65 cosine SimRank 30 0.61 0.59 0.69 0.22 0.67 cosine SimRank 40 0.60 0.57 0.72 0.23 0.71 cosine SimRank 50 0.62 0.58 0.74 0.26 0.71 cosine SimRank 60 0.60 0.56 0.75 0.25 0.70

Table B.5. IBD dataset using NAPS:Cosine Distance with ASCOS Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine ASCOS 10 0.44 0.38 0.71 0.07 0.53 cosine ASCOS 20 0.55 0.51 0.72 0.19 0.66 cosine ASCOS 30 0.60 0.56 0.75 0.25 0.70 cosine ASCOS 40 0.59 0.55 0.75 0.24 0.70 cosine ASCOS 50 0.62 0.59 0.72 0.26 0.71 cosine ASCOS 60 0.60 0.56 0.73 0.24 0.70

Table B.6. IBD dataset using NAPS:Cosine Dsitance with FaBP Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine FaBP 10 0.59 0.56 0.72 0.22 0.67 cosine FaBP 20 0.65 0.63 0.74 0.30 0.74 cosine FaBP 30 0.63 0.59 0.77 0.29 0.76 cosine FaBP 40 0.63 0.59 0.78 0.30 0.76 cosine FaBP 50 0.65 0.61 0.80 0.33 0.76 cosine FaBP 60 0.66 0.63 0.78 0.33 0.77 57

Table B.7. IBD dataset using NAPS:Bray-Curtis Distance SimRank with Rep. Distance Representation #Features ACC Sn Sp MCC AUC BC SimRank 10 0.54 0.50 0.69 0.15 0.64 BC SimRank 20 0.60 0.58 0.68 0.21 0.67 BC SimRank 30 0.62 0.59 0.72 0.25 0.72 BC SimRank 40 0.59 0.55 0.73 0.23 0.72 BC SimRank 50 0.61 0.57 0.76 0.26 0.71 BC SimRank 60 0.63 0.60 0.76 0.29 0.73

Table B.8. IBD dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. Distance Representation #Features ACC Sn Sp MCC AUC BC ASCOS 10 0.50 0.45 0.70 0.12 0.62 BC ASCOS 20 0.57 0.52 0.75 0.22 0.68 BC ASCOS 30 0.60 0.57 0.74 0.25 0.71 BC ASCOS 40 0.59 0.55 0.76 0.25 0.71 BC ASCOS 50 0.59 0.56 0.74 0.24 0.71 BC ASCOS 60 0.63 0.60 0.74 0.28 0.73

Table B.9. IBD dataset using NAPS:Bray-Curtis Distance FaBP with Rep. Distance Representation #Features ACC Sn Sp MCC AUC BC FaBP 10 0.57 0.53 0.72 0.20 0.64 BC FaBP 20 0.62 0.59 0.71 0.24 0.70 BC FaBP 30 0.65 0.63 0.75 0.30 0.76 BC FaBP 40 0.66 0.63 0.78 0.33 0.77 BC FaBP 50 0.68 0.65 0.79 0.36 0.78 BC FaBP 60 0.69 0.67 0.78 0.37 0.77 Appendix C Performance Comparison On Neuroblastoma(NB) dataset dataset Using NAS methods

Table C.1. NB dataset using NAS:Auto-Encoder with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE adj 10 0.80 0.22 0.95 0.26 0.77 AE adj 20 0.83 0.52 0.92 0.47 0.87 AE adj 30 0.82 0.43 0.93 0.42 0.88 AE adj 40 0.82 0.41 0.93 0.39 0.89 AE adj 50 0.81 0.33 0.94 0.36 0.88 AE adj 60 0.80 0.26 0.95 0.29 0.88

Table C.2. NB dataset using NAS:Auto-Encoder with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE SimRank 10 0.81 0.31 0.94 0.34 0.83 AE SimRank 20 0.80 0.35 0.92 0.33 0.86 AE SimRank 30 0.81 0.28 0.96 0.34 0.88 AE SimRank 40 0.80 0.30 0.94 0.31 0.89 AE SimRank 50 0.81 0.28 0.96 0.34 0.89 AE SimRank 60 0.81 0.30 0.95 0.33 0.89 59

Table C.3. NB dataset using NAS:Auto-Encoder with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE ASCOS 10 0.80 0.35 0.92 0.32 0.84 AE ASCOS 20 0.82 0.39 0.93 0.39 0.86 AE ASCOS 30 0.80 0.26 0.95 0.29 0.86 AE ASCOS 40 0.80 0.28 0.94 0.29 0.88 AE ASCOS 50 0.80 0.30 0.94 0.32 0.88 AE ASCOS 60 0.80 0.26 0.95 0.30 0.88

Table C.4. NB dataset using NAS:Auto-Encoder with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC AE FaBP 10 0.78 0.11 0.96 0.14 0.72 AE FaBP 20 0.79 0.19 0.95 0.22 0.84 AE FaBP 30 0.78 0.20 0.94 0.21 0.86 AE FaBP 40 0.80 0.20 0.96 0.25 0.87 AE FaBP 50 0.81 0.30 0.95 0.34 0.88 AE FaBP 60 0.80 0.24 0.96 0.30 0.89

Table C.5. NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF adj 10 0.80 0.54 0.87 0.41 0.81 CBLOF adj 20 0.79 0.31 0.92 0.28 0.86 CBLOF adj 30 0.80 0.35 0.92 0.33 0.86

Table C.6. NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF SimRank 10 0.77 0.13 0.95 0.13 0.81 CBLOF SimRank 20 0.82 0.28 0.97 0.37 0.88 CBLOF SimRank 30 0.80 0.28 0.95 0.31 0.88 CBLOF SimRank 40 0.80 0.30 0.94 0.32 0.87 CBLOF SimRank 50 0.82 0.35 0.95 0.40 0.88

Table C.7. NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with ASCOS Rep. ADM Rep. #Features ACC Sn Sp MCC AUC CBLOF ASCOS 10 0.76 0.13 0.93 0.10 0.71 CBLOF ASCOS 20 0.76 0.09 0.94 0.06 0.74 CBLOF ASCOS 30 0.77 0.09 0.96 0.10 0.84 CBLOF ASCOS 40 0.80 0.20 0.96 0.25 0.85 CBLOF ASCOS 50 0.80 0.22 0.96 0.29 0.86 60

Table C.8. NB dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC CBLOF FaBP 10 0.78 0.11 0.97 0.15 0.76 CBLOF FaBP 20 0.80 0.09 0.99 0.21 0.81 CBLOF FaBP 30 0.83 0.31 0.97 0.40 0.86 CBLOF FaBP 40 0.82 0.28 0.97 0.37 0.87 CBLOF FaBP 50 0.81 0.31 0.95 0.35 0.87

Table C.9. NB dataset using NAS:Histogram-based Outlier Score with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS adj 10 0.80 0.54 0.87 0.41 0.83 HBOS adj 20 0.78 0.31 0.91 0.27 0.84 HBOS adj 30 0.81 0.33 0.94 0.34 0.86

Table C.10. NB dataset using NAS:Histogram-based Outlier Score with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS SimRank 10 0.78 0.31 0.91 0.26 0.84 HBOS SimRank 20 0.81 0.31 0.94 0.34 0.85 HBOS SimRank 30 0.82 0.39 0.94 0.40 0.85 HBOS SimRank 40 0.82 0.39 0.94 0.40 0.85

Table C.11. NB dataset using NAS:Histogram-based Outlier Score with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS ASCOS 10 0.81 0.43 0.92 0.39 0.85 HBOS ASCOS 20 0.80 0.37 0.92 0.35 0.85 HBOS ASCOS 30 0.79 0.30 0.93 0.28 0.85 HBOS ASCOS 40 0.79 0.28 0.93 0.28 0.85

Table C.12. NB dataset using NAS:Histogram-based Outlier Score with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC HBOS FaBP 10 0.78 0.43 0.87 0.31 0.81 HBOS FaBP 20 0.79 0.33 0.91 0.29 0.86 HBOS FaBP 30 0.78 0.30 0.92 0.26 0.86

Table C.13. NB dataset using NAS:Isolation Forest(Iforest) with adj Rep. ADM Rep. #Features ACC Sn Sp MCC AUC Iforest adj 10 0.80 0.44 0.89 0.36 0.83 Iforest adj 20 0.80 0.39 0.92 0.36 0.86 Iforest adj 30 0.79 0.30 0.92 0.27 0.86 61

Table C.14. NB dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC Iforest SimRank 10 0.80 0.31 0.93 0.31 0.83 Iforest SimRank 20 0.81 0.37 0.93 0.36 0.84 Iforest SimRank 30 0.79 0.30 0.93 0.28 0.84 Iforest SimRank 40 0.80 0.30 0.93 0.30 0.86 Iforest SimRank 50 0.80 0.28 0.95 0.31 0.87

Table C.15. NB dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC Iforest ASCOS 10 0.78 0.07 0.98 0.13 0.68 Iforest ASCOS 20 0.78 0.07 0.98 0.13 0.79 Iforest ASCOS 30 0.80 0.17 0.97 0.24 0.85 Iforest ASCOS 40 0.79 0.17 0.96 0.22 0.84 Iforest ASCOS 50 0.80 0.19 0.96 0.24 0.84 Iforest ASCOS 60 0.79 0.17 0.96 0.22 0.85

Table C.16. NB dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC Iforest FaBP 10 0.76 0.06 0.95 0.01 0.79 Iforest FaBP 20 0.80 0.15 0.98 0.27 0.83 Iforest FaBP 30 0.80 0.11 0.99 0.24 0.83 Iforest FaBP 40 0.81 0.28 0.95 0.32 0.87 Iforest FaBP 50 0.80 0.24 0.96 0.30 0.87

Table C.17. NB dataset using NAS:Local Outlier Factor(LOF) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF adj 10 0.79 0.48 0.87 0.36 0.82 LOF adj 20 0.79 0.35 0.91 0.31 0.86 LOF adj 30 0.80 0.35 0.92 0.32 0.86

Table C.18. NB dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF SimRank 10 0.77 0.15 0.94 0.14 0.75 LOF SimRank 20 0.78 0.06 0.97 0.07 0.80 LOF SimRank 30 0.78 0.19 0.94 0.19 0.88 LOF SimRank 40 0.80 0.26 0.94 0.28 0.88 62

Table C.19. NB dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF ASCOS 10 0.75 0.13 0.92 0.08 0.71 LOF ASCOS 20 0.76 0.07 0.95 0.05 0.77 LOF ASCOS 30 0.80 0.13 0.98 0.22 0.82 LOF ASCOS 40 0.80 0.26 0.94 0.28 0.87 LOF ASCOS 50 0.79 0.24 0.94 0.26 0.87 LOF ASCOS 60 0.78 0.20 0.93 0.19 0.87

Table C.20. NB dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC LOF FaBP 10 0.80 0.41 0.91 0.35 0.84 LOF FaBP 20 0.78 0.31 0.90 0.25 0.86 LOF FaBP 30 0.79 0.30 0.92 0.27 0.86 LOF FaBP 40 0.79 0.30 0.93 0.28 0.87 LOF FaBP 50 0.79 0.30 0.92 0.27 0.86 LOF FaBP 60 0.79 0.24 0.94 0.26 0.87

Table C.21. NB dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD adj 10 0.77 0.13 0.94 0.12 0.70 MCD adj 20 0.80 0.15 0.98 0.27 0.80 MCD adj 30 0.82 0.20 0.98 0.34 0.82 MCD adj 40 0.80 0.13 0.98 0.24 0.81

Table C.22. NB dataset using NAS:Minimum Covariance Determinant(MCD) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD SimRank 10 0.80 0.24 0.96 0.30 0.78 MCD SimRank 20 0.81 0.20 0.98 0.32 0.75 MCD SimRank 30 0.80 0.19 0.96 0.24 0.81 MCD SimRank 40 0.80 0.20 0.97 0.28 0.86 MCD SimRank 50 0.80 0.26 0.95 0.29 0.87 MCD SimRank 60 0.80 0.24 0.95 0.28 0.86 63

Table C.23. NB dataset using NAS:Minimum Covariance Determinant(MCD) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD ASCOS 10 0.78 0.20 0.94 0.21 0.78 MCD ASCOS 20 0.78 0.20 0.94 0.21 0.84 MCD ASCOS 30 0.78 0.20 0.94 0.21 0.84 MCD ASCOS 40 0.77 0.13 0.95 0.13 0.84 MCD ASCOS 50 0.79 0.17 0.96 0.22 0.85

Table C.24. NB dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC MCD FaBP 10 0.77 0.02 0.98 -0.01 0.69 MCD FaBP 20 0.76 0.06 0.95 0.01 0.71 MCD FaBP 30 0.77 0.15 0.94 0.14 0.83 MCD FaBP 40 0.79 0.22 0.94 0.24 0.84 MCD FaBP 50 0.79 0.22 0.95 0.25 0.85 MCD FaBP 60 0.77 0.19 0.93 0.16 0.86

Table C.25. NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM adj 10 0.80 0.48 0.89 0.39 0.84

Table C.26. NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM SimRank 10 0.78 0.09 0.97 0.14 0.75 OCSVM SimRank 20 0.78 0.11 0.97 0.15 0.83 OCSVM SimRank 30 0.78 0.07 0.97 0.11 0.83 OCSVM SimRank 40 0.78 0.13 0.95 0.14 0.84 OCSVM SimRank 50 0.79 0.17 0.96 0.21 0.85

Table C.27. NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM ASCOS 10 0.76 0.07 0.95 0.05 0.73 OCSVM ASCOS 20 0.78 0.13 0.95 0.14 0.83 OCSVM ASCOS 30 0.78 0.19 0.95 0.20 0.85 OCSVM ASCOS 40 0.80 0.28 0.95 0.31 0.88 OCSVM ASCOS 50 0.80 0.35 0.93 0.34 0.88 OCSVM ASCOS 60 0.80 0.30 0.94 0.31 0.88 64

Table C.28. NB dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC OCSVM FaBP 10 0.80 0.19 0.96 0.24 0.77 OCSVM FaBP 20 0.80 0.33 0.93 0.33 0.86 OCSVM FaBP 30 0.81 0.30 0.95 0.34 0.87 OCSVM FaBP 40 0.81 0.33 0.94 0.36 0.87 OCSVM FaBP 50 0.82 0.35 0.94 0.37 0.88 OCSVM FaBP 60 0.82 0.33 0.95 0.37 0.88

Table C.29. NB dataset using NAS:Principal Component Analysis(PCA) with adj Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA adj 10 0.80 0.20 0.96 0.25 0.78 PCA adj 20 0.82 0.44 0.92 0.42 0.87 PCA adj 30 0.81 0.41 0.92 0.38 0.87 PCA adj 40 0.82 0.39 0.93 0.39 0.88 PCA adj 50 0.80 0.33 0.93 0.33 0.88 PCA adj 60 0.79 0.28 0.93 0.28 0.88

Table C.30. NB dataset using NAS:Principal Component Analysis(PCA) with Sim- Rank Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA SimRank 10 0.80 0.33 0.93 0.32 0.83 PCA SimRank 20 0.80 0.30 0.94 0.32 0.86 PCA SimRank 30 0.80 0.31 0.93 0.30 0.87 PCA SimRank 40 0.81 0.31 0.94 0.34 0.88 PCA SimRank 50 0.81 0.30 0.95 0.33 0.88 PCA SimRank 60 0.81 0.28 0.95 0.32 0.89

Table C.31. NB dataset using NAS:Principal Component Analysis(PCA) with ASCOS Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA ASCOS 10 0.78 0.11 0.96 0.14 0.69 PCA ASCOS 20 0.83 0.43 0.94 0.43 0.85 PCA ASCOS 30 0.82 0.39 0.94 0.40 0.87 PCA ASCOS 40 0.80 0.26 0.95 0.30 0.88 PCA ASCOS 50 0.82 0.31 0.95 0.36 0.89 PCA ASCOS 60 0.79 0.22 0.95 0.25 0.88 65

Table C.32. NB dataset using NAS:Principal Component Analysis(PCA) with FaBP Rep. ADM Rep. # Features ACC Sn Sp MCC AUC PCA FaBP 10 0.79 0.39 0.90 0.33 0.83 PCA FaBP 20 0.78 0.33 0.90 0.27 0.86 PCA FaBP 30 0.79 0.33 0.91 0.29 0.86 PCA FaBP 40 0.80 0.28 0.94 0.29 0.86 PCA FaBP 50 0.80 0.28 0.94 0.30 0.87 PCA FaBP 60 0.80 0.28 0.95 0.31 0.88 Appendix D Performance Comparison On Neuroblastoma(NB) dataset Using NAPS methods

Table D.1. NB dataset using NAPS:RoodED with SimRank Rep. Distance Representation #Features ACC Sn Sp MCC AUC RootED SimRank 10 0.79 0.15 0.97 0.21 0.68 RootED SimRank 20 0.80 0.15 0.98 0.25 0.79 RootED SimRank 30 0.80 0.24 0.95 0.27 0.87 RootED SimRank 40 0.80 0.28 0.94 0.30 0.87 RootED SimRank 50 0.79 0.24 0.94 0.26 0.86 RootED SimRank 60 0.80 0.30 0.93 0.30 0.87

Table D.2. NB dataset using NAPS:RoodED with ASCOS Rep. Distance Representation #Features ACC Sn Sp MCC AUC RootDE ASCOS 10 0.79 0.30 0.92 0.27 0.83 RootDE ASCOS 20 0.80 0.24 0.95 0.27 0.86 RootDE ASCOS 30 0.81 0.31 0.94 0.34 0.87 RootDE ASCOS 40 0.79 0.30 0.93 0.28 0.87 RootDE ASCOS 50 0.80 0.30 0.93 0.30 0.88 RootDE ASCOS 60 0.81 0.31 0.95 0.35 0.88 67

Table D.3. NB dataset using NAPS:RoodED with FaBP Rep. Distance Representation #Features ACC Sn Sp MCC AUC RootED FaBP 10 0.80 0.39 0.91 0.35 0.81 RootED FaBP 20 0.78 0.31 0.91 0.27 0.84 RootED FaBP 30 0.79 0.30 0.93 0.28 0.86 RootED FaBP 40 0.80 0.31 0.93 0.31 0.85 RootED FaBP 50 0.78 0.24 0.93 0.23 0.86 RootED FaBP 60 0.79 0.19 0.95 0.22 0.86

Table D.4. NB dataset using NAPS:Cosine Distance with SimRank Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine SimRank 10 0.81 0.19 0.98 0.29 0.72 cosine SimRank 20 0.82 0.33 0.95 0.37 0.87 cosine SimRank 30 0.79 0.26 0.93 0.26 0.87 cosine SimRank 40 0.79 0.26 0.93 0.26 0.86 cosine SimRank 50 0.78 0.20 0.94 0.20 0.86 cosine SimRank 60 0.80 0.28 0.94 0.29 0.87

Table D.5. NB dataset using NAPS:Cosine Distance with ASCOS Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine ASCOS 10 0.80 0.28 0.94 0.29 0.84 cosine ASCOS 20 0.82 0.37 0.94 0.39 0.86 cosine ASCOS 30 0.80 0.30 0.93 0.30 0.87 cosine ASCOS 40 0.80 0.31 0.93 0.30 0.87 cosine ASCOS 50 0.80 0.33 0.93 0.33 0.87 cosine ASCOS 60 0.79 0.24 0.94 0.26 0.87

Table D.6. NB dataset using NAPS:Cosine Distance with FaBP Rep. Distance Representation #Features ACC Sn Sp MCC AUC cosine FaBP 10 0.79 0.33 0.92 0.30 0.81 cosine FaBP 20 0.78 0.35 0.90 0.29 0.85 cosine FaBP 30 0.80 0.31 0.93 0.31 0.86 cosine FaBP 40 0.80 0.30 0.94 0.31 0.87 cosine FaBP 50 0.79 0.28 0.93 0.28 0.86 cosine FaBP 60 0.78 0.22 0.94 0.22 0.88 68

Table D.7. NB dataset using NAPS:Bray-Curtis Distance SimRank with Rep. Distance Representation #Features ACC Sn Sp MCC AUC bc SimRank 10 0.79 0.15 0.97 0.21 0.74 bc SimRank 20 0.78 0.17 0.95 0.19 0.81 bc SimRank 30 0.81 0.30 0.95 0.34 0.87 bc SimRank 40 0.80 0.26 0.94 0.28 0.86 bc SimRank 50 0.78 0.24 0.93 0.23 0.86 bc SimRank 60 0.79 0.26 0.94 0.27 0.87

Table D.8. NB dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. Distance Representation #Features ACC Sn Sp MCC AUC bc ASCOS 10 0.80 0.30 0.93 0.30 0.84 bc ASCOS 20 0.82 0.33 0.95 0.37 0.86 bc ASCOS 30 0.82 0.33 0.95 0.37 0.87 bc ASCOS 40 0.79 0.31 0.92 0.29 0.87 bc ASCOS 50 0.80 0.28 0.94 0.29 0.87 bc ASCOS 60 0.79 0.22 0.94 0.24 0.87

Table D.9. NB dataset using NAPS:Bray-Curtis Distance FaBP with Rep. Distance Representation #Features ACC Sn Sp MCC AUC bc FaBP 10 0.80 0.39 0.91 0.35 0.81 bc FaBP 20 0.79 0.33 0.92 0.30 0.84 bc FaBP 30 0.78 0.28 0.92 0.24 0.85 bc FaBP 40 0.80 0.30 0.93 0.30 0.86 bc FaBP 50 0.78 0.20 0.94 0.21 0.86 bc FaBP 60 0.79 0.20 0.95 0.23 0.86 Bibliography

[1] Hamedani, M. R. and S.-W. Kim (2016) “SimRank and its variants in academic literature data: measures and evaluation,” in Proceedings of the 31st Annual ACM Symposium on Applied Computing,ACM,pp.1102–1107.

[2] Chen, H.-H. and C. L. Giles (2013) “ASCOS: an asymmetric network structure context similarity measure,” in 2013 IEEE/ACM International Con- ference on Advances in Social Networks Analysis and Mining (ASONAM 2013),IEEE,pp.442–449.

[3] ——— (2015) “Ascos++: An asymmetric similarity measure for weighted net- works to address the problem of simrank,” ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2), p. 15.

[4] Koutra, D., N. Shah, J. T. Vogelstein, B. Gallagher,and C. Faloutsos (2016) “D elta C on: principled massive-graph similarity func- tion with attribution,” ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3), p. 28.

[5] Zhao, Y., Z. Nasrullah,andZ. Li (2019) “PyOD: A python toolbox for scalable outlier detection,” arXiv preprint arXiv:1901.01588.

[6] Abbas, M., T. Le, H. Bensmail, V. Honavar,andY. El-Manzalawy (2018) “Microbiomarkers discovery in inflammatory bowel diseases using network-based feature selection,” in Proceedings of the 2018 ACM Interna- tional Conference on Bioinformatics, Computational Biology, and Health In- formatics, ACM, pp. 172–177.

[7] Zhang, W., Y. Yu, F. Hertwig, J. Thierry-Mieg, W. Zhang, D. Thierry-Mieg, J. Wang, C. Furlanello, V. Devanarayan, J. Cheng, et al. (2015) “Comparison of RNA-seq and microarray-based models for clinical endpoint prediction,” Genome biology, 16(1), p. 133. 70

[8] Group, B. D. W., A. J. Atkinson Jr, W. A. Colburn, V. G. De- Gruttola, D. L. DeMets, G. J. Downing, D. F. Hoth, J. A. Oates, C. C. Peck, R. T. Schooley, et al. (2001) “Biomarkers and surrogate endpoints: preferred definitions and conceptual framework,” Clinical pharma- cology & therapeutics, 69(3), pp. 89–95.

[9] Amur, S. (2019), “BIOMARKER TERMINOLOGY: SPEAK- ING THE SAME LANGUAGE,” https://www.fda.gov/files/ BIOMARKER-TERMINOLOGY--SPEAKING-THE-SAME-LANGUAGE.pdf,[Online; accessed 25-June-2019].

[10] Robotti, E., M. Manfredi, E. Marengo, et al. (2014) “Biomarkers discovery through multivariate statistical methods: a review of recently de- veloped methods and applications in proteomics,” J Proteomics Bioinform, 3,p.20.

[11] Weiss, S., Z. Z. Xu, S. Peddada, A. Amir, K. Bittinger, A. Gon- zalez, C. Lozupone, J. R. Zaneveld, Y. Vazquez-Baeza´ , A. Birm- ingham, et al. (2017) “Normalization and microbial di↵erential abundance strategies depend upon data characteristics,” Microbiome, 5(1), p. 27.

[12] Guyon, I. and A. Elisseeff (2003) “An introduction to variable and feature selection,” Journal of machine learning research, 3(Mar), pp. 1157–1182.

[13] Breiman, L. (2001) “Random forests,” Machine learning, 45(1), pp. 5–32.

[14] Rawashdeh, A. and A. L. Ralescu (2015) “Similarity Measure for Social Networks-A Brief Survey.” in Maics,pp.153–159.

[15] Mart´ınez, V., F. Berzal,andJ.-C. Cubero (2017) “A survey of link prediction in complex networks,” ACM Computing Surveys (CSUR), 49(4), p. 69.

[16] Chandola, V., A. Banerjee,andV. Kumar (2009) “Anomaly detection: Asurvey,”ACM computing surveys (CSUR), 41(3), p. 15.

[17] Zhou, Y., H. Cheng,andJ. X. Yu (2009) “Graph clustering based on structural/attribute similarities,” Proceedings of the VLDB Endowment, 2(1), pp. 718–729.

[18] Zhang, B., H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen,andW.-Y. Ma (2005) “Improving web search results using anity graph,” in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval,ACM,pp.504–511. 71

[19] Heymans, M. and A. K. Singh (2003) “Deriving phylogenetic trees from the similarity analysis of metabolic pathways,” Bioinformatics, 19(suppl 1), pp. i138–i146.

[20] Liben-Nowell, D. and J. Kleinberg (2007) “The link-prediction problem for social networks,” Journal of the American society for information science and technology, 58(7), pp. 1019–1031.

[21] Newman, M. E. (2001) “Clustering and preferential attachment in growing networks,” Physical review E, 64(2), p. 025102.

[22] Adamic, L. A. and E. Adar (2003) “Friends and neighbors on the web,” Social networks, 25(3), pp. 211–230.

[23] Ravasz, E., A. L. Somera, D. A. Mongru, Z. N. Oltvai,andA.- L. Barabasi´ (2002) “Hierarchical organization of modularity in metabolic networks,” science, 297(5586), pp. 1551–1555.

[24] Lu,¨ L. and T. Zhou (2011) “Link prediction in complex networks: A survey,” Physica A: statistical mechanics and its applications, 390(6), pp. 1150–1170.

[25] Jaccard, P. (1901) “Etude´ comparative de la distribution florale dans une portion des Alpes et des Jura,” Bull Soc Vaudoise Sci Nat, 37,pp.547–579.

[26] Leicht, E. A., P. Holme,andM. E. Newman (2006) “Vertex similarity in networks,” Physical Review E, 73(2), p. 026120.

[27] Barabasi,´ A.-L. and R. Albert (1999) “Emergence of scaling in random networks,” science, 286(5439), pp. 509–512.

[28] Zhou, T., L. Lu¨,andY.-C. Zhang (2009) “Predicting missing links via local information,” The European Physical Journal B, 71(4), pp. 623–630.

[29] Salton, G. and M. J. McGill (1986) “Introduction to modern information retrieval,” .

[30] Hamers, L. et al. (1989) “Similarity measures in scientometric research: The Jaccard index versus Salton’s cosine formula.” Information Processing and Management, 25(3), pp. 315–18.

[31] Sørensen, T. (1948) “A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons,” Biol. Skr., 5,pp.1–34.

[32] Jeh, G. and J. Widom (2002) “SimRank: a measure of structural-context similarity,” in Proceedings of the eighth ACM SIGKDD international confer- ence on Knowledge discovery and data mining, ACM, pp. 538–543. 72

[33] Lu,¨ L., M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang,and T. Zhou (2012) “Recommender systems,” Physics reports, 519(1), pp. 1– 49. [34] Yu, W., W. Zhang, X. Lin, Q. Zhang,andJ. Le (2012) “A space and time ecient algorithm for SimRank computation,” World Wide Web, 15(3), pp. 327–353. [35] Kusumoto, M., T. Maehara,andK.-i. Kawarabayashi (2014) “Scal- able similarity search for SimRank,” in Proceedings of the 2014 ACM SIG- MOD international conference on Management of data, ACM, pp. 325–336. [36] He, G., H. Feng, C. Li,andH. Chen (2010) “Parallel SimRank compu- tation on large graphs with iterative aggregation,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 543–552. [37] Li, C., J. Han, G. He, X. Jin, Y. Sun, Y. Yu,andT. Wu (2010) “Fast computation of simrank for static and dynamic information networks,” in Pro- ceedings of the 13th International Conference on Extending Database Technol- ogy,ACM,pp.465–476. [38] Yoon, S.-H., S.-W. Kim,andS. Park (2016) “C-Rank: A link-based similarity measure for scientific literature databases,” Information Sciences, 326,pp.25–40. [39] Fogaras, D. and B. Racz´ (2005) “Scaling link-based similarity search,” in Proceedings of the 14th international conference on World Wide Web,ACM, pp. 641–650. [40] Yu, W., X. Lin, W. Zhang, L. Chang,andJ. Pei (2013) “More is sim- pler: E↵ectively and eciently assessing node-pair similarities based on hy- perlinks,” Proceedings of the VLDB Endowment, 7(1), pp. 13–24. [41] Yoon, S.-H., J.-S. Kim, J. Ha, S.-W. Kim, M. Ryu,andH.-J. Choi (2014) “Link-based similarity measures using reachability vectors,” The Sci- entific World Journal, 2014. [42] Zhao, P., J. Han,andY. Sun (2009) “P-Rank: a comprehensive struc- tural similarity measure over information networks,” in Proceedings of the 18th ACM conference on Information and knowledge management,ACM,pp. 553–562. [43] Yoon, S.-H., S.-W. Kim,andS. Park (2011) “C-Rank: A Link- based Similarity Measure for Scientific Literature Databases,” arXiv preprint arXiv:1109.1059. 73

[44] Lin, Z., M. R. Lyu,andI. King (2009) “Matchsim: a novel neighbor-based similarity measure with maximum neighborhood matching,” in Proceedings of the 18th ACM conference on Information and knowledge management,ACM, pp. 1613–1616.

[45] Tversky, A. (1977) “Features of similarity.” Psychological review, 84(4), p. 327.

[46] Antonellis, I., H. G. Molina,andC. C. Chang (2008) “Simrank++: query rewriting through link analysis of the click graph,” Proceedings of the VLDB Endowment, 1(1), pp. 408–421.

[47] Koutra, D., J. T. Vogelstein,andC. Faloutsos (2013) “Deltacon: Aprincipledmassive-graphsimilarityfunction,”inProceedings of the 2013 SIAM International Conference on Data Mining, SIAM, pp. 162–170.

[48] Koutra, D. and C. Faloutsos (2017) “Individual and collective graph mining: principles, algorithms, and applications,” Synthesis Lectures on Data Mining and Knowledge Discovery, 9(2), pp. 1–206.

[49] Papadimitriou, P., A. Dasdan,andH. Garcia-Molina (2010) “Web graph similarity for anomaly detection,” Journal of Internet Services and Ap- plications, 1(1), pp. 19–30.

[50] Haveliwala, T. H. (2003) “Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search,” IEEE transactions on knowledge and data engineering, 15(4), pp. 784–796.

[51] Aldous, D. and J. Fill (1995), “Reversible Markov chains and random walks on graphs,” .

[52] Doyle, P. G. and J. L. Snell (2000) “Random walks and electric net- works,” arXiv preprint math/0001057.

[53] Koutra, D., T.-Y. Ke, U. Kang, D. H. P. Chau, H.-K. K. Pao,and C. Faloutsos (2011) “Unifying guilt-by-association approaches: Theorems and fast algorithms,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases,Springer,pp.245–260.

[54] Gatterbauer, W., S. Gunnemann¨ , D. Koutra,andC. Faloutsos (2015) “Linearized and single-pass belief propagation,” Proceedings of the VLDB Endowment, 8(5), pp. 581–592.

[55] Grubbs, F. E. (1969) “Procedures for detecting outlying observations in samples,” Technometrics, 11(1), pp. 1–21. 74

[56] Goldstein, M. and S. Uchida (2016) “A comparative evaluation of un- supervised anomaly detection algorithms for multivariate data,” PloS one, 11(4), p. e0152173.

[57] Ding, Q., N. Katenka, P. Barford, E. Kolaczyk,andM. Crovella (2012) “Intrusion as (anti) social communication: characterization and detec- tion,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 886–894.

[58] Sun, J., Y. Xie, H. Zhang,andC. Faloutsos (2008) “Less is more: Sparse graph mining with compact matrix decomposition,” Statistical Analysis and Data Mining: The ASA Journal, 1(1), pp. 6–22.

[59] Bolton, R. J., D. J. Hand, et al. (2001) “Unsupervised profiling methods for fraud detection,” Credit Scoring and Credit Control VII,pp.235–255.

[60] Kumar, M., R. Ghani,andZ.-S. Mei (2010) “Data mining to predict and prevent errors in health insurance claims processing,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining,ACM,pp.65–74.

[61] Lin, J., E. Keogh, A. Fu,andH. Van Herle (2005) “Approximations to magic: Finding unusual medical time series,” in 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05),Citeseer,pp.329–334.

[62] Schubert, E., A. Zimek,andH.-P. Kriegel (2014) “Local outlier detec- tion reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection,” Data Mining and Knowledge Discovery, 28(1), pp. 190–237.

[63] Akoglu, L., H. Tong,andD. Koutra (2015) “Graph based anomaly detection and description: a survey,” Data mining and knowledge discovery, 29(3), pp. 626–688.

[64] Aggarwal, C. C. and S. Sathe (2017) Outlier ensembles: An introduction, Springer.

[65] Sakurada, M. and T. Yairi (2014) “Anomaly detection using with nonlinear ,” in Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis,ACM,p.4.

[66] He, Z., X. Xu,andS. Deng (2003) “Discovering cluster-based local out- liers,” Pattern Recognition Letters, 24(9-10), pp. 1641–1650.

[67] Han, J., J. Pei,andM. Kamber (2011) Data mining: concepts and tech- niques,Elsevier. 75

[68] Amer, M. and M. Goldstein (2012) “Nearest-neighbor and clustering based anomaly detection algorithms for rapidminer,” in Proc. of the 3rd Rapid- Miner Community Meeting and Conference (RCOMM 2012),pp.1–12. [69] Goldstein, M. and A. Dengel (2012) “Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm,” KI-2012: Poster and Demo Track,pp.59–63. [70] Liu, F. T., K. M. Ting,andZ.-H. Zhou (2008) “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining,IEEE,pp.413–422. [71] ——— (2012) “Isolation-based anomaly detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), p. 3. [72] Breunig, M. M., H.-P. Kriegel, R. T. Ng,andJ. Sander (2000) “LOF: identifying density-based local outliers,” in ACM sigmod record,vol.29,ACM, pp. 93–104. [73] Rousseeuw, P. J. and K. V. Driessen (1999) “A fast algorithm for the minimum covariance determinant estimator,” Technometrics, 41(3), pp. 212– 223. [74] Hardin, J. and D. M. Rocke (2004) “Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator,” Com- putational Statistics & Data Analysis, 44(4), pp. 625–638. [75] Hubert, M., M. Debruyne,andP. J. Rousseeuw (2018) “Minimum co- variance determinant and extensions,” Wiley Interdisciplinary Reviews: Com- putational Statistics, 10(3), p. e1421. [76] Scholkopf, B., K.-K. Sung, C. J. Burges, F. Girosi, P. Niyogi, T. Poggio,andV. Vapnik (1997) “Comparing support vector machines with Gaussian kernels to radial basis function classifiers,” IEEE transactions on Signal Processing, 45(11), pp. 2758–2765. [77] Chang, Y.-W., C.-J. Hsieh, K.-W. Chang, M. Ringgaard,andC.-J. Lin (2010) “Training and testing low-degree polynomial data mappings via linear SVM,” Journal of Machine Learning Research, 11(Apr), pp. 1471–1490. [78] Scholkopf,¨ B., J. C. Platt, J. Shawe-Taylor, A. J. Smola,and R. C. Williamson (2001) “Estimating the support of a high-dimensional distribution,” Neural computation, 13(7), pp. 1443–1471. [79] Shyu, M.-L., S.-C. Chen, K. Sarinnapakorn,andL. Chang (2003) A novel anomaly detection scheme based on principal component classifier, Tech. rep., MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING. 76

[80] Strimbu, K. and J. A. Tavel (2010) “What are biomarkers?” Current Opinion in HIV and AIDS, 5(6), p. 463.

[81] Gevers, D., S. Kugathasan, L. A. Denson, Y. Vazquez-Baeza´ , W. Van Treuren, B. Ren, E. Schwager, D. Knights, S. J. Song, M. Yassour, et al. (2014) “The treatment-naive microbiome in new-onset Crohns disease,” Cell host & microbe, 15(3), pp. 382–392.

[82] Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth (2015) “limma powers di↵erential expression analyses for RNA-sequencing and microarray studies,” Nucleic acids research, 43(7), pp. e47–e47.

[83] Abbas, M., J. Matta, T. Le, H. Bensmail, T. Obafemi-Ajayi, V. Honavar,andE.-M. Yasser (2019) “Biomarker discovery in inflamma- tory bowel diseases using network-based feature selection,” bioRxiv,p.662197.

[84] Faust, K., J. F. Sathirapongsasuti, J. Izard, N. Segata, D. Gev- ers, J. Raes,andC. Huttenhower (2012) “Microbial co-occurrence re- lationships in the human microbiome,” PLoS computational biology, 8(7), p. e1002606.

[85] Faust, K. and J. Raes (2016) “CoNet app: inference of biological associa- tion networks using Cytoscape,” F1000Research, 5.

[86] Faust, K., G. Lima-Mendez, J.-S. Lerat, J. F. Sathirapongsasuti, R. Knight, C. Huttenhower, T. Lenaerts,andJ. Raes (2015) “Cross- biome comparison of microbial association networks,” Frontiers in microbiol- ogy, 6,p.1200.

[87] Hagberg, A., P. Swart,andD. S Chult (2008) Exploring network struc- ture, dynamics, and function using NetworkX, Tech. rep., Los Alamos Na- tional Lab.(LANL), Los Alamos, NM (United States).

[88] Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) “Scikit-learn: Machine learning in Python,” Journal of machine learning research, 12(Oct), pp. 2825–2830.

[89] Baldi, P., S. Brunak, Y. Chauvin, C. A. Andersen,andH. Nielsen (2000) “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, 16(5), pp. 412–424.

[90] Fawcett, T. (2006) “An introduction to ROC analysis,” Pattern recognition letters, 27(8), pp. 861–874. 77

[91] Bradley, A. P. (1997) “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern recognition, 30(7), pp. 1145–1159.