Open Cheng-Kai Chen Thesis Final.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School BIOMARKERS DISCOVERY USING NETWORK BASED ANOMALY DETECTION AThesisin Computer Science and Engineering by Cheng-Kai Chen c 2019 Cheng-Kai Chen Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2019 The thesis of Cheng-Kai Chen was reviewed and approved⇤ by the following: Vasant Honavar Professor of Computer Science and Engineering Professor of Information Sciences and Technology Thesis Advisor Kamesh Madduri Associate Professor of Computer Science and Engineering Chitaranjan R. Das Distinguished Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering ⇤Signatures are on file in the Graduate School. Abstract Identifying biomarkers is an important step in translating research advances in genomics into clinical practice. From a machine learning perspective, computa- tional biomarker identification can be implemented using a broad range of feature selection methods. In this thesis, we consider an alternative approach, Network- Based Biomarker Discovery (NBBD) framework. As the name suggest, NBBD uses network representations of the input data to identify potential biomarkers (i.e., dis- criminative features for training machine learning classifiers). NBBD consists of two main customizable modules: Network Inference Module and Node Importance Scoring Module. The Network Inference Module creates ecological networks from given dataset. The Node Importance Scoring Module computes a score for each node based on the di↵erence between two ecological networks. However, most of the node scoring methods used in NBBD are based on nodes’ local topological properties. To date, NBBD has been successfully applied to metagenomics data. In this thesis, we extend two aspects of the earlier work on NBBD: i) we pro- pose two novel node important scoring methods based on node anomaly scores and di↵erences in nodes global profiles; ii) we demonstrate the applicability of NBBD for Neuroblastoma biomarker discovery from gene expression data. Our computa- tional results show that our methods can outperform the local node importance scoring methods and are comparable to state-of-art feature selection methods, in- cluding Random Forest Feature Importance and Information Gain. iii Table of Contents List of Figures vii List of Tables viii Acknowledgments xii Chapter 1 Introduction 1 Chapter 2 Similarity in Graphs 4 2.1 VertexSimilarity ............................ 5 2.1.1 LocalApproaches........................ 5 2.1.1.1 CommonNeighbor(CN). 5 2.1.1.2 The Adamic-Adar Index (AA) . 6 2.1.1.3 TheHubPromotedIndex(HPI) . 6 2.1.1.4 TheHubDepressedIndex(HDI) . 6 2.1.1.5 JaccardIndex(JA).................. 7 2.1.1.6 The Local Leicht-Holme-Newman Index (LLHN) . 7 2.1.1.7 The Preferential Attachment Index(PA) . 7 2.1.1.8 TheResourceAllocationIndex(RA) . 8 2.1.1.9 TheSaltonIndex(SA). 8 2.1.2 The Sørensen Index (SO) . 8 2.1.3 GlobalApproaches ....................... 9 2.1.3.1 SimRank . 9 2.1.3.2 Asymmetric Structure COntext Similarity(ASCOS) 15 2.2 Graph Similarity . 19 iv 2.2.1 Measuring Node Affinities:FaBP ............... 21 2.2.2 Distance Measure Between Graphs . 22 2.2.3 DeltaConNodeAttributionFunction . 23 Chapter 3 Anomaly Detection Methods 25 3.1 Auto-Encoder .............................. 26 3.2 Clustering-Based Local Outlier Factor(CBLOF) . 26 3.3 Histogram-basedOutlierScore(HBOS) . 27 3.4 Isolation Forest(IForest) . 28 3.5 LocalOutlierFactor(LOF). 29 3.6 Minimum Covariance Determinant(MCD) . 30 3.7 One-ClassSupportVectorMachines(OCSVM) . 31 3.8 PrincipalComponentAnalysis(PCA) . 32 Chapter 4 Graph Based Feature Selection Methods and Their Application inBiomarkerDiscovery 33 4.1 Methods . 34 4.1.1 Datasets . 34 4.1.1.1 Inflammatory Bowel Diseases (IBD) dataset . 34 4.1.1.2 Neuroblastoma(NB)dataset . 34 4.1.2 Network-Based Biomarkers Discovery(NBBD) framework . 35 4.1.3 Proposed Node Importance Scoring Methods . 36 4.1.3.1 Node Anomaly Scoring (NAS) . 36 4.1.3.2 Node Attribution Profile Scoring (NAPS) . 37 4.1.4 Experiments . 37 4.2 ResultsandDiscussion . 40 4.2.1 Performance comparisons using IBD dataset . 40 4.2.2 Performance comparisons using NB dataset . 41 4.3 Conclusion . 42 Chapter 5 Conclusion 44 Appendix A Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAS methods 46 v Appendix B Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAPS methods 55 Appendix C Performance Comparison On Neuroblastoma(NB) dataset dataset Using NAS methods 58 Appendix D Performance Comparison On Neuroblastoma(NB) dataset Us- ing NAPS methods 66 Bibliography 69 vi List of Figures 2.1 Notations ................................ 4 2.2 A Sample Citation Graph (adopted from Hamedani et al. [1]) . 11 2.3 A toy network (adopted from [2]) . 16 2.4 A toy network with edge weights. (adopted from [3]) . 18 2.5 Symbols and Definitions for DeltaCon . 20 2.6 ToyNetworks .............................. 21 2.7 Algorithm: DeltaCon(adoptedfrom[4]). 23 2.8 Algorithm: DeltaCon Node Attribution(adopted from [4]) . 23 3.1 Select Outlier Detection Models in PyOD(adopted from [5]) . 26 4.1 NBBDframeworkoverview(adoptfrom[6]) . 35 4.2 NBBD framework overview with two di↵erent node scoring method 36 vii List of Tables 4.1 Performance comparisons on IBD dataset of RF classifiers trained using di↵erent feature selection methods using Information Gain (IG), RF Feature Importance (RFFI), and NBBD using three node topological properties. 40 4.2 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 41 4.3 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. 41 4.4 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 42 4.5 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 42 4.6 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. 43 A.1 IBD dataset using NAS:Auto-Encoder with adj Rep. 46 A.2 IBD dataset using NAS:Auto-Encoder with SimRank Rep. 46 A.3 IBD dataset using NAS:Auto-Encoder with ASCOS Rep. 47 A.4 IBD dataset using NAS:Auto-Encoder with FaBP Rep. 47 A.5 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep. 47 A.6 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep............................ 47 A.7 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep............................. 48 viii A.8 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep. 48 A.9 IBD dataset using NAS:Histogram-based Outlier Score with adj Rep. 48 A.10 IBD dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep. 48 A.11 IBD dataset using NAS:Histogram-based Outlier Score with AS- COS Rep. 49 A.12 IBD dataset using NAS:Histogram-based Outlier Score with FaBP Rep. 49 A.13 IBD dataset using NAS:Isolation Forest(Iforest) with adj Rep. 49 A.14 IBD dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 49 A.15 IBD dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. 50 A.16 IBD dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. 50 A.17 IBD dataset using NAS:Local Outlier Factor(LOF) with adj Rep. 50 A.18 IBD dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep. 50 A.19 IBD dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. 50 A.20 IBD dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 51 A.21 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep. 51 A.22 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep............................ 51 A.23 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep............................. 51 A.24 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep. 52 A.25 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep. 52 A.26 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep. 52 A.27 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep............................. 52 A.28 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep. 53 A.29 IBD dataset using NAS:Principal Component Analysis(PCA) with adj Rep. 53 A.30 IBD dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep. 53 ix A.31 IBD dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep. .............................. 53 A.32 IBD dataset using NAS:Principal Component Analysis(PCA) with FaBPRep................................. 54 B.1 IBD dataset using NAPS:RoodED with SimRank Rep. 55 B.2 IBD dataset using NAPS:RoodED with ASCOS Rep. 55 B.3 IBDdatasetusingNAPS:RoodEDwithFaBPRep. 56 B.4 IBD dataset using NAPS:Cosine Distance with SimRank Rep. 56 B.5 IBD dataset using NAPS:Cosine Distance with ASCOS Rep. 56 B.6 IBD dataset using NAPS:Cosine Dsitance with FaBP Rep. 56 B.7 IBD dataset using NAPS:Bray-Curtis Distance SimRank with Rep. 57 B.8 IBD dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. 57 B.9 IBD dataset using NAPS:Bray-Curtis Distance FaBP with Rep. 57 C.1 NB dataset using NAS:Auto-Encoder with adj Rep. 58 C.2 NB dataset using NAS:Auto-Encoder with SimRank Rep. 58 C.3 NB dataset using NAS:Auto-Encoder with ASCOS Rep. 59 C.4 NB dataset using NAS:Auto-Encoder with FaBP Rep.