Open Cheng-Kai Chen Thesis Final.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Open Cheng-Kai Chen Thesis Final.Pdf The Pennsylvania State University The Graduate School BIOMARKERS DISCOVERY USING NETWORK BASED ANOMALY DETECTION AThesisin Computer Science and Engineering by Cheng-Kai Chen c 2019 Cheng-Kai Chen Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2019 The thesis of Cheng-Kai Chen was reviewed and approved⇤ by the following: Vasant Honavar Professor of Computer Science and Engineering Professor of Information Sciences and Technology Thesis Advisor Kamesh Madduri Associate Professor of Computer Science and Engineering Chitaranjan R. Das Distinguished Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering ⇤Signatures are on file in the Graduate School. Abstract Identifying biomarkers is an important step in translating research advances in genomics into clinical practice. From a machine learning perspective, computa- tional biomarker identification can be implemented using a broad range of feature selection methods. In this thesis, we consider an alternative approach, Network- Based Biomarker Discovery (NBBD) framework. As the name suggest, NBBD uses network representations of the input data to identify potential biomarkers (i.e., dis- criminative features for training machine learning classifiers). NBBD consists of two main customizable modules: Network Inference Module and Node Importance Scoring Module. The Network Inference Module creates ecological networks from given dataset. The Node Importance Scoring Module computes a score for each node based on the di↵erence between two ecological networks. However, most of the node scoring methods used in NBBD are based on nodes’ local topological properties. To date, NBBD has been successfully applied to metagenomics data. In this thesis, we extend two aspects of the earlier work on NBBD: i) we pro- pose two novel node important scoring methods based on node anomaly scores and di↵erences in nodes global profiles; ii) we demonstrate the applicability of NBBD for Neuroblastoma biomarker discovery from gene expression data. Our computa- tional results show that our methods can outperform the local node importance scoring methods and are comparable to state-of-art feature selection methods, in- cluding Random Forest Feature Importance and Information Gain. iii Table of Contents List of Figures vii List of Tables viii Acknowledgments xii Chapter 1 Introduction 1 Chapter 2 Similarity in Graphs 4 2.1 VertexSimilarity ............................ 5 2.1.1 LocalApproaches........................ 5 2.1.1.1 CommonNeighbor(CN). 5 2.1.1.2 The Adamic-Adar Index (AA) . 6 2.1.1.3 TheHubPromotedIndex(HPI) . 6 2.1.1.4 TheHubDepressedIndex(HDI) . 6 2.1.1.5 JaccardIndex(JA).................. 7 2.1.1.6 The Local Leicht-Holme-Newman Index (LLHN) . 7 2.1.1.7 The Preferential Attachment Index(PA) . 7 2.1.1.8 TheResourceAllocationIndex(RA) . 8 2.1.1.9 TheSaltonIndex(SA). 8 2.1.2 The Sørensen Index (SO) . 8 2.1.3 GlobalApproaches ....................... 9 2.1.3.1 SimRank . 9 2.1.3.2 Asymmetric Structure COntext Similarity(ASCOS) 15 2.2 Graph Similarity . 19 iv 2.2.1 Measuring Node Affinities:FaBP ............... 21 2.2.2 Distance Measure Between Graphs . 22 2.2.3 DeltaConNodeAttributionFunction . 23 Chapter 3 Anomaly Detection Methods 25 3.1 Auto-Encoder .............................. 26 3.2 Clustering-Based Local Outlier Factor(CBLOF) . 26 3.3 Histogram-basedOutlierScore(HBOS) . 27 3.4 Isolation Forest(IForest) . 28 3.5 LocalOutlierFactor(LOF). 29 3.6 Minimum Covariance Determinant(MCD) . 30 3.7 One-ClassSupportVectorMachines(OCSVM) . 31 3.8 PrincipalComponentAnalysis(PCA) . 32 Chapter 4 Graph Based Feature Selection Methods and Their Application inBiomarkerDiscovery 33 4.1 Methods . 34 4.1.1 Datasets . 34 4.1.1.1 Inflammatory Bowel Diseases (IBD) dataset . 34 4.1.1.2 Neuroblastoma(NB)dataset . 34 4.1.2 Network-Based Biomarkers Discovery(NBBD) framework . 35 4.1.3 Proposed Node Importance Scoring Methods . 36 4.1.3.1 Node Anomaly Scoring (NAS) . 36 4.1.3.2 Node Attribution Profile Scoring (NAPS) . 37 4.1.4 Experiments . 37 4.2 ResultsandDiscussion . 40 4.2.1 Performance comparisons using IBD dataset . 40 4.2.2 Performance comparisons using NB dataset . 41 4.3 Conclusion . 42 Chapter 5 Conclusion 44 Appendix A Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAS methods 46 v Appendix B Performance Comparison On Inflammatory Bowel Disease(IBD) dataset Using NAPS methods 55 Appendix C Performance Comparison On Neuroblastoma(NB) dataset dataset Using NAS methods 58 Appendix D Performance Comparison On Neuroblastoma(NB) dataset Us- ing NAPS methods 66 Bibliography 69 vi List of Figures 2.1 Notations ................................ 4 2.2 A Sample Citation Graph (adopted from Hamedani et al. [1]) . 11 2.3 A toy network (adopted from [2]) . 16 2.4 A toy network with edge weights. (adopted from [3]) . 18 2.5 Symbols and Definitions for DeltaCon . 20 2.6 ToyNetworks .............................. 21 2.7 Algorithm: DeltaCon(adoptedfrom[4]). 23 2.8 Algorithm: DeltaCon Node Attribution(adopted from [4]) . 23 3.1 Select Outlier Detection Models in PyOD(adopted from [5]) . 26 4.1 NBBDframeworkoverview(adoptfrom[6]) . 35 4.2 NBBD framework overview with two di↵erent node scoring method 36 vii List of Tables 4.1 Performance comparisons on IBD dataset of RF classifiers trained using di↵erent feature selection methods using Information Gain (IG), RF Feature Importance (RFFI), and NBBD using three node topological properties. 40 4.2 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 41 4.3 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. 41 4.4 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 42 4.5 Performance comparison on NB dataset of top performing (in terms of AUC) RF classifiers using di↵erent Anomaly Detection (AD) methods and input data representations. 42 4.6 Performance comparison on IBD dataset of top performing (in terms of AUC) RF classifiers using three di↵erent distance functions and di↵erent input data representations. 43 A.1 IBD dataset using NAS:Auto-Encoder with adj Rep. 46 A.2 IBD dataset using NAS:Auto-Encoder with SimRank Rep. 46 A.3 IBD dataset using NAS:Auto-Encoder with ASCOS Rep. 47 A.4 IBD dataset using NAS:Auto-Encoder with FaBP Rep. 47 A.5 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with adj Rep. 47 A.6 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withSimRankRep............................ 47 A.7 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) withASCOSRep............................. 48 viii A.8 IBD dataset using NAS:Clustering-Based Local Outlier Factor(CBLOF) with FaBP Rep. 48 A.9 IBD dataset using NAS:Histogram-based Outlier Score with adj Rep. 48 A.10 IBD dataset using NAS:Histogram-based Outlier Score with Sim- Rank Rep. 48 A.11 IBD dataset using NAS:Histogram-based Outlier Score with AS- COS Rep. 49 A.12 IBD dataset using NAS:Histogram-based Outlier Score with FaBP Rep. 49 A.13 IBD dataset using NAS:Isolation Forest(Iforest) with adj Rep. 49 A.14 IBD dataset using NAS:Isolation Forest(Iforest) with SimRank Rep. 49 A.15 IBD dataset using NAS:Isolation Forest(Iforest) with ASCOS Rep. 50 A.16 IBD dataset using NAS:Isolation Forest(Iforest) with FaBP Rep. 50 A.17 IBD dataset using NAS:Local Outlier Factor(LOF) with adj Rep. 50 A.18 IBD dataset using NAS:Local Outlier Factor(LOF) with SimRank Rep. 50 A.19 IBD dataset using NAS:Local Outlier Factor(LOF) with ASCOS Rep. 50 A.20 IBD dataset using NAS:Local Outlier Factor(LOF) with FaBP Rep. 51 A.21 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with adj Rep. 51 A.22 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withSimRankRep............................ 51 A.23 IBD dataset using NAS:Minimum Covariance Determinant(MCD) withASCOSRep............................. 51 A.24 IBD dataset using NAS:Minimum Covariance Determinant(MCD) with FaBP Rep. 52 A.25 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with adj Rep. 52 A.26 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with SimRank Rep. 52 A.27 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) withASCOSRep............................. 52 A.28 IBD dataset using NAS:One-Class Support Vector Machines(OCSVM) with FaBP Rep. 53 A.29 IBD dataset using NAS:Principal Component Analysis(PCA) with adj Rep. 53 A.30 IBD dataset using NAS:Principal Component Analysis(PCA) with SimRank Rep. 53 ix A.31 IBD dataset using NAS:Principal Component Analysis(PCA) with ASCOSRep. .............................. 53 A.32 IBD dataset using NAS:Principal Component Analysis(PCA) with FaBPRep................................. 54 B.1 IBD dataset using NAPS:RoodED with SimRank Rep. 55 B.2 IBD dataset using NAPS:RoodED with ASCOS Rep. 55 B.3 IBDdatasetusingNAPS:RoodEDwithFaBPRep. 56 B.4 IBD dataset using NAPS:Cosine Distance with SimRank Rep. 56 B.5 IBD dataset using NAPS:Cosine Distance with ASCOS Rep. 56 B.6 IBD dataset using NAPS:Cosine Dsitance with FaBP Rep. 56 B.7 IBD dataset using NAPS:Bray-Curtis Distance SimRank with Rep. 57 B.8 IBD dataset using NAPS:Bray-Curtis Distance ASCOS with Rep. 57 B.9 IBD dataset using NAPS:Bray-Curtis Distance FaBP with Rep. 57 C.1 NB dataset using NAS:Auto-Encoder with adj Rep. 58 C.2 NB dataset using NAS:Auto-Encoder with SimRank Rep. 58 C.3 NB dataset using NAS:Auto-Encoder with ASCOS Rep. 59 C.4 NB dataset using NAS:Auto-Encoder with FaBP Rep.
Recommended publications
  • Fastlof: an Expectation-Maximization Based Local Outlier Detection Algorithm
    FastLOF: An Expectation-Maximization based Local Outlier Detection Algorithm Markus Goldstein German Research Center for Artificial Intelligence (DFKI), Kaiserslautern www.dfki.de [email protected] Introduction Performance Improvement Attempts Anomaly detection finds outliers in data sets which Space partitioning algorithms (e.g. search trees): Require time to build the tree • • – only occur very rarely in the data and structure and can be slow when having many dimensions – their features significantly deviate from the normal data Locality Sensitive Hashing (LSH): Approximates neighbors well in dense areas but • Three different anomaly detection setups exist [4]: performs poor for outliers • 1. Supervised anomaly detection (labeled training and test set) FastLOF Idea: Estimate the nearest neighbors for dense areas approximately and compute • 2. Semi-supervised anomaly detection exact neighbors for sparse areas (training with normal data only and labeled test set) Expectation step: Find some (approximately correct) neighbors and estimate • LRD/LOF based on them Maximization step: For promising candidates (LOF > θ ), find better neighbors • 3. Unsupervised anomaly detection (one data set without any labels) Algorithm 1 The FastLOF algorithm 1: Input 2: D = d1,...,dn: data set with N instances 3: c: chunk size (e.g. √N) 4: θ: threshold for LOF In this work, we present an unsupervised algorithm which scores instances in a 5: k: number of nearest neighbors • Output given data set according to their outlierliness 6: 7: LOF = lof1,...,lofn:
    [Show full text]
  • Incremental Local Outlier Detection for Data Streams
    IEEE Symposium on Computational Intelligence and Data Mining (CIDM), April 2007 Incremental Local Outlier Detection for Data Streams Dragoljub Pokrajac Aleksandar Lazarevic Longin Jan Latecki CIS Dept. and AMRC United Tech. Research Center CIS Department. Delaware State University 411 Silver Lane, MS 129-15 Temple University Dover DE 19901 East Hartford, CT 06108, USA Philadelphia, PA 19122 Abstract. Outlier detection has recently become an important have labeled data, which can be extremely time consuming for problem in many industrial and financial applications. This real life applications, and (2) inability to detect new types of problem is further complicated by the fact that in many cases, rare events. In contrast, unsupervised learning methods outliers have to be detected from data streams that arrive at an typically do not require labeled data and detect outliers as data enormous pace. In this paper, an incremental LOF (Local Outlier points that are very different from the normal (majority) data Factor) algorithm, appropriate for detecting outliers in data streams, is proposed. The proposed incremental LOF algorithm based on some measure [3]. These methods are typically provides equivalent detection performance as the iterated static called outlier/anomaly detection techniques, and their success LOF algorithm (applied after insertion of each data record), depends on the choice of similarity measures, feature selection while requiring significantly less computational time. In addition, and weighting, etc. They have the advantage of detecting new the incremental LOF algorithm also dynamically updates the types of rare events as deviations from normal behavior, but profiles of data points. This is a very important property, since on the other hand they suffer from a possible high rate of false data profiles may change over time.
    [Show full text]
  • A Two-Level Approach Based on Integration of Bagging and Voting for Outlier Detection
    Research Paper A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection Alican Dogan1, Derya Birant2† 1The Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir, Turkey 2Department of Computer Engineering, Dokuz Eylul University, Izmir, Turkey Citation: Dogan, Alican and Derya Birant. “A two- level approach based on Abstract integration of bagging and voting for outlier Purpose: The main aim of this study is to build a robust novel approach that is able to detect detection.” Journal of outliers in the datasets accurately. To serve this purpose, a novel approach is introduced to Data and Information determine the likelihood of an object to be extremely different from the general behavior of Science, vol. 5, no. 2, 2020, pp. 111–135. the entire dataset. https://doi.org/10.2478/ Design/methodology/approach: This paper proposes a novel two-level approach based jdis-2020-0014 on the integration of bagging and voting techniques for anomaly detection problems. The Received: Dec. 13, 2019 proposed approach, named Bagged and Voted Local Outlier Detection (BV-LOF), benefits Revised: Apr. 27, 2020 Accepted: Apr. 29, 2020 from the Local Outlier Factor (LOF) as the base algorithm and improves its detection rate by using ensemble methods. Findings: Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method. According to the results, the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average. Research limitations: In the BV-LOF approach, the base algorithm is applied to each subset data multiple times with different neighborhood sizes (k) in each case and with different ensemble sizes (T).
    [Show full text]
  • Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems
    Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems Malak Alshawabkeh Byunghyun Jang David Kaeli Dept of Electrical and Dept. of Electrical and Dept. of Electrical and Computer Engineering Computer Engineering Computer Engineering Northeastern University Northeastern University Northeastern University Boston, MA Boston, MA Boston, MA [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION The Local Outlier Factor (LOF) is a very powerful anomaly The Local Outlier Factor (LOF) [3] algorithm is a powerful detection method available in machine learning and classifi- outlier detection technique that has been widely applied to cation. The algorithm defines the notion of local outlier in anomaly detection and intrusion detection systems. LOF which the degree to which an object is outlying is dependent has been applied in a number of practical applications such on the density of its local neighborhood, and each object can as credit card fraud detection [5], product marketing [16], be assigned an LOF which represents the likelihood of that and wireless sensor network security [6]. object being an outlier. Although this concept of a local out- lier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor The LOF algorithm utilizes the concept of a local outlier that queries – this overhead can limit the use of LOF due to the captures the degree to which an object is an outlier based computational overhead involved. on the density of its local neighborhood. Each object can be assigned an LOF value which represents the likelihood of that object being an outlier.
    [Show full text]
  • Arxiv:1904.06034V1 [Stat.ML] 12 Apr 2019 Sity Exactly for a Test Instance
    Supervised Anomaly Detection based on Deep Autoregressive Density Estimators Tomoharu Iwata Yuki Yamanaka NTT Communication Science Laboratories NTT Secure Platform Laboratories Abstract autoencoders (VAE) (Kingma and Welling 2013), flow- based generative models (Dinh, Krueger, and Bengio 2014; We propose a supervised anomaly detection method based Dinh, Sohl-Dickstein, and Bengio 2016; Kingma and Dhari- on neural density estimators, where the negative log likeli- wal 2018), and autoregressive models (Uria, Murray, and hood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By Larochelle 2013; Raiko et al. 2014; Germain et al. 2015; the recent advance of deep learning, the density estimation Uria et al. 2016). The VAE has been used for anomaly de- performance has been greatly improved. However, the neural tection (An and Cho 2015; Suh et al. 2016; Xu et al. 2018). density estimators cannot exploit anomaly label information, In some situations, the label information, which indicates which would be valuable for improving the anomaly detec- whether each instance is anomalous or normal, is avail- tion performance. The proposed method effectively utilizes able (Gornitz¨ et al. 2013). The label information is valuable the anomaly label information by training the neural density for improving the anomaly detection performance. How- estimator so that the likelihood of normal instances is max- ever, the existing neural network based density estimation imized and the likelihood of anomalous instances is lower methods cannot exploit the label information. To use the than that of the normal instances. We employ an autoregres- sive model for the neural density estimator, which enables anomaly label information, supervised classifiers, such as us to calculate the likelihood exactly.
    [Show full text]
  • A Comparative Evaluation of Semi- Supervised Anomaly Detection Techniques
    DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 A Comparative Evaluation Of Semi- supervised Anomaly Detection Techniques REBWAR BAJALLAN BURHAN HASHI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE A Comparative Evaluation Of Semi-supervised Anomaly Detection Techniques REBWAR BAJALLAN BURHAN HASHI Degree Project in Computer Science Date: June 9, 2020 Supervisor: Pawel Herman Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: En jämförande utvärdering av semi-övervakade tekniker för identifiering av uteliggande datapunkter iii Abstract As we are entering the information age and the amount of data is rapidly in- creasing, the task of detecting anomalies has become a necessity in many orga- nizations as anomalies often reveal useful information which in many cases can be critical to save lives or to catch imposters. The semi-supervised approach to anomaly detection which is based on the fact that the user has no infor- mation about anomalies has become widely popular since it’s easier to model the normal state of systems than to obtain information about every anomalous behavior. Therefore, in this study we choose to conduct a comparative evalua- tion of the semi-supervised anomaly detection techniques; Autoencoder, Local outlier factor algorithm, and one class support vector machine, to simplify the process of selecting the right technique when faced with similar anomaly de- tection problems of semi-supervised nature. We found that the local outlier factor algorithm was superior in performance given the Electrocardiograms dataset (ECG5000), achieving a high precision and perfect recall. The autoencoder achieved the best performance given the credit card fraud dataset, even though the remaining models also achieved a relatively high performance that didn’t differ much from that of the autoen- coder.
    [Show full text]
  • Isolation Forest and Local Outlier Factor for Credit Card Fraud Detection System
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-4, April 2020 Isolation Forest and Local Outlier Factor for Credit Card Fraud Detection System V. Vijayakumar, Nallam Sri Divya, P. Sarojini, K. Sonika The dataset includes Credit Card purchases made by Abstract: Fraud identification is a crucial issue facing large consumers in Europe during September 2013. Credit card economic institutions, which has caused due to the rise in credit purchases are defined by tracking the conduct of purchases card payments. This paper brings a new approach for the predictive into two classifications: fraudulent and non-fraudulent. identification of credit card payment frauds focused on Isolation Depending on these two groups correlations are generated Forest and Local Outlier Factor. The suggested solution comprises of the corresponding phases: pre-processing of data-sets, training and machine learning algorithms are used to identify and sorting, convergence of decisions and analysis of tests. In this suspicious transactions. Instead, the action of such article, the behavior characteristics of correct and incorrect anomalies can be evaluated using Isolation Forest and transactions are to be taught by two kinds of algorithms local Local Outlier Factor and their final results can be outlier factor and isolation forest. To date, several researchers contrasted to verify which algorithm is better. identified different approaches for identifying and growing such The key problems involved in the identification of frauds. In this paper we suggest analysis of Isolation Forest and credit card fraud are: Immense data is collected on a regular Local Outlier Factor algorithms using python and their basis and the model construct must be quick sufficiently comprehensive experimental results.
    [Show full text]
  • Anomaly Detection Using Signal Segmentation and One-Class Classification in Diffusion Process of Semiconductor Manufacturing
    sensors Article Anomaly Detection Using Signal Segmentation and One-Class Classification in Diffusion Process of Semiconductor Manufacturing Kyuchang Chang 1, Youngji Yoo 2 and Jun-Geol Baek 1,* 1 Department of Industrial and Management Engineering, Korea University, Seoul 02841, Korea; [email protected] 2 Samsung Electronics Co., Ltd., Hwaseong-si 18448, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-2-3290-3396 Abstract: This paper proposes a new diagnostic method for sensor signals collected during semi- conductor manufacturing. These signals provide important information for predicting the quality and yield of the finished product. Much of the data gathered during this process is time series data for fault detection and classification (FDC) in real time. This means that time series classification (TSC) must be performed during fabrication. With advances in semiconductor manufacturing, the distinction between normal and abnormal data has become increasingly significant as new challenges arise in their identification. One challenge is that an extremely high FDC performance is required, which directly impacts productivity and yield. However, general classification algorithms can have difficulty separating normal and abnormal data because of subtle differences. Another challenge is that the frequency of abnormal data is remarkably low. Hence, engineers can use only normal data to Citation: Chang, K.; Yoo, Y.; Baek, develop their models. This study presents a method that overcomes these problems and improves J.-G. Anomaly Detection Using Signal the FDC performance; it consists of two phases. Phase I has three steps: signal segmentation, feature Segmentation and One-Class extraction based on local outlier factors (LOF), and one-class classification (OCC) modeling using the Classification in Diffusion Process of Semiconductor Manufacturing.
    [Show full text]
  • Anomaly Detection Using Dictionary Learning
    Anomaly Detection Using Dictionary Learning Mark Eisen,∗ Mengjie Pan,y Zachary Siegelzand Sara Staszakx July 22, 2013 MAXIMA REU Summer 2013 Institute for Mathematics and its Applications University of Minnesota Faculty advisor: Alicia Johnson (Macalester College) Problem poser: Jarvis Haupt (University of Minnesota) Abstract This report applies dictionary learning and sparse coding algorithms to data in the interest of de- veloping a better method of anomaly detection without a priori information about the anomalies them- selves. These methods aim to find a sparse representation of data Y with respect to a learned basis, or dictionary D. Specifically, iterative learning algorithms are used to solve the minimization problem 2 min kY − DXk2 + λkXk0, where X is a set of coefficients and λ controls the sparsity of X. Sparsity X;D helps assign semantic meaning to individual dictionary elements based upon their use in reconstructing data, which in turn highlights natural groupings and relationships among the data points. Thus, though traditional applications of dictionary learning include image denoising, novel methods for identification of anomalous or salient data points can also be derived from such structural features. To this end, we develop sparsity-informed metrics for defining and identifying anomalies with broad applications. Our results are promising and competitive with previous methods for flagging anomalous data in both images and propagating wavefield video. ∗University of Pennsylvania yBryn Mawr College zPomona College xMacalester College 1 Contents 1 Introduction 3 1.1 Anomaly Detection . .3 1.2 Existing Methods . .3 1.3 Proposed Method . .3 2 Methodology 4 2.1 Sparse Coding . .4 2.2 Dictionary Learning .
    [Show full text]
  • A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams
    big data and cognitive computing Review A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams Omar Alghushairy 1,2,* , Raed Alsini 1,3 , Terence Soule 1 and Xiaogang Ma 1,* 1 Department of Computer Science, University of Idaho, Moscow, ID 83844, USA; [email protected] (R.A.); [email protected] (T.S.) 2 College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia 3 Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia * Correspondence: [email protected] (O.A.); [email protected] (X.M.) Abstract: Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers.
    [Show full text]
  • Unsupervised Anomaly Detection Approach for Time-Series in Multi-Domains Using Deep Reconstruction Error
    S S symmetry Article Unsupervised Anomaly Detection Approach for Time-Series in Multi-Domains Using Deep Reconstruction Error Tsatsral Amarbayasgalan 1, Van Huy Pham 2, Nipon Theera-Umpon 3,4 and Keun Ho Ryu 2,4,* 1 Database and Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea; [email protected] 2 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam; [email protected] 3 Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand; [email protected] 4 Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand * Correspondence: [email protected] or [email protected] Received: 16 June 2020; Accepted: 28 July 2020; Published: 29 July 2020 Abstract: Automatic anomaly detection for time-series is critical in a variety of real-world domains such as fraud detection, fault diagnosis, and patient monitoring. Current anomaly detection methods detect the remarkably low proportion of the actual abnormalities correctly. Furthermore, most of the datasets do not provide data labels, and require unsupervised approaches. By focusing on these problems, we propose a novel deep learning-based unsupervised anomaly detection approach (RE-ADTS) for time-series data, which can be applicable to batch and real-time anomaly detections. RE-ADTS consists of two modules including the time-series reconstructor and anomaly detector. The time-series reconstructor module uses the autoregressive (AR) model to find an optimal window width and prepares the subsequences for further analysis according to the width. Then, it uses a deep autoencoder (AE) model to learn the data distribution, which is then used to reconstruct a time-series close to the normal.
    [Show full text]
  • Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection
    Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection Zekun Xu∗ Deovrat Kakdey Arin Chaudhuriz February 5, 2019 Abstract In recent years, there have been many practical applications of anomaly detection such as in predictive maintenance, detection of credit fraud, network intrusion, and system failure. The goal of anomaly detection is to identify in the test data anomalous behaviors that are either rare or unseen in the training data. This is a common goal in predictive maintenance, which aims to forecast the imminent faults of an appliance given abundant samples of normal behaviors. Local outlier factor (LOF) is one of the state-of-the-art models used for anomaly detection, but the predictive performance of LOF depends greatly on the selection of hyperparameters. In this paper, we propose a novel, heuristic methodology to tune the hyperparameters in LOF. A tuned LOF model that uses the proposed method shows good predictive performance in both simulations and real data sets. Keywords: local outlier factor, anomaly detection, hyperparameter tuning 1 Introduction Anomaly detection has practical importance in a variety of applications such as predictive mainte- nance, intrusion detection in electronic systems (Patcha and Park, 2007; Jyothsna et al., 2011), faults in industrial systems (Wise et al., 1999), and medical diagnosis (Tarassenko et al., 1995; Quinn and Williams, 2007; Clifton et al., 2011). Predictive maintenance setups usually assume that the normal class of data points is well sampled in the training data whereas the anomaly class arXiv:1902.00567v1 [stat.AP] 1 Feb 2019 is rare and underrepresented. This assumption is relevant because large critical systems usually produce abundant data for normal activities, but it is the anomalous behaviors (which are scarce and evolving) that can be used to proactively forecast imminent failures Thus, the challenge in anomaly detection is to be able to identify new types of anomalies in the test data that are rare or unseen in the available training data.
    [Show full text]