Computational Complexity of Data Mining Algorithms Used in Fraud Detection

The Pennsylvania State University The Graduate School Harold and Inge Marcus Department of Industrial and Manufacturing Engineering COMPUTATIONAL COMPLEXITY OF DATA MINING ALGORITHMS USED IN FRAUD DETECTION A Thesis in Industrial Engineering by Karthik Iyer 2015 Karthik Iyer Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2015 The thesis of Karthik Iyer was reviewed and approved* by the following: Vittal Prabhu Professor of Industrial Engineering Thesis Advisor A. Ravindran Professor of Industrial Engineering Chair- Enterprise Integration Consortium Harriet Black Nembhard Professor of Industrial Engineering Interim Head of the Department of Industrial Engineering *Signatures are on file in the Graduate School iii ABSTRACT According to estimates by certain government agencies, 10% of the total medical expenditure is lost to healthcare fraud. Similarly the credit card industry loses billions of dollars every year due to fraudulent transactions. The datasets used to identify these fraudulent transactions have millions of rows and each transaction is defined by 15-40 attributes. It is not possible for a human to sift through these massive datasets and find the fraudulent transactions. Hence credit card companies and insurance companies use data mining algorithms to identify fraudulent transactions. These data mining algorithms need to identify the fraudulent transactions efficiently and at the same time they need to process the dataset as quickly as possible. The time taken by the algorithm to execute is a function of the computational complexity of the algorithm. In this thesis, the theoretical running time complexities of the various data mining algorithms used in fraud detection are compared, i.e. the complexity is expressed as a function of the number of instances in the database. These algorithms were then run on statistical tools like Weka and R and on comparing the performance of these algorithms to the theoretical computational efficiency it was found that, all algorithms agree with the big O complexity. Support vector machines and decision trees performed better than the big O complexity. iv TABLE OF CONTENTS List of Figures .............................................................................................................. vi List of Tables ............................................................................................................... viii Acknowledgements ...................................................................................................... ix Chapter 1 Introduction ............................................................................................................ 1 1.1. Fraud: ........................................................................................................................ 1 1.1.1. Credit Card Fraud: .......................................................................................... 1 1.1.2. Healthcare Fraud: ........................................................................................... 4 1.2. Problem Statement: ................................................................................................... 6 1.3. Research Objectives: ................................................................................................. 7 1.4. Thesis Structure ......................................................................................................... 7 Chapter 2 Literature Review .................................................................................................... 8 2.1. Background: .............................................................................................................. 8 2.2. Data mining algorithms for detecting credit card fraud: ........................................... 9 2.3. Data Mining Algorithms for detecting Health Care fraud: ....................................... 10 2.3. Feature Selection: ...................................................................................................... 14 Chapter 3 Theoretical Computational Complexity .................................................................. 16 3.1. Big O Notation .......................................................................................................... 16 3.2. K-means Clustering ................................................................................................... 18 3.3. Regression (Taylor, 2011):........................................................................................ 22 3.4. Association Rules (Tan & Kumar, 2005): ................................................................. 24 3.5 Logistic Regression .................................................................................................... 24 3.5.1. Logistic model (Komarek, 2004): .................................................................. 25 3.5.2. Maximum likelihood estimation (Komarek, 2004): ....................................... 26 3.5.3. Newton’s method: .......................................................................................... 26 3.6. Decision Trees (Witten & Frank, 2005) .................................................................... 27 3.7. Support Vector Machine (“Support vector machine,” 2015) .................................... 29 Chapter 4 Analysis and Results ............................................................................................... 32 4.1. K-means Clustering Analysis: ................................................................................... 33 4.1.1. Comparison with theoretical complexity of k-means clustering algorithm: .. 37 4.2 Regression Analysis ................................................................................................... 38 4.2.1. Comparison with the theoretical Regression model complexity .................... 42 4.3. Association Rules Analysis ....................................................................................... 43 4.3.1. Comparison with theoretical complexity for Apriori algorithm ..................... 45 4.4. Logistic Regression Analysis .................................................................................... 46 4.4.1. Comparison with theoretical logistic regression complexity: ........................ 51 v 4.5. Decision Tree Analysis ............................................................................................. 51 4.5.1. Comparison with theoretical decision tree complexity: ................................. 55 4.6. Support vector Machine Analysis ............................................................................. 55 4.6.1. Comparison with the theoretical complexity of Support Vector Machines ... 57 Chapter 5 Conclusion .............................................................................................................. 58 5.1. Future Scope ............................................................................................................. 58 Appendix A: K-means Clustering ............................................................................................ 60 Appendix B: Regression .......................................................................................................... 68 Appendix C: Association Rules (Apriori Algorithm) .............................................................. 72 Appendix D: Logistic Regression ............................................................................................ 79 Appendix E: Decision Tree ...................................................................................................... 84 Appendix F: Support Vector Machine ..................................................................................... 85 References: ............................................................................................................................... 86 vi LIST OF FIGURES Figure 1: Global Cost of Payment Card Fraud (Heggestuen, 2014) ........................................ 2 Figure 2: U.S. Card Fraud by Type (Conroy, 2014) ................................................................ 3 Figure 3: Types of Healthcare Fraud and Abuse (“Healthcare fraud and abuse remains a costly challenge,” 2004) ................................................................................................... 6 Figure 4: Percentage of Papers on detecting three types of fraud (Li et al., 2008) .................. 14 Figure 5: Percentage of papers using specific number of features (Li et al., 2008) ................. 15 Figure 6: Big-O Complexity (Steinberger, 2013) .................................................................... 18 Figure 7: 2-D Plot of Instances ................................................................................................ 19 Figure 8: Plot of instances with initial seeds ............................................................................ 19 Figure 9: Plot of Cluster assignments ...................................................................................... 20 Figure 10: Plot of new Centroids ............................................................................................. 21 Figure 11: Plot of new Cluster assignment .............................................................................. 21 Figure 12: Hyperplane (“Support vector machine,” 2015) ...................................................... 29 Figure 13: Maximum Margin Hyperplane (“Support vector machine,” 2015) ........................ 30 Figure 14: Run time of k-means

Computational Complexity of Data Mining Algorithms Used in Fraud Detection

Time Complexity

Quick Sort Algorithm Song Qin Dept

CS321 Spring 2021

Rate of Growth Linear Vs Logarithmic Growth O() Complexity Measure

Comparing the Heuristic Running Times And

Big-O Notation

Time Complexity of Algorithms

13.7 Asymptotic Notation

A Short History of Computational Complexity

Binary Search

Sorting Algorithm 1 Sorting Algorithm

Algorithm Time Cost Measurement