
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Winter 1-23-2017 Malware Detection using the Index of Coincidence Bhavna Gurnani San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Recommended Citation Gurnani, Bhavna, "Malware Detection using the Index of Coincidence" (2017). Master's Projects. 507. DOI: https://doi.org/10.31979/etd.9mkp-kstb https://scholarworks.sjsu.edu/etd_projects/507 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. Malware Detection using the Index of Coincidence A Project Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Science by Bhavna Gurnani Dec 2016 ○c 2016 Bhavna Gurnani ALL RIGHTS RESERVED The Designated Project Committee Approves the Project Titled Malware Detection using the Index of Coincidence by Bhavna Gurnani APPROVED FOR THE DEPARTMENTS OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY Dec 2016 Dr. Mark Stamp Department of Computer Science Dr. Thomas Austin Department of Computer Science Mr. Fabio Di Troia Department of Computer Science ABSTRACT Malware Detection using the Index of Coincidence by Bhavna Gurnani In this research, we apply the Index of Coincidence (IC) to problems in malware analysis. The IC, which is often used in cryptanalysis of classic ciphers, is a technique for measuring the repeat rate in a string of symbols. A score based on the IC is applied to a variety of challenging malware families. We find that this relatively simple IC score performs surprisingly well, with superior results in comparison to various machine learning based scores, at least in some cases. ACKNOWLEDGMENTS I am grateful to Dr. Stamp for providing me with the opportunity to work on this project and guiding me. I would like to thank my committee members for reviewing my work. Also, special thanks to Fabio Di Troia for helping me throughout the project. I would also like to thank my brother for constantly motivating me during the project. v TABLE OF CONTENTS CHAPTER 1 Introduction ...............................1 1.1 Introduction to Malware.......................1 1.2 Types of Malware...........................1 1.3 Techniques to detect malware....................2 1.4 Technique used............................2 2 Background ................................4 2.1 Previous work and research......................4 2.2 Index of Coincidence.........................5 2.2.1 Calculation..........................5 2.2.2 Application..........................6 3 Implementation ..............................7 3.1 Choices Made in Calculation.....................7 3.2 Process to calculate IC........................8 4 Results ................................... 10 4.1 With variation of normalizing coefficient.............. 12 4.2 With variation of Train dataset................... 17 4.3 With variation of Test dataset.................... 21 4.4 Summary of results.......................... 24 4.5 Comparison with other scores.................... 25 5 Observations ............................... 27 vi 5.1 Observation 1:............................. 27 5.2 Observation 2:............................. 28 5.3 Observation 3:............................. 29 6 Conclusion and Future Work ..................... 31 APPENDIX Generated ROC Curves .......................... 34 vii LIST OF TABLES 1 Dataset................................ 10 2 Parameters for c as 26........................ 13 3 Parameters for c as 79........................ 13 4 Results with different normalizing coefficients........... 14 5 Results with different c and training datasets........... 16 6 Parameters for NGVCK as training dataset............ 18 7 Parameters for Zeroaccess as training dataset........... 19 8 Results with different training datasets............... 20 9 Parameters for Cridex as test dataset................ 22 10 Parameters for Cleaman as test dataset............... 23 11 Results with different testing datasets............... 23 12 Comparison across different scores................. 25 13 Percentage of matching opcodes................... 27 14 Frequency percentage of ‘other’ opcodes.............. 28 A.15 Parameters NGVCK-Benign-26................... 34 A.16 Parameters NGVCK-Cleaman-26.................. 34 A.17 Parameters NGVCK-Cridex-26................... 35 A.18 Parameters ZeroAccess-Benign-26.................. 35 A.19 Parameters Smart HDD-Benign-26................. 36 A.20 Parameters Security Shield-Benign-26................ 36 A.21 Parameters Winwebsec-Benign-26.................. 37 viii A.22 Parameters Zbot-Benign-26..................... 37 A.23 Parameters Cridex-Benign-26.................... 38 A.24 Parameters Cleaman-Benign-26................... 38 A.25 Parameters Harebot-Benign-26................... 39 A.26 Parameters NGVCK-Benign-79................... 39 A.27 Parameters NGVCK-Zbot-79.................... 40 A.28 Parameters NGVCK-Benign-63................... 40 A.29 Parameters NGVCK-Security Shield-63............... 41 A.30 Parameters NGVCK-Benign-50................... 41 A.31 Parameters NGVCK-Winwebsec-50................. 42 A.32 Parameters NGVCK-Benign-45................... 42 A.33 Parameters NGVCK-Smart HDD-45................ 43 A.34 Parameters NGVCK-Benign-34................... 43 A.35 Parameters NGVCK-Zeroaccess-34................. 44 A.36 Parameters NGVCK-Benign-12................... 44 A.37 Parameters NGVCK-Harebot-12.................. 45 A.38 Parameters NGVCK-Benign-6.................... 45 ix LIST OF FIGURES 1 NGVCK-Benign scatter plot, c is 26................ 11 2 NGVCK-Benign scatter plot, c is 30................ 11 3 ROC curve for c as 26........................ 12 4 ROC curve for c as 79........................ 13 5 AUC with different normalizing coefficients: line graph...... 14 6 AUC with different normalizing coefficients: bar graph...... 15 7 Different c and train dataset across benign............. 17 8 ROC curve for NGVCK as training dataset............ 18 9 ROC curve for Zeroaccess as training dataset........... 19 10 AUC by different training datasets................. 20 11 ROC curve for Cridex as test dataset................ 22 12 ROC curve for Cleaman as test dataset............... 23 13 AUC by different testing datasets.................. 24 14 Comparing results with different scores............... 25 15 Percentage of matching opcodes................... 28 16 Frequency percentage of ’other’ opcodes.............. 29 17 Opcode Distribution across families................. 30 A.18 ROC Curve NGVCK-Benign-26................... 34 A.19 ROC Curve NGVCK-Cleaman-26.................. 34 A.20 ROC Curve NGVCK-Cridex-26................... 35 A.21 ROC Curve ZeroAccess-Benign-26.................. 35 x A.22 ROC Curve Smart HDD-Benign-26................. 36 A.23 ROC Curve Security Shield-Benign-26............... 36 A.24 ROC Curve Winwebsec-Benign-26................. 37 A.25 ROC Curve Zbot-Benign-26..................... 37 A.26 ROC Curve Cridex-Benign-26.................... 38 A.27 ROC Curve Cleaman-Benign-26................... 38 A.28 ROC Curve Harebot-Benign-26................... 39 A.29 ROC Curve NGVCK-Benign-79................... 39 A.30 ROC Curve NGVCK-Zbot-79.................... 40 A.31 ROC Curve NGVCK-Benign-63................... 40 A.32 ROC Curve NGVCK-Security Shield-63.............. 41 A.33 ROC Curve NGVCK-Benign-50................... 41 A.34 ROC Curve NGVCK-Winwebsec-50................. 42 A.35 ROC Curve NGVCK-Benign-45................... 42 A.36 ROC Curve NGVCK-SmartHDD-45................ 43 A.37 ROC Curve NGVCK-Benign-34................... 43 A.38 ROC Curve NGVCK-Zeroaccess-26................. 44 A.39 ROC Curve NGVCK-Benign-12................... 44 A.40 ROC Curve NGVCK-Harebot-12.................. 45 A.41 ROC Curve NGVCK-Benign-6................... 45 xi CHAPTER 1 Introduction 1.1 Introduction to Malware Malware [2] is an umbrella term for various kinds of malicious software. Such programs are designed to harm computers in several ways including stealing personal information, installing illegitimate software, reformatting hard drives, deleting files and so on. Only six years after the launch of the first personal computer in 1975, the world was presented with its first computer virus [13]. Ever since, malware has seemed unstoppable. The rate with which malicious code and other unwanted programs are released is increasing in comparison to the release rate of legitimate software applications. In this paper, the terms virus and malware are used interchangeably. 1.2 Types of Malware There has been an arms race between malware writers and anti-virus software publishers. The most commonly and widely used malware detection technique is signature detection [2]. In this technique, pattern matching is used to scan for par- ticular byte patterns that are found in malware. If a matching pattern is found, the file is categorized as malware. This is a very simple and often effective technique, but malware writers have developed various techniques aimed at defeating signature detection. Encrypting the body of a virus can effectively defeat signature detection. How- ever, the decryption
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages57 Page
-
File Size-