Security Posture Based Incident Forecasting
Total Page:16
File Type:pdf, Size:1020Kb
Security Posture Based Incident Forecasting A Thesis Submitted to the Faculty of Drexel University by Dagmawi Mulugeta in partial fulfillment of the requirements for the degree of Master of Science June 2019 c Copyright 2019 Dagmawi Mulugeta. All Rights Reserved. ii Acknowledgements It is through the support and counsel of numerous individuals that I have compiled the work below. Firstly, to my advisors, Dr. Steven Weber and Ben Goodman, I extend my deepest gratitude for their invaluable guidance. They constantly encouraged me to challenge myself in new and interesting ways. I will, without a doubt, continue to use the lessons from our biweekly meetings in the future. Secondly, I would like to thank Raymond Canzanese for being an outstanding mentor, both in research and life. His feedback has shaped me into the engineer that I am today. Thirdly, I would like to thank my parents (Mesfin and Hanna), my siblings (Nola and Eto), and rest of my family for being daring enough to send me to school overseas. Their support has made the time apart slightly more bearable. I would also like to thank Pushkin, for always demanding that I focus on the bigger picture, which I hope to someday pay forward to someone else; Yoan and Pranay, for being superb friends, challenging me, and believing in my potential when I wasn't a believer; Jordane and Sena, for consistently checking up on me and providing emotional support through what seemed to be a challenging year; Anishi, Saloni, and Mahshid, for helping solve many problems (both in and out of the lab) through the majority of this process; Hari and Abhinanda, for imparting many priceless tokens on graduate student life; Nick, Mark, Tyler, Rob, Marco, Liz, Meghan, Peter, Alex, Lydia, and Jordan, for keeping me sane all those late nights I was up fixing a gruelling issue; CyberDragons (Colbert, Ryan, Nahid, and many others) for aiding in the understanding of many cyber security concepts and affording me the pleasure of being a CyberDragon. Lastly, and most importantly, I would like to thank God, as it is through his grace, wisdom, and power that I have been able to successfully finish this thesis. I dedicate this thesis to my role model and loving uncle, Girma Deyaso. You were a sincere soul that was taken from us too soon. iii Contents List of Tables ........................................... v List of Figures .......................................... vi Abstract .............................................. viii 1. Introduction .......................................... 1 1.1 Motivation and background................................. 1 1.2 Problem definition and scope ................................ 3 1.3 Contributions......................................... 4 1.4 Terminology.......................................... 6 1.5 Outline ............................................ 8 2. Relevant works ........................................ 9 2.1 Internet scanning....................................... 9 2.2 Incident analysis and forecasting .............................. 10 2.3 Industry cyber risk analysis................................. 19 3. Data and design ........................................ 22 3.1 Development tools ...................................... 22 3.1.1 Data science libraries................................... 22 3.2 Data collection ........................................ 22 3.2.1 Cohort selection...................................... 23 3.2.2 Host attribution (Footprinting)............................. 34 3.2.3 Host collection ...................................... 40 3.2.4 Feature engineering.................................... 44 3.3 Design decisions ....................................... 48 4. Results and analysis ..................................... 57 4.1 Algorithms .......................................... 57 iv 4.1.1 Spearman correlation................................... 57 4.1.2 Cross-validation...................................... 58 4.1.3 Performance metrics ................................... 59 4.1.4 Receiver Operating Characteristic (ROC) Curve.................... 60 4.1.5 Recursive Feature Elimination (RFE).......................... 61 4.1.6 Random Forests...................................... 62 4.1.7 Outlier Detection and Isolation Forest ......................... 64 4.2 Experimental setup...................................... 67 4.2.1 Outlier vs. inlier analysis ................................ 69 4.2.2 Victim vs. non-victim host analysis........................... 70 4.2.3 Victim vs. non-victim organization analysis...................... 72 4.3 Analysis of results ...................................... 74 4.3.1 Feature importance.................................... 75 4.4 Discussion........................................... 78 5. Conclusion ........................................... 86 5.1 Performance comparison................................... 86 5.2 Future work.......................................... 87 Appendix A: ............................................ 90 A.1 Sample security incident notification letter......................... 90 A.2 Description of some subdomain enumeration techniques ................. 90 A.3 Exhaustive list of feature space............................... 91 A.4 Additional analysis charts.................................. 132 Bibliography ............................................ 141 v List of Tables 2.1 Reputation blacklists...................................... 12 3.1 Sample PRC incident entries ................................. 26 3.2 Sample HHS data breach entries ............................... 29 3.3 List of directly invoked subdomain enumeration data sources ............... 37 3.4 List of subdomain enumeration data sources with research access............. 37 3.5 Exhaustive list of footprinting data sources ......................... 37 3.6 Cohort size distribution .................................... 42 3.7 Sample list of Censys fields .................................. 46 3.8 Feature space distribution................................... 47 4.1 Outlier and inlier counts for cohort subsets ......................... 69 4.2 Outlier vs. inlier using all attributions............................ 72 4.3 Outlier vs. inlier using only certificate attributions..................... 72 4.4 Victim vs. non-victim inlier host using all attributions................... 73 4.5 Victim vs. non-victim outlier host using all attributions.................. 73 4.6 Victim vs. non-victim inlier host using only certificate attributions............ 73 4.7 Victim vs. non-victim outlier host using only certificate attributions........... 73 4.8 Victim vs. non-victim organization using all attributions.................. 74 4.9 Victim vs. non-victim organization using only certificate attributions........... 74 5.1 Performance comparison with contemporary methods ................... 86 vi List of Figures 1.1 Internal and external network segments of an organization................. 1 1.2 Contemporary approach in problem domain: collect victim and non-victim organizations, attribute their assets, and compare rules that discern victim configurations....... 2 1.3 Novel contributions to the contemporary approach..................... 5 2.1 Contemporary external network posture using misconfiguration and maliciousness [1,2] 14 3.1 Pipeline that collects large scale victim and non-victim data to map, collect, and extract features from their assets ................................... 23 3.2 Victim organization collection pipeline............................ 24 3.3 Non-victim organization collection pipeline ......................... 32 3.4 Asset attribution stage to collect a company's resources through two different methods . 34 3.5 Sample ARIN lookup [3].................................... 36 3.6 Host collection stage using Censys [4] ............................ 41 3.7 Feature engineering stage to extract features from 26 protocols.............. 45 3.8 Entire data space: 714,244 hosts x 1,386 features...................... 48 4.1 Random Forest architecure .................................. 63 4.2 Isolation Forest [5]....................................... 66 4.3 Challenge with experimental setup: features and target label at different levels . 67 4.4 Outlier vs. inlier classification using all attributions .................... 70 4.5 Outlier vs. inlier classification for different organization sizes using all attributions . 71 4.6 Host classification using outlier and inlier hosts....................... 72 4.7 Non-victim vs. victim host classification using all attributions............... 80 4.8 Organization classification using the probability distributions from outlier and inlier clas- sifications............................................ 80 4.9 Victim vs. non-victim organization classification using all attributions.......... 81 4.10 Outlier vs. inlier classification feature importance chart using all attributions . 81 4.11 Outlier vs. inlier classification feature importance chart using only certificate attributions 82 vii 4.12 Victim vs. non-victim outlier host classification using all attributions........... 83 4.13 Victim vs. non-victim outlier host classification using only certificate attributions . 83 4.14 Victim vs. non-victim inlier host classification using all attributions ........... 84 4.15 Victim vs. non-victim inlier host classification using only certificate attributions . 84 4.16 Victim vs. non-victim organization classification using all attributions.......... 85 4.17 Victim vs. non-victim organization classification using only certificate attributions