Quantifying Security: Methods, Challenges and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Quantifying Security: Methods, Challenges and Applications by Armin Sarabi A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering: Systems) in the University of Michigan 2018 Doctoral Committee: Professor Mingyan Liu, Chair Assistant Professor Tudor Dumitras, , University of Maryland Professor J. Alex Halderman Associate Professor Vijay Subramanian Armin Sarabi [email protected] ORCID iD: 0000-0002-1431-7434 ©Armin Sarabi 2018 ACKNOWLEDGMENTS I am very fortunate to have had the chance to complete this dissertation under the advise- ment of Professor Mingyan Liu. I am grateful for your continued support and guidance; you have been a true mentor to me, both professionally and personally. I would like to express my gratitude to my colleagues Yang Liu, Parinaz Naghizadeh, Chaowei Xiao for their efforts and shared insights. It has been a privilege to meet and work with all of you. I am especially thankful to Manish Karir, for his advice and helpful discussions. I also thank my dissertation committe, Professor Tudor Dumitras, , Professor J. Alex Halderman, and Professor Vijay Subramanian for their time and valuable feedback. This disseration would not have been possible without the various data sets that provide the backbone of our experiments. Therefore, I am grateful to the VERIS community, and the Censys team, for maintaining and making available the comprehensive databases that have been extensively used in our studies. I also thank Professor Tudor Dumitras, for shar- ing the WINE data set with us, and Ziyun Zhu for his efforts in data curation and cleaning. I am lucky to have so many wonderful friends. We have shared numerous joyful expe- riences, and will continue to do for years to come. Last but not least, I wish to thank my parents and sister, who have provided constant encouragement and inspiration throughout my life. Even though we are apart, I feel right at home whenever I hear your voice. The work in this dissertation was supported by the National Science Foundation (NSF), and the Department of Homeland Security (DHS). ii TABLE OF CONTENTS Acknowledgments ................................... ii List of Figures ..................................... v List of Tables ...................................... vii Abstract ......................................... ix Chapter 1 Introduction ..................................... 1 1.1 Motivation and background.........................1 1.2 Where we stand...............................2 1.3 Main contributions.............................5 1.3.1 Fine-grained data breach prediction................5 1.3.2 Modeling end-user patching behavior...............7 1.3.3 Numerical fingerprinting of Internet hosts.............9 1.3.4 Overview and non-goals...................... 14 1.4 Organization................................. 15 2 Fine-grained Data Breach Prediction Using Business Profiles .......... 16 2.1 Introduction................................. 16 2.2 Data sets................................... 19 2.2.1 VERIS Community Database (VCDB).............. 19 2.2.2 Alexa Web Information Service (AWIS).............. 21 2.2.3 The Open Directory Project (ODP)................ 22 2.2.4 Neustar IP Intelligence....................... 23 2.2.5 Pre-processing........................... 23 2.3 Related work................................ 25 2.4 Methodology................................ 26 2.4.1 Feature set............................. 27 2.4.2 Construction of the classifiers................... 28 2.4.3 Forecasting overall risk....................... 31 2.4.4 Forecasting conditional risk.................... 32 2.5 Results.................................... 34 2.5.1 Overall risk............................. 35 2.5.2 Risk distributions.......................... 36 iii 2.6 Dealing with rare events.......................... 38 2.6.1 Interpreting the classifier output.................. 39 2.6.2 Evaluation of risk profiles..................... 40 2.7 Conclusion................................. 44 3 The Effects of Individual User Behavior on End-Host Vulnerability State ... 46 3.1 Introduction................................. 46 3.2 Vulnerabilities from the perspective of end-users.............. 48 3.3 Data sets and their processing........................ 51 3.3.1 Raw data.............................. 51 3.3.2 Curated data............................ 53 3.4 User behavior and its security implications................. 58 3.4.1 Modeling a user’s patching delay................. 58 3.4.2 Vulnerability state......................... 61 3.4.3 Susceptibility to vulnerability exploits............... 64 3.5 Discussion.................................. 65 3.6 Related work................................ 65 3.7 Conclusion................................. 67 4 Numerical Fingerprinting of Internet Hosts ................... 68 4.1 Introduction................................. 68 4.2 Data sets................................... 72 4.3 Feature extraction.............................. 73 4.3.1 Encoding tree-like documents into numerical vectors....... 73 4.3.2 Fine-tuning............................. 76 4.4 The latent variable model.......................... 78 4.4.1 Neural networks.......................... 79 4.4.2 Variational autoencoders...................... 80 4.4.3 Restricted Boltzmann machines.................. 83 4.4.4 Evaluation............................. 85 4.4.5 Visualization............................ 88 4.5 Applications................................. 90 4.5.1 Quantifying maliciousness..................... 90 4.5.2 Inferring masked attributes..................... 96 4.5.3 Network signatures......................... 100 4.6 Discussion.................................. 105 4.6.1 Privacy............................... 106 4.6.2 Fingerprinting IPv6 hosts..................... 106 4.7 Related work................................ 106 4.8 Conclusion................................. 107 5 Conclusion ...................................... 110 5.1 Summary of contributions.......................... 110 5.2 Future directions.............................. 112 Bibliography ...................................... 114 iv LIST OF FIGURES 1.1 The vulnerability life-cycle............................8 2.1 A sample risk assessment tree. The risk at each node is quantified by multi- plying its conditional risk, by the risk of its parent node............. 30 2.2 Distribution of all victim and non-victim samples (both training and testing) over Alexa’s top categories. Note that a website can belong to multiple or no categories..................................... 31 2.3 Receiver operating characteristic (ROC) curve for overall risk estimation (left), and cumulative distribution of risk (right).................... 35 2.4 ROC curves for action (left), actor (middle), and asset (right) classifiers..... 36 2.5 Detection rate vs. average number of high-risk incident types (top), and distri- bution of organizations over the number of types in their risk profiles (bottom). 40 3.1 Evolution of number of vulnerabilities in successive Firefox versions (left), and following a user’s update events (right). Each color represents a single version....................................... 49 3.2 Scatter plots of normalized vulnerability duration vs. average user delay in days (top), and the mean, and first and third quartiles for different user types (bottom). Each point in the scatter plots corresponds to a single user. In 3.2c the yellow/red dots are users active in 2010/only active starting 2011...... 61 3.3 Scatter plot (left) and mean, and first and third quartiles for exploited vulnera- bilities of Flash Player............................... 64 4.1 An example tree-like document describing a host................. 74 4.2 A simple MLP with a single hidden layer..................... 79 4.3 Graphical depiction of a restricted Boltzmann machine’s Markov random field. The bottom and top rows correspond to visible and hidden variable, respec- tively, and edges indicate direct interaction between two variables. Note that visible (hidden) states are independent from each other, given the hidden (vis- ible) vector..................................... 83 4.4 Bivariate and marginal distributions for two-dimensional VAE representations of Internet hosts, resembling an isotropic Gaussian distribution......... 89 4.5 2-D representations of IPs from different countries (left), different HTTP servers (center), and correctly configured and misconfigured HTTPS servers (right). Figure 4.5a uses our 2-D representations, and for Figures 4.5b and 4.5c we use PCA projections from 50-dimensional fingerprints......... 89 v 4.6 Density for two-dimensional representations of blacklisted IP addresses from PhishTank (left), and hpHosts (right). Hosts from the highlighted clusters cor- respond to the depicted landing pages, taken from HTTP content recorded by Censys....................................... 91 4.7 2-dimensional representations of IPs from different categories of autonomous systems. Each plot is drawn using 20 000 sample IP addresses from the de- picted category. The observed differences in the structure of these networks can be used by a supervised learning algorithm to distinguish between them.. 102 vi LIST OF TABLES 1.1 Categorization of security research, and the presented studies in this dissertation.3 2.1 Incident examples from the VERIS Community Database............ 20 2.2 Features and feature importances for all classifiers. Crosses indicate features that have not been used in training