RIVERACARRANZA-THESIS-2021.Pdf (941.8Kb)
Total Page:16
File Type:pdf, Size:1020Kb
Large-scale analysis of the security of cryptographic keys A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Ronald Rivera c Ronald Rivera, November/2020. All rights reserved. Permission to Use In presenting this thesis in partial fulfillment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 Or Dean College of Graduate and Postdoctoral Studies University of Saskatchewan 116 Thorvaldson Building, 110 Science Place Saskatoon, Saskatchewan S7N 5C9 Canada i Abstract Cryptographic algorithms are considered provably secure due to their strong mathematical foundation. Notwithstanding, real-life application of cryptographic algorithms and protocols continues to fail. These failures are frequently due to low entropy, faulty library implementation, and Application Programming Interface (API) misuse. Biases introduced during the generation process incorporate distinct bit patterns in RSA cryptographic keys allowing their attribution, thus endangering their advertised security. This thesis proposes a novel attribution approach to link cryptographic keys to their originating libraries based on moduli's characteristics. We analyze over 6.5 million generated keys and show that only a few of these characteristics are enough to achieve a 75% accuracy in the attribution of individual keys to their originating library. Also, depending on the library, our approach is sensitive enough to pinpoint the corresponding major, minor, and build release information for several libraries with accuracy levels between 81% and 98%. We further explore the attribution of SSH keys collected from publicly facing IPv4 addresses proving that our approach differentiates individual libraries of RSA keys with a 95% accuracy. ii Acknowledgements I wish to express my gratitude to my adviser, Natalia Stakhanova, for her guidance and uncountable suggestions to improve my work quality. Thank you for your never-ending support and for always being there when we needed you. Because of your patience and guidance, I learned so much more than I could ever have imagined. I also wish to thank our laboratory technician, Enrico Branca, for all the technical knowledge and cyber- security perspectives he shared with me while working on different projects. His support contributed a great deal to the simplification and understanding of complex concepts. I thank my friends for all those unforgettable moments watching movies, going out, eating pizza or chicken wings or merely texting or talking on the phone. Whether in Canada, the US or El Salvador, you all cheered me up and encouraged me to do my best all the time. I thank my family for their unconditional love and support throughout these years. They were always there when I needed any advice or just talking to someone when things did not result as expected. They always believed in me and gave me the strength to keep on doing the best I could, regardless of those difficulties that life throws at you. Even though we were apart, you made me feel as if I had never left. I love you all. To all of you, thank you. I would not have made it without you. iii To my mom, the loveliest rose of them all. iv Contents Permission to Use i Abstract ii Acknowledgements iii Contents v List of Tables vii List of Figures viii List of Abbreviations ix 1 Introduction 1 1.1 Contributions . 2 1.2 Thesis Structure . 3 2 Background 5 2.1 Public-Key Cryptography . 5 2.2 The RSA Cryptosystem . 6 2.3 Protocols and Libraries implementing RSA . 6 2.3.1 The Secure Shell Protocol (SSH) . 6 2.3.2 The Secure Sockets Layer Protocol (SSL) . 7 2.3.3 The Transport Layer Security Protocol (TLS) . 7 2.3.4 The OpenSSH Library . 7 2.3.5 The OpenSSL Library . 7 2.3.6 The GnuTLS Transport Layer Security Library . 7 2.3.7 The GNU Privacy Guard (GPG) Library . 8 3 Related Work 9 3.1 Human errors affecting software and devices . 9 3.2 Entropy Issues and Factorization . 10 3.3 Key Source Attribution . 11 4 Methodology 12 4.1 General overview of the approach . 12 4.2 Parsing . 12 4.3 Analysis . 14 4.4 Feature extraction . 15 4.5 Keys Classification . 18 5 Validation study 20 5.1 Generated RSA keys . 20 5.2 Validation results . 20 5.2.1 Comparative analysis . 24 6 Internet wide scan of Canadian IPv4 space 26 6.1 The collected SSH dataset . 26 6.2 Parsing . 27 6.3 Security analysis of the SSH collected keys . 27 v 6.3.1 Low public exponents . 28 6.4 Analysis of collected key moduli . 29 6.4.1 Moduli distribution . 29 6.5 Source Attribution . 31 7 Conclusion 34 7.1 Future work . 34 References 36 Appendix A The DieHarder testing suite 40 vi List of Tables 4.1 Public-Key and/or certificates formats for PUB files . 13 4.2 The features extracted for attribution of generated and collected keys . 17 4.3 Machine learning model parameters . 19 5.1 The summary of the generated keys . 21 5.2 The list of core features across for various libraries . 21 5.3 Attribution accuracy for library versions (Random Forest) . 23 5.4 Attribution accuracy of generated keys by minor build version groupings (Random Forest) . 25 5.5 Classification accuracy of Nemec et. al [47] approach on the generated keys . 25 6.1 General statistics of the retrieved keys . 27 6.2 The results of the Dieharder test on the collected SSH keys . 29 6.3 The list of top features (Random Forest) . 31 6.4 Attribution results for the collected SSH keys by the individual library and its version (Random Forest) . 33 A.1 The Dieharder tests [12] . 40 vii List of Figures 4.1 The flow of the analysis . 13 4.2 The decoding process. 14 5.1 The accuracy of attributing generated keys to the originating library and library versions . 22 6.1 The decoding process of the collected SSH keys. 28 6.2 Collected RSA moduli per size . 30 6.3 Distribution of all collected RSA keys . 30 6.4 Attribution accuracy of collected keys by individual library . 32 viii List of Abbreviations PKCS Public-Key Cryptography Standards SSH Secure Shell Protocol SSL Secure Sockets Layer Protocol TLS Transport Layer Security GnuTLS GNU Transport Layer Security GPG GNU Privacy Guard PGP Pretty Good Privacy DTLS Datagram Transport Layer Security RSA (cryptosystem) Rivest{Shamir{Adleman (cryptosystem) HTTPS Hypertext Transfer Protocol Secure API Application Programming Interface IP (address) Internet Protocol (address) NIST National Institute of Standards and Technology LZMA Lempel{Ziv{Markov chain algorithm TruRNG Hardware Random Number Generator NeuG True Random Number Generator FIPS Federal Information Processing Standards PEM Privacy-Enhanced Mail ix 1 Introduction Nowadays, above 50% of Internet traffic is encrypted, and the adoption of cryptographic protocols has been increasing over the past years [3]; therefore, secure communications over the Internet has become the norm. Theoretically, cryptography is provably secure due to its strong dependence on mathematics. In prac- tice, however, communications' security depends on how developers and applications implement encryption standards. Over the past decade, several studies identified alarming weaknesses affecting the security of widely used communication protocols that depend upon cryptographic implementations such as TLS/SSL [29], SSH [26] and HTTPS [21, 28, 4]; most of these studies reasoned that inadequate security levels result from weakly generated keys. Furthermore, some of these studies traced the problem back to weak random number generators and lack of entropy [29, 24, 16]. Additionally, others observed the improper implementation of cryptographic libraries [22, 41, 56, 45] and the pure misuse of cryptographic algorithms, for example, RSA private keys mistakenly left embedded in binary files [23]. These security issues lead to the development of techniques that allow the identification of insecure and vulnerable keys. From these, the vast majority focused on the analysis of binaries that contain vulnerable keys [42, 41, 52, 49], cryptographic libraries and the APIs that produce them [47, 56]. The improper implementation of the RSA algorithm and its related cryptographic operations cause soft- ware libraries to produce keys with distinguishable patterns that can identify the affected cryptographic keys' source. In this regard, Svenda et al. [56] were the first to observe that implementation decisions in the presence of specific hardware are sufficient to associate RSA keys with their corresponding software library; Nemec et al. [47] and Janovsky et al. [32] improved upon this work. The attribution performed by both approaches relied on rules based on the libraries' biases. Despite the high accuracy achieved by Nemec et al.