Quick viewing(Text Mode)

Analyzing Use of Cryptographic Primitives by Machine Learning Approaches

Analyzing Use of Cryptographic Primitives by Machine Learning Approaches

Masaryk University Faculty of Informatics

Analyzing use of cryptographic primitives by machine learning approaches

Ph.D. Thesis Proposal

Mgr. Adam Janovský

Advisor: prof. RNDr. Václav Matyáš, M.Sc., Ph.D.

Brno, Spring 2020 Signature of Thesis Advisor Abstract

Systems that employ cryptographic primitives and protocols are being improved continuously on a multitude of layers. Modern cryptographic primitives have been introduced that have been replacing the weak prim- itives. Also, the implementations of the systems get frequently audited or even standardized. Still, the security community has failed many times at implementing secure systems, often falling into known pitfalls, or even repeating the same patterns of mistakes. Even if the systems prove to be well designed by the security specialists, their implementation is no less important w.r.t. security. Often, developers untrained in security implement these systems, neglecting essential aspects of security. While individual failures give valuable lessons, the security community should strive to learn where the systems fail systematically.

In this thesis proposal, a systematic analysis of usage of crypto- graphic primitives is outlined for three related domains: Sources of RSA keys, malicious Android applications, and security devices certified with Common Criteria framework. For each of the mentioned domains, the proposal reviews relevant literature and presents the research plan. In addition, some already achieved results are presented, together with a re-print of two selected publications. The objective of our research is to reveal individual vulnerabilities, to point to weak elements in the systems, but most importantly to raise the awereness of the security situation in the studied domains.

In the area of RSA keys, we built a model for classification of private RSA keys and identified the possible (but also ruled out the improbable) sources of GCD-factorable keys found on the Internet. Concerning malicious Android applications, more than 250 thousand of samples were automatically processed and nearly 1 million of cryptographic API call sites were collected. Analysis of these call sites revealed several interesting trends about usage by the malware authors. For instance, we discovered widespread use of weak hash functions, the growth in use of public- cryptography, and the progressive decrease of the use of cryptographic API in malware. These insights may help to prevent some future threats.

i ii Contents

1 Introduction 1

2 State of art 4 2.1 Bias in RSA key generation ...... 4 2.2 Cryptographic API in Android malware ...... 8 2.2.1 Inferring rules of cryptographic API misuse . . .9 2.2.2 Evaluation of cryptographic API misuse . . . . . 10 2.2.3 Attribution of cryptographic API misuse . . . . . 11 2.2.4 Automatic cryptographic API repairs ...... 11 2.2.5 Study of cryptography in Android malware . . . . 11 2.3 Cryptographic primitives in certified devices ...... 12

3 Aims of thesis 14 3.1 Bias in RSA keys ...... 14 3.2 Cryptographic API in Android malware ...... 15 3.3 Cryptography in certified devices ...... 16 3.4 Limitations ...... 18 3.5 Publication venues ...... 19 3.6 Tentative schedule ...... 19

4 Achieved results 21 4.1 Biased RSA private keys: Origin attribution of GCD- factorable keys ...... 21 4.2 Cryptographic API in Android malware ...... 23 4.3 in real-world TLS ...... 25 4.4 List of publications ...... 26

Appendices 38

A Attached publications of author 39 A.1 Biased RSA private keys: Origin attribution of GCD- factorable keys ...... 39 A.2 Bringing kleptography to real-world TLS ...... 60

iii 1 Introduction

Systems that employ cryptographic primitives and protocols are being improved continuously on a multitude of layers. Modern cryptographic primitives have been introduced that have been replacing the weak prim- itives [1, 2, 3]. Also, the implementations of the systems get frequently audited or even standardized. Still, the security community has failed many times at implementing secure systems, often falling into known pitfalls, or even repeating the same patterns of mistakes [4, 5]. Even if the systems prove to be well designed by the security specialists, their implementation is no less important w.r.t. security. Often, developers untrained in security implement these systems, neglecting essential as- pects of security [6, 7]. While individual failures give valuable lessons, the security community should strive to learn where the systems fail systematically.

We offer three main perspectives that affect the security of a system w.r.t. cryptographic primitives: (i) strong, previously unattacked primitives are chosen; (ii) whatever is used, it is generated and leveraged in a proper way; (iii) these primitives get then composed into well-designed and implemented protocols. A plethora of individual failures across these levels was witnessed in the past. To mention few of them, consider the efficient of the A5/1 used in GSM [8, 9], or the scheduling weaknesses discovered in the RC4 cipher by Fluhrer et al. [10] that gets utilized in the Wired Equivalent Privacy (WEP) protocol. These two works represent the problem on the level (i), where insecure primitives were chosen for the protocols. The ROCA [11] vulnerability constitutes a case when a potentially secure primitive, i.e., RSA key, was generated improperly, resulting in practically factorable keys. Last but not least, multiple flaws in the (TLS) protocol were discovered along the way. Either due to design errors – e.g., weak Diffie-Hellman (DH) keys in Logjam attack [12] – or due to implementation mistakes – e.g., Lucky Thirteen [13].

We conjecture that whole domains of systems that employ cryptography would benefit from automatic scrutiny, especially on levels (ii) and

1 1. Introduction

(iii). Examples of such domains may be mobile applications or certified devices. Such automation can even provide insight beyond the discovery of individual vulnerabilities. For instance, knowing how the malware authors use cryptography on mobile platforms can help us prevent future threats. Similarly, understanding the selection of cryptographic primitives for certified black-box devices could guide future inventors in their design process. Also, focus on temporal trends may help to evaluate whether the level of bit security outpaces computational power growth. We claim that the awareness of the landscape of cryptographic primitives provides a valuable insight for the security community and contributes to the evaluation of the security of the studied systems.

Many domains of cryptography-driven systems exist that are worth examining. Due to the limited time frame for the dissertation thesis, we concentrate only on three specific ecosystems. In these domains, we plan to systematically analyze their usage of cryptographic primitives on multiple layers.

Firstly, we expand the work [14, 15] that revealed a bias in the RSA public keys. We spotted that private keys were not studied, creating a gap in knowledge that we aim to fill. Having a model with a capability to unveil the probable source of private keys allowed us to analyze a dataset of GCD-factorized keys in the wild. Simultaneously, the analysis of private keys provided much more precise results that can serve for internal audits of the keys.

Secondly, this thesis concentrates on malicious applications on mobile platforms. Since the Android platform is the most prevalent among mobile platforms [16], and is also continuously threatened with mal- ware [17], we target our attention on Android. Our research on the Android platform involves two main steps. Currently, we are mapping the popularity of cryptographic primitives among the malware authors, describing the ecosystem as a whole. In the near future, we plan to systematically search for weaknesses in malware implementations w.r.t. cryptography that could potentially allow one to block some functions of the malware. Additionally, the patterns of cryptographic API misuse in Android malware can be directly compared to the results obtained from the landscape of benign applications [18, 19], and can be critical for securing the systems and preventing future threats. Thirdly, we aim

2 1. Introduction

to systematically scrutinize the certification documents of the crypto- graphic devices – certified via a Common Criteria (CC) framework [20]– and describe the trends in cryptography usage in those certificates. The objective of this proposal is to introduce a long-term research plan that will eventually lead to a dissertation thesis. The rest of this document is structured as follows. Chapter 2 reviews state of the art that is relevant to the studied problems. Then, in Chapter 3 we precisely formulate our research goals and dissect it into multiple stages. At the same time, the chapter provides a tentative schedule for the outlined work. Finally, in Chapter 4 we review our already achieved results accepted or published at conferences. As an appendix, two already published papers are included in the form of full re-prints.

3 2 State of art

In this chapter, we summarize the existing research and on-going efforts of the security community w.r.t. cryptographical security of selected ecosystems. In Section 2.1, we review the bias in RSA keys. We specif- ically concentrate on the impact of bias on key fingerprinting and vulnerability discovery. Next, in Section 2.2 we focus on work that maps the cryptographic API landscape in Android applications. Even though that malicious applications are the main subject of our research, most of the existing research concerns only benign applications. Still, the methodology of the related research applies both to benign and malicious applications. Finally, in Section 2.3 we introduce the Common Criteria framework and revisit several papers that touch upon the cryptographic primitives in certified devices.

2.1 Bias in RSA key generation

This section is adopted from [21]. Fingerprinting of devices based on their physical characteristics, exposed interfaces, behaviour in non-standard or undefined situations, errors returned, and a wide range of various other side-channels forms a well-researched area. The experience shows that finding a case of a non- standard behaviour is usually possible, while making a group of devices indistinguishable from each other is very difficult due to an almost infinite number of observable characteristics, resulting in an arms race between the device manufacturers and fingerprinting observers. Yet, having a device fingerprinted is helpful to better understand a complex ecosystems, like quantifying the presence of interception middle-boxes on the Internet [22], types of clients connected or version of the operating system. Differences may even help to point out subverted supply chains or counterfeit products. When applied to the study of cryptographic keys and cryptographic libraries, researchers devised a range of techniques to analyze the fraction

4 2. State of art

of encrypted connections, the prevalence of particular cryptographic algorithms, the chosen key lengths or cipher suites [23, 24, 25, 26, 27, 28, 29]. In the rest of this section, we concentrate on the bias found specifically in the RSA keys. Recall that the public RSA key is a large integer n that is in fact product of two randomly generated prime numbers p, . When we write about bias in the RSA key, we in fact mean that the primes p and q were not selected uniformly from the pool of possible primes. Instead, the developers of the key generation algorithms have various motivations to modify their algorithms to choose primes differently, as revealed by the following paragraphs. The first evidence of bias in the RSA keys was detected in 2012 indepen- dently by two teams of researchers. Both Lenstra et al. [30] and Heninger et al. [31] demonstrated that a non-trivial fraction (0.5%) of RSA keys used on publicly reachable TLS servers is generated insecurely and is practically factorable. This is because the affected network devices were found to independently generate RSA keys that share a single prime or both primes. While an efficient factorization algorithm for RSA moduli is unknown, when two keys accidentally share one prime, the efficient factorization is possible using the Euclidean algorithm to find their GCD1. Still, the number of public keys obtained from crawling TLS servers was too high to allow for the investigation of all possible pairs. However, the distributed GCD algorithm allowed to analyze hundreds of millions of keys efficiently. That led to two papers showing that even in 2016, almost 1% of RSA keys on the Internet remained weak and were practically factorable [32, 25]. In reaction to Lenstra et al., Mironov [33] revealed that RSA keys from the OpenSSL library intentionally avoid small factors of p 1, which − in fact creates a bias in the distribution of primes. Further work by Švenda et al. [14] made this method generic. In their work, the authors analyzed over 60 million of public keys from 38 different sources, showing that many other libraries produce biased keys which enables the origin attribution. As a result, both separate keys, as well as large datasets, could be analyzed for their origin libraries. Measurements on large

1. Note that the keys sharing both primes are not susceptible to this attack but reveal their private keys to all other owners of the same RSA key pair.

5 2. State of art

datasets were presented in [15], leading to an accurate estimation of the fraction of cryptographic libraries used in large datasets like IPv4-wide TLS. Švenda et al. identified several sources of bias in the RSA keys that were later confirmed by our work [34]: 1. Performance optimizations, e.g., most significant bits of primes set to a fixed value to obtain RSA moduli of a defined length. 2. Type of primes: probable, strong, and provable primes: • For probable primes, whether candidate values for primes are chosen randomly or a single starting value is incremented until a prime is found. • When generating candidates for probable primes, small factors are avoided in the value of p 1 by multiple imple- − mentations without explaining. • Blum integers are sometimes used for RSA moduli – both RSA primes are congruent to 3 modulo 4. • For strong primes, the size of the auxiliary prime factors of p 1 and p + 1 is biased. − • For provable primes, the recursive algorithm can create new primes of double to triple the binary length of a given prime; usually one version of the algorithm is chosen. 3. Ordering of primes: are the RSA primes in private key ordered by size? 4. Proprietary algorithms, e.g., the well-documented case of the Infineon fast prime key generation algorithm [11]. 5. Bias in the output of a PRNG: often observable only from a large number of keys from the same source. 6. Natural properties of primes that do not depend on the implementation. Insights from [14] allowed Nemec et al. [11] to notice an especially strong fingerprint enabled to attribute keys from the Infineon CJTOP 80K

6 2. State of art

smartcard with 100% accuracy. The researchers noticed that the keys are of a very specific form that rapidly decreases the entropy of the key. With such information at hand, Nemec et al. were able to mount an adjusted version of the Coppersmith’s factorization method [35], showing that the respective keys of common lengths (1024 or 2048 bits) are practically factorable. The resulting vulnerability was named Return of the Coppersmith Attack (ROCA) [36]. The simple test for the ROCA vulnerability in public RSA keys allowed to measure the fraction of citizens of Estonia who held an electronic ID powered by a vulnerable smartcard, by inspecting the public repository of eID certificates [11]. The fingerprinting of keys from smartcards was used to detect that some private keys were generated outside of the card and injected later into the eIDs, despite the policy mandating that all keys are to be generated on-card [37]. After the detection of GCD-factorable keys, the question of their origin naturally followed. Previous research addressed it using two principal approaches: 1) an analysis of the information extractable from the cer- tificates of GCD-factorable keys, and 2) matching specific properties of factored primes with primes generated by a suspected library – OpenSSL. The first approach allowed to detect a range of network routers that seeded their PRNG shortly after boot without enough entropy, what caused them to occasionally generate a prime shared with another device. These routers contained a customized version of the OpenSSL library, what was confirmed with the second approach, since OpenSSL code intentionally avoids small factors of p 1 as shown by [33]. − While this suite of routers was clearly the primary source of the GCD- factorable keys, are they the sole source of insecure keys? The paper [32] identified 23 router/device vendors that used the code of OpenSSL (using specific OpenSSL fingerprint based on avoidance of small factors in p 1 and information extracted from the certificates). Eight other − vendors (DrayRek, Fortinet, Huawei, Juniper, Kronos, Siemens, Xerox, and ZyXEL) produced keys without such OpenSSL fingerprint, and the underlying libraries remained yet unidentified. We stress that the origin of weak factorable keys needs to be identified in order to notify the maintainers of the code to fix underlying issues. This may be possible via direct analysis of the private keys, obtained from the GCD factorization.

7 2. State of art 2.2 Cryptographic API in Android malware

This section is adopted from [38]. The Android operating system has consistently spread worldwide during the last decade, reaching over 3 billions of users in 2020 [16]. At the same time, security threats against Android have dramatically multiplied. In particular, recent reports showed a continually increasing number of malicious samples and their variants, with more than 10 000 infection attempts per day [17]. Hence, malware detection has become particularly critical to ensure data protection in Android devices. Various efforts have been made to provide reliable protection against Android malware. In particular, both industry and academia have been focusing on developing efficient static and dynamic analysis approaches that could also predict new attacks. A significant part of the research has been carried out to develop, among others, machine learning-based techniques, de-obfuscation tools, permission and bytecode analyzers [39, 40, 41, 42, 43, 44, 45, 46]. Overall, big progress was made in developing tools that can be easily used by end-users, and that can provide reliable protection against most threats. However, the problem of detection is only one part of a much more complex ecosystem. While malware detection is undoubtedly a critical problem (especially from the perspective of the end-user), other use- ful elements can be extracted from malicious samples. Notably, many samples belong to organized campaigns, exploit kits, and may feature characteristics that can provide information about their creators. These aspects, related to threat intelligence, are crucial in preventing and analyzing future threats properly. The use of cryptography is one of the critical aspects that can provide valuable information about the functionality of malware and its interac- tion with external resources (such as network connections and external websites). While cryptography is widely used to ensure confidentiality, integrity, and availability in benign applications, it can also be employed in a plethora of artful ways to serve adversary’s malevolent objectives. For instance, cryptography equips the attacker with the ability to finger- print the parameters of an infected device, to encrypt users’ media files, or to establish a secure connection with a command-and-control server.

8 2. State of art

The notorious difficulty of implementing cryptography securely [4, 5] can also shed light on the skills of the malware creators. It can be expected that advanced and organized adversaries will exhibit a solid understanding of cryptography and will prefer more secure primitives, while unskilled authors may fall into various security pitfalls.

Most of the research on the use of cryptographic API in Android is mainly focused on the analysis of benign applications. In the ecosystem of benign Android applications, the ultimate goal of the research community is to mitigate cryptographic API misuse. Several steps are needed to achieve this goal, and the respective works usually treat one or two steps at a time. We can summarize these steps as follows: (i) inferring the rules of cryptographic API misuse; (ii) evaluation of cryptographic API misuse; (ii) attribution of cryptographic API misuse; (iv) automatic cryptographic API repairs. We discuss the related research for all these steps mentioned above in the following subsections.

2.2.1 Inferring rules of cryptographic API misuse

In the area of inferring the rules of cryptographic API misuse, the goal is to create a list of specifications for developers and researchers that imply the insecure use of cryptography. Such rules can be crafted manually as done in [18, 47, 48]. However, this approach does not scale well, leading to the works [49, 50] that attempt to infer these rules from git commits, basing on the assumption that newly introduced commits typically eliminate security vulnerabilities from the code. Surprisingly, Paletov et al. [49] reported success with this approach, whereas chronologically later work [50] commends against the initial assumption.

An inherent aspect of any set of rules (for cryptographic API misuse) is the presence of false positives. Indeed, each research presumes that cryptographic API is used to secure some form of communication. While using the MD5 hash function for integrity protection is surely insecure, it can represent a viable choice for non-cryptographic applications. Yet, static rules cannot tell those applications apart. Besides the Android platform, the OWASP organization [51] maintains a comprehensive list of static analysis tools for security scrutiny.

9 2. State of art

2.2.2 Evaluation of cryptographic API misuse After having obtained a set of rules that suggest security violations at hand, it is vital to explore these violations on the Android applications market. While more powerful dynamic analysis is employed in [47, 48] to show that more than half of the examined applications violate the static set of rules, the application dataset is rather small (size < 100). On the contrary, the static analysis approach used by Egele et al. in [18] allowed to examine a large dataset of 145 thousand benign applications to reveal that 10.4% of them uses some form of cryptography. Overall, 88% of applications that employ cryptography violate at least one rule of secure cryptographic API usage. The authors used six rules that imply insecure usage of cryptography. In fact, they claim that any application that violates any of the following rules cannot be secure:

1. Do not use ECB mode for [52].

2. Do not use constant for CBC mode of en- cryption [52, 53].

3. Do not use constant encryption keys.

4. Do not use constant salts for password-based encryption (PBE) [54].

5. Do not use fewer than 1 000 iterations of the underlying function for PBE [54].

6. Do not use static to initialize SecureRandom(), a pseudo- random number generator in Java.

The identical set of rules was later utilized by Muslukhov et al. [55]. The authors gathered a new dataset of 109 thousand APKs that contain at least one cryptographic API call. At the same time, they concentrated on how misuse evolved between 2012 (studied in [18]), and 2016. Overall, a similar ratio of applications that misuse cryptographic API as in 2012 was reported. Furthermore, the authors also covered the attribution of the misuse (see Section 2.2.3).

These static rules were substituted by more sophisticated definition lan- guage in [56]. The authors analyzed 10 thousand of Android applications and detected misuses in more than 95% of cases.

10 2. State of art

Concentrating on the TLS protocol, the work [57] analyzed 13 thousand Android applications to reveal inadequate TLS usage in 8% of cases. The authors also managed to launch 41 MiTM attacks against selected applications. Outside the Android platform, Anderson et al. [58] discov- ered a bias in the parameters of TLS connections utilized by malicious applications. Such a bias can allow for malware fingerprinting just by observing the parameters of the respective TLS connections.

2.2.3 Attribution of cryptographic API misuse

When aiming towards the attribution of the misuse, one must know how to detect the usage of third-party libraries before deciding whether they contain insecure code that is being called. This problem has been addressed by [59, 60, 19] where the authors proposed complex matching algorithms to reliably detect third-party libraries. Later, Muslukhov et al. [55] relied on the package names and showed that 90% of cryp- tographic misuse cases originate from approximately 600 third-party libraries.

2.2.4 Automatic cryptographic API repairs

More distant to our research are papers that concentrated on automatic cryptographic API misuse repairs. Such research attempts not only to discover cryptographic vulnerabilities in the source code, but also to suggest or apply automatically generated patches for such problems. From this area of research, we refer the reader to [61, 62, 63].

2.2.5 Study of cryptography in Android malware

We point out that all the aforementioned research on the Android plat- form did not concentrate on malicious applications. To the best of our knowledge, there is no research that studies the usage of cryptographic API in malicious Android applications specifically. Moreover, the ex- traction and analysis of cryptographic primitives embedded in Android applications was not studied as well. Our work in this paper aims to fill these gaps in knowledge.

11 2. State of art 2.3 Cryptographic primitives in certified devices

With the start of the ubiquitous use of cryptography, governments and related institutions demanded the use of products that underwent some standardized process of specification and evaluation of their security. By the word product, any piece of hardware, firmware, or software is usually meant. In the respective standards, the product or device may also be called a cryptographic module, or more generally, a target of evaluation. This proposal uses these terms interchangeably.

The era of our interest began in 1994 when the first version of Common Criteria for Information Technology Security Evaluation – abbreviated as Common Criteria or CC – was released. At the core, the CC standard is very generic and is defined by several concepts:

• Target of evaluation – the subject of standardization.

• Protection profile – a document that a user or community submits, that specifies the threats, security requirements, and assurance requirements for the desired class of products. To be more general, the protection profile rigorously describes the security context of a class of products. For instance, the CC website [20] lists 21 protection profiles for Products for Digital Signatures.

• The Evaluation Assurance Level (EAL) – a number between 1 and 7 that specifies what controls were checked when evaluating the corresponding device. For instance, in order to evaluate a device to a protection profile that is EAL7 compliant, the whole source code of the device must be investigated. To what level the target may get validated against is specified in the corresponding protection profile.

An evaluation of the subjective devices is run by independent parties that must also comply with specific standards. The certificates are then recognized across many countries around the world. Also, note that the CC framework does not state how the evaluation should be conducted, which leaves a degree of freedom to the evaluators.

12 2. State of art

Nearly in parallel to CC, FIPS 1402 was released with complemen- tary objectives. In the rest of this proposal, we concentrate on CC. As of September 2020, the most recent version of CC is Release 5 of v3.1 [20], and 4356 certified products are documented in the CC database [64]. Individual case-studies exist, where various parties qualifying for certifi- cation [65, 66, 67, 68] share their lessons from the process or comment on the CC framework in general. As stated in the cited works, the evaluation process if often costly and can take months or even years to complete. The fact that a target of evaluation passed the certificate validation does not imply the desired level of security. It merely shows that some independent party validated its security against standardized specification. On many occasions, vulnerabilities in certified products were discovered. As an example, we mention the recent Minerva vul- nerability [69] affecting Athena IDProtect smartcard validated both as FIPS 140-2 and EAL4+ level in the CC framework. To the best of our knowledge, no research exists that would method- ically evaluate the content of the CC certification documents. Two studies, however, exist with a close connection to such an idea. In 2019, Vassilev [70] studied the possibilities of developing neural networks for sentiment analysis of documents that accompany the certification process of FIPS 140-2. While the paper reports positive results on security-unrelated datasets, the application to FIPS certificates is left for future work. Still, the author emphasizes the need for automatic processing of the certification documents. Methodologically very similar paper – although with a different application domain – was written in 2018 by Harkous et al. [71]. In their work, the authors trained a neural network to automatically process arbitrary privacy policy document3. The authors learned the privacy-centric language model and built a bot capable of answering user queries about privacy policies, providing helpful answers in more than 80% of cases.

2. Security requirements for cryptographic modules, a part of Federal Information Processing Standards managed by National Institute of Standards and Technol- ogy (NIST). The newest version 140-3 is available from https://csrc.nist.gov/ publications/detail/fips/140/3/final. 3. The results of their research are now offered as a service at https://pribot.org/.

13 3 Aims of thesis

Our research aims to examine the usage of cryptographic primitives in three selected ecosystems. More specifically, we plan to:

1. Expand the study of bias in RSA keys to the private keys. We use the acquired classification models to unveil possible sources of practically factorable keys from the Internet.

2. Map the usage patterns of cryptographic API in Android malware. We also plan to learn what ratio of malicious applications misuses the cryptographic API.

3. Shed the light on the popularity of cryptographic primitives in certified devices.

Due to the volume of the analyzed data in the corresponding domains, our methodology is driven by data analysis and simple machine learning techniques. Yet, we do not aim to advance the field of machine learning by no means. We merely use it as a way to tackle problems in computer security. The following paragraphs explain our specific goals in each of the fields of focus.

3.1 Bias in RSA keys

The problem of attributing the RSA keys to their origin library got opened by [14]. Despite the follow-up works [15, 11], little is still known about fingerprinting possibilities with private RSA keys at hand. We aim to expand on previous research to answer the following questions:

1. How much can be origin attribution of RSA keys improved when primes p, q are studied, instead of modulus n? How many distinct groups of keys are recognizable?

2. What models can be built to address this problem, what perfor- mance can they achieve when classifying a single key, or even batch of keys from the same source at once?

14 3. Aims of thesis

3. What is the origin of the yet unclassified part of GCD-factorable keys that are periodically collected as a part of the Rapid7 dataset [72]? The methodology of our research is planned followingly. The goal is to expand the dataset from [14] to at least 150 million of RSA keys. Using such a large dataset, we aim to revisit the source of bias in the primes and expand the representatives of the bias identified in [14]. With the help of the keys’ biased attributes, we plan to build a classification model capable of attributing the keys to their origin library. The performance of the model is to be evaluated. We will also measure the performance when the domain of possible sources is limited, e.g., to smartcards or keys found in wide TLS scans. Furthermore, we utilize the Rapid7 dataset [72] to obtain primes of GCD-factorable keys on the Internet, as of autumn 2019. Finally, those private keys get classified, and their origin is revealed, possibly uncovering the new sources of factorable keys. Our objective is to release our source code together with the dataset as an open-source repository.

3.2 Cryptographic API in Android malware

Our research intents on the Android platform are divided into two phases. Firstly, we aim to answer the following questions: 1. Are system cryptographic libraries preferred over the third-party libraries or home-brew cryptography by the malware authors? 2. What categories of cryptographic primitives are most utilized in Android malware (e.g., symmetric cryptography, hash functions, algorithms,etc.)? 3. How does the usage of cryptographic API differ between benign and malicious applications? 4. Are there any temporal trends w.r.t. cryptographic API usage in Android malware? To conduct systematic research on Android malware, we developed an open-source tool for static analysis of Android binaries. Specifically, the tool can automatically download malicious samples from the Andro-

15 3. Aims of thesis

zoo dataset (where over 1 million samples are available). Furthermore, the tool uses jadx decompiler to obtain the java source code of the application. On the source code, regular expressions are matched to answer our research questions. Possibly, a program-slicing method may enhance the tools’ abilities. Our objective is to analyze at least 250 000 applications. Answers to the questions above will provide a deep understanding of the cryptographic API landscape in Android malware. Once they are found, we would like to concentrate on the misuse of cryptographic API by malware developers. Specifically, we would like to study the following: 1. How prevalent is the misuse of cryptographic API in Android mal- ware? How does it compare to known misuse in benign malware (as discussed in Chapter 2)? 2. What is the distribution of hard-coded cryptographic primitives? 3. If RSA keys can be retrieved from the applications, how were they generated? 4. Can the misuse be possibly exploited to block some malware functionality? We would like to highlight our aim to acquire RSA keys generated by the malware authors and use our previous research to unveil their origin. Naturally, a prerequisite is that the static RSA keys are widely adopted by the malware authors.

3.3 Cryptography in certified devices

There are two research questions in the area of CC certified devices that we attempt to answer: 1. What is the landscape of cryptographic primitives and applied security mechanisms w.r.t. achieved evaluation assurance level? 2. How prevalent are the dependencies between the targets of eval- uation and how they affect the impact of CVEs?

16 3. Aims of thesis

Naturally, other questions may arise during the analysis of processed certificates. We stress that our methodology can be easily adapted to other certification schemes, like FIPS 140. Yet, to fit the estimated thesis schedule, we limit ourselves to CC. We argue that many aspects of the ecosystem of certified devices may benefit our work. PDF files are at the core of the validation process that are non-trivial to automatically process and extract information from. This is partly why the transparency of validation is virtually non- existent and the amount of information readily available to a security analyst is severely limited. Without the ability to query an information base about certified devices (or the product undergoing a validation), several challenges for the community arise: • Dependencies between devices (that exhibit as references in the certification documents) are hard to search. For instance, when a CVE directly impacts a certified device, it is difficult to search for other systems that may be indirectly affected (e.g., by employing the affected device as its component). This is especially a concern due to the arms race between those who fix bugs and those who attempt to exploit them. • Guidance for those who qualify for the certificates is missing. Based on the selection of primitives and defenses that the previous authors utilized, the candidates could learn precious insights that would make the evaluation process quicker, cheaper, and the certified devices more secure. • The validation process is destined to remain mostly human work. As such, it may take a lot of time and the total number of certi- fied products corresponds with the manpower involved in their validation1. Furthermore, a human-driven review may produce errors that computers would not do. Consider, for example, a case when a single device is being validated by two different teams of reviewers who decide inconsistently about the device.

1. Recent efforts from NIST attempt to automate some of the tasks during modules evaluation w.r.t. FIPS 140-2. Some are demonstrated in the Automatic Cryptographic Validation Testing (ACVT) programme [73] that strives to automatically validate implemented cryptographic algorithms. This is, however, only the tip of the iceberg.

17 3. Aims of thesis 3.4 Limitations

The presented research proposal naturally has some limitations that have to be discussed. In the area of biased RSA keys, these are worth noting: Our classifier will fail to reliably attribute keys from previously unseen sources (as they are not a part of the dataset that trains the model). In addition, it is not possible to distinguish between the libraries that produce identically distributed keys w.r.t. our feature set. Last but not least, the private key must be available to run the analysis, which may not always be possible for the user, especially when tamper-resistant devices are considered from which the key cannot be exported.

Considering the research performed on Android malware, it is worth noting that the whole analysis is fully automated and static. Conse- quently, we do not thoroughly explore how cryptographic primitives are employed in the context of the applications (e.g., to send SMS, encrypt data, et cetera). This analysis is extremely complex due to the variety of application contexts, and it is hardly feasible with static analysis. Furthermore, static analysis inherently produces a small percent of false results that must be manually verified or may miss some findings. Similarly, when operating in the landscape of obfuscated malware appli- cations, some instances are nearly impossible to process with automated methods due to strong obfuscation techniques employed by the authors (such as the ones concerning application encryption and dynamic code loading [41]).

In the area of certified systems, similar issues as with Android malware arise. If a list of rules is to be used to process the PDF files, some impor- tant facts may get missed. On the contrary, relying on machine learning to process the documents will introduce uncertainty and the problem of explaining the machine learning decisions. In addition, this proposal does not aim to fix the root cause of the problem. Indeed, adjusting the standard so that it introduces machine-readable documents would be much more effective, yet at the moment unrealistic solution.

18 3. Aims of thesis 3.5 Publication venues

Research in the area of computer security is mostly published at confer- ences, whereas journals are more suitable for long-term investigations. Consequently, even though our publications are closely related, they are more suitable for conferences. For publishing our results, we strive for the following venues:

• USENIX Security Symposium, CORE A* rank.

• Annual Computer Security Applications Conference (ACSAC), CORE A rank.

• International Symposium on Research in Attacks, Intrusions and Defenses (RAID), CORE A rank.

• European Symposium on Research in Computer Security (ES- ORICS), CORE A rank.

• International Conference on Applied Cryptography and Network Security (ACNS), CORE B rank.

• Financial Cryptography and Data Security (FC), CORE B rank.

3.6 Tentative schedule

The past and expected future timeline of the proposed research is:

• Autumn 2018. I worked on a publication about kleptography that has got accepted and also has been presented as WISTP conference.

• Spring 2019. I initiated the analysis of bias in RSA private keys; I was mostly studying the related work.

• Autumn 2019. We finalized the analysis of bias in RSA private keys; we also classified the practically factorized keys. In parallel, I was analyzing cryptographic API in Android malware and formulating the proposed paper’s objectives.

19 3. Aims of thesis

• Spring 2020. I finished the work on the paper studying crypto- graphic API in Android malware. Consequently, I was able to formulate this thesis proposal. • Autumn 2020. I plan to submit the paper on cryptographic API in Android malware. Apart from that, I plan to perform the data analysis on the dataset of documents that accompany evaluation of certified devices. Also, the results of our work on RSA private keys has got accepted for the ESORICS conference and have been presented there. • Spring 2021. I intend to finish writing the paper about certified devices. In parallel, I plan to perform the analysis of the misuse of cryptographic API in Android malware. • Autumn 2021. I want to submit a paper about the misuse of cryptographic API in Android malware. I intend to polish the software artifacts of the paper that treats cryptography in certified devices.. • Winter 2021. I plan to prepare the software artifacts of the misuse of cryptographic API in malware and start writing a dissertation. • Summer or Autumn 2022. I hope to finish and submit my disser- tation thesis.

20 4 Achieved results

This chapter aims to our research that we consider finished. That is, the respective publications are either published [34, 21], or are just entering the submission process [38]. In Section 4.1, we sum up our research on bias in RSA keys, published at ESORICS 2020 [21]. Next, our findings about the landscape of cryptographic primitives in Android malware are reviewed in Section 4.2. The respective work was finished as of summer 2020 and is currently entering a peer-review process. Finally, in Section 4.3, we discuss kleptographically subverted primitives in TLS protocol as published in [34]. The rest of this section is adopted directly from the respective publications [21, 38, 34] that are listed at the end of this chapter, together with a description of my personal contribution.

4.1 Biased RSA private keys: Origin attribution of GCD-factorable keys

In our publication [21] we provide what we believe is the first broad examination of properties of RSA keys with the goal of attribution of a private key to its origin library. The attribution applies in multi- ple scenarios, e.g., to the analysis of GCD-factorable keys in the TLS domain. We investigated the properties of keys generated by 70 crypto- graphic libraries, identified biased features in the primes produced, and compared three models based on Bayes classifiers for the private key attribution. The information available in private keys significantly increases the classification performance compared to the result achieved on public keys [14]. Our work enables to distinguish 26 groups of sources (compared to 13 with public keys only) while increasing the accuracy of the model more than twice w.r.t. random guessing. When 100 keys are available for the classification, the correct result is almost always provided (> 99%) for 19 out of 26 groups. Finally, we designed a method usable also for a batch of keys (from the same source) where all keys share a single prime. Such primes are

21 4. Achieved results

found in GCD-factorable TLS keys, where one prime was generated with insufficient randomness and would introduce a high classification error in the unmodified method. As a result, we can identify libraries responsible for the production of these GCD-factorable keys, showing that only three groups are a relevant source of such keys. The accurate classification can be easily incorporated in forensic and audit tools. In total, our model can recognize keys coming from 9 distinct groups, as can be seen in Table 4.1. Group name Sources of keys Group 1 OpenSSL Group 2 OpenSSL (8-bit fingerprint) Group 3 (Sage Blum, Sage Provable) Group 4 (Mocana) Group 5|13 (mbedTLS, Nettle 2.0, Sage Default) Group 6 Bouncy Castle 1.53, SunRsaSign Group 7|11 Bouncy Castle 1.54, Mocana, Thales Libgcrypt, Libgcrypt FIPS, OpenSSL FIPS, Group 8|9|10 WolfSSL, SafeNet, cryptlib, Botan, LibTomCrypt, Nettle 3.2, Nettle 3.3 Group 12 Utimaco HSM

Table 4.1: Performance comparison of different models on the dataset with all libraries. Note that the precision of a random guess classifier is 3.8% when considering 26 groups.

We applied the fastgcd tool [74] on the Rapid7 dataset [72] to obtain private keys with more than 82 thousand primes divided into 2511 batches. While each batch has at least 10 keys, the median of the batch size is 15. Among the batches, 88.8% of them exhibit the OpenSSL fingerprint. This number well confirms the previous finding by [32] that also captured the OpenSSL-specific fingerprint in a similar fraction of keys. Furthermore, we attribute 3 batches as coming from the OpenSSL (8-bit fingerprint), an OpenSSL library compiled to test and avoid divisors of p 1 only up to 251. Importantly, slightly more than 11% of − batches were generated by some library from groups 8, 9, or 10, which are not mutually distinguishable when only a single prime is available. There are also negative results to report. With the accuracy over 80%

22 4. Achieved results

(for a batch size of 15) and no batches attributed to any of groups 3, 4, 6, 12, 5 13, or 7 11, it is very improbable that any GCD-factorable | | keys originate from the respective sources in these libraries.

While the bias in the keys usually does not help with factorization, the cryptographic libraries should approach their key generation design with great care, as strong bias can lead to weak keys [11]. We recom- mend to follow a key generation process with as little bias present as possible.

4.2 Cryptographic API in Android malware

In our study, we provide a large-scale analysis of how cryptographic API was typically employed in Android malware in 2012–2018. We analyzed more than 250 000 applications and extracted nearly 1 million call sites that have been extensively analyzed. Each of the applications was collected in the form of an Android application package (APK) that encompasses complete artifacts of an Android application. We analyzed the APKs by the means of static analysis. The APKs are processed in parallel by our system, and each APK traverses the following pipeline (depicted in Figure 4.1):

1. Pre-processor. This module decompiles the APKs to obtain information embedded in their codebase. Then, the third-party packages of the APKs are identified, and the whole Java source code of the APK is extracted.

2. Crypto-extractor. This module extracts and analyzes the cryp- tographic function call sites. Accordingly, it also detects both Java and native third-party cryptographic libraries.

3. Evaluator. This module stores and organizes the information retrieved by the analyzed APKs to a JSON record.

The whole dataset of APKs accordingly produces a JSON report contain- ing information about cryptographic API usage. Finally, we automati- cally procecssed the report. The attained results showed the following interesting trends:

23 4. Achieved results

INPUT Configuration file (and APKs)

PRE-PROCESSOR Download from Androzoo or load APK Decompile Detect 3rd party packages

CRYPTO-EXTRACTOR Collect 3rd party cryptographic libs Collect Android cryptographic API

EVALUATOR Generate cryptography usage report

OUTPUT Cryptography usage report

Figure 4.1: The system architecture diagram. The samples can be loaded from disk or downloaded from the Androzoo dataset [75]. They are processed in parallel and each APK traverses the depicted pipeline separately, contributing to the final report.

1. Very limited use of third-party cryptographic libraries. Our analysis showed that malware authors favor the use of system- based libraries to perform cryptographic operations.

2. Use of weak hash functions. The majority of malicious ap- plications featuring crypto routines resorted to using weak MD5 hash function. According to the results obtained from other hash functions, we speculate that such wide adoption is meant for operations that do not require strong integrity protection.

3. Progressive growth of public-key cryptography in mal- ware. The attained results exhibit a constant prevalence of sym-

24 4. Achieved results

metric cryptography in malware between 2012 and 2018. In comparison, public-key cryptography represented by the RSA scheme shows a rapid rise of popularity in 2015, possibly caused by the rise of ransomware in Android. 4. Progressive decrease in the use of cryptography. Interest- ingly, the relative number of malicious applications that employ cryptographic API is decreasing over time. We speculate that this may be related to a switch of the adversary‘s goals from information theft to, for example, direct lock of the device. 5. Contrast between malicious and benign usage of cryp- tography. Our study shows that in general, cryptographic API is much more employed in malware than benign samples (consid- ering the proportions of released APKs). 6. Late adoption from DES to AES. It is worth to point out one aspect of comparison with benign samples especially. In the category of symmetric encryption, malware authors exhibit late adoption from insecure DES to modern AES. While AES was the most popular cipher in benign samples already in 2012, it was only in 2015 when AES outrun DES in the malicious dataset.

4.3 Kleptography in real-world TLS

Albeit distant to our thesis aims, yet not unrelated, is our study of kleptography in the TLS protocol. Kleptography, as defined by Yung and Young in 1996, is an art of stealing information securely and subliminally from cryptographic devices. The cornerstone of kleptography is so-called asymmetric . Such a backdoor gets implanted into the desired protocol and weakens the protocol exclusively for the attacker. For other parties, the protocol remains as secure as without the backdoor. Additionally, it should not be possible to discover the backdoor without inspecting the internals of the infected device, e.g., by looking at the outputs of the protocol. This property, however, cannot always be fulfilled. In our paper [34], we answered the question of whether such a backdoor could be practically implemented into a TLS library. As shown by multiple studies [76, 77, 78, 79], even the RSA keys may get backdoored similarly. This fact underlines the need for a study of bias in

25 4. Achieved results

the RSA keys, as well as the feasibility of implementing such backdoors into popular libraries. Our efforts resulted in a design of an asymmetric backdoor for all versions of the TLS protocol. Such a backdoor can be used to exfiltrate session keys from a captured handshake by a passive eavesdropper, leading to a denial of confidentiality and authenticity of the whole session. We also demonstrated that it is fairly simple to implement the backdoor into an open-source TLS library while maintaining a reasonable performance of the library. Furthermore, our efforts show that with the assumption that AES is a pseudorandom function, the backdoored cryptographic primitives cannot be distinguished from clean primitives in polynomial time. Subsequently, it is unreasonable to attempt to detect such a backdoor based on biased primitives, as it may be for the RSA keys. We stress that to install our backdoor, the adversary must have access to the target device. In such cases, other dangerous scenarios arise – we mention ransomware as an example. However, the important property of our backdoor is that it may stay unnoticed for a long time on the target device. Also, it may be endorsed into particular hardware by its manufacturer or organizations with sufficient resources. We also showed that timing analysis might prove as an effective defense, depending on the powers of the defender.

4.4 List of publications

I am the main author of the following publications: • A. Janovsky, M. Nemec, P. Svenda, P. Sekan, and V. Matyas, “Biased RSA Private Keys: Origin Attribution of GCD-Factorable Keys” in European Symposium on Research in Computer Security (ESORICS). Springer, 2020. My contribution forms approximately 45% of work. I designed, implemented, and evaluated the clas- sification models and classified the dataset of GCD-factorized keys. • A. Janovsky, J. Krhovjak, and V. Matyas, “Bringing Kleptogra- phy to Real-World TLS” in IFIP International Conference on

26 4. Achieved results

Information Security Theory and Practice. Springer, 2018, pp. 15–27. This paper is almost solely my own work. My contribution forms approximately 90% of work and I received help especially during the editing phase. In addition, we are about to submit (September 2020) the following paper: • A. Janovsky, D. Maiorca, G. Giacinto, V. Matyas, “A Large-Scale Exploratory Study of Cryptographic API in Android Malware“, in Conference yet to be decided. My contribution forms approxi- mately 60% of the work. I designed and implemented the scripts for the collection of cryptographic API from malware binaries. Furthermore, I helped to evaluate the results and shaped a large portion of the text. If the reviewers happen to need this paper to complete a review of this proposal (or are just interested in reading it), please contact the author who will promptly provide you with the text.

27 Bibliography

[1] National Institute of Standards and Technology (NIST), “Announc- ing the ADVANCED ENCRYPTION STANDARD (AES),” 2001, [cit. 2020-09-03]. Available from https://nvlpubs.nist.gov/nistpubs/ FIPS/NIST.FIPS.197.pdf.

[2] National Institute of Standards and Technology (NIST), “SHA-3 standard: Permutation-based hash and extendable-output func- tions,” 2015, [cit. 2020-09-03]. Available from https://doi.org/10. 6028/NIST.FIPS.202.

[3] D. J. Bernstein, “Curve25519: New Diffie-Hellman speed records,” 2005, [cit. 2020-09-03]. Available from https://cr.yp.to/ecdh/ curve25519-20060209.pdf.

[4] R. Anderson, “Why fail,” in Proceedings of the 1st ACM Conference on Computer and Communications Security, 1993, pp. 215–227.

[5] D. Lazar, H. Chen, X. Wang, and N. Zeldovich, “Why does crypto- graphic software fail? a case study and open problems,” in Proceed- ings of 5th Asia-Pacific Workshop on Systems, 2014, pp. 1–7.

[6] S. Nadi, S. Krüger, M. Mezini, and E. Bodden, “Jumping through hoops: Why do Java developers struggle with cryptography APIs?” in Proceedings of the 38th International Conference on Software Engineering, 2016, pp. 935–946.

[7] Y. Acar, M. Backes, S. Fahl, D. Kim, M. L. Mazurek, and C. Stran- sky, “You get where you’re looking for: The impact of information sources on code security,” in 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 2016, pp. 289–305.

[8] E. Biham and O. Dunkelman, “Cryptanalysis of the A5/1 GSM ,” in International Conference on Cryptology in India. Springer, 2000, pp. 43–51.

28 BIBLIOGRAPHY

[9] A. Biryukov, A. Shamir, and D. Wagner, “Real time cryptanalysis of A5/1 on a PC,” in International Workshop on Fast Software Encryption. Springer, 2000, pp. 1–18. [10] S. Fluhrer, I. Mantin, and A. Shamir, “Weaknesses in the key scheduling algorithm of RC4,” in International Workshop on Se- lected Areas in Cryptography. Springer, 2001, pp. 1–24. [11] M. Nemec, M. Sys, P. Svenda, D. Klinec, and V. Matyas, “The return of Coppersmith’s attack: Practical factorization of widely used RSA moduli,” in 24th ACM Conference on Computer and Communications Security (CCS’2017). ACM, 2017, pp. 1631–1648. [12] D. Adrian, K. Bhargavan, Z. Durumeric, P. Gaudry, M. Green, J. A. Halderman, N. Heninger, D. Springall, E. Thomé, L. Valenta et al., “Imperfect forward secrecy: How Diffie-Hellman fails in practice,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015, pp. 5–17. [13] N. J. Al Fardan and K. G. Paterson, “Lucky thirteen: Breaking the TLS and DTLS record protocols,” in 2013 IEEE Symposium on Security and Privacy. IEEE, 2013, pp. 526–540. [14] P. Svenda, M. Nemec, P. Sekan, R. Kvasnovsky, D. Formanek, D. Komarek, and V. Matyas, “The million-key question — Investi- gating the origins of RSA public keys,” in Proceeding of USENIX Security Symposium, 2016, pp. 893–910. [15] M. Nemec, D. Klinec, P. Svenda, P. Sekan, and V. Matyas, “Mea- suring popularity of cryptographic libraries in Internet-wide scans,” in Proceedings of the 33rd Annual Computer Security Applications Conference. ACM, 2017, pp. 162–175. [16] Statista, “Android - statistics & facts,” 2020, [cit. 2020-07-13]. Available from https://www.statista.com/topics/876/android/. [17] McAfee Labs, “McAfee labs threats report, august 2019,” 2019, cit. [2020-04-25]. Available from https://www.mcafee.com/enterprise/ en-us/threat-center/mcafee-labs/reports.html. [18] M. Egele, D. Brumley, Y. Fratantonio, and C. Kruegel, “An em- pirical study of cryptographic misuse in Android applications,” in

29 BIBLIOGRAPHY Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security - CCS ’13. NY, United States: ACM Press, 2013, pp. 73–84.

[19] M. Backes, S. Bugiel, and E. Derr, “Reliable third-party library detection in Android and its security applications,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Com- munications Security. NY, United States: ACM Press, 2016, pp. 356–367.

[20] “Common Criteria for Information Technology Security Evaluation,” 2020, version 3.1, revision 5. [cit. 2020-09-03]. Available from https: //www.commoncriteriaportal.org/cc/.

[21] A. Janovsky, M. Nemec, P. Svenda, P. Sekan, and V. Matyas, “Biased RSA private keys: Origin attribution of GCD-factorable keys,” in European Symposium on Research in Computer Security (ESORICS). Springer, 2020.

[22] Z. Durumeric, Z. Ma, D. Springall, R. Barnes, N. Sullivan, E. Bursztein, M. Bailey, J. A. Halderman, and V. Paxson, “The se- curity impact of HTTPS interception,” in Network and Distributed Systems Symposium. The Internet Society, 2017.

[23] Z. Durumeric, J. Kasten, M. Bailey, and J. A. Halderman, “Analysis of the HTTPS certificate ecosystem,” in Proceedings of the 2013 ACM Internet Measurement Conference. ACM, 2013, pp. 291–304.

[24] Electronic Frontier Foundation, “The EFF SSL Observatory,” 2010, [cit. 2020-07-13]. Available from https://www.eff.org/observatory.

[25] M. Barbulescu, A. Stratulat, V. Traista-Popescu, and E. Simion, “RSA weak public keys available on the Internet,” in Interna- tional Conference for Information Technology and Communications. Springer-Verlag, 2016, pp. 92–102.

[26] M. R. Albrecht, J. P. Degabriele, T. B. Hansen, and K. G. Paterson, “A surfeit of SSH cipher suites,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016, pp. 1480–1491.

30 BIBLIOGRAPHY

[27] J. Gustafsson, G. Overier, M. Arlitt, and N. Carlsson, “A first look at the CT landscape: Certificate Transparency logs in practice,” in Proceedings of the 18th Passive and Active Measurement Conference. Springer-Verlag, 2017, pp. 87–99. [28] F. Cangialosi, T. Chung, D. Choffnes, D. Levin, B. M. Maggs, A. Mislove, and C. Wilson, “Measurement and analysis of private key sharing in the HTTPS ecosystem,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016, pp. 628–640. [29] B. VanderSloot, J. Amann, M. Bernhard, Z. Durumeric, M. Bailey, and J. A. Halderman, “Towards a complete view of the certificate ecosystem,” in Proceedings of the 2016 ACM on Internet Measure- ment Conference. ACM, 2016, pp. 543–549. [30] A. K. Lenstra, J. P. Hughes, M. Augier, J. W. Bos, T. Kleinjung, and C. Wachter, “Ron Was Wrong, Whit Is Right,” Cryptology ePrint Archive, Report 2012/064, 2012, [cit. 2020-07-13]. Available from https://eprint.iacr.org/2012/064. [31] N. Heninger, Z. Durumeric, E. Wustrow, and J. A. Halderman, “Mining your Ps and Qs: Detection of widespread weak keys in network devices,” in Proceeding of USENIX Security Symposium. USENIX, 2012, pp. 205–220. [32] M. Hastings, J. Fried, and N. Heninger, “Weak keys remain widespread in network devices,” in Proceedings of the 2016 ACM on Internet Measurement Conference. ACM, 2016, pp. 49–63. [33] I. Mironov, “Factoring RSA Moduli II.” 2012, [cit. 2020-07- 13]. Available from https://windowsontheory.org/2012/05/17/ factoring-rsa-moduli-part-ii/. [34] A. Janovsky, J. Krhovjak, and V. Matyas, “Bringing kleptogra- phy to real-world TLS,” in WISTP International Conference on Information Security Theory and Practice. Springer, 2018, pp. 15–27. [35] D. Coppersmith, “Finding a small root of a bivariate integer equa- tion; factoring with high bits known,” in International Confer-

31 BIBLIOGRAPHY

ence on the Theory and Applications of Cryptographic Techniques. Springer, 1996, pp. 178–189. [36] “CVE-2017-15361,” Available from NVD NIST, CVE-ID CVE-2017- 15361, 2017, [cit. 2020-08-30]. Available from https://nvd.nist.gov/ vuln/detail/CVE-2017-15361. [37] A. Parsovs, “Estonian electronic identity card: Security flaws in key management,” in 29th USENIX Security Symposium. USENIX Association, 2020. [38] “A large-scale exploratory study of cryptographic API in Android malware,” in To be submitted at security conference, Janovsky, Adam and Maiorca, Davide and Giacinto, Giorgio and Matyas, Vashek. [39] A. Daniel, S. Michael, H. Malte, G. Hugo, and K. Rieck, “Drebin: Efficient and explainable detection of Android malware in your pocket,” in Proceedings 2014 Network and Distributed System Se- curity Symposium. San Diego, CA: The Internet Society, 2014, pp. 23–26. [40] N. Andronio, S. Zanero, and F. Maggi, “HelDroid: Dissecting and detecting mobile ransomware,” in Recent Advances in Intrusion Detection (RAID). Cham, Switzerland: Springer, 2015, pp. 382– 404. [41] D. Maiorca, D. Ariu, I. Corona, M. Aresu, and G. Giacinto, “Stealth attacks: An extended insight into the obfuscation effects on Android malware,” Computers & Security, vol. 51, no. C, pp. 16–31, Jun. 2015. [42] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “CopperDroid: Automatic reconstruction of Android malware behaviors,” in Proc. 22nd Annual Network & Distributed System Security Symposium (NDSS). San Diego, United States: The Internet Society, 2015, pp. 1–15. [43] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu, “StormDroid: A streaminglized machine learning-based system for detecting Android malware,” in Proceedings of the 11th ACM on Asia Conference on

32 BIBLIOGRAPHY

Computer and Communications Security, ser. ASIA CCS ’16. New York, NY, USA: ACM, 2016, pp. 377–388.

[44] A. Demontis, M. Melis, B. Biggio, D. Maiorca, D. Arp, K. Rieck, I. Corona, G. Giacinto, and F. Roli, “Yes, machine learning can be more secure! a case study on Android malware detection,” IEEE Transactions on Dependable and Secure Computing, vol. 16, no. 4, pp. 711–724, 2017.

[45] J. Garcia, M. Hammad, and S. Malek, “Lightweight, obfuscation- resilient detection and family identification of Android malware,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 26, no. 3, pp. 11:1–11:29, Jan. 2018.

[46] M. Melis, D. Maiorca, B. Biggio, G. Giacinto, and F. Roli, “Explain- ing black-box Android malware detection,” in 26th European Signal Processing Conference, EUSIPCO 2018. Rome, Italy: IEEE, 2018, pp. 524–528.

[47] A. Chatzikonstantinou, C. Ntantogian, G. Karopoulos, and C. Xe- nakis, “Evaluation of cryptography usage in Android applications,” in Proceedings of the 9th EAI International Conference on Bio- Inspired Information and Communications Technologies (Formerly BIONETICS), S. Junichi, N. Tadashi, and H. Henry, Eds. NY, United States: ACM Press, 2016, pp. 83–90.

[48] S. Shuai, D. Guowei, G. Tao, Y. Tianchang, and S. Chenjie, “Mod- elling analysis and auto-detection of cryptographic misuse in An- droid applications,” in 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing. Dalian: IEEE, 2014, pp. 75–80.

[49] R. Paletov, P. Tsankov, V. Raychev, and M. Vechev, “Inferring crypto API rules from code changes,” in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2018. Philadelphia, PA, USA: ACM Press, 2018, pp. 450–464.

[50] J. Gao, P. Kong, L. Li, T. F. Bissyande, and J. Klein, “Negative results on mining crypto-api usage rules in Android apps,” in 2019

33 BIBLIOGRAPHY IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). QC, Canada: IEEE, 2019, pp. 388–398. [51] OWASP, “Source code analysis tools,” 2020, cit. [2019-09-06]. Available from https://owasp.org/www-community/Source_Code_ Analysis_Tools. [52] A. J. Menezes, J. Katz, P. C. Van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography. CRC press, 1996. [53] B. Moeller, “Security of CBC ciphersuites in SSL/TLS: Problems and countermeasures,” 2001, [cit. 2020-09-02]. Available from https: //www.openssl.org/~bodo/tls-cbc.txt. [54] M. Bellare, T. Ristenpart, and S. Tessaro, “Multi-instance security and its application to password-based cryptography,” in Annual Cryptology Conference. Springer, 2012, pp. 312–329. [55] I. Muslukhov, Y. Boshmaf, and K. Beznosov, “Source attribution of cryptographic API misuse in Android applications,” in Proceedings of the 2018 on Asia Conference on Computer and Communications Security - ASIACCS ’18. Incheon, Republic of Korea: ACM Press, 2018, pp. 133–146. [56] S. Krüger, J. Späth, K. Ali, E. Bodden, and M. Mezini, “CrySL: An extensible approach to validating the correct usage of crypto- graphic APIs,” in 32nd European Conference on Object-Oriented Programming (ECOOP 2018), ser. Leibniz International Proceed- ings in Informatics (LIPIcs), T. Millstein, Ed., vol. 109. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018, pp. 10:1–10:27. [57] S. Fahl, M. Harbach, T. Muders, M. Smith, L. Baumgärtner, and B. Freisleben, “Why Eve and Mallory love Android: An analysis of Android SSL (in)security,” in Proceedings of the 2012 ACM Conference on Computer and Communications Security - CCS ’12. NY, United States: ACM Press, 2012, pp. 50–61. [58] B. Anderson, S. Paul, and D. McGrew, “Deciphering malware’s use of TLS (without decryption),” Journal of Computer Virology and Hacking Techniques, vol. 14, no. 3, pp. 195–211, 2018.

34 BIBLIOGRAPHY

[59] H. Wang, Y. Guo, Z. Ma, and X. Chen, “WuKong: A scalable and accurate two-phase approach to Android app clone detection,” in Proceedings of the 2015 International Symposium on Software Testing and Analysis - ISSTA 2015. Baltimore, MD, USA: ACM Press, 2015, pp. 71–82. [60] Z. Ma, H. Wang, Y. Guo, and X. Chen, “LibRadar: Fast and accurate detection of third-party libraries in Android apps,” in Proceedings of the 38th International Conference on Software Engi- neering Companion - ICSE ’16. Austin, Texas: ACM Press, 2016, pp. 653–656. [61] S. Ma, D. Lo, T. Li, and R. H. Deng, “CDRep: Automatic repair of cryptographic misuses in Android applications,” in Proceedings of the 11th ACM on Asia Conference on Computer and Communica- tions Security - ASIA CCS ’16. Xi’an, China: ACM Press, 2016, pp. 711–722. [62] X. Zhang, Y. Zhang, J. Li, Y. Hu, H. Li, and D. Gu, “Embroidery: Patching vulnerable binary code of fragmentized Android devices,” in 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). Shanghai: IEEE, 2017, pp. 47–57. [63] L. Singleton, R. Zhao, M. Song, and H. Siy, “FireBugs: Finding and repairing bugs with security patterns,” in 2019 IEEE/ACM 6th International Conference on Mobile Software Engineering and Systems (MOBILESoft). Montreal, QC, Canada: IEEE, 2019, pp. 30–34. [64] “Common Criteria: Certified products list - statistics,” 2020, [cit. 2020-09-03]. Available from https://www.commoncriteriaportal.org/ products/stats/. [65] F. Keblawi and D. Sullivan, “Applying the Common Criteria in systems engineering,” IEEE security & privacy, vol. 4, no. 2, pp. 50–55, 2006. [66] M. S. Merkow and J. Breithaupt, Computer Security Assurance Using the Common Criteria. Cengage Learning, 2004. [67] G. Bossert and F. Guihery, “Security evaluation of communication protocols in common criteria,” in Proceedings of the IEEE Interna-

35 BIBLIOGRAPHY

tional Conference on Communications, Ottowa, ON, Canada, 2012, pp. 10–15. [68] E. Venson, X. Guo, Z. Yan, and B. Boehm, “Costing secure soft- ware development: A systematic mapping study,” in Proceedings of the 14th International Conference on Availability, Reliability and Security, 2019, pp. 1–11. [69] J. Jancar, V. Sedlacek, P. Svenda, and M. Sys, “Minerva: The curse of ECDSA nonces; systematic analysis of lattice attacks on noisy leakage of bit-length of ECDSA nonces,” in Conference on Cryp- tographic Hardware and Embedded Systems (CHES) 2020. Ruhr- University of Bochum, Transactions on Cryptographic Hardware and Embedded Systems, 2020. [70] A. Vassilev, “Bowtie - a deep learning feedforward neural network for sentiment analysis,” in International Conference on Machine Learning, Optimization, and Data Science. Springer, 2019, pp. 360–371. [71] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. G. Shin, and K. Aberer, “Polisis: Automated analysis and presentation of pri- vacy policies using deep learning,” in 27th USENIX Security { } Symposium ( USENIX Security 18), 2018, pp. 531–548. { } [72] Rapid7, “Rapid 7 Sonar SSL Full IPv4 Scan,” 2019, [cit. 2020-07-13]. Available from https://opendata.rapid7.com/sonar.ssl/. [73] National Institute of Standards and Technology (NIST), “Automated cryptographic validation testing (ACVT),” 2016, [cit. 2020-09-03]. Available from https://csrc.nist.gov/Projects/ Automated-Cryptographic-Validation-Testing. [74] N. Heninger and J. A. Halderman, “Fastgcd,” 2015, [cit. 2020-07-13]. Available from https://github.com/sagi/fastgcd. [75] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “AndroZoo: Collecting millions of Android apps for the research community,” in Proceedings of the 13th International Workshop on Mining Software Repositories - MSR ’16. Texas, USA: ACM Press, 2016, pp. 468–471.

36 BIBLIOGRAPHY

[76] A. Young and M. Yung, “An elliptic curve asymmetric backdoor in Open-SSL RSA key generation,” 2005. [77] A. Young and M. Yung, “A space efficient backdoor in RSA and its applications,” in International Workshop on Selected Areas in Cryptography. Springer, 2005, pp. 128–143. [78] A. Young and M. Yung, “A timing-resistant elliptic curve backdoor in RSA,” in International Conference on Information Security and Cryptology. Springer, 2007, pp. 427–441. [79] A. V. Markelova, “Embedding asymmetric backdoors into the RSA key generator,” Journal of Computer Virology and Hacking Tech- niques, pp. 1–10, 2020.

37 Appendices

38 A Attached publications of author

The full re-prints of author’s publication are attached here. In exact, the publication Biased RSA private keys: Origin attribution of GCD- factorable keys can be found in Appendix A.1 and the article Bringing kleptography to real-world TLS can be found in Appendix A.2.

A.1 Biased RSA private keys: Origin attribution of GCD-factorable keys

The paper starts at the next page.

39 Biased RSA private keys: Origin attribution of GCD-factorable keys?

Adam Janovsky1,2( ), Matus Nemec3, Petr Svenda1, Peter Sekan1, and Vashek Matyas1

1 Masaryk University, Czech Republic [email protected] 2 Invasys, Czech Republic 3 Link¨opingUniversity, Sweden

Abstract. In 2016, Svendaˇ et al. (USENIX 2016, The Million-key Ques- tion) reported that the implementation choices in cryptographic libraries allow for qualified guessing about the origin of public RSA keys. We ex- tend the technique to two new scenarios when not only public but also private keys are available for the origin attribution – analysis of a source of GCD-factorable keys in IPv4-wide TLS scans and forensic investiga- tion of an unknown source. We learn several representatives of the bias from the private keys to train a model on more than 150 million keys collected from 70 cryptographic libraries, hardware security modules and cryptographic smartcards. Our model not only doubles the number of dis- tinguishable groups of libraries (compared to public keys from Svendaˇ et al.) but also improves more than twice in accuracy w.r.t. random guess- ing when a single key is classified. For a forensic scenario where at least 10 keys from the same source are available, the correct origin library is correctly identified with average accuracy of 89% compared to 4% accu- racy of a random guess. The technique was also used to identify libraries producing GCD-factorable TLS keys, showing that only three groups are the probable suspects.

Keywords: Cryptographic library RSA factorization Measurement · · · RSA key classification Statistical model. ·

1 Introduction

The ability to attribute a cryptographic key to the library it was generated with is a valuable asset providing direct insight into cryptographic practices. The slight bias found specifically in the primes of RSA private keys generated by the OpenSSL library [14] allowed to track down the devices responsible for keys found in TLS IPv4-wide scans that were in fact factorable by distributed GCD algorithm. Further work [23] made the method generic and showed that many other libraries produce biased keys allowing for the origin attribution. As a result,

? Full details, datasets and paper supplementary material can be found at https:// crocs.fi.muni.cz/papers/privrsa esorics20 2 Janovsky et al. both separate keys, as well as large datasets, could be analyzed for their origin libraries. The first-ever explicit measurement of cryptographic library popularity was introduced in [18], showing the increasing dominance of the OpenSSL library on the market. Furthermore, very uncommon characteristics of the library used by Infineon smartcards allowed for their entirely accurate classification. Impor- tantly, this led to a discovery that the library is, in fact, producing practically factorable keys [19]. Consequently, more than 20 million of eID certificates with vulnerable keys were revoked just in Europe alone. The same method allowed to identify keys originating from unexpected sources in Estonian eIDs. Eventually, the unexpected keys were shown to be injected from outside instead of being generated on-chip as mandated by the institutional policy [20]. While properties of RSA primes were analyzed to understand the bias de- tected in public keys, no previous work addressed the origin attribution problem with the knowledge of private keys. The reason may sound understandable – while the public keys are readily available in most usage domains, the private keys shall be kept secret, therefore unavailable for such scrutiny. Yet there are at least two important scenarios for their analysis: 1) Tracking sources of GCD- factorable keys from large TLS scans and 2) a forensic identification of black-box devices with the capability to export private keys (e.g., unknown smartcard, re- mote key generation service, or in-house investigation of cryptographic services). The mentioned case of unexpected keys in Estonian eIDs [20] is a practical ex- ample of a forensic scenario, but with the use of public keys only. The analysis based on private keys can spot even a smaller deviance from the expected origin as the bias is observed closer to the place of its inception. This work aims to fill this gap in knowledge by a careful examination of both scenarios. We first provide a solid coverage of RSA key sources used in the wild by ex- panding upon the dataset first released in [23]. During our work, we more than doubled the number of keys in the dataset, gathered from over 70 distinct crypto- graphic software libraries, smartcards, and hardware security modules (HSMs). Benefiting from 158.8 million keys, we study the bias affecting the primes p and q. We transform known biased features of public keys to their private key ana- logues and evaluate how they cluster sources of RSA keys into groups. We use the features in multiple variants of Bayes classifier that are trained on 157 mil- lion keys. Subsequently, we evaluate the performance of our classifiers on further 1.8 million keys isolated from the whole dataset. By doing so, we establish the reliability results for the forensic case of use, when keys from a black-box sys- tem are under scrutiny. On average, when looking at just a single key, our best model is able to correctly classify 47% of cases when all libraries are considered and 64.6% keys when the specific sub-domain of smartcards is considered. These results allow for much more precise classification compared to the scenario when only public keys are available. Finally, we use the best-performing classification method to analyze the dataset of GCD-factorable RSA keys from the IPv4-wide TLS scan collected by Rapid7 [21]. Biased RSA private keys: Origin attribution of GCD-factorable keys 3

The main contributions of this paper are: – A systematic mapping of biased features of RSA keys evaluated on a more exhaustive set of cryptographic libraries, described in Section 2. The dataset (made publicly available for other researchers) lead to 26 total groups of libraries distinguishable based on the features extracted from the value of RSA private key(s). – Detailed evaluation of the dataset on Bayes classifiers in Section 3 with an average accuracy above 47% where only a single key is available, and almost 90% when ten keys are available. – An analysis of the narrow domain of cryptographic smartcards and libraries used for TLS results in an even higher accuracy, as shown in Section 4. – Practical analysis of real-world sources of GCD-factorable RSA keys from public TLS servers obtained from internet-wide scans in Section 5. The paper roadmap has been partly outlined above, Section 7 then shows related work and Section 8 concludes our paper.

2 Bias in RSA keys Various design and implementation decisions in the algorithms for generating RSA keys influence the distributions of produced RSA keys. A specific type of bias was used to identify OpenSSL as the origin of a group of private keys [17]. Systematic studies of a wide range of libraries [23,18] described more reasons for biases in RSA keys in a surprising number of libraries. In the majority of cases, the bias was not strong enough to help factor the keys more efficiently. Previous research [23] identified multiple sources of bias that our observations from a large dataset of private RSA keys confirm: 1. Performance optimizations, e.g., most significant bits of primes set to a fixed value to obtain RSA moduli of a defined length. 2. Type of primes: probable, strong, and provable primes: – For probable primes, whether candidate values for primes are chosen randomly or a single starting value is incremented until a prime is found. – When generating candidates for probable primes, small factors are avoided in the value of p 1 by multiple implementations without explaining. – Blum integers are− sometimes used for RSA moduli – both RSA primes are congruent to 3 modulo 4. – For strong primes, the size of the auxiliary prime factors of p 1 and p + 1 is biased. − – For provable primes, the recursive algorithm can create new primes of double to triple the binary length of a given prime; usually one version of the algorithm is chosen. 3. Ordering of primes: are the RSA primes in private key ordered by size? 4. Proprietary algorithms, e.g., the well-documented case of Infineon fast prime key generation algorithm [19]. 5. Bias in the output of a PRNG: often observable only from a large number of keys from the same source; 6. Natural properties of primes that do not depend on the implementation. 4 Janovsky et al.

2.1 Dataset of RSA keys

We collected, analyzed, and published the largest dataset of RSA keys with a known origin from 70 libraries (43 open-source libraries, 5 black-box libraries, 3 HSMs, 19 smartcards). We both expanded the datasets from previous work [23,18] and generated new keys from additional libraries for the sake of this study. We processed the keys to a unified format and made them publicly available. Where possible, we analyzed the source code of the cryptographic library to identify the basic properties of key generation according to the list above. We are primarily interested in 2048-bit keys, what is the most commonly used key length for RSA. As in previous studies [23,18], we also generate shorter keys (512 and 1024 bits) to speed up the process, while verifying that the cho- sen biased features are not influenced by the . This makes the keys of different sizes interchangeable for the sake of our study. We assume that repeat- edly running the key generation locally approximates the distributed behaviour of many instances of the same library. This model is supported by the mea- surements taken in [18] where distributions of keys collected from the Internet exhibited the same biases as locally generated keys.

2.2 Choice of relevant biased features

We extended the features used in previous work on public keys to their equivalent properties of private keys: Feature ‘5p and 5q’: Instead of the most significant bits of the modulus, we use five most significant bits of the primes p and q. The modulus is defined by the primes, and the primes naturally provide more information. We chose 5 bits based on a of high bits. Further bits are typically not biased and reducing the size of this feature prevents an exponential growth of the feature space. Feature ‘blum’: We replaced the feature of second least significant bit of the modulus by the detection of Blum integers. Blum integers can be directly iden- tified using the two prime factors. When only the modulus is available, we can rule out the usage of Blum integers, but not confirm it. Feature ‘mod’: Previous work used the result of modulus modulo 3. It was known that primes can be biased modulo small primes (due to avoiding small factors of p 1 and q 1). The authors only used the value 3, because it is possible to rule out− that 3 is− being avoided as a factor of p 1, when the modulus equals 2 modulo 3 [23]. It is not possible to rule out higher− factors from just a single modulus. With the access to the primes we can directly check for this bias for all factors. We detected four categories of such bias, each avoiding all small odd prime factors up to a threshold. We use these categories directly by looking at small odd divisors of p 1 and q 1 and note if none were detected: 1) up to 17863, 2) up to 251, 3)− up to 5, 4)− none – at least one value is divisible by 3. Feature ‘roca’: We use a specific fingerprint of factorable Infineon keys pub- lished in [19]. Biased RSA private keys: Origin attribution of GCD-factorable keys 5

Dendrogram for all groups Group 25 (OpenSSL (8-bit fingerprint)) Group 24 (OpenSSL) Group 22 (NXP J2D081, NXP J2E145G (fingerprint 131)) Group 23 (NXP J2D081, NXP J2E145G (fingerprint 251)) Group 20 (Oberthur Cosmo Dual 72K) Group 21 (NXP J2A080, NXP J2A081, NXP J3A081, NXP JCOP 41 V2.2.1) Group 18 (Sage Blum, Sage Provable) Group 19 (GNU Crypto) Group 17 (Cryptix, FlexiProvider, Nettle 2.0, Sage Default) Group 16 (Utimaco) Group 14 (Crypto++, Microsoft) Group 15 (Athena IDProtect, Botan, Gemalto GCX4 72K, LibTomCrypt, Group 13 (Gemalto GXP E64) Nettle 3.2, Nettle 3.3, OpenSSL FIPS, WolfSSL) Group 12 (Taisys SIMoME VAULT) Group 11 (Feitian JavaCOS A22, Feitian JavaCOS A40, Oberthur Cosmo 64, Group 10 (Bouncy Castle 1.54, Mocana, Thales) SafeNet, cryptlib) Group 9 (Bouncy Castle 1.53, SunRsaSign) Group 8 (PuTTY, mbedTLS) Group 7 (Libgcrypt, Libgcrypt FIPS) Group 6 (PGP SDK) Group 5 (Mocana) Group 4 (PGP SDK FIPS) Group 2 (G&D SmartCafe 4.x, G&D SmartCafe 6.0) Group 3 (G&D SmartCafe 3.2) Group 26 (G&D StarSign) Group 1 (Infineon JTOP 80K) 2.0 1.5 1.0 0.5 0.0 Manhattan distance

Fig. 1. How the keys from various libraries differ can be depicted by a dendrogram. It tells us, w.r.t. our feature set, how far from each other the probability distributions of the sources are. We can then hierarchically cluster the sources into groups that produce similar keys. The blue line at 0.085 highlights the threshold of differentiating between two sources/groups. This threshold yields 26 groups using our feature set.

2.3 Clustering of sources into groups

Since it is impossible to distinguish sources that produce identically distributed keys, we introduce a process of clustering to merge similar sources into groups. We cluster two sources together if they appear to be using identical algorithms based on the observation of the key distributions. We measure the difference in the distributions using the Manhattan distance4. The absolute values of the distances depend on the actual distributions of the features. Large distances correlate with significant differences in the implementations. Note, that very small observed distances may be only the result of noise in the distributions instead of a real difference, e.g., due to a smaller number of keys available. We attempt to place the clustering threshold as low as possible, maximizing the number of meaningful groups. If we are not able to explain why two clusters are separated based on the study of the algorithms and distributions of the

4 We experimented with Euclidean distance and fractional norms. While Euclidean distance is a proper metric, our experiments showed that it is more sensitive to the noise in the data, creating separable groups out of sources that share the same key generation algorithms. On the other hand, fractional norms did not highlight differences between sources that provably differ in the key generation process. 6 Janovsky et al. features, the threshold needs to be moved higher to join these clusters. We worked with distributions that assume all features correlated (as in [23]). The resulting classification groups and the dendrogram is shown in Figure 1. We placed the threshold value at 0.085. By moving it higher than to 0.154, we would lose the ability to distinguish groups 11 and 12. It would be possible to further split group 14, as there is a slight difference in the prime selection intervals used by Crypto++ and Microsoft [23]. However, the difference manifests less than the level of noise in other sources, requiring the threshold to be put at 0.052, what would create several false groups. We use the same clustering throughout the paper, although the value of the threshold would change when the features change. Note that different versions of the same library may fall into different groups, mostly because of the algorithm changes between these versions. This, for instance, is the case of the Bouncy Castle 1.53, and 1.54.

3 Model selection and evaluation

How accurately we can classify the keys depends on several factors, most notably on: the libraries included in the training set, number of keys available for clas- sification, features extracted from the classified keys, and on the classification model. In this section, we focus on the last factor.

3.1 Model selection

As generating the RSA keys is internally a stochastic process, we choose the family of probabilistic models to address the source attribution problem. Since there is no strong motivation for complex machine learning models, we utilize simple classifiers. More sophisticated classifiers could be built based on our find- ings when the goal is to reach higher accuracy or to more finely discriminate sources within a group. The rest of this subsection describes the chosen models. Na¨ıve Bayes classifier. The first investigated model is a na¨ıveBayes clas- sifier, called na¨ıve because it assumes that the underlying features are condition- ally independent. Using this model, we apply the maximum-likelihood decision rule and predict the label asy ˆ = argmaxy P (X = x y). Thanks to the na¨ıve as- sumption, we may decompose this computation into|y ˆ = argmax n P (x y) y i=1 i | for the feature vector x = (x1, . . . , xn). Bayes classifier. We continue to develop the approach originallyQ used in [23] that used the Bayes classifier without the na¨ıve assumption. Several reasons mo- tivate this. First, it allows to evaluate how much the na¨ıve Bayes model suffers from the violated independence assumption (on this specific dataset). Secondly, it enables us to access more precise probability estimates that are needed to clas- sify real-world GCD-factorable keys. Additionally, we can directly compare the classification accuracy of private keys with the case of the public keys from [23]. However, one of the main drawbacks of the Bayes classifier is that it requires exponentially more data with the growing number of features. Therefore, when Biased RSA private keys: Origin attribution of GCD-factorable keys 7 striving for high accuracy achievable by further feature engineering, one should consider the na¨ıve Bayes instead. Na¨ıve Bayes classifier with cross-features. The third investigated option is the na¨ıve Bayes classifier, but we merged selected features that are known to be correlated into a single feature. In particular, we merged the features of the most significant bits (of p, q) into a single cross-feature. Subsequently, the na¨ıve Bayes approach is used. This enables us to evaluate whether merging clearly interdependent features into one will affect the performance of na¨ıve Bayes classifier w.r.t. this specific dataset.

3.2 Model evaluation Methodology of classification and metrics. Our training dataset contains 157 million keys and the test set contains 1.8 million keys. We derived the test set by discarding 10 thousand keys of each source from the complete dataset before clustering. This assures that each group has the test set with at least 10 thousand keys. Accordingly, since the groups differ in the number of sources involved, the resulting test dataset is imbalanced. For this reason, we employ the metrics of precision and recall when possible. However, we represent the model performance by accuracy measure in the tables and in more complex classification scenarios. For group X, the precision can be understood as a fraction of correctly clas- sified keys from group X divided by the number of keys that were marked as group X by our classifier. Similarly, the recall is a fraction of correctly classified keys from group X divided by a total number of keys from group X [11]. We also evaluate the performance of the models under the assumption that the user has a batch of several keys from the same source at hand. This scenario can arise, e.g., when a security audit is run in an organization and all keys are being tested. Furthermore, to react to some often misclassified groups, we additionally provide the answer “this key originates from group X or group Y ” to the user (and we evaluate the confidence of these answers).

Comparison of the models. The overall comparison of all three models can be seen in Table 1. If the precision for some group is undefined, i.e., no key is allegedly originating from this group, we say that the precision is 0. We evaluate the na¨ıve Bayes classifier on the same features that were used for Bayes classi- fier to measure how much classification performance is lost by introducing the

Model Avg. precision Avg. recall Bayes classifier 43.2% 47.6% Na¨ıve Bayes classifier 40.9% 46.2% Cross-feature na¨ıve B. 41.7% 47.6%

Table 1. Performance comparison of different models on the dataset with all libraries. Note that the precision of a random guess classifier is 3.8% when considering 26 groups. 8 Janovsky et al. feature independence assumption. A typical example of interdependent features is that the most significant bits of primes p and q are intentionally correlated to preserve the expected length of the resulting modulus n. Pleasantly, the observed precision (recall) decrease is only 2.3% (1.4%) when compared to the Bayes clas- sifier. Accordingly, this suggests that a larger number of different features than usable with the Bayes classifier (due to exponential growth in complexity) can be considered when the na¨ıve Bayes classifier is used. As a result, further improve- ment of the performance might be achieved, despite ignoring the dependencies among features. Overall, the Bayes classifier shows the best results. When a sin- gle key is classified, the average success rate for the 26 groups is captured by precision of 43.2% and a recall of 47.6%. Still, there is a wide variance between the performance in specific groups. A detailed table of results together with a discussion is presented in Appendix A.

4 Classification with prior information

Section 2 outlined the process of choosing a threshold value that determines the critical distance for distinguishing between distinct groups. Inevitably, the same threshold value directly influences the number of groups after the clustering task. As such, the threshold introduces a trade-off between the model performance and the number of discriminated groups. The smaller the difference between group distributions is, the more they are similar, and the model performance is lower as more misclassification errors occur. The objective of this section is to examine the classification scenario when some prior knowledge is available to the analyst, limiting the origin of keys to only a subset of all libraries or increase the likelihood of some. Since Section 3 showed that the Bayes classifier provides the best performance, this chapter considers only this model. Prior knowledge can be introduced into the classification process in multiple ways, e.g., by using a prior probability vector that considers some groups more prevalent. We also note that the measurement method of [18] can be used to obtain such prior information, but a relatively large dataset (around 105 private keys) is required that may not be available. Our work, therefore, considers a different setting when some sources of the keys are ruled-out before the classifier is constructed. Such scenario arises e.g., when the analyst knows that the scru- tinized keys were generated in an unknown cryptographic smartcard. In such case, HSMs and other sources of keys can thus be omitted from the model alto- gether what will arguably increase the performance of the classification process. Another example is leaving out libraries that were released after the classified data sample was collected. We present the classification performance results for three scenarios with a limited number of sources – 1) cryptographic smartcards (Section 4.1), 2) sources likely to be used in the TLS domain (Section 4.2) and 3) a specific case of GCD- factorable keys from the TLS domain, where only one out of two primes can be used for classification (see Section 4.3 for more details). The comparison of models for these scenarios can be seen in Table 2. Biased RSA private keys: Origin attribution of GCD-factorable keys 9

Dataset Avg. precision Avg. recall Random guess (baseline) All libraries 43.2% 47.6% 3.8% Smartcards domain 61.9% 64.6% 8.3% TLS domain 45.5% 42.2% 7.7% Single-prime TLS domain 28.8% 36.2% 11.1%

Table 2. Bayes classifier performance on three analyzed partitionings of the dataset – complete dataset with all libraries (All libraries), smartcards only (Smartcards do- main), libraries and HSMs expected to be used for TLS (TLS domain) and specific subset of TLS domain where only single prime is available due to the nature of results obtained by GCD factorization method (Single-prime TLS domain). Comparison with the random guess as a baseline is provided (here, accuracy equals precision and recall).

To compute these models we first, discard the sources that cannot be the origin of the examined keys according to the prior knowledge of the domain (e.g., smartcards are not expected in TLS). Next, we re-compute the clustering task to obtain fewer groups than on the dataset with all libraries. Finally, we compute the classification tables for the reduced domain and evaluate the performance.

4.1 Performance in the smartcards domain

Dendrogram for smartcard domain Group 10 (Oberthur Cosmo Dual 72K) Group 11 (NXP J2A080, NXP J2A081, NXP J3A081, NXP JCOP 41 V2.2.1) Group 8 (Gemalto GXP E64) Group 9 (Athena IDProtect, Gemalto GCX4 72K) Group 6 (Feitian JavaCOS A22, Feitian JavaCOS A40, Oberthur Cosmo 64) Group 7 (Taisys SIMoME VAULT) Group 4 (NXP J2D081, NXP J2E145G (fingerprint 131)) Group 5 (NXP J2D081, NXP J2E145G (fingerprint 251)) Group 2 (G&D SmartCafe 4.x, G&D SmartCafe 6.0) Group 3 (G&D SmartCafe 3.2) Group 12 (G&D StarSign) Group 1 (Infineon JTOP 80K) 2 1 0 Manhattan distance

Fig. 2. The clustering of smartcard sources yields 12 separate groups.

The clustering task in the smartcards domain yields 12 recognizable groups for 19 different smartcard models as shown in Figure 2. The training set for this limited domain contains 20.6 million keys, whereas the test set contains 340 thousand keys. On average, 61.9% precision and 64.6% recall is achieved. Moreover, 8 out of 12 groups achieve > 50% precision. Additionally, the clas- sifier exhibits 100% recall on 3 specific groups: a) Infineon smartcards (before 2017 with the ROCA vulnerability [19]), b) G&D Smartcafe 4.x and 6.0, and 10 Janovsky et al. c) newer G&D Smartcafe 7.0. Figure 3 shows so-called confusion matrix where each row corresponds to percentage of keys in an actual group while each column represents percentage of keys in a predicted group.

100%

1 100% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%

2 0% 82% 15% 2% 0% 0% 0% 1% 0% 0% 0% 0%

3 0% 33% 53% 1% 1% 5% 5% 1% 0% 0% 0% 0% 75%

4 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 0% 0%

5 0% 0% 0% 0% 100% 0% 0% 0% 0% 0% 0% 0%

6 0% 0% 0% 0% 14% 86% 0% 0% 0% 0% 0% 0% 50%

7 0% 3% 1% 7% 7% 0% 15% 67% 0% 0% 0% 0% True group True

8 0% 0% 0% 7% 7% 0% 0% 87% 0% 0% 0% 0%

9 0% 20% 4% 0% 0% 0% 0% 0% 8% 49% 16% 1% 25% 0% 21% 4% 0% 0% 0% 0% 0% 4% 54% 16% 1% 10

0% 14% 1% 0% 0% 0% 0% 0% 2% 12% 61% 9% 11

0% 12% 2% 0% 0% 0% 0% 0% 4% 26% 26% 30% 12 0% 1 2 3 4 5 6 7 8 9 10 11 12 Group predicted by our model

Fig. 3. The confusion matrix for the classifier of a single private key generated in the smartcards domain. A given row corresponds to a vector of observed relative frequen- cies with which keys generated by a specific group (True group) are misclassified as generated by other groups (Group predicted by our model). For example, group 1 and group 2 have no misclassifications (high accuracy), while keys of group 3 are in 33% cases misclassified as keys from group 2. On average, we achieve 64.6% accuracy. The darker the cell is, the higher number it contains. This holds for all figures in this paper.

As expected, the results represent an improvement when compared to the dataset with all libraries. When one has ten keys of the same card at hand, the expected recall is over 90% on 10 out of 12 groups. The full table of results can be found in the project repository. Interestingly, 512- and 1024-bit keys generated by the same NXP J2E145G card (similarly also for NXP J2D081) fall into different groups5. The main dif- ference is in the modular fingerprint (avoidance of small factors in p 1 and q 1). We hypothesize that on-card key generation avoids more small factors− for larger− keys. Such behaviour was not observed for other libraries but highlights the necessity of collecting different key lengths in the training dataset when one analyzes black-box proprietary devices or closed-source software libraries.

5 This is an exception to the observation that the selected features behave indepen- dently of key length. Otherwise, keys of different length can be used interchangeably. Biased RSA private keys: Origin attribution of GCD-factorable keys 11

To summarize, the classification of private keys generated by smartcards is very accurate due to the significant differences resulting from the proprietary, embedded implementations among the different vendors. The differences ob- served likely results from the requirements to have a smaller footprint required by low-resources devices.

4.2 Performance in the TLS domain For the TLS domain, we excluded all the libraries and devices unlikely to be used to generate keys then used by TLS servers. All smartcards are excluded, together with highly outdated or purpose-specific libraries like PGP SDK 4. All hardware security modules (HSMs) are present as they may be used as TLS accelerators or high-security key storage. Summarized, we started with 17 separate cryptographic libraries and HSMs, inspected in a total of 134 versions. The clustering resulted in 13 recognizable groups as shown in Figure 4. The domain training set contains 121.8 million keys and the test set contains 1.3 million keys. On average, the classifier achieves 45.5% precision and 42.2% recall. The decrease in average recall compared to the full domain may look surprising, but averaging is deceiving in this context. In fact, recall improved for 10 out of 13 groups that are both in the full set and the TLS domain set, with the precision improving for 9 groups. The mean values of the full dataset are being uplifted by a generally better performance of the model outside the TLS domain. Five groups have > 50% precision. OpenSSL (by far the most popular library used by servers for TLS [18]) has 100% recall, making the classification of OpenSSL keys very reliable. Complete results can be found in the project repository. To summarize, we correctly classify more keys in a more specific TLS domain than with the full dataset classifier. Additionally, the user can be more confident about the decisions of the TLS-specific classifier.

Dendrogram for TLS domain Group 13 (Nettle 2.0, Sage Default) Group 12 (Utimaco) Group 11 (Crypto++, Microsoft) Group 10 (Botan, LibTomCrypt, Nettle 3.2, Nettle 3.3, Group 9 (Libgcrypt, Libgcrypt FIPS) OpenSSL FIPS, WolfSSL) Group 7 (Bouncy Castle 1.54, Mocana, Thales) Group 8 (SafeNet, cryptlib) Group 5 (mbedTLS) Group 6 (Bouncy Castle 1.53, SunRsaSign) Group 4 (Mocana) Group 3 (Sage Blum, Sage Provable) Group 2 (OpenSSL (8-bit fingerprint)) Group 1 (OpenSSL) 2.0 1.5 1.0 0.5 0.0 Manhattan distance

Fig. 4. The clustering of the sources from the TLS domain yields 13 separate groups. 12 Janovsky et al.

4.3 Performance in the single-prime TLS domain

The rest of this section is motivated by a setting when one wants to ana- lyze a batch of correlated keys. Specifically, we assume a case of k 1 keys ≥ (p1, q1),..., (pk, qk) generated by the same source, where p1 = p2 = = pk. This scenario emerges in Section 5 and cannot be addressed by previously··· con- sidered classifiers. If applied, the results would be drastically skewed since the classifier would consider each of pi separately, putting half of the weight on the shared prime. For that reason, we train a classifier that works on single primes rather than on complete private keys. Instead of feeding the classifier with a batch of k private keys, we supply it with a batch of k + 1 unique primes from those keys. The selected features were modified accordingly: we extract the 5 most significant bits from the unique prime, its second least significant bit, and compute the ROCA and modular fingerprint for the single prime. We trained the classifier on the learning set limited to the TLS domain, as in Section 4.2. On average, we achieve 28.8% precision and 36.2% recall when classifying a single prime. Table 3 shows the accuracy results in more detail. It should, however, be stressed that this classifier is meant to be used for batches of many keys at once. When considering a batch of k 10 primes, the accuracy is more than 77%. The decrease in accuracy compared≥ to Section 4.2 can be explained by the loss of information from the second prime. The features mod and blum are much less reliable when using only one prime. Since we can compute the most significant bits from a single prime at a time, we lost the information about the ordering of primes (since features 5p and 5q are correlated). These facts resulted in only nine separate groups of libraries being distinguishable. The following groups from the TLS domain are no longer mutually distinguishable: 5 and 13, 7 and 11, 8 and 9 and 10.

4.4 Methodology limitations

The presented methodology has several limitations:

Number of primes in a batch 1 10 20 30 100 Group 1 100.0% 100.0% 100.0% 100.0% 100.0% Group 2 42.8% 99.7% 100.0% 100.0% 100.0% Group 3 78.0% 100.0% 100.0% 100.0% 100.0% Group 4 47.5% 90.3% 95.8% 98.7% 100.0% Group 5 13 1.8% 30.8% 43.7% 51.8% 74.7% | Group 6 5.2% 48.9% 61.0% 64.8% 76.7% Group 7 11 0.0% 67.3% 92.3% 97.4% 100.0% | Group 8 9 10 37.9% 99.9% 100.0% 100.0% 100.0% | | Group 12 12.8% 61.8% 77.7% 83.9% 97.2% Average 36.2% 77.6% 85.6% 88.5% 94.3%

Table 3. Classification accuracy for single-prime features evaluated on TLS domain. Biased RSA private keys: Origin attribution of GCD-factorable keys 13

Classification of an unseen source. Not all existing sources of RSA keys are present in our dataset for clustering analysis and classification. This means that attempting to classify a key from a source not considered in our study will bring unpredictable results. The new source may either populate some existing group or have a unique implementation, thus creating a new group. In both cases, the behaviour of the classifier is unpredictable. Granularity of the classifier. There are multiple libraries in a single group. The user is therefore not shown the exact source of the key, but the whole group instead. This limitation has two main reasons: 1) Some sources share the same implementation and thus cannot be told apart. 2) The list of utilized features is narrow. There are infinitely many possible features in principle and some may hide valuable information that can further help the model performance. Nevertheless, the proposed methodology allows for an automatic evaluation of features using the na¨ıve Bayes method which shall be considered in future work. Human factor. The clustering task in our study requires human knowledge. To be specific, the value of the threshold that splits the libraries into groups (for a particular feature) is established only semi-automatically. We manually con- firmed the threshold – when we could explain the difference between the libraries, or moved it otherwise. Summarized, this complicates the fully automatic evalua- tion on a large number of potential features. Once solved, the relative importance of the individual features could be measured.

5 Real-world GCD-factorable keys origin investigation

Previous research [16,14,13,2] demonstrated that a non-trivial fraction of RSA keys used on publicly reachable TLS servers is generated insecurely and is prac- tically factorable. This is because the affected network devices were found to in- dependently generate RSA keys that share a single prime or both primes. While an efficient factorization algorithm for RSA moduli is unknown, when two keys accidentally share one prime, the efficient factorization is possible using the Eu- clidean algorithm to find their GCD6. Still, the current number of public keys obtained from crawling TLS servers is too high to allow for the investigation of all possible pairs. However, the distributed GCD algorithm [15] allows analyzing hundreds of millions of keys efficiently. Its performance was sufficient to analyze all keys collected from IPv4-wide TLS scans [21,5] and resulted in almost 1% of factorable keys in the scans collected at the beginning of the year 2016. After the detection of GCD-factorable keys, the question of their origin natu- rally followed. Previous research addressed it using two principal approaches: 1) an analysis of the information extractable from the certificates of GCD-factorable keys, and 2) matching specific properties of factored primes with primes gener- ated by a suspected library – OpenSSL. The first approach allowed to detect a range of network routers that seeded their PRNG shortly after boot without enough entropy, what caused them to occasionally generate a prime shared with 6 Note that the keys sharing both primes are not susceptible to this attack but reveal their private keys to all other owners of the same RSA key pair. 14 Janovsky et al. another device. These routers contained a customized version of the OpenSSL library, what was confirmed with the second approach, since OpenSSL code in- tentionally avoids small factors of p 1 as shown by [17]. − While this suite of routers was clearly the primary source of the GCD- factorable keys, are they the sole source of insecure keys? The paper [13] iden- tified 23 router/device vendors that used the code of OpenSSL (using specific OpenSSL fingerprint based on avoidance of small factors in p 1 and informa- tion extracted from the certificates). Eight other vendors (DrayRek,− Fortinet, Huawei, Juniper, Kronos, Siemens, Xerox, and ZyXEL) produced keys without such OpenSSL fingerprint, and the underlying libraries remained unidentified. In the rest of this section, we build upon the prior work to identify probable sources of the GCD-factorable keys that do not originate from the OpenSSL library. Two assumptions must be met to employ the classifier studied in Section 4.3. First, we assume that when a batch of GCD-factored keys shares a prime, they were all generated by sources from a single classification group. This conjecture is suggested in [13,14] and supported by the fact that when distinct libraries differ in their prime generation algorithm, they will produce different primes even when initialized from the same seed. On the other hand, when they share the same generation algorithm, they inevitably fall into the same classification group. Second, we assume that if the malformed keys share only single prime, the PRNG was reseeded with enough entropy before the second prime got generated. This is suggested by the failure model studied for OpenSSL in [14] and implies that the second prime is generated as it normally would be. Leveraging these conjectures, the rest of this section tracks the libraries re- sponsible for GCD-factorable keys while not relying on the information in the certificates. First, we describe the dataset gathering process, as well as the fac- torization of the RSA public keys. Later, successfully factored keys are analyzed, followed with a discussion of findings.

6 Datasets of GCD-factorable TLS keys

The input dataset with public RSA keys (both secure and vulnerable ones) was obtained from the Rapid7 archive. All scans between October 2013 and July 2019 (mostly in one or two weeks period) were downloaded and processed, resulting in slightly over 170 million certificates. Only public RSA keys were extracted, and duplicates removed, resulting in 112 million unique moduli. On this dataset, the fastgcd [15] tool based on [3] was used to factorize the moduli into private keys. A detailed methodology of this procedure is discussed in Appendix B.

6.1 Batching of GCD-factorable keys

Would the precision and recall of our classifier be 100%, one could process the factored keys one by one, establish their origin library and thus detect all sources of insecure keys. But since the classification accuracy of the single-prime TLS Biased RSA private keys: Origin attribution of GCD-factorable keys 15 classifier7 with a single key is only 36%, we apply three adjustments: 1) batch the GCD-factorable keys sharing the same prime (believed to be produced by the same library); 2) analyze only the batches with at least 10 keys (therefore with high expected accuracy); 3) limit the set of the libraries considered for classification only to the single-prime TLS domain. Since the keys from the OpenSSL library were already extensively analyzed by [13], we use the mod feature to reliably mark and exclude them from further analysis. By doing so, we concentrate primarily on the non-OpenSSL keys that were not yet attributed. The exact process for classification of factored keys in batches is as follows:

1. Factorize public keys from a target dataset (e.g., Rapid7) using fastgcd tool. 2. Form batches of factored keys that share a prime and assume that they originate from the same classification group. 3. Select only the batches with at least k keys (e.g., 10). 4. Separate batches of keys that all carry the OpenSSL fingerprint. As a control experiment, they should classify only to a group with the OpenSSL library. 5. Separate batches without the OpenSSL fingerprint. This cluster contains yet unidentified libraries. 6. Classify the non-OpenSSL cluster using a single-prime TLS classifier.

6.2 Source libraries detected in GCD-factorable TLS keys

Group(s) # batches 1 (OpenSSL) 2230 2 (8-bit OpenSSL) 3 8 9 10 (various libraries, see Figure 4) 278 | | 3; 4; 6; 12; 5 13; 7 11 0 (improbable) | | Table 4. Keys that share a prime factor belong to the same batch. Classification of most batches resulted in OpenSSL as the likely source. The rest of the batches were likely generated by libraries in the combined group 8 9 10. | |

In total, we analyzed more than 82 thousand primes divided into 2511 batches. While each batch has at least 10 keys in it, the median of the batch size is 15. Among the batches, 88.8% of them exhibit the OpenSSL fingerprint. This num- ber well confirms the previous finding by [13] that also captured the OpenSSL- specific fingerprint in a similar fraction of keys. We attribute three other batches as coming from the OpenSSL (8-bit fingerprint), an OpenSSL library compiled to test and avoid divisors of p 1 only up to 251. Importantly, slightly more than 11% of batches were generated− by some library from groups 8, 9, or 10, which

7 Note that without using single-prime model, the results are biased as the shared prime is considered multiple times in the classification process. 16 Janovsky et al. are not mutually distinguishable when only a single prime is available. There are also negative results to report. With the accuracy over 80% (for a batch size of 15) and no batches attributed to any of groups 3, 4, 6, 12, 5 13, or 7 11, it is very improbable that any GCD-factorable keys originate from| the respective| sources in these libraries.

7 Related work

The fingerprinting of devices based on their physical characteristics, exposed interfaces, behaviour in non-standard or undefined situations, errors returned, and a wide range of various other side-channels is a well-researched area. The experience shows that finding a case of a non-standard behaviour is usually possible, while making a group of devices indistinguishable is very difficult due to an almost infinite number of observable characteristics, resulting in an arms race between the device manufacturers and fingerprinting observers. Having the device fingerprinted is helpful to better understand the complex ecosystem like quantifying the presence of interception middle-boxes on the inter- net [9], types of clients connected or version of the operating system. Differences may help point out subverted supply chains or counterfeit products. When applied to the study of cryptographic keys and cryptographic libraries, researchers devised a range of techniques to analyze the fraction of encrypted connections, the prevalence of particular cryptographic algorithms, the chosen key lengths or cipher suites [8,10,2,1,12,4,24]. Information about a particular key is frequently obtained from the metadata of its certificate. Periodical network scans allow to assess the impact of security flaws in prac- tice. The population of OpenSSL servers with the Heartbleed vulnerability was measured and monitored by [7], and real attempts to exploit the bug were sur- veyed. If the necessary information is coincidentally collected and archived, even a backward introspection of a vulnerability in time might be possible. The simple test for the ROCA vulnerability in public RSA keys allowed to measure the fraction of citizens of Estonia who held an electronic ID supported by a vulnerable smartcard, by inspecting the public repository of eID certificates [19]. The fingerprinting of keys from smartcards was used to detect that private keys were generated outside of the card and injected later into the eIDs, despite the requirement to have all keys generated on-card [20]. The attribution of the public RSA key to its origin library was analyzed by [23]. Measurements on large datasets were presented in [18], leading to accurate estimation of the fraction of cryptographic libraries used in large datasets like IPv4-wide TLS. While both [23] and [18] analyze the public keys, private keys can be also obtained under certain conditions of faulty random number generator [16,6,13,14,22]. The origin of weak factorable keys needs to be identified in order to notify the maintainers of the code to fix underlying issues. A combination of key properties and values from certificates was used. Biased RSA private keys: Origin attribution of GCD-factorable keys 17

8 Conclusions

We provide what we believe is the first wide examination of properties of RSA keys with the goal of attribution of private key to its origin library. The attribu- tion is applicable in multiple scenarios, e.g., to the analysis of GCD-factorable keys in the TLS domain. We investigated the properties of keys as generated by 70 cryptographic libraries, identified biased features in the primes produced, and compared three models based on Bayes classifiers for the private key attribution. The information available in private keys significantly increases the classifica- tion performance compared to the result achieved on public keys [23]. Our work enables to distinguish 26 groups of sources (compared to 13 on public keys) while increasing the accuracy more than twice w.r.t. random guessing. When 100 keys are available for the classification, the correct result is almost always provided (> 99%) for 19 out of 26 groups. Finally, we designed a method usable also for a dataset of keys where one prime is significantly correlated. Such primes are found in GCD-factorable TLS keys where one prime was generated with insufficient randomness and would introduce a high classification error in the unmodified method. As a result, we can identify libraries responsible for the production of these GCD-factorable keys, showing that only three groups are a relevant source of such keys. The accurate classification can be easily incorporated in forensic and audit tools. While the bias in the keys usually does not help with factorization, the cryp- tographic libraries should approach their key generation design with a great care, as strong bias can lead to weak keys [19]. We recommend to follow a key generation process with as little bias present as possible. Acknowledgements The authors would like to thank anonymous reviewers for their helpful comments. P. Svenda and V. Matyas were supported by Czech Science Foundation project GA20-03426S. Some of the tools used and other people involved were supported by the CyberSec4Europe Competence Network. Computational resources were supplied by the project e-INFRA LM2018140.

References

1. Albrecht, M.R., Degabriele, J.P., Hansen, T.B., Paterson, K.G.: A surfeit of SSH cipher suites. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. pp. 1480–1491. ACM (2016) 2. Barbulescu, M., Stratulat, A., Traista-Popescu, V., Simion, E.: RSA weak public keys available on the Internet. In: International Conference for Information Tech- nology and Communications. pp. 92–102. Springer-Verlag (2016) 3. Bernstein, D.J.: How to find smooth parts of integers (2004), [cit. 2020-07-13]. Available from http://cr.yp.to/papers.html#smoothpart 4. Cangialosi, F., Chung, T., Choffnes, D., Levin, D., Maggs, B.M., Mislove, A., Wil- son, C.: Measurement and analysis of private key sharing in the https ecosystem. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Commu- nications Security. pp. 628–640. ACM (2016) 5. Censys: Censys TLS Full IPv4 443 Scan (2015), [cit. 2020-07-13]. Available from https://censys.io/data/443-https-tls-full ipv4/historical 18 Janovsky et al.

6. Batch-GCDing Github SSH Keys (2015), [cit. 2020-07-13]. Available from https: //cryptosense.com/batch-gcding-github-ssh-keys/ 7. Durumeric, Z., Kasten, J., Adrian, D., Halderman, J.A., Bailey, M., Li, F., Weaver, N., Amann, J., Beekman, J., Payer, M., et al.: The matter of Heartbleed. In: Proceedings of the 2014 Conference on Internet Measurement Conference. pp. 475– 488. ACM (2014) 8. Durumeric, Z., Kasten, J., Bailey, M., Halderman, J.A.: Analysis of the HTTPS certificate ecosystem. In: Proceedings of the 2013 ACM Internet Measurement Conference. pp. 291–304. ACM (2013) 9. Durumeric, Z., Ma, Z., Springall, D., Barnes, R., Sullivan, N., Bursztein, E., Bailey, M., Halderman, J.A., Paxson, V.: The security impact of https interception. In: Network and Distributed Systems Symposium. The Internet Society (2017) 10. Electronic Frontier Foundation: The EFF SSL Observatory (2010), [cit. 2020-07- 13]. Available from https://www.eff.org/observatory 11. Flach, P.: Machine Learning: The Art and Science of Algorithms that Make Sense of Data, chap. 2, pp. 57–58. Camridge University Press (2012) 12. Gustafsson, J., Overier, G., Arlitt, M., Carlsson, N.: A first look at the CT land- scape: Certificate Transparency logs in practice. In: Proceedings of the 18th Passive and Active Measurement Conference. pp. 87–99. Springer-Verlag (2017) 13. Hastings, M., Fried, J., Heninger, N.: Weak keys remain widespread in network devices. In: Proceedings of the 2016 ACM on Internet Measurement Conference. pp. 49–63. ACM (2016) 14. Heninger, N., Durumeric, Z., Wustrow, E., Halderman, J.A.: Mining your Ps and Qs: Detection of widespread weak keys in network devices. In: Proceeding of USENIX Security Symposium. pp. 205–220. USENIX (2012) 15. Heninger, N., Halderman, J.A.: Fastgcd (2015), [cit. 2020-07-13]. Available from https://github.com/sagi/fastgcd 16. Lenstra, A.K., Hughes, J.P., Augier, M., Bos, J.W., Kleinjung, T., Wachter, C.: Ron was wrong, Whit is right. Cryptology ePrint Archive, Report 2012/064 (2012), [cit. 2020-07-13]. Available from https://eprint.iacr.org/2012/064 17. Mironov, I.: Factoring RSA Moduli II. (2012), [cit. 2020-07-13]. Available from https://windowsontheory.org/2012/05/17/factoring-rsa-moduli-part-ii/ 18. Nemec, M., Klinec, D., Svenda, P., Sekan, P., Matyas, V.: Measuring popularity of cryptographic libraries in internet-wide scans. In: Proceedings of the 33rd Annual Computer Security Applications Conference. pp. 162–175. ACM (2017) 19. Nemec, M., Sys, M., Svenda, P., Klinec, D., Matyas, V.: The Return of Cop- persmith’s Attack: Practical Factorization of Widely Used RSA Moduli. In: 24th ACM Conference on Computer and Communications Security (CCS’2017). pp. 1631–1648. ACM (2017) 20. Parsovs, A.: Estonian electronic identity card: Security flaws in key management. In: 29th USENIX Security Symposium. USENIX Association (2020) 21. Rapid7: Rapid 7 Sonar SSL full IPv4 scan (2019), [cit. 2020-07-13]. Available from https://opendata.rapid7.com/sonar.ssl/ 22. Software in the Public Interest: DSA-1571-1 openssl – predictable random num- ber generator (2008), [cit. 2020-07-13]. Available from https://www.debian.org/ security/2008/dsa-1571 23. Svenda, P., Nemec, M., Sekan, P., Kvasnovsky, R., Formanek, D., Komarek, D., Matyas, V.: The million-key question — Investigating the origins of RSA public keys. In: Proceeding of USENIX Security Symposium. pp. 893–910 (2016) Biased RSA private keys: Origin attribution of GCD-factorable keys 19

24. VanderSloot, B., Amann, J., Bernhard, M., Durumeric, Z., Bailey, M., Halderman, J.A.: Towards a complete view of the certificate ecosystem. In: Proceedings of the 2016 ACM on Internet Measurement Conference. pp. 543–549. ACM (2016)

A Detailed discussion of classifier results

Some groups are accurately classified and rarely misclassified even with a single key available: namely group 1 (Infineon prior 2017, distinct because of the ROCA fingerprint), group 2 (Giesecke&Devrient SmartCafe 4.x and 6.0), group 24 (standard OpenSSL without the FIPS module enabled) and group 26 (Giesecke&Devrient SmartCafe 7.0) are all classified with more than 96% recall. Groups 1, 2, and 26 are rarely misclassified as origin library (false positive). The keys from group 25 (OpenSSL avoiding only 8-bit small factors in p 1) are misclassified as group 24 (standard OpenSSL) in 31.6% cases, which− still identifies the origin library correctly, only misidentifies the OpenSSL compile- time configuration. In contrast, keys from groups 7, 10, 11, 14, 15, and 17 are almost always misclassified (less than 8% recall, some even less than 1%). However, as dis- cussed in the next section, if some additional information is available and can be considered, this misclassification can be largely remediated. Keys from group 7 (Libgcrypt) are mostly misclassified as group 6 (PGP SDK 4, 64.5%) or group 13 (Gemalto GXP E64, 20.2%). As libgcrypt is a commonly used library while groups 6 and 13 correspond to a very old library and card, this case demonstrates the possibility for further classifier improvement when some prior knowledge is available. E.g., for the TLS domain, groups corresponding to old smartcards or non-TLS libraries can be ruled out from the process. Group 10 (Bouncy Castle since 1.54, Mocana 7.x or HSM Thales nShieldF3) is misclassified as group 12 (smartcard Taisys SIMoME, 36.3%) or group 5 (Mo- cana 6.x 21.0%). Additional information can improve classification accuracy as the Taisys smartcard is unlikely source for the most usage domains. If Mocana library actually generated the key, only the identified version is incorrect. Group 11 (cryptlib, Safenet HSM Luna SA-1700, and Feitian and Oberthur cards) is misclassified as group 12 (smartcard Taisys, 50.2%) or group 20 (Oberthur Cosmo Dual, 20.4%). This is a very similar case as for group 10. Group 14 (Microsoft and Crypto++, prevalent group) is misclassified as group 6 (PGP SDK 4, 23.9%), group 12 (card Taisys, 20.1%), group 13 (card Gemalto GXP E64, 13.5%) or group 5 (Mocana 6.x, 10.7%). Again, for the TLS domain, the only real misclassification problem is with the Mocana 6.x library. Group 15 (large group with multiple frequently used libraries) is misclassified as group 12 (card Taisys, 27.2%), group 13 (card Gemalto GXP E64, 18.1%), group 20 (card Oberthur, 11.7%) or group 6 (PGP SDK 4, 32.3%). For the TLS domain, no group from the misclassified ones is likely. Group 17 (Nettle, Cryptix, FlexiProvider) is misclassified as multiple other groups where only groups 5 (Mocana 6.x) and 9 (Bouncy Castle prior 1.54 and SunRsaSign OpenJDK 1.8) cannot be ruled out as unlikely for the TLS domain. 20 Janovsky et al.

Top 1 match Top 2 match Top 3 match #keys in batch 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10 Group 1 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Group 2 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Group 3 86.3% 98.1% 99.8% 100.0% 100.0% 98.2% 100.0% 100.0% 100.0% 100.0% 98.2% 100.0% 100.0% 100.0% 100.0% Group 4 92.7% 99.3% 99.9% 100.0% 100.0% 94.8% 99.7% 100.0% 100.0% 100.0% 96.4% 99.9% 100.0% 100.0% 100.0% Group 5 60.8% 76.3% 79.8% 90.7% 96.6% 71.5% 90.1% 93.6% 98.7% 99.9% 73.0% 91.3% 97.6% 98.8% 100.0% Group 6 73.0% 88.1% 88.5% 83.5% 69.8% 92.8% 92.8% 97.7% 98.2% 99.9% 96.5% 97.0% 99.5% 99.8% 100.0% Group 7 7.6% 18.9% 30.0% 47.9% 73.6% 77.3% 95.5% 98.8% 99.9% 100.0% 92.7% 99.3% 99.9% 100.0% 100.0% Group 8 16.3% 33.5% 44.2% 54.6% 62.8% 27.5% 56.2% 73.5% 91.3% 99.2% 38.6% 63.9% 81.7% 94.2% 99.5% Group 9 12.8% 28.3% 38.9% 50.9% 61.1% 37.7% 65.7% 79.1% 90.4% 99.0% 48.3% 75.9% 87.8% 96.8% 99.8% Group 10 0.0% 24.7% 47.7% 67.9% 92.0% 18.4% 44.1% 60.8% 79.8% 96.1% 52.7% 87.6% 92.5% 98.5% 100.0% Group 11 6.9% 21.8% 34.2% 51.6% 63.1% 56.7% 87.2% 95.9% 99.4% 100.0% 73.2% 95.2% 99.2% 100.0% 100.0% Group 12 54.9% 75.4% 78.2% 71.5% 65.8% 72.2% 85.0% 95.4% 98.1% 100.0% 89.5% 95.7% 99.0% 99.8% 100.0% Group 13 47.2% 57.0% 69.6% 84.8% 96.3% 52.9% 68.6% 80.9% 93.8% 99.5% 66.4% 82.9% 91.4% 98.0% 99.8% Group 14 6.9% 22.4% 40.8% 70.5% 93.6% 7.7% 41.0% 69.7% 90.8% 99.3% 12.4% 53.7% 78.9% 95.4% 99.9% Group 15 0.2% 28.0% 52.7% 80.0% 96.5% 2.5% 43.4% 65.4% 90.2% 99.4% 28.2% 64.6% 81.0% 94.4% 99.7% Group 16 31.4% 63.6% 79.4% 91.1% 99.4% 40.9% 70.6% 85.4% 96.5% 100.0% 48.3% 80.0% 92.1% 98.8% 100.0% Group 17 5.1% 28.6% 50.2% 78.0% 97.6% 18.3% 51.2% 71.9% 92.0% 99.7% 37.7% 73.0% 89.0% 98.1% 100.0% Group 18 12.2% 55.1% 70.5% 78.5% 84.7% 45.2% 91.0% 98.2% 100.0% 100.0% 76.3% 96.1% 99.4% 100.0% 100.0% Group 19 44.0% 54.4% 59.7% 67.3% 78.5% 54.5% 88.3% 97.3% 99.9% 100.0% 62.1% 93.8% 99.1% 100.0% 100.0% Group 20 81.5% 95.2% 98.7% 99.9% 100.0% 97.2% 100.0% 100.0% 100.0% 100.0% 98.9% 100.0% 100.0% 100.0% 100.0% Group 21 53.0% 77.9% 88.4% 97.0% 99.9% 95.2% 99.7% 100.0% 100.0% 100.0% 97.6% 100.0% 100.0% 100.0% 100.0% Group 22 14.6% 39.2% 53.5% 72.5% 92.3% 78.0% 98.2% 99.8% 100.0% 100.0% 97.2% 99.9% 100.0% 100.0% 100.0% Group 23 77.4% 98.0% 99.9% 100.0% 100.0% 96.8% 99.9% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Group 24 96.8% 99.9% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Group 25 58.3% 86.7% 96.1% 99.7% 100.0% 87.6% 97.9% 99.6% 100.0% 100.0% 93.9% 99.7% 100.0% 100.0% 100.0% Group 26 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Average 47.7% 64.2% 73.1% 82.2% 89.4% 66.3% 83.3% 90.9% 96.9% 99.7% 76.1% 90.4% 95.7% 98.9% 100.0%

Table 5. The average classification accuracy of the best performing Bayes classifier. In the i-th column we consider a classifier successful if the true source of the key is among i best guesses of our model. Similarly, for each of the 3 columns we evaluate the success rate when 1, 2, 3, 5 or 10 keys from the same group are available.

B Obtaining dataset of GCD-factorable keys

The fastgcd [15] tool based on [3] was used to perform the search for the GCD- factorable keys. Only valid RSA keys were considered8. Running the fastgcd tool for a high number of keys (around 112 million for Rapid7 dataset) requires an extensive amount of RAM. Running the tool on a machine with 500 GB of RAM resulted in only a few factored keys, all sharing just tiny factors, while the tool did not produce any errors or warnings. The same computation on a subset of 10 million keys revealed a substantial number of large factors. Likely, the fastgcd tool requires even more RAM for the correct functioning with such a large number of keys. To solve the problem, we partitioned the time-ordered dataset into two subsets of 50 and 62 million keys with an additional third subset with 50 million keys that partially overlapped both previous partitions. By doing so, we miss GCD-factorable keys that appeared in the dataset separated by a considerable time distance (2-3 years). We hypothesise that if a prevalent source starts producing GCD-factorable keys, we capture a sufficiently large batch of them within a single subset. In total, we have acquired 114 thousand unique factors from the whole dataset. 8 The factorization occasionally finds small prime factors up to 216, likely because the public key (certificate) was damaged, e.g., by a bit flip. A. Attached publications of author A.2 Bringing kleptography to real-world TLS

The paper starts at the next page.

60 Bringing kleptography to real-world TLS

Adam Janovsky1( ), Jan Krhovjak2, and Vashek Matyas1

1 Masaryk University [email protected] 2 Invasys, a.s.

Abstract. Kleptography is a study of stealing information securely and subliminally from black-box cryptographic devices. The stolen informa- tion is exfiltrated from the device via a backdoored algorithm inside an asymmetricaly encrypted subliminal channel. In this paper, the kleptog- raphy setting for the TLS protocol is addressed. While earlier proposals of asymmetric backdoors for TLS lacked the desired properties or were impractical, this work shows that a feasible asymmetric backdoor can be derived for TLS. First, the paper revisits the existing proposals of klepto- graphic backdoors for TLS of version 1.2 and lower. Next, advances of the proposal by Gołębiewski et al. are presented to achieve better security and indistinguishability. Then, the enhanced backdoor is translated both to TLS 1.2 and 1.3, achieving first practical solution. Properties of the backdoor are proven and its feasibility is demonstrated by implementing it as a proof-of-concept into the OpenSSL library. Finally, performance of the backdoor is evaluated and studied as a tool for side-channel de- tection.

Keywords: Asymmetric backdoor Kleptography · · · TLS

1 Introduction

Tamper-proof devices were proposed as a remedy for many security-related prob- lems. Their advantage is undeniable since they are protected from physical at- tacks and it is difficult to change the executed code. However, they inherently introduce a trust into the manufacturer. It was shown that such devices are the- oretically vulnerable to the presence of so-called subliminal channels [20]. Such channels can be used to exfiltrate private information from the underlying system covertly, inside cryptographic primitives. As a consequence, malware introduced by a manufacturer or a clever third-party adversary can utilise subliminal chan- nels to break the security of black-box devices. This paper concerns kleptography – the art of stealing information securely and subliminally – for the TLS protocol. The field of kleptography was established in the 1990s by Yung and Young [18]. Kleptographic backdoors for many proto- cols and primitives were proposed ever since. For instance, we mention the RSA 2 A. Janovsky et al. key generation protocol [18] and the Diffie-Hellman (DH) protocol [19]. Using kleptography, an attacker can subvert a target to deny confiden- tiality and authenticity of transferred data. Thus, it is important to explore the feasibility of kleptographic backdoors for various protocols, alongside with the methods for defeating such backdoors. Several challenges arise when inventing a kleptographic backdoor. First, one must assure that such backdoor cannot be detected by looking at inputs and outputs of an infected device. Furthermore, the exploited channel is often narrow-band and the computing performance of the device should not be overly affected. Last but not least, one must prove the security of both the original cryptosystem and the encrypted subliminal channel. Previous work on kleptography in the TLS protocol [9,21] showed that it is pos- sible to utilise a single random nonce to exfiltrate session keys. Both backdoors exploit a random field inside ClientHello message, 32-byte nonce that is sent to server by a client. The proposal [9] was rather a sketch of an asymmetric back- door and it lacked few key properties. The work [21] was an important theoretical result and proven asymmetrical backdoor for the TLS protocol. Nonetheless, it remains impractical to implement. Also, neither of the papers addressed TLS of version 1.3. More insight into the related work follows in Section5. In this work we make the following contributions: – We modify the backdoor [9] to achieve better security of the backdoor and also idistinguishability in a random oracle model. – We prove that our proposal is an asymmetric backdoor for all versions of the TLS protocol, including TLS 1.3. – We implement the backdoor as a proof-of-concept into the OpenSSL library, confirming its feasibility. – We evaluate the performance of the backdoor and discuss its detectability. The remainder of the paper is organized as follows. Section2 gives basic back- ground on kleptography. In Section3 we show a design of our backdoor. Section4 comments on how we implemented the backdoor and gives the exact results of our performance tests. It also shows how a timing channel can be used to detect our backdoor. Section5 reviews related work and, finally, Section6 concludes the paper.

2 Kleptography background

The work on kleptography utilises cryptology and virology and naturally extends the study of subliminal channels [20]; those are further encrypted and embedded Bringing kleptography to real-world TLS 3 into the devices, creating so-called asymmetric backdoors. As of 2018, secretly embedded backdoor with universal protection (SETUP) is a supreme (and only) tool in the field of kleptography. One could therefore say that kleptography studies development of asymmetric backdoors and possible defenses against them at the same time. Kleptography concerns black-box environment exclusively, as in white-box setting scrutiny allows to detect such channel. The aim of this section is to introduce necessary techniques that are involved in an asymmetric backdoor design. We begin with a formal description of an asymmetric backdoor adopted from [20].

Definition 1. Assume that C is a black-box cryptosystem with a publicly known specification. A SETUP mechanism is an algorithmic modification made to C to get C0 such that:

1. The input of C0 agrees with the public specifications of the input of C.

2. C0 computes efficiently using the attacker’s public encryption function E (and possibly other functions) contained within C0.

3. The attacker’s private decryption function D is not contained within C0 and is known only by the attacker.

4. The output of C0 agrees with the public specifications of the output of C. At the same time, it contains published bits (of the user’s secret key) which are easily derivable by the attacker (the output can be generated during key- generation or during system operation like message sending).

5. Furthermore, the output of C and C0 are polynomially indistinguishable to everyone except the attacker.

6. After the discovery of the specifics of the SETUP algorithm and after discov- ering its presence in the implementation (e.g. reverse engineering of hard- ware tamper-proof device), user (except the attacker) cannot determine past (or future) keys.

Consider that an asymmetric backdoor itself can be a subject of cryptanalysis. That is why the resulting subliminal channel must be encrypted according to good cryptographic practice. To keep the notation unambiguous, we call the per- son who attacks the backdoor an inquirer. We say that an asymmetric backdoor has (m, n) leakage scheme if it leaks m keys/secret messages over n outputs of the cryptographic device. The desired leakage bandwidth that asymmetric back- door should achieve is (m, m), meaning that the whole private information is leaked within one execution of the protocol. Further, a publicly displayed value that also serves as an asymmetric backdoor is denoted a kleptogram. 4 A. Janovsky et al.

2.1 Example of asymmetric backdoor

The paper continues with an example of RSA key generation SETUP [18] to illustrate the concept of asymmetric backdoors. The backdoor allows for efficient factorization of RSA modulus by evil Eve. First, Eve generates her public RSA key (N,E) and embedds it into the contaminated device of Alice together with a subverted key-generation algorithm: 1. The device selects two distinct primes p, q, computes the product pq = n and Euler’s function ϕ(n) = (p 1)(q 1). − − 2. The public exponent is derived as e = pE (mod N). If e is not invertible modulo ϕ(n), new p is generated. 1 3. Private exponent is computed as d = e− (mod ϕ(n)). 4. Public key is (n, e), private key is (n, d). After obtaining the kleptogram e contained in public key (n, e), Eve can use her private key (N,D) to compute

eD = pED = p (mod N).

Thus, Eve can factorize the modulus n, and compute the private exponent d just by eavesdropping the public key. The reader may notice that the backdoor requires e to be uniformly distributed on the group (otherwise the backdoor can be detected), which leaves it unsuitable for real use. Yet, this toy example illus- trates the concept of asymmetric backdoors beautifully. It is not difficult to prove that the backdoor fullfils all conditions of SETUP, if e is to be picked uniformly in the clean system. Also, we note that this backdoor exhibits the ideal leakage bandwidth (m, m). However, this backdoor lacks perfect forward secrecy. Indeed, when the attacker’s private key (N,D) is compromised, an inquirer can factorize all past and future keys from the particular key generating device.

2.2 Kleptography in the wild

Recall that an asymmetric backdoor is a modification of already established al- gorithm. Detection of such modification therefore proves its malicious nature. In contrast, when an algorithm is designed to be kleptographic initially, its malev- olence cannot be decided so easily. To illustrate this aspect, we briefly revisit the DUAL_EC_DRBG pseudorandom number generator invented by the NSA3. The generator was standardized in NIST SP 8000-90A [2]. Later, a bit predictor with advantage 0.0011 was presented in [8]. Despite this serious flaw we concentrate on a different problem. In particular, the paper [2] shows potential kleptographic tampering. To be exact, the generator requires two constants on an elliptic curve,

3 National Security Agency. Bringing kleptography to real-world TLS 5 i.e., P,Q E(Fp), and the security of the internal state relies on the intractabil- ∈ ity of discrete logarithm problem for these constants. Consequently, if one is able to find scalar k such that P = kQ, they are able to compute the inner state of the generator efficiently based on the output. This naturally breaks the security of the generator. Despite the fact that arbitrary P,Q can be used for the gen- erator, NIST standard forces the use of fixed constants with unknown origin. Naturally, NSA is alleged to provided the backdoored constants. The practical exploitability of backdoored constants was shown in [17]. However, one cannot prove nor disprove that the constants for the standard are backdoored except for the NSA. This aspect suggests that not many SETUPs are likely to appear in the wild, but rather delicate modifications of otherwise secure algorithms are expected, such that their sensitivity to efficient cryptanalysis can be viewed as a coincidence.

3 Attack design

Our proposal is based on [9] by Gołębiewski et al. However, several drawbacks are eliminated by our construction, and properties of the backdoor are treated more rigorously. The needed improvements w.r.t. the proposal [9] were: – To achieve indistinguishability of kleptogram from random bit string, – to ensure that reverse-engineering of the infected device will not compromise security of any session, – to allow recovery of master secret to attacker even if she misses to eavesdrop some sessions. An additional goal was to minimize the computational overhead introduced by the backdoor to avoid possible detection by timing analysis.

3.1 Backdoor description

During the TLS handshake, assuming no pre-shared key is involved and a new session is to be established, all traffic keys are derived from a pre-master secret and publicly available values. Thus, for an attacker, it suffices to obtain the pre- master secret to decrypt whole session. Additionally, the pre-master secret can be derived via DH method, eventually via RSA method in the case of TLS 1.2 and lower 4 (The removal of the RSA key exchange method from TLS 1.3 makes it more difficult to debug or inspect encrypted connections for the industry

4 By DH method we refer to DH modulo prime. Nowadays, Diffie-Hellman over elliptic curves (ECDH) is mostly used in TLS connections. We stick to the modular case, even though our method can be translated to elliptic curves easily. 6 A. Janovsky et al.

– for example in datacenters of intrusion detection systems. This led to two RFC drafts [10, 11] that would mitigate this issue. The former allows for opt- in mechanism that allows a TLS client and server to explicitly grant access to the TLS session plaintext. The latter relies on introducing static DH key exchange method to TLS 1.3. Naturally, both drafts inherently weaken the TLS 1.3 protocol and did not become a part of the final TLS 1.3 RFC. A question arises, whether the stated motivation behind introducing such drafts was honest, as debugging and inspection of traffic is possible even with ephemeral DH, only at a cost of adjusting infrastructure.). In the case of DH method, a is established between the server and the client. In the case of RSA method, server’s certificate is required and used to encrypt random bytes generated on the client device. Those bytes then serve as the pre-master secret. We exploit the 32-byte random nonce sent by the client during ClientHello message to derive a secret only the attacker can obtain. We further sanitize that secret and use it as a seed during the function that creates the client’s contribution to the pre-master secret.

We begin with the presentation of the original kleptographic construction by Gołębiewski et al. The authors suggest to hardcode a DH public key Y = gX on an infected device. During the first handshake on the device, a random value k is selected and gk is published as the ClientHello random nonce. During subsequent executions, the ClientHello random nonce is not subverted, but the PMS is then derived deterministically from H(Y k, i), where H denotes a hash function and i is a counter to ensure that the secrets will differ across sessions. Notice that when an attacker fails to eavesdropp the first handshake, she will not be able to recover any of subsequent sessions. At the same time, if an inquirer manages to capture the first handshake and any of k or X, it allows the inquirer for a decryption of all previous and subsequent sessions. Also, the value Y k must be stored in non-volatile memory on the infected device and is prone to reverse engineering, thus violating condition 6 of SETUP. Last but not least, the published nonce gk can be distinguished from a random bit string since it is an element of a group, see [6] – this violates condition 5 of SETUP. Our improvements aim to eliminate all of the presented drawbacks.

The exact design of our proposal is as follows. Prior to the deployment, a de- signer generates a DH key pair on the X25519 curve, denoted Y = gX . The public key Y is then hard-coded into the infected device, together with the 128- bit key for AES, denoted K and the counter. The initial value of the counter is 1 and is incremented by 2 after each execution. Suppose that the infected device connects to the server and the handshake is initiated. When construction of the ClientHello message is triggered, the infected device generates 32 random bytes denoted k and computes the public key gk. The value gk is then encrypted with k AES-CTR into C = EK (g ) and published as a kleptogram inside the Clien- tHello random nonce. Meanwhile, value S = Y k is derived on the device as a shared secret between the attacker and the device. When the attacker eaves- k kX dropps the value C, she is able to derive g = DK (C) and then S = g using Bringing kleptography to real-world TLS 7

Algorithm 1: Generate kleptogram and seed Input: A public key Y , AES-CTR key K with counter Output: The kleptogram C and seed S k 32 random bytes ← k C EK (g ) ← S Y k ← delete value k securely return (C,S)

Algorithm 2: Generate pre-master secret Input: Key exchange method, DH parameters and public key of server if needed, seed S Output: Pre-master secret PMS if key exchange method is RSA then PMS PRF(S, 46 bytes) ← else l length of DH prime in bits ← Z DH public key of server ← x S ← do x PRF(x, l bits) ← while x = 0 or x = 1 or x p ≥ PMS Zx ← end return PMS

her private key X. After obtaining S, the attacker can replicate the compu- tation of the infected device. When the pre-master secret is to be derived, we differentiate two cases:

1. If the RSA method is used, the value S is stretched to 46 bytes by the TLS 1.2 pseudorandom function (PRF) and sent as the pre-master secret.

2. If the DH method is used, the server first sends the DH parameters to the client, including the prime p. The value S is then stretched by the TLS 1.2 PRF to the string of the same length as prime p. This bit string is checked to fulfill requirements for DH private key (not being 0, 1 or p) and is used ≥ as the client’s private key. If the requirements are not met, the output of PRF is repeatedly used as an input to PRF until proper key is generated.

Once the pre-master secret is generated, the handshake continues ordinarily.

The backdoor is described by Algorithms1 and2. Algorithm1 generates the ClientHello random nonce and the seed S. The latter is further processed by Algorithm2 to derive the pre-master secret. 8 A. Janovsky et al.

The paper [14] proves that the counter mode (CTR) is polynomially indistin- guishable from random bit string on the assumption that the underlying cipher is a pseudorandom function (PRF). This holds when the value of the counter never repeats. We have selected the AES in the CTR mode with a key of 128 bits to achieve indistinguishability. Some properties of this selection must be further discussed. First, NIST recommends [2] to limit the number of calls of pseudorandom number generator (PRNG) keyed with hard-coded value to 248 blocks. Since AES-CTR is essentially a PRNG, this recommendation should be respected. Consider that birthday collisions are likely to appear only after 264 bits of output, so they are trivially treated by the NIST recommendation. Since the backdoor uses two blocks of AES-CTR for one handshake, this limits the functioning of the backdoor to 247 handshakes. It is emphasized that once the backdoor is reverse engineered and the symmetric key is obtained, the indistin- guishability is broken.

3.2 Properties of SETUP proposal

Theorem 1. Under the following assumptions, our proposal is a SETUP: – A random oracle is used to generate the values k and to sanitize the values C instead of a TLS PRF. – AES is a random permutation. – Computational DH assumption holds.

Proof. The reader can easily verify that properties 1-3 of SETUP hold for our backdoor. To prove property 4, we show how the attacker can obtain the seed S by eavesdropping on the handshake traffic. Notice that once the seed S is known, anyone can replicate the computation of the device that leads to the pre-master secret. When the attacker obtains the ClientHello nonce, she can decrypt it with the AES key K, obtaining the public key of the device gk. The attacker can further utilise her private key X to compute the shared value S = gkX = Y k. Consider that when the DH key exchange method is used, the attacker must also eavesdrop the public key of the server and the parameters of the exchange; but those are sent in plaintext. To conclude, property 4 holds as well. We proceed with the property 5. If AES is a random permutation, then AES- CTR produces indistinguishable from uniformly distributed bit string as shown in [14]. Thus, one cannot distinguish between the kleptogram C and random bit strings (unless the collisions occur, which was discussed earlier). Further, the resulting pre-master secrets (both for RSA and DH method) are uniformly distributed too, since we utilize the random oracle to sanitize C. Recall that non-volatile memory of the infected device contains AES key K with the counter and the public key Y . Only the key Y is relevant to the confidentiality Bringing kleptography to real-world TLS 9 of the shared secret S. Notice that obtaining shared secret gkX from gk and gX is equivalent to solving ECDH problem. We therefore conclude that the property 6 holds. ut To summarize, properties 1, 2, 3, 4 hold unconditionally for our proposal. The property 6 requires a computational DH assumption. Moreover, the property 5 requires a random oracle and AES to be a random permutation. This seems sufficient, as for the practical deployment, speed is more pressing than provable indistinguishability. Regarding the perfect forward secrecy, an inquirer is not able to recover past session secrets if she obtains the private key of the device, k. However, after obtaining the key X, the inquirer can break all past (and future) sessions.

4 Attack implementation

We have implemented the asymmetric backdoor into the OpenSSL library of version 1.1.1-pre2. Choosing this library allowed us to reveal whether the back- door can pose as a regular malware, without the requirements of a black-box environment. When one designs malware in black-box setting, she is allowed to change both implementation and header files of the infected library. On the con- trary, only the compiled binaries are infected in case of regular malware. The OpenSSL does not expose many low-level functions from a cryptographic library to high-level functions that are used in the TLS handshake. Our work shows that the proposal can be embedded into the compiled binary, leaving the header files untouched. The pre-release version of the library was chosen because it provides certain functions for computations on the X25519 curve not available in previous re- leases. The X25519 is implemented in a different way than other elliptic curves in OpenSSL and older releases did not allow for the creation of specific keys on this curve; only random keys could be created. We decided to expose some low-level functions for direct use to achieve simpler implementation. This re- sulted into modification of the header files. Nevertheless, high-level interfaces could be used instead and the backdoor could be deployed as a compiled library. We faced no serious obstacles that would prevent the backdoor installation to the library.

4.1 Attack detection

We have also studied the possible detections of the infected library via side chan- nels. As we worked in the desktop environment, we limit our attention to the 10 A. Janovsky et al. timing channel. Nonetheless, a power side channel could be a viable detection mechanism on different platforms. Several code snippets of both infected and clean version of OpenSSL were isolated and their performance was evaluated and compared. This creates a possible detection mechanism of the backdoor, yet, with certain limitations. As expected, the backdoor performs slower than the clean version. Nevertheless, this does not necessarily create a distinguisher. Suppose that all devices of a certain kind are infected. Then there exists no refer- ence to how the uninfected version should perform. The inquirer must therefore somehow guess the expected performance and measure deviations based on this estimate. Also, the library could be used on various hardware and outperform clean versions when running on faster hardware. Three code snippets were measured for the infected algorithm. In particular, those were the RSA pre-mater secret generation snippet, the DH private key generation snippet, and the ClientHello nonce generation snippet. The last snip- pet was measured in two versions. The first version contained only one expo- nentiation (computation of the public-key presented as kleptogram). The second version was expanded by shared-secret derivation between the attacker and the device. As the backdoor does not require any initialization except for loading the AES counter (other keys can be hard-coded in the binary), this aspect was ignored in the experiments.

Code snippet Average computation time in µs Timer overhead 0.312 gk on X25519 171.132 ClientHello clean 4.726 ClientHello subverted 176.293 ClientHello subverteda 246 843 RSA PMS clean 4.887 RSA PMS subverted 11.185 DH private key gen. clean 3178.587 DH private key gen. subverted 3218.496 TLS context builder 374.5

Table 1. Average execution times of code snippets. a Version with the shared secret precomputation.

4.2 Average execution times

Our measurements show that the whole subverted version runs by 0.248 ms (0.282 ms) slower than the clean version when RSA (DH) key exchange method Bringing kleptography to real-world TLS 11 is used. The subverted RSA key exchange method runs slower by the factor of 2.28 over the clean version. On the contrary, the subverted DH key exchange method runs slower only by the factor of 1.01. This is because in the case of RSA, the newly introduced computations take relatively much more time to the overall key exchange method cost. It also can be seen that such an increase in time cannot be spotted just by using the device. The interaction over the network creates the opportunity for obfuscating the computation times. The ex- ponentiations, or even parts of them, could be precomputed once the handshake is initiated, stored, and only loaded from memory when needed. The more com- plex the underlying protocol is, the larger is the space for obfuscations. It is also questionable whether the inquirer will be able to isolate the corresponding snippets on a tamper-proof device to obtain precise measurements. To conclude, the timing-channel method is not reliable and most likely could be evaded by a skilled adversary. Nevertheless, the proposal can be detected when a clean version of the OpenSSL is at hand and benchmarking is available on the same hardware on which the suspected version is running. As the highest increase in time is seen when the ClientHello nonce is generated, this could be the sweet spot for malware detection. Recall that the execution time of the ClientHello nonce function should correspond to the time in which the library generates 32 random bytes. If the function takes substantially larger amount of time, the backdoor is likely to be present. We do not release the source code for the reason that it could be easily misused as a malware.

5 Related work

Our work is based on the concept [9]. In contrast to the original backdoor, our solution cannot fullfils the conditions 5 and 6 of the SETUP mechanism. The pa- per [22] by Young and Yung presents how to generate shared secret with ECDH that is polynomially indistinguishable from random bit string. Furthermore, they also mention its applicability to the TLS protocol and provide first asymmetric backdoor for TLS. However, the proposal is rather impractical as to execute a single backdoored handshake more than 300 ECDH key exchanges are required. Injective mappings of strings on elliptic curve points [1,6] could have interesting applications for kleptography, as their inversion could map ECDH keys to strings that are polynomially indistinguishable from random strings. Regarding the de- tection of backdoors via side channels, the paper [12] presents a method that studies variance in execution times of functions which might reveal newly intro- duced exponentiations to the protocol – a common facet of kleptography. In recent years, major advances came in the field of defenses against klepto- graphic adversaries. Most of them were published in [16]. The work achieves a 12 A. Janovsky et al. general technique for preserving semantic security of a cryptosystem, if put into the kleptographic setting. Also, the paper classifies already proposed defenses into three categories: – Abandoning the randomness in favour of deterministic computation [3–5], – use of a trusted module that can re-randomize subverted primitives [7,13], – hashing the subverted randomness [15].

6 Conclusions

TLS is an essential protocol for securing data on the transport layer. As such, TLS is omnipresent in an era of computer networks, having applications in https, VPN, payment gateways and many others. The widespread use of TLS motivated us to study its vulnerability in the kleptographic setting. We aimed to answer whether a kleptographic backdoor can be practically implemented into the TLS libraries. Our efforts resulted into a design of an asymmetric backdoor for all versions of the TLS protocol. Such backdoor can be used to exfiltrate session keys from a captured handshake by a passive eavesdropper, leading to a denial of confiden- tiality and authenticity of the whole session. We also demonstrated that it is fairly simple to implement the backdoor into an open source TLS library while maintaining a reasonable performance of the library. We stress that to install our backdoor, an adversary must have access to the target device. In such cases, other dangerous scenarios arise – we mention ransomware as an example. How- ever, the important property of our backdoor is that it may stay unnoticed for a long time on the target device. Also, it may be endorsed into a particular hardware by its manufacturer or organizations with sufficient resources. We also showed that timing analysis may prove as an effective defense, depending on the powers of an inquirer. For future work, we suggest to study whether an effective defense could be de- rived for TLS on a protocol level, for instance, in the form of a protocol extension. Regarding the offensive techniques, if mappings [1,6] could be combined with a cryptographic key, they would allow for ECDH secrets indistinguishable from a random noise.

References

1. Aranha, D.F., Fouque, P.A., Qian, C., Tibouchi, M., Zapalowicz, J.C.: Binary elligator squared. In: Selected Areas in Cryptography – SAC 2014. LNCS, vol. 8781, pp. 20–37. Springer, Cham (2014) Bringing kleptography to real-world TLS 13

2. Barker, E.B., Kelsey, J.M.: Recommendation for Random Number Generation Us- ing Deterministic Random Bit Generators. Tech. rep. (2015) 3. Bellare, M., Hoang, V.T.: Resisting randomness subversion: Fast deterministic and hedged public-key encryption in the standard model. In: Advances in Cryptology – EUROCRYPT 2015. LNCS, vol. 9056, pp. 627–656. Springer, Berlin, Heidelberg (2015) 4. Bellare, M., Jaeger, J., Kane, D.: Mass-surveillance without the state: Strongly undetectable algorithm-substitution attacks. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security – CCS ’15. pp. 1431–1440. ACM, New York (2015) 5. Bellare, M., Paterson, K.G., Rogaway, P.: Security of symmetric encryption against mass surveillance. In: Advances in Cryptology – CRYPTO 2014. LNCS, vol. 8616, pp. 1–19. Springer-Verlag, Berlin, Heidelberg (2014) 6. Bernstein, D.J., Hamburg, M., Krasnova, A., Lange, T.: Elligator: elliptic-curve points indistinguishable from uniform random strings. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security – CCS ’13. pp. 967–980. ACM, New York (2013) 7. Dodis, Y., Mironov, I., Stephens-Davidowitz, N.: Message transmission with reverse firewalls – secure communication on corrupted machines. In: Advances in Cryp- tology – CRYPTO 2016. LNCS, vol. 9814, pp. 341–372. Springer-Verlag, Berlin, Heidelberg (2016) 8. Gjøsteen, K.: Comments on dual-ec-drbg/nist sp 800-90, draft december 2005. Tech. rep. 9. Gołębiewski, Z., Kutyłowski, M., Zagórski, F.: Stealing secrets with SSL/TLS and SSH – kleptographic attacks. In: Cryptology and Network Security – CANS ’06. LNCS, vol. 4301, pp. 191–202. Springer-Verlag, Berlin, Heidelberg (2006) 10. Green, M., Droms, R., Housley, R., Turner, P., Fenter, S.: Data Center use of Static Diffie-Hellman in TLS 1.3. RFC Draft (2017), https://tools.ietf.org/ html/draft-green-tls-static-dh-in-tls13-01 11. Housley, R., Droms, R.: TLS 1.3 Option for Negotiation of Visibil- ity in the Datacenter. RFC Draft (2018), https://tools.ietf.org/html/ draft-rhrd-tls-tls13-visibility-01 12. Kucner, D., Kutyłowski, M.: Stochastic Kleltography Detecion. In: Public-Key Cryptography and Computational Number Theory. pp. 137–149. De Gruyter (2001) 13. Mironov, I., Stephens-Davidowitz, N.: Cryptographic reverse firewalls. In: Ad- vances in Cryptology – EUROCRYPT 2015. LNCS, vol. 9056, pp. 657–686. Springer, Berlin, Heidelberg (2015) 14. Rogaway, P.: Evaluation of some blockcipher modes of operation. Tech. rep., Cryp- tography Research and Evaluation Committees (CRYPTREC) for the Government of Japan (2011) 15. Russell, A., Tang, Q., Yung, M., Zhou, H.S.: Cliptography: Clipping the power of kleptographic attacks. In: Advances in Cryptology – ASIACRYPT 2016. LNCS, vol. 10032, pp. 34–64. Springer-Verlag, Berlin, Heidelberg (2016) 16. Russell, A., Tang, Q., Yung, M., Zhou, H.S.: Generic semantic security against a kleptographic adversary. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security – CCS ’17. pp. 907–922. ACM, New York (2017) 17. S. Checkoway, et al.: On the practical exploitability of dual ec in tls implementa- tions. In: SEC’14 Proceedings of the 23rd USENIX conference on Security Sym- posium. pp. 319 – 335 (2014) 14 A. Janovsky et al.

18. Young, A., Yung, M.: The dark side of “black-box” cryptography or: Should we trust capstone? In: Advances in Cryptology – CRYPTO ’96. LNCS, vol. 1109, pp. 89–103. Springer-Verlag, Berlin, Heidelberg (1996) 19. Young, A., Yung, M.: Kleptography: Using cryptography against cryptography. In: Advances in Cryptology – EUROCRYPT ’97. LNCS, vol. 1233, pp. 62–74. Springer, Berlin, Heidelberg (1997) 20. Young, A., Yung, M.: Malicious Cryptography: Exposing Cryptovirology. Wiley, Hoboken, NJ (2004) 21. Young, A., Yung, M.: Space-efficient kleptography without random oracles. In: Information Hiding: 9th International Workshop, IH 2007. LNCS, vol. 4567, pp. 112–129. Springer-Verlag, Berlin, Heidelberg (2007) 22. Young, A., Yung, M.: Kleptography from standard assumptions and applica- tions. In: Security and Cryptography for Networks. LNCS, vol. 9841, pp. 271–290. Springer International Publishing (2010)