Python Crypto Misuses in the Wild
Total Page:16
File Type:pdf, Size:1020Kb
Python Crypto Misuses in the Wild Anna-Katharina Wickert Lars Baumgärtner [email protected] [email protected] Technische Universität Darmstadt Technische Universität Darmstadt Darmstadt, Germany Darmstadt, Germany Florian Breitfelder Mira Mezini [email protected] [email protected] Technische Universität Darmstadt Technische Universität Darmstadt Darmstadt, Germany Darmstadt, Germany ABSTRACT 1 INTRODUCTION Background: Previous studies have shown that up to 99.59 % of the Cryptography, hereafter crypto, is widely used nowadays to protect Java apps using crypto APIs misuse the API at least once. However, our data and ensure confidentiality. For example, without crypto, these studies have been conducted on Java and C, while empirical we would not be able to securely use online banking or do online studies for other languages are missing. For example, a controlled shopping. Unfortunately, previous research results show that crypto user study with crypto tasks in Python has shown that 68.5 % of the is often used in an insecure way [3, 4, 7, 9, 11]. One such problem is professional developers write a secure solution for a crypto task. the choice of an insecure parameter, like an insecure block mode, for Aims: To understand if this observation holds for real-world code, crypto primitives like encryption. Many static analysis tools exist we conducted a study of crypto misuses in Python. Method: We to identify these misuses such as CryptoREX [13], CryptoLint [4], developed a static analysis tool that covers common misuses of5 CogniCryptSAST [8], and Cryptoguard [12]. different Python crypto APIs. With this analysis, we analyzed 895 While these tools and the respective in-the-wild studies concen- popular Python projects from GitHub and 51 MicroPython projects trate on Java and C, user studies suggest that the existing Python for embedded devices. Further, we compared our results with the APIs reduce the number of crypto misuses. Acar et al. [2] con- findings of previous studies. Results: Our analysis reveals that ducted an experiment with 307 GitHub users which had to solve 52.26 % of the Python projects have at least one misuse. Further, 3 crypto-related development tasks. They observed that 68.5 % of some Python crypto libraries’ API design helps developers from the professional developers wrote a secure solution in Python for misusing crypto functions, which were much more common in the given task. Within a controlled experiment with 256 Python studies conducted with Java and C code. Conclusion: We con- developers that tried to solve simple crypto tasks, Acar et al. [1] clude that we can see a positive impact of the good API design on identified that a simple API design, like the Python library cryp- crypto misuses for Python applications. Further, our analysis of tography, supports developers in writing secure code. However, no MicroPython projects reveals the importance of hybrid analyses. empirical in-the-wild study has yet confirmed that crypto misuses in Python occur less frequently than in Java or C. CCS CONCEPTS To empirically evaluate crypto misuses in Python, we introduce LICMA,a multi-language analysis framework with support for5 • Software and its engineering ! Software libraries and repos- different Python crypto APIs and Java’s JCA API. We provide5 dif- itories. ferent rules [4] for all Python APIs and6 different rules [ 4] for JCA to detect the most common crypto misuses. With LICMA, we ana- KEYWORDS lyzed 895 popular Python apps from GitHub and 51 MicroPython Crypto API, Security, Static Analysis. projects to gain insights into misuses in Python. We identified that arXiv:2109.01109v1 [cs.SE] 2 Sep 2021 52.26 % of the Python GitHub apps with crypto usages have at least ACM Reference Format: one misuse causing 1,501 misuses. In total, only7% of the misuses Anna-Katharina Wickert, Lars Baumgärtner, Florian Breitfelder, and Mira Mezini. 2021. Python Crypto Misuses in the Wild. In ACM / IEEE Inter- are within the application code itself, while the remaining misuses national Symposium on Empirical Software Engineering and Measurement are introduced by dependencies. Further, our study of MicroPython (ESEM) (ESEM ’21), October 11–15, 2021, Bari, Italy. ACM, New York, NY, projects reveals that developers in the embedded domain tend to USA, 6 pages. https://doi.org/10.1145/3475716.3484195 use crypto via C code. Thus, revealing the importance of hybrid static analyses, which can track program information, e.g., a call graph, across multiple languages [5, 10]. Permission to make digital or hard copies of part or all of this work for personal or To further improve our understanding whether Python APIs are classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation less prone to crypto misuses, we make the following contributions: on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). • A novel, multi-language analysis tool to detect crypto mis- ESEM ’21, October 11–15, 2021, Bari, Italy uses in Python and Java. For Python we cover crypto misuses © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8665-4/21/10. for5 common Python crypto APIs and for Java the standard https://doi.org/10.1145/3475716.3484195 API JCA. ESEM ’21, October 11–15, 2021, Bari, Italy Wickert et al. Table 1: Six commonly discussed crypto misuses in Java and C [4, 13] with an example of a violation in Python. ID Rule Python: Violation Example §1 Do not use electronic code book (ECB) mode for encryption. aes = AES.new(key, AES.MODE_ECB) §2 Do not use a non-random initiliazation vector (IV) for ciphertext aes = AES.new(key, AES.MODE_CBC, b'\0'∗ 16) block chaining (CBC) encryption. §3 Do not use constant encryption keys. aes = AES.new(b'\0'∗ 32, AES.MODE_CBC, iv) §4 Do not use constant salts for password-based encryption (PBE). kdf = PBKDF2HMAC(hashes.SHA256(), 32, b'\0'∗ 32, 10000) §5 Do not use fewer than 1,000 iterations for PBE. kdf = PBKDF2HMAC(hashes.SHA256(), 32, salt, 1) §6 Do not use static seeds to initialize secure random generator. Due to API design only possible in Java [4] and C/C++ [13] • An empirical study of crypto misuses in the 895 most popular A seed can ensure that the output of a random number generator Python applications on GitHub revealing 1,501 misuses. is non-deterministic. Rule 6 (§6) covers that static seeds lead to • A comparison of our findings in Python applications with deterministic outputs. Common Java [4] and C/C++ [13] libraries previous studies about crypto misuses in-the-wild for An- expect a seed, while the analyzed Python APIs avoid this by design. droid Apps and firmware images in C. We observed that most Python applications are more secure and the distribution 3 DESIGN AND IMPLEMENTATION OF LICMA between the concrete types of misuses differ a lot. In this section, we describe the design of our static analysis tool • An empirical study of crypto misuses in MicroPython projects LICMA, and discuss the implementation in more detail. which reveals the importance of hybrid static analyses. • A replication package including both data sets used for our 3.1 Design study, the results of our analysis, and the code of LICMA1. A general overview of LICMA is given in Figure 1. First, we parse a source code file into the respective Abstract Syntax Tree (AST). 2 BACKGROUND More specifically, we use Babelfish 2 to create a Universal Abstract Crypto libraries enable developers to use crypto primitives, like Syntax Tree (UAST) which combines language-independent AST symmetric key encryption or password-based encryption (PBE), elements with language-specific elements. For simplicity, we use in their application. However, studies have shown that developers the term AST in the remainder of the paper. Second, we apply the struggle to use those APIs correctly and securely. For example, LICMA analysis upon the AST to identify potential misuses of our Krüger et al. [8] show that 95 % of the Android applications using crypto rules. crypto have at least one crypto misuse. In the remainder of the paper, Based upon the rule defining a misuse, the analysis checks fora we define a crypto misuse, in short misuse, as a usage of acrypto violation of the rule within the source code and triggers a backward library which is a formally correct use of the API, e.g., not causing analysis for this task. The backward slice is created by filtering a crash, but is unsafe. In this paper, we focus on six commonly the AST with the help of XPath3 queries, and works as follows: discussed crypto misuses [4, 13] which we will introduce next. All First, the backward slicing algorithm (BSA) identifies all source rules including Python example code can be seen in Table 1. code lines that are referred within the respective rule. An example For AES encryption, a block mode is used to combine several of the slicing criterion is a function call parameter like the key for a blocks for encryption and decryption. However, some block modes crypto function. Second, the BSA determines for all function calls if are considered insecure by experts. Rule 1 (§1) and rule 2 (§2) cover the parameter is either hard-coded, a local assignment, or a global two insecure usages of block modes. §1 prohibits the usage of assignment. If one of the three cases is fulfilled, the corresponding ECB which is deterministic and not secure. §2 focus on a non- value is returned. This value is checked against a function defined random initialization vector (IV) for the block mode CBC.