Poster: Towards Using Source Code Repositories to Identify Software Supply Chain Attacks
Total Page:16
File Type:pdf, Size:1020Kb
Poster: Towards Using Source Code Repositories to Identify Software Supply Chain Attacks Duc-Ly Vu Ivan Pashchenko Fabio Massacci [email protected] [email protected] [email protected] University of Trento, IT University of Trento, IT University of Trento, IT Vrije Universiteit Amsterdam, NL Henrik Plate Antonino Sabetta [email protected] [email protected] SAP Security Research, FR SAP Security Research, FR ABSTRACT 1 INTRODUCTION Increasing popularity of third-party package repositories, like Software supply chain attacks occur when an attacker hijacks NPM, PyPI, or RubyGems, makes them an attractive target the complex software development chain to insert malicious for software supply chain attacks. By injecting malicious code code [3]. Among the attack vectors, attackers primarily target into legitimate packages, attackers were known to gain more language based ecosystems such as NPM for Javascript or than 100 000 downloads of compromised packages. Current PyPI for Python by either stealing package owner credentials approaches for identifying malicious payloads are resource to subvert the package releases (e.g., attack on ssh-decorate demanding. Therefore, they might not be applicable for the package1) or publishing their own packages with names simi- on-the-fly detection of suspicious artifacts being uploaded to lar to popular ones (i.e., combosquatting and typosquatting the package repository. In this respect, we propose to use attacks on PyPI packages [11]). The examples of the supply source code repositories (e.g., those in Github) for detecting chain attacks were discovered several years ago. Since then, injections into the distributed artifacts of a package. Our pre- the number of cases has been steadily increasing [3, Figure 1]. liminary evaluation demonstrates that the proposed approach Given the popularity of packages coming from language captures known attacks when malicious code was injected based ecosystems, the number of victims affected by software into PyPI packages. The analysis of the 2666 software ar- supply chain attacks is significant. For example, in July 2019, tifacts (from all versions of the top ten most downloaded three malicious Ruby packages discovered by MalOSS [1] Python packages in PyPI) suggests that the technique is exceeded 100 000 downloads and were assigned the official suitable for lightweight analysis of real-world packages. CVEs. In February 2020, two accounts uploaded over 700 ty- posquatting malicious packages to the RubyGems repository KEYWORDS and achieved more than 100 000 overall downloads [6]. The situation becomes even more dramatic as the ongo- Software Supply Chain Attacks; Lightweight Analysis; Code ing and successful attacks have passed unnoticed: 20% of injection; package repositories; source code repositories the malicious packages persisted in the NPM, PyPI, and RubyGems ecosystems for over 400 days [1]. This problem ACM Reference Format: likely happens due to the lack of an efficient mechanism Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and An- tonino Sabetta. 2020. Poster: Towards Using Source Code Repos- for checking malicious code injections in FOSS packages up- itories to Identify Software Supply Chain Attacks. In Proceed- loaded to the package repositories at a high pace (400 and 100 ings of the 2020 ACM SIGSAC Conference on Computer and new packages are uploaded to npm and PyPI, respectively Communications Security (CCS ’20), November 9–13, 2020, Vir- every day [10]). Current malware detection techniques in tual Event, USA. ACM, New York, NY, USA, 3 pages. https: language based ecosystems, on the other hand, are resource //doi.org/10.1145/3372297.3420015 demanding [1], require prior knowledge of previously benign releases [8], or unable to process packages that have a limited number of published releases [2]. Permission to make digital or hard copies of all or part of this work Our approach is motivated by an intuition behind the 2 for personal or classroom use is granted without fee provided that reproducible builds : it is suspicious if the code in the source copies are not made or distributed for profit or commercial advantage code repository differs from the code in the artifacts distributed and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than in the package repository. In this respect, we propose an ACM must be honored. Abstracting with credit is permitted. To copy approach to detect code injected into software packages by otherwise, or republish, to post on servers or to redistribute to lists, comparing their distributed artifacts (e.g., those in PyPI) requires prior specific permission and/or a fee. Request permissions from [email protected]. with the source code repository (e.g., those in Github). The CCS ’20, November 9–13, 2020, Virtual Event, USA 1 © 2020 Association for Computing Machinery. https://www.bleepingcomputer.com/news/security/backdoored- ACM ISBN 978-1-4503-7089-9/20/11. $15.00 python-library-caught-stealing-ssh-credentials/ https://doi.org/10.1145/3372297.3420015 2https://reproducible-builds.org CCS ’20, November 9–13, 2020, Virtual Event, USA Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta proposed approach can be used to detect injected code in Ohm et al. [8] proposed Buildwatch, a framework for dy- typosquatting and hijacked packages. namic analysis of software and its third-party dependencies. To illustrate our approach, consider a typosquatting Python The authors observed a high number of activities related to package jeIlyfish discovered by Lutoma [5], which was files (e.g., files written operations) in malicous verisons com- persistent in PyPI for nearly a year until its detection on pared to the benign versions that were previously released December 1, 2019. jeIlyfish mimicked the popular package of the analyzed packages. Although the approach provides jellyfish3 (the first L is an I) to steal SSH and GPG keys4. insight into malware behaviors, it requires knowing of previ- Our technique processes the suspected jeIlyfish artifacts to ously published benign releases that are difficult to obtain [1]. identify the corresponding source code repository5. Then we In summary, current approaches are limited to checking compare the file hashes and contents extracted from the arti- package names for providing early warnings for squatting facts with those obtained from the source code repository. Our attacks and are prone to false alerts. On the other hand, the tool detects two injected files: setup.py, and _jellyfish.py malicious code detection techniques are resource demanding and reports several lines in the latter to contain API calls for and might not be approach might not be suitable for real-time decoding and executing the malicious code. processing of packages uploaded to the package repositories. Overall, our tool requires 12 seconds (on a laptop with 4 CPU cores and 8 GB RAM) for processing the source code 3 IDENTIFICATION OF CODE repository of jellyfish consisting of 364 commits, 530 indi- INJECTIONS vidual files, and 28104 unique lines of code. The tool takes Our approach compares distributed artifacts in package repos- only 0.04 seconds for scanning the suspected artifact. Hence, itories (e.g., PyPI) and the source code repository (e.g., our approach is fast to be integrated into the package distri- Github) to detect the injected code by the following steps: bution pipeline for detecting suspicious injections introduced (1) For each package, identify the source code repository by packages being uploaded to the package repositories. by mining metadata properties (e.g., homepage) (2) Clone the repository and extract all the commits in the 2 BACKGROUND master branch. For each commit, check out involved Several studies (e.g., [10, 11]) based on the edit distance (e.g., file, calculate the file hash, and collect the file content. Levenshtein distance) between package names to flag suspi- The file hashes and contents are stored into a database cious packages. Although the approaches are fast, manual (3) Download each artifact of the package from the package analysis of the several generated candidates confirmed that repository, decompress it into files. For each file, we the packages are benign [11] and exist due to coincidence, calculate the hash and collect the file content. because they have similar functionality or one package was (4) Compare the file hashes and contents from step (3) derived from the other (e.g., a fork) for legitimate reasons [10]. with those extracted from step (4). This comparison Taylor et al. [9] used six package name patterns (e.g., results in files (and their lines) whose hashes arenot repeated characters) and the number of package downloads to recorded (differ from) in the source code repository identify typosquatting candidates of a given package in PyPI. (5) For the unknown lines, check the presence of API calls Although this approach provides early warnings about a (e.g., urlopen) and imports (e.g., import os) using regex potential typosquatting package, the lack of essential features During the packaging process, the packaging tools (e.g., (e.g., source code similarity of suspected packages) might setuptools6 in Python) create (benign) metadata files (e.g., cause the approach to generate a high number of false alerts. METADATA, WHEEL7). Hence, we ignore such files and focus on Duan et al. [1] proposed MalOSS, an extensible framework the differences in code files (e.g., .py, .js, .rb). for detecting malicious packages in language based ecosys- tems. The framework combines metadata, static, and dynamic 4 PRELIMINARY FINDINGS analysis to extract various features of packages. Although We evaluate the proposed approach on the two datasets of MalOSS can effectively detect malicious packages, the pro- known malicious and top ten most downloaded packages. posed system is computationally expensive, and only process packages of a specific environment (e.g., Linux Ubuntu). Known malicious examples. We used the malware dataset Garrett et al.