<<

Masaryk University Faculty of Informatics

Attacks on Package Managers

Bachelor’s Thesis

Martin Čarnogurský

Brno, Spring 2019

Masaryk University Faculty of Informatics

Attacks on Package Managers

Bachelor’s Thesis

Martin Čarnogurský

Brno, Spring 2019

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Martin Čarnogurský

Advisor: Mgr. Vít Bukač, Ph.D.

i

Acknowledgements

I would like to thank Mgr. Vít Bukač Ph.D., RNDr. Václav Lorenc, and Mgr. Patrik Hudák for their continuous support over the years. I would not have been able to go so far, both in education and professionally, and for these reasons I forever owe them my gratitude.

iii Abstract

The primary focus of this thesis is to analyse the current state of var- ious managers regarding security mechanisms related to selected malicious attacks, such as typosquatting or distribution of malicious packages. The further analysis described here also provides insights on differences between security mechanisms used in OS-level managers, smartphone application marketplaces, and the primary focus of this thesis: community repositories of libraries used by developers. We then propose several monitoring mechanisms as a proof-of-concept to detect malicious intent, ongoing attacks or yet unknown vulnerabilities. The implemented system computes a risk- score using heuristics that is language independent where possible and evaluated against real data from Python package index.

iv Keywords , typosquatting, attack, malware

v

Contents

Introduction 1

1 Overview of the ecosystem 3 1.1 Package managers ...... 3 1.1.1 package manager ...... 4 1.2 Package managers for developers ...... 5 1.2.1 The Python Package Index ...... 5

2 Anatomy of a Python package 7 2.1 The Setup Script ...... 7 2.2 Package Installer for Python () ...... 8 2.3 Source distributions ...... 10 2.4 Binary ...... 10 2.5 The Wheel Binary Package Format ...... 11

3 Previous Incidents 15

4 Attack vectors and threat models 19 4.1 modifications ...... 19 4.2 Typosquatting ...... 20 4.3 Bait packages ...... 21

5 Analyzing packages on a global scale 23 5.1 Existing tools and frameworks ...... 23 5.2 Static Analysis ...... 25 5.2.1 Abstract Syntax Tree ...... 26 5.2.2 Tree transformation and analysis ...... 29 5.3 Aura framework ...... 32 5.3.1 apip ...... 33 5.4 Global PyPI scan findings ...... 33

6 Conclusions and Future Work 41 6.1 Future work ...... 41 6.2 Conclusions ...... 43

Glossary 48

vii A Appendix 51 A.1 Live analysis ...... 51 A.1.1 Comparision of static analysis vs. live analysis approaches ...... 51 A.2 setup.py from the talib package ...... 52

B ssh-decorate incident evidence 55

C Built-in Aura analyzers 57 .1 Produced hits ...... 57

viii List of Figures

4.1 An example of a typosquatting package on PyPI when searching for a package scikit 20 4.2 Screenshot of a package that is already included in Python 3.3 but available for download on PyPI 22 5.1 Detections found during the latest scan 36 5.2 Screenshot of a typosquatting package 38 B.1 A screenshot of the opened GitHub issue by user mowshon after he found the malicious code 55 B.2 A screenshot of the malicious code 56

ix

Introduction

Package managers are widely used in various areas, ranging from OS-level installation of software frequently used in systems to development libraries and the installation of smartphone applications. In this thesis, we aim to analyze various attack vectors on package managers used by developers and demonstrate a proof-of-concept monitoring system to address these issues that we developed from scratch. A quick introduction to package managers, their usage and how they operate are discussed in Chapter 1.

A majority of the package managers in question are community- based. In this context, no central authority needs to confirm when a new version of a package is uploaded or if an entirely new package is being created. On one hand, it provides a low-barrier opportunity to contribute to the open-source and faster release cycles; on the other hand, it means that anyone can upload anything, which presents various exciting scenarios for malicious attacks. To understand how these attacks are performed, we need first to understand the basics of how packages themselves are being used; this is covered in Chapter 2.

Several different threat vectors have been identified. We have seen from the past incident that the typical type of attack is a so-called typosquatting attack[11] now seeing a reincarnation in the world of package managers. Other forms of attacks include hijacking existing packages or creating bait packages with attractive names, trying to lure the developers to install them. We compiled a brief overview of the notable incidents in Chapter 3. Since packages are not isolated components but typically have dependencies on other packages, the compromise of a package can propagate much further and faster into other packages. We discuss this topic in more depth in Chapter 4.

In Chapter 5, we present a proof-of-concept system, called Aura, that we created from scratch after an extensive research, which can scan terabytes of data and find anomalies in the published packages on the PyPI repository. This goal is accomplished by using a highly- optimized hybrid analysis engine that tracks the code execution flow

1 and defeats a selected set of code obfuscations. We further discuss these techniques in the associated chapter, as well as difficulties in terms of creating this engine. At the end of the chapter, we present interesting findings that we extracted from the dataset gathered by scanning the whole PyPI repository. Thesis conclusions and steps that could be taken in the future are provided in Chapter 6.

2 1 Overview of the ecosystem

In this chapter, we discuss the roles of the package managers and briefly how they operate. As a baseline, we look into the Debian pack- age manager, which in the context of this thesis is an ideal and mature model for how the package operations and ecosystem should work from the security point of view. Afterwards, we look into the package manager for Python, called pip.

1.1 Package managers

Package managers 1 are currently present in various areas in com- puter systems where a user can install a missing software for her needs to avoid a complicated process of manual installation. Users usually select the application they wish to install from a list of available applica- tions, and the package manager installs this application; in most cases, it also performs a default configuration that is needed. One of the main benefits of such systems is that they also handle dependencies, where the installed package also requires another package already installed for its functionality. These systems are commonly referred to as package managers and are available in modern operating systems and smartphones. As they often are used by non-technical people, they usually have several security mechanisms that aim to prevent the compromise of the end-user system, block malicious intent or mitigate the spread of a potentially exploitable vulnerability. In most cases, this is achieved by a central authority that manages the repository of available software, requiring the approval of every published software and their different versions with a combination of static analysis to flag packages that need a human review.

1. Sometimes also called Software Managers or Application Managers. The name often depends on context; for example, in programming languages, the preferred term is Package Manager, since the installed software is often just a set of libraries and not a directly executable application.

3 1. Overview of the ecosystem

1.1.1 Debian package manager A Debian package2 is a collection of files that allows applications or libraries to be distributed via the Debian package management system. There also is a with the same name, Debian[21], that is using Debian package management system as a core software manager, hence the name of this distribution. Any given package consists of one source package and one or more binary components with the structure defined by the policy3, although there are numerous techniques for creating these files.

A significant note here is that every officially published package needs to have an associated source code, built by maintainers. Pack- ages that are already compiled and contain binary components are not accepted4, in order to ensure that all published packages originate from provided source code without any (sometimes malicious) addi- tions. Such a mechanism also allows for independent audits to verify that distributed packages are unmodified before being published. The mechanism is called reproducible builds5. Although reproducible builds in Debian are not available for all packages at the time of writ- ing, there is a great effort to increase the coverage and essentially provide them for all officially distributed software in Debian.

Accepting a new package (or a new version of an existing package) into the official repository usually goes through several steps, suchas putting it in a testing or unstable area6 for a period, which mitigates several attack vectors discussed further in this thesis. Other additional extensions, such as Debsigs7, allow extending the standard package model by providing support for digital signatures and verification using PGP.

2. https://wiki.debian.org/Packaging 3. https://www.debian.org/doc/debian-policy/ 4. There are special repository areas such as contrib or non-free that allow such packages 5. https://wiki.debian.org/ReproducibleBuilds 6. https://www.debian.org/releases/sid/ 7. https://tracker.debian.org/pkg/debsigs

4 1. Overview of the ecosystem 1.2 Package managers for developers

Special software managers exist for developers that are used to install specific libraries that developers then use in their programs. They are commonly referred to just as package managers. Their func- tionality is very similar to software application managers, including the resolution of dependencies and updating libraries to their latest versions. Most of the popular development languages also use their dedicated package systems/managers to help users distribute their libraries. Unfortunately, they usually do not have a central authority for approving or have very loose security mechanisms without any frequent audits on published packages. This principle goes, hand-in- hand with the idea of enabling rapid release cycles and providing a low barrier entry for anyone to contribute as developers do not need to wait for approvals, audits or other workflows to complete.

1.2.1 The Python Package Index The Python Package Index8 (PyPI), combined with a client-side application called pip9, is the most frequent way of installing libraries/- packages for Python [3]. Its model is community- driven, where anyone can sign up for an account and begin publishing packages immediately. Every package is published under its name, which also acts as a namespace10 and is assigned to one or more cate- gories, such as Intended Audience :: Developers or Topic :: :: Build Tools. A given package can list required (and optional) dependencies on other packages including their versions, which must be installed and satisfied by a package manager before attempting the installation of a package itself. These package relation- ships essentially can be projected as a directed graph-like structure. More in-depth anatomy and functionality is described in Chapter 2.

8. https://pypi.org/ 9. https://pypi.org/project/pip/ 10. https://www.python.org/dev/peps/pep-0541/

5

2 Anatomy of a Python package

In this chapter, we describe the format of a typical package as it is distributed by official channels to the target developers. First, though, we need to understand how the installation of a package is done manually, as the package manager and formats are just an abstraction over this process.

Over the years, there have been several standards (regarding data structure as well as workflow) developed, and some of them still are maintained and considered as official. These standards are governed by the Python Packaging Authority (PyPA), which also is the main- tainer of the Python Package Index (PyPI), a global standard (CDN) for distributing the packages. From our point of view, rather than describing each format in detail, we focus on parts that could be leveraged by a threat actor to achieve malicious code execution. Referenced Python Enhancement Proposals (PEP)[3] and external documentation further in the text provide more technical details, if needed as they serve as the official standard.

2.1 The Setup Script

At the heart of most of the different package formats for distributing the package lies a script called setup.py. This script provides main de- about the distribution and additional used by a package manager, as well as the PyPI repository. Here is a simple illustration of the content of the setup.py file: #!/usr/bin/env python

from distutils.core import setup

setup(name=’Distutils’, version=’1.0’, description=’Python Distribution Utilities’, author=’Greg Ward’, author_email=’[email protected]’,

7 2. Anatomy of a Python package url=’https://www.python.org/sigs/distutils-sig/’, packages=[’distutils’,’my_package’], )

In the example, we declared a distribution with a name Distutils that provides the packages distutils and my_package. This specifies that there are directories with the names as listed in the packages attribute, and should be installed under the given name (Distutils). It is important to note, that this setup.py script is by itself a python executable code, as it can be seen from the ‘*.py‘ extension. This means, theoretically, there is nothing in preventing threat actors from inserting a malicious executable code that would be executed upon installation. Achieving code execution can be done in multiple ways: ∙ The first obvious way is inserting the malicious code directly just before the call to the setup function from distutils. ∙ Setup function also supports overriding the various setup com- mands (install, develop, tests) and providing own implemen- tation of such commands.1 ∙ Python language by itself supports modules written in C/C++ language called extensions. Upon installation, these extensions are compiled, and the package developer can define how the compiler is being executed.

A standard method of installing a python package from the un- packed archive/directory is by running the command python setup.py install. At this point, the code execution is performed, since the script is directly executed.

2.2 Package Installer for Python (PIP)

Manual installation of a Python package can be very time consum- ing, as it involves obtaining the package (can be platform dependent),

1. An example can be found in appendix A.2, cmdclass keyword there is defining the override of the setup commands

8 2. Anatomy of a Python package

unpacking the archives containing the code, installation of dependen- cies and finally installation of the package itself by running python setup.py install, as described in the previous section. While this process is not problematic for installation of a single package, and in fact is still frequently used these days, we can easily see that it does not scale to more complex projects.

Initially, the first attempt to automate the installation process was done by a helper script called easy_install that was part of the default python installation. As the easy_install was the first attempt of stan- dardization, it also had several problems, such as corrupted partial installations (on error), no uninstallation of packages, no caching and many more. Over the years, a new project called pip was created that was meant to replace easy_install and address most of the problems. Eventually, it did so, the easy_install was deprecated a long time ago and the official package manager as declared by PyPA is now pip.Be- cause of that, we will not be covering anything related to easy_install in this thesis, and all functionalities and problems are going to be discussed from the perspective of pip.

The installation of a python package using pip is done by executing the command pip install package-name. The following simplified workflow is performed:

1. Requirements are gathered, including the package to be in- stalled and its dependencies. Release and version constraints are applied by (e.g., allowing only a specific version to be in- stalled). This step can be viewed as dependency resolution.

2. Gathered dependencies are retrieved either from cache or by downloading them from the PyPI mirror.

3. Dependency chain is processed by unpacking the archives at a temporary location and running python setup.py install in the same manner as manual installation.

4. Installation of packages is finished, and results/summary are reported to the user.

9 2. Anatomy of a Python package

It also is possible to install multiple dependencies at once; those are usually located in a file called requirements.txt that can be passed to pip using the "-" switch. This currently is the most widespread method for installing python software dependencies in a manual way.

2.3 Source distributions

The legacy format for distributing python packages is called source distribution, or sdist 2 for short. It is designed to be completely plat- form independent. It is a very simple archive that contains the source code of a python package (by default). The content of the package can be customized by using a MANIFEST.in template, which whitelists/blacklists paths and files to be included inside the archive for distribution. During the installation of the sdist package, the work- flow is the same as the manual installation.

2.4 Binary package format

During the installation of python source code files related to a pack- age, an optimization step can be executed that compile the source code in a python bytecode *.pyc or *.pyo file. This has the benefit of decreasing the startup time; such code is already compiled into the form suitable for Python interpreter3. Sdist packages also needed to be redistributed as a regular archive, which posed a difficulty in some operating systems, such as Windows. Binary package formats (also called built distributions)4 are an extension to source distributions. Python packages can include C/C++ extensions that would need to be compiled during the installation process from the source code. Binary package format allows for including a pre-compiled binary extension in a package, hence the name. Apart from including binary blobs, the other major change is more platform-friendly package for- mats. It was difficult for a developer to install a "*.tar.gz" archive under

2. https://docs.python.org/3/distutils/sourcedist.html 3. Sometimes also called a virtual machine, however in Python, the preferred term is . 4. https://docs.python.org/3/distutils/builtdist.html

10 2. Anatomy of a Python package

Windows or inside Linux without breaking the OS’s software man- ager. Binary packages include additional formats that are native to the target platform, e.g., "exe" or "msi" for windows, "rpm" for certain Linux distributions, etc.

Using the binary distributions arguably can be less secure, due to the usage of a binary data that is not easily auditable. Creating binary blobs (exe , pyc files) is non-deterministic, making it relatively easy to include malicious code, as external verification would be hard to do.

2.5 The Wheel Binary Package Format

A new PEP427[7] was proposed and eventually accepted that de- scribed the The Wheel Binary Package Format ("wheel" in short), which is now a de facto official and recommended way to distribute the python packages5. It is a archive that simplifies the interaction between the build system and the installer. This format is an evolution of the Binary distributions described above, explicitly designed to fit better into the new ecosystems of python packages that evolved over the years. In theory, the package can be installed just by unpacking it at the correct location.

The introduction of the Binary distributions did not fully solve the problem of compiling the dependencies. These distributions permitted containing the binary blobs (pre-compiled code) into the package itself. This benefit, gained by pre-compiling the code, has decreased significantly. Several factors are causing this, such as many different platforms (Linux, Windows, etc.), python versions (2 & 3) and even python implementations (Cython, Pypy, Jython, etc.). The package

5. There were also several other formats created before the Wheels were standard- ized, those formats are however not entirely official as they does not have any kindof explicitly written specification such as the Egg format: https://packaging.python. org/discussions/wheel-vs-/

11 2. Anatomy of a itself would need to recompile the package sources6 in most cases anyway, due to not meeting exactly the original environment.

This problem was solved by proposing file naming conventions as part of the PEP, where the information about the platform, python ver- sion/implementation and additional metadata is directly in the name of the package. This means that the package manager itself can select a package for installation that matches the target environment. As a fallback option, sdist would then be used if the constraint matching wheel cannot be found.

A very important new feature as a part of this proposal was made to introduce a file called METADATA in the dist-info directory of the wheel package. This file contains the metadata, such as package name, author, extensions, etc., which in previous versions were part of the setup.py file. When the package is created (on developer side), package information is parsed and placed in this METADATA file, which the installer then can read without executing the setup.py file. As a consequence, it is now possible to install a wheel package without any code execution, an important milestone in terms of security.

Another interesting feature from the security point of view is a new special file called RECORD. This file is in the CSV format and contains a list of (almost) all files within the package and their checksums. The PEP standard also includes support for the RECORD.jws file, which is a digital JSON web signature (or RECORD.p7s for S/MIME signature type) that allows developers to digitally sign the content of a package before publishing it on PyPI. Unfortunately, it is very rare to find a package using this feature, as there is neither the infrastructure to man- age the chain of digital signatures/certificates, nor the tools to validate them. Installer (pip) also is supposed to check the content of a wheel package against the RECORDS file to at least validate the checksums. However, this also is not a case7, and no such checks are performed

6. In case of python bytecode (*.pyc), at least not until Python 3.7 which introduced deterministic compilation, See https://www.python.org/dev/peps/pep-0552/ 7. https://github.com/pypa/pip/issues/3513

12 2. Anatomy of a Python package currently by pip. There is an open issue at GitHub discussing this fea- ture, as it is known that some packages have manipulated the content of a wheel package8.

8. https://github.com/pypa/wheel-builders/issues/1

13

3 Previous Incidents

In this chapter, we describe various security incidents from the past. As the number of incidents is rising very quickly, due to traction on the threat actor side, we selected only a few of them to be documented here. The selection of incidents is mainly based on new attack vectors that have been used or their total impact. Most of these attacks were discovered by sheer luck.

I.1 ssh-decorate May 2018 [22]

Description: In May 2018, user mowshon created an issue on Github for a project called ssh-decorate, asking the author about why private user credentials were sent to a remote web server, hinting it’s a part of malware. After the issue was brought to the author’s attention, he claimed that the alleged backdoor was not intentional and that he was not aware of it. It is important to mention that the malicious code was not present on the Github repository but only inside the released packages. Impact: The affected user had their server credentials sent to the attacker-controlled endpoint during each connection to the server, including username, password, and their SSH key. Analysis: Unfortunately,the associated repository was removed after the news started appearing on popular websites, al- though several users who were able to access it before removal claimed that they found ssh-decorate authors’ PyPI credentials in the commit history. We were able to obtain the screenshots of the GitHub issue and a snippet of the malicious code. See Appendix B.1 and B.2. The most accepted conclusion is that the threat actor was able to find these credentials (which were still valid) and uploaded a modified version to the PyPI repository, which included a malicious backdoor.

15 3. Previous Incidents

I.2 getcookies May 2018 [19]

Description: In early May 2018, security team received a report of a package that masqueraded itself as a cookie parsing called getcookies. Impact: Significance of this incident lies in the fact that this package was used as a dependency in another package, called mailparser, which had around 64,000 weekly down- loads at the time of the incident discovery. Although the number of installations reached several thousand, upon closer analysis, npm team concluded that luckily only a small percentage of those users were impacted, due to how this dependency was used. Analysis: This backdoor was looking for specially crafted HTTP headers from C&C server and executing the code provided within them. The HTTP headers were modified with sim- ple obfuscations1.

I.3 event-stream November 2018 [18]

Description: On 20th of Nov 2018, a GitHub issue was created for an npm library event-stream, claiming it contains mali- cious code that was inserted by a user known as right9ctrl. Impact: Precise numbers are unknown, but the number of im- pacted users is estimated to be very low, as stated by the npm security team in their incident report. This is because the malicious code was checking the environment for a set of specific conditions, aimed mostly at enterprise environ- ments. This malware targeted cryptocurrency wallets by hooking up to other (legitimate) functions that captured the users’ private keys with additional information and sent them to the remote server in Kuala Lumpur. Analysis: The package in question was no longer actively main- tained by the original author, who was seeking a new main- tainer/volunteer to continue the development. This threat

1. https://snyk.io/vuln/npm:getcookies:20180502

16 3. Previous Incidents actor offered to continue the development and shortly af- terward injected a specially-created dependency, which contained the malicious code. Several obfuscation mech- anisms also were included, such as encrypted & heavily- obfuscated payloads and excluding source code versions that were intended for humans (non-minified source code).

17

4 Attack vectors and threat models

In this chapter, we will be talking about various attack scenarios that can be performed by a malicious threat actor. These scenarios will leverage how the package ecosystem currently works, and most of them are based on real-life incidents. Our focus is to demonstrate several ways of achieving a code execution, which we consider, in the context of this thesis, as a successful compromise of a system.

4.1 Source code modifications

This type of threat assumes in most cases that the target package was developed as legitimate by its original authors. However, threat actors made later code modifications to include a malicious one on purpose. We call this type of attack Package hijacking. There are several differ- ent ways to introduce malicious modifications to an already-existing package.

The first method is leveraging stolen credentials of the author. This can be accomplished by re-using credentials across systems that were previously compromised in a different breach. It is not uncommon for inexperienced developers to accidentally leak their private credentials or access tokens[23, 9] by committing them by accident to a GIT repos- itory (or another version control system). Often, they do not realize that, even when they delete them later, they are still visible in commit history, and a change of credentials and revocation of all associated access tokens is needed to mitigate the potential breach. This was the probable cause of the incident where a legitimate package ssh-decorate was hijacked I.1.

The second method for hijacking the package abuses the trust of authors [20]. Many projects are abandoned over time, due to the original authors no longer being able to continue the development. For more popular or more prominent projects, authors usually publicly announce that they are seeking a new maintainer to keep the project alive and continue the development. Another reason might be that

19 4. Attack vectors and threat models the development team needs to be expanded, or an external developer is claiming to have implemented a desired feature. An example of this is the event-stream incident I.3.

4.2 Typosquatting

This well-known technique has been a common attack vector in past incidents[16, 15]. Most common installation methods are done via command line by typing the package name to be installed manually (for example: pip install requests). There are by default no further confirmations to proceed with the installation, as long as a package with such name exists. As the packages often have very technical terms, containing abbreviations or version numbers, it is easy to a mistake in the package name, especially for a non-fluent English speaker.

Figure 4.1: An example of a typosquatting package on PyPI when searching for a package scikit

20 4. Attack vectors and threat models

Typosquatting was used first on the internet to target websites. Over the years, browsers evolved to combat this threat by various means, such as SSL certificates, domain reputation scores, active threat moni- toring and many more[24]. While it is still an ongoing battle in case of phishing websites, such use of the technique to target developers by typosquatting package names is completely new, and defensive mechanisms including active security scans are yet to be developed.

4.3 Bait packages

A slightly different but related vector is creating bait packages. Python has several built-in modules that come by default when it is installed. Threat actors started creating malicious packages using the names of built-in modules. A developer might then install such package, not knowing that it already is available on his system. An example of such package is shown in Fig. 4.2.

Previous studies showed that there is a prevalent number of so- called trivial packages 1, which consist of only a few lines of code. At the same time, the survey revealed that trivial packages are used because they are perceived to be well implemented and tested pieces of code. Code re-use also is often encouraged, and most of the developers believe it increases their productivity [12]. This has an effect of creating large chain of dependencies[14], meaning a compromise of a single package can have a potentially disastrous effect by propagating into literally thousands of other packages.

We were able to observe an effect of this large dependency chains when an author of a package left-pad decided to remove the package from the NPM repository2. This package, that had 11 lines of code, caused disruptions at large corporations, such as Netflix, Airbnb, and Facebook, only because the left-pad package was included deep in the software dependency chain.

1. Also called micropackages 2. https://www.davidhaney.io/npm-left-pad-have-we-forgotten-how-to-program/

21 4. Attack vectors and threat models

Figure 4.2: Screenshot of a package that is already included in Python 3.3 but available for download on PyPI

22 5 Analyzing packages on a global scale

In this chapter, we discuss the various tools that we explored at the beginning of our research. The usability of tools is tied to the ability to do execution flow analysis and transform the AST tree to defeat simple forms of obfuscation. Due to the inability of existing tools to fulfill these requirements, we present here the Aura framework that we created for this thesis. We used this framework to scan the whole PyPI repository to find anomalies that could indicate malicious behavior or lead to an incident. At the end of the chapter, we list remarks that were identified for conducting future research in this area.

5.1 Existing tools and frameworks

During our research, we evaluated multiple tools and frameworks to asses the suitability for this research. The primary area that we were looking at were tools that do static analysis on top of the Abstract Syn- tax Tree (AST; explained in 5.2.1) that is parsed from the source code. Another important feature was the ability to apply transformations to such tree, as that is the main limiting factor of usability in the domain of malicious code.

One of the first tools that we found was called Spoofax[4], which is a framework for researching the abstract syntax trees (in terms of de- veloping grammar, parsers, etc.). Part of this toolkit called "Stratego" is an AST tree transformation engine. Stratego is one of the most pow- erful transformation engines available these days, but, unfortunately, it is designed to work only under the programming language, making it unsuitable for our use, due to using Python language for our framework.

The next tool we evaluated was Bandit[25], a security linter for analyzing python source code. When we started the development of our framework, we did it by extending this framework and creation of analyzer plugins. The feature set of Bandit closely matched to what we were looking for, including analyzers working on top of the AST tree,

23 5. Analyzing packages on a global scale built-in transformations of AST trees and a lot of examples that were already addressing some of the areas of our research. However, after several days of development, we ran into multiple blocking issues. The biggest issue was that it is not simple to conduct a global scan of the python repository. It is restricted to using a built-in AST parser that can parse only the python source code of the python version under which Bandit is installed. This means it would cover only a small percentage of all the packages in a repository. Secondly, there was no easy way to support scanning non-python (source code) files, unpacking archives (needed for scanning packages) or assigning metadata. Also, one of the last big blockers we found was the AST transformation. From the online documentation and discussions, it appeared that the Bandit has built-in AST transformations. However, during the evaluation, it was not able to recognize even the simplest form of obfuscations (such as string concatenation). Due to these issues and a few others, we concluded that it would be more beneficial to start from scratch, rather than modifying most of the Bandit source code. When designing our system for this research, we took much inspiration from Bandit.

During the development, we also found a project called Coala [26] and the related project CoAST1. Coala is a static code analyzer de- signed to suggest fixes for the source code that would improve the overall quality of the code. It is not designed to do the analysis/audits in terms of security. However, it contains integration with the Bandit tool to do so. After evaluating Coala, we found the same issues as with the Bandit tool, and we also received recommendations on the official discussion channel to look into Bandit, which is better suited for our purpose then Coala.

One of the main reasons we looked into Coala was the discovery of CoAST, which claims to be a Universal Abstract Syntax Tree2, indepen- dent of the target programming language used by the Coala frame- work. This would have significant implications, meaning the ability to audit a source code regardless of the language and not being restricted

1. https://coast.netlify.com/ 2. Later on, we also discovered babelfish, which has similar goals. https://doc. bblf.sh/

24 5. Analyzing packages on a global scale

to just Python, as it is our current case. After further inspection, we found that CoAST is not yet used by Coala but instead proposed as a replacement in the future, and the current state of the project is more of a highly-experimental state, far away from being able to be used by other complex projects. In the future, if CoAST enters a mature state, it would be an excellent base for making a similar kind of research.

5.2 Static Analysis

In general, there are different methods when it comes to the meth- ods for detection of malicious code or malware itself. At present, it’s common to use a live analysis (also called dynamic or sandbox analy- sis) approach by running a sample inside the sandboxed environment, which observes the behavior. On a completely different side, we have a static analysis approach. This approach aims to deduce the behav- ior and characteristics of a given sample just by analyzing its code, without any execution, as is the case with sandboxes. Both approaches have weaknesses[2] and specific strengths [5].3 An overview of the steps done by Aura during the static analysis is listed below:

1. Find the correct Python interpreter for the input source code 2. Parse the source code into the AST tree. 3. Transform the tree by applying the reduction rules (partial evaluation). 4. Perform execution flow analysis and matching on the AST tree. 5. Collect hits generated by analyzers and compute the security score.

Static analysis requires a completely different set of tools, most of which work on top of a parsed code. This set of tools attempts to analyze the flow of the execution graph and its various functional components. Part of the functionality also usually works as a transfor- mation mechanism, which transforms the original code into another

3. A brief description of a live analysis approach is located in the Appendix A.1

25 5. Analyzing packages on a global scale one (equivalent if executed), but with a better semantic value. These transformations are meant, for example, to defeat a simple/trivial ob- fuscation mechanism employed by malware, although they have more limitations and effectiveness then live analysis approach. This makes it one of the biggest differentiators when choosing which approach to use. A swift comparision to the live analysis approach is located in Appendix A.1.1.

For the implementation of the Aura framework, we chose the hybrid static analysis approach. Hybrid in this context means that we have primarily static methods, with support for partial execution to address some simple obfuscation mechanisms. This approach was chosen for the following reasons:

∙ There is no standard entry point for a python source code in case of libraries/modules – as discussed in previous chapters – meaning we can’t just "execute" the sample in a sandbox. ∙ Our system is designed to run on large-scale repositories. The live analysis approach is not easily scalable, since we would like to monitor the whole repository – meaning every package ever published and not just on demand. ∙ Python by itself runs on a variety of environments and systems (Windows, Linux, embedded devices). There also are multiple versions of Python, notably 2 and 3, which are not compatible with one another. A live analysis would, in this case, have only one specific configuration for a sandbox, meaning a very little coverage. Running Python code on the different system or ver- sion likely would result in a fail. Solving this by having multiple configurations of sandboxes would significantly increase the resources needed to run this system.

5.2.1 Abstract Syntax Tree To perform a static analysis of source code, we need a mechanism to understand its semantics in a machine-friendly way. We will use the same way that compilers and interpreters use to translate the source code into executable instructions. This is achieved by constructing an

26 5. Analyzing packages on a global scale

Abstract Syntax Tree (AST) from the input code. Each node in a tree represents a construct present in the original code, such as assignment expressions, conditions, function calls and so on. It also preserves the order of the instructions, which makes it suitable to conduct data/exe- cution flow analysis or optimizations by transforming the tree. This syntax is "abstract", because it does not contain every detail from the source code, such as parenthesis symbols, semicolons, indentation, comments, new lines, and other inessential elements. Apart from the transformation, AST tree could be enriched with additional informa- tion, which would not be possible with source code, as that would imply changing it.

There are several other approaches to consider. The simplest one is using regular expressions for extracting information. However, from the automata theory, it is known that regular expressions have very limited power in recognizing languages, such as no ability to match brackets/parentheses, which are required for our purposes. Arguably, we have integrated Yara support into the framework, which is consid- ered a more powerful extension of classic regular expressions. This integration is not intended to be used to understand python source code but rather to apply signatures for detecting potentially interesting matches in another type of file, such as binary blobs. The last approach that we considered was using linters, which usually work as parsers using context-free grammars (same principle as AST parser). In their simplest form, they only tokenize, since their main use is intended for syntax highlighting. The more advanced ones, such as ANTLR[1] or Spoofax Workbench[4], are designed especially for language recogni- tion and study of such languages. These are some of the most powerful tools available for understanding the source code, producing detailed parsing trees, often including more details than those produced by standard AST parser, such as comments, because they are not needed by interpreters when running/compiling the program. We have chosen the native AST parser, as we are currently not interested in these extra syntactic details, and additional complexity of trees would project into a much bigger complexity of our framework implementation. In the future, we would consider switching to ANTLR or a similar type

27 5. Analyzing packages on a global scale of framework, due to being more powerful and also partial ability to recover from some parsing errors, compared to native AST parser.

For our goals, we will use a built-in "ast" module in a default Python installation to construct the tree from given source code in the same way as the interpreter itself would do it. Unfortunately, this module is designed only to parse the source code for a target Python version under which it is running. This means that it might not be possible to parse a Python 3 code under Python 2 (and vice versa), because of the changes in language syntax that are non-compatible. To solve this problem, we created a standalone (e.g., no dependencies and does not require installation) wrapper around the ast module that is designed to only construct the AST tree and serialize it to the JSON format for transferring it back to the main framework for further processing. This piece of framework also is the only one that depends on specific python versions, which is necessary for reliable parsing. Workflow for obtaining the AST tree is the following:

1. Multiple target interpreters can be configured, each able to cover different non-compatible syntaxes, such as Python 2, Python 3, Pypy, etc.

2. Our standalone wrapper is executed using the target interpreter to parse the source code and serialize the tree into JSON format, so it can be transferred back in a system-independent way to the framework to perform static analysis.

3. These interpreters are tried in order as configured, until one that accepts the code is found. Otherwise, it means that no compatible interpreter is configured, or the input code contains breaking syntactic errors.

When performing the scan, our framework automatically detects a mime type of input file, and AST parsing is attempted on those identified as a python source code. It also is possible to define actions to be performed if the parsing fails, and thus we also can look for this kind of anomalies, if needed.

28 5. Analyzing packages on a global scale

5.2.2 Tree transformation and analysis

After the parsing has been performed and serialized AST tree was re- trieved, our framework will run it through multiple stages (implemen- tations) of different visitors. A visitor is our wrapper implementation for a tree traversal algorithm, exposing the core functionality serv- ing as a skeleton to several implementations. This core functionality includes:

∙ hooks and signal for events to trigger callbacks, such as the beginning of traversal, end of traversal, node visit and node replacement

∙ multiple traversal iterations with extra additional passes for convergence, which becomes important when the tree is modi- fied

∙ FIFO queue for visited nodes with support for queue invalida- tion

There are three main stages implemented as a specialization of this visitor wrapper. The first and simplest one is the AST node converter. Since the input tree has been obtained from JSON, it contains only primitive structures (numbers, strings, arrays, dictionaries), due to JSON language definition, and this lack of more advanced structures would form a bottleneck in our analysis. Converter transforms the tree by wrapping supported nodes into more advanced structures that expose additional functionality and attributes, such as the ability to add tags, pretty printing or shadowing attributes with their extended versions. Since one structure can wrap another structure, this is the reason we implemented the logic of multiple traversal iterations of a tree. If a tree is modified, it is marked as such, and a new traversal is performed, until the tree is no longer marked as modified. For safety reasons, we then do additional traversals that we call "converging", as it sometimes is very complicating to mark the tree as modified, due to backtracking complicated references.

29 5. Analyzing packages on a global scale

A second stage is a form of a logical tree transformation that also per- forms a partial evaluation. To explain the partial evaluation, consider the following simple example: x=2+2 y=5*x

It is easy to notice that the statement x = 2 + 2 can be optimized to be written as x = 4 by precomputing the value; this kind of opti- mization is called constant folding. Once we have a value of x, we also can further optimize the second statement by computing a value of y, which would be y = 20. This optimization is called constant propagation. There are many more different techniques, such as loop unrolling, inlining function calls or taint analysis – one of the most common method to find vulnerabilities[6, 13, 8]. These techniques are primarily used by compilers and interpreters to speed up program executions by optimizing their instructions. We included a limited subset of this functionality into the second stage, which transforms the tree by using mainly the abovementioned constant propagation and constant folding techniques. This would allow us to address some simple obfuscation mechanisms, and it also is a reason why our ap- proach to statically analyzing the samples is hybrid, since it results in partial (safe) code evaluation. There are, of course, several limi- tations; as such, this stage is the main bottleneck in static detection capabilities[10]. Consider the following code: url="ht"+"tp://" def func1(data): x="example" return data+x def func2(data): x = ".com/callback" return data+x

30 5. Analyzing packages on a global scale url= func1(url) url= func2(url)

While it still is a very simple obfuscation mechanism and easy to understand by humans, it is a non-trivial for a to determine a final value of the url variable. Such algorithm would need to be able to simulate a frame stack during tree traversal to isolate variables and scopes, since the variable x inside the func1 is completely different and unrelated to the variable x inside the func2. An even more difficult example would be to store and manipulate data inside the object’s at- tributes/properties, as that would require yet another additional data isolation mechanism than just simulating frame stacks. Implementing these advanced analysis mechanisms would then be on par with devel- oping an actual interpreter for the language. As in the previous stage, it also is important to perform this rewriting in multiple iterations, as folded constants often propagate themselves further into the code, until the tree converges with no more transformations/modifications.

The final stage applied to the AST tree is a code execution flow analyzer and also is implemented as a tree traversal visitor in the same manner as previous stages. This analyzer does not perform any tree transformations and is intended to only interpret the defined semantic rules and match them against this final AST tree. When the nodes match the semantic rule, a hit is produced, which is a simple structure containing metadata of a match, such as position/line number, tags, severity score, file name and many more. These hits are collected by an analyzer, which uses them to compute the final risk score and several characteristics of a given input. Examples of various types of hits includes:

∙ a module has been imported and tracks what kinds of functions from this module are being used by a source code

∙ track specific function calls and their signatures, e.g., what kind of parameters are passed to the function and their

∙ find embedded data blobs in a more intelligent way than justa simple regex parsing (compared to Yara integration)

31 5. Analyzing packages on a global scale 5.3 Aura framework

As part of this thesis, we created a custom framework from scratch that is designed to conduct scans on a large scale through the whole PyPI repository, as well as other datasets of source codes.

The core component of the Aura framework is the source code ana- lyzer, which, as described in the previous section, works on top of the AST tree by analyzing data flow. Analyzers are developed as plugins, and, as their input, they receive a path to the file to be analyzed along with metadata. The framework is using a set of Uniform Resource Identifier (URI) handlers that provide these inputs for the analyzers. These handlers are identified by protocol and the resource locator, which define how to produce a set of inputs for the analyzers. For example:

∙ pypi://requests URI defines that the PyPI handler should lookup online for a package "requests", download the latest release and pass the local path to the analyzers

∙ file://quarantine/ or ./quarantine is a local URI which enu- merates the location on filesystem recursively

When files are enumerated, we also added support for automati- cally uncompressing the archives. As mentioned in chapter 2, python packages are in fact archives (different formats), so the framework automatically detects mime type of the file, and, in case it is oneof the supported archives, it is extracted to the temporary folder and added to the set of inputs for the analyzers. Adding new types of URI handlers is trivial, as they also are developed in the form of plugins – for example, adding support for git:// resource that would auto- matically clone the GIT repository and pass it to the analyzers is just a matter of few lines of code.

Produced output is what we call hits. They provide information about an anomaly that was found by the analyzer – or, in some cases, just an additional extracted data that is later used by another analyzer to influence the score. After the scan is finished, produced hitsare

32 5. Analyzing packages on a global scale

gathered and used to compute the total score of the package (or an- other scan target). Each hit can define its score that will contribute to the overall score, which we call security aura (hence the name of the framework). Individual hits also can define how their score is aggre- gated; for more details, we instruct the reader to consult the framework documentation. A comprehensive list of all possible detection hits also can be found in Appendix C.1.

5.3.1 apip

The Aura framework itself is more suited to work on the server side for analyzing huge amounts of data. For demonstration purposes, we also created a small client-side executable script called apip. This small wrapper acts as a replacement for the pip package installer. It is designed to proxy the functionality to pip but will intercept any package installation. Once a package installation is detected, the pack- age is sent to the Aura framework for analysis. Using the framework capabilities and the output produced from the scan, the developer can then decide if she should proceed with the installation or abort the process. Unfortunately, the pip development group stated multiple times that they do not intend to provide a public API to enable such functionality. 4 5 For this reason, we use a so-called "monkey patching" technique, which means that apip replaces the pip functionality at runtime.

5.4 Global PyPI scan findings

We conducted several global repository scans using our custom- developed framework to find anomalies and to test the suitability of the tools. The requirement to do such a scan is to have an offline mirror of the central PyPI repository, which can be done via an official tool called bandersnatch. Several customizations also were made that affected which packages were synchronized to our local mirror.

4. https://github.com/pypa/pip/issues/3999 5. https://github.com/pypa/pip/issues/4696

33 5. Analyzing packages on a global scale

We observed a massive increase in the size of a total mirror size between the time we first started an offline mirror synchronization (around 2.5TB) and the time of writing this thesis (around 4.0TB). After closer inspection, we discovered that a new set of packages were created and published that accounted for this size of very aggressive new releases that occurred even several times in a day. We concluded that this was a result of some automated tools that re-published the packages after each change in a repository and thus configured the offline synchronization to exclude such packages. We also hadonly standard consumer external disk drives at our disposal (approximately 2.0 TB) and thus needed to enable several built-in plugins to also fil- ter a version of packages to be synchronized to bring the total size down. Since this was long-term, ongoing research, we did the syn- chronization on the best effort basis, as we were not able to run thisin a continuous, real-time form.

During the latest scan, our local offline mirror contained 173675 python packages that were all used for the scanner input. It is im- portant to note here that a "package" term does not equal a single file/archive. A package can have several different archive formats/files published for a given release, and they all are passed to the analyzer as one set of files. Out of those, 5655 packages failed to be processed by Aura; the reasons are as follows:

∙ a corrupted archive that cannot be unpacked

∙ encoding issue, most notably more exotic codecs (neither UTF- 8 nor plain ASCII) that the source code is written in, causing a failure to load the source code and parse it via AST correctly

∙ invalid checksums as synchronized from the offline mirror

∙ timeouts as imposed by our framework to prevent bugs or failures from hanging up indefinitely

When investigating previous attacks and trying to obtain malware samples, we found out that the malicious packages were removed from the repository, and we were not able to obtain copies in most

34 5. Analyzing packages on a global scale

of the cases. 6 Based on this fact, we configured bandersnatch to not synchronize removals (e.g. keep deleted packages) from the upstream mirror.

A different dataset also was used for the typosquatting research, rather than using our local offline mirror. Throughout the research, we discovered a project called Libraries.io 7, which was synchroniz- ing metadata for major package managers and periodically published these datasets. This information was much better for the parts of this re- search that did not require access to packages themselves, rather than extracting this information from the local mirror in a time-consuming and inefficient way. This dataset contained a total of 172412 metadata entries for PyPI packages8 9. It is important to note that one of the most significant differences in this dataset is that the Libraries.io data also contain packages that were already removed from PyPI. Because of this, many of our typosquatting findings were already removed from the repository when we investigated them further.

As a result of these decisions, there was continually a difference between each scan of our local mirror in coverage of all published packages, also making a difference with the official live public mirror. The used dataset from Libraries.io also did not match precisely the content of our local mirror, due to the same reasons. Replicating the results for these above-mentioned reasons would be difficult, and, in further research, would require a more controlled input data than what we had at our disposal.

During the global scan of PyPI repository, we found several unusual anomalies that we shall discuss now. An overview of all the different hits that we found during the scan is located in Figure 5.1. It seems

6. We found a github repository that attemted to archive some of the malicious packages found on PyPI befre they were removed. https://github.com/hannob/ pypi-bad 7. https://libraries.io/data 8. By metadata data entries we mean one entry per package containing data such as: author, title, description, dependencies, etc... 9. The last data dump that we used has been produced on the 22nd of December, 2018

35 5. Analyzing packages on a global scale

Detection Count ArchiveAnomaly 0 Wheel 1039 FunctionCall 68289 ModuleImport 294396 URL 290535 Base64Blob 165 SetupScript 0 a SensitiveFile 82 SuspiciousFile 217498 YaraMatch 3029 Figure 5.1: Detections found during the latest scan a. Not produced, as this is an informational-only detection that is filtered by default, unless the verbose output is explicitly enabled that, due to increasing typosquatting attacks, developers of more pop- ular packages started also deploying some simple countermeasures. These developers pre-registered typosquatting names and uploaded there a stub package that would inform a victim of their error when installed. These packages have a common signature, and that is this literal description: A package to prevent exploit10. Upon installation, it will fail with the following error: You probably meant to install and run followed by the name of the legitimate package. Example code taken from one of the packages (talib) can be found in Appendix A.2. We found a total of 1141 published packages as placeholders11 using this technique. While this mechanism is very simple, a downside is that it pollutes the repository with stub packages that do not serve any other purpose. We also conducted a brief reverse search using the signature of error message using Google and StackOverflow, where we found several users discussing this error (and asking for help with

10. Example: https://pypi.org/project/talib/ 11. Similar mechanism to this: https://github.com/mattsb42/pypi-parker

36 5. Analyzing packages on a global scale

resolving) during installation12. It can be concluded from this that the typosquatting placeholders served their purpose by preventing installation of the possibly malicious package and that these kind of mistakes are very common.

We found a total of 21230 typosquatting pairs of packages. They were discovered by obtaining a list of 10000 most popular packages by using the number of downloads per last month as the metric. After- ward, we computed Damerau-Levenshtein edit distance between these popular packages and the rest of the packages, producing typosquat- ting pairs. During the computation time, we also applied a filter to exclude packages with edit distance greater than 2 as an optimization step.

There are also several groups/individuals researching various attack vectors. The most prominent sample we found was a package pub- lished with the description "Checking out the typosquatting state of PyPI", linking to a website www.pytosquatting.org with keywords "ty- posquatting" and "honeypot". From the linked GitHub repository[17], we determined it was part of the research for a conference talk in which they preregistered several packages and implemented a pingback bea- con that would notify them when such package was installed to collect research data.13 We found 94 packages related to this experiment.

Our initial focus on finding malicious packages mainly aimed at finding setup scripts that were doing code execution, as this wasex- pected to be a good indicator of possibly malicious code. Surprisingly, after running the first global scan, we were overwhelmed by an enor- mous number of false positives. After further research into this code execution hits, we found that in at least 128 instances, code execution pattern was used to manage the version of the packages. Each package has a version associated with it, which is expected to grow over time,

12. https://stackoverflow.com/questions/54692535/ error-while-downloading-talib-for-final-year-project 13. https://github.com/benjaoming/pytosquatting/tree/master/misc/ bornhack-talk

37 5. Analyzing packages on a global scale

Figure 5.2: Screenshot of a typosquatting package indicating newer releases. This version information is located in sev- eral places of a standard python package; primary location is inside the setup script metadata, while additional locations include pack- age root, for example ( __init__.py script or a __version__ attribute). As development progresses, this version number is increased and needs to be updated at these places. In good faith, those developers created a central file (for example, version.py) which would hold this information, and all other places that would require it would obtain this version number by executing or parsing this file. Although we understand the decision from the perspective of easier and less error- prone version management, in our opinion, it is considered a very bad practice.

We also found two instances of packages that even had stated in a description that they were created as typosquatting attacks. These packages are "tensoflow"14 and "djamgo"15. Both of these packages were analyzed, and they did not include a malicious code,16 hint- ing that they were more of an experiment. During installation, the tensoflow package created a file in a user home directory withthe message, "You have been hacked since", followed by a timestamp. The

14. Typosquatting "tensorflow" - a leading machine learning library 15. Typosquatting "Django" - popular framework for web development 16. At the time of writing.

38 5. Analyzing packages on a global scale

djamgo package was empty and didn’t contain any code, neither did it perform any actions.

We found 9 packages that by accident included a ".pypirc" file which contained user credentials. One of the affected users is a high-profile target as he has the "maintainer" 17 access to several popular Python packages, including Pillow, path.py, zc., etc. If a threat actor got a hold of these credentials, the impact would be disastrous due to the popularity of these packages and other packages depending on them. This incident is still in the remediation phase by the Python security team, and for these reasons, we cannot yet publish the exact details.

17. Meaning, he can publish new versions of the packages or replace an existing one.

39

6 Conclusions and Future Work

In this chapter, we propose several points for future research and how the results can be improved. In the second half of this chapter, we recapitulate what we learned during this research as well as achieve- ments.

6.1 Future work

The development of the Aura framework made a necessary ground for many research topics in the future. As the framework is in a proof of concept state, many areas can be improved, such as: ∙ Introduction of something like CoAST to support other lan- guages than Python while avoiding duplication of code and detection mechanisms ∙ Better server/client architecture. A central server is providing audit/scan capability exposed via API. Clients will have only a minimal wrapper that would send payloads to the servers for processing. This would be beneficial in organizational deploy- ments, as it would provide a single point of maintenance and control while also enabling us to collect valuable (anonymous) research data ∙ More anomaly analyzers. Beyond just looking at the source code for malicious code, we can introspect the code to also look for vulnerabilities similar to what Bandit is doing (SQL injection, shell injection propagation, etc.) and even other data that is not in the source code such as default JSON configuration files, executable files (exe/elf), documents (PFDs, HTML, etc.). A good starting point for this would be to implement support for taint analysis. ∙ The system already can be extended easily to also scan reposi- tories providing other data than package managers. Such ex- ample includes scanning repository hosting docker images, which were reportedly hosting malicious images containing cryptocurrency miners

41 6. Conclusions and Future Work

∙ Another valuable addition would be to scan also GitHub repos- itories (via linked URL from package metadata) that are more likely to include artifacts that would allow hijacking the pack- age (leaking credentials in commit history) or to find anomalies (different published files vs. source code hosted at GitHub)

One of the significant improvements for a future version ofthe framework would be a dependency resolution system. Currently, the Aura framework is scanning every file in an isolated manner, ignoring whether there are cross imports between the files. This leads to loss of information and also opens the possibility for known obfuscation techniques. Consider the following files:

Listing 6.1: a.py x= open

Listing 6.2: b.py importa import fnmatch fnmatch.os.system(’echo"Malware"’) print(a.x("/etc/passwd","r").read())

When file b.py is executed, malicious code execution is achieved, as well as file access from the local file system. Both of these actions are now undetectable during the scan from the point of the framework; it doesn’t know the semantics behind the called functions from differ- ent modules. The first obfuscation technique is accessing the target module/functions indirectly via another module that imports it. The second technique is obfuscation of the function call via a proxy module that renames it. These obfuscations can be solved by introducing the abovementioned dependency resolution module, which will sort the files for scanning based on their imports (dependencies on other files) and re-use the data from already performed scans. It also is possible to pre-scan built-in modules to learn their cross-dependencies and how other modules are imported within the standard library to enrich the semantic signatures engine to detect the usage of proxy modules.

42 6. Conclusions and Future Work 6.2 Conclusions

In this thesis, we dived into the world of package managers and their role in the development process. Although they are being used almost daily in the lives of developers with rapid publishing mechanisms, from a security point of view, the whole model is not mature enough yet. There are several different causes for this. The most significant one is the lack of security monitoring or audits over the published content.

A new format was created in the past to address this – wheels. This format was created to get rid of the code execution during the instal- lations phase and to introduce the digital signatures. Unfortunately, these mechanisms are not yet enforced, and even the official Python installer pip, lack the needed functionality.

As a reaction to this, we created the Aura framework that is designed to look for anomalies inside the packages published on PyPI. This goal is achieved by analyzing the Abstract Syntax Tree. By applying various transformations to the AST tree, we can defeat simple obfuscation mechanisms and provide more context for the analysis. Apart from that, Aura is able also to analyze other artifacts, such as the signs of package manipulation.

During our research, we conducted multiple scans of the PyPI repos- itory to look for these anomalies. These scans were performed by cre- ating an offline PyPI mirror and run on a standard consumer laptop. This was a challenging task on its own to complete as when we first launched the scan; an estimation to completion was more than two months. Of course, this amount of time was not feasible for our case, so we started to optimize the Aura engine and brought the total time needed to complete the scan down to 14 days. After this optimiza- tion was completed, we were finally able to run a fully complete scan, which produced 0.5G of raw JSON data that we needed to analyze manually.

43 6. Conclusions and Future Work

Nine critical findings, such as the ability to hijack an existing pack- age, were found and reported to the Python security team. This in- cident is currently still in the remediation phase preventing us from publishing the details. Although we have not found a directly mali- cious package during our analysis, we found several different individ- uals and groups creating typosquatting packages. All these packages were analyzed manually, and we did not find signs of malicious intent, only at most harmless messages, such as "You have been hacked," reminding the victim of her mistake.

In May 2019, we established a partnership with the returntocorp (r2c) 1 company that specializes in large-scale analysis, such as scan- ning NPM2 packages or associated GitHub repository. We are cur- rently working together to add support for scanning PyPI packages (and associated GitHub repositories) by customizing the Aura frame- work to support the r2c platform. This integration would result in significantly decreasing the time it takes to scan the PyPI repository from 14 days to a matter of a few hours while also covering additional data sources3.

This partnership shows that we are dedicated to continue work- ing on this research subject in the upcoming months. The ongoing collaboration will bring this research results closer to the companies and developers, educating them on the dangers of package managers. We identified several tasks for future work (as desribed above) that would enhance produced findings. In fact, we already started working on one of these future work points. The taint analysis feature, which would allow us to find unknown vulnerabilities by analyzing how the untrusted input is propagated inside the application.

1. https://returntocorp.com/ 2. NPM is a package repository and collection of tools for the Javascript language, equivalent to PyPI for Python. 3. Above-mentioned Github repositories.

44 Bibliography

[1] Terence J. Parr and Russell W. Quong. “ANTLR: A predicated-LL (k) parser generator”. In: Software: Practice and Experience 25.7 (1995), pp. 789–810. [2] Andreas Moser, Christopher Kruegel, and Engin Kirda. “Limits of static analysis for malware detection”. In: Twenty-Third Annual Applications Conference (ACSAC 2007). IEEE. 2007, pp. 421–430. [3] Guido Van Rossum et al. “Python Programming Language.” In: USENIX annual technical conference. Vol. 41. 2007, p. 36. [4] Lennart CL Kats and Eelco Visser. “The spoofax language workbench: rules for declarative specification of languages and IDEs”. In: ACM sigplan notices 45.10 (2010), pp. 444–463. [5] Jusuk Lee, Kyoochang Jeong, and Heejo Lee. “Detecting metamorphic malwares using code graphs”. In: Proceedings of the 2010 ACM symposium on applied computing. ACM. 2010, pp. 1970–1977. [6] Edward J. Schwartz, Thanassis Avgerinos, and David Brumley. “All You Ever Wanted to Know About Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask)”. In: Proceedings of the 2010 IEEE Symposium on Security and Privacy. SP ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 317–331. isbn: 978-0-7695-4035-1. doi: 10.1109/SP.2010.26. url: http://dx.doi.org/10.1109/SP.2010.26. [7] Daniel Holth. PEP 427 – The Wheel Binary Package Format 1.0. Sept. 20, 2012. url: https://www.python.org/dev/peps/pep-0427/. [8] Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. “Generalized Vulnerability Extrapolation Using Abstract Syntax Trees”. In: Proceedings of the 28th Annual Computer Security Applications Conference. ACSAC ’12. Orlando, Florida, USA: ACM, 2012, pp. 359–368. isbn: 978-1-4503-1312-4. doi: 10.1145/2420950.2421003. url: http://doi.acm.org/10.1145/2420950.2421003.

45 BIBLIOGRAPHY

[9] Michael Henriksen. Gitrob: Putting the Open Source in OSINT. Jan. 12, 2015. url: http://michenriksen.com/blog/gitrob- putting-the-open-source-in-osint/. [10] Federico Scrinzi. “Behavioral Analysis of obfuscated Code”. In: 2015. [11] Nikolai Philipp Tschacher. “Typosquatting in Programming Language Package Managers”. Bachelor Thesis. University of Hamburg, Mar. 17, 2016. 74 pp. url: http://incolumitas.com/data/thesis.pdf. [12] Rabe Abdalkareem et al. “Why do developers use trivial packages? an empirical case study on npm”. In: ESEC/SIGSOFT FSE. 2017. [13] Philippe Arteau. “Static-Analysis, Now you’re playing with power!” Hackfest 2017. Apr. 11, 2017. url: https://gosecure.github.io/presentations/2017-11- 04_hackfest_static_analysis/Hackfest2017- Static_Analysis.pdf. [14] Alexandre Decan, Tom Mens, and Philippe Grosjean. “An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems”. In: CoRR abs/1710.04936 (2017). arXiv: 1710.04936. url: http://arxiv.org/abs/1710.04936. [15] npm, Inc. ‘crossenv‘ malware on the npm registry. The npm Blog. Aug. 2, 2017. url: http://blog.npmjs.org/post/163723642530/crossenv- malware-on-the-npm-registry. [16] SK-CSIRT advisory: PyPI Malicous Code. Sept. 9, 2017. url: http: //www.nbu.gov.sk/skcsirt-sa-20170909-pypi/index.html. [17] Hanno Böck and Benjamin Bach. “Package mis-management”. BornHack 2018 conference. Aug. 16, 2018. url: https://github.com/benjaoming/pytosquatting/tree/ master/misc/bornhack-talk. [18] npm, Inc. Details about the event-stream incident. The npm Blog. Nov. 27, 2018. url: https://blog.npmjs.org/post/173526807575/reported- malicious-module-getcookies.

46 BIBLIOGRAPHY

[19] npm, Inc. Reported malicious module: getcookies. The npm Blog. May 2, 2018. url: https://blog.npmjs.org/post/173526807575/reported- malicious-module-getcookies. [20] Markus Zimmermann et al. “Small World with High Risks: A Study of Security Threats in the npm Ecosystem”. In: CoRR abs/1902.09217 (2019). arXiv: 1902.09217. url: http://arxiv.org/abs/1902.09217. [21] Ian Ashley Murdock et al. Debian – The Universal . url: https://www.debian.org/index.en.html. [22] Catalin Cimpanu. Backdoored Python Library Caught Stealing SSH Credentials. url: https: //www.bleepingcomputer.com/news/security/backdoored- python-library-caught-stealing-ssh-credentials/. [23] Michael Meli, Matthew R McNiece, and Bradley Reaves. How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories. url: https://www.ndss-symposium.org/ndss- paper/how-bad-can-it-git-characterizing-secret- leakage-in-public-github-repositories/. [24] Mohammad Taha Khan et al. “Every Second Counts: Quantifying the Negative Externalities of Cybercrime via Typosquatting”. In: Security and Privacy (SP), 2015 IEEE Symposium on. IEEE Symposium. San Jose, CA, USA: IEEE. isbn: 978-1-4673-6949-7. doi: 10.1109/SP.2015.16. url: https: //ieeexplore.ieee.org/abstract/document/7163023/. [25] Author Unknown. Bandit. url: https://github.com/PyCQA/bandit. [26] Author Unknown. coala – linting and fixing code for all languages. url: https://coala.io/#/home?lang=Python.

47

Glossary

A

Abstract Syntax Tree (AST) A representation of a parsed computer program in a tree-like structure. Often used by compilers to perform optimizations and transform the code into the machine executable instructions. 23

P

Package manager A software abstraction to install, update or remove packages us- ing the . The primary functions also include resolving the dependencies and compiling the source code, if any. 3 pip Python Package Installer. Officially recommended package man- ager for Python. 5

49

A Appendix

A.1 Live analysis

A live analysis is excellent for observing changes being made by malware, which in turn helps to create (even automatically) signa- tures for quick detection in the wild. It requires many more resources to run the live analysis than the static one. This high resource cost often is caused by a need to have live sandbox system, usually some kind of a virtual machine or emulator that needs to imitate a usual live system (also in terms of memory and processing power). These environments have often only have one and the same configuration between runs (system version, imitated hardware, etc.), making it sometimes apparent for malware to detect that it is being executed inside such sandbox. On the contrary, a considerable advantage is in dealing with encrypted payloads or obfuscations. At present, it is common for malware to encrypt its payload to avoid being detected easily. Upon execution, malware automatically decrypts this payload and executes it. Sandboxes and emulators are very convenient in be- ing able to take snapshots of memory after the decryption, making it much easier when performing reverse engineering. There also is a risk that, if such sandbox is not properly isolated, malware could escape the sandbox and infect other systems, including the one where the sandbox is running.

A.1.1 Comparision of static analysis vs. live analysis approaches Live Analysis

∙ Scalability - High resource cost needed to emulate the target device and environment, including CPU and memory. This is often accomplished by using virtual machines/emulators that do not scale well

∙ Obfuscations - Capturing memory snapshots is often a built- in functionality. This feature makes it easy to analyze highly- obfuscated samples or those that utilize encryption.

51 A. Appendix

∙ Risks - One of the biggest risks also seen in the past is that the malware can escape a controlled sandbox environment by utilizing exploits against the virtualization software/emulator 1. This is supported by the fact that the sandbox also should have a functional internet/network connection, which malware often checks for before proceeding or downloading the next stage.

Static Analysis ∙ Scalability - Very scalable, since it often has the same resource usage and easily can be optimized in terms of resources. Also allows for easily running multiple times in parallel on the same server.

∙ Obfuscation - Highly limited but can be partially addressed by analyzing code execution flow. Being able to analyze highly- obfuscated code or samples that use encryption mechanisms often requires having implemented an almost functional lan- guage interpreter. In some cases, this might even be an impos- sible task without doing a direct code execution.

∙ Risks - Very low, since there is no direct code execution. Of course, it is possible to target the flaws in the parser or analyzer itself 2, but this approach is much harder to achieve.

A.2 setup.py from the talib package

We reformatted the following code to better fit into the page while preserving the functionality, due to very long lines. The changes we made were to replace the common message with the msg variable to shorten these lines, and moved the comments from the end of the line to the line above. from distutils.core import setup

1. https://www.vmware.com/security/advisories/VMSA-2019-0005.html 2. For example https://www.exploit-db.com/exploits/23524

52 A. Appendix from .command.develop import develop from setuptools.command.install import install from setuptools.command.egg_info import egg_info from subprocess import check_call msg="You probably meant to install and run ta-lib" class PostDevelopCommand(develop): """Post-installation for development mode."""

def run(self): raise Exception(msg) develop.run(self) class PostInstallCommand(install): """Post-installation for installation mode."""

def run(self): raise Exception(msg) install.run(self) class EggInfoCommand(egg_info): """Post-installation for installation mode."""

def run(self): print(msg) egg_info.run(self) setup( name=’talib’, # this must be the same as the name above packages=[’talib’], version=’0.1.1’,

53 A. Appendix description=’A package to prevent exploit’, author=’The Guardians’, author_email=’[email protected]’, cmdclass={ ’develop’: PostDevelopCommand, ’install’: PostInstallCommand, ’egg_info’: EggInfoCommand, },#I’ll explain this ina second # arbitrary keywords keywords=[’testing’,’logging’,’example’], entry_points={ ’console_scripts’:[ ’talib= talib.cli:cli’, ], } )

54 B ssh-decorate incident evidence

Figure B.1: A screenshot of the opened GitHub issue by user mowshon after he found the malicious code

2. http://archive.today/WUlRu

55 B. ssh-decorate incident evidence

Figure B.2: A screenshot of the malicious code as posted on the Reddit forum 2after the news of the incident started to spread

56 C Built-in Aura analyzers

C.1 Produced hits

In this appendix, we included a list of all the possible hits that can be produced by the Aura framework and were developed as built-in functionality. These hits also are the primary output produced when we conducted global PyPI scans to use for further (manual) analysis. It also is important to note that the hit can set on itself an "informational" flag and still be produced, as it can be acted upon by other analyzers, yet it is filtered when the data is presented to the user or exported into the JSON format. This behavior can be disabled by using the verbose command line flag; we excluded the informational hits during the global scan, as they provide only a little value when analyzing the data on a scale. This informational flag often is set when the score is equal to zero, but it depends on the analyzer itself that is producing the output.

Here is a list of all possible hits that can be produced at the time of writing:

∙ ArchiveAnomaly - Produced by the Archive analyzer when an archive (python package) contains a nonstandard path, such as absolute location (starting with "/") or a parent specifier "..". Such paths should not be included in the archive, as they can overwrite system locations when unpacking and provide a hint that the archive either was manipulated or created using non-standard/outdated tools.

∙ Wheel - This hit is produced when an anomaly is found in the wheel package structure by the Wheel analyzer. This anomaly in- cludes checksums that do not match when checking the entries against the RECORDS file or a missing entry. This analyzer was created when we found that pip is not validating RECORDS file entries, as there was evidence that some wheel packages have been manipulated/created by hand.

57 C. Built-in Aura analyzers

∙ FunctionCall - Produced by the Execution Flow analyzer when a function call is detected, as specified by the semantic rules/sig- natures.

∙ ModuleImport - Produced by the Execution Flow analyzer when a module import is detected, as specified by the semantic rules/sig- natures.

∙ URL - Produced by the Data Finder analyzer that looks for strings in the AST tree and checks if they start with a URL locator (http:// and https://).

∙ Base64Blob - Produced by the Data Finder analyzer that looks for strings in the AST tree that look like base64 encoded blobs. It attempts to decode them, and, if successful, this hit is produced with the relevant data.

∙ SetupScript - Produced by the setup.py analyzer that looks specifically for the package setup scripts. It can be either in- formational, in this case containing only metadata from the parsed script (module name, author, homepage, and other key- word arguments of the setup function). This hit also can be produced when an anomaly is found which includes code exe- cution or network communication functionality found directly in the setup script.

∙ SensitiveFile - Produced by the Filesystem structure analyzer that looks at the tree structure and file names, rather than an- alyzing the content of files. It is produced when a filename is matching a sensitive file pattern, as specified in the semantic rules

∙ SuspiciousFile - Produced by the Filesystem structure analyzer when a filename that matches the suspicious file pattern is found. This pattern currently includes "*.pyc" (compiled python bytecode) and files starting with "." (dot, means the file ishid- den when viewing the directory content).

∙ YaraMatch - Produced by the Yara analyzer when a Yara signa- ture/rule is matching against the file input. This hit is populated

58 C. Built-in Aura analyzers with the metadata from the Yara signature (including score) and also the patterns found.

59