Isadetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code

This is a self-archived version of an original article. This version may differ from the original in pagination and typographic details. Author(s): Kairajärvi, Sami; Costin, Andrei; Hämäläinen, Timo Title: ISAdetect : Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code Year: 2020 Version: Accepted version (Final draft) Copyright: © 2020 ACM Rights: In Copyright Rights url: http://rightsstatements.org/page/InC/1.0/?language=en Please cite the original version: Kairajärvi, S., Costin, A., & Hämäläinen, T. (2020). ISAdetect : Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code. In CODASPY '20 : Proceedings of the 10th ACM Conference on Data and Application Security and Privacy (pp. 376- 380). ACM. https://doi.org/10.1145/3374664.3375742 ISAdetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code Sami Kairajärvi ∗ Andrei Costin Timo Hämäläinen University of Jyväskylä University of Jyväskylä University of Jyväskylä Jyväskylä, Finland Jyväskylä, Finland Jyväskylä, Finland [email protected] [email protected] [email protected] ABSTRACT KEYWORDS Static and dynamic binary analysis techniques are actively used Binary code analysis, Firmware analysis, Instruction Set Archi- to reverse engineer software’s behavior and to detect its vulnera- tecture (ISA), Supervised machine learning, Reverse engineering, bilities, even when only the binary code is available for analysis. Malware analysis, Digital forensics To avoid analysis errors due to misreading op-codes for a wrong ACM Reference Format: CPU architecture, these analysis tools must precisely identify the Sami Kairajärvi, Andrei Costin, and Timo Hämäläinen. 2020. ISAdetect: Instruction Set Architecture (ISA) of the object code under analysis. Usable Automated Detection of CPU Architecture and Endianness for The variety of CPU architectures that modern security and reverse Executable Binary Files and Object Code. In Tenth ACM Conference on engineering tools must support is ever increasing due to massive Data and Application Security and Privacy (CODASPY’20), March 16–18, proliferation of IoT devices and the diversity of firmware and mal- 2020, New Orleans, LA, USA. ACM, New York, NY, USA, 5 pages. https: ware targeting those devices. Recent studies concluded that falsely //doi.org/10.1145/3374664.3375742 identifying the binary code’s ISA caused alone about 10% of failures of IoT firmware analysis. The state of the art approaches detecting 1 INTRODUCTION ISA for executable object code look promising, and their results Reverse engineering and analysis of binary code has a wide spec- demonstrate effectiveness and high-performance. However, they trum of applications [24, 35, 39], ranging from vulnerability re- lack the support of publicly available datasets and toolsets, which search [4, 9, 10, 34, 43] to binary patching and translation [37], makes the evaluation, comparison, and improvement of those tech- and from digital forensics [8] to anti-malware and Intrusion De- niques, datasets, and machine learning models quite challenging tection Systems (IDS) [25, 41]. For such applications, various static (if not impossible). This paper bridges multiple gaps in the field of and dynamic analysis techniques and tools are constantly being automated and precise identification of architecture and endianness researched, developed, and improved [3, 6, 15, 24, 30, 38, 42]. of binary files and object code. We develop from scratch the toolset Regardless of their end goal, one of the important steps in these and datasets that are lacking in this research space. As such, we techniques is to correctly identify the Instruction Set Architecture contribute a comprehensive collection of open data, open source, and (ISA) of the op-codes within the binary code. Some techniques open API web-services. We also attempt experiment reconstruction can perform the analysis using architecture-independent or cross- and cross-validation of effectiveness, efficiency, and results ofthe architecture methods [16, 17, 32]. However, many of those tech- state of the art methods. When training and testing classifiers using niques still require the exact knowledge of the binary code’s ISA. solely code-sections from executable binary files, all our classifiers For example, recent studies concluded that falsely identifying the performed equally well achieving over 98% accuracy. The results binary code’s ISA caused about 10% of failures of IoT/embedded are consistent and comparable with the current state of the art, firmware analysis [9, 10]. hence supports the general validity of the algorithms, features, and Sometimes the CPU architecture is available in the executable approaches suggested in those works. format’s header sections, for example in ELF file format [29]. How- ever, this information is not guaranteed to be universally available CCS CONCEPTS for analysis. There are multiple reasons for this and we will detail a • Security and privacy → Software reverse engineering; • few of them. The advances in Internet of Things (IoT) technologies Computing methodologies → Machine learning; • Computer bring to the game an extreme variety of hardware and software, in systems organization → Embedded and cyber-physical systems. particular new CPU architectures, new OSs or OS-like systems [27]. Many of those devices are resource-constrained and the binary code comes without sophisticated headers and OS abstraction lay- ers. At the same time, the digital forensics and the IDSs sometimes ∗This paper is based on: author’s MSc thesis [21] and extended pre-print version [22]. may have access only to a fraction of the code, e.g., from an ex- ploit (shell-code), malware trace, or a memory dump. For example, many shell-codes are exactly this – a quite short, formatless and CODASPY ’20, March 16–18, 2020, New Orleans, LA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. headerless sequence of CPU op-codes for a target system (i.e., a This is the author’s version of the work. It is posted here for your personal use. Not for combination of hardware, operating-system, and abstraction lay- redistribution. The definitive Version of Record was published in Tenth ACM Conference ers) that performs a presumably malicious action on behalf of the on Data and Application Security and Privacy (CODASPY’20), March 16–18, 2020, New Orleans, LA, USA, https://doi.org/10.1145/3374664.3375742. attacker [18, 33]. In such cases, though possible in theory, it is quite unlikely that the full code including the headers specifying CPU ISA will be available for analysis. Finally, even in the traditional No x86/x86_64 computing world of (e.g., ), there are situations where Crawl Debian Extract selected Unpack Debian Jigdo ﬁle Yes Jigdo to ISO the hosts contain object code for CPU architectures other than repository packages packages the one of the host itself. Examples include firmware for network Metadata and Features from cards [2, 13, 14], various management co-processors [26], and de- code sections code sections vice drivers [5, 20] for USB [28, 40] and other embedded devices that from binaries contain own specialized processors and provide specific services (e.g., encryption, data codecs) that are implemented as peripheral Figure 1: ISAdetect general workflow of the toolset. firmware [11, 23]. Even worse, more often than not the object code for such peripheral firmware is stored using less traditional or non- Table 1: ISAdetect dataset summary. standard file formats and headers (if any), or embedded inside the device drivers themselves resulting in mixed architectures code Type Approx. # of files in dataset Approx. total size streams. Currently, several state of the art works try to address the .iso files ~1600 ~1843 GB challenge of accurately identifying the CPU ISA for executable ob- .deb files ~79000 ~36 GB ject code sequences [8, 12]. Their approaches look promising as the ELF files ~96000 ~27 GB ELF code sections ~96000 ~17 GB results demonstrate effectiveness and high-performance. However, they lack the support of publicly available datasets and toolsets, which makes the evaluation, comparison, and improvement of those 2 DATASETS AND EXPERIMENTAL SETUP techniques, datasets, and machine learning models quite challenging (if not impossible). With this paper, we bridge multiple gaps in 2.1 Datasets the field of automated and precise identification of architecture and We started with the dataset acquisition challenge. Despite the exis- endianness of binary files and object code. We develop from scratch tence of several state of the art works, unfortunately neither their the toolset and datasets that are lacking in this research space. datasets nor the toolsets are publicly available. 1 To overcome this To this end, we release a comprehensive collection of open data, limitation we had to develop a complete toolset and pipeline that open source, and open API web-services. We attempt experiment are able to optimally download a dataset that is both large and reconstruction and cross-validation of effectiveness, efficiency, and representative enough. We release our data acquisition toolset as results of the state of the art methods [8, 12], as well as propose and open source. The pipeline used to acquire our dataset is depicted experimentally validate new approaches to the main classification in Figure 1. We chose the Debian Linux repositories for several challenge where we obtain

Isadetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code

Debian Installer Bullseye Alpha 3 Released and More Debian News

A Zahlensysteme

Debian 1 Debian

Pipenightdreams Osgcal-Doc Mumudvb Mpg123-Alsa Tbb

Ldp Howto-Index

Linux Package Management

Debian Reference

Konstruktion, Bau Und Ansteuerung

Network-Install-HOWTO

Debian Reference

Automatic Identification of Architecture and Endianness Using Binary File Contents

Debian Jigdo Mini-HOWTO Peter Jay Salzman [email protected]