Automated Metadata Extraction

Masaryk University Faculty of Informatics Automated Metadata Extraction Master’s Thesis Bc. Martin Šmíd Brno, Spring 2021 Masaryk University Faculty of Informatics Automated Metadata Extraction Master’s Thesis Bc. Martin Šmíd Brno, Spring 2021 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Martin Šmíd Advisor: RNDr. Lukáš Němec i Acknowledgements I would like to thank my advisor, RNDr. Lukáš Němec, for his friendly approach, helpful advice, and guidance of my work. I would like to express my gratitude to my family for their support. iii Abstract This thesis aims to create a modular tool that can extract important patterns of interest from binary data of generally unknown origin. Binary data can come in many different forms. Therefore, several standard encodings used to represent binary data are introduced in the first part. A section introducing selected interesting data patterns whose detection allows for more detailed identification of certain virtual or physical entities follows. The thesis also includes a description of the implemented tool that automates data decoding and data pattern extraction, along with the results of a collected dataset processing. iv Keywords binary-to-text encoding, metadata extraction, Python, data decoding, pattern detection v Contents 1 Introduction 1 2 Binary-to-text Encoding 3 2.1 UUEncoding . .3 2.2 Base encoding . .4 2.2.1 Base16 . .5 2.2.2 Base32 . .5 2.2.3 Base58 . .6 2.2.4 Base64 . .6 2.2.5 Base85 . .7 2.2.6 Base91 . .8 2.3 BinHex . .8 2.4 Percent-encoding . .9 2.5 Quoted-printable . 10 2.6 yEnc . 10 2.7 Bech32 . 11 2.8 MIME . 12 2.9 PEM . 13 3 Metadata Patterns 15 3.1 Cryptographic Hash Functions . 15 3.1.1 Message-Digest Algorithm 5 . 15 3.1.2 Secure Hash Algorithms . 16 3.2 Uniform Resource Identifier . 16 3.2.1 Uniform Resource Locator . 17 3.2.2 Uniform Resource Name . 17 3.2.3 Data URI . 17 3.2.4 Magnet URI . 18 3.3 International Standard Book Number . 19 3.4 International Standard Serial Number . 19 3.5 Digital Object Identifier . 20 3.6 IP Address . 20 3.7 E-mail . 21 3.8 MAC Address . 21 3.9 Cryptocurrency . 22 vii 3.9.1 Bitcoin . 23 3.9.2 Ethereum . 24 3.9.3 Tether . 25 3.9.4 Polkadot . 25 3.9.5 XRP (Ripple) . 25 3.9.6 Cardano . 26 3.9.7 Litecoin . 26 3.9.8 Bitcoin Cash . 27 3.9.9 Chainlink . 27 3.9.10 Stellar . 28 4 MetExt 29 4.1 Analysis . 29 4.1.1 Balbuzard . 29 4.1.2 CyberChef . 30 4.1.3 Chepy . 30 4.1.4 Ciphey . 30 4.2 Design . 31 4.3 Plugins . 31 4.3.1 Decoders . 32 4.3.2 Extractors . 33 4.3.3 Printers . 36 4.4 API . 37 4.5 CLI . 38 5 Evaluation 41 5.1 Test Data Set . 41 5.1.1 Generated Data . 41 5.1.2 Collected Data . 43 5.2 Results . 43 5.2.1 Generated Data . 44 5.2.2 Collected Data . 44 5.3 Evaluation Discussion . 46 6 Conclusion and Future Work 47 6.1 Future Work . 48 A Encoding Mapping Tables 49 viii B Magnet Links Parameters 59 C Pattern Extraction Results 61 D Code extracts 69 Bibliography 71 ix 1 Introduction The Internet has progressively become a very accessible global network for people. Web applications enable creating and sharing large amounts of data dedicated to individual participants in network com- munication. Often this data is usually unstructured, and thus without knowledge of the content or nature of the data, it is not easy to explore or read it correctly. On the contrary, unawareness of the structure or its complete ab- sence plays into the hands of the creator or sender of the data. They may apply methods to obfuscate the actual content of the data. In worse cases, it may be the misdeeds of an actor seeking to spread malicious software or otherwise compromise the data. Therefore, it is advisable to inspect the unknown data acquired and ascertain its origin, characteristics, structure or any other information that can help provide more in-depth insight into the data. Such inspection can be done partly manually, often using software that displays the data in a form known as a hex dump or in a form suitable for inspection of the data as a sequence of bytes in both hexadecimal and simplified text form. This view of the data dramatically simplifies the manual inspection of binary data. Nevertheless, since the manual data processing is time-consuming with respect to the size and form of the data being examined and human error can quickly occur, it is desirable to eliminate manual steps and automate data processing as much as possible. This thesis focuses on creating an extensible tool designed to process binary data in different formats and find meaningful data patterns in these data. The goal is to meet the initial requirement to support Python version 3.5, ease of use and functional extensibility in modules. The tool’s design is such that the output is further easily machine- processable, and it is possible to add additional information about the patterns found in the determined output. However, before it is possible to realistically analyse the data and determine whether the data contains the information sought, it is necessary to process it and recognise its correct format. Standard encodings used in sending and sharing binary data and selected search data patterns are described in the first half of the 1 1. Introduction paper. The second half then focuses on the tool itself, its description and implemented parts and sample data output from the tool. 2 2 Binary-to-text Encoding It is necessary to ensure that the data recipients can receive and process the data without malformation. In distinct environments that were not or are still not designed to process binary data in 8-bit encoding, one must utilise an encoding that enables sending such data. Therefore, several binary-to-text encodings that encode binary data into a sequence of printable characters had been created to mitigate such a problem. Although such encodings ensure seamless data transmission in a particular system, it is also necessary to keep in mind that the use of binary-to-text encodings entails certain compromises, e.g. the size of the data sent is usually larger. Hence means of data compression may need to be applied. In this section, we will list commonly used binary-to-text encodings. 2.1 UUEncoding UUEncoding (Unix-to-Unix encoding) originated in 1980 in Unix for encoding binary data for transmission in e-mail [1] designated to transfer files from one Unix system to another through systems with different character sets. UUEncoding was also used to post filesto Usenet newsgroups. Historically, the uuencode program utilised ASCII1 [2] characters with decimal codes from 32 to 95 to encode three octets of data into four printable characters. Each encoded line starts with a length character equal to the number of encoded bytes. Encoded data is encap- sulated within a header line begin <mode> <filename><NL> and a trailer ‘<NL>end<NL> lines where the character ‘ is used as a zero-data character and <NL> is a line break. Later on, the uuencode tool added a Base64 (Section 2.2.4) variant conforming to the MIME. The header lines were changed to contain the information about the used Base64 encoding and the trailer line 1. American Standard Code for Information Interchange 3 2. Binary-to-text Encoding was replaced with a sequence of four equal symbols “====” as it is a valid Base64-decodable sequence [3]. However, the MIME standard for e-mails (Section 2.8) and yEnc encoding (Section 2.6) created for newsgroups posting replaced UUEn- coding in later years. Listing 2.1: An example of uuencoded data 1 begin 0744 example.dat 2 M3&]R96T@:7!S=6T@9&]L;W(@<VET(&%M970L(&-O;G-E8W1E=’5 E<B!A9&EP 3 M:7-C:6YG(&5L:70N($5T:6%M(&)I8F5N9’5M(&5L:70@96=E="! E<F%T+B!. 4 1=6QL86T@:G5S=&\@96YI;2X‘ 5 ‘ 6 end 2.2 Base encoding Base encodings utilise a specific alphabet to encode binary data and fundamentally use the modulus operation for the encoding and decoding. Characters from the ASCII set is usually used as encoding alphabet in common encodings to avoid data misinterpretation in text-based systems. Depending on the specification, different restrictions such as line length constraints or characters outside the defined encoding alphabet apply. Base encodings tailored for specific use cases utilising UTF encodings have also been invented, e.g. Base122 [4] (an alternative to Base64 utilising properties of UTF-8 encoding), Base1024 [5] (encoding with emoji characters), Base2048 [6] (encoding optimised for Twitter), or Base32768 [7] (encoding optimised for UTF-16 text). Nevertheless, they have not been standardised and commonly used so far. Thus, this section only describes commonly-used base encodings with alphabets that comprise printable ASCII characters. 4 2. Binary-to-text Encoding 2.2.1 Base16 Base16 (also referred to as hex) encoding is the standard case-insensi- tive hex encoding that uses a set of 16 characters – digits and letters from “A” to “F” (see Table A.1).

Automated Metadata Extraction

Specification for JSON Abstract Data Notation Version

Characterizing Pixel Tracking Through the Lens of Disposable Email Services

Pdflib Text and Image Extraction Toolkit (TET) Manual

Understanding JSON Schema Release 2020-12

DLI Implementation and Reference Guide

Efficient Sorting of Search Results by String Attributes

[MS-LISTSWS]: Lists Web Service Protocol

V10.5.0 (2013-07)

Federal Implementation Guideline for Electronic Data Interchange

Python Language

A Proposal of Substitute for Base85/64 – Base91

Answers to Exercises