Masaryk University Faculty of Informatics

Automated Metadata Extraction

Master’s Thesis

Bc. Martin Šmíd

Brno, Spring 2021

Masaryk University Faculty of Informatics

Automated Metadata Extraction

Master’s Thesis

Bc. Martin Šmíd

Brno, Spring 2021

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Bc. Martin Šmíd

Advisor: RNDr. Lukáš Němec

i

Acknowledgements

I would like to thank my advisor, RNDr. Lukáš Němec, for his friendly approach, helpful advice, and guidance of my work. I would like to express my gratitude to my family for their support.

iii Abstract

This thesis aims to create a modular tool that can extract important pat- terns of interest from binary data of generally unknown origin. Binary data can come in many different forms. Therefore, several standard encodings used to represent binary data are introduced in the first part. A section introducing selected interesting data patterns whose detection allows for more detailed identification of certain virtual or physical entities follows. The thesis also includes a description of the implemented tool that automates data decoding and data pattern extraction, along with the results of a collected dataset processing.

iv Keywords

binary-to-text encoding, metadata extraction, Python, data decoding, pattern detection

Contents

1 Introduction 1

2 Binary-to-text Encoding 3 2.1 ...... 3 2.2 Base encoding ...... 4 2.2.1 Base16 ...... 5 2.2.2 ...... 5 2.2.3 Base58 ...... 6 2.2.4 ...... 6 2.2.5 Base85 ...... 7 2.2.6 Base91 ...... 8 2.3 BinHex ...... 8 2.4 Percent-encoding ...... 9 2.5 Quoted-printable ...... 10 2.6 yEnc ...... 10 2.7 Bech32 ...... 11 2.8 MIME ...... 12 2.9 PEM ...... 13

3 Metadata Patterns 15 3.1 Cryptographic Hash Functions ...... 15 3.1.1 Message-Digest Algorithm 5 ...... 15 3.1.2 Secure Hash Algorithms ...... 16 3.2 Uniform Resource Identifier ...... 16 3.2.1 Uniform Resource Locator ...... 17 3.2.2 Uniform Resource Name ...... 17 3.2.3 Data URI ...... 17 3.2.4 Magnet URI ...... 18 3.3 International Standard Book Number ...... 19 3.4 International Standard Serial Number ...... 19 3.5 Digital Object Identifier ...... 20 3.6 IP Address ...... 20 3.7 E- ...... 21 3.8 MAC Address ...... 21 3.9 Cryptocurrency ...... 22

vii 3.9.1 Bitcoin ...... 23 3.9.2 Ethereum ...... 24 3.9.3 Tether ...... 25 3.9.4 Polkadot ...... 25 3.9.5 XRP (Ripple) ...... 25 3.9.6 Cardano ...... 26 3.9.7 Litecoin ...... 26 3.9.8 Bitcoin Cash ...... 27 3.9.9 Chainlink ...... 27 3.9.10 Stellar ...... 28

4 MetExt 29 4.1 Analysis ...... 29 4.1.1 Balbuzard ...... 29 4.1.2 CyberChef ...... 30 4.1.3 Chepy ...... 30 4.1.4 Ciphey ...... 30 4.2 Design ...... 31 4.3 Plugins ...... 31 4.3.1 Decoders ...... 32 4.3.2 Extractors ...... 33 4.3.3 Printers ...... 36 4.4 API ...... 37 4.5 CLI ...... 38

5 Evaluation 41 5.1 Test Data Set ...... 41 5.1.1 Generated Data ...... 41 5.1.2 Collected Data ...... 43 5.2 Results ...... 43 5.2.1 Generated Data ...... 44 5.2.2 Collected Data ...... 44 5.3 Evaluation Discussion ...... 46

6 Conclusion and Future Work 47 6.1 Future Work ...... 48

A Encoding Mapping Tables 49 viii B Magnet Links Parameters 59

C Pattern Extraction Results 61

D Code extracts 69

Bibliography 71

ix

1 Introduction

The has progressively become a very accessible global net- work for people. Web applications enable creating and sharing large amounts of data dedicated to individual participants in network com- munication. Often this data is usually unstructured, and thus without knowledge of the content or nature of the data, it is not easy to explore or read it correctly. On the contrary, unawareness of the structure or its complete ab- sence plays into the hands of the creator or sender of the data. They may apply methods to obfuscate the actual content of the data. In worse cases, it may be the misdeeds of an actor seeking to spread malicious software or otherwise compromise the data. Therefore, it is advisable to inspect the unknown data acquired and ascertain its origin, characteristics, structure or any other informa- tion that can help provide more in-depth insight into the data. Such inspection can be done partly manually, often using software that displays the data in a form known as a hex dump or in a form suitable for inspection of the data as a sequence of in both and simplified text form. This view of the data dramatically simplifies the manual inspection of binary data. Nevertheless, since the manual data processing is time-consuming with respect to the size and form of the data being examined and human error can quickly occur, it is desirable to eliminate manual steps and automate data processing as much as possible. This thesis focuses on creating an extensible tool designed to pro- cess binary data in different formats and find meaningful data patterns in these data. The goal is to meet the initial requirement to support Python version 3.5, ease of use and functional extensibility in modules. The tool’s design is such that the output is further easily machine- processable, and it is possible to add additional information about the patterns found in the determined output. However, before it is possible to realistically analyse the data and determine whether the data contains the information sought, it is necessary to process it and recognise its correct format. Standard encodings used in sending and sharing binary data and selected search data patterns are described in the first half of the

1 1. Introduction paper. The second half then focuses on the tool itself, its description and implemented parts and sample data output from the tool.

2 2 Binary-to-text Encoding

It is necessary to ensure that the data recipients can receive and pro- cess the data without malformation. In distinct environments that were not or are still not designed to process binary data in 8-bit en- coding, one must utilise an encoding that enables sending such data. Therefore, several binary-to-text encodings that encode binary data into a sequence of printable characters had been created to mitigate such a problem. Although such encodings ensure seamless data transmission in a particular system, it is also necessary to keep in mind that the use of binary-to-text encodings entails certain compromises, e.g. the size of the data sent is usually larger. Hence means of data compression may need to be applied. In this section, we will list commonly used binary-to-text encod- ings.

2.1 UUEncoding

UUEncoding (Unix-to-Unix encoding) originated in 1980 in Unix for encoding binary data for transmission in e-mail [1] designated to transfer files from one Unix system to another through systems with different character sets. UUEncoding was also used to post filesto newsgroups. Historically, the uuencode program utilised ASCII1 [2] characters with codes from 32 to 95 to encode three octets of data into four printable characters. Each encoded line starts with a length char- acter equal to the number of encoded bytes. Encoded data is encap- sulated within a header line begin and a trailer ‘end lines where the character ‘ is used as a zero-data character and is a line break. Later on, the uuencode tool added a Base64 (Section 2.2.4) variant conforming to the MIME. The header lines were changed to contain the information about the used Base64 encoding and the trailer line

1. American Standard Code for Information Interchange

3 2. Binary-to-text Encoding

was replaced with a sequence of four equal symbols “====” as it is a valid Base64-decodable sequence [3]. However, the MIME standard for e-mails (Section 2.8) and yEnc en- coding (Section 2.6) created for newsgroups posting replaced UUEn- coding in later years. Listing 2.1: An example of uuencoded data 1 begin 0744 example.dat 2 M3&]R96T@:7!S=6T@9&];W(@

2.2 Base encoding

Base encodings utilise a specific to encode binary data and fundamentally use the modulus operation for the encoding and de- coding. Characters from the ASCII set is usually used as encoding alphabet in common encodings to avoid data misinterpretation in text-based systems. Depending on the specification, different restrictions such as line length constraints or characters outside the defined encoding alphabet apply. Base encodings tailored for specific use cases utilising UTF encod- ings have also been invented, e.g. Base122 [4] (an alternative to Base64 utilising properties of UTF-8 encoding), Base1024 [5] (encoding with emoji characters), Base2048 [6] (encoding optimised for Twitter), or Base32768 [7] (encoding optimised for UTF-16 text). Nevertheless, they have not been standardised and commonly used so far. Thus, this section only describes commonly-used base encodings with that comprise printable ASCII characters.

4 2. Binary-to-text Encoding 2.2.1 Base16

Base16 (also referred to as hex) encoding is the standard case-insensi- tive hex encoding that uses a set of 16 characters – digits and letters from “A” to “F” (see Table A.1). That enables us to represent an 8-bit group (octet) with two characters, one for each 4-bit group (nibble). Thus, the output data is double the size of the input. Hexadecimal values are commonly prefixed to differentiate them in the context. Unix shells and many programming languages use the prefix 0x for numeric constants and \x for bytes representation in strings. URL encoding (see Section 2.4) prefixes pairs of hexadecimal values with a per cent symbol “%”, MIME quoted-printable encoding (see Section 2.5) prefixes the pairs with the equal sign= “ ”. Hexadecimal values are used for many different entities such as colour references in HTML or CSS, character value representation in Unicode standard or XML and (X)HTML.

2.2.2 Base32

Base32 encoding uses 32 ASCII characters to represent data in a form that needs to be case-insensitive. The encoding process represents 40-bit groups treated as eight 5-bit groups and translated into eight characters. If fewer than 40 bits are available, zero-valued bits are added to form 40-bit groups. RFC 4648 [8] recognises two kinds of Base32 encodings (base32 and base32hex) identified by the encoding alphabet. The encoding alphabets of the base32 and base32hex variants are listed in Tables A.2 and A.3. Both utilise a special character “=” for padding. The RFC 4648 variant is used in IPFS CID v1 (Content Identifier) [9], which adheres to the permitted character set of a (sub)domain name [10]. There are other alternative Base32 encodings, e.g. Crockfords’s Base32 [11], which uses digits and Latin alphabet excluding the char- acters ILOU (see Table A.5). Crockford’s version proposes the use of additional characters for a checksum and allows the encoded data to be delimited by a hyphen character. algorithm [12] uses another variant of Base32 encoding, conveniently enabling encoding of latitude and longitude to one pos-

5 2. Binary-to-text Encoding itive integer represented by Base32 characters. The alphabet of the Geohash variant can be seen in Table A.4.

2.2.3 Base58

Satoshi Nakamoto invented Base58 encoding for Bitcoin (Section 3.9.1). He designed it to be comparable with Base64 (Section 2.2.4). However, the aim was to avoid errors in the visual checks of the encoded data; thus, some of the characters that may resemble each other were omit- ted as well as any non-alphanumeric characters, which could cause a line break in some systems. Compared to Base64, the characters 0OLI and special characters +/ are not used. See Table A.6 for the listing of the full alphabet. To preserve the meaning of the leading zero bytes of the input, they are converted to the first character of the encoding alphabet during the encoding, i.e. “1” for the Bitcoin alphabet. There are other variants such as Ripple [13] (see Table A.7) and Flickr [14] (see Table A.8) that use different alphabets. However, Bitcoin uses a modified version called Base58Check en- coding, which includes a checksum in attempt to detect errors in the encoded data. Instead of encoding the raw input data, the four leading bytes of a SHA-256(SHA-256(data)) are appended to the input data, and the Bitcoin Base58 encoding is applied to the concatenation [15]. Base58Check is used to encode the addresses and private keys.

2.2.4 Base64

Base64 translates a sequence of octets into a sequence of ASCII char- acters, each representing six bits, utilising a set of 64 characters. That enables the encoding of three input octets into four printable charac- ters. The first standard version was defined in Privacy-enhanced Elec- tronic Mail (PEM) in 1989 [16]. Nowadays, MIME is often referenced as the definition of Base64 encoding. Both variants use the same64- character set of upper- and lower-cased Latin letters and digits with special characters +/ (see Table A.9), a padding suffix= “ ”, and the CRLF sequence for line-wrapping. They differ at the required line length – 64 characters (or less for the last line) in PEM [17] and 76 characters

6 2. Binary-to-text Encoding

or less in MIME [18]. MIME explicitly declares that characters outside the specified set must be ignored in the decoding process. However, other variants may differ in the 62nd and 63rd characters, using padding or line length restrictions as defined in RFC 4648 [8]. We can name the base64url variant, recognised in RFC 4648, that use characters minus and underline “-_” as the 62nd and 63rd characters without the padding suffix. This variant is suitable for use inURLs and file names. Base64 is widespread in embedded binary data, e.g., we can find it inside assets in HTML and CSS via data URI scheme [19] or as embedded binary data in XML and JSON API responses. It is also widely used for sending e-mail attachments in the MIME Base64 con- tent transfer encoding and for cryptographic certificates and keys in the PEM format (Section 2.9). Moreover, HTTP Basic Authentication use Base64 to encode the credentials [20].

2.2.5 Base85

Base85 encoding utilises a set of 85 characters, originally created in the btoa implementation. Some of the commonly used variants of Base85 encoding are btoa [21], [22], base85 as defined in RFC 1924 [23], and Z85 [24]. Ascii85 encodes four octets into five printable characters. The Adobe variant uses ASCII characters from “!” to “u” (see Table A.11), and the encoded output is separated into lines of at most 80 characters. A special character “” is used as a shortcut to denote an encoded a series of five zeros, and a pair “~>” is used to denote end-of-data. How- ever, the pair “<~” is used to mark the begin-of-data of the Ascii85 encoded string. The btoa utility program uses begin-data and end- data marker lines and a different convention of marking end-of-data. Ascii85 is commonly used in Adobe PostScript language [22] and Portable Document Format (PDF) [25]. RFC 1924 [23] defined the base85 variant to represent an IPv6 address compactly. It enables encoding of the whole range of IPv6 addresses in the sequence of 20 characters. It uses a carefully chosen alphabet (see Table A.12) to avoid a collision with a common repre- sentation of IPv6 addresses. This variant is also used in Git for patches of binary data [26].

7 2. Binary-to-text Encoding

Another variant is known as Z85. Since Ascii85 is not considered string-safe and cannot be used cleanly in a source code or data inter- change formats such as JSON and XML, a derivation of Ascii85 was created by a messaging library ZeroMQ2 [24]. Z85 achieves string safety with a modified encoding alphabet (see Table A.13). Compared to Ascii85, seven characters ‘\"’_,; were re- moved, and characters vwxyz{} were added. The encoding characters order is also different. The encoding requires chunks of four octets, which should be padded if necessary.3 Compared to Ascii85, no begin and end marking lines or special characters are used.

2.2.6 Base91

It had been argued that using more than 85 ASCII safe characters for encoding binary data to be sent via e-mail may not bring higher gains [21]. Despite that, Base91 was created to maximise data transmission efficiency whilst achieving e-mail 7-bit encoding compatibility. Thus, it efficiently utilises 91 safe characters out of 95 printable characters, excluding space, hyphen, backslash, and apostrophe characters [29] (see Table A.14 for the encoding alphabet). Base91 was designed to divide the input data sequence into 13-bit blocks. Each such block can then be encoded as two printable ASCII characters.

2.3 BinHex

BinHex stands for binary-to-hexadecimal and is the encoding used on the pre-Mac X (macOS) classic Mac OS systems (Macintosh systems). It was a popular means of encoding Macintosh files for archiving on non-Macintosh file systems and transmission via Internet mail. Itwas designed to handle the Macintosh file format comprising two parts, called forks – a data and a [30]. It is associated with the file extensions .hex, .hcx and .hqx de- pending on the version used. The first version was created in 1984

2. It is also known as ØMQ, 0MQ, or zmq. 3. Implementations such as z85e [27] or CometD-Z85 [28] can encode data of length non-divisible by four.

8 2. Binary-to-text Encoding

and was based on the original BinHex encoding written for TRS-80 systems in 1981. It used a hexadecimal encoding and only supported the data fork. The second version replaced the 8-to-4 hexadecimal encoding with an 8-to-6 encoding with a 64-character alphabet similar to the stan- dard Base64 encoding (Section 2.2.4) to reduce the output size. The name stayed the same despite the change of the encoding used. It also supported both forks. However, the encoding alphabet included characters that different internationalized systems could alter inan attempt of localization. Therefore, the third version, called BinHex 4.0 was created with a modified encoding alphabet (see Table A.16). The encoded data is preceded with a line (This file must be converted with BinHex 4.0). The encoded data starts and ends with a colon symbol. Lines are separated with CR every 64 characters such that the last colon is not at the beginning of a line. The encoded data consists of three parts – the first is a header with file metadata (see Appendix A in [30]), the second is the data fork, and the third is the resource fork. Each part is followed by 16-bit CRC checksum [31].

Listing 2.2: An example of BinHex 4.0 encoded data

1 (This file must be converted with BinHex 4.0)

2

3 :#hGPBR9dD@acAh"X!$mr2cmr2cmr!!!!!!!-!!!!!-M’5’9XE’ mJ9fpbE’3K$0- 4 !!!!!:

2.4 Percent-encoding

Percent-encoding, also known as URL encoding, is a mechanism to encode 8-bit data in URI using only the permitted ASCII characters as defined in RFC 3986 [32]. As such, it is also used to prepare datafor the media type application/x-www-form-urlencoded [33]. Percent-encoding defines reserved and unreserved characters. The reserved characters (any of #$&´()*+,/:;=?@[]) are delimiting char- acters distinguishable from other data within a URI. The unreserved characters include uppercase and lowercase letters, decimal digits, a hyphen, an , and a tilde.

9 2. Binary-to-text Encoding

To percent-encode arbitrary data, it is divided into octets, and each octet represented in its hexadecimal value prefixed with a per cent symbol “%”. In general, unreserved characters need not be encoded. However, the per cent symbol must be encoded; otherwise, it is not distinguishable from an encoded .

2.5 Quoted-printable

Quoted-printable encoding (also referred to as QP encoding) was spec- ified in RFC 2045 [18] as one of the mechanisms for sending other contents than ASCII text in e-mails. The data encoding is set within the MIME Content-Transfer-Encoding header field. It is intended to minimize potential modifications by mail systems that work with ASCII text contents. It encodes bytes into a pair of hexadecimal values (lowercase letters are not allowed by the specification) prefixed with an equal sign “=”. The quoted-printable encoding requires that lines of encoded data are not longer than 76 characters. Therefore it uses a concept of soft line wraps denoted with a single character “=” (followed by the sequence of CRLF) at the end of a line. Printable ASCII characters and the TAB can stay in their literal representation. The exception is for space and TAB, which cannot occur unencoded at the end of a line. The equal sign must always be encoded.

2.6 yEnc

E-mail and Usenet systems would originally rely on a 7-bit data trans- mission using one of the encodings such as UUEncoding (Section 2.1), or (MIME) Base64 (Section 2.2.4) and quoted-printable (Section 2.5). However, these encodings come with a data overhead that is around 33-40% of the original data size. The first version of the yEnc encoding was published in 2001 by Jürgen Helbing to address the data overhead for posting binary data to Usenet newsgroups over NNTP after many Usenet servers had been capable of handling 8-bit encoding [34]. The yEnc is a 8-bit encoding which utilises almost all of the 256 characters. The critical characters that must be escaped with the equal symbol “=” are NULL, CR, and LF. The computed encoding is computed

10 2. Binary-to-text Encoding

as = ( + 42) % 256. If the result equals one of the critical characters, then the value of the escaped character is computed as = ( + 64) % 256 preceded with the “=” sign. Escap- ing only the critical characters reduces the overhead to around 2% [35]. The encoding uses data wrapping lines similar to UUEncoding. Any line beginning with “=y” identifies a keyword line. The header line starts with =ybegin and includes line length (typically 128 or 256), original data size and the data file name. The trailer line starts with =yend and contains the original data size (same as in the header line), and usually the CRC32 checksum to guarantee the data validity. Listing 2.3: yEnc-encoded data structure 1 =ybegin line=128 size=200000 name=binary.dat 2 ... data 3 =yend size=200000 crc32=678de146 The yEnc encoding also supports a multi-part splitting. In that case, the header and trailer lines include the keyword part=, followed by a line with the data offset. The header and the trailer lines must contain the same part number, and there should be a computed CRC32 value for each part. Listing 2.4: yEnc-encoded multi-part data structure 1 =ybegin part=1 line=128 size=200000 name=binary.dat 2 =ypart begin=1 end=100000 3 ... data 4 =yend size=200000 pcrc32=678ab146

5

6 =ybegin part=2 line=128 size=200000 name=binary.dat 7 =ypart begin=100001 end=200000 8 ... data 9 =yend size=200000 pcrc32=678cd146 crc32=678de146

2.7 Bech32

Bech32 encoding is a Base32-derived encoding utilising a checksum. It was invented for the Bitcoin newer type of SegWit (segregated witness) addresses. It shifted from the Base58 encoding for reasons such as

11 2. Binary-to-text Encoding inconvenience due to mixed case, no guarantees on error detection, and slower decoding. Base32 encoding enables more efficient utilisation of QR codes [36]. See Table A.15 for the encoding alphabet. The encoded string is specified to contain at most 90 lower-cased characters comprising the following parts: 1. The human-readable part, which is intended to describe data type. This part must consist of 1 to 83 ASCII characters with decimal values from 33 to 126. 2. The separator, which is always the rightmost character “1”. 3. The data part, which is at least six alphanumeric characters long excluding “1”, “b”, “i”, and “”. The last six characters of the data part form the checksum.

2.8 MIME

Multipurpose Internet Mail Extensions (MIME) is an Internet stan- dard that extends the format of Internet messages (e-mail) defined in [37] (since then obsoleted by [38]) to support text and textual header information in character sets other than ASCII as well as attachments of non-textual data such as audio, video, images, and application pro- grams, and multi-part message bodies. It was specified in a series of RFCs – RFC 2045 [39], RFC 2046 [40], RFC 2047 [41], RFC 2048 [42], and RFC 2049 [43]. The standard defines multiple headers in addition to those defined in [37] to overcome the limitations of the Internet messages stan- dard, such as that the text could be written only in the ASCII charac- ter set. The described header fields are MIME-Version, Content-Type, Content-Transfer-Encoding, Content-ID and Content-Description. Although the values of header fields are required to be in ASCII, MIME allows non-ASCII texts to be written in the encoded-word syntax. The encoded-word format can be seen in Listing 2.5 where is a character set registered with IANA4 [44]. The is “Q” or “B” encoding and is the text in the specified

4. Internet Assigned Numbers Authority

12 2. Binary-to-text Encoding

encoding. The “B” encoding is the Base64 encoding (see Section 2.2.4) and the “Q” encoding is similar to the quoted-printable encoding (see Section 2.5) in a way that the characters can be encoded as pairs of hexadecimal values prefixed with the equal symbol.

Listing 2.5: Encoded-word syntax "=?" "?""?""?=" The standard describes the MIME-Version header field with the value 1.0 (excluding any comment strings). However, the standard does not define what should be done in case of a future version. Hence, due to the significant adaptation of the standard since its creation, it may not be easy to develop a newer version of the MIME standard. The Content-Type header field is used to describe the data con- tained in the body such that the receiver can pick an appropriate mechanism to deal with the data. The initial top-level media types were given in the RFC 2046 [40]. The complete list of registered media types is managed by IANA [45]. The Content-Type header may define a multi-part body with a specified boundary. Then each subpart may contain its headers defining the type and encoding of the subpart’s body. The Content-Transfer-Encoding provides two pieces of information. It specifies the encoding transformation applied to the body and it specifies the domain of the result. The transformation is explicitly specified by one of supported mechanism, i.e. the quoted-printable and base64 encodings and 7bit, 8bit, binary domains. A custom mechanism may be specified, which must be prefixedx- with“ ” [39]. The Content-Type and Content-Transfer-Encoding headers sepa- rate text from binary data, making it much more unlikely that character- set translations will affect binary data.

2.9 PEM

Although the Privacy-Enhanced Mail (PEM) standard was finalised in RFCs in 1993, it has not been widely realised in its original form. At present, it is a file format for storing and sending cryptographic keys and certificates, and other data based on a set of IETF standards. IETF formalized the PEM format in RFC 7468 [46].

13 2. Binary-to-text Encoding

Many cryptography standards utilise ASN.1 (Abstract Syntax No- tation One) to define their data structures and DER (Distinguished Encoding Rules) to serialize the structures. PEM enables such data to be transmitted through systems that only support ASCII by encoding the binary data using Base64. Textual encoding begins with a line “-----BEGIN

14 3 Metadata Patterns

Data can come in different forms and come from different sources. One can target specific patterns to verify the shape and characteristics of the binary data. An occurrence of targeted patterns of interest may help with further data processing. This section presents descriptions of several metadata patterns that may help give a better overview of the input data. The following patterns are often used as entity or location identifiers. They may also be used to maintain data integrity, pointers to data sources, or may provide context to support the determination of other characteristics of the data being analyzed.

3.1 Cryptographic Hash Functions

A cryptographic hash function is a many-to-one function that maps data of arbitrary size to a bit array of a fixed size. It is a one-way function (see section 3.1 in [47]) which ideally has the following properties (see [48] for the properties details). It is deterministic. It is quick to compute. It is infeasible to generate a message that yields the hash value (pre-image resistance). It is infeasible to find two different messages with the same hash value (collision resistance, see section 3.2 in [47]). A small change of the message should make an extensive change of the hash value. Cryptographic hash functions have many different applications, primarily to protect the authenticity of the information. However, they have been utilised in pass-phrase protection, construction of digital signature schemes (the generation and verification of digital signatures), construction of encryption algorithms, key derivation, pseudorandom bit generation, or checksums to detect accidental data corruption.

3.1.1 Message-Digest Algorithm 5

Message-Digest algorithm 5 (MD5) is a cryptographic hash function that takes an input of arbitrary length and produces a 128-bit message digest of the input. It was created by Ronald Rivest and published

15 3. Metadata Patterns in RFC 1321 in 1992 [49] as an extension and replacement of the MD4 hashing function, intended for digital signature applications. Unfortunately, it has been prone to collision attacks and found insecure for new cryptographical designs [50]. Despite that, MD5 message digests have been widely used as checksums to verify data integrity and have been commonly represented as strings of 32 hexadecimal values.

3.1.2 Secure Hash Algorithms

As with the previously mentioned MD5 hash function, a family of hash functions referred to as Secure Hash Algorithms (SHA) created by the National Security Agency (NSA) originated as an enhanced version of the previous MD4 function. The first heavily used hash function, which also uses Merkle-Damgård construction, was SHA-1 [51] published in 1995 as a U.S. Federal Information Processing Standard (FIPS). The message is parsed into message blocks, padded, and the message length is appended at the end of the padded message. Compared to MD4 and MD5, the hash size increased to 160 bits. In 2001 a new family SHA-2 consisting of six functions SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256 was published in FIPS PUB 180-4 [51]. The functions differ in the shift amounts, additive constants and the number of rounds. Numbers in the names gives the length of the hash digest. The latest member of the SHA family is SHA-3, which is based on the Keccak algorithm. It was published in 2015 as FIPS PUB 202 [52]. It is internally different than SHA-1 and SHA-2 family since it does not use the Merkle-Damgård construction but a sponge construc- tion. The standard functions are SHA3-224, SHA3-256, SHA3-384, and SHA3-512, and two extendable-output functions (XOFs) SHAKE128 and SHAKE256.

3.2 Uniform Resource Identifier

A Uniform Resource identifier (URI) is a compact sequence of charac- ters that provides a simple and extensible means for identifying an abstract or a physical resource. Its syntax is organized hierarchically.

16 3. Metadata Patterns

The components are listed in order of decreasing significance from left to right. The access to the resource is defined by the protocols that make use of URI. Each URI begins with a scheme name that refers to a specification for assigning identifiers within that scheme. The specification defines required elements and elements common to many of the schemes. Although the components semantics are specific to each scheme, the generic component separators are “/”, “?”, and “#”. URI can be divided into two specific subgroups, Uniform Resource Locator (URL) and Uniform Resource Name (URN).

3.2.1 Uniform Resource Locator

For Internet access protocols, it was needed to define the access algo- rithm’s encoding into something that could be termed address. Thus, URL identifies resources’ locations in the context of a particular access protocol, such as HTTP(S) or FTP, describing their primary access mechanism. Although URL is a subset of a URI, the names are often used interchangeably.

3.2.2 Uniform Resource Name

URN is a URI that is assigned under the “urn” URI scheme and a particular URN namespace. It is a globally unique, persistent resource identifier on the Internet [53]. Officially registered namespace identi- fiers pass through the registration procedure managed by the Internet Assigned Numbers Authority (IANA) [54]. All URNs have the form urn:: where is the unique URN case-insensitive namespace identifier (only ASCII char- acters are allowed), and is the namespace specific string unique within NID. RFC 8141 specifies that URN is valid if the NID is also registered, but it is common to encounter URNs with unregistered NIDs. [53]

3.2.3 Data URI

Data URI scheme was defined in RFC 2397 [19], published in 1998.It allows inclusion of (small) data items directly inline in Web pages, as

17 3. Metadata Patterns if they had been included externally. It comprises the Internet media (MIME) type and the data part, which can be either percent-encoded (see Section 2.4), or Base64-encoded (see Section 2.2.4). In the latter case, the data encoding is denoted by the [;base64] part of the URI. Listing 3.1: Data URI format data:[][;base64],

3.2.4 Magnet URI

Magnet “magnet:” is a URI scheme that defines the magnet links’ for- mat. It identifies files by their content via cryptographic hash value. It was created in 2002 as a neutral generalization of “ed2k:” and “freenet:” URI schemes used by eDonkey2000 and Freenet networks to conform to the URI standard as much as possible. Magnet links are mainly used by P2P applications to refer to files located in a particular network, such as Gnutella, Gnutella2 (G2), BitTorrent (BT), Direct network (DC), eDonkey2000 (eD2k). Magnet URIs consist of several parameters with URL-encoded values. The most significant and required one is the xt parameter, which contains the hash value of the file to lookup in the network. Other supported parameters are listed in Table B.1. Parameters starting with “x.” are experimental and specific to a network. Different formats of the xt parameter are listed in Table B.2. Magnet links may be best known for their use in the BitTorrent network along with the torrent files. In BitTorrent Info Hash format, the content is identified via SHA-1 hash of the torrent info metadata. Since version 2 of the protocol, released in 2020, the multi-hash format encoded SHA-256 hash has been used. It is possible to include both a v1 (btih) and v2 (btmh) info-hash in a BitTorrent magnet link for the backwards compatibility. Listing 3.2: BitTorrent magnet link structure magnet:?xt=urn:btih:&dn= &tr=&x.pe= magnet:?xt=urn:btmh:&dn= &tr=&x.pe=

18 3. Metadata Patterns 3.3 International Standard Book Number

ISBN has been internationally recognised as the unique identification for monographic publications, published as a standard ISO 2108 (with the latest revision ISO 2108:2017 from 2017) [55]. The first version (ISBN-10) consists of 10 digits dividing the identi- fier into four parts – a registration group element, a registrant element, a publication element, and a check digit. Modulo 11 operation is used to calculate the check digit on the sum of the digits’ weighted products with weights from 10 to 1. The character “X” is used if the remainder equals 10. However, only the 13-digit identifiers (ISBN-13) has been assigned since 2007. ISBN-13 adds the GS1 prefix (978 or 979). Modulo 10is used to calculate the check digit on the weighted products’ sum with alternating weights 1 and 3. One can obtain ISBN-13 from the book’s ISBN-10 by prefixing the first nine digits with 978 and calculating the check digit according to ISBN-13. ISBN-13 can be represented in the machine-readable form of a 13- digit EAN-13 bar code. It can be represented as a URN urn:isbn:, where is the identifier digits, e.g. urn:isbn:9780132350884 (cor- responds to printed form ISBN 978-0-13-235088-4).

3.4 International Standard Serial Number

The International Standard Serial Number (ISSN) is an 8-digit identi- fier used for serial publications, associated with the publication’s title [56]. The identifier has been standardized in ISO 3297 (with the latest edition ISO 3297:2020 [57]), and consists of two 4-digit parts joined by a hyphen. The last digit is a check digit calculated as modulo 11 on the sum of the digits’ weighted products with weights from 8 to 2, in which the letter “X” replaces the remainder value of 10. ISSN can be displayed as a URN of the form urn:issn:xxxx-xxxx, e.g. urn:issn:1793-6519.

19 3. Metadata Patterns 3.5 Digital Object Identifier

The Digital Object Identifier (DOI) is a digital identifier of an object. It is the Handle System unique handle. Managed by the International DOI Foundation (IDF) since 1998, it has been standardised by ISO 26324:2012 [58]. It forms a string divided into two parts “prefix/suffix”. The prefix identifies the identifier’s registrant, and the suffix is theunique identification of the object. The prefix is ofform 10.NNNN where NNNN is a number larger than 1000 and can be further subdivided with periods. The suffix may incorporate an identifier based on another system (e.g. ISBN, ISSN). The DOI aims to resolve the identifier to the resource’s location [59]. It commonly identifies academic, professional, and government information, such as journal articles, research reports, and official publications. It is usually displayed in one of the following forms: • DOI prefix/suffix • doi:prefix/suffix • https://doi.org/prefix/suffix

3.6 IP Address

An Internet Protocol (IP) address is a numeric label for a network hardware device that uses the Internet Protocol [60] to communicate. It identifies the host (the network interface) on the network and provides the host’s location in the network needed to make the device available for communication on the network. Internet Protocol version 4 (IPv4) defines an IP address as a 32-bit number, defining up to 232 addresses. It is commonly represented in human-readable form as four decimal numbers separated by a period. Due to the rapid exhaustion of IPv4 addresses available for assign- ment, Internet Protocol version 6 (IPv6) has been created, defining IP addresses as a 128-bit number, creating the address space of up to 2128 addresses. These addresses are commonly displayed in the form of hexadecimal values separated by a colon “:”. Both versions of the pro- tocol support subnetworks with the Classless Inter-Domain Routing

20 3. Metadata Patterns

(CIDR) notation. That enables defining the number of most significant bits to describe the network prefix, which identifies a whole network or a subnetwork. The least significant set forms the host identifier, which specifies the host interface on the network. IANA [61] manages the address space, and each of the special-purpose address blocks [62].

3.7 E-mail

An e-mail address is a character string that identifies the recipient (a user or a location) of the mail. A depository to which the mail is delivered is called a mailbox. E-mail addresses comprise a locally interpreted string followed by the at-sign character “@” followed by an Internet domain. The local part commonly comprises the Latin letters, the digits, and the characters “.”, “-”, and “+”. However, any of the following printable characters (listed within the quotes) “!#$%&´*+-/=?^_`|~” is valid [38]. Unless the local part is a quoted string, no two consequent dot characters may occur. Since the local part may be interpreted differently in different systems, we can find e-mail systems thattreat local parts with and without the dots as equal. Similarly, an address containing a part delimited by a plus sign may be treated as the same address without the plus sign and part following it. In such cases, the plus sign may be used for message categorisation or tagging. The domain part consists of labels conforming to hostnames, where each label must start with a letter and end with a letter or a digit. Each label can be at most 63 characters long. The allowed characters are Latin lower and upper letter, the digits and the hyphen “-” character [10]. Labels are delimited with a dot “.” character. In rare cases, the domain part can be written as an IP address enclosed within square brackets.

3.8 MAC Address

Media Access Control Address (MAC Address) is a hardware 48-bit identification number associated with a particular physical interface and uniquely identifies each device on a network. The ID is assigned

21 3. Metadata Patterns to vendors by the IEEE and assigned to the network circuit at manufac- ture. Hence, it is often referred to as a hardware address or a physical address. MAC addresses are made up of twelve hexadecimal values. The first three octets contains the ID number of the adapter manufacturer, also known as Organizationally Unique Identifier (OUI) [63]. The rest of the bits of a MAC address represents the serial number assigned to the adapter by the manufacturer, also known as Network Interface Controller (NIC) specific. However, IEEE started assigning identifiers of larger size, which leaves smaller NIC address blocks. They assign MA-L1 [64] (24-bit identifiers), MA-M2 [65] (28-bit identifiers), and MA-S3 [66] (36-bit identifiers) types of identifiers that can be viewed in public listings [67]. MAC addresses are used in the data link layer of the Open Systems Interconnection (OSI) model. The addresses are commonly presented as six pairs of hexadecimal values delimited by a colon or a hyphen or are displayed without a . The Cisco MAC address format is three quartets of hexadec- imal values delimited by a point.

3.9 Cryptocurrency

Cryptocurrencies have created an alternative way of maintaining and trading property between people. They are based on cryptographic principles and often take advantage of the decentralization of the network. People can easily create a cryptocurrency wallet in much the same way that banks typically handle monetary transactions. The technology called blockchain acts as a ledger, recording immutable information about all transactions, and acts as a safeguard against malicious changes. Each transaction is validated by the nodes involved in a process commonly known as mining. Ownership of the cryptocurrency is generally established through digital keys, crypto addresses, and digital signatures. A crypto wallet generates and stores pairs of public and private keys that are used to perform transactions. The private keys are used to sign transactions,

1. MAC Addresses – Large; it is also known as OUI 2. MAC Addresses – Medium 3. MAC Addresses – Small; it is also known as OUI-36

22 3. Metadata Patterns

and the public keys derived from the private keys can be used to verify the transaction. Crypto addresses are hash strings generated from the public key and are used to ensure that sent assets are delivered to the correct recipient. The cryptocurrency market is snowballing. The website CoinMar- ketCap [68] currently lists over 5,000 different cryptocurrencies, and CoinGecko [69] lists around 7,000 digital assets. Changes in the crypto market are very dynamic. Hence, this section lists the top cryptocurren- cies by market capitalization listed on the CoinGecko cryptocurrency tracking website as of the end of November 2020 [70].

3.9.1 Bitcoin

The Bitcoin whitepaper was published in 2008 by Satoshi Nakamoto [71], the blockchain was launched in early 2009. Bitcoin is radically changing how the payments industry is viewed and has become the impetus for rethinking how money is handled. Bitcoin embraces peer-to-peer payments over an online network, eliminating third parties and exchanging trust for verification. Trans- actions are irreversible on the blockchain and recorded in chronolog- ical order based on a computed distributed timestamp. The Bitcoin blockchain is secure as long as honest participants have more comput- ing power than attackers. Bitcoin is based on the Proof-of-Work concept, i.e. the verification of transactions is based on the principle of solving a difficult algorithm. If successfully solved, a block is created and stored in the blockchain. This process is called mining and is performed by the network nodes involved, called miners. Bitcoin is generally not anonymous. As a good practice, it is en- couraged to use a unique address for individual transactions if the participant is privacy concerned. Each participant in the Bitcoin blockchain must have a private key. A private key is theoretically any number between 0 and approximately 1.1578 × 1077 [72, 73]. The upper bound is determined by secp256k1 defining parameters for the Elliptic Curve Digital Signature Algorithm (ECDSA) used in Bitcoin. By default, private keys are displayed as a hexadecimal string representing a 256-bit value or in Wallet Import Format (WIF).

23 3. Metadata Patterns

A WIF key is an encoded form of a private key in hexadecimal form. Byte 0x80 is used as the prefix, and byte 0x01 is used as the suffix in the case of private key correspondence with a compressed public key. This string is used to create a checksum as the last four bytes of the result of a double application of the SHA-256 hash. The extended private key with the associated checksum is converted to WIF by the Base58Check encoding (Section 2.2.3) application. The leading character of the Bitcoin WIF can be 5, K and L [74]. Bitcoin currently supports three types of addresses – Pay-to-Pubkey- Hash (P2PKH), Pay-to-Script-Hash (P2SH) [75] and Bech32 addresses [36]. P2PKH and P2SH addresses use Base58 encoding and have the following format [75]: [one-byte version][20-byte hash][4-byte checksum]

[20-byte hash] is the hash of the script used to spend the amount of coins. When verifying the address, the checksum and version byte must be checked. Valid values for the version byte are 0x00 and 0x05 for mainnet (address prefix is 1 or 3) and 0x6f, 0xc4 for testnet. The Bech32 addresses are created according to the Bech32 encoding spec- ification. The human-readable part for mainnet is bc and it is tb for testnet [36].

3.9.2 Ethereum

Ethereum is a decentralized blockchain created by Vitalik Buterin, who published its whitepaper in 2013 [76] (last updated in February 2021 [77]) and launched in 2015. Its native cryptocurrency is called Ether. The blockchain has a smart contract functionality (a collection of functions and data residing at a specific address on the blockchain) and allows developers to build decentralized applications. The func- tionality is extended via Ethereum Improvement Proposals (EIPs), e.g. a creation of custom tokens specified in EIP-20 [78] Token Standard and EIP-721 [79] Non-Fungible Token Standard. The Ethereum addresses comprise the prefix 0x and the rightmost 20 bytes of the Keccak-256 hash of the public key. The inclusion of a checksum in an address was proposed in EIP-55 [80] in 2016, which results in a mixed letter casing.

24 3. Metadata Patterns

3.9.3 Tether

Tether is a cryptocurrency with tokens issued by Tether Limited. It is called a stablecoin because it was initially designed to be worth 1 USD and supposed to be a non-volatile trading currency. The initial version started on the Bitcoin blockchain via the Omni Layer protocol [81]. However, Tether has expanded to other blockchains since – Ethereum, Algorand, Bitcoin Cash, EOS, Liquid Network, Omni, Tron and Solana [82]. Since it is not a native cryptocurrency on the blockchains, it utilises layers that enabled trading the token. Hence, it uses the native address formats of respective blockchains.

3.9.4 Polkadot

Polkadot is a heterogeneous multi-blockchain system that allows com- munication between different blockchains securely in a trust-free en- vironment. The whitepaper was published in 2016 by Gavin Wood, a co-founder of Ethereum [83]. Polkadot addresses are built on the Substrate standard for defining addresses. This format is based on Bitcoin’s Base58 encoding with additional modifications referred to as SS58. The use of this format also allows the transformation of the public key into addresses in various blockchains whose addresses are based on Substrate. They differ in the Substrate address type. Polkadot addresses start withthe character 1 (one), based on the fact that the address type has a value of 0 (zero) in the SS58 encoding representation. It utilises the Blake2b hash function to calculate the checksum [84].

3.9.5 XRP (Ripple)

XRP is a public, counterparty-free asset native to the XRP Ledger [85]. The XRP Ledger was created in 2012 by a private enterprise software company Ripple. The Ripple protocol consensus algorithm was published in 2014 [86]. Unlike other cryptocurrencies blockchains, XRP Ledger is not decentralized. There are two types of addresses – a classic address and an X-address. XRP Ledger natively supports only the classic format, but third-party tools support the X-address. The classic address starts with the character “r” and is between 25 and 35 characters long in the Base58 format consisting of a 20-byte

25 3. Metadata Patterns account ID (a concatenation of 0x00 with a hash of the account’s public key) and a 4-byte checksum. The X-address starts with upper-case “X” and contains the destina- tion address and destination tag.

3.9.6 Cardano

Cardano is a decentralised blockchain system employing a Proof-of- Stake consensus protocol Ouroboros, i.e. the validation nodes are selected according to the proportion of the stake in the relevant cryp- tocurrency. The journey of this project started in 2015 in an effort to change the way cryptocurrencies are designed and developed [87]. Cardano addresses depend on the current blockchain era. The Byron era (dedicated to building a foundational federated network that enabled the purchase and sale of ADA finished in July 2020) addresses could be divided into two types – Icarus-style addresses that start with Ae2 (they are between 57 and 64 characters long) and Daedalus-style addresses that begin with DdzFF (they are between 83 and 123 characters long). However, the Shelley era (focused on ensuring greater decentralisa- tion) introduced a new type of wallet addresses based on the Bech32 encoding. Clients should switch to the newer address format. In this case, the addresses can be 103 characters long, which is longer than the Bech32 (Section 2.7) specification allowed. The human-readable parts of the Shelley addresses are addr and stake.

3.9.7 Litecoin

Charlie Lee created Litecoin in 2011 as a fork of Bitcoin [88]. The dif- ferences are in the hash functions used, the block size and the address formats. Valid values for byte versions are 0x30, 0x05, 0x32 for main- net (address prefixes L, 3, M) and 0x6f, 0xc4, 0x3a for testnet. However, addresses prefixed with M had been created to replace addresses with prefix 3. Litecoin also supports Bech32 addresses with prefixes ltc on mainnet and tltc on testnet.

26 3. Metadata Patterns

3.9.8 Bitcoin Cash

Bitcoin Cash was created as a hard fork of Bitcoin in 2017. A new address format had to be created to avoid confusion with Bitcoin. However, Bitcoin Cash also supports legacy addresses correspond- ing to standard Bitcoin addresses. The new address format is called CashAddr and consists of the following parts:

• A prefix indicating the network on which the address is valid. Bitcoin Cash specifies the prefixes bitcoincash for mainnet, bchtest for testnet, and bchreg for regtest.

• A fixed-value separator “:” (a colon).

• A Base32 encoded payload is indicating the destination of the address and containing a checksum. The encoding alphabet is the same as for Bech32.

The payload of the CashAddr address comprise three elements: [1-byte version][hash 20-64 bytes][40-bit checksum]

The version byte contains the information about the type of address and the size of the hash. The complete table with defined values can be found in the specification [89]. The hash part represents a hash of the public key for P2KH and a hash of the reedemScript for P2SH. The 40-bit checksum is a computation result of BCH cyclic error-correcting codes defined over GF(25) and considers the network prefix.

3.9.9 Chainlink

Chainlink is a decentralized oracle network that provides real-world data to blockchains. It was founded by Sergey Nazarov in 2017 and includes a cryptocurrency LINK. The original whitepaper was pub- lished in 2017 [90], the whitepaper of version 2 was published in April 2021 [91]. LINK token is an ERC-677 token based on the EIP-20 standard [92]. Hence, the address format is the same as the Ethereum addresses described in Section 3.9.2.

27 3. Metadata Patterns

3.9.10 Stellar

Stellar a blockchain network for currencies and payments [93]. Unlike Bitcoin or Ethereum, it does not intend to replace the existing financial system, but to enhance it. Like Ethereum it provides means to create custom assets to trade on the network. It has its native currency called lumens that is used to pay operation fees. Account IDs are represented by 32-byte public keys that begin with character “G”. However, Stellar utilise a federation protocol that maps more human-readable user information to the account ID. The Stellar addresses are divided into two parts separated by an asterisk “*”, the username and the domain [94].

28 4 MetExt

This chapter introduces the tool MetExt (Metadata Extractor). We go through the analysis of the requirements and introduce examples of projects with similar aims and capabilities, describe the tool’s design aims, introduce the implemented modules and provide the description of the API and the CLI tool that utilises the modules.

4.1 Analysis

In some instances, it is necessary to examine specific unknown binary data and determine their characteristics or the occurrence of specific patterns witnesses to confirm the data format and the data in it. However, manually checking for specific data can be more error- prone with increasing data size. Therefore, it is advisable to automate this activity. Such a tool, which facilitates the necessary data analysis, should be easily extensible. Therefore, it is advisable to choose a mod- ular architecture that allows the extension of isolated functionality by adding new modules or plugins. The tool should detect and decode input data and find specific pat- terns of interest in it. The tool should cover at least common encoding types and formats. The functionality should be provided via an API and a CLI utility tool written in the Python programming language. In the following subsections, we will therefore introduce several existing projects that offer similar functionality.

4.1.1 Balbuzard

Balbuzard is a malware analysis tool to extract patterns of interest from malicious files [95]. The tool is written in Python 2.x andisnot supported in Python 3.x. It can search for strings or regular expression patterns such as IP addresses, URLs, EXE strings, and various malware strings. It also supports YARA [96] rules as searching patterns. The tool contains a few helper tools for different kinds of analysis and data deobfuscation. The functionality can be extended with new patterns in Python scripts and YARA rules.

29 4. MetExt

4.1.2 CyberChef

CyberChef is a client-side web app written in JavaScript for carrying out data analysis and transformation operations [97]. It is designed to let analysts manipulate data in complex ways. The functionality is divided into many pluggable and configurable operations, such as data decoding, decryption, patterns extraction in the form of recipes in the user interface. Recipes are created by dragging and stacking boxes to apply operations in order. One can load a local file for analysis. It utilises several methods to detect encoded data automatically. Some of them are pattern matching, checking magic numbers, or arithmetic logic brute-forcing. It also runs a “Magic” operation in the background to help recognise the set of operations needed to decode the data.

4.1.3 Chepy

Chepy is a Python utility tool in the form of an interactive CLI ap- plication and a library [98]. It was created as a direct counterpart to CyberChef written in Python and tries to cover all the functionality of CyberChef and extend it. Currently, at least Python 3.6 is required to run Chepy. Similarly to CyberChef, it is created with a modular archi- tecture to enable a straightforward extension of the functionality, and it offers operation stacking. Moreover, the CLI supports the command autocompletion.

4.1.4 Ciphey

Another CLI tool to provide automated decoding and decryption functionality is Ciphey [99]. It is written in Python, with the least sup- ported version being Python 3.7. It utilises a custom-built processing modules to guess the appropriate decoding and decryption with a certain probability of a correct guess. The creators see the tool as a direct alternative to CyberChef since Ciphey includes a larger set of encodings and can process larger files in a shorter time. Moreover, it can potentially reverse cryptographic hashes via external web services. Nevertheless, Ciphey is focused solely on the data decoding part.

30 4. MetExt 4.2 Design

As outlined in the Section 4.1, the goal is to create a tool that can process binary data and extract certain patterns or other interesting information. To make the tool easily extensible, it will be built on top of a mod- ular architecture that isolates individual modules while adding new modules or using only a portion of the added modules. Since the goal is to have functionality covering both the detection and decoding of binary data and the extraction of data patterns, the tool should contain at least two types of different modules, which will differ in their use. Subsequently, to make these modules easy to use, a library with an easy-to-use API will be created. To cover the desired functionality, it is offered to have at least three high-level functions available, which will cover the functionality of data decoding, data pattern extraction and the analysis itself, which will not assume anything about the input data. In addition to the library itself, a CLI application will be created, which will also serve as an example of using the created library. The CLI application would process input files or standard input and output information about the extracted patterns if any appeared in the data. It will be possible to write the output both to the standard output and to a specified file in an easy-to-read form and at the same timeina machine-processable form. Formats such as JSON or CSV are thus suitable for this purpose. The tool is required to be implemented in the Python language. That allows utilising a large built-in library as well as many of external packages. However, the aim is to keep the support of the older Python version 3.5.

4.3 Plugins

Four groups of modules cover all required functionality. The groups are decoders, extractors, validators and printers, which comprise rel- evant modules in the metext.plugins directory in their respective modules. A base class has been created for each plugin type, i.e. BaseDecoder, BaseExtractor, BaseValidator, and BasePrinter. All

31 4. MetExt of them inherit from the PluginBase class designed to keep the in- formation about registered plugins of individual types and retrieve them. The load_plugin_modules function in the metext.plugins mod- ule manages the registration of the plugins. It registers plugins by loading modules dynamically when the module is imported. The dynamic loading enables plugin modules to be imported into other modules and accessed via dedicated static methods. The realisation of plugins in such a modular architecture makes it simple to extend the list of plugins in the future. A new plugin can be added by creating a new class inheriting from the PluginBase class or any of the classes for a specific plugin type. A newly created class must set the plugin type and plugin name values and override the run method. The method takes one positional argument _input and any keyword arguments if needed. A plugin type, a plugin name, and a flag of an active state distinguish plugins from each other. Therefore, every plugin must have a unique pair of plugin type and plugin name. The data type of the returned values must be consistent within the plugins of the same type. Although the validators have been isolated as individual plugin modules and are available for direct import, they are used only inter- nally within extractors to ensure that the results are as much accurate as possible, so they are not exposed via an API function. Thus, their description is not included in this text. They are available for further in- spection in the source code. The other three plugin types are described in the following Sections 4.3.1 to 4.3.3.

4.3.1 Decoders

Decoding plugins inherit from the BaseDecoder class, which sets the plugin name to “Decoder”. A total of 21 different active decoders have been implemented, excluding a trivial case of a helper decoder returning the identity of the input data. See Table 4.1 for the full overview of the decoders. Besides, three other helper decoders for Bech32 encoding, Bitcoin SegWit address type and a common Base58 decoder have been implemented. However, they are not enabled and thus can be used only internally within other modules.

32 4. MetExt

Decoders can be called via the function decode from the API, which takes the data and the plugin name as positional arguments and arbi- trary keyword arguments for the decoder. The list of active decoders can be retrieved with the list_decoders function. The function re- turns the dictionary with keys of decoders names and values of de- coders classes. The function list_decoders_names returns the list of the names of the decoders in the active state. The plugins of the decoder type should return value of the data type Optional[bytes]. The decoder should return the None value if the decoding fails. The currently implemented decoders attempts to decode the input as a whole. Thus, a minor divergence from the expected data format may cause the failure of the decoding.

4.3.2 Extractors

Plugins of the type “extractor” inherit from the BaseExtractor class, which sets the plugin type to “Extractor”. There have been imple- mented 34 distinguished extractors, which can be seen in the table Table 4.2. There are extractors for 29 unique patterns if we consider ipv4 and ipv6 extractors as subtypes of a general IP address extractor, and url, urn, data_uri and magnet extractors as subtypes of the uri extractor. The plugins of the extractor type should return an iterable col- lection of dictionaries, e.g. a list of dictionaries. It is expected that a dictionary item in the collection contains at least the property “value” with the value of the found pattern. The implemented extractors utilise regular expressions. Thus, they share a common helper extracting function _extract_with_regex whose signature can be seen in the listing Listing D.3. The common item dictionary with the data of the found pattern can be seen in List- ing D.4. However, that output is generally not final so each extractor can extend the common dictionary object with additional data. An example of such an extractor is the mac extractor, which adds optional information about the MAC address.

33 4. MetExt

Table 4.1: The list of implemented decoders

Decoder name Description hex Hexadecimal values, optionally with or prefixes hexdump Hexadecimal values from a hex dump formats produced by tools such as hexdump and xdd base32 base32 encoding as per RFC 4648 base32hex base32hex encoding as per RFC 4648 base32crockford Crockford’s Base32 encoding base58btc Base58 encoding with Bitcoin alphabet base58ripple Base58 encoding with Ripple alphabet base64 base64 encoding as per RFC 4648 base64url base64url encoding as per RFC 4648 base85 base85 encoding as per RFC 1924 ascii85 Ascii85 encoding including Adobe variant z85 Z85 encoding as defined in [24] base91 Base91 encoding BinHex version 4 encoding pem PEM file format containing Base64-encoded data gzip GZIP compression format as defined in RFC 1952 raw A helper performing no decoding mime Decoding message body compliant to the MIME standard percent Percent-encoding as defined in RFC 3986 quopri Quoted-printable encoding as defined in RFC 2045 uu UUEncoding as created with the uuencode utility tool yEnc encoding as defined in [35]

34 4. MetExt

Table 4.2: The list of implemented extractors

Extractor name Description base32 base32-encoded data as defined in RFC 4648 base64 base64-encoded data as defined in RFC 4648 btc Bitcoin P2PKH, P2SH and Bech32 types of mainnet addresses btc-wif Bitcoin private keys in the wallet import format (WIF) bip32-xkey Extended private and public keys as defined in BIP32 deterministic wallet formats eth, ltc, xrp, usdt, Ethereum / Litecoin / Ripple (classic address) / bch, chainlink, ada, Tether (Omni, Ethereum) / Bitcoin Cash / Chain- dot link / Cardano / Polkadot mainnet addresses E-mail addresses with no quoted strings or IP ad- dress as per RFC 5322 uuid Universally unique identifiers as per RFC 4122 md5, sha1, sha224, MD5 / SHA-1 / SHA-224 / SHA-256 / SHA-384 / sha256, sha384, sha512 SHA-512 message hex digests ipv4, ipv6 IPv4 and IPv6 addresses isbn, issn, doi ISBN / ISSN / DOI publications identifiers Non-empty JSON arrays and objects mac MAC address as defined in [63] pem Data in PEM format uri, url, urn URIs with registered schemes listed by IANA / URLs as defined with HTTP and FTP schemes / URNs data_uri URIs with the DATA scheme magnet BitTorrent magnet links

35 4. MetExt

4.3.3 Printers

A few options have been provided to save the analysis results to a file. There are three data printers implemented for the JSON, CSVand text-table formats with their respective printers’ names “json”, “csv” and “text”. Every printer inherits from the BasePrinter class that defines the plugins’ type. The printers are simple helpers to output the result data without further knowledge of the data structure. Therefore, they may be used to print other kinds of data as well. A custom JSON encoder may be provided to the JSON printer to deal with otherwise unserialisable data. To support the serialisation of sets, a helper class metext.utils.CustomJsonEncoder has been cre- ated. Therefore, other custom JSON encoders should inherit from this encoder not to lose the set serialisation support. The CSV and text-table printers need a data transformation preced- ing the printing. Thus, helper functions convert_to_csv_format and convert_to_table_format have been created in the metext.utils module to do the transformations. The package tabulate [100] is used to print the text-table output. Therefore, one can choose one of the table formats supported by the package. Each of the printers take an optional argument “filename” as a destination of the output. If it is not provided, the result is printed to the standard output. Other arguments are specific to each printer. An example of the arguments that the json printer takes can be seen in the table Table 4.3.

Table 4.3: Overview of the json printer’s arguments

Argument Description _input A positional argument, it holds the data to print filename Output file path; if not provided, the output is printed tothe standard output indent An indentation number to pretty print the output. The default value is 2 sort_keys A flag to sort keys in JSON. The default value is False json_encoder A custom JSON encoder class with the default method. The default value is metext.utils.CustomJsonEncoder

36 4. MetExt 4.4 API

To meet the tool’s requirements of different data formats’ detection and patterns’ extraction, several functions have been created to utilise the implemented plugins. The most important ones are analyse, decode, extract_patterns, and print_analysis_output. Moreover a func- tion register_plugin_modules that allows to register plugins at cus- tom paths is included too. Some of the functions takes the name of a plugin as an argument so function list_decoders_names and its equivalents for extractors and printers are available as well. Those functions return a list of active plugins’ names of the respective type. The register_plugin_modules function loads any plugin (a class inheriting from the PluginBase class) in specified paths into the scope of the metext.plugins module and registers the plugin for further use. The extract_patterns function utilises the extractors to find pat- terns in input data. It takes the data, extractor name and extractor keyword arguments as the arguments. Generally, it expects a string type of input data. It outputs a list of the dictionaries with the data about the patterns. An example of a pattern output dictionary can be seen in the listing Listing D.4. The decode function uses the decoder plugins. Similarly to function extract_patterns, its arguments are the data, the decoder’s name and the decoder’s keyword arguments. The output is exactly the output of the used decoder, which should be None in case of failure, otherwise decoded bytes. Most likely, the most significant function is analyse. It takes the data, a list of tuples specifying the name and keyword arguments of a decoder and a list of tuples specifying the name and keyword arguments of an extractor. The data can be of types string, bytes, list of strings, BytesIO, StringIO, or FileInputExtended. The data of types string, bytes and list of strings are transformed to StringIO, BytesIO and FileInputExtended, respectively. If a list of strings is provided, it is expected to be a list of file paths to process. In general, a type with the read() method should be fine as long as its output is a string orbytes or can be converted to bytes. The functions signature can be seen in the listing Listing D.1. It outputs a list of dictionaries where each list

37 4. MetExt item should contain at least the property value. The skeleton of the output data can be seen in the listing Listing D.2. The lists of decoders and extractors can reduce the scope of inter- est. If decoders are not provided, all supported decoders in an active state can be used to detect/decode the data. It is similar for the extrac- tors. Moreover, there is an attempt to decompress the data if they are compressed. The attempted decompressions include Brotli [101], zip [102], gzip [103], bzip2 [104], xz [105], and LZMA [106] compression methods and formats. They are examples of HTTP content encoding parameters [107] and common compression methods used on Unix-like systems. Furthermore, if the data is a (compressed) tar archive [108], it is also decompressed and the files extracted from the archive. Suppose any decompression or tar extraction is applied. In that case, the information is added to the in Listing D.2 indicating whether it took place before or after the actual data decoding. Even though some means to reduce false positives and increase chances of correct format detection and decoding are implemented, the output does not discard decoding attempts that passed. That means that the data could have been in the data format no patterns were found. Thus as much complete output is given for further inspection by a person to decide which of the output data is the most accurate. The print_analysis_output function exposes the printer plugins. However, the function is meant to be used only on the output of the analyse function as a specific data structure is expected.

4.5 CLI

The MetExt tool is also provided as a CLI utility, whose list of sup- ported arguments can be seen in the table Table 4.4. The arguments -i and -f specifies the list of input files. They can be used together, i.e. the tool processes the union of found files. The input can also be the standard input STDIN if no input files are specified, which enables piping output from other tools to MetExt. The tool uses internally a class FileInputExtended that is an exten- sion of FileInput class to support the read() method as a wrapping helper to process the input files sequentially. Then the input is pro-

38 4. MetExt

cessed with the analyse function with the active supported decoders and extractors called with the default arguments. Once the data are processed, the output is given in the format specified by the argument -F and written either to a file specified by the argument -o, or to the standard output STDOUT. The supported formats are JSON array with objects containing found patterns for each input file, CSV format and a text table. A minor data transformation is required to apply to create the CSV output. The common columns are source (the input file name), format (the name of any applied encoding), pattern_type (the name of the searched pattern), and pattern (the value of the found pattern). Other columns are created according to existing keys in all found pat- terns objects, i.e. a union of all existing keys except those conforming to the common ones create additional columns in the CSV output. Values of the pattern-value column and additional columns are JSON serialized to avoid data representation problems. The text table is a table representation of the CSV output format.

39 4. MetExt

Table 4.4: Overview of supported arguments in CLI

Argument Description -h, --help Display help message -i, --input INPUT A path to files to analyze. Wildcards * and ** are supported. -f, --file FILE A path to file containing files’ paths to analyze. Wildcards * and ** are supported. -r, --recursive A flag to switch the wildcards ** to include files recursively. -o, --output OUTPUT An output file path. If not specified, the output is written to STDOUT. -d, --decode [...] Select encoding formats, which should be consid- ered for processing. If not specified, all supported and active encodings are considered. Supported values are {hex, hexdump, base32, base32hex, base32crockford, base58btc, base58ripple, base64, base64url, base85, ascii85, z85, base91, binhex, pem, gzip, raw, mime, percent, quopri, uu, yenc} -e, --extract [...] Select patterns to search for. If not specified, all sup- ported patterns are searched for. Supported values are {base32, base64, btc, btc-wif, bip32-xkey, eth, ltc, xrp, usdt, bch, chainlink, ada, dot, doi, email, uuid, md5, sha1, sha224, sha256, sha384, sha512, ipv4, ipv6, isbn, issn, json, mac, pem, uri, url, urn, data_uri, magnet} -F, --out-format FORMAT Select the output format. If not specified, the out- put is in JSON format. Supported values are {csv, json, text}.

40 5 Evaluation

Once we have created the desired tool for finding specific data pat- terns, it is also desirable to know whether and how the created tool is successful in finding the patterns of interest in arbitrary binary data. Therefore, this section contains an evaluation of the tool based on the generated and collected data.

5.1 Test Data Set

We can group the test data sources for evaluation into artificial and real data sources, which have been only slightly modified. The custom dataset was created to represent the data in as many of the available and supported encodings as possible (see Table 4.1 for the list and Chapter 2 for their description). At the same time, the data should contain examples of as many types of patterns as possible. The supported patterns were listed in Table 4.2 with their respective description in Chapter 3.

5.1.1 Generated Data

We created a basic set of test data with valid data patterns. The pat- terns were embedded in the context of pseudo-random data. In a simpler form, bytes that are certainly not part of the data patterns separate the patterns from the surrounding data. To demonstrate that the current implementation for finding interesting patterns can be strongly affected by embedding a pattern in a data context, wehave created another set where patterns are separated from the context by data that may very likely interfere with pattern search. The counts of each pattern is listed in Table 5.1. The data can be found in attached files. The generated files are partitioned both by individual pattern types and as aggregate files containing all generated patterns. The data were saved in the following formats – Ascii85, Base16, Base32, Base64, Base85, Base91, BinHex, raw binary data, quoted-printable, UUEncod- ing, Z85, Brotli, bzip2, gzip, zip, tar archive compressed with bzip2, gzip, and xz algorithms.

41 5. Evaluation

Table 5.1: Generated pattern types overview

Pattern Type Count Description ADA address 12 Cardano Byron and Shelley addresses Base32 4 base32-encoded data strings Base64 4 See base64-encoded data strings BCH address 11 Bitcoin Cash legacy and Bech32 mainnet addresses BIP32 xkey 8 Extended private and public keys according to BIP32 BTC address 9 Bitcoin mainnet P2PKH, P2SH, and Bech32 addresses BTC WIF 5 Bitcoin private keys in the wallet import format Data URI 5 Data URI strings DOI 5 DOI publication identifiers DOT address 3 Polkadot mainnet addresses E-mail address 5 E-mail addresses E-mail message 1 E-mail simple message with the MIME headers ETH address 5 Ethereum addresses IPv4 address 7 IP version 4 addresses IPv6 address 13 IP version 6 addresses ISBN10 9 ISBN-10 publication identifiers ISBN13 5 ISBN-13 publication identifiers ISSN 3 ISSN publication identifiers LINK address 5 Chainlink Ethereum-based addresses LTC address 9 Litecoin legacy and Bech32 addresses MAC address 3 MAC addresses in delimited hexadecimal values Magnet URI 4 BitTorrent version 1 and version 2 magnet links MD5 4 MD5 hash hex digests PEM 2 Certificates in the PEM file format SHA-1 5 SHA-1 hash hex digests SHA-224 5 SHA-224 hash hex digests SHA-256 5 SHA-256 hash hex digests SHA-384 5 SHA-384 hash hex digests SHA-512 5 SHA-512 hash hex digests URL 3 URL links with HTTP and FTP schemes URN 7 URN identifiers USDT address 3 Tether Ethereum-based addresses UUID 5 Universally unique identifiers XLM address 4 Stellar addresses XRP address 5 Ripple classic addresses

42 5. Evaluation

5.1.2 Collected Data

The collected data comprises mainly a captured network traffic using the Wireshark tool. The data are in forms of hexadecimal values of the extracted data, the raw data, and data bytes exported as a hex dump. In addition, a few test Bitcoin and Litecoin wallet files were collected as well.

5.2 Results

We used the tool to analyse binary data containing data patterns. Table C.2 summarises the numbers of unique patterns found for each type. The result data of the extracted patterns of interest can be found in attached CSV and XLSX files. The CSV files are unmodified results retrieved from the input data analysis. The XLSX files contains a few additional tables to observe the statistics of the results.

Table 5.2: List of files with extracted results

Filename Description generated.all.ambiguous.dp.xlsx Results of artificially generated data with po- tential pattern distortion generated.all.simple.dp.xlsx Results of artificially generated data with un- ambiguous pattern embedding in data sur- roundings out.nemec.dp.xlsx Collected network traffic data in hex stream format provided by RNDr. Lukáš Němec outfile.split.nemec.dp.xlsx Collected network traffic binary data pro- vided by RNDr. Lukáš Němec. The input file was over 1.1GB large, hence it was split into multiple parts. wallet.cert.dp.xlsx Collected test files from desktop Bitcoin and Litecoin wallets smid.wireshark.dump.dp.xlsx Collected network traffic data in hex dump format. The source was over 1GB large, hence it was split into smaller files and processed separately.

43 5. Evaluation

5.2.1 Generated Data

The generated input files that were analysed are listed in Table C.1. A similar set of files named “ambiguous” instead of “simple” were processed to see the impact of the surroundings on the extraction. By simply comparing with Table 5.1, we can see that more patterns were found for multiple pattern types than the number of patterns that were initially identified. That is because more different types of data patterns may overlap. That can be observed with cryptocurrency addresses for Bitcoin, Bitcoin Cash, Tether and Litecoin, i.e. currencies utilising Bitcoin legacy address formats. The same phenomenon can be observed for cryptocurrencies based on Ethereum and thus use the same address rules. Other instances of such overlap are IPv4 addresses in IPv6 addresses or URI patterns, including URLs, URNs data URIs or magnet links. However, since extractors for each pattern may process the patterns differently, hence if a pattern is recognised in themore general pattern type, it might not be recognised as the more specific pattern type. Results in Table C.2 are common to instances for which the applied decoding corresponds to the input file. The tool applies a few data checks that should help with unnecessary decoding, i.e. many of the encodings use different (not necessarily disjoint) encoding alphabets or data structure distinguishing itself from other encodings. However, an assumption of a successful decoding of one format may be a good indication that similar encodings could not be applied. On the other hand, such assumption is not generally enough to run the correct decoding algorithm and not to run the others for small input data. We can see in Table C.1 that multiple “successful” extractions were applied for files encoded in quoted-printable encoding and in UUEncoding. In those cases the overview of the results can be found in the attached result files.

5.2.2 Collected Data

The most significant sources of collected data are results of network traffic capturing via the Wireshark tool. Two of the files are toolarge for the tool to process. Since some of the encodings follow a specific data format, the tool reads the input completely. Furthermore, current

44 5. Evaluation pattern extractors are based on regular expressions, which can be resource-demanding for extensive data if the expression is not optimal. Hence, large input files were split into smaller parts and processed sequentially. We could find the largest number of unique patterns in thefile smid.wireshark.dump.txt (individual parts can be found in the direc- tory smid.wireshark.dump.txt-split). The file contains a capture of a short web session and data sending. A total of 43,846 unique patterns were found (excluding formats other than hexdump) with 92,379 of a total number of occurrences. JSON and URIs are the most common pattern types that were found in this file. A complete pattern type summary can be seen in Table C.6 and Table C.7 in Appendix C. The second-largest processed file outfile.raw gave 12,641 unique patterns with 338,463 total occurrences. The pattern summary is listed in Table C.4. IPv4 addresses and URIs (URLs) are the most frequent pattern types in this file. Another network-traffic data was captured in the file out.raw.A total of 793 unique patterns with 110,843 hits were found in the file. Most of the patterns are e-mail addresses and URLs. Not many unique patterns of those types occur in the data, but their occurrence is fre- quent. Only 12 different addresses are present, but their total count is 8,538. Similarly, only 29 different URLs were found, but their total count is 42,026. A deeper inspection of the extracted patterns hints that the source data are relevant to Tomcat Apache. The complete summary of the found patterns is listed in Table C.3. The last group of collected data comes from several test files created by desktop Bitcoin and Litecoin wallets. However, most of the wallets were created to be unencrypted. Otherwise, it may be impossible to find any interesting patterns without additional efforts. However, one could detect a presence of the wallet header bytes. By inspecting the results, we can see that some of the wallets save data into JSON format. Some of them save the public keys, which can be used to derive information about addresses. Others save information about created addresses directly. Wallets also contain information about private and public keys in different formats.

45 5. Evaluation 5.3 Evaluation Discussion

The MetExt tool successfully found many different patterns in the data sources. The results are provided in separate XLSX files with the extraction results and statistics of the analysis output. However, we should mention that we could come to a conclusion that many instances of found patterns are false positives in many cases. Although instances may be syntactically correct, they often might not be correct in the way that they are not originally instances of identified patterns. For example, many of the Base64 encoded strings found in the data contain common English words, which is unlikely to happen by the standard encoding process. The case of Base32 may give even more inaccurate results since Base32-encoded strings comprise a smaller set of characters. Hence, we can witness instances of strings with a low variance of characters. Similarly, the extraction of URI may not be accurate enough since there is no selective extraction for each different URI scheme. In the case of selected cryptocurrencies, we observe that it is not easy to syntactically distinguish addresses of the same format used by different tokens. To give more precise results, one could check ifthere had been any transaction or any token possession on a blockchain. Although we can find and identify different types of data patterns, the results must be further verified by a person or other means to confirm a successful extraction of a pattern instance. The overview of the found pattern types and their frequency in the data sources are included in Appendix C.

46 6 Conclusion and Future Work

The thesis aimed to create a modular tool in Python programming language with easy extensibility to process data in different encodings and formats and automate the detection and extraction of meaningful data patterns. Therefore, it was necessary to familiarize ourselves with existing data encodings and data formats. Standard data encodings commonly used for representing digital binary data in text form were selected for the tool. These encodings were chosen because of their high prevalence globally, especially in e-mail correspondence and web applications. A more detailed descrip- tion and use of these encodings were given in Chapter 2. Alternative existing tools with similar goals and functionality were also listed and described within this thesis. However, the tool created aimed to maintain the initial requirement of supporting Python ver- sion 3.5, which proved to be a minor drawback. Some of the tools and libraries used no longer supported the required version in their current form, so some of their needed functionalities were backported for the tool. A different set of search patterns in the data was implemented compared to those tools. In addition to common patterns such as IP, MAC address or email address, publication identifiers in the form of DOI or ISBN and address and private and public keys for selected cryptocurrencies were also included. A description of these was given in Chapter 3. The work also includes the output of processing and analysis of the generated and acquired test data. The output shows that the tool is functional in its form and can extract the required data patterns from the binary data. However, we have shown that although the tool can find the rel- evant patterns that are syntactically correct, in some circumstances, such as searching for Base64 and Base32 strings or URIs, it may detect invalid patterns that are not correct in a given context or in general. At the same time, while running the tool over a large amount of input data, the current implementation proved to be memory intensive for large inputs. Thus, there remains room for optimization in refining the tool to reduce false positives and optimizing system resources.

47 6. Conclusion and Future Work

Despite these points that remain for future improvement, the points of the thesis brief have been met.

6.1 Future Work

A simple modular architecture was used for the tool, focusing on (but not limited to) data decoding and data pattern extraction. Other simi- lar projects with more complex modular architectures were presented in the final paper. Thus, more extensive use of internal state retention or chaining of operations, for example, seems to be one of the points for future improvements of the tool. Although a similar result can be achieved by proper sequencing of called operations, chaining options over the intermediate state seem to be a more friendly solution. The tool was developed with the simple assumption that it would not be used on multi-encoded data. Therefore, in cases where the input would be multi-encoded in one or more ways, the tool would not get to the desired result from which the search patterns could be extracted. Thus, the possibility of modifying the tool to process data that has been encoded multiple times can be added. Based on running the tool over the data, it was apparent that not all results met expectations. Overlaps could be observed for some pattern types. For instance, subtypes of URIs or cryptocurrencies using the same address format were not sufficiently distinguished from each other. This step is left to the user for manual inspection. For cryptocurrencies, there is the possibility of using other tools that would be able to verify the existence of transactions or ownership of an amount on the blockchain. In the case of URIs with different schemes, we can further work on scheme-specific strategies. Last but not least point to mention is the processing of large data. The tool could not process files as large as 1GB due to high memory consumption, and it was needed to split such file into smaller parts. Hence, memory optimisation may be required to process large data.

48 A Encoding Mapping Tables

Table A.1: The Base16 alphabet

Value Encode Value Encode Value Encode Value Encode 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 A 11 B 12 C 13 D 14 E 15 F

Table A.2: The Base32 alphabet

Value Encode Value Encode Value Encode Value Encode 0 A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 2 27 3 28 4 29 5 30 6 31 7 pad =

Table A.3: The “Extended Hex” Base32 alphabet

Value Encode Value Encode Value Encode Value Encode 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 A 11 B 12 C 13 D 14 E 15 F 16 G 17 H 18 I 19 J 20 K 21 L 22 M 23 N 24 O 25 P 26 Q 27 R 28 S 29 T 30 U 31 V pad =

49 A. Encoding Mapping Tables

Table A.4: The “Geohash” Base32 alphabet

Value Encode Value Encode Value Encode Value Encode 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 b 11 c 12 d 13 e 14 f 15 g 16 h 17 j 18 k 19 m 20 n 21 p 22 q 23 r 24 s 25 t 26 u 27 v 28 w 29 x 30 y 31 z

Table A.5: The “Crockford’s” Base32 alphabet

Value Encode Decode Value Encode Decode Value Encode Decode 0 0 0 o O 1 1 1 i I l L 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 A a A 11 B b B 12 C c C 13 D d D 14 E e E 15 F f F 16 G g G 17 H h H 18 J j J 19 K k K 20 M m M 21 N n N 22 P p P 23 Q q Q 24 R r R 25 S s S 26 T t T 27 V v V 28 W w W 29 X x X 30 Y y Y 31 z z Z

50 A. Encoding Mapping Tables

Table A.6: The “Bitcoin” Base58 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 A 10 B 11 C 12 D 13 E 14 F 15 G 16 H 17 J 18 K 19 L 20 M 21 N 22 P 23 Q 24 R 25 S 26 T 27 U 28 V 29 W 30 X 31 Y 32 Z 33 a 34 b 35 c 36 d 37 e 38 f 39 g 40 h 41 i 42 j 43 k 44 m 45 n 46 o 47 p 48 q 49 r 50 s 51 t 52 u 53 v 54 w 55 x 56 y 57 z

Table A.7: The “Ripple” Base58 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 r 1 p 2 s 3 h 4 n 5 a 6 f 7 3 8 9 9 w 10 B 11 U 12 D 13 N 14 E 15 G 16 H 17 J 18 K 19 L 20 M 21 4 22 P 23 Q 24 R 25 S 26 T 27 7 28 V 29 W 30 X 31 Y 32 Z 33 2 34 b 35 c 36 d 37 e 38 C 39 g 40 6 41 5 42 j 43 k 44 m 45 8 46 o 47 F 48 q 49 i 50 1 51 t 52 u 53 v 54 A 55 x 56 y 57 z

51 A. Encoding Mapping Tables

Table A.8: The “Flickr” Base58 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a 10 b 11 c 12 d 13 e 14 f 15 g 16 h 17 i 18 j 19 k 20 m 21 n 22 o 23 p 24 q 25 r 26 s 27 t 28 u 29 v 30 w 31 x 32 y 33 z 34 A 35 B 36 C 37 D 38 E 39 F 40 G 41 H 42 J 43 K 44 L 45 M 46 N 47 P 48 Q 49 R 50 S 51 T 52 U 53 V 54 W 55 X 56 Y 57 Z

Table A.9: The Base64 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 a 27 b 28 c 29 d 30 e 31 f 32 g 33 h 34 i 35 j 36 k 37 l 38 m 39 n 40 o 41 p 42 q 43 r 44 s 45 t 46 u 47 v 48 w 49 x 50 y 51 z 52 0 53 1 54 2 55 3 56 4 57 5 58 6 59 7 60 8 61 9 62 + 63 / pad =

52 A. Encoding Mapping Tables

Table A.10: The “URL safe” Base64 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 a 27 b 28 c 29 d 30 e 31 f 32 g 33 h 34 i 35 j 36 k 37 l 38 m 39 n 40 o 41 p 42 q 43 r 44 s 45 t 46 u 47 v 48 w 49 x 50 y 51 z 52 0 53 1 54 2 55 3 56 4 57 5 58 6 59 7 60 8 61 9 62 - 63 _ (pad) =

53 A. Encoding Mapping Tables

Table A.11: The “Adobe” Ascii85 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 ! 1 " 2 # 3 $ 4 % 5 & 6 ´ 7 ( 8 ) 9 * 10 + 11 , 12 - 13 . 14 / 15 0 16 1 17 2 18 3 19 4 20 5 21 6 22 7 23 8 24 9 25 ; 26 ; 27 < 28 = 29 > 30 ? 31 @ 32 A 33 B 34 C 35 D 36 E 37 F 38 G 39 H 40 I 41 J 42 K 43 L 44 M 45 N 46 O 47 P 48 Q 49 R 50 S 51 T 52 U 53 V 54 W 55 X 56 Y 57 Z 58 [ 59 \ 60 ] 61 ^ 62 _ 63 ` 64 a 65 b 66 c 67 d 68 e 69 f 70 g 71 h 72 i 73 j 74 k 75 l 76 m 77 n 78 o 79 p 80 q 81 r 82 s 83 t 84 u

54 A. Encoding Mapping Tables

Table A.12: The “RFC 1924” Base85 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 A 11 B 12 C 13 D 14 E 15 F 16 G 17 H 18 I 19 J 20 K 21 L 22 M 23 N 24 O 25 P 26 Q 27 R 28 S 29 T 30 U 31 V 32 W 33 X 34 Y 35 Z 36 a 37 b 38 c 39 d 40 e 41 f 42 g 43 h 44 i 45 j 46 k 47 l 48 m 49 n 50 o 51 p 52 q 53 r 54 s 55 t 56 u 57 v 58 w 59 x 60 y 61 z 62 ! 63 # 64 $ 65 % 66 & 67 ( 68 ) 69 * 70 + 71 - 72 ; 73 < 74 = 75 > 76 ? 77 @ 78 ^ 79 _ 80 ` 81 { 82 | 83 } 84 ~

55 A. Encoding Mapping Tables

Table A.13: The “Z85” Base85 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 a 11 b 12 c 13 d 14 e 15 f 16 g 17 h 18 i 19 j 20 k 21 l 22 m 23 n 24 o 25 p 26 q 27 r 28 s 29 t 30 u 31 v 32 w 33 x 34 y 35 z 36 A 37 B 38 C 39 D 40 E 41 F 42 G 43 H 44 I 45 J 46 K 47 L 48 M 49 N 50 O 51 P 52 Q 53 R 54 S 55 T 56 U 57 V 58 W 59 X 60 Y 61 Z 62 . 63 - 64 : 65 + 66 = 67 ^ 68 ! 69 / 70 * 71 ? 72 & 73 < 74 > 75 ( 76 ) 77 [ 78 ] 79 { 80 } 81 @ 82 % 83 $ 84 #

56 A. Encoding Mapping Tables

Table A.14: The Base91 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8 I 9 J 10 K 11 L 12 M 13 N 14 O 15 P 16 Q 17 R 18 S 19 T 20 U 21 V 22 W 23 X 24 Y 25 Z 26 a 27 b 28 c 29 d 30 e 31 f 32 g 33 h 34 i 35 j 36 k 37 l 38 m 39 n 40 o 41 p 42 q 43 r 44 s 45 t 46 u 47 v 48 w 49 x 50 y 51 z 52 0 53 1 54 2 55 3 56 4 57 5 58 6 59 7 60 8 61 9 62 ! 63 # 64 $ 65 % 66 & 67 ( 68 ) 69 * 70 + 71 , 72 . 73 / 74 : 75 ; 76 < 77 = 78 > 79 ? 80 @ 81 [ 82 ] 83 ^ 84 _ 85 ` 86 { 87 | 88 } 89 ~ 90 "

Table A.15: The Bech32 encoding alphabet

Value Encode Value Encode Value Encode Value Encode 0 q 1 p 2 z 3 r 4 y 5 9 6 x 7 8 8 g 9 f 10 2 11 t 12 v 13 d 14 w 15 0 16 s 17 3 18 j 19 n 20 5 21 4 22 k 23 h 24 c 25 e 26 6 27 m 28 u 29 a 30 7 31 1

57 A. Encoding Mapping Tables

Table A.16: The “BinHex 4.0” Base64 alphabet

Value Encoding Value Encoding Value Encoding Value Encoding 0 ! 1 " 2 # 3 $ 4 % 5 & 6 ´ 7 ( 8 ) 9 * 10 + 11 , 12 - 13 0 14 1 15 2 16 3 17 4 18 5 19 6 20 8 21 9 22 @ 23 A 24 B 25 C 26 D 27 E 28 F 29 G 30 H 31 I 32 J 33 K 34 L 35 M 36 N 37 P 38 Q 39 R 40 S 41 T 42 U 43 V 44 X 45 Y 46 Z 47 [ 48 ` 49 a 50 b 51 c 52 d 53 e 54 f 55 h 56 i 57 j 58 k 59 l 60 m 61 p 62 q 63 r

58 B Magnet Links Parameters

Table B.1: Overview of supported Magnet URI parameters

Parameter Name Description xt Exact Topic [Required] Network-specific hash value of the data content dn Display A filename to conveniently display to the user Name kt Keyword Keywords to search for in a P2P network Topic mt Manifest The URI to a metafile; it can be an URN Topic xl Exact The size in bytes Length ws Web Seed The payload data served over HTTP(S) as Acceptable A URL to a direct download from a web server, re- Source garded as a fall-back source xs Exact Source A URL to download the pointed source from tr Address A URL of the tracker Tracker

59 B. Magnet Links Parameters

Table B.2: Examples of the xt parameter in P2P networks

Name xt Format Description Tiger Tree urn:tree:tiger: works Secure Hash urn:sha1: Gnutella and G2 networks Algorithm (SHA) BitPrint urn:bitprint:. ED2K urn:ed2k: eDonkey2000 network BitTorrent urn:btih: of the torrent info metadata. (BTIH) Clients should also support Base32 representation. BitTorrent urn:btmh: matted SHA-256 hash of the tor- (BTMH) rent info metadata.

60 C Pattern Extraction Results

Table C.1: Processed generated files with applied decoding types

Source Decoding ...\input\generated\simple\all\data.ascii85 ascii85 ...\input\generated\simple\all\data.base16 hex ...\input\generated\simple\all\data.base32 base32 ...\input\generated\simple\all\data.base64 base64 ...\input\generated\simple\all\data.base85 base85 ...\input\generated\simple\all\data.base91 base91 ...\input\generated\simple\all\data.binhex binhex ...\input\generated\simple\all\data.dat raw ...\input\generated\simple\all\data.dat.bz2 bzip2+raw ...\input\generated\simple\all\data.dat.gz gzip+raw ...\input\generated\simple\all\data.dat.tar tar+raw ...\input\generated\simple\all\data.dat.tar.bz2 bzip2+tar+raw ...\input\generated\simple\all\data.dat.tar.gz gzip+tar+raw ...\input\generated\simple\all\data.dat.tar.xz xz+tar+raw ...\input\generated\simple\all\data.dat.zip zip+raw ...\input\generated\simple\all\data.quopri percent quopri raw ...\input\generated\simple\all\data.uu percent quopri raw uu ...\input\generated\simple\all\data.z85 z85

61 C. Pattern Extraction Results Table C.2: Result summary of the “simple” (left) and “ambiguous” (right) datasets

Pattern type Unique patterns Pattern type Unique patterns ada 12 ada 2 base32 8 base32 6 base64 11 base64 16 bch 18 bch 3 bip32-xkey 8 bip32-xkey 0 btc 14 btc 3 btc-wif 5 btc-wif 0 data_uri 5 data_uri 2 doi 5 doi 1 dot 3 dot 0 email 6 email 2 eth 13 eth 1 chainlink 13 chainlink 1 ipv4 10 ipv4 4 ipv6 11 ipv6 6 isbn 8 isbn 3 issn 4 issn 3 ltc 14 ltc 0 mac 3 mac 3 magnet 4 magnet 2 md5 6 md5 1 pem 2 pem 2 sha1 8 sha1 2 sha224 5 sha224 0 sha256 5 sha256 1 sha384 5 sha384 1 sha512 5 sha512 1 uri 27 uri 14 url 4 url 4 urn 14 urn 9 usdt 24 usdt 2 uuid 6 uuid 0 xrp 5 xrp 0 62 C. Pattern Extraction Results

Table C.3: Pattern type extraction summary from file out.nemec.dp.xlsx

Format Pattern type Unique Total hex base32 3 4 base64 19 128 email 12 8,538 ipv4 118 4,467 json 105 628 mac 5 15 md5 286 1,204 sha1 14 422 sha224 10 270 sha256 10 298 sha384 6 186 uri 104 52,386 url 29 42,026 uuid 72 271 Totals 793 110,843

63 C. Pattern Extraction Results

Table C.4: Pattern type extraction total summary from file outfile.split.nemec.dp.xlsx

Format Pattern type Unique Total raw base32 471 4,784 base64 1,901 27,697 email 240 11,320 ipv4 777 122,787 ipv6 19 263 isbn 18 225 json 181 1,437 mac 23 97 md5 1,455 4,292 pem 5 7 sha1 1,486 2,264 sha224 91 560 sha256 92 585 sha384 71 371 sha512 4 4 uri 4,641 90,251 url 742 69,309 urn 49 429 uuid 375 1,781 Totals 12,641 338,463

64 C. Pattern Extraction Results

Table C.5: Pattern type extraction summary from file wallet.cert.dp.xlsx

Source Format Pattern type Unique Total ...\btc_wallet_armory\armorylog.txt quopri ipv4 1 6 raw ipv4 1 6 ...\btc_wallet_armory\ArmorySettings.txt raw md5 1 1 sha224 1 1 ...\btc_wallet_armory\multipliers.txt raw sha1 400 400 sha256 400 400 ...\btc_wallet_bitcoin-core\wallet1\wallet.dat raw bch 3 10 btc 6 21 ltc 3 10 usdt 3 10 ...\btc_wallet_electrum\wallet1 raw btc 31 104 json 1 1 ...\btc_wallet_wasabi\Wallet0.json raw base64 1 1 bip32-xkey 1 1 json 1 1 ...\btc_wallet_wasabi\Wallet1.json raw base64 1 1 bip32-xkey 1 1 json 1 1 ...\btc_wallet_wasabi\Wallet2.json raw base64 1 1 bip32-xkey 1 1 json 1 1 ...\certs_btc_electrum\bitcoin.aranguren.org raw base64 1 1 pem 1 1 ...\certs_btc_electrum\btc.cihar.com pem uri 3 3 url 3 3 ...\ltc_wallet_electrum\wallet1 raw bip32-xkey 2 2 json 1 1 ltc 30 60 ...\ltc_wallet_electrum\wallet2 raw base64 2 2 bip32-xkey 1 1 json 1 1 ...\ltc_wallet_litecoin-core\wallet.dat raw ltc 3 11 ...\wallet_coinomi\wallets.config raw sha256 1 1 ...\wallet_guarda\2021-05-07-13-38-guarda-backup.txt raw base64 1 1

65 C. Pattern Extraction Results

Table C.6: Pattern type extraction summary from file smid.wireshark.dump.dp.xlsx (Part 1)

Format Pattern type Unique Total hexdump ada 168 215 base32 9 9 base64 2,485 3,632 bch 544 1,223 btc 597 1,358 data_uri 144 149 email 36 51 eth 34 81 chainlink 34 81 ipv4 98 335 isbn 31 175 json 13,154 20,021 ltc 241 540 mac 5 5 magnet 36 38 md5 1,311 2,617 pem 6 6 sha1 253 473 sha224 3 3 sha256 2,412 3,380 sha384 3 3 sha512 6 6 uri 15,891 29,068 url 4,633 25,165 usdt 576 1,302 uuid 1,136 2,443 Subtotals 43,846 92,379 Totals 49,767 104,620

66 C. Pattern Extraction Results

Table C.7: Pattern type extraction summary from file smid.wireshark.dump.dp.xlsx (Part 2)

Format Pattern type Unique Total percent email 9 13 ipv4 35 77 isbn 13 29 json 624 1,005 mac 3 3 uri 1,164 2,704 url 128 283 Subtotals 1,976 4,114 quopri email 9 13 ipv4 35 77 isbn 13 29 json 624 1,005 mac 3 3 uri 1,160 2,653 url 128 283 Subtotals 1,972 4,063 raw email 9 13 ipv4 35 77 isbn 13 29 json 624 1,005 mac 3 3 uri 1,161 2,654 url 128 283 Subtotals 1,973 4,064 Totals 49,767 104,620

67

D Code extracts

Listing D.1: Analyse function signature 1 def analyse( 2 _input: Union[FileInputExtended, BytesIO, StringIO, str, bytes, List[str]], 3 decoders: List[Tuple[str, dict]] = None, 4 extractors: List[Tuple[str, dict]] = None, 5 ) -> List[dict]:...

Listing D.2: Analysis output data skeleton 1 [ 2 { 3 "name": "", 4 "formats":{ 5 "": { 6 "patterns":{ 7 "": [ 8 { 9 "value": "", 10 "": "" 11 } 12 ] 13 } 14 } 15 } 16 } 17 ]

69 D. Code extracts

Listing D.3: Signature of the regex-based extractor 1 def _extract_with_regex( 2 _input, 3 regex, 4 validator=None, # callable validator 5 per_line=True, # if True, then search for pattern per line 6 preprocess=None, # callable to apply on each part before search 7 postprocess=None, # callable to apply on found pattern 8 cached_values=None, # collection of cached found values 9 data_kind=None, # the type of searched pattern 10 include_original=True, 11 include_contexts=True, # include pattern surroundings 12 context_length=30, # fixed data surrounding length 13 ):...

Listing D.4: Regex-based extractor output item example 1 { 2 "frequency": 1, # number of occurrences 3 "positions": [2685], # positions in the decoded data 4 "contexts":[ 5 "I:IlO6␣),n]:q’htzbt#F\tEa3+sW/␣>>>>value<<<<\n =5?d?t}0d‘Ks..4Q3vyx/LhT&9b(z" 6 ], # occurrences withina unique surroundings 7 "value":"1AGNa15ZQXAZUgFiqJ2i7Z2DPU2J6hW62i", # pattern 8 "value_kind":"btc", # pattern type 9 "original":"1AGNa15ZQXAZUgFiqJ2i7Z2DPU2J6hW62i", # included if postprocess changed the value 10 }

70 Bibliography

1. HORTON, Mark. uuencode [UNIX Programmer’s Manual] [on- line]. 1980 [visited on 2021-02-25]. Available from: https:// www.tuhs.org/cgi-bin/utree.pl?file=4BSD/usr/man/cat1/ uuencode.1c. 2. CERF, Vint. ASCII format for network interchange [online] [visited on 2021-03-13]. Available from: https://tools.ietf.org/ html/rfc20. 3. HORTON, Mark. uuencode [The Open Group Base Specifica- tions Issue 7, 2018 edition] [online] [visited on 2021-03-19]. Available from: https://pubs.opengroup.org/onlinepubs/ 9699919799/utilities/uuencode.html. 4. ALBERTSON, Kevin. Base-122 Encoding [online]. 2016-11- 26 [visited on 2021-03-15]. Available from: http : / / blog . kevinalbs.com/base122. 5. TURNER, Keith. keith-turner/ecoji [online]. 2021 [visited on 2021-03-13]. Available from: https : / / github . com / keith - turner/ecoji. 6. QNTM. base2048 [online]. 2021 [visited on 2021-03-13]. Avail- able from: https://github.com/qntm/base2048. 7. QNTM. base32768 [online]. 2021 [visited on 2021-03-13]. Avail- able from: https://github.com/qntm/base32768. 8. JOSEFSSON, Simon. The Base16, Base32, and Base64 Data En- codings [online]. 2006 [visited on 2021-03-13]. Available from: https://tools.ietf.org/html/rfc4648. 9. BENET, Juan et al. multiformats/cid [online]. Multiformats, 2021 [visited on 2021-03-15]. Available from: https://github.com/ multiformats/cid. 10. MOCKAPETRIS, Paul V. Domain names - implementation and specification [online]. 1987 [visited on 2021-03-15]. Available from: https://tools.ietf.org/html/rfc1035.

71 BIBLIOGRAPHY

11. CROCKFORD, Douglas. Base 32 [online]. 2019 [visited on 2021- 03-14]. Available from: https://www.crockford.com/base32. html. 12. VENESS, Chris. Geohash [online] [visited on 2021-03-14]. Avail- able from: https : / / www . movable - type . co . uk / scripts / geohash.html. 13. base58 Encodings [online]. XRPL.org, 2020 [visited on 2021-04- 10]. Available from: https://xrpl.org/base58-encodings. html. 14. ELLIOTT-MCCREA, Kellan. manufacturing flic.kr style photo URLs [Flickr] [online]. 2009 [visited on 2021-04-10]. Available from: https://www.flickr.com/groups/51035612836@N01/ discuss/72157616713786392/. 15. Base58Check encoding [Bitcoin Wiki] [online]. 2017 [visited on 2021-04-20]. Available from: https://en.bitcoin.it/wiki/ Base58Check_encoding. 16. LINN, John. Privacy enhancement for Internet electronic mail: Part I - message encipherment and authentication procedures [online]. 1989 [visited on 2021-03-19]. Available from: https://tools. ietf.org/html/rfc1113. 17. LINN, John. Privacy Enhancement for Internet Electronic Mail: Part I: Message Encryption and Authentication Procedures [online]. 1993 [visited on 2021-03-09]. Available from: https://tools.ietf. org/html/rfc1421. 18. BORENSTEIN, Nathaniel S.; FREED, Ned. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies [online]. 1996 [visited on 2021-03-13]. Available from: https://tools.ietf.org/html/rfc2045. 19. MASINTER, Larry. The "data" URL scheme [online]. 1998 [vis- ited on 2021-03-15]. Available from: https://tools.ietf.org/ html/rfc2397. 20. RESCHKE, Julian. The ’Basic’ HTTP Authentication Scheme [on- line]. 2015 [visited on 2021-03-15]. Available from: https:// tools.ietf.org/html/rfc7617.

72 BIBLIOGRAPHY

21. OROST, Joe. Encoding of binary data into mailable ASCII [online]. 1993-03-06 [visited on 2021-03-15]. Available from: https:// groups.google.com/g/comp.compression/c/Ve7k8XF-F5k/m/ gBWfpyL-gfgJ. 22. ADOBE SYSTEMS. PostScript language reference. 3rd ed. Reading, Mass: Addison-Wesley, 1999. isbn 978-0-201-37922-8. 23. ELZ, Robert. A Compact Representation of IPv6 Addresses [online]. 1996 [visited on 2021-02-25]. Available from: https://tools. ietf.org/html/rfc1924. 24. HINTJENS, Pieter. 32/Z85 [32/Z85] [online] [visited on 2021- 03-01]. Available from: http://rfc.zeromq.org/spec/32/. 25. ADOBE SYSTEMS. Dcoument management - Portable document format - Part 1: PDF 1.7. 2008. Available also from: https:// www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/ PDF32000_2008.. 26. HAMANO, Junio. [PATCH] binary patch. [Online]. 2006-05-05 [visited on 2021-03-15]. Available from: https://web.archive. org/web/20121128201119/http://www.gelato.unsw.edu.au/ archives/git/0605/19975.html. 27. MUNCKHOF, Coen van den. Z85e [online]. 2020 [visited on 2021-03-01]. Available from: https://github.com/coenm/Z85e. 28. BORDET, Simone. CometD-Z85 [online]. 2020 [visited on 2021- 03-01]. Available from: https://github.com/cometd/cometd- Z85. 29. HE, Dake; HE, Wei. A Method of Digital Data Transformation- Base91. In: CHEN, Kefei (ed.). Progress on Cryptography: 25 Years of Cryptography in China [online]. Boston, MA: Springer US, 2004, pp. 229–234 [visited on 2021-02-25]. The International Series in Engineering and Computer Science. isbn 978-1-4020-7987-0. Available from doi: 10.1007/1-4020-7987-7_31. 30. FAIR, Erik; CROCKER, Dave; FALTSTROM, Patrik. MIME Con- tent Type for BinHex Encoded Files [online]. 1994 [visited on 2021-04-20]. Available from: https://tools.ietf.org/html/ rfc1741.

73 BIBLIOGRAPHY

31. LEWIS, Peter N. BinHex 4.0 Definition [online]. 1991 [visited on 2021-04-20]. Available from: https://files.stairways.com/ other/binhex-40-specs-info.txt. 32. MASINTER, Larry; BERNERS-LEE, Tim; FIELDING, Roy T. Uni- form Resource Identifier (URI): Generic Syntax [online]. 2005 [vis- ited on 2021-03-14]. Available from: https://tools.ietf.org/ html/rfc3986. 33. Percent-encoding [MDN Web Docs] [online] [visited on 2021- 03-14]. Available from: https://developer.mozilla.org/en- US/docs/Glossary/percent-encoding. 34. HELBING, Jürgen. yEncode - A quick and dirty encoding for binaries [online]. 2001 [visited on 2021-04-04]. Available from: https: //web.archive.org/web/20190106020039/http://www.yenc. org/yEnc-draft-1.txt. 35. HELBING, Jürgen. yEncode - A quick and dirty encoding for binaries [online]. 2002 [visited on 2021-05-13]. Available from: https: //web.archive.org/web/20201112015538/http://www.yenc. org/yenc-draft.1.3.txt. 36. WUILLE, Pieter; MAXWELL, Greg. BIP 0173: Base32 address format for native v0-16 witness outputs [Bitcoin Wiki] [online]. 2017-03-20 [visited on 2021-04-10]. Available from: https://en. bitcoin.it/wiki/BIP_0173. 37. CROCKER, David H. Standard for the Format of ARPA Internet Text Messages [online]. 1982 [visited on 2021-04-20]. Available from: https://tools.ietf.org/html/rfc822. 38. RESNICK, Peter W. Internet Message Format [online]. 2008 [vis- ited on 2021-04-20]. Available from: https://tools.ietf.org/ html/rfc5322. 39. BORENSTEIN, Nathaniel S.; FREED, Ned. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies [online]. 1996 [visited on 2021-04-20]. Available from: https://tools.ietf.org/html/rfc2045.

74 BIBLIOGRAPHY

40. BORENSTEIN, Nathaniel S.; FREED, Ned. Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types [online]. 1996 [visited on 2021-04-20]. Available from: https://tools.ietf. org/html/rfc2046. 41. MOORE, Keith. MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text [online]. 1996 [visited on 2021-04-20]. Available from: https://tools. ietf.org/html/rfc2047. 42. FREED, Ned. Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures [online]. 1996 [visited on 2021-04- 20]. Available from: https://tools.ietf.org/html/rfc2048. 43. BORENSTEIN, Nathaniel S.; FREED, Ned. Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Ex- amples [online] [visited on 2021-04-20]. Available from: https: //tools.ietf.org/html/rfc2049. 44. Character Sets [online]. IANA, 2021 [visited on 2021-04-20]. Available from: https : / / www . iana . org / assignments / character-sets/character-sets.xhtml. 45. Media Types [online]. 2021-03-10 [visited on 2021-03-15]. Avail- able from: https : / / www . iana . org / assignments / media - types/media-types.xhtml. 46. JOSEFSSON, Simon. Textual Encodings of PKIX, PKCS, and CMS Structures [online] [visited on 2021-03-02]. Available from: https://tools.ietf.org/html/rfc7468. 47. PRENEEL, Bart. Cryptographic hash functions. European Transactions on Telecommunications [online]. 1994, vol. 5, no. 4, pp. 431–448 [visited on 2021-04-27]. issn 1541-8251. Available from doi: https://doi.org/10.1002/ett.4460050406. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/ett.4460050406. 48. AL-KUWARI, Saif; DAVENPORT, James H.; BRADFORD, Rus- sell J. Cryptographic Hash Functions: Recent Design Trends and Secu- rity Notions [online]. 2011 [visited on 2021-05-01]. 565. Available from: http://eprint.iacr.org/2011/565.

75 BIBLIOGRAPHY

49. RIVEST, Ronald. The MD5 Message-Digest Algorithm [online]. 1992 [visited on 2021-04-10]. Available from: https://tools. ietf.org/html/rfc1321. 50. TURNER, Sean; CHEN, Lily. Updated Security Considerations for the MD5 Message-Digest and the HMAC-MD5 Algorithms [online]. 2011 [visited on 2021-04-10]. Available from: https://tools. ietf.org/html/rfc6151. 51. Secure Hash Standard (SHS) [online]. 2015-07 [visited on 2021- 04-18]. NIST FIPS 180-4. National Institute of Standards and Technology. Available from doi: 10.6028/NIST.FIPS.180-4. 52. SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions [online]. 2015-07 [visited on 2021-04-18]. NIST FIPS 202. National Institute of Standards and Technology. Available from doi: 10.6028/NIST.FIPS.202. 53. KLENSIN, John; SAINT-ANDRE, Peter. Uniform Resource Names (URNs) [online]. 2017 [visited on 2021-03-22]. Available from: https://tools.ietf.org/html/rfc8141. 54. Uniform Resource Names (URN) Namespaces [online]. 2021-03-01 [visited on 2021-03-22]. Available from: https://www.iana. org/assignments/urn-namespaces/urn-namespaces.xhtml. 55. ISO. Information and documentation — International Standard Book Number (ISBN) [ISO] [online] [visited on 2021-03-19]. Available from: https://www.iso.org/cms/render/live/en/sites/ isoorg/contents/data/standard/06/54/65483.html. 56. What is an ISSN? [ISSN] [online]. 2021 [visited on 2021-03-22]. Available from: https://www.issn.org/understanding-the- issn/what-is-an-issn/. 57. ISO. Information and documentation — International standard serial number (ISSN) [ISO] [online]. 2020 [visited on 2021-03-22]. Available from: https://www.iso.org/cms/render/live/en/ sites/isoorg/contents/data/standard/07/38/73846.html. 58. 1 Introduction [DOI® Handbook] [online]. 2015 [visited on 2021- 03-22]. Available from: https://www.doi.org/doi_handbook/ 1_Introduction.html.

76 BIBLIOGRAPHY

59. 2 Numbering [DOI® Handbook] [online]. 2015 [visited on 2021- 03-22]. Available from: https://www.doi.org/doi_handbook/ 2_Numbering.html. 60. POSTEL, Jon. Internet Protocol [online]. 1981 [visited on 2021-05- 01]. Available from: https://tools.ietf.org/html/rfc791. 61. IANA — Number Resources [online]. 2021 [visited on 2021-05-01]. Available from: https://www.iana.org/numbers. 62. BONICA, Ronald; VEGODA, Leo; COTTON, Michelle; HABER- MAN, Brian. Special-Purpose IP Address Registries [online]. 2013 [visited on 2021-05-01]. Available from: https://tools.ietf. org/html/rfc6890. 63. Guidelines for Use of Extended Unique Identifier (EUI), Organiza- tionally Unique Identifier (OUI), and Company ID (CID) [online]. IEEE Standards Association, 2017 [visited on 2021-04-25]. Avail- able from: https://standards.ieee.org/content/dam/ieee- standards/standards/web/documents/tutorials/eui.pdf. 64. IEEE SA - MA-L [online] [visited on 2021-04-26]. Available from: https://standards.ieee.org/products-services/regauth/ oui/index.html. 65. IEEE SA - MA-M [online] [visited on 2021-04-26]. Available from: https://standards.ieee.org/products- services/ regauth/oui28/index.html. 66. IEEE SA - MA-S [online] [visited on 2021-04-26]. Available from: https://standards.ieee.org/products-services/regauth/ oui36/index.html. 67. IEEE SA - Registration Authority [online]. 2021 [visited on 2021-04-25]. Available from: https://standards.ieee.org/ products-services/regauth/index.html. 68. Cryptocurrency Prices, Charts And Market Capitalizations [online]. CoinMarketCap, 2021 [visited on 2021-05-10]. Available from: https://coinmarketcap.com/. 69. Cryptocurrency Prices and Market Capitalization [online]. CoinGecko, 2021 [visited on 2021-05-10]. Available from: https://www.coingecko.com/.

77 BIBLIOGRAPHY

70. Cryptocurrency Prices and Market Capitalization [online]. CoinGecko, 2020 [visited on 2020-11-28]. Available from: https://www.coingecko.com/. 71. NAKAMOTO, Satoshi. Bitcoin: A peer-to-peer electronic cash sys- tem,” http://bitcoin.org/bitcoin.pdf. 2008. 72. ANTONOPOULOS, Andreas M. Chapter 4. Keys, Addresses. In: Mastering Bitcoin: programming the open blockchain. Second edition. Sebastopol, CA: O’Reilly, 2017. isbn 978-1-4919-5438-6. 73. Secp256k1 [Bitcoin Wiki] [online]. 2019 [visited on 2021-05-01]. Available from: https://en.bitcoin.it/wiki/Secp256k1. 74. Wallet import format [Bitcoin Wiki] [online]. 2017 [visited on 2021-05-10]. Available from: https://en.bitcoin.it/wiki/ Wallet_import_format. 75. ANDRESEN, Gavin. Address Format for pay-to-script-hash [BIP 0013] [online]. 2011 [visited on 2021-05-01]. Available from: https://en.bitcoin.it/wiki/BIP_0013. 76. BUTERIN, Vitalik. A Next-Generation Smart Contract and Decen- tralized Application Platform. 2013. Available also from: https:// cryptorating.eu/whitepapers/Ethereum/Ethereum_white_ paper.pdf. 77. BUTERIN, Vitalik et al. A Next-Generation Smart Contract and De- centralized Application Platform [ethereum.org] [online]. 2021-02- 09 [visited on 2021-05-13]. Available from: https://ethereum. org/en/whitepaper/. 78. VOGELSTELLER, Fabian; BUTERIN, Vitalik. EIP-20: ERC-20 Token Standard [online]. 2015 [visited on 2021-05-01]. Available from: https://eips.ethereum.org/EIPS/eip-20. 79. ENTRIKEN, William; SHIRLEY, Dieter; EVANS, Jacob; SACHS, Nastassia. EIP-721: ERC-721 Non-Fungible Token Standard [on- line]. 2018 [visited on 2021-05-01]. Available from: https:// eips.ethereum.org/EIPS/eip-721. 80. BUTERIN, Vitalik; VAN DE SANDE, Alex. EIP-55: Mixed-case checksum address encoding [online]. 2016 [visited on 2021-05-01]. Available from: https://eips.ethereum.org/EIPS/eip-55.

78 BIBLIOGRAPHY

81. Tether: Fiat currencies on the Bitcoin blockchain [online]. tether.to, 2016 [visited on 2021-05-13]. Available from: https://tether. to/wp-content/uploads/2016/06/TetherWhitePaper.pdf. 82. CRAWLEY, Jamie. Tether Stablecoin Expands Its Reach to Another Blockchain [CoinDesk] [online]. 2021-03-11 [visited on 2021- 05-02]. Available from: https://www.coindesk.com/tether- stablecoin-expands-its-reach-to-another-blockchain. 83. WOOD, Gavin. Polkadot: Vision for a Heterogeneous Multi-chain Framework [online]. 2016 [visited on 2021-05-13]. Available from: https://polkadot.network/PolkaDotPaper.pdf. 84. WOOD, Gavin. External Address Format (SS58) [online]. 2021 [visited on 2021-05-01]. Available from: https://github.com/ paritytech / substrate / wiki / External - Address - Format - (SS58). 85. XRP Ledger [online] [visited on 2021-05-13]. Available from: https://xrpl.org/. 86. SCHWARTZ, David; YOUNGS, Noah; BRITTO, Arthur. The Ripple Protocol Consensus Algorithm [online]. 2014 [visited on 2021-05-13]. Available from arXiv: 1802.07242. 87. HOSKINSON, Charles. Why Are We Building Cardano [Why Car- dano] [online]. 2017 [visited on 2021-05-13]. Available from: https://why.cardano.org/en/introduction/motivation/. 88. Litecoin Whitepaper [online]. litecoin.org, 2011 [visited on 2021- 05-13]. Available from: https://whitepaper.io/document/ 683/litecoin-whitepaper. 89. SECHET, Amaury et al. Address format for Bitcoin Cash. 2017. Version 1.0. Available also from: https : / / github . com / bitcoincashorg / bitcoincash . org / blob / master / spec / cashaddr.md. 90. ELLIS, Steve; JUELS, Ari; NAZAROV, Sergey. ChanLink: A Decentralized Oracle Network [online]. 2017 [visited on 2021-05-13]. Available from: https://link.smartcontract. com/whitepaper.

79 BIBLIOGRAPHY

91. BREIDENBACH, Lorenz; CACHIN, Christian; COVENTRY, Alex; ELLIS, Steve; JUELS, Ari; MILLER, Andrew; MAGAU- RAN, Brendan; NAZAROV, Sergey; TOPLICEANU, Alexandru; ZHANG, Fan; CHAN, Benedict; KOUSHANFAR, Farinaz; MOROZ, Daniel; TRAMER, Florian. Chainlink 2.0: Next Steps in the Evolution of Decentralized Oracle Networks. 2021. Available also from: https://research.chain.link/whitepaper-v2.pdf. 92. LINK Token Contracts [Chainlink Documentation] [on- line]. Chainlink [visited on 2021-05-01]. Available from: https://docs.chain.link/docs/link-token-contracts/. 93. MAZIERS, David. Stellar Consensus Protocol: A Federated Model for Internet-level Consensus [online]. Stellar Development Foun- dation, 2016 [visited on 2021-05-13]. Available from: https:// stellar.org/papers/stellar-consensus-protocol?locale= en. 94. Federation protocol [SEP 0002] [online]. stellar.org, 2019. Ver- sion 1.1.0 [visited on 2021-05-01]. Available from: https:// github . com / stellar / stellar - protocol / blob / master / ecosystem/sep-0002.md. 95. LAGADEC, Philippe. Balbuzard [online]. 2021 [visited on 2021- 04-20]. Available from: https : / / github . com / decalage2 / balbuzard. 96. ALVAREZ, Victor M. YARA [online]. VirusTotal, 2021 [visited on 2021-04-25]. Available from: https : / / github . com / VirusTotal/yara. 97. CyberChef [online]. GCHQ, 2021 [visited on 2021-04-25]. Avail- able from: https://github.com/gchq/CyberChef. 98. Chepy [online]. securisec, 2021 [visited on 2021-04-25]. Available from: https://github.com/securisec/chepy. 99. Ciphey [online]. Ciphey, 2021 [visited on 2021-04-20]. Available from: https://github.com/Ciphey/Ciphey. 100. ASTANIN, Sergey. tabulate: Pretty-print tabular data [online]. 2021. Version 0.8.9 [visited on 2021-04-16]. Available from: https://github.com/astanin/python-tabulate.

80 BIBLIOGRAPHY

101. ALAKUIJALA, Jyrki; SZABADKA, Zoltan. Brotli Compressed Data Format [online]. 2016 [visited on 2021-05-13]. issn 2070- 1721. Available from: https://www.rfc-editor.org/info/ rfc7932. 102. PKWARE INC. .ZIP File Format Specification [online]. 2020 [vis- ited on 2021-05-14]. Available from: https://pkware.cachefly. net/webdocs/casestudies/APPNOTE.TXT. 103. DEUTSCH, Peter. GZIP file format specification version 4.3 [on- line]. 1996 [visited on 2021-05-13]. Available from: https:// datatracker.ietf.org/doc/html/rfc1952. 104. bzip2 [online] [visited on 2021-05-13]. Available from: http: //sourceware.org/bzip2/index.html. 105. COLLIN, Lasse; PAVLOV, Igor. The .xz File Format [online]. 2009 [visited on 2021-05-13]. Available from: https://tukaani.org/ xz/xz-file-format.txt. 106. PAVLOV, Igor. The .lzma File Format [online]. [N.d.] [visited on 2021-05-13]. Available from: https://svn.python.org/ projects/external/xz-5.0.3/doc/lzma-file-format.txt. 107. Hypertext Transfer Protocol (HTTP) Parameters [online]. IANA [visited on 2021-05-13]. Available from: https : / / www . iana . org / assignments / http - parameters / http - parameters.xhtml. 108. Basic Tar Format [GNU tar 1.34: Basic Tar Format] [online]. 2021- 03-24 [visited on 2021-05-13]. Available from: https://www. gnu.org/software/tar/manual/html_node/Standard.html.

81