A Method for Automatic Identification of Signatures of Steganography Software 3
Total Page:16
File Type:pdf, Size:1020Kb
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1 A Method for Automatic Identification of Signatures of Steganography Software Graeme Bell* and Yeuan-Kuen Lee Abstract—A fully automated, blind, media-type agnostic approach to research attention in steganalysis has been directed towards until now. steganalysis is presented here. Steganography may sometimes be exposed Implementation-induced artefacts are however a valid and interesting by detecting automatically-characterised regularities in output media target for steganalytic research, as the primary goal of steganography caused by weak implementations of steganography algorithms. Fast and accurate detection of steganography is demonstrated experimentally here (covert communication) is defeated whenever steganographic and across a range of media types and a variety of steganography approaches. non-steganographic media can be easily distinguished by any means. As Westfeld states in [2], Index Terms—blind, media-type agnostic, steganalysis, steganography, “The goal is to modify the carrier in an imperceptible way WAT-STEG only, so that it reveals nothing - neither the embedding of a message nor the embedded message itself.” I. INTRODUCTION In the past, discovery of implementation artefacts during research TEGANOGRAPHY is the field of research that studies how has occurred infrequently and on an ad-hoc basis. This paper dis- secret data can be hidden in carrier media, without being cusses a new steganalytic method by which the attempted discovery detectableS either to normal human observation or programmatic of these implementation-induced peculiarities can be automated. The scrutiny. Steganalysis [1]–[5] is the field of research that (primarily) method is applied in two stages. First, some known-stego training seeks methods to detect steganographic media. Broadly speaking, examples produced by a steganography tool are presented to a training algorithm, along with a larger number of examples of non-stego steganalysis can be separated into targeted steganalysis, which aims 1 to break a particular known steganographic embedding scheme, files of the same media type . The training algorithm attempts to and blind steganalysis, which attempts to identify the existence of identify any characteristic regularities in the binary representation steganography without apriori knowledge of the specific algorithm of the known-stego files that are not present in the non-stego files. that was used. Universal steganalysis may be used to refer to those If regularities are found, then a signature is immediately produced methods that are potentially capable of identifying many forms of that may be used to help detect output from that steganography steganography through a single system. Here, we add the idea of tool, and by proxy, to help detect the existence of steganography. being media-type agnostic, whereby a method makes no assumptions Conversely, this means that this method will not operate at all against regarding steganography, steganography algorithms, media type (e.g. implementations of algorithms that do not produce characteristic pictures, music, text) or media data format (e.g. TIFF, JPG), making regularities in their output. Tool signatures, where they are found, it suitable in principle for use against any steganography approach. may be used by comparing the bits of a given file against the known The problem addressed here is that of blind, media-type agnostic bit values representing the regularities of the signature. This test steganalysis. It is well-established that it is very difficult but also very of binary data values against signature values may be carried out useful to develop techniques that work broadly and blindly against very quickly. Unfortunately, a match against a signature does not steganography, without knowledge of the algorithm being attacked, provide a guarantee that a particular tool has been used, nor does it the secret message or the original carrier media [1], [5], [6]. For guarantee that steganography is present, as sources of false positives example, Fridrich notes in [5] that: may exist. However, a match against a stego-signature can provide a useful indication that a particular tool may have been used and “... having a few candidate stego-objects without any addi- consequently an indication that the file may contain steganography. tional information (e.g. “JPEG in the wild”) is the hardest This property means that this method can effectively complement case for steganalysis.” existing steganalysis techniques and help to improve overall accuracy. In this paper, blind steganalysis is attempted by attacking the imple- The structure of this paper is as follows. Section II introduces the mentations of steganography algorithms rather than the steganography background and principles. Section III presents an algorithm. Sections algorithms themselves. Consequently, steganalysis may be achieved IV and V measure and evaluate the technique experimentally. Section by proxy, when the attack is successful. The principle of the technique VI proposes developer guidelines. Section VII presents conclusions, is as follows: and is followed by references. In order to make steganographic research applicable in the real world, it is necessary to implement theoretical steganography algo- II. BACKGROUND AND MOTIVATION rithms as executable programs. In the process of realising an abstract The approach described in this paper was inspired by an ob- algorithm as a concrete piece of code, mistakes may be made by the servation regarding the implementation of F5 [7]. Provos observed programmer that allow the presence of steganography to be exposed. [5] that the F5 implementation added a characteristic header field Where they exist, these implementation-induced artefacts are separate which was seldom seen in non-steganographic media. The presence to those produced by the nature of a steganographic embedding of this characteristic field allowed the presence of secret data to be method, such as the ‘pair of values’ artefact [2], which the majority of reasonably assumed. Another well-known example is provided by G. Bell is with the School of I.T. at Murdoch University, South the GIF steganography tool, STools [8]. It was observed that STools Street, Perth, WA 6150, Australia. E-mail: [email protected]. Web: accidentally outputs a characteristic palette which is not commonly http://graemebell.net. Phone: (+61)8-9360-6533. Fax (+61)8-9360-2941. Y.K. Lee is with the Dept. of CSIE, Ming Chuan University, 5 Teh-Ming 1Though it is blind, this technique still requires some output samples from Lu, Gueishan, Taiwan 333. a steganography tool in order to attempt an attack. 2 SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Average value Average value 0.4 0.2 0.2 Characteristic values 0.2 0 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 0 Bit position Bit position Bit position (a) (b) (c) Fig. 1. (a) The average of the first 100 bits of 40 stego-GIFs produced with STools. (b) The average of the first 100 bits of 40 randomly selected non-stego GIFs. (c) The most ‘unusual’ characteristic bits of the stego-GIFs2. In (c), uninteresting bit positions are assigned the value 0.5 to distinguish them from positions that are characteristically 1 or 0 only in the stego-media. seen in GIF files [9]. Consequently, stego-media produced by this III. METHOD program can be very rapidly and reliably identified. Unfortunately, To determine the locations of candidate fixed-bits in stego-media this style of steganalysis is time consuming, and requires an expert files, it is necessary to take into consideration the typical value of with plenty of free time, patience and a measure of good luck. a bit in a particular location in the files, and how it varies from Nonetheless, these anecdotes provide a compelling motivation for natural, non-stego media. The typical value of a bit position in a non- the construction of a similar, but more systematic and automated stego media file is not 0.5, as might be expected (Fig. 1b). Rather, it ‘blind regularity detection’ attack against steganography tools (and depends on the kind of file that is being considered. For a JPEG file, thus steganography, by proxy). we might expect some fixed value bit-positions making up ‘0xFF’ or To understand the principle of this technique, imagine that ‘cow’, ‘JFIF’ to occur naturally within a normal file simply because it is a ‘dog’, ‘cat’, and ‘fox’ represent the output of a steganography pro- JPEG, without being indicative of stego-media. gram given different secret messages and different carrier media. The The method of this paper is therefore to first take the average output data clearly changes each time at the human-observable and (mean) of the bit value at each bit position, across several examples3 semantic level. Even at the byte-level, there is no obvious regularity of a particular tool’s outputted steganographic files. We record the that is always present in the output. However, if we look at the bit- bit locations which are ‘unusual’, i.e. vary dramatically from the level representation of these ASCII characters, we see that several bit expected average value of 0.5. In this implementation, we only record positions never change in value. Additionally, any regularity in the the bits that always have a value of ‘1’, or always have a value of ‘0’. byte-level representation will necessarily also appear as a regularity In other words: what is unusual about the bit values at each position in the bit-level representation. This means that the ‘obvious’ types of of the average stego-file produced by a particular steganography regularities noted by human researchers in the past can be discovered tool? (e.g.