DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015

Delta encoding image archives

COMPARING DELTA ENCODING AND PNG AS COMPRESSION METHODS FOR IMAGE ARCHIVES

ISAK JÄGBERG, MATS STICHEL

KTH ROYAL INSTITUTE OF TECHNOLOGY

CSC SCHOOL Delta encoding image archives

Comparing delta encoding and PNG as compression methods for image archives

ISAK JÄGBERG, MATS STICHEL

Degree project in Computer Science DD143X Supervisor: Jens Lagergren Examiner: Örjan Ekeberg

2015-05-08

Abstract

This thesis studies the effect of using delta encoding to compress archives of images. The results are compared with two types of lossless PNG compression. The tests show that the most sophisticated PNG method tested compresses the archives around 2-12% higher than the delta en- coding methods, but at the cost of taking more than 10 times as long. The conclusion is therefore drawn that delta encoding could be a useful method for compressing image archives in environments where speed is more important than storage. Referat

Denna rapport undersöker effekten av att använda delta encoding för att komprimera bildarkiv. Resultaten jämförs med two typer av förlust- fri PNG-komprimering. Testerna visar att den mest sofistikerade PNG- komprimeringen som testats komprimerar arkiven omkring 2-12% mer än delta encoding-komprimeringen, men att den samtidigt är över 10 gånger långsammare. Slutsatsen är därför att delta encoding kan va- ra en användbar komprimeringsteknik i datorsystem där hastighet är viktigare än lagringsutrymme. Contents

1 Introduction 1 1.1 Problem Definition ...... 1 1.2 Background ...... 2 1.2.1 Compression ...... 2 1.2.2 Delta encoding ...... 2 1.2.3 Data differencing and patching ...... 3 1.2.4 and image formats ...... 4 1.2.5 Reading/writing images ...... 5 1.3 State-of-the-art ...... 5 1.4 Scope ...... 6 1.5 Terminology ...... 6

2 Method 7 2.1 Test phases ...... 7 2.2 Data sets ...... 8 2.3 Compression methods ...... 9 2.4 Test environment ...... 12

3 Result 13 3.1 Result of test phase 1 ...... 13 3.2 Result of test phase 2 ...... 14 3.3 Analysis ...... 15

4 Discussion 17 4.1 Method limitations ...... 19

5 Conclusion 21

Bibliography 23 Chapter 1

Introduction

Recording everything we see has become a natural part of our world, whether it is by photo or . Because of this, the need for large storage servers has become greater at a fast pace. In 2009, over 15 billion photos had been uploaded to Face- book and 220 million more were being uploaded weekly[6]. These photos can easily be compressed by Facebook even with a algorithm because, most of the time, the users will not notice this. In other archives the data in the image could be of greater importance, for example images associated with a criminal in- vestigation or a patient’s medical records at a hospital. In these cases the photos have to be stored exactly as they are and cannot be compressed unless it is possi- ble to decompress them without data loss. With archives of photos in the sizes of Facebook’s, the amount of storage needed makes such a compression method more relevant to investigate. One common type of compression used on many data formats is delta encoding. Delta encoding is used in most video formats such as MPEG in order to compress frames by utilising the fact that each frame is similar to the one before it[5]. With delta encoding, you describe each frame as a list of differences from the previous frame. Since the two frames are similar, the differences are few, and the storage size needed is reduced.

1.1 Problem Definition

The purpose of this study is to find out if and how well delta encoding can be used to compress archives of images compared to PNG compression. To do this the following problem statement has been chosen:

1. How much does the selection of source file affect the compression using delta encoding?

2. How well does delta encoding compress an archive of images compared to PNG compression?

1 CHAPTER 1. INTRODUCTION

3. How well does delta encoding compress an archive of similar images compared to an archive of dissimilar images?

1.2 Background

In the following sections the relevant terminology and information necessary to understand the latter portions of the report will be discussed. They will cover which different kinds of compression there are and when they are used. Delta encoding will be introduced and it will be explained how delta encoding can be used to compress data. To understand how images can be compressed some image file formats will be explained, as well as Java’s ImageIO library which can be used to read and write images.

1.2.1 Compression In computer science, compression refers to the practice of reducing the size of data. It is a central part of the modern world of computing. As the usage of technology grows larger and larger, so does the amount of data that we need to store. While compressing a few files on your own hard drive may seem like an unnecessary waste of time since you have so much space anyways, compressing the billions of files stored globally has a much more drastic effect on the total storage space used. This in turn brings positive effects like lesser maintenance and lower costs[7]. There are two types of compression: lossy and lossless. A lossy compression will lose some of the data of the original file(s) and it will not be possible to reproduce all of the original data from the compressed data. This kind of compression is naturally not desired when compressing important data to minimise storage costs but some scenarios where lossy compression is useful include compressing music files and video files. In these cases the lost data is acceptable because you either do not notice the data in the original file(s) or the lesser quality of the music or video file is a necessary cost for higher playback speed or lower bandwidth requirements when streaming[10]. will always prioritise conservation of data over reduction of data size, meaning it will try to reduce the size of a file, but it will only do it if it knows it can decompress the compressed data into an exact copy of the original data[13].

1.2.2 Delta encoding One common type of compression used by many different industries today is delta encoding (sometimes called delta compression or delta coding). Delta encoding is used to compress sequences of data by describing every item in the sequence in terms of differences to the previous item in the sequence[12]. These alternative descriptions are often referred to as patches, or deltas. Consider this example: we have a large file with source code of a program. We want to add a line of code to test something and want to keep both versions of the

2 1.2. BACKGROUND

file in case the code does not work like we want it to. Using delta encoding we only need to store the original file along with a telling the system that we added a line of code. To achieve the same result without delta encoding we would have to store both files where only one line of code differs. This might not seem so bad considering the usually small file size of source code files and the convenience of actually having the previous version of the file readable but in larger projects where you need to track many different versions of many different files the storage space saved by delta encoding becomes much more significant. The effectiveness of delta encoding depends largely on the similarity between the items in the sequence. Very similar items have very few differences, and as such the patches will be very small. Because of this characteristic, delta encoding lends itself very nicely to compressing already sequential data (audio/video for example).

1.2.3 Data differencing and patching

In order to use delta encoding, we need a way to find the differences between two items in the sequence. This process is called data differencing. There are various tools that will perform various types of data differencing, but the core of the practice is the same: you have a source and a target, and you compare them in order to produce what is called a patch - a description of the changes you need to apply to the source in order to turn it into the target. A common data differencing tool is Unix diff, which differentiates text files on a line-by-line basis[4].

Figure 1.1. Two versions of a shopping list, and a patch

Differentiating the two shopping lists would generate a patch file describing how to turn the first version of the shopping list into the second version (patch). The patch file is a list of instructions, telling the patch tool exactly what to do, and where. On line 3 of the original file, delete "Cheese". On line 5 of the original file, add "Pasta". As stated earlier, Unix diff is used mainly for differentiating text files. There are other tools designed for other specific types of data differencing, but there are also some tools that can differentiate any pair of files, regardless of file type. This is called binary data differencing, because it treats a file as a series of 1:s and 0:s, instead of a series of characters, or pixels et cetera. A popular tool for binary data differencing and patching is bsdiff and bspatch[2]. Using bsdiff to differentiate two files, and bspatch to apply the patch to the source, one can turn any file into any other file, regardless of file extension. Binary data

3 CHAPTER 1. INTRODUCTION differencing is most often used in creating software patches, like when upgrading a program[9]. Delta encoding works because the patch file between two similar shopping lists is smaller than the target file. Instead of storing both the source and the target, the source and the patch could be stored.

Figure 1.2. Archive of four images

To maximize the compression of delta encoding, it is important to choose a good source file. Figure 1.2 shows an archive of four images. If image 4 is chosen as the source file, the compressed archive will contain image 4, and three large patch files (since image 4 is very different to the other images). If image 1 is chosen as the source file instead, the compressed archive would contain image 1, two small patch files (since image 1 is similar to image 2 and 3), and one large patch files (since image 4 is different).

1.2.4 Image compression and image formats Most common image compression methods have some degree of loss. This is gen- erally not a big problem due to a few different reasons. First of all, the human eye can really only perceive a certain level of image detail (and as such, reducing image quality is sometimes almost unnoticeable). Second of all, the relation between data loss and image compression is generally that small data loss will result in a major improvement in compression. For this reason the loss is usually of a small enough degree to not be problematic even if the image has to be of high quality[13]. BMP (Bitmap) is a common format for storing images. It was initially devel- oped for Microsoft Windows in particular, but it is supported by most image and graphics applications across all operating systems. BMP files are almost always stored uncompressed, meaning the files are often quite large and include redundant data. The format has a simple structure which is easy to understand[1]. PNG (Portable Network Graphics) is one of the most effective lossless image compression formats (although it also offers lossy compression)[11]. PNG uses many different compression techniques to reduce file size. One such technique is the inclusion of a ’palette’, a list of the colours in the image. This is

4 1.3. STATE-OF-THE-ART done in order to reduce the number of bytes necessary to describe what colour a pixel is. Most image formats describe a pixel’s colour as a combination of a red, green and blue colour value (RGB for short). In an uncompressed 24-bit image, each value would require 8 bits. In a large image with only four different colours, a palette would list these colours and index them accordingly. Then, instead of using 24 bits to describe what colour a pixel is, only 2 bits would be necessary (since 22 = 4). Delta encoding is used in many different types of image compression types, including PNG. PNG uses delta encoding within the image itself, describing pixels in terms of their differences to the pixels nearby[11]. This helps PNG compress losslessly, and is especially effective at compressing images where pixels near each other have similar colours. Delta encoding is heavily utilised in compression of video. A video file is tech- nically just a sequence of images. Most video compression use some form of delta encoding to reduce file size. There are usually three different types of items in a compressed video file, in- tracoded frames (I-frames), forward predicted frames (P-frames), and bidirectional predicted frames (B-frames). One could think of I-frames as separate source files, while P- and B-frames are patch-files used to create targets. P-frames and B-frames describe the movement of data in the image, and since the images in a video se- 1 quence usually only move a little bit every frame (a frame is only shown for 24 th of a second), the patch files are much smaller than the raw image file[5].

1.2.5 Reading/writing images When compressing an image file, you first need a way to read it. Most conventional programming languages have built in support for reading files, but some program- ming languages have built in libraries specifically designed for reading images. Java, for example, has a built in library called ImageIO, which is a collection of objects and functions that allow a user to manipulate image files. ImageIO supports com- mon image formats like JPEG, PNG, BMP amongst others and offers convenient functions that allow you to read pixel values individually, write pixels to new image files, read palettes in images and more[8].

1.3 State-of-the-art

While delta encoding is utilized in the compression of many different types of data we have not found any previous research regarding archives of images. Methods of image compression often use delta encoding but only within the image itself, not between two images. The state of the art is to simply compress each image individually, regardless of shared data between the images. This research is therefore considered relevant, since there could still be compressible data in the archive that individual image compression simply cannot detect.

5 CHAPTER 1. INTRODUCTION

1.4 Scope

There are many different types of PNG compression. In order to reduce the amount of time necessary for testing, only two different lossless PNG compression tools will be tested (pngout and OptiPNG). Additionally, the compression methods will only be tested on images in the BMP format. BMP images are often stored uncom- pressed, which means the comparison between the delta encoding and PNG com- pression methods will be clearer. If a format like JPEG would be used instead, the images would already be compressed and the comparison would be between JPEG+delta encoding and JPEG+PNG. Additionally, using a format other than BMP means all images used in the tests would need to be from the same source, to assure that they were all compressed with the same settings.

1.5 Terminology

Source Item in a sequence from which other items are described Target Item in a sequence that is described from a source Patch A list of changes to be made on a source to transform it into a target Pixel The smallest addressable part of an image containing only one color Lossy Not conserving data Lossless Conserving data, possible to restore the original data Archive Multiple seperate files stored together

6 Chapter 2

Method

The study will be conducted in two phases of testing. The purpose of Test phase 1 is to study the effect of choosing different source files when compressing using delta encoding, in order to answer problem statement 1. The purpose of Test phase 2 is to compare the effectivity of delta encoding and PNG encoding, and to study the difference between compressing an archive of dissimilar images and an archive of similar images, in order to answer problem statement 2 and 3. To perform these tests, three different data sets were formed, each consisting of 100 images. All the images in the data sets are BMP images. The reason why the BMP file type was chosen is because it has a very simple structure, storing images as not much more than a pixel array. Additionally, BMP images are almost always uncompressed which is important for the tests to produce genuine results. If one of the more well known file types had been used, for instance JPEG, the results would have shown the compression done by the combination of delta encoding and JPEG, which is not the desired result. The data sets will be described in more detail in section 2.2.

2.1 Test phases

1. For each data set, 20 images will be chosen to create a smaller archive. This archive will then be compressed 20 times using one of the delta encoding compression methods described in section 2.2, each time with a different source file. This will result in 20 different compressed archives. Comparing the sizes of the archives, it will be possible to see which source file resulted in the best compression. There is a possibility that an effective method for choosing the best source file can be found using these results, in which case that method will be used in test phase 2. If no effective method is found, the source file will be chosen at random. 2. Test phase 2 will consist of compressing each data set with each compression method. The archive size of the data set before compression and after com- pression will be used to calculate percentage of reduction, which will be used when comparing the methods. Additionally, compression time will be mea-

7 CHAPTER 2. METHOD

sured. For compression methods 1 and 2, time will be measured using Java’s System.nanoTime() function. For methods 3 and 4, time will be measured by the software PNGGauntlet. It should be mentioned that the measurement of time is mainly used to allow for a deeper discussion of the results, and should only be considered an approximation of how long it takes for a method to compress a data set.

2.2 Data sets

1. Data set 1 is an archive of 100 dissimilar BMP images in varying sizes and resolutions. Compressing this data set will show how effective delta encoding image archives is when the items in the sequence differ largely, which is where delta encoding usually underperforms.

2. Data set 2 is an archive of 100 similar BMP images in varying sizes and res- olutions. Every image in the archive is predominantly blue in colour. Com- pressing this data set will show how effective delta encoding image archives can be when the items in the sequence have many similarities, which is where delta encoding usually performs well.

3. Data set 3 is an archive of 100 black and white BMP images in varying sizes and resolutions. The grayscale images have been generated from colour images by the average method. Each pixel’s rgb value is modified to equal (r+g+b)/3. Compressing this data set will show if delta encoding works better on image archives where each pixel can only be 28 different colours instead of 224.

8 2.3. COMPRESSION METHODS

Figure 2.1. Two of the images in data set 1

Figure 2.2. Two of the images in data set 2

Figure 2.3. Two of the images in data set 3

2.3 Compression methods

1. Method 1 is pure binary delta encoding. This means each image file will be treated as a binary file of 0:s and 1:s. Unless a method of choosing a source

9 CHAPTER 2. METHOD

is found during test phase 1, the first image in the archive will chosen. Then, using bsdiff, all other images in the archive are diffed against the source to generate patches. The source image and the patches will be stored in a sepa- rate folder. This folder will be the compressed version of the initial archive. To decompress, bspatch is used to patch the desired target file from the source file. Since this method runs bsdiff on the actual image files (including head- ers), the patches will describe how to turn the source into the target without any loss at all, making this method lossless.

2. Method 2 is binary delta encoding pixel arrays using ImageIO. This method works in 4 steps, which are illustrated in Figure 2.4. Step 1 and 2 are used to compress an image archive, and step 3 and 4 are used to decompress the compressed version of the archive.

• Step 1 is performed on an archive of BMP images. Using Java’s ImageIO, the program goes through each image and extracts the width, height and the raw pixel array data. It then stores these in a binary file in the form width,height,r,g,b,r,g,b... The binary files are stored in a folder called bin_images, next to the original image archive. For example, an image of a single blue pixel would be stored as the binary file 1,1,0,0,255 • Step 2 is performed on the bin_images folder created in step 1. In this step, a source file is chosen the same way as in compression method 1, and then for each binary file in the archive (except the source file), bsdiff creates a patch describing how to turn the source file into the target. These patch files (along with the source file) are then stored in a folder called patches, next to the images and bin_images folder. This folder is the compressed version of the initial archive. • Step 3 is performed on the patches folder created in step 2. For each patch file in the folder, bspatch will run together with the source file to create the target file, which will be a binary image in the form described in step 1. For clarity’s sake, these files are stored (together with the source file) in a folder called restored_bin, but it’s important to understand that they will be exactly the same as the binary images created in step 1. • Step 4 is the final step, and is performed on the restored_bin folder de- scribed in step 3. ImageIO reads each binary image and using WritableRaster sets the pixel values of a new BMP file to the values in the binary file. The new BMP file is then written to the disk using ImageIO.write(...), which adds the correct header to the file. These images are then stored in a folder called restored_images. This is the decompressed version of the archive, and if the original archive used the BMP variant that ImageIO uses, the two archives will be the same.

10 2.3. COMPRESSION METHODS

3. Method 3 is PNG compression using a tool called pngout. pngout is a rela- tively unsophisticated compression method. It optimizes PNG files, but has more focus on speed than compression.

4. Method 4 is PNG compression using both pngout and another PNG optimizer called OptiPNG. OptiPNG is a more sophisticated compression using more parameters, and in theory should compress better than compression method 3.

Both methods of PNG compression will be executed using the software PNGGauntlet[3]. It was chosen due to its relative ease of use and because it allows for batch compres- sion of multiple images at once. The two different methods of PNG compression were chosen simply because it makes the comparison between them and delta encod- ing more accurate, since they will show both how unsophisticated and sophisticated PNG compression performs.

Figure 2.4. Steps taken in compression method 2

11 CHAPTER 2. METHOD

2.4 Test environment

The tests were performed on a computer with the following components:

• Intel Core i5-2500k CPU

• Corsair 8 GB DDR3 1600MHz RAM

• Asus P8P67-M PRO REV B3 Motherboard

12 Chapter 3

Result

This section presents the results obtained during test phase 1 and 2.

3.1 Result of test phase 1

In test phase 1, 20 images from each data set were selected and compressed 20 times, each time with a different source file, and using both delta encoding compression methods. This section presents the worst and best compression for each data set.

Data set Worst compression Best compression Difference 1 - Dissimilar images 50.87% 54.78% 3.91% 2 - Similar images 60.83% 64.62% 3.79% 3 - Grayscale images 68.17% 75.83% 7.66%

Table 3.1. Difference in compression between best and worst source file using com- pression method 1

Data set Worst compression Best compression Difference 1 - Dissimilar images 47.39% 50.87% 3.48% 2 - Similar images 58.35% 62.11% 3.76% 3 - Grayscale images 67.11% 74.38% 7.66%

Table 3.2. Difference in compression between best and worst source file using com- pression method 2

In all cases, with both methods and all data sets, the smallest file in the archive gave one of the best results when used as the source. Similarily, the largest file in the archive gave one of the worst results. As shown in tables 3.1 and 3.2 the differences

13 CHAPTER 3. RESULT between the best and the worst compressions could be close to 8%, which means that the selection of source file can be important. Because of this, the method of choosing the smallest file as the source file was used in the later tests.

3.2 Result of test phase 2

In test phase 2, all 3 data sets were compressed using each compression method. Time was measured between starting and finishing the compression. This section presents the percentage of reduction, and time taken rounded to the nearest minute, for each data set and compression method.

Compression method Reduction Time 1 - Direct binary diff 54.21% 9 min 2 - ImageIO 54.21% 9 min 3 - pngout 41.05% 14 min 4 - pngout + OptiPNG 56.70% 1 h 50 min

Table 3.3. Compression of data set 1 for each compression method

Compression method Reduction Time 1 - Direct binary diff 58.35% 6 min 2 - ImageIO 58.35% 7 min 3 - pngout 53.21% 16 min 4 - pngout + OptiPNG 65.40% 1 h 58 min

Table 3.4. Compression of data set 2 for each compression method

Compression method Reduction Time 1 - Direct binary diff 74.04% 3 min 2 - ImageIO 74.13% 4 min 3 - pngout 76.79% 22 min 4 - pngout + OptiPNG 86.30% 47 min

Table 3.5. Compression of data set 3 for each compression method

14 3.3. ANALYSIS

3.3 Analysis

Tables 3.1 and 3.2 show that the difference in terms of reduction when choosing the best and worst source file is around 3-8%. The difference was highest in data set 3 (Grayscale images) for both compression methods. Method 4 consistently performs better in terms of reduction of archive size. When compressing the archive of grayscale images, it outperforms the other methods by 10-12%. It was however the slowest of all compression methods, taking almost two hours to compress data set 1 and 2. Both compression methods using delta encoding (1 and 2) had very similar results, matching both reduction and time taken in multiple tests. The patch files are the same size even without stripping the header first. It can be seen from table 3.3 and 3.4 that the reduction from methods 1 and 2 for data set 2 was about 4% more than for data set 1 . This seems low, considering the fact that delta encoding should work much better on items that have many similarities between them. However, there is an explanation for why the difference is quite low.

Figure 3.1. Two 2x2 images of entirely different colour

15 CHAPTER 3. RESULT

Consider Figure 3.1. Image 1 is a 2 × 2px red image, and Image 2 is a 2 × 2 blue image. Their binary representation would look like the following: Image 1: 2 2 255 0 0 255 0 0 255 0 0 255 0 0 Image 2: 2 2 0 0 255 0 0 255 0 0 255 0 0 255 The two images share no colours at all. However, the images were differentiated using bsdiff, the only differences that would be found would be the two 0:s at the end of Image 1, and the two 0:s at the beginning of Image 2. The patch describing how to turn Image 1 into Image 2 would only contain the following instructions: ’delete 0 0 at the end, insert 0 0 at the beginning’. Though the two images look entirely different, they share a lot of binary data. For two large images, there is a high chance there will be common strings in both binary representations, meaning that even though the actual colours are different, the binary difference can be small. Two images with similar colours will have more common binary strings, and this characteristic can be seen in the extra 4% of compression between data set 1 and 2. Another example of this characteristic is shown in the results of data set 3, where each image is in grayscale. In this case, the number of different possible colours for each pixel has been reduced from 224 to 28. This means there will be many more common strings between two images, which is why the reduction is higher.

16 Chapter 4

Discussion

The result of test phase 1 showed the impact of choosing the best source file and sug- gested that choosing the smallest file in the archive most often gives the best result. To understand why this is the case further research would need to be done regarding the structure of the patch files created by bsdiff. A simple explanation could be that commands of the kind "add pixel of colour X" are smaller than commands of the type "exchange with pixel of colour Y". There were some cases where the smallest image did not produce the best com- pression. In these cases the best compression came from the second/third smallest image. The method of always choosing the smallest file as the source is therefore only approximative and doesn’t guarantee optimal results. However, in the cases where the smallest file did not give the best compression, the differences between the best compression and the compression using the smallest file as the source file were very small. This means that the method should give close to optimal results. The two delta encoding compression methods perform almost equally in all tests, suggesting that. However, it could also indicate that the images are stored as pixel data and not using a colour palette, as the BMP-format supports as well. In that case, the conversion could still improve the compression, should the archive contain images of different formats. The tables also show that the best method for compression was method 4, which was the method combining pngout and OptiPNG. This method was also the slowest by large margin, which is an important factor to consider when discussing which method is better. The difference in reduction for methods 1, 2, and 4 were quite small in data sets 1 and 2, but methods 1 and 2 were faster by a significant amount. The only data set where method 3 gave better reduction than methods 1 and 2 was the archive of grayscale images. This is unsurprising, considering the fact that PNG utilizes a palette, meaning it is more effective at compressing images with few colours. This also explains the results of compressing this data set with method 4. The delta encoding methods can be improved in many ways, some at the cost of runtime, some as a result of a more optimized and/or specialized algorithm. Some possible improvements are listed below.

17 CHAPTER 4. DISCUSSION

• In test phase 1, it was shown that the difference between the best and the worst source file was around 3-8%. Generally, a smaller source file resulted in better compression. However, choosing the smallest file as the source file does not promise the best compression. Through further research, it might be possible to increase compression by finding a more reliable method of choosing a source file.

• In section 1.2.4 we describe how some video compression methods choose certain frames as I-frames which are used to describe the following frames as patches instead of entire frames. This could potentially be implemented in method 1 and 2, by dividing the archive into multiple smaller archives, where each archive contains images similar to each other. Compressing each smaller archive individually could lead to more reduction.

• As explained in section 1.2.4, palettes are a common technique used in image compression to reduce the number of bits required to describe a colour. Im- plementing this in the storage of binary pixel arrays could potentially reduce their sizes, leading to smaller patch files.

• Modifying the actual data differencing software to specifically compare pixel values between two images, one could add a variable describing a tolerance level, two pixels would be considered equal colour if the difference between their colours was less than the tolerance. A very important product of this change would be that the method would be lossy instead of lossless.

• Using it on archives that can take advantage of the fact that delta encoding compresses images relative to one another and not one image at a time. If an archive has many similar images with lots of colours and shapes in them, PNG will have a hard time compressing these images. This is due to the fact that it only looks for ways to compress the image internally. Delta encoding on the other hand will be able to find the similarities between the images and use them to compress the archive more effectively.

Despite these improvements, the delta encoding methods would still have a major weakness. When compressing images individually with PNG, the compressed files can be opened by an image viewer without decompressing them. To view an image compressed with delta encoding, the patch must first be applied to the source. The decompressed image will therefore have the same size as the original image. This could potentially be an important factor depending on the environment the image would be used in. In an environment like Facebook, images are viewed by users on a regular basis. Compressing and decompressing these images multiple times each time they are viewed would result in additional server load and costs. Because of this weakness, delta encoding is most suitable in an environment where many images are stored, but few are viewed. For example, it might be useful to utilize delta encoding in cloud storage solutions. In these systems, files often

18 4.1. METHOD LIMITATIONS have to be downloaded via the Internet before they can be used. As such, there is no need for the images to be functional in storage. This would reduce bandwidth costs for the cloud service provider.

4.1 Method limitations

There are some comments that need to be made regarding the methodology chosen for the study. First of all, as for any study, the reliability of the results is largely related to the size of test data. In our case, we had three data sets of 100 images each, which we consider to be enough to provide a set of results from which a conclusion can be drawn. However, there is of course the possibility that one of the data sets happened to be filled with images that compress exceptionally well with delta encoding. Performing additional tests would reduce this possibility, but we think that the chance of 100 images happening to be perfect for a certain method of compression is unlikely enough. Only two types of PNG compression were tested. There are many other types of PNG compression that could be tested, some of which are optimized for certain types of images, and running more tests using these would give a clearer comparison to delta encoding. Testing compression in general is a difficult task. There are many different methods, techniques, formats, and standards that work well on some types of data, and not so well on other types of data. BMP was chosen as the file format in this study due to both its simple structure and to the fact that most BMP files are uncompressed, which means we get a better picture of how well the methods actually compress. Using a file format that might already be compressed would mean we are actually testing a combination of compression methods (JPEG + PNG, JPEG + delta encoding and so on), instead of the individual methods. More data sets could be added containing more image formats, but this would take much more time than we had.

19

Chapter 5

Conclusion

1. How much does the selection of source file affect the compression using delta encoding?

The difference in terms of reduction when compressing using the best and worst source file is between 3-7%. The difference was the greatest for the archive of grayscale images. In our tests, we found that a smaller source file generally resulted in a better compression, which is why our method of choosing source file was to choose the smallest image in the archive.

1. How well does delta encoding compress an archive of images compared to PNG compression?

The two delta encoding methods did not compress archives as well as the more sophisticated PNG method, which compressed the archives an additional 2-12% comparatively. However, the delta encoding methods were 10 times faster.

1. How well does delta encoding compress an archive of similar images compared to an archive of dissimilar images?

Delta encoding compresses an archive of similar images around 4% more than an archive of dissimilar images. It compresses an archive of grayscale images around 17% more than an archive of dissimilar images.

21

Bibliography

[1] Bitmap file format summary. http://www.fileformat.info/format/ os2bmp/egff.htm. [Accessed 17-March-2015]. [2] bsdiff. http://www.daemonology.net/bsdiff/. [Computer software]. [3] PNGGauntlet. http://pnggauntlet.com. [Computer software]. [4] Unix diff manual. http://unixhelp.ed.ac.uk/CGI/man-cgi?diff. [Accessed 17-March-2015]. [5] I. Agi and L. Gong. An empirical study of secure mpeg video transmissions. In Network and Distributed System Security, 1996., Proceedings of the Symposium on, pages 137–144, Feb 1996. [6] D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel, et al. Finding a needle in haystack: Facebook’s photo storage. In OSDI, volume 10, pages 1–8, 2010. [7] T. Bell. . In Encyclopedia of Computer Science, pages 492– 496. John Wiley and Sons Ltd., Chichester, UK. [8] Oracle. Java ImageIO documentation. http://docs.oracle.com/javase/7/ docs/api/javax/imageio/ImageIO.html. [Accessed 17-March-2015]. [9] C. Percival. Matching with mismatches and assorted applications. PhD thesis, University of Oxford, 2006. [10] J. H. Pujar and L. M. Kadlaskar. A new lossless method of image compression and decompression using huffman coding technique. Journal of Theoretical and Applied Information Technology, 15(1), 2010. [11] G. Roelofs. PNG: The Definitive Guide. O’Reilly Media, 1999. [12] T. Suel and N. Memon. Algorithms for delta compression and remote file synchronization. In In Khalid Sayood, editor, Lossless Compression Handbook. Academic Press, 2002. [13] D. Taubman and M. Marcellin. Jpeg2000 standard for interactive imaging. Proceedings of the IEEE, 90(8):1336–1357, Aug 2002.

23 www.kth.se