Delta Encoding Image Archives
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL STOCKHOLM, SWEDEN 2015 Delta encoding image archives COMPARING DELTA ENCODING AND PNG AS COMPRESSION METHODS FOR IMAGE ARCHIVES ISAK JÄGBERG, MATS STICHEL KTH ROYAL INSTITUTE OF TECHNOLOGY CSC SCHOOL Delta encoding image archives Comparing delta encoding and PNG as compression methods for image archives ISAK JÄGBERG, MATS STICHEL Degree project in Computer Science DD143X Supervisor: Jens Lagergren Examiner: Örjan Ekeberg 2015-05-08 Abstract This thesis studies the effect of using delta encoding to compress archives of images. The results are compared with two types of lossless PNG compression. The tests show that the most sophisticated PNG method tested compresses the archives around 2-12% higher than the delta en- coding methods, but at the cost of taking more than 10 times as long. The conclusion is therefore drawn that delta encoding could be a useful method for compressing image archives in environments where speed is more important than storage. Referat Denna rapport undersöker effekten av att använda delta encoding för att komprimera bildarkiv. Resultaten jämförs med two typer av förlust- fri PNG-komprimering. Testerna visar att den mest sofistikerade PNG- komprimeringen som testats komprimerar arkiven omkring 2-12% mer än delta encoding-komprimeringen, men att den samtidigt är över 10 gånger långsammare. Slutsatsen är därför att delta encoding kan va- ra en användbar komprimeringsteknik i datorsystem där hastighet är viktigare än lagringsutrymme. Contents 1 Introduction 1 1.1 Problem Definition . 1 1.2 Background . 2 1.2.1 Compression . 2 1.2.2 Delta encoding . 2 1.2.3 Data differencing and patching . 3 1.2.4 Image compression and image formats . 4 1.2.5 Reading/writing images . 5 1.3 State-of-the-art . 5 1.4 Scope . 6 1.5 Terminology . 6 2 Method 7 2.1 Test phases . 7 2.2 Data sets . 8 2.3 Compression methods . 9 2.4 Test environment . 12 3 Result 13 3.1 Result of test phase 1 . 13 3.2 Result of test phase 2 . 14 3.3 Analysis . 15 4 Discussion 17 4.1 Method limitations . 19 5 Conclusion 21 Bibliography 23 Chapter 1 Introduction Recording everything we see has become a natural part of our world, whether it is by photo or video. Because of this, the need for large storage servers has become greater at a fast pace. In 2009, over 15 billion photos had been uploaded to Face- book and 220 million more were being uploaded weekly[6]. These photos can easily be compressed by Facebook even with a lossy compression algorithm because, most of the time, the users will not notice this. In other archives the data in the image could be of greater importance, for example images associated with a criminal in- vestigation or a patient’s medical records at a hospital. In these cases the photos have to be stored exactly as they are and cannot be compressed unless it is possi- ble to decompress them without data loss. With archives of photos in the sizes of Facebook’s, the amount of storage needed makes such a compression method more relevant to investigate. One common type of compression used on many data formats is delta encoding. Delta encoding is used in most video formats such as MPEG in order to compress frames by utilising the fact that each frame is similar to the one before it[5]. With delta encoding, you describe each frame as a list of differences from the previous frame. Since the two frames are similar, the differences are few, and the storage size needed is reduced. 1.1 Problem Definition The purpose of this study is to find out if and how well delta encoding can be used to compress archives of images compared to PNG compression. To do this the following problem statement has been chosen: 1. How much does the selection of source file affect the compression using delta encoding? 2. How well does delta encoding compress an archive of images compared to PNG compression? 1 CHAPTER 1. INTRODUCTION 3. How well does delta encoding compress an archive of similar images compared to an archive of dissimilar images? 1.2 Background In the following sections the relevant terminology and information necessary to understand the latter portions of the report will be discussed. They will cover which different kinds of compression there are and when they are used. Delta encoding will be introduced and it will be explained how delta encoding can be used to compress data. To understand how images can be compressed some image file formats will be explained, as well as Java’s ImageIO library which can be used to read and write images. 1.2.1 Compression In computer science, compression refers to the practice of reducing the size of data. It is a central part of the modern world of computing. As the usage of technology grows larger and larger, so does the amount of data that we need to store. While compressing a few files on your own hard drive may seem like an unnecessary waste of time since you have so much space anyways, compressing the billions of files stored globally has a much more drastic effect on the total storage space used. This in turn brings positive effects like lesser maintenance and lower costs[7]. There are two types of compression: lossy and lossless. A lossy compression will lose some of the data of the original file(s) and it will not be possible to reproduce all of the original data from the compressed data. This kind of compression is naturally not desired when compressing important data to minimise storage costs but some scenarios where lossy compression is useful include compressing music files and video files. In these cases the lost data is acceptable because you either do not notice the data in the original file(s) or the lesser quality of the music or video file is a necessary cost for higher playback speed or lower bandwidth requirements when streaming[10]. Lossless compression will always prioritise conservation of data over reduction of data size, meaning it will try to reduce the size of a file, but it will only do it if it knows it can decompress the compressed data into an exact copy of the original data[13]. 1.2.2 Delta encoding One common type of compression used by many different industries today is delta encoding (sometimes called delta compression or delta coding). Delta encoding is used to compress sequences of data by describing every item in the sequence in terms of differences to the previous item in the sequence[12]. These alternative descriptions are often referred to as patches, or deltas. Consider this example: we have a large file with source code of a program. We want to add a line of code to test something and want to keep both versions of the 2 1.2. BACKGROUND file in case the code does not work like we want it to. Using delta encoding we only need to store the original file along with a patch telling the version control system that we added a line of code. To achieve the same result without delta encoding we would have to store both files where only one line of code differs. This might not seem so bad considering the usually small file size of source code files and the convenience of actually having the previous version of the file readable but in larger projects where you need to track many different versions of many different files the storage space saved by delta encoding becomes much more significant. The effectiveness of delta encoding depends largely on the similarity between the items in the sequence. Very similar items have very few differences, and as such the patches will be very small. Because of this characteristic, delta encoding lends itself very nicely to compressing already sequential data (audio/video for example). 1.2.3 Data differencing and patching In order to use delta encoding, we need a way to find the differences between two items in the sequence. This process is called data differencing. There are various tools that will perform various types of data differencing, but the core of the practice is the same: you have a source and a target, and you compare them in order to produce what is called a patch - a description of the changes you need to apply to the source in order to turn it into the target. A common data differencing tool is Unix diff, which differentiates text files on a line-by-line basis[4]. Figure 1.1. Two versions of a shopping list, and a patch Differentiating the two shopping lists would generate a patch file describing how to turn the first version of the shopping list into the second version (patch). The patch file is a list of instructions, telling the patch tool exactly what to do, and where. On line 3 of the original file, delete "Cheese". On line 5 of the original file, add "Pasta". As stated earlier, Unix diff is used mainly for differentiating text files. There are other tools designed for other specific types of data differencing, but there are also some tools that can differentiate any pair of files, regardless of file type. This is called binary data differencing, because it treats a file as a series of 1:s and 0:s, instead of a series of characters, or pixels et cetera. A popular tool for binary data differencing and patching is bsdiff and bspatch[2]. Using bsdiff to differentiate two files, and bspatch to apply the patch to the source, one can turn any file into any other file, regardless of file extension.