LZW COMPRESSION AND DECOMPRESSION

December 4, 2015

1 Contents

1 INTRODUCTION 3

2 CONCEPT 3

3 COMPRESSION 3

4 DECOMPRESSION: 4

5 ADVANTAGES OF LZW: 6

6 DISADVANTAGES OF LZW: 6

2 1 INTRODUCTION

LZW stands for Lempel-Ziv-Welch. This was created in 1984 by these people namely Abraham Lempel, , and . This algorithm is very simple to implement. In 1977, Lempel and Ziv published a paper on the “sliding-window” compression followed by the “dictionary” based compression which were named LZ77 and LZ78, respectively. later, Welch made a contri- bution to LZ78 algorithm, which was then renamed to be LZW Compression algorithm.

2 CONCEPT

Many files in real time, especially text files, have certain set of strings that repeat very often, for example ” The ”,”of”,”on”etc., . With the spaces, any string takes 5 bytes, or 40 bits to encode. But what if we need to add the whole string to the list of characters after the last one, at 256. Then every time we came across the string like” the ”, we could send the code 256 instead of 32,116,104 etc.,. This would take 9 bits instead of 40bits. This is the algorithm of LZW compression. It starts with a ”dictionary” of all the single character with indexes from 0 to 255. It then starts to expand the dictionary as information gets sent through. Pretty soon, all the strings will be encoded as a single bit, and compression would have occurred. LZW compression replaces strings of characters with single codes. It does not analyze the input text. Instead, it adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters. The code that the LZW algorithm outputs can be of any variable length, but it must have more bits in it than a single character. The first 256 codes (when using 8-bit characters) are by default allocated to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes starting from 0 to 255 refer to individual bytes, while codes from 256 to 4095 refer to substrings.

3 COMPRESSION

LZW consists of a dictionary of 256 characters (in the case of 8 bits) and uses those as the ”standard” character set. It then reads data 8 bits at a time (e.g., ’a’, ’b’, etc.) and encodes the inputdata as a number that rep- resents its index in the dictionary. Whenever it comes across a new sub- string (e.g., ”ab”), it adds it to the dictionary. Whenever it comes across a substring it has already seen, it just reads in a new character and con- catenates it with the current string to get a new substring. The next time LZW revisits a substring, it will be encoded using a single number. Usu- ally the maximum number of entries (say, 2048) is defined for the dictionary, so that the process doesn’t run away with memory. Thus, the codes which

3 are taking place of the substrings in this example are 12 bits long (211 = 2048).Itisnecessaryforthecodestobelongerinbitsthanthecharacters(12vs.8bits), butsincemanyfrequentlyoccurringsubstringswillbereplacedbyasinglecode, inthelongrun, compressionisachieved

4 DECOMPRESSION:

The Decompression process for LZW is also very simple. In addition, it has an edge over static compression methods because no dictionary or other pre-existing information is necessary for the decoding algorithm–a dictionary identical to the one created during compression is re-built during the process. Both encoding and decoding programs must start with same initial dictionary, in this scenario, all the 256 ASCII characters. Here’s how it works The LZW decoder first reads in an index , looks up the index in the dictionary, and returns the substring

4 associated with the index. The first character of this substring is appended to the current working string. This new concatenation is added to the dictionary .The decoded string then becomes the current working string (the current index, ie. the substring, is remembered), and the process repeats.

So, the encoded output starts out 0,1,2,4,... . When we start trying to decode, a problem arises (in the table below, keep in mind that the Current String is just the substring that was decoded in the last iterationof the loop. Also, the New Dictionary Entry is created by appending the Current String with the first character of the new Dictionary Translation): So, the encoded output starts out 0,1,2,4,... . When we start trying to decode, a problem might arise(in the table below, we must understand that the Current String is just the substring that was decoded/translated in the last iteration of the loop. Also, the New Dictionary Entry is created by appending

5 the Current String with the first character of the new Dictionary Translation):

5 ADVANTAGES OF LZW:

• LZW compression is very fast. • It is loss less compression technique. • The algorithm is very simple to implement. • There is no need to analyze the incoming text. • The whole algorithm can be expressed in only a dozen lines. • LZW excels when used for data streams that have any repeated strings. Be- cause of this it does extremely well for compressing English text. Compression ratio of 50 percent or more is expected. • For any fixed stationary source the LZW algorithm performs just as well as if it is designed for that source.

6 DISADVANTAGES OF LZW:

• Although the algorithm is pretty simple but implementation of this algorithm is complicated mainly because of management of the string table. • Files that do not contain any repetitive data at all cannot be compressed

6 much. • The method is good at text files but not as good at other types of files. • The amount of storage needed is indeterminate as it depends on the total length of all the strings. • Also problem involves while searching the strings. Each time a new char- acter is read in, the algorithm has to search for the new string formed by string+character. • Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found then a new string has to be added to the string table. This causes two problems. First the string table can get very large very fast. If string lengths average even as low as three or four character each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.

7