<<

Master of Science in Computer Science September 2020

Compression Algorithm in Mobile Packet Core

Lakshmi Nishita Poranki

Faculty of , Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Lakshmi Nishita Poranki E-mail: [email protected]

University advisor: Siamak Khatibi Department of Telecommunications

Industrial advisor: Nils Ljungberg Ericsson, Gothenburg

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Context: is the technique that is used for the fast transmission of the data and also to reduce the storage size of the transmitted data. Data compression is the massive and ubiquitous technology where almost every communication company make use of data compression. Data compression is categorized mainly into lossy and lossless data compression. Ericsson is one of the telecommunication company that deals with millions of user data and, all these data get compressed using the deflate compression algorithm. Due to the compression ratio and speed, the deflate algorithm is not optimal for the present use case(compress twice and decompress once) of Ericsson. This research is all about finding the best alternate algorithm which suits the current use case so that the deflate algorithm can replace it.

Objectives: The objective of the research is to replace the deflate algorithm with the algorithm, which is useful for compressing the Serving GPRS Support Node-Mobility Manage- ment Entity(SGSN-MME) user data effectively. The main objectives to achieve this goal are: to investigate the better algorithm which fits the SGSN-MME compression patterns, to investigate the few alternate algorithms for deflate algorithm, the SGSN- MME dataset was used to perform experimentation, the experiment should perform by using all selected algorithms on the dataset, the results of the experiment were compared based on the compression factors, based on the performance of algorithm the deflate algorithm will get replaced with the suitable algorithm.

Methods: In this research, a literature review performed to investigate the alternate algorithms for the deflate algorithm. After selecting the algorithm, an experiment conducted on the data which was provided by Ericsson AB, Gothenburg and based on the com- pression factors like compression ratio, compression speed the performance of the algorithm evaluated.

Results: By analyzing the results of the experiment, Z-standard is the better performance algorithm with the optimal compression sizes, compression ratio, and compression speed. Conclusions: This research concludes by identifying an alternate algorithm that can replace the deflate algorithm and also which is suitable for the present Use case.

Keywords: Compression Algorithms, Algorithm, SGSN-MME node, Compression factors, performance of compression algorithm.

ii Acknowledgments

Firstly, I would like to express my prodigious gratitude to my university supervisor Prof. Siamak Khatibi, Department of Telecommunications for his worthwhile super- vision, patience, suggestions and incredible guidance throughout entire period of the thesis.

I would also like to express my warmest gratitude to my Industrial Supervisors Erik Vargas and Nils Ljunberg for their guidance, support and insightful comments throughout my journey at Ericsson AB, Gothenburg.

I would like to express my profound gratitude to my parents Siva Rama Prasad Poranki and Vijaya Lakshmi Poranki, my sister Snehitha Poranki and my colleague Lakshmi Venkata Sai Sri Tikkireddy for their persistent and unparalleled love and continuous support.

Last but not least, I would like to thank all of my friends who stood beside me during my good and bad times and helped me a lot to complete my thesis and made my thesis journey so successful.

Thank you very much, everyone!!

iii Contents

Abstract i

Acknowledgments iii

1 Introduction 1 1.1 Problem Statement: ...... 4 1.2 Outline: ...... 4

2 Background 6 2.1SGSN:...... 6 2.2 MME: ...... 6 2.3 Ericsson SGSN components: ...... 7 2.4 Performance factors ...... 8 2.5 Types of Lossless Compression Algorithms: ...... 8 2.5.1 Huffman Algorithm ...... 9 2.5.2 ...... 9 2.5.3 Shannon-Fano ...... 10 2.5.4 Lempel-Ziv ...... 10 2.5.5 LZ77-LZ78 ...... 10 2.5.6 LZ4 ...... 11 2.5.7 ...... 12 2.5.8 ...... 14 2.5.9 Z-standard ...... 15

3 Methodology 17 3.1 Literature review ...... 18 3.2 Experiment ...... 20 3.2.1 Hypothesis ...... 20 3.2.2 Experiment workspace ...... 20 3.2.3 Dataset Creation ...... 21 3.2.4 Dataset Pre-Processing ...... 22 3.2.5 Experiment ...... 23 3.2.6 Statistical Tests ...... 24

4 Results and Analysis 26 4.1 Results for Z-standard: ...... 26 4.2 Results for LZ4: ...... 27 4.3 Results for Brotli: ...... 28

iv 4.4 Results for Zlib (deflate): ...... 29 4.5 Comparison of Compression Ratio ...... 29 4.6 Comparison of Compression speed ...... 31 4.7 Comparison of Space-saving ...... 32 4.8 Result Analysis ...... 33 4.9 Performing Statistical Tests ...... 34

5 Discussion 36 5.1 Answering Research Questions ...... 36 5.2 Validity Threats ...... 37

6 Conclusion and Future Work 39

References 40

v List of Figures

1.1 ...... 2 1.2 Lossy vs Lossless compression ...... 3

2.1 Ericsson SGSN components ...... 7 2.2 Types of Compression ...... 9 2.3FlowchartofLZ4...... 13

3.1 Dataset Creation ...... 23

4.1 Z-standard compression ...... 26 4.2 LZ4 compression ...... 27 4.3 Brotli compression ...... 28 4.4 Zlib compression ...... 29 4.5 Comparison of compression ratios ...... 30 4.6 Comparison of compression speed ...... 31 4.7 Comparison of space saving ...... 32

vi List of Tables

2.1 The data format of the LZ4 sequence ...... 12

4.1 Ranks for comparison of compression ratios of algorithms for each test case...... 34 4.2 Average ranks of algorithms...... 35

vii List of Abbreviations

ANS Asymmetric Numeral System

AP Application Procedures

ASCII American Standard Code for Information Interchange

BSD Berkeley Source Distribution

DP Device Procedures

ETS Erlang Term Storage

FPGA Field Programmable Gate Array

FSE Finite State Entropy

GGSN Gateway GPRS Support node

GIF Graphics Interchange Format

GNU GNU’s not

GPRS General Packet Radio Service

GSM Global System for mobile communication(2G)

HTML Hyper Text Markup Language

HTTP Hyper Text Transfer Protocol

IP Internet Protocol

IT Information Technology

JPEG Joint Photographic Experts Group

LTE Long Term Evolution(4G)

LZW Lempel-Zive-Welch

MME Mobility Management Entity

MP3 MPEG Audio layer 3

MPEG Moving Picture Experts Group

viii OTP Open Telecom Platform

PIU’s PlugIn units

SGSN Service GPRS Support Node

SLR Systematic Literature Review tANS tabled variant of ANS

TIFF Tagged Image

UE User Equipment

WCDMA Wideband Code Division Multiple Access(3G)

WEM Workspace Environment Management

Zstd Z standard

ix Chapter 1 Introduction

Digital communication is a field that deals with the concept of transmitting and receiving digital data. Here, the digital data may contain different data formats like text data, audio, , and images which are to be transmitted. Before transmis- sion, the complete data needs to convert into a binary format which is either a 0 or 1. Usually, the data bitstream consists of many data which may reach to millions of data bits in most of the cases. So, it is clear that the large files will take approximately minutes for the transmission [33].

If the massive amount of data should get transmitted, there is a chance of delay in reaching the destination. The data should get compressed to avoid the situations [44]. Incomparably the vast amount of the data received, processed and transmitted, which will affect the data transmission speed ability and also leads to the shortage of storage [57][66].

For a long time, compression is the domain that consists of a small group of engineers and scientists. But now, the data compression is a ubiquitous massive technology, and its applications grew day by day. The data compression field is an existing and vital one [43].

The data compression is the art which used to reduce the data size by eliminat- ing the characters which are redundant or encoding the bits in the data [63]. The data compression was used to represent the information in the compact form [63]. Without compression, mobile phones may not provide clarity in communication. The advent of digital television is not possible. Listing to music, watching , brows- ing on the internet and many more are possible only because the data is compressed [57].

Advantages of compression [1]:

• Storage: Compressed data helps to store more files in the given storage space.

• Bandwidth and file transfer: When compared to the uncompressed file, com- pressed file utilizes less bandwidth to download from the internet, which also helps to speed up the transfer speed.

• Cost: Compressed file utilizes less storage space, so there is no need for buying extra storage.

1 2 Chapter 1. Introduction

There are two types of data compression techniques. They are Lossy Compression and Lossless Compression. • Lossy Compression: Lossy data compression is used for storage handling and also in transmitting the data content by reducing the size of data [64]. This kind of technique is used by MP3, MPEG and JPEG which are audio and image data formats [63]. Figure 1.1 shows how a lossy compression works, high quality image will looses its quality after compression. The first image in Figure 1.1 is the high quality original picture which is not compressed, the second image in the figure is the picture of the compressed picture where it can be seen that the image is blurred when comparative to the first image.

Figure 1.1: Lossy compression

Lossy data compression is involved in the loss of data that cannot be retrieved or recovered precisely as the original file after decompression [56]. The high compression is possible using lossy compression techniques when compared to lossless compression techniques [57]. When compared to the lossless compres- sion, the lossy compression will achieve better compression effectiveness. But lossy compression is limited to the audio, video, and image where the loss of data is not a problem.

• Lossless Compression: Lossless data compression will compress the original data without any permanent loss of information. Lossless compression is also known as reversible compression because the original data can be reconstructed exactly after decompression [63]. Generally, digital data is transmitted using lossless transmitted. It is used in the field of medical data, legal data, or some computer-executable files, etc.[58].

There are many data compression techniques available, and few lossless com- pression techniques are the Huffman algorithm, Lempel-Ziv algorithm, Arith- metic coding algorithm, Shannon-fano algorithm, LZ77, and LZ78 algorithm. Figure 1.2 explains, In lossless compression, the original data can be retrieved after compression, whereas in lossy compression, the data will be lost, which cannot recover. 3

Figure 1.2: Lossy vs Lossless compression

GPRS: GPRS is the advancement to the existing 2G network infrastructure, it pro- vides the wireless packet data service.It allows the subscribers to make use of wide range of the applications like Email, internet.

SGSN: SGSN is one of the Important component in the GPRS architecture which is responsible for handling the packet switch data with in the network.

MME: MME is the main control node which is responsible for handling few millions of user data. MME is responsible for paging and tagging of the mobility procedures to all the users within the network. 4 Chapter 1. Introduction 1.1 Problem Statement:

Ericsson uses the node which integrates the functionalities of both SGSN and MME and it the world’s most widely deployed SGSN/MME [14]. This node is helpful to serve traffic from all generation networks via the same node. The main target for the Ericsson in SGSN - MME field is to guarantee 99.9999% up-time. Currently, Erics- son SGSN-MME compresses every User Equipment(UE) context-independent using Zlib, which uses the deflate algorithm for compression. Due to the compression ratio and compression speed, the deflate algorithm is not optimal for the SGSN-MMEs use case (each UE Context is compressed twice and decompressed once). Hence the deflate compression algorithm needs to be replaced with an algorithm which is a good fit for the use case.

The main aim of this research is to replace the deflate compression algorithm with the compression algorithm, which suits the present use case. In this research, the formulated research questions are as follows:

RQ1: Which algorithm other than the deflate algorithm is suitable for the present use case?

Motivation: The main reason for formulating this research question is there are many numbers of compression algorithm techniques available, so there is a need to compare and select the algorithm that suites the present use cases.

RQ2: Which algorithm work more effectively when compared with the existing algorithm (deflate) for the use case?

Motivation:The main reason for selecting this research question is, for comparing and analyzing the results of the compression algorithms with the deflate compres- sion algorithm which needs to be replaced based on the factors like compression speed, compression ratio, space-saving. The detailed discussion on the compressionn algorithm techniques explained in the Section-2.4.

1.2 Outline:

In this section, the research work is divided into different chapters and a brief dis- cussion is provided.

Chapter 1: Chapter 1 discusses the Introduction, the problem statement and also the research questions with motivation towards this research.

Chapter 2: This chapter discusses the related work with different types of com- pression algorithms used in this research.

Chapter 3: This chapter explains the methodologies used for this research. Also explains about the dataset creation, the data preprocessing and the statistical tests. 1.2. Outline: 5

Chapter 4: This chapter shows the results and the analysis of the result from the research work.

Chapter 5: A brief discussion for answering the research questions and validity threats.

Chapter 6: Provided regarding the conclusion and future work of this research work. Chapter 2 Background

In recent years, due to the fast development of communication networks and con- tent services, there is a rapid increase in the number of packet switch service users. Nowadays, the internet has become a part of our daily life. Millions of people are using the internet from their mobile phones or computers. It is all possible only because of 2G, 3G and 4G network technologies, when the internet-connected to 2G or 3G network, all the traffic which is related to it passed through the network called (GPRS) network. GRPS is the advancement to the existing Global System for Mo- bile Communication (GSM) network infrastructure, and it will provide the wireless packet data services. GRPS wholly based on the Internet Protocol (IP), which al- lows subscribers to make use of a wide range of applications like email, internet, and internet resources for instances [11].

2.1 SGSN:

In this GPRS network, serving GPRS support node [SGSN] is the main component that will handle all the data related to the packet switch within the network. At present many of the mobile users access the core network through SGSN. Therefore a large amount of traffic will be handled [65]. Some of the critical functionalities of GPRS Support Node (GSN) are Mobility Management, Mobile station Authentica- tion, Data encryption, Data compression, and Radio Resource Management. SGSN will also handle packet routing and transfer address translation, Logical link man- agement, Packet segmentation, Re-assembly, and tunnelling [52].

2.2 MME:

For the Long Term Evolution (LTE) access-network, the necessary control code is MME. It is mainly responsible for paging and tagging mobility procedures, includ- ing re-transmission for idle mode UE. MME is also responsible for handling a few millions of UEs and tracks all at a time. It also enables more control and adds intelligence to the system. MME supports the control plane function for mobility between 2G/3G and LTE access networks [26]. The combination of SGSN and MME helps in handling a considerable amount of subscribers which allows taking strin- gent up-time requirements [14]. It will support the advanced mobile networks, which provide enhanced functionality for GSM, Wideband Code Division Multiple Access

6 2.3. Ericsson SGSN components: 7

(WCDMA), and LTE Access.

2.3 Ericsson SGSN components:

SGSN component contains the many numbers of Plug-In Units(PIUS). Among these Plug-In Units, Application procedures (AP’s) and Device procedure (DP’s) were re- sponsible for sending and receiving the traffic through routers. Here, the application procedures are responsible for handling the signalling, which will track all the at- tached and the cell towers which connected to those devices. The device procedure is responsible for managing the traffic, which is generated by the users while suffering for the internet in mobile phones. The user payload redirected towards the GGSN, and it can monitor to change the user correctly. Presently, Ericsson has developed the SGSN node, which is widely in use. After the existence of LTE, Ericsson has combined both SGSN-MME functionalities into the same node. This node is respon- sible for serving all the three generations traffic of the core-less networks via the same node [14].

Figure 2.1: Ericsson SGSN components

Figure 2.1 is the image of the Ericsson’s SGSN-MME component, and it contains a huge number of plugin units and routers which are mainly responsible for sending and receiving the signals. 8 Chapter 2. Background 2.4 Performance factors

In general, it is difficult to measure the performance of the compression algorithm because the compression depends on the redundancy of the data in the input file. The performance of the compression algorithm completely depends on the type and structure of the input file. To measure the performance of the compression algorithm, many criteria depend on the nature of the application. The main concern for measuring the performance of the compression algorithm would be space and time efficiency. The compression behavior completely depends on the type of compression algorithm used. If a lossy compression algorithm is used, the efficiency to compress the input file would be high when compared to the lossless compression algorithm. It is difficult to measure the general performance of the lossless compression algorithm. The following measurements are used to evaluate the performance of the compression algorithms. • Compression ratio: It is the ratio of the size of the uncompressed file and the size of the compressed file. It helps to calculate up to what extent the file has been compressed. Uncompressed size Compression ratio = (2.1) Compressed size

• Compression factor: Inverse of the compression ratio is the compression factor. • Compression Time: It is the time taken for a compression algorithm to compress the file • Compression speed: It is the ratio of the size of the uncompressed file and the time taken to compress the file. Uncompressed size Compression speed = (2.2) Timetakentocompress

• Space saving: It helps to calculate how much size the file has been shrunk. Compressed size Space savings =1− (2.3) Uncompressed size

To evaluate the performance of compression algorithms there are some other meth- ods like computational complexity, probability distribution. The above measure- ments(2.1,2.2,2.3) are used to evaluate the performance using the file size and com- pression time.

2.5 Types of Lossless Compression Algorithms:

The lossless data compression algorithms generally categorized into entropy-based and dictionary-based compression algorithms. Most of the compression algorithms which exist in real-time are the combination of both entropy and dictionary-based algorithms. 2.5. Types of Lossless Compression Algorithms: 9

Figure 2.2: Types of Compression

Figure 2.2 describes the different types of compression algorithm techniques. As discussed earlier, there are two types of compression techniques, lossy and lossless compression techniques. For different lossless compression techniques,provided a brief explanation .

2.5.1 Huffman Algorithm David Huffman developed this technique in 1951 [63]. The Huffman coding based on the occurrence of the frequency of data items (pixel in images). The bits with lower numbers used to encode the frequently occurred data, which is the basic principle of the Huffman coding [46]. Huffman coding is the algorithm, which used for lossless data compression in the field of computer science and [46].

Huffman coding used for variable-length code tables that have derived based us- ing Huffman coding on the estimated occurrence probability of each possible value of the source symbol [56]. This technique based on the American Standard Code for Information Interchange(ASCII) characters [64]. The main advantage of Huffman coding is easy to implement and produce lossless compression for the JPEG image files. The major disadvantage of Huffman coding is the algorithm varies with differ- ent formats, but few of them will give better results, i.e. 8:1 compression. Another disadvantage is codes of encoded data are of different sizes (it should not be fixed length) [64].

2.5.2 Arithmetic Coding The basic concept of arithmetic compression was developed by Elias in the early 1960s and further developed by Pasco [55], Rissanen [61][60] , Langdon [48]. Arithmetic coding in the form of entropy encoding, which used for lossless data compression. In this technique, code words not used to represent the symbol of text instead uses the fraction to represent the entire message [2]. The probability of occurrence and the cumulative probability for the set of symbols in the source message will be considered [36]. Arithmetic compression techniques are one of the most powerful compression 10 Chapter 2. Background techniques, which will convert the whole input data into a single floating-point num- ber [56].

Arithmetic coding is mostly useful for data like small alphabets with highly skewed probability [51]. Arithmetic coding mainly used for the most frequently occurring sequences of pixels. The main advantage of Arithmetic Coding will reduce the file size dramatically. This algorithm is a paid algorithm, which is protected by the patent, which is a significant disadvantage [64].

2.5.3 Shannon-Fano It is the compression technique named after and Robert Fano [25]. This technique is almost the same as Huffman coding. The only difference is the code word creation [36]. It is also called a theoretic algorithm.

2.5.4 Lempel-Ziv Lempel-Ziv Welch (LZW) is the universal lossless compression technique, which was created by , and Terry Welch in the year 1984 [21]. LZW is the universal algorithm, which represents sequential data compression. Depending upon the non-probabilistic model of constrained sources, the algorithm performance investigated [69]. Lempel-Ziv consist of different implementations which are said to its family LZ77, LZ78, LZW, LZ4,etc., comes under LZ family

This technique is a dictionary-based compression algorithm, which reads the char- acter substring [63]. This technique used to replace the string of characters with single codes, by adding the new series of characters to the table of lines without any analysis of incoming text data [64]. It can compress the data, which is in the form of text, executable code and similar data files, which is about one-half of their origi- nal size. This compression technique used in Tagged Image File Format(TIFF) and Graphic Interchange Format(GIF) files. The main advantages of the LZW algorithm are easy to implement ,and it also is a dictionary-based technique, which compresses very fast [5].

2.5.5 LZ77-LZ78 These are the two lossless algorithms, which were published by the Lempel and Jacob Ziv in the year 1977 and 1988 [23]. These algorithms are also known as LZ1 and LZ2 algorithm respectively.

LZ77 was designed, based on the fact that the words and phrases are more likely to repeat in the text file. When the repetition occurred, it can encode the pointer to the earlier occurred element. The pointer should be accompanied by the number of characters to match. The algorithm is straightforward to adapt and does not require any prior knowledge of the source and also do not require assumptions about the 2.5. Types of Lossless Compression Algorithms: 11 characteristics of the source [62].

In the year 1978, Jacob Ziv and Abraham Lempel presented their dictionary based model [70]. LZ78 algorithm can identify the patterns and hold them indefinitely. The algorithm is a dictionary-based algorithm, which maintains the dictionary explicitly. The dictionary has to be built both at the encoding side as well as the decoding side and both need to follow similar rules to ensure they are using the same dictionary [70].

2.5.6 LZ4 LZ4 is the lossless data compression algorithm which was proposed by Collet [68] in the year 2011. This algorithm is a variant of the LZ77 algorithm [50]. Among the entire LZ algorithm, the compression speed of the LZ4 algorithm known to be the fastest algorithm. The compression speed provided by the LZ4 algorithm is more remarkable than 500MB/s per core, scalable with multi-core CPU. This algorithm is also featured as the fastest decoder with speed in the multiple GB/s per core, which typically reaches the RAM speed limits on the multi-core system [22]. The reference code of LZ4 created in the programming language; this code has ported into many different programming languages. To obtain maximum performance, the code of LZ4 contains many optimizations for different processor architecture. [32]. Examples, where the LZ4 algorithm is used, are: • Fast (De) compression of GNU/ Kernel [49] . • It is used for the (De) compression Field Programmable Gate Array(FPGA) stream [49]. • It is used for IP packet compression [54]. In the LZ4 algorithm, the decompression is faster and simpler than compression. The decompression process entirely based on copying the literals from the decoded part.

Due to the vast expansion of digital communication, the data storage needs to be increased, because of this, the hardware design of the compression algorithm is receiving significant attention. The LZ4 library provided as open-source software that uses the Berkeley Software Distribution(BSD) 2 clause license. The design of the LZ4 algorithm is compatible and also optimized for an x32 mode which helps in providing the additional speed performance. [22].

Initially, the LZ4 algorithm defined in the form of a compressed data format. Here, these compressed data consist of the LZ4 sequence, which included with the token, literal length, offset and the match length [38]. Also, Figure 2.3 explains how the LZ4 flowchart works.

• From Table 2.1, Token represents the length of the matched and unmatched data. 12 Chapter 2. Background

Token Literal length Literals Offset Match length 1 0-n 0-2 Bytes 2 Bytes 0-n Bytes Table 2.1: The data format of the LZ4 sequence

• The literal length is used to represent the uncompressed data length and its value which is equal to the value of the uncompressed data minus 15.

• The uncompressed data, stored as the literals in the LZ4 sequence and it copied from the original data.

• When the input file finds the repeated data while searching, that particular data will get compressed.

• The offset value represents the address of the current data minus the address of the primary data.

• Matching length is the length of the matching data.

The LZ4 algorithm operations are mainly divided into five steps.

1. Hash computation

2. Matching

3. Backward matching

4. Parameter calculation

5. Data output

2.5.7 Brotli Brotli is one of the lossless compression techniques developed by Google employees Jyrki Alakuijala and Zoltan Szabadka to reduce the size of web font’s transmission [16]. Initially, in 2013 it was launched for the offline compression of web fonts [3]. In 2015 the Brotli specification was generalized for HTTP stream compression [3]. It is the open-source software that is supported by most of the web browsers like , [35], and Mozilla .

Mainly the internet data is compressed using Brotli. It will optimize the resources which are used at the decoding time resulting in the maximum compression density. Brotli is having the fastest decompression speed [31].

The Brotli encoder has compression levels ranging from (0-11)[30]

• Level-0 is for the fast compression with the low compression ratio.

• Level-11 is for the slowest compression with high compression ratio. 2.5. Types of Lossless Compression Algorithms: 13

]

Figure 2.3: Flowchart of LZ4

• In Brotli, the performance of compression varies from level to level.The Speed of Compression slows down from Level-1 to Level-10.

Brotli algorithm makes use of a Dynamic populated (sliding window) dictionary and predefined dictionary with the size of 120KB. This predefined dictionary con- tains approximately 13000 common words, phrases, and sub-string, which derived from the large part of the text data and HTML document. The sliding window dic- tionary is limited to 16MB, which enables decoding of mobile phones with limited resources. The predefined dictionary helps to increase the compression level if the file contains the most repeated words.

Brotli compressed file contains the set of meta blocks, each meta block can handle up to 16MB, which divided into two parts, the first part is data part which will store the input of LZ77 and the second part is the header part which will hold the required data to decode the data part.Brotli is highly used for the very high compression where the static file compressed once and severed many times. [30]. 14 Chapter 2. Background

2.5.8 Zlib Zlib is one of the essential lossless compression techniques which are written by Jean- Loup Gaily and . The basic version of Zlib released in May 1995 [27]. It is developed in C language. Zlib implementation is an efficient and pragmatic one. Hence in many of the software applications and libraries, Zlib is one of the crucial components.

Generally, Zlib has ten compression levels (0-9). For each compression level, the compression performance varies based on the compression ratio and compression speed [10].

• Level-0 indicates that there is no compression so that the output will be the original data.

• Level-1 indicates the fastest compression with a low compression ratio.

• Level-9 indicates the high compression ratio with the slowest compression speed.

• In Zlib, the performance of compression varies from level to level. The Speed of Compression slows down from Level-2 to Level-8. The compression ratio increases from Level-2 to Level-8.

As a default, the Zlib used the compression level-6 [10].

Zlib uses the compression algorithm deflate which is initially defined by [18]. It is the combination of the LZ77 and Huffman Encoding techniques [28].the basic building blocks of the deflate algorithm are LZ77 compressor(LZ4 ) and standard Huffman encoding package(Zlib)[45].

deflate compression is a two-step process where the first step involves matching the string and replacing the repeated strings with pointers. It achieved by using the LZ77 algorithm. It makes use of a sliding dictionary protocol which will contain the list of strings that recently used. The encoder will go through all the strings, and when it spots the new string (which does not exist in the dictionary), that string will emit as the literal sequence of bytes. On the other hand, when the encoder finds a match with string in a dictionary, that particular string will get replaced with a pointer that will match the string in dictionary and length of match [59].

Achieved a second step by using Huffman coding. In this step all the literals and lengths are gathered into a single alphabet and output of the first step, it will get encoded by replacing the most occurring literals with the least number of bits and least occurring literals with the most number of bits. In the encoding phase, two types of Huffman coding can use, which are Static Huffman and Dynamic Huffman. On the other hand, if there is no match found in the first step, the output of the LZ77 will be the null pointer here the code will be produced only for the literal [59]. 2.5. Types of Lossless Compression Algorithms: 15

While decoding, the decoders first figure out whether the code is for the literals or the lengths. If the code is for literals, the literals will decode, otherwise it will move to the next code to retrieve the match distance, and then the matched sequence is the output. The Huffman codes are the prefix codes that help the decoders to know the end of each code even if the code is having a variable number of bits[59].

2.5.9 Z-standard The Z-standard algorithm is one of the latest lossless data compression algorithms which are designed by Yann Collet at Facebook. Released on 31st August 2016 as the free, open-source software [29]. The implementation of the Z-standard executed in the C language.

Z-standard is a combination of entropy coders, Huffman coding and Finite State Entropy (FSE) which helps the modern CPU to give better performance. From LZ77, the dictionary matching with a broad search window and entropy coding from FSE and Huffman combined in Z-standard. Generally, the entropy encoding in the com- pression process is the last stage in the compression algorithm. The main purpose of this process is to reduce the set of symbols/flags to the optimal space based on the given probability [34]. The entropy stage in the Z-standard is provided by FSE [29].

FSE is an entropy coding technique proposed in the BSD-license package. Im- plementation of the FSE is quite good even on the modern top-end CPU [39]. They were used for the tabled Asymmetric Numeral System (tANS). Here, ANS is an en- tropy encoding process that combines the ratio of arithmetic coding and Huffman coding. ANS will encode the complete information into a single natural number x. For operating large alphabets, a variant and a finite state machine constructed in tANS which will store the entire behaviour of this natural number x into the table which yields a finite state machine [15].

Z-standard has the compression levels ranging from (1-22)

• Level-1 is the fastest compression with a low compression ratio.

• Level-22 is the slowest compression with the best compression ratio.

• In Z-standard, the performance of compression varies from level to level.The Speed of Compression slows down and compression ration increases from Level- 2 to Level-21.

There is no inherent limit for Z-standard; it can handle and address the terabytes of memory [29]. Z-standard mainly developed to provide a better compression ratio compared with deflate (Zlib) algorithm. It targets mostly real-time compression sce- narios [29].

From the literature review, the research question RQ1, to find suitable algorithms other than deflate for the present use case can answer. From the background of Lempel-Ziv, LZ4 is well known for its fastest compression speed, Therefore LZ4 is 16 Chapter 2. Background selected. For having the high compression Brotli and Z-standard on the high range setting is suitable, therefore Z-standard and Brotli algorithms were selected. The main motivation for selecting LZ4,Brolti and Z-standard is, all these selected algo- rithms are meant to general purpose compression which can be used on variety of data types.Thereby answering the research question RQ1. Chapter 3 Methodology

Generally, the research paradigm is classified into two types:

1. Qualitative Research

2. Quantitative Research

The main aim of performing qualitative research is discovering the cause, which is noticed by the subjects. The main aim of Quantitative research is finding and analyzing the cause-effect relationship or comparing the two or more groups [68].

For this research, quantitative research preferred because, in this study, different algorithms should compare to answer the formulated research questions. There are four methods to perform Quantitative analysis [68]:

1. Survey

2. Literature Review

3. Experiment

4. Case Study

Selected Research method: a Literature Review, Experiment.

The motivation for selecting Literature Review: To perform any research, firstly a clear idea of the topic is mandatory, and the re- lated work regarding that topic should be known. A literature review helps us to gain prior knowledge on the concept and to get a clear picture of the background, which allows us to research efficiently. So the literature review was chosen as part of this research. The literature review has to be performed on that topic and gather the appropriate information and topic-related literature to explain in-depth [71]. For this study, the literature review conducted to gain adequate knowledge about the SGSN node, MME node, Compression Techniques, Lossless Compression Algorithms and many more.

The motivation for selecting Experiment: The research method experiment examines the required variables and the results obtained from the different environmental conditions. The experimentation is the method used for evaluating any system models and also for running simulations to

17 18 Chapter 3. Methodology notice whether the model gets affected by various variables [34]. In this research, the performance of the different compression algorithms compared by experimenting and the results are drawn based on the compression factors like compression speed, compression ratio. Therefore the experimentation is selected for doing this research.

Excluded Research method: Survey, Case Study.

The motivation for excluding Survey: The primary purpose of conducting a survey is to get individual opinions from a specific group of people [68]. The survey is a descriptive process, and it is not a better option for this research because the views will not help us to measure the performance of the algorithm. Hence the surveys are not done in this research.

The motivation for excluding Case Study: For performing the exploratory study, which is indefinite, the case study can be a better option. But the case study is descriptive, and it will take a long time to complete the process [68]. As the case study not used for comparing different kinds of methods, it excluded from doing this research.

3.1 Literature review

The literature review is the perfect way to collect all the necessary information which is related to the particular concept from the various sources [37]. The literature review (Also known as systematic review) is the rigorous strategy for identifying, evaluating and interpreting all available research or research questions based on the phenomenon of interest [47]. It is essential to conduct a literature review for any academic projects. To gain complete knowledge of the various compression algo- rithms and about the SGSN-MME node, the literature review adopted as one of the research methods. In this research, the literature review either gives the final results or fills the research gap of this study, where the Systematic Literature Review (SLR) not performed.

RQ1 can also answer by performing a literature review. After obtaining the re- quired data from the literature review, the data analysis performed [71]. The results obtained from the data analysis documented and those results will help perform ex- perimentation.

In this research, the literature review was conducted by following the steps below:

1. Identification of keywords.

2. Define research space to get data.

3. Set the criteria to Include/Exclude papers.

4. Literature extraction using criteria. 3.1. Literature review 19

5. Filtering of the papers by studying the title, abstract, etc.

6. Documented the results.

1. Identification of keywords The first step for the literature review is to identify the keywords. For this research, the specified keywords are Compression Algorithms, Lossless Com- pression Algorithm, SGSN node, MME node, Compression factors.

2. Define Research Space to get the data: In this step, the searching of the literature performed using the identified key- words. The databases used in this study to search the literature are IEEE and SCOPUS.

3. Set the criteria to Include/Exclude papers:

Inclusion Criteria:

(a) Papers which are available in full text (b) Papers in English (c) Papers which are related to the lossless compression algorithm

Exclusion criteria:

(a) Papers those are not available in full text (b) Papers that are not related to this research topic. (c) Articles that are not in English. (d) The articles which are not related to computer science, electronics.

4. Literature extraction using criteria: The literature was extracted based on the criteria. All the list of articles, journals, and conference papers were gone through the criteria and resulted in a few papers.

5. Filtering of the papers by studying the title, abstract, etc. After getting the papers that go through criteria. By studying the title, ab- stract and conclusion the papers which are not related to this study excluded and few papers filtered for further study.

6. Documented the results By reading those selected papers, the useful information relating to this study is gathered and documented. The collected data used to choose the perfect algorithm which suits the present use case, which helps to answer the RQ1.

From the literature review, the research question RQ1, to find suitable algorithms other than deflate for the present use case can answer. From the background of Lempel-Ziv, LZ4 is well known for its fastest compression speed, Therefore LZ4 is selected. For having the high compression Brotli and Z-standard on the high range 20 Chapter 3. Methodology setting is suitable, therefore Z-standard and Brotli algorithms were selected. The main motivation for selecting LZ4,Brolti and Z-standard is, all these selected algo- rithms are meant to general purpose compression which can be used on variety of data types.Thereby answering the research question RQ1.

The following sections detail about what methods used for research and how the dataset creation and the experimentation done on the selected compression algo- rithms.

3.2 Experiment

For experimenting, the dependent and independent variables need to be identified [9] which are as follows:

Independent variables: The different algorithms which used to compare with the deflate algorithms are independent variables that will not get affected by any other variable.

Dependent Variables: The Performance of the algorithm is measured using fac- tors like compression ratio, compression time, compression speed and space savings which are the dependent variables for this research.

For the in-depth explanation of the experiment, the hypothesis needed to con- struct [68] which also helps to answer the research questions.

3.2.1 Hypothesis The main reason for this research is, the deflate algorithm is not an optimal algorithm that does not suit the present use case. So here, the main aim is to replace the deflate algorithm with any other optimal algorithm. By performing the literature review, it is clear that the algorithms like Brotli, LZ4, and Z-standard are suitable for the present use case. So here, the performance of these algorithms needs to compare with the implementation of the deflate algorithm.

• Null Hypothesis (H0): Performances of all the four compression algorithms (LZ4, Brotli, Z-standard, and Zlib) are equal.

• Alternate Hypothesis (H1): Performances of all the four compression algorithms (LZ4, Brotli, Z-standard, and Zlib) are different.

3.2.2 Experiment workspace For performing any experimentation, the environment setup is to perform. The soft- ware environment which used for this experiment is Linux in the Citrix environment. 3.2. Experiment 21

Ericsson uses the Citrix environment as their work environment.

Citrix: Citrix workspace environment platform enables IT, administrators, to man- age all the applications, desktops and the data from the single pane [17], providing them various access controls to build a secure digital perimeter around the user when accessing enterprise content from any device, hosted on any cloud or from any network thereby ensuring the high security to the administrators. To work in the IT sectors the Citrix will provide the better way by including the simple devices onboarding, pre-built integrations with the most popular Saas applications and the main feature of Citrix is the centralizes security controls like without any compromise in security level it will deliver what employees want [13].

The Citrix workspace is one of the most complete workspaces which allow unifying the components in the single pane [4].

• Access Control: Saas Applications, Secure browser, Web filtering.

• Endpoint Management: Physical Endpoint Management, Workspace Environ- ment Management (WEM), Mobile Applications.

• Content Collaboration: Access to the files and workflow.

• Analysis: Citrix Analytics for access controls, EMU and content collaboration.

3.2.3 Dataset Creation After the environment setup is done the data which requires for the experiments needs to be collected. The main goal of the research is to find the suitable algorithm which is having the better performance when compared to the deflate algorithm.

The data used in this experiment is Ericsson’s SGSN-MME user data. The SGSN- MME is responsible for accessing certain functionalities like authentication, commu- nication, migrating from one place to another place, management functions, and set up for the subscribers. It will collect all the data from the subscribers/users [14]. This whole data is transmitted to and fro via SGSN-MME to the subscribers. The SGSN-MME is responsible for handling millions of user data [14]. Every single func- tion of the user mobile will record and stored. The data is called the UE which is transmitted is stored in the form of ETS (Erlang Term Storage) tables in the Erics- son database.

Erlang is the functional and concurrent programming language which regularly advertised as the supporting “shared-nothing concurrency” [7]. In Erlang the pro- cessed are lightweight, and instead of being mapped to the threads, the virtual machine used for the implementation which will not share memory by default. 22 Chapter 3. Methodology

ETS is the library that provides the key-value store in memory; further; it im- plemented in C for more efficiency. In many of the Erlang applications, the ETS highly used, and it is a very critical component for their implementation. In ETS, the data is well organized as a set of dynamic tables [42]. The individual process will create and own the tables in ETS. At the time of table creation, every table will have an access right. It leads to destroying the table automatically when the process terminated. By calling the function ets: new (new, option), a new ETS table can be created [42]. Here, the first argument in the function represents the table name, and the second argument in the function represents the list of options. A table identifier is returned by the function which can e used for the table access.

By calling the function ets: insert (tab, data), new entities can be inserted into the table [52]. The arguments in the function are the table identifier where the data is tuple expression. Ericsson erlang is maintained and supported by the Open Tele- com Platform (OTP) product unit [24].

The data can retrieve from the ETS tables in many ways. The most common ways are ets: match(table, pattern) and ets: select(table, match)[6]. The method is all done using the Erlang programming language in the Erlang shell in the created GSN-Work Space(GSNWS) in Citrix workspace. The data is extracted from these ETS tables and stored in a segment file. All the data stored in the form of tables, each row is a record that represents a slice of the UE content. This data was taken as the sample dataset for this research.The data which was stored in the ETS tables are completely encrypted data which contains mixed data formats including special characters.

3.2.4 Dataset Pre-Processing Completion of the data set creation does not mean that everything is ready for ex- perimentation. Since the raw data collected from the ETS tables that data should transform into a suitable format here, the data preprocessing should be done to run the algorithm on that data. The data preprocessing is the technique that helps in transforming the raw data into an understandable format.

The data stored in the form of ETS tables retrieve into the segment files. The data is stored in 10 segment files (SegmentX). Each segment file consists of 512 tables (SegmentX_0 to SegmentX_511) with different sizes. For the accuracy in results, these ten segments were considered as 20 data sets. For using this data on the com- pression algorithm, the data should be processed into a single table. All the tables are processed and converted into the single table using Erlang, and all the converted tables are stored in a (.mbin) file. So all the 512 tables are converted into a sin- gle table and saved as the seg1.mbin, seg2.mbin, seg3.mbin,...seg10.mbin. Here each table consists of the records where each record represents the slice of the UE content.

Figure 3.1 illustrates the architecture of MME, where MME consists of n number of e-nodes. Each e-node has its database to store the user data. In this database, the user data stored in the form of ETS tables known as Segments. The data is extracted 3.2. Experiment 23

Figure 3.1: Dataset Creation from these ETS tables and stored in a segment file. All the data stored in the form of tables where each row is a record that represents a slice of the UE context. The data set is ready for the experiment with the compression algorithms. Here all the data will get compressed using the chosen algorithms (LZ4, Brotli, and Zstd).

3.2.5 Experiment • The experiment design type: Randomized complete block chain which belongs to one factor with more than two treatments.

• Subjects in the experiment: Segment Files.

• Factors in the experiment: Performance of the algorithm.

• Treatments in the experiment: Z-standard, Brotli, LZ4 algorithms.

After the pre-processing of the data, the data is ready to process with the algorithms. Here the experiment is done using Z-standard, Brotli and LZ4 algorithms and com- paring the results and the result needs to be finalized based on analysis.

The Brotli version 1.0.7 is accessed using the python from the Brotli bindings [8].

The LZ4 algorithm version 1.9.0 is accessed using python from the LZ4 compression libraries [67].

The Z-standard version of the 1.3.4.4 used to experiment using the python from the Z-standard bindings [9]. The experimentation is done on Ericsson’s SGSN-MME data using these algorithms, and all the results compared and analyzed based on compression factors like compression size, compression time and space savings. 24 Chapter 3. Methodology

3.2.6 Statistical Tests The main aim of the research is to select an effective algorithm to the current Use- case. The experiment conducted and the results of all algorithms are manually analyzed based on the compression factors. To get the valid conclusion for the experiment results, the statistical tests are per- formed [68]. The statistical tests are divided into parametric and non parametric tests [68] • Parametric test: These tests are the models which involves a particular distri- bution. Examples are t-test, F-test [68] . • Non-parametric test: These types of tests do not make the same type of as- sumptions regarding the parameter distributions which parametric test will do. To derive non parametric test, only the general assumptions were made. Examples: Analysis of Variance(ANOVA) , Wilcoxon, Sign test, Chi-2 [68]. The different type of tests can be sorted out based on the design type of the exper- iment. The design type of this experiment is Randomized complete block design. The statistical test used for this design type is the Friedman test.

Friedman Test: The Friedman test used when the data arise from the more than two matched samples. When the assumptions for analysis of the variance not held, this test can be used to test the difference between more than two treatments [53]. In Friedman test, using the experimental results, all the algorithms are ranked from low to high i.e. the algorithm with the best performance is ranked 1; the second-best algorithm is ranked two and so on [40]. If at all there is a tie between the algorithm performances then the average ranks assigned to both algorithms, Ranking of the algorithm is shown in section -4.10.

Motivation: Friedman test is one of the most efficient statistical methods for testing the perfor- mance of the multiple algorithms which used for comparing k number of algorithms for the n datasets where k ≥ 2 [40].

The formula used to calculate Friedman static is [20]: i=1 12 n(k +1) 2 FM = (Ri − ) ( +1) 2 (3.1) nk k k R = sum of all ranks, k = total number of algorithms, n = number of datasets

After calculating FM value, the critical value of k at n is measured at significance level α =0.05. If the critical value is greater than the FM, then it is considered that the null hypothesis is true, else the null hypothesis is false which means it will get rejected [20] . 3.2. Experiment 25

If the null hypothesis gets rejected, then the post-hoc test should be performed for determining the algorithms that performed significantly different. Here, the Nemenyi tests are used because it is the most effective post-hoc test which is used in the case when the statistical test of multiple comparisons has rejected the null hypothesis where all algorithms perform similarly [12].

The Nemenyi Test: The Nemenyi test used when all algorithms compared to each other. The performance of the two algorithms is significantly different if the corresponding average ranks differ by at least the critical difference [40]. The formula for calculating the critical difference is  qα k(k +1) CD = (3.2) 6n

Here qα depends on the value of α and k. In this experiment, the four algorithms tested on 20 datasets. Chapter 4 Results and Analysis

To answer the RQ2, an experiment is performed with the selected algorithms us- ing Ericsson’s SGSN-MME data. After conducting Literature review, based on the performance, three algorithms were selected. The results of all the three selected algorithms were compared with the results of existing algorithm. In this research, box plot is used for analysing the results. Box plot is the visual rep- resentation of statistical technique which will analyse the experiment data using five number of analysis(First quartile,Third quartile, Median, Minimum and Maximum whiskers. Microsoft excel 2016 is used to construct the box plot with the experiment results.

4.1 Results for Z-standard:

Figure 4.1: Z-standard compression

Figure 4.1 shows the comparison between the results of the size of the uncompressed original segment file and the size of the Z-standard compressed file.The Y-axis of the graph represents the size of the segment file in bytes.According to the box plot illus- tration of Figure 4.1, the blue colour box plot represents the uncompressed data series

26 4.2. Results for LZ4: 27 of the segment file with the median value of 711204, the first quartile value is 425368 and the third quartile value is 965130. The whiskers are at 215634(minimum) and 1563038(maximum). The x in the graph represent the mean value which is at 781519.

According to the box plot illustration of Figure 4.1, the orange colour box plot represents the Z-standard compressed data series of the segment file with the me- dian value of 317050, the first quartile value is 202564 and the third quartile value is 407610. The whiskers are at 125368(minimum) and 627726(maximum). The x in the graph represent the mean value which is at 341388.

4.2 Results for LZ4:

Figure 4.2: LZ4 compression

Figure 4.2 shows the comparison between the results of the uncompressed original segment file size and the size of the LZ4 compressed file. The Y-axis of the graph represents the size of the segment file in bytes. According to the box plot illustration of Figure 4.2, the blue colour box plot represents the uncompressed data series of the segment file with the median value of 711204, the first quartile value is 425368 and the third quartile value is 965130. The whiskers are at 215634(minimum) and 1563038(maximum). The x in the graph represent the mean value which is at 781519.

According to the box plot illustration of Figure 4.2, the orange colour box plot represents the LZ4 compressed data series of the segment file with the median value of 400865, the first quartile value is 272671 and the third quartile value is 477713. The whiskers are at 158554(minimum) and 747865(maximum). The x in the graph represent the mean value which is at 413422. 28 Chapter 4. Results and Analysis 4.3 Results for Brotli:

Figure 4.3: Brotli compression

Figure 4.3 shows the comparison between the results of the uncompressed original segment file size and the size of the Brotli compressed file. The Y-axis of the graph represents the size of the segment file in bytes. According to the box plot illustration of Figure 4.3, the blue colour box plot represents the uncompressed data series of the segment file with the median value of 711204, the first quartile value is 425368 and the third quartile value is 965130. The whiskers are at 215634(minimum) and 1563038(maximum). The x in the graph represent the mean value which is at 781519.

According to the box plot illustration of Figure 4.3, the orange colour box plot represents the uncompressed data series of the segment file with the median value of 377032, the first quartile value is 257798 and the third quartile value is 466750.5. The whiskers are at 154024(minimum) and 733820(maximum). The x in the graph represent the mean value which is at 400207. 4.4. Results for Zlib (deflate): 29

4.4 Results for Zlib (deflate):

Figure 4.4: Zlib compression

Figure 4.4 shows the comparison between the results of the uncompressed original segment file size and the size of the Zlib compressed file. The algorithm that is in use and needs to replace with a better efficient algorithm. The Y-axis of the graph represents the size of the segment file in bytes. According to the box plot illustration of Figure 4.4, the blue colour box plot represents the uncompressed data series of the segment file with the median value of 711204, the first quartile value is 425368 and the third quartile value is 965130. The whiskers are at 215634(minimum) and 1563038(maximum). The x in the graph represent the mean value which is at 781519.

According to the box plot illustration of Figure 4.4, the orange colour box plot represents the Zlib(deflate) compressed data series of the segment file with the me- dian value of 328464, the first quartile value is 205491 and the third quartile value is 425585. The whiskers are at 135618(minimum) and 653990(maximum). The x in the graph represent the mean value which is at 355573.

4.5 Comparison of Compression Ratio

Figure 4.5 represents the comparison of the compression ratios of all the four algo- rithms (Zstd, LZ4, Brotli, and Zlib). The Y-axis represent the compression ratio of each algorithm. The X-axis represent the number of algorithms. According to the box plot illustration of Figure 4.5, the blue colour box plot represents the com- pression ratio of Z-standard series with the median value of 2.27, the first quartile 30 Chapter 4. Results and Analysis

Figure 4.5: Comparison of compression ratios value is 2.1 and the third quartile value is 2.367. The whiskers are at 1.72(minimum) and 2.49(maximum). The x in the graph represent the mean value which is at 2.2015.

According to the box plot illustration of Figure 4.5, the orange colour box plot represents the compression ratio of Zlib(Brotli) series with the median value of 2.15, the first quartile value is 2.07 and the third quartile value is 2.267. The whiskers are at 1.59(minimum) and 2.39(maximum). The x in the graph represent the mean value which is at 2.109.

According to the box plot illustration of Figure 4.5, the grey colour box plot rep- resents the compression ratio of Brotli series with the median value of 1.975, the first quartile value is 1.65 and the third quartile value is 2.607. The whiskers are at 1.4(minimum) and 2.13(maximum). The x in the graph represent the mean value which is at 1.865.

According to the box plot illustration of Figure 4.5, the yellow colour box plot rep- resents the comparison of LZ4 series with the median value of 1.89, the first quartile value is 1.56 and the third quartile value is 2.02. The whiskers are at 1.36(minimum) and 2.09(maximum). The x in the graph represent the mean value which is at 1.797.

From the Figure 4.5, the median of the box for compression ratio of Z-standard series is high when compared to all other median values. In regards of dispersion of data, the inter quartile ranges(both first and third quartile) are also obtained superior value for Z-standard box plot. 4.6. Comparison of Compression speed 31 4.6 Comparison of Compression speed

Figure 4.6: Comparison of compression speed

Figure 4.6 represents the comparison of the compression speed of all the four al- gorithms (Zstd, LZ4, Brotli, and Zlib). The Y-axis represents the average of the compression speed of each algorithm.The X-axis represent the number of algorithms. According to the box plot illustration of Figure 4.6, the blue colour box plot repre- sents the compression speed of Z-standard series with the median value of 14.115, the first quartile value is 14.03 and the third quartile value is 14.2625. The whiskers are at 13.86(minimum) and 14.49(maximum). The x in the graph represent the mean value which is at 14.145.

According to the box plot illustration of Figure 4.6, the orange colour box plot repre- sents the compression speed of Zlib(Brotli) series with the median value of 7.755, the first quartile value is 7.56 and the third quartile value is 7.98. The whiskers are at 7.41(minimum) and 8.31(maximum). The x in the graph represent the mean value which is at 7.8035.

According to the box plot illustration of Figure 4.6, the grey colour box plot rep- resents the compression speed of Brotli series with the median value of 13.355, the 32 Chapter 4. Results and Analysis

first quartile value is 13.19 and the third quartile value is 13.51. The whiskers are at 13.03(minimum) and 13.67(maximum). The x in the graph represent the mean value which is at 13.355.

According to the box plot illustration of Figure 4.6, the yellow colour box plot rep- resents the comparison speed of LZ4 series with the median value of 17.115, the first quartile value is 16.94 and the third quartile value is 17.22. The whiskers are at 16.78(minimum) and 17.43(maximum). The x in the graph represent the mean value which is at 17.1015.

From the Figure 4.6, the median of the box for compression speed of LZ4 series is high when compared to all other median values. In regards of dispersion of data, the inter quartile ranges(both first and third quartile) are also obtained superior value for the LZ4 box plot.

4.7 Comparison of Space-saving

]

Figure 4.7: Comparison of space saving

Figure 4.7 shows the comparison between the space savings of all algorithms( Zstd, Brotli, LZ4, Zlib). The Y-axis represents the value of space-saving of each al- gorithm. The X-axis represent the number of algorithms. According to the box plot illustration of Figure 4.7, the blue colour box plot represents the space saving of Z- standard series with the median value of 0.56, the first quartile value is 0.524 and the third quartile value is 0.578. The whiskers are at 0.419(minimum) and 0.599(maxi- mum). The x in the graph represent the mean value which is at 0.5408.

According to the box plot illustration of Figure 4.7, the orange colour box plot 4.8. Result Analysis 33 represents the space saving of Zlib(Brotli) series with the median value of 0.541, the first quartile value is 0.517 and the third quartile value is 0.559. The whiskers are at 0.372(minimum) and 0.582(maximum). The x in the graph represent the mean value which is at 0.5209.

According to the box plot illustration of Figure 4.7, the grey colour box plot rep- resents the space saving of Brotli series with the median value of 0.492, the first quartile value is 0.394 and the third quartile value is 0.516. The whiskers are at 0.286(minimum) and 0.531(maximum). The x in the graph represent the mean value which is at 0.4531.

According to the box plot illustration of Figure 4.7, the yellow colour box plot repre- sents the space saving of LZ4 series with the median value of 0.417, the first quartile value is 0.359 and the third quartile value is 0.505. The whiskers are at 0.265(mini- mum) and 0.522(maximum). The x in the graph represent the mean value which is at 0.4309.

From the Figure 4.6, the median of the box for space saving of Z-standard series is high when compared to all other median values. In regards of dispersion of data, the inter quartile ranges(both first and third quartile) are also obtained superior value for the Z-standard box plot.

4.8 Result Analysis

From the Section-4.5, when compared to all other algorithms, the Z-standard algo- rithm is more effective in terms of Compression ratio.

From the Section-4.6, when compared to all other algorithms, the LZ4 algorithm compresses the data files faster. But Z-standard has the capacity to trade compres- sion speed for the stronger compression ratios [19].

From the Section-4.7, the Z-standard algorithm saves 54% of the storage space, the LZ4 algorithm saves 43% of storage space, and the Brotli algorithm saves 45% of the storage space, and Zlib (deflate) algorithm saves 52% of storage space. So, Z-standard algorithm saves more space when compared to other algorithms.

So from the above analysis LZ4 algorithm compresses fast but results in lowest compression ratio, Zlib results low compression speed and low compression ratio, Z-standard results high compression ratio. when compared to all algorithms, the Z-standard algorithm is more effective by maintaining the good balance between the compression ratio and compression speed. 34 Chapter 4. Results and Analysis 4.9 Performing Statistical Tests

To get the valid conclusion for the experiment results, the statistical tests Friedman and Nemenyi tests are performed on the compression ratio of all algorithms.

• Null Hypothesis (H0): Performances of all the four compression algorithms (LZ4, Brotli, Z-standard, and Zlib) are equal.

• Alternate Hypothesis (H1): Performances of all the four compression algo- rithms (LZ4, Brotli, Z-standard, and Zlib) are different.

Test cases Zstd Zlib Brotli LZ4 1 2.36(1) 2.26(2) 2.06(3) 2.01(4) 2 2.42(1) 2.32(2) 2.11(3) 2.07(4) 3 2.20(1) 2.14(2) 1.85(3) 1.74(4) 4 2.33(1) 2.21(2) 2.05(3) 1.98(4) 5 2.29(1) 2.19(2) 2.01(3) 1.92(4) 6 2.36(1) 2.26(2) 2.06(3) 2.01(4) 7 2.39(1) 2.29(2) 2.09(3) 2.05(4) 8 2.10(1) 2.07(2) 1.65(3) 1.56(4) 9 1.72(1) 1.59(2) 1.40(3) 1.36(4) 10 2.14(1) 2.11(2) 1.83(3) 1.69(4) 11 2.28(1) 2.19(2) 2.01(3) 1.92(4) 12 2.10(1) 2.07(2) 1.65(3) 1.56(4) 13 1.72(1) 1.59(2) 1.40(3) 1.36(4) 14 2.39(1) 2.29(2) 2.09(3) 2.05(4) 15 2.20(1) 2.14(2) 1.85(3) 1.74(4) 16 2.49(1) 2.39(2) 2.13(3) 2.09(4) 17 1.97(1) 1.94(2) 1.55(3) 1.47(4) 18 2.46(1) 2.35(2) 2.13(3) 2.09(4) 19 2.25(1) 2.16(2) 1.94(3) 1.86(4) 20 1.85(1) 1.72(2) 1.45(3) 1.41(4) Total ranks 20 40 60 80 Table 4.1: Ranks for comparison of compression ratios of algorithms for each test case.

Compression ratio is the factor which has the ability to calculate up-to what extent the file gets compressed without disturbing the original size. The compression ratio will remains constant if the data gets compressed many times. In this research the statistical test were performed on the constant results.

The above table represents the ranks for the compression ratios obtained from com- pressing the each segment file using Z-standard, Brotli, LZ4 and Zlib algorithms. For each segment ranks are giving from rank-1 to rank-4 based on the performance, best one ranked as 1. The total of the ranks are calculated for the further calculation. 4.9. Performing Statistical Tests 35

From the table n = 20, k = 4, and Ri = [20,40,60,80] Friedman Static [FM] is calculated using the formula: i=1 12 n(k +1) 2 FM = (Ri − ) ( +1) 2 (4.1) nk k k

Resulted FM = 60 As we discussed in the section -3.2.6, after calculating FM value, the critical value of k at n is measured at significance level α=0.05. If the critical value is greater than the FM, then it is considered that the null hy- pothesis is true, else the null hypothesis is false, which means it will reject. From the statistical table, the Critical Value for α = 0.05 and k=4 is 7.8. Here the FM (28.83) > CV (7.8). Hence the null hypothesis is rejected. If the null hypothesis gets rejected, then the post-hoc test should be performed for determining the algorithms that performed significantly different. The Critical Difference CD calculated by Nemenyi test using the formula:  qα k(k +1) CD = (4.2) 6n

We should find the qα where it depends upon the α-value and k. Thevalueofqα at α =0.05 and k = 4 is 2.569. The value of Critical Difference (CD) is 1.02. If The difference of average ranks of best performance and worst performance algo- rithms should be greater than or equal to CD, and then it is said that the algorithms are having different performances else it is said that the performance of all algorithms are same.

The difference between the ranks of best performance and the worst performance algorithm(4-1=3). Since the difference is greater than the Critical Difference(1.02), it is clear that the algorithms have different performances. By comparing the aver- age ranks in the above table, Z-standard ranks top. So we can conclude that the Z-standard algorithm performs well when compared to remain algorithm and can be used to compress the SGSN-MME data.

Algorithm Average Ranks Z-standard 1 Zlib 2 Brotli 3 LZ4 4 Table 4.2: Average ranks of algorithms. Chapter 5 Discussion

In this section, the results obtained from Section-4 used for answering the research questions. The results which obtained from qualitative and quantitative methods are generalized and presented along with the research questions. The validity threats are also discussed in this section.

5.1 Answering Research Questions

1. Which algorithm other than the deflate algorithm is suitable for the present Use-case?

Answer: The main purpose of this question is to review the existing algorithm which is deflate algorithm and find out if there exist any other compression al- gorithms which suit to our present use-case (compress twice and decompress once).

The literature review was discussed in the section-3.1, which helped to un- derstand the Zlib algorithm and its properties in detail. We have identified few compression algorithms like Brotli (Section-2.4.7), LZ4 (Section-2.4.6), Z- standard (Section-2.4.9) which are having better properties compared to Zlib.

An experiment conducted using all the selected algorithms, and the results compared with each other along with the results of existing Zlib algorithm. Based on the analysis, one algorithm from the selected algorithm will be our proposed algorithm.

2. Which algorithm works more effectively when compared with the existing al- gorithm for the Use-case?

Answer: For answering RQ2, an experiment is conducted using Z-standard, Brotli, and LZ4 algorithm on the SGSN-MME datasets. The results are an- alyzed based on the compression factors like compression ratio, compression speed, and space-saving.

By analyzing the results and from a comparative study between the existing and selected algorithms, it concludes that the Z-standard algorithm compresses

36 5.2. Validity Threats 37

more effectively when compared to other algorithms.

Statistical tests in the section-4.10 perform on the resulted data and finally conclude that the Z-standard algorithm works more effectively than deflate al- gorithm.

5.2 Validity Threats

In an experiment, the basic question concerning the results is “how valid the results are?” [41] For validating an experiment. For different types of threats, there are different classification schemes. Campbell and Stanley defined two types of threats which are internal and external validity threats. Later cook and Campbell extended them to four types of threats. They are Conclusion, Internal, Construct and External validity. Each validity is completely related to the methodological question in the experiment

1. Conclusion Validity: This type of validity mainly involved with the relation between the theory and observation which means it will completely deal with the threats which will affect the ability to conclude the relationship between the treatment and outcomes of the experiment correctly [68]. Sometimes the conclusion validly may be referred to as statistical conclusion validity. To mitigate this kind of threat the statistical tests with the given significance need to conduct to draw the results of the experiment.

2. Internal Validity: Issues related to internal validity threats may indicate the causal relationship between the treatment and results, although there is none. This type of va- lidity completely deals about how the dependent variable gets affected by the independent variable [68]. In this experiment, the independent variables are the different compression algorithm and the dataset. The dependent variable is the performance of the compression algorithm. To mitigate a kind of threat, all the data set is carefully collected and preprocessed.

3. Construct Validity: This type of validity mainly involved with the relation between the theory and observation, which mainly involves the generalization of the experiment results in theory behind the experiment. To mitigate this kind of threat, it is necessary to define the hypothesis, and the experiment design needs to figure out upfront [68]. 38 Chapter 5. Discussion

4. External Validity: This type of validity mainly deals with generalization. External validity threats mainly concern about the generality of the experiment results, which are out- side the scope of the experiment. The experiment design, objects, and subjects chosen are the main reason that will affect external validity. To mitigate the external validity threats, it is necessary to make the environmental conditions as realistic as possible [68]. In this research, the whole experiment is done in a single industrial environment (Ericsson) using the fixed SGSN-MME dataset. Here the results cannot be generalized. Chapter 6 Conclusion and Future Work

The main aim of this research is to replace the deflate algorithm with a more suit- able algorithm for better performance in the present use-case(compress twice and decompress once) of Ericsson’s SGSN-MME. For this, a reliable background study among different compression algorithms is done and based on their performance re- views from various research papers, LZ4, Brotli, Z-standard compression algorithms are chosen to compare with the deflate(Zlib) compression algorithm.

An experiment is held on these selected compression algorithms to compare their performance based on the compression factors like compression ratio, compression speed, space savings. Based on this comparison between the compression algorithms, this research concludes that the Z-standard algorithm is the algorithm which per- forms better than another algorithm.

From the analysis of the results, it concludes that the performance of Z-standard compression algorithm out-stands from other compression algorithms and is suitable for the present use-case of Ericsson’s SGSN-MME in term of the compression factors. Hence, deflate can be replaced with Z-standard compression algorithm to be better compress the Ericsson’s database with good balance between the compression ratio and compression speed.

Since the Z-standard compression is a Dictionary-based algorithm, as future work, a pre-defined dictionary can be created and given to the algorithm externally for compression the data even more effectively in terms of compression ratio, compression time and space savings.

39 References

[1] The Advantages of File Compression | Techwalla.

[2] Arithmetic coding - Wikipedia.

[3] Brotli Compression - The What, The Why, The How.

[4] Citrix Workspace - Transform Employee Experience - Citrix India.

[5] Data Compression using Huffman based LZW Encoding Technique.

[6] Erlang – ets.

[7] Erlang Programming Language.

[8] google/brotli.

[9] indygreg - Overview.

[10] Understanding zlib.

[11] What is GPRS (General Packet Radio Services)? - Definition from WhatIs.com.

[12] Nemenyi test, October 2015. Page Version ID: 685831545.

[13] Citrix Workspace App – Everything You Need to Know, June 2018.

[14] SGSN-MME, November 2018. Last Modified: 2020-07-11T01:02:08+00:00.

[15] Asymmetric numeral systems, August 2020. Page Version ID: 973248208.

[16] Brotli, August 2020. Page Version ID: 974952707.

[17] Citrix Systems, July 2020. Page Version ID: 969040032.

[18] , July 2020. Page Version ID: 970306549.

[19] facebook/zstd, August 2020. original-date: 2015-01-24T00:22:38Z.

[20] Friedman test, July 2020. Page Version ID: 969936548.

[21] Lempel–Ziv–Welch, May 2020. Page Version ID: 958784425.

[22] lz4/lz4, August 2020. original-date: 2014-03-25T15:52:21Z.

[23] LZ77 and LZ78, February 2020. Page Version ID: 941933445.

40 References 41

[24] Open Telecom Platform, June 2020. Page Version ID: 960213440.

[25] Shannon–Fano coding, May 2020. Page Version ID: 959180025.

[26] System Architecture Evolution, August 2020. Page Version ID: 974889148.

[27] zlib, August 2020. Page Version ID: 973912850.

[28] zlib-ng/zlib-ng, August 2020. original-date: 2014-10-13T15:47:27Z.

[29] , August 2020. Page Version ID: 975032214.

[30] Jyrki Alakuijala, Andrea Farruggia, Paolo Ferragina, Eugene Kliuchnikov, Robert Obryk, Zoltan Szabadka, and Lode Vandevenne. Brotli: A General- Purpose Data Compressor. ACM Transactions on Information Systems, 37(1):1–30, January 2019.

[31] Jyrki Alakuijala, Evgenii Kliuchnikov, Zoltan Szabadka, and Lode Vandevenne. Comparison of Brotli, Deflate, Zopfli, LZMA, LZHAM and Compression Algorithms. page 6.

[32] Matěj Bartík, Sven Ubik, and Pavel Kubalik. LZ4 compression algorithm on FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), pages 179–182, December 2015.

[33] Rhen Anjerome Bedruz and Ana Riza F. Quiros. Comparison of Huffman Algo- rithm and Lempel-Ziv Algorithm for audio, image and text compression. In 2015 International Conference on Humanoid, Nanotechnology, Information Technol- ogy,Communication and Control, Environment and Management (HNICEM), pages 1–6, December 2015.

[34] Mikael Berndtsson, Jörgen Hansson, B. Olsson, and Björn Lundell. Thesis Projects: A Guide for Students in Computer Science and Information Sys- tems. Springer Science & Business Media, October 2007. Google-Books-ID: CoGcm3lU3FYC.

[35] Microsoft Edge Blog. Introducing Brotli compression in Microsoft Edge, De- cember 2016.

[36] J.B. Connell. A Huffman-Shannon-Fano code. Proceedings of the IEEE, 61(7):1046–1047, July 1973. Conference Name: Proceedings of the IEEE.

[37] Patricia Cronin, Frances Ryan, and Michael Coughlan. Undertaking a litera- ture review: a step-by-step approach. British Journal of Nursing, 17(1):38–43, January 2008. Publisher: Mark Allen Group.

[38] Cyan. RealTime Data Compression: LZ4 explained, May 2011.

[39] Cyan. RealTime Data Compression: Finite State Entropy - A new breed of entropy coder, December 2013. 42 References

[40] Janez Demšar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan):1–30, 2006. [41] Robert Feldt and Ana Magazinius. Validity Threats in Empirical Software En- gineering Research - An Initial Survey. page 6. [42] Scott Lystig Fritchie. A study of Erlang ETS table implementations and per- formance. In Proceedings of the 2003 ACM SIGPLAN workshop on Erlang - ERLANG ’03, pages 43–55, Uppsala, Sweden, 2003. ACM Press. [43] Jerry D. Gibson, Toby Berger, Tom Lookabaugh, Rich Baker, and David Lind- bergh. Digital Compression for Multimedia: Principles and Standards. Morgan Kaufmann, January 1998. Google-Books-ID: aqQ2Ry6spu0C. [44] Zaid Haitham and Maher K. Mahmood Al-Azawi. Video Compression Based on and Contourlet Transform. In 2018 Third Scientific Conference of (SCEE), pages 90–94, December 2018. [45] Danny Harnik, Ety Khaitzin, Dmitry Sotnikov, and Shai Taharlev. A Fast Implementation of Deflate. In 2014 Data Compression Conference, pages 223– 232, March 2014. ISSN: 2375-0359. [46] Dhiraj Kapila and Harpreet Arora. A REVIEW OF OF COMPRESSION BASED METHODS IN IMAGE SEGMENTATION. 3(9):4. [47] Staffs Keele. Guidelines for performing systematic literature reviews in soft- ware engineering. Technical report, Technical report, Ver. 2.3 EBSE Technical Report. EBSE, 2007. [48] G. G. Langdon. An Introduction to Arithmetic Coding. IBM Journal of Re- search and Development, 28(2):135–149, March 1984. [49] Kyungsik Lee and LG Electronics. LZ4 Compression and Improving Boot Time. page 21. [50] Weiqiang Liu, Faqiang Mei, Chenghua Wang, Maire O’Neill, and Earl E. Swart- zlander. Data Compression Device Based on Modified LZ4 Algorithm. IEEE Transactions on Consumer Electronics, 64(1):110–117, February 2018. Confer- ence Name: IEEE Transactions on Consumer Electronics. [51] Samar lofty and Abdel Nasser H. Zaied. Survey of Compression and Cryptog- raphy Techniques of Data Security in E-Commerce. International Journal of Innovative Research in Information Security, 4(8), August 2017. [52] A. Mishra. Performance and architecture of SGSN and GGSN of general packet radio service (GPRS). GLOBECOM’01. IEEE Global Telecommunications Con- ference (Cat. No.01CH37270), 2001. [53] Harvey Motulsky. GraphPad Statistics Guide. page 402. [54] Dan Munteanu and Carey Williamson. An FPGA-based Network Processor for IP Packet Compression. page 10. References 43

[55] R. Pasco. Source coding algorithms for fast data compression (Ph.D. Thesis abstr.). IEEE Transactions on Information Theory, 23(4):548–548, July 1977. Conference Name: IEEE Transactions on Information Theory.

[56] Shrusti Porwal, Yashi Chaudhary, Jitendra Joshi, and Manish Jain. Data Com- pression Methodologies for Lossless Data and Comparison between Algorithms. 2(2):7, 2013.

[57] Khalid Sayood Ph D. Professor. Introduction to Data Compression.

[58] Ida Mengyi Pu. Fundamental Data Compression. Butterworth-Heinemann, November 2005. Google-Books-ID: Nyt0HgC81I4C.

[59] Gonçalo César Mendes Ribeiro. Data Compression Algorithms in FPGAs. page 101.

[60] J. Rissanen and G. G. Langdon. Arithmetic Coding. IBM Journal of Research and Development, 23(2):149–162, March 1979. Conference Name: IBM Journal of Research and Development.

[61] J. J. Rissanen. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and Development, 20(3):198–203, May 1976. Conference Name: IBM Journal of Research and Development.

[62] Senthil Shanmugasundaram and Robert Lourdusamy. A Comparative Study Of Text Compression Algorithms. 1:9, 2011.

[63] Komal Sharma and Kunal Gupta. Lossless data compression techniques and their performance. In 2017 International Conference on Computing, Communi- cation and Automation (ICCCA), pages 256–261, May 2017.

[64] Mamta Sharma. Compression Using Huffman Coding. page 9, 2010.

[65] Yu Su and Xian Feng. Analysis and Performance Evaluation of Multi-path Diversity Based SGSN Pool in Packet Switched Domain Core Networks. In 2014 10th International Conference on Mobile Ad-hoc and Sensor Networks, pages 294–298, December 2014.

[66] Heru Susanto, Fang-Yie Leu, Didi Rosiyadi, and Chin Kang Chen. Revealing Storage and Speed Transmission Emerging Technology of Big Data. In Leonard Barolli, Makoto Takizawa, Fatos Xhafa, and Tomoya Enokido, editors, Advanced Information Networking and Applications, Advances in Intelligent Systems and Computing, pages 571–583, Cham, 2020. Springer International Publishing.

[67] Jonathan Underwood. lz4: LZ4 Bindings for Python.

[68] Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. Experimentation in Software Engineering. Springer Science & Business Media, June 2012. Google-Books-ID: QPVsM1_U8nkC. 44 References

[69] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977. Confer- ence Name: IEEE Transactions on Information Theory.

[70] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530–536, September 1978. Conference Name: IEEE Transactions on Information Theory.

[71] . A Guide to Writing the Thesis Literature Review for Graduate Studies. Ex- ecutive Journal, 34(1):11–22, 2014. Number: 1.

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden