CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

LOSSLESS COMPRESSION OF SATELLITE TELEMETRY DATA FOR A

NARROW-BAND DOWNLINK

A graduate project submitted in partial fulfillment of the requirements

For the degree of Master of Science in

Electrical Engineering

By

Gor Beglaryan

May 2014

Copyright

Copyright (c) 2014, Gor Beglaryan

Permission to use, copy, modify, and/or distribute the software developed for this project for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL

WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED

WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL

THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR

CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING

FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF

CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF

OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Copyright by Gor Beglaryan

ii

Signature Page

The graduate project of Gor Beglaryan is approved:

______Prof. James A Flynn Date

______Dr. Deborah K Van Alphen Date

______Dr. Sharlene Katz, Chair Date

California State University, Northridge

iii

Contents Copyright ...... ii Signature Page ...... iii List of Figures ...... vi List of Tables ...... viii ABSTRACT ...... ix 1 Introduction ...... 1 2 Background: and Coding ...... 2 2.1 Outline ...... 2 2.2 Formulas and Measures of Performance ...... 4 2.3 Lossy vs. ...... 5 3 Design Procedure ...... 7 3.1 Problem Definition ...... 7 3.2 Huffman Codes ...... 8 3.2.1 How Huffman Codes Work ...... 8 3.2.2 Huffman Algorithm Design ...... 10 3.2.3 Efficiency of Huffman Compression ...... 15 3.2.4 Sample Output of the Static Huffman Algorithm ...... 17 3.3 Adaptive Huffman Codes ...... 20 3.3.1 How Adaptive Huffman Codes Work ...... 21 3.3.2 Adaptive Huffman Algorithm Design ...... 23 3.3.3 Efficiency of Adaptive Huffman Codes ...... 29 3.3.4 Sample Output of the Adaptive Huffman Algorithm ...... 30 3.4 ...... 31 3.4.1 How Arithmetic Coding Works ...... 32 3.4.2 Arithmetic Coding Algorithm Design ...... 36 3.4.3 Efficiency of Arithmetic Coding ...... 46 3.4.4 Sample Output of the Static Arithmetic Algorithm ...... 47 3.5 Adaptive Arithmetic Coding ...... 48 3.5.1 How Adaptive Arithmetic Coding Works ...... 48 3.5.2 Efficiency of Adaptive Arithmetic Coding ...... 50 4 Performance Tests and Comparison ...... 51 4.1 Pic33 Analog-to-Digital Converter Output Format ...... 51

iv

4.2 Benchmark Test ...... 54 4.2.1 Test Data ...... 54 4.2.2 Compression Ratio Test ...... 58 4.2.3 Timing Test ...... 63 4.2.4 Discussion of Test Results ...... 68 4.3 Delta Compression ...... 70 5 Conclusions ...... 72 Bibliography ...... 73 Appendix A: MATLAB Code for Static Huffman Compression ...... 75 Appendix B: MATLAB Code for Adaptive Huffman Compression ...... 84 Appendix C: MATLAB Code for Static Arithmetic Coding ...... 94 Appendix D: MATLAB Code for Adaptive Arithmetic Coding ...... 101 Appendix E: MATLAB Code for 10 bit Conversion ...... 106 Appendix F: MATLAB Code for 12 bit Conversion...... 112

v

List of Figures

Figure 2.1-Simplified source and channel coding system...... 3 Figure 3.1-Static Huffman compression flowchart...... 8 Figure 3.2-Huffman binary tree example...... 9 Figure 3.3-Flowchart for traversing a binary tree...... 13 Figure 3.4-Static Huffman compression output format...... 14 Figure 3.5-Encoded binary tree example...... 15 Figure 3.6-Output variable info of the Static Huffman program...... 17 Figure 3.7-Output variable info.codewords of the Static Huffman program...... 19 Figure 3.8-Histogram generated by the Static Huffman Program...... 19 Figure 3.9-Adaptive Huffman encoding flowchart [15]...... 22 Figure 3.10-Adaptive Huffman tree example...... 25 Figure 3.11-Adaptive Huffman tree node update example...... 26 Figure 3.12-Adaptive Huffman compression output format...... 27 Figure 3.13-Adaptive Huffman decoder flowchart [17]...... 28 Figure 3.14- Output variable info of the Adaptive Huffman program...... 30 Figure 3.15-Final binary tree table of the Adaptive Huffman simulation...... 30 Figure 3.16-Output variable info.codewords of the Adaptive Huffman program...... 31 Figure 3.17-Generating unique tag for Static Arithmetic Coding ...... 33 Figure 3.18-Arithmetic Coding Case 0 and Case 1 rescaling [23]...... 37 Figure 3.19-Arithmetic Coding Case S rescaling [23]...... 38 Figure 3.20-Static Arithmetic Coding flowchart...... 41 Figure 3.21-Static Arithmetic Coding output format...... 42 Figure 3.22-Static Arithmetic Coding output example...... 43 Figure 3.23-Static Arithmetic decoder flowchart...... 45 Figure 3.24-Output of the Static Arithmetic program...... 47 Figure 4.1-dsPic33 ADC Output Format...... 51 Figure 4.2-Data conversion from input to output...... 53 Figure 4.3-Symbol histogram of the book of Genesis...... 54 Figure 4.4-Sample indoor temperature data...... 55 Figure 4.5-Sample outdoor temperature data...... 55 Figure 4.6-Sample wind speed data...... 56 Figure 4.7-Sample wind gust data...... 56 Figure 4.8-Sample magnetic field data...... 57 Figure 4.9-Compression results for the book of Genesis...... 58 Figure 4.10-Compression results for indoor temperature data...... 59 Figure 4.11-Compression results for outdoor temperature data...... 60 Figure 4.12-Symbol histogram for indoor and outdoor temperature data source alphabet...... 61 Figure 4.13-Compression results for magnetic field data...... 62 Figure 4.14-Compression results for wind gust data...... 62 Figure 4.15-Compression results for wind speed data...... 63 Figure 4.16-Compression time results (linear scale)...... 64 Figure 4.17-Compression time results (logarithmic scale)...... 65

vi

Figure 4.18-Decompression time results (linear scale)...... 66 Figure 4.19-Total compression and decompression time (linear scale)...... 67 Figure 4.20-Total compression and decompression time (logarithmic scale)...... 67

vii

List of Tables

Table 2.1-Information theory and coding: outline of topics...... 3 Table 3.1-Table for storing Huffman Tree...... 11 Table 3.2-Table for storing Adaptive Huffman tree...... 24 Table 3.3-Source alphabet information of the sequence "abracadabra"...... 43

viii

ABSTRACT

LOSSLESS COMPRESSION OF SATELLITE TELEMETRY DATA FOR A

NARROW-BAND DOWNLINK

By

Gor Beglaryan

Master of Science in Electrical Engineering

The objective of this project is to select a lossless compression technique to be

implemented on a CubeSat being developed by CSUN. The goal is to compress satellite

telemetry data in a timely and computationally efficient manner and achieve reasonable compression ratio. There are two main parts in this project; Algorithm Development and

Benchmark Tests. In the former phase four lossless compression techniques, namely

Huffman, Adaptive Huffman, Arithmetic and Adaptive Arithmetic, were implemented in

MATLAB. Concise description and implementation details of each algorithm are given in

the text. Next, during the Benchmark Test phase, sample data is input to the algorithms and

performance metrics are collected. The collected performance measures include

compression ratio, compression time and decompression time. Based on the results, it is

recommended that Adaptive Arithmetic coding be selected for the CubeSat project.

ix

1 Introduction

CubeSats are small satellites between 1000 and 3000 cm3 in volume. They are a low cost method for conducting experiments in space. Due to the size, weight, power limitations and orbit of these small satellites the downlink data rates are often limited.

Thus, it may be necessary to compress the data on the satellite before it is transmitted. This project is a study of some of the lossless compression schemes that might be used.

In order to select a lossless compression technique, four algorithms, each representing a different compression method, have been implemented in MATLAB. All algorithms perform necessary data manipulation, parsing, compression, decompression and original data recovery. These routines and subroutines enable modeling of an actual encoder and decoder, hence aiding in simulation and comparison of each compression method. As a result the most suitable and efficient algorithm can be selected for further analysis and possible implementation on the CubeSat.

Section 2 of this report presents some background information on Information Theory and Coding that was needed to complete this project. Section 3 presents the four coding techniques that were studied and the algorithms for their implementation. Section 4 presents the benchmark tests and compares the coding techniques. Section 5 includes the conclusions and suggestions for future work.

1

2 Background: Information Theory and Coding

With the introduction of the personal computer in the 1970s and the establishment of the Internet in the ‘90s, the Information Age had begun and was gaining large momentum very rapidly. As it is now known, this revolution resulted in an exponential growth of accessible information for the general public; this growth is still ongoing. This information is in the form of music, pictures, , satellite telemetry data, medical imaging data, and a vast array of multimedia formats we encounter in our daily life. It is fair to say that this would not be possible without .

Before the Information Age data compression was mainly in the radar of a small group of engineers who had already developed modern compression techniques, such as Huffman

Compression [1]. However, as the number of data sources expanded, the need for larger storage expanded as well. New compression techniques were needed that could significantly reduce the number of bits required for storage.

2.1 Outline

Data compression, commonly referred to as “source coding”, is a sub-field of

Information Theory and Coding. To better understand the role of data compression in the field of information theory consider Table 2.1 [2]. The table shows that data compression is applied to the source, to reduce the information size to be transmitted. In contrast, error correction is performed to transfer information reliably over a non-ideal channel that is subject to noise and other forms of distortion depending on the channel. These different components can be visualized in the simplified source and coding system shown in Figure

2.1.

2

Furthermore, compression itself can be approached from two different angles, mathematical and algorithmic. The mathematical part is purely analytical and deals with

Compression/Source Coding Error-Correction/Channel (efficiency) Coding (reliability)

Information Theory i. Source coding i. Noisy channel (math) theorem coding theorem ii. Kraft-McMillan ii. Channel capacity inequality theorem iii. Rate-distortion iii. Typicality & AEP theorem Coding methods i. Symbol codes, e.g. i. Hamming codes (algorithms) Huffman codes ii. Turbo codes ii. Stream coder, e.g. iii. Gallager codes Arithmetic coding, iv. BCH codes, e.g. Lempel-Ziv coding Reed-Solomon codes Table 2.1-Information theory and coding: outline of topics.

Figure 2.1-Simplified source and channel coding system. theorems that help to measure performance and shed light on the limitations of compression. The algorithmic perspective is a hands-on approach trying to overcome the limitations by creating complex compression packages that can manage a variety of data formats with reasonable performance.

3

2.2 Formulas and Measures of Performance

Studying information theory and coding can quickly lead into depths that are outside

the scope of this project. However, some fundamental formulas are necessary to have a

basic understanding of how information is measured, how much the source data can be

compressed with a given compression technique and how reliably the data can be

recovered.

When speaking of information theory it is essential to know how information is

measured and what it represents. If () is the probability that the event A will occur, then

the self-information associated with A quantifies the uncertainty of the event and is given

1 i(A)=log =- log P(A) (2.1) b P(A) b

by [3]. Equation 2.1 shows that the self-information of an event is a positive quantity.

Moreover, the smaller the probability of the event, the higher the information. In other

words, if an event is not expected to happen, the information gained would be high if it

actually happened, or equivalently if an event has high probability of occurrence then there

is little uncertainty associated with the event. Probability and uncertainty of an event are

inversely proportional. Self-information is measured in bits. The self-information in

Equation 2.1 is for a single event that leads to the average self-information of an experiment consisting of a number of independent events. This quantity is called entropy [4] and is calculated by:

() =− () () (2.2)

Entropy quantifies the uncertainty of an experiment.

4

The next formula describes the expected code-word length of a particular

compression technique, which is the average code-word length of a coding scheme. The

expected code length is a performance measure that can never exceed the entropy, which

= () ∙ (2.3)

is the lower bound of the average code length. The difference between H(S) and is

defined as redundancy [5], also an important measure of performance. () in Equation

th 2.3 is the probability of the i symbol and is the length of the code-word corresponding to it.

The last, but probably the most important measure of performance is the compression

ratio that can be achieved with a particular compression scheme. Compression ratio

expresses the reduction in the amount of data required as a percentage of the original data

size [6].

2.3 Lossy vs. Lossless Compression

Data compression can be done in two ways, lossy and lossless. As the name suggests,

data that has undergone cannot be fully recovered during

decompression. In the era of analog technology lossy compression was inevitable. As soon

as an analog signal is quantized, information is lost that cannot be recovered. However, in some applications, such as radiology, small details must be preserved. This became relatively easier with digital technology and lossless compression techniques. The title of this document gives away that the focus here is lossless compression of satellite telemetry data. In the next chapter a few of these lossless compression techniques will be discussed.

5

The information mentioned so far is for introduction only and does not even touch the tip of the iceberg when speaking of information theory in general, and data compression specifically. For a deeper insight into the theory, please see the resources mentioned in [7],

[8], [9].

6

3 Design Procedure

3.1 Problem Definition

The basis for this project is to select a lossless compression algorithm that will be implemented on the CubeSat. Once launched and in orbit, the eleven sensors on board the

CubeSat will take one measurement per second for at least one hour a day. The measurements will be taken by voltage, current and temperature sensors.

The analog-to-digital converter (ADC) on the Microchip dsPic33 microcontroller used on the CubeSat supports 10 or 12 bit output formats [10]. Since the data logged by the sensors is not going to have a large range, 10 bit ADC resolution is sufficient for the output.

A simple calculation shows that, at a bare minimum, there will approximately be

∙ ∙ ≈ 387 of measurement data to be transmitted per day.

Taking all these factors into consideration, the compression algorithms have to be able to meet the following requirements

i. Lossless compression,

ii. Compress at least 387 kb of data, iii. Support 10 bit data size, iv. Support 12 bit data size (if needed),

v. Time efficient, vi. Memory efficient, vii. Computationally not complex.

7

The last three items will be quantified later when the different algorithms are compared in

a benchmark test.

For this project four compression techniques have been considered; Static Huffman,

Adaptive Huffman, Static Arithmetic and Adaptive Arithmetic.

3.2 Huffman Codes

Huffman compression is one of the first modern compression techniques. Huffman

codes are probabilistic in nature, meaning that a source symbol probability distribution is necessary for generating the code-words.

3.2.1 How Huffman Codes Work

In a Huffman code each source symbol is mapped to a code-word, which will

replace the source symbol during encoding. In essence, this is similar to Morse code; the

Figure 3.1-Static Huffman compression flowchart.

8

higher the probability of a symbol, the shorter the corresponding code-word. Please note

that the source symbols constitute the source alphabet. The source symbols of this text are

the ASCII characters. The flowchart in Figure 3.1 shows the steps necessary to map source

symbols to code-words.

To demonstrate this with an example, an imaginary source can be assumed that has

5 symbols. For convenience, the source alphabet is denoted as ,,,, and the corresponding probability distribution is given by 0.2,0.35,0.2,0.1,0.15. Following the steps in the flowchart the values have to be sorted in decreasing order, and the symbols with the smallest two probabilities have to be combined until 1 symbol is left. The result is known as a binary tree, as shown below. The final step necessary to generate the code- words is to traverse the tree from the bottom to the top and assigning 0s and 1s on a left and right turn, respectfully. The resulting codebook is = 10, = 00, = 11, =

011, = 010, where is the code-word for symbol , etc. The resulting binary tree is shown in Figure 3.2.

Figure 3.2-Huffman binary tree example.

9

3.2.2 Huffman Algorithm Design

In this section the actual design of the Huffman algorithm is described. The

flowchart in Figure 3.1 suggests that the first step in a Huffman algorithm is to sweep

through the source data and count the number of occurrences of each source symbol. From

this information the source symbol probabilities are easily calculated. For this purpose, a

MATLAB function has been written that sweeps through the data and outputs the unique

symbols and their frequencies in a two column matrix. The rows of this matrix represent

the symbols that are present in the source. This function is called get_freq_huff and the

code can be found in Appendix A. The get_freq_huff function returns the output variables

in sorted and tabulated order. As a result, separate processing steps are not necessary to

sort the probabilities in decreasing order.

All programing languages offer functions or tools to extract the minimum value from a data set. So, the second step of the algorithm is to recursively find the two smallest probability entries in the table, add them together, and assign this probability to a new

“symbol” in the table. The algorithm must also be able to trace the merger of symbols back to their original state, therefore every time there is a new table entry, two pointers are assigned that are going to point to the indices of the two symbols that have been merged to

create the new table entry. A suitable pseudo code that can create and manage such a table,

or multidimensional array can be found in [11].

Each entry in the table is called a root. The two roots with the smallest probabilities

are combined to create a new entry in the table. This new entry has probability equal to the

sum of the probabilities of the merged roots. Then the two roots are removed from the

table. However, the new entry has two pointers, one for each of the previous roots, which

10

are now called nodes. This continues until the top root has been reached and the sum of probabilities equals 1. The pseudo-code is shown below where T is the table. Each row of the table is enclosed in the square brackets and is separated from the other row by a comma.

The symbols for each row are designated by , and the associated probability is (), where is the symbol index. The left and right arrows are place holders for the left and

=,(),←,→, ,(),←,→,…,,(),←,→ { ℎ || >1 =min =− =min =− =⊥, () +(),, =∪ } right children respectively. In the actual algorithm the arrows will be replaced with the decimal numbers of the left and right children. First, the two minimum probability entries are found in the table, and a new table entry is created with probability equal to the sum of the probabilities of and . The symbol for the new line is ⊥, which means that it is not a real source symbol. The new row is added to the table, whereas the rows

Symbol Probability Left node pointer Right node pointer

s1 P(s1) ↖ ↗

s2 P(s2) ↖ ↗

⁞ ⁞ ⁞ ⁞

sn P(sn) ↖ ↗ Table 3.1-Table for storing Huffman Tree.

11

corresponding to the two minimum probability entries will be removed from the table.

Clearly, this table is diminishing in size one row at a time, until there is a single row left.

The structure of the resulting table is shown in Table 3.1.

The MATLAB implementation uses numbers to uniquely identify the table entries.

Given that the symbols can be any 8 bit ASCII character, they can take on index values of

0-255. All the other “symbols” that result from the merger of the two smallest probabilities are assigned numbers starting at 256 in incrementing order. The indices of the nodes to be merged are saved in the left and right pointers of the new entry. Therefore, as can be seen, all table entries have unique fields identifying its location in the binary tree represented by the table.

A few clarifications are necessary to help with traversing the tree. First, all unique symbols, represented by the 8 bit ASCII character have a number from 0 to 255. Therefore, if a table index is less than 256, the entry in that field is a symbol. Please note that a symbol can only be a “leaf”. If a “leaf” is found in the tree, then the end of that particular branch has been reached. These unique symbols are stored in the first column of the table. Second, if the index of a table entry is a number greater than 256, then it is a “node”, and will certainly have a “child” at least on one side. Third, the top of the tree, known as the root, will have the largest number assigned to it, since it results from the last summation operation. Keeping these observations in mind, a flowchart can be created describing the second step of the algorithm; traversing the binary tree.

The flowchart given in Figure 3.3 will traverse the left side of a binary tree. To traverse the right side of a tree the entry point in the flow chart will be the right side of the binary tree. The pointer to the right side of the tree is stored in column four of the table.

12

The MATLAB function that creates the table and traverses it to generate the code-words is

called get_codewords. The code for this function can be found in Appendix A.

This function outputs three 1-dimensional arrays. The first array contains the source

symbols in sorted order, the second array contains the source symbol probabilities, and the

Figure 3.3-Flowchart for traversing a binary tree. third array contains the corresponding code-words. These three variables are fed into the

Huffman compression function called huffman_zip. The source code for huffman_zip can be found in Appendix A. The main purpose of the Huffman compression function is to go through all source symbols one at a time and replace them with their corresponding code- words. In addition to the compressed input the binary tree has to be transmitted as well.

This is necessary, otherwise the receiver will not be able to map the code-words to their symbols. To accomplish this as efficiently as possible the length of the longest code-word

13

is determined first. Then 3 bits are dedicated to store number such that 2 > max code- word length. According to this implementation the largest code-word cannot exceed 2 −

1 = 127 bits. Once the length of the next code-word is encoded, the actual code-word is appended, followed by the corresponding source symbol, which is represented by 8 bits.

This loop continues until all codewords and source symbols have been encoded and concatenated. The formatted result can be visualized in Figure 3.4. At the beginning of the compressed message additional 13 bits are dedicated to represent the length of the binary tree. 13 bits have been chosen arbitrarily and they are enough to encode a binary tree length of 8191 bits. If the application requires a binary tree that takes up more bits, than the 13 bits can be changed to any number enough to accommodate the tree. This is important for the receiver, so that it can separate the binary tree from the actual message and know when to stop parsing the code-words and symbols.

To have a sense of how a real example looks when converted to the format shown in Figure 3.4, the symbol and probability set given in Section 3.2.1 can be compressed using MATLAB. To simulate the probability distribution given in Figure 3.2 an input string

Figure 3.4-Static Huffman compression output format.

14

consisting of thirty-five a, twenty b, twenty c, fifteen d and ten e letters was created. The compressed output of the algorithm is 303 bits long, therefore only the binary tree representation is shown, which occupies 78 bits. Due to space limitation, the details for the second and third symbols are not included. The number 10 in the subscript means the number shown is decimal. The source symbol letters are shown in decimal values, where

97 is the decimal representation of letter a, and 101 is the decimal representation of e. The outcome is shown below.

Figure 3.5-Encoded binary tree example. The output of the huffman_zip function is the final compressed data. The compressed output is self-contained, meaning that all the decompressor has to know is how to parse it. The receiver starts with parsing the input. As Figure 3.4 suggests the Huffman binary tree is analyzed first. The next step is to read in bits from the compressed data and match it with the code-book, if there is a match, the code-word is replaced with the actual symbol, if there is no match, an additional bit is read and the process repeats until a code- word is found. This has to be done for the entire length of the encoded stream. The decompressor function is named huffman_unzip and is given in Appendix A.

3.2.3 Efficiency of Huffman Compression

In the previous section the functionality of Static Huffman compression and decompression algorithms was discussed. In this section efficiency is considered. The

15

flowchart in Figure 3.3 may make the Huffman tree traversal seem to be a computationally

heavy process, which is not the case. The flowchart can be implemented very efficiently in

C or C++, even MATLAB implementation was very efficient. However, note that there is

a large overhead with the Huffman compression, since the complete table, representing the

binary tree, has to be transmitted along with the actual message. Of course a binary tree

can be agreed upon before the compression. In this case the binary tree does not have to be

transmitted. The negative aspect to this method is that the code-words initially agreed upon

will not be very effective for all input messages. Therefore, a unique table for different

input messages is most certainly more efficient.

The second drawback that stands out with the Huffman method is the large memory

requirement. The table has to be maintained throughout the whole compression and

decompression process. For large input data this will be almost impossible using the limited

memory there is on the Pic33 microcontroller.

The third drawback, although not as crucial, is the decompression. The

decompressor can read in a few bits in the beginning (corresponding to the shortest code-

word), and check the code-book for a match, if there is no match found, then another bit is

read from the input. This loop is repeated until a code match is found. This can be very

inefficient for long code-words.

As a final observation note that compression and decompression can start only after

the complete input is available, hence the name Static Huffman

Compression/Decompression. The implication is that the source data is swept twice.

During the first sweep statistical data is collected, and only during the second sweep compression takes place. If n is the input alphabet size, and is the data length, the binary

16

tree will have n leafs, and depth of log . The binary tree needs to be swept once for each source symbol, therefore using the big O algorithm complexity notation the time complexity of Static turns out to be +∙log(), whereas the memory complexity (also known as space complexity) is () [12].

3.2.4 Sample Output of the Static Huffman Algorithm

To conclude the discussion on Static Huffman Codes, the symbol and probability set given in Section 3.2.1 can be used to run a simulation. As previously mentioned, the probability distribution of the five symbols was used to create a dummy data set with the lower case letters a – e. For the simulation the decimal values of 97-101 were used, which correspond to the letters in the source alphabet. The letter assignment was done according to decreasing probability. Since symbol has the highest probability in Figure 3.2, then letter “a” in this experiment would correspond to symbol . The same logic is applied for the remaining letter – symbol combinations.

The compressor results in two output variables “zipped” and “info”. As the name of the first output variable suggests this is the compressed output. The second output variable is a structure that contains basic information about the compressed data. The

Figure 3.6-Output variable info of the Static Huffman program.

17

structure for this particular experiment is shown in Figure 3.6. H, L were described in

Section 2.2 and represent the entropy and the expected code-word length. As the equations

predicted, the average code-word length is greater than the entropy, although they are very

close to each other. This implies that the achieved compression is almost as good as it can

get for the given source alphabet. The redundancy, measured by the difference between L

and H is 4.84%. The next notable measure in Figure 3.6 is the compression ratio R which

is equal to 0.6213 or 62.13%. This means the output is 62.13% smaller in size than the

input. This is very good performance for lossless compression, but of course this is a

dummy data set. The next two structure members, message_R and binary_tree_cost show

that the effect of the binary tree on the compression ratio. If the binary tree did not have to

be transmitted the compression ratio would be 71.88%. The cost of sending the binary tree

is 9.75%. The next four members of the structure uncompressed_input_length,

compressed_input_length, total_compress_length and binary_tree_length show how many

bits the original data, the original data after compression, the final compressed output, and

the binary tree were respectively. Note that the binary tree added 78 bits to the output. The original data size is shown as 800 bits. This is because each of the 100 input symbols are of uint8 type and require 8 bits on the machine. The code-tree is stored in the codewords member, and the symbols are stored in the symbols member. The last member of the structure is called FORM and shows the original input data type. This is necessary for this

algorithm, since it is designed to handle various data types, including “char”, “uint8”,

“single”, “double” and the formats of the Pic33 ADC. In the real implementation there is going to be one format only, therefore the receiver does not have to know what type the input string was.

18

The figure below shows the code-words generated by the algorithm. The code-

words are identical to the ones shown in Figure 3.2, but in different order. To be more

specific, the code-words shown in Figure 3.2 are in lexicographic order, which means the

code-word for comes before the code-word for symbol and so on. However, the code-

words resulting from the simulation are sorted in decreasing probability order. Therefore,

the code-word on the top of the list in Figure 3.7 actually corresponds to symbol because the probability of occurrence of symbol equals 0.35; highest in the list. Following the

same logic the second and third codewords in the list correspond to symbols and ,

Figure 3.7-Output variable info.codewords of the Static Huffman program. which both have probability of 0.2. This means that they can be interchanged as far as

Figure 3.8-Histogram generated by the Static Huffman Program.

19

probability order is concerned, resulting in swapped code-words. And finally the fourth

and fifth code-words are for symbols and respectively.

The final set of information generated by the encoder is a histogram of the source

alphabet. The histogram in Figure 3.8 is expected, since the source alphabet consisted of the letters a – e. The probabilities of each source symbol are shown on the vertical axis.

3.3 Adaptive Huffman Codes

In the previous section Static Huffman compression codes were discussed. One of the limitations of the Static Huffman technique was that all the input data must be available at the time of compression, which in turn will require all computational power from the

microcontroller on the CubeSat. The most likely scenario is that when the Pic33 is

compressing the data and preparing for transmission, it will not be able to perform other tasks in a timely manner, or if the compression itself is continuously delayed by interrupt request to the microcontroller, then the processor may not finish compressing on time for transmission. Although the Pic33 can be programmed to not take measurements while compression is in progress, a compression technique can be created that will not introduce

such a burden on the processor all at the same time. For this purpose the Adaptive Huffman compression algorithm needs to be taken into account.

The Adaptive Huffman compression algorithm allows the compression of data as it is acquired. The statistical information about the source data is updated after new data is received, which in this case would be temperature, voltage or current measurement.

Essentially, the Adaptive Huffman coding is the one-pass version of the Static Huffman technique [13]. This will enable the microprocessor to perform calculations when new data is received. Processing the data will still be computationally complex, but will be

20

performed per measurement, rather than for the complete data set. At least using the

Adaptive Huffman method the programmers of the microcontroller will have increased flexibility in working with the available buffers and memory.

3.3.1 How Adaptive Huffman Codes Work

As in the case of Static Huffman technique, a table must be created and maintained to successfully implement Adaptive Huffman Compression. This table has six columns and variable number of rows corresponding to all possible nodes. The six columns represent the node number, predecessor node pointer, symbol, left and right child node pointers and the weight of the node. The table begins with one row that represents the “not yet transmitted” symbol, designated by NYT. When a symbol is encountered for the first time the NYT node gives birth to a new NYT node and an external node containing the symbol.

The previous NYT nodes becomes an internal node. As soon as the new nodes have been created weights are assigned to all the nodes in the tree. The weight of an external node can be looked upon as a counter that keeps track of how many times the symbol has appeared. The weight of an internal node is the sum of the weights of all its offspring nodes

[14]. The implication is that after each time a symbol is received the weight of the external node has to be updated, which in turn will require weight updates for all the internal nodes connected between the root and the external node of interest. The third step of the procedure is to rearrange the tree in increasing node weight order going from left to right and bottom to top. This process will be illustrated in detail in the next section.

Finally, once the tree has been updated, the same procedure is applied to generate the code-words as in the Static Huffman algorithm. The flowchart shown in Figure 3.9 summarizes the steps needed to encode using Adaptive Huffman codes. The update

21

procedure, fixed symbol code and other parts of the flowchart will be described in detail in the next section, where the actual algorithm design is considered.

Figure 3.9-Adaptive Huffman encoding flowchart [15]. The decompression proceeds in a similar manner as the encoder, since the receiver does not have initial knowledge about the source symbols. As new data is received, a tree is created and maintained identical to the one encountered in the encoder. The decoder is

22

slightly more complicated than the encoder, and will be described in detail in the next section.

3.3.2 Adaptive Huffman Algorithm Design

In the previous section Adaptive Huffman compression was introduced. In this section the actual design will be given. As mentioned earlier, to successfully compress and decompress, a table must be created. The table used for the Adaptive Huffman algorithm has six columns. The rows of the table represent unique node entries. If the source has k symbols, then a total of 2k-1 unique nodes exist. Of course, not all of these nodes have to be present during compression. The number of external and internal nodes depends on the number of symbols present in the data. In addition to the symbols, a row is dedicated to the table to store the NYT node information. This node is always present in the table, as it represents the symbol that has not been encountered yet. Therefore, if the source is composed off all 8 bit ASCII characters and the NYT symbol, then there can be a total of

28+1=257 symbols, and 2*257-1=513 nodes. These unique node numbers are stored in the first column of the table. The second column of the table contains pointers to the predecessor internal node. The third column in the table represents the symbol. All internal nodes will have -1 stored in the symbol column, because the actual source symbols can appear only as external nodes. The actual symbols will have their ASCII value stored in this field. The fourth and fifth columns of the table are left and right child node pointers.

These fields will contain the node numbers of the left and right child nodes. Please note that the first five columns describe a full binary tree, which can be traced from top to bottom. The last column of the table represents the weight of each node [16].

23

Node Predecessor Symbol Left node Right node Weight number pointer pointer

513 N/A ROOT 511 512 2

512 513 “a” -1 -1 1

511 513 -1 509 510 1

510 511 “b” -1 -1 1

509 511 NYT -1 -1 0

⁞ ⁞ ⁞ ⁞ ⁞ ⁞ Table 3.2-Table for storing Adaptive Huffman tree. The table given above would result after transmitting the first two letters of

“abracadabra”. The -1 indicates that a field is not applicable, and is a convenient way for

the code to know that this can be skipped. The following discussion should shed light on

what this table looks like if viewed as a tree.

The table begins with the highest possible node number, which is assigned to the

root. Since at the beginning there is only the NYT node in the table, it is also the root of

the tree. When a symbol is encountered for the first time, a fixed code-word is sent followed

by the code-word for the NYT node. The fixed code-word is generated as follows: first

and are selected such that = 2 +, where is the number of all possible symbols in

the alphabet. Then, if is the decimal value of the source symbol and is between 0 and

2 − 1 the fixed code-word is the +1 bit binary representation of , otherwise the fixed

code-word is the bit binary representation of −1. Since in the scope of this project the

source symbols are all possible 8 bit ASCII characters, and are 8 and 1 respectively, resulting in = 257. Please note that the NYT represents an additional symbol, therefore there are 257 instead of 256 symbols. For example, if the decimal value of the transmitted

24

symbols is 1, the fixed code word will be the 8+1=9 bit binary representation of =

1, if the decimal value of the transmitted symbol is 113, then the fixed code-word will be the 8 bit binary representation of 113 − 1 = 112. This convention was adapted from [14].

In addition to the new symbol, the NYT code-word has to be sent, to indicate that this symbol has not been transmitted before. As soon as the sequence is sent, the NYT gives birth to two external nodes. The left external node is assigned as the new NYT and the node number is decreased by 2, relative to the old NYT node. The right external node is assigned to the symbol, and the node number assigned to it is 1 less than the old NYT node. The old NYT node becomes an internal node.

Figure 3.10 shows how the contents of Table 3.2 were filled in three steps. The weights of the nodes are inscribed into the circles. The symbol and node number are

Figure 3.10-Adaptive Huffman tree example. separated by a semicolon next to the circle. For example, note that according to Table 3.2 the left and right pointers of node 512 are -1. Indeed, looking at Figure 3.10 one can see that node 512 does not have any left or right child nodes, hence the value -1. Please note that the node weights are in the correct order: going from left to right and bottom to top the

25

weights are in increasing order, therefore there is no update required. However, when the next letter of the sequence “r” is transmitted, the binary tree will look as shown on the left side of Figure 3.11. Now, the nodes are not in increasing order anymore. Consequently, the tree update procedure has to be called to ensure the nodes are in increasing order. If not, the symbols with highest occurrence will not have the shortest code-words. After the update, the tree will take the form shown on the right side of Figure 3.11.

The update procedure is a simple loop that starts at the bottom left side of the tree and compares the nodes to each other. In essence it is a sorting process. After the tree has been re-arranged, another function checks the weights of the nodes, and verifies they are up to date. This is necessary after the tree update, since certain nodes are being swapped during the update.

Figure 3.11-Adaptive Huffman tree node update example. As soon as the input has been compressed, 20 bits are concatenated to the sequence to be transmitted as shown in Figure 3.12. These bits encode the length of the input data which the decoder will use to know when to stop decoding. A custom MATLAB function

26

called adaptive_huffman_zip along with three sub-functions called NYT, update_weights,

and update_order perform all the steps described above. An additional function called

adap_huff_get_codewords will trace the tree to generate the code-words. All these

functions can be found in Appendix B.

Figure 3.12-Adaptive Huffman compression output format. The design of the decoder is similar to the encoder. When the transmitted sequence

is received the decoder separates the first 20 bits as they represent the length of the encoded

message. As in the case of the encoder, the decoder too, maintains a table identical to the compressor table. To decode the received stream, the decompressor reads in a single bit. It compares this bit to the code-words to find a match. If there is no match, the decoder reads in an additional bit. However, if there is a match, the algorithm checks if the matching symbol corresponds to the NYT node, in which case a flag is set to true and the algorithm

knows that this is the first appearance of a symbol. In accord with the compressor the

decompressor will read in bits to decode the new symbol. However, if the code-word

matches an existing symbol in the table, the algorithm will replace the code-word with the

actual symbol and send it to the output stream. As soon as the next symbol is determined,

the table is updated as described for the encoder. This algorithm repeats continuously until

the length of the decompressed string equals the length of the input string to the compressor. To make this description less scripted and easier to visualize, please consider the flowchart in Figure 3.13.

27

Figure 3.13-Adaptive Huffman decoder flowchart [17]. The MATLAB implementation of the decoder is called adaptive_huffman_unzip, and can be found in Appendix B.

28

3.3.3 Efficiency of Adaptive Huffman Codes

The discussion shows that Adaptive Huffman Compression technique is a dynamic

and powerful method. Of course this comes with added complexity for the encoder and

decoder. Compared with the Static Huffman method, the Adaptive Huffman technique

provides two important improvements. First, the tree does not have to be transmitted,

reducing the overhead necessary for code management. The encoder and decoder will have

the same initial conditions enabling them to create identical trees. Second, compression

can start even when the complete data set is not available. Of course, the algorithm

discussed in the previous section will have to be modified slightly to enable this. Instead

of dedicating 20 bits for the length of the sequence, a known termination sequence can be

agreed on between the transmitter and receiver. In addition to the benefits, there are also

disadvantages. Probably, the most significant obstacle for choosing the Adaptive Huffman

procedure is the size of the table. Although the table does not have to be transmitted, it still

has to be maintained in the transmitter and receiver. The binary tree created by this

algorithm has six columns compared to the four sustained by the Static Huffman method.

The second obstacle, as mentioned already, is the added computational complexity, which

balances the added benefit of being adaptive. The third disadvantage with very serious consequences is that there is no room for error. The mishandling of one symbol by the algorithm will result in a chain reaction deeming the complete data unrecoverable.

To quantify the time and space complexities of the adaptive method the big O notation is used. If denotes the length of the data, and represents the number of source

symbols used for the data, then the time complexity is given by (∙), and the space

complexity is given by ().

29

3.3.4 Sample Output of the Adaptive Huffman Algorithm

In this section there will be an opportunity to see sample output of the Adaptive

Huffman algorithm. For simulation and comparison purposes the same dummy data set will be used as for the Static Huffman compression, namely a source alphabet consisting of the letters , , , , with probability distribution 0.35, 0.2, 0.2, 0.15, 0.1.

The output of the Adaptive Huffman algorithm has two variables as before, zipped

and info. The info structure in this case has fewer members, because the binary tree does

not have to be transmitted. Figure 3.14 shows the info structure of the simulation. The

Figure 3.14- Output variable info of the Adaptive Huffman program. compression ratio R is 63.63% compared to the 62.13 percent of the Static Huffman

algorithm. In this case the adaptive method performed slightly better. The fact that the tree

did not have to be transmitted certainly played a role in this. The cost in this case refers to

Figure 3.15-Final binary tree table of the Adaptive Huffman simulation.

30

the 20 bit overhead added to inform the receiver of the size of the message. The codewords

member shows that there are 11 code-words in for this source. However, as the third column of the table in Figure 3.15 suggested, there are only five real symbols represented by the numbers 97-101. The symbol numbered 256 represents the NYT node as discussed above. Therefore the final code-words are on the 2nd, 4th, 5th, 6th and 8th lines in Figure 3.16

above. However, these code-words have been constantly updated during the compression

and decompression processes and have not been the same at all stages. This implies that at

different stages of compression and decompression the letter “a” is not represented as code-

word “11”.

Figure 3.16-Output variable info.codewords of the Adaptive Huffman program.

3.4 Arithmetic Coding

Arithmetic compression is a state-of-the-art technique that approaches optimal

performance [18]. To understand the most basic principle behind Arithmetic compression,

one has to go back to Huffman Compression. The examples and discussions in Section 3.2

used the source as it was given, without modifications. However, Block Coding theory [19]

suggests that the expected code-word length of the nth transformation of the source

31

approaches the Entropy as n approaches infinity. This is shown in Equation 3.1.

Transformation of the source means an extension of the source, and depends on n, the order

lim =() (3.1) → of transformation. For example, the second order transformation of a source with symbols

, is , , , , and similarly, the third order transformation of the source would consist of symbols , ⋯ . Intuitively Equation 3.1 is logical. As n approaches infinity, the input string to the decoder would be considered one symbol of the extension alphabet. To convert the expected code-word length of the extension to a first order equivalent, one has to divide it by the order of the extension.

The relevance of the discussion above to Arithmetic Compression is that Arithmetic compression, in effect, treats the complete input string as one block, one “single symbol” consisting of multiple source symbols, whereas Huffman source extensions grow exponentially in size and become difficult to manage. In fact, Huffman code extensions grow exponentially; the nth extension of a source alphabet of size k, has symbols.

3.4.1 How Arithmetic Coding Works

In Arithmetic Coding the input message is represented by a numeric tag at the output of the encoder. The tag is guaranteed to be unique. An example will help to understand the encoding procedure that results in a unique tag. The source symbol and probability distribution from the previous sections will be used to encode the sequence . As a reminder, the source alphabet consists of five symbols

,,,, with probability distribution 0.2,0.35,0.2,0.1,0.15. After has been transmitted, the appearance of symbol is a conditional probability defined as (|).

32

Similarly, the transmission of in the sequence is a conditional probability given as (|).

Textbooks use a convenient way to demonstrate the conditional probabilities and

the compression process in a visual manner, which is shown in Figure 3.17. Initially, the

Figure 3.17-Generating unique tag for Static Arithmetic Coding symbols are lined up as shown on the left side, and the probabilities summed together. This

creates the cumulative distribution function of the source. The first transmitted symbol

is , therefore the upper and lower bounds of become the upper and lower bounds of the second stage, corresponding to the second symbol in the sequence. The intervals of all the symbols have to be rescaled accordingly. As an example, the lower bound of symbol

is 0.75, whereas the upper bound is =∆, ∙() + = (0.85 − 0.75) ∙

0.2 + 0.75 = 0.77, where and are the upper and lower bounds of the interval. All

other intervals are calculated similarly for all stages. At the third stage, after the complete

sequence has been encoded the interval corresponding to the sequence is (0.78925 0.79625).

33

Finally, as a convenience, the tag is chosen to be the midpoint of the interval of

.. interest. Therefore the tag corresponding to the sequence is = 0.79275. To

generalize the process above Equations 3.2 – 3.4 [20] can be used to find the tag interval

and the tag of a sequence of any length in a sequential order. The last stage in Figure 3.17

() () () () = + − ( −1) (3.2)

() () () () = + − () (3.3)

() +() = (3.4) 2

makes it clear that once the upper and lower bounds of a sequence are known, the tag can

be calculated. Equations 3.2 – 3.4 make it possible to find only the interval of interest of a

sequence of any length. and are the lower and upper bound of respectively. The

th () in the superscript denotes the n symbols in the sequence, and is the CDF. For

example, using Equation 3.2 one can calculate the lower bound for , the first symbol in

the sequence to be

() () () () = + − ( −1) = 0+(1−0)() = () =0.75

Initially the lower and upper bounds are initialized to 0 and 1 respectively. The symbol before is , therefore the CDF of is calculated from Figure 3.17 as () =() +

() +() = 0.2 + 0.35 + 0.2 = 0.75. Of course it does not make sense to transmit a

floating point number, therefore the binary representation of the tag is needed for

transmission. As a side note, remember that a binary number

0 . ⋯ ,ℎ = 1

34

is converted to the following equivalent decimal number:

() () () () ∙2 + ∙2 + ∙2 +⋯+ ∙2

As a simple example, the number 0.75 is equivalent to 0.5 + 0.25 =. 2 +2 =

0.11 in binary. For transmission the preceding 0 and period can be discarded. However, not all decimal numbers can be converted into finite binary equivalents. For this reason, the binary representation of the tag may not be equal to the tag calculated above. As long as the binary representation of the tag is a number within the upper and lower bounds of the interval, the tag requirements are met. This is satisfactory, because the interval for each transmitted sequence is unique. Now, keeping all these in mind, the binary tag for the

sequence is 11001011 = 1 ∙ 2 +1∙2 +0∙2 +0∙2 +1∙2 +0∙

2 +1∙2 +1∙2 = 0.79296875 which is within the required interval.

At the receiver, the decoder will take the tag and try to find the corresponding

source sequence. During the first step the decoder compares the tag to the intervals of the

first stage, which are just the source symbols and their CDF. The first symbol will be

decoded as the symbol whose interval encompasses the tag. Then, the tag is updated using

Equation 3.5 [21], and the previous step is repeated for the new tag. As an example, consider that the receiver has received 0.79275 and it uses this value to decode the

− () ( ) = () () 3.5 −

sequence. Initially, the decoder compares the tag to the CDF of the source symbols. From

Figure 3.17 it apparent that 0.79275 is in within 0.75 − 0.85, therefore the first symbol

.. decoded is . Then the tag is recalculated as = = 0.4275. Comparing ..

35

the new tag to the source CDF, the decoder will chose symbol because 0.4275 is

enclosed within (0.2 − 0.55). Finally, during the next iteration the tag equals

0.65 corresponding to symbol . The resulting decoded sequence is identical to

the transmitted sequence.

3.4.2 Arithmetic Coding Algorithm Design

The example and discussion in the previous section sheds light on Arithmetic

Coding, but cannot be directly converted into a useful algorithm. There are two major

obstacles. First, the microcontroller would run out of precision for any realistic sequence,

because the tag interval becomes smaller with each additional symbol in the sequence. The

second problem is the inefficient tag generation process. Powerful and efficient algorithms

have been created to overcome these obstacles. In order to avoid floating point operations

and very small decimal numbers, the tag interval is rescaled in a predefined manner that

keeps all numbers in a manageable range. An example of this would be rescaling the upper

and lower bound in the first stage of Figure 3.17 from (0.75, 0.85) to (0, 1).

For this project a version of Arithmetic Coding, called Integer Implementation, is

used. With this method all numbers are converted into integers, including the upper and

lower bounds of the intervals. There are a few nuances that have to be accounted for when

performing rescaling to ensure that the sequence can be decoded unambiguously. The first

step towards successful rescaling is to select a large initial upper boundary. If || is the

length of the input sequence to the encoder, and =2+log||, then the upper bound

is named whole and is given by Equation 3.6. Equations 3.7 – 3.8 show the half and quarter

values needed for the rescaling algorithm. Please see [22] for details about the derivation

ℎ = 2 (3.6)

36

ℎ ℎ = =2 (3.7) 2

ℎ = =2 (3.8) 4

of these values. There are three scenarios that trigger rescaling. The first two scenarios

Figure 3.18-Arithmetic Coding Case 0 and Case 1 rescaling [23].

are straight forward. These two cases are depicted in Figure 3.18. and are the lower

and upper boundaries of the current stage. The cases are shown on the left side. The

conditions when each of these cases are triggered are shown inside the figure, and the

operation that takes place for each case is shown on the right side. The numbers and 1 correspond to half and whole as calculated in Equation 3.7 and Equation 3.6 respectively.

Case 0 in Figure 3.18 shows that when the upper bound is below the half mark the algorithm will transmit a 0, followed by the rescaling operation. Case 1 is triggered when the lower bound is on the right side of the half mark the encoder will emit a 1, followed by the rescaling operation.

The third scenario is shown separately in Figure 3.19, because it requires a little more care. The third rescaling is called case S and is triggered when the upper boundary is contained in the third quarter and the lower bound is contained in the second quarter. The rescaling for Case 0 and Case 1 cannot be applied for the scenario shown in Figure 3.19, because the rescaling may converge to the midpoint and eventually crash the program.

37

Therefore, when the conditions for Case S are present, a counter S is initialized and the rescaling shown for Case S takes place until the lower bound is contained within the first quarter, or the upper bound is contained with the fourth quarter. These conditions are shown as Case 2Q and Case 3Q. After each rescaling operation the counter is incremented

Figure 3.19-Arithmetic Coding Case S rescaling [23]. by 1. As soon as condition Case 2Q or Case 3Q is satisfied, the counter increments one last time and the encoder emits S zeroes or ones depending on the condition. These are shown in detail in Figure 3.19.

Exact implementation of the conditions is very important, otherwise the decoder will not be able to decompress the sequence properly, or the program may end up in an infinite scaling loop. As will be seen later, the decoder is highly dependent on the S zeroes or ones to undo the rescaling and find the proper symbol. For successful rescaling the upper and lower bound must be recalculated for the next symbol to be encoded. Equation 3.2 and

Equation 3.3 have to be modified slightly to suit the integer implementation algorithm. The results are shown in Equations 3.9 and 3.10.

( −) ∙ _( −1) = + (3.9) _

38

( −) ∙ _( ) = + (3.10) _

In the integer implementation, the cumulative distribution cannot be used, as shown

in Equations 3.2 and 3.3, since the CDF is not an integer. For the integer implementation

the CDF has been replaced with a variable called cum_count, which is the cumulative count

of the source symbols. For example, assume the source consists of symbols ,, which appear in the sequence 5, 7, 2 times (symbols appears 5 times, etc.). Then the

cum_count variable is constructed as 5,5+7,5+7+2 = 5, 12, 14. So, when symbol

is at the input of the encoder, the algorithm will select 12 as the scaling factor. Notice

the similarity between cum_count and the CDF. The variable in the denominator of

Equations 3.9 and 3.10 represents the length of the sequence to be encoded. To ensure that

the bounds result in integers, the calculations are rounded down to the nearest integer. This

is denoted as in the equations and can be implemented in MATLAB using the floor

function.

Another advantage of this process is that the rescaling operation and the binary tag

generation are combined into a single loop. This is possible, because each bit added to the

binary tag in the example discussed in Section 3.4.1, effectively multiples the number by

2. Therefore, since during each rescaling shown in Figure 3.18 and Figure 3.19 the upper

and lower bounds are multiplied by 2, then a bit can be issued. This is a powerful and

efficient way of generating the binary tag without complex calculations.

Another important observation is that multiplication by two in decimal can be

implemented as a shifting operation in binary, reducing computational complexity for the

Pic33.

39

The last, and probably the most important observation is that in order to check whether the lower or upper bounds are below or above the half mark of the interval, one

has to convert the lower and upper boundaries into binary numbers and look at the most significant bit (MSB). If the MSB of the upper and lower bounds are 1, then the interval is

contained above the mid-point. This is described in detail in [24] and can increase

efficiency significantly if implemented.

A few final notes are necessary before an encoding flowchart diagram can be given.

The source symbols for the Arithmetic compression technique are represented by the 8 bit

extended ASCII table, as was the case for the previous two compression techniques. The algorithm sweeps through the source twice, since this is Static Arithmetic Coding. During the first sweep statistical data is collected. In particular two one-dimensional arrays are

generated. The first array, called sym, contains the symbols in sorted order. The second array, called count, contains the symbol counts in the input sequence. The Symbol count sequence is used to generate the cum_count array used by the algorithm. The sym and count

arrays are aligned. This means that the count of the symbol at the first index of the sym

array is located at the first index of the count array. The cum_count array is offset by 1 index, because a 0 is appended to it before compression begins. The necessity of the 0 at the beginning can be seen in Equation 3.9. If the symbol being encoded is located at the first index of the symbol array, then according to Equation 3.9 the algorithm needs the cumulative count corresponding to the symbol before the symbol of interest. However, since there is no symbol before the first symbol, then the cum_count value is 0.

During the second sweep through the source, compression takes place. The

algorithm gets the first symbol in the sequence and looks up the sym array to find its index

40

in the symbol list. It then uses the index number to look up the cum_count value corresponding to this particular index. As soon as the lower and upper bounds have been calculated, rescaling of the intervals begins until the code corresponding to the first symbol

Figure 3.20-Static Arithmetic Coding flowchart. has been emitted. The process starts over for the second and the rest of symbols in the sequence. The discussion in the previous few pages is combined into the flowchart shown in Figure 3.20.

In addition to the binary tag, the transmitter also has to send the symbols and their counts to the decoder. For this purpose, the source symbols and their counts are encoded

41

and appended. First the algorithm checks how many distinct symbols are present in the

source and converts this value to an 8 bit binary number. Then the encoder passes through

the sym array and converts the symbols to 8 bit numbers and appends them to each other.

The symbol counts are encoded in a slightly different manner. The algorithm sweeps

through the count array and for each symbol count it calculates the number of bits necessary

to represent the value. This information is encoded in a 4 bit binary number followed by

the binary representation of the count.

Figure 3.21-Static Arithmetic Coding output format. To demonstrate the output format shown in Figure 3.21, the sequence

“abracadabra” can be used. Table 3.3 shows the source alphabet information of the

sequence. Based on the information from the table above, Figure 3.22 can be created which shows how the symbol and count information for symbols “a” and “b” would look like.

The first 8 bits reperesent the number of symbols, which is 5 in this case. The second 8

bits represent symbol “a”, similiarly, the third set of 8 bits represent symbol “b”. The

ellipsis after symbol “b” means that all source symbols must be represented in this way.

After each symbol has been encoded, the symbol counts will follow. The first four bits

represent the number of bits required to represent the count of symbol “a”. In this example,

the symbol appears 5 times, therefore 3 bits are sufficient to encode the number 5. The next

42

3 bits show the count of symbol “a”. This process is repeated for the remaining source

symbols.

Symbol Symbol ASCII Value Symbol Binary Value Count Cumulative Count

- - - - 0

a 97 01100001 5 5

b 98 01100010 2 7

c 99 01100011 1 8

d 100 01100100 1 9

r 114 01110010 2 11

Table 3.3-Source alphabet information of the sequence "abracadabra".

Figure 3.22-Static Arithmetic Coding output example. The decoder implementation of Static Arithmetic coding is similar to the encoder, with a few adjustments. In Section 3.4.1 an example of Arithmetic decoding was given that

utilized Equation 3.5 to update the tag approximation after each symbol was identified.

43

This process is slightly different when integrated into the algorithm, because the tag is

represented as a binary sequence and not a decimal number. Therefore, in order to set the

initial tag value the decoder portion of the algorithm will approximate the tag from the

input to the decoder. In order for this to take place the decoder extracts the number of bits

that are necessary to approximately represent the tag. This calculation was shown above as

=2+log||, where m is the number of bits and |X| is the length of the sequence to be decoded. Note that the |X| can be extracted from the cum_count sequence generated by the decoder. As soon as m is found, the decoder uses the following equation to approximate

=+2 , ℎ < < || = 1 (3.11)

the tag value. z in the equation is the tag approximation, and the initial value is set to 0.

The index i points to the next value of the binary sequene.

The conditions above ensure that the loop does not overshoot the number of bits

necessary to represent the sequence and the total length of the sequence. Additionally,

this calculation takes place if the bit of interest is a 1, otherwise it does not add any value

to the tag. Once the tag is approximated, the program uses Equations 3.9 and 3.10 to

calculate the lower and upper bounds of each possible symbol from the 8 bit ASCII table.

The symbol of interest is the symbol whose upper and lower bounds contain the tag

approximation. As soon as a symbol is decoded, interval rescaling takes place identical to

the encoder. However, in this case the tag approximation is also rescaled along with the

lower and upper bounds. Another nuance of the decoder is that after each symbol has been

decompressed index i in Equation 3.11 has to be incremented by 1 and point to the next

incoming bit. If the bit is a “1” then the tag value has to be incremented by 1. This is

necessary, since each rescaling operation is in effect shifting the binary representation of

44

the tag to the left by 1 bit, and one additional bit has to be added at the location of the previous LSB. The decompression stops after the number of decoded symbols equals the number of original symbols encoded. The flowchart for the decoder is given in Figure 3.23.

Figure 3.23-Static Arithmetic decoder flowchart.

45

The MATLAB implementation of the decoder and encoder are given in Appendix

C. The Arithmetic encoder is called arithmetic_zip and the decoder is called arithmetic_unzip. An additional function called get_freq_arith runs through the input before compression begins to gather statistical information. This function is similar to the get_freq_huff function encountered for Huffman compression.

3.4.3 Efficiency of Arithmetic Coding

Arithmetic coding is a state-of-the-art compression technique, approaching near optimal lossless compression results [18]. On top of this, the time complexity of Arithmetic coding is linearly dependent on the input message size, which was not the case for Huffman coding, where the dependency was exponential.

Using the big O notation and assuming that the length of the input message is k, and the number of distinct source symbols are n, the time complexity of Arithmetic coding is (+). The memory complexity can be given as () [25], assuming efficient implementation of the algorithm. As can be seen, the space complexity is dependent on the source alphabet size.

The decoder may perform slightly slower than the encoder, since it has to search for the tag interval. Nevertheless, the decoder is also linearly dependent on the message size. In either case the complexity of the decoder will not be a bottleneck for this project, because it will be performed on the ground station with powerful enough computers.

The encoding is a different matter, since it will be performed on a slower Pic33 microcontroller on the CubeSat.

46

3.4.4 Sample Output of the Static Arithmetic Algorithm

The Arithmetic encoder has two outputs. The first output, called zipped contains the compressed message and is sufficient for the decoder to perform decoding. The second output is a structure type variable named info. The info output of the Arithmetic encoder contains less information than the Static and Adaptive Huffman outputs. There are no code-

Figure 3.24-Output of the Static Arithmetic program. words and no binary tree in this case. The output only contains the compression ratio, the cost of sending the symbols and their counts. The expected code-word length also found its way into the output, but it is not necessary, as there are no code-words here. In this particular instance the average code-word length is the ratio of the encoded sequence to the message size.

The information structure resulting from compressing the sequence “abracadabra”

and the compressed binary sequence are shown in Figure 3.24. Of course the compression ratio is not very impressive and should be called expansion. This is due to the overhead of transmitting the symbols and their counts which represent 87.5% of the output sequence.

However, the compression ratio for the five symbol example used for the previous two

47

compression methods, turns out to be 61%, which is in line with the compression ratios

provided by the Huffman and Adaptive Huffman methods.

3.5 Adaptive Arithmetic Coding

In the previous section Static Arithmetic coding was discussed. One of the limitations

of the static compression methods is that the complete input data set must be available at

the time of compression. However, this may not be realistic in terms of storage on the

CubeSat. Therefore, it may be desirable to encode the data as it is being captured and then save it. This is most certainly the more likely approach. For this purpose, in this section

Adaptive Arithmetic coding will be considered.

3.5.1 How Adaptive Arithmetic Coding Works

Section 3.3 revealed the great difficulty and challenge that came during the transition from Static Huffman to Adaptive Huffman compression. This is not the case for

Arithmetic compression. In fact, the Static and Adaptive Arithmetic Coding techniques are almost identical to each other. The few minor differences are given in this section.

The first change in the algorithm is the count array. Since the program does not have the opportunity to sweep through the input, the symbol counts in the data are not known. As a workaround all symbol counts are initialized to 1.

The second difference is the total_count variable, which is the length of the input sequence. Again, since initially the length of the sequence is not known, the length of the sequence is set to 2 −1. This is an arbitrarily large number and gives the algorithm

enough room to converge during rescaling. A few experiments have revealed that smaller

values for the total_count estimate send the algorithm into an infinite loop, and the program

48

is not able to recover. Although if the approximate length of the input sequence is known,

a value slightly greater than the expected length can be set for the initial value of

total_count. This is also sufficient for the program to converge. The large number is just a

safe starting point for data of different lengths, but results in a few extra bits in the output

sequence. As soon as the first input symbol is encoded, total_count is updated with the

actual value that can be found at the last index of the cum_count array. After each symbol passes through the encoder, the count for that symbol has to be incremented by 1.

Another difference and advantage of the adaptive technique is the small overhead of the compressor. Unlike Static Arithmetic coding, the symbols and their counts do not have to be transmitted for the adaptive method. Only the size of the sequence is transmitted to decoder, which takes 20 bits. If the sequence to be encoded is larger than 2 −1=

1,048,575 symbols, then additional bits can be dedicated to encode the length of sequence.

This information is necessary for the decoder in order to stop decompression when all

symbols have been decoded. As an alternative, the 20 bits can be replaced by a special

sequence that indicates the end of the stream. If the encoder and decoder can agree on such

a sequence, the overhead can be completely eliminated. However, for a large input

message, the 20 bits will not represent significant cost. The format of the compressor output

is identical to the Adaptive Huffman output, and is given in Figure 3.12. Appendix D

includes the adaptive_arithmetic_zip and adaptive_arithmetic_unzip algorithms

implemented according to the above discussion.

Considering the similarity of the Static and Adaptive methods, and the discussion

above, a separate sub-section is not dedicated to the Adaptive Arithmetic algorithm design.

49

3.5.2 Efficiency of Adaptive Arithmetic Coding

As has already been emphasized before, Arithmetic coding is a state-of-the-art compression tool. Although the Adaptive counterpart adds another level of freedom to this algorithm, it is at the expense of a slightly reduced compression ratio. As will be seen later, the compression ratio difference between Static and Adaptive Arithmetic techniques can be negligible, depending on the source alphabet and their probabilities. The algorithm run time is also slightly increased for the adaptive approach. Using the big O notation and assuming that the length of the input message is k, and the number of distinct source symbols are n, then the time complexity of Adaptive Arithmetic coding is (∙), and the memory complexity is identical to its static counterpart, given as () [26].

50

4 Performance Tests and Comparison

In Chapter 3, four compression algorithms have been discussed, including detailed descriptions of their operation. In this chapter, benchmark tests will be run to determine the most suitable compression technique for the CubeSat project. Compression and decompression time, compression ratio and algorithm efficiency have been chosen as the most important factors for comparison. In order to perform benchmark tests, it is necessary to understand the Pic33 ADC output form factors, as they affect test outcome. Therefore

Section 4.1 is dedicated to this topic.

4.1 Pic33 Analog-to-Digital Converter Output Format

The Pic33 Microcontroller chosen for this project is belongs to the dsPic33F family.

Figure 4.1-dsPic33 ADC Output Format. This family integrates high-performance digital signal controllers.

51

The ADC form factors are of particular interest, because the input data for the

benchmark tests had to be converted to one of these form factors for sound test results.

According to the dsPic33F datasheet [10] the ADC can sample with 10 or 12 bit resolution

and output 16 bits (2 bytes) per measurement. Both the 10 and 12 bit modes can take four different forms. The different combinations of the output forms and the 2 possible processing modes are shown in Figure 4.1. As established in the introduction, the 10 bit resolution mode is sufficient for this project.

In order to convert the sample data to the formats above, two MATLAB functions have been created, called conversion_input_10bit and conversion_output_10bit which are available in Appendix E. The first function converts the input data to any of the formats above, which can be selected by the user. The format is supplied as an input argument during run-time. For example, if the ADC is configured to 10 bit unsigned integer format, then the data to the conversion_input_10bit function must range from 0 to 1023. The function will then convert each input measurement to the closest quantization level identical to what the ADC would do. This implies that some minor details of the test data may be lost during this process, but it is necessary to assess the performance of the compression algorithms. Please note that this does not mean that compression is lossy. The

compression itself is lossless, but because the test data is not quantized by the ADC, and is

coming from a different source, it has to be converted to one of the ADC formats, to

simulate the most realistic scenario. In MATLAB, numbers are by default represented as

double precision variables. As a result, each number occupies 64 bits. The input conversion function changes each number from 64 bits to 16 bits after quantization. So, as can be seen,

52

the ADC output is simulated properly, including output size and quantization levels. This

will ensure that the test results are trustworthy.

The output conversion function restores the input data to its original state and is the

reverse process of input conversion. The details of these functions are really outside the

scope of this project, as there will be no need to implement such conversion in the

microcontroller. Therefore, the specific conversion methods are not given here.

Nevertheless, these functions are available in the appendix for reference. In addition,

Figure 4.2 shows the data type and size at each step of the encoding and decoding process.

At the input compressor, the sample test data is represented as a double type in MATLAB

occupying 64 bits. Before the actual compression takes place, the input conversion function

“quantizes the data” and represents it in 16 bits as would the dsPic33 ADC. These 16 bits

are split into 2 bytes, because during compression and decompression the source alphabet

is made up of the 256 symbols from the 8 bit ASCII table. The output of the compressor is

a sequence of 0s and 1s that are represented as char type in MATLAB. However, to

calculate the compression ratio, the number of 0s and 1s is of interest, because they will be

Figure 4.2-Data conversion from input to output. represented as bits in the microcontroller. During decompression, the sequence of 0s and

1s are decoded and the resulting symbols are represented as 8bit numbers. The decoded

sequence, before being output to the user, passes through the output conversion stage and

53

is converted back to double. This is crucial in MATLAB to compare the data before compression and after decompression.

4.2 Benchmark Test

All tests in this section were accomplished on a computer with a 3.07GHz Intel Core i7-950 processor and Windows 7 Professional operating system. In the first part the test data is given, followed by compression ratio and execution time results.

4.2.1 Test Data

Seven data sets have been chosen for compression. The first data set is the book of

Genesis. Of course the data captured by the CubeSat will be alphanumeric, therefore limiting the compression test to numeric data only will not draw a complete picture of the abilities of the algorithm to handle different data types. In addition, if the CubeSat project expands in the future in a way that firmware updates and strings of commands can be sent to it, then being able to compress text may be beneficial. The symbol histogram of the book

Figure 4.3-Symbol histogram of the book of Genesis.

54

of Genesis is shown in Figure 4.3. The symbol with the highest probability is the space

(DEC value = 32), the symbol with the second highest probability is the letter e (DEC value

= 101). This is expected for the English language.

The second and third set of data are sample indoor and outdoor temperature data as

Figure 4.4-Sample indoor temperature data.

Figure 4.5-Sample outdoor temperature data.

55

shown in Figure 4.4 and Figure 4.5 respectively. The temperature data is changing slowly over time. In other words, there are no abrupt changes in the amplitude between samples.

The fourth and fifth set of data are wind speed and wind gust sample measurements

Figure 4.6-Sample wind speed data.

Figure 4.7-Sample wind gust data.

56

plotted in Figure 4.6 and Figure 4.7 respectively. Compared with the temperature measurements, these have abrupt changes, so there will be an opportunity to see how such data is managed by the algorithm.

Figure 4.8-Sample magnetic field data. The sixth and final set of data are sample magnetic field measurements. A plot of these measurements is shown in Figure 4.8. Note that the magnetic field measurements vary between smooth and sudden changing samples.

Before compression, all data sets had to be normalized to a range supported by the form of the ADC given in Figure 4.1. To achieve this, the sample data was divided by its maximum value. This normalization is comparable to the 01 output format of the ADC. To transition from 01 to 00 output formats the normalized data needs to be multiplied by a factor of 1023. The results of the compression test are given in Section 4.2.2.

57

4.2.2 Compression Ratio Test

Before presenting the results remember that in the scope of this compression ratio

R, given in Equation 4.1, shows the reduction of the data size compared to the uncompressed input. This is necessary to emphasize as different definitions of compression ratio are used in the literature.

− = (4.1)

The results are separated by the data sets, and the given charts are self-explanatory.

Figure 4.9-Compression results for the book of Genesis. The results for the book of Genesis are shown in The results are separated by the data sets, and the given charts are self-explanatory.

Figure 4.9. Although the scale of the chart may be deceiving, all four compression techniques resulted in similar compression ratios. The difference between the best

58

performance achieved by Static Arithmetic Coding and the worst performance achieved by

Static Huffman Codes is only 0.49%, which is negligible.

Figure 4.10 and Figure 4.11 show the compression results for the sample indoor

Figure 4.10-Compression results for indoor temperature data. and outdoor temperature measurements respectively. These figures show two sets of

results. The first set of compression results shown on the left side of the chart was achieved

using the 10 output format of the ADC, whereas the second set of results make use of the

00 output format of the ADC. These formats are shown as 0b10 and 0b00 in the figure

above and in the remaining figures that will follow. The indoor temperature test results

show that all four compression techniques are performing similarly in terms of

compression ratio, where the best result is achieved by the Adaptive Huffman technique resulting in 45.33% or 52.08% reduction in data size depending on the output format of the

ADC. When it comes to the outdoor temperature test the Adaptive Arithmetic method has

59

slightly better performance than its counterparts, resulting in approximately 32.41% or

33.93% compression ratio depending on the ADC output format.

One question that arises is the compression ratio difference resulting from

compressing the indoor and outdoor temperature data. Using the unsigned fractional

format the indoor temperature was reduced 10% more in size than the outdoor temperature

Figure 4.11-Compression results for outdoor temperature data. data. This difference almost doubles with the unsinged integer format. Figure 4.4 and

Figure 4.5 give a little insight into the possible reason. The first observation is that the

outdoor temperature data has a wider range of values, varying from approximately 55 to

105 °F, whereas the indoor temperature goes back and forth between 68 and 82 °F.

The second observation is that the outdoor temperature is much smoother than the

indoor temperature. These two factors combined result in the difference. Smoother data means that there are many symbols in the source alphabet and each of them has almost

60

equal probability, and the wider range of data means there are more source symbols

compared to the narrower range. This can be visualized using the symbol histogram of the

indoor and outdoor temperature measurements shown in Figure 4.12. The plot on the left

side shows the skewed probability distribution of the source symbols for the indoor

temperature data. The trace on the right side shows the probability distribution for the

outdoor temperature measurements. Note that the outdoor temperature data has more

symbols which have almost equal probability. This is an important factor to consider when

implementing the final compression algorithm on the CubeSat.

Figure 4.12-Symbol histogram for indoor and outdoor temperature data source alphabet. While this is not a major factor for the CubeSat voltage, current and on-board

temperature data, it still can have substantial impact if the space temperature is captured.

This data is most likely widely, varying between extremely cold and extremely hot temperatures, depending on CubeSat’s position in space. Therefore, alternative approaches

can be considered instead of compressing all the data in one run. For example, the data

captured when the CubeSat is radiated by the sun’s rays, can be compressed in a different

thread than the data captured when there is no sun, and then these two sets can be combined before transmission. One conclusion from the discussion above is that for lossless

61

compression it is desirable to have fewer source symbols with highly skewed probability to achieve high compression ratio.

Figure 4.13-Compression results for magnetic field data.

Figure 4.14-Compression results for wind gust data.

62

Figure 4.15-Compression results for wind speed data. The next results come from compressing the sample magnetic field measurements.

The resulting chart is shown in Figure 4.13. Adaptive arithmetic compression is able to

achieve relatively better results compared to its counterparts, although the difference

is not significant. Using the 00 output format Adaptive Arithmetic coding is able to achieve

3.58% higher compression ratio compared to the worst performing Static Huffman method.

The final two test results were performed on the wind gust and wind speed sample data and are shown in Figure 4.14 and Figure 4.15 respectively. Given a certain output format, the compression techniques are able to achieve almost equal results.

4.2.3 Timing Test

To perform the timing test, a slightly different approach has been used. The source

data for the timing test is composed of uint8 pseudorandom numbers between 0 – 255 generated by MATLAB. The reason for this approach is to have variable size data. The

63

data sets used in the previous section were almost of equal length, and would not be helpful

to determine the effect of data size on the compression and decompression durations.

Nevertheless, the compression and decompression durations for the previous six data sets

are in accord with the results of this section.

For testing purposes, 43 data sets were created ranging from length 10 to 106 symbols, or from 8 bytes to approximately 5469 kilo bytes (kB). This approach gave good insight into the performance of each algorithm and how they are able to manage small as well as large data sets. It would not make sense to consider the compression result of the pseudorandom sources, since, as there is no structure in the data, there is no correlation

Figure 4.16-Compression time results (linear scale). from symbol to symbol, and therefore, it is not reasonable to expect any significant compression from this source.

64

The results have been separated into three different parts; compression time,

decompression time and the total compression time, given as the sum of compression and

decompression times. The preliminary compression time results are shown in Figure 4.16.

The compression time of Adaptive Huffman is increasing so rapidly with data size that the

compression time details of the remaining algorithms cannot be seen. The same data is

shown in Figure 4.17 but in logarithmic scale, to preserve the details of the remaining

compression techniques. From the figure it can be seen that the best compression time is

almost consistently achieved by the Static Huffman method, and Adaptive Huffman

technique provides the largest compression time. Static Arithmetic and Adaptive

Arithmetic are almost identical in their behavior in terms of compression time.

Figure 4.17-Compression time results (logarithmic scale). The decompression time results follow a similar trend with the exception of Static

Huffman codes. Figure 4.18 reveals that Static Huffman does not perform very well for

65

decompression. The exponential shape of the curve is expected, as it has to constantly

search the binary tree for the code-word – symbol match. If it cannot find the match, an

additional bit is read from the encoded sequence and the search starts all over again. This is a major problem with large data sizes, because the binary tree size increases exponentially. Both arithmetic compression techniques show similar semi-linear performance as during compression.

Figure 4.18-Decompression time results (linear scale). The final graph is the combined compression and decompression time shown in linear scale in Figure 4.19, and logarithmic scale in Figure 4.20. The graph in logarithmic scale helps to see the details that have been suppressed in Figure 4.19 by the large processing time of the Adaptive Huffman method.

Before concluding, a few remarks are necessary. First, Adaptive Huffman compression cannot be considered for this project purely based on the compression time.

66

In addition to the compression time, large memory is required for storing the binary tree, and it is very computationally complex. Second, for data sizes up to 310 kB Static Huffman,

Figure 4.19-Total compression and decompression time (linear scale).

Figure 4.20-Total compression and decompression time (logarithmic scale).

67

Static Arithmetic and Adaptive Arithmetic compression techniques behave similarly, but

Huffman compression has superior compression time performance for larger data sizes.

Third, when it comes to decompression time, both Huffman compression techniques are out of the question because of their non-linear behavior and computational burden.

However, assuming decompression will only be performed on the ground station, then the time associated with the process may not be a significant problem.

4.2.4 Discussion of Test Results

Many factors affect the compression ratio, most of which have been discussed. In addition, the compression ratio test runs for this project showed significant differences depending on the selected output format. Therefore, additional tests have to be run after the algorithm is implemented on the microcontroller. It is also recommended that the test be run with the 12 bit mode of the ADC. Appendix F has the input and output conversion files to mimic the 12 bit mode of the ADC, therefore, additional tests can be run using the algorithms designed for this report in MATLAB.

The compression ratio results in the previous section show that there is no specific way of identifying which compression technique will perform better. Nevertheless, the difference seen in the compression ratio results between different algorithms were less than

5%, and for most of the data did not exceed 1.5%. However, the timing test results along with the time and space efficiency discussions identify the possible candidate for the compression technique of choice. First, Adaptive Huffman coding is most certainly out of the equation given its time complexity, which could also be seen in the timing test results.

For 31 kB of input data it took the Adaptive Huffman algorithm approximately 108 sec and

109 sec respectively for encoding and decoding. In contrast it took the Static Huffman

68

algorithm less than 1 sec for compression and decompression combined, and the Static

Arithmetic and Adaptive Arithmetic algorithms were able to perform the same task in

approximately 2.5 seconds. Considering that, decompression of the Static Huffman

algorithm will be performed on the ground station only, it is very attractive for its fast compression time. However, given the fact that the complete data set must be available at

the time of compression and the large space complexity required for storing the binary tree

make it less attractive for implementation.

The two remaining algorithms, both based on the arithmetic method are very

powerful and efficient techniques, which will achieve near optimum lossless compression results. One of the major advantages is that there is no binary tree involved with these

algorithms. The algorithm performs only integer arithmetic, which in fact can be

completely removed by using the shift operation instead of multiplication by two. The binary tree is replaced by two arrays. One of the arrays keeps track of the symbol counts, and the second array maintains the cumulative count. Since there is minor difference

between Static and Adaptive Arithmetic compression, the Adaptive method is more favorable, as it makes compression possible when new measurements are received from the sensors.

Although the four compression techniques described here are not the only lossless compression algorithms, they do perform well with the anticipated data. Other lossless compression techniques, such as LZ78 exist [7], however they primarily work for sources with repeating sequences of symbols. Another lossless compression technique called Run-

Length [7] coding can be implemented if the data to be encoded has long runs of 0s or 1s.

69

Both LZ78 and Run-length coding would not be suitable to encode the test data presented here.

Most compression software packages nowadays are a combination of different techniques. The data at each stage is compressed using different methods, or a hybrid technique. However, most of these techniques are protected by US or international patents and require fast processors for being useful. The Adaptive Arithmetic compression technique in its basic form is sufficient for the current goals set forth in the CubeSat requirements. Therefore the recommendation is to use Adaptive Arithmetic coding.

4.3 Delta Compression

The compression tests in Section 4.2 were run only using ADC unsigned fractional and unsigned integer formats. Why were the signed fractional and signed integer formats not selected for the benchmark tests? The simple answer is that the test data used was only positive, therefore using any of the signed formats the negative part would have to be discarded resulting in a fewer quantization levels. In order to utilize the signed integer formats, an approach called Delta compression [27] could be used. With Delta compression, the symbols are not compressed directly, but rather the difference between subsequent values is compressed. Since the difference between subsequent values can result in negative numbers, the signed output formats of the ADC are ideal. As one would guess, this technique can provide very good compression results for slow changing measurements such as the presented temperature data, because the difference between subsequent values will be close to zero, resulting in few source symbols with high probability. Delta compression can be used on any of the lossless compression methods mentioned above, but may result in a lossy compression. For example, if the signed integer

70

format is used the resolution of the ADC will not be enough to encode small differences between subsequent values. However, the signed fractional format can roughly provide a

.() resolution of = 0.002 to encode measurement deltas of 0.002 or below. Since there is a chance that can result in lossy compression it is left up to the programmers to implement and experiment with this method.

71

5 Conclusions

This project investigated four lossless compression techniques. Each compression

technique was implemented in MATLAB. The complete implementation details were

presented in Chapter 3. As was seen these algorithms were sufficient to perform all

necessary steps to simulate compression and decompression.

After the algorithms were developed, they were compared in benchmark tests.

Different data sets were input and performance measurements were collected for each

method. Performance measures included compression and decompression time, memory

complexity and most important of all compression ratio. The benchmark tests revealed that

all algorithms performed similarly in terms of compression ratio, but major differences

were seen with the involved durations. As was seen the Huffman compression techniques

were not suitable for a CubeSat because of the required memory, and the

compression/decompression durations. In contrast the Arithmetic coding techniques

offered fast processing and easy implementation. In addition the Adaptive Arithmetic

method offered extra agility and was recommended for implementation and further testing.

In order to understand how the dsPic33 ADC bit mode and output format affect the

performance measures it is recommended that additional tests be run after Adaptive

Arithmetic is implemented on the microprocessor. In addition, to avoid errors resulting

from the communications channel, it is recommended that channel coding be implemented.

This is important as isolated errors in the received signal can create a chain reaction of

errors and introduce high bit and packet error rates.

72

Bibliography

[1] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 1.

[2] mathematicalmonk, "Youtube," 26 August 2011. [Online]. Available: http://www.youtube.com/watch?v=UrefKMSEuAI. [Accessed 1 September 2013].

[3] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 14.

[4] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 16.

[5] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 47.

[6] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 6.

[7] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012.

[8] e. a. Sayood, Lossless Compression Handbook, K. Sayood, Ed., San Diego, CA: Academic Press, 2003.

[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons, Inc., 2006.

[10] Microchip Technology Inc., dsPIC33FJXXXGPX06/X08/X10 Data Sheet, 2007, p. 235.

[11] e. a. Sayood, Lossless Compression Handbook, K. Sayood, Ed., San Diego, CA: Academic Press, 2003, p. 83.

[12] M. Balík, M. Hanuš, J. Holub and M. Paulíček, "Data Compression Applets Library," Czech Technical University in Prague, 2005. [Online]. Available: http://www.stringology.org/DataCompression/sh/index_en.html. [Accessed 15 February 2014].

[13] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 67.

[14] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 68.

73

[15] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, pp. 69, 72.

[16] R. Seeck, "About BinaryEssence," [Online]. Available: http://www.binaryessence.com/dct/en000097.htm. [Accessed 15 September 2013].

[17] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 74.

[18] I. H. Witten, N. M. Radford and J. G. Cleary, Arithmetic Coding for Data Compression, vol. 30, Calgary: The University of Calgary, 1987.

[19] e. a. Sayood, Lossless Compression Handbook, K. Sayood, Ed., San Diego: Academic Press, 2003, p. 18.

[20] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 100.

[21] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Kaufmann, 2012, p. 102.

[22] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Haufmann, 2012, pp. 91-133.

[23] mathematicalmonk, "Youtube," 8 October 2011. [Online]. Available: https://www.youtube.com/watch?v=t8_198HHSfI. [Accessed 1 October 2013].

[24] K. Sayood, Introduction to DATA COMPRESSION, 4th ed., Waltham, MA: Morgan Haufmann, 2012, p. 113.

[25] M. Balík, M. Hanuš, J. Holub and M. Paulíček, "Data Compression Applets Library," Czech Technical University in Prague, 2005. [Online]. Available: http://www.stringology.org/DataCompression/ak-int/index_en.html. [Accessed 20 February 2014].

[26] M. Balík, M. Hanuš, J. Holub and M. Paulíček, "Data Compression Applets Library," Czech Technical University in Prague, 2005. [Online]. Available: http://www.stringology.org/DataCompression/aak-11/index_en.html. [Accessed 22 February 2014].

[27] S. Smith, The Scientist and Engineer's Guide to Digital Signal Processing, San Diego: California Technical Publishing, 1998, pp. 481-502.

74

Appendix A: MATLAB Code for Static Huffman Compression

“get_freq_huff” function

“get_codewords” function

75

76

77

“huffman_zip” function

78

79

80

81

“huffman_unzip” function

82

83

Appendix B: MATLAB Code for Adaptive Huffman Compression

“adap_huff_get_codewords” function

84

85

“adaptive_huffman_zip” function

86

87

88

89

“adaptive_huffman_unzip” function

90

91

92

93

Appendix C: MATLAB Code for Static Arithmetic Coding

“get_freq_arith” function

“arithmetic_zip” function

94

95

96

97

“arithmetic_unzip” function

98

99

100

Appendix D: MATLAB Code for Adaptive Arithmetic Coding

“adaptive_arithmetic_zip” function

101

102

103

“adaptive_arithmetic_unzip” function

104

105

Appendix E: MATLAB Code for 10 bit Conversion

“conversion_input_10bit” function

106

107

108

“conversion_output_10bit” function

109

110

111

Appendix F: MATLAB Code for 12 bit Conversion

“conversion_input_12bit” function

112

113

114

“conversion_output_12bit” function

115

116

117