CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

High-Throughput, Lossless Data Compression and Decompression On FPGAs

A graduate project submitted in partial fulfillment of the requirements for the degree of Masters of Science in Electrical Engineering.

Vikas Udayashekar

in collaboration with

Spoorthi Suresh

May 2012

The graduate project of Vikas Udayashekar is approved:

______Dr. Somnath Chattopadhyay Date

______Dr. Ahmad Sarfaraz Date

______Dr. Ramin Roosta, Chair Date

California State University, Northridge

ACKNOWLEDGEMENT

The satisfaction and euphoria that accompanies the successful completion of any task would be incomplete without the mention of the people who made it possible and whose constant encouragement and guidance has been a source of inspiration throughout the course of the project. We express our sincere gratitude to Dr. Ramin Roosta, our project committee chairperson. His invaluable assistance is one the main reasons that this project has been successfully completed. We also wish to thank the other members of our graduate project committee, Dr.Somnath Chattopadhayay and Dr. Ahmad Sarfaraz for their suggestions and support. We would like to extend our profound gratitude to our Department Chair Dr. Ali Amini for facilitating and helping us. It is by God’s grace and the continuous support of our parents and friends that we have been able to complete our MS program. Our family’s invaluable support in providing us with a high quality of education has helped us achieve our goals. We want to also express our appreciation to the Electrical and Computer Engineering Department at California State University, Northridge, including all the professors whose classes we had the pleasure to take.

iii

TABLE OF CONTENTS

SIGNATURE PAGE………………………………………………………………...…. ii

ACKNOWLEDGEMENT……...………………………….………………………..…. iii

LIST OF FIGURES…….....……..…………………………………………..……...... vi

ABSTRACT………………………..……………………….…………………………...vii

CHAPTER 1 INTRODUCTION………….………………………………………….....1

1 Introduction and Background………………...…………...………..…..1 1.1 How does compression work? ………...…………………….2

1.2 Text and Signals: lossless and lossy compression…………...2

CHAPTER 2 824B ALGORITHM………………………………………………..…... 4

2.1 Introduction……………………………………………………….…4

CHAPTER 3 824B FPGA DESIGN…………………………………………………....6

3.1 Introduction ………………………………………...………………..6

3.2 FPGA Compression Pipeline……………………………..…………..6

3.2 FPGA Decompression Pipeline ……………………….…………….10

CHAPTER 4 SOFTWARE LANGUAGE/ HARDWARE IMPLEMENTATION…...12

4.1 Programmable Devices………………………………….…………...12

4.1.1 Programmable Logic Devices …………….……………..12

4.1.2 Complex Programmable Devices ……………………….13

4.1.3 Field Programmable Gate Arrays(FPGA)………..……...14

4.1.3.1 Advantages of FPG……………………….…...16

4.2 Hardware Design and Development………………………………..17

4.2.1 Design Entry………………………………………….....17

4.2.2 Synthesis………………………….……………….…...18

4.2.3 Simulation…………………………………….……...... 18

4.2.4 Implementation………………………….……………..18

4.2.4.1 Translate…………………………….……...... 18

4.2.4.2 Map…………………………………………..18

4.2.4.3 Place and Route……………………………...19

4.3 Device Programming……………………………..……………... 19

4.4 Verilog HDL……………………………………………………...20

4.4.1 Importance of HDLs…………………..………………20

4.4.2 Why Verilog? ………………………………….……...20

CHAPTER 5 Result And Discussion………………………………………………..22

Verification in Modelsim(Xilinx)………….…………………………….…..22

Compression………….……………………………………………….……..22

Decompression………….……………………………….…………………..30

HDL Synthesis Report- Compression.….……………….…………………..38

HDL Synthesis Report- Decompression.………………….………….……..39

CHAPTER 6 Conclusion……………………………………………………………40

REFERENCE..………………………………………………………..…………….41

APPENDIX …………………………………………………………………….…..42

LIST OF FIGURES

Figure 1: 8 Byte Input Split into 7 Phrases ……………….………………………...…5

Figure 2: FPGA Compression Pipeline……………………………………………….. 7

Figure 3: FPGA Decompression Pipeline ……………………………………………. 10

Figure 4: Internal Structure of a CPLD……………………………………………….. 13

Figure 5: Internal Structure of an FPGA ………………...…………………………… 15

Figure 6: Internal architecture of CLB …………………………………...…………... 16

Figure 7: Design Flow ……………………………………………………………..…. 17

Figure 8: Simulation result 1 for Compression...... 22

Figure 9: Simulation result 2 for Compression …………………….…………..…….. 23

Figure 10: Simulation result 3 for Compression ………………….….………………..24

Figure 11: Simulation result 4 for Compression …………………….………………...25

Figure 12: Simulation result 5 for Compression ….…………………………………..26

Figure 13: Simulation result 6 for Compression..……………………………..……….27

Figure 14: Simulation result 7 for Compression.……………………………………....28

Figure 15: Simulation result 8 for Compression …………………………….…..…….29

Figure 16: Simulation result 1 for Decompression ………………………...………….30

Figure 17: Simulation result 2 for Decompression.………………………..…………..31

Figure 18: Simulation result 3 for Decompression …………………………………....32

Figure 19: Simulation result 4 for Decompression ……………………………………33

Figure 20: Simulation result 5 for Decompression ……..……………….………….....34

Figure 21: Simulation result 6 for Decompression …………………………...…….....35

Figure 22: Simulation result 7 for Decompression …………………………..…….....36

Figure 23: Simulation result 8 for Decompression ………………………..……….....37

vii

ABSTRACT

High-Throughput, Lossless Data Compression and Decompression

on FPGAs

By Vikas Udayashekar Masters of Science in Electrical Engineering

Before writing data to a storage medium or transmitting across a transmission medium lossless compression is often used. Storage space or transmission bandwidth is saved by compression; when the data is subsequently read a decompression operation is performed. Though this scheme has clear benefits, the execution time of compression and decompression is critical to its application in real-time systems. Software compression utilities are often slow, leading to degraded system performance. Hardware-based solutions, on the other hand, often drive large resource requirements and are not amenable to supporting future algorithmic changes. We present a high-throughput, streaming, lossless compression algorithm and its efficient implementation on FPGAs. A peak throughput of 1GB/sec per engine, with a sustained overall measured throughput of 2.66GB/sec on a PCIe-based FPGA board with compression and decompression engines is provided by the proposed solution. An overall speedup of 13.6x over reference software implementation is represented by this result. With multiple engines running in parallel, the proposed design provides a path to potential speedups of up to two orders of magnitude. The achievable overall throughput is limited only by the available PCIe bus bandwidth in the current implementation.

viii

CHAPTER 1: INTRODUCTION

1. Introduction And Background To save storage space or reduce required transmission bandwidth lossless data compression is often used. In software, data compression algorithms are often implemented. It can sometimes be a performance bottleneck, although this approach saves important real estate on processor chips and allows for later modifications to the algorithm. Aiming for the entropy of the data, most of the existing work on data compression has been concentrated on achieving the best compression efficiency possible. However, execution speed of the compression / decompression operation is more important than the compression efficiency in a number of applications. In such applications hardware-based fast compression algorithms may be used. Many such algorithms like custom hardware based ALDC [1], MXT [2], 842[3], and FPGA based XMatchPRO[4] exit. However, most of these solutions utilize expensive CAM (content- addressable memory) structure for implementing the history windows (dictionaries) and achieve throughputs in the range of 100 MB/sec to 400 MB/sec. In this design, we present the 842B algorithm, a hardware-friendly, lossless compression algorithm derived from the original 842 algorithm [3], and its FPGA implementation. Instead of expensive CAMs, the proposed algorithm uses hashing-based dictionary lookups and offers a throughput of 1GByte/sec per engine. Compression and decompression of arbitrary size data blocks is allowed by the low latency streaming architecture and can be placed directly on the transmission channels. Additionally, better compression efficiency is yield by multiple overlapping sliding compression windows (dictionaries) of different lengths. Very modest FPGA resources are required as the compressor and decompressor designs presented here are very lean. Therefore, in application areas the designs are suitable for use as small modules where FPGA-based systems can be applied, including signal and image processing, network routers and transmitter-receiver systems, effectively increasing the CPU cycles and bandwidth resources available for other purposes in such systems.

1.1 How does compression work? Compression relies on the fact that the data is redundant, that till some extent it was generated following some rules and that we can learn those rules, and thus predict accurately the data. A compressor can reduce the size of a file by deciding which data is more frequent and assigning it less bits than to less frequent data. Clearly compression has two parts: one guess which are the most frequent symbols, and other which outputs the "decision" of the first one.

1.2 Text and signals: lossless and lossy compression We have seen before that we may want to compress different kinds of data such as text, data bases, binary programs, sound, image and video. In practice we distinguish about text compression and signal compression. We do this separation because data bases, and binary programs have the same characteristic as text. Likewise sound, image and video are signals and thus share properties. In the other hand text and image data have nothing in common, and that's what they don't belong to the same group.

We also do this separation because for these two groups we use different kinds of compression. That comes from the nature of the data. Digital signals are an imperfect representation of an analogic signal, thus when compressing them we can discard some of the information to achieve more compression. This is done with transformation and quantization algorithms.

Let's say we have a byte from an image, its value is 65 and it represents the quantity of red in a given pixel. If when decompressing this byte is 66 we wouldn't notice the difference between red and a very little more of red. However if that was a text file 65 would be 'A' (assuming it's Ascii), and there's a big difference if we decompress 66 which would be a 'B' instead of 'A'. Due to the nature of text we can't afford errors.

So we use lossless compression for text where the original file must be exact bit per bit to the original one. And lossy compression for signals where some error is acceptable and in most of the cases is not detected. However you should note that signals can be lossless compressed, though then the compression achieved is far worse than with lossy

compression. In most of the cases signal compression goes of discarding as much data as possible but retaining as much quality as possible.

CHAPTER 2: 842B ALGORITHM

2.1 Introduction Repeating patterns of size 8, 4 and 2 bytes in the input data stream is identified by the 842B algorithm and 6 to 8 bit pointers replaces them to previously seen data. The algorithm follows the same principles as the original 842 algorithm [3]. Every 8-byte chunk of the input data is divided into 7 phrases (Figure 2-1) which are compared against previously seen phrases. The input phrases for lookup in processing subsequent inputs are stored using dictionaries . The address of the phrase in the dictionary (pointer) is stored in a hash table, at a location given by the hash value of the phrase for constant-time phrase look-up. The 7 sub-phrases of the 8B input are hashed into 7 keys, which are used to read the pointers from the hash tables during compression. 7 phrases are read from the dictionaries and compared against the input sub-phrases using these pointers. Indicating the composition of the compressed data, the compressed output is generated as the smallest possible combination of the pointers to the data. Decoding the template and extracting different pointers and raw phrases from the compressed data, reading the remaining phrases from the dictionaries and reconstructing the uncompressed data decompression is involved. Reconstructing the dictionary contents is required for reading the phrases from the dictionaries. By simply writing the post-decompression phrases back into the dictionaries the dictionary is reconstructed on the fly, much as in the compression operation. Note that during decompression since the pointers are already present in the compressed data no hashing and no hash tables are required. The 842B algorithm uses three separate dictionaries, representing three different sliding history windows, one for each of 8, 4 and 2 byte phrases, unlike the original 842 algorithm which uses a single phrase dictionary. The three dictionaries redundantly store the 7 sub phrases of the current input.

Figure 1:8-Byte Input split into 7 phrases

Redundant storage of data has two benefits, while this would seem wasteful. Multi-port RAM arrays are expensive to implement as port count increases. A 7-port RAM array is replaced with 1, 2 and 4 port RAM arrays for the 8, 4 and 2 byte phrases, respectively, trading off register ports with RAM capacity. Once we make that tradeoff, the optimal dictionary sizes can be chosen for each of the 3 phrase lengths independently. The 842B algorithm also incorporates performance enhancers such as detecting and replacing multiple consecutive 8-byte repeats with just a 5 bit template and a repeat count in addition to the basic phrase comparisons. As a special case of repeats long strings of zeros are also detected.

CHAPTER 3: 842B FPGA DESIGN 3.1 Introduction For the 842B compressor and decompressor, we present the FPGA design. Multi- stage pipelines are used to design both the compressor and the decompressor. The pipelines are used to stream the input data which process one block of input per cycle. The compression operation is feed-forward and lends itself well to pipelining. On the other hand, the dictionary write-back in the decompression pipeline introduces a feedback loop and results in multiple data hazard conditions. As stated earlier, the 842B algorithm operates on 7 different phrases for each 8-byte input. Despite the fact that these phrases are independent of each other and can be processed in parallel, software 842B algorithm implementations process these 7 phrases sequentially. Our FPGA pipelines include seven parallel data-paths, one for each phrase to exploit this parallelism and achieve improved performance.

3.2 FPGA Compression Pipeline Different stages of the compression pipeline as implemented in the FPGA are shown in Figure 2. 8 bytes of input per cycle are taken by the compression pipeline and outputs one compressed data word and a template. Every 8-byte input is broken into 7 phrases; the hashing, hash-table look-up, dictionary look-up and phrase comparison for all these phrases are performed in parallel and a 7-bit match/mismatch status is generated. The pointers and the raw input phrases in the smallest possible output, based on the match/mismatch status are encoded by the encoder. A template to indicate the composition of the compressed output is also generated. The mapping from the match/mismatch status to the corresponding smallest output combination (and thus the template) is got statically. Using look-up tables this mapping is implemented on the FPGA. The 5-bit template can be read directly from the look-up table, given a match/mismatch status.

Figure 2: FPGA Compression Pipeline

The compression pipeline also performs hash table and dictionary writes in addition to the above operation. The next dictionary address, generated sequentially using a counter, is stored in the hash table, at a location given by the hash value of the phrase during the hash table read/write cycle. Pointers may be overwritten with the latest pointer hashing to the same location since many input phrases might hash to the same value. The implementation is analogous to a direct mapped cache memory. This dictionary and hash table-based design uses regular RAM arrays instead of the CAMs typically used in other hardware compressors and hence is more area-efficient and simpler to implement. The input phrase is written in the dictionary at a location given by the output of the counter during the dictionary read/write cycle. Note that there is a need for up to 4 simultaneous reads and 4 writes from and to the memory banks since all 7 phrases are being processed simultaneously. We duplicate the dictionaries to support multiple read ports since the FPGA block RAMs that are used to implement the dictionaries and the hash tables provide only two read/write ports. This increases the memory requirements four-fold for a 4-port dictionary. Simply by performing a single wide write operation multiple writes are supported and thus do not demand further dictionary replication. The compressor’s performance depends on many factors. The first is the dictionary size, which represents the amount of data that is “remembered”. The larger the dictionary, the more phrases are remembered, and hence, the higher the probability of finding a phrase match. On the other hand, a larger dictionary requires longer pointers, which in turn increases the size of the compressed data. Larger dictionaries also require higher FPGA resource. Thus, there exists a tradeoff between allocated hardware resources and algorithm performance. A dictionary size-performance sweet spot which yields the best

average compression ratio is indicated by our simulations. This sweet-spot occurs at different dictionary sizes for different phrase sizes. The dictionary size is 2KB, 2KB and 512B for the 8-byte, 4-byte and 2-byte phrase dictionary respectively in our FPGA design. Dictionaries also include a wrap-around mechanism, they wrap to the beginning when they get filled. This behavior represents windows of sizes that are equal to the dictionary sizes sliding over the data to be compressed. This wraparound approach results in better compression than the one which flushes the dictionaries once filled. The hashing scheme and the sizes of the hash tables is another factor affecting the compression efficiency. To achieving good compression performance, efficient and effective hashing is one of the keys. A hashing function is generates the address of the hash table location where the pointer to a dictionary entry is stored. Dictionary pointers may be overwritten since hashing is a many-to-one function, i.e. many different phrases hash to the same value. It is thereby important to have a hashing scheme that spreads the hashes evenly across the entire hash table. The hashing scheme must be (i) lightweight (requiring few resources) and (ii) simple, enabling achievement of high frequency for efficient hardware implementation. The simplest hashing scheme is modulo operation which simply involves selecting the lower order bits of the input phrase as the hash. This scheme, however, results in poor hash quality, thereby creating multiple conflicts in certain table locations while leaving the rest untouched. Creating XOR trees by bitwise ANDing the input phrase with a constant and XORing together the bits of the result to generate one bit of the hash is involved in our scheme. Generating an N bit hash out of an M bit input an N M-bit constants is required. A total of MxN 2-input AND operations and N(M-1) 2-input XOR operations is required for hashing. Experimentally the optimal values of the constants are determined and hard- coded into the FPGA, thus reducing the AND operations to simple bit selection. Using an XOR tree, the selected bits are then XORed. For effective hashing, the selection of the appropriate hash constants is critical. We use the Random Invertible Binary Matrix (RIBM) approach to generate the XOR tree’s hash constants [5, 6]. The Random Invertible Binary Matrix is produced off-line by

filling randomly with 1s and 0s and checking for invertibility to ensure maximum dispersion in the output bits. Hash conflicts occur, even with very good hashing techniques, overwriting a current pointer with a new one. Conflicts result in “forgetting” a previously written phrase as the pointer to that phrase is lost. We chose to use a large direct mapped organization for its simplicity though an N-way set associative hash table could be used to increase the hash hit rate. The probability of pointers being overwritten is reduced by larger hash tables and hence increases the chances of finding a previously written phrase in the dictionary. Having larger hash tables, however, requires larger FPGA resource. Hash table sizes can be optimized against performance like regular caches and hence increasing the table size beyond a certain point yields diminishing performance gains. A hash table with roughly 4 times the number of entries in the corresponding dictionary achieves good performance for our design.

Pattern encoding for our design is shown in following table extracted from C- pack compression and decompression algorithm [7].

Table: Pattern encoding table for compression and decompression

3.3 FPGA Decompression Pipeline Figure 3 shows the 842B decompression pipeline in the FPGA. One set of compressed data and its template is fed to the pipeline per cycle. The template is decoded and the various pointers and raw phrases are extracted from the compressed data by the data decoder. These extracted pointers are used to read the phrases from the three dictionaries.

Figure 3: FPGA Decompression pipeline The 8-byte uncompressed data is generated by the data generator by selecting each 2-byte phrase from one of four sources, namely, the extracted phrases or data read from one of the three dictionaries. This module, thus, simply contains four 4:1 multiplexors. The select lines for these multiplexors are read directly from a look-up table using the 5- bit template much as in the compressor design. The decompressed output is written into the three dictionaries to reconstruct the dictionaries on the fly. The dictionary write-back introduces a feedback path in the pipeline, which leads to possible data hazards. In other words, compressed data might contain a pointer to a dictionary location which has not been written yet. A data hazard can be led by four possible scenarios. The first is where a pointer points to a phrase that arrived one cycle earlier (1-ahead). In the above case, dictionary data being read is still in the data-gen stage and has not yet been written into the dictionary. This data, however, is required in the data-gen stage in the next cycle, and hence can be forwarded. This situation is detected and the data is forwarded appropriately by adding a hazard detection and data forwarding unit in the data-gen stage. Since there are three separate dictionaries, three forwarding units are required. When the read and write requests from the same address arrive during the same cycle (4 ahead) or a read request arrives 1 or two cycles earlier than the write request (3-

ahead and 2-ahead respectively), the other three data hazards occur. 4-ahead hazard is a true read-during-write condition whereas 2-ahead and 3-ahead hazards occur due to pipelining as a consequence of the read/write operations requiring more than 1 cycle. A dictionary bypass unit is added to address these hazards at the output of each dictionary, which bypasses the dictionary and forwards the dictionary write data as the response to the read request. The hazard detection logic in the decompressor could have been avoided by disallowing near pointers, i.e. pointers between phrases less than or equal to 4 cycles apart, during compression. This approach would have yielded simpler pipeline logic but then reduced compression efficiency.

CHAPTER 4: SOFTWARE LANGUAGE/HARDWARE IMPLEMENTATION

This chapter gives details of Programmable Logic Devices and Verilog HDL. Programmable devices like PLD, CPLD and FPGA are explained. At the end, history of Verilog HDL, importance of HDLs and advantages of Verilog HDL are discussed.

4.1 Programmable Devices Programmable devices are those devices which can be programmed by the user. Various programmable devices are PLDs, CPLDs, ASICs and FPGAs 4.1.1 Programmable Logic Devices At the low end of the spectrum are the original Programmable Logic Devices (PLDs). A programmable logic device is an IC that is user configurable and is capable of implementing logic functions. These were the first chips that could be used to implement a flexible digital logic design in hardware. In other words, one could remove a couple of the 7400-series TTL parts (ANDs, ORs, and NOTs) from the board and replace them with a single PLD. Other names for this class of device are Programmable Logic Array (PLA), Programmable Array Logic (PAL), and Generic Array Logic (GAL). PLDs have several clear advantages over the 7400-series TTL parts that they replaced. First, of course, is that chip requires less board area, power, and wiring. Another advantage is that the design inside the chip is flexible, so a change in the logic doesn't require any rewiring of the board. Rather, simply replacing that one PLD with another part that has been programmed with the new design can alter the decoding logic. Inside each PLD is a set of fully connected macro cells. These macro cells are typically comprised of some amount of combinatorial logic (AND and OR gates) and a flip-flop. In other words, a small Boolean logic equation can be built within each macro cell. Hardware designs for these simple PLDs are generally written in languages like ABEL or PALASM (the hardware equivalents of assembly) or drawn with the help of a schematic capture tool.

4.1.2 Complex Programmable Devices As chip density is increased, it was natural for the PLD manufacturers to evolve their products into larger (logically, but not necessarily physically) parts called Complex Programmable Logic Devices (CPLDs). For most practical purposes, CPLDs can be thought of as multiple PLDs (plus some programmable interconnect) in a single chip. The Larger size of a CPLD allows implementing either more logic equations or a more complicated design. In fact, these chips are large enough to replace dozens of those 7400- Series parts. Figure 4 contains a block diagram of a CPLD. Each of the four logic blocks shown is equivalent to one PLD. However, in an actual CPLD there may be more or less than four logic blocks. These logic blocks are themselves comprised of macro cells and interconnect wiring, just like an ordinary PLD.

Figure 4: Internal Structure of a CPLD Unlike the programmable interconnect within a PLD, the switch matrix within a CPLD may or may not be fully connected. In other words, some of the theoretically possible connections between logic block outputs and inputs may not actually be supported within a given CPLD. The effect of this is most often to make 100% utilization

Of the macro cells very difficult to achieve. Some hardware designs simply won't fit within a given CPLD, even though there are sufficient logic gates and flip-flops available. Because CPLDs can hold larger designs than PLDs, their potential uses are more varied. They are still sometimes used for simple applications like address decoding, but more often contain high-performance control-logic or complex finite state machines. At the high-end (in terms of numbers of gates), there is also a lot of overlap in potential applications with FPGAs. Traditionally, CPLDs have been chosen over FPGAs whenever High-performance logic is required. Because of its less flexible internal architecture, the delay through a CPLD (measured in nanoseconds) is more predictable and usually shorter. 4.1.3 Field Programmable Gate Arrays (FPGA) 'Field Programmable' means that the FPGA's function is defined by a user's program rather than by the manufacturer of the device. A typical integrated circuit performs a particular function defined at the time of manufacture. In contrast, a program written by someone other than the device manufacturer defines the FPGA’s function. Depending on the particular device, the program is either ’burned’ in permanently or semi-permanently as part of a board assembly process, or is loaded from an external memory each time the device is powered up. This user programmability gives the user access to complex integrated designs without the high engineering costs associated with application specific integrated circuits (ASIC). The FPGA is an integrated circuit that contains many (64 to over 10,000) identical logic cells that can be viewed as standard components. The individual cells are interconnected by a matrix of wires and programmable switches. The logic cell architecture varies between different device families. Generally speaking, each logic cell combines a few binary inputs (typically between 3 and 10) to one or two outputs according to a Boolean logic function specified in the user program. The cell's combinatorial logic may be physically implemented as a small look-up table memory (LUT) or as a set of multiplexers and gates. LUT devices tend to be a bit more flexible and provide more inputs per cell than multiplexer cells at the expense of propagation delay.

Figure 5: Internal Structure of an FPGA The development of the FPGA was distinct from the PLD/CPLD evolution. There are three key parts of its structure: logic blocks, interconnect, and I/O blocks. The I/O blocks form a ring around the outer edge of the part. Each of these provides individually selectable input, output, or bi-directional access to one of the general-purpose I/O pins on the exterior of the FPGA package. Inside the ring of I/O blocks lies a rectangular array of logic blocks. The wire connecting logic block to logic blocks and I/O to logic block is called as programmable inter connect.

Figure 6: Internal Architecture of CLB The logic blocks within an FPGA can be as small and simple as the macro cells in A PLD (a so-called fine-grained architecture) or larger and more complex (coarse- grained). However, they are never as large as an entire PLD, as the logic blocks of a CPLD are. The logic blocks of a CPLD contain multiple macro cells. But the logic blocks in an FPGA are generally nothing more than a couple of logic gates or a look-up table and a flip-flop. 4.1.3.1 Advantages of FPGA Because of all the extra flip-flops, the density is higher from several thousand gates to few million gates and the architecture of an FPGA is much more flexible than that of a CPLD. This makes FPGAs better in register-heavy applications. They are also often used in place where the processing of input data streams must be performed at a very fast pace. In addition, FPGAs are usually denser (more gates in a given area) and cost less than CPLD, so they are the best choice for larger logic designs. FPGA’s uses static memory so they are reprogrammable.

4.2 Hardware Design and Development A description of the hardware's structure and behavior is written in a high-level hardware description language (usually VHDL or Verilog) and that code is then compiled and downloaded prior to execution. Of course, schematic capture is also an option for design entry, but it has become less popular as designs have become more complex and the language-based tools have improved. The overall process of hardware development for programmable logic is shown in Figure 4.4

Figure 7: Design Flow

4.2.1 Design Entry In the design entry process, the behavior of circuit is written in hardware description language like VHDL or Verilog.

4.2.2 Synthesis First, an intermediate representation of the hardware design is produced. This step is called synthesis and the result is a representation called a netlist. In this step, any semantic and syntax errors are checked. The synthesis report is created which gives the details of errors and warning if any. The netlist is device independent, so its contents do not depend on the particulars of the FPGA or CPLD; it is usually stored in a standard format called the Electronic Design Interchange Format (EDIF). 4.2.3 Simulation Simulator is a software program to verify functionality of a circuit. The functionality of code is checked. The inputs are applied and corresponding outputs are checked. If the expected outputs are obtained then the circuit design is correct. Simulation gives the output waveforms in form of zeros and ones. Although problems with the size or timing of the hardware may still crop up later, the designer can at least be sure that his logic is functionally correct before going on to the next stage of development. 4.2.4 Implementation Device implementation is done to put a verified code on FPGA. The various steps in design implementation are: Translate Map Place and route 4.2.4.1 Translate Translate converts the EDIF file to the NGD (Native Generic Description File) which means code is converted to the gates or net lists. The translate process generates the translate report which gives the errors and warnings in translation process. This report also gives the list of device and I/O utilization, which helps the designer to determine the selection of best device. 4.2.4.2 Map Mapping converts the NGD (Native Generic Description) file obtained from translate process to the NCD (Native Circuit Description File) which means the gates are converted to the physical components like flip flops and multiplexer.

4.2.4.3 Place and Route Place is the process of selecting specific logic blocks in the FPGAs where design gates will reside. Route is the physical routing of interconnect between logic blocks. This means that logic blocks, CLB, I/O blocks are assigned to specific locations on die and interconnections are made between them. This step involves mapping the logical structures described in the net list onto actual macro cells, interconnections, and input and output pins. This process is similar to the equivalent step in the development of a printed circuit board, and it may likewise allow for either automatic or manual layout optimizations. The result of the place & route process is a bitstream. This name is used generically, despite the fact that each CPLD or FPGA (or family) has its own, usually proprietary, bitstream format. Bitstream is the binary data that must be loaded into the FPGA or CPLD to cause chip to execute a particular hardware design. 4.3 Device Programming Once bit stream file is created for a particular FPGA or CPLD, it is downloaded on the device. The details of this process are dependent upon the chip's underlying process technology. Programming technologies used are PROM (for one-time programmable), EPROM, EEPROM, and Flash. Just like their memory counterparts, PROM and EPROM based logic devices can only be programmed with the help of a separate piece of lab equipment called a device programmer. On the other hand, many of the devices based on EEPROM or Flash technology are in-circuit programmable. In other words, the additional circuitry that's required to perform device reprogramming is provided within the FPGA or CPLD silicon as well. This makes it possible to erase and reprogram the device internals via a JTAG interface or from an on-board embedded processor. In addition to non-volatile technologies, there are also programmable logic devices based on SRAM technology. In such cases, the contents of the device are volatile. This has both advantages and disadvantages. The obvious disadvantage is that the internal logic must be reloaded after every system or chip reset. That means, some sort of an additional memory chip is needed to hold the bit stream. But it also means that the contents of the logic device can be changed.

4.4 Verilog HDL The history of the Verilog HDL[8] goes back to the 1980s, when a company called Gateway Design Automation developed a logic simulator, Verilog-XL, and with it a hardware description language. Cadence Design Systems acquired Gateway in 1989 and with it the rights to the language and the simulator. In 1990, Cadence put the language (but not the simulator) into the public domain, with the intention that it should become a standard, nonproprietary language. The Verilog HDL is now maintained by a nonprofit making organization, Accellera, which was formed from the merger of Open Verilog International (OVI) and VHDL International. OVI had the task of taking the language through the IEEE standardization procedure. In December 1995 Verilog HDL became IEEE Std. 1364-1995. A significantly revised version was published in 2001: IEEE Std. 1364-2001. There was a further revision in 2005 but this only added a few minor changes. Accellera have also developed a new standard, System Verilog, which extends Verilog. System Verilog became an IEEE standard (1800-2005) in 2005. There is also a draft standard for analog and mixed-signal extensions to Verilog, Verilog-AMS. 4.4.1 Importance of HDLs HDLs have many advantages compared to traditional schematic based design. Designs can be described at a very abstract level by use of HDLs. Designers can write their RTL description without choosing a specific fabrication technology. Logic synthesis tools can automatically convert the design to any fabrication technology. If a new technology emerges, designers do not need to redesign their circuit. Functional verification of the design can be done early in the design cycle. Better representation of design due to simplicity of HDLs when compared to gatelevel schematics. Modification and optimization of the design became easy with HDLs. Cuts down design cycle time significantly because the chance of a functional bug at a later stage in the design-flow is minimal[8]. 4.4.2 Why Verilog? Verilog HDL has evolved as a standard hardware description language. Verilog HDL offers many useful features for hardware design. Easy to learn and easy to use, due to its similarity in syntax to that of the C

programming language. Different levels of abstraction can be mixed in the same design. Availability of Verilog HDL libraries for post-logic synthesis simulation. Most of the synthesis tools support Verilog HDL. The Programming Language Interface (PLI) is a powerful feature that allows the user to write custom C code to interact with the internal data structures of Verilog. Designers can customize a Verilog HDL simulator to their needs with the PLI [8]

CHAPTER 5: RESULT AND DISCUSSION Verification in Modelsim(Xilinx) Compression

Figure 8: Simulation result 1 for Compression

Figure 9: Simulation result 2 for Compression

Figure 10: Simulation result 3 for Compression

Figure 11: Simulation result 4 for Compression

Figure 12: Simulation result 5 for Compression

Figure 13: Simulation result 6 for Compression

Figure 14: Simulation result 7 for Compression

Figure 15: Simulation result 8 for Compression

Decompression

Figure 16: Simulation result 1 for Decompression

Figure 17: Simulation result 2 for Decompression

Figure 18: Simulation result 3 for Decompression

Figure 19: Simulation result 4 for Decompression

Figure 20: Simulation result 5 for Decompression

Figure 21: Simulation result 6 for Decompression

Figure 22: Simulation result 7 for Decompression

Figure 23: Simulation result 8 for Decompression

HDL Synthesis Report- Compression

HDL Synthesis Report- Decompression

CHAPTER 6: CONCLUSION

Storage space or transmission bandwidth was saved by compression; when the data was subsequently read a decompression operation was performed. A high-throughput, streaming, lossless compression algorithm and its efficient implementation on FPGAs was achieved. A peak throughput on a PCIe-based FPGA board with compression and decompression engines was provided by the proposed solution. An overall speedup over reference software implementation was represented by this result. With multiple engines running in parallel, the proposed design provided a path to potential speedups of up to two orders of magnitude. The achievable overall throughput was limited only by the available PCIe bus bandwidth in the current implementation. Our FPGA pipelines included seven parallel data-paths, one for each phrase to exploit parallelism and achieve improved performance. The wraparound approach resulted in better compression than the one which flushes the dictionaries once filled.

REFERENCE

[1] Craft, D. J. “A fast hardware data compression algorithm and some algorithmic extensions”, IBM Journal of Research and Development, 42(6), 733 – 745, November 1998. [2] Tremaine, R.B., et. al. “IBM Memory Expansion Technology (MXT)”,IBM Journal of Research and Development, 45(2), 271-285, March 2001. [3] Franaszek, P. A., Lastras, L. A., Peng, S., and Robinson, J. T., “Data Compression with Restricted Parsings”, dcc, 203-212, Data Compression Conference (DCC'06), 2006. [4] Núñez, J. L., et. al. "X-MatchPRO: A ProASIC-Based 200 Mbytes/s Full-Duplex Lossless Data Compressor”, Lecture Notes in Computer Science, 2147/2001, 613-617, January 2001. [5] Qureshi, M. K., et. al., “Enhancing Lifetime and Security of PCMBased Main Memory with Start-Gap Wear Leveling” 42nd International Symposium on Microarchitecture (MICRO 2009), December 2009. [6] Vandierendonck, H. and De Bosschere, K. “XOR-based hash functions”, IEEE Transactions on Computers, 54(7), 800- 812, July 2005. [7] Xi Chen, Lei Yang, Robert P. Dick, Member, IEEE, Li Shang, Member “C-Pack: A High-Performance Microprocessor Cache Compression Algorithm ”, IEEE, and Haris Lekatsas, August 2010. [8] Samir Palnitkar, Verilog HDL A guide to Digital Design and Synthesis, 3rd Edition, SunSoft Press, 1996.

APPENDIX VERILOG HDL FOR COMPRESSION AND DECOMPRESSION, VERIFICATION AND REPORT Compression

Top module: module top_comp_new(sel,clk,reset,dataout);

input [3:0] sel; input clk; input reset;

output [7:0] dataout;

wire [3:0] sel; wire clk; wire reset; wire [63:0] datain; wire [7:0] dataout; wire [67:0] en_data_out; wire [63:0] key8; wire [31:0] key4_1; wire [31:0] key4_2; wire [15:0] key2_1; wire [15:0] key2_2; wire [15:0] key2_3; wire [15:0] key2_4;

wire [3:0] addr8; wire [3:0] addr4_1; wire [3:0] addr4_2; wire [3:0] addr2_1; wire [3:0] addr2_2; wire [3:0] addr2_3; wire [3:0] addr2_4;

wire mis8; wire mis4_1; wire mis4_2; wire mis2_1; wire mis2_2;

wire mis2_3; wire mis2_4;

wire [63:0] data8; wire [31:0] data4_1; wire [31:0] data4_2; wire [15:0] data2_1; wire [15:0] data2_2; wire [15:0] data2_3; wire [15:0] data2_4;

wire pout8; wire pout4_1; wire pout4_2; wire pout2_1; wire pout2_2; wire pout2_3; wire pout2_4;

assign key8 = datain; assign key4_1 = datain[63:32]; assign key4_2 = datain[31:0]; assign key2_1 = datain[63:48]; assign key2_2 = datain[47:32]; assign key2_3 = datain[31:16]; assign key2_4 = datain[15:0]; assign datain = 64'haabbccdd12345678;

Hash8 c0(key8,clk,reset,addr8);

Hash4 c1(key4_1,key4_2,clk,reset,addr4_1,addr4_2);

Hash2 c2(key2_1,key2_2,key2_3,key2_4,clk,reset,addr2_1,addr2_2,addr2_3,addr2_4); dict8 c3(addr8,key8,mis8,clk,reset,data8); dict4 c4(addr4_1,addr4_2,key4_1,key4_2,mis4_1,mis4_2,clk,reset,data4_1,data4_2); dict2 c5(addr2_1,addr2_2,addr2_3,addr2_4,key2_1,key2_2,key2_3,key2_4,mis2_1,mis2_2,mis2_3,mi s2_4,clk,reset,data2_1,data2_2,data2_3,data2_4); phase_comp c6(datain,data8,data4_1,data4_2,data2_1,data2_2,data2_3,data2_4,clk,reset,mis8,mis4_1,mis4_2 ,mis2_1, mis2_2,mis2_3,mis2_4,pout8,pout4_1,pout4_2,pout2_1,pout2_2,pout2_3,pout2_4);

encoder c7(clk,reset,datain,pout8,pout4_1,pout4_2,pout2_1,pout2_2,pout2_3,pout2_4,addr8, addr4_1,addr4_2,addr2_1,addr2_2,addr2_3,addr2_4,en_data_out); assign dataout = (sel == 4'b0000) ? en_data_out[7:0]: (sel == 4'b0001) ? en_data_out[15:8]: (sel == 4'b0010) ? en_data_out[23:16]: (sel == 4'b0011) ? en_data_out[31:24]: (sel == 4'b0100) ? en_data_out[39:32]: (sel == 4'b0101) ? en_data_out[47:40]: (sel == 4'b0110) ? en_data_out[55:48]: (sel == 4'b0111) ? en_data_out[63:56]: (sel == 4'b1000) ? en_data_out[67:64]: 8'b0;

endmodule

2 byte hash: module Hash2(key2_1,key2_2,key2_3,key2_4,clk,reset,addr2_1,addr2_2,addr2_3,addr2_4); input [15:0] key2_1; input [15:0] key2_2; input [15:0] key2_3; input [15:0] key2_4;

input clk; input reset; output [3:0] addr2_1; output [3:0] addr2_2; output [3:0] addr2_3; output [3:0] addr2_4;

wire [15:0] key2_1; wire [15:0] key2_2; wire [15:0] key2_3; wire [15:0] key2_4;

wire clk; wire reset; reg [3:0] addr2_1; reg [3:0] addr2_2; reg [3:0] addr2_3; reg [3:0] addr2_4;

reg [19:0] hashtable2_1 [7:0]; reg [19:0] hashtable2_2 [7:0]; reg [19:0] hashtable2_3 [7:0];

reg [19:0] hashtable2_4 [7:0];

reg match_1; reg match_2; reg match_3; reg match_4;

reg [3:0] count_1; reg [3:0] count_2; reg [3:0] count_3; reg [3:0] count_4;

reg [2:0] i_1; reg [2:0] i_2; reg [2:0] i_3; reg [2:0] i_4;

always@(posedge clk) begin if(reset == 1'b1) begin hashtable2_1[0] <= 20'd0; hashtable2_1[1] <= 20'd0; hashtable2_1[2] <= 20'd0; hashtable2_1[3] <= 20'd0; hashtable2_1[4] <= 20'd0; hashtable2_1[5] <= 20'd0; hashtable2_1[6] <= 20'd0; hashtable2_1[7] <= 20'd0; count_1 <= 4'd0; i_1 <= 3'd0;

end else if(match_1 == 1'b0) begin hashtable2_1[i_1] <= {key2_1,count_1}; count_1 <= count_1 + 1; i_1 <= i_1 + 1; end end always@( reset or key2_1) begin if(reset == 1'b1) begin addr2_1 = 4'd0;

match_1 = 1'd0; end else if(key2_1 == hashtable2_1[0][19:4]) begin addr2_1 = hashtable2_1[0][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[1][19:4]) begin addr2_1 = hashtable2_1[1][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[2][19:4]) begin addr2_1 = hashtable2_1[2][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[3][19:4]) begin addr2_1 = hashtable2_1[3][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[4][19:4]) begin addr2_1 = hashtable2_1[4][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[5][19:4]) begin addr2_1 = hashtable2_1[5][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[6][19:4]) begin addr2_1 = hashtable2_1[6][3:0]; match_1 = 1'd1; end else if(key2_1 == hashtable2_1[7][19:4])

begin addr2_1 = hashtable2_1[7][3:0]; match_1 = 1'd1; end

else begin addr2_1 = addr2_1; match_1 = 1'd0; end end always@(posedge clk) begin if(reset == 1'b1) begin hashtable2_2[0] <= 20'd0; hashtable2_2[1] <= 20'd0; hashtable2_2[2] <= 20'd0; hashtable2_2[3] <= 20'd0; hashtable2_2[4] <= 20'd0; hashtable2_2[5] <= 20'd0; hashtable2_2[6] <= 20'd0; hashtable2_2[7] <= 20'd0; count_2 <= 4'd0; i_2 <= 3'd0;

end else if(match_2 == 1'b0) begin hashtable2_2[i_2] <= {key2_2,count_2}; count_2 <= count_2 + 1; i_2 <= i_2 + 1; end end

always@( reset or key2_2 ) begin if(reset == 1'b1) begin addr2_2 = 4'd0; match_2 = 1'd0; end

else if(key2_2 == hashtable2_2[0][19:4]) begin

addr2_2 = hashtable2_2[0][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[1][19:4]) begin addr2_2 = hashtable2_2[1][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[2][19:4]) begin addr2_2 = hashtable2_2[2][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[3][19:4]) begin addr2_2 = hashtable2_2[3][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[4][19:4]) begin addr2_2 = hashtable2_2[4][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[5][19:4]) begin addr2_2 = hashtable2_2[5][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[6][19:4]) begin addr2_2 = hashtable2_2[6][3:0]; match_2 = 1'd1; end else if(key2_2 == hashtable2_2[7][19:4]) begin addr2_2 = hashtable2_2[7][3:0]; match_2 = 1'd1; end

else begin addr2_2 = addr2_2; match_2 = 1'd0; end end always@(posedge clk) begin if(reset == 1'b1) begin hashtable2_3[0] <= 20'd0; hashtable2_3[1] <= 20'd0; hashtable2_3[2] <= 20'd0; hashtable2_3[3] <= 20'd0; hashtable2_3[4] <= 20'd0; hashtable2_3[5] <= 20'd0; hashtable2_3[6] <= 20'd0; hashtable2_3[7] <= 20'd0; count_3 <= 4'd0; i_3 <= 3'd0;

end else if(match_3 == 1'b0) begin hashtable2_3[i_3] <= {key2_3,count_3}; count_3 <= count_3 + 1; i_3 <= i_3 + 1; end end

always@( reset or key2_3) begin if(reset == 1'b1) begin addr2_3 = 4'd0; match_3 = 1'd0; end

else if(key2_3 == hashtable2_3[0][19:4]) begin addr2_3 = hashtable2_3[0][3:0]; match_3 = 1'd1; end

else if(key2_3 == hashtable2_3[1][19:4])

begin addr2_3 = hashtable2_3[1][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[2][19:4]) begin addr2_3 = hashtable2_3[2][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[3][19:4]) begin addr2_3 = hashtable2_3[3][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[4][19:4]) begin addr2_3 = hashtable2_3[4][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[5][19:4]) begin addr2_3 = hashtable2_3[5][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[6][19:4]) begin addr2_3 = hashtable2_3[6][3:0]; match_3 = 1'd1; end else if(key2_3 == hashtable2_3[7][19:4]) begin addr2_3 = hashtable2_3[7][3:0]; match_3 = 1'd1; end else begin addr2_3 = addr2_3; match_3 = 1'd0; end

end always@(posedge clk) begin if(reset == 1'b1) begin hashtable2_4[0] <= 20'd0; hashtable2_4[1] <= 20'd0; hashtable2_4[2] <= 20'd0; hashtable2_4[3] <= 20'd0; hashtable2_4[4] <= 20'd0; hashtable2_4[5] <= 20'd0; hashtable2_4[6] <= 20'd0; hashtable2_4[7] <= 20'd0; count_4 <= 4'd0; i_4 <= 3'd0;

end else if(match_4 == 1'b0) begin hashtable2_4[i_4] <= {key2_4,count_4}; count_4 <= count_4 + 1; i_4 <= i_4 + 1; end end

always@( reset or key2_4 ) begin if(reset == 1'b1) begin addr2_4 = 4'd0; match_4 = 1'd0; end

else if(key2_4 == hashtable2_4[0][19:4]) begin addr2_4 = hashtable2_4[0][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[1][19:4]) begin addr2_4 = hashtable2_4[1][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[2][19:4]) begin addr2_4 = hashtable2_4[2][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[3][19:4]) begin addr2_4 = hashtable2_4[3][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[4][19:4]) begin addr2_4 = hashtable2_4[4][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[5][19:4]) begin addr2_4 = hashtable2_4[5][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[6][19:4]) begin addr2_4 = hashtable2_4[6][3:0]; match_4 = 1'd1; end

else if(key2_4 == hashtable2_4[7][19:4]) begin addr2_4 = hashtable2_4[7][3:0]; match_4 = 1'd1; end

else begin addr2_4 = addr2_4; match_4 = 1'd0; end end endmodule

4 byte hash:

module Hash4(key4_1,key4_2,clk,reset,addr4_1,addr4_2); input [31:0] key4_1; input [31:0] key4_2;

input clk; input reset; output [3:0] addr4_1; output [3:0] addr4_2;

wire [31:0] key4_1; wire [31:0] key4_2;

wire clk; wire reset; reg [3:0] addr4_1; reg [3:0] addr4_2;

reg [35:0] hashtable4_1 [7:0]; reg match_1; reg [3:0] count_1; reg [2:0] i_1;

reg [35:0] hashtable4_2 [7:0]; reg match_2; reg [3:0] count_2; reg [2:0 ] i_2;

always@(posedge clk) begin if(reset == 1'b1) begin hashtable4_1[0] <= 36'd0; hashtable4_1[1] <= 36'd0; hashtable4_1[2] <= 36'd0; hashtable4_1[3] <= 36'd0; hashtable4_1[4] <= 36'd0; hashtable4_1[5] <= 36'd0; hashtable4_1[6] <= 36'd0; hashtable4_1[7] <= 36'd0; count_1 <= 4'd0; i_1 <= 3'd0; end else if(match_1 == 1'b0) begin hashtable4_1[i_1] <= {key4_1,count_1};

count_1 <= count_1 + 1; i_1 <= i_1 + 1; end end always@( reset or key4_1) begin if(reset == 1'b1) begin addr4_1 = 4'd0; match_1 = 1'd0; end

else if(key4_1 == hashtable4_1[0][35:4]) begin addr4_1 = hashtable4_1[0][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[1][35:4]) begin addr4_1 = hashtable4_1[1][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[2][35:4]) begin addr4_1 = hashtable4_1[2][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[3][35:4]) begin addr4_1 = hashtable4_1[3][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[4][35:4]) begin addr4_1 = hashtable4_1[4][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[5][35:4]) begin addr4_1 = hashtable4_1[5][3:0];

match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[6][35:4]) begin addr4_1 = hashtable4_1[6][3:0]; match_1 = 1'd1; end

else if(key4_1 == hashtable4_1[7][35:4]) begin addr4_1 = hashtable4_1[7][3:0]; match_1 = 1'd1; end

else begin addr4_1 = addr4_1; match_1 = 1'd0; end

end always@(posedge clk) begin if(reset == 1'b1) begin hashtable4_2[0] <= 36'd0; hashtable4_2[1] <= 36'd0; hashtable4_2[2] <= 36'd0; hashtable4_2[3] <= 36'd0; hashtable4_2[4] <= 36'd0; hashtable4_2[5] <= 36'd0; hashtable4_2[6] <= 36'd0; hashtable4_2[7] <= 36'd0; count_2 <= 4'd0; i_2 <= 3'd0; end else if(match_2 == 1'b0) begin hashtable4_2[i_2] <= {key4_2,count_2}; count_2 <= count_2 + 1; i_2 <= i_2 + 1; end end

always@( reset or key4_2 ) begin if(reset == 1'b1) begin addr4_2 = 4'd0; match_2 = 1'd0; end

else if(key4_2 == hashtable4_2[0][35:4]) begin addr4_2 = hashtable4_2[0][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[1][35:4]) begin addr4_2 = hashtable4_2[1][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[2][35:4]) begin addr4_2 = hashtable4_2[2][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[3][35:4]) begin addr4_2 = hashtable4_2[3][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[4][35:4]) begin addr4_2 = hashtable4_2[4][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[5][35:4]) begin addr4_2 = hashtable4_2[5][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[6][35:4]) begin

addr4_2 = hashtable4_2[6][3:0]; match_2 = 1'd1; end

else if(key4_2 == hashtable4_2[7][35:4]) begin addr4_2 = hashtable4_2[7][3:0]; match_2 = 1'd1; end

else begin addr4_2 = addr4_2; match_2 = 1'd0; end end endmodule

8 byte hash: module Hash8(key8,clk,reset,addr8); input [63:0] key8; input clk; input reset; output [3:0] addr8;

wire [63:0] key8; wire clk; wire reset; reg [3:0] addr8;

reg [67:0] hashtable8 [7:0]; reg match; reg [3:0] count; reg [2:0] i;

always@(posedge clk) begin if(reset == 1'b1) begin hashtable8[0] <= 68'd0; hashtable8[1] <= 68'd0; hashtable8[2] <= 68'd0; hashtable8[3] <= 68'd0; hashtable8[4] <= 68'd0;

hashtable8[5] <= 68'd0; hashtable8[6] <= 68'd0; hashtable8[7] <= 68'd0; count <= 4'd0; i <= 3'd0;

end else if(match == 1'b0) begin hashtable8[i] <= {key8,count}; count <= count + 1; i <= i + 1; end end always@( reset or key8 ) begin if(reset == 1'b1) begin addr8 = 4'd0; match = 1'd0; end

else if(key8 == hashtable8[0][67:4]) begin addr8 = hashtable8[0][3:0]; match = 1'd1; end

else if(key8 == hashtable8[1][67:4]) begin addr8 = hashtable8[1][3:0]; match = 1'd1; end

else if(key8 == hashtable8[2][67:4]) begin addr8 = hashtable8[2][3:0]; match = 1'd1; end

else if(key8 == hashtable8[3][67:4]) begin addr8 = hashtable8[3][3:0]; match = 1'd1; end

else if(key8 == hashtable8[4][67:4]) begin addr8 = hashtable8[4][3:0]; match = 1'd1; end

else if(key8 == hashtable8[5][67:4]) begin addr8 = hashtable8[5][3:0]; match = 1'd1; end

else if(key8 == hashtable8[6][67:4]) begin addr8 = hashtable8[6][3:0]; match = 1'd1; end

else if(key8 == hashtable8[7][67:4]) begin addr8 = hashtable8[7][3:0]; match = 1'd1; end else begin addr8 = addr8; match = 1'd0; end end endmodule

2 byte dictionary: module dict2(addr2_1,addr2_2,addr2_3,addr2_4,key2_1,key2_2,key2_3,key2_4,mis2_1,mis2_2,mis2_3, mis2_4,clk,reset,data2_1,data2_2,data2_3,data2_4); input [3:0] addr2_1; input [3:0] addr2_2; input [3:0] addr2_3; input [3:0] addr2_4;

input clk; input reset; input mis2_1;

input mis2_2; input mis2_3; input mis2_4; input [15:0] key2_1; input [15:0] key2_2; input [15:0] key2_3; input [15:0] key2_4; output [15:0] data2_1; output [15:0] data2_2; output [15:0] data2_3; output [15:0] data2_4; wire [3:0] addr2_1; wire [3:0] addr2_2; wire [3:0] addr2_3; wire [3:0] addr2_4; wire [15:0] key2_1; wire [15:0] key2_2; wire [15:0] key2_3; wire [15:0] key2_4;

wire clk; wire reset; wire mis2_1; wire mis2_2; wire mis2_3; wire mis2_4; wire [15:0] data2_1; wire [15:0] data2_2; wire [15:0] data2_3; wire [15:0] data2_4; reg [15:0] dictionary2_1 [15:0]; reg [15:0] dictionary2_2 [15:0]; reg [15:0] dictionary2_3 [15:0]; reg [15:0] dictionary2_4 [15:0]; reg [3:0] count2_1; reg [3:0] count2_2; reg [3:0] count2_3;

reg [3:0] count2_4; always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_1[0] <= 16'd0; dictionary2_1[1] <= 16'd0; dictionary2_1[2] <= 16'd0; dictionary2_1[3] <= 16'd0; dictionary2_1[4] <= 16'd0; dictionary2_1[5] <= 16'd0; dictionary2_1[6] <= 16'd0; dictionary2_1[7] <= 16'd0; dictionary2_1[8] <= 16'd0; dictionary2_1[9] <= 16'd0; dictionary2_1[10] <= 16'd0; dictionary2_1[11] <= 16'd0; dictionary2_1[12] <= 16'd0; dictionary2_1[13] <= 16'd0; dictionary2_1[14] <= 16'd0; dictionary2_1[15] <= 16'd0; count2_1 <= 4'd0; end

else if(mis2_1 == 1'b1) begin dictionary2_1[count2_1] <= key2_1; count2_1 <= count2_1 + 1; end end

dictionary2_2[10] <= 16'd0; dictionary2_2[11] <= 16'd0; dictionary2_2[12] <= 16'd0; dictionary2_2[13] <= 16'd0; dictionary2_2[14] <= 16'd0; dictionary2_2[15] <= 16'd0; count2_2 <= 4'd0; end

else if(mis2_2 == 1'b1) begin dictionary2_2[count2_2] <= key2_2; count2_2 <= count2_2 + 1; end end always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_3[0] <= 16'd0; dictionary2_3[1] <= 16'd0; dictionary2_3[2] <= 16'd0; dictionary2_3[3] <= 16'd0; dictionary2_3[4] <= 16'd0; dictionary2_3[5] <= 16'd0; dictionary2_3[6] <= 16'd0; dictionary2_3[7] <= 16'd0; dictionary2_3[8] <= 16'd0; dictionary2_3[9] <= 16'd0; dictionary2_3[10] <= 16'd0; dictionary2_3[11] <= 16'd0; dictionary2_3[12] <= 16'd0; dictionary2_3[13] <= 16'd0; dictionary2_3[14] <= 16'd0; dictionary2_3[15] <= 16'd0; count2_3 <= 4'd0; end

else if(mis2_3 == 1'b1) begin dictionary2_3[count2_3] <= key2_3; count2_3 <= count2_3 + 1; end end

always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_4[0] <= 16'd0; dictionary2_4[1] <= 16'd0; dictionary2_4[2] <= 16'd0; dictionary2_4[3] <= 16'd0; dictionary2_4[4] <= 16'd0; dictionary2_4[5] <= 16'd0; dictionary2_4[6] <= 16'd0; dictionary2_4[7] <= 16'd0; dictionary2_4[8] <= 16'd0; dictionary2_4[9] <= 16'd0; dictionary2_4[10] <= 16'd0; dictionary2_4[11] <= 16'd0; dictionary2_4[12] <= 16'd0; dictionary2_4[13] <= 16'd0; dictionary2_4[14] <= 16'd0; dictionary2_4[15] <= 16'd0; count2_4 <= 4'd0; end

else if(mis2_4 == 1'b1) begin dictionary2_4[count2_1] <= key2_4; count2_4 <= count2_4 + 1; end end assign data2_1 = dictionary2_1[addr2_1]; assign data2_2 = dictionary2_2[addr2_2]; assign data2_3 = dictionary2_3[addr2_3]; assign data2_4 = dictionary2_4[addr2_4];

endmodule

4 byte dictionary: module dict4(addr4_1,addr4_2,key4_1,key4_2,mis4_1,mis4_2,clk,reset,data4_1,data4_2); input [3:0] addr4_1; input [3:0] addr4_2; input [31:0] key4_1;

input [31:0] key4_2; input clk; input reset; input mis4_1; input mis4_2; output [31:0] data4_1; output [31:0] data4_2; wire [3:0] addr4_1; wire [3:0] addr4_2; wire [31:0] key4_1; wire [31:0] key4_2; wire clk; wire reset; wire mis4_1; wire mis4_2; wire [31:0] data4_1; wire [31:0] data4_2; reg [31:0] dictionary4_1 [15:0]; reg [31:0] dictionary4_2 [15:0]; reg [3:0] count1; reg [3:0] count2; always@(posedge clk) begin if(reset == 1'b1) begin dictionary4_1[0] <= 32'd0; dictionary4_1[1] <= 32'd0; dictionary4_1[2] <= 32'd0; dictionary4_1[3] <= 32'd0; dictionary4_1[4] <= 32'd0; dictionary4_1[5] <= 32'd0; dictionary4_1[6] <= 32'd0; dictionary4_1[7] <= 32'd0; dictionary4_1[8] <= 32'd0; dictionary4_1[9] <= 32'd0; dictionary4_1[10] <= 32'd0; dictionary4_1[11] <= 32'd0; dictionary4_1[12] <= 32'd0; dictionary4_1[13] <= 32'd0; dictionary4_1[14] <= 32'd0; dictionary4_1[15] <= 32'd0; count1 <= 4'd0; end

else if(mis4_1 == 1'b1) begin dictionary4_1[count1] <= key4_1; count1 <= count1 + 1; end end

always@(posedge clk) begin if(reset == 1'b1) begin dictionary4_2[0] <= 32'd0; dictionary4_2[1] <= 32'd0; dictionary4_2[2] <= 32'd0; dictionary4_2[3] <= 32'd0; dictionary4_2[4] <= 32'd0; dictionary4_2[5] <= 32'd0; dictionary4_2[6] <= 32'd0; dictionary4_2[7] <= 32'd0; dictionary4_2[8] <= 32'd0; dictionary4_2[9] <= 32'd0; dictionary4_2[10] <= 32'd0; dictionary4_2[11] <= 32'd0; dictionary4_2[12] <= 32'd0; dictionary4_2[13] <= 32'd0; dictionary4_2[14] <= 32'd0; dictionary4_2[15] <= 32'd0; count2 <= 4'd0; end

else if(mis4_2 == 1'b1) begin dictionary4_2[count2] <= key4_2; count2 <= count2 + 1; end end assign data4_1 = dictionary4_1[addr4_1]; assign data4_2 = dictionary4_2[addr4_2];

endmodule 8 byte dictionary: module dict8(addr8,key8,mis8,clk,reset,data8); input [3:0] addr8;

input [63:0] key8; input clk; input reset; input mis8; output [63:0] data8; wire [3:0] addr8; wire [63:0] key8; wire clk; wire reset; wire mis8; wire [63:0] data8; reg [63:0] dictionary8 [15:0]; reg [3:0] count; always@(posedge clk) begin if(reset == 1'b1) begin dictionary8[0] <= 64'd0; dictionary8[1] <= 64'd0; dictionary8[2] <= 64'd0; dictionary8[3] <= 64'd0; dictionary8[4] <= 64'd0; dictionary8[5] <= 64'd0; dictionary8[6] <= 64'd0; dictionary8[7] <= 64'd0; dictionary8[8] <= 64'd0; dictionary8[9] <= 64'd0; dictionary8[10] <= 64'd0; dictionary8[11] <= 64'd0; dictionary8[12] <= 64'd0; dictionary8[13] <= 64'd0; dictionary8[14] <= 64'd0; dictionary8[15] <= 64'd0; count <= 4'd0; end

else if(mis8 == 1'b1) begin dictionary8[count] <= key8; count <= count + 1; end end

assign data8 = dictionary8[addr8];

endmodule

Phase comparator: module phase_comp(data8, d8, d4_1, d4_2, d2_1, d2_2, d2_3, d2_4, clk, reset, mis8, mis4_1, mis4_2, mis2_1, mis2_2, mis2_3, mis2_4, pout8, pout4_1, pout4_2, pout2_1, pout2_2, pout2_3, pout2_4);

input [63:0] data8; input [63:0] d8; input [31:0] d4_1; input [31:0] d4_2; input [15:0] d2_1; input [15:0] d2_2; input [15:0] d2_3; input [15:0] d2_4;

input clk; input reset;

output mis8; output mis4_1; output mis4_2;

output mis2_1; output mis2_2; output mis2_3; output mis2_4;

output pout8; output pout4_1; output pout4_2; output pout2_1; output pout2_2; output pout2_3; output pout2_4;

wire [63:0] data8; wire [63:0] d8; wire [31:0] d4_1; wire [31:0] d4_2; wire [15:0] d2_1; wire [15:0] d2_2; wire [15:0] d2_3; wire [15:0] d2_4;

wire clk; wire reset;

wire mis8; wire mis4_1; wire mis4_2; wire mis2_1; wire mis2_2; wire mis2_3; wire mis2_4;

wire pout8; wire pout4_1; wire pout4_2; wire pout2_1; wire pout2_2; wire pout2_3; wire pout2_4;

wire [63:0] data8_temp;

assign data8_temp = data8;

assign mis8 = (reset == 1'd1)?1'd0: (data8_temp == d8)?1'd0:1'd1; assign pout8 = (reset == 1'd1)?1'd0: (data8_temp == d8)?1'd1:1'd0; assign mis4_1 = (reset == 1'd1)?1'd0: (data8_temp[63:32] == d4_1)?1'd0:1'd1; assign pout4_1 = (reset == 1'd1)?1'd0: (data8_temp[63:32] == d4_1)?1'd1:1'd0; assign mis4_2 = (reset == 1'd1)?1'd0: (data8_temp[31:0] == d4_2)?1'd0:1'd1; assign pout4_2 = (reset == 1'd1)?1'd0: (data8_temp[31:0] == d4_2)?1'd1:1'd0; assign mis2_1 = (reset == 1'd1)?1'd0: (data8_temp[63:48] == d2_1)?1'd0:1'd1; assign pout2_1 = (reset == 1'd1)?1'd0: (data8_temp[63:48] == d2_1)?1'd1:1'd0; assign mis2_2 = (reset == 1'd1)?1'd0: (data8_temp[47:32] == d2_2)?1'd0:1'd1; assign pout2_2 = (reset == 1'd1)?1'd0: (data8_temp[47:32] == d2_2)?1'd1:1'd0; assign mis2_3 = (reset == 1'd1)?1'd0: (data8_temp[31:16] == d2_3)?1'd0:1'd1; assign pout2_3 = (reset == 1'd1)?1'd0: (data8_temp[31:16] == d2_3)?1'd1:1'd0; assign mis2_4 = (reset == 1'd1)?1'd0: (data8_temp[15:0] == d2_4)?1'd0:1'd1; assign pout2_4 = (reset == 1'd1)?1'd0: (data8_temp[15:0] == d2_4)?1'd1:1'd0;

endmodule

Encoder:

module encoder(clk, reset, data8, pout8, pout4_1, pout4_2, pout2_1, pout2_2, pout2_3, pout2_4, addr8, addr4_1, addr4_2, addr2_1, addr2_2, addr2_3, addr2_4, dataout);

input clk; input reset; input [63:0] data8;

input pout8; input pout4_1; input pout4_2; input pout2_1; input pout2_2; input pout2_3; input pout2_4;

input [3:0] addr8; input [3:0] addr4_1; input [3:0] addr4_2; input [3:0] addr2_1; input [3:0] addr2_2; input [3:0] addr2_3; input [3:0] addr2_4;

output [67:0] dataout;

wire [63:0] data8; wire clk; wire reset;

wire pout8; wire pout4_1; wire pout4_2; wire pout2_1; wire pout2_2; wire pout2_3; wire pout2_4;

wire [3:0] addr8; wire [3:0] addr4_1; wire [3:0] addr4_2; wire [3:0] addr2_1; wire [3:0] addr2_2; wire [3:0] addr2_3; wire [3:0] addr2_4;

reg [67:0] dataout; always@(posedge clk) begin if(reset == 1'b1) begin dataout <= 67'd0; end

else if(data8 == 64'd0) begin dataout <= 67'd0; end

else if(pout8 == 1'd1) begin dataout <= {4'b0001,addr8}; end else if(pout4_1 == 1'd1) begin dataout <= {data8[31:0],4'b0010,addr4_1}; end else if(pout4_2 == 1'd1) begin dataout <= {data8[63:32],4'b0011,addr4_2}; end else if(pout2_1 == 1'd1) begin dataout <= {data8[47:0],4'b0111,addr2_1}; end

else if(pout2_2 == 1'd1) begin dataout <= {data8[63:48],data8[31:0],4'b0110,addr2_2}; end else if(pout2_3 == 1'd1) begin dataout <= {data8[63:32],data8[15:0],4'b0101,addr2_3}; end else if(pout2_4 == 1'd1) begin dataout <= {data8[63:16],4'b0100,addr2_4};

end else begin dataout <= {data8,4'b1000}; end end

endmodule

Decompression

Top module: module top_dcomp(datain,clk,reset,dataout);

input clk; input reset; input [67:0] datain;

output [63:0] dataout;

wire clk; wire reset;

wire [67:0] datain;

wire [63:0] dataout8; wire [63:0] Dataout8; wire [63:0] dataOut8;

wire [67:0] data_out; wire [63:0] dataout;

wire [3:0] addr8; wire [3:0] addr4; wire [3:0] addr2; decoder d1(clk,reset,datain,data_out,addr8,addr4,addr2); dict8 d2(addr8, dataout8, clk, reset, dataout); dict4 d3(addr4, Dataout8, clk, reset, dataout);

dict2 d4(addr2, dataOut8, clk, reset, dataout); data_gen d5(clk,reset,data_out,dataout8,dataout);

endmodule

Decoder: module decoder( clk, reset, datain, dataout, addr8, addr4, addr2);

input clk; input reset; input [67:0] datain;

output [67:0] dataout; output [3:0] addr8;

output [3:0] addr4; output [3:0] addr2;

wire clk; wire reset; wire [67:0] datain;

wire [67:0] dataout; wire [3:0] addr8; wire [3:0] addr4; wire [3:0] addr2;

assign dataout = datain;

assign addr8 = (datain[3:0] == 4'd0)?4'd0:datain[7:3]; assign addr4 = (datain[3:0] == 4'd0)?4'd0:datain[7:3]; assign addr2 = (datain[3:0] == 4'd0)?4'd0:datain[7:3];

endmodule

2 byte dictionary: module dict2(addr8,dataout8,clk,reset,datain8); input [3:0] addr8; input clk; input reset; input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk; wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

wire [15:0] datain2_1; wire [15:0] datain2_2; wire [15:0] datain2_3; wire [15:0] datain2_4;

reg [15:0] dictionary2_1 [15:0]; reg [15:0] dictionary2_2 [15:0];

reg [15:0] dictionary2_3 [15:0]; reg [15:0] dictionary2_4 [15:0];

reg [3:0] count2_1; reg [3:0] count2_2; reg [3:0] count2_3; reg [3:0] count2_4;

assign data2_1 = datain8[15:0]; assign data2_2 = datain8[31:16]; assign data2_3 = datain8[47:32]; assign data2_4 = datain8[63:48];

always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_1[0] <= 16'd0; dictionary2_1[1] <= 16'd0; dictionary2_1[2] <= 16'd0; dictionary2_1[3] <= 16'd0; dictionary2_1[4] <= 16'd0; dictionary2_1[5] <= 16'd0; dictionary2_1[6] <= 16'd0; dictionary2_1[7] <= 16'd0; dictionary2_1[8] <= 16'd0; dictionary2_1[9] <= 16'd0; dictionary2_1[10] <= 16'd0; dictionary2_1[11] <= 16'd0; dictionary2_1[12] <= 16'd0; dictionary2_1[13] <= 16'd0; dictionary2_1[14] <= 16'd0; dictionary2_1[15] <= 16'd0; count2_1 <= 4'd0; end

else if(datain2_1 != dictionary2_1[0] && datain2_1 != dictionary2_1[1] && datain2_1 != dictionary2_1[2] && datain2_1 != dictionary2_1[3] && datain2_1 != dictionary2_1[4] && datain2_1 != dictionary2_1[5] && datain2_1 != dictionary2_1[6] && datain2_1 != dictionary2_1[7] && datain2_1 != dictionary2_1[8] && datain2_1 != dictionary2_1[9] && datain2_1 != dictionary2_1[10] && datain2_1 != dictionary2_1[11] && datain2_1 != dictionary2_1[12] && datain2_1 != dictionary2_1[13] && datain2_1 != dictionary2_1[14] &&

datain2_1 != dictionary2_1[15] )

begin dictionary2_1[count2_1] <= datain2_1; count2_1 <= count2_1 + 1; end end

always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_2[0] <= 16'd0; dictionary2_2[1] <= 16'd0; dictionary2_2[2] <= 16'd0; dictionary2_2[3] <= 16'd0; dictionary2_2[4] <= 16'd0; dictionary2_2[5] <= 16'd0; dictionary2_2[6] <= 16'd0; dictionary2_2[7] <= 16'd0; dictionary2_2[8] <= 16'd0; dictionary2_2[9] <= 16'd0; dictionary2_2[10] <= 16'd0; dictionary2_2[11] <= 16'd0; dictionary2_2[12] <= 16'd0; dictionary2_2[13] <= 16'd0; dictionary2_2[14] <= 16'd0; dictionary2_2[15] <= 16'd0; count2_2 <= 4'd0; end

else if(datain2_2 != dictionary2_2[0] && datain2_2 != dictionary2_2[1] && datain2_2 != dictionary2_2[2] && datain2_2 != dictionary2_2[3] && datain2_2 != dictionary2_2[4] && datain2_2 != dictionary2_2[5] && datain2_2 != dictionary2_2[6] && datain2_2 != dictionary2_2[7] && datain2_2 != dictionary2_2[8] && datain2_2 != dictionary2_2[9] && datain2_2 != dictionary2_2[10] && datain2_2 != dictionary2_2[11] && datain2_2 != dictionary2_2[12] && datain2_2 != dictionary2_2[13] && datain2_2 != dictionary2_2[14] && datain2_2 != dictionary2_2[15] )

begin dictionary2_2[count2_2] <= datain2_2; count2_2 <= count2_2 + 1;

end end

always@(posedge clk) begin if(reset == 1'b1) begin dictionary2_3[0] <= 16'd0; dictionary2_3[1] <= 16'd0; dictionary2_3[2] <= 16'd0; dictionary2_3[3] <= 16'd0; dictionary2_3[4] <= 16'd0; dictionary2_3[5] <= 16'd0; dictionary2_3[6] <= 16'd0; dictionary2_3[7] <= 16'd0; dictionary2_3[8] <= 16'd0; dictionary2_3[9] <= 16'd0; dictionary2_3[10] <= 16'd0; dictionary2_3[11] <= 16'd0; dictionary2_3[12] <= 16'd0; dictionary2_3[13] <= 16'd0; dictionary2_3[14] <= 16'd0; dictionary2_3[15] <= 16'd0; count2_3 <= 4'd0; end

else if(datain2_3 != dictionary2_3[0] && datain2_3 != dictionary2_3[1] && datain2_3 != dictionary2_3[2] && datain2_3 != dictionary2_3[3] && datain2_3 != dictionary2_3[4] && datain2_3 != dictionary2_3[5] && datain2_3 != dictionary2_3[6] && datain2_3 != dictionary2_3[7] && datain2_3 != dictionary2_3[8] && datain2_3 != dictionary2_3[9] && datain2_3 != dictionary2_3[10] && datain2_3 != dictionary2_3[11] && datain2_3 != dictionary2_3[12] && datain2_3 != dictionary2_3[13] && datain2_3 != dictionary2_3[14] && datain2_3 != dictionary2_3[15] )

begin dictionary2_3[count2_3] <= datain2_3; count2_3 <= count2_3 + 1; end end

always@(posedge clk) begin

if(reset == 1'b1) begin dictionary2_4[0] <= 16'd0; dictionary2_4[1] <= 16'd0; dictionary2_4[2] <= 16'd0; dictionary2_4[3] <= 16'd0; dictionary2_4[4] <= 16'd0; dictionary2_4[5] <= 16'd0; dictionary2_4[6] <= 16'd0; dictionary2_4[7] <= 16'd0; dictionary2_4[8] <= 16'd0; dictionary2_4[9] <= 16'd0; dictionary2_4[10] <= 16'd0; dictionary2_4[11] <= 16'd0; dictionary2_4[12] <= 16'd0; dictionary2_4[13] <= 16'd0; dictionary2_4[14] <= 16'd0; dictionary2_4[15] <= 16'd0; count2_4 <= 4'd0; end

else if(datain2_4 != dictionary2_4[0] && datain2_4 != dictionary2_4[1] && datain2_4 != dictionary2_4[2] && datain2_4 != dictionary2_4[3] && datain2_4 != dictionary2_4[4] && datain2_4 != dictionary2_4[5] && datain2_4 != dictionary2_4[6] && datain2_4 != dictionary2_4[7] && datain2_4 != dictionary2_4[8] && datain2_4 != dictionary2_4[9] && datain2_4 != dictionary2_4[10] && datain2_4 != dictionary2_4[11] && datain2_4 != dictionary2_4[12] && datain2_4 != dictionary2_4[13] && datain2_4 != dictionary2_4[14] && datain2_4 != dictionary2_4[15] )

begin dictionary2_4[count2_4] <= datain2_4; count2_4 <= count2_4 + 1; end end assign dataout8 = {dictionary2_4[addr8],dictionary2_3[addr8],dictionary2_2[addr8],dictionary2_1[addr8]};

endmodule

4 byte dictionary:

module dict4(addr8,dataout8,clk,reset,datain8); input [3:0] addr8; input clk; input reset; input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk; wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

wire [31:0] datain4_1; wire [31:0] datain4_2;

reg [31:0] dictionary4_1 [15:0]; reg [31:0] dictionary4_2 [15:0];

reg [3:0] count4_1; reg [3:0] count4_2;

assign data4_1 = datain8[31:0]; assign data4_2 = datain8[63:32];

always@(posedge clk) begin if(reset == 1'b1) begin dictionary4_1[0] <= 32'd0; dictionary4_1[1] <= 32'd0; dictionary4_1[2] <= 32'd0; dictionary4_1[3] <= 32'd0; dictionary4_1[4] <= 32'd0; dictionary4_1[5] <= 32'd0; dictionary4_1[6] <= 32'd0; dictionary4_1[7] <= 32'd0; dictionary4_1[8] <= 32'd0; dictionary4_1[9] <= 32'd0; dictionary4_1[10] <= 32'd0; dictionary4_1[11] <= 32'd0; dictionary4_1[12] <= 32'd0;

dictionary4_1[13] <= 32'd0; dictionary4_1[14] <= 32'd0; dictionary4_1[15] <= 32'd0; count4_1 <= 4'd0; end

else if(datain4_1 != dictionary4_1[0] && datain4_1 != dictionary4_1[1] && datain4_1 != dictionary4_1[2] && datain4_1 != dictionary4_1[3] && datain4_1 != dictionary4_1[4] && datain4_1 != dictionary4_1[5] && datain4_1 != dictionary4_1[6] && datain4_1 != dictionary4_1[7] && datain4_1 != dictionary4_1[8] && datain4_1 != dictionary4_1[9] && datain4_1 != dictionary4_1[10] && datain4_1 != dictionary4_1[11] && datain4_1 != dictionary4_1[12] && datain4_1 != dictionary4_1[13] && datain4_1 != dictionary4_1[14] && datain4_1 != dictionary4_1[15] )

begin dictionary4_1[count4_1] <= datain4_1; count4_1 <= count4_1 + 1; end end

else if(datain4_2 != dictionary4_2[0] && datain4_2 != dictionary4_2[1] && datain4_2 != dictionary4_2[2] && datain4_2 != dictionary4_2[3] && datain4_2 != dictionary4_2[4] && datain4_2 != dictionary4_2[5] && datain4_2 != dictionary4_2[6] && datain4_2 != dictionary4_2[7] && datain4_2 != dictionary4_2[8] && datain4_2 != dictionary4_2[9] && datain4_2 != dictionary4_2[10] && datain4_2 != dictionary4_2[11] && datain4_2 != dictionary4_2[12] && datain4_2 != dictionary4_2[13] && datain4_2 != dictionary4_2[14] && datain4_2 != dictionary4_2[15] )

begin dictionary4_2[count4_2] <= datain4_2; count4_2 <= count4_2 + 1; end end assign dataout8 = {dictionary4_1[addr8],dictionary4_2[addr8]};

endmodule

8 byte dictionary: module dict8(addr8,dataout8,clk,reset,datain8); input [3:0] addr8; input clk; input reset; input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk; wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

reg [63:0] dictionary8 [15:0]; reg [3:0] count;

always@(posedge clk) begin if(reset == 1'b1) begin dictionary8[0] <= 64'd0; dictionary8[1] <= 64'd0; dictionary8[2] <= 64'd0; dictionary8[3] <= 64'd0; dictionary8[4] <= 64'd0; dictionary8[5] <= 64'd0; dictionary8[6] <= 64'd0; dictionary8[7] <= 64'd0; dictionary8[8] <= 64'd0; dictionary8[9] <= 64'd0; dictionary8[10] <= 64'd0; dictionary8[11] <= 64'd0; dictionary8[12] <= 64'd0; dictionary8[13] <= 64'd0; dictionary8[14] <= 64'd0; dictionary8[15] <= 64'd0; count <= 4'd0; end

else if(datain8 != dictionary8[0] && datain8 != dictionary8[1] && datain8 != dictionary8[2] && datain8 != dictionary8[3] && datain8 != dictionary8[4] && datain8 != dictionary8[5] && datain8 != dictionary8[6] && datain8 != dictionary8[7] && datain8 != dictionary8[8] && datain8 != dictionary8[9] && datain8 != dictionary8[10] && datain8 != dictionary8[11] && datain8 != dictionary8[12] && datain8 != dictionary8[13] && datain8 != dictionary8[14] && datain8 != dictionary8[15] )

begin dictionary8[count] <= datain8; count <= count + 1; end end assign dataout8 = dictionary8[addr8]; endmodule

Data generator:

module data_gen( clk, reset, datain, datain8, dataout);

input clk; input reset; input [67:0] datain;

input [63:0] datain8;

output [63:0] dataout;

wire clk; wire reset; wire [67:0] datain;

wire [63:0] datain8;

wire [63:0] dataout;

assign dataout = (datain[7:4] == 4'b0001)? datain8: (datain[7:4] == 4'b0010)? {datain8[63:32],datain[39:8]}: (datain[7:4] == 4'b0011)? {datain[39:8],datain8[31:0]}: (datain[7:4] == 4'b0100)? {datain[55:8],datain8[15:0]}: (datain[7:4] == 4'b0101)? {datain[55:24],datain8[31:16],datain[23:8]}: (datain[7:4] == 4'b0110)? {datain[55:40],datain8[47:32],datain[39:8]}: (datain[7:4] == 4'b0111)? {datain8[63:48],datain[55:8]}: (datain[7:4] == 4'b1000)? {datain[67:4]}:64'd0;

endmodule