A benchmark for water column

ing. Bouke Grevelt

August 27, 2017, 65 pages

Supervisor: dr. ir. A.L. Varbanescu Host organisation: Qps B.V., http://www.qps.nl, Zeist, +31 (0)30 69 41 200 Contact: Jonathan Beaudoin, PhD, Fredericton, +1 506 454 4487, [email protected]

Universiteit van Amsterdam Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering http://www.software-engineering-amsterdam.nl Contents

Abstract 3

Preface 4

1 Introduction 5 1.1 Problem definition ...... 6 1.2 Approach ...... 6 1.3 Document structure ...... 7

2 Related Work 8 2.1 Water column data compression...... 8 2.2 Benchmarking & test benches...... 9 2.3 Corpus creation...... 11

3 Water Column Corpus 14 3.1 Existing corpora ...... 14 3.1.1 ...... 14 3.1.2 ...... 14 3.1.3 Ekushe-Khul...... 15 3.1.4 Silesia corpus...... 15 3.1.5 Prague Corpus...... 16 3.2 Corpus creation method...... 16 3.3 Results...... 17

4 Metrics 21 4.1 Generic compression metrics...... 21 4.2 Real-time compression...... 21 4.3 Processing...... 22 4.4 Cost reduction ...... 24 4.5 Overview ...... 26

5 Test bench 27 5.1 Requirements...... 27 5.2 Structural design...... 27 5.2.1 Test execution logic ...... 28 5.2.2 Input files...... 28 5.3 Compression algorithm...... 29 5.4 Compressed file / decompressed file...... 30 5.5 Results...... 30 5.5.1 Configuration file...... 30 5.6 Behavioral design...... 31 5.6.1 Test execution ...... 31 5.6.2 Metrics computation...... 31

6 Benchmark 34

1 6.1 Publication ...... 34 6.2 Rules...... 35

7 Using the test bench 37 7.1 Installing the test bench...... 37 7.2 Adding a new algorithm to the test bench...... 37 7.2.1 The algorithm to add ...... 37 7.2.2 Building a shared library ...... 38 7.2.3 Exposing the algorithm in python ...... 39 7.2.4 implementing the interface ...... 39 7.2.5 Add the algorithm to the test bench configuration ...... 40 7.3 Configuring the test bench...... 40 7.4 Running the test bench ...... 45

8 Results and Analysis 46 8.1 Algorithms ...... 46 8.2 Generic compression metrics...... 47 8.2.1 Deviation in (de)compression time...... 47 8.2.2 LZMA compression time...... 47 8.2.3 Jpg2k compression ratio...... 48 8.2.4 Random access decompression compared to full file decompression ...... 48 8.2.5 Jpg2k performance...... 48 8.3 The processing metric ...... 48 8.4 The real-time metric...... 50 8.5 The cost reduction metric...... 51

9 Conclusion & Future work 53 9.1 Threats to validity...... 53 9.1.1 File properties considered in corpus selection ...... 53 9.1.2 Restriction of candidate files for corpus selection...... 54 9.1.3 Hardware differences between acquisition and processing...... 54 9.2 Future work...... 54 9.2.1 Water Column Corpus inclusion criteria...... 54 9.2.2 Improve performance...... 54 9.2.3 Data generation ...... 54 9.2.4 ...... 54 9.2.5 Hardware differences...... 55 9.3 Community involvement...... 55

Bibliography 56

A Water column data 58

B Generic Water column Format 60 B.1 Purpose...... 60 B.2 Structure ...... 60 B.3 Conversion ...... 61

C Test bench Configuration file 63

2 Abstract

Multibeam echosounders are devices that use acoustic pulses to measure the topology of the sea floor. Modern multibeam echosounders make the raw data that is used to determine the sea floor topology available to the user. This data is referred to as water column data. The scientific community has identified many applications for water column data to improve and augment current hydrographic applications. Due to its large size compared to other hydrographic data, surveyors often choose not to store water column data. Compression of water column data has been identified as a possible solution to this problem and multiple methods for compression have been proposed. As there currently is no clear definition on how to measure the performance of water column compression algorithms, it is unclear what the current state of the art is. In this work, we show that benchmarking can be used to evaluate the performance of water column compression algorithms. We present the Water Column Corpus: a methodologically selected, repre- sentative set of water column data files in the public domain to be used as the input data for the benchmark and a set of metrics used to measure compression algorithm performance. A test bench was developed and published in the public domain to compute the presented metrics for the files in the Water Column Corpus. Four generic algorithms and one water column data specific compression algorithm are included in the test bench. The results clearly distinguish the different algorithms based on their performance.

3 Preface

This thesis describes my research project at Quality Positioning Services (QPS). QPS has been my employer since before I started the Software Engineering program at the university of Amsterdam (UvA).

Acknowledgements

First and foremost I need to thank Ana Lucia Varbanescu for her inexhaustible guidance and support throughout this whole endeavor. Thanks to Jonathan Beaudoin for always finding the time for a chat or a review and of course for coming up with the topic. Thanks to Duko van der Schaaf for supporting me when I wanted to work less to go back to school. Thanks to Duncan Mallace & Tom Weber for helping me find specific water column data. Thanks to QPS for supporting me throughout the program and with the thesis especially. But most of all thanks to Charlotte for putting up with me for two years of stress, complaints and preoccupation.

4 Chapter 1

Introduction

QPS B.V., hereafter referred to as ’the host organization’, is a company that builds and sells hydro- graphic software. One of the main purposes of this software is to gather and display information on the depth and form of the seafloor. This is referred to as bathymetric data. A number of different types of devices are capable of gathering bathymetric data. The type of device most commonly used for this purpose is the multibeam echo sounder (MBES). This device emits sound waves to the seafloor and determines the location of the seafloor based on the received echo. The technique is similar to ultrasounds used in the medical field. Many modern multibeam echosounders make the raw data that was used to compute the bathymetry available to the user. This data is referred to as water column data1.

Figure 1.1: Water column data in grey with seafloor detections in green and red

The scientific community has identified opportunities to use water column data to for a number of purposes, including • Improved noise recognition. [Cla06] • Improved bottom detection. [Cla06] • Improved ability to estimate minimum clearance over wrecks. [Cla06], [CRBW14] • Study of marine life. [CRBW14] • Study of sub-aquatic gas vents [CRBW14] • Study of suspended sediment [CRBW14] These novel applications provide opportunities for the host organization to improve and expand their software suite.

A problem with water column data is its volume. With data rates of several gigabytes per hour [Bea10], storage requirements drastically increase when water column data is to be stored in addition

1More information on water column data can be found in appendixA

5 to the data more commonly stored during hydrographic acquisition. Beaudoin states that ”A ten- fold increase in data storage requirements is not uncommon” [Bea10]. Moszynski et al. state that ”the size of water column data can easily exceed 95% of all data collected by a multibeam system” [MCKL13]. The additional costs incurred by the high volume of water column data are prohibitive to the collection of this data [MCKL13][Bea10][APR+16]. The host organization believes that the volume reduction of water column data will make the collection of the data less prohibitive and thus allows end users to benefit from future water column data related innovations. The problem of water column data size has been noted in the scientific community. Applying compression to the data is brought forward as a possible solution to this problem. Beulens et al. [BWSP06] note that the data size is one the of the challenges in water column data processing and state that ”making use of a data compression scheme is the preferred approach” to solving this issue. [BWSP06]. Moszynski et al. [MCKL13], Beaudoin [Bea10] and Amblas et al. [APR+16] have proposed algorithms for water column compression.

1.1 Problem definition

The field of water column compression currently consists of three publications by Beaudoin [Bea10], Moszinsky et al. [MCKL13], and Amblas et al. [APR+16] The authors of all three publications present results that show that their proposed algorithm outperforms some general purpose compres- sion algorithm. As all of the publications use different sets of input files and none include a direct comparison against any of the other water column specific compression algorithms, it is unclear what the current state of the art in water column compression is. Furthermore, because none of the publications discuss their motivation for the selection of the input data used in the experiment, the scope of the results is unclear. Therefore, the research question addressed in this thesis is ’How to evaluate the performance of lossless water column data compression algorithms?’

1.2 Approach

According to Sim et al. [SEH03] the lack of emphasis on validation and comparison with other solu- tions observed in the water column compression field is typical for immature scientific communities. The authors state that benchmarking can have a positive effect on the maturity of a scientific com- munity as the creation of a benchmark ”requires a community to examine their understanding of the field, come to an agreement on what are the key problems, and encapsulate this knowledge in an evaluation” [SEH03, p. 1]. In this work, we present our vision for a benchmark for water column data compression based on the three components of a benchmark according to Sim et al. [SEH03]: motivation for the comparison, task sample, and performance measures. The motivation for comparison is the desire to know the current state of the art in water column compression (as discussed in section 1.1). The task sample consists of a set of input files that are selected using an empirical method based on the method for corpus creation presented by Arnold & Bell [AB97]. Performance measures are selected based on a literature review of related work, further augmented with new measures which we believe are relevant to the domain. The positive effect of benchmarking discussed by Sim et al. [SEH03] depends on a collaborative effort to create a benchmark within that community. We therefore invite members of the community to respond, contribute and collaborate on this work in order to get to a community driven benchmark for water column compression. To facilitate contributions from the scientific community, we provide not only the definition of the benchmark but also a test bench for the evaluation of the benchmark. This test bench facilitates computation of the performance measures over the task sample. This work focuses solely on the of water column data. Although the domain may benefit from lossy compression, the lack of widely used water column processing applications makes it hard to determine what types and which amount of loss is acceptable. Possible future work on the evaluation of lossy compression algorithms is described in section 9.2.4

6 1.3 Document structure

In chapter2 we review related work and its impact on our approach. In chapter3 we present the Water Column Corpus, a set of water column files to be used as the input for compression algorithms, selected to be representative of the real-world data water column compression algorithms may encounter. The Water Column Corpus is what Sim et al. [SEH03] refer to as the ’task sample’ of the benchmark. Chapter4 presents the metrics that are to be computed for each algorithm in the benchmark. The metrics are what Sim et al. [SEH03] refer to as the ’performance measures’. Chapter5 presents the design of a test bench that computes the metrics specified in chapter4 over a collection of input files. Next to the three components of a benchmark presented by Sim et al. [SEH03] (motivation for the comparison, task sample and performance measures) we present two more components that we believe to be important for a benchmark in chapter6: a set of rules that algorithms that are to be included in the benchmark should adhere to, and the method of benchmark result publication. As part of this work, the design described in chapter5 has been implemented. Chapter7 describes how this test bench can be used. This includes running the test bench and adding a new algorithm to the set of algorithms used by the test bench. In Chapter8 the results of running the test bench are presented and analyzed. Chapter9 contains conclusion, threats to validity, and future work.

7 Chapter 2

Related Work

In this chapter, we present relevant existing work, roughly divided into sections based on three different research areas.

2.1 Water column data compression

In [MCKL13], Moszynski et al. describe a method to MBES data (which includes water column data) based on Huffman coding. They propose two adaptations to standard Huffman coding to improve performance: • Create a Huffman tree once for each message type instead of for every message. The authors assume that probabilities will be the same among different messages of the same type. • Encode water column amplitude data in its true resolution instead of the file format’s resolution which is often too high. The average compression ratio is approximately 1:3 and outperforms the general purpose compres- sion methods ZIP, 7-ZIP and RAR in both compression rate and compression time. There are some exceptions, specifically when the settings of the multibeam echosounder are changed during the sur- vey. In that case, the Huffman tree is no longer optimal (since it was created for the first packet) and the compression ratio drops. The authors use two files for validation. Both originate from the same multibeam echosounder (a Reson 7125) and contain ”relatively flat and homogenous bathymetry” [MCKL13, p. 81]. The results obtained this way may not be representative for other systems and other types of survey (e.g., fish schools, wrecks, or gas plumes). The authors note that their proposed solution (including file format) is specifically suited for large files as ”Reading the structure of the compressed file and retrieving the information such as the total number of datagrams, original size of particular datagrams and their types is also possible without the need of decoding the whole compressed dataset.” [MCKL13, p. 81]. However, decompression times, either for a single record or the complete file, are not part of the results. Summarizing: The results presented in these works are incomparable and may not be representative for the complete domain. Our benchmark should provide representative and comparable results for all included algorithms.

In [Bea10], the author uses the (JasPer[AK00] implementation of) the JPEG2000 algorithm to compress the amplitude data in water column data. Lossless encoding leads to a compression rate of approximately 1:1.5. Compression rates up to 1:10 are attainable when lossy compression is used while still ”yielding very little loss of resolution or introduction of artifacts” [Bea10, p. 14]. The author uses data from a single type of multibeam echosounder (A Kongsberg EM3002) that was collected over a wreck. The results obtained this way may not be representative for other types of

8 systems or surveys (e.g., fish schools, homogeneous bathymetry, or gas plumes). The source of the data is of specific interest as the proposed algorithm only amplitude samples. For Kongsberg data, this constitutes the majority of the water column data. For other data formats however, it does not. The Reson s7k water column encoding for instance, may contain phase data which typically has approximately the same size as the amplitude data. The performance of the proposed algorithm on data in the Kongsberg encoding may therefore not be representative of the algorithm’s performance on other file formats. The author does not include information on compression time or decompression time for the algo- rithm.

In [APR+16], The authors apply FAPEC, a compression algorithm initially developed for space data communications, to water column data. FAPEC uses and ”includes mechanisms for the efficient compression of sequences with repeated values” [APR+16, p. 47]. The proposed algorithm uses a pre-processing pipeline tailored to MBES water column data. The test results show a compression rate of approximately 1:1.7. The algorithm was considerably faster than any of the other general purpose compression techniques that were evaluated (, , Rar & 7Zip) with FAPEC being more than twice as fast as the runner up. The authors use two files for validation. Both originate from the same type of multibeam echosounder (A Kongsberg EM710) and correspond to ”a relatively smooth an homogeneous bathymetry” [APR+16, p. 47], but one of the files includes a wreck. The results may not be representative for other types of systems or other types of surveys (e.g. fish schools or gas plumes).

2.2 Benchmarking & test benches in [DHL15], the authors present a benchmark framework for data compression. The intended algo- rithms are light weight memory compression algorithms for databases. A benchmark is defined by the user as a combination of data generation parameters and the al- gorithms that should be included in the benchmark. The authors place specific emphasis on the performance of the framework. They aim to get maximum performance by reducing redundant ac- tions in the framework’s execution. The framework uses standardized interfaces (not otherwise described in the work) that algorithms should conform to for easy inclusion in the framework. As we want our test bench to have high performance, we should prevent redundant actions. Whether or not that requires an approach similar to the ’sophisticated approach’ described by the authors will have to become clear during the research. The benchmark specification file offers an easy interface for the user to select which algorithms to include in a run. The use of standardized interfaces for the algorithms makes it easy to include new algorithms into the framework. Both features would be valuable in our test bench as well.

In [Swa08], the author proposes a single measure for the efficiency of data compression and storage. This measure is based on compression ratio, compression time, and decompression time. As many algorithms are asymmetric in compression and decompression performance, such a measure requires information on the expected ratio of compression and decompression actions for a single file. The single measure for the efficiency of data compression presented is a cost measure. The author defines measures for the profits of storing data and the costs for storing data. The overall efficiency measure is the ratio of these two measures. We want the benchmark for water column compression to include a single metric representative of the algorithm’s performance. A cost measure would appear to be a logical choice for such a measure. For water column compression the ratio of compression and decompression is interesting as well. Both because the data may be decompressed multiple times (if processed on multiple systems), but also because (part of) the compressed data may not be decompressed at all (if the data was stored only in

9 case it was needed). Water column compression adds some complexity when compared to data storage in the sense that the applicant of the compression algorithm is likely to want to compress the data as it is generated. This means that not only the relation between compression time and decompression time is important, but also the relation between compression time and generation time.

In [GVI+14], the authors are faced with a similar problem as the one addressed in this work: a relatively immature field (graph processing) which does not have a clear method for performance analysis. As a step towards the goal of creating a benchmark suite, seven challenges for creating such a suite are presented: • Evaluation process How to design the benchmark process in such a way that a ’fair’ comparison of platforms can be attained. This relates to rules and/or definition for data format, workflows, multi-tenancy and tuning. • Selection and design of performance metrics Which metrics are of interest and how can these be normalized to directly compare runs on different hardware.

• Dataset selection How to select a dataset that is representative, but also able to stress bottlenecks of the platforms. • Algorithm coverage How to select a representative, reduced, set of graph-processing algorithms. • Scalability How to make the benchmark deal with platforms ranging from super-computer to small-business scale. • Portability How to balance the required features of the benchmark suite against the amount of work it takes to make a platform ”benchmark ready”.

• Reporting of results Ideally the benchmark produces a single metric that represents the performance of the platform. The authors believe that such a metric will be hard/impossible to find as no platform can offer the best performance over the whole dataset, even when only a single metric is evaluated. As the fields of water column compression and graph processing differ significantly, not all challenges apply to a benchmark for water column compression. We specifically believe that algorithm coverage and scalability are not relevant in our domain. The former because the water column compression domain consists of a single type of algorithm (the compression algorithm), and the latter because we believe that there will not be much diversity in the types of platforms that will run water column compression algorithms. Portability can be an important factor in the adoption of the benchmark: if the work required for the inclusion of an algorithm into the benchmark is too large, it is likely that the benchmark will not be adopted by algorithm implementers. The authors suggest that benchmarking graph-processing algorithms requires re-implementation of the algorithm in the domain of the benchmark. For the water column data compression benchmark, we want to look into methods that do not require such re-implementations.

In [CHI+15], a benchmark is introduced for graph-processing platforms (based on the vision detailed in ”Benchmarking Graph-Processing platforms: A Vision” [GVI+14]).

10 The benchmark uses a choke-point based design; the problems that challenge the current technology are collected and described early in the design process [Bon14]. These choke-points are identified by experts in the community. The benchmark uses both generated and real-world input data. We believe that the field of water column compression is not mature enough for the choke-points in water column compression to be identified. Therefore a choke-point based benchmark, although an interesting concept for future work, is not feasible at this time. The combination of real-world data and generated data could be beneficial for the water column data compression benchmark. Using real-world data would increase the credibility of the benchmark. Data generation could be used to show how certain factors of the input data affect the different compression algorithms. This work unfortunately contains no reference to the ’normalized metrics’ referred to in the pre- ceding publication [GVI+14].

In [Pow01], the author describes the creation of a test bench for compression algorithms compressing a number of widely known corpora, among which the Calgary and Canterbury corpora. The test bench is employed by the author to maintain a benchmark for generic compression algo- rithms. The test bench is periodically run and the results are published on a website. The downside of this method is that only algorithms that have been built for UNIX can be included in the test bench. Also, the algorithms to be included in the benchmark need to be made available to the maintainer of the benchmark. The author states that it is impossible to make sure that different compression tests are run under the same conditions due to differences in processor speed and system resources. The author claims to counter this by reporting all speed measurements relative to the speed measurement on the same file by the UNIX compress utility. It is unclear if the normalization employed by the author is meant to normalize for hardware differ- ences (e.g. different platforms running the test bench) or differences in the state of the system between different test runs. However, we believe that reporting performance relative to the performance of another algorithm is not a good approach for either, because it assumes that both algorithms scale with system change in the same way. This is not necessarily the case. Differences in parallelization, for instance, are likely to result in different scaling behavior as the CPU load changes between runs.

Summarizing: The primary difference between our benchmark and those presented in this section is the difference in domain. Water column data compression performance evaluation requires metrics specific to the domain, such as the ability of an algorithm to compress the data at the rate at which it is generated. Similar to the test benches described in this section, our test bench should be designed with a focus performance, portability and ease-of-use.

2.3 Corpus creation

In [AB97], the authors look into the reliability of empirical evaluations for lossless compression (of text). They indicate that the ’Calgary corpus’ which was commonly used for that purpose at the time, had become outdated and voice a concern that new compression algorithms may be tailored to that corpus. The goal of a corpus of files is to facilitate a one-to-one comparison of compression methods by creating a small, publicly available, sample of ’likely’ files for the compression purpose. A corpus should conform to the following criteria:

• Be representative of the files that are likely to be used for the compression. • Be widely available and in the public domain. • Not be larger than necessary.

11 • Perceived to be valid and useful (otherwise it will not be used). This can be attained by including widely used files and publishing the inclusion procedure for the corpus. • Actually be valid and useful. The performance of compression algorithms on files in the corpus should reflect the typical performance of the algorithms. The authors created a new corpus for file compression using the following steps: 1. Select a large group of files that are relevant for inclusion in the corpus (all public domain). 2. Divide the files into groups according to their type (e.g. HTML, C code, English text).

3. Select a small number of representative files per group. (a) Create a scatter plot of the file size before and after compression of the files using a number of compression algorithms. (b) Fit a straight line to the points using ordinary regression techniques. (c) For each group, pick a file that is closest to the fitted line. The authors note that due to the large deviation between files, the absolute compression ratio of the corpus may not be representative, but the relative compression ratio is representative.

In [IR08], Islam and Rajon propose a corpus for evaluating compression algorithms for Bengali text. The authors claim that the necessity of such a corpus arises because results on corpora in English have little significance for text in Bengali. The corpus was created in a way similar to that used by Arnold and Bell [AB97]. Like them, the authors start out with a candidate data set that is categorized. Islam and Rajon pre-process their candidate files to remove all non-Bengali text and convert the files to Unicode encoding before compression. They use type-to-token ratio to select the files for inclusion in the corpus rather than compression ratio. They select two files for each category: the one with the lowest TTR and the one with the highest TTR. Although the authors show that there is a relation between TTR and compression ratio, they don’t explain why they choose TTR over compression ratio. Most interesting is their choice to select files with the lowest and highest TTR instead of the approach used by Arnold and Bell who use linear regression to find the most representative file in a category. Selecting the files with the lowest and highest TTR would appear to select the most unrepresentative files instead.

In his master’s thesis, Rezn´ıˇcekevaluatesˇ existing corpora for the evaluation of compression methods and constructs a new one [Jak10]. The author believes a new corpus to be necessary to overcome a number of problems in the Calgary, Canterbury, and large Canterbury corpora namely, the lack of large files (in the Calgary and Canterbury corpora), over-representation of English text, lack of files that are a concatenation of large projects and the lack of medical images and databases. The methodology for creating this corpus is very similar to the methodology proposed by Arnold and Bell, but includes a method to update the corpus. The method to update the corpus is of specific interest as the rest of the methodology is practically the same as that used by Arnold & Bell [AB97]. It is very likely that a corpus of water column data will become outdated at some point, because the corpus should be representative of real-world data, and the real-world data continually changes because of advances in technology. We do see a potential problem in the method presented by Rezn´ıˇcek:Anyˇ modification to the corpus should be versioned, but as there is no central agent responsible for versioning, nor a centralized location that contains all versions of the corpus, the updater of the corpus has no means to determine what the last version of the corpus is. This induces the risks of multiple versions of the corpus having the same version number.

12 Summarizing: The methodology used for the creation of corpora is either not described or based on that presented by Arnold & Bell [AB97] with slight modifications. New corpora are proposed not because authors disagree with the methodology used for previous corpora, but because the previous corpora have become outdated.

13 Chapter 3

Water Column Corpus

In order for a test bench to evaluate the performance of compression algorithms, it needs to contain data to be compressed. In order for the test bench to have credibility the data should be representative of the data the algorithm encounters during normal operation. At the same time, the volume of data used by the test bench should be as low as possible to guarantee swift execution. The test bench can generate its own input data, use a set of input files or use a combination of both. The advantage of data generation is that it provides the opportunity to specifically change certain properties of the input data in isolation. This allows the user of the test bench to review the effects of these individual properties of the data on the compression algorithm performance. Our test bench, in its first version, includes a corpus of real world data to be used as input for the algorithms: the Water Column Corpus. A data generator for water column data does not currently exist. Based on our analyses of water column data, a generator will be developed and included in a future release of the test bench. In this chapter, we review the methods used for the creation of other corpora, present the method that we have used for the Water Column Corpus, and finally present the water column corpus itself.

3.1 Existing corpora

Although we are the first to define a corpus for water column compression, a number of corpora have been defined for general purpose compression. In this section, we will look at existing corpora and the method used to create them.

3.1.1 Calgary corpus Probably the most widely known corpus for lossless data compression is the Calgary corpus created by Bell et al. for evaluating a number of lossless compression algorithms [BCW90]. The corpus consists of 14 files of 9 distinct types. Bell et al. provide no information on the method used for selection of the files other than that the file types are ’common on computer systems’ [BCW90, p. 583]. The Calgary corpus was made publicly available and has been used for lossless compression evaluation frequently in the 1990’s. [Jak10, p. 44]

3.1.2 Canterbury corpus The Canterbury corpus was published in 1997 by Arnold & Bell [AB97] as a reaction to issues in the Calgary corpus. The authors argue that the corpus has become outdated, was not constructed methodically, and indicate a concern that new compression algorithms may be tailored to the content of the corpus. The authors argue that a proper corpus should consist of a small sample of likely files. However, determining the likeliness of data is precisely the problem in compression algorithm development. Because of this, files for empirical validation are often selected haphazardly from files that are avail- able. The authors claim that this way of working reduces repeatability and validity. They present

14 a methodology for creating a corpus of test data for compression methods based on the following criteria: • The corpus should be representative of the files that are likely to be encountered in real-world applications. • The corpus should be widely available and in the public domain. • The corpus should not be larger than necessary. • The corpus should be perceived to be valid and useful. This can be attained by including widely used files and publishing the inclusion procedure for the corpus. • The corpus should actually be valid and useful. The performance of compression algorithms on files in the corpus should reflect the typical performance of the algorithms. Based on these criteria a method for creating a corpus is presented that consists of the following steps: 1. Select a large number of candidate files from the public domain. 2. Divide the files into groups according to their type. 3. Compress all files.

4. Use linear regression to determine the correlation between compressed and uncompressed file size. 5. Choose the file that is closest to the regression line in each category to be included in the corpus.

3.1.3 Ekushe-Khul Islam & Rajon propose a corpus for evaluating compression algorithms for Bengali text [IR08]. The authors claim that the necessity of such a corpus arises because results on corpora in English have little significance for text in Bengali. The method used for corpus creation is similar to that used for the Calgary corpus. Islam & Rajon make the following changes to the methodology presented by Arnold & Bell. • Prior to compression all files are stripped of non-Bengali text and converted to Unicode. • File selection is based on type-to-token ratio rather than compression ratio. • For each group, the files with the highest and lowest type-to-token ratio are selected whereas Arnold & Bell select the file with the compression rate closest to the regression line.

3.1.4 Silesia corpus As part of his dissertation, Sebastian Deorowicz presents the Silesia corpus [Deo03] for lossless com- pression. The author believes a new corpus to be necessary to overcome a number of problems in the Calgary, Canterbury, and large Canterbury corpora. Namely the lack of large files (in the Calgary and Canterbury corpora), over-representation of English text, lack of files that are a concatenation of large projects, and the lack of medical images and databases. Furthermore the author states that the Canterbury corpus is specifically faulty because it contains a file for which the ”specific structure causes different algorithms to achieve strange results (...) the differences in the compression ratio on this file are large enough to dominate the overall average corpus ratio”[Deo03, p. 92]. The Silesia corpus is meant to be used in combination with the Calgary corpus. Therefore, the Silesia corpus contains files of the types missing that corpus. The author provides no insight into the method used for selecting files to be included in the corpus.

15 3.1.5 Prague Corpus In his master’s thesis, Jakub Rezn´ıˇcekpresentsˇ the Prague Corpus [Jak10], the author includes a verbose methodology (similar to the one used for the Canterbury corpus) and instructions on how to update the corpus. The method for corpus creation contains the following steps:

• Define file types (e.g. image, database, binary) that are frequently used in practice. These will be the groups used for the corpus. • Collect candidate files that are real (not randomly generated) and can be placed in the public domain. Use as many sources as possible to prevent similarity. • Decompress internally compressed files (such as PDFs). • Divide the candidate files into groups. • Subgroup the candidate files based on file extension. • Compress all files using several (at least three) compression programs using different algorithms. • For each subgroup that contains at least 15 files and optionally for each complete group: – Determine the correlation between uncompressed and compressed file sizes using linear regression for each compression method. – Select the file closest to the regression line for each compression method. – From the selected files, select the one with the lowest compression ratio for inclusion in the corpus.

If one wants to update the Prague corpus, the same method should be followed. The contributor may then choose to either make changes to, or completely replace the corpus. The old corpus should always be kept available. The new corpus should be documented in a report (an optional template is provided) and the new corpus should be published on the Internet.

3.2 Corpus creation method

In this section, we present a method for the creation of a corpus of water column data. The method is based on that proposed by Arnold & Bell for the Calgary corpus [AB97]. Although a number of new corpora have been suggested to replace the Calgary corpus, most of them still use the method presented by Arnold & Bell (with slight adaptations) for constructing the new corpus. Those that do not, do not report any method at all for selecting files. The only remark we have found that could be considered criticism of the method presented by Arnold & Bell is that of Deorowicz on the inclusion of the XML file that has such different compression characteristics across different compression algorithms that it dominates the overall corpus ratio. [Deo03, p. 92]. Deorowicz, however, provides no evidence for this statement and we have not been able to find similar statements in literature or a web search. Like the authors of other corpora, we have adapted the method proposed by Arnold & Bell for our purposes. The following adaptations have been made to the original method: • As there is little water column data in the public domain, it is not feasible to collect a sufficiently large sample of files that is in the public domain. Instead, we will select candidate files regardless of whether they are in the public domain. Once files have been selected for inclusion in the corpus, we will request permission to put the file in the public domain from the owner. If the owner does not grant permission, the file will be removed from the candidate files and a new file will be selected. • Although Arnold & Bell describe in detail, the selection process of representative files for the categories, they provide little insight into the categorization of candidate files. We believe that the chosen categories serve two purposes: stating which types of files are important in the field

16 and reducing the search scope for candidate files. As we start with a limited set of data, there is no need to limit our search scope. The purpose of categorization in our case is thus only to state which types of water column data are important in the domain. We take a number of categories from the categories used by the ”Shallow survey conference”: an important conference in the domain at which multibeam echosounder manufacturers are asked to gather data over various types of objects. From this data the categories survey, water seep, wreck, and target have been taken. We have added two categories that we believe to be important in the domain: fish schools and gas vents. Figure 3.1 provides example data for each of the included categories. • Files that contain water column data often also contain other types of maritime data. Also, one format supports data from multiple devices in one file whereas the other formats do not. In order to get comparable files, the files need to be processed to remove all non-water column data and split files that contain data from different devices. Because the current the state of the art in water column compression cannot be identified (which is part of the problem we are addressing with the test bench proposed in this thesis), we cannot use the state of the art water column compression algorithm to determine the compression ratio for the files. Instead we will use the Lempel-Ziv-Markov chain algorithm as this algorithm (used by and often referred to as ’7-Zip’) is used by all authors of water column compression algorithms to compare their algorithm’s performance against and either has the best compression ratio of all the algorithms used to compare against [MCKL13], [APR+16], or is the only algorithm the author compares against [Bea10]. After these adaptions, our process for creating a corpus of water column data comprises of the following steps: 1. Gather a set of files that contain water column data (either proprietary or in the public domain). 2. Process the files to remove all non-water column data and split files when needed. 3. Divide the files into groups according to their category. 4. Compress all files using the LZMA algorithm. 5. Use linear regression to determine the correlation between compressed and uncompressed file size for each group. 6. Choose the file that is closest to the regression line for each group. 7. Ask the owner of the file for permission to put the file in the public domain. 8. If permission is granted, the file is selected for inclusion in the corpus. If not, the file is removed from the set of candidate files and the process is repeated from step 4. We strongly agree with Rezn´ıˇcek[ˇ Jak10] that it is very likely for any corpus to become outdated and as such it requires a method for updating and/or replacing the current version. We propose a high-level approach so as not to restrict the solution space for any future iteration: The Water Column Corpus should always have a maintainer. The maintainer is responsible for the publication of the Water Column Corpus. The publication includes versioning of the corpus. Publications should include the methodology used for corpus selection (or a reference to a previous version of the corpus if the same methodology is used) and the corpus itself. All versions of the Water Column Corpus should be published at a single location.

3.3 Results

Files containing water column data were gathered from the host organization’s intranet, a number of industry specialists, and the scientific community. After processing, this yielded a set of 1066 candidate files with a total size of just over 600GB.

17 (a) Survey (b) Gas vent

(c) Fish (d) Target

(e) Water seep (f) Wreck

Figure 3.1: Water column data categories included in the corpus

The data was grouped based on meta-data when available. If this was not possible, we used visual inspection to determine the category of the data. As this is a slow process, the time line of the project did not allow us to classify all of the candidate files. For maximum efficiency, we focused on the categorization of the larger sets of files. 768 out of 1066 files were categorized. An overview of the categories and the number of candidate files assigned to each can be found in table 3.1.

Category N water column contains Fish 8 A school of fish Vent 15 A stream of bubbles Survey 415 Little else than the return from the sea floor Target 142 A small (<2m) object Wreck 86 A sunken object (e.g., a wreck of a ship) Water seep 92 A seep of fresh water into salt water Total 768

Table 3.1: Categorized candidate files

All candidate files were compressed using LZMA. We made use of the open source LZMA SDK1, and its python binding pylzma2. All files were compressed using the default settings of the algorithm. For each group, a scatter plot was created of the original file size versus the compressed file size. Using linear regression a line was fitted to the data to determine the representative compression rate for the group. The results can be found in figure 3.2, which shows distinct groups with a linear relation for all categories except the wreck category which appears to contain two groups. Closer inspection showed that a set of files had been included that did include wreck data, but an area that

1http://www.7-zip.org/sdk.html 2https://pypi.python.org/pypi/pylzma

18 Figure 3.2: Original vs compressed size of 768 categorized candidate files was relatively large compared to the wreck had been surveyed. As a result, only a small part of the water column data actually contained a wreck. Therefore, the data was not representative of the category and the files were removed from the candidate files. Table 3.2 shows the results of the categorization effort. Correlation and deviation numbers are very similar to those presented by Arnold and Bell. Table 3.3 shows the files that were selected to be included in the corpus. We obtained permission to put all these files in the public domain, so there was no need to reiterate the process for any of the categories.

compression rate standard best match Bin N average deviation correlation ratio

Fish 8 0.7223 0.01 0.9991 0.7195 Survey 415 0.5479 0.11 0.9812 0.5477 Target 142 0.6546 0.02 0.9966 0.6541 Vent 15 0.4774 0.10 0.9947 0.4833 Water seep 92 0.2621 0.05 0.993 0.2622 Wreck 58 0.5041 0.04 0.9949 0.5053

Table 3.2: Categorized compression properties

category best match size (MB) Fish 20060922 232920 7125 (400kHz) filtered.s7k 4293.3 Survey 0003 20140910 094146 True North.wcd 1241.4 Target 20140724 105715 filtered.s7k 1023.2 Vent 0006 20090402 210804 ShipName.wcd 363.2 Water seep 0046 20110309 215521 IKATERE.wcd 551.7 Wreck 0012 - Bathy plus WCD 0 from qinsy db.wcd 40.4

Table 3.3: Selected files for the corpus

To validate the corpus, we reviewed the average compression ratios of different compression methods

19 on both the complete set of candidate files and the corpus. The results are shown in table 3.4. There is some deviation in the average compression ratios between the candidate files and the corpus. This is similar to the results obtained by Arnold and Bell [AB97, p. 207] and likely due to the relatively high deviation in compression ratios for files in a group (as shown in table 3.2). This means that our corpus, like the Canterbury corpus, is unreliable for absolute performance estimates. Fortunately, like the Canterbury corpus, the relative compression ratios are very similar: ZIP has a compression ratio that is 118.8& of the compression ratio of LZMA for the compete set of candidate files, and 118.6% for the corpus. Similarly, the compression ratio of BZ2 is 102.5% of that of LZMA for the complete set of candidate files and 102.4% for the corpus. Therefore the corpus is reliable for relative compression performance estimation.

Average compression ratio Algorithm Candidate files Water Column Corpus

LZMA 0.5113 0.5287 ZIP 0.6078 0.6268 BZ2 0.5241 0.5413

Table 3.4: Average compression ratios of candidate files and corpus

20 Chapter 4

Metrics

In this chapter, we present the metrics that will be reported by the test bench. The test bench will include both generic compression metrics, which are presented in the first section, and water column data specific metrics, which are presented in the second section. Specifics on the implementation of metric computation in the test bench can be found in chapter5.

4.1 Generic compression metrics

We studied publications on water column compression, generic compression evaluation and recent publications on lossless compression algorithms to see which metrics are reported. The reported metrics were: compression ratio by all, compression time and/or decompression time by most. All metrics will be included in the test bench:

CR = Sc So where CR is the compression ratio. Sc is the size of the compressed data. So is the size of the original data

Tc The time required to compress the data.

Td The time required to decompress the compressed data.

4.2 Real-time compression

An important feature of a compression algorithm for water column data is its ability to compress data at the rate it is generated by the multibeam echosounder during data acquisition. Inability to compress the data ’on the fly’ means that the data will have to be stored uncompressed first and later be compressed. This separate compression step takes up time and thus induces cost. The ability of an algorithm to perform real-time compression depends not only on the performance of the algorithm, but also on the environment in which it is executed. As compression will be one of many tasks executed by hydrographic acquisition software in parallel, the algorithm will have access to limited resources. The test bench includes the ’Real-time compression’ metric; the ratio of the average water column record generation interval to the average water column compression time:

21 T RTC = rga Trca where

Trga is the average water column record generation interval.

Trca is the average water column record compression time.

The average water column generation interval can be computed from the generation time included in the water column record:

Twc − Twc T = l f rga N − 1 where

Trga is the average water column generation interval.

Twcf is the generation time of the first water column record.

Twcl is the generation time of the last water column record. N is the number of water column records.

The average water column record compression time is dependent on the performance of the com- pression algorithm in the real-time environment:

T (e) T = c rca N where

Trca is the average water column record compression time. N is the number of water column records in the file. Tc(e) is the compression time of the file in computation environment e.

The computation environment is defined by two user parameters: the memory and the CPU capacity available to the algorithm during compression. More details in how the test bench emulates the real- time environment can be found in section 5.6.2. RTC has a range of (0, ∞), where: • 1 indicates that on average the compression of water column records requires the same time as it takes for the hardware to generate them. • < 1 indicates that the compression of water column records on average takes more time than it takes for the hardware to generate them. Thus real-time compression is not possible. • > 1 indicates that on average water column records can be compressed in less time than it takes for the hardware to generate them. Thus real-time compression is possible.

4.3 Processing

After data has been stored during acquisition, it is transferred to another department where it is processed. Using water column compression adds overhead to the processing phase as data needs to be decompressed prior to its processing. Although there is a theoretical advantage of decreased

22 disk space requirements due to compression, we believe that these cost savings are negligible when compared to the overhead in processing time. The processing metric is included to indicate the overhead induced in processing due to the ap- plication of water column data compression. It is defined as the product of the decompression time required for processing and the number of times the file will be decompressed in the processing phase:

P = Tdp ∗ Ndp

where Tdp is the decompression time for processing. Ndp is the number of decompressions.

The number of decompressions Ndp on a single workstation can be reduced by storing the decom- pressed data in between different processing session. As it is common for hydrographic processing software to offer up disk space in favor of processing speed, we assume that processing software will follow this same trend for water column decompression and thus the number of processing sessions on a single machine has no influence on the number of decompressions. We have observed that support for concurrent data access is uncommon in hydrographic processing software, while it is common for multiple people to work on the same data. This means that the data is decompressed on each work station. Ndp is therefore normally equal to the number of work stations working on processing the same data. The decompression time for processing, Tdp depends on the performance of the compression algo- rithm and on the number of records that are required to be decompressed for processing. The latter depends strongly on the reason for collecting water column data. If water column data is recorded with the intention of improving bathymetric data, it is likely to be stored for the entire survey, but only used in locations where bathymetric data appears to be faulty. In other words, only a small part of the stored water column data will need to be processed (and thus decompressed). Conversely, if water column data is gathered with the intention of investigating a specific feature (e.g. a wreck or marine life) it is likely that water column data will only be stored at the location of that feature. In this situation, it is likely that all the stored water column records will be processed (and thus decompressed). We assume that the processing software selects the best decompression method for the data: full file decompression if all records are required during processing, or individual record decompression if only a limited subset of records are required for processing. The decompression time is therefore defined as:

Tdp = min(Td,Tdpartial )

where Tdp is the time required for decompression in processing. Td is the time required for full file decompression.

Tdpartial is the time required for partial decompression of the required water column records

The partial decompression time depends on the ratio of water column records the user expects to decompress during processing:

23 Tdpartial = N ∗ Rp ∗ Tdra

where

Tdpartial Partial decompression time N is the number of water column records. Rp is the ratio of water column records that the user expects to process to the number of water column records in the file. (a user parameter).

Tdra is the average random access decompression time per record.

The average random decompression time is calculated by the test bench by randomly selecting a number of records for decompression and then averaging the decompression time.

4.4 Cost reduction

As discussed in section 1.1, the cost incurred by handling and storing water column data is prohibitive to its collection. This cost includes one-time costs such as the cost of transfer of the data from the survey vessel to the processing station on shore over expensive satellite links and acquisition of storage space, as well as continuous costs such as maintenance on the storage space for the data. We refer to the sum of these costs as the total cost of data ownership. Employing water column compression will decrease the volume of data and thus the total cost of ownership of the data, but it also incurs cost because of time lost on compression and decompression. The cost metric shows the reduction in cost realized by the application of a specific compression algorithm:

C = Rco − Cc − Cd

where Rco the reduction in cost of ownership. Cc is the cost incurred by compression. Cd is the cost incurred by decompression.

The reduction of the cost of ownership depends on the data reduction realized by the compression algorithm and the cost of ownership of data. The latter is a user parameter:

Rco = Rd ∗ Co

where Rco the reduction in cost of ownership. Rd is data reduction resulting from the application of the compression algorithm. Co is the cost of ownership of data.

The data reduction is computed from the compression ratio and the file size:

24 Rd = So ∗ (1 − CR)

where Rd is data reduction resulting from the application of the compression algorithm. So is the size of the original, uncompressed file. CR is the compression ratio for the file.

The cost incurred by compression depends on the ability of the algorithm to perform real-time compression. If the data can be compressed in real-time, no additional cost is induced by compression. If the algorithm cannot process the data in real-time, we assume that the data is compressed after acquisition. Compression is likely to occur on the ship (to reduce the amount of data that needs to be sent to shore) and thus takes up time that could otherwise have been spent on acquisition. We refer to this time as ship time. The equation for the cost of compression is thus:

Cc = Tc ∗ Cst if RT C < 1 otherwise Cc = 0 where Cc is compression cost. Tc is the compression time for the complete file. RTC is the real-time compression metric. Cst is the cost of ship time (a user parameter).

The cost of decompression for processing is the product of the time spent on decompressing the data we need for processing (e.g. the processing metric presented in section 4.3) and the cost of processing time:

Cd = P ∗ Cpt

where P is the processing metric. Cpt is the cost of processing time (a user parameter).

25 4.5 Overview

In this chapter, we proposed the following metrics to be included in the test bench:

Metric Symbol Range Better when Compression ratio CR (0, ∞) Lower Compression time Tc (0, ∞) Lower Decompression time Td (0, ∞) Lower Real-time compression RTC (0, ∞) Higher Processing overhead P (0, ∞) Lower Cost reduction C (−∞, ∞) Higher

Table 4.1: Metrics

In order to compute these metrics, the user must supply the following parameters to the test bench: • Memory available to the algorithm during acquisition. • Processor capacity (as a percentage) available to the algorithm during acquisition.

• The number of times a file is decompressed for processing. • Processing ratio. • The cost of processing time.

• The cost of ship time. • The cost of ownership of data. Additional information on how to determine the right values for these parameters can be found in table 7.1 in section 7.3.

26 Chapter 5

Test bench

In this chapter, we present a test bench for water column compression algorithms based on the input files and metrics presented in chapters3 and4, respectively. The purpose of the test bench is both to facilitate the computation of metrics for the benchmark and to enforce that all benchmark results have been computed in the same manner.

5.1 Requirements

The requirements for the test bench are based on the requirements for benchmarks presented by Sim et al. [SEH03]. As this work is meant as a first step towards a benchmark for water column compression and we expect it to change based on input from the (scientific) community, we have added changeability as a requirement. • Accessibility The test bench (including the set of input files) should be in the public domain. • Affordability The work to make an algorithm available to the test bench should not take more than a day. A complete test run should not take more than a day. • Relevance The results of the test bench should be representative of real-world applications. • Solvability The test bench should not use (artificial) input data that is incompressible. • Portability The test bench should be able to handle algorithms written in many programming languages. • Scalability The test bench should support algorithms at different levels of maturity. • Changeability The test bench should support changing the set of input files and the set of metrics.

5.2 Structural design

This section describes the structural design of the test bench based on the requirements presented in the previous section. For portability, the test bench will be implemented in python, which means the test bench will run on any of the platforms for which a python interpreter is available without requiring compilation. Figure 5.1 shows the components of the test bench and their relation. Each component will be discussed in detail in the following sections.

27 Figure 5.1: structural test bench design

5.2.1 Test execution logic The code that runs when the test bench is executed. This logic governs the program flow of the test bench. Detailed information on the program flow can be found in section 5.6.1. More details on the way metrics are computed can be found in section 5.6.2.

5.2.2 Input files The Water Column Corpus presented in chapter3 is the default set of input files. However, the test bench can be configured to use any set of input files by editing the configuration file.

Input file format Water column data files come in a lot of different encodings as it is common for MBES manufacturers to use their own, proprietary encoding. If the test bench would support files in their native format, this would put the responsibility of decoding the data on the compression algorithm. We see two problems with this approach. First, it makes the test bench hard to extend with new file formats as that would require an update of all the compression algorithms. Second, there is a risk of algorithms supporting a single encoding1. The inability of compression algorithms to compress the same files would make the test bench useless for comparing algorithm performance. In order to overcome these problems, we present a generalized file format in which the test bench’ default input files will be encoded. The goals for this format are: • Be generic, so that the data from any vendor encoding can be converted to this format. • Be as close as possible to the original file in its content.

• Contain all the information essential to water column data. • Contain all the data needed by the test bench.

1This may already be the case for current publications on water column compression: Moszynski et al. use an application that reads files in the ’s7k’ format [MCKL13, p. 81]. Amblas et al. [APR+16] & Beaudoin [Bea10] do not specifically mention which formats are supported, but both only include data from Kongsberg systems in their input data sets.

28 Based on these goals, we defined the generic water-column format (GWF). Details of the file format can be found in appendixB. In order for the data to be as generic as possible, we include as few fields as possible while still making sure that all the data needed for compression and the test bench are included. We include the fields: identifier, generation time and number of beams for each water column record. Amplitude samples and phase samples for each beam. In order to retain as much similarity with the original file as possible, we do not perform conversion on the sample data (which is the majority, >90% of the data). Instead, the file will contain a field indicating the format in which the samples are stored. Furthermore, all data that does not fit in one of the defined fields is stored as ’generic data’. Both the water column record header and each individual beam contain a field in which generic data can be stored. Because of these precautions, conversion from a vendor specific format to the generic water-column format can be achieved largely by reordering of the data in the original file. More details and examples of such a conversion can be found in appendixB. We do not expect that there is a correlation between the compressibility of a file and its structure. Therefore the performance of a water column data compression algorithm should not be different when processing an input file in the GWF file format compared to processing the same data in its original format. However, we have not conducted extensive research to provide proof to support this claim, and users of the test bench may be worried about the results of the test bench on files in the GWF file format not being representative of the algorithms’ real-life performance. Therefore, the test bench supports input files both in the GWF file format and all the original file formats of the Water Column Corpus (.all, .wcd and .s7k).

5.3 Compression algorithm

The compression algorithm is the algorithm for which we want to evaluate the performance. In <> order for the algorithm to be used by the test Compression algorithm bench, it should be made available as a python Init(parameters : string) : int Compress(inputPath : string, outputPath : string) : int module that supports the interface shown in fig- Decompress(inputPath : string, outputPath : string) : int ure 5.2. Compression algorithms are often written Decompress(inputPath : string, outputPath : string, recordID) : int in C ([APR+16], [Bea10]) or C++ ([MCKL13]) for speed. Algorithms written in these program- Figure 5.2: algorithm interface ming languages can be easily made available as a python module by compiling them as shared libraries and making the libraries available in python using the ctypes module2. There are many other methods to make code written in other languages available to the test bench3. The init operation is used to set algorithm parameters that are used in (de)compression. The parameters are specific to the algorithm that is used. For example, generic compression algorithms often include a parameter to favor either speed or compression rate. The test bench passes parameters in JSON4 notation as they have been configured in the configuration file of the test bench. As the parameters are specific to the algorithm, all parameters and their possible values should be supplied by the algorithm manufacturer. The compress and decompress operations both take the path of an input file (the original file for the compress function, the compressed file for the decompress function) and the path to an output file. The algorithm is responsible for creating the output file at the specified location and writing the (un)compressed data to that file. The interface specifies an operation to decompress the entire file and an operation to decompress a single record from the input file. All operations should return an integer which is negative in case of failure (algorithm specific return values may be used) and zero in case of success.

2https://docs.python.org/3/library/ctypes.html 3https://wiki.python.org/moin/IntegratingPythonWithOtherLanguages 4http://www.json.org

29 5.4 Compressed file / decompressed file

The compressed and decompressed files are transient files that are used to store the output of the compress and decompress functions, respectively. The algorithm under test is responsible for the creation of the files and the test bench is responsible for the removal of the files when they are no longer necessary. The files are used by the test bench for the calculation of the compression ratio metric.

5.5 Results

The results of the test bench are stored in an sqlite database. The results consist of all the metrics computed during the test run. The database has a simple format consisting of columns for input file name, name of the algorithm and one for each computed metric. Refer to figure 7.2 for an example.

5.5.1 Configuration file The configuration file contains the configuration for the test bench. This includes: • The paths to all the input files. • The path to the python module(s) that contains the compression algorithm(s) • The compression algorithm parameters.

• The metric parameters. The configuration file uses the JSON format for readability. An example of a configuration file can be found in appendixC. Section 7.3 provides more details on the adaptation of the configuration file.

30 5.6 Behavioral design

In this chapter we present the behavioral design of the test bench. Section 5.6.1 describes the program flow of a test run. Section 5.6.2 describes how the metrics proposed in chapter4 are computed.

5.6.1 Test execution A run of the test bench computes the metrics proposed in chapter4 for one or more compression algorithms using one or more input files. Both the algorithms and the input files to use are specified by the user in the configuration file. Figure 5.3 shows the execution flow of a test run of the test bench as a flow chart. The gray boxes in the flow chart’s processes indicate (parts of) metrics that are calculated in that step. The test bench iterates over all the input files. For each file, the test bench starts with getting meta data from the file. This meta data is necessary for the calculation of metrics later on (file size, the number of records in the file & the timespan of the file) and determining which records will be used for random order decompression. Once the meta data for the file is read, the following steps are executed for each algorithm:

1. The algorithm is instructed to compress and subsequently decompress the input file, yielding measurements for the compression ratio, compression time and decompression time metrics. 2. The algorithm is instructed to decompress a single record from the compressed file for each record that was selected for random access decompression.

3. The test bench emulates the real time scenario in which limited resources are available to the algorithm (this process is described in detail in section 5.6.2) and subsequently instructs the algorithm to compress the input file again to compute the average record compression time in a low resource environment. After the compression step completes, all resources allocated for the real time scenario are released.

When these actions have been completed for all input files and all algorithms, the metrics that require user parameters are computed from the previously calculated values and the user defined parameters. Calculating these metrics outside of the file and algorithm loops allows for fast re- computation of these metrics from previous results when only the parameters have been changed.

5.6.2 Metrics computation Chapter4 presents the metrics that are computed by the test bench, but does not address the question of how these metrics should be computed. The computation of metrics CR, P and C is obvious from their definition. Computation for metrics Tc, Td and RTC, however is less obvious. In this section, we will discuss the computation of these metrics.

Time measurement Except for compression rate, the computation of all metrics depends on time measurement. It is important to discern which time we want to measure for each metric. Compression (Tc) and decom- pression time (Td) are metrics that indicate the performance of the algorithm. Therefore we want to exclude anything that is not related to algorithm performance. Examples of such activities are the operating system executing other tasks, and the time spent waiting on disk access to complete. For the real time compression (RTC) metric on the other hand, we explicitly want to include the influence of other processes that claim CPU time in parallel to the execution of the compression algorithm. The test bench is written in the python programming language which offers a number of different methods to get timing information. We have tested these methods to find the right method to use for both scenarios. The most important observation that we made is that not every method behaves according to the documentation. Specifically, we found a number of methods for which the results, contrary to documentation, differ significantly when executed on different operating systems.

31 Start

for each next input file

get file meta data

T ,S ,N,T ,T rga o wcf wcl done

select water column record ID’s for random access decompression

for each next algorithm

compress done RC, Tc

decompress

Td

random access decompress

T dra

start emulation of ac- quisition environment

compress compute param- T eterized metrics rca RT C, P, C

stop emulation of ac- quisition environment

Stop

Figure 5.3: test bench execution flow

32 Based on the results of our investigation, process time from the time module is used for measure- ments where time spent on other processes and I/O should be ignored, and perf counter from the time module is used when wall clock time should be measured. For optimal resolution, we advise running this test bench on Linux.

Real time compression The purpose of acquisition environment emulation is to compute a metric that is representative of the performance of the compression algorithm during real-time data acquisition (refer to section 4.2 for more information on the metric). The test bench controls two variables based on parameters supplied by the user: the CPU load and the occupied memory. Experimentation showed that exact replication of the processor load using a continuous spin-sleep cycle on all cores of the CPU did not yield satisfactory results unless specifically tailored to the OS running the test bench. Using a number of threads that continuously spin (and thus take up the full capacity of a logical core) proved to be more stable and portable. The disadvantage of this method is that we have only two options per logical core of a CPU: apply 100% load to the core or apply 100 no load to the core. This means we can control CPU load with a granularity of Number of logical cores percent. For instance, on a machine with eight logical cores, we can introduce loads of 0%, 12.5%, 25%, 37.5%, 50%, 62.5%, 75%, 87.5% and 100%. In order to get a representative real time compression metric, the compression time (Tc(e)) is calculated twice: using the two CPU loads closest to the required CPU load. For example, if the user supplied a CPU load of 67% and the test bench is running on a system with eight logical cores, the compression time will be measured once with a load of 62,5% and once with a load of 75%. Linear interpolation is used to get the compression time at the CPU load configured by the user.

33 Chapter 6

Benchmark

The test bench presented in this thesis can be be used as a means to collect metrics for lossless water column compression methods. As discussed in the introduction, our goal is for these metrics to be used in a benchmark. In this chapter we present our vision on how such a benchmark should be organized. The first section deals with the methods for the publication of results. The second section deals with the rules algorithm implementers should conform to in order for their algorithm to be included in the benchmark.

6.1 Publication

The results of a benchmark should be easily accessible to anyone interested. Publication on a web- page is a common means to accomplish this. There are two main methods that are used for the publication of benchmark results. The first approach is to have a centralized agent that periodically computes the metrics used in the benchmark for all programs included in the benchmark. The metric computations for all of the programs is run on the same hardware in the same state to get to results that are directly comparable. This approach requires that programs that are to be included in the benchmark are made available to the maintainer of the benchmark in some form. An example of this approach is the Canterbury corpus benchmark published on http://corpus.canterbury.ac.nz as described by Powel [Pow01]. In the other approach, the calculation of metrics is not performed by the benchmark maintainer, but by individuals that wish to include the program in the benchmark. The acquired results are sent to the benchmark maintainer with the request to add these to the benchmark. Benchmarks using this approach usually supplement the results with the description of the hardware used to obtain the results. An example of this approach is the Large Text Compression Benchmark hosted at http://mattmahoney.net/dc/text.html. Because the scientific community for water column compression is relatively immature, we believe that the best procedure for a benchmark in the domain is the one that is least restrictive both to contributors and maintainers. For that reason we propose to use the second approach for the benchmark. This approach does not require the maintainer to periodically compute results and it does not require contributors to make their algorithm available to the maintainer. The downside of the approach is that the results are not directly comparable as they are likely to be obtained on different hardware. However, we expect that the diversity in hardware will be low. Surveying and processing software are typically run on standard personal computers. We do not expect results obtained on hardware configurations such as embedded systems or supercomputers to be submitted to the benchmark. A submission for inclusion in the benchmark should contain the following items: • The metrics computed for each file in the water column corpus. • The settings used for the computation of these metrics (e.g. the test bench configuration file) • Description of the hardware used to compute the metrics: CPU type, amount of memory and

34 operating system. Benchmark results may be validated by the benchmark maintainer. The maintainer may decide (but is not required) to do so at the request for inclusion, or can be requested to do so by any party concerned about the validity of a result in the benchmark. The availability rule presented in the following section requires that any algorithm included in the benchmark is freely available for reproduction of results. If the benchmark maintainer is not convinced of the validity of the submission (s)he may choose to decline the request for inclusion, or remove the result from the benchmark. The submitter of the results will be informed of this step and the reasons for it.

6.2 Rules

Algorithms that are to be included in the benchmark have to abide by a set of rules. It is specifically not our intention to formulate a set of rules to prevent ’fraudulent’ behavior. We believe that the inappropriateness of fraudulent behavior (such as solutions that store the data to be compressed in any other location than the compressed file, as described for instance by Arnold & Bell [AB97, p203]) to be self evident. Our intention for this set of rules is to address situations which are less evident, and to make sure that all algorithms treat these equally. Specifically, these rules serve two primary purposes: to facilitate the reproduction of published results, and to make sure that metrics computed on different algorithms are comparable. Reproduction of published results is important as we have chosen a publication method that allows anyone to add results to the benchmark. The scientific community needs to be able to reproduce any noteworthy result to determine its validity. Being able to compare the metrics computed on different algorithms is at the core of the benchmark.

Lossless Rule Only compression algorithms that produce a file that is bit-wise equal to the input file after subsequent compress and decompress actions are allowed to be included in the benchmark. Motivation Many compression programs employ pre-processing steps prior to the actual com- pression. The purpose of these pre-processing steps is to manipulate the data in such a way that it is optimal for the subsequent compression step(s). Pre- processing steps may be reversible or irreversible. As example of a reversible pre-processing step is which is often employed in the compression of continuous signals (such as audio) [Smi13, p. 487]). An example of irreversible pre-processing is the quantization step in JPEG(2000) [SCE01]. Sometimes the combination of irreversible pre-processing and lossless compression is still referred to as lossless compression as the actual compression is lossless. We believe the compression algorithm proposed by Moszynski et al.[MCKL13] to be an example of such a combination. The compression algorithm (Huffman coding) is lossless, but the pre-processing step that converts sample values from linear scale to logarithmic scale is unlikely to be reversible when using discrete values. In the context of our test bench, we define losslessness as the bitwise equality be- tween the input file and the result of a subsequent compression and decompression action on the input file including any pre- or post-processing steps.

35 Repetitive (de)compression Rule Algorithms may not use any methods that increase algorithm performance when working on the same file a multitude of times. Motivation A simple example of such behavior would be the caching of the algorithm’s output. When the algorithm detects that it is ordered to execute the same action on the same file as was executed for the cached output, it can immediately return the cached output. This is undesirable as it would cause the test bench to report incorrect metrics. For instance, the test bench will compress each file at least twice: once ’normal’ and once in an environment that simulates the acquisition environment (see chapter 4.2). Applying methods that rely on historic information to increase performance would lead to a non-representative real-time metric.

Execution confined to the system Rule The execution of the algorithm must be confined to the system of which the prop- erties are included in the request for publication. Motivation This means that if the algorithm was run on a computing cluster, the specifics of that cluster should be supplied with the request for publication. Conversely it means that if the request for publication included the specifics for one personal computer, the processing space for the algorithm is confined to that computer.

File handling Rule Algorithms should read records from the input files only once, and in the order in which they are stored in the file. Motivation This simulates the real time scenario in which data is not read from a file but directly received from the MBES. If the algorithm requires history, it is the re- sponsibility of the algorithm to buffer the data.

Tailoring Rule Algorithms should not be tailored to the input data Motivation The metrics published in the benchmark should be representative of the algo- rithm’s performance on real data, not specifically on the files included in the Water Column Corpus.

Resources Rule There is no restriction to the resources that the algorithms may use (including parallelization) other than what is mentioned in rule ’Execution confined to the system’ Motivation The only situation in which restrictions to available resources apply is in the simulation of the acquisition situation (e.g. the real-time metric) and in that case all the resources but those that should be available to the algorithm will be made unavailable to the algorithm by the test bench.

Publication content Rule Publication request should include the metrics presented in chapter4 computed over the full Water Column Corpus presented in chapter3. Motivation Other metrics and/or metrics computed on other files will not be included in the benchmark. This allows for validation of published results.

Availability Rule All algorithms that are to be included in the benchmark must either be freely accessible or allow a free and unrestricted trial period of at least seven days. Motivation This allows for validation of published results.

36 Chapter 7

Using the test bench

7.1 Installing the test bench

In order to run the test bench, the user should have python version ≥ 3.0 installed1. The test bench is hosted in a Github repository2. To be able to run the test bench, the directory named ’testbench’ should be downloaded. The test bench itself is fully encompassed in the source files that reside in the ’testbench’ directory and does not require any installation. However, the test bench does have dependencies on external python modules. To install all the dependencies of the test bench, browse to the ’testbench’ folder and run the following command:

1 python -m pip install .

After this command has executed, all the dependencies of the test bench should be properly in- stalled. The configuration may need to be changed to your specific use case (as is discussed in section 7.4), but the default setup should be enough to verify that the test bench has been correctly installed. The test bench can be run by running either the run.bat or run.sh script (based on the operating system). The Water Column Corpus is hosted on Google drive3. The default configuration expects these files to be present in a sub-folder of the test bench named ”Water Column Corpus”.

7.2 Adding a new algorithm to the test bench

This section shows what steps need to be performed to add an algorithm to the test bench. We will use the replication of the JPEG2000 based method proposed by Beaudoin [Bea10] as an example. For the sake of brevity, we will refer to this implementation as ’the jp2k algorithm’ in the remainder of this document.

7.2.1 The algorithm to add The jp2k algorithm has been implemented in C++ as a library and works on individual water column records. The library exposes the three functions shown in listing 7.1. The compress function takes a pointer to memory containing a water column record in the gwf format4 and the size of that memory. The third parameter is a pointer to an integer in which the

1https://www.python.org/downloads/ 2https://github.com/bgrevelt/Thesis 3https://drive.google.com/drive/folders/0B-cItQ2KwxYsS0lmWnFROGNqeGs?usp=sharing 4for details on the format, refer to appendixB

37 1 char* Compress(char* uncompressedData, uint32_t uncompressedDataSize, uint32_t* compressedSize); 2 char* Decompress(char* compressedData, uint32_t compressedDataSize, uint32_t* decompressedSize); 3 void Destroy(char* data);

Listing 7.1: JP2k interface

size of the compressed data will be stored upon completion of the function. The function returns a pointer to the compressed data. The decompress function is the inverse of the compress function and has similar arguments. The library takes on the responsibility of allocating the memory for the compressed data on a call to the compress function and the decompressed data on a call to the decompress function. In order to deallocate the memory when the data is no longer required, the library exposes a destroy function which should be called passing in the return value from a call to the compress or decompress function as its argument.

7.2.2 Building a shared library There are multiple ways of making programs written in C++ (and other languages) available in python, but we have found the usage of shared libraries the easiest. The functions exposed through the library can easily be made available by using the ctypes module. The specifics on creating a shared library from the algorithm’s source code are platform and language dependent and will not be discussed here in detail. It is however important to note that the platform selected for the shared library should be the same as that of the python distribution you are planning to use. E.g a 64-bit version of python requires a shared library that was built for a 64-bit platform. Once a shared library has been built from the source code, it may be required to determine the names of the exposed functions. This is the case for the jp2k algorithm as C++ uses name mangling and thus the names exposed by the shared library do not match the function names in the source. There are different methods of finding the exposed names for different platforms. On windows, the dumpbin executable that is part of Visual Studio may be used:

1 dumpbin /exports

On Linux, the nm command can be used:

1 nm -D

On OS-X, the same nm command can be used, but the -D option can be omitted. Listing 7.2 shows the exported function names of the jp2k algorithm on windows are ”?Com- press@@YAPEADPEADIPEAI@Z”, ”?Decompress@@YAPEADPEADIPEAI@Z”, and ”?Destroy@@YAXPEAD@Z”.

38 1 File Type: DLL 2 Section contains the following exports for jpeg_2k_compression_lib.dll 3 (...) 4 5 ordinal hint RVA name 6 1 0 00002930 ?Compress@@YAPEADPEADIPEAI@Z = ?Compress@@YAPEADPEADIPEAI@Z (char * __cdecl 7 Compress(char *,unsigned int,unsigned int *)) 8 2 1 00002B30 ?Decompress@@YAPEADPEADIPEAI@Z = ?Decompress@@YAPEADPEADIPEAI@Z (char * __cdecl 9 Decompress(char *,unsigned int,unsigned int *)) 10 3 2 00002D80 ?Destroy@@YAXPEAD@Z = ?Destroy@@YAXPEAD@Z (void __cdecl Destroy(char *)) 11 Summary 12 (...)

Listing 7.2: JP2k DLL dumpbin output

7.2.3 Exposing the algorithm in python In order for a python program (like the test bench) to get access to functions in a dynamic library, the functions need to be wrapped. This functionality is offered by the built-in ctypes library5. Listing 7.3 shows how the functions of the jp2k algorithm are wrapped.

1 jp2kdll = ctypes.cdll() 2 3 _compress_function = getattr(jp2kdll, "?Compress@@YAPEADPEADIPEAI@Z") 4 _decompress_function = getattr(jp2kdll, "?Decompress@@YAPEADPEADIPEAI@Z") 5 _destroy_function = getattr(jp2kdll, "?Destroy@@YAXPEAD@Z") 6 7 _decompress_function.restype = ctypes.POINTER(ctypes.c_char) 8 _compress_function.restype = ctypes.POINTER(ctypes.c_char)

Listing 7.3: Exposing JP2k code using the ctypes module

The dynamic library is loaded in line 1. Lines 3 through 5 create references to the exposed functions using the function names we have determined in the previous section. We use the getattr function to get pointers to the exposed functions. For the jp2k algorithm we need to use the getattr func- tion because the function names (after name mangling) contain characters which are not valid for python identifiers. If the function names no not contain such characters, the more readable = . format can be used instead. The ctypes module assumes that all exposed functions have an integer return type. As this is not the case for our compress and decompress function, we indicate the right return type (a char pointer) in lines 7 and 8.

7.2.4 implementing the interface As discussed in section 5.3, each compression algorithm needs to conform to a specific interface in order for the test bench to be able to use it. Listing 7.4 shows the template code for this interface.

1 def init(parameters): 2 3 4 def compress(input_path, output_path): 5 6 7 def decompress(input_path, output_path, record_id = None): 8

Listing 7.4: Compression algorithm interface template

5https://docs.python.org/3/library/ctypes.html

39 If the algorithm to be wrapped supports the required interface, all that is needed is to call the right functions in the shared library. The jp2k algorithm, however, implements a different interface, working on memory containing a single water column record instead of files containing multiple records. Therefore we need to write glue code to support the required interface. For convenience, compress and decompress functions are created that encapsulate all ctypes related code (7.5). The final step for exposing the jp2k algorithm is to write the code to support the required interface. This code is shown in listing 7.6. A simple file format is used for the compressed data. Each record is stored in a block of data. Each block contains the block’s size, the record number, and the compressed data. Because the algorithm does not support parameterization, the init function (line 1) has an empty function body. The compress function (line 4) reads the input file using the gwf module (which is distributed along with the test bench) and passes each individual record to the jp2k algorithm for compression. The result, along with the block header, is written to the output file. The decompress function (line 13) applies the same steps in reverse order. With the interface in place, the algorithm is now ready to be included in the test bench. This process is described in the following section.

1 def _compress(data): 2 compressed_data_length = ctypes.c_uint() 3 compressed = _compress_function( 4 ctypes.c_char_p(data), 5 ctypes.c_uint(len(data)), 6 ctypes.byref(compressed_data_length)) 7 data = compressed[:compressed_data_length.value] 8 9 _destroy_function(compressed) 10 11 return data 12 13 def _decompress(data): 14 decompressed_data_length = ctypes.c_uint() 15 decompressed = _decompress_function( 16 ctypes.c_char_p(data), 17 ctypes.c_uint(len(data)), 18 ctypes.byref(decompressed_data_length)) 19 data = decompressed[:decompressed_data_length.value] 20 21 _destroy_function(decompressed) 22 23 return data

Listing 7.5: Jpeg2k exposed function wrapping

7.2.5 Add the algorithm to the test bench configuration The final step to adding the new algorithm to the test bench is to add it to the test bench configuration file. This file can be found under testbench/configuration/config.cfg. The algorithm is included by adding a JSON object to the ”algorithms” section of the configuration file, consisting of the algorithm’s name, the path to the module, and algorithm parameters. Listing 7.7 shows the lines that are added to the configuration file to include the jp2k algorithm. The jp2k algorithm does not support any parameters. The parameter field is therefore an empty object.

7.3 Configuring the test bench

The test bench configuration file contains the parameters used by the test bench encoded in the JSON format6. AppendixC contains a listing of the configuration file as shipped with the test bench

6http://www.json.org

40 1 def init(parameters): 2 pass 3 4 def compress(input_path, output_path): 5 gwf_file = gwf.File(input_path) 6 with open(output_path, ’wb’) as out_file: 7 for record in gwf_file.read(): 8 data = record.serialize() 9 compressed_data = _compress(data) 10 write_block(compressed_data, record.ping_number, out_file) 11 12 def decompress(input_path, output_path, record_id = None): 13 file_size = os.path.getsize(input_path) 14 with open(input_path, ’rb’) as input_file, open(output_path, ’wb’) as output_file: 15 while input_file.tell() < file_size: 16 block_size, current_record_id = read_block_header(input_file) 17 if record_id == None or record_id == current_record_id: 18 compressed_data = read_compressed_data(input_file, block_size) 19 decompressed_data = _decompress(compressed_data) 20 write_decompressed_data(output_file, decompressed_data) 21 else: 22 skip_block(input_file, block_size) 23 24 def read_compressed_data(file, block_size): 25 compressed_data_size = block_size - header_length 26 return file.read(compressed_data_size) 27 28 def write_decompressed_data(file, data): 29 file .write(data) 30 31 def write_block(data, record_id, file): 32 write_block_header(data, record_id, file) 33 file .write(data) 34 35 def read_block_header(file): 36 header_data = file.read(header_length) 37 return struct.unpack(_record_header_format, header_data) 38 39 def write_block_header(data, record_id, file): 40 block_length = len(data) + header_length 41 file .write(struct.pack(_record_header_format, block_length, record_id)) 42 43 def skip_block(file, block_size): 44 compressed_data_size = block_size - header_length 45 file .seek(compressed_data_size, 1)

Listing 7.6: JP2k adapter code

1 "algorithms": 2 [ 3 (...) 4 { 5 "name": "JPEG 2000 based compression", 6 "path": "algorithms/jpeg2k.py", 7 "parameters": {} 8 } 9 ]

Listing 7.7: Addition of the JP2k algorithm to the test bench’ configuration file

41 in listing C.8. The configuration file consists of three sections: input files, algorithms, and metric parameters.

Input files The test bench uses the water column corpus presented in chapter3 by default. The set of input files is defined by a simple list of paths, so the user can easily change this if the water column corpus is placed at a different location or (s)he would like to include other files.

Algorithms The test bench ships with five reference implementations of compression algorithms: the jp2k algo- rithm that has been discussed in the previous sections, and four general purpose algorithms: ZIP in ’STORED’ and ’’ modes, and LZMA in ’fast’ and ’maximum compression’ modes. The process for adding an algorithm has been explained in section 7.2. If the user would like to exclude an algorithm, the complete object, starting at the left brace and ending at the right brace should be deleted from the configuration file.

Metric parameters This section of the configuration file contains the parameters that are used for the computation for a number of metrics. More details on the metrics and how they use these parameters can be found in chapter4. These parameters are use case specific and should therefore always be updated to reflect your specific use case. Table 7.1 provides detailed information of each of the metric parameters and guidelines on how to determine the best value for your use case.

42 Parameter Metric Meaning Guideline Acquisition Real-time The percentage of CPU This can be determined by monitor- CPU avail- 4.2 time available to the com- ing the CPU usage of the machine dur- able pression algorithm during ing acquisition. Different operating sys- acquisition. tems provide different tools to monitor the CPU usage. Windows uses the ’task manager’, MacOS has the activity mon- itor and both Linux and MacOS have the ’top’ command line utility. The CPU time available to the compression algorithm is usually 1007 - Acquisition Real-time The amount of memory This can be determined by monitor- memory 4.2 available to the compres- ing the amount of memory that is in available sion algorithm during ac- use on average during acquisition. Dif- quisition in mega bytes. ferent operating systems provide dif- ferent tools to monitor memory usage. Windows uses the ’task manager’, Ma- cOS has the activity monitor and both Linux and MacOS have the ’top’ com- mand line utility. The amount of mem- ory available to the algorithm is the to- tal amount of memory installed on the system minus the amount of memory in use. Cost of pro- Cost 4.4 The monetary value of This is usually known within the orga- cessing time processing time per hour. nization. The user is free to use any cur- rency, as long as the same currency is used for all monetary parameters (cost of processing time, cost of ship time, cost of data ownership). The cost met- ric will be reported in the same cur- rency. Cost of ship Cost 4.4 The monetary value of Ships (including crew and machinery) time ship time per hour. are often hired to perform surveys. In this case, this parameter is equal to the ship’s day rate divided by 24. If the survey is conducted with own ma- terials, the cost of ship time is usually known within the organization. The user is free to use any currency, as long as the same currency is used for all monetary parameters (cost of process- ing time, cost of ship time, cost of data ownership). The cost metric will be re- ported in the same currency.

7Most operating systems report the CPU load on a scale from 0 to 100 percent. There are operating systems however that report the CPU load on a scale of 0 - 100 * percent. In this case, scale the available CPU percentage back to a 0-100 range and use that value in the configuration.

43 Cost of data Cost 4.4 The monetary cost of stor- Larger organizations may have deter- ownership ing and maintaining data mined the cost of ownership for data. If over its life span per mega this has not been determined for your byte. organization, a proper estimation could be made by calculating what the cost of ownership would be if the organiza- tion’s data was hosted by a third party. The test bench’s default value has been calculated based on the current cost of Amazon cloud hosting8 based on a total data size of < 500 TB and a life cycle of 50 years. This value has been doubled to account for non-storage related costs such as administration and the cost of data transfer. The user is free to use any currency, as long as the same cur- rency is used for all monetary param- eters (cost of processing time, cost of ship time, cost of data ownership). The cost metric will be reported in the same currency. Number of Processing The number of times This depends on the way of working in processing 4.3 a compressed file is the company. As a rule of thumb, use decompres- expected to be decom- the number of systems that will be used sions pressed. to process the data. Processing Processing The fraction of com- This value will normally need to be set ratio 4.3 pressed records that will to one as water column data is usually be decompressed during only collected on sites where it is known processing. to be of value. Engineering trials are a noteworthy exception where water col- umn is often collected during the entire trial with the sole purpose of having ac- cess to it if it is needed for troubleshoot- ing. Continued progress in the field of water column compression may also en- able users to collect water column data for the entire survey as the size becomes less restrictive. In both cases, internal analysis should provide the right num- ber to use for this parameter.

Table 7.1: Metric parameter details

8https://aws.amazon.com/s3/pricing/

44 7.4 Running the test bench

Now that the algorithm has been added to the setup, it will be included in future test bench runs. For optimal accuracy of the metrics, the user has to make sure that the operating system running the test bench has as few other tasks to perform as possible. In addition, disable any features your CPU may have that influence the CPU clock speed(s). These features are known under many names and can usually be disabled in the BIOS/UEFI. The test bench can be run by executing the run.bat script on Windows or the run.sh script on OSX or Linux. This will start the calculation of the metrics on the configured files. Once the test bench is complete, the results are stored in an sqlite3 database: ’results.sqlite’. There are many (open source) tools that can handle the sqlite3 format and/or convert it to more generic file formats such as CSV. The test bench supports two visualization options for the result data. If the test bench is run with the –show-table flag, the results are shown as a table on the command line as shown in figure 7.1. The –show-graph switch will generate bar graphs of each computed metric. Figure 7.2 shows an example of such a bar graph for the compression ratio metric.

Figure 7.1: Test bench output in tabular format

Figure 7.2: Test bench output in bar graph format for the compression ratio metric

45 Chapter 8

Results and Analysis

In this chapter, we present and discuss the results of a run of our test bench. The test bench (in its default configuration) computes the metrics presented in chapter4 for the files in the Water Column Corpus presented in chapter3.

8.1 Algorithms

Five algorithms are included in the test bench in the default configuration. The included algorithms are presented in the following sections.

ZIP Two configurations of the ZIP algorithm are included in the test bench. One that uses ZIP’s DE- FLATE algorithm for compression (hereafter referred to as ’ZIP with compression’) and one that does not use any compression method (hereafter referred to as ’ZIP without compression’). The lat- ter algorithm is included in the test bench to act as a sort of benchmark. Since no compression is applied, it is a worst-case scenario for the compression ratio metric. For the same reason, we expect the algorithm to have low compression and decompression times compared to algorithms that do use compression. As shown in section 7.2.4, file handling can be part of the code that has to be written to make an algorithm available to the test bench. For the ZIP algorithms we have selected a simple file handling mechanism: the input data is compressed in its entirety, and the compressed data is written to the output file. As a consequence, the ZIP algorithms will need to decompress the entire compressed file when instructed to decompress a single record. As other algorithms have a file handling method that is more suited to random access decompression (as will be detailed in subsequent sections), we expect the test bench results to show that the ZIP algorithms have poor random access decompression performance compared to the other algorithms.

LZMA The LZMA compression algorithm implementation used in the test bench1 offers three ’compression modes’: fast, normal and maximum compression. The fast (hereafter referred to as ’LZMA fast’) and the maximum compression (hereafter referred to as ’LZMA maximum compression’) modes are included in the test bench. We expect the test bench results to show that LZMA fast has better performance than LZMA maximum compression in compression and decompression time. Conversely, we expect LZMA maximum compression to have better performance in compression ratio. Contrary to the ZIP algorithms, the LZMA algorithms compress the input data one record at a time. For each compressed record, the record number, compressed record size, and the compressed record are written to the output file. This allows the algorithm to decompress only the requested

1https://pypi.python.org/pypi/pylzma

46 record when instructed to decompress a single record. We therefore expect the test bench results to show that the LZMA algorithms have better random access decompression performance than the ZIP algorithms.

JPEG 2000 based compression Our reproduction of the jpeg2000-based compression method proposed by Beadoin [Bea10] (hereafter referred to as jpg2k). This algorithm uses the Jasper [AK00] library to compress the amplitude data of the water column record. This algorithm uses the same method of file handling as the LZMA algorithms, we therefore similarly expect the test bench results to show that the jpg2k algorithm outperforms the ZIP algorithms in random access decompression.

8.2 Generic compression metrics

Figures 8.1, 8.2, 8.3, and 8.4 show the measurements for the generic compression metrics: compres- sion ratio, compression time, decompression time and random access decompression time. All time measurements have been normalized to the size of the data that was (de)compressed to make the graphs easier to read. Most of the measurements are according to our expectations: ZIP without compression always has a compression ratio of 1.0 (e.g. no compression) and is consistently the fastest algorithm in both compression and decompression. The LZMA algorithms consistently have a lower (better) compression ratio than the ZIP algorithms, but also have higher (worse) compression & decompression times. LZMA maximum compression consistently offers a lower (better) compression ratio than LZMA fast. The latter algorithm is typically faster than the former both in compression and decompression. As expected, the random access decompression performance of the ZIP algorithms is poor compared to the other algorithms. Some measurements, however, are not according to our expectations or stand out otherwise. These are discussed in the subsequent sections.

8.2.1 Deviation in (de)compression time Figures 8.2 and 8.3 show that there is a deviation in the average compression and decompression times per byte over the different input files. This is further detailed in table 8.1. This suggests that the contents of the file not only influence the compression ratio, but also the rate at which the data can be compressed or decompressed. Neither of these properties of the input files have been taken into account in the selection process for the Water Column Corpus. Table 8.1 shows that the compression time has a strong correlation to the compression ratio, the file characteristic used to select files for the Water Column Corpus. This makes it less likely that a file has been selected for the corpus that has a representative compression ratio, but a non-representative compression time. The decompression time, however has a very low correlation with the compression ratio, which means that it is possible that the files selected for the corpus, although representative of compression ratio, are not representative of the decompression time. This observation is further discussed in sections 9.1 and 9.2.

8.2.2 LZMA compression time The LZMA fast algorithm, contrary to our expectation, has a higher compression time for the ’Fish’ file than LZMA maximum compression algorithm even though the latter has a lower compression ratio. Interestingly, this is not a fluke. We observed the same result in five test bench runs and a separate measurement made not using the test bench. It appears that the there is something specific in the content of the ’fish’ file that causes this behavior.

47 File Average Average Average compression time decompression time compression ratio (µs per byte) (µs per byte) Fish 0.21 0.037 0.84 Survey 0.17 0.050 0.67 Target 0.20 0.030 0.76 Vent 0.18 0.045 0.62 Water seep 0.12 0.030 0.43 Wreck 0.22 0.045 0.63 Range (% of mean) 0.10 (57) 0.021 (51) Correlation to compression ratio 0.80 0.10

Table 8.1: Average (de)compression time statistics.

8.2.3 Jpg2k compression ratio The jpg2k algorithm and the ZIP with compression algorithm alternately have the third lowest com- pression ratio. This inconsistency between files may be explained by the fact that the two files for which ZIP with compression outperforms the jpg2k algorithm are the only two files in the Water Column Corpus that contain phase data. The jpeg2000 based algorithm as proposed by Beaudoin [Bea10] only compresses the amplitude samples in the water column data. As the phase data consti- tutes approximately half of the file size for the two files that contain phase data, it is likely that the compression ratio for the jpg2k algorithm would improve if the phase data was compressed as well.

8.2.4 Random access decompression compared to full file decompression Figure 8.4b shows the random access decompression times at a zoom level at which the two LZMA algorithms and the jpg2k algorithm can be compared. The relation between the algorithms is similar to the relation for the decompression time metric, but the differences between the algorithms is less pronounced for the random access decompression time metric. Further inspection showed that this was due to the ratio of time spent on file handling and time spent on actual decompression. Even though the actual I/O time is not included in the time measurement2, there is still some overhead incurred by searching the compressed file for the record to decompress. In case of full file decompression, this overhead is negligible compared to the compression time. However, when a single record is decompressed, searching the file for the right record can take longer than the actual decompression of the record and thus has a strong influence on the random access decompression time. As the file reading code (and thus the induced overhead of searching for the record to decompress) is equal for the jpg2k algorithm and the LZMA algorithms, the random access decompression times for the jpg2k algorithm and the LZMA algorithms are closer than their respective full file decompression times.

8.2.5 Jpg2k performance The jpg2k algorithm, which is the only algorithm in the test bench that was specifically designed for water column data, does not appear to outperform any of the generic compression algorithms. The compression ratio is often similar to that of the ZIP with compression algorithm, while the latter has much lower compression and decompression times. The algorithm performs especially poor on decompression where it is consistently the slowest algorithm by at least a factor of 2.

8.3 The processing metric

The processing metric (section 4.3) is indicative of the time spent on decompression during processing. The metric is computed using either the full file decompression time, or the time spent individually

2refer to section 5.6.2, for more information on how time is measured in the test bench.

48 Figure 8.1: Compression ratio

Figure 8.2: Compression time

Figure 8.3: Decompression time decompressing the required records, depending on which of the two is smallest. The number of required records is determined by the processing ratio, a user parameter. A processing ratio of 1.0 indicates that all records are to be decompressed during processing. The results of a test bench run with this setting are shown in figure 8.5a. As expected, the processing metric values correlate to the (full file) decompression time (figure 8.3). When the processing ratio is reduced, the processing metric starts to correlate with the random access decompression instead of the full file decompression time for some algorithms as the time required for partial decompression becomes less than the time required for full file decompression. Figures 8.5b and 8.5c show this behavior for processing ratios of 0.1 and 0.01, respectively. Due to the way we implemented file handling in the ZIP algorithms, random access decompression of a single record takes more time than full file decompression (as discussed in section 8.1). The values for these algorithms, therefore, do not change when the processing ratio is lowered. For most measurements, we see that the values correlate to those of the random order decompression when a processing ration of 0.1 is used. A notable exception is the ’Survey’ file, for which the values have not changed compared to the values obtained with a processing ratio of 1.0. The reason for this behavior probably lies in the fact that the random access decompression time for this file is relatively high, as we can see in 8.4b. When a processing ratio of 0.01 is used, we see that all values except those of the ZIP algorithms now correlate to the random access decompression time instead of the full file decompression time.

49 (a) random access decompression time

(b) random access decompression time zoomed in

Figure 8.4: Random access decompression time

8.4 The real-time metric

As discussed in section 4.2, the real time metric is an indication of the ability of an algorithm to compress data at the rate it is generated. Figure 8.6 shows the computed values for the real time metric using the default test bench con- figuration. In this configuration, the algorithm is allowed to use 50% of the CPU time and 2 GB of memory. Sub-figure 8.6a shows the computed values over the full range of the results. Sub-figure 8.6b shows the same data, zoomed in to the turning point for real-time compression. The relative order of the algorithms’ real-time performance is equal to that of the algorithms’ performance for compression time. Meaning that, for each file, the ordering of the algorithms from best to worst performance is equal between the compression time metric and the real-time metric. This signifies that the CPU load and memory consumption that were induced during the measurement of the real-time compression time in this test run have similar effects on all compression algorithms. The values show that not only the algorithms differ in their ability to perform real-time compression, there is also a large difference between the different files. The ’Vessel’, ’WaterSeep’ and ’Wreck’ files can be compressed in real-time by any algorithm whereas the ’Target’ file cannot be compressed in real-time by any algorithm that actually applies compression. A closer look at the definition of the metric explains why. The real-time metric is defined as the ratio of the average compression time for a single record and the generation interval of records in the input file. The metric’s dependency on the record generation interval is explicit, but the metric depends on another property of the input file as well. The average record compression time depends on the algorithm’s compression performance, but it also depends on the size of the records. We can combine these two dependencies in the ’data-rate’ of the input file. We define the data rate as the size of the file divided by the timespan of the records in the file. The data rates for the input files are shown in figure 8.7. When comparing the real-time metric measurements to the data rates, there appears to be an inverse correlation between the two. As the data-rate of the input files has not been taken into account in the selection process for the Water Column Corpus, there is a chance that the files selected in the corpus are less representative. This observation is further discussed in sections 9.1 and 9.2.

50 (a) Processing ratio = 1.0

(b) Processing ratio = 0.1

(c) Processing ratio = 0.01

Figure 8.5: Processing metric

8.5 The cost reduction metric

The cost reduction metric presented in section 4.4 encompasses the measurement of all other metrics and yields a single (monetary) value that represents the value of applying a compression method. Figure 8.8 shows the values computed by the test bench in the default configuration. In order to ease comparison of different files, the values have been normalized for the file size. This figure clearly shows that the ZIP without compression algorithm offers no value. Because it does not actually compress the data, there is no data reduction and therefore no cost reduction. Because the algorithm is fast, its induced costs are low. For all other algorithms, there is a clear correlation between the cost reduction and the real-time metric. Whenever the real time metric is 1.0 or higher, the cost reduction value is positive. This means that the cost savings induced by the data reduction are large enough to compensate for the costs incurred by decompression during processing, but not large enough to compensate the cost incurred by decompression during processing and the cost that would be incurred by non-real-time compression. Based on the results obtained from this test bench run, the only algorithm that consistently offers cost reduction over all files is the ZIP with compression algorithm. It is important to note that this metric relies heavily on the supplied user parameters which may differ significantly between organizations. The cost of ship time, for instance, may vary greatly based

51 (a) Overview

(b) Zoomed

Figure 8.6: Real time metric

Figure 8.7: Input file data rates

Figure 8.8: Cost reduction metric on the size of the vessel. The cost of data storage may differ based on the company’s requirements for redundancy and administration. And the cost of processing may vary based on the country of operation. The default values in the configuration have been chosen with care based on information from professionals in the domain and the current costs of cloud hosting, but may not be representative for all use cases.

52 Chapter 9

Conclusion & Future work

The scientific community has identified a number of promising applications for multibeam echosounder water column data. However, many hydrographic surveyors choose not to collect water column data due to its large volume compared to other types of hydrographic data. Compression of the water column data has been identified as a possible solution, and three different algorithms for water column compression have been proposed [Bea10][MCKL13][APR+16]. No direct comparison of these three algorithms has been performed. It is therefore unclear what the current state of the art in water column data compression is. The goal of this work was to answer the research question posed in section 1.1 ’How to evaluate the performance of lossless water column data compression algorithms?’. Our hypothesis was that, similar to other disciplines, benchmarking could be used to provide such an evaluation. We have methodologically selected a set of input files representative of real-world data and placed these files in the public domain so they can be used without restriction. Furthermore, we have defined metrics for the different components of water column data compression performance and created a test bench to compute these metrics. We have used the test bench to compute the proposed metrics over the proposed set of input files. The results of this test bench run, presented in chapter8, show that our test bench is a valid method for comparing the various performance components of lossless water column data compression. The results clearly discern between the different algorithms and conform to the expectations we have formulated based on known properties of the algorithms. We conclude that indeed, benchmarking can be used to evaluate the performance of lossless water column data compression algorithms, and the addition of a test bench simplifies the process of ob- taining results for such a benchmark. We now invite the community to evaluate the benchmark in order to come to a community driven benchmark.

9.1 Threats to validity

We have identified a number of threats to the validity of the methodology used for the selection of the Water Column Corpus. As a result, certain properties of the files selected for the corpus could be less representative of the real-world data than we would like them to be. This means that the threats presented in the subsequent sections possibly limit the implications of the results presented in chapter8 to a subset of the complete water column compression domain. It does not, however, threaten the validity of our conclusion: benchmarking can be used to evaluate the performance of lossless water column data compression algorithms. Moreover, because our benchmark and test bench have exposed interesting performance behavior over the input files, we are able to identify these threats.

9.1.1 File properties considered in corpus selection The analysis of the results appears to indicate that the data rate (defined as the file size divided by the timespan of the records in the file) correlates with the real-time performance of the algorithm. We also

53 observed significant deviation of the decompression time per byte over different files. Because neither of these input file properties was included in the selection process for the Water Column Corpus, we cannot say if the Water Column Corpus is representative of real-world data for those properties.

9.1.2 Restriction of candidate files for corpus selection Another threat to validity in the Water Column Corpus can be found in the set of candidate files. This set of candidate files consisted of water column data available to the host organization. Because the domain of the host organization does not span the complete domain of water column data, the input set is restricted. Of the applications of water column data mentioned by Colba et al. [CRBW14]: fisheries, marine mammals, zooplankton, kelp ecosystems, aquaculture, gas venting, near surface bubbles, suspended sediment, physical oceanography and wrecks/archeological oceanography less than half were present in the set of candidate files. As a result, the obtained results may not be representative for the domains for which no data was included in the candidate files.

9.1.3 Hardware differences between acquisition and processing The test bench computes all metrics on a single system. In reality, compression and decompression are likely to occur on different systems (the acquisition system and the processing system, respectively). These systems may have different hardware configurations. If the difference in hardware configurations has a significant impact on the performance of the compression algorithm(s), the results of the test bench may not be representative of the real-life situation. This risk is especially prominent in the cost reduction metric which integrates time spent on compression during acquisition and time spent on decompression during processing.

9.2 Future work 9.2.1 Water Column Corpus inclusion criteria Section 9.1 indicated why more research regarding the creation of the next iteration of the Water Column Corpus, with better selection criteria and a larger set of candidate files, is needed.

9.2.2 Improve performance The test bench in its default configuration has a run time of approximately 14 hours on a high-end machine. As the number of algorithms included in a test bench run is likely to grow and the number of input files may grow as well, we think it is opportune to look into ways to reduce the run time of the test bench. One research direction is to investigate whether it is possible to include only part of the selected input files without compromising the representativeness of the data set. Another direction is to replace unnecessary measurements with modeling. For example, modeling the compression performance in a high load environment from the results obtained in a ’normal’ environment. Such a development could reduce the test bench run time to under half the current run time.

9.2.3 Data generation Generated data is a valuable tool to stress known weaknesses of algorithms and to prevent algorithms from tailoring to the input data. A water column data generator could thus be a worthwhile addition to the benchmark.

9.2.4 Lossy compression In this work, we have solely focused on the lossless compression of water column data, mainly because currently it is unknown what amount of loss is acceptable for water column data. To determine this,

54 processing of water column data should become more common. We expect that the development of high-performance lossless compression can be a means to attain this goal. Once water column processing is more common and the problems associated with it are better understood, we believe that the acceptable loss for each application should be determined, and the focus of the scientific community should shift towards lossy compression. The best compression ratio we have measured is close to 0.25. The average compression ratio of the best performing algorithm (LZMA max compression) is only 0.5. Beaudoin [Bea10] has shown that with lossy compression, compression rates of 0.1 can be attained ”yielding very little loss of resolution or introduction of artifacts”. All the metrics presented in this work are valid for lossy compression as well. Thus, test bench and/or benchmark for lossy water column compression should be attained by appending to our work.

9.2.5 Hardware differences Research should determine if the threat to validity addressed in 9.1.3 is a threat in practice. If it is, the test bench could be adapted to support partial runs. The metrics important to acquisition could then be determined on the acquisition hardware and the metrics important to processing on the processing hardware.

9.3 Community involvement

The positive effect of a benchmark on the community depends on the involvement of the community in that benchmark. At this point, we would like to call on the community to contribute to the benchmark. Such contributions can come in the form of feedback on the proposed benchmark, sharing relevant water column data, and/or sharing (new) water column data compression algorithms for evaluation. We would like to specifically invite members of the community that have access to a large body of (proprietary) water column data to collaborate on the creation of the next iteration of the Water Column Corpus.

55 Bibliography

[AB97] Ross Arnold and Tim Bell. A corpus for the evaluation of lossless compression algorithms. In Data Compression Conference, 1997. DCC’97. Proceedings, pages 201–210. IEEE, 1997. [AK00] Michael D Adams and Faouzi Kossentini. Jasper: A software-based jpeg-2000 codec implementation. In Image Processing, 2000. Proceedings. 2000 International Conference on, volume 2, pages 53–56. IEEE, 2000.

[APR+16] David Amblas, Jordi Portell, Xavier Rayo, Alberto G. Villafranca, Enrique Garc’ia-Berro, and Miquel Canals. Real-time lossless compression of multibeam echosounder water col- umn data. 2016. [BCW90] Timothy C Bell, John G Cleary, and Ian H Witten. Text compression. Prentice-Hall, Inc., 1990.

[Bea10] Jonathan Beaudoin. Application of JPEG 2000 wavelet compression to multibeam echosounder mid-water acoustic reflectivity measurements. In Canadian Hydrographic Conference, Quebec City (Canada), 2010.

[Bon14] Peter Boncz. Choke-point based benchmark design, 2014. URL: http://ldbcouncil. org/blog/choke-point-based-benchmark-design. [BWSP06] B Buelens, R Williams, AHJ Sale, and Tim Pauly. Computational challenges in processing and analysis of full-watercolumn multibeam sonar data. 2006. [CHI+15] Mihai Capot˘a,Tim Hegeman, Alexandru Iosup, Arnau Prat-P´erez,Orri Erling, and Peter Boncz. Graphalytics: A big data benchmark for graph-processing platforms. In Proceed- ings of the GRADES’15, page 7. ACM, 2015. [Cla06] JE Hughes Clarke. Applications of multibeam water column imaging for hydrographic survey. Hydrographic Journal, 120:3, 2006. [CRBW14] Keir Colbo, Tetjana Ross, Craig Brown, and Tom Weber. A review of oceanographic applications of water column data from multibeam echosounders. Estuarine, coastal and shelf science, 145:41–56, 2014. [Deo03] Sebastian Deorowicz. Universal lossless data compression algorithms. Philosophy Disser- tation Thesis, Gliwice, 2003. [DHL15] Patrick Damme, Dirk Habich, and Wolfgang Lehner. A benchmark framework for data compression techniques. In Technology Conference on Performance Evaluation and Benchmarking, pages 77–93. Springer, 2015. [GVI+14] Yong Guo, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, and Theodore L Willke. Benchmarking graph-processing platforms: a vision. In Proceedings of the 5th ACM/SPEC international conference on Performance engineering, pages 289–292. ACM, 2014.

56 [IR08] Md Rafiqul Islam and SA Ahsan Rajon. On the design of an effective corpus for evaluation of bengali text compression schemes. In Computer and Information Technology, 2008. ICCIT 2008. 11th International Conference on, pages 236–241. IEEE, 2008.

[Jak10] Bc JakubReznıcek. Corpus for comparing compression methods and an extension of a excom library. 2010. [MCKL13] Marek Moszynski, Andrzej Chybicki, Marcin Kulawiak, and Zbigniew Lubniewski. A novel method for archiving multibeam sonar data with emphasis on efficient record size reduction and storage. Polish Maritime Research, 20, 2013.

[Pow01] Matt Powell. Evaluating lossless compression methods. In New Zealand Research Students’ Conference, Canterbury, New Zealand, pages 35–41. Citeseer, 2001. [SCE01] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The jpeg 2000 still image compression standard. IEEE Signal processing magazine, 18(5):36–58, 2001.

[SEH03] Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. Using benchmarking to ad- vance research: A challenge to software engineering. In Software Engineering, 2003. Proceedings. 25th International Conference on, pages 74–83. IEEE, 2003. [Smi13] Steven Smith. Digital signal processing: a practical guide for engineers and scientists. Newnes, 2013.

[Swa08] Jakub Swacha. Assessing the efficiency of data compression and storage system. In Com- puter Information Systems and Industrial Management Applications, 2008. CISIM’08. 7th, pages 109–114. IEEE, 2008.

57 Appendix A

Water column data

The purpose of this chapter is to introduce some multibeam echosounder related concepts and ter- minology that are used in this work. It is not meant as a comprehensive introduction to multibeam echosounders as this is outside the scope of this work. If the reader is interested in a comprehen- sive introduction (s)he may find such information in the document published by L3 communications at https://www.ldeo.columbia.edu/res/pi/MB-System/sonarfunction/SeaBeamMultibeamTheoryOperation. pdf

A multibeam echosounder is a device that emits an acoustic pulse into the water that is wide in port-starboard direction and narrow in aft-bow direction. This pulse travels down to the sea floor where it is reflected by the sea floor. This travels back to the multibeam echosounder where the reflection is recorded by an array of hydrophones. Because the reflection is recorded by an array of hydrophones, the signal can be filtered to only contain the signal that arrived at the multibeam echosounder from a certain angle. This process is called beam forming. The signal that is formed for a specific angle is called a beam. The data after beam forming consists of a collections of beams where each beam consists of an angle and a representation of the signal. The signal representation always includes samples of the signal amplitude and for some devices includes samples of the phase of the signal as well. Water column data is normally shown as a ’wedge plot’ meaning that for each beam, the ampli- tude samples are drawn on a straight line starting at a central node (representing the multibeam hydrophone) going downwards at the beam angle. Figure A.1a is an example of such a plot. In this representation it is hard to identify the individual beams. Figure A.1b shows the same data visualized in a different way: the amplitude samples for each beam are drawn on a line going straight down. Beams are stacked horizontally. In this representation, the individual beams are more clearly visible.

58 (a) wedge plot

(b) horizontally stacked

Figure A.1: Water column data record visualized in two modes

59 Appendix B

Generic Water column Format

B.1 Purpose

The generic water column format has been developed as a format to be used in the evaluation of water column compression algorithms. It achieves genericity by abstracting away all properties of the data but the ones that are necessary for compression. The goal of the format is that any water column format can be converted to the generic water column format largely by restructuring the original data. The generic water column format adds some overhead to the original format, but this is kept to a minimum.

B.2 Structure

The generic water column format consists of a collection of water column records. Each record consists of a record header as shown in B.1 followed by a beam entry for each beam in the record. The format for the beam entry is shown in table B.2

Bytes Field Data type Note 0..1 Number of beams (Nb) uint16 1 2 Amplitude sample format (Fsa) uint8 * 1 3 Phase sample format (Fsp) uint8 * 4..7 Generic swath data size (Ssg) uint32 7..7+Ssg -1 Generic swath data uint8[]

Table B.1: Record header

The swath header is followed by Nb beam entries:

Bytes Field Data type Note 0..3 Swath number uint32 4..7 Number of amplitude samples uint32 8..11 Number of phase samples uint32 12..13 Generic bream data size uint16 2 Amplitude samples (Fsa)[] * 2 Phase samples (Fsp)[] * Generic data uint8[] *2

Table B.2: Beam entry

*1) Sample format according to table B.3

60 Value Format Bytes per sample 0 8 bit unsigned integer 1 1 8 bit signed integer 1 2 16 bit unsigned integer 2 3 16 bit unsigned integer 2 4 32 bit unsigned integer 4 5 32 bit signed integer 4 6 64 bit unsigned integer 8 7 64 bit signed integer 8 8 32 bit IEEE float 4 9 64 bit IEEE float 8

Table B.3: Sample format enumeration

*2) The size and/or position of amplitude, phase, and generic data is variable and depends on the data size of the sample formats used for amplitude and phase data. Refer to table B.3 for specifics on the data size for each sample format.

B.3 Conversion

Figure B.1 shows how a water column record in the Kongsberg .all encoding is converted to a generic water column format record. All yellow fields are considered ’generic data’ in the GWF encoding. This is data that is not necessary for compression and is therefore treated simply as generic data. The orange fields are similar. This is generic data that belongs to a specific beam. The white fields are fields that are present in both the original encoding and the GWF encoding. The dotted lines show the mapping from the Kongsberg .all field names to the GWF field names. The green fields are fields that are not present in the original data. In this specific case, these are the sample formats for phase and amplitude data and the number of phase samples for each beam (as the Kongsberg .all format does not include phase data). The total amount of overhead induced by the GWF format is thus 1 + 1 + ( * 4) bytes.

61 Figure B.1: Mapping of Kongsberg all format to GWF

62 Appendix C

Test bench Configuration file

The following listing shows the default configuration file as distributed with the test bench

63 1 { 2 "input files": 3 [ 4 "Water Column Corpus/Fish.gwf", 5 "Water Column Corpus/Survey.gwf", 6 "Water Column Corpus/Target.gwf", 7 "Water Column Corpus/Vent.gwf", 8 "Water Column Corpus/WaterSeep.gwf", 9 "Water Column Corpus/Wreck.gwf" 10 ], 11 12 "algorithms": 13 [ 14 { 15 "name": "ZIP no compression", 16 "path": "algorithms/zip.py", 17 "parameters": 18 { 19 "compress" : false 20 } 21 }, 22 { 23 "name": "ZIP with compression", 24 "path": "algorithms/zip.py", 25 "parameters": 26 { 27 "compress" : true 28 } 29 }, 30 { 31 "name": "JPEG 2000 based compression", 32 "path": "algorithms/jpeg2k.py", 33 "parameters": {} 34 }, 35 { 36 "name": "LZMA fast", 37 "path": "algorithms/wclzma.py", 38 "parameters": 39 { 40 "compression mode" : 0 41 } 42 }, 43 { 44 "name": "LZMA max compression", 45 "path": "algorithms/wclzma.py", 46 "parameters": 47 { 48 "compression mode" : 2 49 } 50 } 51 ], 52 53 "metric parameters": 54 { 55 "acquisition cpu available" : 50, 56 "acquisition memory available" : 2048, 57 "cost of processing time" : 200, 58 "cost of ship time" : 900, 59 "cost of data ownership" : 0.014, 60 "number of processing decompressions" : 1, 61 "Processing ratio" : 1 62 } 63 }

Listing C.8: Example test bench configuration file

64