<<

Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

Thomas Kriechbaumer Hans-Arno Jacobsen Chair for Application and Middleware Systems Chair for Application and Middleware Systems Technische Universität München Technische Universität München Germany Germany [email protected] [email protected]

ABSTRACT Collecting, distributing, and managing large-scale data storage fa- Electrical energy consumption has been an ongoing research area cilities is an ongoing research topic [11, 46] and strongly depends since the coming of smart homes and Internet of Things devices. on the environment and systems architecture. Consumption characteristics and usages profiles are directly influ- High sampling rates are particularly interesting for NILM to ex- enced by building occupants and their interaction with electrical tract waveform information from voltage and current signals [24]. appliances. Extracted information from these data can be used to Early datasets targeted at load disaggregation and appliance identi- conserve energy and increase user comfort levels. Data analysis to- fication started with under 2 GiB [26], whereas recently published gether with machine learning models can be utilized to extract valu- datasets reach nearly 100 TiB of raw data [28]. Working with such able information for the benefit of occupants themselves, power quantities requires specialized storage and processing techniques plants, and grid operators. Public energy datasets provide a scien- which can be costly and maintenance-heavy. Optimizing infrastruc- tific foundation to develop and benchmark these algorithms and ture costs for storage is part of ongoing research [29, 37]. techniques. With datasets exceeding tens of terabytes, we present a The data quality requirements typically define a fixed sampling novel study of five whole-building energy datasets with high sam- rate and bit-resolution for a static environment. Removing or aug- pling rates, their signal entropy, and how a well-calibrated measure- menting measurements might impede further research, therefore ment can have a significant effect on the overall storage require- no filtering or preprocessing steps are performed before releasing ments. We show that some datasets do not fully utilize the avail- the data. able measurement precision, therefore leaving potential accuracy techniques can be classified as lossy or lossless and space savings untapped. We benchmark a comprehensive list [8]. Lossy algorithms allow for some margin of error when encoding of 365 file formats, transparent data transformations, and lossless the data and typically give a metric for the remaining accuracy compression algorithms. The primary goal is to reduce the overall or lost precision. For comparison, most audio, image, and video dataset size while maintaining an easy-to-use and access compression algorithms remove information not detectible by a API. We show that with careful selection of file format and encod- human ear or eye. This allows for a data rate reduction in areas ing scheme, we can reduce the size of some datasets by up to 73%. of the signal a user can’t detect or has a reduced resolution due to a typical human physiology. Depending on the targeted use case, certain aspects of the input signal are considered unimportant and 1 INTRODUCTION might be not reconstructable. Encoding only the amplitude and Home and building automation promise many benefits for the oc- frequency of the signal can lead to vast space savings, assuming cupants and power utilities. From increased user comfort levels to phase alignment, harmonics, or other signal characteristics are not demand response and lower electricity costs, Smart Homes offer a required for future analysis. On the contrary, lossless encoding variety of assistance and informational gains. Internet of Things, schemes guarantee a 1:1 representation of all measurement data a combination of sensors and actuators, can be intelligently con- with a reversible data transformation. If the intended use case or trolled based on sensor data or external triggers. Power monitoring audience for a given dataset is not known or is very diverse in arXiv:1810.10887v1 [cs.OH] 25 Oct 2018 and smart metering are a key step to fulfill these promises. The in- their requirements, only can be applied to flux of renewable energies and the increased momentum of changes keep all data accessible for future use. Recent works pointed out in the power grid and its operations are a main driving factor for an imbalance in the amount of research on steady-state versus further research in this area. waveform-based compression of electricity signals [10]. Non-intrusive load monitoring (NILM) can be one solution to Further consideration must be given to communication band- identify and disaggregate power consumers (appliances) from a width (transmission to a remote endpoint) and in-memory process- single-point measurement in the building. Utilizing a centralized ing (SIMD computation). The efficient use of network channels can data acquisition system saves costs for hardware and installation be a key requirement for real-time monitoring of streaming data. in the electrical circuits under observation. The NILM community In the case of one-time transfers (or burst transmissions), chunk- heavily relies on long-term measurement data, in the form of public ing is used to split large datasets into more manageable (smaller) datasets, to craft new algorithms, train models, and evaluate their files. However, choosing a maximum file size depends on the avail- accuracy on per-appliance energy consumption or appliance iden- able memory and CPU (instruction set and cache size). Distributing tification. In recent years these datasets grew significantly insize large datasets as a single file creates an unnecessary burden for re- and sampling characteristics (temporal and amplitude resolution). searchers and required infrastructure. A suitable file format must be considered for raw data storage, as 2 RELATED WORK well as easy access to metadata, such as calibration factors, times- NILM and related fields distinguish between low and high sampling tamps, and identifier tags. None of the existing datasets (NILM or rates to capture voltage and current measurements. Low sampling related datasets with high sampling rates) share a common file for- rates (or low-frequency) are typically 1 Hz or slower. High sam- mat, chunk size, or signal sampling distribution. This heterogeneity pling rates (or high-frequency) are typically above 500 Hz (or at makes it difficult to apply algorithms and evaluation pipelines on least the Nyquist–Shannon sampling theorem [41]). Recording mul- more than one dataset. Therefore, researchers working with mul- tiple channels with high sampling rates requires oscilloscopes or tiple datasets have to implement custom importer and converter specialized data acquisition systems as presented in [21, 27, 31]. stages, which can be time-consuming and error-prone. Low-frequency energy data can benefit greatly from compres- This work provides an in-depth analysis of public whole-building sion when applied to smart meter data, as multiple recent works datasets, and gives a comprehensive evaluation of best-practice have shown [13, 39, 44, 45]. Electricity smart meters can be a source storage techniques and signal conditioning in the context of energy of high data volume with measurement intervals of 1 s, 60 s, 15 min, data collection. The key contributions of this work are: or higher. Possible transmission and storage savings due to loss- less compression have been evaluated in [45]. While the achievable (1) A numerical analysis of signal entropy and measurement cal- compression ratio increased with smaller sampling intervals, the ibration of public whole-building energy datasets by evalu- benefits of compression vanish quickly above 15 min intervals. Var- ating all signal channels with respect to their available reso- ious encodings (ASCII- and binary-based) have been evaluated for lution and sample distribution over the entire measurement such low-frequency measurements, and in most cases, a binary en- period. The resulting entropy metrics further motivate our coding greatly outperforms an ASCII-based encoding. The need for contributions and the need for a well-calibrated measure- smart data compression was discussed in [33], which further moti- ment system. vates in-depth research in this area. The main focus of the authors (2) An exhaustive benchmark of storage requirements and po- was smart meter data with low temporal resolution from 10,000 tential space savings with a comprehensive collection of 365 meters or more. Various compression techniques were presented file formats, lossless compression techniques, and reversible and a fast-streaming differential compression algorithm was eval- − data transformations. We re-encode and normalize data from uated: removing steady-state power measurements (ti+1 ti = 0) all datasets to evaluate the effect of compression. We present can save on average 62% of required storage space. the best-performing combinations and their overall space High-frequency energy data offers a significantly larger poten- savings. The full ranking can be used to select the optimal tial for lossless compression, due to the inherent repeating wave- file format and compression for offline storage of large long- form signal. Tariq et al. [43] utilized general-purpose compressors, term energy datasets. such as LZMA and , and achieved good compression ratios (3) A full-scale evaluation of increasingly larger data chunks per on some datasets. Applying differential compression and omitting file and their final compression ratio. The dependency be- timestamps can yield size reductions of up to 98% on smart grid tween input size and achievable compression ratio is evalu- data, however, these results are not comparable as there is no gener- ated up to 3072 MiB per file. The results provide an evidence- alized uniform data source. The presented results use a single data based guideline for future selection of chunk sizes and pos- channel and an ASCII-based data representation as a baseline for sible environmental factors for consideration. their comparison, which contains an inherent encoding overhead. The SURF file format [36] was designed to store NILM datasets and provide an API to create and modify such files. The internal struc- We give an in-depth evaluation of file formats and signal charac- ture is based on wave-audio and augments it with new types of teristics that directly affect storage, encoding, and compression of metadata chunks. To the best of our knowledge, the SURF file for- such data. Each of the analyzed datasets was created with a dedi- mat didn’t gain any traction due to its lack of support in common cated set of requirements, therefore, a single best option does not scientific computing frameworks. The recently published EMD-DF exist. However, with this study, we want to help the community file format35 [ ], by the same authors, relies on the same wave-audio to better understand the fundamental causes of compression per- encoding, while extending it with more metadata and annotations. formance in the field of waveform-based whole-building energy Neither SURF nor EMD-DF provides any built-in support for com- datasets. We provide a definition of measurement calibration and pression. The power grid community defined the PQDIF [42] (for its effects on the storage requirements based on signal entropy. power quality and quantity monitoring) and COMTRADE [5] (for Published datasets are self-contained and final, which allows us to transient data in power systems) file formats. Both specifications prioritize the compression ratio and achievable space saving over outline a structured view of numerical data in the context of en- other common compression metrics (CPU load, throughput, or la- ergy measurements. Raw measurements are augmented with pre- tency). We define the achievable space saving and compression ra- computed evaluations (statistical metrics), which can cause a sig- tio as the only criterion when dealing with large (offline) datasets. nificant overhead in required storage space. While PQDIF supports The rest of this paper is structured as follows: We discuss related a simple LZ compression, COMTRADE does not offer such capabil- work in Section 2. We describe the evaluated datasets in Section 3, ities. To the best of our knowledge, these file formats never gained which are then used in the experiments in Sections 4, 5, and 6. Fi- traction outside the power grid operations community. nally, we present results in Section 7, before concluding in Section 8.

2 can achieve multiple magnitudes higher com- Table 1: Overview of evaluated datasets: long-term continu- pression ratios than lossless, with minimal loss of accuracy for cer- ous measurements containing raw voltage and current wave- tain use cases [13]. Using piecewise polynomial regression, the au- forms. thors achieved good compression ratios on three existing smart grid scenarios. The compressed parametrical representation was Current Voltage Sampling Dataset Values stored in a relational database system. However, this approach only Channels Channels Rate applies if the use case and expected data transformation is known REDD 2 1 15 kHz 24-bit before applying a lossy data reduction. A 2-dimensional represen- BLUED 2 1 12 kHz 16-bit tation for power quality data was proposed in [17] and [38], which UK-DALE 1 1 16 kHz 24-bit then could be used to employ compression approaches from image BLOND-50 3 3 50 kHz 16-bit processing and other related fields. While both approaches can be BLOND-250 3 3 250 kHz 16-bit categorized as lossy compression due to their numerical approxi- mation using or trigonometric functions, they require a specialized encoder and decoder which is not readily available in Measurement systems and their analog-to-digital converters scientific computing frameworks. (ADC) always output a unit-less integer number, either between The NilmDB project [34] provides a generalized user interface [0, 2bits ) for unipolar ADCs or [−2bits−1, 2bits−1) for bipolar ADCs. to access, query, and analyze large time-series datasets in the con- During setup and calibration, a common factor is determined to text of power quality diagnostics and NILM. A distributed archi- convert raw values into a voltage or current reading. Some datasets tecture and a custom storage format were employed to work ef- publish raw values and the corresponding calibration factors, while ficiently with “big data”. The underlying data persistence isorga- others publish directly Volt- and Ampere-based readings as float nized hierarchically in the filesystem and utilizes tree-based struc- values. Datasets only available as floating-point values are con- tures to reduce storage overhead. This internal data representation verted back into their original integer representation without loss is capable of handling multiple streams and non-uniform data rates of precision by reversing the calibration step from the analog-to- but lacks support for data compression or more efficient coding digital converter for each channel: schemes. NILMTK [6], an open-source NILM toolkit, provides an evaluation workbench for power disaggregation and uses the HDF5 measurementi = ADCi · calibrationchannel [15] file format with a custom metadata structure. Most available [Volts] = [steps]·[V olt/steps] public datasets require a specialized converter to import them into [Ampere] = [steps]·[Ampere/step] a NILMTK-usable file format. While the documentation states that a zlib data compression is applied, some converters currently use Each of the mentioned datasets was published in a different (com- bzip2 or Blosc [1]. pressed) file format and encoding scheme. To allow for compar- isons between these datasets, we decompressed, normalized, and 3 EVALUATED DATASETS re-encoded all data before analyzing them (raw binary encoding). From REDD, we used the entire available High Frequency Raw While there is a vast pool of smart meter datasets1, i.e., low sam- Data: house_3 and house_5, each with 3 channels: current_1, cur- pling rates of measurements every 1 s, 15 min, or 1 h, a majority of rent_2, and voltage. The custom file format encodes a single chan- the underlying information is already lost (signal waveform). The nel per file. In total, 1.4 GiB of raw data from 126 files were used. raw signals are aggregated into single root-mean-squared voltage From BLUED, we used all available waveform data (1 location, and current readings, frequency spectrums, or other metrics accu- 16 sub-datasets) and 3 channels: current_a, current_b, voltage. The mulated over the last measurement interval. This can be already CSV-like text files contain voltage and two current channels anda classified as a type of lossy compression. For some use cases, this dedicated measurement timestamp. In total, 41.1 GiB of raw data data source is sufficient to work with, while other fields require from 6430 files were used. high sampling rates to extract more information from the signals. From UK-DALE, we selected house_1 from the most recent re- All following experiments and evaluations were performed on lease (UK-DALE-2017-16kHz, the longest continuous recording). The publicly accessible datasets: The Reference Energy Disaggregation compressed FLAC files contain 2 channels: current and voltage. In Data Set (REDD [26]), Building-Level fUlly-labeled dataset for Elec- total, 6259.1 GiB of raw data from 19491 files were used. tricity Disaggregation (BLUED [3]), UK Domestic Appliance-Level From BLOND, we selected the aggregated mains data of both Electricity dataset (UK-DALE [25]), and the Building-Level Office eN- sub-datasets: BLOND-50 and BLOND-250. The HDF5 files with vironment Dataset (BLOND [28]). We will refer to these datasets by compression contain 6 channels: current{1-3} and voltage{1-3}. In their established acronyms: REDD, BLUED, UK-DALE, and BLOND. total, 10 246.7 GiB of raw data from 61125 files of BLOND-50, and Based on the energy dataset survey provided by the NILM-Wiki1, 11 899.0 GiB of raw data from 35490 files of BLOND-250 were used. these are all datasets of long-term continuous measurements with The data acquisition systems (DAQ) of all datasets produce a voltage and current waveforms from selected buildings or house- linear pulse-code modulated (LPCM) stream. The analog signals holds. The data acquisition systems and data types are comparable are sampled in uniform intervals and converted to digital values to warrant their use in this context. (Table 1). (Figure 1). The quantization levels are distributed linearly in a fixed measurement range which requires a signal conditioning step in the 1http://wiki.nilm.eu/datasets.html DAQ system. ADCs typically cannot directly measure mains voltage 3 and require a step-down converter or measurement probe. Mains Each dataset is split into multiple files, making it necessary to current signals need to be converted into a proportional voltage. merge all histograms into a total result at the end of the computing run. Since all histograms can be combined with a simple summation, 4 ENTROPY ANALYSIS the process can be parallelized and computed without any particular DAQ units provide a way to collect digital values from analog sys- order. Computing and merging all histograms is, therefore, best tems. As such, the quality of the data depends strongly on the cor- accomplished in a distributed compute cluster with multiple nodes rect calibration and selection of measurement equipment. Mains or similar environments. electricity signals are typically not compatible with modern digi- tal systems, requiring an indirect measurement through step-down 5 DATA REPRESENTATION transformers or other metrics. Mains voltage can vary by up to Choosing a suitable file format for potentially large datasets in- ±10% during normal operation of the grid [2, 14], making it neces- volves multiple tradeoffs and decisions, including supported plat- sary to design the measurement range with a safety margin. The forms, scientific computing frameworks, metadata, error correction, expected signal, plus any margin for spikes, should be equally dis- compression, and chunking. The available choices for data represen- tributed on the available ADC resolution range. Leaving large areas tation can range from CSV data (ASCII-parsable) to binary file for- of the available value range unused can be prevented by carefully mats and custom encoding schemes. From the energy dataset survey selecting input characteristics and signal conditioning (step-down and the evaluated datasets, it can be noted, that every dataset uses a calibration). A rule of thumb for range calibration is that the ex- different file format, encoding scheme, and optionally compression. pected signal should occupy 80-90%, leaving enough bandwidth for Publishing and distributing large datasets requires storage sys- unexpected measurements. Input signals larger than the measure- tems capable of providing long-term archives of scientific measure- ment range get recorded as the minimum/maximum value. Grossly ment data. Lossless compression helps to minimize storage costs exceeding the rated input signal level could damage the ADC, un- and distribution efforts. At the same time, other researchers access- less a dedicated signal conditioning and protection is employed. ing the data benefit from smaller files and shorter access timesto download the data.

32768 Electricity signals (current and voltage) contain a repetitive wave- 300 form with some form of distortion depending on the load. In an 16384 150 ideal power grid, the voltage would follow a perfect sinusoidal 0 0 waveform without any offset or error. This would allow us to accu- 150 rately predict the next voltage measurement. However, constant 16384 Voltage [V] 300 fluctuations in the supply and demand cause the signals to deviate. 32768

Raw ADC Values [steps] The fact that each signal is primarily continuous (without sudden 0 50 100 Time [samples] jumps) can be beneficial to compression algorithms. A delta encoding scheme only stores the numerical difference of Figure 1: Linear pulse-code modulation stream of a sinu- neighboring elements in a time-series measurement vector. This soidal waveform sampled with a 16-bit ADC. The waveform can be useful for slow-changing signals because the difference of a corresponds to a 230 V mains voltage signal. signal might require less bytes to encode than the absolute value:

i ∈ {1 ... n} : di = vi − vi−1 We extracted the probability mass function (PMF) of all evaluated ∀ d0 = v0 datasets for the full bit-range (16- or 24-bit). The value histogram is a structure mapping each possible measurement value (integer) to the We compare the original data representation (format, compres- number of times this value was recorded. Ideally, the region between sion, encoding) of each dataset, reformat them into various file for- the lowest and highest value contains a continuous value range mats, and evaluate their storage saving based on a comprehensive without gaps. However, the quantization level (step size) could list of lossless compression algorithms. This involves encoding raw cause a mismatch and results in skipped values. We then normalize data in a more suitable representation to compare their compressed this histogram to obtain the PMF and compute the signal entropy size: CS = compressed_size/oriдinal_size ∗ 100%, and the resulting per channel, which gives an estimation of the actual information space saving: SS = 100% − CS. We define the main goal of reducing contained in the raw data and provides a lower bound for the the overall required storage space for each dataset, and deliberately achievable compression ratio based on the Kolmogorov complexity. do not consider compression or decompression speed. The perfor- n o mance characteristics (throughput and speed) are well known for X = −2bits−1,..., 0,..., 2bits−1 − 1 individual compression techniques [4] and are of minor importance hist = histoдram(dataset, X) in the case of large static datasets which require only a single com- hist pression step before distribution. Performance metrics are impor- fX = Í tant when dealing with repeated compression of raw data, which is x ∈X hist[x] not the case for static energy datasets. Repeated decompression is x ∈ X where f (x) = 0 : f (x) = 1 X X however relevant because researchers might want to read and parse ∀ Õ H(x) = − fX (x)· loд2 (fX (x)) the files over and over again while analyzing them (if in-memory x ∈X processing is not feasible). As noted in [4], decompression speed

4 and throughput is typically not a performance bottleneck in data metadata, typically designed for the needs of the music and enter- analytics tasks. tainment industry. Out of these formats, only HDF5 and Zarr pro- Building a novel data compression scheme for energy data is vide support for encoding and storing arbitrary metadata objects counter-productive, since most scientific computing frameworks (complex types or matrices) together with measurement data. lack support and the idea suffers from the "not invented here" and Most audio-based formats support at most 8 signal channels, "yet another standard" problematic, both common anti-patterns in while general-purpose formats such as HDF5 and Zarr have no re- the field of engineering when developing new solutions, despite strictions on the total number of channels per file. The sampling rate existing suitable approaches [13, 26, 36]. Therefore, a key require- can also be a limiting factor: FLAC supports at most 655.35 kHz and ment is that each file format must be supported in common scien- ALAC only 384 kHz. ADC resolution (bit depth) is mostly bound by tific computing systems to read (and possibly write) data files. existing technological limitations and will not exceed 32-bit in the We selected four format types: raw binary, HDF5 (data model foreseeable future. While these constraints are within the require- and file format for storing and managing data), Zarr (chunked, com- ments for all datasets under evaluation, they need to be considered pressed, N-dimensional arrays), and audio-based PCM containers. for future dataset collection and the design of measurement systems. Raw binary formats provide a baseline for comparison. All sam- In total, we encoded the evaluated datasets with 365 different ples are encoded as integer values (16-bit or 24-bit) and are com- data representation formats: 54 raw, 264 HDF5-based, 44 Zarr-based, pressed with a general-purpose compressor: zlib/gzip, LZMA, bzip2, and 3 audio-based and gathered their per-file compression size asa and zstd, all with various parameter values. The input for each benchmark. The complete list, including all parameters and com- compressor is either raw-integer or variable-length encoded data pression options, is available in the online appendix2. The full anal- (LEB128S [19]), which is serialized either row- or column-based ysis was performed in a distributed computing environment and from all channels (interweaving). The LEB128S encoding is addi- consumed approx. 1, 176, 000 CPU-core-hours (dual Intel Xeon E5- tionally evaluated with delta encoding of the input. 2630v3 machines with 128 GiB RAM and 10 Gibit Ethernet inter- The Hierarchical Data Format 5 (HDF5) [15] provides structured faces). metadata and data storage, data transformations, and libraries for most scientific computing frameworks. All data is organized in 6 CHUNK SIZE IMPACT natively-typed arrays (multi-dimensional matrices) with various Each dataset is provided in equally-sized files, typically based on filters for data compression, checksumming, and other reversible measurement duration. Working with a single large file can be cum- transformations before storing the data to a file. The API transpar- bersome due to main memory restriction or available local storage ently reverses these transformations and compression filters while space. Assuming a typical desktop computer, with 8 GiB of main reading data. HDF5 is popular in the scientific community and used memory, is used for processing, a single file from a dataset must for various big-data-type applications [7, 12, 18, 40]. The public reg- be fully loaded into memory before any computation can be done. 1 istry for HDF5 filters currently lists 21 data transformations, most Depending on the analysis and algorithms, multiple copies might of them compression-related. Each HDF5 file is evaluated with and be required for intermediary results and temporary copies. This without the shuffle filter, zlib/gzip, lzf, MAFISC22 [ ] with LZMA, means the main memory size is an upper bound for the maximum szip [20], Bitshuffle [30] with LZ4, zstd, and the full Blosc [1] com- feasible chunk size. pression suite, again all with various parameter values. Some file formats and data types support internal chunking Zarr [32] organizes all data in a filesystem-like structure, which or streamed data access, in which data can be read into memory can be archived as a single -archive file or as tree-structure in the sequentially or random-access. In such environments other factors filesystem. Each channel is stored as a separate array (data stream) will limit the usable chunk size, such as file system capabilities, with optional chunk-based compression via zlib/gzip, LZMA, bzip2, network-attached storage, or other limitations. or Blosc (with shuffle, Bitshuffle, or no-shuffle filter), again allwith The evaluated datasets are distributed with the following chunk various parameter values. Each Zarr file is additionally evaluated sizes of raw data: REDD: 11.4 MiB or 4 min, BLUED: 6.6 MiB or with a delta filter to reduce the value range. 1.65 min, UK-DALE: 329.2 MiB or 60 min, BLOND-50: 171.7 MiB or Audio-based formats use LPCM-type data encoding (PCM16 or 5 min, BLOND-250: 343.3 MiB or 2 min. Measurement duration and PCM24) with a fixed precision and sampling rate. All channels are file size are not strictly linked, causing a slight variation in filesizes encoded into a single container using lossless compression formats: across the entire measurement period of each dataset. Observed FLAC [16], ALAC [23], and WavPack [9]. These formats do not real-world time does not affect any of the compression algorithms provide tune-able parameters. under test and is therefore omitted. The sampling rate and channel Calibration factors, timestamps, and labels can augment the raw count directly affects the data rate (bytes per time unit) and explains data in a single file while providing a unified API for accessing the non-uniform chunk sizes mentioned for each dataset. data and metadata. Raw binary formats lack this type of integrated We compare the best-performing data representation formats of support and require additional tooling and encoding schemes for each dataset from the previous experiment, benchmark them with metadata. Audio-based formats require a container format to store different chunk sizes, and estimate their effect on the overall com- pression ratio. For this evaluation, we define the compression ra- tio as CR = oriдinal_size/compressed_size. The chunk sizes range from 1, 2, 4, 8, 16, 32, 64, 128 MiB, and then continue in steps of

1 https://support.hdfgroup.org/services/filters.html 2The online appendix is available through the program chair (double-blind review).

5 128 MiB up to 3072 MiB. To reduce the required computational ef- Table 2: Entropy analysis of whole-building energy datasets fort, we greedily consume data from the first available dataset file, with high sampling rates. The amount of unique measure- until the predefined chunk limit is fulfilled. The chunk size isdeter- ment values for each channel is extracted, which corre- mined using the number of samples (across all channels) and their sponds to a usage percentage over the available measure- integer byte count (2 or 3 bytes); only full samples for all channels ment resolution. The lowest and highest observed value is are included in a chunk. used to give determine the observed range.

7 RESULTS Dataset Channel Values Usage Range H(x) 7.1 Entropy Analysis current_1 87713 1% 4% 14.3 REDD Entropy is based on the probability for a given measurement (signal current_2 85989 1% 5% 14.9 (24-bit) value). The histogram of an entire measurement channel shows voltage 2925155 17% 18% 21.1 the number of times a single measurement value was seen in the current_a 5855 9% 10% 7.8 BLUED dataset (Figure 3). The plots show the raw measurement bandwidth current_b 7684 12% 13% 9.7 (16-bit) in ADC value on the x-axis and a logarithmic y-axis for the number voltage 11302 17% 18% 13.2 of occurrences of each value. The raw ADC values are bipolar and centered on 0: −32768 ... 32767 for BLUED, BLOND-50, and UK-DALE current 6981612 42% 81% 19.0 BLOND-250; −8388608 ... 8388607 for REDD and UK-DALE. (24-bit) voltage 15135594 90% 100% 23.2 The voltage histogram shows a distinctive sinusoidal distribution current1 51122 78% 100% 12.6 (peaks at minimum and maximum values). The current histogram current2 49355 75% 100% 11.2 would show a similar distribution if the power draw is constant BLOND-50 current3 48658 74% 100% 11.3 (pure-linear or resistive loads), however, multiple levels of current (16-bit) voltage1 58396 89% 92% 15.3 values can be observed, indicating high activity and fluctuations. voltage2 57975 88% 91% 15.4 REDD and BLUED (Figures 3a and 3b) show a center-biased distribu- voltage3 59596 91% 95% 15.4 tion, indicating a sub-optimal calibration performance and unused current1 52721 80% 100% 12.4 measurement bandwidth. UK-DALE, BLOND-50, and BLOND-250 current2 51802 79% 100% 10.8 (Figures 3c, 3d, 3e) show a wide range of highly used values, with the BLOND-250 current3 50989 78% 100% 11.6 voltage channels utilizing around 90% of the available bandwidth. (16-bit) voltage1 58488 89% 91% 15.3 REDD and BLUED use only a small percentage of the available voltage2 57912 88% 92% 15.4 range, indicating a low entropy based on the used data type. UK- voltage3 59742 91% 94% 15.4 DALE utilizes a reasonable slice, while BLOND covers almost the en- tire possible range (Table 2). Assuming a well-calibrated data acqui- sition system, the expected percentage should reflect the expected measurement values. Low range usage (REDD, BLUED) leads to lost causes a 1-byte overhead for REDD and UK-DALE. The baseline precision which would have been freely available with the given used for comparison is a raw concatenated byte string with dataset- hardware, whereas high usage (UK-DALE, BLOND) means almost native data types (16-bit and 24-bit). This allows us to obtain com- all available measurement precision is reflected in the raw data. parable evaluation results, while other published benchmarks com- Some datasets utilize 100% of the available measurement range, pared ASCII-like encodings against binary representations, skew- while REDD only uses 5%. A high range utilization does not result ing the results significantly. in a equally high usage, as the histogram can contain gaps (ADC Overall, it can be noted that all three audio-based formats per- values with 0 occurrences in the datasets). formed well, given their inherent targeted nature of - ing waveforms with high temporal resolution. ALAC and FLAC 7.2 Data Representation achieved the highest overall CS across all datasets, followed by The evaluation compares the compressed size (CS, final file size HDF5+MAFISC and HDF5+zstd, which can overcome the 1-byte after compression and file format encapsulation in percent of un- overhead. Although the general-purpose compressors and their in- compressed size) of 365 data representation formats. For brevity dividual data representation formats were intended to serve as a reasons, only the 30 best-performing formats are shown in Figure 2. baseline for comparison of the more advanced schemes (HDF5, Zarr, Each of the 365 data representation was tested on all datasets and and audio-based), one can conclude that even plain bzip2 or LZMA the full evaluation is available in the online appendix3. The follow- compression can achieve comparable compression results. A trade- ing evaluation and benchmark uses the raw data from each dataset off to consider is the lack of metadata and internal structure, which as described in Section 3. In total, raw data with 27.8 TiB was re- might cause additional data handling overhead as easy-to-use im- encoded 365 times. port and parsing tools are not available. Variable-length encoding HDF5 and Zarr are general-purpose file formats for numerical using LEB128S is a suitable input for the bzip2 and LZMA compres- data with a broad support in scientific computing frameworks. As sors when combined with a column-based storage format. Delta such, they only support 16-bit and 32-bit integer values, which encoding resulted in comparably good CS in certain combinations. Some datasets are inherently more compressible than others. 3The online appendix is available through the program chair (double-blind review). This is a result of the entropy analysis and can be observed in

6 aibelnt noig(E18)adDlaecdn il the yield encoding Delta and (LEB128S) encoding Variable-length -yeoeha dpnigo h sdcompressor). used the on (depending overhead 1-byte inletoyadvlerneutilization. range higher value their and given entropy signal BLOND-250, and BLOND-50, UK-DALE, on best performed HDF5+MAFISC and ALAC encoding. variable-length (bzip2) and compressor general-purpose a with savings space highest BLUED). the and (REDD data of types such for 2). saving (Table entropy space signal largest low very it’s to due smaller CS, significantly 30% a below shows of BLUED CS while perform CS, datasets of BLOND 50-60% around both at and UK-DALE, REDD, that noted be or 51.3% 50: as or or files 48.3% 73.0% REDD: data BLUED: encoding: all binary to raw applied the when against achieved compared be can SS following the ofthe because formats, be- andZarr-based This 100%). HDF5- over most (CS affects output havior larger a generate counter- are and formats productive some baseline, the to compared reduction, data a dataset. per CS correlates higher entropy with higher strongly that shows benchmark The thanany dataset. other most compressors with sizes BLUED file smaller Compressing yields well. consistently as evaluation data representation Each data the filters. transformation dataset. their to every and formats basis per-file representation a data on top-30 applied was the format for representation performance Compression 2: Figure

w u ftefv vlae aaes(EDadBUD showed andBLUED) (REDD datasets evaluated five the of out Two dataset, each for representation data best-performing the Choosing achieves formats representation data tested of majority the While Compressed Size [%] 100 20 40 60 80 0

5252 ALAC BLOND-250 BLOND-50 UK-DALE BLUED REDD 30

. bzip2 LEB128S-delta-row GiB 3 . GiB 0 FLAC LN-5:5.%or 55.4% BLOND-250: ,

KDL:4.%or 40.5% UK-DALE: , LZMA-alone LEB128S-delta-row

HDF5 no_shuffle MAFISC-LZMA all-column-chunk

HDF5 shuffle zstd individual-dsets

HDF5 shuffle zstd all-column-chunk

LZMA-alone raw-column

2534 LZMA-alone LEB128S-row 6590 LZMA-alone LEB128S-delta-column . GiB 1 . GiB 8 LZMA-xz LEB128S-delta-column BLOND- , tcan It . 0 HDF5 no_shuffle MAFISC-LZMA individual-dsets . GiB 7 bzip2 LEB128S-delta-column

, HDF5 shuffle gzip all-column-chunk 7 HDF5 shuffle gzip individual-dsets hc ol o ercmeddfrlredtst eas fthe of because datasets large for recommended be not would which h hn ieeauto Fgr )cnan h vrgdCR averaged the contains 4) (Figure evaluation size chunk The 1438 nrae adigadcnanroeha.A uh hn sizes 3 chunk such, As overhead. container and handling increased appendix online the contains in only available is it evaluation as REDD, except datasets all for size chunk per Impact Size Chunk 7.3 of space storage in reduction to compelling up more computing a (desktop center), usecases data and in most of insignificant saving be space might absolute REDD an ranking While study. highest this the in among thisis FLAC formats showing a highextent; 2, to Figure by pub- supported compressed originally already the are means files FLAC This size. lished dataset overall insignificant the an in shows increase which UK-DALE, for except savings, space or or or 61.2% 96.4% REDD: can savings: we space compressed, additional already achieve is typically which dataset, published h nieapni saalbetruhtepormcar(obebidreview). (double-blind chair program the through available is appendix online The 1519 h vlae hn iernesat ihvr ml chunks, small very with starts range size chunk evaluated The actually the against savings space raw the comparing When LZMA LEB128S-delta-row DELTA distance-1 . MiB 4 1867 HDF5 shuffle zstd all-row-chunk . GiB 7 295

. bzip2 raw-column GiB 9 fdt n a hrfr mte.Adtie per-dataset detailed A omitted. therefore was and data of . GiB 5 LN-5:2.%or 26.0% BLOND-250: , LZMA LEB128S-delta-row DELTA distance-2 o LN-5 a esbtnilybeneficial. substantially be can BLOND-250 for KDL:-.%or -1.3% UK-DALE: , bzip2 LEB128S-row

LZMA-alone raw-row

HDF5 shuffle MAFISC-LZMA individual-dsets

LZMA raw-column DELTA distance-1

LZMA raw-column DELTA distance-2 1867 −

49 HDF5 shuffle gzip all-row-chunk . GiB 9 . GiB 1 LZMA raw-column DELTA distance-3 3 . l aaesshow datasets All .

LN-0 23.3% BLOND-50: , LZMA raw-column DELTA distance-4 1 .

GiB 1 LZMA LEB128S-row DELTA distance-2

1 LZMA LEB128S-row DELTA distance-4 . BLUED: , GiB 1 LZMA raw-row DELTA distance-2 for (a) REDD (24-bit) (b) BLUED (16-bit)

(c) UK-DALE (24-bit) (d) BLOND-50 (16-bit)

(e) BLOND-250 (16-bit)

Figure 3: Semi-logarithmic histogram of ADC values for each dataset and channel. Current signals show distinct steps, corre- sponding to prolonged usage at certain power levels. For visualization reasons, the scatter plot was smoothed and the full his- togram is available in the online appendix3.

8 starting with 128 MiB can be considered as viable storage strategy. Choosing a file format for long-term whole-building energy The resulting CR ramps up quickly for most formats until it lev- datasets is a crucial component, directly affecting the visibility and els off between 32 MiB to 64 MiB. Above this mark, no significant accessibility of the data by other researchers. Using an unsupported improvement in CR can be achieved by increasing the chunk size. encoding or requiring specialized tools to read the data is cumber- Some file formats even show a slight linear decrease in CR withvery some and error-prone and should be avoided. We recommend using large chunk sizes (above approx. 1.5 GiB). ALAC and FLAC com- well-known file formats, such as HDF5 or FLAC, which are widely pressors show a slight improvement (2-3%) in CR with larger chunk adopted and provide built-in support for metadata, compression, sizes. In most use cases this size reduction comes at a great cost and error-detection. While ALAC and FLAC already provide in- in RAM requirement to process files above 2048 MiB. HDF5 has its ternal compression, we recommend the MAFISC or zstd filters for own concept of "chunks", used for I/O and the filter pipeline, with HDF5, due to their superior compression ratio. The serialization a default size of 1 MiB. Internal limitations do not allow for HDF5- orientation (row- or column-based) has only a minor effect. chunks larger than 2048 MiB, however, HDF5, in general, can be Large datasets should be split into multiple smaller files to fa- used for files larger than this limit. The MAFISC filter withLZMA cilitate data handling, reduce transfer speeds and loading times compression experiences large fluctuations for neighboring chunk for short amounts of data. We have found that compression algo- size steps and should, therefore, be tuned separately. Overall, in- rithms (together with the above-described file formats) yield higher creasing the chunk size has a negligible effect on the final compres- space savings with chunk sizes above 256 MiB to 384 MiB. Small sion ratio and only pushes up the RAM requirements for processing. files show a modest compression ratio, while larger files require more transfer bandwidth and time before the data can be analyzed. LZMA-alone LEB128S-delta-row ALAC 8 CONCLUSIONS LZMA-alone raw-row We presented a comprehensive entropy analysis of public whole- FLAC building energy datasets with waveform signals. Some datasets BZ2-9 LEB128S-delta-row BZ2-9 raw-row leave a majority of the available ADC range unused, causing lost HDF5 no_shuffle MAFISC-LZMA all-column-chunk precision and accuracy. A well-calibrated measurement system max- imizes the achievable precision. Using 365 different data represen-

2.2 tation formats, we have shown that immense space savings of up to 73% are achievable by choosing a suitable file format and data transformation. Low entropy datasets show higher achievable com- 2.1 pression ratios. Audio-based file formats perform considerably well, given the similarities to electricity waveforms. Transparent data transformations are particularly beneficial, such as MAFISC and 2.0 SHUFFLE-based approaches. The input size shows a mostly stable dependency to the achievable compressed size, with variations of a 1.9 few percentage points (limited by RAM). Waveform data shows a nearly constant compression ratio, independent of the input chunk size. Splitting large datasets into multiple smaller files is important 1.8 for data handling, but insignificant in terms of space savings.

1.7

Averaged Compression Ratio Over All Datasets [X:1] 1.6

1 2 4 8 163264128 512 1024 1536 2048 2560 3072 Original Chunk Size [MB]

Figure 4: Chunk size impact of different representations.

7.4 Summary and Recommendations The entropy analysis shows a lack of measurement range calibra- tion in some datasets. This results in unutilized precision, that would have been available with the given hardware DAQ units. The used range directly affects the contained entropy, and therefore the achievable compression ratio. A well-calibrated measurement system is a key requirement to achieve the best signal range and resolution. 9 REFERENCES https://doi.org/10.1145/3077839.3077845 [1] Francesc Alted. 2017. Blosc: A high performance compressor optimized for binary [25] Jack Kelly and William Knottenbelt. 2015. The UK-DALE dataset, domestic data. (November 2017). Retrieved January 20, 2018 from http://blosc.org/ appliance-level electricity demand and whole-house demand from five UK homes. [2] American National Standards Institute. 2016. ANSI C84.1-2016: Standard for Scientific Data 2, 150007 (2015). https://doi.org/10.1038/sdata.2015.7 Electric Power Systems and Equipment—Voltage Ratings (60 Hz). (2016). [26] J. Zico Kolter and Matthew J. Johnson. [n. d.]. REDD: A Public Data Set for [3] Kyle Anderson, Adrian Ocneanu, Diego Benitez, Derrick Carlson, Anthony Rowe, Energy Disaggregation Research. In SustKDD ’11 (2011), Vol. 25. 59–62. and Mario Berges. 2012. BLUED: A Fully Labeled Public Dataset for Event-Based [27] Thomas Kriechbaumer, Anwar Ul Haq, Matthias Kahl, and Hans-Arno Jacobsen. Non-Intrusive Load Monitoring Research. In SustKDD ’12. ACM, Beijing, China, 2017. MEDAL: A Cost-Effective High-Frequency Energy Data Acquisition System 1–5. for Electrical Appliances. In Proceedings of the 2017 ACM Eighth International [4] R. Arnold and T. Bell. 1997. A corpus for the evaluation of lossless compression Conference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA. algorithms. In Data Compression Conference, 1997. DCC ’97. Proceedings. 201–210. https://doi.org/10.1145/3077839.3077844 https://doi.org/10.1109/DCC.1997.582019 [28] Thomas Kriechbaumer and Hans-Arno Jacobsen. 2018. BLOND, a building- [5] IEEE Standards Association. 2018. COMTRADE: Common format for Transient level office environment dataset of typical electrical appliances. (March 2018). Data Exchange for power systems. (January 2018). Retrieved January 20, 2018 https://doi.org/10.1038/sdata.2018.48 from https://standards.ieee.org/findstds/standard/C37.111-2013.html [29] Guoxin Liu and Haiying Shen. 2017. Minimum-Cost Cloud Storage Service Across [6] Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Multiple Cloud Providers. IEEE/ACM Trans. Netw. 25, 4 (Aug. 2017), 2498–2513. Alex Rogers, Amarjeet Singh, and Mani Srivastava. 2014. NILMTK: An Open https://doi.org/10.1109/TNET.2017.2693222 Source Toolkit for Non-intrusive Load Monitoring. In ACM e-Energy ’14. ACM, [30] K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. Höfer, M. Halpern, D. New York, NY, USA, 265–276. https://doi.org/10.1145/2602044.2602051 Hanna, A.D. Hincks, G. Hinshaw, J.M. Parra, L.B. Newburgh, J.R. Shaw, and K. [7] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. Vanderlinde. 2015. A compression scheme for radio data in high performance Parallel Data Analysis Directly on Scientific File Formats. In Proceedings of the computing. Astronomy and Computing 12, Supplement C (2015), 181 – 190. 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD https://doi.org/10.1016/j.ascom.2015.07.002 ’14). ACM, New York, NY, USA, 385–396. https://doi.org/10.1145/2588555.2612185 [31] M. N. Meziane, T. Picon, P. Ravier, G. Lamarque, J. C. Le Bunetel, and Y. Raingeaud. [8] Abraham Bookstein and James A. Storer. 1992. Data compression. Informa- 2016. A Measurement System for Creating Datasets of On/Off-Controlled Elec- tion Processing & Management 28, 6 (1992), 675 – 680. https://doi.org/10.1016/ trical Loads. In 2016 IEEE 16th International Conference on Environment and Elec- 0306-4573(92)90060-D Special Issue: Data compression for images and texts. trical Engineering (EEEIC). 1–5. https://doi.org/10.1109/EEEIC.2016.7555847 [9] David Bryant. 2018. WavPack: Hybrid Lossless Audio Compression. (January [32] Alistair Miles. 2018. Zarr: A Python package providing an implementation of 2018). Retrieved January 20, 2018 from http://www.wavpack.com/ chunked, compressed, N-dimensional arrays. (January 2018). Retrieved January [10] J. C. S. de Souza, T. M. L. Assis, and B. C. Pal. 2017. Data Compression in Smart 20, 2018 from https://zarr.readthedocs.io/en/latest/ Distribution Systems via Singular Value Decomposition. IEEE Transactions on [33] Muhammad Nabeel, Fahad Javed, and Naveed Arshad. 2013. Towards Smart Smart Grid 8, 1 (Jan 2017), 275–284. https://doi.org/10.1109/TSG.2015.2456979 Data Compression for Future Energy Management System. In Fifth International [11] E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data- Conference on Applied Energy. Intensive Scientific Workflows. In 2008 Eighth IEEE International Symposium [34] J. Paris, J. S. Donnal, and S. B. Leeb. 2014. NilmDB: The Non-Intrusive Load on Cluster Computing and the Grid (CCGRID). 687–692. https://doi.org/10.1109/ Monitor Database. IEEE Transactions on Smart Grid 5, 5 (Sept 2014), 2459–2467. CCGRID.2008.24 https://doi.org/10.1109/TSG.2014.2321582 [12] Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein, [35] Lucas Pereira. 2017. EMD-DF: A Data Model and File Format for Energy Disag- Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, and Christoph Best. 2009. gregation Datasets. In Proceedings of the 4th ACM International Conference on Unifying Biological Image Formats with HDF5. Commun. ACM 52, 10 (Oct. 2009), Systems for Energy-Efficient Built Environments (BuildSys ’17). ACM, New York, 42–47. https://doi.org/10.1145/1562764.1562781 NY, USA, Article 52, 2 pages. https://doi.org/10.1145/3137133.3141474 [13] Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. 2015. [36] Lucas Pereira, Nuno Nunes, and Mario Bergés. 2014. SURF and SURF-PI: A A Time-series Compression Technique and Its Application to the Smart File Format and API for Non-intrusive Load Monitoring Public Datasets. In Grid. The VLDB Journal 24, 2 (April 2015), 193–218. https://doi.org/10.1007/ Proceedings of the 5th International Conference on Future Energy Systems (e-Energy s00778-014-0368-8 ’14). ACM, New York, NY, USA, 225–226. https://doi.org/10.1145/2602044.2602078 [14] European Committee for Electrotechnical Standardization. 1989. CENELEC [37] Krishna P.N. Puttaswamy, Thyaga Nandagopal, and Murali Kodialam. 2012. Frugal Harmonisation Document HD 472 S1. (1989). Storage for Cloud File Systems. In Proceedings of the 7th ACM European Conference [15] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011. on Computer Systems (EuroSys ’12). ACM, New York, NY, USA, 71–84. https: An Overview of the HDF5 Technology Suite and Its Applications. In Proceedings //doi.org/10.1145/2168836.2168845 of the EDBT/ICDT 2011 Workshop on Array Databases (AD ’11). ACM, New York, [38] A. Qing, Z. Hongtao, H. Zhikun, and C. Zhiwen. 2011. A Compression Approach NY, USA, 36–47. https://doi.org/10.1145/1966895.1966900 of Power Quality Monitoring Data Based on Two-dimension DCT. In 2011 Third [16] Xiph.Org Foundation. 2018. FLAC: Free Lossless . (January 2018). International Conference on Measuring Technology and Mechatronics Automation, Retrieved January 20, 2018 from https://xiph.org/flac/ Vol. 1. 20–24. https://doi.org/10.1109/ICMTMA.2011.12 [17] O. N. Gerek and D. G. Ece. 2004. 2-D analysis and compression of power-quality [39] Martin Ringwelski, Christian Renner, Andreas Reinhardt, Andreas Weigel, and event data. IEEE Transactions on Power Delivery 19, 2 (April 2004), 791–798. Volker Turau. 2012. The Hitchhiker’s Guide to choosing the Compression Algo- https://doi.org/10.1109/TPWRD.2003.823197 rithm for your Smart Meter Data. (September 2012), 935–940. https://doi.org/10. [18] L. Gosink, J. Shalf, K. Stockinger, Kesheng Wu, and W. Bethel. 2006. HDF5- 1109/EnergyCon.2012.6348285 FastQuery: Accelerating Complex Queries on HDF Datasets using Fast [40] S. Sehrish, J. Kowalkowski, M. Paterno, and C. Green. 2017. Python and HPC Indices. In 18th International Conference on Scientific and Statistical Database for High Energy Physics Data Analyses. In Proceedings of the 7th Workshop on Management (SSDBM’06). 149–158. https://doi.org/10.1109/SSDBM.2006.27 Python for High-Performance and Scientific Computing (PyHPC’17). ACM, New [19] Free Standards Group. 2018. DWARF Debugging Information Format Speci- York, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/3149869.3149877 fication Version 3.0. (January 2018). Retrieved January 20, 2018 from http: [41] C. E. Shannon. 1949. Communication in the Presence of Noise. Proceedings of //dwarfstd.org/doc/Dwarf3.pdf the IRE 37, 1 (Jan 1949), 10–21. https://doi.org/10.1109/JRPROC.1949.232969 [20] HDF Group. 2017. Szip Compression in HDF Products. (November 2017). Re- [42] IEEE Power & Energy Society. 2018. IEEE 1159 - PQDIF: Power Quality and trieved January 20, 2018 from https://support.hdfgroup.org/doc_resource/SZIP/ Quantity Data Interchange Format. (January 2018). Retrieved January 20, 2018 [21] Anwar Ul Haq, Thomas Kriechbaumer, Matthias Kahl, and Hans-Arno Jacobsen. from http://grouper.ieee.org/groups/1159/3/docs.html 2017. CLEAR – A Circuit Level Electric Appliance Radar for the Electric Cabinet. [43] Z. B. Tariq, N. Arshad, and M. Nabeel. 2015. Enhanced LZMA and BZIP2 for In 2017 IEEE International Conference on Industrial Technology (ICIT ’17). 1130– improved energy data compression. In 2015 International Conference on Smart 1135. https://doi.org/10.1109/ICIT.2017.7915521 Cities and Green ICT Systems (SMARTGREENS). 1–8. [22] Nathanael Hübbe and Julian Kunkel. 2013. Reducing the HPC-datastorage foot- [44] Andreas Unterweger and Dominik Engel. 2015. Resumable load data compression print with MAFISC—Multidimensional Adaptive Filtering Improved Scientific in smart grids. IEEE Transactions on Smart Grid 6, 2 (2015), 919–929. https: data Compression. Computer Science - Research and Development 28, 2 (01 May //doi.org/10.1109/TSG.2014.2364686 2013), 231–239. https://doi.org/10.1007/s00450-012-0222-4 [45] Andreas Unterweger, Dominik Engel, and Martin Ringwelski. 2015. The Effect of [23] Apple Inc. 2018. ALAC: Audio Codec. (January 2018). Retrieved Data Granularity on Load Data Compression. Springer International Publishing, January 20, 2018 from https://macosforge.github.io/alac/ Cham, 69–80. https://doi.org/10.1007/978-3-319-25876-8_7 [24] Matthias Kahl, Anwar Ul Haq, Thomas Kriechbaumer, and Hans-Arno Jacobsen. [46] D. Yuan, Y. Yang, X. Liu, and J. Chen. 2010. A cost-effective strategy for in- 2017. A Comprehensive Feature Study for Appliance Recognition on High termediate data storage in scientific cloud workflow systems. In 2010 IEEE In- Frequency Energy Data. In Proceedings of the 2017 ACM Eighth International ternational Symposium on Parallel Distributed Processing (IPDPS). 1–12. https: Conference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA. //doi.org/10.1109/IPDPS.2010.5470453

10