Comparison and Model of Compression Techniques for Smart Cloud Log File Handling
Total Page:16
File Type:pdf, Size:1020Kb
Copyright IEEE. The final publication is available at IEEExplore via https://doi.org/10.1109/CCCI49893.2020.9256609. Comparison and Model of Compression Techniques for Smart Cloud Log File Handling Josef Spillner Zurich University of Applied Sciences Winterthur, Switzerland [email protected] Abstract—Compression as data coding technique has seen tight pricing of offered cloud services. Increasing diversity and approximately 70 years of research and practical innovation. progress in generic compression tools and log-specific algo- Nowadays, powerful compression tools with good trade-offs exist rithms [4], [5], [6] leaves many operators without a systematic for a range of file formats from plain text to rich multimedia. Yet in the dilemma of cloud providers to reduce log data sizes as much framework to choose suitable and economic log compression as possible while having to keep as much as possible around for tools and respective configurations. This prevents a systematic regulatory reasons and compliance processes, many companies solution to exploit cost tradeoffs, such as increasing investment are looking for smarter solutions beyond brute compression. into better compression levels while saving long-term storage In this paper, comprehensive applied research setting around cost. In this paper, such a framework is constructed by giving network and system logs is introduced by comparing text com- pression ratios and performance. The benchmark encompasses a comprehensive overview with benchmark results of 30 to- 13 tools and 30 tool-configuration-search combinations. The tool tal combinations of compression tools, decompression/search and algorithm relationships as well as benchmark results are tools and associated configurations. modelled in a graph. After discussing the results, the paper The four concrete technical contributions of the paper are: reasons about limitations of individual approaches and suitable combinations of compression with smart adaptive log file han- 1) A rich graph model of compression algorithms, dling. The adaptivity is based on the exploitation of knowledge formats, tools, settings and runtime characteristics on format-specific compression characteristics expressed in the (compressgraph). graph, for which a proof-of-concept advisor service is provided. 2) A robust test bench aiming at reproducible model cre- Index Terms —log file management, compression algorithms, ation with integration of relevant tools for accurate ratio text compression, benchmark, adaptivity, smart systems and performance benchmarking (compressbench). 3) A reference input and results dataset of text compres- I. INTRODUCTION sion and search tools applied to representative log files Cloud computing has become a mature backbone for mil- (compressrefdata). lions of delivered applications and services. Besides global- 4) A programmable advisor service that exploits the graph scale/hyper-scale infrastructure providers with dozens of to recommend suitable compression for a given situation data centres, many smaller managed network and platform (compressadvisor). providers are successfully covering market needs for spe- All four contributions are publicly available1. The relation cialised services [1]. One key issue for these providers is the between them is summarised in Fig. 1. handling of dynamically generated data from their services and hosted applications. Increasingly automated operations demand more insights into the provisioning and delivery situations, and therefore access to larger amounts of historic data [2]. Additionally, regulations may demand the storage of such data for longer periods of time, and occasional search for suspicious occurrences of terms. One of the most impor- tant information sources are log files, and therefore complex log management systems are set up to collect, transform and unify log messages. At the end of such pipelines, logs Fig. 1. Contributions of this paper are compressed and stored for as long as necessary, while still being available for occasional information retrieval [3]. In the next sections, related works are summarised and Consequently, providers aim at finding compression tools log file scenarios defined. Afterwards, the tool comparison which squeeze the logs into the smallest possible files, while is introduced with the compression graph model, a testbed tolerating slow compression, as long as content search, in with curated sample data and the plan of the experiments. most cases preceded by decompression, should be fast. The The results are then presented and discussed, and the advisor additional cost of log management, along with monitoring and other operations, should be kept to a minimum to allow for 1Contributions records: https://doi.org/10.5281/zenodo.4053735 978-1-7281-7315-3/20/$31.00 c 2020 IEEE service presented, before proceeding to an outlook on potential recent and comprehensive comparisons of log file compression future compression tools that favour smart handling over the and smart selection of best tools for this task. quest for raw compression ratios. A general observation can be made about the apparent business necessity of industrial compression research and tool II. RELATED WORK development. This is evidenced not only by Logzip (Huawei), In recent years, the use of online services has seen a but also by the generic tools Brotli (Google) and Zstandard significant growth, leading to an increase in log messages (Facebook). A second observation concerns the optimisation to preserve (spatial growth). For multiple reasons, including dimension. Most recent research works aim at a decreased legal requirements, log files are also stored longer (tempo- compression ratio, typically at the cost of increased compres- ral growth). The product of both growth factors leads to sion time. In contrast, another class of compression algorithms a superlinear increase in resources required to store logs. aims primarily on searchable compression with ratios being a Hence, some researchers have focused specifically on new secondary concern. Our work combines them in a common compression algorithms for log files, while others have looked model. into comparison approaches. Logs can be produced by application, by system components III. ADAPTIVELY COMPRESSED LOG FILE SCENARIOS or by network or user activities on a system. They are typically Software adaptivity is controlled by goals and constraints. semi-structured, combining regular entry types (dates, times, For compression processes, typical goals are fast compression hosts) with irregular user-defined messages. For a primer on or decompression times, fast search (often in conjunction with application log structures and their semantic interpretation, decompression), low-memory or low-energy (de)compression, which is also exploited by more recent compression algo- or optimal compression ratio. The constraints are manyfold, rithms, the work by Nimbalkar et al. [7] explains the problem ranging from not having the appropriate tool installed to domain and offers an RDF-based solution that links to domain inherent file size limitations in the tools. This knowledge needs vocabularies. to be captured in a knowledge base so that it can be exploited Logzip [4] has been proposed to exploit log-specific redun- at runtime. In contrast to pure mechanical abstraction layers dancy in contrast to that found in generic text. Specifically, such as Squash, the knowledge can then lead to dynamic Logzip extracts hidden structures by first sampling log lines decisions about which codec and which parameterisation to and then clustering them by tokens and other features. One use in any context. The novel proposal in this paper is limitation of Logzip is the reliance on spaces as token separa- to model the relationships in a graph, so that for instance tors which excludes widespread other formats. Vehicle traffic format-equivalent compression tool alternatives can be queried logs can be compressed semantically with high efficiency as dynamically based on situational context defined by goals shown in a recent study [8]. Multi-level Log Compression and constraints. Through autonomous or intelligent decision- (MLC) [9] is another proposal aimed at compressing log files making between the possible candidates, based on an advisor in a cloud backup workflow. It promises ratio improvements of service, smart adaptive log file handling is achieved. around 16% over state of the art compression tools. Text com- This handling shall be illustrated by a scenario: A provider pression beyond ASCII, applicable to the human-readable log wants to store and rotate logs, asks the advisor, and gets a messages, has been explored by modifications to existing byte- command-line ready to execute on the files to achieve the level compressors such as bzip2, with significant effectiveness highest possible compression. Afterwards, the provider notices improvements reported [10], and semantic compression for that CPU usage is high and negatively affects the business ap- text has been investigated as well [11]. plication. The constraint for less CPU involvement is brought While these research prototypes are promising, a baseline to the advisor, leading to updated advice on a command-line comparison of widely deployed compression tools would be that achieves still high compression with tolerable CPU load. of immediate usefulness to operators and is in the focus of As the higher-level choice is remembered, new tools that are this paper. There are many benchmarks and measurements added in later years are taken into account