Data Compression to Define Information Content
Total Page:16
File Type:pdf, Size:1020Kb
EGU Journal Logos (RGB) Open Access Open Access Open Access Advances in Annales Nonlinear Processes Geosciences Geophysicae in Geophysics Open Access Open Access Natural Hazards Natural Hazards and Earth System and Earth System Sciences Sciences Discussions Open Access Open Access Atmospheric Atmospheric Chemistry Chemistry and Physics and Physics Discussions Open Access Open Access Atmospheric Atmospheric Measurement Measurement Techniques Techniques Discussions Open Access Open Access Biogeosciences Biogeosciences Discussions Open Access Open Access Climate Climate of the Past of the Past Discussions Open Access Open Access Earth System Earth System Dynamics Dynamics Discussions Open Access Geoscientific Geoscientific Open Access Instrumentation Instrumentation Methods and Methods and Data Systems Data Systems Discussions Open Access Open Access Geoscientific Geoscientific Model Development Model Development Discussions Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Open Access Open Access Hydrol. Earth Syst. Sci. Discuss., 10,Hydrology 2029–2065, 2013and Hydrology and www.hydrol-earth-syst-sci-discuss.net/10/2029/2013/ Earth System doi:10.5194/hessd-10-2029-2013 Earth System HESSD Sciences Sciences © Author(s) 2013. CC Attribution 3.0 License. 10, 2029–2065, 2013 Discussions Open Access Open Access Ocean Science This discussion paper is/hasOcean been under Science review for the journal Hydrology and Earth System Data compression to Sciences (HESS). Please refer to the corresponding final paper in HESS if available.Discussions define information content Open Access Open Access Solid Earth S. V. Weijs et al. Solid Earth Data compression to define informationDiscussions Open Access Open Access Title Page content of hydrological time seriesThe Cryosphere The Cryosphere Abstract Introduction Discussions 1 2 1 S. V. Weijs , N. van de Giesen , and M. B. Parlange Conclusions References 1 Environmental fluid mechanics and Hydrology Lab, ENAC, Ecole Polytechnique Tables Figures Fed´ erale´ de Lausanne, Station 2, 1015 Lausanne, Switzerland 2Water resources management, Delft University of Technology, Stevinweg 1, P.O. Box 5048, 2600 GA, Delft, The Netherlands J I Received: 31 January 2013 – Accepted: 6 February 2013 – Published: 14 February 2013 J I Correspondence to: S. V. Weijs (steven.weijs@epfl.ch) Back Close Published by Copernicus Publications on behalf of the European Geosciences Union. Full Screen / Esc Printer-friendly Version Interactive Discussion 2029 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Abstract HESSD When inferring models from hydrological data or calibrating hydrological models, we might be interested in the information content of those data to quantify how much can 10, 2029–2065, 2013 potentially be learned from them. In this work we take a perspective from (algorithmic) 5 information theory (AIT) to discuss some underlying issues regarding this question. In Data compression to the information-theoretical framework, there is a strong link between information con- define information tent and data compression. We exploit this by using data compression performance as content a time series analysis tool and highlight the analogy to information content, prediction, and learning (understanding is compression). The analysis is performed on time series S. V. Weijs et al. 10 of a set of catchments, searching for the mechanisms behind compressibility. We discuss both the deeper foundation from algorithmic information theory, some practical results and the inherent difficulties in answering the question: “How much Title Page information is contained in this data?”. Abstract Introduction The conclusion is that the answer to this question can only be given once the fol- 15 lowing counter-questions have been answered: (1) Information about which unknown Conclusions References quantities? (2) What is your current state of knowledge/beliefs about those quantities? Tables Figures Quantifying information content of hydrological data is closely linked to the question of separating aleatoric and epistemic uncertainty and quantifying maximum possible model performance, as addressed in current hydrological literature. The AIT perspec- J I 20 tive teaches us that it is impossible to answer this question objectively, without specify- J I ing prior beliefs. These beliefs are related to the maximum complexity one is willing to accept as a law and what is considered as random. Back Close Full Screen / Esc 1 Introduction Printer-friendly Version How much information is contained in hydrological time series? This question is not of- Interactive Discussion 25 ten explicitly asked, but is actually underlying many challenges in hydrological modeling and monitoring. Information content of time series is for example relevant for decisions 2030 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | regarding what to measure and where, to achieve optimal monitoring network designs (Alfonso et al., 2010a,b; Mishra and Coulibaly, 2010; Li et al., 2012). Also in hydrolog- HESSD ical model inference and calibration, the question can be asked, to decide how much 10, 2029–2065, 2013 model complexity is warranted by the data (Jakeman and Hornberger, 1993; Vrugt 5 et al., 2002; Schoups et al., 2008; Laio et al., 2010; Beven et al., 2011). There are, however, some issues in quantifying information content of data. Although Data compression to the question seems straightforward, the answer is not. This is partly due to the fact define information that the question is not completely specified. The answers found in data are relative content to the question that one asks the data. Moreover, the information content of those S. V. Weijs et al. 10 answers depends on how much was already known before the answer was received. An objective assessment of information content is therefore only possible when prior knowledge is explicitly specified. Title Page In this paper, we take a perspective from (algorithmic) information theory, (A)IT, on quantifying information content in hydrological data. This puts information content in Abstract Introduction 15 the context of data compression. The framework naturally shows how specification of the question and prior knowledge enter the problem and to what degree an objective Conclusions References assessment is possible using tools from information theory. The illustrative link between Tables Figures information content and data compression is elaborated in practical explorations of information content, using common compression algorithms. J I 20 This paper must be seen as a first exploration of the compression framework to define information content in hydrological time series, with the objective of introducing J I the analogies and showing how they work in practice. The results will also serve as a Back Close benchmark in a follow-up study Weijs et al. (2013), where new compression algorithms are developed that employ hydrological knowledge to improve compression. Full Screen / Esc Printer-friendly Version 25 2 Information content, patterns, and compression of data Interactive Discussion From the framework of information theory, originated by Shannon (1948), we know that information content of a message, data point or observation, can be equated to 2031 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | suprisal, defined as −logP , where P is the probability assigned to the event before ob- serving it. To not lengthen this paper too much, we refer the reader to Shannon (1948); HESSD Cover and Thomas (2006) for more background on information theory. See also Weijs 10, 2029–2065, 2013 et al. (2010a,b) for introduction and interpretations of information measures in the con- 5 text of hydrological prediction and model calibration. We also refer the reader to Singh and Rajagopal (1987); Singh (1997); Ruddell et al. (2013), for more references on ap- Data compression to plications of information theory in the geosciences. In the following, the interpretation define information of information content as description length is elaborated. content 2.1 Information theory: entropy and code length S. V. Weijs et al. 10 Data compression seeks to represent the most likely events (most frequent characters in a file) with the shortest codes, yielding the shortest total code length. As is the case Title Page with dividing high probabilities, also short codes are a limited resource that has to be Abstract Introduction allocated as efficiently as possible. When required to be uniquely decodable, short codes come at the cost of longer codes elsewhere. This follows from the fact that such Conclusions References 15 codes must be prefix free, i.e. no code can be the first part (prefix) of another one. This is formalized by the following theorem of McMillan (1956), who generalized the Tables Figures inequality (Eq. 1) of Kraft (1949) to all uniquely decodable codes. J I X A−li ≤ 1 (1) J I i Back Close in which A is the alphabet size (2 in the binary case) and li is the length of the code 20 assigned to event i. In other words, one can see the analogy between prediction and Full Screen / Esc data compression through the similarity between the scarcity of short codes and the 1 scarcity of large predictive probabilities. Just as there are only 4 probabilities of 4 avail- Printer-friendly Version 1 able, there are only 4 prefix-free binary codes as short as −log2 4 = 2 (see Fig. 1, code A). In contrast to probabilities, which can be chosen freely, the code lengths