arXiv Report 2021 Volume 0 (1981), Number 0

A Bounded Measure for Estimating the Benefit of : Theoretical Discourse and Conceptual Evaluation

Min Chen1 and Mateu Sbert2

1University of Oxford, UK and 2 University of Girona, Spain

Visual Mapping with Topological Abstraction

Visual Mapping with a Integral

Color Pixel Color Opacity

Color Pixel Color Opacity ......

Color Pixel Color Opacity

(a) mapping from different sets of voxel values to the same pixel color (b) mapping from different geographical paths to the same line segment Figure 1: Visual encoding typically features many-to-one mapping from data to visual representations, hence information loss. The significant amount of information loss in volume visualization and metro suggests that viewers not only can abide the information loss but also benefit from it. Measuring such benefits can lead to new advancements of visualization, in theory and practice.

Abstract Information theory can be used to analyze the cost-benefit of visualization processes. However, the current measure of benefit contains an unbounded term that is neither easy to estimate nor intuitive to interpret. In this work, we propose to revise the existing cost-benefit measure by replacing the unbounded term with a bounded one. We examine a number of bounded measures that include the Jenson-Shannon divergence and a new divergence measure formulated as part of this work. We describe the rationale for proposing a new divergence measure. As the first part of comparative evaluation, we use visual analysis to support the multi-criteria comparison, narrowing the search down to several options with better mathematical properties. The theoretical discourse and conceptual evaluation in this paper provide the basis for further comparative evaluation through synthetic and experimental case studies, which are to be reported in a separate paper. arXiv:2103.02505v1 [cs.IT] 3 Mar 2021

1. Introduction In terms of information theory, the types of inaccuracy featured in Figure1 are different forms of information loss (or many-to- To most of us, it seems rather intuitive that visualization should be one mapping). Chen and Golan proposed an information-theoretic accurate, different data values should be visually encoded differ- measure [CG16] for analyzing the cost-benefit of data intelligence ently, and visual distortion should be disallowed. However, when workflows. It enables us to consider the positive impact of informa- we closely examine most (if not all) visualization images, we can tion loss (e.g., reducing the cost of storing, processing, displaying, notice that inaccuracy is ubiquitous. The two examples in Figure1 perceiving, and reasoning about the information) as well as its neg- evidence the presence of such inaccuracy. In volume visualization, ative impact (e.g., being mislead by the information). The measure when a pixel is used to depict a set of voxels along a ray, many provides a concise explanation about the benefit of visualization different sets of voxel values may result in the same pixel color. In because visualization and other data intelligence processes (e.g., a metro , a variety of complex geographical paths may be dis- statistics and algorithms) all typically cause information loss and torted and depicted as a straight line. Since there is little doubt that visualization allows human users to reduce the negative impact of volume visualization and metro maps are useful, some “inaccurate” information loss effectively using their knowledge. visualization must be beneficial.

© 2021 The Author(s) with LATEX template from Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. Published by John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

The mathematical formula of the measure features a term based Visual Mapping & Displaying Performing Visualization Tasks on the Kullback-Leibler (KL) divergence [KL51] for measuring the P P potential distortion of a user or a group of users in reconstructing i i+1 the information that may have been lost or distorted during a vi- increasing AC can cause more PD sualization process. While using the KL-divergence is mathemati- cally intrinsic for measuring the potential distortion, its unbound- human knowledge can edness property has some undesirable consequences. The simplest reduce PD in visualization phenomenon of making a false representation (i.e., always display- increasing AC PD PD can AC PD ing 1 when a binary value is 0 or always 0 when it is 1) happens to lead to be a singularity condition of the KL-divergence. The amount of dis- more cost tortion measured by the KL-divergence often has much more bits Cost Cost than the entropy of the information space itself. This is not intuitive increasing AC can lead to cost reduction to interpret and hinders practical applications. In this paper, we propose to replace the KL-divergence with a Figure 2: Alphabet compression may reduce the cost of Pi and Pi+1, bounded term. We first confirm the boundedness is a necessary especially when human knowledge can reduce potential distortion. property. We then conduct multi-criteria decision analysis (MCDA) [IN13] to compare a number of bounded measures, which include uation by Wei et al. [WLS13], measuring observation capacity the Jensen–Shannon (JS) divergence [Lin91], and a new divergence by Bramon et al. [BRB∗13a], measuring information content by k measure Dnew (including its variations) formulated as part of this Biswas et al. [BDSW13], proving the correctness of “overview first, work. We use visual analysis to aid the observation of the math- zoom, details-on-demand” by Chen and Jänicke [CJ10] and Chen ematical properties of these candidate measures, narrowing down et al. [CFV∗16], and confirming visual multiplexing by Chen et from eight options to five. In a separate but related paper (included al. [CWB∗14]. in the supplementary materials), we use synthetic and experimen- tal case studies to to instantiate values that may be returned by the Ward first suggested that information theory might be an un- five options. It also explores the relationship between measuring derpinning theory for visualization [PAJKW08]. Chen and Jänicke the benefit of visualization and measuring the viewers’ knowledge [CJ10] outlined an information-theoretic framework for visualiza- used during visualization. The search for the best way to measure tion, and it was further enriched by Xu et al. [XLS10] and Wang and the benefit of visualization will likely entail a long journey. The Shen [WS11] in the context of scientific visualization. Chen and main contribution of this work is to initiate this endeavor. Golan proposed an information-theoretic measure for analyzing the cost-benefit of visualization processes and work- flows [CG16]. It was used to frame an observation study showing 2. Related Work that human developers usually entered a huge amount of knowledge Claude Shannon’s landmark article in 1948 [Sha48] signifies the into a machine learning model [TKC17]. It motivated an empirical birth of information theory. It has been underpinning the fields study confirming that knowledge could be detected and measured of data communication, compression, and encryption since. As quantitatively via controlled experiments [KARC17]. It was used a mathematical framework, information theory provides a collec- to analyze the cost-benefit of different virtual reality applications tion of useful measures, many of which, such as Shannon en- [CGJM19]. It formed the basis of a systematic methodology for tropy [Sha48], cross entropy [CT06], mutual information [CT06], improving the cost-benefit of visual analytics workflows [CE19]. and Kullback-Leibler divergence [KL51] are widely used in appli- It survived qualitative falsification by using arguments in visual- cations of physics, biology, neurology, psychology, and computer ization [SEAKC19]. It offered a theoretical explanation of “visual abstraction” [VCI20]. The work reported in this paper continues the science (e.g., visualization, computer graphics, computer vision, ∗ data mining, machine learning), and so on. In this work, we also path of theoretical developments in visualization [CGJ 17], and is consider Jensen-Shannon divergence [Lin91] in detail. intended to improve the original cost-benefit formula [CG16], in order to make it a more intuitive and usable measurement in prac- Information theory has been used extensively in visualization tical visualization applications. [CFV∗16]. It has enabled many applications in visualization, in- cluding scene and shape complexity analysis by Feixas et al. 3. Overview and Motivation [FdBS99] and Rigau et al. [RFS05], light source placement by Gumhold [Gum02], view selection in mesh rendering by Vázquez Visualization is useful in most data intelligence workflows, but the et al. [VFSH04] and Feixas et al. [FSG09], attribute selection by usefulness is not universally true because the effectiveness of vi- Ng and Martin [NM04], view selection in volume rendering by sualization is usually data-, user-, and task-dependent. The cost- Bordoloi and Shen [BS05], and Takahashi and Takeshima [TT05], benefit ratio proposed by Chen and Golan [CG16] captures the multi-resolution volume visualization by Wang and Shen [WS05], essence of such dependency. Below is the qualitative expression focus of attention in volume rendering by Viola et al. [VFSG06], of the measure: feature highlighting by Jänicke and Scheuermann [JWSK07,JS10], Benefit Alphabet Compression − Potential Distortion = (1) and Wang et al. [WYM08], transfer function design by Bruckner Cost Cost and Möller [BM10], and Ruiz et al. [RBB∗11, BRB∗13b], multi- modal data fusion by Bramon et al. [BBB∗12], isosurface eval- Consider the scenario of viewing some data through a partic-

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

Question 5: The image on the right depicts a computed Pixel Value Voxel Values tomography dataset (arteries) that was rendered using a maximum intensity projection (MIP) algorithm. Consider Decision the section of the image inside the red circle (also in the inset of a zoomed-in view). Which of the following illustrations would be the closest to the real surface of “curved” this part of the artery? or “flat”

A B Curved, Curved, rather smooth with wrinkles and bumps Figure 4: A simplified scenario of visualizing three sequences of voxels, which are rendered using the MIP method. A viewer needs C D to determine if the surface in the volume is curved or flat (or Flat, Flat, rather smooth with wrinkles and bumps straight in the illustration).

Figure 3: A volume dataset was rendered using the MIP method. A question about a “flat area” in the image can be used to tease out Image by Min Chen, 2008 As shown in some recent works, it is possible for visualization a viewer’s knowledge that is useful in a visualization process. designers to estimate AC, PD, and Cost qualitatively [CGJM19, CE19] and quantitatively [TKC17,KARC17]. It is highly desirable ular visual representation. The term Alphabet Compression (AC) to advance the scientific methods for quantitative estimation, to- measures the amount of information loss due to visual abstrac- wards the eventual realization of computer-assisted analysis and tion [VCI20]. Since the visual representation is fixed in the sce- optimization in designing visual representations. This work focuses nario, AC is thus largely data-dependent. AC is a positive mea- on one challenge of quantitative estimation, i.e., how to estimate the sure reflecting the fact that visual abstraction must be useful in benefit of visualization to human users with different knowledge many cases though it may result in information loss. This appar- about the depicted data and visual encoding. ently counter-intuitive term is essential for asserting why visualiza- tion is useful. (Note that the term also helps assert the usefulness Building on the methods of observational estimation [TKC17] of statistics, algorithms, and interaction since they all usually cause and controlled experiment [KARC17], one may reasonably antic- information loss [CE19]. (See also AppendixA.) ipate a systematic method based on a short interview by asking potential viewers a few questions. For example, one may use the The positive implication of the term AC is counterbalanced by question in Figure3 to estimate the knowledge of doctors, patients, the term Potential Distortion, while both being moderated by the and any other people who may view such a visualization. The ques- term Cost. The term Cost encompasses all costs of the visualization tion is intended to tease out two pieces of knowledge that may help process, including computational costs (e.g., visual mapping and reduce the potential distortion due to the “flat area” depiction. One rendering), cognitive costs (e.g., cognitive load), and consequential piece is about the general knowledge that associates arteries with costs (e.g., impact of errors). As illustrated in Figure2, increasing tube-like shapes. Another, which is more advanced, is about the AC typically enables the reduction of cost (e.g., in terms of energy, surface texture of arteries and the limitations of the MIP method. or its approximation such as time or money). In a separate paper (see the supplementary materials), the ques- The term Potential Distortion (PD) measures the informative di- tion in Figure3 is one of eight questions used in a survey for col- vergence between viewing the data through visualization with in- lecting empirical data for evaluating the bounded measures consid- formation loss and reading the data without any information loss. ered in this paper. As this paper focuses on the theoretical discourse The latter might be ideal but is usually at an unattainable cost ex- and conceptual evaluation, we use a highly abstracted version of cept for values in a very small data space (i.e., in a small alphabet this example to introduce the relevant information-theoretic nota- as discussed in [CG16]). As shown in Figure2, increasing AC typ- tions and elaborate the problem statement addressed by this paper. ically causes more PD. PD is data-dependent or user-dependent. Given the same with the same amount of infor- mation loss, one can postulate that a user with more knowledge 4. Mathematical Notations and Problem Statement about the data or visual representation usually suffers less distor- tion. This postulation is a focus of this paper. Mathematical Notation. Consider a simplified scenario in Fig- ure4, where three sequences of voxels are rendered using the MIT Consider the visual representation of a network of arteries in Fig- method, resulting in three pixel values on the left. Here the three ure3. The image was generated from a volume dataset using the sequence voxels examplifies a volume with Nx × Ny × NZ voxels maximum intensity projection (MIP) method. While it is known (illustrated as 1×3×10). Let each voxel value be an 8-bit unsigned that MIP cannot convey depth information well, it has been widely integer. In information theory, the possible 256 values are referred used for observing some classes of medical data, such as to as an alphabet, denoted here as Dvxl = {0,1,2,...,255}. The arteries. The highlighted area in Figure3 shows an apparently flat 256 valid values [0,255] are its letters. The alphabet is associated area, which is a distortion from the actuality of a tubular surface with a probability mass function (PMF) P(Dvxl). The Shannon en- likely with some small wrinkles and bumps. The doctors who deal tropy of this alphabet H (Dvxl) measures the average uncertainty with such medical data are expected to have sufficient knowledge and information of the voxel, and is defined as: to reconstruct the reality adequately from the “distorted” visual- 255 255 ization, while being able to focus on the more important task of H (Dvxl) = − ∑ pi log2 pi where pi ∈ [0,1], ∑ pi = 1 making diagnostic decisions, e.g., about aneurysm. 0 0

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation where pi indicates the probability for the voxel to have its value 1: Imaginary scenarios where probability data is collected equal to i ∈ [0,255]. When all 256 values are equally probable (i.e., for estimating knowledge related to alphabet A = {curved, flat}. ∀i ∈ [0,255], pi = 1/256), we have H (Dvxl) = 8 bits. In practice, The ground truth (G.T.) PMFs are defined with ε = 0.01 and 0.0001 an application usually deals with a specific type of volume data, the respectively. The potential distortion (as “→ value”) is computed probability of different values may vary noticeably. For example, in using the KL-divergence. , a voxel at the concern of a volume is more likely to have a value indicating an empty space. Scenario 1 Scenario 2 Q( ): {0.99,0.01}{0.9999,0.0001} The entire volume of N × N × N voxels be defined as a com- AG.T. x y Z P( ): {0.01,0.99} → 6.50 {0.0001,0.9999} → 13.28 posite alphabet . Its letters are all valid combinations of voxel AMIP Dvlm P( ): {0.99,0.01} → 0.00 {0.99,0.01} → 0.05 values. If the N ×N ×N voxels are modelled as independent and Adoctors x y Z P( ): {0.7,0.3} → 1.12 {0.7,0.3} → 3.11 identically distributed random variables, we have: Apatients M H (Dvml) = ∑ H (Dvxl,k) = 8 × Nx × Ny × NZ bits Table 2: Imaginary scenarios for estimating knowledge related k=1 to alphabet B = {wrinkles-and-bumps, smooth}. The ground truth where M = Nx × Ny × Nz. For the volume illustrated in Figure4, (G.T.) PMFs are defined with ε = 0.1 and 0.001 respectively. The H (Dvml) would be 240 bits. However, this is the maximum en- potential distortion (as “→ value”) is computed using the KL- tropy of such a volume. In real world applications, it is very un- divergence. likely for the N × N × N voxels to be independent and identi- x y Z Scenario 3 Scenario 4 cally distributed random variables. Although domain experts may Q( ): {0.9,0.1}{0.001,0.999} not have acquired the ground truth PMF, by measuring a very large BG.T. P( ): {0.1,0.9} → 2.54 {0.001,0.999} → 9.94 corpus of volume data, they have intuitive knowledge as to what BMIP P( ): {0.8,0.2} → 0.06 {0.8,0.2} → 1.27 may be possible or not. For example, doctors, who handle med- Bdoctors P( ): {0.1,0.9} → 2.54 {0.1,0.9} → 8.50 ical imaging data, do not expect to see a car, ship, aeroplane, or Bpatients other “weird” objects in a volume dataset. This intuitive and impre- cise knowledge about the PMF can explain the humans’ ability to where n = kZk is the number of letters in the alphabet Z, and pi decode visualization featuring some “short comings” such as vari- and qi are the probability values associated with letter zi ∈ Z. DKL ous visual multiplexing phenomena (e.g., occlusion, displacement, is also measured in bit. Because DKL is an unbounded measure and information omission) [CWB∗14]. In the follow-on paper (see regardless the maximum entropy of Z, it is easy to relate, quanti- supplementary materials), we will explore means to measure such tatively, the value of potential distortion and that of alphabet com- ability quantitatively. pression. This leads the problem to be addressed in this paper and the follow-on paper in the supplementary materials. Similarly, we can define an alphabet for an pixel Rpxl and a com- posite alphabet for an image Rimg. For the example in Figure4, we Note: In this paper, to simplify the notations in different contexts, assume a simple MIP algorithm that selects the maximum voxel for an information-theoretic measure, we use an alphabet Z and its value along each ray, and assigns it as the corresponding pixel as PMF P interchangeably, e.g., H (P(Z)) = H (P) = H (Z). Ap- an 8-bit monochromatic value. It is obvious that the potential varia- pendixB provides more mathematical background about informa- tions of a pixel is much less than the potential combined variations tion theory, which may be helpful to some readers. of all voxels along a ray. Hence in terms of Shannon entropy, most Problem Statement. Recall our brief discussion about an ana- likely H (Dvlm) − H (Dimg) 0, indicating significant informa- ≫ lytical task that may be affected by the MIP image in Figure3 in tion loss during the rendering process. Section3. Let us define the analytical task as binary options about Given an analytical task to be performed through visualiza- whether the “flat area” is actually flat or curved. In other words, tion, the analytical decision alphabet A usually contains a small it is an alphabet A = {curved,flat}. The likelihood of the two op- number of letters, such as {contain artefact X,no artefact X} or tions is represented by a probability distribution or probability mass {big,medium,small,tiny,none}. The entropy of A is usually much function (PMF) P(A) = {1−ε,0+ε}, where 0 < ε < 1. Since most lower than that of Dimg, i.e., H (Dimg) − H (A)  0, indicating arteries in the real world are of tubular shapes, one can imagine that further information loss during perception and cognition. As dis- a ground truth alphabet AG.T. might have a PMF P(AG.T.) strongly cussed in Section3, this is referred to as alphabet compression and in favor of the curved option. However, the visualization seems to is a general trend of all data intelligence workflows. The question is suggest the opposite, implying a PMF P(AMIP) strongly in favor of thus about how much the analytical decision was disadvantaged by the flat option. It is not difficult to interview some potential viewers, the loss of information. This is referred to as potential distortion. enquiring how they would answer the question. One may estimate In the original quantitative formula proposed in [CG16], the po- a PMF P(Adoctors) from doctors’ answers, and another P(Apatients) tential distortion is measured using Kullback–Leibler divergence from patients’ answers. (or KL-divergence) [KL51]. Given an alphabet Z with two PMFs Table1 shows two scenarios where different probability data is P and Q, KL-divergence measures how much Q differs from the obtained. The values of PD are computed using the KL-divergence reference distribution P: as proposed in [CG16]. In Scenario 1, without any knowledge, the n n  pi visualization process would suffer 6.50 bits of potential distortion DKL(P(Z)||Q(Z)) = ∑ pi log2 pi − log2 qi = ∑ pi log2 (2) i=1 i=1 qi (PD). As doctors are not fooled by the “flat area” shown in the

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

MIP visualization, their knowledge is worth 6.50 bits. Meanwhile, is bounded, AC, which is the entropic difference between the input patients would suffer 1.12 bits of PD on average, their knowledge and output alphabets, is also bounded. On the other hand, as dis- is worth 5.38 = 6.50 − 1.12 bits. cussed in the previous section, PD is unbounded. Although Eq.3 can be used for relative comparison, it is not quite intuitive in an In Scenario 2, the PMFs of P( ) and P( ) depart fur- AG.T. AMIP absolute context, and it is difficult to imagine that the amount of ther away, while P( ) and P( ) remain the same. Al- Adoctors Apatients informative distortion can be more than the maximum amount of though doctors and patients would suffer more PD, their knowledge information available. is worth more than that in Scenario 1 (i.e., 13.28−0.05 = 13.23 bits and 13.28 − 3.11 = 10.17 bits respectively). In this section, we present the unpublished work by Chen and Sbert [CS19], which shows mathematically that for alphabets of Similarly, the binary options about whether the “flat area” a finite size, the KL-divergence used in Eq.3 should ideally be is actually smooth or not can be defined by an alphabet = A bounded. In their arXiv report, they also outlined a new diver- {wrinkles-and-bumps, smooth}. Table2 shows two scenarios about gence measure and compare it with a few other bounded measures. collected probability data. In these two scenarios, doctors exhibit Building on the initial comparison in [CS19], we use visualization much more knowledge than patients, indicating that the surface tex- in Section6 to assist the multi-criteria analysis and selection of a ture of arteries is a piece of specialized knowledge. bounded divergence measure to replace the KL-divergence used in The above example demonstrates that using the KL-divergence Eq.3. In the separate follow-on paper in the supplementary ma- to estimate PD can differentiate the knowledge variation between terials, we further examine the practical usability of a subset of doctors and patients regarding the two pieces of knowledge that bounded measures by evaluating them using synthetic and experi- may reduce the distortion due to the “flat area”. When it is used in mental data. Eq.1 in a relative or qualitative context (e.g., [CGJM19, CE19]), the unboundedness of the KL-divergence does not pose an issue. 5.1. A Conceptual Proof of Boundedness However, this does become an issue when the KL-divergence According to the mathematical definition of DKL in Eq.2, DKL is used to measure PD in an absolute and quantitative context. is of course unbounded. We do not in anyway try to prove that From the two diverging PMFs P(AG.T.) and P(AMIP) in Table1, or this formula is bounded. We are interested in a scenario where an P( ) and P( ) in Table2, we can observe that the smaller ε BG.T. BMIP alphabet Z is associated with two PMF, P and Q, which is very is, the more divergent the two PMFs become and the higher value much the scenario of measuring the penitential distortion in Eq.1. the PD has. Indeed, consider an arbitrary alphabet = {z1,z2}, and Z We ask a question: is conceptually necessary for DKL to yield a two PMFs defined upon Z: P = [0+ε, 1−ε] and Q = [1−ε, 0+ε]. unbounded value to describe the divergence between P and Q in When ε → 0, we have the KL-divergence (Q||P) → ∞. DKL this scenario despite that H (P) and H (Q) are both bounded? Meanwhile, the Shannon entropy of Z, H (Z), has an upper We highlight the word “conceptually” because this relates to bound of 1 bit. It is thus not intuitive or practical to relate the the concept about another information-theoretic measure, cross en- value of DKL(Q||P) to that of H (Z). Many applications of in- tropy, which is defined as: formation theory do not relate these two types of values explicitly. n n When reasoning such relations is required, the common approach 1 HCE(P,Q) = − ∑ pi log2 qi = ∑ pi log2 is to impose a lower-bound threshold for ε (e.g., [KARC17]). How- i=1 i=1 qi ever, there is yet a consistent method for defining such a threshold n n pi (4) for various alphabets in different applications, while preventing a = pi log − pi log pi ∑ 2 q ∑ 2 range of small or large values (i.e., [0,σ) or (1 − σ,1]) in a PMF i=1 i i=1 is often inconvenient in practice. Indeed, for a binary alphabet with = DKL(P||Q) + H (P) two arbitrary P and Q, in order to restrict its DKL(P||Q) ≤ 1, one Conceptually, cross entropy measures the cost of a coding scheme. has to set 0.0658 σ 0.9342, rendering some 13% of the prob- . . If a code (i.e., an alphabet Z) has a true PMF P, the optimal cod- ability range [0, 1] unusable. In the following section, we discuss ing scheme should require only H (P) bits according to Shan- several approaches to defining a bounded measure for PD. non’s source coding theorem [CT06]. However, if the code designer mistakes the PMF as Q, the resulting coding scheme will have 5. Bounded Measures for Potential Distortion (PD) HCE(P,Q) bits. From Eq.4, we can observe that the inefficiency is described by the term DKL(P||Q). Naturally, we can translate our Let Pi be a process in a data intelligence workflow, Zi be its in- aforementioned question to: should such inefficiency be bounded if put alphabet, and Zi+1 be its output alphabet. Pi can be a human- there is a finite number of codewords (letters) in the code (alpha- centric process (e.g., visualization and interaction) or a machine- bet). centric process (e.g., statistics and algorithms). In the original pro- posal [CG16], the value of Benefit in Eq.1 is measured using: Coding theory has been applied to visualization, e.g., for ex- plaining the efficiency of logarithmic plots in displaying data of 0 Benefit = AC − PD = H (Zi) − H (Zi+1) − DKL(Zi||Zi) (3) a family of skewed PMFs and the usefulness of redundancy in vi- sual design [CJ10]. Here we focus on proving that H (P,Q) is where H () is the Shannon entropy of an alphabet and D () is CE KL conceptually bounded. KL-divergence of an alphabet from a reference alphabet. Because the Shannon entropy of an alphabet with a finite number of letters Let Z be an alphabet with a finite number of letters,

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

{z1,z2,...,zn}, and Z is associated with a PMF, Q, such that: A more formal proof of the boundedness of HCE(P,Q) and −(n−1) DKL(P||Q) for an alphabet with a finite number of letters can be q(zn) = ε, (where 0 < ε < 2 ), found in AppendixC with more detailed discussions. It is neces- −(n−1) q(zn−1) = (1 − ε)2 , sary to note again that the discourse in this section and AppendixC does not imply that the KL-divergence is incorrect. Firstly, the KL- q(z ) = (1 − ε)2−(n−2), n−2 (5) divergence applies to both discrete probability distributions (PMFs) ··· and continuous distributions. Secondly, the KL-divergence is one −2 of the many divergence measures found in information theory, and q(z2) = (1 − ε)2 , a member of the huge collection of statistical distance or differ- −1 −(n−1) q(z1) = (1 − ε)2 + (1 − ε)2 . ence measures. There is no simply answer as to which measure is When we encode this alphabet using an entropy binary coding correct and incorrect or which is better. We therefore should not scheme [Mos12], we can be assured to achieve an optimal code over-generalize the proof to undermine the general usefulness of with the lowest average length for codewords. One example of such the KL-divergence. a code for the above probability is: z : 0, z : 10, z : 110 1 2 3 5.2. Existing Candidates of Bounded Measures ··· (6) While numerical approximation may provide a bounded KL- zn−1 : 111...10 (with n − 2 “1”s and one “0”) divergence, it is not easy to determine the value of ε and it is diffi- zn : 111...11 (with n − 1 “1”s and no “0”) cult to ensure everyone to use the same ε for the same alphabet or In this way, zn, which has the smallest probability, will always be comparable alphabets. It is therefore desirable to consider bounded assigned a codeword with the maximal length of n − 1. Entropy measures that may be used in place of DKL. coding is designed to minimize the average number of bits per letter Jensen-Shannon divergence is such a measure: when one transmits a “very long” sequence of letters in the alphabet over a communication channel. Here the phrase “very long” implies 1  D (P||Q) = D (P||M) + D (Q||M) = D (Q||P) that the string exhibits the above PMF Q (Eq.5). JS 2 KL KL JS 1 n  2p 2q  (8) Suppose that Z is actually of PMF P, but is encoded as Eq.6 = p log i + q log i ∑ i 2 p + q i 2 p + q based on Q. The transmission of Z using this code will have inef- 2 i=1 i i i i ficiency. As mentioned above, the cost is measured by cross en- where P and Q are two PMFs associated with the same alphabet Z tropy HCE(P,Q), and the inefficiency is measured by the term and M is the average distribution of P and Q. Each letter zi ∈ Z is DKL(P||Q) in Eq.4. associated with a probability value pi ∈ P and another qi ∈ Q. With Clearly, the worst case is that the letter, zn, which was encoded the base 2 logarithm as in Eq.8, DJS(P||Q) is bounded by 0 and 1. using n − 1 bits, turns out to be the most frequently used letter in P (instead of the least in Q). It is so frequent that all letters in the Another bounded measure is the conditional entropy H (P|Q): long string are of z . So the average codeword length per letter of n H (P|Q) = H (P) − I (P;Q) this string is n − 1. The situation cannot be worse. Therefore, n − 1 n n is the upper bound of the cross entropy. From Eq.4, we can also ri, j (9) = H (P) − ∑ ∑ ri, j log2 observe that DKL(P||Q) must also be bounded since HCE(P,Q) and i=1 j=1 piq j H (P) are both bounded as long as Z has a finite number of letters. (P Q) P Q r Let >CE be the upper bound of HCE(P,Q). The upper bound for where I ; is the mutual information between and and i, j DKL(P||Q), >KL, is thus: is the joint probability of the two conditions of zi,z j ∈ Z that are  associated with P and Q. H (P|Q) is bounded by 0 and H (P). DKL(P||Q) = HCE(P,Q) − H (P) ≤ >CE − min H (P) (7) Because I (P;Q) measures the amount of shared information be- ∀P(Z) tween P and Q (and therefore a kind of similarity), H (P|Q) thus There is a special case worth noting. In practice, it is common increases if P and Q are less similar. Note that we use H (P|Q) and (P Q) P Q to assume that Q is a uniform distribution, i.e., qi = 1/n,∀qi ∈ Q, I ; here in the context that and to be associated with the typically because Q is unknown or varies frequently. Hence the as- same alphabet Z, though the general definitions of H (P|Q) and I (P;Q) are more flexible. sumption leads to a code with an average length equaling log2 n (or in practice, the smallest integer ≥ log2 n). Under this special The above two measures in Eqs.8 and9 consist of logarithmic (but rather common) condition, all letters in a very long string have scaling of probability values, in the same form of Shannon entropy. codewords of the same length. The worst case is that all letters in They are entropic measures. There are many other divergence mea- the string turn out to the same letter. Since there is no informative sures in information theory, including many in the family of f - variation in the PMF P for this very long string, i.e., H (P) = 0, in divergences [LV06]. However, many are also unbounded. principle, the transmission of this string is unnecessary. The maxi- mal amount of inefficiency is thus log2 n. This is indeed much lower Meanwhile, entropic divergence measures belong to the broader than the upper bound >CE = n−1, justifying the assumption or use family of statistical distances or difference measures. In this work, of a uniform Q in many situations. we considered a set of non-entropic measures in the form of

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

Minkowski distances, which have the following general form: with, luminance(pa) = pa, black = 0, and white = 1. Unfortu- s nately, the accompanying legend displays incorrect labels as black k n k k for pa = 1 and while for pa = 0. The visualization results thus DM(P,Q) = ∑ |pi − qi| (k > 0) (10) i=1 feature a “lie” distribution Pi = {1 − pa, pa}. An obvious para- doxical scenario is when Pi = {1,0}, which has an entropy value where we use symbol D instead of D because it is not entropic. k k H (Zi) = 0. Although DJS, Dnew, and Dncm would all return 1 as the maximum value of divergence for the visual mapping, the 0 5.3. New Candidates of Bounded Measures term H (Zi)D(Zi||Zi) would indicate that there would be no di- vergence. Hence H (Zi) cannot be used instead of Hmax(Zi). For each letter zi ∈ Z, DKL(P||Q) measures the difference be- tween its self-information −log2(pi) and −log2(qi) with respect to P and Q. Similarly, DJS(P||Q) measures the difference of self- information with the involvement an average distribution (P + 6. Comparing Bounded Measures: Conceptual Evaluation Q)/2. Meanwhile, it will be interesting to consider the difference Given those bounded candidates in the previous section, we would of two probability values, i.e., |pi − qi|, and the information con- like to select the most suitable measure to be used in Eq. 13. In tent of the difference. This would lead to measuring log2 |pi − qi|, the history of measurement science [Kle12], there have been an which is unfortunately an unbounded term in [−∞,0]. enormous amount of research effort devoted to inventing, evaluat- k ing, and selecting different candidate measures (e.g., metric vs. im- Let u = |pi − qi|, the function log2 u + 1 (where k > 0) is an perial measurement systems; temperatute scales: Celsius, Fahren- isomorphic transformation of log2 u. The former preserves all in- formation of the latter, while offering a bounded measure in [0,1]. heit, kelvin, Rankine, and Reaumur; and Seismic magnitude scales: k Richter, Mercalli, moment magnitude, and many others). There is Although log2 u + 1 and log2 u are both monotonically increasing measures, they have different gradient functions, or visually, differ- usually no ground truth as to which is correct, and the selection ent shapes. We thus introduce a power parameter k to enable our in- decision is rarely determined only by mathematical definitions or vestigation into different shapes. The introduction of k reflects the rules [PS91]. Similarly, there are numerous statistical distance and open-minded nature of this work. It follows the same generaliza- difference measures. selecting a measure in a certain application is tion approach as Minkowski distances and α-divergences [vH14], often an informed decision based on multiple factors. avoiding a fixation on their special cases such as the Euclidean dis- Measuring the benefit of visualization and the related informa- tance or DKL. tive divergence in visualization processes is a new topic in the field k of visualization. It is not unreasonable to expect that more research We first consider a commutative measure Dnew: n effort will be made in the coming years, decades, or unsurprisingly, k 1 k  centuries. The work presented in this paper and the follow-on pa- Dnew(P||Q) = ∑(pi + qi)log2 |pi − qi| + 1 (11) 2 i=1 per in the supplementary materials represents the early thought and early effort in this endeavor. In this work, we devised a set of crite- where k > 0. Because 0 ≤ |p − q |k ≤ 1, we have i i ria and conducted multi-criteria decision analysis (MCDA) [IN13] 1 n 1 n to evaluate the candidate measures described in the previous sec- (p +q )log (0+1) ≤ k (P||Q) ≤ (p +q )log (1+1) ∑ i i 2 Dnew ∑ i i 2 tion. 2 i=1 2 i=1 k Our criteria fall into two main categories. The first group of cri- Since log2 1 = 0, log2 2 = 1, ∑ pi = 1, ∑qi = 1, Dnew(P||Q) is thus k teria reflect five desirable conceptual or mathematical properties, as bounded by 0 and 1. The formulation of Dnew(P||Q) was derived from its non-commutative version: shown in Table3. The second criteria reflect the assessments based n on numerical instances constructed synthetically or obtained from k k  Dncm(P||Q) = ∑ pi log2 |pi − qi| + 1 (12) experiments. This paper focuses conceptual evaluation based on the i=1 first group of criteria, while the follow-on paper in the supplemen- which captures the non-commutative property of DKL. In this work, tary materials focuses on empirical evaluation based on the second k k group criteria. we focus on two options of Dnew and Dncm(P||Q), i.e., when k = 1 and k = 2. For criteria 1, 4, and 5 in the first group, we use visualization k k As DJS, Dnew, and Dncm are bounded by [0, 1], if any of them is plots to aid our analysis of the mathematical properties. Based on selected to replace DKL, Eq.3 can be rewritten as our analysis, we score each divergence measure against a criterion using ordinal values between 0 and 5 (0 unacceptable, 1 fall-short, Benefit = ( ) − ( ) − ( ) ( 0|| ) (13) H Zi H Zi+1 Hmax Zi D Zi Zi 2 inadequate, 3 mediocre, 4 good, 5 best). We intentionally do not where Hmax denotes maximum entropy, while D is a placeholder assign weights to these criteria. While we will offer our view as to k k for DJS, Dnew, or Dncm. the importance of different criteria, we encourage readers to apply their own judgement to weight these criteria. We hope that readers We have considered the option of using ( ) instead of H Zi will reach the same conclusion as ours. We draw our conclusion ( ). However, this would lead to an undesirable paradox. Hmax Zi about the conceptual evaluation in Section7, where we also outline Consider an alphabet = {z ,z } with a PMF P = {p ,1 − p }. Zi a b i a a the need for empirical evaluation. Consider a simple visual mapping that is supposed to encode the probability value pa using the luminance of a monochrome shape Criterion 1. This is essential since the selected divergence

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

Table 3: A summary of multi-criteria decision analysis. Each measure is scored against a conceptual criterion using an integer in [0, 5] with 5 being the best. The symbol I indicates an interim conclusion after considering one or a few criteria. k=1 k=2 k=1 k=2 k=2 k=200 Criteria Importance 0.3DKL DJS H (P|Q) Dnew Dnew Dncm Dncm DM DM 1. Boundedness critical 0 5 5 5 5 5 5 3 3 I 0.3DKL is eliminated but used below only for comparison. The other scores are carried forward. 2. Number of PMFs important552555555 3. Entropic measures important555555511 4. Curve shapes (Fig.5) helpful551242433 5. Curve shapes (Fig.6) helpful541353523 2 200 I Eliminate H (P|Q), DM, DM based on criteria 1-5 sum: 24 14 20 24 20 24 14 15

1.1 1.1 1.1 1.1 DKL DKL*0.3 DJS CondEn 1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -0.1 p1 -0.1 p1 -0.1 p1 -0.1 p1

(a) DKL(P||Q) (b) 0.3DKL(P||Q) (c) DJS(P||Q) (d) H (P|Q)

1.1 1.1 1.1 1.1 Dnew (k=1) Dnew (k=2) Dm (k=2) Dm (k=200) 1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -0.1 p1 -0.1 p1 -0.1 p1 -0.1 p1 k k k k (e) Dnew(P||Q), k = 1 (f) Dnew(P||Q), k = 2 (g) DM(P,Q), k = 2 (h) DM(P,Q), k = 200  = 0.0  = 0.1, 0.2, 0.3, 0.4  = 0.5  = 0.6, 0.7, 0.8, 0.9  = 1.0 0.3·DKL,  = 0.5

Figure 5: The different measurements of the divergence of two PMFs, P = {p1,1 − p1} and Q = {q1,1 − q1}. The x-axis shows p1, varying from 0 to 1, while we set q1 = (1 − α)p1 + α(1 − p1),α ∈ [0,1]. When α = 1, Q is most divergent away from P. measure is to be bounded. Otherwise we could just use the KL- return valid values when p1 = 0 or p1 = 1. We therefore score them divergence. Let us consider a simple alphabet Z = {z1,z2}, which is 0 for Criterion 1. associated with two PMFs, P = {p1,1− p1} and Q = {q1,1−q1}. k k We set q1 = (1 − α)p1 + α(1 − p1),α ∈ [0,1], such that when DJS, H (P|Q), Dnew, and Dncm are all bounded by [0, 1], and k α = 1, Q is most divergent away from P. The entropy values of semantically intuitive. We score them 5. Although DM is a bounded P and Q fall into the range of [0, 1]. Hence semantically, it is more measure, its semantic interpretation is not ideal, because its upper intuitive to reason an unsigned value representing their divergence bound depends on k and is always > 1. We thus score it 3. Although within the same range. 0.3DKL is eliminated based on criterion 1, it is kept in Table3 as a benchmark in analyzing criteria 2-5. Meanwhile, we carry all other scores forward to the next stage of analysis. Figure5 shows several measures by varying the values of p1 in the range of [0,1]. We can obverse that DKL raises its values quickly Criterion 2. For criteria 2-5, we follow the base-criterion above 1 when α = 1, p1 ≤ 0.22. Its scaled version, 0.3DKL, does method [HSS20] by considering DKL and 0.3DKL as the bench- not rise up as quick as DKL but raises above 1 when α = 1, p1 ≤ mark. Criterion 2 concerns the number of PMFs as the input vari- 0.18. In fact DKL and 0.3DKL are not only unbounded, they do not ables of each measure. DKL and 0.3DKL depend on two PMFs,

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

2.0 Criterion 5. We now consider Figure6, where the candidate measures are visualized in comparison with DKL and 0.3DKL in 1.8 a range close to zero, i.e., [0.110,0.1]. The ranges [0,0.110] and [0.1,0.5] are there only for references to the nearby contexts as 1.6 they do not have the same logarithmic scale as that in the range 10 10 1.4 [0.1 ,0.1]. We can observe that in [0.1 ,0.1] the curve of 0.3DKL rises as almost quickly as DKL. This confirms that simply scaling k=1 1.2 the KL-divergence is not an adequate solution. The curves of Dnew k=2 and Dnew converge to their maximum value 1.0 earlier than that of 1.0 DJS. If the curve of 0.3DKL is used as a benchmark as in Figure5, k=2 DKL the curve of Dnew is closer to 0.3DKL than that of DJS. We thus 0.8 k=2 k=1 k k 0.3DKL score Dnew : 5, DJS: 4, Dnew : 3, DM(k = 200): 3, DM(k = 200): 2, k DJS and H (P|Q): 1. Same as Figure5, Dncm has the same curves and 0.6 k Dnew (k=1) thus the same score as Dnew. Dnew (k=2) 0.4 The sums of the scores for criteria 1-5 indicate that H (P|Q) and Dm (k=2) k k k DM are much less favourable than DJS, Dnew, and Dncm. Because Dm (k=200) 0.2 these criteria have more holistic significance than the instance- CondEn k based analysis for criteria 6-9, we can eliminate H (P|Q) and DM 0.0 for further consideration. Ordinal scores in MCDA are typically

subjective. Nevertheless, in our analysis, ±1 in those scores would

5.0E-01 1.0E-10 1.0E-09 1.0E-08 1.0E-07 1.0E-06 1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 2.0E-01 3.0E-01 4.0E-01 0.0E+00 not affect the elimination. (linear) P1 (log) (linear)

Figure 6: A visual comparison of the candidate measures in a 7. Discussions and Conclusions range near zero. Similar to Figure5,P = {p1,1 − p1} and Q = {q1,1 − q1}, but only the curve α = 1 is shown, i.e., q1 = 1 − p1. In this paper, we have considered the need to improve the math- 10 The line segments of DKL and 0.3DKL in the range [0,0.1 ] do not ematical formulation of an information-theoretic measure for ana- represent the actual curves. The ranges [0,0.110] and [0.1,0.5] are lyzing the cost-benefit of visualization as well as other processes in only for references to the nearby contexts as they do not use the a data intelligence workflow [CG16]. The concern about the origi- same logarithmic scale as in [0.110,0.1]. nal measure is its unbounded term based on the KL-divergence. As discussed in the early sections of this paper, although using the KL- divergence measure in [CG16] as part of the cost-benefit measure is P and Q. All candidates of the bounded measures depend on two a conventional or orthodox choice, its unboundedness leads to sev- PMFs, except the conditional entropy H (P|Q) that depends on eral issues in the potential applications of the cost-benefit measure three. Because it requires some effort to obtain a PMF, especially a to practical problems: joint probability distribution, this makes H (P|Q) less favourable • It is not intuitive to interpret a set of values that would indicate and it is scored 2. that the amount of distortion in viewing a visualization that fea- Criterion 3. In addition, we prefer to have an entropic measure tures some information loss, could be much more than the total as it is more compatible with the measure of alphabet compression amount of information contained in the visualization. k as well as DKL that is to be replaced. For this reason, DM is scored • It is difficult to specify some simple visualization phenomena. 1. For example, before a viewer observes a variable x using vi- sualization, the viewer incorrectly assumes that the variable is Criterion 4. One may wish for a bounded measure to have a a constant (e.g., x ≡ 10, and probability p(10) = 1). The KL- geometric behaviour similar to D since it is the most popular KL divergence cannot measure the potential distortion of this phe- divergence measure. Since D rises up far too quickly as shown KL nomenon of bias because this is a singularity condition, unless in Figure5, we use 0 .3D as a benchmark, though it is still un- KL one changes p(10) by subtracting a small value 0 < ε < 1. bounded. As Figure5 plots the curves for α = 0.0,0.1,...,1.0, we • If one tries to restrict the KL-divergence to return values within a can visualize the “geometric shape” of each bounded measure, and bounded range, e.g., determined by the maximum entropy of the compare it with that of 0.3D . KL visualization space or the underlying data space, one could po- From Figure5, we can observe that DJS has almost a perfect tentially lose a non-trivial portion of the probability range (e.g., k match when α = 0.5, while Dnew(k = 2) is also fairly close. They 13% in the case of a binary alphabet). thus score 5 and 4 respectively in Table3. Meanwhile, the lines To address these problems, we have proposed to replace the KL- of H (P|Q) curve in the opposite direction of 0.3DKL. We score k k divergence in the cost-benefit measure with a bounded measure. We it 1. Dnew(k = 1) and DM(k = 2,k = 200) are of similar shapes, k have obtained a proof that the divergence used in the cost-benefit with DM correlating with 0.3DKL slightly better. We thus score k k formula is conceptually bounded, as long as the input and output Dnew(k = 1) 2 and DM(k = 2,k = 200) 3. For the PMFs P and k k k alphabets of a process have a finite number of letters. Q concerned, Dncm has the same curves as Dnew. Hence Dncm has k the same score as Dnew in Table3. We have considered a number of bounded measures to replace

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

k the unbounded term, including a new divergence measure Dnew and information-aware framework for exploring multivariate data sets. IEEE k Transactions on Visualization and Computer Graphics 19, 12 (2013), its variant Dncm. We have conducted multi-criteria decision analy- sis to select the best measure among these candidates. In particular, 2683–2692.2 we have used visualization to aid the observation of the mathemat- [BM10]B RUCKNER S.,MÖLLER T.: Isosurface similarity maps. Com- ical properties of the candidate measures, assisting in the analysis puter Graphics Forum 29, 3 (2010), 773–782.2 of three criteria in considered in this paper. [BRB∗13a]B RAMON R.,RUIZ M.,BARDERA A.,BOADA I.,FEIXAS M.,SBERT M.: An information-theoretic observation channel for vol- From Table3, we can observe the process of narrowing down ume visualization. Computer Graphics Forum 32, 3pt4 (2013), 411–420. from eight candidate measures to five measures. In particular, three 2 k k ∗ candidate measures DJS, Dnew(k = 2), and Dncm(k = 2) received [BRB 13b]B RAMON R.,RUIZ M.,BARDERA A.,BOADA I.,FEIXAS the same total scores. It is not easy to separate them. M.,SBERT M.: Information theory-based automatic multimodal transfer function design. IEEE Journal of Biomedical and Health Informatics 17, In the history of measurement science [Kle12], scientists en- 4 (2013), 870–880.2 countered many similar dilemma in choosing different measures. [BS05]B ORDOLOI U.,SHEN H.-W.: View selection for volume render- For example, temperature measures Celsius, Fahrenheit, Réaumur, ing. In Proc. IEEE Visualization (2005), pp. 487–494.2 Rømer, and Delisle scales exhibit similar mathematical properties, [CE19]C HEN M.,EBERT D. S.: An ontological framework for sup- their proposal and adoption were largely determined by practical porting the design and evaluation of visual analytics systems. Computer instances: Graphics Forum 38, 3 (2019), 131–144.2,3,5, 12 [CE20]C HEN M.,EDWARDS D. J.: ‘isms’ in visualization. In Founda- • Fahrenheit (original) — 0 degree: the freezing point of brine (a tions of Data Visualization, Chen M., Hauser H., Rheingans P., Scheuer- high-concentration solution of salt in water), 32 degree: ice wa- mann G., (Eds.). Springer, 2020. 12 ter, 96 degree: average human body temperature; [CFV∗16]C HEN M.,FEIXAS M.,VIOLA I.,BARDERA A.,SHEN H.- • Fahrenheit (present) — 32 degree: the freezing point of water, W., SBERT M.: Information Theory Tools for Visualization. A K Peters, 212 degree: the boiling point of water; 2016.2 • Réaumur — 0 degree: the freezing point of water, 80 degree: the [CG16]C HEN M.,GOLAN A.: What may visualization processes opti- boiling point of water; mize? IEEE Transactions on Visualization and Computer Graphics 22, • Rømer — 7.5 degree: the freezing point of water, 60 degree: the 12 (2016), 2619–2632.1,2,3,4,5,9, 10, 12, 13 ∗ boiling point of water; [CGJ 17]C HEN M.,GRINSTEIN G.,JOHNSON C.R.,KENNEDY J., • Celsius (original) — 0 degree: the boiling point of water, 100 TORY M.: Pathways for theoretical advances in visualization. IEEE Computer Graphics and Applications 37, 4 (2017), 103–112.2 degree: the freezing point of water; • Celsius (1744–1954) — 0 degree: the freezing point of water, [CGJM19]C HEN M.,GAITHER K.,JOHN N. W., MCCANN B.: Cost- benefit analysis of visualization in virtual environments. IEEE Transac- 100 degree: the boiling point of water; tions on Visualization and Computer Graphics 25, 1 (2019), 32–42.2,3, • Celsius (1954–2019) — redefined based on absolute zero and the 5 triple point of VSMOW (specially prepared water); [CJ10]C HEN M.,JÄNICKE H.: An information-theoretic framework for • Celsius (2019–now) —- redefined based on the Boltzmann con- visualization. IEEE Transactions on Visualization and Computer Graph- stant. ics 16, 6 (2010), 1206–1215.2,5 • Delisle — 0 degree: the boiling point of water, −1 degree; the [CS19]C HEN M.,SBERT M.: On the upper bound of the kullback-leibler contraction of the mercury in hundred-thousandths. divergence and cross entropy. arXiv:1911.08334, 2019.5 [CT06]C OVER T. M., THOMAS J.A.: Elements of Information Theory. In addition, Newton proposed two temperatute systems based on John Wiley & Sons, 2006.2,5, 14, 15 his observation of some 18 instance values [Wik20]. [CWB∗14]C HEN M.,WALTON S.,BERGER K.,THIYAGALINGAM J., The effort of these scientific pioneers suggested that observing DUFFY B.,FANG H.,HOLLOWAY C.,TREFETHEN A. E.: Visual mul- how candidate measures relate to practical instances was part of tiplexing. Computer Graphics Forum 33, 3 (2014), 241–250.2,4 the scientific processes for selecting different candidate measures. [DCK12]D ASGUPTA A.,CHEN M.,KOSARA R.: Conceptualizing vi- Therefore, building on the work presented in this paper, we car- sual uncertainty in parallel coordinates. Computer Graphics Forum 31, 3 (2012), 1015–1024. 12 ried our further investigation into a group of criteria based on ob- served instances in synthetic and experimental data. This empirical [FdBS99]F EIXAS M., DEL ACEBO E.,BEKAERT P., SBERT M.: An in- formation theory framework for the analysis of scene complexity. Com- evaluation is reported in a separate follow-on paper in the supple- puter Graphics Forum 18, 3 (1999), 95–106.2 mentary materials, where we narrow the remaining five candidate [FSG09]F EIXAS M.,SBERT M.,GONZÁLEZ F.: A unified information- measures to one measure, and propose to the original cost-benefit theoretic framework for viewpoint selection and mesh saliency. ACM ratio in [CG16] based on the combined conclusion derived from the Transactions on Applied Perception 6, 1 (2009), 1–23.2 conceptual evaluation and empirical evaluation. [Gol20]G OLIN M. J.: Lecture 17: Huffman coding. http: //home.cse.ust.hk/faculty/golin/COMP271Sp03/ Notes/MyL17.pdf, accessed in March 2020. 14, 15 References [Gum02]G UMHOLD S.: Maximum entropy light source placement. In [BBB∗12]B RAMON R.,BOADA I.,BARDERA A.,RODRÍGUEZ Q., Proc. IEEE Visualization (2002), pp. 275–282.2 FEIXAS M.,PUIG J.,SBERT M.: Multimodal data fusion based on [HSS20]H ASELI G.,SHEIKH R.,SANA S. S.: Base-criterion on multi- mutual information. IEEE Transactions on Visualization and Computer criteria decision-making method and its applications. International Jour- Graphics 18, 9 (2012), 1574–1587.2 nal of Management Science and Engineering Management 15, 2 (2020), [BDSW13]B ISWAS A.,DUTTA S.,SHEN H.-W., WOODRING J.: An 79–88.8

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

[IN13]I SHIZAKA A.,NEMERY P.: Multi-Criteria Decision Analysis: [vH14] VAN ERVEN T., HARREMOS P.: Rényi divergence and Kullback- Methods and Software. John Wiley & Sons, 2013.2,7 Leibler divergence. IEEE Transactions on Information Theory 60, 7 (2014), 3797–3820.7 [JS10]J ÄNICKE H.,SCHEUERMANN G.: Visual analysis of flow features using information theory. IEEE Computer Graphics and Applications 30, [Wik20]W IKIPEDIA: Newton scale. https://en.wikipedia. 1 (2010), 40–49.2 org/wiki/Newton_scale, accessed in Nov. 2020. 10 [JWSK07]J ÄNICKE H.,WIEBEL A.,SCHEUERMANN G.,KOLLMANN [WLS13]W EI T.-H., LEE T.-Y., SHEN H.-W.: Evaluating isosurfaces W.: Multifield visualization using local statistical complexity. IEEE with level-set-based information maps. Computer Graphics Forum 32, 3 Transactions on Visualization and Computer Graphics 13, 6 (2007), (2013), 1–10.2 1384–1391.2 [WS05]W ANG C.,SHEN H.-W.: LOD Map - a visual interface for nav- [KARC17]K IJMONGKOLCHAI N.,ABDUL-RAHMAN A.,CHEN M.: igating multiresolution volume visualization. IEEE Transactions on Vi- Empirically measuring soft knowledge in visualization. Computer sualization and Computer Graphics 12, 5 (2005), 1029–1036.2 Graphics Forum 36, 3 (2017), 73–85.2,3,5 [WS11]W ANG C.,SHEN H.-W.: Information theory in scientific visual- [KL51]K ULLBACK S.,LEIBLER R. A.: On information and sufficiency. ization. Entropy 13 (2011), 254–273.2 Annals of Mathematical Statistics 22, 1 (1951), 79–86.2,4 [WYM08]W ANG C.,YU H.,MA K.-L.: Importance-driven time- [Kle12]K LEIN H.A.: The Science of Measurement: A Historical Survey. varying data visualization. IEEE Transactions on Visualization and Com- Dover Publications, 2012.7, 10 puter Graphics 14, 6 (2008), 1547–1554.2 IN [Lin91]L J.: Divergence measures based on the shannon entropy. [XLS10]X U L.,LEE T. Y., SHEN H. W.: An information-theoretic IEEE Transactions on Information Theory 37 (1991), 145–151.2 framework for flow visualization. IEEE Transactions on Visualization [LV06]L IESE F., VAJDA I.: On divergences and informations in statistics and Computer Graphics 16, 6 (2010), 1216–1224.2 and information theory. IEEE Transactions on Information Theory 52, 10 (2006), 4394–4412.6 [Mos12]M OSER S.M.: A Student’s Guide to Coding and Information Theory. Cambridge University Press, 2012.6, 14 [NM04]N G C.U.,MARTIN G.: Automatic selection of attributes by importance in relevance feedback visualisation. In Proc. Information Visualisation (2004), pp. 588–595.2 [PAJKW08]P URCHASE H.C.,ANDRIENKO N.,JANKUN-KELLY T. J., WARD M.: Theoretical foundations of information visualization. In Information Visualization: Human-Centered Issues and Perspectives, Springer LNCS 4950. 2008, pp. 46–64.2 [PS91]P EDHAZUR E.J.,SCHMELKIN L. P.: Measurement, Design, and Analysis: An Integrated Approach. Lawrence Erlbaum Associates, 1991. 7 [RBB∗11]R UIZ M.,BARDERA A.,BOADA I.,VIOLA I.,FEIXAS M., SBERT M.: Automatic transfer functions based on informational diver- gence. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 1932–1941.2 [RFS05]R IGAU J.,FEIXAS M.,SBERT M.: Shape complexity based on mutual information. In Proc. IEEE Shape Modeling and Applications (2005).2 [SEAKC19]S TREEB D.,EL-ASSADY M.,KEIM D.,CHEN M.: Why visualize? untangling a large network of arguments. IEEE Trans- actions on Visualization and Computer Graphics early view (2019). 10.1109/TVCG.2019.2940026.2, 12 [Sha48]S HANNON C. E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948), 379–423.2 [TKC17]T AM G.K.L.,KOTHARI V., CHEN M.: An analysis of machine- and human-analytics in classification. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017).2,3 [TT05]T AKAHASHI S.,TAKESHIMA Y.: A feature-driven approach to locating optimal viewpoints for volume visualization. In Proc. IEEE Visualization (2005), pp. 495–502.2 [VCI20]V IOLA I.,CHEN M.,ISENBERG T.: Visual abstraction. In Foundations of Data Visualization, Chen M., Hauser H., Rheingans P., Scheuermann G., (Eds.). Springer, 2020. Preprint at arXiv:1910.03310, 2019.2,3, 13 [VFSG06]V IOLA I.,FEIXAS M.,SBERT M.,GRÖLLER M.E.: Importance-driven focus of attention. IEEE Transactions on Visualiza- tion and Computer Graphics 12, 5 (2006), 933–940.2 [VFSH04]V ÁZQUEZ P.-P., FEIXAS M.,SBERT M.,HEIDRICH W.: Au- tomatic view selection using viewpoint entropy and its application to image-based modelling. Computer Graphics Forum 22, 4 (2004), 689– 700.2

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

• Cost (Ct) of the forward transformation from input to output Appendices and the inverse transformation of reconstruction provides a fur- A Bounded Measure for Estimating ther balancing factor in the cost-benefit metric in addition to the trade-off between AC and PD. In practice, one may measure the the Benefit of Visualization: cost using time or a monetary measurement. Theoretical Discourse and Why is visualization useful? There have been many arguments Conceptual Evaluation about why visualization is useful. Streeb et al. collected a large number of arguments and found many arguments are in conflict Min Chen, University of Oxford, UK with each other [SEAKC19]. Chen and Edwards presented an Mateu Sbert, University of Girona, Spain overview of schools of thought in the field of visualization, and showed that the “why” question is a bone of major contention [CE20].

Appendix A: Explanation of the Original Cost-Benefit The most common argument about “why” question is because Measure visualization offers insight or helps humans to gain insight. When This appendix contains an extraction from a previous publication this argument is used outside the visualization community, there [CE19], which provides a relatively concise but informative de- are often counter-arguments that statistics and algorithms can offer scription of the cost-benefit ratio proposed in [CG16]. The inclu- insight automatically and often with better accuracy and efficiency. sion of this is to minimize the readers’ effort to locate such an ex- There are also concerns that visualization may mislead viewers, planation. The extraction has been slightly modified. In addition, which cast further doubts about the usefulness of visualization, at the end of this appendix, we provide a relatively informal and while leading to a related argument that “visualization must be ac- somehow conversational discussion about using this measure to ex- curate” in order for it to be useful. The accuracy argument itself is plain why visualization is useful. not bullet-proof since there are many types of uncertainty in a visu- alization process, from uncertainty in data, to that caused by visual Chen and Golan introduced an information-theoretic metric for mapping, and to that during perception and cognition [DCK12]. measuring the cost-benefit ratio of a visual analytics (VA)workflow Nevertheless, it is easier to postulate that visualization must be ac- or any of its component processes [CG16]. The metric consists of curate, as it seems to be counter-intuitive to condone the idea that three fundamental measures that are abstract representations of a “visualization can be inaccurate,” not mentioning the idea of “visu- variety of qualitative and quantitative criteria used in practice, in- alization is normally inaccurate,” or “visualization should be inac- cluding operational requirements (e.g., accuracy, speed, errors, un- curate.” certainty, provenance, automation), analytical capability (e.g., fil- tering, clustering, classification, summarization), cognitive capabil- The word “inaccurate” is itself an abstraction of many different ities (e.g., memorization, learning, context-awareness, confidence), types of inaccuracy. Misrepresentation truth is a type of inaccu- and so on. The abstraction results in a metric with the desirable racy. Such acts are mostly wrong, but some (such as wordplay and mathematical simplicity [CG16]. The qualitative form of the met- sarcasm) may cause less harm. Converting a student’s mark in the ric is as follows: range of [0, 100] to the range of [A, B, C, D, E, F] is another type Benefit Alphabet Compression − Potential Distortion of inaccuracy. This is a common practice, and must be useful. From = (14) Cost Cost an information-theoretic perspective, these two types of inaccuracy are information loss. The metric describes the trade-off among the three measures: • Alphabet Compression (AC) measures the amount of entropy re- In their paper [CG16], Chen and Golan observed that statistics duction (or information loss) achieved by a process. As it was and algorithms usually lose more information than visualization. noticed in [CG16], most visual analytics processes (e.g., statisti- Hence, this provides the first hint about the usefulness of visualiza- cal aggregation, sorting, clustering, visual mapping, and interac- tion. They also noticed that like wordplay and sarcasm, the harm tion), feature many-to-one mappings from input to output, hence of information loss can be alleviated by knowledge. For someone losing information. Although information loss is commonly re- who can understand a workplay (e.g., a pun) or can sense a sarcas- garded harmful, it cannot be all bad if it is a general trend of tic comment, the misrepresentation can be corrected by that person VA workflows. Thus the cost-benefit metric makes AC a positive at the receiving end. This provides the second hint about the use- component. fulness of visualization because any “misrepresentation” in visual- • Potential Distortion (PD) balances the positive nature of AC by ization may be corrected by a viewer with appropriate knowledge. measuring the errors typically due to information loss. Instead of measuring mapping errors using some third party metrics, On the other hand, statistics and algorithms are also useful, and PD measures the potential distortion when one reconstructs in- sometimes more useful than visualization. Because statistics and puts from outputs. The measurement takes into account humans’ algorithms usually cause more information loss, some aspects of knowledge that can be used to improve the reconstruction pro- information loss must be useful. One important merit of losing in- cesses. For example, given an average mark of 62%, the teacher formation in one process is that the succeeding process has less who taught the class can normally guess the distribution of the information to handle, and thus incurs less cost. This is why Chen marks among the students better than an arbitrary person. and Golan divided information loss into two components, a positive

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation component called alphabet compression and a negative component Here we use base 2 logarithm as the unit of bit is more intuitive called potential distortion [CG16]. in context of computer science and data science.

The positive component explains why statistics, algorithms, vi- An alphabet Z may have different PMFs in different conditions. sualization, and interaction are useful because they all lose infor- Let P and Q be such PMFs. The KL-Divergence DKL(P||Q) de- mation. The negative component explains why they are sometimes scribes the difference between the two PMFs in bits: less useful because information loss may cause distortion during n pi information reconstruction. Both components are moderated by the DKL(P||Q) = pi log (unit: bit) ∑ 2 q cost of a process (i.e., statistics, algorithms, visualization, or inter- i=1 i action) in losing information and reconstructing the original infor- DKL(P||Q) is referred as the divergence of P from Q. This is not a mation. Hence, given a dataset, the best visualization is the one that metric since DKL(P||Q) ≡ DKL(Q||P) cannot be assured. loses most information while causing the least distortion. This also Cross Entropy explains why visual abstraction is effective when the viewers have Related to the above two measures, is defined as: adequate knowledge to reconstruct the lost information and may n not be effective otherwise [VCI20]. H (P,Q) = H (P) + DKL(P||Q) = − ∑ pi log2 qi (unit: bit) i=1 The central thesis by Chen and Golan [CG16] may appear to Sometimes, one may consider Z as two alphabets Za and Zb with be counter-intuitive to many as it suggests “inaccuracy is a good the same ordered set of letters but two different PMFs. In such case, thing”, partly because the word “inaccuracy” is an abstraction of one may denote the KL-Divergence as DKL(Za||Zb), and the cross many meanings and itself features information loss. Perhaps the entropy as H (Za,Zb). reason for the conventional wisdom is that it is relatively easy to think that “visualization must be accurate”. To a very small extent, this is a bit like the easiness to think “the earth is flat” a few cen- Appendix C: Conceptual Boundedness of HCE(P,Q) and DKL turies ago, because the evidence for supporting that wisdom was For readers who are not familiar with information-theoretic nota- available everywhere, right in front of everyone at that time. Once tions and measures, it is helpful to read AppendixB first. Accord- we step outside the field of visualization, we can see the phenom- ing to the mathematical definition of cross entropy: ena of inaccuracy everywhere, in statistics and algorithms as well as in visualization and interaction. All these suggest that “the earth n n 1 H (P,Q) = − ∑ pi log2 qi = ∑ pi log2 (15) may not be flat,” or “inaccuracy can be a good thing.” i=1 i=1 qi In summary, the cost-benefit measure by Chen and Golan 1 H (P,Q) is of course unbounded. When qi → 0, we have log2 q → [CG16] explains that when visualization is useful, it is because vi- i ∞. As long as pi 6= 0 and is independent of qi, H (P,Q) → ∞. sualization has a better trade-off than simply reading the data, sim- Hence the discussion in this appendix is not about a literal proof ply using statistics alone, or simply relying on algorithms alone. that H (P,Q) is unbounded when this mathematical formula is ap- The ways to achieve a better trade-off include: (i) visualization plied without any change. It is about that the concept of cross en- may lose some information to reduce the human cost in observ- tropy implies that it should be bounded when n is a finite number. ing and analyzing the data, (ii) it may lose some information since the viewers have adequate knowledge to recover such information Definition 1. Given an alphabet Z with a true PMF P, cross- or can acquire such knowledge at a lower cost, (iii) it may preserve entropy H (P,Q) is the average number of bits required when en- some information because it reduces the reconstruction distortion coding Z with an alternative PMF Q. in the current and/or succeeding processes, and (iv) it may preserve some information because the viewers do not have adequate knowl- This is a widely-accepted and used definition of cross-entropy edge to reconstruct such information or it would cost too much to in the literature of information theory. The concept can be ob- acquire such knowledge. served from the formula of Eq. 15, where log2(1/qi) is considered as the mathematically-supposed length of a codeword that is used Appendix B: Formulae of the Basic and Relevant to encode letter zi ∈ Z with a probability value qi ∈ Q. Because n Information-Theoretic Measures ∑i=1 pi = 1, H (P,Q) is thus the weighted or probabilistic average length of the codewords for all letters in Z. This section is included for self-containment. Some readers who have the essential knowledge of probability theory but are unfamil- Here a codeword is the digital representation of a letter in an iar with information theory may find these formulas useful. alphabet. A code is a collection of the codewords for all letters in an alphabet. In communication and computer science, we usually Let = {z ,z ,...,z } be an alphabet and z be one of its letters. Z 1 2 n i use binary codes as digital representations for alphabets, such as is associated with a probability distribution or probability mass Z ASCII code and variable-length codes. function (PMF) P(Z) = {p1, p2,..., pn} such that pi = p(zi) ≥ 0 n and ∑1 pi = 1. The Shannon Entropy of Z is: However, when a letter zi ∈ Z is given a probability value qi, it is not necessary for zi to be encoded using a codeword of length log (1/qi) bits, or more precisely, the nearest integer above or n 2 equal to it, i.e., dlog2(1/qi)e bits, since a binary codeword can- H (Z) = H (P) = − ∑ pi log2 pi (unit: bit) i=1 not have fractional bits digitally. For example, consider a simple

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

with the lowest average length of codewords when the code is used 0 1 0 1 to transmit a “very long” sequence of letters in Z. One example of 0 z1 0 z1 such a code for the above PMF Q is: 0 1 0 1 z1 : 0, z2 : 10, z3 : 110 10 ? 10 z2 0 0 ··· (17) zn−1 : 111...10 (with n − 2 “1”s and one “0”) 110 ? 110 z3 zn : 111...11 (with n − 1 “1”s and no “0”) 0 1 0 1 Fig.7(b) shows illustrates such a code using a binary tree. In this

11...0 ? 11...0 zn-2 way, zn, which has the smallest probability value, will always be 0 1 0 1 assigned a codeword with the maximum length of n − 1.

11...10 ? z2 11..11 11...10 zn-1 zn 11..11 Lemma 1. Let Z be an alphabet with n letters and Z is associated (a) a wasteful binary tree (b) the most unbalanced binary tree with a PMF P. If Z is encoded using the aforementioned entropy coding, the maximum length of any codeword for zi ∈ Z is always Figure 7: Two examples of binary codes illustrated as binary trees. ≤ n − 1.

We can prove this lemma by confirming that when one creates a binary code for an n-letter alphabet , the binary tree shown in alphabet = {z ,z }. Regardless what PMF is associated with , Z Z 1 2 Z Fig.7(b) is the worst unbalanced tree without any wasteful leaf can always be encoded with a 1-bit code, e.g., codeword 0 for Z nodes. Visually, we can observe that the two letters with the lowest z and codeword 1 for z , as long as neither of the two probability 1 2 values always share the lowest internal node as their parent node. values in P is zero, i.e., p 6= 0 and p 6= 0. However, if we had 1 2 The remaining n−2 letters are to be hung on the rest binary subtree. followed Eq. 15 literally, we would have created codes similar to Because the subtree is not allowed to waste leaf space, the n − 2 the following examples: leaf nodes can be comfortably hung on the root and up to n − 3 1 1 • if P = { 2 , 2 }, codeword 0 for z1 and codeword 1 for z2; internal node. A formal proof can be obtained using induction. For 3 1 details, readers may find Golin’s lecture notes useful [Gol20]. See • if P = { 4 , 4 }, codeword 0 for z1 and codeword 10 for z2; •··· also [CT06] for related mathematical theorems. 63 1 • if P = { 64 , 64 }, codeword 0 for z1 and codeword 111111 for z2; •··· Theorem 1. Let Z be an alphabet with a finite number of letters and Z is associated with two PMFs, P and Q. With the Huffman encod- As shown in Fig.7(a), such a code is very wasteful. Hence, in ing, conceptually the cross entropy H (P,Q) should be bounded. practice, encoding Z according to Eq. 15 literally is not desirable. Note that the discussion about encoding is normally conducted in conjunction with the Shannon entropy. Here we use the cross en- Let n be the number of letters in Z. According to Lemma 1, tropy formula for our discussion to avoid a deviation from the flow when Z is encoded in conjunction with PMF Q using the Huff- of reasoning. man encoding, the maximum codeword length is ≤ n − 1. In other words, in the worst case scenario, there is letter z ∈ that has the Let Z be an alphabet with a finite number of letters, k Z lowest probability value q , i.e., q ≤ q ∀ j = 1,2,...n and j 6= k. {z1,z2,...,zn}, and Z is associated with a PMF, Q, such that: k k j With the Huffman encoding, zk will be encoded with the longest −(n−1) q(zn) = ε, (where 0 < ε < 2 ), codeword of up to n − 1 bits. −(n−1) q(zn−1) = (1 − ε)2 , According to Definition 1, there is a true PMF P. Let L(zi,qi) be −(n−2) the codeword length of zi ∈ Z determined by the Huffman encod- q(zn−2) = (1 − ε)2 , (16) ing. We can write a conceptual cross entropy formula as: ··· n n −2 q(z2) = (1 − ε)2 , H (P,Q) = ∑ pi · L(zi,qi) ≤ ∑ pi · L(zk,qk) ≤ n − 1 i=1 i=1 −1 −(n−1) q(z1) = (1 − ε)2 + (1 − ε)2 . where qk is the lowest probability value in Q and zk is encoded with We can encode this alphabet using the Huffman encoding that is a a codeword of up to n − 1 bits (i.e., L(zk,qk) ≤ n − 1). Hence con- practical binary coding scheme and adheres the principle to obtain ceptually H (P,Q) is bounded by n − 1 if the Huffman encoding is a code with the Shannon entropy as the average length of code- used. Since we can find a bounded solution for any n-letter alphabet words [Mos12]. Entropy coding is designed to minimize the av- with any PMF, the claim of unboundedness has been falsified.  erage number of bits per letter when one transmits a “very long” sequence of letters in the alphabet over a communication chan- Corollary 1. Let Z be an alphabet with a finite number of letters nel. Here the phrase “very long” implies that the string exhibits and Z is associated with two PMFs, P and Q. With the Huffman the above PMF Q (Eq. 16). In other words, given an alphabet Z and encoding, conceptually the KL-divergence DKL(PkQ) should be a PMF Q, the Huffman encoding algorithm creates an optimal code bounded.

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

For an alphabet Z with a finite number of letters, the Shannon Theoretical Discourse and Conceptual Evaluation entropy H (P) is bounded regardless any PMF P. The upper bound Min Chen and Mateu Sbert of H (P) is log n, where n is the number of letters in . Since we 2 Z The original version of this paper, “1051: A Bounded Measure have for Estimating the Benefit of Visualization, by Chen, Sbert, Abdul- H (P,Q) = H (P) + DKL(P||Q) Rahman, and Silver”, was submitted to EuroVis 2020, the paper received scores (5, 4.5, 2.5, 2). The co-chairs recommended for the DKL(P||Q) = H (P,Q) − H (P) paper to be resubmitted to CGF after a major revision (i.e., fast using Theorem 1, we can infer that with the Huffman encoding, track). Unfortunately, the first author was an editor-in-chief of CGF conceptually DKL(P||Q) is also bounded.  until January 2020 and still has administrative access to the CGF review system at that time. Hence it will not be possible for CGF Further discussion. The code created using Huffman encoding to arrange a confidential review process for this paper. is also considered to be optimal for source coding (i.e., assuming without the need for error correction and detection). A formal proof The paper was revised according to the reviewers’ comments can be found in [Gol20]. and was subsequently submitted to SciVis 2020. The SciVis sub- mission, the revision report, and the EuroVis 2020 reviews can be Let Z be an n-letter alphabet, and Q be its associated PMF. When found as an arXiv report https://arxiv.org/pdf/2002.05282.pdf. we use the Shannon entropy to determine the length of each code- word mathematically, we have: The SciVis 2020 submission improved upon the EuroVis 2020 with additional clarifications in the main text and detailed explana- 1 tions in appendices including a mathematical proof. The reviewers L(zi,qi) = dlog2 e, zi ∈ Z,qi ∈ Q qi were not impressed and scored the submission as [3.5 2, 2, 1.5], As we showed before, the length of a codeword can be infinitely which is much lower than the scores for the EuroVis 2020 submis- sion. We appreciate that all four SciVis 2020 reviewers stated trans- long if qi → 0. Huffman encoding makes the length finite as long as n is finite. This difference between the mathematically-literal parently that they had (and likely still have) doubts about the the- entropy encoding and Huffman encoding is important to our proof oretic measure proposed by Chen and Golan, TVCG, 2016, which this paper tries to improve. We also appreciate that the reviewers that conceptually H (P,Q) and DKL(PkQ) are bounded. stated that they tried not to let these doubts to prejudice this work. However, we should not draw a conclusion that there is much difference between the communication efficiency gained based on We notice that almost all comments that have influenced the re- the mathematically-literal entropy encoding and that gained using jection decision are in the form of requiring further explanations the Huffman encoding. In fact, in terms of the average length of and clarifications. As the summary review states “The paper de- codewords, they differ by less than one bit since both lie between mands a lot from the reader; without detailed knowledge of infor- H (Q) and H (Q) + 1 [CT06], although their difference in terms mation theory concepts, as well as details of the cost-benefit ratio, of the maximum length of individual letters can be very different. ...”, clearly simply addressing this issue through adding appendices has not met the reviewers’ concerns. When we converted the SciVis For example, if Z is a two-letter alphabet, and its PMF Q is 2020 submission into the CGF format, it resulted in 11.5 pages, 1.5 {0.999,0.001}, the Huffman encoding results in a code with one bit pages above the limit of EuroVis 2021 guideline. Considering the for each letter, while the mathematically-literal entropy encoding needs (i) to add a fair amount of additional explanation and clarifi- results in 1 bit for z1 ∈ Z and 10 bits for z2 ∈ Z. The probabilistic cation, and (ii) to reduce the paper length to 10 pages, we decided to average length of the two codewords, which indicate the communi- split our paper into two papers. We are aware that the EuroVis and cation efficiency, is 1 bit for the Huffman encoding, and 1.009 bits CGF rarely see two-part publications. We hence have rewritten the for the mathematically-literal entropy encoding, while the entropy two submissions substantially, making them self-contained and rel- H (Q) is 0.0114 bits. As predicted, 0.0114 < 1 < 1.009 < 1.0114. atively independent, while avoiding unnecessary duplications. This Consider another example with a five-letter alphabet and Q = is the first paper on “Theoretical Discourse and Conceptual Evalu- {0.45,0.20,0.15,0.15,0.05}. The mathematically-literal entropy ation”. The second (follow-on) paper, entitled encoding assigns five codewords with lengths of {2,3,3,3,5}, Bounded Measure for Estimating the Benefit of Visualization: Case while the Huffman encoding assigns codewords with lengths of Studies and Empirical Evaluation {1,3,3,3,3}. The probabilistic average length of the former is 2.65, Min Chen, Alfie Abdul-Rahman, Deborah Silver, and Mateu Sbert while that of the Huffman encoding is 2.1, while the entropy H (Q) is 2.0999. As predicted, 2.0999 < 2.1 < 2.65 < 3.0999. is included in the supplementary materials. Similarly, this submis- sion is also included the supplementary materials of the second pa- per. The both submissions are available as arXiv reports, together Appendix D: Authors’ Revision Report following the SciVis with the corresponding revision reports. 2020 Reviews In this revision report, we focus on the reviewers’ concerns and 2 December 2020 queries related to the theoretical discourses in the original Euro- Vis and SciVis submissions. Those comments are highlighted in or- Dear EuroVis 2021 co-chairs, IPC members, and Reviewers, ange. We structure our feedback based on the summary review, and Bounded Measure for Estimating the Benefit of Visualization: incorporate individual reviewers’ comments when they are men-

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation tioned in the summary. The comments highlighted in purple are for (R1, R3, R4), some are dimensionally doubtful (R4), they are the follow-on paper to address. (probably) not additive (R3) and there is no natural language inter- pretation (R1); other reasonable choices would have been possible Yours sincerely, (R3). Min Chen and Mateu Sbert k R4: I am also skeptical about the quantity Dnew in Eq. (9), except for the case k=1. If one does a dimensional consideration by (for the Summary Review: moment) leaving out the summand 1 in the logarithm in Eq. (9), ... 1. The need for a bounded measure for PD is not sufficiently justi- Clarification and Action: Due to the page limits associated with fied; there are several arguments (see in particular R2 and R3) that the previous submissions, we did not explain the semantic meaning this requirement is not necessary. This, however, makes the premise of the proposed measures. We have added a new subsection Section of the paper doubtful. (R1, R3, R4) 5.3 New Candidates of Bounded Measures, where we described the Clarification and Action: The previous text accompanying Ta- semantic meaning of the proposed measures, addressing the “natu- bles 1 and 2 has explained the difficulties in interpreting the un- ral language interpretation issue” raised by R1. bounded PD measure. We have now added a new section, Section 4 We also explain the reason for considering the power factor k. Mathematical Notations and Problem Statement, which introduces This is a common approach in defining statistical measures and more basic information theory concepts and hopefully helps read- information-theoretic measures. This is just part of the research in- ers to appreciate the original problem statement better. We have vestigation to study different options while avoiding a potentially- also added new text in Section 7 Discussions and Conclusions to biased assumption by fixing k to 1. In fact, we started this inves- summarize the needs to study new measures. tigation with k = 1. It was our curiosity led us to consider other k Of course, one can argue that the original DKL can be used. values. We believe such an approach in research should be encour- Imagine that we had in the history a temperature measurement sys- aged. tem that had been defined based humans’ normal temperature (0 We would also like to point out the fact that Shannon entropy it- degree) and the maximum temperature when they were ill (100 de- self is a parameterized formula as different logarithmic scales have gree), or a weight measurement system that had been a power func- been used in different applications, such as natural log for ther- tion of the current metric system, e.g., w0 = w100 (which would modynamics, log2 in computer science and communication theory, show rapid growth as DKL). One might argue that they are ade- and log10 and natural log in psychology. It is important for us sci- quate enough, and there is no need to consider other new measuring entists not to dismiss such diversity as “undesirable”. systems. The history of measurement science shows the opposite. As scientists, we would not discourage a serious exploration of al- On the additive nature of the measure, this depends on the defi- ternative measures. nition of addition operator. Divergence measures are often not dis- tance metrics. Addition of two divergence values is not always se- 2. Some of the authors’ trains of thought cannot be fully compre- mantically meaningful. We encourage all researchers to continue hended, as there are discrepancies with earlier work that are not the investigation into different mathematical properties of the mea- being clarified. (R1) sures considered in this paper as well as any measures that may be Action: We hope that the addition of Section 4 Mathematical proposed in the future for visualization. There is no reason that vi- Notations and Problem Statement helps explain the connection be- sualization researchers should not be involved in developing math- tween cost-benefit ratio and volume rendering via information the- ematical concepts and formulae, especially when they are for visu- ory. We also hope that the addition of Section 5.3. New Candidates alization. of Bounded Measures helps explain the thought process behind the 5. The evaluation of measures has a strong ad hoc character (R1, proposed measures, as well as some technical considerations such R2, R3, R4); with the combination of Likert scales and MCDA as why using maximum entropy to scale the potential distortion in- almost any connection to information theory is lost (R3). stead of Shannon entropy itself. Clarification and Action: We disagree with the reviewers’ cri- 3. The alleged proofs are unsustainable. (R1, R3, R4) tique about the “strong ad hoc character”. If this critique is valid, the same could be applied most research efforts in the history of Clarification and Action: We guess that the reviewers may mis- measurement science. For many measure systems, such as time, interpret that the proof was for proving the mathematical formula temperature, length, weight, earthquake magnitude, etc. there are of DKL is bounded. Obviously this is not the case. The proof is usually only partial constraints defined by physical principles, such about a key concept that DKL and Cross Entropy are supposed to as starting point of the universe and absolutely zero temperature, model. The proof showed that this key concept is bounded if the which are also sometimes associated with some uncertainty. Most alphabet has a finite number of letters, while DKL cannot assure a parameters such as critical point, interval, and linearity are de- bounded solution. This was explained in Appendix C, but was not fined based on different criteria. Otherwise, we would not have de- in the main text due to the previous space limitation. Because we bates about imperial vs. metric systems, or Celsius vs. Fahrenheit have a bit more space now after splitting the paper, we have added vs. kelvin vs. Rankine vs Reaumur for temperature measurement. further clarification in Section 5.1 A Conceptual Proof of Bound- Of course, we should all aspire to discover more mathematically- edness. defined principles for evaluating different measurement systems 4. The proposed alternative measures are sometimes very ad hoc designed to measure the benefit of visualization. However, it is not

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation reasonable to reject the MCDA approach based on a speculation of of the collected data (R1) and critiquing the experimental design existence of a complete set of principles. We have added new text (R4). at the beginning of Section 6 Comparing Bounded Measure: Con- Note that the two experiments have now been moved to a sep- ceptual Evaluation, and in Section 7 Discussions and Conclusions, arate follow-on paper. They are no longer “demonstrated” in this drawing some historical facts in Measurement Science to support paper as in the previous EuroVis 2020 and SciVis 2020 submis- our approach. sions. R1 and R4’s comments will thus be addressed in the separate In terms of R3’s comment about the connection with informa- follow-on paper. Nevertheless, we added a discussion in Section 7 tion theory, part of the problem studied in this paper is that there Discussions and Conclusions to state that the two case studies are are several optional information-theoretical measures and statisti- part of the evaluation process. cal measures. They exhibit different properties, in ways similar to the options of Celsius, Fahrenheit, kelvin, Rankine, Reaumur, and other scales for temperature measurement. Our decisions on which Appendix E: SciVis 2020 Reviews temperature measurement system to use are based on our multi- We regret to inform you that we are unable to accept your IEEE VIS 2020 SciVis criteria judgement. For now, choosing which measure for measur- Papers submission: ing the benefit of visualization is perhaps more complex. Using 1055 - A Bounded Measure for Estimating the Benefit of Visualization MCDA helps to make the process more transparent and rigorous than without. The reviews are included below. IEEE VIS 2020 SciVis Papers had 125 submissions and we conditionally accepted 32, for a provisional acceptance rate of about 25%. 6. The paper demands a lot from the reader; without detailed knowl- ...... edge of information theory concepts, as well as details of the cost- —————————————————————- benefit ratio, the paper is difficult to understand. It would be better if the basic terms and concepts were briefly presented again in the Reviewer 1 review paper, with reference to the appendix) (R1). It would be nice if Paper type some of the two case studies were interwoven into the mathemat- Theory & Model ical presentation to better anchor the concepts (R2). The notation Expertise needs to be improved (R1). Expert Action: We have now added a new Section 4 Mathematical No- Overall Rating tations and Problem Statement, which include 3/4 pages of new 2 - Reject text and a new figure (Figure 4) for introducing the mathemati- The paper is not ready for publication in SciVis / TVCG. cal notions and basic concepts of information to help readers who The work may have some value but the paper requires major revisions or additional are not familiar with information theory. We have adopted the R2’s work that are beyond the scope of the conference review cycle to meet the quality suggestion for making links to the case studies. In this paper, we standard. Without this I am not going to be able to return a score of ’4 - Accept’. have made links with the volume rendering case study, while in the Supplemental Materials follow-on paper, we made links with the metro map case study. Acceptable with minor revisions (specify revisions in The Review section) 7. The validation of the PD measure is questionable (R1, R3, R4); Justification for details see in particular R3. The paper presents a new measure to assess the benefits of using visualization. The R3: Hence, the chosen measure was not, it turns out, actually measure revises Chen and Golan’s cost-benefit ratio by proposing to measure dis- determined by information theory, and its actual value for *quanti- tortion differently. The previous measure was unbounded and this issue is addressed in the revised measures. The work is evaluated using analysis of two case studies tative* analysis was never demonstrated. where subjects’ distortion was assessed through questionnaires and compared with Clarification and Action: We guess that the reviewers were the distortion as indicated by the proposed measures. concerned about the semantic meaning of the proposed measure, Overall, the paper presents a novel idea, but it is not ready for publication. It lacks rather than its “validation”. We have added a new subsection Sec- detail on fundamental concepts that were presented in earlier work, but are not widely tion 5.3 New Candidates of Bounded Measures, where we de- known in the community. There are notable discrepancies to prior work, that are scribed the semantic meaning of the proposed measures. never addressed. The premise of the work is dubious, since the unbounded measure of distortion indicates that, according to the measure, visualizations can be arbitrarily There are numerous entropy and divergence measures in infor- bad, but in practice, where the aim is to maximize benefit, this is not a problem, since mation theory, exhibiting different semantic meanings and math- benefit can be bounded from above in the measure. The rating- based (pre-)selection ematical properties. One approach is to test the candidates of of the proposed alternatives needs more details on the rating process to be objective. bounded measures in different circumstances to see how they af- The suggested alternatives are rather ad-hoc and only compared to each other, but fect the semantic interpretations, in a way similar to interpreting never to the previous distortion measure in whether they capture distortion and thus what a negative temperature measure means, or how earthquake benefit more accurately. The appropriate validation of a measure is to show that it magnitude 8 relates to magnitude 5. This is why we conducted does measure what it was intended to measure. The paper is currently still lacking this and it is questionable, whether the authors can provide this validation this within two experiments to collect real data. Hence we do not agree R3’s the reviewing cycle. comment that “actual value for *quantitative* analysis was never demonstrated.” The case studies of volume rendering and metro- The Review map are exactly for that. R1 and R4 did consider the two case stud- I will elaborate on the issues mentioned above. ies as part of validation process, while requesting for more analysis 1. The paper lacks detail on fundamental concepts. This includes both information-

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation theoretic concepts as well as details concerning the cost-benefit ratio. Although the was not applied to any of them. Instead the authors performed two studies. However, authors provided a brief introduction in the appendix, this should really be in the pa- they do not fully model these studies in their framework. There is no mention of the per in a brief form. Elaborations can remain in the appendix. Information that is crucial sample space underlying the experiment, or the random variables. How can we es- to understand Equation (2) is lacking: there is no discussion of what Z’_i denotes, not timate the probability distribution for the decision variable from the responses? The how Z_i, Z_i+1, and Z’_i are related. Note how much the paper by Chen and Jänicke authors instead use one probability distribution for each answer. Is this meaningful? 10 years ago elaborated on the basic concepts in contrast to the submitted work. It removes all the probabilistic effects that underlie the foundation of the measures in The authors may have spent 10 years thinking deeply about information theory for the first place. Only some outcomes of the studies are discussed. Why? The distri- visualization and are thus deeply familiar with it, but they should respect the needs butions for ground truth are guessed, while there is arguably a ground truth present of the readers, who mostly studied something else. Especially, if the authors want from which the distribution can be read off (e.g. the number of subway stations be- a large part of the field of visualization to understand and apply the methods they tween two given stations). In the discussion, the images do not seem to play a role, propose. only the distortion between ground truth and response. Furthermore it is confusing 2. There are also notable differences to prior work that are glossed over. Chen and that both Q and Q’ are described as probability distributions for ground truth, that F Golan describe visualization as a sequence of transformations from data to a de- remains mysterious, and that the value for alphabet compression cannot be traced. cision and say that the benefit for the whole sequences is H(Z_0) - H(Z_n+1) - Do the topological and the geographical subway maps have the same alphabet com- sum_i=0..n D_KL(Z’_i || Z_i); i.e. the distortion accumulates. In contrast, the for- pression? Nor is it clear, how Q, Q’, and F relate to Z_i, Z_i+1, and Z’_i in Equation mula in the current submission comprises just one transition from the "ground truth" (2). The distribution function for Q in 5.2 is also ill-justified. (which presumably aren’t the data) to the decision variable about the ground truth. 7. Lastly: notation. The paper uses notation rather freely, e.g. to specify probabil- Why do the intermediate distortions no longer show up? And how does ground-truth ity distributions, which are functions, using set or interval notation, which is non- relate to data? standard, inconsistent, and confusing for people well-versed in probability theory. 3. The premise of the work is dubious. The introduction notes that the measure Similar concepts get different symbols, and the same thing is referred to by differ- of distortion used in the cost-benefit ratio is unbounded. It shows both that it is un- ent expressions (e.g. H(Z), H(P(Z)), H(P)). It is also very confusing when probability bounded, as well as that even for simple cases, the distortion can be arbitrarily large, distributions are denoted by their alphabets, in particular when different symbols even though the information is low, and that this is counter- intuitive. However, I will for alphabets refer to probability distributions over the same alphabet. The authors hold, that the unboundedness is not a problem in practice. Since H(Z_i) is bounded should make notation consistent both within and between paper and appendix, and from above by log |Z_i|, H(Z_i+1) and D_KL(Z’_i || Z_i) are bounded from below use established conventions in probability theory and information theory. by 0, the benefit, defined as H(Z_i) - H(Z_i+1) - D_KL(Z’_i || Z_i), is bounded from —————————————————————- above by log |Z_i| as well. It may not be bounded by below, but since the aim is Reviewer 2 review to maximize the cost-benefit ratio, this is of no concern. Only maximizing functions that cannot be bounded by above are problematic. Note that log-likelihood scores Paper type also cannot be bounded from below, yet statistics has no reservations in using them Theory & Model for maximum-likelihood estimation. A different motivation would be Kulback-Leibler Expertise divergence overestimates distortion and therefore non-confusing visualizations are Knowledgeable incorrectly claimed to be confusing by the cost-benefit ratio. Chen an Golan wrote that negative benefit indicates confusing visualizations. How well do different mea- Overall Rating sures for distortion align with this assessment? For which measures does the sign 3.5 - Between Possible Accept and Accept of the benefit measure indicate confusing visualizations? For which measures do Supplemental Materials benefit values correlate better with visualization quality? Acceptable 4. It doesn’t help the discussion that the paper claims that Kullback-Leibler diver- gence is unbounded, and then provide a "proof" in Section 4.1 that it is bounded for Justification particular choices of Q. Since the measure of distortion is applied in the paper to The manuscript is an interesting mathematical foray into how loss of information can relate the ground truth and the decision variable, both of which have probability dis- actually benefit a visualization user, and what the bounds of the distortion- loss might tributions not under our control, the premises of the proof are not met and its result be. The idea, the analysis, and the proof are valuable contributions to the visualiza- is therefore irrelevant. The section should be removed. tion field, in particular since the conclusion is counterintuitive to many audiences, 5. The suggested alternatives are rather ad-hoc. While Kullback-Leibler divergence, and a hangup of some domain experts (those who decline to use scientific visual- cross entropy have interpretations in data transmission contexts (which are even re- ization altogether because of inherent distortions). On the for improvement side, the counted in the paper: the expected number of (excess) bits when using code lengths manuscript is not an easy read, and could also benefit from certain clarifications. The for symbols that would be optimal for a distribution P,when the actual symbol proper- two real world examples are intriguing, although the domain motivation (why these ties are given by distribution Q). There is no interpretation of the new measures given questions) could be further clarified and discussed. (in the sense of how to translate values for these measures to meaningful natural- The Review language sentences), yet these previous measures are declared unintuitive and the The manuscript contributes a fascinating mathematical foray into how loss of infor- new one’s intuitive without any indication of why. The assessment of the different mation can actually benefit a visualization user. This is a valuable contribution to the measures uses a catalogue of criteria and values are assigned to each measures, field, in particular since the conclusion is counterintuitive. A major drawback is that again, rather ad-hoc. There are no clear rules, under which conditions a certain the manuscript is not an easy read, in particular for a potential computer science number (ranging from 0-5) is assigned, nor a discussion of inter-coder consistency or design graduate student, which is a pity, and that it could benefit from certain (Would different people assign the same numbers?). Overall, I do not see that such clarifications. qualitative assessments are helping. See next point. [Note that Shannon put forth axioms that a measure for uncertainty should have and showed that entropy is the The manuscript makes several significant contributions, including: only formula meeting all the properties specified axiomatically. This however is quite ** a theoretical discussion of information loss, based on a theoretical basis in infor- different to the approach taken by the authors of the present submission.] mation theory. The information theory premise is valid, and builds on peer-reviewed 6. The proper validation of a measure is to show that it measures what it is intended prior publications. The same approach gave us, years ago, the first semblance of to measure. Although there are numerous studies in visualization that establish one formalism in analyzing Shneiderman’s "Overview First" mantra, and putting some design as working or working better with respect to some task, the cost-benefit ratio bounds to it (good for novices, bad for experts). Albeit not perfect, that formalism

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation opened up the possibility of a discussion in the field. Information Theory is one pos- (e.g., “MCDA, a sub-discipline of operations research that explicitly evaluates multi- sible theoretic framework, leading to discussion in the field, and so it has merit. ple conflicting criteria in decision making”, “ data intelligence workflow, *def here*”, ** an interesting theoretical discussion of potential distortion (PD), as a term that “cross entropy”) might encourage students to keep reading the paper. could be calibrated and bounded. There seems to be disagreement in the field about When discussing the relationship between AC and PD, I’d be interested in any prac- whether PD should be absolutely minimized, or it could be calibrated as part of a tical examples supporting the increase/decrease illustrated in Fig.2. benefit ratio. That is good: it means PD is something we need to discuss as a field. Citations are grammatically invisible. Not good: “observational estimation in [43]”. This manuscript is the beginning of that conversation. Good: “observational estimation in Smith et al. [43]” or “observational estimation The PD subway example in the manuscript is brilliant, although somewhat cryptic, [43]”. in that it illustrates well these different views of PD, and how we may be at cross- Fig 6 and Fig 7 * Some reorganization of the caption would be helpful (e.g., in Fig 7, purposes when discussing it: In NYC, the shortest path involves exiting the subway I’d start with “Divergence values for six users using five candidate measures, on an system, walking for a few minutes on the street (hence the length judgment, and the example scenario with four data values, A, B, C, D. ” * “them” -> “the data values” importance of PD), and re-entering the subway system at a different line. In contrast, (?) in the or subway, all the lines are well connected, and all that matters is the number of stops (PD almost unimportant). However, had I not lived in NYC for Fig 7 * typo: “A, B, C, are D.” (and D). * clarify “pr.” stands for “process”, or remove a few years, I wouldn’t even know these things. It’s possible the authors themselves —————————————————————- are unaware of an alternative interpretation of subway maps and PD—than the one Reviewer 3 review they used in the paper. Paper type This may be the case about the manuscript medical vis example, as well—I do not Theory & Model have experience educating novice medical students, but the authors do. This fact, that the manuscript looks at PD from a different point of view than me, has tremen- Expertise dous value to me. It would be helpful to clarify that this manuscript views things Expert differently from several of us, and that overall, sometimes PD is just a quantity to be Overall Rating minimized, whereas other times PD is a quantity where bounding would be useful. 1.5 - Between Strong Reject and Reject ** an ad hoc construction of a bound, which the manuscript evaluates via a user study. Whereas the construction is ad hoc, and there must be other ways of ap- Supplemental Materials proaching the problem, I do not think it detracts significantly from the manuscript. It Not acceptable works reasonably well as a proof of concept. Justification Overall, this manuscript seems to me to present great opportunities for the sci vis The basic story of the paper is: information theory is a promising way to characterize field to grow. the process of data vis because it gives ways to quantify information transmission. Suggestions for improvement: From some of the measures it provides (Shannon entropy, and KL divergence), the * It would be great if some part of the two case studies were weaved into the math- authors have previously [11] developed a measure of "benefit" (eq 2). But now, it is ematical exposition, to help anchor the concepts. Humans are great at generalizing apparently a big problem that the negative term in benefit ("potential distortion" or from examples. For example, I really can’t parse this statement: “Suppose that Z is PD) is unbounded, so a bounded measure is needed, and the paper aims to make actually of PMF P, but is encoded as Eq. 4 based on Q.” What does “based on Q” a well-informed pick from among many possible choices. mean here? The easier to follow the proof and function construction, the better. In However, the authors do not make a case that, insofar as this is about data visual- general, it might be worth moving to the appendix, or compacting the proof and ad ization, a bounded measure is actually a problem that merits a whole paper, and, hoc construction. the evaluation of the new measure, such as it is, has a strained relationship to both * Instead of reiterating what was done, step by step, in the conclusion, it would be information theory and to visualization practice. good to see some discussion/summary of assumptions, limitations, and where the This submission is on a topic of interest to visualization research, and it may be work agrees or disagrees with other published observations. enthusiastically received by the adherents of [11] and the related work in the same * The information theory approach and the PD discussion are interesting, and a valid vein. The submission should stand on its own as a piece of scholarship, but does way of approaching this problem. The user study is also a reasonable way to eval- not. uate this theory. However, here and in previous manuscripts in this vein, it seems The Review to me the supporting evidence is subject to how we interpret the term "information". The authors are pursuing information theory as a means of putting data vis on math- For example, in the two case studies, the questions asked as not typical of the two ematical footing. The current submission relies heavily on a previous paper [11]. In domains in some settings, but they may be typical in other settings. E.g., a clini- response to a previous round of reviews (from a rejected Eurovis 2020 submission), cian would be interested in where an anomaly is located and in its shape, and how the authors were defensive about critiques that they felt were of [11], rather than of much uncertainty there is; whereas a nursing student might be interested in gen- the new work, and have written a new summary of [11] work as Appendix A. How- eral literacy of the type described in the manuscript evaluation. In the second case ever, every submission presents a new opportunity for scrutiny, from a new set of study, a Chicago subway user would often be interested in the number of stops and reviewers, which ideally generates new opportunities for authors to strengthen or line changes, and not so much in walking distance, because the lines are well con- reconsider a line of work, even if initial steps have passed peer review. Nonetheless, nected; whereas a NYC subway user often needs, for the fastest route, to come out this review will strive to consider the new submission on its own terms. of the subway system, walk for 5 minutes, and enter the subway system at a differ- ent point–I see how the type of judgments used in the second case study would be My high-level considerations are: useful there. These considerations should appear both in the introduction, to help ((1)) the motivation of needing a bounded information theoretic measure of PD, and motivate the manuscript, and in the discussion, to help anchor it. the mathematical strength of the connection to information theory, and * A reworking of most figure captions would be good (see Minor comments below). ((2)) "intuitive" and "practical" (as noted at the end of paragraph 4) applicability to vis Minor: I am afraid that we are becoming so mathematically illiterate as a field, that applications the manuscript would benefit from reminding the readers what “unbounded” means First ((1)): in math terms. Also, a gentle brief definition of terms unfamiliar to a graduate student We accomplish theoretical research when we can reframe new things in terms of

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation pre- existing and trusted theory. The authors trust information theory, and advertise, arctan(D_KL)? The authors have not explained why this much simpler and obvious in the Abstract and Intro, how they are drawing on information theory. However, the strategy is not acceptable, and yet, based on their motivation from Figure 2, it should authors seem bothered (top of 2nd column, 1st page) by the fact that the KL di- suffice. Figure 2 nicely documents the desired *qualitative* effects of varying AC, PD, vergence when communicating a binary variable can be unbounded, because "the and Cost. None of the later quantitative arguments were compelling, so it was too amount of distortion measured by the KL-divergence often has much more bits than bad the authors did not pursue a simpler way of meeting their goals that maintains the entropy of the information space itself. This is not intuitive to interpret .." Later at least some connection to D_KL. they say (middle 2nd column page page 3) "it is difficult to imagine that the amount We finally turn to how the different possible measures are actually compared and of informative [sic?] distortion can be more than the maximum amount of information evaluated. Criterion 1 of Section 4.3 essentially amounts to looking at lots of plots, available." and making qualitative judgments about what behaves in the desired way. A suit- Actually, it is not difficult to imagine, if you accept information theory (as the authors ably formulated monotonic reparameterization of D_KL could have done well here, surely want us to do), hence this is a worrisome start to the math foundations of this and the qualitative nature of this evaluation undermines the author’s mission to find work. quantitative measures for visualization. Criteria 2 through 5 in Table 3 involve a 1– If you have an extremely biased coin Q (say, p(head)=0.99, p(tail)=0.01), and want to 5 Likert scale (!) for judging the various alternative measures. Subsequent criteria use bits to communicate a sequence of flips of Q, the optimal scheme will involving (described below) are used in a "multi-criteria decision analysis (MCDA)" analysis, something like using "0" to represent some number (greater than 1) of heads in a a fancy way of saying: a ranking based on some heuristically selected weighted av- row, i.e., some kind of compression, which can be very efficient for Q given how erages of various desiderata. The combination of Likert scales and MCDA is where low the entropy is: 0.081. But then, if you are tasked with encoding a sequence we lose any remaining connection to information theory. How we know: a different of flips of an *unbiased* coin P (p(head) = p(tail) = 0.5), you are going to be very set of equally plausible candidate measures, under a different set of criteria, with disadvantaged with your Q-optimized encoding, so much so that it could easily more a different Likert scales, and different weightings, could have produced a different than double the expected length of the encoding, as compared to a P-optimized winner. encoding (i.e. "0"=head "1"=tail). As the authors surely know, the KL divergence Hence, the chosen measure was not, it turns out, actually determined by information D_KL(P|Q) tells you the expected number of extra bits needed to represent a sample theory, and its actual value for *quantitative* analysis was never demonstrated. from P, when using a code optimized for Q, rather than one optimized for P. For this Now for consideration ((2)): "intuitive" and "practical" (as noted at the end of para- specific example, D_KL(P|Q) = 2.32918, which motivates how D_KL can naturally graph 4) applicability to visualizations exceed 1 even for a binary event. But again, the authors probably know this (in fact their section 4.1 is closely related to this situation), so it is unclear what they mean The exposition starts with the claim that "inaccuracy" is ubiquitous, even neces- by saying this is "unintuitive". sary, in visualization, and that information theory promises to quantify the inaccu- racy. Even if this claim was made in [11], it bears scrutiny as the foundation of this Also, if the worry is that KL divergence is unbounded because of the places where paper. As Alfred Korzybski tells us, "the map is not the territory" and "the word is not probabilities can be become very small, then why not try some smoothing? That is the thing". OF COURSE there is information missing in a visualization; it wouldn’t a known thing in ML, for example this paper: be visualization, and it wouldn’t be useful for visual analysis and communication, https://papers.nips.cc/paper/8717-when-does-label-smoothing-help.pdf if it merely re-iterated all the data. Visualizations are useful insofar as they make talks about "label smoothing" in the context of minimizing cross entropy (which is task-specific design decisions about WHAT to show (and what not to show), and the same as minimizing KL divergence if the original or data distribution P is fixed). audience-specific design decisions about HOW to show it. The editing task of choos- So the connection to stats or information theory, as it is commonly used in current ing what to show is where the authors would point out "inaccuracy", but that is not research, appears weak, which makes the proposed new measure appear contrived. how visualization research, or scientific visualization research in particular, under- stands that term, regardless of [11]. Another good property of KL divergence is its additivity: P=P1*P2 and Q=Q1*Q2 ==> D_KL(P|Q) = D_KL(P1|Q1) + D_KL(P2|Q2), which is relevant for considering The later evaluation criteria in Sec 4.3 do not convey a realistic consideration of how models and phenomena can have both coarse-grained and fine-grained com- visualization applications. In Criterion 6 the authors embrace an encompassing view ponents. KL divergence nicely analyzes this in terms of simple addition, which would of an analysis workflow that has to end in some binary decision, in which the mapping also seem to be valuable for the authors, given how in their driving equation (1) KL from data to decision actually has to transmit less than a single bit (since p(good)=0.8 divergence appears in the numerator of a fraction. so entropy is ∼0.72). If some decision support system were so refined and mature that the human user need only made a binary decision, and the accuracy of that Instead, the authors invent some new measures (9) and (10), which raises various decision is the entirety of the story, does that seem like a compelling setting for red flags. Inventing new measures for information divergence seems like an inter- visualization research? There is no room in the analysis for using visualization to esting research direction, but (red flag #1) the most informed peer reviewers for that justify or contextualize the decision, and no down- stream stake-holders have any will be found in statistics, so it is a little scary for the value of these measures to role in the analysis, so why even consider visualization, instead of a well-trained be evaluated by vis researchers, when fundamentally, the underlying math actually machine learning method? The story for Criterion 7 is nearly as contrived, and again has nothing to do with visualization (any more than it has to do with any other appli- not actually admitting any visualizations per se, so it is not compelling as a setting cation of statistics). Red flag #2 is that the new measures involve a free parameter for evaluation. "k", unlike the standard information-theoretic measures, but maybe this is okay in the same way that smoothing is okay. Still, it doesn’t seem like the new measures Vis researchers use case studies with the belief that, even though they assess some will be additive, or this isn’t shown (red flag #3). Red flag #4 is that section 4.1 ("A limited and contrived scenario, they nonetheless reveal something informative and Mathematical Proof of Boundedness") is not a proof at all - it simply describes one representative about broader and more general situations. The two case studies specific scenario where the optimal encoding for Q will require more, but a bounded were unconvincing individually, but there is also general problem: where are the en- amount more, bit to encode a sample of P. That does not prove anything, nor does codings of the many alphabets Z_i, Z_i+1, etc, one for each step of some workflow? it illuminate anything about how intuitive or reasonable information divergence mea- I thought the authors’ theory is based on analyzing each of those steps, but the case sures should behave in general. The many red flags undermine the desired rigorous studies collapse everything from some simplistic reduction of the data, to some sim- connection to information theory. plistic reduction of a viewer’s response. Because the case studies are so structurally disconnected from all the previous mathematical exposition, it’s harder to see how If, at the end of the day, there is only a tenuous connection to information theory, the case study results support the goals of the paper. and yet there is a perceived need to have a bounded information divergence mea- sure, why not redirect the creative thought that went into formulating the new mea- The Volume Visualization (Criterion 8) case study just doesn’t work. The questions sures (9) and (10), into some bounded monotonic reparameterization of D_KL, like asked about volume visualization in the questionnaire are completely unrecognizable

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation to me, as someone with experience with real-world biomedical applications of volume Furthermore, the authors are looking for a more suitable mathematical quantity that rendering. The questionnaire involves variations on the same false premise- that a is bounded. They propose some candidates for this; the selection of the most suitable good visualization is one that allows you to infer the data from the picture. No: in quantity seems to me to be rather ad hoc and arbitrary. The main reason is that the volume visualization the whole point is that there are huge families of transformations selection process is not based on realistic examples. on the data that are purposely invisible (a task will typically require scrutinizing some Correcting these shortcomings seems to me to require a major revision, which is not materials or interfaces and ignoring others), and, the questions to answer for the task possible within the short revision cycle. are not about recovering the original data values at any one location in the rendered image, but rather about inferring higher-order properties about structures and their The Review geometric inter- relationships (e.g. what is the shape of the bone fracture, or, what Preliminary remark: is the size and extent of a lesion relative to other anatomy). That all the questions I’m convinced that information theory plays an important role in the foundation of here involve pre-existing renderings, of *old* test datasets (wholly detached from visualization. However, I believe that information theory based on Shannon’s math- any real-world application), was economical on the part of the authors but deficient ematical theory of communication (which operates largely on a syntactic level) is not as scholarship. I am unconvinced that this questionnaire can shed light on how to sufficient for this. Semantics and task-dependent weighting play too big a role in visu- evaluate volume visualizations, or how to evaluate measures for evaluating volume alization. Although semantics can be represented by suitable alphabets, these have visualizations. to be defined against the background of complex knowledge frameworks into which The London Tube map (Criterion 9) case study is more interesting, but describing the user sorts the perceived information, utilizing perceptual and content-related Beck’s map as "deformed" with a "significant loss of information" risks missing the prior knowledge. Therefore, I think that Shannon’s theory represents a rigorous con- point. Beck recognized that for the commuter’s task of navigating the subway, ab- straint to any further theory formation about all semantic and pragmatic aspects of solute geography is irrelevant (because you can’t see it), but topology and counting information as well as about visualization. The question is how strong this constraint stops matter (because that’s what you do experience). Beck made a smart choice is and which dimensions need to be added to reflect the importance of semantics in about what to show. Both kinds of maps used in the study are faithful: one is faith- data visualization. ful to geography (while distorting the simplest depiction of the topology), and one is Furthermore, I think that the benefit of a visualization depends very much on how faithful to the topology (while distorting geography). There is something interesting cognitive peculiarities (e.g. the particular strength of processing spatial information to learn from a case study about measuring the mismatch between geographic vs or the limited capacity of short-term memory) are taken into account in the visual topological maps, and geographic vs topological tasks, but this paper doesn’t con- design. I do not see (maybe I know too little about) how this is taken into account in vince me that information theory is at all relevant to that. Instead, the case study is the information-theoretic framework. used here to help choose a bounded information distortion measure, but again the As I am not an expert in these matters, I do not wish to include my fundamental reliance on a Likert scale ("spot on", "close", "wild guess") is a sign that the ultimate doubts in my assessment. I try to evaluate the paper by asking what it brings to the choice of measure is much less connected to information theory than the Abstract table, *assuming* that the described information theory approach is adequate for and Intro imply. "estimating the benefit of visualization". The more interesting component about the London Tube map case study was that —– Review: cost was considered. But hang on: why is the choice of a bounded information diver- gence measure – for the numerator in (1) – so crucially important, when the choice Why should we be interested in estimating the benefits of visualization quantitatively? of cost – for the denominator in (1) – is basically left completely unspecified? The au- Of course, because it would help to advance the visualization. So the overall goal thors say that cost can be different things "(e.g., in terms of energy, time, or money)", seems to make sense. but if the goal is meaningful quantitation, how can cost be so under-specified? It could The paper is based on Ref. [11], in which Chen and Golan proposed an information- have a huge effect on the relative cost/benefit ratios of comparable vis methods, theoretic measure for the cost-benefit of data visualization workflows. They propose which further undermines the idea that the choice of bounded divergence measure to measure the cost-benefit ratio by (alphabet compression - potential distortion) / alone warrants a paper. cost. Here, "alphabet compression" is the information loss due to visual abstraction —————————————————————- and the "potential distortion" is the informative divergence between viewing the data through visualization with information loss and reading *all* the data. Reviewer 4 review The "alphabet compression" is computed as the entropic difference between the Paper type input and output alphabets. This is a bounded quantity. The "potential distortion" Theory & Model was measured in Ref. [11] as the Kullback-Leibler (KL) divergence of an alphabet Expertise from some reference alphabet. This quantity is unbound. The only goal of the paper Knowledgeable is to find an appropriate information-theoretical quantity for the "potential distortion", which is also *bounded*. Overall Rating As a reason for this goal, the authors say that the benefit calculated using the KL 2 - Reject divergence "is not quite intuitive in an absolute context, and it is difficult to imagine The paper is not ready for publication in SciVis / TVCG. that the amount of informative distortion can be more than the maximum amount of The work may have some value but the paper requires major revisions or additional information available." work that are beyond the scope of the conference review cycle to meet the quality standard. Without this I am not going to be able to return a score of ’4 - Accept’. From a practical point of view, I do not really see the problem: as I mentioned be- fore, the goal is to improve visualization, which includes trying to keep the "potential Supplemental Materials distortion" small. In other words, you’re practically working at the other end of the Acceptable scale and you wouldn’t even feel that the quantity is not bounded. For example, in Justification the visualization of high-dimensional data, there is an entire branch of research that is mainly concerned with finding projections with the least possible distortion. The paper deals with the improvement of a quantitative measure for the cost- benefit ratio of visualizations. In previous work (Ref. [11]) a proposal was made for this ratio. The fact that the amount of information distortion can be greater than the maximum In the counter there are 2 terms, one bounded, the other unbounded. The authors amount of available information could be seen as a theoretical disadvantage. To argue that both terms should be bounded. The argumentation seems to me not quite better understand the motivation for the paper, I would like to see this point worked conclusive, at least it is not clear to me why it is so important. out in more detail.

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

Some of the statements in the paper puzzled me, especially the repeated statement But even the use of multi-criteria decision analysis (MCDA) cannot hide the fact that that the following fact would seem counter-intuitive: that high alphabetical compres- there is a great deal of arbitrariness in the analysis, as most numerical values were sion is a feature of a benefit of visualization. This surprises me, as so many efforts simply assumed. That these are somehow plausible is not enough to convince me. of visualization are directed in exactly this direction: Information reduction, visual ab- I think Sect. 4 is a rendition of the path the authors have taken, but not the way it straction, feature extraction, visual summaries, first overview then (user controlled) should be presented. My recommendation would be to shorten this drastically. details... The intuition of visualization experts should rather tell them that information However, this does not dispel my fundamental reservations about the approach to reduction is a quality feature - up to the point where essential information is no longer finding an appropriate measure of PD. The examples shown with the tiny alphabets visually represented. The last point is missing in the model, unless the omission or are, compared to realistic tasks, extremely simple and in my opinion not convincing. I loss of essential information is subsumed under "information distortion". would find it more convincing if, for example, 2 really realistic examples with realistic I find the search for bounded measures for the Potential Distortion (PD) unconvinc- questions/tasks and suitable visualizations were used to find out which PD measure ing. I am aware of the size of the problem: We try to find suitable abstractions to cre- is most suitable. ate a theoretical framework that covers a huge number of very different use cases. In doing so, I would leave out all candidates for whom it is actually clear from the This is a mammoth task, which cannot be accomplished in one step. Rather, one has outset that they will not perform well (e.g. for dimensionality reasons). to approach it in many attempts with subsequent modifications and iterative improve- ments. Here we want even more, namely a useful quantitative formulation that cap- In Section 5, the examples chosen again are too artificial and academic. The sub- tures essential aspects of information processing in the visualization pipeline. One way maps were primarily designed to quickly convey the number of stops between basic problem is the vague imprecise terminology, another is the high dimensionality two locations (the most important information, e.g. to get off at the right station), of the problem. I respect any attempt to address this huge problem. Nevertheless, and only secondly to estimate travel times (at least within the city, the travel time is the described selection of an appropriate mathematical quantity that measures PD approximately defined by the number of stops) and only thirdly to estimate actual I did not find convincing. distances. Section 4 starts with the "mathematical proof" (Sect. 4.1). I don’t think this is a math- I also found the tomography example unworldly. Better would be e.g. the artery data ematical proof, since a special PMF Q is used here (or is this distribution mathemat- set, but with reasonable questions. For example, a doctor would look for constric- ically somehow distinguished?). tions in the arteries, the spatial location of the constrictions in the arterial network; furthermore, she/he would estimate how narrow the constrictions really are (taking In Sect. 4.2, again a reason is given (the arbitrariness of specifying epsilon) why "it into account possible measurement and reconstruction errors) and what the med- is [...] desirable to consider bounded measures that may be used in place of D_KL". ical risk is – taking into account the overall blood flow in the arterial network and I don’t understand this argumentation. In equation (9), it should probably read k ≥ considering, which body regions are supplied. 1, so that slope of |...|∧k is not negative. Overall, it is not quite clear to me why the motivating factor for this work, namely the In equation (11), H_max should be explained: over which set is the maximum boundedness of PD, is so important. Furthermore, I find the choice of a more suitable searched for? mathematical quantity not convincing. This selection should be made on the basis In Sect. 4.3, a search for the "most suitable measure" is compared to the search of real, realistic problems with appropriately designed visualizations. A summarizing for a suitable unit (e.g., metric vs. imperial). This is a misleading comparison. While discussion of the assumptions, limitations and scope of the theory is missing. the mentioned example is about different units of measurement for the same opera- Summary Rating tionally defined quantity (units of measurement that can be converted by multiplying by a constant factor), this paper is about selecting a suitable (operationally defined) Reject quantity. The problem here corresponds more to the question which metric is the The paper is not ready for publication in SciVis / TVCG. best one to solve a task, e.g. the dissimilarity of objects to. This has nothing to do The work may have some value but the paper requires major revisions or additional with the choice of units of measurement (which in our case are [log_2( p)] = bits). work that are beyond the scope of the conference review cycle to meet the quality standard. Without this I am not going to be able to return a score of ’4 - Accept’. But the inappropriately chosen example reminds us of another point, which is ex- tremely important and yet very elementary! If you add or subtract two quantities, The Summary Review they must have the same unit of measurement; otherwise you get complete non- The authors aim to contribute to the foundation of data visualization by mathemati- sense. In our case we calculate AC - PD. AC has, because of log_2 (p ), the unit bit; cally capturing essential characteristics of the visualization process. The main goal if we would calculate with log_256 (p ) instead, we would have the result in the unit is a general mathematical expression that can be used to calculate the cost-benefit byte. So we have to calculate AC also in the unit "bit". Otherwise we pretend to be ratio of a visualization on an information-theoretical basis. While the costs are given quantitative and yet we are only doing pseudo-science. by e.g. mean response times of the users, the benefit is more difficult to define. In This means that PD must also be an entropic quantity calculated with logarithms to Ref. [11], to which the first author of the current paper already contributed, a mea- base 2. This condition is fulfilled by the KL-divergence, Jensen-Shannon divergence sure for the "benefit" was developed (Eq. 2). It consists of the difference "Alphabet and the conditional entropy. The condition is certainly not fulfilled for the Minkowski Compression (AD)" - "Potential Distortion (PD)". The subject of the present work is distances. The latter should therefore not be brought into play at all ! PD, which measures the informative divergence between viewing the data through visualization with information loss and reading the data without any information loss. I am also skeptical about the quantity D∧k_new in Eq. (9), except for the case k=1. If In [11] an expression based on Kullback-Leibler (KL) divergence was used for PD. one does a dimensional consideration by (for the moment) leaving out the summand Since the KL divergence is unbounded, PD can be arbitrarily large. In the eyes of 1 in the logarithm in Eq. (9), then one can consider the exponent k as a multiplicative the authors this is a problem which they - and this is the actual aim of the paper - factor before the logarithm and thus before the sum; i.e. k is a factor that changes want to solve by a more suitable mathematical expression. the unit of measurement. Although this is not a proof (especially since you cannot ignore the presence of the summand 1 in the logarithm), it is an indication that k To this end, the authors first present a set of candidates for limited measures, ori- unequal to 1 could be problematic. It could mean, so to say, to subtract bytes from ented on common measures of information theory; some of these are parameter- bits. ized, resulting in greater diversity. To select the most suitable measure, they define a set of criteria, apply them, using also multi-criteria decision analysis (MCDA). Then Aren’t in information theory only terms of the type p * log (p ) considered, where p are they validate the results with 2 case studies and conclude that a newly introduced always probabilities with values from the interval [0,1] and the logarithms are always measure, D∧2_new, is the most suitable one. taken to the same basis. I think there is a deep reason for this. All 4 reviewers assume that information theory will be part of the a foundational I acknowledge the effort to make the selection of the "best" quantity halfway rational. theory of visualization and will at least provide constraints for further approaches.

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd. M. Chen and M. Sbert / A Bounded Measure for Estimating the Benefit of Visualization: Theoretical Discourse and Conceptual Evaluation

However, the majority of the reviewers have doubts about the approach [11]. Since [11] and the follow-up work were positively assessed in peer reviews, these reviewers tried NOT to include their reservations in the evaluation, but to assess the manuscript on the basis of the assumption that the approach [11] is meaningful. The main results of the review are: + The effort to make a contribution to the theoretical basis of the visualization is to be evaluated positively, especially since fundamental concepts are not sufficiently clarified and it is therefore a very difficult undertaking. (R1, R2, R3, R4) - The need for a bounded measure for PD is not sufficiently justified; there are several arguments (see in particular R2 and R3) that this requirement is not necessary. This, however, makes the premise of the paper doubtful. (R1, R3, R4) - Some of the authors’ trains of thought cannot be fully comprehended, as there are discrepancies with earlier work that are not being clarified. (R1) - The alleged proofs are unsustainable. (R1, R3, R4) - The proposed alternative measures are sometimes very ad hoc (R1, R3, R4), some are dimensionally doubtful (R4), they are (probably) not additive (R3) and there is no natural language interpretation (R1); other reasonable choices would have been possible (R3). - The evaluation of measures has a strong ad hoc character (R1, R2, R3, R4); with the combination of Likert scales and MCDA almost any connection to information theory is lost (R3). - The paper demands a lot from the reader; without detailed knowledge of information theory concepts, as well as details of the cost-benefit ratio, the paper is difficult to understand. It would be better if the basic terms and concepts were briefly presented again in the paper, with reference to the appendix) (R1). It would be nice if some of the two case studies were interwoven into the mathematical presentation to better anchor the concepts (R2). The notation needs to be improved (R1), - The validation of the PD measure is questionable (R1, R3, R4); for details see in particular R3. - When trying to apply the presented theory to real applications, huge gaps open up (R1, R3, R4); it starts with the fact that it is the goal of almost all visualizations to reduce the amount of information and to transmit only the information that is essential for the respective application situation; and it ends with the fact that the case studies in the paper, on closer inspection, hardly support the goals of the work (R1, R3, R4). - A summarizing discussion of the assumptions, limitations and scope of the theory presented is missing (R1, R4). —————————————————————-

© 2021 The Author(s) with LATEX template from Computer Graphics Forum © 2021 The Eurographics Association and John Wiley & Sons Ltd.