Uncertainty-aware Multidimensional Ensemble Data Visualization and Exploration

Haidong Chen, Song Zhang, Senior Member, IEEE, Wei Chen, Member, IEEE, Honghui Mei, Jiawei Zhang, Andrew Mercer, Ronghua Liang, Member, IEEE, and Huamin Qu, Member, IEEE

Abstract—This paper presents an efficient visualization and exploration approach for modeling and characterizing the relationships and uncertainties in the context of a multidimensional ensemble dataset. Its core is a novel dissimilarity-preserving projection technique that characterizes not only the relationships among the mean values of the ensemble data objects but also the relationships among the distributions of ensemble members. This uncertainty-aware projection scheme leads to an improved understanding of the intrinsic structure in an ensemble dataset. The analysis of the ensemble dataset is further augmented by a suite of visual encoding and exploration tools. Experimental results on both artificial and real-world datasets demonstrate the effectiveness of our approach.

Index Terms—Ensemble visualization, uncertainty quantification, uncertainty visualization, multidimensional data visualization !

1INTRODUCTION ture, pressure, and wind direction. Therefore, challenges for ECENT developments in scientific simulation research analyzing an ensemble dataset include both the uncertainty R have resulted in an increased role for scientific sim- with respect to each output variable and the high dimension- ulations and analysis. One common way to study uncer- ality of the dataset. tainty is the ensemble simulation, which employs stochastic There have been a large number of uncertainty visu- initial conditions or multiple parameterizations to produce alization and analysis methods [2] that go beyond tradi- an ensemble of simulation outcome. For example, in the tional summary statistics [3] and depict ensemble data in numerical weather simulation, both initial conditions and more detail. Conventional visualization solutions exploit simulation parameters (e.g., cumulus schemes and micro- glyphs [4], [5], [6] and visual variables [7], [8] to encode physics schemes) can be perturbed in an ensemble forecast uncertainties. However, most of them are only designed for simulation. The number of ensemble runs ranges from 1D or 2D datasets and have limited capabilities to reveal the dozens to thousands. In this paper, all ensemble members intrinsic structures in the ensemble dataset. Recent methods of a single data entry are called an ensemble data object, can effectively characterize and analyze the uncertainty e.g., the ensemble numerical weather forecast at a geospatial structures [9], [10] and forms [11], [12] but are designed for location. The ensemble members of an ensemble data object data objects with one variable. may be averaged to obtain the ensemble mean, which is To address the second challenge, high dimensionality, regarded as a representative of the ensemble members. Un- multidimensional projection techniques [13], [14] are widely fortunately, important information is lost in the averaging used to build a low-dimensional layout that respects the process. It is highly beneficial to characterize the uncertainty distances among data objects in the high-dimensional space. of an ensemble dataset during the entire analysis process. In this low-dimensional layout, closely positioned points In most scientific applications, the resulting simulation indicate similar data objects in the high-dimensional space. output is a multivariate field. For instance, the output of However, most conventional projection methods are de- a typical Weather Research and Forecasting (WRF) simula- veloped for datasets in which a data object has only one tion [1] includes more than 100 variables such as tempera- instance. Simply using the ensemble mean as an instance to represent a data object for projection suffers from heavy • Haidong Chen, Wei Chen, Honghui Mei, and Jiawei Zhang information loss, because the shape of distribution for each are with the State Key Lab of CAD & CG, Zhejiang Uni- ensemble data object is lost during the averaging process. versity, CHINA, 310058. E-mail: [email protected], chen- A conceptual example is shown in Fig. 1 (a,b), where four [email protected], [email protected], and [email protected]. • Wei Chen is the corresponding author. He is also with the Cyber Innova- ensemble data objects with similar ensemble means but tion Joint Research , Zhejiang University. different ensemble distributions are lumped together by • Song Zhang is with the Department of Computer Science and a conventional multidimensional projection method to the Engineering, Mississippi State University, U.S., 39762. E-mail: [email protected]. ensemble means. This can cause misleading perceptions • Andrew Mercer is with the Geosciences Department, Mississippi State from users. For example, users might advocate that these University, U.S., 39762. E-mail: [email protected]. four data objects are almost the same as they are located • Ronghua Liang is with the College of Information Engineering, Zhejiang close to each other. University of Technology, CHINA, 310014. E-mail: [email protected]. • Huamin Qu is with the Department of Computer Science and En- It remains a challenging task to visually explore the gineering, Hong Kong University of Science and Technology. E-mail: intrinsic structures of a high dimensional ensemble dataset [email protected]. as well as the uncertainty of each individual ensemble Digital Object Identifier no. 10.1109/TVCG.2015.2410278 data object. The key contribution of this paper is a novel

1077-2626 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS uncertainty-aware multidimensional projection approach • A novel uncertainty-aware multidimensional projec- that generalizes the conventional projection methods for tion technique for ensemble datasets. ensemble datasets. Its core is a new dissimilarity measure of • A multi-view interactive visualization system for the ensemble data objects and an enhanced Laplacian-based effective exploration of intrinsic structures and un- projection scheme. Compared with the conventional projec- certainties in ensemble datasets. tion methods that can solely respect the distances among ensemble means, our dissimilarity measure preserves the relationships among the ensemble distributions as well as 2RELATED WORK the ensemble means. By setting the distributional difference Our work relates to multiple topics of visualization includ- weight to zero (see Equation (7)), our approach regresses to ing uncertainty visualization, ensemble data visualization, a conventional projection scheme. The enhanced Laplacian- and multidimensional/multivariate data visualization. based projection scheme achieves a balance between the Uncertainty spreads throughout the entire data analysis local and global layout by imposing global constraints pipeline including acquisition, transformation, and visual- in constructing the Laplacian system. ization [15], [16]. Depicting uncertainty can significantly Fig. 1 (c) illustrates the effectiveness of our uncertainty- help analysts make better decisions [17]. Thomson et al. [18] aware projection approach, with which four data objects proposed a typology for uncertainty visualization in intel- are positioned in the 2D visual plane. In this projection, U2 ligence analysis. Potter et al. [19] presented a comprehen- and U3 are located close to each other due to their similar sive survey for uncertainty quantification and visualization ensemble means and distributions. As U1 and U4 follow of scientific data. In the past decade, a large number of different distributions from U2 and U3, they are positioned uncertainty visualization techniques have been developed. far away from U2 and U3 even though all of them have They can be roughly classified into four categories: glyph similar ensemble means. based, visual variable based, geometry based, and anima- tion based. Glyph based methods encode uncertainties into well-designed glyphs (e.g., the flow radar glyph [4], the U U U U 3 1 circular glyph [20], the summary plot [5], and the uncertain 1 2 U4 ODF glyph [21]) and place them into the original data field. U (b) 2 Similarly, visual variables such as color [10], [22], bright- U U U3 2 4 ness [8], [23], blurriness [24], and texture [7], [25] can also U 3 U be employed to encode uncertainty. The geometry based U 1 4 approaches are a family of techniques that adapt the basic (a) (c) geometry to represent uncertainty, including point [22], line [26], cube [27], and surrounding volume [10], [28]. Fig. 1. Projecting an example 1D ensemble dataset (a) which consists of four ensemble data objects with similar mean values but different Coninx et al. [29] and Lundstrom et al. [30] demonstrate distributions. (b) The result of a conventional multidimensional projection that animation can be used to express uncertainty as well. algorithm to the ensemble means. (c) The result of our method. In general, most of these techniques focus on 1D or 2D datasets and cannot be readily extended to multidimen- We note that as with any projection method, our method sional datasets. Wu et al. [31] introduced the standard error will not preserve the high-dimensional structures with 100% ellipsoid to characterize uncertainty arising in any stage of accuracy. Our goal is to provide users with a projection a visual analytic process for multidimensional datasets. In method to visualize not only the differences between the addition, evaluations of different uncertainty visualization means of ensemble data objects, but also the differences be- techniques have attracted much attention recently. Deitrick tween the shapes of the ensemble distributions. As demon- et al. [17] conducted an empirical evaluation to explore the strated in Section 6 (see Fig. 6 and Fig. 11), a better sepa- influence of uncertainty visualization on decision making. ration of data objects with different ensemble distributions Sanyal et al. [32] compared four commonly used techniques can be achieved with our method, but not with previous for 1D and 2D datasets. They identified that the glyph projection methods using ensemble means. representations are good options generally. In addition, we augment users’ abilities to visually study Ensemble data is common in many scientific fields such an ensemble dataset with a suite of visual exploration wid- as meteorology, hydrology, and chemistry. Ensembles effec- gets: 1) an uncertainty histogram that allows for selecting tively represent multiple simulations of a model with varied the interesting range of uncertainty values in the visualized initial or boundary conditions, yielding slightly different data; 2) an ensemble bar that embodies key information results. They can be regarded as a type of uncertainty data. in an ensemble data object; 3) a parallel coordinates view A convenient visualization method for ensemble data is that shows the details of the ensemble distribution for a the small-multiple method [33], [34] associated with linking particular ensemble data object; and 4) an optional geo- and brushing operations. Spaghetti plots [35] is another location view that depicts the uncertainty patterns identified approach to visualize ensemble dataset. Potter et al. [34] in the projection view. Experimental results on a synthetic introduced a framework, Ensemble-Vis, to visualize and ex- dataset and numerical weather simulations alike show that plore ensemble dataset by leveraging multiple coordinated ensemble data objects with similar ensemble means but views. Sanyal et al. [6] developed a tool named Noodles different ensemble distributions can be clearly distinguished to visualize ensemble uncertainty modeled with standard with our approach. deviation. Unfortunately, Noodles is only applicable for In summary, the main contributions of this work include: single variable ensemble data. Recent work in this area IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 3 has progressed from simple visualization to formal analysis on the ensemble means and projected to a 2D space with the of the results. Thomas et al. [36] presented an interactive conventional multidimensional scaling method [13]; then system to study off-shore structures in an ensemble ocean all other objects in U are projected to the 2D space with forecasting dataset. Gosink et al. [11] proposed a method an enhanced Laplacian system that combines the influences to characterize different types of predictive uncertainty in from both control points and other data points. ensemble datasets based on the Bayesian model averaging. One distinctive feature of our approach is that we take Mathias et al. [12] developed a Lagrangian framework for both the ensemble mean differences and the ensemble distri- ensemble flow field analysis. bution differences into account to measure the dissimilarity The differences between the terms multidimensional and D(Ui,Uj ) of any pair of ensemble data objects Ui and Uj. multivariate are subtle. The term multidimensional refers The former is described by the Euclidean distance between to independent dimensions, while the term multivariate the ensemble means E(U i, U j ), and the latter is captured refers to that of dependent variables [37]. Nevertheless, by a difference measure between ensemble distributions multivariate data can be treated as multidimensional be- J(Ui||Uj ) based on relative entropy [41]. To reconstruct the cause the relationships among dimensions are typically un- continuous ensemble distribution for each data object, a known in advance. Conventional approaches to visualizing multidimensional Kernel Density Estimation (KDE) method multidimensional data employ scatter plot matrix, parallel that considers the dimensional correlations is employed. coordinates, star coordinates, etc. Among these approaches, multidimensional projection has gained much attention due 3.2 Ensemble Data Object and Probability Distribution to its ability to characterize similarities among data points For simplicity, we suppose that the data objects are indepen- in a visual space (2D or 3D), and has been proven to dent. Currently, many uncertainty modeling and visualiza- be a useful tool in many applications. Daniels et al. [38] tion methods [9], [10] assume that the ensemble members employed a LSP-like projection technique for interactive for a given data object are drawn from a known parametric vector field analysis. Chen et al. [39] proposed to embed distribution such as Gaussian. In practice, the ensemble high-dimensional DTI fibers to a 2D space with MDS for distribution is often not Gaussian and can vary from data interactive exploration. Joia et al. [14] introduced an ad- object to data object. vanced projection technique called local affine multidimen- KDE is a non-parametric approach to approximate the sional projection (LAMP) to interactively correlate similar underlying continuous distribution by using a sum of ker- data instances. Anand et al. [40] used random projection to nels centered at each sample. Specifically, the multidimen- find interesting substructures in a high-dimensional dataset. sional KDE [42] for data object Ui is defined as: Although these approaches are capable of visualizing data m   by their low-dimensional layout, they are designed for data t Ui(x)= wtKH x − U , without uncertainty. Our approach advances the multidi- i (1) t=1 mensional projection scheme by taking uncertainty into where H is a bandwidth matrix, wt is a weight factor which account. m can be determined by prior knowledge, and t=1 wt =1. KH is the multidimensional kernel function satisfying that 3UNCERTAINTY-AWARE MULTIDIMENSIONAL KH(x) ≥ 0 in the entire domain and KH(x)dx =1.As PROJECTION such, the estimated density function can be interpreted as Multidimensional projection is emerging as an effective the probability density function. visual exploration technique for high-dimensional datasets. A range of kernels can be adopted for KDE including It projects a set of data points in the d-dimensional space uniform, Gaussian, and Epanechnikov kernels. We employ into an l-dimensional visual space, typically l ∈{2, 3}. the widely used normal kernel [43], [44], because the kernel Conventional techniques assume that the value of each data type is less important than the bandwidth parameter in object is deterministic. Many recent scientific simulations terms of influences on the estimation [45] . In general, there are three choices for H: the scaled produce a collection of values at each data object due to the 2 identity matrix H = h I, the diagonal matrix H = perturbed initial condition or parameterization. This type of 2 2 2 diag(h1,h2, ..., hd), and the general symmetric positive def- data is called ensemble data. inite matrix. The scaled identity matrix implies that the The conceptual model of our data consists of n ensemble variance in all dimensions are identical. The diagonal matrix data objects U = {U1,U2, ..., Un}. Each object has md- 1 2 m implies that each dimension has its own bandwidth and no dimensional ensemble members, Ui = {Ui ,Ui , ..., Ui | t d correlation exists between any two dimensions. In practice, a Ui ∈ R }. The goal of our uncertainty-aware projection variety of automatic data-driven bandwidth selection meth- scheme is to build an l-dimensional representation V = l ods can be used to estimate the bandwidth hk on the k-th {V1,V2, ..., Vn | Vi ∈ R } to preserve the relationships dimension, e.g., the Silverman’s rule of thumb [45]: among data objects in terms of both the ensemble means  5 1/5 and the ensemble distributions. 4σk −1/5 hk = ≈ 1.06σkm , (2) 3m 3.1 Approach Overview where σk is the standard deviation of samples on the k-th To strike a balance between projection efficiency and ac- dimension, m is the number of samples. More sophisticated curacy, we employ a two-step multidimensional projection: approaches like cross-validation exist in [46]. The general first, a small set of control points are selected from U based symmetric positive definite matrix encodes the individual 4 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS bandwidth for each dimension and the linear relationship   between any two dimensions. 1 2Ui(x) J(Ui||Uj )= Ui(x) log dx Each type of bandwidth matrix has its own advan- 2 D Ui(x)+Uj (x) tages and limitations. The scaled identity matrix is prone   , (5) 1 2Uj (x) to under-fitting and has a large fitting error. The general + Uj(x) log dx 2 D Ui(x)+Uj(x) matrix, which requires that H is symmetric and positive definite, is the most accurate one but needs more parameters where Ui(x) and Uj(x) are the reconstructed probability and is prone to over-fitting. The diagonal matrix strikes a density functions for Ui and Uj respectively, D denotes the balance between them. high-dimensional space where all the ensemble members t To take the advantage of simplicity by the diagonal Ui reside. Typically, we set the base of the log function as matrix while preserving correlations among dimensions, we 2 so that J(Ui||Uj ) is bounded by 1. Only when Ui and choose to perform KDE in a space defined by the principal Uj have the same distributions does J(Ui||Uj ) reach 0.By component transformation [42]. Specifically, a mean centering convention, 0 log 0 is defined as 0. process is first applied to each data object before estimation In most cases, it is impossible to obtain the analytical so that the distributional differences (to be detailed in sec- solution of Equation (5). Therefore we compute its numer- tion 3.3) are not influenced by ensemble means: ical solution with sampling. In our implementation, the Metropolis-Hastings (MH) [49] algorithm, a Markov Chain t t − Ui = Ui U i. (3) Monte Carlo method, is employed, because it can generate t a sequence of samples from a function proportional to the Ui = {U | We then apply principal component analysis to i target probability density distribution when direct sampling t =1, 2, ..., m} Φ Φ that yields a transformation matrix . is is difficult. As more and more samples are produced, the composed by the eigenvectors of the covariance matrix of t distribution of samples more closely approximates the target Ui. Lastly, we transform each ensemble member Ui into a distribution. Consequently, by the law of large numbers, we ∗ { t∗ | } new set Ui = Ui t =1, 2, ..., m by: can rewrite Equation (5) as:   t∗ T t S t Ui =Φ Ui . (4) 1 2Ui(xi) J(Ui||Uj ) ≈ log t t 2S t=1 Ui(xi)+Uj(xi) Consequently, the probability Ui(x) at location x is com- (6) ∗ ∗ ∗ ∗ T S t puted by Ui (x ) on Ui instead, where x =Φ (x − U i). 1 2Uj(xj ) Because the bases of the new space are eigenvectors that are + log t t 2S t=1 Ui(xj )+Uj(xj ) orthogonal to each other, the diagonal bandwidth matrix ∗ ∗ ∗ t t H for estimating Ui (x ) can be quickly estimated with where {xi | t =1, 2, ..., S}∼Ui(x) and {xj | t = ∗ Equation (2) on Ui . 1, 2, ..., S}∼Uj(x) are S samples generated by MH. Therefore, we define the dissimilarity between Ui and Uj as a weighted sum of the Euclidian distance between their 3.3 Dissimilarity Estimation ensemble means and the distributional difference between The dissimilarity measure between pairs of data objects their probability density functions: plays an essential role in most multidimensional projection D(Ui,Uj )=αE(U i, U j)+βJ(Ui  Uj ), (7) techniques. Geometric distances such as the Euclidean dis- tance, the cosine distance, and the geodesic distance have where α ∈ [0, 1], β ∈ [0, 1] are two parameters adjustable by prevailed in the multidimensional projection literature for users, and α + β>0. When α =0 , β =0, the distributional years. However, these measures cannot be directly applied differences are missed by our dissimilarity measure. In to ensemble datasets. such a case, our approach will regress to the conventional A naive way to deal with ensemble datasets is to use the projection methods because only the dissimilarities among summary statistics like the ensemble means to represent the ensemble means are captured. data objects and calculate the dissimilarities among them. In summary, the proposed dissimilarity measure is en- Unfortunately, this simple scheme results in a large amount dowed with the following properties: of information loss. The Anscombe’s quartet dataset [47] • Symmetry (D(Ui,Uj )=D(Uj ,Ui)) The dissimi- shows that four data objects with the same means and vari- larities between any pair of data objects are sym- ances have very different ensemble distributions. Clearly, it metric. Therefore, it can be seamlessly incorporated is impossible to distinguish them from each other with the into many state-of-the-art multidimensional projec- ensemble means. tion techniques. Therefore, it is natural to extend the definition of • Uncertainty-awareness The dissimilarity considers dissimilarities among ensemble data objects by consider- the differences in ensemble distributions therefore ing their probability distributions. Jensen-Shannon divergence differentiating data objects indistinguishable by sum- (JSD) [41] or relative entropy is a dissimilarity measure of mary statistics. distributions based on the Kullback-Leibler divergence. JSD is symmetric, non-negative, and bounded. It is widely used in the community of uncertain data mining [48]. The distribu- 3.4 The Enhanced Laplacian-based Projection tion difference J(Ui||Uj ) between the data object Ui and Uj Our scheme of projecting ensemble data objects to the is defined as: visual space is inspired by the Least Square Projection (LSP) IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 5 technique [50]. Generally, it is a two-step local technique. In ilarity Dinv(Ui,Uj )=1/D(Ui,Uj ) between the ensemble the first step, a subset of data objects are projected to the data object Ui and Uj is used to define the weight: visual space. In the second step, the rest of the data objects Dinv(Ui,Uj ) are interpolated according to the K-nearest neighborhood τij =  . (9) Dinv(Ui,Uj ) graph. However, this method inherits the drawbacks of local Uj ∈{Ni∪Ri} methods. It may bias the data objects projected in the second By reorganizing Equation (8) for all ensemble data ob- step in favor of control points projected in the first step [51] jects into a matrix representation, a sparse linear system (see Fig. 3 (a)). Thus, instead of directly using the Laplacian constrained by the control points can be derived [50] which system originally proposed in [50], we propose to add global can then be solved in a least square sense. The solution constraints into the equations. In other words, our method of this system is the low-dimensional representation of the associates each data object with two small sets: a near set ensemble dataset U. and a random set, instead of only one K-nearest set. Both Fig. 3 demonstrates the effect of using a random set. A sets are used for constructing the neighborhood graph for synthetic dataset consisting of 500 data objects in 5 clusters interpolation. The random set plays the role of the global is employed. Each ensemble data object has 80 members. constraints. This simple scheme strikes a balance between Fig. 3 (a) displays the projection result without the random the locality maintained by the near set and the global layout set. In this case, our method regresses to the Least Square preserved by the random set. Projection technique. Fig. 3 (b,c) show the results with the A large number of sam- size of the random set tested at 4 and 8 respectively. It is ples from the ensemble dis- reasonable that a large random set preserves more global U7 U6 U tributions may be needed to 8 relationships among points, leading to low stress. This is U obtain an accurate estimation 10 empirically verified by the plot of standard normalized U of the relative entropy (Equa- 2 stress over the size of the random set (see Fig. 3 (e)). For U9 tion (6)), especially when U 3 comparison, the projection result of MDS implemented with U the ensemble distribution is 1 U 5 SMACOF [13], which optimizes the global relationships complicated. To avoid pair- U 4 among points, is shown in Fig. 3 (d). MDS provides the wise distributional difference The Near Set of U : The Random Set U : 1 1 smallest stress but requires the pairwise dissimilarity matrix N = { U , U , U , U } R = { U , U } calculations, we select con- 1 2 3 4 5 1 8 10 that it is usually not feasible in projecting large datasets. trol points and construct Fig. 3 (f) shows the time of our method with different the neighborhood graph only Fig. 2. The neighborhood graph for U1 with a near set of size 4 sizes of the random set. From this plot, we can see that based on the ensemble means and a random set of size 2. the consuming time is closely linear to the sum of |Ni| U = {U 1, U 2, ..., U n} . Con- and |Ri|, |Ni| =20in this example. This is because the { | ∈ } trol points C = Ci Ci U, i =1, 2, ..., K are selected distributional difference estimation is much more time- with a K-center algorithm [52]. If we do not have any priori consuming compared to building and solving the linear knowledge about√ the dataset, the number of control points system. In summary, the addition of a random set provides a is set as K = n [53]. The near set Ni for each ensemble balance between the fast but inaccurate local methods (e.g., object Ui is then defined as the ensemble objects whose Laplacian-based projection methods) and the accurate but means are the K-nearest neighbors of U i. Apparently, Ni slow global methods (e.g., MDS). might not be the true K-nearest neighbors of Ui because only the ensemble mean information is utilized. However, in practice the K-nearest neighbors in Ni can be guaranteed 4UNCERTAINTY QUANTIFICATION AND VISUAL- by enlarging the near set. In such a case, all other objects in IZATION FOR INDIVIDUAL ENSEMBLE DATA OBJECT Ni Ui that are not true K-nearest neighbors of can be deemed The uncertainty-aware 2D projection of the multidimen- Ri as objects in the random set . Fig. 2 shows an example to sional ensemble data objects helps users understand the build the neighborhood graph for the ensemble data object relationships among data objects in a simple 2D layout. It is U1 . equally important to allow users to further examine the un- In our implementation, an iterative majorization al- certainty of each ensemble data object. This section describes gorithm called Scaling by Majorizing a Convex Function our solutions for quantifying and visualizing uncertainty of (SMACOF) [13] is employed to build a low-dimensional c c 2 the i-th ensemble data object Ui. representation {Vi | Vi ∈ R ,i =1, 2, ..., K} for control c points. Vi is the 2D projection of the control point Ci. Technically, the Laplacian-based projection scheme rests 4.1 Uncertainty Quantification on the theory of convex combination. That is, the low Quantifying uncertainty plays an important role in the dimensional representation for each high-dimensional data pipeline of uncertainty visualization [15]. It offers the vari- object can be regarded as a linear combination of its neigh- ables for visual encoding. In this section, we describe two 2 borhoods in the visual space. Mathematically, let Vi ∈ R be measures for quantifying the overall uncertainty of an the projection of ensemble data object Ui, according to the ensemble data object and the detailed deviation for each convex combination theory, Vi can be written as: ensemble member. We also discuss the limitations and ap-  plication scenarios of each measure. Vi = τij Vj, (8) In the 1D case, the standard deviation is a widely used Uj ∈{Ni∪Ri}  metric to quantify uncertainty of a random variable. In- where τij > 0, τij =1. Typically, the inverse of dissim- spired by this rule, our first measure models the overall 6 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS

   −1   t t T t δ = U − U i U − U i . i i i i (13) The second measure requires more computation than the first one due to a matrix inversion. In addition, a sufficient number of ensemble members are demanded in the second Stress: Stress: 0.614 0.087 measure to correctly represent correlations among dimen- (a) (b) sions. Thus, the second measure is not a good option when the number of ensemble members is much smaller than the dimensionality of the dataset.

4.2 From Quantification to Visualization Following the basic principles identified by Sanyal et al. [32] that the glyph size and color are two effective variables to Stress: Stress: 0.055 0.011 display uncertainty in a 2D space, we design a color bar (c) (d) based representation called ensemble bar to depict the overall uncertainty and the distribution of ensemble members. Our Stress

0.5 (S) Time goal is to provide users a quick preview of uncertainty 5.8 patterns in an ensemble data object rather than complete 0.3 5.4 details, which will be displayed in the parallel coordinates 5.0 0.1 view. The height of the ensemble bar encodes the overall 4.6 04 81215 0 5 10 15 Random set size Random set size uncertainty: a higher bar implies an ensemble data object (e) (f) of larger overall uncertainty. The pattern shown in the bar depicts the distribution of all ensemble member deviations. Fig. 3. The influence of different sizes of the random set in our approach. To help users easily explore the distribution differences of (a) |Ri| =0, which is the LSP technique. (b) Our method with |Ri| =4. different ensemble data objects, all ensemble bars have an |R | =8 (c) Our method with i . (d) MDS projection. (e) The stress plot. (f) identical width. The time plot. 1 2 m Let δi = {δi ,δi , ..., δi } be the deviations of the m ensemble members from the ensemble mean U i. The height uncertainty Oi of the ensemble data object Ui as a sum of of the ensemble bar is given by: the standard deviations in all dimensions: Hi =(1− Oi)Hmin + OiHmax, (14) d k Oi = σi , (10) where Hmin and Hmax represent the minimum and maxi- k=1 mum heights of all ensemble bars, respectively. Oi ∈ [0, 1] k is the normalized overall uncertainty of Ui. here σi represents the standard deviation on the k-th di- t In order to convey the distribution of ensemble member mension. The deviation δi of the t-th ensemble member of Ui is defined as its Euclidean distance to the ensemble mean: deviations in the bar, we first set the left and right edges of the bar to correspond to the minimum deviation min(δi) t  t −  δi = Ui U i . (11) and the maximum deviation max(δi), respectively. Then, The first measure is simple and can be used for most ap- we equally discretize the entire bar into a set of bins and plications. However, it does not take the correlations among count the number of ensemble members in each bin. Each dimensions into account. The second measure employs the bin is drawn as a rectangle with a linearly varied sequential covariance matrix to quantify uncertainty for each ensemble color map (e.g., from white to green in this paper). data object. The covariance matrix can be considered as a The number of bins has a great impact on the pat- generalization of variance to multiple dimensions. It mea- tern shown in an ensemble bar. In practice, various useful sures the dispersion of an ensemble data object with respect guidelines and rules of thumb can be employed. In our to the ensemble mean. A geometrical representation of a implementation, we take the square root of the number covariance matrix is the hyper ellipsoid [31], [54], whose of ensemble members as the number of bins. This scheme axes correspond to the eigenvectors and square roots of is simple and has been widely used in many applications eigenvalues of the covariance matrix. Similar to [31], we use such as Excel. Users are allowed to interactively adjust the the volume of the hyper ellipsoid to represent the overall number of bins to discretize the ensemble bar. uncertainty of an ensemble data object: In general, the ensemble bar allows users to compare overall uncertainties of different ensemble data objects d d  π 2 through the heights of the ensemble bars (see Fig. 4). In Oi = d λk, (12) addition, the patterns of bins enable users to identify differ- Γ( 2 +1) k=1 ent types of distributions. Taking the uniform distribution where Γ(·) indicates the Gamma function, λk represents the as an example, an approximately monochromatic ensemble eigenvalues of the covariance matrix i. Once established, bar is produced with our method (see Fig. 4 (f)). This the deviation of an ensemble member is defined as the indicates that all bins in the ensemble bar almost have the Mahalanobis distance [55] to the ensemble mean: same number of ensemble members. Similarly, for positively IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 7 skewed distribution of which more ensemble members are Uncertainty Histogram Projection view Ensemble bar view 543 Widgets (b) (c) close to the ensemble mean, an ensemble bar with the dark- a b c

green bin on the left is generated (see Fig. 4 (a,b)). 0.0 1.0 Overall Uncertainty Parameters view (e) WRF Points of Interest Microphysics e

(a) O1 = 0.2 (b) O2 = 0.25 (c) O3 = 0.4

Cumulus (d) Geo-Location view Parallel coordinates view

f H d (d) O4 = 0.45 (e) O5 = 0.6 (f) O6 = 0.65 L

Fig. 4. Various types of ensemble bars show different types of ensemble data objects, each of which consists of 100 members. The overall uncertainty increases from (a) to (f). In each case, a bar in dark green Fig. 5. The main interface of our visual exploration system. (a) The indicates the peak of the distribution of ensemble member deviations. uncertainty histogram. (b) The 2D projection view. (c) The ensemble bar The ensemble member deviations approximately follow: (a-b) the posi- view. (d) The continuous parallel coordinates view. (e) The parameters tively skewed distribution; (c-d) the normal distribution; (e) the negatively view. (f) The Geo-Location view. skewed distribution; (f) the uniform distribution.

exploration including zooming, panning, hovering, pointer selection, lasso selection, and multi-selection with 5VISUAL EXPLORATION AND INTERACTIONS the control key pressed. This section describes an integrated system with a suite of Ensemble bar view. To facilitate comparison of overall visualization and interaction tools for visually exploring a uncertainties for different data objects, the ensemble bar multidimensional ensemble dataset. view is displayed at the right side of the main interface. The overall uncertainties are perceived by the heights of 5.1 The Exploration Workflow the bars, and the ensemble distributions are depicted by The exploration process starts with an uncertainty his- different types of patterns in the ensemble bars. Following togram, on which users can select target uncertainty inter- interactions are provided in this view: vals (e.g., intervals with high levels of uncertainty). Then, • Choosing a color map and the bin number; users can interactively investigate the low-dimensional lay- • Dragging bars together for detailed comparison; out of the ensemble data objects in the 2D projection view • Reordering bars according to the shown patterns; equipped with a set of interactions such as zoom in/out, and • Selecting bins of interest to further examine ensemble pointer/lasso selection. Once points of interest are specified, members in the parallel coordinates view. the geospatial locations associated with each data object will be highlighted in the Geo-Location view. Meanwhile, the Parallel coordinates view. A continuous representation ensemble bar view shows all selected data objects. Users can is employed to convey uncertainty at each dimension. Bright further drill down to a specific ensemble bar to explore the areas on each dimension indicate high certainty. Users can high-level distribution and the overall uncertainty of a data detect outliers with line dyeing, brushing, local zooming, object. Afterwards, users can select bins of interest in the and axis reordering. To ensure a smooth interaction, ani- ensemble bar to examine the selected ensemble members in mated transitions are implemented in this view. an animated continuous parallel coordinates view. The par- Parameters view. Studying the effect of parameters in an allel coordinates view allows users to study the distribution ensemble simulation (that is, ensemble members are gener- of ensemble members on each dimension. To investigate the ated by different parameterizations) is an important task. source of uncertainty (e.g., parameter perturbations in this For this purpose, we employ a radial plot where each pa- paper), a radial plot is provided to display the parameter rameter corresponds to one of the equiangular axes. Initially, configurations that generate these ensemble members. all parameter configurations are shown as a context. Once users select a set of ensemble members, the corresponding 5.2 Exploration Tools parameter configurations are highlighted in this view. Geo-Location view. For geolocation-related applica- The main user interface is composed of a set of linked tions, a geo-location view is employed. Uncertainty features widgets and views to enhance interactivity. Fig. 5 shows an identified in the projection view may be easily understood overview of our system. and verified by the domain experts in this view. Uncertainty histogram. At the top left corner, an uncer- tainty histogram widget allows users to select data points with overall uncertainty in a desired range. 6EXPERIMENTS AND DISCUSSION Projection view. This view displays the intrinsic struc- We conducted three experiments to demonstrate the effec- tures of an ensemble dataset in the 2D plane. The dis- tiveness and usefulness of our uncertainty-aware projection tances among 2D data points encode the similarities of method and the exploration system. The system was imple- the corresponding ensemble data objects with regard to mented with the standard C++ and Qt 5.0. The experiments both ensemble means and ensemble distributions. Users were performed on a PC equipped with an Intel Core 2 Duo can either select a point or a group of points for further 3.0 GHz CPU, 4GB host memory and an NVIDIA Quadro analysis. A set of operations are provided in this view to 4000 video card with 1.5 GB video memory. 8 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS

6.1 The Synthetic Dataset Uniform Uniform We generated a synthetic 5D dataset consisting of 300 en- Normal Normal semble data objects to demonstrate the effectiveness and the influence of dissimilarity weights. We first randomly chose three anchor points that constitute an equilateral triangle as cluster centers in a 5D space. Then, we generated 100 en- semble data objects in each cluster. Each data object had 250 ensemble members following either a uniform or a normal (a) (b) distribution. The ensemble mean of a data object in a cluster Uniform Uniform was set as the cluster center with a small random bias, Normal Normal and the correlation matrix was set as a fixed scaled identity matrix. This process generated a dataset with three clusters. The Euclidean distances between any two cluster centers are equal. The geometrical differences of data objects in the same cluster are subtle but the distributional differences are significant. The equilateral triangle structure in this dataset is designed to demonstrate capability of our method to (c) (d) preserve the high-dimensional features in a visual space. Fig. 6 shows the effect of the dissimilarity weights α Fig. 6. The projections produced with different dissimilarity weights on the ensemble mean difference and the ensemble distribution difference and β (see Equation (7)) in our uncertainty-aware mul- in the dissimilarity measure. (a) α =1.0, β =0.0. (b) α =0.5, β =1.0. tidimensional projection method. Without considering the (c) α =0.25, β =1.0. (d) α =0.0, β =1.0. distributional differences (β =0), all data objects with similar ensemble means are projected close to each other Blocks, Turnovers, Personal Fouls, and Points. Each ensem- (see Fig. 6 (a)) even though data objects within a cluster ble data object has at least 5 ensemble members. Because the have distinct distributions. This is consistent with the result number of the ensemble members for most ensemble data of the conventional multidimensional projection methods objects is smaller than the dataset dimensionality of 16, we designed for certain datasets. When the distributional differ- chose the first uncertainty quantification measure to evalu- β>0 ences are incorporated ( ), data objects within a cluster ate the overall uncertainty (Equation (10)) and the ensemble are gradually separated (see Fig. 6 (b,c)). For demonstration, member deviations from mean (Equation (11)). The overall we color the uniformly distributed data objects with green uncertainty in this case reflects the fluctuation of a player’s and normally distributed data objects with purple. From performance over years, or a player’s inconsistency. Fig. 6 (b,c), we can observe that there are three clusters. In Fig. 7 shows the results of the uncertainty-aware pro- each cluster, we can further identify two sub-clusters, each jection with different parameters. The color encodes the of which corresponds to a particular type of distribution. position of a player. We can see a triangular structure in However, these structures cannot be observed in Fig. 6 (a), Fig. 7 (a,b). The points in the upper part mainly represent as data objects within a cluster become distinguishable only players who play center (C) and players who play both when ensemble distributions are considered. center and forward (C-F). The points in the lower part In general, our method not only preserves the global primarily represent players who play guard (G) and players structure of the three clusters but also distinguishes data who play both forward and guard (F-G). In addition, it is objects between the two types of ensemble distributions easy to find that yellow points are distributed in both the within each cluster. The degree of separation of the data upper and the lower part. This is because players of forward α objects within a cluster depends on the relative value of (F) usually are versatile. For demonstration, we annotated a and β. When the ensemble mean is not considered (α =0), set of players with their names. The results in Fig. 7 (a,b) all data objects are separated into two clusters solely based show that points on the left side are role players and the on the two types of ensemble distributions (see Fig. 6 (d)). points on the right side are key players. More specifically, the points in the upper right corner represent excellent 6.2 The NBA Players’ Statistics Dataset players of C or C-F, and the points in the lower right corner represent excellent players of G or F-G. We applied our method to a dataset consisting of 933 NBA Fig. 7 (c) shows the result of projection using only distri- players’ career statistics1 since 1981. In this dataset, we treat butional differences. This result indicates that the seasonal each player as an ensemble data object, and the statistics in fluctuation patterns of the players in C and C-F positions a season as an ensemble member. Each ensemble member are quite different from those of players in F-G and G is a 16-dimensional vector that describes the statistics in positions. Furthermore, if we only use the geometrical dif- a season including: Games Played, Minutes Played, Field ferences among players to model their dissimilarities, the Goals, Attempted, 3-Point Field Goals, 3-Point consistency of a player will be hidden. For example, without Field Goal Attempted, Free Throws, Attempted, considering the distributional differences, the points in a Offensive Rebounds, Defensive Rebounds, Assists, Steals, region indicated by the magenta circle in Fig. 7 (a) are lumped together. This implies that their ensemble means 1. The statistics of a player who has career over 5 seasons are collected from a professional basketball statistics website (http://www. are similar. However, Fig. 7 (c) reveals that their ensemble basketball-reference.com). distributions are distinct. This explains why points inside IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 9

Position parallel coordinates views shown in Fig. 8 further verify C C-F F F-G G Joe Barry Carroll this observation. On the other hand, Joe Barry Carroll and Kelvin Cato Josh Smith Shareef Abdur-Rahim are positioned closer to each other Shareef Abdur-Rahim when considering the distributional difference because their

Jerome James ensemble distributions are quite similar, which is evident in John Thomas Carmelo Anthony Kevin Durant their ensemble bars and the parallel coordinates representa- LeBron James Von Wafer tions in Fig. 8. Tracy Moore Kobe Bryant Sam Mack From the heights of the ensemble bars in Fig. 8, we can conclude that the consistency of Josh Smith is higher (a) than those of Joe Barry Carroll and Shareef Abdur-Rahim. Position Tim Duncan Hakeem Olajuwon Further examining the details, we can see that the high C C-F F F-G G David Robinson Dwight Howard overall uncertainties of Joe Barry Carroll and Shareef Abdur- Joe Barry Carroll Rahim are partly caused by some flat and close-to-zero Kelvin Cato Shareef Abdur-Rahim Josh Smith outlier ensemble members (highlighted in purple). This was Jerome James caused by seasons in which they may have injuries. John Thomas

Tracy Moore Carmelo Anthony Kevin Durant Von Wafer Joe Barry Carroll Shareef Abdur-Rahim Josh Smith Sam Mack LeBron James Joe Barry Carroll Kobe Bryant Allen Iverson (b) Sam Mack Position Tracy Moore C C-F F F-G G Von Wafer Shareef Abdur-Rahim John Thomas Jerome James

Kelvin Cato

Kobe Bryant Carmelo Anthony Josh Smith Allen Iverson LeBron James Kevin Durant Josh Smith David Robinson Dwight Howard Hakeem Olajuwon Fig. 8. The ensemble bars and parallel coordinates representation for Joe Barry Carroll Shareef Abdur-Rahim TiTimmD Duncan the three selected players. (c) 6.3 The Numerical Weather Simulation Dataset Fig. 7. Results for the NBA players’ statistics dataset. (a) The result that considers only the geometrical differences, α =1.0, β =0.0. (b) The dataset used in this experiment is an ensemble WRF α =0.55 β =0.45 The result of our method with , . (c) The result that simulation of the 1993 superstorm at 6:00 PM, Mar. 13th, considers only the distributional difference, α =0.0, β =1.0. In this ◦ ◦ |R | =8 |N | =2 1993. We select a region with latitude from 30 Nto40 N, example, i and i . ◦ ◦ and longitude from 85 Wto75 W. The dataset consists of 7,050 geospatial grid points. Each grid point contains the magenta circle in Fig. 7 (b) are much more dispersed. 8 variables, including wind speed in the zonal (E-W) and And this dispersion may help a general manager identify meridional (N-S) directions respectively, temperature, soil consistent players from a pool of players who are similar in temperature, water temperature, relative humidity, specific mean performance. humidity, and mean sea level pressure. Forty runs of the To further inspect the difference between the results WRF simulation with different cumulative schemes and with and without considering distributional differences, we microphysics schemes generated the ensemble members. highlight a group of points in a region indicated by the In other words, our simulation produced an 8-dimensional black circle in Fig. 7 (a). The same group of points are also ensemble dataset with 7,050 ensemble data objects, each of highlighted in black in Fig. 7 (b) for comparison. Because which has 40 ensemble members. As the number of ensem- Fig. 7 (a) only considers the geometrical difference among ble members is much larger than the dimensionality of the ensemble means, the highlighted points close to each other dataset, the second uncertainty quantification measure was have subtle differences among ensemble means. Take Josh employed in this case. Smith and Joe Barry Carroll as an example, they are located Fig. 9 presents our results. To help users easily associate close to each other in Fig. 7 (a). However, they are dispersed the geospatial locations in the projection view, we use when taking the distributional difference into account in color to encode the geospatial locations. Specifically, we Fig. 7 (b). This is because their distributional differences synthesize a 2D texture that covers the entire simulation ◦ ◦ are significant, which can be clearly verified in Fig. 7 (c) region, i.e., latitude from 30 Nto40 N, and longitude from ◦ ◦ (notice the annotations in blue). The ensemble bars and the 85 Wto75 W. For simplicity, a linear gradient fill mode 10 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS from violet to orange along the diagonal for this texture is tional differences respectively. The near set, the random set, adopted. In this way, the points sampled from the sea area the neighborhood graph, and the control points used in each will be encoded with color close to violet and the points projection were identical, while α and β were different. sampled from the land area will be encoded with color Fig. 11 (a) shows the result that only takes geometrical close to orange (see Fig. 9). For clarity, the map is drawn differences into account, i.e., α =1.0 and β =0.0. Fig. 11 as a background. From the projection result, we can easily (c) displays the result that only considers distributional find that points in the right part are closer to the sea area differences, i.e., α =0.0 and β =1.0. Fig. 11 (b) presents and points in the left part are closer to the land area. the result that considers both geometrical and distributional Our system can help users differences with α =0.3 and β =0.7. explore the dataset from mul- From Fig. 11 (a), we can see that the major structure tiple perspectives and levels- shown in this result is similar to that in the result produced of-detail. We present two ex- by our uncertainty-aware projection method in Fig. 11 (b). ploration scenarios below. However, the difference between the data objects over the Visual cluster analysis. sea and the land is captured much more clearly by taking Identifying highly correlated both the geometrical and distributional differences into ac- data objects is a crucial count. Noticing the red points in C1 highlighted in Fig. 11 data exploration task to gain (a), many of them are lumped with points in C3. This is be- insights into an ensemble Fig. 9. The result of our projec- cause the geometrical differences of ensemble means among dataset because highly cor- tion method with α =0.4, β = these points are subtle. On the other hand, from Fig. 11 (c) 0.6, |Ri| =24and |Ni| =4. The we know that the distributional differences between data related data objects (i.e., the color of a point is indexed from a clusters) constitute the major 2D linear map (from violet to or- objects in C1 and C3 are significant because C1 and C3 are structure of a dataset. ange) with the geo-coordinates. clearly separated in this projection. Fig. 10 shows the result of our uncertainty-aware projec- Furthermore, more points outside the three main clusters tion method colored by the overall uncertainty. In this result, (e.g., points in the orange circle) are produced by the method a color set from purple (low uncertainty) to yellow (high without considering distributional differences in Fig. 11 uncertainty) is employed. Because the overall uncertainty (a). To investigate the reason why our method can avoid varies significantly, we employ a log transformation and these outliers, three groups (G1, G2, and G3) of points are scale the overall uncertainty to [0,1]. highlighted in different colors. For inspection, the points of G1, G2, and G3 are also highlighted in Fig. 11 (b) and Fig. 11 (c). We calculate the average pairwise geometrical difference between two groups by: C1  Ui∈P,Uj ∈Q E(U i, U j) C3 E(P, Q)= . (15) |P ||Q| Similarly, the average pairwise distributional difference be- tween two groups is calculated by: C2C 

Ui∈P,Uj ∈Q J(Ui,Uj ) Overall Uncertainty J(P, Q)= . (16) 0.0 1.0 |P ||Q|

Fig. 10. Cluster identification and analysis based on our uncertainty- In this case, E(G1,G2) = 0.29, E(G1,G3) = 0.27.It aware projection method. Three preliminary clusters are clearly shown means that the average geometrical dissimilarity of G1 to with our method. The geospatial locations encoded with overall uncer- G2 and G3 are almost the same. This faithfully explains why tainties are displayed on the map. In this example, α =0.3, β =0.7, |Ri| =24, and |Ni| =4. points in G1 are located between G2 and G3 in Fig. 11 (a). However, J(G1,G2) = 0.82, J(G1,G3) = 0.44. This means From the resulting 2D manifold, we can easily find that that the ensemble distributions of points in G1 have higher the entire dataset can be roughly grouped into three clusters dissimilarities to those points in G2. The resulting projection C1, C2, and C3. With the lasso tool, we can further select the in Fig. 11 (c) visually verifies this observation. From the points in these clusters and show the geospatial locations result presented in Fig. 11 (b), we conclude that our method associated with each data object in the Geo-Location view. strikes a balance between methods that only consider geo- The results show that C1 contains data objects located in metrical differences or distributional differences. the Chesapeake bay and Delaware bay area, C2 contains Another interesting observation is that there are two data objects located in the sea area, and C3 contains data points located between G1 and G2 in Fig. 11 (b) (the region objects located over the land. This separation of data points indicated by the black circle). We highlight their geospatial based on land types is particularly interesting given that the locations in the Geo-Location view. From the magnified geospatial location is not part of the data used in projection. map, we can see that they are sampled from the Florida From the results in the Geo-Location view, we can further shore of the Gulf of Mexico. For demonstration, we also find that data objects over the sea have lower uncertainties highlight these two points both in Fig. 11 (a) and Fig. 11 (c) than those over the land. (the black points in the black circles). From Fig. 11 (a), we For comparison, we also projected this dataset by only can infer that the ensemble means of these two points are considering the geometrical differences and the distribu- similar to points over the land. From Fig. 11 (c), we can infer IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 11 Cold Sector G1 G1 G1 G2 G2 G2 G3 G3 G3

P1 P2

P3 Warm Sector

P4

(a) (b) (c)

Fig. 11. (a) The result that considers only geometrical differences, α =1.0, β =0.0, i.e., the projection method applied to the ensemble means. (b) The result of our uncertainty-aware projection, α =0.3, β =0.7. (c) The result that considers only the distributional differences, α =0.0 and β =1.0. In this example, |Ri| =24and |Ni| =4.

P3 that their ensemble distributions are much more similar to P1 P2 points over the sea. These observations further prove the P4 necessity of modeling the dissimilarities among ensembles P1 with both geometrical and distributional differences. Ensemble distribution investigation. Our system can also be employed to investigate the patterns of uncertainty in different areas and study the reason for the uncertainty of a single data object. For simplicity, we selected a set of P2 representative points (P1, P2, P3, and P4 in Fig. 11 (b)) in different areas for analysis. Fig. 12 shows the ensemble bars and parallel coordinate representations for the selected points. From the heights of the ensemble bars, we can see that P1 or P2 has a higher overall uncertainty than those P3 of P3 and P4. From the patterns depicted in the ensemble bars of P1 and P2, we can infer that their ensemble member deviations follow an approximately uniform distribution. By examining their parallel coordinates representations, we can visually verify this observation. From the ensemble bars P4 of P3 and P4, we may conclude that their ensemble member deviations approximately follow a positively skewed dis- tribution which is further confirmed by their parallel co- ordinates representations. Another interesting observation about the ensemble bars of P3 and P4 is that the color of the last bin varies greatly (see the magnified bins on the Fig. 12. The ensemble bars and parallel coordinates representations for four selected data objects. The selected bins are magnified on the right right side). This suggests that P4 has more outlier members. side. All selected ensemble members are highlighted in purple. By selecting the last bin in the ensemble bars of P3 and P4, more ensemble members are highlighted in the parallel coordinates representation of P4. need to put more emphasis on geometrical differences, a large α should be assigned. On the other hand, if users want to highlight the distributional differences among data 6.4 Discussion objects, a large β needs to be assigned. Without any prior Parameter configuration. Fig. 3 (e,f) reveal that the near set knowledge, α and β are both set as 0.5 for a preliminary |Ni| and the random set |Ri| have a great impact on projec- study of an ensemble data set. tion efficiency and accuracy. A large near set and random Projection performance. The overall projection complex- set require more computation time and resource, while a ity of our method is determined by three steps: building the high projection accuracy (i.e., low stress) can be obtained. neighborhood graph, estimating dissimilarities, and solving Because the dissimilarity estimation process relates to the the sparse linear system. number of ensemble members and the dimensionality of Building the neighborhood graph involves defining a the dataset, it is intractable to find an optimal size for the near set and a random set for each data object. A standard 2 near set and the random set. In our implementation, a procedure to find the nearest neighbors is prohibitive√ O(n ). good balance between efficiency and accuracy is achieved By reorganizing the input data objects into n clusters [50], √ 3 by setting |Ni| close to n/3 and |Ri| no larger than 4. the complexity can be reduced to O(n 2 ). The experiment in section 6.1 demonstrates the influence A core component of our method is to estimate the distri- of the dissimilarity weights α and β. Empirically, if users butional differences among data objects in the neighborhood 12 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS graph. The complexity is O(n(|Ni|+|Ri|)mdS). Here |Ni| is separation is caused by the more precise ensemble simu- the size of the near set, |Ri| is the size of the random set, m lations over the sea than over the land. This hypothesis is the number of ensemble members, d is the dimensionality came from the past experience with WRF simulations and of the dataset, and S √is the number of samples. In our was validated by examining the parallel coordinates of data implementation |Ni| = n/3, both |Ri| and S are constants, objects over the land and the sea. 3 thus the complexity becomes O(n 2 md). The meteorologist also observed that the warm sector As the derived linear system is sparse, symmetric, and and the cold sector (see Fig. 11 (c)) are much better sepa- positive definite, the conjugated gradient√ method is em- rated in the projection view when considering distributional ployed. In this case, the complexity is O(n k) [50], where differences rather than considering geometrical differences T k is the condition number of the matrix L L.(L represents only. This is because the uncertainty between the ensemble the coefficient matrix of the linear system.) members significantly increased in the warm sector located In general, the complexity of our method is closer to the center of storm at that time point. The reason for 3 √ O(max(n 2 md, n k)). In practice, dissimilarity estimation larger uncertainty is the uncertain timing of the cold front dominates the projection time in our experiments. For exam- associated with the storm. The different ensemble members ple, several hours are required to compute the dissimilarities timed the cold frontal passage differently, which affects in the numerical weather simulation dataset on our com- when each point in the warm sector leaves the warm sector puter. A parallel implementation of the entire system would and enters the cold sector. alleviate this problem. The computation consumption can be Based on these observations, the meteorologist agreed further decreased by employing a fast KDE algorithm, e.g., that projecting the data objects according to both geomet- the KD-tree based KDE. This is an avenue for future work. rical and distributional information provides a novel and In addition, from Equation (7), we know that changing useful perspective on the data. the value of α and/or β for exploration does not need to re-estimate the geometrical and distributional differences. 7.2 The Uncertainty Visualization Accordingly, we can obtain a new projection in seconds by The meteorologist stated that it is a routine but an effective solving a new linear system. means to use color for showing uncertainty on a geospa- Strengths and limitations. By considering both geomet- tial map. The color map provided a useful overview of rical and distributional differences, data objects with similar the uncertainty of data objects in regions of high or low summary statistics can be distinguished with our approach. uncertainty. The meteorologist further confirmed that the Our enhanced Laplacian-based projection scheme strikes a high uncertainty region (i.e., the yellow region) in Fig. 10 balance between the computation time and accuracy, which coincides with the path of the super storm center. can be easily verified in Fig. 3 (e,f). Currently, both the Besides color mapping, the ensemble bar representation probability reconstruction and the dissimilarity estimation provides a supplementary visualization for the ensemble processes in our approach are limited to numerical data uncertainty. The meteorologist asserted that this represen- objects. In the future, we will seek to adapt our approach tation was intuitive and helped him study the consensus of for categorical multivariate ensemble data visualization and ensemble members and compare different data objects. The exploration. meteorologist concluded that the ensemble bars can be used to characterize and identify different types of uncertainty. 7EVALUATION For example, the ensemble bars with dark green color to the left indicate positive skew distributions where most We evaluated the exploration system with a meteorologist ensemble members are grouped together with few outliers collaborator to analyze the ensemble simulation of the 1993 (e.g., data objects located in the sea surface). Ensemble bars superstorm. We examined the ensemble data at the 44th with dark green color in the middle or right represent hour of the simulation out of the entire 49-hour simula- little to no agreement among ensemble members (e.g., data tion. We intentionally chose a time point towards the end objects located in the warm sector). of the simulation when there are more diversities among the ensemble distributions. According to the findings in 7.3 The Linked Multi-view Interface Section 6.3, the dissimilarity weights were specified as α = 0.3,β =0.7 so that data objects from different land types The meteorologist found the linked interaction mode help- were clearly separated. The meteorologist can interactively ful. A cluster of points in the projection view can be easily adjust these weights to compare projection patterns. The selected and their ensemble distributions can then be re- observations and feedback are summarized in this section. viewed in the ensemble bar view immediately. The selection of bins in an ensemble bar allows a group of ensemble members to be quickly selected and reviewed in the parallel 7.1 The Uncertainty-aware Projection Method coordinates view and the parameter view. In particular, the The meteorologist observed that, without considering the last several bins in an ensemble bar usually contain the distributional differences among data objects, the Chesa- outlier ensemble members that require further analysis. peake bay/Delaware bay and the surrounding land are The parallel coordinates view helps the meteorologist overlapped in the projection view (see Fig. 11 (a)). When examine the distribution of individual variables of a data considering both geometrical and distributional differences, object. The meteorologist prefers the continuous parallel these two types of land covers are clearly separated (see coordinates plot for a more intuitive presentation of the Fig. 11 (b, c)). The meteorologist hypothesized that this distribution. The ability to reorder the axes in the parallel IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 13 coordinates view helps organize the similarly distributed [7] R. Botchen, D. Weiskopf, and T. Ertl, “Texture-based visualization variables close to each other for comparison. of uncertainty in flow fields,” in Proceedings of Visualization, 2005. The meteorologist pointed out that the soil temperature VIS 05. IEEE, 2005, pp. 647–654. [8] S. Djurcilov, K. Kim, P. Lermusiaux, and A. Pang, “Visualizing variable was not simulated, but inherited from the input scalar volumetric data with uncertainty,” Computers & Graphics, NARR dataset. Thus, it contributed no uncertainty in the vol. 26, no. 2, pp. 239–248, 2002. ensembles. The meteorologist expressed the desire to inter- [9] K. Pothkow,¨ B. Weber, and H. Hege, “Probabilistic marching actively select a group of variables for analysis. Currently, cubes,” Computer Graphics Forum, vol. 30, no. 3, pp. 931–940, 2011. [10] K. Pothkow¨ and H. Hege, “Positional uncertainty of isocontours: this feature is not supported in our system due to the time- Condition analysis and probabilistic measures,” IEEE Transactions consuming dissimilarity computation. However, a parallel on Visualization and Computer Graphics, vol. 17, no. 10, pp. 1393– computing system can help achieve this goal in the future. 1406, 2011. [11] L. Gosink, K. Bensema, T. Pulsipher, H. Obermaier, M. Henry, H. Childs, and K. Joy, “Characterizing and visualizing predictive 8CONCLUSION uncertainty in numerical ensembles through bayesian model av- eraging,” IEEE Transactions on Visualization and Computer Graphics, We present a visualization and exploration system for a vol. 19, no. 12, pp. 2703–2712, 2013. multidimensional ensemble dataset whose kernel is a novel [12] H. Mathias, O. Harald, G. Christoph, and I. Kenneth, “Compar- uncertainty-aware multidimensional projection method. ative visual analysis of lagrangian transport in cfd ensembles,” This new method considers not only the ensemble means IEEE Transactions on Visualization and Computer Graphics, vol. 19, no. 12, p. In press, 2013. but also the ensemble distributions. Experiments on an [13] I. Borg and P. Groenen, Modern multidimensional scaling: Theory and artificial dataset demonstrate the ability of the uncertainty- applications. Springer, 2005. aware projection to distinguish between ensemble data ob- [14] P. Joia, F. Paulovich, D. Coimbra, J. Cuminato, and L. Nonato, jects with similar means but different ensemble distribu- “Local affine multidimensional projection,” IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2563–2571, tions. Results on both real-world and simulation datasets 2011. verify that 1) differences in ensemble distributions are an [15] A. Pang, C. Wittenbrink, and S. Lodha, “Approaches to uncer- important part and are crucial for proper ensemble analysis, tainty visualization,” The Visual Computer, vol. 13, no. 8, pp. 370– and 2) our uncertainty-aware projection can distinguish 390, 1997. [16] M. Riveiro, “Evaluation of uncertainty visualization techniques for ensemble distributions in real-world data. information fusion,” in Information Fusion, 2007 10th International Conference on. IEEE, 2007, pp. 1–8. ACKNOWLEDGMENTS [17] S. Deitrick and R. Edsall, “The influence of uncertainty visual- ization on decision making: An empirical evaluation,” Progress in This work was partly supported by National High Tech- Spatial Data Handling, pp. 719–738, 2006. nology Research and Development Program of China [18] J. Thomson, E. Hetzler, A. MacEachren, M. Gahegan, and M. Pavel, “A typology for visualizing uncertainty,” in Proceedings of SPIE, (2012AA12090), Major Program of National Natural Science vol. 5669, 2005, pp. 146–157. Foundation of China (61232012), National Natural Science [19] K. Potter, P. Rosen, and C. Johnson, “From quantification to visual- Foundation of China (61422211), Zhejiang Provincial Natu- ization: A taxonomy of uncertainty visualization approaches,” Un- ral Science Foundation of China (LR13F020001), National certainty Quantification in Scientific Computing, pp. 226–249, 2012. [20] T. Pfaffelmoser, M. Mihai, and R. Westermann, “Visualizing the Science Foundation (NSF1117871), Pacific Northwest Na- variability of gradients in uncertain 2d scalar fields,” IEEE Trans- tional Laboratory under U.S. Department of Energy Con- actions on Visualization and Computer Graphics, vol. 19, no. 11, pp. tract (DE-AC05-76RL01830), National Science Foundation 1948–1961, 2013. of China (61379076), Program for New Century Excellent [21] F. Jiao, J. Phillips, Y. Gur, and C. Johnson, “Uncertainty visu- alization in hardi based on ensembles of odfs,” in IEEE Pacific Talents in University of China (NCET-12-1087), Zhejiang Visualization Symposium, 2012, pp. 193–200. Provincial Natural Science Foundation under Grant No. [22] G. Grigoryan and P. Rheingans, “Point-based probabilistic surfaces LR14F020002, and the NUS-ZJU SeSama center. to show surface uncertainty,” IEEE Transactions on Visualization and Computer Graphics, vol. 10, no. 5, pp. 564–573, 2004. [23] V. Dinesha, N. Adabala, and V. Natarajan, “Uncertainty visualiza- REFERENCES tion using hdr volume rendering,” The Visual Computer, vol. 28, no. 3, pp. 265–278, 2012. [1] J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Ska- marock, and W. Wang, “The weather research and forecast model: [24] C. H. Lee and A. Varshney, “Representing thermal vibrations software architecture and performance,” in Proceedings of the 11th and uncertainty in molecular surfaces,” in SPIE Conference on ECMWF Workshop on the Use of High Performance Computing In Visualization and Data Analysis, vol. 4665, 2002, pp. 80–90. Meteorology, 2004, pp. 156–168. [25] P. Rhodes, R. Laramee, R. Bergeron, and T. Sparr, “Uncertainty [2] C. Johnson and A. Sanderson, “A next step: Visualizing errors visualization methods in isosurface rendering,” in Eurographics, and uncertainty,” IEEE Computer Graphics and Applications, vol. 23, vol. 2003, 2003, pp. 83–88. no. 5, pp. 6–10, 2003. [26] B. Zehner, N. Watanabe, and O. Kolditz, “Visualization of gridded [3] K. Potter, S. Gerber, and E. Anderson, “Visualization of uncertainty scalar data with uncertainty in geosciences,” Computers & Geo- without a mean,” Computer Graphics and Applications, vol. 33, no. 1, sciences, vol. 36, no. 10, pp. 1268–1275, 2010. pp. 75–79, 2013. [27] G. Schmidt, S. Chen, A. Bryden, M. Livingston, L. Rosenblum, [4] M. Hlawatsch, P. Leube, W. Nowak, and D. Weiskopf, “Flow radar and B. Osborn, “Multidimensional visual representations for un- glyphs-static visualization of unsteady flow with uncertainty,” derwater environmental uncertainty,” IEEE Computer Graphics and IEEE Transactions on Visualization and Computer Graphics, vol. 17, Applications, vol. 24, no. 5, pp. 56–65, 2004. no. 12, pp. 1949–1958, 2011. [28] T. Pfaffelmoser, M. Reitinger, and R. Westermann, “Visualizing the [5] K. Potter, J. Kniss, R. Riesenfeld, and C. Johnson, “Visualizing positional and geometrical variability of isosurfaces in uncertain summary statistics and uncertainty,” Computer Graphics Forum, scalar fields,” Computer Graphics Forum, vol. 30, no. 3, pp. 951–960, vol. 29, no. 3, pp. 823–832, 2010. 2011. [6] J. Sanyal, S. Zhang, J. Dyer, A. Mercer, P. Amburn, and R. Moor- [29] A. Coninx, G. Bonneau, J. Droulez, and G. Thibault, “Visualization head, “Noodles: A tool for visualization of numerical weather of uncertain scalar data fields using color scales and perceptually model ensemble uncertainty,” IEEE Transactions on Visualization adapted noise,” in Proceedings of the ACM SIGGRAPH Symposium and Computer Graphics, vol. 16, no. 6, pp. 1421–1430, 2010. on Applied Perception in Graphics and Visualization, 2011, pp. 59–66. 14 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS

[30] C. Lundstrom, P. Ljung, A. Persson, and A. Ynnerman, “Uncer- [55] P. C. Mahalanobis, “On the generalized distance in statistics,” in tainty visualization in medical volume rendering using proba- Proceedings of the national institute of sciences of India, vol. 2, no. 1, bilistic animation,” IEEE Transactions on Visualization and Computer 1936, pp. 49–55. Graphics, vol. 13, no. 6, pp. 1648–1655, 2007. [31] Y. Wu, G. Yuan, and K. Ma, “Visualizing flow of uncertainty Haidong Chen received a BS degree from Northeast through analytical processes,” IEEE Transactions on Visualization Normal University in 2009 and a master degree from and Computer Graphics, vol. 18, no. 12, pp. 2526–2535, 2012. Zhejiang University in 2011. He is currently pursuing [32] J. Sanyal, S. Zhang, G. Bhattacharya, P. Amburn, and R. Moorhead, a PhD degree in Zhejiang University. His research “A user study to compare four uncertainty visualization methods interests include uncertainty visualization and high- for 1d and 2d datasets,” IEEE Transactions on Visualization and dimensional data visualization. Computer Graphics, vol. 15, no. 6, pp. 1209–1218, 2009. [33] A. Wilson and K. Potter, “Toward visual analysis of ensemble data sets,” in Proceedings of the Workshop on Ultrascale Visualization, 2009, pp. 48–53. Song Zhang received a BS degree in computer sci- [34] K. Potter, A. Wilson, P. Bremer, D. Williams, C. Doutriaux, V. Pas- ence from Nankai University in 1996 and a PhD de- cucci, and C. Johnson, “Ensemble-vis: A framework for the statisti- gree in computer science from Brown University in cal visualization of ensemble data,” in IEEE International Conference 2006. He is an associate professor in the Department on Data Mining Workshops, 2009, pp. 233–240. of Computer Science and Engineering, Mississippi [35] P. Diggle, P. Heagerty, K. Liang, and S. Zeger, Analysis of longitudi- State University. His research interests include scien- nal data. Oxford University Press, 2002. tific visualization, data analysis, medical imaging, and [36] T. Hollt, A. Magdy, G. Chen, G. Gopalakrishnan, I. Hoteit, computer graphics. C. Hansen, and M. Hadwiger, “Visual analysis of uncertainties in ocean forecasts for planning and operation of off-shore struc- Wei Chen tures,” in Proceedings IEEE Pacific Visualization Symposium (Paci- received a PhD degree in Fraunhofer Insti- ficVis). IEEE, 2013, pp. 185–192. tute for Graphics, Darmstadt, Germany, in July 2002. He is a professor in State Key Lab of CAD & CG at [37] P. Wong and R. Bergeron, “30 years of multidimensional multivari- Zhejiang University, P.R. China. From July 2006 to ate visualization,” Scientific Visualization, Overviews, Methodologies, September 2008, he was a visiting scholar at Purdue and Techniques, pp. 3–33, 1997. University. His current research interests include vi- [38] J. Daniels, E. Anderson, L. G. Nonato, and C. Silva, “Interactive sualization and visual analytics. vector field feature identification,” IEEE Transactions on Visualiza- tion and Computer Graphics, vol. 16, no. 6, pp. 1560–1568, 2010. [39] W. Chen, Z. Ding, S. Zhang, A. MacKay-Brandt, S. Correia, H. Qu, Honghui Mei is currently pursuing a PhD degree J. Crow, D. Tate, Z. Yan, and Q. Peng, “A novel interface for inter- in Zhejiang University. His research interests include active exploration of dti fibers,” IEEE Transactions on Visualization visualization and visual analytics. and Computer Graphics, vol. 15, no. 6, pp. 1433–1440, 2009. [40] A. Anand, L. Wilkinson, and T. Dang, “Visual pattern discovery using random projections,” in Proceedings of IEEE Conference on Visual Analytics Science and Technology, 2012, pp. 43–52. [41] S. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, Jiawei Zhang is a Ph.D. student in Electrical and pp. 300–307, 2007. Computer Engineering at . His re- [42] D. Scott and S. Sain, “Multi-dimensional density estimation,” search interests include visual analytics and informa- Handbook of Statistics, vol. 23, pp. 229–263, 2004. tion visualization. [43] M. Minnotte and D. Scott, “The mode tree: A tool for visualization of nonparametric density features,” Journal of Computational and Graphical Statistics, vol. 2, no. 1, pp. 51–68, 1993. [44] O. Lampe and H. Hauser, “Interactive visualization of streaming data with kernel density estimation,” in IEEE Pacific Visualization Symposium, 2011, pp. 171–178. Andrew Mercer received a PhD in meteorology from [45] B. Silverman, Density estimation for statistics and data analysis. the University of Oklahoma in 2008. He is a meteo- Chapman & Hall/CRC, 1986. rologist whose previous research has involved using [46] D. Scott, Multivariate density estimation. Wiley, 1992. artificial intelligence and advanced statistics to model [47] F. Anscombe, “Graphs in statistical analysis,” The American Statis- weather phenomena. His current research interests tician, vol. 27, no. 1, pp. 17–21, 1973. lie in the integration of artificial intelligence into all [48] B. Jiang, J. Pei, Y. Tao, and X. Lin, “Clustering uncertain data facets of the geosciences. based on probability distribution similarity,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 751–763, 2011. [49] C. Andrieu, N. De Freitas, A. Doucet, and M. Jordan, “An intro- Ronghua Liang received a PhD degree in computer duction to mcmc for machine learning,” Machine learning, vol. 50, science from Zhejiang University in 2003. He is cur- no. 1, pp. 5–43, 2003. rently a Professor of Computer Science and the Vice [50] F. Paulovich, L. Nonato, R. Minghim, and H. Levkowitz, “Least Dean of College of Information Engineering, Zhe- square projection: A fast high-precision multidimensional projec- jiang University of Technology, China. His research tion technique and its application to document mapping,” IEEE interests include information visualization, computer Transactions on Visualization and Computer Graphics, vol. 14, no. 3, vision, and medical visualization. pp. 564–575, 2008. [51] S. Ingram, T. Munzner, and M. Olano, “Glimmer: Multilevel mds on the gpu,” IEEE Transactions on Visualization and Computer Huamin Qu obtained a BS in Mathematics from Xi’an Graphics, vol. 15, no. 2, pp. 249–261, 2009. Jiaotong University, China, an MS and a PhD (2004) [52] T. Gonzalez, “Clustering to minimize the maximum intercluster in Computer Science from the Stony Brook Univer- distance,” Theoretical Computer Science, vol. 38, pp. 293–306, 1985. sity. He is an associate professor in the Department [53] N. R. Pal and J. C. Bezdek, “On cluster validity for the fuzzy c- of Computer Science and Engineering at the Hong means model,” IEEE Transactions on Fuzzy Systems, vol. 3, no. 3, Kong University of Science and Technology. His main pp. 370–379, 1995. research interests are in visualization and computer [54] S. Brandt, Data Analysis: statistical and computational methods for graphics. scientists and engineers. Springer, 1999.