Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell
[email protected] [email protected] [email protected] www.cs.cmu.edu/~eugene www.cs.cmu.edu/~jgc Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States Abstract—We consider the problem of converting a set of numeric data points into a smoothed approximation of II. PROBLEM the underlying probability distribution. We describe a We consider the task of smoothing a probability distribution. representation of distributions by histograms with vari- Specifically, we need to develop a procedure that inputs a able-width bars, and give a greedy smoothing algorithm point set or a fine-grained histogram, and constructs a based on this representation. coarser-grained histogram. We explain the representation of Keywords—Probability distributions, smoothing, com- probability distributions in the developed system and then pression, visualization, histograms. formalize the smoothing problem. Representation: We encode an approximated probability I. INTRODUCTION density function by a histogram with variable-width bars, HEN analyzing a collection of numeric measurements, which is called a piecewise-uniform distribution and abbre- Wwe often need to represent it as a histogram that shows viated as PU-distribution. This representation is a set of dis- the underlying distribution. For example, consider the set of joint intervals, with a probability assigned to each interval, as 5000 numeric points in Figure 9(a) on the next-to-last page of shown in Figure 1 [Bardak et al., 2006a; Fu et al., 2008]. We the paper. We may convert it to a fine-grained histogram in assume that all values within a specific interval are equally likely; thus, every interval is a uniform distribution, and the Figure 9(b), to a coarser version in Figure 9(c), or to a very histogram comprises multiple such distributions.