The History of Histograms (Abridged)

The History of Histograms (abridged) Yannis Ioannidis Department of Informatics and Telecommunications, University of Athens Panepistimioupolis, Informatics Buildings 157-84, Athens, Hellas (Greece) [email protected] word that was originally used in the Greek language1. The term `histogram' was coined by the famous statis- tician Karl Pearson2 to refer to a \common form of Abstract graphical representation". In the Oxford English Dic- tionary quotes from \Philosophical Transactions of the The history of histograms is long and rich, full Royal Society of London" Series A, Vol. CLXXXVI, of detailed information in every step. It in- (1895) p. 399, it is mentioned that \[The word `his- cludes the course of histograms in different togram' was] introduced by the writer in his lectures scientific fields, the successes and failures of on statistics as a term for a common form of graphical histograms in approximating and compressing representation, i.e., by columns marking as areas the information, their adoption by industry, and frequency corresponding to the range of their base.". solutions that have been given on a great va- Stigler identifies the lectures as the 1892 lectures on riety of histogram-related problems. In this the geometry of statistics [69]. paper and in the same spirit of the histogram The above quote suggests that histograms were used techniques themselves, we compress their en- long before they received their name, but their birth tire history (including their \future history" date is unclear. Bar charts (i.e., histograms with an as currently anticipated) in the given/fixed individual `base' element associated with each column) space budget, mostly recording details for the most likely predate histograms and this helps us put periods, events, and results with the highest a lower bound on the timing of their first appearance. (personally-biased) interest. In a limited set The oldest known bar chart appeared in a book by the of experiments, the semantic distance between Scottish political economist William Playfair3 titled the compressed and the full form of the history \The Commercial and Political Atlas (London 1786)" was found relatively small! and shows the imports and exports of Scotland to and from seventeen countries in 1781 [74]. Although Play- fair was skeptical of the usefulness of his invention, it was adopted by many in the following years, includ- 1 Prehistory ing for example, Florence Nightingale, who used them in 1859 to compare mortality in the peacetime army The word `histogram' is of Greek origin, as it is a com- to that of civilians and through those convinced the posite of the words ìsto-s' (ιστos) (= `mast', also government to improve army hygiene. means `web' but this is not relevant to this discussion) and `gram-ma' (γραµµα) (= `something writ- From all the above, it is clear that histograms were ten'). Hence, it should be interpreted as a form of first conceived as a visual aid to statistical approxima- writing consisting of `masts', i.e., long shapes vertically tions. Even today this point is still emphasized in the standing, or something similar. It is not, however, a common conception of histograms: Webster's defines a 1To the contrary, the word `history' is indeed part of the Permission to copy without fee all or part of this material is Greek language (ìstoria' - ιστoρια) and in use since the ancient granted provided that the copies are not made or distributed for times. Despite its similarity to `histogram', however, it appears direct commercial advantage, the VLDB copyright notice and to have a different etymology, one that is related to the original the title of the publication and its date appear, and notice is meaning of the word, which was `knowledge'. given that copying is by permission of the Very Large Data Base 2His claim to fame includes, among others, the chi-square test Endowment. To copy otherwise, or to republish, requires a fee for statistical significance and the term `standard deviation'. and/or special permission from the Endowment. 3In addition to the bar chart, Playfair is probably the fa- Proceedings of the 29th VLDB Conference, ther of the pie chart and other extremely intuitive and useful Berlin, Germany, 2003 visualizations that we use today. histogram as \a bar graph of a frequency distribution The data distribution of Xi is the set of pairs Ti = in which the widths of the bars are proportional to the f (vi(1); fi(1)); (vi(2); fi(2)); : : : ; (vi(Di); fi(Di)) g. classes into which the variable has been divided and The joint frequency f(k1; ::; kn) of the value combi- the heights of the bars are proportional to the class nation < v1(k1); ::; vn(kn) > is the number of tuples frequencies". Histograms, however, are extremely use- in R that contain vi(ki) in attribute Xi, for all i. The ful even when disassociated from their canonical visual joint data distribution T1;::;n of X1; ::; Xn is the entire representation and treated as purely mathematical ob- set of (value combination, joint frequency) pairs. jects capturing data distribution approximations. This In the sequel, for 1-dimensional cases, we use the is precisely how we approach them in this paper. above symbols without the subscript i. In the past few decades, histograms have been used in several fields of informatics. Besides databases, his- 2.2 Motivation for Histograms tograms have played a very important role primarily in image processing and computer vision. Given an Data distributions are very useful in database systems image (or a video) and a visual pixel parameter, a his- but are usually too large to be stored accurately, so togram captures for each possible value of the param- histograms come into play as an approximation mech- eter (Webster's \classes") the number of pixels that anism. The two most important applications of his- have this value (Webster's \frequencies"). Such a histogram techniques in databases have been selectivity togram is a summary that is characteristic of the image estimation and approximate query answering within and can be very useful in several tasks: identifying sim- query optimization (for the former) or pre-execution ilar images, compressing the image, and others. Color user-level query feedback (for both). Our discussion histograms are the most common in the literature, e.g., below focuses exactly on these two, especially range- in the QBIC system [21], but several other parameters query selectivity estimation as this is the most pop- have been proposed as well, e.g., edge density, tex- ular issue in the literature. It should not be forgot- turedness, intensity gradient, etc. [61]. In general, his- ten, however, that histograms have proved to be useful tograms used in image processing and computer vision in the context of several other database problems as are accurate. For example, a color histogram contains well, e.g., load-balancing in parallel join query execu- a separate and precise count of pixels for each possi- tion [65], partition-based temporal join execution [68] ble distinct color in the image. The only element of and others. approximation might be in the number of bits used to represent different colors: fewer bits imply that several 2.3 Histograms actual colors are represented by one, which will be as- A histogram on an attribute X is constructed by parti- sociated with the number of pixels that have any of tioning the data distribution of X into β (≥ 1) mutu- the colors that are grouped together. Even this kind ally disjoint subsets called buckets and approximating of approximation is not common, however. the frequencies and values in each bucket in some com- In databases, histograms are used as a mechanism mon fashion. This definition leaves several degrees of for full-fledged compression and approximation of data freedom in designing specific histogram classes as there distributions. They first appeared in the literature and are several possible choices for each of the following in systems in the 1980's and have been studied exten- (mostly orthogonal) aspects of histograms [67]: sively since then at a continuously increasing rate. In Partition Rule: This is further analyzed into the this paper, we concentrate on the database notion of following characteristics: histograms, discuss the most important developments • Partition Class: This indicates if there are any on the topic so far, and outline several problems that restrictions on the buckets. Of great importance we believe are interesting and whose solution may fur- is the serial class, which requires that buckets are ther expand their applicability and usefulness. non-overlapping with respect to some parameter (the next characteristic), and its subclass end- 2 Histogram Definitions biased, which requires at most one non-singleton bucket. 2.1 Data Distributions • Sort Parameter: This is a parameter whose Consider a relation R with n numeric attributes Xi value for each element in the data distribution (i = 1::n). The value set Vi of attribute Xi is is derived from the corresponding attribute value the set of values of Xi that are present in R. Let and frequencies. All serial histograms require that Vi = f vi(k): 1 ≤ k ≤ Di g, where vi(k) < vi(j) when the sort parameter values in each bucket form a k < j. The spread si(k) of vi(k) is defined as contiguous range. Attribute value (V), frequency si(k) = vi(k + 1) − vi(k), for 1 ≤ k < Di. (We (F), and area (A) are examples of sort parameters take si(Di) = 1.) The frequency fi(k) of vi(k) is that have been discussed in the literature. the number of tuples in R with Xi = vi(k). The • Source Parameter: This captures the property area ai(k) of vi(k) is defined as ai(k) = fi(k) × si(k).

Load more