Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 D2HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms Dingqi Yang, Bin Li, Laura Rettig, and Philippe Cudre-Mauroux´ Abstract—Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming histograms, where the elements of a histogram are observed one after the other in an online manner. The ever-growing cardinality of histogram elements over the data streams makes any similarity computation inefficient in that case. To tackle this problem, we propose in this paper D2HistoSketch, a similarity-preserving sketching method for streaming histograms to efficiently approximate their Discriminative and Dynamic similarity. D2HistoSketch can fast and memory-efficiently maintain a set of compact and fixed-size sketches of streaming histograms to approximate the similarity between histograms. To provide high-quality similarity approximations, D2HistoSketch considers both discriminative and gradual forgetting weights for similarity measurement, and seamlessly incorporates them in the sketches. Based on both synthetic and real-world datasets, our empirical evaluation shows that our method is able to efficiently and effectively approximate the similarity between streaming histograms while outperforming state-of-the-art sketching methods. Compared to full streaming histograms with both discriminative and gradual forgetting weights in particular, D2HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) at the expense of a small loss in accuracy only (about 3.25%). Index Terms—Similarity-Preserving Sketching, Histograms, Streaming Data, Concept Drift, Discriminative Weighting F 1 INTRODUCTION ISTOGRAMS are an important statistic reflecting the measuring the similarity between two POIs based on such H empirical distribution of data. They have been widely histograms, one can build various high-quality applications. used not only as a popular data analysis and visualization For example, semantic place labeling [4] infers a POI’s type tool, but also as an important feature for measuring sim- based on the assumption that two POIs sharing similar ilarities between data instances, such as color histograms histograms of their customers probably belong to the same for images or word histograms for documents. As a result, type. However, it is challenging to measure the similarity histogram-based similarity measures have been extensively between such streaming histograms in practice, due to the exploited in many classification and clustering tasks and for ever-increasing cardinality of the histogram elements over various application domains, including image processing time. In the above example, this corresponds to the case of [1], document analysis [2], social network analysis [3], and an ever-growing number of customers. The monotonically business intelligence [4]. increasing size of the streaming histograms makes any simi- Despite its importance in machine learning, computing larity computation inefficient, which further makes learning histogram-based similarities is often difficult in practice, algorithms impractical. particularly for data streams. In this study, we consider To solve this problem, similarity-preserving data sketch- streaming histograms, where the elements of a histogram are ing (hashing) techniques [5] have been intensively stud- observed over a data stream as shown in Fig. 1(a) in Section 3. ied in stream data processing [6], [7]. Their key idea is Streaming histograms can be used for a wide range of appli- to maintain a set of compact and fixed-size sketches for cations, such as solving range queries and similarity search the original data to approximate their similarity under a in a streaming database, change detection and classification certain measure. In the current literature, most existing data over data streams. In practice, streaming histograms are sketching techniques [8], [9], [10], [11] consider the case of often seen when online or offline businesses observe their streaming data instances, where complete data instances are customers’ activity data. For example, a Point of Interest received one by one from a data stream (e.g., a stream of (POI), such as a supermarket or a restaurant, may observe images whose color histogram can be easily derived). In a continuous data stream of visits from its customers and contrast, a streaming histogram assumes that the elements consider to analyze the histogram of its customers’ visits. By of a histogram describing an individual data instance are continuously received in arbitrary order from a data stream Dingqi Yang, Laura Rettig and Philippe Cudr´e-Mauroux are with the De- (e.g., the histogram of customers’ visits to a POI), which partment of Informatics at the University of Fribourg, Switzerland, E-mail: departs from classical techniques that focus on sketching fdingqi.yang, laura.rettig, [email protected]. Bin Li is with complete data instances. Therefore, these methods cannot School of Computer Science, Fudan University, Shanghai, China, E-mail: [email protected]. be efficiently applied for sketching streaming histograms. Manuscript received xxx; revised xxx. In this paper, we tackle the similarity-preserving data JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 sketching problem for streaming histograms. Specifically, an for streaming histograms to approximate their Discriminative efficient similarity-preserving sketching method for stream- and Dynamic similarity. D2HistoSketch is designed to effi- ing histograms should allow for fast and memory-efficient ciently maintain a set of compact and fixed-sized sketches maintenance of the sketches. Fast maintenance requires over streaming histograms to approximate the similarity sketches of streaming histograms to be incrementally up- between the histograms. Specifically, to measure the similar- datable. In other words, the new sketch of a streaming his- ity between histograms, our method focuses on normalized togram should be incrementally computed from the former min-max similarity, which has been proven to be an effective sketch and the newly arrived element. Moreover, memory- similarity measure for nonnegative data in various appli- efficient maintenance requires the sketching method to create cation domains [11]. To create a sketch from a histogram, a small and bounded memory overhead when computing we borrow the idea from consistent weighted sampling the sketches, which differs from existing sketching methods [20] that was originally proposed for approximating min- that require a large set of random variables as in-memory max similarity for complete data instances. In addition, parameters [11], [12], [13], [14], where the size of these pa- we formally derive a memory-efficient sketching method rameters is proportional to the cardinality of the histogram with few in-memory parameters. To efficiently maintain elements. In addition, to maintain high-quality similarity- the sketch over the streaming histogram elements, we first preserving sketches, the following two issues should be adjust the original sketch to seamlessly incorporate both considered when measuring similarities. discriminative weights and gradual forgetting weights, and First, as histogram elements are not all equally important then incrementally compute the new sketch based on the when measuring histogram-based similarity, discriminative adjusted sketch and the incoming histogram element. Our similarity should be considered, which refers to the simi- main contributions can be summarized as follows: larity that improves the discriminative capability of some • To the best of our knowledge, this is the first work classification/clustering methods [15]. Specifically, in the considering the discriminative and dynamic similarity- case of labeled histograms, a histogram element appear- preserving sketching over streaming histograms. ing only in the histograms of a specific label has more • We design an efficient similarity-preserving sketch- discriminative capability than one appearing uniformly in ing method for streaming histograms, D2HistoSketch, all histograms. Taking the example of semantic place la- which allows for fast and memory-efficient maintenance beling where we want to classify POIs according to their of the sketches, where the sketches can be incrementally customers’ visiting patterns, it means that visits from users updated with a small and bounded memory overhead. having stronger preferences on visiting a specific type of • To provide high-quality similarity-preserving sketches POIs are more discriminative. For static datasets, such a for downstream tasks, D2HistoSketch considers both discriminative similarity can be easily computed using var- discriminative similarity that improves the discrimi- ious feature weighting methods [16] to give a higher weight native capability of the sketches for some classifica- to more discriminative histogram elements. However, it is tion/clustering methods, and dynamic similarity that not straightforward to incorporate such a discriminative adapts to concept drift by gradually forgetting outdated similarity in sketching streaming histograms, where dis- histogram elements. criminative weights have to be updated over time, and more • We empirically evaluate our method on multiple clas- importantly, to be incorporated in the sketches. sification tasks using both synthetic and real-world Second,