Composite Subset Measures
Total Page:16
File Type:pdf, Size:1020Kb
Composite Subset Measures Lei Chen1, Raghu Ramakrishnan1,2, Paul Barford1, Bee-Chung Chen1, Vinod Yegneswaran1 1 Computer Sciences Department, University of Wisconsin, Madison, WI, USA 2 Yahoo! Research, Santa Clara, CA, USA {chenl, pb, beechung, vinod}@cs.wisc.edu [email protected] ABSTRACT into a hierarchy, leading to a very natural notion of multidimen- Measures are numeric summaries of a collection of data records sional regions. Each region represents a data subset. Summarizing produced by applying aggregation functions. Summarizing a col- the records that belong to a region by applying an aggregate opera- lection of subsets of a large dataset, by computing a measure for tor, such as SUM, to a measure attribute (thereby computing a new each subset in the (typically, user-specified) collection is a funda- measure that is associated with the entire region) is a fundamental mental problem. The multidimensional data model, which treats operation in OLAP. records as points in a space defined by dimension attributes, offers Often, however, more sophisticated analysis can be carried out a natural space of data subsets to be considered as summarization based on the computed measures, e.g., identifying regions with candidates, and traditional SQL and OLAP constructs, such as abnormally high measures. For such analyses, it is necessary to GROUP BY and CUBE, allow us to compute measures for subsets compute the measure for a region by also considering other re- drawn from this space. However, GROUP BY only allows us to gions (which are, intuitively, “related” to the given region) and summarize a limited collection of subsets, and CUBE summarizes their measures. In this paper, we introduce composite subset meas- all subsets in this space. Further, they restrict the measure used to ures, which differ from the traditional GROUP BY and CUBE summarize a data subset to be a one-step aggregation, using func- approaches in three ways: tions such as SUM, of field-values in the data records. 1) The measures for a region can be computed as the summaries In this paper, we introduce composite subset measures, computed of other “related” regions in a compositional manner. The re- by aggregating not only data records but also the measures of other lationships capture various types of proximity in related subsets. We allow summarization of naturally related re- multidimensional space. gions in the multidimensional space, offering more flexibility than 2) In contrast to the CUBE construct, we do not offer a way to either GROUP BY or CUBE in the choice of what data subsets to compute the summary of every region; this is typically over- summarize. Thus, our framework allows more meaningful sum- kill for the kinds of complex measures we seek to compute. maries to be computed for a targeted collection of data subsets. 3) The language and algebra are carefully designed to enable We propose an algebra called AW-RA and an equivalent pictorial highly scalable, parallelizable, and distributed evaluation language called aggregation workflows. Aggregation workflows strategies based on streaming the data in one or more passes, allow for intuitive expression of composite measure queries, and possibly with interleaved sorts. the underlying algebra is designed to facilitate efficient multi-scan execution. We describe an evaluation framework based on multi- This study is motivated by our ongoing work in two different ap- ple passes of sorting and scanning over the original dataset. In plication domains: environmental monitoring [18] and analysis of each pass, several measures are evaluated simultaneously, and network traffic data [7]. Similar problems have been faced by dependencies between these measures and containment relation- researchers dealing with data at Internet companies such as Google ships between the underlying subsets of data are orchestrated to and Yahoo!, leading them to also develop systems for scalable reduce the memory footprint of the computation. We present a aggregation of tabular data [10,17,22]. In contrast to the proposal performance evaluation that demonstrates the benefits of our ap- in [22], we explore a declarative, query-style approach in the spirit proach. of OLAP constructs such as [4, 14, 19,20, 26]. Further, the focus of [10] is a highly parallelizable, distributed evaluation framework 1. INTRODUCTION for simple one-step aggregation queries. This is an issue that we do not tackle in this paper, but such highly distributed computation In the multidimensional model of data, records in a central fact has been a strong consideration in the design of our language, and table are viewed as points in a multidimensional space. Attributes we intend to address it in future work. In keeping with this goal, are divided into dimension attributes, which are the coordinates of we have avoided implementation choices that require us to assign the data point, and measure attributes, which are values associated unique ids to records or to maintain indexes over (potentially with points. The domain of values for each dimension is organized widely distributed) tables, and focused on evaluation strategies Permission to copy without fee all or part of this material is granted pro- that orchestrate aggregation steps across one or more scans of data vided that the copies are not made or distributed for direct commercial (partitions). advantage, the VLDB copyright notice and the title of the publication and Consider the following network traffic analysis example. Self- its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to propagating worm outbreaks have the potential to wreak havoc in post on servers or to redistribute to lists, requires a fee and/or special per- the Internet, and can take place on a variety of time scales. We can mission from the publisher, ACM. potentially identify new outbreaks based on the escalation of the VLDB ‘06, September 12–15, 2006, Seoul, Korea. traffic into a network from one time period to the next. This kind Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 of escalation, which is defined on a per-time period and sub- 403 network basis, is a composite measure built on the traffic measures attack packets identified in each network. Table 1 lists the attrib- for two adjacent time periods. utes used in our examples. When used to compute composite measures, existing tools, such as The schema of a multidimensional dataset with d dimension relation algebra or multidimensional query languages, frequently attributes has a dimension vector X = (X1, X2,…, Xd), and possibly result in nested expressions that are hard for human analysts to additional measure attributes. While there are no explicit measure understand and for the processing system to optimize. Further, attributes in the Dshield dataset, they are typical in multidimen- their use requires us to import data into a DBMS, which can itself sional datasets. Each record r in is denoted as a tuple of be a challenge for very large datasets [10]. Our goal is to develop a dimension values followed by measure values, if any: r = standalone, lightweight yet highly scalable analysis system that (x ,x ,…, x ,m,… ), where x is the value for dimension attribute supports composite measure queries. 1 2 d 1 i Xi. In the Dshield dataset, X = (Time, Source, Target, TargetPort), Our contributions are as follows: which is abbreviated as X ={t,U,T,P}. 1) We propose a pictorial language (called aggregation work- 2.1 Domains and Domain Hierarchies flows) and algebra for expressing composite aggregation A base domain D (X ) is associated with each dimension attrib- queries and representing streaming plans for such queries. base i ute Xi. For example, the base domain for attributes Source and 2) We show that any query in our language can be expressed in Target is 4-byte integers, and the base domain for Time is the our algebra, and present a comprehensive framework for number of seconds since the first second of 1970 (UNIX time). highly scalable, scan-based evaluation in one or more passes. A base domain can be generalized. For example, we can We show how to exploit sorting between passes and orches- generalize the base domain of Source IP into the /24 subnet tration of dependencies between different aggregation steps. domain (256 contiguous IP addresses). Each value in this domain 3) We present an evaluation that demonstrates the potential of is a 3-byte integer representing one /24 subset. Given two domains our methods for significant performance gains over tradi- Di and Dj, Di <D Dj indicates that Dj is a domain generalization of tional relational approaches. Di; we also say that Di is more specific than Dj. This work is a first step in our broader research agenda to develop All the domains associated with a given dimension attribute form a efficient, streamlined tools for domain specialists to mine large, domain generalization hierarchy, which is a directed acyclic graph complex datasets. The complete, technical report version of this (DAG). Each node in this graph represents one domain. The rela- paper [8] discusses the optimization problem of finding good tionship <D defines a partial order in the graph. Di<DDj, if there is multi-pass streaming plans and describes a greedy optimizer. an arc chain from Dj to Di. A domain hierarchy is linear when the Beyond that, the approach offers potentially unlimited parallelism domain hierarchy graph forms a single path. For any dimension and ability to distribute computation, but our current implementa- attribute, there is a special domain called DALL with a single value tion does not take advantage of these opportunities. ALL, which is the generalization of all possible values for the The rest of the paper is organized as follows. In Section 2, we given dimension.