Database Research at Columbia University

DATABASE RESEARCH AT COLUMBIA UNIVERSITY Shih-Fu Chang,∗ Luis Gravano, Gail E. Kaiser, Kenneth A. Ross, Salvatore J. Stolfo Dept. of Computer Science, Columbia University, New York, NY 10027. http://www.cs.columbia.edu 1 Introduction In [3], we investigate techniques for evaluating cor- related subqueries in SQL. Our techniques apply to all Columbia University has a number of projects that nested subqueries, unlike techniques such as query- touch on database systems issues. In this report, rewriting that apply only to a limited class of sub- we describe the Columbia Fast Query Project (Sec- queries. The basic idea is to cache the invariant part of tion 2), the JAM project (Section 3), the CARDGIS the nested subquery between iterations, and to reeval- project (Section 4), the Columbia Internet Informa- uate just the variant part on each iteration. Integrat- tion Searching Project (Section 5), the Columbia ing this technique into a cost-based optimizer required Content-Based Visual Query project (Section 6), and careful design. Our work was implemented in the projects associated with Columbia’s Programming Sybase IQ commercial database system, and will be Systems Laboratory (Section 7). present in their next commercial release. In [4] we propose techniques for performing a join of 1 two large relations using a join index. Our techniques 2 The Columbia Fast Query Project are novel in that they require only a single pass of each participating relation, even if both tables are much Faculty: Ross. larger than main memory, with all intermediate I/O The focus of the Columbia Fast Query Project is performed on tuple identifiers. This work is extended to process complex queries efficiently. Ideally, we aim in [5] to perform a self-join of a large relation with a for interactive query response. However, we also aim single pass through the relation. to improve the performance of noninteractive queries A datacube operation computes an aggregate over over huge datasets. many different sets of grouping attributes. In par- 2.1 Complex OLAP Query Processing ticular, for d attributes chosen as possible “dimensions,” 2d aggregates are computed, one for each pos- In [1] we present the notion of “multi-feature sible set of dimension attributes. In practice, data is queries.” Multi-feature queries succinctly express often sparse in the d-dimensional space. We have complex queries such as “Find the total sales among developed techniques that are particularly efficient for minimum-price suppliers of each item.” Such queries computing the datacube of large, sparse datasets [6]. need multiple views and/or subqueries in standard We have also developed novel algorithms to evalu- SQL. We demonstrate significant performance im- ate complex datacube queries involving multiple de- provements over a commercial system of a specialized pendent aggregates at each granularity [7]. Previous evaluation algorithm for multi-feature queries. techniques asked only simple aggregates (such as sum, In [2] we have developed techniques for recogniz- max, etc.) at each granularity. ing when an arbitrary relational query is amenable to 2.2 Materialized Views the following kind of evaluation strategy: (a) Parti- tion the data according to some attributes, (b) apply Materialized views precompute and store expressions a (simpler) query to each partition, and (c) union the that can be used as subexpressions of complex queries. results. Such evaluation strategies are particularly ef- By doing work in advance, one can speed up the pro- fective when the partitioned data is small enough to cess of answering interactive queries. fit in memory. Our criteria for recognizing such pro- An important practical question is choosing which grams are syntactic, and, surprisingly, are nonrestric- views to materialize. In [8] we address this problem, tive in the sense that every query that can be evaluated demonstrating that it sometimes pays to materialize in this partitioned fashion can be expressed in a way additional views just to support the maintenance of a that satisfies our criterion. given materialized view. ∗Dept. of Electrical Engineering In [9] we look at the problem of view adaptation, 1http://www.cs.columbia.edu/˜kar/fastqueryproj.html namely how can one incrementally modify a materialized view after a change in the view definition. large problems of interest in the financial and banking When one queries multiple materialized views and industry. One such problem is fraud detection. base tables one would like to see a single consistent Fraudulent electronic transactions are a significant database state. A more general notion of serializabil- problem, one that will grow in importance as the num- ity for accessing both base data and materialized views ber of access points in the nation’s financial informa- is presented in [10]. tion system grows. Financial institutions today typi- Different maintenance policies might be used for cally develop custom fraud detection systems targeted performance reasons. For example, deferred mainte- to their own asset bases. Recently though, banks have nance leads to fast base updates but adds overhead come to realize that a unified, global approach is re- to queries; immediate maintenance does the opposite. quired, involving the periodic sharing with each other In [11] we describe how to combine various policies of information about attacks. within a single system, and measure the performance We propose another wall to protect the nation’s of various alternatives. financial systems from threats. This new wall of pro- Maintenance of nested relations within materialized tection consists of pattern-directed inference systems views is described in [12]. using models of anomalous or errant transaction behaviors to forewarn of impending threats. This ap- 2.3 Data Reduction and Visualization proach requires analysis of large distributed databases of information about transaction behaviors to produce In huge data warehouses it often makes sense to sum- models of “probably fraudulent” transactions. marize a dataset in order to reduce its size. An ap- The key difficulties in this approach are: finan- proximate answer to a query computed using the sum- cial companies don’t share their data for a number mary may be feasible when an exact answer using the of (competitive and legal) reasons; the databases that full dataset may be infeasible. A survey of such data companies maintain on transaction behavior are huge reduction techniques is presented in [13]. and growing rapidly; real-time analysis is desirable to Techniques for visualizing large multidimensional update models when new events are detected and dis- datasets are presented in [14]. Our techniques enable tribution of models in a networked environment is es- one to visually identify, for any pair of dimensions, sential to maintain up to date detection capability. regions where the two-dimensional distribution is not We propose a novel system to address these diffi- explainable as the independent combination of one- culties and thereby protect against electronic fraud. dimensional distributions. Our system has two key component technologies: local fraud detection agents that learn how to detect 3JAM2 fraud and provide intrusion detection services within a single corporate information system, and a secure, in- tegrated meta-learning system that combines the col- Faculty: Stolfo. lective knowledge acquired by individual local agents. We address the performance problem associated Once derived local classifier agents or models are with Knowledge Discovery in Databases (or data min- produced at some site(s), two or more such agents may ing) over inherently distributed databases. We study be composed into a new classifier agent by a meta- approaches to substantially increase the amount of learning agent. The meta-learning system proposed data a knowledge discovery system can handle effec- will allow financial institutions to share their models of tively. Meta-learning is a general technique we have fraudulent transactions by exchanging classifier agents developed to integrate a number of distinct learning in a secured agent infrastructure. But they will not processes executed in a parallel and distributed com- need to disclose their proprietary data. puting environment. Many other researchers have Our publications include [15, 16, 17, 18, 19, 20, 21]. likewise invented ingenious methods for integrating multiple learned models as exemplified by stacking, boosting, bagging, weighted voting of mixtures of ex- 4 CARDGIS3 perts, etc. Several meta-learning strategies we have proposed have been shown to process massive amounts Faculty: Gravano, Kaiser, Ross, Stolfo, and others. of data that main-memory-based learning algorithms CARDGIS is the acronym for a newly-formed cen- cannot efficiently handle. ter: The USC/ISI and Columbia University Center We collaborate with the Financial Services Technol- for Applied Research in Digital Government Informa- ogy Consortium in applying our prototype systems to tion Systems. The center’s mission is research in the 2http://www.cs.columbia.edu/˜sal/JAM/PROJECT/ 3http://www.cs.columbia.edu/˜sal/DIGGOV/census/index.htm design and development of advanced information sys- Increasingly, users want to issue complex queries tems with capabilities for information workers and the across Internet sources to obtain the data they re- general public to individually and cooperatively gener- quire. Because of the size of the Internet, it is not ate, share, interact with and effectively utilize knowl- possible

Load more