DATABASE RESEARCH AT COLUMBIA UNIVERSITY

Shih-Fu Chang,∗ Luis Gravano, Gail E. Kaiser, Kenneth A. Ross, Salvatore J. Stolfo Dept. of , Columbia University, New York, NY 10027. http://www.cs.columbia.edu

1 Introduction In [3], we investigate techniques for evaluating cor- related subqueries in SQL. Our techniques apply to all Columbia University has a number of projects that nested subqueries, unlike techniques such as query- touch on database systems issues. In this report, rewriting that apply only to a limited class of sub- we describe the Columbia Fast Query Project (Sec- queries. The basic idea is to cache the invariant part of tion 2), the JAM project (Section 3), the CARDGIS the nested subquery between iterations, and to reeval- project (Section 4), the Columbia Internet Informa- uate just the variant part on each iteration. Integrat- tion Searching Project (Section 5), the Columbia ing this technique into a cost-based optimizer required Content-Based Visual Query project (Section 6), and careful design. Our work was implemented in the projects associated with Columbia’s Programming Sybase IQ commercial database system, and will be Systems Laboratory (Section 7). present in their next commercial release. In [4] we propose techniques for performing a join of 1 two large relations using a join index. Our techniques 2 The Columbia Fast Query Project are novel in that they require only a single pass of each participating relation, even if both tables are much Faculty: Ross. larger than main memory, with all intermediate I/O The focus of the Columbia Fast Query Project is performed on tuple identifiers. This work is extended to process complex queries efficiently. Ideally, we aim in [5] to perform a self-join of a large relation with a for interactive query response. However, we also aim single pass through the relation. to improve the performance of noninteractive queries A datacube operation computes an aggregate over over huge datasets. many different sets of grouping attributes. In par- 2.1 Complex OLAP Query Processing ticular, for d attributes chosen as possible “dimen- sions,” 2d aggregates are computed, one for each pos- In [1] we present the notion of “multi-feature sible set of dimension attributes. In practice, data is queries.” Multi-feature queries succinctly express often sparse in the d-dimensional space. We have complex queries such as “Find the total sales among developed techniques that are particularly efficient for minimum-price suppliers of each item.” Such queries computing the datacube of large, sparse datasets [6]. need multiple views and/or subqueries in standard We have also developed novel algorithms to evalu- SQL. We demonstrate significant performance im- ate complex datacube queries involving multiple de- provements over a commercial system of a specialized pendent aggregates at each granularity [7]. Previous evaluation algorithm for multi-feature queries. techniques asked only simple aggregates (such as sum, In [2] we have developed techniques for recogniz- max, etc.) at each granularity. ing when an arbitrary relational query is amenable to 2.2 Materialized Views the following kind of evaluation strategy: (a) Parti- tion the data according to some attributes, (b) apply Materialized views precompute and store expressions a (simpler) query to each partition, and (c) union the that can be used as subexpressions of complex queries. results. Such evaluation strategies are particularly ef- By doing work in advance, one can speed up the pro- fective when the partitioned data is small enough to cess of answering interactive queries. fit in memory. Our criteria for recognizing such pro- An important practical question is choosing which grams are syntactic, and, surprisingly, are nonrestric- views to materialize. In [8] we address this problem, tive in the sense that every query that can be evaluated demonstrating that it sometimes pays to materialize in this partitioned fashion can be expressed in a way additional views just to support the maintenance of a that satisfies our criterion. given materialized view. ∗Dept. of Electrical Engineering In [9] we look at the problem of view adaptation, 1http://www.cs.columbia.edu/˜kar/fastqueryproj.html namely how can one incrementally modify a material- ized view after a change in the view definition. large problems of interest in the financial and banking When one queries multiple materialized views and industry. One such problem is fraud detection. base tables one would like to see a single consistent Fraudulent electronic transactions are a significant database state. A more general notion of serializabil- problem, one that will grow in importance as the num- ity for accessing both base data and materialized views ber of access points in the nation’s financial informa- is presented in [10]. tion system grows. Financial institutions today typi- Different maintenance policies might be used for cally develop custom fraud detection systems targeted performance reasons. For example, deferred mainte- to their own asset bases. Recently though, banks have nance leads to fast base updates but adds overhead come to realize that a unified, global approach is re- to queries; immediate maintenance does the opposite. quired, involving the periodic sharing with each other In [11] we describe how to combine various policies of information about attacks. within a single system, and measure the performance We propose another wall to protect the nation’s of various alternatives. financial systems from threats. This new wall of pro- Maintenance of nested relations within materialized tection consists of pattern-directed inference systems views is described in [12]. using models of anomalous or errant transaction be- haviors to forewarn of impending threats. This ap- 2.3 Data Reduction and Visualization proach requires analysis of large distributed databases of information about transaction behaviors to produce In huge data warehouses it often makes sense to sum- models of “probably fraudulent” transactions. marize a dataset in order to reduce its size. An ap- The key difficulties in this approach are: finan- proximate answer to a query computed using the sum- cial companies don’t share their data for a number mary may be feasible when an exact answer using the of (competitive and legal) reasons; the databases that full dataset may be infeasible. A survey of such data companies maintain on transaction behavior are huge reduction techniques is presented in [13]. and growing rapidly; real-time analysis is desirable to Techniques for visualizing large multidimensional update models when new events are detected and dis- datasets are presented in [14]. Our techniques enable tribution of models in a networked environment is es- one to visually identify, for any pair of dimensions, sential to maintain up to date detection capability. regions where the two-dimensional distribution is not We propose a novel system to address these diffi- explainable as the independent combination of one- culties and thereby protect against electronic fraud. dimensional distributions. Our system has two key component technologies: lo- cal fraud detection agents that learn how to detect 3JAM2 fraud and provide intrusion detection services within a single corporate information system, and a secure, in- tegrated meta-learning system that combines the col- Faculty: Stolfo. lective knowledge acquired by individual local agents. We address the performance problem associated Once derived local classifier agents or models are with Knowledge Discovery in Databases (or data min- produced at some site(s), two or more such agents may ing) over inherently distributed databases. We study be composed into a new classifier agent by a meta- approaches to substantially increase the amount of learning agent. The meta-learning system proposed data a knowledge discovery system can handle effec- will allow financial institutions to share their models of tively. Meta-learning is a general technique we have fraudulent transactions by exchanging classifier agents developed to integrate a number of distinct learning in a secured agent infrastructure. But they will not processes executed in a parallel and distributed com- need to disclose their proprietary data. puting environment. Many other researchers have Our publications include [15, 16, 17, 18, 19, 20, 21]. likewise invented ingenious methods for integrating multiple learned models as exemplified by stacking, boosting, bagging, weighted voting of mixtures of ex- 4 CARDGIS3 perts, etc. Several meta-learning strategies we have proposed have been shown to process massive amounts Faculty: Gravano, Kaiser, Ross, Stolfo, and others. of data that main-memory-based learning algorithms CARDGIS is the acronym for a newly-formed cen- cannot efficiently handle. ter: The USC/ISI and Columbia University Center We collaborate with the Financial Services Technol- for Applied Research in Digital Government Informa- ogy Consortium in applying our prototype systems to tion Systems. The center’s mission is research in the

2http://www.cs.columbia.edu/˜sal/JAM/PROJECT/ 3http://www.cs.columbia.edu/˜sal/DIGGOV/census/index.htm design and development of advanced information sys- Increasingly, users want to issue complex queries tems with capabilities for information workers and the across Internet sources to obtain the data they re- general public to individually and cooperatively gener- quire. Because of the size of the Internet, it is not ate, share, interact with and effectively utilize knowl- possible to process such queries in naive ways, e.g., by edge in a national-scale statistical digital library. The accessing all the available sources. Thus, we must pro- center’s goals include demonstrating the value of such cess queries in a way that scales with the number of knowledge sharing when applied to vast and continu- sources. Also, sources vary in the type of information ally growing amounts of statistical data. This will be objects they contain and in the interface they present achieved through the development, deployment and to their users. Some sources contain text documents evolution of pilot systems in collaboration with sev- and support simple query models where a query is just eral Federal Statistical agencies, among them the U.S. a list of keywords. Other sources contain more struc- Bureau of the Census and the Bureau of Labor Statis- tured data and provide query interfaces in the style tics, and their broad user communities. Both the of relational model interfaces. User queries might re- Census Bureau and the Bureau of Labor Statistics quire accessing sources supporting radically different have committed to participate in and collaborate with interfaces and query models. Thus, we must process CARDGIS research projects. queries in a way that deals with heterogeneous sources. This project brings together researchers in Our past research focused on processing queries databases, collaboration technologies and tools, over static document sources. As part of this re- human-computer interaction, knowledge representa- search, we implemented GlOSS, a scalable system tion, knowledge discovery and data mining, WWW that helps users choose the best sources for their technology, security, and software engineering, drawn queries [23, 24, 25]. One key goal of GlOSS is to scale from Columbia and USC/ISI, along with social and with large numbers of document sources, for which statistical science experts from Federal, State and Lo- it keeps succinct descriptions of the source contents. cal Statistical agencies. These agencies routinely col- GlOSS needs to extract these content summaries from lect large amounts of statistical data. The project the sources. Unfortunately, the source contents are will develop technology that will build upon and in- often hidden behind search interfaces that vary from corporate existing systems, including the integration source to source. In some cases, even the same orga- of legacy databases. It will support the development nization uses search engines from different vendors to of new information systems with features that provide index different internal document collections. To facil- for sophisticated viewing, analysis and manipulation itate accessing multiple document sources by making of data by statisticians, sociologists, policy makers, their interfaces more uniform, we coordinated an infor- teachers, students, and the general public, while main- mal standards effort involving the major search engine taining data security; secure collaboration among dy- vendors and users of their technology. The result of namic groups of users intending to achieve purposeful this effort, STARTS (Stanford Protocol Proposal for goals such as policy development; and dynamic inte- Internet Retrieval and Search), includes a description gration of statistical data with more traditional digi- of the content summaries that each source should ex- tal libraries and other information resources in secured port so that systems like GlOSS can operate [26]. application- specific contexts. Sources on the Internet offer many types of informa- CARDGIS research will proceed initially along tion: dynamic documents generated on the fly, struc- six key coordinated thrusts: Ontology development tured, non-text information with sophisticated query and merging; integration of information distributed interfaces, resource directories specialized in a cer- over multiple, heterogeneous legacy sources; human- tain topic, images, video. To complicate matters fur- computer interaction; collaboration technologies to ther, many Internet sources provide little more than a assist dynamic groups of distributed users; secured query interface, and do not export any content sum- infrastructures to maintain confidentiality; and dis- maries or information about their capabilities. Search- tributed data mining and analysis for large statistical ing for information across general heterogeneous Inter- sources. A report describing the challenges of moving net sources thus presents many challenging new prob- towards a digital government is available [22]. lems: from characterizing sources with novel informa- tion types, for efficient query processing, to indexing 5 Internet Information Searching4 non-traditional resource properties, for helping users get what they really want. We are developing tools to Faculty: Gravano. allow users to exploit all the information available on the Internet in meaningful ways, by building on work 4http://www.cs.columbia.edu/˜gravano from several fields, most notably from the database, information retrieval, and natural language process- systems for extraction and recognition of semantic ob- ing fields. Combining technologies from different ar- jects (e.g., people, car) using user-defined models. eas is crucial to achieve the goal of letting users access Ideal image representations should be amenable to the information that they want easily, without forcing dynamic feature extraction and indexing. Today’s them to be aware of the different interfaces, formats, compression standards (such as JPEG, MPEG-2), are and models used to represent this information at the not suited to this need. The objective in the design originating sources. of these compression standards was to reduce band- width and increase subjective quality. Although many interesting analysis and manipulation tasks can still 6 Content-Based Visual Queries5 be achieved in today’s compression formats [30], the potential functionalities of image search and retrieval Faculty: Chang. were not considered. Recent trends in compression, The amount of visual information on-line, including such as MPEG-4 and object-based video, have shown images, videos, and graphics, is increasing at a rapid interest and promise in this direction. One of our goals pace. Textual annotations provide direct means for is to develop systems in which the video objects are visual searching and retrieval, but are not sufficient. extracted, then encoded, transmitted, manipulated, Visual features of the images and video provide addi- and indexed flexibly with efficient adaptation to users’ tional descriptions of their content. By exploring the preference and system conditions. synergy between textual and visual features, image Another direction is to break the barrier of decod- search systems can be greatly improved. ing image semantic content by using user-interaction There has been substantial progress in develop- and domain knowledge. These systems learn from the ing powerful tools which allow users to specify im- users’ input as to how the low-level visual features are age queries by giving examples, drawing sketches, se- to be used in the matching of images at the semantic lecting visual features (e.g., color and texture), and level. For example, the system may model the cases in arranging spatial-temporal structure of features [27]. which low-level feature search tools are successful in Much success has been achieved, particularly in spe- finding the images with the desired semantic content. cific domains, such as sports, remote sensing, and We have developed a unique concept called Semantic medical applications. Visual Templates [31], which use a two-way interaction New challenges remain in applying the above system to find a samll subset of graphic icons repre- content-based image search tools to meet real user senting the semantic concepts (e.g., sunsets or high needs. Our experience indicates that use of the im- jumpers). In an environment including distributed, age search systems varies greatly, depending on the federated search engines, a meta-search system may user background and the task. One unique aspect of monitor user’s preference of different feature models image search systems is the active role played by users. and search tools and then recommend search options By modeling the users and learning from them in the to help users improve the search efficiency [32]. search process, we can better adapt to the users’ sub- Exploring the association of visual features with jectivity. We describe our current research directions other multimedia features, such as text, speech, and in the following. audio, provides another fruitful direction. Video often Although today’s computer vision systems cannot has text transcripts and audio that may also be an- recognize high-level objects in unconstrained images, alyzed, indexed, and searched. Images on the World we are finding that low-level visual features can be Wide Web typically have text associated with them. used to partially characterize image content. Local In [33], we have shown significant performance im- region features (such as color, texture, face, contour, provement in searching Web visual content by using motion) and their spatial/temporal relationships are integrated multimedia features. being used in several innovative image searching sys- tems [28, 29]. We argue that the automated segmen- tation of images/video objects does not need to accu- 7 The Programming Systems Lab6 rately identify real world objects contained in the im- agery. One objective is to extract “salient” visual fea- Faculty: Kaiser. tures and index them with efficient data structures for The Programming Systems Lab at Columbia Uni- powerful querying. To link the features to high-level versity is currently working on two database-oriented semantics, we are also investigating (semi-)automated projects: Pern and Xanth.

5http://www.ctr.columbia.edu/˜sfchang/vis-project 6http://www.psl.cs.columbia.edu/ 7.1 Pern repositories, etc. dispersed over the Internet or organi- zational intranet. One key concept of a middlebase is The Pern project is concerned with programmable that not only are the information repositories hetero- transaction managers that can realize a wide range of geneous, but so are the protocols used to access these human-oriented extended transaction models (ETMs) repositories. Further, the presentation clients and pro- and, more recently, with application of such transac- tocols for communicating with them may be similarly tion management services in conjunction with emerg- heterogeneous, and legacy, like the data sources con- ing distributed computing protocol standards such as structed without the potential for sharable informa- CORBA and HTTP. tion spaces in mind. An earlier prototype, called simply Pern [34], op- The Xanth project is currently working on erated as a component (written in C) that can be groupviews, team-oriented user interface metaphors linked into or externally added on to a system that for collaboration environments like those supported by requires either the conventional ACID properties or a Xanth. One experiment combines VRML with MUDs, special-purpose extended transaction model. Pern’s with information subspaces mapped to platforms in lock conflict table is replaceable and its standard the New York City subway system and movement be- transaction operations (begin, commit, abort, lock, tween subspaces mapped to trips on the subway trains. unlock) can be associated with before and after plu- gins to modify or augment the semantics of the op- References erations, by default implemented through the usual two-phase locking. Pern has been added on externally [1] D. Chatziantoniou and K. A. Ross. Querying multiple to Cap Gemini’s ProcessWeaver workflow product and features of groups in relational databases. In Proceed- linked into our own Oz software development environ- ings of the International Conference on Very Large ment, and has also been used in tandem with the con- Databases, pages 295–306, 1996. ventional transaction manager built into GIE Emer- [2] D. Chatziantoniou and K. A. Ross. Groupwise pro- aude’s PCTE environment framework product. The cessing of relational queries. In VLDB Conference, Cord [35] extended transaction modeling language and pages 476–485, 1997. corresponding interpreter provides means for defining [3] J. Rao and K. A. Ross. Reusing invariants: A new rules for resolving concurrency control conflicts as they strategy for correlated queries. In ACM SIGMOD arise. It has been used to implement altruistic locking Conference on Management of Data, June 1998. and epsilon serializability, and integrated with U. Wis- [4] Z. Li and K. Ross. Fast joins using join indices. VLDB consin’s Exodus as well as with Pern. The current ver- Journal, 1998. (to appear). sion of Pern, called JPernLite [36], has been redesigned and is now implemented in Java. We are currently [5] H. Lei and K. A. Ross. Faster joins, self-joins and multi-way joins using join indices. In Proceedings of developing JCord, which will expand on Cord to sup- the Workshop on Next Generation Information Tech- port dynamic construction and modification of ETMs nologies and Systems, pages 89–101, 1997. Full version on the fly from rules received from clients and negotia- to appear in Data and Knowledge Engineering. tion of conflicts among ETMs for different applications [6] K. A. Ross and D. Srivastava. Fast computation coincidentally accessing the same data served by the of sparse datacubes. In International Conference on same or different instances of JPernLite. Very Large Databases, pages 116–125, August 1997. 7.2 Xanth [7] K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. In Ex- Xanth addresses integration of heterogeneous infor- tending Database Technology, pages 263–277, 1998. mation resources from the viewpoint of accomplishing [8] K. A. Ross, D. Srivastava, and S. Sudarshan. Ma- useful work, particularly in a team context. Our goal terialized view maintenance and integrity constraint is to develop a middleware infrastructure for group checking: Trading space for time. In Proceedings information spaces, or groupspaces, in which workflow of the ACM SIGMOD International Conference on and transaction collaboration services, and the dis- Management of Data, pages 447–458, June 1996. tributed query, information retrieval, data mining, etc. [9] A. Gupta, I. S. Mumick, and K. A. Ross. Adapt- services being developed by other Columbia database ing materialized views after redefinitions. In Proceed- projects, can operate. ings of the ACM-SIGMOD International Conference The Xanth prototype in particular operates as a on Management of Data, pages 211–222, 1995. middlebase [37], imposing hypermedia object manage- [10] A. Kawaguchi, D. Lieuwen, I. S. Mumick, D. Quass, ment on top of external data residing in any combina- and K. A. Ross. Concurrency control theory for de- tion of conventional databases, websites, proprietary ferred materialized views. In Proceedings of the Inter- national Conference on Database Theory, pages 306– [25] L. Gravano, H. Garc´ıa-Molina, and Anthony Tomasic. 320, 1997. Precision and recall of GlOSS estimators for database [11] L. Colby, A. Kawaguchi, D. Lieuwen, I. S. Mumick, discovery. In Proceedings of the Third International and K. A. Ross. Supporting multiple view mainte- Conference on Parallel and Distributed Information nance policies. In Proceedings of the ACM SIGMOD Systems (PDIS’94), September 1994. International Conference on Management of Data, [26] L. Gravano, C. K. Chang, H. Garc´ıa-Molina, and pages 405–416, May 1997. A. Paepcke. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the 1997 ACM In- [12] A. Kawaguchi, D. Lieuwen, I. S. Mumick, and K. A. ternational Conference on Management of Data (SIG- Ross. Implementing incremental view maintenance in MOD’97), May 1997. nested data models. In Proceedings of the Workshop on Database Programming Languages, 1997. [27] S.-F. Chang, J. R. Smith, M. Beigi, and A. Benitez. Visual information retrieval from large distributed on- [13] D. Barbara et al. The New Jersey data reduction line repositories. Communications of ACM, Special report. Data Engineering Bulletin, 20(4):3–45, 1997. Issue on Visual Information Management, 40(12):63– [14] S. Berchtold, H. V. Jagadish, and K. A. Ross. Inde- 71, Dec. 1997. pendence diagrams: A technique for visual data min- [28] S.-F. Chang, W. Chen, H.J. Meng, H. Sun- ing. In Proceedings of the International Conference daram, and D. Zhong. Videoq- an automatic on Knowledge Discovery and Data Mining, 1998. content-based video search system using visual cues. [15] W. Lee, S. J. Stolfo, and K. Mok. Mining audit data In ACM Multimedia Conference, 1997. Demo: to build intrusion detection models. In KDD98, 1998. http://www.ctr.columbia.edu/VideoQ. [16] P. K. Chan and S. J. Stolfo. Toward scalable learning [29] J. R. Smith and S.-F. Chang. Visualseek: A with non-uniform class and cost distributions: A case fully automated content-based image query system. study in credit card fraud detection. In KDD98, 1998. In ACM Multimedia Conference, 1996. Demo: . [17] A. L. Prodromidis and S. J. Stolfo. Mining databases http://www.ctr.columbia.edu/VisualSEEk with different schemas: Integrating incompatible clas- [30] J. Meng and S.-F. Chang. Cveps: A com- sifiers. In KDD98, 1998. pressed video editing and parsing system. In ACM Multimedia Conference, 1996. Demo: [18] W. Lee and S. Stolfo. Data mining approaches for http://www.ctr.columbia.edu/WebClip. intrusion detection. In Proceedings 7th USENIX Se- curity Symposium, 1998. [31] S.-F. Chang, W. Chen, and H. Sundaram. Semantic visual templates - linking visual features to semantics. [19] S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, D. Fan, In IEEE International Conference on Image Process- and P. Chan. Jam: Java agents for meta-learning over ing, 1998. distributed databases. In KDD97 and AAAI97 Work- shop on AI Methods in Fraud and Risk Management, [32] M. Beigi, A. Benitez, , and S.-F. Chang. Metaseek: 1997. A content-based meta search engine for images. In SPIE Conference on Storage and Retrieval [20] P. Chan and S. Stolfo. Sharing learned models among for Image and Video Databases, 1998. Demo: remote database partitions by local meta-learning. In http://www.ctr.columbia.edu/metaseek. KDD96, 1996. [33] J. R. Smith and S.- [21] P. Chan and S. Stolfo. On the accuracy of meta- F. Chang. Visually searching the web for content. learning for scalable data mining. Journal of Intelli- IEEE Multimedia Magazine, 4(3):12–20, 1997. Demo: gent Integration of Information. http://www.ctr.columbia.edu/webseek. [22] H. Schorr and S. Stolfo. Towards the digi- [34] G. T. Heineman and G. E. Kaiser. An architec- tal government of the 21st century. Report of ture for integrating concurrency control into environ- the Workshop on Research and Development Op- ment frameworks. In 17th International Conference portunities in Federal Information Services, 1997. on Software Engineering, 1995. http://www.isi.edu/nsf/final.html. [35] G. T. Heineman and G. E. Kaiser. The CORD ap- [23] L. Gravano, H. Garc´ıa-Molina, and A. Tomasic. The proach to extensible concurrency control. In 13th In- effectiveness of GlOSS for the text-database discov- ternational Conference on Data Engineering, 1997. ery problem. In Proceedings of the 1994 ACM Inter- [36] J. J. Yang and G. E. Kaiser. Jpernlite: An extensible national Conference on Management of Data (SIG- transaction server for the world wide web. In 9th ACM MOD’94), May 1994. Conference on Hypermedia and Hypertext, 1998. [24] L. Gravano and H. Garc´ıa-Molina. Generalizing [37] G. E. Kaiser and S. E. Dossick. Workgroup middle- GlOSS for vector-space databases and broker hi- ware for distributed projects. In IEEE 7th Interna- erarchies. In Proceedings of the Twenty-third In- tional Workshops on Enabling Technologies: Infras- ternational Conference on Very Large Databases tructure for Collaborative Enterprises, 1998. (VLDB’97), pages 78–89, September 1995.