COVER FEATURE Technology for Decision Support Systems

Creating the framework for an effective — one that leverages business from numerous discrete touch points— is a daunting but doable task.

Surajit ecision support systems are the core of busi- • total footwear sales in New York City within the Chaudhuri ness IT infrastructures because they give past month by product family, Microsoft Research companies a way to translate a wealth of • the 50 cities with the highest number of unique business information into tangible and customers, and D lucrative results. Collecting, maintaining, • one million customers who are most likely to buy Umeshwar and analyzing large amounts of data, however, are the new Walk-on-Air shoe model. Dayal mammoth tasks that involve significant technical chal- Hewlett-Packard lenges, expenses, and organizational commitment. Before building a system that provides this decision Laboratories Online transaction processing systems allow orga- support information, FSC’s analysts must address and nizations to collect large volumes of daily business resolve three fundamental issues: Venkatesh point-of-sales data. OLTP applications typically auto- Ganti mate structured, repetitive data processing tasks such • what data to gather and how to conceptually Microsoft Research as order entry and banking transactions. This detailed, model the data and manage its storage, up-to-date data from various independent touch points • how to analyze the data, and must be consolidated into a single location before ana- • how to efficiently load data from several inde- lysts can extract meaningful summaries. Managers use pendent sources. this aggregated data to make numerous day-to-day business decisions—everything from managing inven- As Figure 1 shows, these issues correlate to a deci- tory to coordinating mail-order campaigns. sion support system’s three principal components: a server, online analytical processing DECISION SUPPORT SYSTEM COMPONENTS and tools, and back-end tools for popu- A successful decision support system is a complex lating the data warehouse. creation with numerous components. A fictitious busi- Data warehouses contain data consolidated from ness example, the Footwear Sellers Company, helps several operational and tend to be orders of illustrate a decision support system’s various compo- magnitude larger than operational databases, often nents. FSC manufactures footwear and sells through hundreds of gigabytes to terabytes in size. Typically, two channels—directly to customers and through the data warehouse is maintained separately from the resellers. FSC’s marketing executives need to extract organization’s operational databases because analyti- the following information from the company’s aggre- cal applications’ functional and performance require- gate business data: ments are quite different from those of operational databases. Data warehouses exist principally for deci- • the five states reporting the highest increases in sion support applications and provide the historical, youth product category sales within the past year, summarized, and consolidated data more appropriate

48 Computer 0018-9162/01/$17.00 © 2001 IEEE Figure 1. Decision support system Monitoring and administration architecture, Online analytical processing which consists of servers Analysis three principal components: a data repository warehouse server, analysis and data Data Back-end Query/reporting mining tools, and External sources warehouse tools data warehouse Operational server • Extract Serve back-end tools. databases • Transform • Load • Refresh Data mining

Data sources

Data marts Tools for analysis than detailed, individual records. The of units sold and the total amount the customer paid. workloads consist of ad hoc, complex queries that Each tuple in the consists of a pointer to access millions of records and perform multiple scans, each entity in a transaction and the numeric measures joins, and aggregates. Query response times are more associated with the transaction. Each dimension table important than transaction throughput. consists of columns that correspond to the entity’s Because data warehouse construction is a complex attributes. the join between a fact table process that can take many years, some organizations and a set of dimension tables is more efficient than instead build data marts, which contain information computing a join among arbitrary relations. for specific departmental subsets. For example, a mar- Some entities, however, are associated with hierar- keting may include only customer, product, chies, which star schemas do not explicitly support. A and sales information and may not include delivery hierarchy is a multilevel grouping in which each level schedules. Several departmental data marts can coex- consists of a disjoint grouping of the values in the level ist with the main data warehouse and provide a par- immediately below it. For example, all products can tial view of the warehouse contents. Data marts roll be grouped into a disjoint set of categories, which are out faster than data warehouses but can involve com- themselves grouped into a disjoint set of families. plex integration problems later if the initial planning Snowflake schemas are a refinement of star schemas does not reflect a complete . in which a dimensional hierarchy is explicitly repre- Online analytical processing and data mining tools sented by normalizing the dimension tables. In the star enable sophisticated . Back-end tools— schema depicted in Figure 2, a set of attributes describes such as extraction, transformation, and load tools— each dimension and may be related via a relationship populate the data warehouse from external data hierarchy. For example, the FSC’s product dimension sources. consists of five attributes: the product name (Running Shoe 2000), category (athletics), product family (shoe), DATA WAREHOUSE price ($80), and the profit margin (80 percent). Most data warehouses use relational database tech- nology because it offers a robust, reliable, and effi- Physical database cient approach for storing and managing large vol- Database systems use redundant structures such as umes of data. The most significant issue associated indices and materialized views to efficiently process with data warehouse construction is database design, complex queries. Determining the most appropriate set both logical and physical. Building a logical schema of indices and views is a complex physical design prob- for an enterprise data warehouse requires extensive lem. While index lookups and scans can be effective for business modeling. data-selective queries, data-intensive queries can require sequential scans of an entire relation or a relation’s ver- Logical database design tical partitions. Improving the efficiency of table scans In the design, the database consists of and exploiting parallelism to reduce query response a fact table that describes all transactions and a dimen- times are important design considerations.1 sion table for each entity. For the fictitious FSC, each sales transaction involves several entities—a customer, Index structures and their usage a salesperson, a product, an order, a transaction date, Query processing techniques that exploit indices and the city where the transaction occurred. Each through index intersection and union are useful for transaction also has attributes—the number answering multiple-predicate queries. Index intersec-

December 2001 49 Product ProdNo Category Order Name CategoryName Category CategoryDescr OrderNo Family OrderDate Fact table UnitPrice OrderNo ProfitMargin • update the materialized views during load and Customer SalespersonID refresh. CustomerNo CustomerNo Date Month Year CustomerName DateKey DateKey Month Because materialized views require extremely large CustomerAddress CityName Date Year amounts of space, currently adopted solutions only City ProdNo Month support a restricted class of structurally simple mate- Quantity TotalPrice rialized views. Salesperson City State SalespersonID CityName ONLINE ANALYTICAL APPLICATIONS SalespersonName State City In a typical online analytical application, a query Quota aggregates a numeric measure at higher levels in the dimensional hierarchy. An example is the first FSC marketing query that asks for a set of aggregate hier- Figure 2. Snowflake tions exploit multiple-condition selectivities and can archical measures—the five states reporting the high- schema for a the significantly reduce or eliminate the need to access est increases in youth product category sales within hypothetical base tables if all projection columns are available via the past year. State and year are ancestors of the city Footwear Sellers index scans. and date entities. Company. A set of The specialized nature of star schemas makes mas- In the context of the FSC data warehouse, a typical attributes describes ter-detail join indices especially attractive for decision OLAP session to determine regional sales of athletic each dimension and support. While indices traditionally map the value in shoes in the last quarter might proceed as follows: is related via a rela- a column to a list of rows with that value, a join index tionship hierarchy. maintains the relationship between a foreign key and • The analyst issues a select sum(sales) group by its matching primary keys. In the context of a star country query to view the distribution of sales of schema, a join index can relate the values of one or athletic shoes in the last quarter across all coun- more attributes of a dimension table to matching rows tries. in the fact table. The schema in Figure 2, for exam- • After isolating the country with either the highest ple, can support a join index on City that maintains, or the lowest total sales relative to market size, the for each city, a list of the tuples’ record identifiers in analyst issues another query to compute the total the fact table that correspond to sales in that city. sales within each state of that country to under- Essentially, the join index precomputes a binary join. stand the reason for its exceptional behavior. Multikey join indices can represent precomputed n- way joins. For example, a multidimensional join index The analyst traverses down the hierarchy associ- built over the sales database can join City.CityName ated with the city dimension. Such a downward tra- and Product.Name to the fact table. Consequently, the versal of the hierarchy from the most summarized to index entry for Seattle, Running Shoe points to the the most detailed level is called drill-down. In a roll- record identifiers of tuples in the Sales table with that up operation, the analyst goes up one level—perhaps combination. from the state level to the country level—in a dimen- sional hierarchy. Materialized views and usage Key OLAP-related issues include the conceptual Many data warehouse queries require summary data model and server architectures. data and therefore use aggregates. Materializing sum- mary data can accelerate many common queries. In OLAP conceptual data model the FSC example, two data views—total sales grouped The multidimensional model shown in Figure 3 uses by product family and city, and total number of cus- numeric measures as its analysis objects. Each numeric tomers grouped by city—can efficiently answer three measure in this aggregation-centric conceptual data of the marketing department’s queries: states report- model depends on dimensions that describe the enti- ing the highest increases in youth product category ties in the transaction. For instance, the dimensions sales, total footwear sales in New York City by prod- associated with a sale in the FSC example are the cus- uct family, and the 50 cities with the highest number tomer, the salesperson, the city, the product name, and of unique customers. the date the sale was made. Together, the dimensions The challenges in exploiting materialized views are uniquely determine the measure, so the multidimen- similar to the index challenges: sional data model treats a measure as a value in the dimensions’ multidimensional space. • identify the views to materialize, With a multidimensional data view, drill-down and • exploit the materialized views to answer queries, roll-up queries are logical operations on the cube and depicted in Figure 3. Another popular operation is to

50 Computer City W Hierarchical summarization paths S N Industry Country Year Sandals 10 Sneakers 50 Category State Quarter compare two measures that are aggregated by the Boots 20 Product City Month Week same dimensions, such as sales and budget. Slippers 12 Product Hiking shoes 15 OLAP analysis can involve more complex statisti- Running shoes 10 Date cal calculations than simple aggregations such as sum, 1234567 count, and average. Examples include functions such Date as moving averages and the percentage change of an aggregate within a certain period compared with a dif- Figure 3. A sample ferent time period. Many commercial OLAP tools pro- data set results in answer sets with the following appli- multidimensional vide such additional functions. cations: database. Each The time dimension is particularly significant for numeric measure decision support processes such as trend analysis. For • group by product, year, and city; in this aggregation- example, FSC’s market analysts may want to chart • group by product and year; and centric conceptual sales activity for a class of athletic shoes before or after • group by product. data model depends major national athletic contests. Sophisticated trend on dimensions that analysis is possible if the database has built-in knowl- Given a list of k columns, the cube operator pro- describe the entities edge of calendars and other sequential characteristics vides a group-by for each of the 2k combinations of in the transaction. of the time dimension. The OLAP Council (http:// columns. Such multiple group-by operations can be www.olapcouncil.org) has defined a list of other such executed efficiently by recognizing commonalities multidimensional cube operations. among them.3 When applicable, precomputing can enhance OLAP server performance.4 OLAP server and middleware architectures MOLAP servers. MOLAP is a native server architec- Although traditional relational servers do not effi- ture that does not exploit the functionality of a re- ciently process complex OLAP queries or support lational back end but directly supports a multi- multidimensional data views, three types of relational dimensional data view through a multidimensional DBMS servers—relational, multidimensional, and storage engine. MOLAP enables the implementation hybrid online analytical processing—now support of multidimensional queries on the storage layer via OLAP on data warehouses built with relational data- direct mapping. MOLAP’s principal advantage is its base systems. excellent indexing properties; its disadvantage is poor ROLAP servers. ROLAP middleware servers sit be- storage utilization, especially when the data is sparse. tween the relational back-end server where the data Many MOLAP servers adapt to sparse data sets warehouse is stored and the client front-end tools. through a two-level storage representation and exten- ROLAPs support multidimensional OLAP queries and sive compression. In a two-level storage representa- typically optimize for specific back-end relational tion, either directly or by means of design tools, the servers. They identify the views to be materialized, user identifies a set of one- or two-dimensional sub- rephrase user queries in terms of the appropriate mate- arrays that are likely to be dense and represents them rialized views, and generate multistatement SQL for in the array format. Traditional indexing structures the back-end server. They also provide additional ser- can then index these smaller arrays. Many of the tech- vices such as query scheduling and resource assign- niques devised for statistical databases are relevant ment. ROLAP servers exploit the scalability and for MOLAP servers. Although MOLAP servers offer transactional features of relational systems, but intrin- good performance and functionality, they do not scale sic mismatches between OLAP-style querying and SQL well for extremely large data sizes. can create performance bottlenecks in OLAP servers. HOLAP servers. The HOLAP architecture combines Bottlenecks are becoming less of a problem with the ROLAP and MOLAP technologies. In contrast to OLAP-specific SQL extensions implemented in rela- MOLAP, which performs better when the data is rea- tional servers such as Oracle, IBM DB2, and Micro- sonably dense, ROLAP servers perform better when soft SQL Server. Functions such as median, mode, the data is extremely sparse. HOLAP servers identify rank, and percentile extend the aggregate functions. sparse and dense regions of the multidimensional Other feature additions include aggregate computa- space and take the ROLAP approach for sparse tions over moving windows, running totals, and regions and the MOLAP approach for dense regions. breakpoints to enhance support for reporting appli- HOLAP servers split a query into multiple queries, cations. issue the queries against the relevant data portions, Multidimensional require grouping by combine the results, and then present the result to the different sets of attributes. Jim Gray and colleagues2 user. HOLAP’s selective view materialization, selec- proposed two operators—roll-up and cube—to aug- tive index building, and query and resource schedul- ment SQL and address this requirement. Roll-up of a ing are similar to its MOLAP and ROLAP counter- list of attributes such as product, year, and city over a parts.

December 2001 51 DATA MINING Several classes of data mining models help predict Data mining Suppose that FSC wants to launch a catalog both explicitly specified and hidden attributes. Two mailing campaign with an expense budget of important issues influencing model selection are the iteratively builds less than $1 million. Given this constraint, the accuracy of the model and the efficiency of the algo- models on a marketing analysts want to identify the set of rithm for constructing the model over large data sets. prepared data set customers most likely to respond to and buy Statistically, the accuracy of most models improves from the catalog. Data mining tools provide with the amount of data used, so the algorithms for and then deploys such advanced predictive and analytical func- inducing mining models must be efficient and scalable one or more models. tionality by identifying distribution patterns to process large data sets within a reasonable amount and characteristic behaviors within a data set. of time. Knowledge discovery—the process of speci- fying and achieving a goal through iterative Model types data mining—typically consists of three phases: Classification models are predictive. When given a new tuple, classification models predict whether the • , tuple belongs to one of a set of target classes. In the • model building and evaluation, and FSC catalog example, a classification model would • model deployment. determine, based on past behavior, whether a cus- tomer is or is not likely to purchase from a catalog. Data preparation Decision trees and naïve Bayes models are two popu- In the data preparation phase, the analyst prepares a lar types of classification models.5-7 data set containing enough information to build accu- Regression trees and logistic regression are two pop- rate models in subsequent phases. To address FSC’s ular types of regression models, which predict numeric information requirement, an accurate model will pre- attributes, such as a customer’s salary or age.5 dict whether a customer is likely to buy products adver- With some applications, the analyst does not explic- tised in a new catalog. Because predictions are based itly know the set of target classes and considers them on factors potentially influencing customers’ purchases, hidden. The analyst uses clustering models such as K- a model data set would include all customers who Means and Birch to determine the appropriate set of responded to mailed catalogs in the past three years, classes and to classify a new tuple into one of these their demographic information, the 10 most expensive hidden classes.6,7 products each customer purchased, and information Analysts use rule-based models such as the associa- about the catalogs that elicited their purchases. tion rules model to explore whether the purchase of a Data preparation can involve complex queries with certain set of footwear products is indicative, with some large results. For instance, preparing the FSC data set degree of confidence, of a purchase of another product. involves joins between the customer relation and the sales relation as well as identifying the top 10 products Additional model considerations for each customer. All the issues related to efficiently No single model class or algorithm will always build processing decision support queries are equally rele- the ideal model for all applications. Data mining plat- vant in the data mining context. In fact, data mining forms must therefore support several types of model platforms use OLAP or relational servers to meet their construction for evaluation and provide additional data preparation requirements. functionality for extensibility and interchangeability. Data mining typically involves iteratively building In some cases, analysts may want to build a unique models on a prepared data set and then deploying one correlation model that the data mining platform does or more models. Because building models on large not support. To handle such requirements, mining data sets can be expensive, analysts often work ini- platforms must support extensibility. tially with data set samples. Data mining platforms, Many commercial products build models for spe- therefore, must support computing random samples of cific domains, but the actual database on which the data over complex queries. model must be deployed may be in a different data- base system. Data mining platforms and database Building and evaluating mining models servers therefore must also be capable of interchang- Only after deciding which model to deploy does the ing models. analyst build the model on the entire prepared data The Data Mining Group (http://www.dmg.org) set. The goal of the model-building phase is to iden- recently proposed using the Predictive Model Markup tify patterns that determine a target attribute. A tar- Language, an XML-based standard, to interchange get attribute example in the FSC data set is whether a several popular predictive model classes. The idea is customer purchased at least one product from a past that any database supporting the standard can import catalog. and deploy any model described in the standard.

52 Computer Mining model deployment published the XML for Analysis specification During the model deployment phase, the analyst (http://www.essbase.com/downloads/XML_ Specific vertical- applies the selected model to data sets to predict a tar- Analysis_spec.pdf), a simple object access pro- domain solutions are get attribute with an unknown value. For each cur- tocol-based XML API designed specifically for rent set of customers in the FSC example, the standardizing the data access interaction be- limited by their prediction is whether they will purchase a product tween a client application and an analytical data feature sets, and from the new catalog. Deploying a model on an input provider (OLAP and data mining) working over they may not satisfy data set—a subset or partitioning of the input data the Web. With this specification, solution pro- set—may result in another data set. In the FSC exam- viders can program using a single XML API all of a company’s ple, the model deployment phase identifies the subset instead of multiple vendor-specific APIs. analysis needs as of customers that will be mailed catalogs. its business grows. When input data sets are extremely large, the model Approximate query processing deployment strategy must be very efficient. Using Complex aggregate query processing typically indexes on the input relation to filter out tuples that will requires accessing large amounts of data in the not be in the deployment result may be necessary, but warehouse. For example, computing FSC’s average this requires tighter integration between database sys- sales across cities requires a scan of all warehouse data. tems and model deployment. Unfortunately, the research In many cases, however, approximate query process- community has devoted less attention to deployment ing is an option for obtaining a reasonably accurate efficiency than to model-building scalability. estimate very quickly. The basic idea is to summarize the underlying data as concisely as possible and then ADDITIONAL OLAP AND DATA MINING ISSUES answer aggregate queries using the summary instead Other important issues in the context of OLAP and of the actual data. The Approximate Query Processing data mining technology include packaged applica- project (http://www.research.microsoft.com/dmx/ tions, platform application program interfaces and the ApproximateQP) and the AQUA Project (http:// impact of XML, approximate query processing, www.bell-labs.com/project/aqua) provide additional OLAP and data mining integration, and Web mining. information about this approach.

Packaged applications OLAP and data mining integration To develop a complete OLAP or data mining analy- OLAP tools help analysts identify relevant portions sis solution, the analyst must perform a series of com- of the data, while mining models effectively enhance plex queries and build, tune, and deploy complex this functionality. For example, if the FSC product sales models. Several commercial tools try to bridge the gap increase does not meet the targeted rates, the market- between actual solution requirements for well-under- ing managers will want to know the anomalous stood domains and the support from a given OLAP regions and product categories that failed to meet the or data mining platform. Packaged applications and target. An exploratory analysis that identifies anom- reporting tools can exploit vertical-domain knowledge alies uses a technique that marks an aggregate mea- to make the analyst’s task easier by providing higher- sure at a higher level in a dimensional hierarchy with level, domain-specific abstractions. The Data Ware- an anomaly score. The anomaly score computes the housing Information Center (http://www.dwinfocenter. overall deviation of the actual aggregate values from org/) and KDnuggets (http://www.kdnuggets.com/ corresponding expected values over all its descen- solutions/index.html) provide comprehensive lists of dants.8,9 Analysts can use data mining tools such as domain-specific solutions. regression models to compute expected values. Businesses can purchase such solutions instead of developing their own analysis, but specific vertical- Web mining domain solutions are limited by their feature sets and Most major businesses maintain a Web presence therefore may not satisfy all of a company’s analysis where customers can browse, inquire about, and pur- needs as its business grows. chase products. Because each customer has one-to- one contact with the business through the Web site, Platform APIs and XML’s impact companies can personalize the experience. For exam- Several OLAP and data mining platforms provide ple, the Web site may recommend products, services, APIs so analysts can build custom solutions. However, or articles within the customer’s interest category. solution providers typically have had to program for Amazon.com has pioneered the deployment of such a variety of OLAP or data mining engines to provide personalization systems. an engine-independent solution. New Web-based The two key issues involved in developing and XML services offer solution providers a common inter- deploying such Web systems are data collection and face for OLAP engines. Microsoft and Hyperion have personalization techniques. Analysis of Web log data—

December 2001 53 automatically collected records of customer ber of unique customers. Tools that help detect and Data mining models behavior at the Web site—can reveal typical pat- correct data anomalies can have a high payoff, and a terns. For example, this kind of analysis would significant amount of research addresses the problems can exploit allow FSC to offer a special on athletic socks to of duplicate elimination10 and data-cleaning frame- behavioral data a customer who purchases shoes. Data mining works.11 to personalize the models can exploit such behavioral data—par- Web pages the ticularly when it is combined with the demo- Loading graphic data the customer enters during After its extraction and transformation, data may customer sees registration or checkout—to personalize the still require additional preprocessing before it is loaded with appropriate Web pages the customer sees with appropriate into the data warehouse. Typically, batch load utili- advertisements. advertisements. Over time, when a large user ties handle functions such as checking integrity con- community develops, the business can recom- straints; sorting; summarizing, aggregating, and mend additional products based on similarities performing other computations to build derived tables among users’ behavioral patterns. Data mining stored in the warehouse; and building indices and models can identify such similar user classes. other access paths. In addition to populating the ware- house, a load utility must allow the system adminis- DATA WAREHOUSE TOOLS trator to monitor status; to cancel, suspend, and Building a data warehouse from independent data resume a load; and to restart after failure with no data sources is a multistep process that involves extracting integrity loss. Because load utilities for data ware- the data from each source, transforming it to conform houses handle much larger data volumes than opera- to the warehouse schema, cleaning it, and then load- tional databases, they exploit pipelined and parti- ing it into the warehouse. The Data Warehousing tioned parallelism.1 Information Center provides an extensive list of ETL (extract, transform, load) tools for use in this opera- Refreshing tions sequence. Refreshing a data warehouse consists of propagat- ing updates on source data that correspondingly Extraction and transformation update the base tables and the derived data—materi- The goal of the step is to bring data alized views and indexes—stored in the warehouse. from disparate sources into a database where it can The two issues to consider are when to refresh and be modified and incorporated into the data ware- how to refresh. house. The goal of the subsequent data transforma- Typically, data warehouses are refreshed periodi- tion step is to address discrepancies in schema and cally according to a predetermined schedule, such as attribute value conventions. A set of rules and scripts daily or weekly. Only if some OLAP queries require typically handles the transformation of data from an current data such as up-to-the-minute stock quotes is input schema to the destination schema. it necessary to propagate every update. The data ware- For example, an FSC distributor may report sales house administrator sets the refresh policy depending transactions as a flat file in which each record describes on user needs and traffic. Refresh schedules may dif- all entities and the number of units involved in the fer for different sources. The administrator must transaction. The distributor might split each customer choose the refresh cycles properly so that the data vol- name into three fields: first name, middle initial, and ume does not overwhelm the incremental load utility. last name. To incorporate the distributor’s sales infor- Most commercial utilities use incremental loading mation into the FSC data warehouse with the schema during refresh to reduce data volume, inserting only shown in Figure 2, the analyst must first extract the the updated tuples if the data sources support extract- records and then, for each record, transform all three ing relevant portions of data. However, the incre- name-related source columns to derive a value for the mental load process can be difficult to manage because customer name attribute. the update must be coordinated with ongoing trans- actions. Data cleaning Data entry errors and differences in schema con- METADATA ventions can cause the customer dimension table to Metadata is any information required to manage have several corresponding tuples for a single cus- the data warehouse, and metadata management is an tomer, resulting in inaccurate query responses and essential warehousing architecture element. Adminis- inappropriate mining models. For example, if the cus- trative metadata includes all information required to tomer table has multiple tuples for several FSC cus- set up and use a warehouse. Business metadata in- tomers in New York, New York may incorrectly cludes business terms and definitions, data ownership, appear in the list of top 50 cities with the highest num- and charging policies. Operational metadata includes

54 Computer information collected during warehouse operation 5. L. Breiman et al., Classification and Regression such as the lineage of migrated and transformed data; Trees, Chapman & Hall/CRC, Boca Raton, Fla., the data currency (active, archived, or purged); and 1984. More work is monitoring information such as usage , error 6. V. Ganti, J. Gehrke, and R. Ramakrishnan, required to develop reports, and audit trails. Warehouse metadata often “Mining Very Large Data Sets,” Computer, Aug. domain-independent resides in a repository that enables metadata sharing 1999, pp. 38-45. among tools and processes for designing, setting up, 7. J. Han and M. Kamber, Data Mining: Concepts tools that solve the using, operating, and administering a warehouse. and Techniques, Morgan Kaufmann, San Fran- data-cleaning cisco, 2001. problems associated 8. S. Sarawagi, “User Adaptive Exploration of with data warehouse oncerted efforts within industry and academia OLAP Data Cubes,” Proc. VLDB Conf., Mor- have brought substantial technological progress gan Kaufmann, San Francisco, 2000, pp. 307- development. C to the data warehousing task, reflected by the 316. number of commercial tools that exist for each of the 9. J. Han, “OLAP Mining: An Integration of OLAP three major operations: populating the data ware- with Data Mining,” Proc. IFIP Conf. Data house from independent operational databases, stor- Semantics, Chapman & Hall/CRC, Boca Raton, Fla., ing and managing the data, and analyzing the data to 1997, pp. 1-11. make intelligent business decisions. Despite the 10. M. Hernandez and S. Stolfo, “The Merge/Purge Problem plethora of commercial tools, however, several inter- for Large Databases,” Proc. SIGMOD Conf., ACM esting avenues for research remain. Press, New York, 1995, pp. 127-138. Data cleaning is related to heterogeneous data inte- 11. H. Galhardas et al., “Declarative Data Cleaning: Model, gration, a problem that has been studied for many Language, and Algorithms,” Proc. VLDB Conf., Mor- years. To date, much emphasis has centered on data gan Kaufmann, San Francisco, 2001, pp. 371-380. inconsistencies rather than schema inconsistencies. Although data cleaning is the subject of recent atten- tion, more work needs to be done to develop domain- independent tools that solve the variety of data- cleaning problems associated with data warehouse Surajit Chaudhuri is a senior researcher and manager development. of the , Exploration, and Mining Most data mining research has focused on devel- Group at Microsoft Research. His research interests oping algorithms for building more accurate models include self-tuning database systems, decision support or for building models faster. The other two stages of systems, and integration of text, relational, and semi- the knowledge discovery process—data preparation structured information. He received a PhD in com- and mining model deployment—have largely been puter science from Stanford University. Chaudhuri is ignored. Both stages present several interesting prob- a member of the IEEE Computer Society and the lems specifically related to achieving better synergy ACM. Contact him at [email protected]. between database systems and data mining technol- ogy. Ultimately, new tools should give analysts more efficient ways to prepare a good data set for achieving Umeshwar Dayal is a principal laboratory scientist at a specific goal and more efficient ways to deploy mod- Hewlett-Packard Laboratories. His research interests els over the results of arbitrary SQL queries. ✸ are data mining, business process management, dis- tributed information management, and decision sup- port technologies, especially as applied to e-business. References He received a PhD in applied mathematics from Har- 1. T. Barclay et al., “Loading Databases Using Dataflow vard University. Dayal is a member of the ACM and Parallelism,” SIGMOD Record, Dec. 1994, pp. 72-83. the IEEE. Contact him at [email protected]. 2. J. Gray et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub Totals,” Data Mining and Knowledge Discovery J., Apr. Venkatesh Ganti is a researcher in the Data Manage- 1997, pp. 29-54. ment, Exploration, and Mining Group at Microsoft 3. S. Agrawal et al., “On the Computation of Multidi- Research. His research interests include mining and mensional Aggregates,” Proc. VLDB Conf., Morgan monitoring large evolving data sets and decision sup- Kaufmann, San Francisco, 1996, pp. 506-512. port systems. He received a PhD in 4. V. Harinarayan, A. Rajaraman, and J.D. Ullman, from the University of Wisconsin–Madison. Ganti is “Implementing Data Cubes Efficiently,” Proc. SIGMOD a member of the ACM. Contact him at vganti@ Conf., ACM Press, New York, 1996, pp. 205-216. microsoft.com.

December 2001 55