Mining Database Structure; Or, How to Build a Data Quality Browser
Total Page:16
File Type:pdf, Size:1020Kb
Mining Database Structure; Or, How to Build a Data Quality Browser Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk AT&T Labs-Research {t amr,johnsont,muthu,vshkap}@research.att.com ABSTRACT correctly modeling a customer and entering all information related to the sale, or a provisioning group may promptly Data mining research typically assumes that the data to be enter in the service/circuits they provision but might not analyzed has been identified, gathered, cleaned, and pro- delete them as diligently. cessed into a convenient form. While data mining tools Unfortunately, these disordered databases have a signif- greatly enhance the ability of the analyst to make data- icant cost. Planning, analysis, and data mining are frus- driven discoveries, most of the time spent in performing an trated by incorrect or missing data. New projects which analysis is spent in data identification, gathering, cleaning require access to multiple databases are difficult, expensive, and processing the data. Similarly, schema mapping tools and perhaps even impossible to implement. have been developed to help automate the task of using A variety of tools have been developed for database clean- legacy or federated data sources for a new purpose, but ing [42] and for schema mapping [35], as we discuss in more assume that the structure of the data sources is well un- detail below. In our experience, however, one is faced with derstood. However the data sets to be federated may come the difficult problem of understanding the contents and struc- from dozens of databases containing thousands of tables and ture of the database(s) at hand before they can be cleaned tens of thousands of fields, with little reliable documentation or have their schemas mapped. Large production databases about primary keys or foreign keys. often have hundreds to thousands of tables with thousands We are developing a system, Bellman, which performs to tens of thousands of fields. Even in a clean database, data mining on the structure of the database. In this paper, discovering the database structure is difficult because of the we present techniques for quickly identifying which fields scale of the problem. have similar values, identifying join paths, estimating join Production databases often contain many additional prob- directions and sizes, and identifying structures in the database. lems which make understanding their structure much more The results of the database structure mining allow the an- difficult. Constructing an entity (e.g., a corporate customer alyst to make sense of the database content. This informa- or a data service offering) often requires many joins with tion can be used to e.g., prepare data for data mining, find long join paths, often across databases. The schema doc- foreign key joins for schema mapping, or identify steps to umentation is usually sparse and out-of-date. Foreign key be taken to prevent the database from collapsing under the dependencies are usually not maintained and may degrade weight of its complexity. over time. Conversely, tables may contain undocumented foreign keys. A table may contain heterogeneous entities, 1. INTRODUCTION i.e. sets of rows in the table that have different join paths. A seeming invariant of large production databases is that The convention for recording information may be different in they become disordered over time. The disorder arises from different tables (e.g. a customer name might be recorded in a variety of causes including incorrectly entered data, incor- one field in one table, but in two or more fields in another). rect use of the database (perhaps due to a lack of documenta- As an aid to our data cleaning efforts, we have devel- tion), and use of the database to model unanticipated events oped Bellman, a data quality browser. Bellman provides and entities (e.g., new services or customer types). Admin- the usual query and schema navigation tools, and also a col- istrators and users of these databases are under demanding lection of tools and services which are designed to help the time pressures and frequently do not have the time to care- user discover the structure in the database. Bellman uses fully plan, monitor, and clean their database. For example, database profiling [13] to collect summaries of the database the sales force is more interested in making a sale than in tablespaces, tables, and fields. These summaries are dis- played to the user in an interactive manner or are used for more complex queries. Bellman collects the conventional profiles (e.g., number of rows in a table, number of distinct Permission to make digital or hard copies of all or part of this work for values in a field, etc.), as well as more sophisticated profiles personal or classroom use is granted without fee provided that copies are (which is one of the subjects of this paper). not made or distributed for profit or commercial advantage and that copies In order to understand the structure of a database, it is bear this notice and the full citation on the first page. To copy otherwise, to necessary to understand how fields relate to one another. republish, to post on servers or to redistribute to lists, requires prior specific Bellman collects concise summaries of the values of the fields. permission and/or a fee. ACM SIGMOD '2002 June 4-6, Madison, Wisconsin,USA These summaries allow Bellman to determine whether two Copyright 2002 ACM 1-58113-497-5/02/06 ...$5.00. 240 fields can be joined, and if so the direction of the join (e.g. In this paper, we make the following contributions: one to many, many to many, etc.) and the size of the join result. Even when two fields cannot be joined, Bellman can • We develop several methods for finding related database use the field value summaries to determine whether they fields using small summaries. axe textually similar, and if the text of one field is likely to be contained in another. These questions can be posed * We evaluate the size of the summaries required for as queries on the summarized information, with results re- accuracy. turned in seconds. Using the summaries, Bellman can pose data mining queries such as, • We present new algorithms for mining the structure of a database. • Find all of the joins (primary key, foreign key, or oth- erwise) between this table and any other table in the database. 2. SUMMARIZING VALUES OF A FIELD Our approach to database structure mining is to first col- • Given field F, find all sets of fields {.T} such that the lect summaries of the database. These summaries can be contents of F are likely to be a composite of the con- computed quickly, and represent the relevant features of the tents of .T. database in a small amount of space. Our data mining algo- rithms operate from these summaries, and as a consequence • Given a table T, does it have two (largely) disjoint are fast because the summaries are small. subsets which join to tables T1 and T2 (i.e. is T het- Many of these summaries are quite simple, e.g. the num- erogeneous)? ber of tuples in each table, the number of distinct and the These data mining queries, and many others, can also be number of null values of each field, etc. Other summaries answered from the summaries, and therefore evaluated in are more sophisticated and have significantly more power seconds. This interactive database structure mining allows to reveal the structure of the database. In this section, we the user to discover the structure of the database, enabling present these more sophisticated summaries and the algo- the application of data cleaning, data mining, and schema rithms which use them as input. mapping tools. 2.1 Set Resemblance 1.1 Related Work The resemblance of two sets A and B is p = [AnB[/[AUB[. The database research community has explored some as- The resemblance of two sets is a measure of how similar they pects of the problem of data cleaning [42]. One aspect of this are. These sets are computed from fields of a table by a research addresses the problem of finding duplicate values in query such as A =Select Distinct R.A. For our purposes we a table [22, 32, 37, 38]. More generally, one can perform ap- are more interested in computing size of the intersection of proximate matching, in which joins predicates can be based A and B, which can be computed from the resemblance by on string distance [37, 39, 20]. Our interest is in finding related fields among all fields in the database, rather than [AnBI= p~P (IAI + IBI) (1) performing any particular join. In [8], the authors compare two methods for finding related fields. However these axe Our system profiles the number of distinct values of each crude methods which heavily depend on schema informa- field, so [A[ and IBI is always available. tion. The real significance of the resemblance is that it can be AJAX [14] and ARKTOS [46] axe query systems designed easily estimated. Let H be the universe of values from which to express and optimize data cleaning queries. However, the elements of the sets are drawn, and let h : H ~ .Af map ele- user must first determine the data cleaning process. ments of H uniformly and "randomly" to the set of natural A related line of research is in schema mapping, and espe- numbers .Af. Let s(A) = mina~A(h(a)). Then cially in resolving naming and structural conflicts [3, 27, 41, 36, 21]. While some work has been done to automatically de- Pr[s(A) = s(B)] ----p tect database structure [21, 11, 33, 10, 43, 5], they are aimed That is, the indicator variable I[s(A) = s(B)] is a Bernoulli at mapping particular pairs of fields, rather than summariz- random variable with parameter p.