<<

Profiling Revisited

Felix Naumann∗ Qatar Computing Research Institute (QCRI), Doha, Qatar [email protected]

ABSTRACT research area in its own right. We focus our discus- Data profiling comprises a broad range of methods to ef- sion on relational data, the predominant format of ficiently analyze a given data set. In a typical scenario, traditional data profiling methods, but we do regard which mirrors the capabilities of commercial data pro- data profiling for other data models in a separate filing tools, tables of a relational are scanned section. to derive metadata, such as data types and value pat- Data profiling encompasses a vast array of meth- terns, completeness and uniqueness of columns, keys ods to examine data sets and produce metadata. and foreign keys, and occasionally functional dependen- Among the simpler results are statistics, such as cies and association rules. Individual research projects the number of null values and distinct values in a have proposed several additional profiling tasks, such as column, its data type, or the most frequent pat- the discovery of inclusion dependencies or conditional terns of its values. Metadata that are more diffi- functional dependencies. cult to compute usually involve multiple columns, Data profiling deserves a fresh look for two reasons: such as inclusion dependencies or functional depen- First, the area itself is neither established nor defined in dencies. More advanced techniques detect approx- any principled way, despite significant research activity imate properties or conditional properties of the on individual parts in the past. Second, more and more data set at hand. To allow focus, the broad field data beyond the traditional relational are be- of is deliberately omitted from the dis- ing created and beg to be profiled. The article proposes cussion here, as justified below. Obviously, all such new research directions and challenges, including inter- discovered metadata refer only to the given data active and incremental profiling and profiling heteroge- instance and cannot be used to derive with cer- neous and non-relational data. tainty schematic/semantic properties, such as pri- mary keys or foreign key relationships. Figure 1 1. DATA PROFILING shows a classification of data profiling tasks. The tasks for “single sources” correspond to state-of-the- “Data profiling is the process of examining the art in tooling and research (see Section 2), while the data available in an existing data source [...] and tasks for “multiple sources” reflect new research di- collecting statistics and information about that rections for data profiling (see Section 5). data.”1 Profiling data is an important and frequent Systematic data profiling, i.e., profiling beyond activity of any IT professional and researcher. We the occasional exploratory SQL query or spread- can safely assume that any reader of this article has sheet browsing, is usually performed by dedicated engaged in the activity of data profiling, at least tools or components, such as IBM’s Information by eye-balling spreadsheets, database tables, XML Analyzer, Microsoft’s SQL Server Integration Ser- files, etc. Possibly more advanced techniques were vices (SSIS), or Informatica’s Data Explorer. Their used, such as key-word-searching in data sets, sort- approaches all follow the same general procedure: ing, writing structured queries, or even using ded- A user specifies the data to be profiled and selects icated data profiling tools. While the importance the types of metadata to be generated. Next, the of data profiling is undoubtedly high, and while ef- tool computes in batch the metadata using SQL ficiently and effectively profiling is an enormously queries and/or specialized algorithms. Depending difficult challenge, it has yet to be established as a on the volume of the data and the selected pro- ∗On leave from Hasso Plattner Institute, Potsdam, Ger- filing results, this step can last minutes to hours. many ([email protected]). The results are usually displayed in a vast collec- 1Wikipedia on “Data Profiling”, 2/2013 ngnrlt rpr o oesbeun task. subsequent some situations, for many prepare in to arises general data in of set unfamiliar or profiling. for cases Use violations. as them mark least at or numbers formatted can differently tools transform either cleansing then Most accordingly. formatted be the to num- moted phone for is pattern frequent bers most the discover- that after ing cleans- instance, subsequent For a dis- phase. in or Typically, enforced ing/integration constraints then into are user. translated visualiza- that rules the be other by then and can explored coveries charts, be to tables, tions tabs, of tion profiling data of classification A tasks 1: Figure aacenig rbbytems yia s case use typical most the Probably cleansing. Data performed is profiling Basic optimization. Query Data Profiling nossetfratn ihnaclm,miss- as column, cleansing such a within errors, data formatting data inconsistent a reveals prepare Profiling to process. data profiling is plan. query a of cost the oper- of ultimately results selectivity and the profiling ators estimate to These used be can columns. about and statistics tables with optimization query sup- port to systems management database most by (ddd)ddd-dddd Multiple sources Single source rule htalpoenmesmust numbers phone all that Schematic overlap Multiple columns hsptencnb pro- be can pattern this , Topical overlap Single column Data overlap h edt rfieanew a profile to need The Schema matching approximate dep. Topical clustering Uniqueness and Uniqueness and foreign key dep. Conditional and Topic discovery Record linkage dependencies dependencies Cross-schema Inclusion and Distributions Patterns and Cardinalities Functional data types detection Duplicate keys keys icn at ftelne aacod oeprior some cloud, data sig- linked downloading the before of of receiving; parts properties is nificant about one know data it to the firehose worthwhile yet Twitter’s be before the to might assess reveals infrastructure data: an new help and such exposing can useful of profiling be characteristics not Data have unknown might or data own yet. not which of do use they at- that made their data turned to the have tention research under and First, industry opportunities brella, profiling. also data but challenges for added have field revisit. to Time data re- for motivation database profiling. further and as [35] [26], engineering of discovery verse scientific index- benefits ef- [42], and formulation ing research other query Other mentioned the have op- forts management. all query data reaps improves structured and consistent, data timization, and modeling keep data helps supports constraints other and aaaayis lotaysaitclaayi or analysis statistical any Almost analytics. Data of management The management. data Scientific inte- be to sets data the Often integration. Data nweg bu aatps es oeg keys, foreign keys, types, data about Knowledge o aamnn [38]. mining data data preparing for subsequently and steps analyzing detailed at of describes such Pyle data Weka. tools, or the SPSS configure as understand appropriately and analyst hand the step profiling help a by to preceded is run mining data an devise then and and data schema. necessary the adequate often profile is to it the useful DBMS, from from a extracted into e.g., Web, or data, experiments raw scientific importing When pro- data effective filing: and additional efficient for created exper- motivation has scientific observations during or gathered iments is that data need. en- this with amplified them has integrate data to terprise potential vast and The sire (linked) etc.? of databases, betweenabundance among dependencies are and columns there types tables of Are data semantics tables? What the and are it? What is sets needed? large data the How explore first: to inte- wants the expert and gration unfamiliar somewhat are grated not constraints. do de- established that by previously records to instance of conform for number set, the data gen- termining the a can monitor of and quality results measure eral Profiling to used be outliers. also or values, ing eetted ntedatabase the in trends Recent pndata open i data big n h de- the and um- sense of the integration effort is needed; before aug- the discovery process to certain columns or tables. menting a warehouse with text mining results an For instance, there are tools that verify inclusion de- understanding of their quality is required. Leading pendencies on user-suggested pairs of columns, but researchers have recently noted “If we just have a that cannot automatically check inclusion between bunch of data sets in a repository, it is unlikely any- all pairs of columns or column sets. one will ever be able to find, let alone reuse, any of The following section elaborates these traditional this data. With adequate metadata, there is some data profiling tasks and gives a brief overview of hope, but even so, challenges will remain [. . . ]” [4]. known approaches. Sections 3 – 6 are the main Second, much of the data that shall be exploited contributions of this article by defining and moti- is of non-traditional type for data profiling, i.e., vating new research perspectives for data profiling. non-relational (e.g., linked open data), non-struc- These areas include interactive profiling (users can tured (e.g., tweets and blogs), and heterogeneous act upon profiling results and re-profile efficiently), (e.g., open government data). And it is often incremental profiling (profiling results are incremen- truly “big”, both in terms of schema, rendering tally updated as new data arrives), profiling hetero- algorithms that are exponential in the number of geneous data and multiple sources simultaneously, schema elements infeasible, and in terms of data, profiling non-relational data (XML and RDF), and rendering main-memory based methods infeasible. profiling on different architectures (column stores, Existing profiling methods are not adequate to han- key-value stores, etc.). dle that kind of data: Either they do not scale well This article is not intended to be a survey of ex- (e.g., dependency discovery), or there simply are isting approaches, though there is certainly a need no methods yet (e.g., incremental profiling, profil- for such, nor is it a formal framework for future data ing multiple data sets, profiling textual attributes). profiling developments. Rather, it strives to spark Third, different and new archi- interest in this research area and to assemble a wide tectures and frameworks have emerged, including range of research challenges. distributed systems, key-value stores, multi-core- or main-memory-based servers, column-oriented lay- 2. STATE OF THE ART outs, streaming input, etc. These new premises pro- vide interesting opportunities as we discuss later. While the introduction mentions current indus- trial profiling tools, this section discusses current Profiling challenges. Data profiling, even in a research directions. In its basic form, data pro- traditional relational setting, is non-trivial for three filing is about analyzing data values of a single reasons: First, the results of data profiling are com- column, summarized as “traditional data profil- putationally complex to discover. For instance, dis- ing”. More advanced techniques detect relation- covering key candidates or dependencies usually in- ships among columns of one or more tables, which volves some sorting step for each considered col- we discuss as “dependency detection”. Finally, we umn. Second, the discovery-aspect of the profil- distinguish data profiling from the broad field of ing task demands the verification of complex con- “data mining”, which we deliberately exclude from straints on all columns and combinations of columns further discussion. in a database. And thus also the solution-space of uniqueness-, inclusion dependency-, or functional Traditional data profiling. The most basic dependency-discovery is exponential in the number form of data profiling is the analysis of individ- of attributes. Third, profiling is often performed on ual columns in a given table. Typically, gener- data sets that may not fit into main memory. ated metadata comprises various counts, such as the Various tools and algorithms have tackled these number of values, the number of unique values, and challenges in different ways. First, many rely on the number of non-null values. These metadata are the capabilities of an underlying DBMS, as many often part of the basic statistics gathered by DBMS. profiling tasks can be expressed as SQL queries. Mannino et al. give a much-cited survey on statis- Second, many have developed innovative ways to tics collection and its relationship to database opti- handle the individual challenges, for instance using mization [32]. In addition to the basic counts, the indexing schemes, parallel processing, and reusing maximum and minimum values are discovered and intermediate results. Third, several methods have the data type is derived (usually restricted to string been proposed that deliver only approximate results vs. numeric vs. date). Slightly more advanced tech- for various profiling tasks, for instance by profiling niques create histograms of value distributions, for samples. Finally, users are asked to narrow down instance to optimize range-queries [37], and iden- tify typical patterns in the data values in the form of regular expressions [40]. Data profiling tools dis- eral attributes” [39]. Yet, a different distinction play such results and can suggest some actions, such is more useful to separate the different use cases: as declaring a column with only unique values a key- Data profiling gathers technical metadata to sup- candidate or suggesting to enforce the most frequent port data management, while data mining and data patterns. analytics discovers non-obvious results to support business management. In this way, data profil- Dependency detection. Dependencies are meta- ing results are information about columns and col- data that describe relationships among columns. umn sets, while data mining results are information The difficulties are twofold: First, pairs of columns about rows or row sets (clustering, summarization, or column-sets must be regarded, and second, the association rules, etc.). chance existence of a dependency in the data at Of course such a distinction is not strict. Some hand does not imply that this dependency is mean- data mining technology does express information ingful. about columns, such as feature selection methods The most frequent real-world use-case is the dis- for sets of values within a column [7] or regression covery of foreign keys [30, 41] with the help of in- techniques to characterize columns [13]. Yet with clusion dependencies [6, 33]. Current data profil- the distinction above, we concentrate on data pro- ing tools often avoid checking all combinations of filing and put aside the broad area of data mining, columns, but rather ask the user to suggest a candi- which has already received unifying treatment in date key/foreign-key pair to verify. Another form of numerous text books and surveys. dependency, which is also relevant for , is the functional dependency (FD). Again, much re- 3. INTERACTIVE DATA PROFILING search has been performed to automatically detect FDs [26, 45]. Data profiling research has yet hardly recognized Both types of dependencies can be relaxed in that data profiling is an inherently user-oriented two ways. First, conditional dependencies need task. In most cases, the produced metadata is con- to hold only for tuples that fulfill the condition. sumed directly by the user or it is at least regarded Conditional inclusion dependencies (CINDs) were by a user before put to use in some application, proposed for data cleaning and contextual schema such as schema design or . We sug- matching [11]. Different aspects of CIND discov- gest the involvement of the user already during the ery have been addressed in [5, 17, 22, 34]. Condi- algorithmic part of data profiling, hence “interac- tional functional dependencies (CFDs) were intro- tive profiling”. duced in [20] for data cleaning. Algorithms for dis- covering CFDs are also proposed in [14, 21]. Sec- Online profiling. Despite many optimization ef- ond, approximate dependencies need to hold only forts, data profiling might last longer than a user for a certain percentage of the data – they are not is willing to wait in front of a screen with nothing guaranteed to hold for the entire relation. Such de- to look at. Online profiling shows intermediate re- pendencies are often discovered using sampling [27] sults as they are created. However, simply hooking or other summarization techniques [16]. the graphical interface into existing algorithms is Finally, algorithms for the discovery of columns usually not sufficient: Data that is sorted by some and column combinations with only unique values attribute or has a skewed order yields misleading in- (which is strictly speaking a constraint and not a termediate results. Solutions might be approximate dependency) have been proposed in [2, 42]. or sampling-based methods, whose results grace- To reiterate our motivation: There are various in- fully improve as more computation is invested. Nat- dividual techniques for various individual profiling urally, such intermediate results do not reflect the tasks. What is lacking even for the state-of-the-art properties of the entire data set. Thus, some form is a unified view of data profiling as a field and a of confidence, along with a progress indicator, can unifying framework of its tasks. be shown to allow an early interpretation of the re- sults. Data mining. Rahm and Do distinguish data pro- Apart from entertaining users during computa- filing from data mining by the number of columns tion, an advantage of online profiling is that the that are examined: “Data profiling focusses on the user may abort the profiling run altogether. For in- instance analysis of individual attributes. [...] Data stance, a user might decide early on that the data mining helps discover specific data patterns in large set is not interesting (or clean) enough for the task data sets, e.g., relationships holding between sev- at hand. Profiling on queries and views. In many cases, data profiling is performed with the purpose of Incremental profiling. An obvious, but yet cleaning the data or the schema to some extent, for under-examined extension to data profiling is to re- instance, to be able to insert it into a data ware- use earlier profiling results to speed-up computation house or to integrate it with some other data set. on changed data. I.e., the profiling system is pro- However, each cleansing step changes the data, and vided with a data set and with knowledge of its thus implicitly also the metadata produced by pro- delta compared to a previous version, and it has filing. In general, after each cleansing step a new stored any intermediate or final profiling results on profiling run should be performed. For instance, that previous version. In the simplest cases, profil- only after cleaning up zip codes does the functional ing metadata can be calculated associatively (e.g., dependence with the city values become apparent. sum, count, equi-width histograms), in some cases Or only after deduplication does the uniqueness of some intermediate metadata can help (e.g., sum and email addresses reveal itself. count for average, indexes for value patterns), and A modern profiling system should be able to al- finally in some cases a complete recalculation might low users to virtually interact with the data and be necessary (e.g., median or clustering). re-compute profiling results. For instance, the pro- There is already some research on performing filing system might show a 96% uniqueness for a cer- individual profiling tasks incrementally. For in- tain column. The user might recognize that indeed stance, the AD-Miner algorithm allows an incre- the attribute should be completely unique and is in mental update of functional dependency informa- fact a key. Without performing the actual cleans- tion [19]. Fan et al. focus on the area of condi- ing, a user might want to virtually declare the col- tional functional dependencies and also consider in- umn to be a key and re-perform profiling on this cremental updates [20]. The area of data mining, virtually cleansed data. Only then a foreign key for on the other hand, has seen much related work, for this attribute might be recognized. instance on association rule mining and other data In short, a user might want to act upon pro- mining applications [24]. filing results in an ad-hoc fashion without going through the entire cleansing and profiling loop, but Continuous profiling. While for incremental pro- remain within the profiling tool context and per- filing we assumed periodic updates (or periodic pro- form cleansing and re-profiling only on a virtually filing runs), a further use case is to update profiling cleansed view. When satisfied, the virtual cleansing results while (transactional) data is created or up- can of course be materialized. A key enabling tech- dated. If the profiling results can be expressed as nology for this kind of interaction is the ability to a query, and if they shall be performed only on a efficiently re-perform profiling on slightly changed temporal window of the data, this use case can be data, as discussed in the next section. In the same served by data stream management systems [23]. manner, profiling results can be efficiently achieved If this is not the case, continuous profiling meth- on query results: While calculating the query re- ods need to be developed, whose results can be dis- sult, profiling results can be generated on the side, played in a dashboard. Of particular importance is thus showing a user not only the result itself, but to find a good tradeoff between recency, accuracy, also the nature of that data. Faceted search pro- and resource consumption. Use cases for continu- vides similar features in that a user is presented ous profiling include internet traffic monitoring or with cardinalities based on the chosen filters. the profiling of incoming search queries. For all suggestions above, new algorithms and data structures are needed to enhance the user ex- Multi-measure profiling. Each profiling algo- perience of data profiling. rithm has its own scheme of running through the data and collecting or aggregating whatever infor- 4. INCREMENTAL DATA PROFILING mation is needed. Realizing that multiple types of profiling metadata shall be collected, it is likely that A data set is hardly ever fixed: Transactional many of these runs can be combined. Thus, in a data is appended to frequently, analytics-oriented manner similar to multi-query-optimization, there data sets experience periodic updates (typically is a high potential for efficiency gains, in particu- daily), and large data sets available on the web lar wrt. I/O cost. While such potential is already data are updated every few weeks or months. Data realized in commercial systems, it has not yet been profiling methods should be able to efficiently han- investigated for the more complex tasks that are not dle such moving targets, in particular without re- covered by these tools. profiling the entire data set. 5. PROFILING HETEROGENEOUS matically finding correspondences between schema DATA elements [18]. Already Smith et al. have recognized that schema matching techniques often play the role While typical profiling tasks assume a single, of profiling tools [43]: Rather than using them to largely homogeneous database or even only a sin- derive schema mappings and perform data trans- gle table, there are many use cases in which a com- formation, they play roles that have a more infor- bined profiling of multiple, heterogeneous data sets mative character, such as assessment of project fea- is needed. In particular when integrating data it is sibility or the identification of integration targets. useful to learn about the common properties of par- However, the mere matching of schema elements ticipating data sets. From profiling one can learn might not suffice as a profiling-for-integration re- about their integrability, i.e., how well their data sult: Additional information on the structure of the and schemata fit together, and learn in advance the values of the matching columns can provide further properties of the integrated data set. Even profiling details about the integration difficulty. a single source that stores data for multiple or many After determining schematic overlap, a next step domains, such as DBpedia or Freebase, can profit is to determine data overlap, i.e., the (estimated) from techniques that profile heterogeneous data. number of real-world objects that are represented in both data sets, or that are represented multiple Degrees of heterogeneity. Heterogeneity in data times in a single data set. Such multiple represen- sets can appear at many different levels and in many tations are typically identified using entity match- different degrees of severity. Data profiling methods ing methods (aka. record linkage, entity resolution, can be used to uncover these heterogeneities and duplicate detection, and many other names) [15]. possibly provide hints on how to overcome them. However, estimating the number of matches with- Heterogeneity is traditionally divided into syn- out actually performing the matching on the entire tactic heterogeneity, structural heterogeneity, and data set is an open problem. If used to determine semantic heterogeneity [36]. Discovering syntactic the integration effort, it is additionally important heterogeneity, in the context of data profiling, is to know how diverse such matching records are rep- precisely what traditional profiling aims at, e.g., resented, i.e., how difficult it is to devise good sim- finding inconsistent formatting. Next, structural ilarity measures and find appropriate thresholds. heterogeneity appears in the form of unmatched schemata and differently structured information. Topical profiling. When profiling yet unknown Such problems are only partly addressed by tradi- data from a large pool of sources, it is necessary to tional profiling, e.g., by discovery schema informa- recognize the topic or domain covered by the source. tion, such as types, keys, or foreign keys. Finally, se- One recently proposed use case for such source mantic heterogeneity addresses the underlying and discovery is situational BI where warehouse data possibly mismatched meaning of the data. For data is complemented with data from openly available profiling we interpret it as the discovery of seman- sources [3, 31]. Examples for such sources are the tical overlap of the data and their domain(s). set of linked open data sources (linkeddata.org) or tables gleaned from the web: “Data on the Web Data profiling for integration. Our focus here is reflects every topic in existence, and topic bound- on profiling tasks to discover structural and seman- aries are not always clear.” [12] tic heterogeneity, arguing that structural profiling Topical profiling should be able to match a data seeks information about the schema and semantic set to a given set of topics or domains. Given two profiling seeks information about the data. Both data sets, it should be able to determine topical serve to assess the integrability of data sets, and overlap between them. There is already initial work thus also indicate the necessary integration effort, on topical profiling for traditional databases in the which is vital to project planning. The integration iDisc system [44], which matches tables to topics or effort might be expressed in terms of similarity, but clusters them by topic, and for web data [8], which also in terms of man-months or in terms of which discovers frequent patterns of concepts and aggre- tools are needed. gates them to topics. An important issue in integrated information systems, irrelevant for single databases, is the schematic similarity, i.e., the degree to which their 6. DATA PROFILING ON OTHER AR- schemata complement each other and the degree to CHITECTURES which they overlap. There is an obvious relation Most current data profiling methods and tools to schema matching techniques, which aim at auto- assume data to be stored in relational form on a single-node database. However, much interesting ing a benchmark is to (ii) provide data that closely data nowadays resides in data stores of different mirrors real-world situations. Given a schema and architecture and in various (non-relational) mod- a set of constraints (uniqueness, data types, FDs, els and formats. If these architectures are more INDs, patterns, etc.) it is not trivial to create a amenable to data profiling tasks, they might even valid database instance. If in addition some dirt- warrant copying data for the purpose of profiling. iness, i.e., violations to constraints, are to be in- serted, or if conditional dependencies are needed, Storage architectures. Of all modern hardware the task becomes even more daunting. The mea- architectures, columnar storage seems the most sures for (iii) need to be carefully selected, in par- promising for many data profiling tasks, which of- ticular if they are to go beyond traditional mea- ten are inherently column-oriented: Analyzing in- sures of response time and cost efficiency and in- dividual columns for patterns, data types, unique- clude the evaluation of approximate results. Fi- ness, etc. involves reading only the data of that col- nally, the benchmark should be able to evaluate not umn and thus matches precisely the sweet-spot of only entire profiling systems but also methods for columns stores [1]. This advantage may dwindle individual tasks. when analyzing column-combinations, for instance to discover functional dependencies, but even then Types of data. Data comes not only in relational one can avoid reading entire rows of data. form, but also in tree or graph shapes, such as XML As data profiling includes many different tasks and RDF data. A first step is to adapt traditional on many tables and columns, a promising research profiling tasks to those models. An example is Pro- avenue is the use of many cores, GPUs, or dis- LOD, which profiles linked open data delivered as tributed environments for parallelization. Paral- RDF triples [10]. A further challenge arises from lelization can occur at different levels: A compre- the sheer size of many RDF data sets, so profiling hensive profiling run might distribute individual, in- computation must be distributed [9]. In addition, dependent profiling tasks to different nodes (task such data models demand new, data model-specific parallelism). Another approach is to partition data profiling tasks, such as maximum tree depth or av- for a single profiling task (data parallelism). As erage node-degree. most profiling tasks are not associative, in the sense Structured data is often intermingled with un- that profiling results for subsets of column-values structured, textual data, for instance in product in- cannot be aggregated to overall results, horizontal formation or user profiles on the web. The field partitioning is usually not useful or at least raises of linguistics knows various measures to character- some coordination overhead. For instance, unique- ize a text from simple measures, such as average ness within each partition of a column does not sentence length, to complex measures, such as vo- imply uniqueness of the entire column, but com- cabulary richness [25] as visualized in [29]. Thus, municating the sets of distinct values is sufficient. data profiling might be extended to text profiling Finally, task parallelism can again be applied to and possibly to methods that jointly profile both finer-grained tasks, such as sorting or hashing, that data and text. A discussion on the large area of form the basic building blocks of many profiling al- text mining is omitted, for the same reasons data gorithms. mining was omitted from this article. Further challenges arise when performing data profiling on key-value stores: Typically, the val- 7. AN OUTLOOK ues contain some structured data, without enforced schemata. Thus, even defining the expected results This article points out the potentials and the on such “soft schema” values is a challenge, and a needs of modern data profiling – there is yet much first step must involve schema profiling as described principled research to do. A planned first step is in Section 5. to develop a general framework for data profiling, To systematically evaluate different methods and which classifies and formalizes profiling tasks, shows architectures for the various data profiling tasks, a its amenability for a range of use cases, and provides corresponding data profiling benchmark is needed. a means to compare various techniques both in their It must define (i) a set of tasks, (ii) data on which abilities and their efficiency. the tasks shall be executed, and (iii) measures to At the same time, this article shall serve as a “call evaluate efficiency. For (i) the first (single-source) to arms” for database researchers to develop more subtree of Figure 1 can serve as an initial set of efficient and more advanced profiling techniques, in tasks. Arguably, the most difficult part of establish- particular for the fast growing areas of “big data” and “linked data”, both of which have attracted great interest by industry, but both of which have Proceedings of the International Conference proven that data is difficult to grasp and use effec- on Information and Knowledge Management tively. Data profiling can bridge this gap by show- (CIKM), pages 2094–2098, Maui, HI, 2012. ing what the data sets are about, how well they fit [6] J. Bauckmann, U. Leser, F. Naumann, and the data environment at hand, and what steps are V. Tietz. Efficiently detecting inclusion needed to make use of them. dependencies. In Proceedings of the Several research areas were deliberately omitted International Conference on Data Engineering in this article, in particular data mining and text (ICDE), pages 1448–1450, Istanbul, Turkey, mining, as reasoned above, but also data visual- 2007. ization: Because data profiling targets users, ef- [7] J. Berlin and A. Motro. Database schema fectively visualizing the profiling results is of ut- matching using machine learning with feature most importance. A suggestion for such a visual selection. In Proceedings of the Conference on data profiling tool is the Profiler system [28]. A Advanced Information Systems Engineering strong cooperation between the database commu- (CAiSE), pages 452–466, Toronto, Canada, nity, which produces the data and metadata to be 2002. visualized, and the visualization community, which [8] C. B¨ohm,G. Kasneci, and F. Naumann. enables users to understand and make use of the Latent topics in graph-structured data. In data, is needed. Proceedings of the International Conference Acknowledgments. Discussions and collabora- on Information and Knowledge Management tion with Ziawasch Abedjan, Jana Bauckmann, (CIKM), pages 2663–2666, Maui, HI, 2012. Christoph B¨ohm,and Frank Kaufer inspired this [9] C. B¨ohm,J. Lorey, and F. Naumann. Creating article. voiD descriptions for web-scale data. Journal of Web Semantics, 9(3):339–345, 2011. 8. REFERENCES [10] C. B¨ohm,F. Naumann, Z. Abedjan, D. Fenz, [1] D. J. Abadi. Column stores for wide and T. Gr¨utze,D. Hefenbrock, M. Pohl, and sparse data. In Proceedings of the Conference D. Sonnabend. Profiling linked open data on Innovative Data Systems Research with ProLOD. In Proceedings of the (CIDR), pages 292–297, Asilomar, CA, 2007. International Workshop on New Trends in [2] Z. Abedjan and F. Naumann. Advancing the Information Integration (NTII), pages discovery of unique column combinations. In 175–178, Long Beach, CA, 2010. Proceedings of the International Conference [11] L. Bravo, W. Fan, and S. Ma. Extending on Information and Knowledge Management dependencies with conditions. In Proceedings (CIKM), pages 1565–1570, Glasgow, UK, of the International Conference on Very Large 2011. Databases (VLDB), pages 243–254, Vienna, [3] A. Abell´o,J. Darmont, L. Etcheverry, Austria, 2007. M. Golfarelli, J.-N. Maz´on,F. Naumann, [12] M. J. Cafarella, A. Halevy, and J. Madhavan. T. B. Pedersen, S. Rizzi, J. Trujillo, Structured data on the web. Communications P. Vassiliadis, and G. Vossen. Fusion Cubes: of the ACM, 54(2):72–79, 2011. Towards self-service business intelligence. [13] S. Chaudhuri, U. Dayal, and V. Ganti. Data Data Warehousing and Mining (IJDWM), in management technology for decision support press, 2013. systems. Advances in Computers, 62:293–326, [4] D. Agrawal, P. Bernstein, E. Bertino, 2004. S. Davidson, U. Dayal, M. Franklin, [14] F. Chiang and R. J. Miller. Discovering data J. Gehrke, L. Haas, A. Halevy, J. Han, H. V. quality rules. Proceedings of the VLDB Jagadish, A. Labrinidis, S. Madden, Endowment, 1:1166–1177, 2008. Y. Papakonstantinou, J. M. Patel, [15] P. Christen. Data Matching. Springer Verlag, R. Ramakrishnan, K. Ross, C. Shahabi, Berlin – Heidelberg – New York, 2012. D. Suciu, S. Vaithyanathan, and J. Widom. [16] G. Cormode, M. N. Garofalakis, P. J. Haas, Challenges and opportunities with Big Data. and C. Jermaine. Synopses for massive data: Technical report, Computing Community Samples, histograms, wavelets, sketches. Consortium, http://cra.org/ccc/docs/ Foundations and Trends in Databases, init/bigdatawhitepaper.pdf, 2012. 4(1-3):1–294, 2012. [5] J. Bauckmann, Z. Abedjan, H. M¨uller, [17] O. Cur´e.Conditional inclusion dependencies U. Leser, and F. Naumann. Discovering for data cleansing: Discovery and violation conditional inclusion dependencies. In detection issues. In Proceedings of the literary analysis. In Proceedings of Visual International Workshop on Quality in Analytics Science and Technology (VAST), Databases (QDB), Lyon, France, 2009. pages 115 –122, Sacramento, CA, 2007. [18] J. Euzenat and P. Shvaiko. Ontology [30] S. Lopes, J.-M. Petit, and F. Toumani. Matching. Springer Verlag, Berlin – Discovering interesting inclusion Heidelberg – New York, 2007. dependencies: application to logical database [19] S. M. Fakhrahmad, M. H. Sadreddini, and tuning. Information Systems, 27(1):1–19, M. Z. Jahromi. AD-Miner: A new incremental 2002. method for discovery of minimal approximate [31] A. L¨oser,F. Hueske, and V. Markl. dependencies using logical operations. Situational business intelligence. In Intelligent , 12(6):607–619, Proceedings Business Intelligence for the 2008. Real-Time Enterprise (BIRTE), pages 1–11, [20] W. Fan, F. Geerts, X. Jia, and Auckland, New Zealand, 2008. A. Kementsietsidis. Conditional functional [32] M. V. Mannino, P. Chu, and T. Sager. dependencies for capturing data Statistical profile estimation in database inconsistencies. ACM Transactions on systems. ACM Computing Surveys, Database Systems (TODS), 33(2):1–48, 2008. 20(3):191–221, 1988. [21] W. Fan, F. Geerts, J. Li, and M. Xiong. [33] F. D. Marchi, S. Lopes, and J.-M. Petit. Discovering conditional functional Efficient algorithms for mining inclusion dependencies. IEEE Transactions on dependencies. In Proceedings of the Knowledge and Data Engineering (TKDE), International Conference on Extending 23(4):683–698, 2011. Database Technology (EDBT), pages 464–476, [22] L. Golab, F. Korn, and D. Srivastava. Prague, Czech Republic, 2002. Efficient and effective analysis of data quality [34] F. D. Marchi, S. Lopes, and J.-M. Petit. using pattern tableaux. IEEE Data Unary and n-ary inclusion dependency Engineering Bulletin, 34(3):26–33, 2011. discovery in relational databases. Journal of [23] L. Golab and M. T. Ozsu.¨ Data Stream Intelligent Information Systems, 32:53–73, Management. Morgan Claypool Publishers, 2009. 2010. [35] V. M. Markowitz and J. A. Makowsky. [24] J. Han, M. Kamber, and J. Pei. Data Mining: Identifying extended entity-relationship object Concepts and Techniques. Morgan Kaufmann, structures in relational schemas. IEEE 2011. Transactions on Software Engineering, [25] D. I. Holmes. Authorship attribution. 16(8):777–790, 1990. Computers and the Humanities, 28:87–106, [36] T. Ozsu¨ and P. Valduriez. Principles of 1994. Distributed Database Systems. Prentice-Hall, [26] Y. Huhtala, J. K¨arkk¨ainen,P. Porkka, and 2nd edition, 1999. H. Toivonen. TANE: An efficient algorithm [37] V. Poosala, P. J. Haas, Y. E. Ioannidis, and for discovering functional and approximate E. J. Shekita. Improved histograms for dependencies. Computer Journal, 42:100–111, selectivity estimation of range predicates. In 1999. Proceedings of the International Conference [27] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, on Management of Data (SIGMOD), pages and A. Aboulnaga. CORDS: Automatic 294–305, Montreal, Canada, 1996. discovery of correlations and soft functional [38] D. Pyle. Data Preparation for Data Mining. dependencies. In Proceedings of the Morgan Kaufmann, 1999. International Conference on Management of [39] E. Rahm and H.-H. Do. Data cleaning: Data (SIGMOD), pages 647–658, Paris, Problems and current approaches. IEEE Data France, 2004. Engineering Bulletin, 23(4):3–13, 2000. [28] S. Kandel, R. Parikh, A. Paepcke, [40] V. Raman and J. M. Hellerstein. Potters J. Hellerstein, and J. Heer. Profiler: Wheel: An interactive data cleaning system. Integrated statistical analysis and In Proceedings of the International Conference visualization for data quality assessment. In on Very Large Databases (VLDB), pages Proceedings of Advanced Visual Interfaces 381–390, Rome, Italy, 2001. (AVI), pages 547–554, Capri, Italy, 2012. [41] A. Rostin, O. Albrecht, J. Bauckmann, [29] D. A. Keim and D. Oelke. Literature F. Naumann, and U. Leser. A machine fingerprinting: A new method for visual learning approach to foreign key discovery. In enterprises. In Proceedings of the Conference Proceedings of the ACM SIGMOD Workshop on Innovative Data Systems Research on the Web and Databases (WebDB), (CIDR), Asilomar, CA, 2009. Providence, RI, 2009. [44] W. Wu, B. Reinwald, Y. Sismanis, and [42] Y. Sismanis, P. Brown, P. J. Haas, and R. Manjrekar. Discovering topical structures B. Reinwald. GORDIAN: Efficient and of databases. In Proceedings of the scalable discovery of composite keys. In International Conference on Management of Proceedings of the International Conference Data (SIGMOD), pages 1019–1030, on Very Large Databases (VLDB), pages Vancouver, Canada, 2008. 691–702, Seoul, Korea, 2006. [45] H. Yao and H. J. Hamilton. Mining functional [43] K. P. Smith, M. Morse, P. Mork, M. H. Li, dependencies from data. Data Mining and A. Rosenthal, M. D. Allen, and L. Seligman. Knowledge Discovery, 16(2):197–219, 2008. The role of schema matching in large