Why Your Data Won't Mix: Semantic Heterogeneity
Total Page:16
File Type:pdf, Size:1020Kb
Why Your Data Won't Mix: Semantic Heterogeneity Alon Y. Halevy Department of Computer Science and Engineering University of Washington Seattle, WA 98195 [email protected] 1. INTRODUCTION Over the years, there have been multiple approaches to ad- When database schemas for the same domain are devel- dressing enterprise information integration challenges. Until oped by independent parties, they will almost always be quite the late 90's, the two leading approaches were data ware- different from each other. These differences are referred to housing and building custom solutions. Data warehousing as semantic heterogeneity. Semantic heterogeneity also ap- solutions had the disadvantage of accessing stale data in many pears in the presence of multiple XML documents, web ser- cases and not being able to work across enterprise bound- vices and ontologies – or more broadly, whenever there is aries. Custom code solutions are expensive, hard to main- more than one way to structure a body of data. The presence tain, and typically not extensible. of semi-structured data exacerbates semantic heterogeneity, In the late 90's, several companies offered solutions that because semi-structured schemas are much more flexible to queried multiple data sources in real-time. In fact, the term start with. In order for multiple data systems to cooperate EII is typically used to refer to these solutions. While the with each other, they must understand each other's schema. users of these systems still see a single schema (whether re- Without such understanding, the multitude of data sources lational or XML), queries are translated on the fly to appro- amounts to a digital version of the Tower of Babel. priate queries over the individual data sources, and results This article begins by reviewing several common scenar- are combined appropriately from partial results obtained from ios in which resolving semantic heterogeneity is crucial for the sources. Consequently, answers returned to the user are building data sharing applications. We then explain why re- always based on fresh data. Interestingly, several of these solving semantic heterogeneity is difficult, and review some companies built their products on XML platforms, because recent research and commercial progress in addressing the the flexibility of XML (and more generally of semi-structured problem. Finally, we point out the key open problems and data) deemed it more appropriate for data integration appli- opportunities in this area. cations. A recent article [8] surveys some of the challenges faced by this industry. Finally, more recent research has pro- posed peer-to-peer architectures for sharing data with rich 2. SCENARIOS OF SEMANTIC HETERO structure and semantics [1]. GENEITY In any of these data sharing architectures, reconciling se- Enterprise Information Integration (EII): Enterprises to- mantic heterogeneity is key. No matter whether the query day are increasingly facing data management challenges that is issued on the fly, or data is loaded into a warehouse, or involve accessing and analyzing data residing in multiple whether data is shared through web services or in a peer-to- sources, such as database systems, legacy systems, ERP sys- peer fashion, the semantic differences between data sources tems and XML files and feeds. For example, in order for need to be reconciled. Typically, these differences are rec- an enterprise to obtain a “single view of customer”, they onciled by semantic mappings. These are expressions that must tap into multiple databases. Similarly, to present a uni- specify how to translate data from one data source into an- fied external view of their data, either to cooperate with a other in a way that preserves the semantics of the data, or third party or to create an external facing web site, they must alternatively, reformulate a query posed on one source into access multiple sources. As the electronic marketplace be- a query on another source. Semantic mappings can be spec- comes more prevalent, these challenges are becoming bot- ified in a variety of mechanisms, including SQL queries, tlenecks in many organizations. XQuery expressions, XSLT scripts, or even Java code. There are many reasons for which data in enterprises re- In practice, the key issue is the amount of effort it takes sides in multiple sources in what appears to be a haphazard to specify a semantic mapping. In a typical data integration fashion. First, many data systems were developed indepen- scenario, over half of the effort (and sometimes up to 80%) dently for targeted business needs, but when the business is spent on creating the mappings, and the process is labor needs changed, data needs to be shared between different intensive and error prone. Today, most EII products come parts of the organization. Second, enterprises acquire many with some tool for specifying these mappings, but the tools data sources as a result of mergers and acquisitions. are completely manual – an expert needs to specify the exact mapping between the two schemas. erties. To sell their products online, the merchants need to Querying and Indexing the Deep Web: The deep web send a feed that adheres to the prescribed schema. How- refers to web content that resides in databases and is accessi- ever, on the back-end, the data at the merchant is stored in ble behind forms. Deep web content is typically not indexed their local schema, which is likely quite different from the by search engines because the crawlers that these engines one prescribed by the retailer (and typically covers a small employ cannot go past the forms. In a sense, the form can fragment of that schema). Hence, the problem we face here be seen as a (typically small) schema, and unless the crawler is creating mappings between thousands of merchants and a can understand the meaning of the fields in the form, it gets growing number of recognized online retailers (roughly 10 stuck there. of them in the USA at this time). An interesting point to The amount and value of content on the deep web are note about this scenario is that there is not necessarily a sin- spectacular. By some estimates, there are 1-2 orders of mag- gle correct semantic mapping from the merchant's schema to nitude more content on the deep web than the surface web. that of the retailer. Instead, because there are subtle differ- Examples of such content range from classified ads in thou- ences between product categories, and products can often be sands of newspapers across the world, to data in govern- mapped to several categories, there are multiple mappings ment databases, product databases, university repositories that may make sense, and the “best” one is the one that ulti- and more. mately sells more products! Here too, the challenge stems from the fact that there is Schema versus Data heterogeneity: Heterogeneity occurs a very wide variety in the way web-site designers model not only in the schema, but also in the actual data values aspects of a given domain. Therefore, it is impossible for themselves. For example, there may be multiple ways of designers of web crawlers to assume certain standard form- referring to the same product. Hence, even though you are field names and structures as they crawl. Even in a simple told that a particular field in a merchant's data maps to Pro- domain such as searching for used cars, the heterogeneity ductName, that may not be enough to resolve multiple ref- in forms is amazing. Of course, the main challenge comes erences to a single product. As other common examples, from the scale of the problem. For example, the web site at there are often multiple ways of referring to companies (e.g., www.everyclassified.com, the first site to aggregate content IBM vs. International Business Machines), people names from thousands of form-based sources, includes over 5000 (that are often incomplete), and addresses. To fully inte- semantic mappings of web forms in the common categories grate data from multiple sources one needs to handle both of classified ads. Later in the article, I will describe the ideas the semantic-level heterogeneity and the data-level hetero- which made this web site possible. geneity. Typically, different products have addressed these It is important to emphasize that accessing the deep web two parts of the problem in isolation. As one example, sev- is even more of a challenge for the content providers than eral of the products for `global spend analysis' have focused it is for the search engines. The content providers thrive on on data-level heterogeneity. This article focuses mostly on getting users' attention. In the early days of the WWW, any schema heterogeneity. good database would be immediately known (e.g., IMDB for movies). However, the number of such databases today is Schema heterogeneity and semi-structured data: I argue vast (estimated in the hundreds of thousands), and people do that the problem of semantic heterogeneity is exacerbated not know about them. Instead, people's searches start from when we deal with semi-structured data, for several reasons. the search box of their favorite engine, and these engines First, the applications involving semi-structured data are typ- do a very poor job of indexing deep web content. Hence, ically ones that involve sharing data among multiple par- if I create an excellent database of middle-eastern recipes ties, and hence semantic heterogeneity is part of the prob- and put it on the web behind a form, it may remain invis- lem from the start. Second, schemas for semi-structured data ible. Ironically, I'm better off creating a set of web pages are much more flexible, so we are more likely to see varia- with my recipe contents than creating an easily searchable tions to the schema.