Enterprise Information Integration: Successes, Challenges and Controversies
Total Page:16
File Type:pdf, Size:1020Kb
Enterprise Information Integration: Successes, Challenges and Controversies Alon Y. Halevy∗(Editor), Naveen Ashish,y Dina Bitton,z Michael Carey,x Denise Draper,{ Jeff Pollock,k Arnon Rosenthal,∗∗Vishal Sikkayy ABSTRACT Several factors came together at the time to contribute to The goal of EII Systems is to provide uniform access to the development of the EII industry. First, some technolo- multiple data sources without having to first loading them gies developed in the research arena have matured to the into a data warehouse. Since the late 1990's, several EII point that they were ready for commercialization, and sev- products have appeared in the marketplace and significant eral of the teams responsible for these developments started experience has been accumulated from fielding such systems. companies (or spun off products from research labs). Sec- This collection of articles, by individuals who were involved ond, the needs of data management in organizations have in this industry in various ways, describes some of these changed: the need to create external coherent web sites experiences and points to the challenges ahead. required integrating data from multiple sources; the web- connected world raised the urgency for companies to start communicating with others in various ways. Third, the 1. INTRODUCTORY REMARKS emergence of XML piqued the appetites of people to share data. Finally, there was a general atmosphere in the late 90's Alon Halevy that any idea is worth a try (even good ones!). Importantly, University of Washington & Transformic Inc. data warehousing solutions were deemed inappropriate for [email protected] supporting these needs, and the cost of ad-hoc solutions were beginning to become unaffordable. Broadly speaking, the architectures underlying the prod- Beginning in the late 1990's, we have been witnessing the ucts were based on similar principles. A data integration budding of a new industry: Enterprise Information Integra- scenario started with identifying the data sources that will tion (EII). The vision underlying this industry is to provide participate in the application, and then building a virtual tools for integrating data from multiple sources without hav- schema (often called a mediated schema), which would be ing to first load all the data into a central warehouse. In the queried by users or applications. Query processing would be- research community we have been referring to these as data gin by reformulating a query posed over the virtual schema integration systems. into queries over the data sources, and then executing it effi- This collection of articles accompanies a session at the ciently with an engine that created plans that span multiple SIGMOD 2005 Conference in which the authors discuss the data sources and dealt with the limitations and capabilities successes of the EII industry, the challenges that lie ahead of each source. Some of the companies coincided with the of it and the controversies surrounding it. The following few emergence of XML, and built their systems on an XML data paragraphs serve as an introduction to the issues, admittedly model and query language (XQuery was just starting to be with the perspective of the editor. developed at the time). These companies had to address ∗University of Washington double the problems of the other companies, because the re- yNasa Ames search on efficient query processing and integration for XML zCallixa was only in its infancy, and hence they did not have a vast xBEA Systems literature to draw on. {Microsoft Corporation Some of the first applications in which these systems were kNetwork Inference fielded successfully were customer-relationship management, ∗∗Mitre Corporation where the challenge was to provide the customer-facing worker yySAP a global view of a customer whose data is residing in mul- tiple sources, and digital dashboards that required tracking information from multiple sources in real time. As with any new industry, EII has faced many challenges, Permission to make digital or hard copies of all or part of this work for some of which still impede its growth today. The following personal or classroom use is granted without fee provided that copies are are representative ones: not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to Scaleup and performance: The initial challenge was to republish, to post on servers or to redistribute to lists, requires prior specific convince customers that the idea would work. How could permission and/or a fee. SIGACMSIGMOD '05 Baltimore, Maryland USA a query processor that accesses the data sources in real Copyright 200X ACM XXXXXXXXX/XX/XX ...$5.00. time have a chance of providing adequate and predictable performance? In many cases, administrators of (very care- tential, and its positioning still needs to be further refined. fully tuned) data sources would not even consider allowing I personally believe that the success of the industry will de- a query from an external query engine to hit them. In this pend to a large extent on delivering useful tools at the higher context EII tools often faced competition from the relatively levels of the information food chain, namely for meta-data mature data warehousing tools. To complicate matters, the management and schema heterogeneity. warehousing tools started emphasizing their real-time capa- bilities, supposedly removing one of the key advantages of 2. TOWARDS COSTEFFECTIVE AND SCAL EII over warehousing. The challenge was to explain to po- tential customers the tradeoffs between the cost of building ABLE INFORMATION INTEGRATION a warehouse, the cost of a live query and the cost of ac- cessing stale data. Customers want simple formulas they Naveen Ashish could apply to make their buying decisions, but those are NASA Ames not available. [email protected] Horizontal vs. Vertical growth: From a business per- EII is an area that I have had involvement with for the spective, an EII company had to decide whether to build past several years. I am currently a researcher at NASA a horizontal platform that can be used in any application Ames Research Center, where data integration has been and or to build special tools for a particular vertical. The ar- remains one of my primary research and technical interests. gument for the vertical approach was that customers care At NASA I have been involved in both developing infor- about solving their entire problem, rather than paying for mation integration systems in various domains, as well as yet another piece of the solution and having to worry about developing applications for specific mission driven applica- how it integrates with other pieces. The argument for the tions of use to NASA. The domains have included integra- horizontal approach is the generality of the system and of- tion of enterprise information, and integration of aviation ten the inability to decide (in time) which vertical to focus safety related data sources. Previously I was a co-developer on. The problem boiled down to how to prioritize the scarce for some of the core technologies for Fetch Technologies, an resources of a startup company. information extraction and integration company spun out of my research group at USC/ISI in 2000. My work at NASA Integration with EAI tools and other middleware: has provided the opportunity to look at EII not only from To put things mildly, the space of data management mid- a developer and provider perspective, but also from a con- dleware products is a very complicated one. Different com- sumer perspective in terms of applying EII to the NASA panies come at related problems from different perspectives enterprise's information management needs. and it's often difficult to see exactly which part of the prob- A primary concern for EII today regards the scalability lem a tool is solving. The emergence of EII tools only further and economic aspects of data integration. The need for complicated the problem. A slightly more mature sector middleware and integration technology is inevitable, more so is EAI (Enterprise Application Integration) whose products with the shifting computing paradigms as noted in [10]. Our try to facilitate hooking up applications to talk to each other experience with enterprise data integration applications in and thereby support certain workflows. Whereas EAI tends the NASA enterprise tells us that traditional schema-centric to focus on arbitrary applications, EII focuses on the data mediation approaches to data integration problems are of- and querying it. However, at some point, data needs to be ten overkill and lead to overly and unnecessary investment fed into applications, and their output feeds into other data in terms of time, resources and cost. In fact the investment sources. In fact, to query the data one can use an EII tool, in schema management per new source integrated and in but to update the data one typically has to resort to an heavy-weight middleware are reasons why user costs increase EAI tool. Hence, the separation between EII and EAI tools directly (linearly) with the user benefit with the primary in- may be a temporary one. Other related products include vestment going to the middleware IT product and service data cleaning tools and reporting and analysis tools, whose providers. What is beneficial to end users however are in- integration with EII and EAI does stand to see significant tegration technologies that truly demonstrate economies of improvement. scale, with costs of adding newer sources decreasing signifi- Meta-data Management and Semantic Heterogene- cantly as the total number of sources integrated increases. ity: One of the key issues faced in data integration projects How is scalability and cost-effectiveness in data integra- is locating and understanding the data to be integrated.