The Hyperview Approach to the Integration of Semistructured Data
Total Page:16
File Type:pdf, Size:1020Kb
The HyperView Approach to the Integration of Semistructured Data Lukas C. Faulstich1 Dissertation am Fachbereich Mathematik und Informatik der Freien Universität Berlin Eingereicht am: 16. November 1999 Verteidigt am: 15. Februar 2000 Gutachter: Prof. Dr. Heinz Schweppe Prof. Dr. Herbert Weber (TU Berlin) Prof. Dr. Hartmut Ehrig (TU Berlin) Betreuer: Prof. Dr. Heinz Schweppe Prof. Dr. Herbert Weber (TU Berlin) Prof. Dr. Hartmut Ehrig (TU Berlin) Dr. Ralf-Detlef Kutsche (TU Berlin) Dr. Gabriele Taentzer (TU Berlin) 1Supported by the German Research Society, Berlin-Brandenburg Graduate School on Distributed Infor- mation Systems (DFG grant no. GRK 316) ii to Myra iii Abstract In order to use the World Wide Web to answer a specific question, one often has to collect and combine information from multiple Web sites. This task is aggravated by the structural and se- mantic heterogeneity of the Web. Virtual Web sites are a promising approach to solve this problem for particular, focused application domains. A virtual Web site is a Web site that serves pages containing concentrated information that has been extracted, homogenized,andcombined from several underlying Web sites. The goal is to save the user from tediously searching and browsing multiple pages at all these sites. The HyperView approach to the integration of semistructured data sources presented in this thesis provides a methodology, a formal framework, and a software environment for building such virtual Web sites. To achieve this kind of integration, data has to be extracted from external Web documents, integrated into a common representation, and then presented to the user in form of Web documents. The HyperView approach treats these three steps of extraction, integration, and presentation uniformly as consecutive views that map between different levels of abstraction. Each of these levels is modeled by schemata and corresponds to an architectural layer of the HyperView System. The HyperView methodology provides a guideline for modeling each of these layers and defining views between them in order to establish a virtual Web site. In HyperView, the contents of Web sites as well as the consecutive results of the views are represented as graphs. A special graph-based data model (CGDM) has been developed to this purpose. The view mechanism of HyperView supports mappings between graphs. Views are de- fined by sets of graph transformation rules. Since it is in general not feasible to materialize views over Web sites in advance, a demand-driven rule activation mechanism has been formally de- scribed and implemented in the HyperView System. This mechanism incrementally materializes views in response to queries issued against them. The HyperView System has been implemented in Prolog. Graph transformation rules are compiled into Prolog predicates that can be executed efficiently. Web documents are loaded into the HyperView System using a standard HTTP client. HyperView based virtual Web sites are supported using the Java servlet technology. The case studies in the fields of Digital Libraries and of Town Information Systems included in this thesis demonstrate the applicability of HyperView for integrating semistructured information sources and for building virtual Web sites. An explorative study shows how the HyperView approach can be applied in the context of the emerging standards related to XML. The main contributions of this thesis are: 1. the key idea of applying the same view mechanism uniformly to solve the problems of extraction, integration, and presentation, 2. the HyperView methodology for modeling and integrating Web sites, 3. the formal framework defining the data model, rule concept, and in particular the demand- driven view materialization mechanism of HyperView, 4. the HyperView System prototype providing a platform for building virtual integrated Web sites 5. the validation of the HyperView methodology and system in the mentioned case studies. In short, this thesis covers the whole problem of building virtual Web sites including method- ology, formal foundation, and software support. iv Contents 1 Introduction 1 1.1Integrationofsemistructuredinformationsources................... 2 1.2 The HyperView approach.................................. 3 1.2.1 DataModelandViewMechanism........................ 3 1.2.2 Architecture..................................... 4 1.2.3 Application of the HyperView Technology . ................. 5 1.3RelatedWork(Overview)................................. 6 1.4Overview........................................... 7 2 HyperView by Example: Wrapping Publisher Web Sites 9 2.1 Digital Libraries of Electronic Journals . ..................... 10 2.1.1 The DARWIN project................................ 10 2.1.2 Usecases....................................... 11 2.2 Modeling publisher Web Sites . ............................ 11 2.2.1 Genericapproach.................................. 11 2.2.2 GraphSchemata.................................. 12 2.2.3 The HyperView DatabaseSchema......................... 12 2.2.4 ACRSchemataofExampleWebSources..................... 13 2.2.5 RepresentingHTMLPagesasHTMLgraphs.................. 14 2.3 Building Views on publisher Web Sites . ..................... 14 2.3.1 QueriesandRules................................. 15 2.3.2 DefiningaViewovertheHTMLGraphs.................... 16 2.3.3 DefiningaViewovertheACRGraphs...................... 18 2.3.4 Querying the HyperView system . ..................... 20 2.4 The Architecture of DARWIN ............................... 21 2.5Summary........................................... 21 3 Formal Framework 23 3.1ClusteredGraphDataModel(CGDM).......................... 24 3.1.1 Motivation...................................... 24 3.1.2 Basicdefinitions.................................. 25 3.1.3 Schemataandinstances.............................. 28 3.2Rules............................................. 30 3.2.1 Ruleapplication.................................. 34 3.3QueriesandOracles.................................... 37 3.3.1 Applying a rule to a virtual data graph ..................... 39 3.3.2 Hyperviews..................................... 42 3.3.3 Usingaruletoanswerasubquery........................ 42 3.3.4 Chainingrulestoansweraquery........................ 43 3.4Reuseofexistingsubgraphs................................ 47 3.5BibliographyonGraph-Transformation......................... 47 3.6Summary........................................... 48 v vi CONTENTS 4TheHyperView System 51 4.1EncodingofGraphs..................................... 52 4.1.1 PlainGraphs.................................... 52 4.1.2 ClusteredGraphs.................................. 52 4.1.3 Typechecking.................................... 53 4.2EncodingofQueries.................................... 53 4.3EncodingofRules...................................... 54 4.4RuleActivation....................................... 56 4.5Queryexecution....................................... 56 4.6ComplexityandPerformance............................... 57 4.7Metadatamanagement................................... 57 4.7.1 Schemaclusters................................... 57 4.7.2 The meta cluster.................................. 58 4.7.3 WWWmetadata.................................. 58 4.8 The HyperView Systemprototype............................. 59 4.9Summary........................................... 60 5TheHVQL Query Language 61 5.1Introduction......................................... 62 5.2BasicNotations....................................... 62 5.3GraphPatterns....................................... 62 5.4GraphLiterals........................................ 63 5.5Queries............................................ 63 5.5.1 Syntax........................................ 63 5.5.2 Semantics...................................... 65 5.5.3 Implementation................................... 66 5.6Rules............................................. 66 5.6.1 Syntax........................................ 67 5.6.2 Semantics...................................... 67 5.6.3 Implementation................................... 68 5.6.4 Example....................................... 68 5.7MetaEdges.......................................... 69 5.8 HTML Edges......................................... 71 5.9 Embedding of HVQL in the HyperView System..................... 71 5.10Summary........................................... 72 6 Support for Web Interfaces 73 6.1Introduction......................................... 74 6.2 Architecture of the HyperView Webserver........................ 74 6.3 Conceptual model of the virtual HyperView Website.................. 76 6.4HTMLCodeGeneration.................................. 77 6.4.1 Phase1:Preparation................................ 77 6.4.2 Phase2:GenerationofaHTMLskeleton.................... 77 6.4.3 Phase 3: HTML dump and generation of variable HTML code . 78 6.4.4 HVQL notationforHTMLrules.......................... 79 6.5 The HyperView Browser.................................. 79 6.5.1 Customization................................... 81 6.6Summary........................................... 82 CONTENTS vii 7 Case Study: Town Information 85 7.1Introduction......................................... 86 7.2Scenario........................................... 86 7.2.1 UseCase....................................... 88 7.3Developingaculturaleventcalendar........................... 88 7.3.1 Conceptualschema................................. 88 7.3.2 Wrappingtowninformationsites......................... 88 7.4TheculturalcalendarWebsite..............................