Structural Graph-Based Metamodel Matching

Structural Graph-based Metamodel Matching Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorgelegt an der Technischen Universität Dresden Fakultät Informatik eingereicht von Dipl.-Inf. Konrad Voigt geboren am 21. Januar 1981 in Berlin Gutachter: Prof. Dr. rer. nat. habil. Uwe Aßmann (Technische Universität Dresden) Prof. Dr. Jorge Cardoso (Universidade de Coimbra, PT) Tag der Verteidigung: Dresden, den 2. November 2011 Dresden im Dezember 2011 Abstract Data integration has been, and still is, a challenge for applications processing multiple heterogeneous data sources. Across the domains of schemas, ontologies, and metamodels, this imposes the need for mapping specifications, i.e. the task of discovering semantic correspondences between elements. Support for the development of such mappings has been researched, producing matching systems that automatically propose mapping sugges- tions. However, especially in the context of metamodel matching the result quality of state of the art matching techniques leaves room for improvement. Although the traditional approach of pair-wise element comparison works on smaller data sets, its quadratic complexity leads to poor runtime and memory performance and eventually to the inability to match, when applied on real-world data. The work presented in this thesis seeks to address these shortcomings. Thereby, we take advantage of the graph structure of metamodels. Conse- quently, we derive a planar graph edit distance as metamodel similarity metric and mining-based matching to make use of redundant information. We also propose a planar graph-based partitioning to cope with large-scale matching. These techniques are then evaluated using real-world mappings from SAP business integration scenarios and the MDA community. The results demonstrate improvement in quality and managed runtime and memory consumption for large-scale metamodel matching. Acknowledgements This dissertation was conducted at SAP Research Dresden directed by Dr. Gregor Hackenbroich. I am grateful to SAP Research for financing this work through a pre-doctoral position. Further, I want to express my gratitude to my advisor Prof. Uwe Aßmann as well as to Prof. Jorge Cardoso and Prof. Alexander Schill who agreed to be my co-advisors. I thank my supervisor Uwe Aßmann for giving me the opportunity to work on this interesting and challenging topic and for all the advises he gave me and Prof. Alexander Schill for constructive comments on this work. I owe a great debt to Prof. Jorge Cardoso, whom I had the pleasure to work with. Jorge has been the kind of mentor to me who all young researchers should have. I am grateful to Petko Ivanov, Thomas Heinze, Peter Mucha, and Philipp Simon. It was very motivating to discuss with them and the thesis benefited a lot from their contributions. A big thanks to you students, without you this thesis would have been impossible. Further, I would like to thank the people from the TU Dresden and SAP Research who helped me with their input and discussions. Special thanks go to the ones commenting on my work: Eldad Louw for his perfect En- glish and joy, Dr. Kay Kadner for his support in the TEXO project and magic moments, Eric Peukert for valuable discussions and expert knowledge ex- change, Birgit Grammel for reminding me of myself and sharing some of the PhD agonies, Dr. Andreas Rummler for the mentoring, Dr. Roger Kilian-Kehr for critical thoughts, Dr. Karin Fetzer for comments on short notice, Arne for physical and psychological work-outs, and Daniel Michulke for regular lunch-meetings. I am grateful to Dr. Gregor Hackenbroich for his support, his valuable and precise comments, and for allowing me to continue my work at SAP Research, and I would also like to thank Annette Fiebig who supported me not only in administrative issues. I also would like to thank my friends for tolerating and enduring my absence and lust for work. Thank you for still knowing me. Finally, I want to thank my family for encouragement and enjoyable moments. And thank you Karen not only for reading countless versions of my thesis and commenting each of them thoroughly but also thank you for all your love and trust in me. Thank you. i Contents 1 Introduction 1 1.1 QualityProbleminMatching . 3 1.2 ScalabilityProbleminMatching . 4 1.3 Research Questions and Contributions . 4 1.4 ThesisOutline........................... 7 2 Background 9 2.1 MetamodelMatching....................... 9 2.1.1 Metamodel ........................ 9 2.1.2 Matching ......................... 11 2.2 GraphTheory........................... 19 2.2.1 Definitions ........................ 19 2.2.2 Metamodelrepresentation . 20 2.2.3 Graph properties . 24 2.2.4 Graphmatching ..................... 28 2.2.5 Graphmining....................... 30 2.2.6 Graph partitioning and clustering . 31 2.3 Summary ............................. 33 3 Problem Analysis 35 3.1 MotivatingExample . 35 3.1.1 Retailscenariodescription . 35 3.1.2 ERPandPOSmetamodels . 37 3.1.3 Data integration problems . 38 3.2 ProblemAnalysis ......................... 40 3.2.1 Problemsandscope . 40 3.2.2 Objectives......................... 44 3.2.3 Requirements....................... 45 3.2.4 Approach ......................... 46 3.2.5 Researchquestion . 48 3.3 Summary ............................. 48 iii 4 Related Work 51 4.1 MatchingSystems ........................ 51 4.1.1 Schemamatching. 52 4.1.2 Ontologymatching. 53 4.1.3 Metamodelmatching. 54 4.2 MatchingQuality ......................... 56 4.3 MatchingScalability . 58 4.4 Summary ............................. 60 5 Structural Graph Edit Distance and Graph Mining Matcher 63 5.1 PlanarGraphEditDistanceMatcher . 63 5.1.1 Analysis of graph matching algorithms . 64 5.1.2 Planargrapheditdistancealgorithm . 67 5.1.3 Examplecalculation . 70 5.1.4 Improvement by k-max degree partial seed matches . 72 5.2 GraphMiningMatcher . 73 5.2.1 Analysis of graph mining algorithms . 75 5.2.2 Graphmodelforminingbasedmatching . 78 5.2.3 Designpatternmatcher . 79 5.2.4 Redundancymatcher. 84 5.3 Summary ............................. 89 6 Planar Graph-based Partitioning for Large-scale Matching 91 6.1 Partition-basedMatching. 91 6.2 PlanarGraph-basedPartitioning. 93 6.2.1 Analysis of graph partitioning algorithms . 93 6.2.2 PlanarEdgeSeparatorbasedpartitioning . 96 6.3 AssignmentofPartitionsforMatching . 102 6.3.1 Partitionsimilarity . 103 6.3.2 Assignmentalgorithms. 104 6.3.3 Comparison........................109 6.4 Summary .............................110 7 Evaluation 113 7.1 Evaluationstrategy . .113 7.2 Evaluationframework:MatchBox . 115 7.2.1 Processing steps and architecture . 115 7.2.2 Matchingtechniques . 116 7.2.3 Parameters and configuration . 118 7.3 EvaluationDataSets . 119 7.3.1 Datasetmetrics. .119 7.3.2 Enterprise service repository mappings . 121 7.3.3 ATL-zoomappings . .126 7.3.4 Summary .........................131 7.4 EvaluationCriteria . 132 7.5 ResultsforGraph-basedMatching. 133 7.5.1 Grapheditdistanceresults. 135 7.5.2 Graphminingresults. 141 7.5.3 Discussion.........................144 7.6 ResultsforGraph-basedPartitioning . 144 7.6.1 Partitionsize .......................145 7.6.2 Partitionassignment . 147 7.6.3 Summary .........................151 7.7 DiscussionofResults . 152 7.7.1 Applicability . .152 7.7.2 Limitations ........................153 7.8 Summary .............................154 8 Conclusion 157 8.1 Summary .............................157 8.2 ConclusionandContributions . 159 8.3 Recommendations for Data Model Development . 161 8.4 FurtherWork ...........................163 A Evaluation Data Import 167 A.1 ATL-zooDataImport . 167 A.1.1 Importofmetamodels . 168 A.1.2 Import of ATL-transformations . 168 A.2 ESRDataImport .........................173 B MatchBox Architecture 175 B.1 Architecture............................175 B.2 CombinationMethods . .175 Bibliography 179 v Chapter 1 Introduction Data integration has been, and still is, a challenge for applications processing multiple heterogeneous data sources. Across the domains of schemas, ontologies, and metamodels this heterogeneity inevitably imposes the need for mapping specifications. Thereby, a mapping specification requires the task of creating semantic correspondences between elements to integrate multiple data sources. An industrial example for the integration of multiple data sources is given through the point of sale scenario [134], where data coming from several retail stores needs to be integrated in a central system. A retail store has cashiers processing sales using tills and a local system collecting all data. The data from the sales of products is sent to and aggregated in a central En- terprise Resource Planning (ERP) system [136]. Thereby, the central system and third-party systems of different stores naturally differ in their internal formats which may be defined using schemas, ontologies, or metamodels. To integrate the systems a mapping between the concepts of the third-party systems and the central ERP is needed. For instance, two different represen- tations for a purchase order or customer data have to be mapped onto each other. Support for the development of such mappings has been researched, producing matching systems that automatically propose mapping sugges- tions, e. g. in schema matching [126], ontology matching [33] and metamodel matching [98]. In the three matching domains most of the proposed systems claim an overall result of automatically finding nearly complete mappings, e. g. in [94, 25, 37, 38, 32]. These results

Load more