Combining Lexical and Semantic Similarity Measures with Machine Learning Approach for Ontology and Schema Matching Problem

© Lev Bulygin Lomonosov Moscow State University, Moscow, Russia [email protected] Abstract. Ontology and schema matching is one of the most important tasks for data integration. We suggest to combine the different matchers with machine learning approach. The features are the outputs of lexical and semantic similarity functions. Naive Bayesian classifier, logistic regression and gradient tree boosting have been trained on these features. The proposed approach is tested on the OAEI 2017 benchmark “conference” with the various splits of the data on train and test sets. Experiments show that final combined model in element-level matching outperformed the single matchers. Results are compared with EditDistance matcher and WordNet matcher. Keywords: ontology matching, element-level matcher, lexical matcher, semantic matcher, word2vec, word2net.

approach on OAEI 2017 “conference” ontologies with 1 Introduction single matchers: edit distance and WordNet. In Section 2 we reviewed related work and it's Data integration is now a very important problem, as the evolution. Section 3 describes the setup of the paper. In amount of data is growing very much. One of the most Section 4 presents our matching algorithm in detail. important steps in automatic data integration is the Section 5 shows the experiments and the analysis. We comparison of ontologies or schemas. This problem is conclude this paper in Section 6. traditionally solved manually or semi-automatically. Manual ontology or schema matching is very labor- 2 Related Work intensive, long in time and prone to errors. Therefore now the task is to maximally automate the process of Schema matching and ontology matching are very comparing the circuit with the greatest accuracy. similar. In this article, there is no difference for schema For schema matching and ontology matching matching and ontology matching because we only work element-level and structure-level matchers are used. with the names of the entities. So we considered the Element-level matchers use information of elements of works about schema matching and ontology matching schema (name of columns, description). Structure-level together. In [1] a review of approaches for automatic matchers use information of structure (type similarity, schema matching are conducted. They introduced the key properties, hierarchy of elements). classification of matchers and concluded that it is Recently, element-level matchers have been widely impossible to create a fully automatic matching system, used lexical and semantic information. For getting universal for all subject areas. One of the most important semantic information WordNet and Word2vec are used. conclusions is that hybrid matching system is better than Word2vec allow us to get a meaning of the word and we single system. Intuitively, this is obvious, since a hybrid can use this for improving ontology and schema matcher uses more information to make a decision. matching. So we can combine many lexical and semantic In [2] the ready-made solutions for automatic schema information into one similarity matrix and train a matching are compared and the concept of pre-match machine learning model for trying to solve ontology and effort and post-match effort introduced. Pre-match effort schema matching problems. is learning of the matcher, configuration of thresholds This work is performed as Master thesis. The main and weights. Post-match effort is correction and goal of this work is to propose a approach for solving improvement of the match output. In the same year, in ontology and schema matching problem with the greatest [3] the authors described their COMA system. The accuracy. To achieve this goal we need to perform the authors introduced new methods of post-match effort: following tasks: review related work, create the reuse of matchings, aggregation of individual matchers. architecture of solution, select of information for In [4] the Target-based Integration Query System (TIQS) matching and create experiments. We compared our are described. In [5] a new classification on matching dimensions are presented. A natural issue is uncertainty, In [6] uncertainty are considered, monotonicity and Proceedings of the XX International Conference statistical monotonicity are introduced . All the above “Data Analytics and Management in Data Intensive articles don't use machine learning for aggregation Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018

245 results of matchers. 3 Problem Statement, Evaluation Metrics In [7] Bayesian Networks are used for combining and Similarity Measures ontology matching. As the features lexical information (N-gram, Levenshtein, Dice Coefficient etc) were used. 3.1 Problem Statement The best accuracy is 88%. In [8] the outputs from several single matchers are An ontology O is by a 3-tuple (C, P, I). C is the classes, used and the authors trained on this data a multi-layer denoting the concepts. P is the relations within the neural network.The authors used lexical information and ontology. I is the instances of classes. The task of the WordNet similarity. ontology matching is to find the alignment between In [9] ontology matching problem in detail are entities in a source ontology and a target ontology . An alignment is a set {( , , )| , described: applications, classifications of ontology 1 2 }, where is an entity in 𝑂𝑂 , is an entity in , and𝑂𝑂 matching techniques, evaluations and alignment. They 1 2 1 1 2 is the confidence of the correspondence.𝑒𝑒 𝑒𝑒 𝑐𝑐𝑐𝑐𝑐𝑐 𝑒𝑒 ∈ 𝑂𝑂 𝑒𝑒 ∈ suggest three dimensions of building correspondences: 2 1 1 2 2 conceptual, extensional and semantic. 𝑂𝑂 Our algorithm𝑒𝑒 works only𝑂𝑂 𝑒𝑒with names of 𝑂𝑂entities In [10] Multiple Concept Similarity for ontology 𝑐𝑐without𝑐𝑐𝑐𝑐 the structure information. The input of our mapping are described. The author reduced the ontology algorithm is the two ontologies: source and target. The mapping problem to the machine learning problem. They output of our algorithm is the alignment. used lexical information (prefix, suffix, Edit distance, n- gram), semantic information (WordNet) and special 3.2 Evaluation Metrics types of similarity called Word List similarity (a The effectiveness of ontology matching could similarity for sentences). measured by precision, recall and F-measure. | | In [11] a machine-learning approach to ontology = , (1) alignment problem are described. The authors used | | | 𝐴𝐴∩𝐵𝐵| = , (2) lexical information, semantic information (WordNet) 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 | | 𝐴𝐴 and structural information for training Support Vector 𝐴𝐴∩𝐵𝐵 = , Machine, K-Nearest Neighbours, Decision Tree, 𝑟𝑟 𝑟𝑟𝑐𝑐𝑎𝑎𝑙𝑙𝑙𝑙 𝐵𝐵 (3) 2⋅ 𝑝𝑝𝑝𝑝𝑒𝑒𝑐𝑐𝑖𝑖𝑠𝑠𝑖𝑖𝑡𝑡𝑛𝑛⋅𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 AdaBoost models. The authors improves F-measure 𝐹𝐹 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝+𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 criterion up to 99%. where A is a predict alignment and B is a true In [12] Bayeasian Networks are used for composition alignment. of matchers. The resulting model achieved 81% accuracy. The authors used the outputs of the lexical 3.3 Used Similarity Measures matchers, structure-level matchers, synonym matchers As features for machine learning we used the similarity and instance-level matchers. measures. We extract from ontology the names of In [13] all stages of ontology matching problem in entities. Further we need to create sets of words from detail are described: feature selection, methods of names for computing some similarity measures. combining matchers and experiments. Example. We have two names of entities: In [14] word2vec embeddings for ontology matching “early_paid_applicant” and “early-registered- problem are used. Authors proposed the new algorithm of participant”. The sets are [“early”, “paid”, “applicant”] matching and sentence2vec algorithm. The results matcher and [“early”, “registered”, “participant”]. are compared with various types of WordNet, EditDistance matchers. In [15] the combination of word2vec features and Metrics based on lexical information from names of a neural network are used. One of the most important entities: problems is that articles using machine learning can not be • N-gram [11]. Let ngram(S, N) be the set of compared with each other because of the lack of a unified substrings of string S of length N. The n-gram benchmark. The OAEI is not intended to use machine similarity for two strings S and T: learning, so the training dataset is chosen by the author of | ( , ) ( , )| ( , ) = (4) the article. (| |,| |) 𝑛𝑛𝑛𝑛𝑛𝑛𝑚𝑚𝑚𝑚 𝑆𝑆 𝑁𝑁 ∩𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑇𝑇 𝑁𝑁 At the moment, the most promising option is the • Dice coefficient [11]. The Dice similarity score is 𝑛𝑛 𝑛𝑛𝑟𝑟𝑎𝑎𝑚𝑚 𝑆𝑆 𝑇𝑇 𝑚𝑚𝑚𝑚𝑚𝑚 𝑆𝑆 𝑇𝑇 −𝑁𝑁+1 using of machine learning to create a hybrid model with defined as twice the shared information lexical, semantic and structure information. In this (intersection) divided by sum of cardinalities. For article, we want to connect many different single two sets X and Y, the Dice similarity score is: matchers to one hybrid matcher using machine learning. | | ( , ) = (5) One of the newest features we used is Word2vec. So far | | | | 2∗ 𝑋𝑋∩𝑌𝑌 we only used the single element-level matchers. • Jaccard similarity [11]. For two sets X and Y, the 𝑑𝑑 𝑑𝑑𝑑𝑑𝑒𝑒 𝑋𝑋 𝑌𝑌 𝑋𝑋 + 𝑌𝑌 All reviewed papers use widely the lexical Jaccard similarity score is: information and WordNet distances. In this paper we | | ( , ) = (6) used 29 similarity metrics from 17 various single | | 𝑋𝑋∩𝑌𝑌 element-level matchers. • Jaro measure [11]. The Jaro measure is a type of edit 𝑗𝑗 𝑗𝑗𝑐𝑐𝑐𝑐𝑎𝑎𝑟𝑟𝑑𝑑 𝑋𝑋 𝑌𝑌 𝑋𝑋∪𝑌𝑌 distance, developed mainly to compare short strings, such as first and last names.

246

• Substring similarity [11]. For longest substring T of • Ratio. For two strings X, Y: two strings X and Y substring similarity is: = (2.0 100), (10) | | ( , ) = (7) 𝑀𝑀 | | | | • where T is the total number of characters in both 2∗ 𝑇𝑇 𝑅𝑅 𝑅𝑅𝑡𝑡𝑖𝑖𝑒𝑒 𝑟𝑟𝑟𝑟𝑠𝑠𝑛𝑛𝑑𝑑 ⋅ 𝑇𝑇 ⋅ • Monge-Elkan [11]. Token-based and internal strings, and M is the number of matches in the two 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑋𝑋 𝑌𝑌 𝑋𝑋 + 𝑌𝑌 similarity function for tokens: strings X and Y. | | • Soundex [6]. Phonetic measure such as soundex ( , ) = ( [ ], [ ]) match string based on their sound. These measures | | 𝑥𝑥 ,| | 1 ′ have been especially effective in matching names, (8) 𝑀𝑀𝑀𝑀 𝑗𝑗=1 𝑦𝑦 𝑠𝑠𝑠𝑠𝑚𝑚 𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑖𝑖=1∑ 𝑚𝑚𝑚𝑚𝑥𝑥 𝑠𝑠𝑠𝑠𝑚𝑚 𝑥𝑥 𝑖𝑖 𝑦𝑦 𝑗𝑗 since names are often spelled in different ways that • Smith-Waterman [11]. The Smith-Waterman sound the same. algorithm performs local sequence alignment; that is, for determining similar regions between two • Token Sort. Fuzzy Wuzzy token sort ratio raw strings. raw_score is a measure of the strings similarity as an int in the range [0, 100]. For two strings X and Y, • Needleman-Wunsch [11]. The Needleman-Wunsch the score is obtained by splitting the two strings into distance generalizes the Levenshtein distance and tokens and then sorting the tokens. The score is then considers global alignment between two strings. the fuzzy wuzzy ratio raw score of the transformed Specifically, it is computed by assigning a score to strings. each alignment between the two input strings and choosing the score of the best alignment, that is, the • Tversky Index. For sets X and Y: | | maximal score. ( , ) = , (11) | | | | | | 𝑋𝑋∩𝑌𝑌 • Affine gap [17]. Returns the affine gap score • where , > 0. 𝑇𝑇𝑇𝑇 𝑋𝑋 𝑌𝑌 𝑋𝑋∩𝑌𝑌 +𝛼𝛼 𝑋𝑋−𝑌𝑌 +𝜂𝜂 𝑌𝑌−𝑋𝑋 between two strings. The affine gap measure is an • Overlap coefficient. For sets X and Y: extension of the Needleman-Wunsch measure that 𝛼𝛼 𝜂𝜂 | | ( , ) = (12) handles the longer gaps more gracefully. (| |,| |) 𝑋𝑋∩𝑌𝑌 • Bag distance. The number of common symbols in 𝑂𝑂𝑂𝑂 𝑋𝑋 𝑌𝑌 𝑚𝑚𝑚𝑚𝑚𝑚 𝑋𝑋 𝑌𝑌 two strings. Metrics based on semantic information from names • Cosine similarity [11]. For two sets X and Y: of entities: | | • WordNet similarity. Best score of similarity between ( , ) = (9) | | | | synonyms of one word with synonyms of other 𝑋𝑋∩𝑌𝑌 • word. We use a sentence similarity algorithm from Partial Ratio. Given𝑐𝑐 𝑐𝑐 two𝑠𝑠𝑖𝑖𝑛𝑛𝑒𝑒 strings𝑋𝑋 𝑌𝑌 X �and𝑋𝑋 ⋅ 𝑌𝑌Y, let the shorter string (X) be of length m. It finds the fuzzy [14]. wuzzy ratio similarity measure between the shorter • Word2vec and Sentence2vec similarity. Word2vec string and every substring of length m of the longer is a model that are used to produce word string, and returns the maximum of those similarity embeddings. We used word2vec model trained by measures. the texts of Google News. The dimensionality of • Soft TFIDF [21]. A variant of TF-IDF that considers these vectors is 300. We used sentence2vec words equal based on Jaro Winkler rather than exact algorithm from [14]. match. • Editex. Editex is a phonetic distance measure that 4 Proposed Matching Algorithm combines the properties of edit distances with the The input of the matching system is the two letter-grouping strategy used by Soundex and ontologies. For each ontology the system extracts the Phonix. names of the entities. Further it generate all possible pairs • Generalized Jaccard. This similarity measure is of the names of the ontology with the names of the other softened version of the Jaccard measure. The ontology. For all pairs system computes the all similarity Jaccard measure is promising candidate for tokens measures. All outputs of the similarity measures are which exactly match across the sets. concatenated into the one similarity matrix. The computed features are used as input to a machine • JaroWinkler. The Jaro-Winkler measure is designed learning model. to capture cases where two strings have a low Jaro score, but share a prefix and thus are likely to match. Program 1 Element-level matching algorithm

• Levenshtein distance [11]. Levenshtein distance Input: Ontology1, Ontology2 - input ontologies computes the minimum cost of transforming one or schemas, THRESHOLD - threshold for create string into the other. matching between elements Output: alignment - output alignment for • Partial Token Sort. For two strings X and Y, the Ontology1 and Ontology2 score is obtained by splitting the two strings into for Entity1 Ontology1 do for Entity2 Ontology2 do tokens and then sorting the tokens. The score is then ∈ Name1 get_name(Entity1) the fuzzy wuzzy partial ratio raw score of the ∈ Name2 get_name(Entity2) transformed strings. ← Features get_features(Name1, Name2) ← ←

247

Match predict_match(Features) Table 1 Our split of dataset “conference” from OAEI if Match > THRESHOLD then 2017 benchmark alignment.append((Name1,Name2))← end if end for Train pairs Test pairs end for return alignment Conference, Ekaw Iasted, Sigkdd Ekaw, Sigkdd Cmt, Confof Program 2 Creating dataset and training a machine Cmt, Sigkdd Confof, Edas learning model Cmt, Ekaw Edas, Iasted Ekaw, Iasted Input: TrueAlignments - set of true alignments, Confof, Ekaw TrainAlignments - set of train pairs Edas, Ekaw Cmt, Iasted Conference, Iasted Edas, Sigkdd Output: Model - trained model for predict Conference, Sigkdd matching Confof, Sigkdd Conference, Edas Conference, Confof for Ontology1, Ontology2 in TrueAlignments do Confof, Iasted Cmt, Edas for Entity1 Ontology1 do Cmt, Conference for Entity2 Ontology2 do ∈ if (Entity1, Entity2) TrainAlignments ∈ then add_to_train(Entity1,∈ Entity2, 1) 5.2 Experiments else add_to_train(Entity1, Entity2, 0) As a baseline solution we take EditDistance and end if WordNet matchers. For every matcher we select end for threshold with the best F-measure on the test dataset. As end for the machine learning models we chose Naive Bayes end for 27 28 Model.train() Classifier , Logistic regression as the simple models return Model 29 and Gradient boosting as the complex model. For training a machine learning model we need the Table 2 The results on test dataset ontologies and the alignments. We split the alignments on the training alignments and testing alignments. Precision, F- Method Recall, % Further we train a machine learning model on the % measure, % features from train alignments. Then we can evaluate F- measure on testing alignment. EditDistance 50 35.8 41.7

5 Preliminaries and Results WordNet 9.1 38.2 14.7 Naive Bayes 2.13 80.08 4.15 5.1 Collection of Data Classifier The experiments were conducted at the dataset Logistic 75.58 40.12 52.41 “conference” from OAEI 2017. The dataset consists 16 regression ontologies from the same domain (conference organisation) and 21 true ontology matchings. This XGBoost 69.15 45.67 55.01 dataset is convenient for machine learning because all ontologies are from the same domain and this dataset has the several true matchings between ontologies. We can see on Table 2 that XGBoost is the model We get entities from ontologies with parsing code on with best F-measure 55.01%. The worst model is Naive Python and get array of tuples: (entity1, entity2, match), Bayes Classifiers with F-measure 4.15%. EditDistance is where entity1, entity2 - string names of entities, match - better than WordNet by 27%. For getting more stable bool variable, 1 means that the entities are matched, 0 results we splitted the datasets randomly on training and means that the entities are not matched. We generated 29 testing pairs 20 times and averaged the results. The similarity measures from 3.3 with Python packages: results are show on Table 3. We can see that XGBoost py_stringmatching24, fuzzycomp25, gensim26. are remained the best model. The final dataset is very unbalanced: 391 572 negative samples and 305 positive samples. After generating features we split data randomly on train and test datasets in equal proportions. In Table 1 are described our split.

24 https://github.com/kvpradap/py_stringmatching .GaussianNB.html 25 https://github.com/fuzzycode/fuzzycomp 28 http://scikit- 26 https://github.com/RaRe-Technologies/gensim learn.org/stable/modules/generated/sklearn.linear_mode 27 http://scikit- l.LogisticRegression.html learn.org/stable/modules/generated/sklearn.naive_bayes 29 https://github.com/dmlc/xgboost

248

Table 3 The averaged results on 20 various random Semantics IV, pp. 146-171 (2005). doi: splits on train and test datasets 10.1007/11603412_5 [6] Gal, A.: Why is Schema Matching Tough and Precision, F- Method Recall, % What Can We Do About It? In: ACM SIGMOD % measure, % Record, vol. 35, issue 4, pp. 2-5 (2006). doi: EditDistance 47.31 39.2 42.82 10.1145/1228268.1228269 [7] Svab, O., Svatek, V.: Combining Ontology WordNet 11.87 41.69 18.35 Mapping Methods Using Bayesian Networks. In: OM'06 Proceedings of the 1st International Naive Bayes 2.63 81.49 5.10 Conference on Ontology Matching, vol. 225, pp. Classifier 206-210. Logistic 74.98 43.43 54.95 [8] Hariri, B., Abolhassani, H., Sayyadi, H.: A regression Neural-Networks-Based Approach for Ontology XGBoost 72.72 47.51 57.29 Alignment. (2006). https://www.jstage.jst.go.jp/article/softscis/2006/ 0/2006_0_1248/_article [9] Euzenat, J., Shvaiko, P.: Ontology Matching. 6 Conclusions Springer-Verlag Berlin Heidelberg, Berlin In this paper, we combined the lexical and semantic (2007). doi: 10.1007/978-3-642-38721-0 similarity measures into one hybrid model. The [10] Ichise, R.: Machine Learning Approach for experiments show that hybrid model is better than single Ontology Mapping Using Multiple Concept matchers. Similarity Measures. In: Seventh IEEE/ACIS In future work we want to run our solution on other International Conference on Computer and benchmarks. We must add the structure-level and Information Science, (2008). doi: instance-level features because in our dataset there are 10.1109/ICIS.2008.51 examples in which the names of entities are exactly the [11] Nezhadi, A., Shadgar, B., Osareh, A.: Ontology same but they are not matched. Alignment Using Machine Learning Techniques. In: International Journal of Computer Science & Acknowledgments. This work is supervised by Sergey Information Technology, vol. 3, pp. 139-150 Stupnikov, Institute of Informatics Problems, Federal (2011). doi: 10.5121/ijcsit.2011.3210 Research Center “Computer Science and Control“ of the [12] Nikovsi, D., Esenther, A., Ye, X., Shiba, M., Russian Academy of Sciences”. Takayama, S.: Bayesian Networks for Matcher Composition in Automatic Schema Matching. References In: Mitsubishi Electric Research Laboratories (2011). doi: 10.1.1.644.2168 [1] Rahm, E., Bernstein, A.: A survey of approaches to automatic schema matching. In: The [13] Ngo, D.: Enhancing Ontology Matching by International Journal on Very Large Data Bases, Using Machine Learning, Graph Matching and vol. 10, issue 4, pp. 334-350 (2001). doi: Information Retrieval Techniques. In: University 10.1007/s007780100057 Montpellier II - Sciences et Techniques du Languedoc (2012). doi: 10.1.1.302.587 [2] Do, H., Melnik, S., Rahm, E.: Comparison of Schema Matching Evaluations. In: Revised [14] Zhang, Y., Wang, X., Lai, S., He, S., Liu, K., Papers from the NODe 2002 Web and Database- Zhao, J., Lv, X.: Ontology Matching with Word Related Workshops on Web, Web-Services, and Embeddings. In: Chinese Computational Database Systems, pp. 221-237 (2002). doi: Linguistics and Natural Language Processing 10.1007/3-540-36560-5_17 Based on Naturally Annotated Big Data, pp 34- 45 (2014). doi: 10.1007/978-3-319-12277-9_4 [3] Do, H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: [15] Jayawardana, V., Lakmal, D., Silva, N., Perera, VLDB '02 Proceedings of the 28th international A., Sugathadasa, K., Ayesha, B.: Deriving a conference on Very Large Data Bases, pp. 610- Representative Vefctor for Ontology Classes 621 (2002). doi: 10.1016/B978-155860869- with Instance Word Vector Embeddings. In: 6/50060-3 2017 Seventh International Conference on Innovative Computing Technology (INTECH), [4] L. Xu, D. Embley.: Automating Schema (2017). doi: 10.1109/INTECH.2017.8102426 Mapping for Data Integration. (2003). http://www.deg.byu.edu/papers/AutomatingSche maMatching.journal.pdf [5] Shvaiko, P., Euzenat, J.: A Survey of Schema- Based Matching Approaches. In: Journal on Data

249