View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Institutional Knowledge at Singapore Management University

Singapore Management University Institutional Knowledge at Singapore Management University

Research Collection School Of Information Systems School of Information Systems

5-2002 Product Schema Integration for Electronic Commerce: A synonym comparison approach Guanghao YAN State University of New York at Stony Brook

Wee-Keong NG Nanyang Technological University

Ee Peng LIM Singapore Management University, [email protected] DOI: https://doi.org/10.1109/TKDE.2002.1000344

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research Part of the and Information Systems Commons, E-Commerce Commons, and the Numerical Analysis and Scientific omputC ing Commons

Citation YAN, Guanghao; NG, Wee-Keong; and LIM, Ee Peng. Product Schema Integration for Electronic Commerce: A synonym comparison approach. (2002). IEEE Transactions on Knowledge and Data Engineering. 14, (3), 583-598. Research Collection School Of Information Systems. Available at: https://ink.library.smu.edu.sg/sis_research/122

This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore Management University. For more information, please email [email protected]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002pp. 583-598 583 https://doi.org/10.1109/TKDE.2002.1000344 Product Schema Integration for Electronic CommerceÐA Synonym Comparison Approach

Guanghao Yan, Student Member, IEEE, Wee Keong Ng, Member, IEEE Computer Society, and Ee-Peng Lim, Senior Member, IEEE

AbstractÐIn any electronic commerce system, the heterogeneity of product descriptions is a critical impediment to efficient business information exchange. In the ABECOS electronic commerce system, buyer agents, seller agents, and directory agents liaise with one another in e-commerce activities. Only when agents have a common ontology of product descriptions )also called product schemas) are they able to interact seamlessly in e-commerce activities. This gives rise to the Product Schema Integration problem )PSI); the problem of integrating heterogeneous schemas of a certain product into one globally compatible schema. In this paper, we adopt an integration approach based on product attribute synonyms. We give a formal definition of the problem and show that it is NP-complete. We contrast our approach of study to conventional schema integration in federated databases. We also propose a set of approximate algorithms for PSI and evaluate their performance.

Index TermsÐElectronic commerce, product description, relational schema, schema integration. æ

1INTRODUCTION

S information on the Internet becomes more and more by traditional search engines; they use users' product Adynamic and heterogeneous, software agents have specifications as the search criteria, compare them with been touted as the new building blocks for a new Internet descriptions provided by different seller agents, and return structure. From an electronic commerce perspective, agent the addresses of relevant seller agents. Once buyer agents technologies are useful for making providers aware of obtain the addresses, they communicate with each relevant consumer needs and consumers aware of providers' seller agent, query for the details of products, negotiate the offerings [1]. Organizations and individuals will not only prices, and complete the deal. establish their presence via web sites alone, they will also To describe a product, a set of attributes is used. For provide agents to interact with other agents. In order to example, we may use &Celine Dion, Falling into you, Sony)to realize this, the ABECOS &Agent-Based Electronic COm- represent the singer's name, the title, and the company of a merce System) Project at the School of Computer Engineer- music CD, respectively. In relational terminology, ing, Nanyang Technological University, Singapore, started a set of names of a is called the schema of in July 1997 with the key objective of designing and that relation. Correspondingly, we may call the set of implementing a software infrastructure for a very large, attribute names of a product the schema of that product. In distributed, agent-based web commerce system [21], [23], thisway,theschemaofmusicCDsis&artist, album, [24], [37], [38]. The project aims to construct generic company). Different instantiations of this schema correspond prototypes for three types of agents: buyer agent, seller to different music CD records. However, different sellers agent, and directory agent. Buyer agents and seller agents may differ in the way they describe their products. They interact and transact on behalf of sellers and buyers. In may adopt different sets of attributes or vocabularies to addition, directory agents, corresponding to conventional describe the same product. For example, &year, classification, web search engines, locate seller agents for buyer agents. singer, title, company) may be another schema for music CDs. In ABECOS, directory agents keep information about We refer to such a seller-specific schema as a local product seller agents. Since seller agents distinguish themselves by schema. the products they sell, directory agents use the buyer's Since different sellers use different local product sche- specifications of desired product as the search query to find mas, the directory agent in ABECOS is required to be relevant seller agents. In this case, directory agents no ªintelligentº enough to understand the semantics of longer follow the simple keyword search strategy adopted different local product schemas and to identify the correspondence between the local schema attributes and the attributes involved in the buyer agents' search queries. . G. Yan is with the Computer Science Department, State University of New Moreover, when buyer agents and seller agents in ABECOS York at Stony Brook, Stony Brook, NY 11794-4400. E-mail: [email protected]. interact directly, they should also understand each other's . W.K. Ng and E.-P. Lim are with the School of Computer Engineering, product descriptions. This is an ontology heterogeneity Nanyang Technological University, Nanyang Ave., Singapore 639798. problem. Data integration is a solution to this problem; E-mail: [email protected], [email protected]. we can construct an integrated schema &or global schema) Manuscript received 30 Mar. 1999; revised 20 Oct. 2000; accepted 18 Jan. for a product before the user specifies the search query. As 2001; posted to Digital Library 2001. For information on obtaining reprints of this article, please send e-mail to: A. Levy has argued, a data-integration system lets users [email protected], and reference IEEECS Log Number 109503. focus on specifying what they want, rather than thinking

1041-4347/02/$17.00 ß 2002 IEEE 584 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 1. Interactions among users, buyer agents, and directory agent. about how to obtain the answers [14]. This is the main in missing product attributes. For example, some sellers function of a global product schema. Using the global may use chassis as an attribute to describe a PC while some schema as the common description of products, the others do not. complex interaction among heterogeneous software agents It has been pointed out by D. Florescu et al. in a survey &seller and buyer agents) can be carried out. As shown in on database techniques for the World Wide Web [10]that Fig. 1, in ABECOS, the directory agent integrates different web data integration has to, in addition, deal with a large local product schemas into a global product schema; the and evolving number of web sources, little metadata about the characteristics of the sources, and a large degree of buyer agent obtains the global schema from the directory source autonomy. Specifically, product schema integration agent when a user wants to buy such a product. Then, the in the ABECOS commerce system has the following user specifies his or her requirements for the product characteristics: according to the global product schema, and the buyer agent sends these specifications as a search query to the . Limited knowledge of local schemas. Since product directory agent. Finally, when the directory agent has information is proprietary, a directory agent may located relevant seller agents, it returns the addresses of only obtain product schemas without much infor- those seller agents as search results to the buyer agent. mation about attribute domains or data types from Product schema integration is a subfield of data the sellers' web pages. Thus, conventional schema integration. In the Asilomar report on database research integration methods built on the availability of [3], in J. Gray's brief essay on database research [12], and in attribute domain information no longer work. This another paper on database research in the 21st century by results in additional difficulties in understanding the A. Silberschatz et al. [34], heterogeneous data integration semantics of local product schemas. has been identified as one of the major research directions . Large number of local schemas. The number of in the future. It has also been pointed out that, in electronic seller agents in the system can be quite large. In this commerce systems, heterogeneous information sources situation, human intervention is hardly feasible. A must be integrated and the integration is an extremely scalable and fully automated solution is therefore difficult problem [34]. Although significant progress has required. been made in data integration, this problem is by no means . Fast local schema evolution. Whenever new fea- solved, name matching across distributed sources remains tures of a product are added or old features of a one major problem [14]. Thus, we need schema manage- product are removed, local schemas of the product ment facilities such as directory agents in ABECOS which must be updated. For example, a multimedia PC can adapt to the dynamic nature of heterogeneous data. product can be extended to include additional 1.1 Characteristics of Product Schema Integration peripherals which lead to new product schemas. Although the updates to a single local schema may As in other schema integration problems &such as in not be very frequent, the total updates to all local integration), the heterogeneity among schemas in the system can be so frequent that local product schemas in ABECOS can be classified into two categories, namely, naming conflicts and missing human intervention is also hardly feasible. Thus, the attributes. Naming conflicts include synonyms, words similar integration is again desired to be fully automated, in meaning but different in spelling, and homonyms, words low-cost, and efficient. similar in spelling but different in meaning in different To successfully integrate product schemas, the approach context. For example, album and title are synonyms in the we adopt should address the above characteristics. Note local product schemas of music CDs. In addition, some that most of these characteristics also apply to hetero- product attributes used by one seller agent in ABECOS may geneous information integration in other non-agent-based not be used by another seller agent in ABECOS. This results distributed e-commerce systems. Because of the large YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 585 number of sellers and the frequent changes of information, the relationships of words in a synonyms lexicon are static an automatic integration is generally desired in the future. whereas the relationships of attribute names are dynamic; We may restate the problem as a set of basic questions: the equivalence relationship of some words may not hold from one context to another. For example, in the schema of . What is a description of a product? What is its an ªautomobile,º model is an attribute name equivalent to schema? type or classification, but this relationship does not apply in . What is the information needed for identifying the the schema of a ªcomputer.º To alleviate this situation, correspondence between attributes from heteroge- common concepts, i.e., properties or characteristics possessed neous local product schemas? How is such informa- by certain objects, can be used to help the understanding of tion represented? attribute semantics [39]. For example, ªhorizontal-positionº . How can we integrate different local product is a common concept for attributes like x-axis or horizontal- schemas? Is it possible to do the integration axis. In this way, homonyms can be easily detected since automatically? their concepts are different. For synonyms, when they have We shall refer to this set of problems as the Product Schema been identified in a synonyms lexicon, whether their Integration problem &PSI). Although there are many concepts agree decides whether they are equivalent existing methodologies for schema integration in multi- attributes names. database systems, PSI is different because of the character- Field Specification. The second level of attribute istics involved. correspondence is attribute field specifications, i.e., char- acteristics of attributes such as uniqueness, cardinality, 1.2 Paper Organization domain, semantic integrity constraints, security constraints, The organization of this paper is as follows: The next section allowable operations, and scale [22]. They are used to provides an overview of related work in the area of describe the properties of different attributes in database database schema integration and electronic commerce. schemas. In the Semint project [19], [30], a classifier is used Sections 3 and 4 elaborate on the formal definition and to categorize attributes according to their field specifications complexity of PSI, respectively. In Section 5, we compare such as data type, length, and key fields, and an algorithm the performance of three algorithms for PSI. Conclusions has been proposed to determine the degree of similarity and and future work are briefly discussed in Section 6 and some dissimilarity among attributes. However, this approach is complexity results are given in the Appendix. based on the assumption that most database designers have the same knowledge about designing a ªgoodº database. This assumption makes it possible to use attribute field 2RELATED WORK specifications to help determine the likelihood that attri- 2.1 Database Schema Integration butes are equivalent. One of the weaknesses of this Schema integration is an important issue in the context of approach is that attribute field information may not always integration &in database design) and database be available, as is the case of ABECOS. integration &in multidatabase management) [2]. Much work Attribute Domain. At the attribute domain level, has been done during the past decades in these two areas relationships among attributes, entities, and relations are [6], [7], [13], [17], [18], [19], [22], [26], [31], [35], [39]. It has defined formally into five types: EQUAL, CONTAINS, been pointed out that success in schema integration OVERLAP, CONTAINED-IN, and DISJOINT [17]. Several depends on understanding the semantics of the schema formulations of different attribute integration strategies components, i.e., attributes, relations, entity sets, etc., and were given according to different relationships of attributes the ability to reason with semantics. There are three levels to be integrated. With this approach, entities and relations of information that can be used to determine the semantics can be integrated by identifying the relationships of of attributes: attribute names, attribute field specifications, attributes of these entities and relations. This approach and attribute domains. has been used to implement a tool for schema integration [32]. However, it has been pointed out that to determine 2.1.1 Levels of Schema such relationships can be time consuming and tedious [31], Attribute Name. At the attribute name level, it is found that even with complete information of the domain of every having only attribute names is not sufficient for complete attribute. The Carnot [7]project and MIT's COIN project [4] semantic schema integration [26]. The main problems of also address schema integration at this level. A large schema integration at this level are the presence of knowledge base has been established to store the informa- homonyms and synonyms. Homonyms can be detected by tion of different semantic contexts. This approach is based comparing concepts with the same name in different on the availability of the complete domain of each attribute. schemas, and synonyms can only be detected after external specifications [2]. According to the characteristics of PSI 2.1.2 Metadata Representation given in Section 1.1, product schema integration should be Metadata are data that describe information of other data. performed at this level. In multidatabase systems, the metadata of schemas of In MUVIS &Multi-User View Integration System) [13], a component databases are critical to the whole process of synonyms lexicon has been used to identify the equivalence database integration. Such metadata describe the corre- of attribute names. It overcomes the drawbacks of simple spondences among attributes of different schemas. Various attribute name comparison, but to construct a synonyms lexicon is a tough and time consuming task. Furthermore, metadata representations have been proposed [15]: 586 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

. Tables. Attribute correspondence can be maintained 2.2.1 RosettaNet Specification in a simple structure. Each tuple in the table RosettaNet [27]creates ªpropertyº definitions for various represents a static correspondence between attribute entities in electronic commerce, such as property definitions names in different databases. This is the simplest for a certain product. The word ªpropertyº is similar to representation of metadata. For example, one tuple ªattributeº as discussed in Section 1. For example, ªMod- from the table representing metadata of music emº is a property &or an attribute) for computer products. CD schemas can be &singer, artist). Once these property definitions are completed, they will be . Ontologies. An ontological representation of attribute distributed to some standards maintenance organizations correspondence uses an ontology, a knowledge base that will then enumerate possible values for those proper- consisting of entities and relationships with abstrac- ties. Then, property definitions, as well as their values, are tion, inference, and typing mechanisms, to represent distributed to companies in the industry supply chain as semantics of the component databases [7]. This is the standards for business information format, say for product most powerful and complex among the three. In the descriptions. Carnot multidatabase system [7], a knowledge base Let us take a look at how one property of a product is known as Cyc is used to store ontologies [18]. defined by RosettaNet. Consider the definition of the Articulation axioms like ist G; Ȇ,ist Si; Ɇ are ªCentral Processor Unitº property for a laptop given by used to represent the logical relationship that logical the RossettaNet Laptop Technical Specification [28]. There expression È in global schema G is equivalent to are several fields for the property ªCentral Processor Unitº: logical expression É in local schema Si. Property Name, Synonym, Property Definition, Dictionary References, Where Used, Property Type, etc. Moreover, some 2.1.3 Relevance of these fields contain subfields. For example, Property Name In product schema integration in ABECOS,weneed has Abbreviation and Acronym as its subfields. All these metadata to represent the semantic information of attribute fields serve as metadata for the product property &attribute). names. We have the following considerations when we In our product schema integration work, we use some of design the metadata representation: First, the fast evolution them to add semantics to product attributes. In fact, we of local product schemas makes tables no longer suitable have examined a mechanism based on ªproduct attribute since tables can only represent static correspondences. Any synonymsº to integrate product schema. We shall elaborate change in one tuple may lead to changes in other tuples; on this later. thereby, making the maintenance of a table very expensive. Second, the diversity of alternative attribute names em- 2.2.2 Common Business Library !CBL) ployed by different sellers makes it impossible to build a The Common Business Library &CBL) that is developed by complete synonym lexicon for all kinds of products. Third, Veo Systems, Inc. [36]is a set of building blocks with to predict all possible semantic context situations and to set common semantics and syntax to ensure interoperability up and maintain an ontology knowledge base is difficult among XML applications. CBL consists of information because of the lack of complete attribute domain informa- models for generic business concepts including: tion. Last, in electronic commerce transactions fast response time is very important. The use of large knowledge bases . business description primitives like companies, may incur undesirable overheads that undermine the services, and products, performance of the system. Therefore, we should find a . business forms like catalogs, purchase orders, and simple but flexible way to represent the metadata of local invoices, and product schemas. . standard measurements, date and time, location, classification codes. 2.2 Standardization Efforts in Electronic Commerce Specifically, CBL consists of an extensible, public set of Much effort has been expended to provide related XML DTDs and modules. These building blocks can be standards on the issues of heterogeneity in electronic assembled to create complete XML documents represent- commerce [5], [8], [9], [16], [25], [27], [36]. If these standards ing a business interaction such as a purchase order or an are adopted widely by e-commerce participants, then inventory stock query. Where possible, CBL takes heterogeneity problems will be largely resolved. However, advantage of other standards using, for example, relevant it will be some time before the standards are widely used, ISO standards for dates, currencies, and names. CBL is and we foresee a future whereby multiple standards exist. closely related to the work of RosettaNet, and the Thus, product schema integration is still meaningful, property definitions given by RossettaNet can be refer- though it will be simplified to some extent by the adoption enced by CBL to compose DTDs and modules for various of standards. e-commerce transactions, including product descriptions. The eCo Framework Project [9], conducted by Commer- To use CBL, an organization starts by creating a ceNet [8], has addressed some of the heterogeneity issues. CBL document describing its offer and services. Then, it They created a base set of common terms and mappings integrates a CBL system with its back-end system by among existing terms for e-commerce specifications. The writing custom code that interprets information between eCo working group consider a list of related specifications the CBL format and the organization's previous format. It is among which the RosettaNet Specification [27]and the like building a ªwrapperº for back-end systems by using Common Business Library &CBL) [5]shall be briefly CBL blocks. After that, organizations interact on the basis of discussed next. CBL semantics and syntax. YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 587

3PROBLEM FORMULATION shown in Table 2, there are four different PC product 3.1 Synonym Set descriptions in four separate small tables D1, D2, D3, and D , respectively. Each cell of the small tables is a We use synonym set to represent the metadata of product 4 synonym set for a certain attribute. For example, schemas. A synonym set is a set of alternative names for an synonym set {memory,main memory}intableD attribute. These alternative names include synonyms, 1 represents the memory of a PC product while synonym abbreviations, and acronyms for the attribute. It is different set {memory, RAM} in table D represents the same from value set, which is the domain &or set of possible 2 attribute of the product. In addition, we note that the values) of the attribute. For instance, {type, sort, category, numbers of synonym sets in these four tables are also class} is a synonym set representing the real world conceptualization of the genre of a music CD, such as {rock, different. classical, jazz, country, Christian, R&B, misc}. The synonym 3.2 Problem Definition set refers to attributes that represent the genre of a music Suppose the schema of a product is Di ˆfAi;1;Ai;2; ...; CD record while the value set contains the possible values Ai;t g &1  i  n, ti  1), where Ai;j is the synonym set for of that attribute. In this way, the metadata of a product i the jth product attribute, and ti is the number of synonym schema is represented as a set of synonym sets, which we sets used by vendor i. We make a few observations. First, call a description. For example, &{artist, singer}, {album, title}, the number of synonym sets employed by each seller may {publisher, company}) is a description of music CD records. In differ due to the different number of attributes found in the ABECOS, when sellers describe local product schemas, they local product schemas. Second, synonym sets from the not only give the attribute names they are using, but also same seller may contain common attribute names because provide synonym sets as metadata to facilitate the under- an attribute name may have multiple semantics in standing of descriptions by other sellers. different contexts. Last, attribute synonym sets from We say two synonym sets are semantically coherent if their different sellers may contain common attribute names intersection is not empty. Here, note that semantic because the sellers may choose the same attribute name coherence does not mean semantic equivalence; it only for an attribute. We formally express these observations implies close semantic relationship. The size of the inter- below: For all 1  i; j  n, i 6ˆ j, we have section of two synonym sets is an indicator of the degree of proximity between them. We observe that semantically 1. jti À tjj0; coherent synonym sets probably refer conceptually to the 2. jAi;u \ Ai;vj0 for 1  u; v  ti; ªsame attribute.º For instance, synonym sets {type, model, 3. jAi;u \ Aj;vj0 for 1  u  ti, 1  v  tj. sort, category, suit, genus, genre, class, classification} and {sort, The problem that we address may be informally described genre, style, class, classification, group, make, lot, type, kind} are as follows: Given a set of heterogeneous descriptions semantically coherent. However, it is still possible that some D1;D2; ...;Dn by n sellers for the same product, we wish semantically coherent synonym sets do not refer to the same to find a description D ˆfI1;I2; ...;Img &m  ti, 1  i  n) attribute, such as the case of homonyms. To handle this for the product that integrates all the different schemas situation, we consider other issues like the degree of from the sellers. Here, Ik, 1  k  m is an integrated coherence of the synonym sets. We shall elaborate on these synonym set for a certain attribute. Specifically, Ik is an issues when we formally define the integration problem. intersection of some semantically coherent synonym sets The approach of using synonym sets to represent the belonging to different sellers. Ik is derived from Dis as Sn metadata of local product schemas is simple and flexible. follows: We first define a column Pk of U ˆ iˆ1 Di to be a There is no complicated knowledge or context information subsetT of U such that jPk \ Dij2f0; 1g for 1  i  n and to be stored in knowledge bases. Generally, sellers are B 6ˆ . All names in one column are assumed to be B2Pk familiar with the terminology of the products they sell. semantically coherent and at most one synonym set from Thus, they are able to provide synonym sets easily and each Di may be included in the column. Then, Ik accurately. When there are updates to local product &1  k  m) is definedT as the intersection of synonym sets schemas, such as adding or removing synonym sets or in P , i.e., I ˆ B. k k B2Pk attribute names, sellers only need to change the synonym Therefore, to find an integrated description D is to find a sets. Any change to one synonym set does not lead to set of columns each corresponding to a group of semanti- additional changes in other synonym sets. Furthermore, the cally coherent synonym sets. Using the concept of partitions correspondence among local product schemas can be in set theory, the set of columns is a of U.We identified by examining the intersection, if any, of their denote it as P ˆfP1;P2; ...;P`g &`  m). However, such a synonym sets. In this way, the semantic identification partition may not be unique from the three observations process is transformed to the process of scalable and fully mentioned previously. We introduce two notions for automated synonym set matching. characterizing a partition: A partition is correct if synonym To further illustrate the use of synonym sets in sets within each column of the partition have nonempty describing products, we present an example of PC intersection. A partition is good if synonym sets within each product descriptions, which is more complicated than column of the partition have a high degree of semantic the previous example of music CD product descriptions. coherence. In general, we want to find partitions that are These product schemas are abstracted and synthesized correct and have a high degree of goodness. Let us examine from the websites of some major computer vendors. As factors that determine the goodness of a partition. 588 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 2. Steps of Product Schema Integration.

Column Similarity. This measures the degree of higher the semantic coherence among the synonym sets, the similarity among synonym sets in the column. Let P ˆ S k higher is the evenness. fA ;A ; ...;A g for 1  k  n. Let W ˆ A such that The combination of similarity, popularity, and evenness 1 2 km m A2Pk jWjˆw. We define a w-dimensional vector yields the following measure: ! W ˆ a1;a2; ...;aw†; XjPkj  C ;C † ' P †ˆh v Pk ‡ k 1 jP j where au 2 W for 1  u  w. Then, a w-dimensional vector Cv vˆ1 k is derived from A &1  v  k ) as follows: Let C u†  T ! v m v jP jÀ1 j Aj X 1 k A2Pk &1  u  w) be the uth component of vector Cv. Then, Cv u†ˆ h2 ‡ h3 n À 1 jPkj jAj 1 if W u†2 A ; C u†ˆ0 otherwise. Let C be the mean A2Pk v v PPk km vector of all synonym sets in Pk, i.e., CP ˆ Cv=km. The for computing the goodness of P , where the weights Pk vˆ1 k similarity of a column P is given by jPkj  C ;C †=jP j, h ;h ;h sum to 1. For simplicity, we may give equal k vˆ1 v Pk k 1 2 3 importance to the three measures by having equal values where function  is defined as follows: For vectors CX ˆ x ;x ; ...;x † and C ˆ y ;y ; ...;y †, for the weights. For the entire partition P, the following 1 2 w Y 1 2 w function: Xw xuyu 1 X` uˆ1  P†ˆ ' Pk†  CX;CY †ˆs : ` Xw Xw kˆ1 x2 Á y2 u u is an evaluation of the goodness of the partition. Since uˆ1 uˆ1 0  ' Pk†1, we have 0   P†1. In general, we would This function is generally used to compute vector similarity like  P† to be as large as possible. With the above in information retrieval [29]. From its definition, we have preliminaries, we are now ready to define the Product 0   CX;CY †1. Since each vector Cv has a similarity Schema Integration Problem. We give the formal definition value of C , the mean of these similarities is the column of the problem as follows: Pk similarity of Pk, i.e., the similarity of all the synonym sets in Definition 1. Given a set of descriptions D1;D2; ...;Dn P . The more similar synonym sets in P are, the larger is k k n  1†, where Di ˆfAi;1;Ai;2; ...;Ai;t g 1  i  n; ti  1† the value of column similarity; when all the synonym sets i and Ai;j 1  j  ti† is a synonymS set, find a partition P ˆ are identical, the column similarity is maximum. n fP1;P2; ...;P`g `  min ti†† of iˆ1 Di satisfying the fol- Column Popularity. This measures the extent to which lowing conditions: the column is contributed by each seller. The popularity of a T 1. j Aj1 for 1  k  `, column Pk is given by the ratio 0  jPkjÀ1†= n À 1†1. A2Pk 2. jP \ D j2f0; 1g for 1  i  n and 1  k  `, By definition, the number of synonym sets in Pk lies k i 3. P \ P ˆ for 1  k; j  ` and k 6ˆ j, and between 1 and n. When jPkjˆn, number of sellers, the k j 4.  P† is maximal. popularity of Pk is maximum. When jPkjˆ1, the attribute referred by Pk is used by only one seller. We observe that a Once a partition P exists, we mayT compute the integrated popular column contributes to a better partition. schema as follows: D ˆfI j I ˆ B; 1  k  `g, which k k B2Pk Column Evenness. This measures the extent to which is trivial compared to the process to compute P. Thus, the synonym sets in Pk are semantically coherent. The evenness of Product Schema Integration Problem is essentially the T P problem of finding P. Fig. 2 shows the three steps of Product a column Pk is given by 0 j Aj=jPkj 1=jAj†  1. A2Pk T A2Pk For an arbitrary synonym set A in P , j Bj=jAj is the Schema Integration: synonym set matching, partition evalua- k B2Pk degree of semantic coherence between A and the other tion, and integrated schema generation. At the end of these three steps, the input local product schemas can be synonym sets. When the alternative attribute names of A are integrated into an output global schema. Note that the all employed by the other synonym sets in the same T Schema Integration Problem defined above is an optimi- columnÐthe ideal case, j Bj=jAj has a maximal value B2Pk zation problem. The difficulty lies in the fact that when of 1. When no other synonym sets employ any common T we find one partition satisfying the first three conditions words in A, j Bj=jAj has a minimal value of 0. Let this B2Pk of the problem, we do not know whether it is the value be the coherence scale of A. Then, the arithmetic mean of optimum partition, i.e., whether  is maximum. In order to the coherence scale of synonym sets in Pk is the evenness of analyze the complexity of the problem in a more convenient Pk. We note that its value also lies between 0 and 1. The way, we redefine the problem as a decision problem: YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 589

TABLE 1 Optimal Partition Resulting from the Schema Integration Process

Definition 2 %Product Schema Integration). Given a number example, we should first find a partition P. Table 1

M 0  M  1†, and a set of descriptions D1;D2; ...;Dn shows such an example of a partition: Synonym sets in n  1†, where D ˆfA ;A ; ...;A g 1  i  n; t  1† the same refer to the ªsameº attribute and each row i i;1 i;2 i;ti i and A 1  j  t † is a synonym set, does there exist a is a column &as defined in Section 3.2). i;j i S n Note that in D1 and D2,{processor, CPU, chipset} partition P ˆfP1;P2; ...;P`g `  min ti†† of iˆ1 Di satisfying the following conditions: and {chip, processor} are two synonym sets referring to the CPU of a PC. Although these two synonym sets are T 1. j Aj1 for 1  k  `, different, their intersection is not empty. Thus, they are A2Pk 2. jPk \ Dij2f0; 1g for 1  i  n and 1  k  `, semantically coherent. However, we observe that two 3. Pk \ Pj ˆ for 1  k; j  ` and k 6ˆ j, and synonym sets {type, sort, classification}and 4.  P†M? {model, serial number, brand}inD1 intersect with synonym set {type, classification, model}inD . 3.3 An Example 2 Conceptually, {type, sort, classification} refers to So far, we have discussed the definition of Product the classification of a PC such as desktop or notebook, and Schema Integration Problem and steps to do such an {model, serial number, brand} refers to the brand of the integration. Now, we shall illustrate PSI by integrating the PC such as IBM or Apple. Thus, {type, classification, four PC product schemas given in Table 2 in Section 3.1. model} refers to the same attribute of a PC as the former one of To perform the Product Schema Integration in this the two synonym sets in D1.

TABLE 2 Different Descriptions of PCs from Different Sellers 590 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

From Table 1, we compute the intersections of the complexity for checking each of the first three require- synonymsetsineachrowandderiveanintegrated ments in the definition is polynomial. For the fourth description requirement, suppose there are n sellers and the average number of synonym sets of each seller is m  1. Then, for D ˆfftypeg; fprocessorg; fmodelg; fmemoryg; fcacheg; a partition P, max jPj† ˆ n Á m, and for each column of fCD-Romg; fhard driveg; ffloppy drive; f-driveg; P, the time complexity of computing function ' in all the fmodemg; fbaysg; fslotsg; fmouseg; fportsg; fkeyboardg; cases is O n3†. Thus, the time complexity to compute 4 fchassisg; fwarrantyg; fweightg; fcolorgg:  P† is O mn †. This means that PSI can be solved in polynomial time by a nondeterministic algorithm. tu

Next, we define UPSI &Uniform Product Schema Integra- 4COMPLEXITY OF PSI tion problem), a subproblem of PSI and show that it is in The problem of integrating heterogeneous product schemas NP-complete: is originally an optimization problem &as defined in Definition 1). The difficulty comes from the large number Definition 3 %UPSI). Given a number M 0  M  1†, and a of possible candidate solutions &partitions). In the worst set of descriptions D1;D2; ...;Dn n  1†, where Di ˆfAi;1, case, the intersection of one synonym set and any synonym Ai;2; ...;Ai;mg 1  i  n, m  1† and Ai;j 1  j  m† is a set from the other sellers can be nonempty. We shall synonym set, does there exist a partition P ˆfP1;P2; ...;Pmg Sn analyze the complexity of the problem in this case. of iˆ1 Di satisfying the following conditions: Suppose there are n sellers each having m synonym sets and the size of each column in any partition is n. 1. jPTkjˆn for 1  k  m, 2. j Aj1 for 1  k  m, Then, let us compute the number of possible ways to A2Pk construct a solution &partition): To constitute the first 3. jPk \ Dij2f0; 1g for 1  i  n and 1  k  m, column in one partition, there are mn possible ways since 4. Pk \ Pj ˆ for 1  k; j  m and k 6ˆ j, and we can select one synonym set from each seller. For the 5.  P†M? second column, there are m À 1†n ways since each seller now has m À 1 synonym sets after the first selection. In Lemma 2. UPSI 2 NP-complete. this way, we have mn m À 1†n ÁÁÁ2n1n, i.e., m!†n ways to Proof. It is easy to see that UPSI 2 NP since a nondetermi- construct one candidate partition. nistic algorithm need only to guess a candidate partition However, the actual number of candidate partitions is and check in polynomial time whether that partition can less than m!†n because some of these combinations result in satisfy all the given requirements. the same partition; a partition is only a combination of We transform nDM to UPSI. First, we consider an columns, not the permutation of them. As the number of instance of nDM. Let W ;W ; ...;W be n disjoint sets permutations of columns is m!, the number of possible 1 2 n each having the q elements such that W ˆfw ;w ; partitions is thus m!†n=m!, i.e., m!†nÀ1. According to i i;1 i;2 ...;w g for 1  i  n. Let n-tuple w ;w ; ...;w †, Stirling's series, we have i;q 1;i 2;j n;k 1  i; j; k  q, be an arbitrary element of the set  p m m S  W  W ÂÁÁÁÂW .Weshallconstructn sets m! ˆ 2m† e m†; 1 2 n e D1;D2; ...;Dn in the corresponding instance of UPSI. Suppose S ˆ s ;s ; ...;s †, m  1. We use a one-to- where 1 2 m one function  to assign a distinguished value to each 1 1 1 element of S, such that  s †ˆy . Note that y 6ˆ w ,  m†ˆ À ‡ : i i i j;k 12m 360m3 1260m5 1  i  m, 1  j  n, and 1  k  q. Set D` 1  `  n† contains q elements as is A 1  i  q† which is Clearly, m!†nÀ1 is an exponential function of n. In the `;i constructed as follows: A ˆfw g[Y ,whereY ˆ general case when the size of each column varies between 1 `;i `;i fy g for 1  j  m if and only if w is the `th element and n, the complexity of the problem is far higher. j `;i of s ; Y ˆ if w is not an element of any n-tuple s . In order to analyze the complexity of PSI, we have j `;i j To another parameter M in the instance of UPSI, we redefined it as a decision problem in Definition 2. In the rest let M ˆ 0. Thus, we have constructed an instance of UPSI of this section, we shall prove that PSI is NP-complete. To from the instance of nDM. We shall prove that each of show that PSI is NP-complete, we first define a version of these two instances is satisfiable if and only if the other PSI called UPSI and show that it is NP-complete. Then, we one is satisfiable. reduce UPSI to PSI &Lemma 3). In order to show that UPSI is If the instance of nDM is satisfiable, then there exists a NP-complete, we show that n-dimensional matching &nDM) set S0  S satisfying the following conditions: reduces to UPSI &Lemma 2). The complexity results of nDM is presented in the Appendix. We begin by showing that PSI 1. jS0jˆq, is NP in the following lemma: 2. 8s1 ˆ w1;i;w2;j; ...;wn;k† and s2 ˆ w1;i0 ;w2;j0 ; ...; 0 0 0 0 Lemma 1. PSI 2 NP. wn;k0 †2S for 1  i; j; k; i ;j;k  q,ifs1 6ˆ s2, then w 6ˆ w 0 , w 6ˆ w 0 ; ...;w 6ˆ w 0 . Proof. We can solve PSI with a nondeterministic algorithm: 1;i 1;i 2;j 2;j n;k n;k S 0 n For each n-tuple w1;i;w2;j; ...;wn;k†2S there is a For all the possible partitions of iˆ1 Di, compute the value of  to see whether it satisfies the requirements corresponding set Pt ˆfA1;i;A2;j; ...;An;kg, 1  t  q given in the definition. It is obvious that the time having the following properties: YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 591

1. The first element of each of A1;i;A2;j; ...;An;k are w1;i;w2;j; ...;wn;k, respectively. 2. If w1;i;w2;j; ...;wn;k†ˆsr, 1  r  m,thenac- cordingT to the construction of D1;D2; ...;Dn, fyrg A;. TA2Pt 3. fyrgˆ A because of the second condition A2Pt T 0 satisfied by S &if there is y 0 2 A, then there r A2Pt 0 is an n-tuple sr0 2 S such that sr0 ˆ sr, which conflicts with the second condition satisfied by S0). 0 Since jS jˆq, there exist q such sets P1;P2; ...;Pq comprising a partition P ˆfP1;P2; ...;Pqg of D1 [ D2 [ÁÁÁ[Dn, satisfying the following conditions:

1. jPTkjˆn for 1  k  q, 2. j Ajˆ1 for 1  k  q, A2Pk 3. jPk \ Dijˆ1 for 1  i  n and 1  k  q, 4. Pk \ Pj ˆ for 1  k; j  q and k 6ˆ j, and Fig. 3. Algorithm I for PSI. 5.  P† > 0. Finally, we show that UPSI reduces to PSI: The first condition is satisfied because Pt ˆfA1;i; A2;j; ...;An;kg for 1  t  q. The second and fourth Lemma 3. UPSI / PSI. conditions are satisfied from the previous description Proof. We define a transformation from each instance of of the third property of Pt. The third condition is UPSI to a corresponding instance of PSI. Suppose there is obviously satisfied from the construction of Pt. Because an arbitrary instance of UPSI: M ˆ q, n descriptions, and the second condition is satisfied, we have  Pt† > 0 and each has m element sets. The corresponding instance of  P† > 0. Then, the fifth condition is also satisfied. PSI also has n descriptions identical to those in UPSI, i.e., Therefore, the instance of UPSI is satisfiable. ti ˆ m; 1  i  n. The number M in PSI is now q. Conversely, suppose the instance of UPSI is satisfi- It is easy to see that this transformation can be able, i.e., there exists a partition P ˆfP1;P2; ...;Pqg of computed by a polynomial time algorithm. For each of D1 [ D2 [ÁÁÁ[Dn satisfying the five situations given by the n descriptions in PSI that must be specified, it is UPSI's definition. Let Pt ˆfA1;i;A2;j; ...;An;kg, 1  t  q. exactly the same as each of those n descriptions in UPSI, Then, the first element of each of A1;i;A2;j; ...;An;k, and the value of M is also the same. which are w1;i;w2;j; ...;wn;k, construct an Consequently, if the answer to UPSI is ªyes,º i.e., there exists a partition satisfying the five requirements in n-tuple w1;i; w2;j; ...; wn;k†2W1  W2 ÂÁÁÁÂWn: the definition of UPSI, this partition is also a solution to Consequently, there are q such n-tuples,andthey PSI. Thus, the answer to PSI should also be ªyes.º If the compose a set S0. Since D ;D ; ...;D are derived from answer to PSI is ªno,º i.e., there is no partition satisfying 1 2 n the requirements in the definition of PSI, it is also true S  W  W ÂÁÁÁ ÂW , it is not difficult to see that 1 2 n that the answer to UPSI is ªno.º tu S0  S. The set S0 has the following properties: Since PSI is in NP &Lemma 1) and it is in NP-hard 1. jS0jˆq, &Lemma 3), therefore PSI is in NP-complete: 2. 8s1 ˆ w1;i;w2;j; ...;wn;k† and s2 ˆ w1;i0 ;w2;j0 ; ...; Theorem 1. PSI 2 NP-complete. 0 0 0 0 wn;k0 †2S , 1  i; j; k; i ;j;k  q,ifs1 6ˆ s2, then w1;i 6ˆ w1;i0 , w2;j 6ˆ w2;j0 ; ...;wn;k 6ˆ wn;k0 . The first property is obvious. As to the second one, 5ALGORITHMS FOR PSI we suppose that s1 6ˆ s2, but wz;x ˆ wz;x0 , 1  z  n, Since PSI is NP-complete, it is not likely to find good 0 1  x; x  q. However, s1 is derived from Pt ˆ A1;i; polynomial time deterministic algorithms. Thus, we focus A2;j; ...;Az;x; ...;An;k†, and s2 is derived from Pt0 ˆ on finding approximate algorithms that are acceptable with 0 A1;i0 ;A2;j0 ; ...;Az;x0 ; ...;An;k0 †, 1  t; t  q.Thus,the respect to both time complexity and result accuracy. Figs. 3, first element of each of Az;x and Az;x0 are wz;x and 4, and 5 show three approximation algorithms for PSI. w 0 respectively, which are the same. This tells us that z;x 5.1 Approximation Algorithms Az;x ˆ Az;x0 which conflicts with the condition that 0 Algorithm I. The following algorithm is repeated &first loop, Pt \ Pt0 ˆ . Therefore, the second property of S is valid. Now, it is clear that S0 is the solution to the index i) for each seller Di: For each synonym set &second instance of nDM. loop, index k) of the seller, we select the most semantically To see that the transformation from the instance of coherent &having the largest intersection) synonym set &third loop, index r) from each of the other sellers' nDM to the instance of UPSI can be performed in descriptions &third loop, index j); these semantically polynomial time, it is suffices to observe that the number coherent synonym sets constitute a column P . That is, of synonym sets in the instance of UPSI is bounded by a k Ai;k 1  k  ti† of Di 1  i  n† is compared with Aj;u 1  polynomial in q, and the details of the construction itself u  tj† of Dj 1  j  n by evaluating the size of their are straightforward. tu intersection &fourth loop). 592 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 4. Algorithm II for PSI. Fig. 5. Algorithm III for PSI.

Each iteration of the second loop results in a column, Based on the second assumption made in computing the which constitutes partition Pi at the end of each of the first time complexity of Algorithm I, we know that the time loop's iteration. This results in n candidate partitions. The complexity of Algorithm II is the same as Algorithm I since final step in the algorithm chooses the partition having the the time to compute the union of the synonym sets in largest goodness value. column Pk can be treated as constant. Hence, the time When determining the time complexity of the above complexity of Algorithm II is also O n2m2†, as we can see algorithm, we make the following two assumptions: from the algorithms themselves. Algorithm III. Fig. 5 shows Algorithm III. For PSI, . The average value of ti 1  i  n†, the number of Algorithm III is almost the same as Algorithm II except for synonym sets in each seller, is m m  1†. one operation in the fourth loop: In Algorithm III, we . The time spent on matching any two synonym sets is compute the size of the intersection of synonym set Aj;u and constant. the intersection of all the existing synonym sets already in Consider the fourth loop, which iterates through every column Pk, while in Algorithm II, we compute the size of synonym sets of a seller. Based on the first assumption, the the intersection of synonym set Aj;u and the union of all the number of iterations is m. However, when a synonym set existing synonym sets already in column Pk. Algorithm III has been selected to join a column Pk &of a synonym set in seems to be more ªstrictº in constructing a column; it the second loop), it is removed from considerations in ensures that the evenness of the solution is larger than zero. subsequent iterations. Hence, the number of iterations of The time complexity of Algorithm III is also O n2m2† the fourth loop decreases: m; m À 1; ...; 1 as we iterate since it has a structure similar to Algorithms I and II. through each synonym set of the second loop. Therefore, the time complexity of the three approxima- As there are n À 1 other sellers to match against for a tion algorithms are the same. The difference among them is given seller, the third loop iterates n À 1 times. Last, the first their performance, i.e, the goodness of the resulting and second loops iterate n and m times, respectively. partitions. We shall elaborate on this in the next section. Therefore,P the complexity of the whole algorithm is m 2 2 5.2 Performance Comparison n nÀ1† gˆ1 gˆn nÀ1†m m‡1†=2ˆO n m †.Notethat the time complexity in the best, worst, and average cases The performance of each algorithm on different input is the same since the number of comparisons in these three product descriptions with respect to different parameters cases remain the same. are shown in Figs. 6, 7, 8, and 9. We plot four sets of graphs In reality, we expect the complexity of the algorithm to where the vertical axes are goodness, similarity, popularity, be way below O n2m2† because the average number of and evenness of a solution. In each set, we use each of the attributes to describe a product is not very large, say around following parameters in turn as the horizontal axis: 50 at most. Thus, the algorithm reduces to O n2†. p The average percentage of input overlap, i.e., the average Algorithm II. Fig. 4 shows Algorithm II. For PSI, the percentage of synonym set overlap within a seller. difference between Algorithms I and II is in the fourth loop: n The number of input descriptions, i.e., the number of In Algorithm I, we compute the size of the intersection of sellers. synonym sets Aj;u and Ai;k, while in Algorithm II, we compute the size of the intersection of synonym set Aj;u m The average size of each description, i.e., the average and the union of all the existing synonym sets already in number of synonym sets in each description. Pk. In other words, in Algorithm I, we select the synonym w The average size of each synonym set. set having the largest intersection with Ai;k to join the We created synthetic descriptions of PCs as inputs for our column Pk, while in Algorithm II, we select the synonym set having the largest intersection with the union of all experiments. For realism, we refer to actual PC descriptions existing synonym sets in Pk to join Pk. from websites of PC sellers when generating synthetic YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 593

Fig. 6. Goodness of solution versus different values of percentage of input overlap, number of input descriptions, average size of input descriptions, and average size of each synonym set. )a) n ˆ 10;mˆ 20;wˆ 5, )b) p ˆ 10%;mˆ 20;wˆ 5, )c) p ˆ 10%;nˆ 10;wˆ 5, and )d) p ˆ 10%;nˆ 10;mˆ 20.

Fig. 7. Similarity of solution versus different values of percentage of input overlap, number of input descriptions, average size of input descriptions, and average size of each synonym set. )a) n ˆ 10;mˆ 20;wˆ 5, )b) p ˆ 10%;mˆ 20;wˆ 5, )c) p ˆ 10%;nˆ 10;wˆ 5, and )d) p ˆ 10%;nˆ 10;mˆ 20. 594 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 8. Popularity of solution versus different values of percentage of input overlap, number of input descriptions, average size of input descriptions, and average size of each synonym set. )a) n ˆ 10;mˆ 20;wˆ 5, )b) p ˆ 10%;mˆ 20;wˆ 5, )c) p ˆ 10%;nˆ 10;wˆ 5, and )d) p ˆ 10%;nˆ 10;mˆ 20. descriptions. We arrange the PC descriptions according to gives the second best performance in the other different values of the four parameters above. We let situations. p ˆ 10%, n ˆ 10, m ˆ 20, and w ˆ 5 be the average case for . Algorithm II always yields the worst performance in all situations. This is a result of the way it constructs the input descriptions. We make the following observations columns for product schemas. Since it chooses the in the experiments: most semantically coherent synonym sets with the union of the synonym sets already in the column, it 1. The goodness of each solution generated by each is more likely that the evenness of the columns is algorithm increases with m when m is larger than 10, zero, i.e., not all the synonym sets in one column while it decreases when p increases. Parameters n intersect. This reduces the goodness of the solution. and w do not have direct or inverse effects on the goodness of solutions &Fig. 6). . Algorithm III yields the best performance in all 2. The similarity of each solution generated by each situations. This is a result of the way it constructs algorithm increases with m, while it decreases when columns for product schemas. Algorithm III is ªstrictº in selecting synonym sets for a column &see p increases. Parameters n and w do not have direct or Section 5.1); it selects the synonym set having the inverse effects on the similarity of solutions &Fig. 7). largest intersection instead of the union with the 3. The popularity of each solution generated by each existing synonym sets in the column. Thus, it selects algorithm increases with w. Parameters p, n, and m do synonym sets in a narrower range but this ensures not have direct or inverse effects on the popularity of the evenness of the partition to be larger than zero. solutions &Fig. 8). With regard to Observation 2 &Fig. 7), we note that: 4. The evenness of each solution generated by each algorithm increases with m, while it decreases when . Algorithms I, II, and III have quite similar perfor- p or w increases. Parameter n does not have direct or mance, i.e., similar values of similarity. inverse effects on the evenness of solutions &Fig. 9). . Algorithm I has the best performance, i.e., the With regard to Observation 1 &Fig. 6), we note that: highest similarity, no matter what values n and m . Algorithm I yields the best performance among the have. It also gives the best performance when p is three algorithms when n is larger than 20; the smaller than 10 percent or when w is larger than five. goodness of the partition generated by Algorithm I . Algorithm II always has the worst performance in all in this case is the highest among the three. Algorithm I situations because it selects the synonym set having YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 595

Fig. 9. Evenness of solution versus different values of percentage of input overlap, number of input descriptions, average size of input descriptions, and average size of each synonym set. )a) n ˆ 10;mˆ 20;wˆ 5, )b) p ˆ 10%;mˆ 20;wˆ 5, )c) p ˆ 10%;nˆ 10;wˆ 5, and )d) p ˆ 10%;nˆ 10;mˆ 20.

the largest union instead of intersection with the . In all the situations, Algorithm III has the best existing synonym sets in the column. This results in performance, i.e., the highest evenness, while Algo- columns having more diverse synonym sets, which rithm II has the worst and Algorithm I has the best. reduce the similarity of the whole partition. In summary, we may make the following conclusions from the experiments: . Algorithm III has the best performance for all values of n. It also has the best performance when p is larger . Algorithm III has the best performance in most cases. than 10 percent and when w is smaller than five. It is capable of solving PSI and it maximizes the evenness of the solutions. Algorithms I and II With regard to Observation 3 &Fig. 8), we note that: maximize the similarity and popularity of the . Algorithm I yields the best performance, i.e., the solutions to some extent, but not as much as what highest popularity, when p is larger than 20 percent, Algorithm III does to the evenness of the solutions. or when w is larger than nine. It gives the worst We should thus adopt Algorithm III in our imple- performance when p is less than 10 percent, when n mentation of directory agents in our ABECOS system. is larger than 10, or when w is between five and eight. . Parameter p, the overlap of synonym sets within one . Algorithm II gives the best performance when m is description, inversely affects on the performance of smaller than 15, when n is larger than 20, when w each algorithm in all the cases except for the is smaller than four, or when w is larger than nine. popularity of the solutions. Thus, we should reduce It gives the worst performance when p is larger p as much as possible. than 10 percent. . Parameter n, the number of sellers, does not have any direct or inverse effects on the performance of . Algorithm III gives the best performance when p is between 2 percent and 20 percent, when n is larger each algorithm in all the cases. Thus, the algorithms than two, when m is larger than 17 or when w is are suitable for large number of sellers. between five and eight. It gives the worst perfor- . Parameter m, the size of product description, mance when m is smaller than 17, when w is smaller directly affects on the performance of each algo- than four or when w is larger than nine. rithm with respect to the goodness and similarity; in the other cases, it does not affect the perfor- With regard to Observation 4 &Fig. 9), we note that: mance too much. 596 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

. Parameter w, the size of each synonym set, directly the attributes within each level. Third, the maintenance of affects the performance of each algorithm with integration results is also an important issue to be respect to the goodness and popularity; it has a considered. To avoid recomputing the global schema from direct effect on the performance of each algorithm scratch, we should adopt incremental algorithms for with respect to the evenness of the solutions. Thus, adding, deleting, and updating local schemas. Fourth, we we should raise w as much as possible. should study product schema integration in the context of Therefore, to increase the performance of product schema XML documents. Last, further theoretical investigation on integration, we have some advice for sellers who provide the approximation algorithms will be performed to deter- synonym sets for their product descriptions: mine their performance bounds. We will report the results of our work in future papers. . Sellers should try to avoid using identical attribute names in different synonym sets of a single product APPENDIX description &in order to reduce the value of parameter p). For example, when p is less than COMPLEXITY RESULTS 10 percent, the goodness of the integration result is The problem of 3-Dimensional Matching &3DM) is defined about 72 percent, which is quite good &see Fig. 6). as follows [11]: This means that if there is a total of 50 attribute names in use by one seller, the number of attribute Definition 4 %3DM). Given a set S  W  X  Y , where W, names that appear more than once should not be X, and Y are disjoint sets having the same number q of 0 0 larger than five. elements, does there exist a subset S  S such that jS jˆq 0 . Sellers are advised to make the size of synonym sets and no two elements of S agree in any coordinate? as large as possible &to raise the value of w). For example, when the average size of synonym sets is We generalize 3DM to nDM as follows and show that it around eight, the integration result has a very high is NP-complete:

goodness value &see Fig. 6). However, sometimes it is Definition 5 %nDM). Given a set S  W1  W2 ÂÁÁÁÂWn, not easy to provide many synonyms for one attribute where Wi 1  i  n† are disjoint sets having the same name, say CPU. From the experiments, we know number q of elements, does there exist a subset S0  S such that the average size of synonym sets should be at that jS0jˆq and no two elements of S0 agree in any least three in order to secure the performance of coordinate? integration. Theorem 2. nDM 2 NP-complete. Proof. Since 3DM is NP-complete, let us suppose nDM, 6CONCLUSIONS AND FUTURE WORK n  3 to be NP-complete. We shall prove that &n+1)DM In this paper, we address the problem &PSI) of integrating is also NP-complete under this assumption. heterogeneous product descriptions in an agent-based First, we consider an instance of nDM. Let W1;W2; electronic commerce system. Although several solutions to ...;W be n disjoint sets each having q elements, such database schema integration have been proposed in multi- n that Wi ˆfwi;1;wi;2; ...;wi;qg for 1  i  n. Let n-tuple database research work, PSI is distinct in the E-commerce w ;w ; ...;w †, 1  i; j; k  q, be an arbitrary element context for several reasons: the need to overcome limited 1;i 2;j n;k of the set S  W  W ÂÁÁÁÂW . Let W ;W ; ...;W , knowledge about local product databases, the requirement 1 2 n 1 2 n W be the n ‡ 1 disjoint sets in the corresponding of automated integration, the large number of product n‡1 instance of &n+1)DM, where W ;W ; ...;W are the same schemas, and the evolution of local schemas. This paper 1 2 n as those in nDM and W ˆfx ;x ; ...;x g. describes our preliminary work in defining, analyzing, and n‡1 1 2 q Now, let us construct set T, the corresponding set of solving PSI in the ABECOS commerce system [21], [23], [24], [37], [38]. We adopt synonym sets to represent the semantic S in nDM, T  W1  W2 ÂÁÁÁÂWn  Wn‡1 as follows: information of local product schemas. Thus, the problem is If w1;i;w2;j; ...;wn;k†2S for 1  i; j; k  q, then n ‡ 1†- defined as finding the optimum synonym set matching tuples w1;i;w2;j; ...;wn;k;x1†, w1;i;w2;j; ...;wn;k;x2† ...; among local product schemas. We also prove that the w1;i;w2;j; ...;wn;k;xq† are elements of set T. We shall problem is NP-complete. Experiments show that our prove that each of these two instances is satisfiable if and approximation algorithm is feasible and computes solutions only if the other one is satisfiable. for PSI. If the instance of nDM is satisfiable, then there exists a Future work can be focused on the following issues: set S0  S, which satisfies the following conditions: First, it is not clear at this moment how complex attribute 1. jS0jˆq, relationships such as aggregate relationship can be handled. 2. 8s1 ˆ w1;i;w2;j; ...;wn;k† and s2 ˆ w1;i0 ;w2;j0 ; ...; For example, in the descriptions of PCs, an attribute memory 0 0 0 0 w 0 †2S for 1  i; j; k; i ;j;k  q if s 6ˆ s , then has an aggregate relationship with another two attributes n;k 1 2 w 6ˆ w 0 ;w 6ˆ w 0 ; ...;w 6ˆ w 0 . standard memory and extensible memory. Second, we should 1;i 1;i 2;j 2;j n;k n;k 0 consider more complex product schemas such as hierarch- For each n-tuple w1;i;w2;j; ...;wn;k†2S , we extend it to ical schemas. Attributes in a hierarchical schema are divided an n ‡ 1†-tuple w1;i;w2;j; ...;wn;k;xi†. This n ‡ 1†-tuple into several abstract levels; some attributes may have a set is an element of T. All such extended n ‡ 1†-tuples 0 0 of child attributes. To integrate such schemas, we identify compose a set T  T, and T satisfies the following the correspondence of different attribute levels, as well as conditions: YAN ET AL.: PRODUCT SCHEMA INTEGRATION FOR ELECTRONIC COMMERCEÐA SYNONYM COMPARISON APPROACH 597

1. jT 0jˆq, [10] D. Florescu, A. Levy, and A. Mendelzon, ªDatabase Techniques for the World-Wide Web: A Survey,º ACM Special Interest Group 2. 8t1 ˆ w1;i;w2;j; ...;wn;k;xi† and t2 ˆ w1;i0 ;w2;j0 ; ...; on Management of Data Record CSIGMOD Record '98), vol. 27, no. 3, 0 0 0 0 wn;k0 ;xi0 †2T for 1  i; j; k; i ;j;k  q if t1 6ˆ t2, Sept. 1998. then [11] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. p. 46, W.H. Freeman and Co., w1;i 6ˆ w1;i0 ;w2;j 6ˆ w2;j0 ; ...;wn;k 6ˆ wn;k0 ;xi 6ˆ xi0 : 1979. [12] J.N. Gray, Database Systems: A Textbook Case of Research Paying The first condition is satisfied because jS0jˆq and Off, http://www.cs.washington.edu/homes/lazowska/cra/ database.html, 1999. 0 0 0 jT jˆjS j. From the second condition satisfied by S [13] S. Hayne and S. Ram, ªMulti-User View Integration System and the fact that W1;W2; ...;Wn, Wn‡1 are n ‡ 1 disjoint &MUVIS): An Expert System for View Integration,º Proc. IEEE sets, we can see that the second condition for T 0 is also Sixth Int'l Conf. Data Eng. CICDE '90), pp. 402±409, Feb. 1990. satisfied. Thus, T 0 is the solution of &n+1)DM. [14] M. Hearst, ªTrends and Controversies: Information Integration,º IEEE Intelligent Systems J., pp. 12±24, Sept./Oct. 1998. Conversely, if the instance of &n+1)DM is satisfiable, [15] R.D. Holowczak and W.-S. Li, ªA Survey on Attribute Corre- then there exists a set T 0  T, and T 0 satisfies the spondence and Heterogeneity Metadata Representation,º Proc. following conditions: First IEEE Metadata Conf., Ap. 1996. [16] Internet Content Exchange &ICE), http://www.vignette.com/ 1. jT 0jˆq, Products/ice, 1999. [17] J.A. Larson, S.B. Navathe, and R. Elmasri, ªA Theory of Attribute 2. 8t1 ˆ w1;i;w2;j;;wn;k;x`† and t2 ˆ w1;i0 ;w2;j0 ; ...; Equivalence in Databases with Application to Schema Integration,º 0 0 0 0 0 wn;k0 ;x`0 †2T for 1 i; j; k; `; i ;j;k;` q if t1 6ˆt2, IEEE Trans. Software Eng., vol. 15, no. 4, pp. 449±463, Apr. 1989. then w1;i 6ˆw1;i0 ;w2;j 6ˆw2;j0 ; ...;wn;k 6ˆwn;k0 , x` 6ˆ x`0 . [18] D.B. Lenat, ªCyc: A Large-Scale Investment in Knowledge For each n ‡ 1†-tuple w ;w ; ...;w ;x †,wejust Infrastructure,º Comm. ACM, pp. 33±38, Nov. 1995. 1;i 2;j n;k ` [19] W.-S. Li and C. Clifton, ªSemantic Integration in Heterogeneous discard its last element, i.e., x` to make it an n-tuple like Databases Using Neural Networks,º Proc. 20th Int'l Conf. Very w1;i;w2;j; ...;wn;k†2S.Thus,allsuchn-tuples are Large Data Bases, Sept. 1994. composed of a set S0  S, and S0 satisfies the following [20] W.-S. Li and C. Clifton, ªUsing Field Specifications to Determine Attribute Equivalence in Heterogeneous Databases,º Proc. IEEE conditions: Third Int'l Workshop Research Issues on Data Eng.: Interoperability in Multidatabase Systems, pp. 174±177, Apr. 1993. 1. jS0jˆq, [21] E.-P. Lim and W.-K. Ng, ªOverview of the Agent-Based Electronic 2. 8s1 ˆ w1;i;w2;j; ...;wn;k† and s2 ˆ w1;i0 ;w2;j0 ; ...; Commerce &ABECOS) Project,º IEEE Data Eng. Bull., vol. 23, no. 1, 0 0 0 0 pp. 49±54, Mar. 2000. w 0 †2S for 1  i; j; k; i ;j;k  q if s 6ˆ s , then n;k 1 2 [22] S. Navathe, R. Elmasri, and J. Larson, ªIntegrating User Views in w1;i 6ˆ w1;i0 ;w2;j 6ˆ w2;j0 ; ...;wn;k 6ˆ wn;k0 . Database Design,º IEEE Computer, vol. 19, no. 1, pp. 50±62, Jan. The first condition is satisfied because jS0jˆjT 0. Since 1986. the second condition is satisfied by T 0 each n-tuple in S0 [23] W.-K. Ng, G. Yan, and E.-P. Lim, ªHeterogeneous Product 0 Description in Electronic Commerce,º ACM SIGecom Exchanges, is the first n elements of the n ‡ 1†-tuple in T , we can vol. 1, no. 1, pp. 7±13, Aug. 2000. 0 see that the second condition is also satisfied by S . Thus, [24] W.-K. Ng, G. Yan, and E.-P. Lim, ªStandardization and Integration the instance of nDM is satisfiable. in Business-to-Business electronic Commerce,º IEEE Intelligent Therefore, we have proven that, if &n+1)DM is NP- Systems, Jan./Feb. 2001. [25] Open Buying on the Internet &OBI), http://www.openbuy.org, complete, then nDM &n  3) is NP-complete. Since 3DM 2000. is NP-complete, from the mathematical induction we [26] W.J. Premerlani and M.R. Blaha, ªAn Approach for Reverse conclude that nDM &n  3) is NP-complete. tu Engineering of Relational Databases,º Comm. ACM, vol. 37, no. 5, pp. 42±49, May 1994. [27] RosettaNet, http://www.rosettanet.org, 1999. REFERENCES [28] RosettaNet Laptop Technical Specification, http://www. rosetta- net. org/general/finished\_projects/laptop. html, May 1998. [1]N. Adam and Y. Yesha, ªStrategic Directions in Electronic [29] G. Salton, Automatic Text Processing, chapter 10, Addison-Wesley Commerce and Digital Libraries: Towards a Digital Agora,º Pub., 1989. ACM Computing Surveys, vol. 28, no. 4, pp. 818±835, Dec. 1996. [30] P. Scheuermann, W.-S. Li, and C. Clifton, ªMultidatabase Query [2] C. Batini, M. Lenzerini, and S.B. Navathe, ªA Comparative Analysis of Methodologies for Database Schema Integration,º Processing with Uncertainty in Global Keys and Attribute ACM Computing Surveys, vol. 18, no. 4, pp. 323±364, Dec. 1986. Values,º J. Am. Soc. for Information Science, vol. 49, no. 3, pp. 283± [3] P. Bernstein, M. Brodie, S. Ceri, and D. DeWitt, The Asilomar 301, 1998. Report on Database Research, Sept. 1998, http://www. [31] A. Sheth and J. Larson, ªFederated Database Systems for Managing research.microsoft.com/Gray/Asilomar_DB_98.html. Distributed Heterogeneous, and Autonomous Databases,º ACM [4] S. Bressan, C.H. Goh, K. Fynn, M. Jakobisiak, K. Hussein, H. Kon, Computing Surveys, vol. 22, no. 3, pp. 183±236, Sept. 1990. T. Lee, S. Madnick, T. Pena, J. Qu, A. Shum, and M. Skegel, ªThe [32] A. Sheth, J. Larson, A. Cornelio, and S.B. Navathe, ªA Tool for Context Interchange Mediator Prototype,º Proc. ACM SIGMOD/ Integrating Conceptual Schemas and User Views,º Proc. IEEE PODS Joint Conf., May 1997. Fourth Int'l Conf. Data Eng. CICDE '88), Feb. 1988. [5] Common Business Library &CBL), http://www.veosystems.com/ [33] A. Silberschatz and S. Zdonik, ªStrategic Directions in Database xml/cbl/cbl.html. SystemsÐBreaking Out of the Box,º ACM Computing Surveys, [6] D. Clements, M. Ganesh, S.-Y. Hwang, E.-P. Lim, , K. Mediratta, J. vol. 28, no. 4, pp. 764±778, Dec. 1996. Srivastava, and H.-R. Yang, ªMyriad: Design and Implementation [34] A.Silberschatz,M.Stonebraker,andJ.Ullman,ªDatabase of Federated Database Prototype,º Proc. Int'l Conf. Computer Research: Achievements and Opportunities Into the 21st Cen- Systems and Education, June 1994. tury,º ACM Special Interest Group on Management of Data Record [7] C. Collet, M.N. Huhns, and W.-M. Shen, ªResource Integration CSIGMOD Record '96), vol. 25, no. 1 Mar. 1996. Using a Large Knowledge Base in Carnot,º Computer, vol. 24, [35] J.M. Smith, P.A. Bernstein, U. Dayal, N. Goodman, T. Landers, T. no. 12, pp. 55±62, Dec. 1991. Lin, and E. Wang, ªMultibase-Integrating Heterogeneous Dis- [8] CommerceNet, http://www.commerce.net. 1999. tributed Database Systems,º Proc. Nat'l Computer Conf., pp. 487± [9] The eCo Framework Project, http://www.commerce.net/ 499, 1981. projects/currentprojects/eco, 1999. [36] Veo Systems, Inc., http://www.veosystems.com, 1999. 598 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

[37] G. Yan, W.-K. Ng, and E.-P. Lim, ªToolkits for a Distributed, Wee Keong Ng received the MSc and PhD Agent-Based Web Commerce System,º Proc. Int'l IFIP Working degrees from the University of Michigan, Ann Conf. Trends in Distributed Systems for Electronic Commerce CTrEC Arbor in 1994 and 1996, respectively. He is an '98), June 1998. assistant professor of the School of Computer [38] G. Yan, W.-K. Ng, and E.-P. Lim, ªIncremental Maintenance of Engineering at the Nanyang Technological Uni- Product Schema in Electronic CommerceÐA Synonym-Based versity, Singapore. He works and publishes Approach,º Second Int'l Conf. Information, Comm., and Signal widely in the areas of Web warehousing, Processing CICICS '99), Dec. 1999. information extraction, electronic commerce [39] C. Yu, W. Sun, S. Dao, and D. Keirsey, ªDetermining Relationships and data mining. He has organized and chaired Among Attributes for Interoperability of Multi-Database System,º international workshops, including tutorials, and Proc. Workshop Multi-Database and Semantic Interoperability, Nov. has actively served in the program committees of numerous interna- 1990. tional conferences. He is a member of the ACM and the IEEE Computer Society. Guanghao Yan recived the BSc degree from Peking University, China, in 1997 and the MSc Ee-Peng Lim is an associate professor of the degree from the Nanyang Technological Uni- School of Computer Engineering at the Nanyang versity, Singapore, in 1999. He is currently a Technological University, Singapore. His re- graduate student at the Department of Computer search interests are in Web warehousing, Science, the State University of New York at database integration, and digital libraries. He is Stony Brook. He works in the area of data currently the director of the Centre for Advanced integration for electronic product descriptions. Information Systems )CAIS), a research centre He is a student member of the ACM and IEEE focusing on web computing and database Computer Society. research. He is also the guest editor of the special issue on mobile commerce of the Journal of Database Management. His research has been published in a number of international journals including IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Information Systems, Decision Support Systems, and Distributed and Parallel Databases. He is a member of the ACM and a senior member of the IEEE.

. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.