Resource and Knowledge Discovery in Global Information Systems: a Preliminary Design and Experiment Osmar R
Total Page:16
File Type:pdf, Size:1020Kb
From: KDD-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment Osmar R. Zaisne and Jiawei Han* School of Computing Science Simon Fraser University Burnaby, B.C., Canada V5A lS6 {zaiane, han}@cs.sfu.ca Abstract have already created controversies. Other indexing so- lutions, like ALIWEB (Koster 1994) or Harvest (Bow- Eficient and efiective discovery of resource and man et al. 1994), behave well on the network but still knowledge from the Internet has become an im- struggle with the difficulty to isolate information with minent research issue, especially with the advent et of the Information Super-Highway. A multiple relevant context. Essence (Bowman al. 1994), which layered database (MLDB) approach is proposed uses a “semantic” indexing, is one of the most compre- to handle the resource and knowledge discovery hensive indexing systems currently known. However, it in global information base. A preliminary ex- still cannot solve most of the problems posed for sys- periment shows the advantages of such an ap- tematic discovery of resources and knowledge in the proach. Information retrieval, data mining, and global information base. data analysis techniques can be used to extract In this article, a different approach, called a Multi- and transform information from a lower layer database to a higher one. Resources can be found ple Layered DataBase (MLDB) approach is proposed to by controlled search through dinerent layers of facilitate information discovery in global information the database, and knowledge discovery can be per- systems. An MLDB is a database composed of several formed eficiently in such a layered database. layers of information, with the lowest layer correspond- ing to the primitive information stored in the global in- formation base and the higher ones storing summarized Introduction information extracted from the lower layers. Every With the rapid expansion of information base and user layer i (i E [l..n]) st ores, in a conventional database, community in the Internet, efficient and effective dis- general information extracted from layer i - 1. This covery and use of the resources in the global infor- extraction of information is called generalization. mation network has become an important issue in the The proposal of the multiple layered database archi- research into global information systems. tecture is based on the previous studies on multiple luy- There have been many interesting studies on infor- ered databases (Han, Fu, & Ng 1994) and data mining mation indexing and searching in the global informa- (Piatetsky-Shapiro & Frawley 1991; Han, Cai, & Cer- tion base with many global information system servers cone 1993) and the following observation: the multiple developed, including Archie, Veronica, WAIS, etc. Al- layered database architecture transforms a huge, un- though these tools provide indexing and document de- structured, global information base into progressively livery services, they aim at a very specific service like smaller, better structured, and less remote databases FTP or gopher. Attempts have also been made to to which the well-developed database technology and discover resources in the World Wide Web (Schwartz the emerging data mining techniques may apply. By et al. 1992). Spider-based indexing techniques, like doing so, the power and advantages of current database the WWW Worm (McBryan 1994), RBSE database systems can be naturally extended to global informa- (Eichmann 1994), Lycos and others, create a substan- tion systems, which may represent a promising direc- tial value to the web users but generate an increasing tion. Internet backbone traffic. They not only flood the net- The remainder of the paper is organized as follows: work and overload the servers, but also lose the struc- in Section 2, a model for global MLDB is introduced, ture and the context of the documents gathered. These and methods for construction and maintenance of the wandering software agents on the World Wide Web layers of the global MLDB are also proposed; resource *Research partially supported by the Natural Sciences and knowledge discovery using the global MLDB is in- and Engineering Research Council of Canada under the vestigated in Section 3; a preliminary experiment is grant OGP0037230 and by the Networks of Centres of Ex- presented in Section 4; finally, the study is summa- cellence Program of Canada under the grant IRIS-HM15. rized in Section 5. Zaiane 331 Generalization: Formation of higher layers attribute by attribute, into appropriate higher layer Layer-l is a detailed abstraction of the layer-O infor- concepts. Different lower level concepts may be gener- mation. It should be substantially smaller than the alized into the same concepts at a higher level and be primitive layer global information base but still rich merged together, reducing the size of the database. enough to preserve most of the interesting pieces of Generalization on nonnumerical values should rely general information for a diverse community of users on the concept hierarchies which represent background to browse and query. Layer-l is the lowest layer of in- knowledge that directs generalization. Using a concept formation manageable by database systems. However, hierarchy, primitive data can be expressed in terms of it is usually still too large and too widely distributed for generalized concepts in a higher layer. efficient storage, management and search in the global A portion of the concept hierarchy for keywords is network. Further compression and generalization can illustrated in Fig. 1. Notice that a contains-list spec- be performed to generate higher layered databases. ifies a concept and its immediate subconcepts, and an alias-list specifies a list of synonyms (aliases) of a con- Example 2 Construction of an MLDB on top of the cept, which avoids the use of complex lattices in the layer- 1 global database. “hierarchy” specification. The introduction of alias- The two layer-l relations presented in Example 1 can lists allows flexible queries and helps dealing with doc- be further generalized into layer-2 database which may uments using different terminologies and languages. contain two relations, dot-brief and person-brief, with Generalization on numerical attributes can be per- the following schema, formed automatically by inspecting data distribution. 1. dot-brief(file-uddr, authors, title, pu bli- In many cases, it may not require any predefined con- ’ cation, publication-date, abstract, category- descrip- cept hierarchies. For example, the size of document tion, language, keywords, major- index, URL-links, can be clustered into several groups according to a rel- num-pages, form, size-dot, access-frequency). atively uniform data distribution criteria or using some statistical clustering analysis tools. 2. person-brief (lust-name, first-name, publications, uf- filiution, e-mail, research-interests, size-home-page, Concept hierarchies allow us two kinds of general- ization, data generalization and relation generalization. access-frequency). The data generalization aims to summarize tuples by The layer-2 relations are generated after studying the eliminating unnecessary fields in higher layers which access frequency of the different fields in the layer- 1 often involves merging generalized data within a set- relations. The least popular fields are dropped while valued data item. The summarization can also be done the remaining ones are inherited by the layer-2 rela- by compressing data like multimedia data, long text tions. Long text data or structured-valued data fields data, structured-valued data, etc. Relation generaliza- are generalized by summarization techniques, tion aims to summarize relations by merging identical Further generalization can be performed on layer-2 tuples in a relation and incrementing counts. relations in several directions. One possible direction is to partition the dot-brief file into different files ac- Incremental updating of the global MLDB cording to different classification schemes, such as cat- The global information base is dynamic, with informa- egory description (e.g., cs-document), access frequency tion added, removed and updated constantly at dif- (e.g., hot-list-document), countries, publications, etc., ferent sites. It is very costly to reconstruct the whole or their combinations. Choice of partitions can be de- MLDB database. Incremental updating could be the termined by studying the referencing statistics. An- only reasonable approach to make the information up- other direction is to further generalize some attributes dated and consistent in the global MLDB. in the relation and merge identical tuples to obtain In response to the updates to the original informa- a “summary” relation (e.g., dot-summary) with data tion base, the corresponding layer-l and higher layers distribution statistics associated (Han, Cai, & Cercone 1993). The third direction is to join two or more rela- should be updated incrementally. tions. For example, dot-author-brief can be produced We only examine the incremental database update by generalization on the join of document and person. at insertion and update. Similar techniques can be Moreover, different schemes can be combined to pro- easily extended to deletions. When a new file is con- duce even higher layered databases. cl nected to the network, a new tuple t is obtained by the layer-l construction algorithm. The new tuple is Clearly,