Developing Semantic Digital Libraries Using Data Mining Techniques
Total Page:16
File Type:pdf, Size:1020Kb
DEVELOPING SEMANTIC DIGITAL LIBRARIES USING DATA MINING TECHNIQUES By HYUNKI KIM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Hyunki Kim To my family ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor, Dr. Su-Shing Chen. He has provided me with financial support, assistance, and active encouragement over the years. I would also like to thank my committee members, Dr. Gerald Ritter, Dr. Randy Chow, Dr. Jih-Kwon Peir, and Dr. Yunmei Chen. Their comments and suggestions were invaluable. I would like to thank my parents, Jungza Kim and ChaesooKim, for their spiritual support from thousand miles away. I would also like to thank my beloved wife, Youngmee Shin, and my sweet daughter, Gayoung Kim, for their constant love, encouragement, and patience. I sincerely apologize to my family for having not taken care of them for so long. I would never have finished my study without them. Finally, I would like to thank my friends, Meongchul Song, Chee-Yong Choo, Xiaoou Fu, Yu Chen, for their help. iv TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................................................................................. iv LIST OF TABLES........................................................................................................... viii LIST OF FIGURES ........................................................................................................... ix ABSTRACT....................................................................................................................... xi CHAPTER 1 INTRODUCTION ........................................................................................................1 1.1 Motivation...............................................................................................................1 1.2 Objective.................................................................................................................2 1.3 Approach.................................................................................................................2 1.4 Research Contributions...........................................................................................4 1.5 Outline of Dissertation............................................................................................5 2 BACKGROUND ..........................................................................................................6 2.1 Digital Libraries......................................................................................................6 2.1.1 Digital Objects..............................................................................................8 2.1.2 Metadata .......................................................................................................9 2.1.3 Interoperability in Digital Libraries............................................................12 2.2 Federated Search...................................................................................................12 2.3 OAI Protocol for Metadata Harvesting.................................................................15 2.4 Data Mining..........................................................................................................17 2.4.1 Document Preprocessing............................................................................18 2.4.2 Document Classification ............................................................................23 2.4.3 Document Clustering..................................................................................23 3 DATA MINING AND SEARCHING IN THE OAI-PMH ENVIRONEMENT.......26 3.1 Introduction...........................................................................................................26 3.2 Self-Organizing Map ............................................................................................28 3.3 Data Mining Method using the Self-Organizing Map..........................................30 3.3.1 The Data .....................................................................................................31 v 3.3.2 Preprocessing: Feature Extraction/Selection and Construction of Document Input Vectors..................................................................................32 3.3.3 Construction of a Concept Hierarchy.........................................................35 3.4 System Architecture..............................................................................................38 3.4.1 Harvester.....................................................................................................39 3.4.2 Data Provider..............................................................................................41 3.4.3 Service Providers........................................................................................42 3.5 Integration of OAI-PMH and Z39.50 Protocols...................................................48 3.5.1 Integrating Federated Search with OAI Cross-Archive Search .................48 3.5.2 Data Collections .........................................................................................49 3.5.3 Mediator .....................................................................................................50 3.5.4 Semantic Mapping of Search Attributes.....................................................50 3.5.5 OAI-PMH and Non-OAI-PMH Target Interfaces......................................51 3.5.6 Client Interface ...........................................................................................51 3.6 Discussion.............................................................................................................52 3.7 Summary and Future Research.............................................................................54 3.7.1 Summary.....................................................................................................54 3.7.2 Future Research..........................................................................................55 4 AUTOMATED ONTOLOGY LINKING BY ASSOCIATIVE NAIVE BAYES CLASSIFIER..............................................................................................................56 4.1 Introduction...........................................................................................................56 4.2 Related Work........................................................................................................58 4.2.1 Document Classification ............................................................................58 4.2.2 Frequent Pattern Mining.............................................................................62 4.3 Gene Ontology......................................................................................................64 4.4 Associative Naïve Bayes Classifier......................................................................65 4.4.1 Definition of Class-support and Class-all-confidence................................67 4.4.2 ANB Learning Algorithm...........................................................................70 4.4.3 ANB Classification Algorithm...................................................................71 4.5 Experiments ..........................................................................................................74 4.5.1 Real World Datasets...................................................................................75 4.5.2 Preprocessing and Feature Selection..........................................................77 4.5.3 Experiments................................................................................................82 4.6 Summary and Future Research.............................................................................87 4.6.1 Summary.....................................................................................................87 4.6.2 Future Research..........................................................................................87 5 DATA MINING OF MEDLINE DATABASE..........................................................90 5.1 Introduction...........................................................................................................90 5.2 Data Mining Method for Organizing MEDLINE Database .................................93 5.2.1 The Data .....................................................................................................93 5.2.2 Text Categorization ....................................................................................93 5.2.3 Text Clustering using the Results of MeSH Descriptor Categorization.....94 5.2.4 Feature Extraction and Selection................................................................95 vi 5.2.5 Construction of a Concept Hierarchy.........................................................96 5.2.6 Experimental Results..................................................................................96 5.3 User Interfaces......................................................................................................97 5.3.1 MeSH Major Topic Tree View and MeSH Term Tree View.....................97 5.3.2 MeSH Co-occurrence Tree View ...............................................................98 5.3.3 SOM Tree View .........................................................................................99 5.4 Discussion.............................................................................................................99