Developing Semantic Digital Libraries Using Data Mining Techniques

Total Page:16

File Type:pdf, Size:1020Kb

Developing Semantic Digital Libraries Using Data Mining Techniques DEVELOPING SEMANTIC DIGITAL LIBRARIES USING DATA MINING TECHNIQUES By HYUNKI KIM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Hyunki Kim To my family ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor, Dr. Su-Shing Chen. He has provided me with financial support, assistance, and active encouragement over the years. I would also like to thank my committee members, Dr. Gerald Ritter, Dr. Randy Chow, Dr. Jih-Kwon Peir, and Dr. Yunmei Chen. Their comments and suggestions were invaluable. I would like to thank my parents, Jungza Kim and ChaesooKim, for their spiritual support from thousand miles away. I would also like to thank my beloved wife, Youngmee Shin, and my sweet daughter, Gayoung Kim, for their constant love, encouragement, and patience. I sincerely apologize to my family for having not taken care of them for so long. I would never have finished my study without them. Finally, I would like to thank my friends, Meongchul Song, Chee-Yong Choo, Xiaoou Fu, Yu Chen, for their help. iv TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................................................................................. iv LIST OF TABLES........................................................................................................... viii LIST OF FIGURES ........................................................................................................... ix ABSTRACT....................................................................................................................... xi CHAPTER 1 INTRODUCTION ........................................................................................................1 1.1 Motivation...............................................................................................................1 1.2 Objective.................................................................................................................2 1.3 Approach.................................................................................................................2 1.4 Research Contributions...........................................................................................4 1.5 Outline of Dissertation............................................................................................5 2 BACKGROUND ..........................................................................................................6 2.1 Digital Libraries......................................................................................................6 2.1.1 Digital Objects..............................................................................................8 2.1.2 Metadata .......................................................................................................9 2.1.3 Interoperability in Digital Libraries............................................................12 2.2 Federated Search...................................................................................................12 2.3 OAI Protocol for Metadata Harvesting.................................................................15 2.4 Data Mining..........................................................................................................17 2.4.1 Document Preprocessing............................................................................18 2.4.2 Document Classification ............................................................................23 2.4.3 Document Clustering..................................................................................23 3 DATA MINING AND SEARCHING IN THE OAI-PMH ENVIRONEMENT.......26 3.1 Introduction...........................................................................................................26 3.2 Self-Organizing Map ............................................................................................28 3.3 Data Mining Method using the Self-Organizing Map..........................................30 3.3.1 The Data .....................................................................................................31 v 3.3.2 Preprocessing: Feature Extraction/Selection and Construction of Document Input Vectors..................................................................................32 3.3.3 Construction of a Concept Hierarchy.........................................................35 3.4 System Architecture..............................................................................................38 3.4.1 Harvester.....................................................................................................39 3.4.2 Data Provider..............................................................................................41 3.4.3 Service Providers........................................................................................42 3.5 Integration of OAI-PMH and Z39.50 Protocols...................................................48 3.5.1 Integrating Federated Search with OAI Cross-Archive Search .................48 3.5.2 Data Collections .........................................................................................49 3.5.3 Mediator .....................................................................................................50 3.5.4 Semantic Mapping of Search Attributes.....................................................50 3.5.5 OAI-PMH and Non-OAI-PMH Target Interfaces......................................51 3.5.6 Client Interface ...........................................................................................51 3.6 Discussion.............................................................................................................52 3.7 Summary and Future Research.............................................................................54 3.7.1 Summary.....................................................................................................54 3.7.2 Future Research..........................................................................................55 4 AUTOMATED ONTOLOGY LINKING BY ASSOCIATIVE NAIVE BAYES CLASSIFIER..............................................................................................................56 4.1 Introduction...........................................................................................................56 4.2 Related Work........................................................................................................58 4.2.1 Document Classification ............................................................................58 4.2.2 Frequent Pattern Mining.............................................................................62 4.3 Gene Ontology......................................................................................................64 4.4 Associative Naïve Bayes Classifier......................................................................65 4.4.1 Definition of Class-support and Class-all-confidence................................67 4.4.2 ANB Learning Algorithm...........................................................................70 4.4.3 ANB Classification Algorithm...................................................................71 4.5 Experiments ..........................................................................................................74 4.5.1 Real World Datasets...................................................................................75 4.5.2 Preprocessing and Feature Selection..........................................................77 4.5.3 Experiments................................................................................................82 4.6 Summary and Future Research.............................................................................87 4.6.1 Summary.....................................................................................................87 4.6.2 Future Research..........................................................................................87 5 DATA MINING OF MEDLINE DATABASE..........................................................90 5.1 Introduction...........................................................................................................90 5.2 Data Mining Method for Organizing MEDLINE Database .................................93 5.2.1 The Data .....................................................................................................93 5.2.2 Text Categorization ....................................................................................93 5.2.3 Text Clustering using the Results of MeSH Descriptor Categorization.....94 5.2.4 Feature Extraction and Selection................................................................95 vi 5.2.5 Construction of a Concept Hierarchy.........................................................96 5.2.6 Experimental Results..................................................................................96 5.3 User Interfaces......................................................................................................97 5.3.1 MeSH Major Topic Tree View and MeSH Term Tree View.....................97 5.3.2 MeSH Co-occurrence Tree View ...............................................................98 5.3.3 SOM Tree View .........................................................................................99 5.4 Discussion.............................................................................................................99
Recommended publications
  • DEWS: a Decentralized Engine for Web Search
    DEWS: A Decentralized Engine for Web Search Reaz Ahmed, Rakibul Haque, Md. Faizul Bari, Raouf Boutaba David R. Cheriton School of Computer Science, University of Waterloo [r5ahmed|m9haque|mfbari|rboutaba]@uwaterloo.ca Bertrand Mathieu Orange Labs, Lannion, France [email protected] (Technical Report: CS-2012-17) Abstract The way we explore the Web is largely governed by centrally controlled, clustered search engines, which is not healthy for our freedom in the Internet. A better solution is to enable the Web to index itself in a decentralized manner. In this work we propose a decentralized Web search mechanism, named DEWS, which will enable the existing webservers to collaborate with each other to form a distributed index of the Web. DEWS can rank the search results based on query keyword relevance and relative importance of websites. DEWS also supports approximate matching of query keywords and incremental retrieval of search results in a decentralized manner. We use the standard LETOR 3.0 dataset to validate the DEWS protocol. Simulation results show that the ranking accuracy of DEWS is very close to the centralized case, while network overhead for collaborative search and indexing is logarithmic on network size. Simulation results also show that DEWS is resilient to changes in the available pool of indexing webservers and works efficiently even in presence of heavy query load. 1 Introduction Internet is the largest repository of documents that man kind has ever created. Voluntary contributions from millions of Internet users around the globe, and decentralized, autonomous hosting infrastructure are the sole factors propelling the continuous growth of the Internet.
    [Show full text]
  • Exploring the Feasibility of Applying Data Mining for Library Reference Service Improvement a Case Study of Turku Main Library
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by National Library of Finland DSpace Services Ming Zhan ([email protected]) Exploring the Feasibility of Applying Data Mining for Library Reference Service Improvement A Case Study of Turku Main Library Master thesis in Information and Knowledge Management Master’s Programme Supervisors: Gunilla Widén Faculty of Social Sciences, Business and Economics Åbo Akademi University Åbo 2016 Abstract Data mining, as a heatedly discussed term, has been studied in various fields. Its possibilities in refining the decision-making process, realizing potential patterns and creating valuable knowledge have won attention of scholars and practitioners. However, there are less studies intending to combine data mining and libraries where data generation occurs all the time. Therefore, this thesis plans to fill such a gap. Meanwhile, potential opportunities created by data mining are explored to enhance one of the most important elements of libraries: reference service. In order to thoroughly demonstrate the feasibility and applicability of data mining, literature is reviewed to establish a critical understanding of data mining in libraries and attain the current status of library reference service. The result of the literature review indicates that free online data resources other than data generated on social media are rarely considered to be applied in current library data mining mandates. Therefore, the result of the literature review motivates the presented study to utilize online free resources. Furthermore, the natural match between data mining and libraries is established. The natural match is explained by emphasizing the data richness reality and considering data mining as one kind of knowledge, an easy choice for libraries, and a wise method to overcome reference service challenges.
    [Show full text]
  • D:\Documents and Settings\Ana\My Documents\Biserka-Knjiga\Digital
    CULTURELINK Network of Networks for Research and Cooperation in Cultural Development was established by UNESCO and the Council of Europe in 1989. Focal point of the Network is the Institute for International Relations, Zagreb, Croatia. Members Networks, associations, foundations, institutions and individuals engaged in cultural development and cooperation. Aims of the Network To strengthen communication among its members; to collect, process and disseminate information on culture and cultural development in the world; to encourage joint research projects and cultural cooperation. Philosophy Promotion and support for dialogue, questioning and debating cultural practices and policies for cultural development. Mailing address CULTURELINK/IMO Ul. Lj. F. Vukotinovića 2 P.O. Box 303, 10000 Zagreb, Croatia Tel.: +385 1 48 77 460 Fax.: +385 1 48 28 361 E-mail: clink@ irmo.hr URL: http://www.culturelink.hr http://www.culturelink.org Download address http://www.culturelink.org/publics/joint/digital_culture-en.pdf DIGITALCULTURE: THECHANGINGDYNAMICS ThisWorkhasbeenpublishedwiththefinancialsupportoftheUNESCOOfficein Venice-UNESCORegionalBureauforScienceandCultureinEurope. UNESCO-BRESCE The designations employed and the presentation of the material throughout this publication do not imply the expressing of any opinion whatsoever on the part of the UNESCO Secretariat concerning the legal status of any country or territory, city or areaorofitsauthorities,thedelimitationsofitsfrontiersorboundaries. The authors are responsible for the choice and the presentation
    [Show full text]
  • A Self-Sustainable Peer-To-Peer Digital Library Framework for Evolving Communities Ashraf A
    Old Dominion University ODU Digital Commons Computer Science Theses & Dissertations Computer Science Winter 2007 FreeLib: A Self-Sustainable Peer-to-Peer Digital Library Framework for Evolving Communities Ashraf A. Amrou Old Dominion University Follow this and additional works at: https://digitalcommons.odu.edu/computerscience_etds Part of the Computer Sciences Commons Recommended Citation Amrou, Ashraf A.. "FreeLib: A Self-Sustainable Peer-to-Peer Digital Library Framework for Evolving Communities" (2007). Doctor of Philosophy (PhD), dissertation, Computer Science, Old Dominion University, DOI: 10.25777/twrn-ya69 https://digitalcommons.odu.edu/computerscience_etds/96 This Dissertation is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion in Computer Science Theses & Dissertations by an authorized administrator of ODU Digital Commons. For more information, please contact [email protected]. FREELIB: A SELF-SUSTAINABLE PEER-TO-PEER DIGITAL LIBRARY FRAMEWORK FOR EVOLVING COMMUNITIES by Ashraf A. Amrou M.S. February 2001, Alexandria University B.S. June 1995, Alexandria University A Dissertation Submitted to the Faculty of Old Dominion University in Partial Fulfillment of the Requirement for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE OLD DOMINION UNIVERSITY December 2007 Approved by: Kurt Maly (Co-Director) Mohammad Zubair (Co-Director) rfussein Abdel-Wahab (Member) Ravi Mjjldcamala (Member) lamea (Member Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT FREELIB: A SELF-SUSTAINABLE PEER-TO-PEER DIGITAL LIBRARY FRAMEWORK FOR EVOLVING COMMUNITIES Ashraf Amrou Old Dominion University, 2007 Co-Directors of Advisory Committee: Dr. Kurt Maly Dr. Mohammad Zubair The need for efficient solutions to the problem of disseminating and sharing of data is growing.
    [Show full text]
  • 2020 SIGACT REPORT SIGACT EC – Eric Allender, Shuchi Chawla, Nicole Immorlica, Samir Khuller (Chair), Bobby Kleinberg September 14Th, 2020
    2020 SIGACT REPORT SIGACT EC – Eric Allender, Shuchi Chawla, Nicole Immorlica, Samir Khuller (chair), Bobby Kleinberg September 14th, 2020 SIGACT Mission Statement: The primary mission of ACM SIGACT (Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory) is to foster and promote the discovery and dissemination of high quality research in the domain of theoretical computer science. The field of theoretical computer science is the rigorous study of all computational phenomena - natural, artificial or man-made. This includes the diverse areas of algorithms, data structures, complexity theory, distributed computation, parallel computation, VLSI, machine learning, computational biology, computational geometry, information theory, cryptography, quantum computation, computational number theory and algebra, program semantics and verification, automata theory, and the study of randomness. Work in this field is often distinguished by its emphasis on mathematical technique and rigor. 1. Awards ▪ 2020 Gödel Prize: This was awarded to Robin A. Moser and Gábor Tardos for their paper “A constructive proof of the general Lovász Local Lemma”, Journal of the ACM, Vol 57 (2), 2010. The Lovász Local Lemma (LLL) is a fundamental tool of the probabilistic method. It enables one to show the existence of certain objects even though they occur with exponentially small probability. The original proof was not algorithmic, and subsequent algorithmic versions had significant losses in parameters. This paper provides a simple, powerful algorithmic paradigm that converts almost all known applications of the LLL into randomized algorithms matching the bounds of the existence proof. The paper further gives a derandomized algorithm, a parallel algorithm, and an extension to the “lopsided” LLL.
    [Show full text]
  • Towards the Ontology Web Search Engine
    TOWARDS THE ONTOLOGY WEB SEARCH ENGINE Olegs Verhodubs [email protected] Abstract. The project of the Ontology Web Search Engine is presented in this paper. The main purpose of this paper is to develop such a project that can be easily implemented. Ontology Web Search Engine is software to look for and index ontologies in the Web. OWL (Web Ontology Languages) ontologies are meant, and they are necessary for the functioning of the SWES (Semantic Web Expert System). SWES is an expert system that will use found ontologies from the Web, generating rules from them, and will supplement its knowledge base with these generated rules. It is expected that the SWES will serve as a universal expert system for the average user. Keywords: Ontology Web Search Engine, Search Engine, Crawler, Indexer, Semantic Web I. INTRODUCTION The technological development of the Web during the last few decades has provided us with more information than we can comprehend or manage effectively [1]. Typical uses of the Web involve seeking and making use of information, searching for and getting in touch with other people, reviewing catalogs of online stores and ordering products by filling out forms, and viewing adult material. Keyword-based search engines such as YAHOO, GOOGLE and others are the main tools for using the Web, and they provide with links to relevant pages in the Web. Despite improvements in search engine technology, the difficulties remain essentially the same [2]. Firstly, relevant pages, retrieved by search engines, are useless, if they are distributed among a large number of mildly relevant or irrelevant pages.
    [Show full text]
  • Creating and Managing a Digital Library
    Creating and Managing a Digital Library 1 Agenda • Housekeeping items • Introduction of speakers – Laura Costello – Sharon Yang – Richard Jost • Questions 2 3 Creating and Managing a Digital Library Creating a digital library takes time, effort, and resources, but the tools are available, thanks to the popularity of the digital movement. In this panel discussion, three experts in the field of librarianship will share their views on the role of the librarian in providing tools and guidance to create and manage a digital library. 4 Introducing Our Panel A Climate of Demand: The shift from traditional print collections to the emergence of eBooks and demand-driven acquisitions Laura Costello, Head of Research & Emerging Technologies Stony Brook University Altmetrics and Research Support: Tips on using altmetrics to measure the impact of your digital library Sharon Q Yang, Professor and Systems Librarian Rider University Libraries, New Jersey Staffing the Library of the Future: The skills librarians need as the library transitions from a print to a more technical environment Richard Jost, Director of the Law Librarianship Program University of Washington Information School 5 ‘ A Climate of Demand 6 Hi! I’m Laura Head of Research & Emerging Technologies at Stony Brook‘ University [email protected] http://bit.ly/CostelloDDA 7 Evaluating Demand- Driven Acquisitions Chandos Publishing,‘ 2016 8 1. Context 2. Case Studies ‘ 3. New Research 4. What’s next? 9 ➔ What we talk about when we talk about monographs ◆ Serials ◆ Space ‘ ◆ Processing/Outsourcing
    [Show full text]
  • Merging Multiple Search Results Approach for Meta-Search Engines
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by D-Scholarship@Pitt MERGING MULTIPLE SEARCH RESULTS APPROACH FOR META-SEARCH ENGINES By Khaled Abd-El-Fatah Mohamed B.S-------- Cairo University, Egypt, 1995 M.A-------Cairo University, Egypt, 1999 M.A------- University of Pittsburgh 2001 Submitted to the Graduate Faculty of School of Information Sciences in Partial Fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2004 UNIVERSITY OF PITTSBURGH INFORMATION SCIENCES This dissertation was presented by Khaled Abd-El-Fatah Mohamed It was defended on Janauary 29, 2004 and approved by Chris Tomer, PhD, Associate Professor, DLIS Jose-Marie Griffiths, PhD, Professor, DLIS Don King, Research Professor, DLIS Amy Knapp, PhD, ULS Dissertation Director: Chris Tomer, PhD, Associate Professor MERGING MULTIPLE SEARCH RESULTS APPROACH FOR META-SEARCH ENGINES Khaled A. Mohamed, PhD University of Pittsburgh, 2004 Meta Search Engines are finding tools developed for enhancing the search performance by submitting user queries to multiple search engines and combining the search results in a unified ranked list. They utilized data fusion technique, which requires three major steps: databases selection, the results combination, and the results merging. This study tries to build a framework that can be used for merging the search results retrieved from any set of search engines. This framework based on answering three major questions: 1. How meta-search developers could define the optimal rank order for the selected engines. 2. How meta-search developers could choose the best search engines combination. 3. What is the optimal heuristic merging function that could be used for aggregating the rank order of the retrieved documents form incomparable search engines.
    [Show full text]
  • Organizing User Search Histories
    Global Journal of Computer Science and Technology Network, Web & Security Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172 & Print ISSN: 0975-4350 Organizing user Search Histories By Ravi Kumar Yandluri Gokaraju Rangaraju Institute of Engineering & Technology, India Abstract - Internet userscontinuously make queries over web to obtain required information. They need information about various tasks and sub tasks for which they use search engines. Over a period of time they make plenty of related queries. Search engines save these queries and maintain user’s search histories. Users can view their search histories in chronological order. However, the search histories are not organized into related groups. In fact there is no organization made except the chronological order. Recently Hwang et al. studied the problem of organizing historical search information of users into groups dynamically. This automatic grouping of user search histories can help search engines also in various applications such as collaborative search, sessionization, query alterations, result ranking and query suggestions. They proposed various techniques to achieve this. In this paper we implemented those techniques practically using a prototype web application built in Java technologies. The experimental results revealed that the proposed application is useful to organize search histories. Indexterms : search engine, search history, click graph, query grouping. GJCST-E Classification : H.3.5 Organizing user Search Histories Strictly as per the compliance and regulations of: © 2013. Ravi Kumar Yandluri. This is a research/review paper, distributed under the terms of the Creative Commons Attribution- Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction inany medium, provided the original work is properly cited.
    [Show full text]
  • A Community-Based Approach to Personalizing Web Search
    COVER FEATURE A Community-Based Approach to Personalizing Web Search Barry Smyth University College Dublin Researchers can leverage the latent knowledge created within search communities by recording users’search activities—the queries they submit and results they select—at the community level.They can use this data to build a relevance model that guides the promotion of community-relevant results during regular Web search. ver the past few years, current Web search through rates should be preferred during ranking. engines have become the dominant tool for Unfortunately, Direct Hit’s so-called popularity engine accessing information online. However, even did not play a central role on the modern search stage today’s most successful search engines struggle (although the technology does live on as part of the O to provide high-quality search results: Approx- Teoma search engine) largely because the technology imately 50 percent of Web search sessions fail to find proved inept at identifying new sites or less well- any relevant results for the searcher. traveled ones, even though they may have had more The earliest Web search engines adopted an informa- contextual relevance to a given search. tion-retrieval view of search, using sophisticated term- Despite Direct Hit’s fate, the notion that searchers based matching techniques to identify relevant docu- themselves could influence the ranking of results by virtue ments from repeated occurrences of salient query terms. of their search activities remained a powerful one. It res- Although such techniques proved useful for identifying onates well with the ideas that underpin the social web, a set of potentially relevant results, they offered little a phrase often used to highlight the importance of a suite insight into how such results could be usefully ranked.
    [Show full text]
  • Our Digital Library Brochure & User Guide
    Using the app, borrowed titles appear on your shelf, where you have the option to: Branch Libraries Digital Library - Click Open Book and/or Open Audiobook & Other Services to begin reading or listening to a title. Collection - Click Manage Loan to see options like Renew, Return, or Aurora Memorial Library Send To Device to send a book to a Kindle device. 115 East Pioneer Trail Aurora, OH 44202 330-562-6502 To access your account for available information: - Loans- Your checked out titles will be listed Garrettsville Library along with their expiration dates. 10482 South Street - Holds- Any holds that you have on titles currently Garrettsville, OH 44231 checked out to other users. 330-527-4378 - Wish List- Any wish list titles that you would like to have saved for check out on a later date. Pierce Streetsboro Library 8990 Kirby Lane - Settings- Ability to change your settings to a 7 or 14-day Streetsboro, OH 44241 loan period, audience content filtering when browsing, 330-626-4458 Featuring and display options. Randolph Library Creativebug, Flipster, General Information 1639 State Route 44 Randolph, OH 44265 330-325-7003 Hoopla, Kanopy Windham Library & OverDrive Renaissance Family Center 9005 Wilverne Drive - Digital Movies, Documentaries & More To obtain operating requirements for any of these Windham, OH 44288 services, visit the Help and/or FAQ sections on the 330-326-3145 - TV Shows & Mysteries service's website. - Music Outreach Services- Home Delivery, - eBooks & Audiobooks Please note that using cellular data Library Express & The Library Box 10482 South Street - Magazines is generally not recommended for streaming Garrettsville, OH 44231 and/or downloading multimedia content.
    [Show full text]
  • Digital Library Models and Prospects
    Paper submission for ASIS Mid-Year 1996 meeting: Digital Library Models and Prospects by Gregory B. Newby Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 E. Daniel Street; Champaign; IL; 61820 Email: [email protected] / Telephone 217-244-7365 ABSTRACT Digital libraries are the means by which people of the next millennium will access materials found in current libraries, yet the nature of digital libraries is only now being shaped. Different visions of digital libraries include the digital library for electronic access to materials previously found in print form, access to data stores, and access to scholarly (self-published) materials. In order to bring about significant progress towards the realization of full-scale digital libraries, existing publishers must participate. Yet economic impediments to full participation of publishers exist, including fears of unauthorized duplication, distribution, or modification. Uncertainty about whether digital access to published materials may yield acceptable fees or royalties may be addressed through adoption of “pay as you go” or “buy once, use many” models, yet significant infrastructure development is needed to make these models feasible. The proliferation of scholarly reprint archives, moderated discussion groups, and electronic conferences and proceedings is helping to address some of the publisher concerns, as are developments in network access, browser standards, bibliographic standards, data security, and encryption. THE DIGITAL LIBRARY Conferences, journals, books, and projects have emerged in 1995 and 1996 on a topic which has been of only limited interest previously: digital libraries. The basic concept of the digital library may be traced back at least to such visionaries as Bush (1945), Licklider (1965) and Lancaster (1969).
    [Show full text]