Data Clustering and Cleansing for Bibliography Analysis Written by Sunanda Patro Has Been Approved for the School of Computer Science & Engineering

Data Clustering and Cleansing for Bibliography Analysis by Sunanda Patro Master in Computer Application (MCA), Berhampur University, India, Master of Science(Computing), University of Tasmania, Australia 2006 A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy School of Computer Science & Engineering University of New South Wales 2012 This thesis entitled: Data Clustering and Cleansing for Bibliography Analysis written by Sunanda Patro has been approved for the School of Computer Science & Engineering Supervisor: A/Prof. Wei Wang Signature Date The final copy of this thesis has been examined by the signatory, and I find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. Declaration I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged. Signature Date Abstract Advances in computational resources and the communications infrastructure, as well as the rapid rise of the World Wide Web, have increased the wide availabil- ity of published papers in electronic form. Digital libraries - indexes collections of articles- have become an essential resource for academic communities. The citation records are also used to measure the impact of specific publications in the research community. Several bibliography databases have been established, which automatically extract bibliography information and cited references from digital repositories. Maintaining, updating and correcting citation data in bibliography databases is an important and ever-increasing task. Bibliography databases contain many errors arising, for example, from data entry mistakes, imperfect citation-gathering software or common author names. Text mining is an emerging technology to deal with such problems. In this thesis, new text mining techniques are proposed to deal with three different data quality problems in real-life bibliography data, which include: 1. Clustering search results from citation enhanced search engines 2. Learning top-k transformation rules from complex co-referent records v 3. Comparative citation analysis based on semi-automatic cleansed bibliography data. The first issue has been tackled by proposing a new similarity function that incorporates domain inforamtion, and by implementing an outlier-conscious algorithm in the generation of clusters. Experimental results confirm that the proposed clustering method is superior to prior approaches. The second problem has been to develop an efficient and effective method to extract top-k high quality transformation rules for a given set of possibly coreferent record pairs. An effective algorithm is proposed, that performs careful local analysis for each record pair and generates candidate rules, and finally chooses top-k rules based on a scoring function. Extensive experiments performed on several publicly available real-world datasets demonstrate its advantage over the previous state-of-the-art algorithm in both effectiveness and efficiency. The final problem has been broached by developing a semi-automatic tool to perform extensive data cleaning, correcting errors found in the citations returned from Google Scholar, and parsing the citations into structured data formats suit- able for citation analysis. The results are then compared with the results from the most widely used subscription-based citation database, Scopus. Extensive experiments on various bibliometric indexes of a collection of research in computer science have, demonstrated the usefulness of Google Scholar in conducting citation analysis, and highlighted its broader international impact on the quality of the publication. Acknowledgements I would like to thank my thesis supervisor Associate Professor Wei Wang for his guidance, encouragement, valuable time and support throughout the course of the thesis. I would also like to thank my associate-supervisor Prof. Xuemin Lin for his guidance during the research program. The financial support offered through University Postgraduate Award (UPA) and the Supplementary Engineering Award from Faculty of Engineering are acknowledged. Special thanks go to my family and friends and in particular, my children Suman and Shruti, and husband Gangadhara Prusty, who have been waiting pa- tiently to see the special day when I submit the thesis. Their unconditional support and encouragement motivated me to finish my studies. I extend my sincerest gratitude to my parents and parents-in-law for their encouragement and co-operation shown during this period. Finally, I would like to dedicate my thesis to my beloved late grandmother (amamma) for her encouragement, affection, inspiration and blessings, which al- ways motivated me to study and excel. Wish she would have been here... Contents Chapter 1 Introduction 1 1.1 Bibliography Data . 5 1.2 Current Problems . 7 1.3 Outline . 14 2 Effective Snippet Clustering with Domain Knowledge 16 2.1 Introduction . 16 2.2 Literature Review . 19 2.2.1 Flat Clustering . 19 2.2.2 Hierarchical Clustering . 24 2.2.3 Multiple Clustering Approaches . 28 2.3 Snippet Miner . 30 2.3.1 Overview . 30 2.3.2 Domain-knowledge Mining . 31 2.3.3 Snippet Parsing . 33 2.3.4 Measuring Similarities between Snippets . 34 viii 2.3.5 Cluster Generation . 38 2.4 Experiments . 39 2.5 Conclusions . 46 3 Learning Top-k Transformation Rules 51 3.1 Introduction . 51 3.2 Literature Review . 54 3.3 The Local-alignment-based Algorithm . 64 3.3.1 Pre-processing the Input Data . 64 3.3.2 Segmentation . 64 3.3.3 Local Alignment . 65 3.3.4 Obtaining top-k Rules . 70 3.4 Experiment . 74 3.4.1 Experiment Setup . 74 3.4.2 Datasets . 77 3.4.3 Quality of the Rules . 79 3.4.4 Correct and Incorrect Rules . 80 3.4.5 Unsupervised versus Supervised . 86 3.4.6 Effect of Numeric Rules . 87 3.4.7 Performance without UpdateRules . 90 3.4.8 Precision and Consistency . 91 3.4.9 Execution Time . 93 3.5 Conclusions . 96 4 Semi-automatic Comparative Citation Analysis 97 4.1 Introduction . 97 ix 4.2 Proposed Study . 101 4.3 Literature Review . 102 4.4 Google Scholar (GS) . 115 4.5 Scopus . 120 4.6 Data Collection . 124 4.7 Experimental Analysis . 128 4.7.1 Citation Counts . 128 4.7.2 Growth in Citations over the Years . 131 4.7.3 h-index, g-index and hg-index . 132 4.7.4 Overlap and Uniqueness of Citing References . 135 4.8 Conclusions . 137 5 Conclusions and Future Work 139 Bibliography 143 List of Tables Table 2.1 Queries . 39 2.2 Closely Related Concepts With the Query Concepts . 41 2.3 Cluster Sample for Clustering ................... 47 2.4 Cluster Sample for Web Mining ................... 48 2.5 Cluster Sample for Relevance Feedback . 49 2.6 Cluster Sample for Information Retrieval . 49 2.7 Cluster Sample for Query Expansion . 50 3.1 Example of Segmentation . 77 3.2 Example of Segmentation . 77 3.3 Example of a Cluster of Coreferent Records . 79 3.4 Example Rules Found . 80 4.1 Author Information . 124 4.2 Statistics of Citations in GS . 129 4.3 Statistics of Citations in Scopus . 130 4.4 Mean, Std Dev and Variance of Citation Counts . 130 xi 4.5 Measuring h-index, g-index and hg-index . 135 4.6 Distribution of Unique and Overlapping Citations . 137 List of Figures Figure 2.1 Parsing a String into a Set of Concepts . 32 2.2 An Example Direct Association Graph . 35 2.3 An Example Indirect Association Graph . 37 2.4 Cluster Quality (k = 10) ...................... 43 2.5 Cluster Quality (k = 15) ...................... 43 2.6 Cluster Quality (k = 20) ...................... 44 2.7 Cluster Quality (k = 25) ...................... 44 2.8 HAC+, k vs. Purity . 45 2.9 HAC+, k vs. Purity . 45 2.10 HAC+, k vs. Outlier . 46 3.1 Optimal Local Alignment . 67 3.2 Optimal Local Alignment After Recognising Multi-token Abbre- viation (R4)............................. 69 3.3 CCSB, Number of Correct Rules . 81 3.4 Cora, Number of Correct Rules . 82 3.5 Restaurant, Number of Correct Rules . 82 xiii 3.6 CCSB, Number of Incorrect Rules . 83 3.7 Cora, Number of Incorrect Rules . 83 3.8 Restaurant, Number of Incorrect Rules . 84 3.9 CCSB, Number of Incorrect Rules vs. Number of Correct Rules . 84 3.10 Cora, Number of Incorrect Rules vs. Number of Correct Rules . 85 3.11 Restaurant, Number of Incorrect Rules vs. Number of Correct Rules 85 3.12 Supervised vs. Unsupervised CCSB, Number of Correct Rules . 86 3.13 Supervised vs. Unsupervised Cora, Number of Correct Rules . 87 3.14 Supervised Segmentation CCSB, Number of Correct Rules . 88 3.15 Supervised Segmentation Cora, Number of Correct Rules . 88 3.16 Unsupervised Segmentation CCSB, Number of Correct Rules . 89 3.17 Unsupervised Segmentation Cora, Number of Correct Rules . 89 3.18 CCSB, Impact of UpdateRules . 90 3.19 Cora, Impact of UpdateRules . 91 3.20 CCSB, Precision vs. k ........................ 92 3.21 CCSB, Precision vs. n ........................ 92 3.22 Execution time vs. Input Record Pairs (n) . 94 3.23 Execution time vs. Output Size (k) . 95 4.1 GS Search Results for an Article Title . 118 4.2 GS Results from Citing Link of an Article . 119 4.3 Scopus Search Interface .

Data Clustering and Cleansing for Bibliography Analysis Written by Sunanda Patro Has Been Approved for the School of Computer Science & Engineering

FM-Index Reveals the Reverse Suffix Array

Optimal Time and Space Construction of Suffix Arrays and LCP

A Quick Tour on Suffix Arrays and Compressed Suffix Arrays✩ Roberto Grossi Dipartimento Di Informatica, Università Di Pisa, Italy Article Info a B S T R a C T

Suffix Trees, Suffix Arrays, BWT

1 Suffix Trees

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Approximate String Matching Using Compressed Suffix Arrays

Suffix Array

Exhaustive Exact String Matching: the Analysis of the Full Human Genome

Distributed Text Search Using Suffix Arrays$

Optimal In-Place Suffix Sorting

4. Suffix Trees and Arrays