Data Clustering and Cleansing for

Bibliography Analysis by Sunanda Patro

Master in Computer Application (MCA), Berhampur

University, India, Master of Science(Computing),

University of Tasmania, Australia 2006

A thesis submitted in fulfillment

of the requirements for the degree of

Doctor of Philosophy

School of Computer Science & Engineering

University of New South Wales

2012 This thesis entitled: Data Clustering and Cleansing for Bibliography Analysis written by Sunanda Patro has been approved for the School of Computer Science & Engineering

Supervisor: A/Prof. Wei Wang

Signature Date

The final copy of this thesis has been examined by the signatory, and I find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. Declaration

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signature Date Abstract

Advances in computational resources and the communications infrastructure, as well as the rapid rise of the World Wide Web, have increased the wide availabil- ity of published papers in electronic form. Digital libraries - indexes collections of articles- have become an essential resource for academic communities. The citation records are also used to measure the impact of specific publications in the research community. Several bibliography databases have been established, which automatically extract bibliography information and cited references from digital repositories.

Maintaining, updating and correcting citation data in bibliography databases is an important and ever-increasing task. Bibliography databases contain many errors arising, for example, from data entry mistakes, imperfect citation-gathering software or common author names. Text mining is an emerging technology to deal with such problems. In this thesis, new text mining techniques are proposed to deal with three different data quality problems in real-life bibliography data, which include:

1. Clustering search results from citation enhanced search engines

2. Learning top-k transformation rules from complex co-referent records v

3. Comparative citation analysis based on semi-automatic cleansed bibliog-

raphy data.

The first issue has been tackled by proposing a new similarity function that incorporates domain inforamtion, and by implementing an outlier-conscious algo- rithm in the generation of clusters. Experimental results confirm that the proposed clustering method is superior to prior approaches.

The second problem has been to develop an efficient and effective method to extract top-k high quality transformation rules for a given set of possibly co- referent record pairs. An effective algorithm is proposed, that performs careful lo- cal analysis for each record pair and generates candidate rules, and finally chooses top-k rules based on a scoring function. Extensive experiments performed on several publicly available real-world datasets demonstrate its advantage over the previous state-of-the-art algorithm in both effectiveness and efficiency.

The final problem has been broached by developing a semi-automatic tool to perform extensive data cleaning, correcting errors found in the citations returned from Google Scholar, and the citations into structured data formats suit- able for citation analysis. The results are then compared with the results from the most widely used subscription-based citation database, Scopus. Extensive ex- periments on various bibliometric indexes of a collection of research in computer science have, demonstrated the usefulness of Google Scholar in conducting cita- tion analysis, and highlighted its broader international impact on the quality of the publication. Acknowledgements

I would like to thank my thesis supervisor Associate Professor Wei Wang for his guidance, encouragement, valuable time and support throughout the course of the thesis. I would also like to thank my associate-supervisor Prof. Xuemin Lin for his guidance during the research program.

The financial support offered through University Postgraduate Award (UPA) and the Supplementary Engineering Award from Faculty of Engineering are ac- knowledged.

Special thanks go to my family and friends and in particular, my children

Suman and Shruti, and husband Gangadhara Prusty, who have been waiting pa- tiently to see the special day when I submit the thesis. Their unconditional support and encouragement motivated me to finish my studies.

I extend my sincerest gratitude to my parents and parents-in-law for their en- couragement and co-operation shown during this period.

Finally, I would like to dedicate my thesis to my beloved late grandmother

(amamma) for her encouragement, affection, inspiration and blessings, which al- ways motivated me to study and excel. Wish she would have been here... Contents

Chapter

1 Introduction 1

1.1 Bibliography Data ...... 5

1.2 Current Problems ...... 7

1.3 Outline ...... 14

2 Effective Snippet Clustering with Domain Knowledge 16

2.1 Introduction ...... 16

2.2 Literature Review ...... 19

2.2.1 Flat Clustering ...... 19

2.2.2 Hierarchical Clustering ...... 24

2.2.3 Multiple Clustering Approaches ...... 28

2.3 Snippet Miner ...... 30

2.3.1 Overview ...... 30

2.3.2 Domain-knowledge Mining ...... 31

2.3.3 Snippet Parsing ...... 33

2.3.4 Measuring Similarities between Snippets ...... 34 viii

2.3.5 Cluster Generation ...... 38

2.4 Experiments ...... 39

2.5 Conclusions ...... 46

3 Learning Top-k Transformation Rules 51

3.1 Introduction ...... 51

3.2 Literature Review ...... 54

3.3 The Local-alignment-based Algorithm ...... 64

3.3.1 Pre-processing the Input Data ...... 64

3.3.2 Segmentation ...... 64

3.3.3 Local Alignment ...... 65

3.3.4 Obtaining top-k Rules ...... 70

3.4 Experiment ...... 74

3.4.1 Experiment Setup ...... 74

3.4.2 Datasets ...... 77

3.4.3 Quality of the Rules ...... 79

3.4.4 Correct and Incorrect Rules ...... 80

3.4.5 Unsupervised versus Supervised ...... 86

3.4.6 Effect of Numeric Rules ...... 87

3.4.7 Performance without UpdateRules ...... 90

3.4.8 Precision and Consistency ...... 91

3.4.9 Execution Time ...... 93

3.5 Conclusions ...... 96

4 Semi-automatic Comparative Citation Analysis 97

4.1 Introduction ...... 97 ix

4.2 Proposed Study ...... 101

4.3 Literature Review ...... 102

4.4 Google Scholar (GS) ...... 115

4.5 Scopus ...... 120

4.6 Data Collection ...... 124

4.7 Experimental Analysis ...... 128

4.7.1 Citation Counts ...... 128

4.7.2 Growth in Citations over the Years ...... 131

4.7.3 h-index, g-index and hg-index ...... 132

4.7.4 Overlap and Uniqueness of Citing References ...... 135

4.8 Conclusions ...... 137

5 Conclusions and Future Work 139

Bibliography 143 List of Tables

Table

2.1 Queries ...... 39

2.2 Closely Related Concepts With the Query Concepts ...... 41

2.3 Cluster Sample for Clustering ...... 47

2.4 Cluster Sample for Web Mining ...... 48

2.5 Cluster Sample for Relevance Feedback ...... 49

2.6 Cluster Sample for Information Retrieval ...... 49

2.7 Cluster Sample for Query Expansion ...... 50

3.1 Example of Segmentation ...... 77

3.2 Example of Segmentation ...... 77

3.3 Example of a Cluster of Coreferent Records ...... 79

3.4 Example Rules Found ...... 80

4.1 Author Information ...... 124

4.2 Statistics of Citations in GS ...... 129

4.3 Statistics of Citations in Scopus ...... 130

4.4 Mean, Std Dev and Variance of Citation Counts ...... 130 xi

4.5 Measuring h-index, g-index and hg-index ...... 135

4.6 Distribution of Unique and Overlapping Citations ...... 137 List of Figures

Figure

2.1 Parsing a String into a Set of Concepts ...... 32

2.2 An Example Direct Association Graph ...... 35

2.3 An Example Indirect Association Graph ...... 37

2.4 Cluster Quality (k = 10) ...... 43

2.5 Cluster Quality (k = 15) ...... 43

2.6 Cluster Quality (k = 20) ...... 44

2.7 Cluster Quality (k = 25) ...... 44

2.8 HAC+, k vs. Purity ...... 45

2.9 HAC+, k vs. Purity ...... 45

2.10 HAC+, k vs. Outlier ...... 46

3.1 Optimal Local Alignment ...... 67

3.2 Optimal Local Alignment After Recognising Multi-token Abbre-

viation (R4)...... 69 3.3 CCSB, Number of Correct Rules ...... 81

3.4 Cora, Number of Correct Rules ...... 82

3.5 Restaurant, Number of Correct Rules ...... 82 xiii

3.6 CCSB, Number of Incorrect Rules ...... 83

3.7 Cora, Number of Incorrect Rules ...... 83

3.8 Restaurant, Number of Incorrect Rules ...... 84

3.9 CCSB, Number of Incorrect Rules vs. Number of Correct Rules . 84

3.10 Cora, Number of Incorrect Rules vs. Number of Correct Rules . . 85

3.11 Restaurant, Number of Incorrect Rules vs. Number of Correct Rules 85

3.12 Supervised vs. Unsupervised CCSB, Number of Correct Rules . . 86

3.13 Supervised vs. Unsupervised Cora, Number of Correct Rules . . . 87

3.14 Supervised Segmentation CCSB, Number of Correct Rules . . . . 88

3.15 Supervised Segmentation Cora, Number of Correct Rules . . . . . 88

3.16 Unsupervised Segmentation CCSB, Number of Correct Rules . . . 89

3.17 Unsupervised Segmentation Cora, Number of Correct Rules . . . 89

3.18 CCSB, Impact of UpdateRules ...... 90

3.19 Cora, Impact of UpdateRules ...... 91

3.20 CCSB, Precision vs. k ...... 92

3.21 CCSB, Precision vs. n ...... 92

3.22 Execution time vs. Input Record Pairs (n) ...... 94

3.23 Execution time vs. Output Size (k) ...... 95

4.1 GS Search Results for an Article Title ...... 118

4.2 GS Results from Citing Link of an Article ...... 119

4.3 Scopus Search Interface ...... 122

4.4 Results of Scopus Author Search ...... 123

4.5 DBLP Search Results for Author Elio Masciari ...... 127

4.6 Distribution of Citations Over Years ...... 131 xiv

4.7 Grouping Distribution of Citations ...... 132 Chapter 1

Introduction

The World Wide Web continues to grow at an amazing speed. There is also a quickly growing number of text documents managed in organisational Intranets, which represent the accumulated knowledge of organisations and are important for their success in today’s information society. Due to the large size, high dynam- ics and large diversity of the Web and of organisational Intranets, it has become a challenging task to find the truly relevant content for a given purpose. The enor- mous amount of information stored in unstructured texts cannot easily be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful patterns.

Text mining refers generally to the process of extracting interesting informa- tion and knowledge from unstructured text. It is a variation of a field called data mining, which to find interesting patterns from large databases. The differ- ence between data mining and text mining is that, in text mining, the patterns are extracted from natural language text rather than structured databases of facts. 2

Handling unstructured data is not an easy task, as it comes with various types of characteristics 1 :

1. It has no specified format, for example .txt, .html, .xml, .pdf, .doc

2. It has variable length, for example, one record might contain a phrase of

a few words or sentences or an academic paper of many pages

3. It may contain variable spelling, such as misspellings, abbreviations and

singular versus plural words

4. The data may contain punctuation and other non-alphanumeric charac-

ters, such as periods, question marks, dashes, equal signs and quotation

marks

5. The contents are not predefined and do not have to adhere to a predefined

set of values. Even when restricted to one discipline, such as actuarial sci-

ence, a paper can be on a variety of topics, such as ratemaking, reserving,

reinsurance, enterprise risk management or data mining.

Text mining is the discovery, via computer, of new, previously unknown infor- mation by automatic extraction of information from unstructured text. It can be used to augment existing data in corporate databases by making unstructured text data available for analysis. Text mining deals with the machine supported analy- sis of text by using techniques from information retrieval, information extraction and natural language processing (NLP), and connects them with the algorithms and methods of KDD, data mining, machine learning and statistics. Typical text mining tasks include text categorisation, text clustering, concept/entity extraction,

1 http://www.casact.org/pubs/forum/10spforum/Francis_Flynn.PDF 3 production of granular taxonomies, sentiment analysis, document summarisation and entity relation modelling. There are two main phases to text mining: (1) pre- processing and integration of unstructured data and (2) statistical analysis of the pre-processed data to extract content from the text.

One of the important pre-processing steps in text mining is data cleaning, also called data cleaning or scrubbing, which deals with detecting and removing er- rors and inconsistencies from data in order to improve the quality of data [RD00].

It deals with data quality problems in single data collections (e.g., due to mis- spellings during data entry, missing information or other invalid data) and in mul- tiple data sources that need to be integrated (e.g., data warehouses and global Web based information systems).

The actual process of data cleaning may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

In general, it involves several phases [RD00]: (1) detecting the types of errors and inconsistencies to be removed, (2) executing the data transformation and clean- ing steps based on the degree of heterogeneity and the dirtiness of the data, (3) evaluating the correctness and effectiveness of the transformation definitions, (4) executing the transformation steps and (5) replacing the clean data with the dirty data in the original sources.

The well-known methods in data cleaning [Mul05]¨ include: (1) parsing: per- forms the detection of syntax errors, (2) data transformation: allows the mapping of the data from its given format into the format expected by the appropriate appli- cation, (3) duplicate elimination: requires an algorithm for determining whether 4 data contains duplicate representations of the same entity, (4) statistical methods: analyses the data using various statistical functions or clustering algorithms and

finds values that are unexpected; thus erroneous.

One of the fundamental problems in data cleaning is determining when two tuples refer to the same real-world entity [BG04]. Data often contains errors, for example, typographical errors, or have multiple representations, such as abbrevi- ations; thus, an exact comparison does not suffice for detecting duplicates in these cases.

Record linkage refers to the task of finding records in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites and databases). It is a form of data cleaning that identifies equivalent but textually distinct items in the extracted data prior to mining. Record or data linkage tech- niques are used to link records that relate to the same entity (e.g., patient, customer or household) in one or more datasets, where a unique identifier for each entity is not available. Hence, the record linkage problem of identifying and linking dupli- cate records that arises in the context of data cleaning is a necessary first step for many database applications.

Another important application in the field of text mining is clustering, which is often one of the first steps in text-mining analysis. It identifies groups of related records that can be used as a starting point for exploring further relationships.

Clustering can be considered the most important unsupervised learning problem that deals with finding a structure in a collection of unlabelled data. It is a conve- nient method for identifying homogenous groups of objects called clusters. Clus- tering can be defined as the process of organising objects into groups in which the members are similar in some way. Objects in a specific cluster intuitively share 5 many characteristics but are dissimilar to objects not belonging to that cluster. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful data pre-processing step to identify homogeneous groups on which to build supervised models.

1.1 Bibliography Data

One of the main sources of text data that requires text mining techniques (both record linkage and clustering) is bibliography data. Advances in computational resources and the communications infrastructure, as well as the rapid rise of the

World Wide Web, have led to the increasingly widespread availability of pub- lished papers in electronic form. These registered publications usually contain citations to previous work, and indices of these citations are valuable for litera- ture searches, analyses and evaluations. Bibliography Digital Libraries, such as

DBLP, CiteSeer and MEDLINE, contain a large number of citation records in different disciplines. Digital libraries strive to enrich documents by examining their content, extracting information and using it to enhance the ways they can be located and presented [WDDT04]. Such digital libraries have been an impor- tant resource for academic communities, since scholars often search for relevant works. Researchers also use citation records to measure the publications’ impact in the research community. In addition, citations are often used when users search for articles of interest.

With the electronic availability of scholarly documents, electronic database searching has become the de facto mode of information retrieval, and several bibliography databases are established that automatically extract bibliography in- 6 formation and cited references from electronic documents retrieved from digital archives and repositories. In recent years, two multi-disciplinary databases: Sco- pus from Elsevier and Google Scholar from Google Inc, have attracted much at- tention.

Google Scholar is a freely accessible Web search engine that indexes the full text of scholarly literature across an array of publishing formats and disciplines. It provids a simple search interface displaying the results as a listing of 10–100 items per page. Each retrieved article is represented by title, authors and source. Under each retrieved article, the number of cited articles is noted and can be retrieved by clicking on the relevant link. Google Scholar is an important service for those that do not have access to expensive multidisciplinary databases such as the Thomson-

Reuters Scientific Citation Index or Scopus. This is potentially a powerful new tool for citation analysis, particularly in subject areas that have experienced rapid changes in communication and publishing. Scopus, officially named SciVerse

Scopus, is a bibliography database that contains abstracts and citations for aca- demic journal articles. It offers powerful features for browsing, searching, sorting and saving functions, as well as exporting to citation management software. The search results in Scopus can be displayed as a listing of 20–200 items per page.

The results can also be refined by source title, author name, year of publication, document type and/or subject area, and a new search can be initiated within the results. Scopus also offers Related Documents, which returns a list of documents that share cited references with the currently selected document. Scopus Citation

Tracker further enhances citation analysis by enabling citations to be viewed on year, providing users with a powerful way to explore citation data over time. Both

Scopus and Google Scholar cover peer-reviewed journals and conferences in the 7 scientific, technical, medical and social science areas.

With the increasing need for citation information, it is important to keep the citations of bibliography databases consistent and up-to-date. However, due to data entry errors, imperfect citation gathering software and common author names, bibliography databases often have many errors in their citation collec- tions. There are two important problems that degrade the qualities of these digital libraries: Mixed Citations (MC), where homonyms are mixed together, and Split

Citation (SC), where citations of the same author appear under different name variants [LOKP05]. Concurrently, text mining is becoming an increasingly well- understood method to deal with such problems.

1.2 Current Problems

This thesis provides new techniques to deal with three different data quality problems in real-life bibliography data. The following paragraphs briefly discuss these problems.

First, one of the major problems in dealing with data quality issues in bibli- ography data is the search results from citation-enhanced search engines such as

Google Scholar. Given a user query, search engines strive to return a list of the most relevant results in the form of snippets, with each snippet represented by a title and a short description. There are two fundamental problems with this ap- proach. First, users’ queries are often ambiguous and their search goals are often narrower in scope than the queries used to express them. For example, index- ing has different meanings in different contexts, such as economics, mathematics and databases. Second, when the search keyword is too general, results describ- 8 ing different facets of the general topic are mixed up in the query results. As a result, users are forced to sift through the list sequentially to find the required information. For example, for a computer science programmer searching for arti- cles on the Cobra Programming Language, Google Scholar returns a long, mixed list of articles for Cobra from various disciplines, including biology, chemistry, medicine, physics and computer science, along with articles from authors with the surname Cobra. Only one article related to the Cobra Programming Language is returned from the top 20 results using Google Scholar 2 .

A possible solution to this problem is to cluster search results into different groups, thus enabling users to identify their required group at a glance. Recently

Web snippet clustering has gained popularity to help users in searching the Web by processing only the snippets instead of processing the entire long description of Web page content. It has thus become a challenging variant of classical clus- tering by reflecting the potentially unbounded themes from the snippets returned by the search engine [FG08]. The work in [ZE99] introduces an interface to the results of the Huskysearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. The work in [ZEMK97,ZE98] propose the Suffix Tree Clustering algorithm and com- pare thier approach with algorithms like k-means and Group Average Agglomer- ative clustering (GAHAC). A statistical model is built on the background knowl- edge and then topcal terms are extracted to generate multi-level summaries of the Web snippets returned by the search engine [LC03]. In the proposed hierar- chical Web snippet clustering method [FG08], hierarchical lebels are generated

2 http://scholar.google.com.au/scholar?hl=en\&num=20\&q=cobra\ &btnG=Search\&as_sdt=0\%2C5\&as_ylo=\&as_vis=0 .on 25/2/12 9 by constructing a sequence of labelled and weighted bipartite graphs represent- ing the individual snippets on one side and set of labels on the other side. Web snippet clustering in [OW04,ZD04] uses SVD on a term-document matrix to find meaningful long labels. The study [GPPS06] developes a meta search engine that groups the Web snippets returned by auxillary search engines into disjoint labelled clusters. In their work, clustering is performed by means of a fast version of the furthest-point-first algorithm(FPF) for metric k-centre clustering. Cluster labels are obtained by combining intra-cluster and inter-cluster term extraction based on the information gain measure.

This thesis discusses the problem of automatically clustering search results returned by Google Scholar. Currently, its search results comprise a linear list of publications represented by a snippet, and the only way to sort them is by date.

The proposed publication snippet clustering system will benefit researchers by helping them to quickly find the group of publications related to a particular area or topic, or to find alternative query keywords. Here, two key issues are iden- tified: the need for a good similarity function to reflect the similarities between research publications, and the need for a clustering algorithm that is robust to short documents and noises. These issues are tackled by the utilisation of domain information in proposing a new similarity function, and the implementation of an outlier-conscious algorithm in generating the required number of clusters. Cluster labels are also generated to facilitate user browsing through the clusters. Exper- imental results confirm that the proposed clustering method is superior to prior approaches.

The second challenge of dealing with data quality issues is the identification of multiple records referring to the same entity, even if they are not bit-wise identical. 10

For example, there are dozens of correct ways to cite a publication, depending on the bibliography style one uses. For example, the following citations refer to the same publication, although they are textually far apart:

[Zhang, W. & Yu, C. T. A necessary condition for a doubly recur- sive rule to be equivalent to a linear recursive rule. In SIGMOD Conference, 345–356 (1987) ]

[ Weining Zhang & C. T. Yu. A necessary condition for a doubly recursive rule to be equivalent to a linear recursive rule. In Pro- ceedings of Association for Computing Machinery Special Inter- est Group on Management of Data 1987 annual conference,San Francisco,May 27–29,1987, pages 345–356, 1987]

Again, real data is inevitably noisy, inconsistent or it contain errors (e.g., ty- pographical errors and Optical Character Recognition (OCR) errors), which result in the same publication record appearing as different entities in a database or Web search. For example, the errors introduced when the citation was extracted by a program from the Web returned two occurrences of the same publication 3 :

Efficient processing of XML twig pattern: A novel one-phase holistic solution

[CITATION] Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution

Existing record linkage approaches mainly rely on similarity functions based on the surface forms of the records; hence, they are not able to identify com- plex coreference records. This seriously limits the effectiveness of existing ap- proaches. The first category is to learn a good similarity function using machine 3 http://scholar.google.com.au/scholar?hl=en\&num=100\&q= Efficient+processing+of+XML+twig+pattern\%3A+A+novel+one-phase+ holistic+solution&btnG=Search\&as_sdt=0\%2C5&as_ylo=\&as_vis=0.on 25/2/12 11 learning techniques [TKM02, CR02, BM03a]. The study [BM03c] shows that trainable similarity measures are capable of learning the specific notion of sim- ilarity that is appropriate for a specific domain. While this approach focuses on homogenous string-based transformation, the approach in [MNK+05] uses hetero- geneous set of models to relate complex domain specific relationships between two values. The work in [SB02a] employs active learning techniques to min- imise the interaction with users, and is recently improved by [AGK10]. Another category of approaches is to model complex or domain-specific transformation rules [MK09, ACK09]. The learned rules can be used to identify more coreferent pairs [ACK08, ACGK08]. Although only few works in record linkage focus on transformation rules, they have been widely employed in many other areas, and automatic rule learning algorithms have been developed accordingly. In Natural

Language Processing (NLP), the study [Tur02] finds sets of synonyms by consid- ering word co-occurrences. Later, the work [PKM03] utilises the similar idea to identify paraphrases and grammatical sentences by looking at co-occurrence of set of words. Recently, researchers have used transformation rules to deduplicate

URLs without even fetching the content [BYKS09a, DKS08]. However, the rules and their discovery algorithms are heavily tailored for URLs.

This thesis focuses on the record linkage approaches to the complex coref- erent records that have little surface similarity (e.g., ‘23rd’ and ‘twenty-third’,

‘Robert’ and ‘Bob’, ‘VLDB’ and ‘Very Large Databases’). It proposes an au- tomatic method to extract top-k high quality transformation rules given a set of possibly coreferent record pairs. The transformation rules are then incorporated to recognise these domain-specific equivalence relationships. This is obtained by proposing an effective and efficient top-k transformation rule learning algorithm. 12

The algorithm performs careful local analysis for each record pair and generates candidate rules. Finally, top-k rules are selected based on a scoring function. Ex- tensive experiments performed on several publicly available real-world datasets demonstrate its advantage over the previous state-of-the-art algorithm in both ef- fectiveness and efficiency.

Another challenge that arises in cleaning bibliography data is performing an accurate and acceptable citation analysis of academic scientists, researchers etc.

In recent years, several database producers have noticed the potential of citation indexing, and several bibliography databases are established that automatically extract bibliography information and cited references from electronic documents retrieved from digital archives and repositories. Among these databases, Google

Scholar has gained popularity because it is free and has broad coverage. It au- tomatically extracts the bibliography data from the reference sections of docu- ments (mostly in PDF and PS formats) and determines citation counts for papers in its collections as well as for citations where the document is not available. The search interface of Google Scholar is simple and easy to use. The publications in the query result are typically ranked according to oft-cited and highly relevant articles [RT05]. Under each retrieved article, the number of cited articles is noted and can be retrieved by clicking on the relevant link.

A major disadvantage of Google Scholar is that its records are retrieved in a way that is impractical for use with large sets, requiring a tedious process of manually cleaning, organising and classifying the information into meaningful and usable formats. As discussed previously, the citation results are in the form of snippets, with each snippet represented by a title and a short description. The search results also suffer from the limitations of the automatic extraction of ref- 13 erences. For instance, Google Scholar frequently has several entries for the same paper due to misspelled author names, different ordering of authors etc. Con- versely, it may group together citations of different papers, such as a journal and conference version of a paper with the same or similar title. For example, for the article “A primitive operator for similarity joins in data cleaning”, the returned list of citations from Google Scholar contains two versions (conference and pre- print) of Benchmarking declarative approximate selection predicates. There are also two versions of Efficient algorithms for approximate member extraction us- ing signature-based inverted lists, the second being Efficient Algorithms for Ap- proximate Member Extraction Using Signature-based Inverted Lists, which differs from the first in the capitalisation of the title (lowercase to uppercase).

As a result, to conduct citation analysis, the existing work in the literature had to manually clean these errors and classify the citation information accord- ing to proper formats (i.e. fields such as author, venue and source), which is a tedious and time-consuming task [BB05,KT06,MY07,VS08,FPMP08,LBLB10,

Fra10, GP10]. To ease this manual processing job, this thesis aims to develop a semi-automatic tool that performs extensive data cleaning to deal with errors and presents the citations in a suitable format to conduct the citation analysis. It per- forms the citation analysis of researchers in the field of computer science, which is a relatively new field of study where conference papers are considered a more important form of publication than is generally the case in other scientific disci- plines and has also been less studied in the literature. The results are compared with the results from the most widely used subscription-based citation database,

Scopus. Extensive experiments on various bibliometric indexes of a collection of authors in computer science underlines the usefulness of Google Scholar for 14 scholars conducting citation analysis, highlighting its broader international effect on the quality of the publication.

1.3 Outline

The outline of the thesis is discussed below.

Chapter 2 focuses on various snippet-clustering methods and proposes a new

similarity function to cluster citations returned from the publically avail-

able Google Scholar search engine.

This chapter is based on the work published in DBKDA 2009: Sunanda Patro and Wei Wang: Effective Snippet Clustering with Domain

Knowledge in DBKDA 2009:First International Conference on Advances

in Databases, Knowledge, and Data Applications, pp.44-49,March 1-6,

2009, Cancun, Mexico. (Best Paper Award)

Chapter 3 reviews various record linkage methods and proposes a rule learning

algorithm to find citation records in a dataset that refer to the same entity

across different data sources.

This chapter is based on the work published in DEXA 2011: Sunanda Patro and Wei Wang , Learning Top-k Transformation Rules in

DEXA11: 22nd International Conference on Database and Expert Sys-

tems Applications, to be held in August 29 - September 2, 2011, Toulouse,

France.

Chapter 4 discusses data cleaning issues and develops a semi-automatic tool to

handle the errors and limitations of the automatic extraction of references 15

from Google Scholar’s search results. Further, it presents studies dedi-

cated to the analysis and comparison of the performance of two widely

used tools (Google Scholar and Scopus) for citation searches.

Chapter 5 provides the conclusions and future work. Chapter 2

Effective Snippet Clustering with Domain

Knowledge

2.1 Introduction

With the phenomenal increase of the amount of information available on the

World Wide Web, search engines have become an indispensable tool for users to obtain desired information. Given a user query, search engines strive to return a list of the most relevant results in the form of snippets, with each snippet repre- sented by a title and a short description. There are two fundamental problems with this approach. First, users’ queries are often ambiguous and their search goals are often narrower in scope than the queries used to express them. For example, the query indexing has different meanings in different contexts, such as economics, mathematics and databases. Second, when the search keyword is too general, re- sults describing different facets of the general topic are mixed up in the query result. As a result, users are forced to sift through the list sequentially to find the required information. 17

A promising technique to address the above problem is to organise the search results into clusters of semantically related pages so the user can quickly view the entire result set and use clusters themselves to filter the results or refine the query [CCG+06, Hea06, ZD04].

This work tackles the problem of automatically clustering search results re- turned by Google Scholar1 , a freely accessible Web search engine that indexes the full text of scholarly literature across an array of publishing formats and dis- ciplines. Currently, Google Scholar’s search results comprise a linear list of pub- lications each represented by a snippet, and the only way to sort them is by date.

The proposed publication snippet clustering system will benefit researchers by helping them to quickly find the group of publications related to a particular area or topic, or to find alternative query keywords.

Although there is much work in both Web page clustering and snippet cluster- ing, they are not immediately applicable to the proposed system due to the unique characteristics of the problem. First, the system has to take only snippets as input.

The full text of documents are not always available, for example, due to a lack of access to a particular publication repository, an invalid URL or files in different formats (for example, scanned images from Google books). Second, snippets are usually short. In fact, the system has to rely heavily on the title of each snippet, as there is often no description in the snippet (for example, a citation to an old pub- lication). Third, the quality of snippets is far from perfect. Due to the nature of automatic extraction, snippets may contain errors or may be incomplete, and de- scriptions of snippets could be empty or meaningless. Fourth, domain knowledge is usually required, even by humans, to assess the similarity between two publica-

1 http://scholar.google.com 18 tions. A publication about decision tree and another about na¨ıve bayes classifier are unrelated literally, but are relevant to domain experts. Traditional Web doc- ument clustering methods either require extra information (e.g., the entire Web page or hyperlinks) or are mainly designed for long documents [CD07]. While existing works on snippet clustering are more relevant to the proposed system, they are found to perform inadequately in their settings due to the abundance of noise in the input snippets and the lack of domain knowledge to accurately assess the similarity of two research publications.

Based on the above analysis, two key issues are identified: (1) the need for a good similarity function to reflect the similarities between research publications and (2) the need for a clustering algorithm that is robust to short documents and noises.

To tackle the first issue, the current work proposes to extract useful domain knowledge from the DBLP database. It then defines a semantically meaningful similarity function that leverages both the cosine-based document similarity and a semantic similarity function based on the closeness of concept phrases appear- ing in the snippets. To tackle the second issue, in the online clustering phase, it takes as inputs only the title and the description of snippets, and runs an outlier- conscious hierarchical clustering algorithm based on the similarity matrix formed by the newly proposed similarity function. It also generates cluster labels to fa- cilitate user browsing through the clusters. Experimental results confirm that the proposed clustering method is superior to prior approaches.

This chapter is organised as follows: Section 2.2 briefly introduces related work; Section 2.3 discusses the proposed snippet mining system; Section 2.4 presents the experimental results and Section 2.5 concludes the chapter. 19

2.2 Literature Review

Various approaches to text clustering have been developed [JMF99, KP04,

XI05]. Typically, clustering approaches can be categorised as agglomerative or partitional (based on the underlying methodology of the algorithm), or as hierar- chical (hierarchy of clusters) or flat (flat set of clusters), based on the structure of the final solution [ZZHM04].

Based on the above categories, this section discusses the related work for var- ious types of snippet clustering algorithms.

2.2.1 Flat Clustering

Flat Clustering can be further classified as word-based clustering, where com- mon words that are shared among documents act as features, and term-based clus- tering, where sentences of variable lengths are used as features to cluster the text snippets.

2.2.1.1 Word-based Clustering

To improve the quality and interpretation of Web search results, the system

[WK02] combines the retrieved snippet content with the link information in the anchor text, meta-content and anchor window of the in-links. Combining links and content analysis, it resolves many of the shortcomings of previous approaches and extends the standard k-means algorithm to make it more suitable to clustering in the Web domain.

Following the work of [HN02], the paper [NN05] investigates how rough set theory and its ideas, such as approximations of concept, could be practically 20 applied in a task of search results clustering. The tolerance classes are used to approximate concepts existing in documents and to enrich the vector representa- tion of snippets. Set of documents sharing similar concepts are grouped together to form clusters and using special heuristic, concise and intelligible cluster labels are derived from the tolerance classes. The experiment results show that Toler- ance Rough Set and upper approximation can have positive effects on clustering quality.

The work [KJNY01] gives new relational fuzzy clustering algorithms (FCMdd and RFCMdd) based on the idea of medoids, and presents several applications of these algorithms to Web mining, including Web document clustering, snippet clus- tering, and Web access log analysis. Further, to enhance the performance of these algorithms, in [JJKY], the authors introduce n-gram and vector space methods to generate the (dis)similarity distance matrix. The results of their experiments show that the n-gram based approach performs better than the vector space approach and the Suffix Tree Clustering (STC) algorithm developed by [ZE99].

The paper [GPMS06] describes a meta-search engine called Armil, which groups the Web snippets returned by auxiliary search engines into disjoint la- belled clusters. With no external sources of knowledge, clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-centre clustering, and cluster labelling is achieved by combining intra and inter-cluster term extractions based on a variant of the information gain measure. Using a comprehensive set of snippets obtained from the Open Directory Project hierar- chy as benchmark, Armil achieves better performance than Vivisimo, the de facto industrial standard in Web snippet clustering.

The article [GPPS06] proposes a much faster variant of FPF [Gon85] based on 21 a filtering step that exploits triangular inequality and shows its suitability for Web snippet clustering. While being far more efficient to run, the algorithm seems to perform with accuracy comparable to the strong baselines consisting of fast variants of the classical k-means iterative algorithm. It also shows that higher efficiency can be obtained by using metrics that exploit the internal structure of the snippets.

2.2.1.2 Term-based Clustering

The paper [ZE98] introduces an incremental linear-time clustering algorithm

STC, which first identifies sets of documents that share common phrases and then creates clusters according to these phrases. It evaluates the effectiveness of clus- tering search results by using both search snippets and entire documents. The results show that clustering methods using snippets outperform methods using en- tire documents. Inspired by this work, a Carrot framework was created in [Wei01] to facilitate research on clustering search results. Grouper [ZE99], which is an interface to the results of the HuskySearch meta-search engine, uses the STC al- gorithm introduced in [ZE98] to extract phrases from snippets and dynamically groups the search results into clusters labelled according to the phrases.

The work [WS03] presents conclusions drawn from an experimental applica- tion of STC to documents in Polish, indicating fragile areas where the algorithm seems to fail due to specific properties of the input data. It indicates that the characteristics of produced clusters (number,value), unlike English documents, strongly depend on the pre-processing phase. It also attempts to investigate the influence of two primary STC parameters: merge threshold and minimum base cluster score on the number and quality of results. It also introduces two ap- 22 proaches to efficient, approximate stemming of Polish words: quasi-stemmer and an automaton-based approach.

Lingo [OSW04] emphasises cluster description quality by using algebraic transformations of the term-document matrix and frequent phrase extraction us- ing suffix arrays. The paper [OW04] highlights the strengths and weaknesses of Lingo and compares the clusters acquired from Lingo to the search results of

STC [ZE99].

An improved Lingo [OSW04] algorithm, Suffix Array Similarity Clustering

(SASC) is presented in [BZZM10]. This method creates clusters by adopting an improved suffix array, that ignores the redundant suffixes and computes document similarity based on the title and short document snippets returned by Web search engines. Experiments show that the SASC algorithm has not only a better perfor- mance in time consumption than Lingo, but also in cluster description quality and precision when compared to STC.

The paper [ZHC+04] reformalises the clustering problem as a salient phrase ranking problem. The candidate phrase extraction process is similar to STC

[ZE98], but it further calculates several important properties to identify salient phrases and utilises learning methods to rank these salient phrases. Given a query and the ranked list of titles and snippets, the method first extracts and ranks salient phrases as candidate cluster names based on a regression model learned from human-labelled training data. The documents are then assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters.

Based on a new suffix tree and a new base cluster combining al- gorithm with a new partial phase join operation [JK06], the proposed approach is 23 suitable for on-the-fly Web search results clustering and labeling clusters. The ex- perimental results show that this approach provides more readable and true com- mon phrases of Web document clusters, and performs better than that of conven- tional Web search result clustering.

The article [WMH+08] proposes a variant of STC for Web search results clus- tering and labelling. It is an incremental and linear-time algorithm with signifi- cantly lower memory requirements. It is claimed that it generates more readable labels than STC, as it inserts into the suffix tree more true common phrases and joins partial phrases to construct true common phrases. Experimental results show that the proposed new approach performs better than that of conventional Web search result clustering.

The paper [WZP+08] proposes a new method to clustering Web search results, which is based on interactive suffix tree algorithm (ISTC). The phrases extracted from the snippets are used as the characteristics of clustering. In the course of interaction with users, it only returns cluster labels to users in the first tier. When users want to have further interaction, they can select a document that they are in- terested in for the second clustering instead of the traditional recursive clustering.

ISTC can also be applied to Chinese and English information processing, which avoids the recursive algorithm for achieving linear- and improv- ing the efficiency of search engines. Experimental results verify the feasibility and effectiveness of this method.

The study [KPT09] introduces a variation of the STC, called STC+, with a scoring formula that favours phrases that occur in document titles, and a novel suf-

fix tree based algorithm called NM-STC which results in hierarchically organised clusters. The comparative user evaluation shows that both STC+ and NM-STC are 24 significantly preferred to STC, and that NM-STC is about two times faster than

STC and STC+.

Using semantic information for clustering Web snippets, the work [WHL09] proposes an improved Web search results clustering algorithm based on STC. It uses latent semantic indexing method to assist in finding common descriptive and meaningful topic phrases for the final document clusters. This makes the search engine results easy to browse and helps users quickly find Web information. Eval- uation of the experiment results demonstrates that clustering Web search results based on the proposed improved suffix tree algorithm has better performance in cluster label quality and snippet assignment precision.

The study [WXC09] proposes a more effective Web snippets clustering algo- rithm that combines the advantages of the vector space model (VSM) and STC document models. The proposed improved suffix tree algorithm takes into ac- count the semantic information of candidate label phrases and offers descriptive, readable and conceptual topic labels for the final document groups. Evaluation of the results demonstrates that this algorithm performs better in making search en- gine results easy to browse and helping users to quickly find Web pages in which they are interested.

2.2.2 Hierarchical Clustering

In [LC03], a statistical model is built on background knowledge and topical terms are then extracted to generate multi-level summaries of the Web snippets returned by the search engine. It shows that the terms selected to be part of the hierarchy are better summary terms than the top TF.IDF terms, and that the hier- 25 archy provides users with more access to the documents retrieved than by using a ranked list alone.

In an attempt to overcome the shortcomings of STC [ZE98], the study [ZD04] proposes a Semantic, Hierarchical, Online Clustering (SHOC) approach to auto- matically organise Web search results into groups. Combining the power of two novel techniques (key phrase discovery and orthogonal clustering), SHOC not only generates clusters that are both reasonable and readable , but also work for multiple languages, including English and oriental languages such as Chinese.

DisCover [KLR+04], a hierarchical monothetic clustering algorithm, builds a topic hierarchy for a collection of search results retrieved in response to a query.

At every level of the hierarchy, it progressively identifies topics in a way that maximises the coverage while maintaining the distinctiveness of the topics. The user studies show that the proposed algorithm generates better hierarchies and is superior to other algorithms as a summarising and browsing tool.

SnakeT [FG08], the first complete and open-source system in the literature, offers both hierarchical clustering and folder labelling with variable-length sen- tences, and achieves efficiency and efficacy performance close to the best known engine Vivisimo. The system produces a hierarchy of labelled clusters by con- structing a sequence of labelled and weighted bipartite graphs representing the individual snippets on one side and a set of labelled clusters on the other side. It emphasises the accuracy of the labels rather than that of the clusters.

Based on the fact that similar Web snippets share a small amount of phrases, the paper [LW10] introduces a new method for hierarchical clustering of Web snippets by exploiting a phrase-based document index. In this method, a hier- archy of Web snippets is built based on phrases instead of all snippets, and the 26 snippets are then assigned to the corresponding clusters consisting of phrases.

Experiments show that this method outperforms the traditional hierarchical clus- tering algorithm.

Using an overlapping and hierarchical clustering method, the work [JK] pro- poses a search result clustering system called Cluiser. The system finds groups of

Web pages along with representative tag words and is found to be most effective for finding the meaning of unknown keywords. It produces more diverse clusters compared with previous approaches.

WISE [CDN06] explores Web page hierarchical soft clustering as an alterna- tive method of organising search results. It is based on (1) an algorithm ignoring less relevant documents and adding relevant documents; (2) statistical phrase ex- traction to define concepts; (3) Web content mining techniques to semantically represent the Web pages and (4) an overlapping clustering algorithm to organise results into a hierarchy of concepts and a classical labelling process. Experimental results demonstrate the correctness of the clusters, the quality of the labels, con- cept disambiguation and language-independence. However, it still needs formal evaluation.

2.2.2.1 Graph-based Approach

The approach in [LY06] uses tokens as basic units for clustering, which avoids segmentation for oriental languages, and can be applied to any language. It in- troduces a Directed Probability Graph (DPG) model that identifies meaningful phrases as cluster labels using statistical methods without any external knowl- edge. The clustering procedure is performed without calculating the similarity between pair-wise documents and the experiments show that the proposed clus- 27 tering algorithm is very efficient and suitable for online Web snippet clustering.

A new clustering strategy, called TermCut, is presented in [NQL+11,

NLQ+09] to cluster short text snippets by finding core terms in the corpus. The motivation of this method is to cluster short text snippets according to the core terms rather than the similarity between text snippets. In this method, a collection of short text snippets are modelled as a graph in which each vertex represents a piece of a short text snippet and each weighted edge between two vertices mea- sures the relationship between the two vertices. The algorithm then recursively selects a core term and bisects the graph such that the short text snippets in one part of the graph contain the term, whereas the snippets in the other part do not. It is applied to different types of short text snippets, including questions and search results, and the experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.

The paper [NC10] presents a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, referred to as

Word Sense Induction(WSI). The method first acquires the senses (i.e. mean- ings) of a query by means of a graph based clustering algorithm that exploits cycles(triangles and squares) in the co-occurrence graph of the query. It then clusters the search results based on their semantic similarity to the induced word senses. The experiments, conducted on datasets of ambiguous queries, show that this method improves search result clustering in terms of both clustering quality and degree of diversification.

Based on probabilistic-Latent Semantic Analysis, the framework in

[NPH+09] collects a large external data collection called universal dataset and then builds a clustering system on both the original snippets and a rich set of hid- 28 den topics discovered in the universal data collection. The main motivation is that, by enriching short snippets with hidden topics from large resources of doc- uments on the Web, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning entire Web pages. Careful evaluation shows that this method can yield impressive clustering quality.

The research [ZD08] combines Web snippet categorisation, clustering and per- sonalisation techniques to recommend relevant results to users. Using a socially constructed Web directory, such as the Open Directory Project (ODP), it develops a Recommender Intelligent Browser (RIB). By comparing the similarities between the semantics of each ODP category represented by the category-documents and the Web snippets, the Web snippets are organised into a hierarchy. Based on an automatically formed user profile, which takes into consideration desktop com- puter information and concept drift, the proposed search strategy recommends relevant search results to users. This research also intends to verify text categori- sation, clustering, and feature selection algorithms in the context where only Web snippets are available.

2.2.3 Multiple Clustering Approaches

The study [BRG07] proposes a method of improving the accuracy of clus- tering short texts by enriching their representation with additional features from

Wikipedia. Given a short text snippet, it creates two query strings from the title and description of the text respectively, and uses the two strings as queries to re- trieve the top 20 Wikipedia articles from the Lucene index 2 . The titles of the

2 Lucene(2009):http://lucene.apache.org/ 29

Wikipedia articles are referred to as concepts, which are added into the feature space of the original short text snippet. It runs six different different clustering al- gorithms provided by CLUTO 3 . The results obtained indicate that this enriched representation of text items can substantially improve clustering accuracy com- pared to the conventional bag-of-words representation.

In order to improve the performance of Web snippet clustering, the work in [JHR09] applies the ODP to expand the original snippets with some related conceptual terms. Using a test dataset of 240 queries, it performs the experiments by using two clustering techniques: k-means clustering as the non-overlapping approach and STC as the overlapping approach. Using the proposed text enrich- ment method, the k-means clustering yields an overall performance improvement up to 15.51 per cent based on the F 1 measure. Conversely, the STC with text enrichment improves performance up to 53.71 per cent.

Recent work [RKT11] compares various document clustering techniques, in- cluding k-means, an SVD based method and a graph based approach, and their performance on short text data collected from Twitter. It defines a measure for evaluating the cluster error with these techniques, and the experimental observa- tions show that graph based approach using affinity propagation performs best in clustering short text data with minimal cluster errors.

The work [BK10] proposes a hybrid approach to content clustering that com- bines the best of the Web information retrieval methods and the personal pref- erence information of users. It modifies the STC algorithm and uses a sentence based approach that considers the relationship between the terms. Thus, it com- bines the best of the flat and hierarchical approaches in a hybrid manner for effec-

3 http://glaros.dtc.umn.edu/gkhome/views/cluto/ 30 tive information retrieval. Experimental results show that this approach has great promise for a wide range of queries.

2.3 Snippet Miner

2.3.1 Overview

A similarity function is vital to the task of document clustering. It is observed that traditional similarity function, such as cosine similarity, does not work well for snippets, as snippets are usually short and noisy. In this chapter, a new similar- ity function is proposed that is based on automatically mined domain knowledge to tackle the former challenge and that uses an outlier-conscious hierarchical clus- tering algorithm to tackle the latter challenge.

Since the current work deals with computer science citations within Google

Scholar, domain knowledge is obtained from another manually edited bibliogra- phy database, the DBLP4 . The proposed framework is based on the intuition that there are usually closely related associations between words or phrases used in dif- ferent publication titles from the same authors. In addition, it can aggregate such associations together and the same word/phrase associations that are supported by many authors are most likely to possess the same semantic connotation.

The proposed algorithm has several steps:

Offline The DBLP database is processed to mine the domain knowledge, that is,

similarities between pair of concept phrases.

4 http://dblp.uni-trier.de/xml/ 31

Online Search results are clustered from users’queries returned by Google

Scholar as follows:

1. retrieve the snippets from a given query from Google Scholar

2. extract concepts from each snippet

3. evaluate the newly proposed pair-wise similarities between snippets

and store the result in a similarity matrix M~ ; the similarity measure

takes advantage of the domain knowledge obtained in the offline

phase

4. run the outlier-conscious hierarchical clustering algorithm to obtain

clusters, and generate a label for each cluster.

The rest of the sections describe the above steps in more detail.

2.3.2 Domain-knowledge Mining

This work mainly focuses on the automatic method in mining concept asso- ciations from large-scale domain datasets. This is preferable, as new terms or acronyms are continuously invented and used in computer science papers; it will be infeasible to manually build or maintain such a knowledge database.

DBLP has been selected as it is one of the most comprehensive computer science bibliography databases and all entries are manually created. For each publication in the DBLP database, the title and author information from the dataset are extracted in the first step. In the second step, each title is parsed into a set of intermediate features, which are longest contiguous segment of words in the paper title that does not contain any boundary words (such as stop words and other words 32 like “using”, “based on”). Then all length-2 are extracted to form the

final features, called concept phrases, or concepts for short. Standard stemming is then applied to the words in the concepts.

For example, the concepts extracted from the paper title “query expansion using local and global document analysis” are shown in Figure 2.1. Since “using” and “and” are boundary words, the intermediate features are: “query expansion”,

“local”, “global document analysis”. The last feature will be further decomposed into “global document” and “document analysis”. The final set of features, or concepts, are marked in blue in Figure 2.1.

“query expansion using local and global document processing”

query expansion local global document analysis

global document document analysis

Figure 2.1: Parsing a String into a Set of Concepts

In order to uncover hidden associations among concepts, each concept is first mapped to a weighted vector. The set of all authors in the DBLP database is denoted as A. Each concept ci is represented as the following weighted vector:

ci = ( wi;1, wi;2, . . . , wi;|A| )

where wi;j records the weight of the j-th author with respect to concept ci.A weighting scheme is devised that resembles the traditional tfidf scheme in infor- 33 mation retrieval:

N w(ci, aj) = tf(ci, aj) · log df(aj) where tf(ci, aj) is the number of times the concept ci used by the author aj; N is the number of publications in the database and df(aj) is the number of publica- tions of the author aj. Given the vector representation of each concept, the widely used cosine sim- ilarity is adopted to evaluate similarities between pairs of concepts, or more for- mally:

P|A| k=1 wi;kwj;k sim(ci, cj) = q (2.1) P|A| 2 P|A| 2 k=1 wi;k k=1 wj;k where ci and cj are concepts, wi;k (wj;k) is the weight of author k in concept ci

(cj). The above similarity function between concepts captures the following intu- itions well: a pair of concepts used by many authors in common or appearing in the same title will bear a high similarity score. Similarly, concept pairs that are not shared by any common authors will have zero similarity.

2.3.3 Snippet Parsing

The user’s query is submitted to Google Scholar and snippets are obtained by screenscraping the HTML pages returned. Each snippet is parsed into a title and a description and then concatenated. Stop-word removal and stemming are then performed, followed by domain-specific data cleaning, including the removal of tags such as “[citation]” and discarding non-English words. 34

All the concepts are then extracted contained in each snippet using the same procedure as used in domain knowledge mining.

2.3.4 Measuring Similarities between Snippets

In order to measure similarities between snippets while drawing on results mined from the domain knowledge, a hybrid similarity measure is proposed, where one part is the traditional cosine similarity and the other part is a new two- step similarity function based on a graph model that considers both direct and indirect associations between all concepts that appear in the snippets.

Direct Association Since each snippet is associated with a set of concepts, the similarity between a pair of snippets is measured based on the similarities be- tween their sets of concepts. It should be noted that standard similarity functions for sets, for example, Jaccard similarity, are not appropriate here, as they are ig- norant of the semantic relationships between concepts. Therefore, it is proposed to measure the similarities based on a graph model. Given a pair of snippets, a weighted undirected graph G is generated, where nodes in G correspond to con- cepts appearing in the snippets. There exists an edge with weight d between two nodes if the concept similarity, according to Equation (2.1) is d (d > 0).

Figure 2.2 illustrates the graph between two snippets: s1: “query expansion using local and global document processing” and s2: “probabilistic query expan- sion using query logs”.

There are two types of nodes in the graph: (1) those from one snippet that is connected with another node from the other snippet (called active nodes), and

(2) those that are not (called dead nodes). For example, in Figure 2.2, nodes 35

“query expansion using local and global document processing”

0.3 0.4 0.2 0.5

query expansion local global document document analysis

1.0 0.2 0.03 0.225 0.13 0.18

probabilistic query query expansion query log

0.1 0.3 0.3

“probabilistic query expansion using query logs”

Figure 2.2: An Example Direct Association Graph

“document analysis” and “probabilistic query” are active nodes while node “local” is a dead node.

Once the graph for concepts is generated, it is enhanced by adding snippets as new nodes and connecting snippet nodes to the concept nodes that they own.

To this end, the weight on such edges needs to be determined (dashed edges in

Figure 2.2). The weights are determined in a probabilistic way: from a snippet node s, a small weight of α is assigned to all its edges to dead concept; the edge weights for the rest of the active concept nodes ci are calculated as follows:

idf(c ) d(s, c ) = i · (1 − n · α) i deg(s) dead where idf(ci) is the inverted document frequency of node ci, deg(s) is the degree of node s, that is the total number of edges from s, and ndead is the number of dead nodes.

Based on this weighted enhanced graph, the similarity between two snippets, 36

s1 and s2, is defined as:

m  n  X X simd(s1, s2) = d(s1, ci) d(ci, cj) · d(cj, s2) i=1 j=1 where m (n) is number of active nodes with s1 (s2). d(x, y) is the weight of the edge between nodes x and y. Interpreting the edge weights as the probability of moving between the two nodes that define the edge, the above similarity func- tion calculates the overall probability of reaching the other snippet node from one snippet node in the enhanced graph.

Indirect Association One problem with the previous similarity measure is that it does not consider indirectly related concepts, that is, two concepts con- nected by a third concept. For example, as shown in Figure 2.3, the nodes “global documents” and “query log” are not directly connected, yet they are reachable from each other via a concept “web search” in the domain knowledge. This sug- gests that “global document” and “query log” do have some similarity.

To model the above observation, each concept in the title is expanded by in- cluding all its related concepts in the domain knowledge into the graph. Then the similarities induced by indirect association is define as:

m X simind(s1, s2) = d(s1, ci)· i=1 p n X X d(ci, ck) · d(ck, cj)d(cj, s2) k=1 j=1 where m (n) is number of active nodes in s1 (s2), and p is the number of concept nodes ci connects to. 37

“query expansion using local and global document processing”

0.3 0.4 0.2 0.5

query expansion local global document document analysis

0.07 0.26 0.16

document processing web search . .. search engine

0.22

probabilistic query query expansion query log

0.1 0.3 0.3

“probabilistic query expansion using query logs”

Figure 2.3: An Example Indirect Association Graph

Final Similarity Measure By integrating the similarities from direct and in- direct associations, the similarities can be obtained based on the domain knowl- edge, simdom:

simdom(si, sj) = simd(si, sj) + simind(si, sj)

Finally, the traditional cosine similarity function is enhanced with simdom. Empirically, the following combination based on the geometric mean is found to work best:

q sim(si, sj) = simcosine(si, sj) · simdom(si, sj) 38

2.3.5 Cluster Generation

The hierarchical clustering (HAC) algorithm is chosen to generate clusters.

The HAC algorithm takes as input a pair-wise similarity matrix and produces a nested sequence of partitions. There are several variants of the HAC algorithm with respect to evaluating the similarities between two clusters of objects. The average link algorithm is chosen, that is, the similarity between two clusters is calculated based on average similarities between objects from two clusters. This variant is known to be more reliable and robust to outliers.

Outlier-Conscious Clustering However, even with the average-link HAC al- gorithm, the quality of the clusters are still substantially affected by outliers. The fundamental reason is that the HAC algorithm has the implicit assumption that there is no outlier in the dataset and that every object in the dataset must belong to a cluster. In the current problem settings, there are abundant noise in the snippets, partly because these publication records and snippets are automatically generated.

HAC algorithm tends to generate many singleton clusters, and almost all of them turn out to be outliers by manual inspection.

Therefore, the following modification is proposed to the basic HAC algorithm, which is called outlier-conscious clustering. The HAC algorithm is run several times and after each run, all singleton clusters are marked as outliers and are removed from the input to the next run of the algorithm. Experimental results show improvements in the final cluster qualities by this modification.

Cluster-labelling Generation After generating the clusters, the next step is to generate representative labels for each cluster. The simple heuristic of choos- ing words that occur in more than 50 per cent of snippets in the cluster is used. 39

This agrees with the intuition of allowing a cluster to have frequently occurring multiple topical labels.

2.4 Experiments

Experiments are conducted on a PC with Intel PIII 3.2 GHz CPU, 2 GB RAM and Windows XP. Domain data is collected from DBLP; it provides bibliography information on major computer science journals and proceedings, and indexes more than one million articles.

Google scholar is used as the query search engine; it is a freely-accessible Web search engine that indexes the full text of scholarly literature across an array of publishing formats and disciplines.

Various types of query keywords are selected in the computer science do- main. The five example queries used in the experiment are listed in Table 2.1.

For each query, the top 100 results are used for clustering. For the proposed outlier-conscious clustering algorithms, 3 iterations are performed and in each it- eration clusters of size 1 are identified as outliers and are removed. The number of clusters k vary from 10 to 25.

Query ID Query Q1 clustering Q2 web mining Q3 relevance feedback Q4 information retrieval Q5 query expansion

Table 2.1: Queries 40

The following algorithms are implemented and compared.

FPF This is one of the latest proposed methods and is based on the k-centre

clustering algorithm [GPPS06].

HAC This is the outlier-conscious algorithm using only the cosine similarity.

HAC+ This is the outlier-conscious algorithm using the proposed similarity func-

tion that leverages domain knowledge.

The effectiveness of different clustering algorithms are measured using the pu- rity measure [GDV07]. Domain experts were requested to manually examine the clusters and mark irrelevant snippets. Let the number of irrelevant snippets in the

ni cluster Ci be ni, then the purity of this cluster (ρi) is defined as 1 − . The |Ci| overall purity of the clusters are calculated as the weighted average of purities of each cluster:

X ni ρ = P ρi i j nj

Domain Knowledge The strength of the proposed clustering algorithm partly lies within the new similarity function that gives non-zero similarity values for re- lated concepts, even if they do not share any common word literally. The closely related concepts for the five queries are shown in Table 2.2. As shown, the pro- posed mining algorithm is able to extract closely related concepts from the domain knowledge.

Cluster Quality The purities of three algorithms with k=10,15,20 and 25 are shown in Figures 2.4, 2.5, 2.6 and 2.7 respectively. It can be observed that HAC+ performs better than the other two algorithms. FPF suffers from its inability to 41

Clustering Web mining Relevance feedback Cluster algorithm Website navigation Information retrieval High dimension Browsing pattern Image retrieval Supervised problem Web usage Web search Knowledge discovery Usage pattern Search engine Fuzzy approximation Pattern discovery Supervised learning

Information retrieval Query expansion Retrieval system Retrieval feedback Relevance feedback Search improvement Language model Phrase analysis Information access Retrieval technique Web search Natural language

Table 2.2: Closely Related Concepts With the Query Concepts deal with outliers. HAC+ outperforms HAC, as the cosine similarity used in the

HAC algorithm fails to capture the semantic similarities. As a concrete example, the following articles are placed in cluster by HAC:

• “An association thesaurus for information retrieval”

• “Query expansion using associated queries”

• “Mining term association rules for automatic global query expansion”

Obviously, HAC groups them into one cluster based on the common occurrence of word association, although phrases association thesaurus, associated queries, association rules have different meanings in different contexts.

In contrast, the proposed HAC+ approach does not restrict the concepts to sin- gle words, and aims to find the semantics between the concepts from the domain data. For example, it is able to put two related papers into one cluster: “on mod- eling information retrieval with probabilistic inference” and “a hidden markov model information retrieval system”. 42

It is also observed that, among the five queries, the performance of the HAC+ algorithm is lowest for query Q1. The reason is that the domain dataset (DBLP dataset) contains article information from only computer science domain. Con- sequently the proposed domain knowledge based approach performs well, if the snippets returned by the search engine are from the computer science domain. For the query Q1 (i.e. “clustering”), it is discovered that many of the snippets returned by the search engine contains combination of snippets from other domains (e.g., physics, medicine, etc.). As a result, the proposed algorithm effectively degener- ates to the HAC algorithm for snippets where the contents are not covered by the domain knowledge.

Figures 2.8 and 2.9 examine the purity of the proposed HAC+ with varying cluster sizes (k). It can be seen that, there is a drop in purity with increasing cluster size for all queries. Again, Figure 2.10 draws the number of outliers detected from

HAC+ with the cluster size. As expected, HAC+ is able to detect more outliers with the increasing cluster size.

Tables 2.3–2.7 show the sample clusters containing the titles and the labels generated from HAC+ for all queries. 43

1 FPF HAC HAC+ 0.8

0.6 Purity 0.4

0.2

0 Q1 Q2 Q3 Q4 Q5 Queries

Figure 2.4: Cluster Quality (k = 10)

1 FPF HAC HAC+ 0.8

0.6 Purity 0.4

0.2

0 Q1 Q2 Q3 Q4 Q5 Queries

Figure 2.5: Cluster Quality (k = 15) 44

1 FPF HAC HAC+ 0.8

0.6 Purity 0.4

0.2

0 Q1 Q2 Q3 Q4 Q5 Queries

Figure 2.6: Cluster Quality (k = 20)

1 FPF HAC HAC+ 0.8

0.6 Purity 0.4

0.2

0 Q1 Q2 Q3 Q4 Q5 Queries

Figure 2.7: Cluster Quality (k = 25) 45

1 k=10 k=15 k=20 0.8 k=25

0.6 Purity 0.4

0.2

0 Q1 Q2 Q3 Q4 Q5 Queries

Figure 2.8: HAC+, k vs. Purity

Q1 Q2 Q3 Q4 1 Q5 Purity

10 15 20 25 Number of Clusters(k)

Figure 2.9: HAC+, k vs. Purity 46

25 Q1 Q2 Q3 20 Q4 Q5

15

10 Number of Outliers

5

0 10 15 20 25 Number of Clusters(k)

Figure 2.10: HAC+, k vs. Outlier

2.5 Conclusions

This work focuses on clustering snippets returned by the citation enhanced search engine Google Scholar. A snippet mining framework is built by proposing a novel similarity function based on mining domain knowledge and an outlier- conscious clustering algorithm. The key contributions of the proposed framework are: (1) identification of appropriate features (concepts) to cluster the snippets re- turned by the Google Scholar search engine, (2) exploration of author information in the domain dataset to generate similarity measures between the concept pairs,

(3) utilisation of domain information in proposing a new similarity function and

(4) analysis of outliers and implementation of an outlier-conscious algorithm in generating the required number of clusters. Overall, the experiment results sug- gest that domain knowledge can be useful in improving the quality of clusters and detecting outliers. In addition, clustering the snippets returned by the search engine is a reasonable and speedy alternative to downloading the original docu- 47

Query Cluster Sample Labels clustering 1. automatic subspace clustering of high dimen- data, sional data for data mining applications cluster

2. efficient and effective clustering methods for spatial data mining

3. birch: an efficient data clustering method for very large databases

4. clustering data streams

5. [book] algorithms for clustering data - »

6. clustering countries on attitudinal dimen- sions: a review and synthesis

7. data clustering: a review Table 2.3: Cluster Sample for Clustering ments.

In future work, we plan to investigate alternative method to extract useful do- main knowledge. One interesting direction could be the use of citation informa- tion. It is also planned to experiment with datasets in various other domains and perform large-scale evaluation of the proposed system and obtain user feedbacks to further improve the proposed system. 48

Query Cluster Sample Labels web mining 1. [citation] research on web mining-based intelli- survey, gent search engine web, research, 2. the research of web mining mining

3. [citation] conference tutorial notes: web mining: concepts, practices and research

4. web mining-concepts, applications and re- search directions 5. web mining research

6. web mining: research and practice

7. [citation] web mining research: a survey, sigkdd explorations 8. [citation] web mining research: a survey

9. web mining research: a survey

10. [citation] research on web mining: a sur- vey

11. web mining: a survey in the fuzzy frame- work Table 2.4: Cluster Sample for Web Mining 49

Query Cluster Sample Labels relevance feedback 1. learning and inferring a semantic space from relevance, user’s relevance feedback for image retrieval learning, image, 2. learning a semantic space from user’s rele- space, vance feedback for image retrieval feedback, retrieval, 3. one-class svm for learning in image retrieval semantic, user 4. learning similarity measure for natural im- age retrieval with relevance feedback

5. image retrieval with relevance feedback: from heuristic weightadjustment to optimal learning methods Table 2.5: Cluster Sample for Relevance Feedback

Query Cluster Sample Labels information retrieval 1. [book] multimedia information retrieval: content- audio, based information retrieval from large text and intelligent, audio ... information, multimedia, 2. [book] intelligent multimedia information retrieval retrieval

3. an overview of audio information retrieval

4. approaches to intelligent information retrieval

5. using linear algebra for intelligent informa- tion retrieval Table 2.6: Cluster Sample for Information Retrieval 50

Query Cluster Sample Labels query expansion 1. query routing for web search engines: architec- expansion, ture and experiments web, query, 2. experiments on interfaces to support query experiment, expansion trec

3. overview of the trec-2001 web track

4. okapi/keenbow at trec-8

5. ucla-okapi at trec-2: query expansion exper- iments Table 2.7: Cluster Sample for Query Expansion Chapter 3

Learning Top-k Transformation Rules

3.1 Introduction

Real data are inevitably noisy, inconsistent or they contain errors. For ex- ample, there are dozens of correct ways to cite a publication, depending on the bibliography style one uses; however, there can be hundreds of citations to the same publication that contain typographical errors, Optical Character Recognition

(OCR) errors or errors introduced when the citation was extracted by a program from Web pages.

Record linkage is the process of bringing together two or more separate records pertaining to the same entity, even if their surface forms are different.

It is a cornerstone to ensure the high quality of mission-critical data, either in a single database or during data integration from multiple data sources. Therefore, it has been used in many applications, including data cleaning and data integra- tion from multiple sources or the Web. Record linkage is known under many different names, including entity resolution and near duplicate detection, and is a 52 well-studied problem that has accumulated a large body of work [Win99,EIV07].

Most existing record linkage approaches exploit similarities between values of intrinsic attributes, and many similarity or distance functions have been proposed to model different types of errors. For example, is used to account for typographical errors and misspellings [WXLZ09], whereas Jaro-Winkler distance is designed for comparing English names [WT91] and Soundex is used to account for misspellings due to similar pronunciations. Various similarity measures, such as Jaccard and cosine similarities are used to measure the similarity of multi-token strings [XWLY08]. A well known common heuristic tf-idf is used to allow mis- spellings in [CRF03, MYC08], and weights are then learned automatically using machine learning techniques [BM02, SB02b, TKM02]. Multiple similarity func- tions are used on the same or different attributes of the records, either by simply aggregating them or by treating them as a relational input to a classifier [SB02b].

The previous chapter mainly focuses in proposing a similarity function to cluster the citation snippets. However, the use of similarity functions is insuf-

ficient when coreferent records have little surface similarity (e.g., 23rd and twenty-third, VLDB and Very Large Databases). Therefore, exist- ing systems incorporate transformation rules to recognise these domain-specific equivalence relationships. This chapter mainly focuses to addresses this problem.

Traditionally, transformation rules were manually created by experts. This process is not only tedious, expensive and often erroneous, the generated rules are seldom comprehensive enough. This is the main motivation for semi-automatic methods to automatically learn a set of high-quality candidate rules [ACK09]; do- main experts can then manually validate or refine the candidate rules. Hence, it is important that a majority of the candidate rules generated by these algorithms 53 should be correct. The rule learning algorithm should also be able to cope with large input datasets and learn top-k rules efficiently. As demonstrated in the ex- periments, existing methods [ACK09] fail to meet both requirements.

In this chapter, a novel method is proposed to learn top-k transformation rules automatically from a set of input record pairs known to be coreferent. It performs a meticulous local alignment for each pair of records by considering a set of com- monly used edit operations, and then generates a number of candidate rules based on the optimal local alignment. Statistics of the candidate rules are maintained and aggregated to select the final top-k rules. Extensive experiments are conducted with the existing state-of-the-art algorithm, and it is found that the proposed rule learning algorithm outperforms the existing method in both effectiveness and ef-

ficiency.

The contributions can be summarised as:

• A local alignment-based rule learning algorithm is proposed. Compared

with the global Greedy algorithm in [ACK09], the proposed algorithm

generates fewer candidate rules, and these candidate rules are more likely

to be correct.

• Extensive experiments are performed using several publicly available

real-world datasets; the experimental results shows an increase of 3.3

times in the percentage of correct rules compared with the previous ap-

proach, and up to 300 times the speed-up in efficiency.

This chapter is organised as follows: Section 3.2 introduces related work; Sec- tion 3.3 presents the proposed algorithm; Section 3.4 discusses the experimental results and Section 3.5 concludes the study. 54

3.2 Literature Review

This section describes various record linkage approaches when dealing with complex coreferent bibliography records and discusses the application of various transformation rules when dealing with such complex data.

The thesis [HH96] presents a system for integrating bibliography information from many heterogeneous sources that identifies related bibliography records and presents them together. It describes an author-title clustering algorithm that auto- matically identifies bibliography records that refer to the same work and clusters them together. It tolerates errors and cataloguing variations within the records by using a full-text search engine and an n-gram-based approximate string match- ing algorithm to build the clusters. In experiments with a collection of 240,000 records of the computer science literature, the algorithm identifies more than 90 per cent of the related records and includes incorrect records in less than 1 per cent of the clusters.

Along with a simple baseline method for comparison, the work [GBL98] in- vestigates three methods of identifying citations to identical articles and grouping them together: Word Matching, Word and Phrase Matching and LikeIt, which is a method based on the LikeIt intelligent string comparison algorithm introduced in [Yia93]. For all of these methods, it finds that normalising certain aspects of citations: conversion to lowercase, removal of hyphens, removal of citation tags, expansion of common abbreviations and removal of extraneous words and characters tends to improve the results of the methods. Experiments on three sets of citations taken from the neural network literature show that Word and Phrase

Matching is best performer followed by Word Matching, Baseline and LikeIt. 55

The paper [LGB99] presents machine-learning techniques that identify vari- ant forms of citations to the same paper. A number of algorithms are presented, including edit distance, word frequency or occurrence, subfields or structure, and probabilistic models. The accuracy and efficiency of all algorithms are quanti- tatively compared for a number of datasets. The algorithm based on word and phrase matching is found to perform best, while the algorithm based on a string edit distance performs poorly in comparison.

The work [MNU00] uses two-step process of object identification on citations.

The key idea involves applying a cheap approximate distance measure to divide the data efficiently into overlapping subsets, called canopies. The more expen- sive transformation, edit distance, is then applied to compute the similarity metric between the objects that occur in a common canopy. The experimental results on grouping bibliography citations from the reference sections of research papers show that the canopy approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude. It also decreases errors in comparison with a previously used algorithm by 25 per cent.

The paper [PMM+02] considers the problem in the context of citation match- ing, that is, the problem of choosing the citations that correspond to the same publication. The proposed approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. The uncertainty is handled by extending standard models to incorporate probabilities over possible mappings between language terms and domain objects. Inferences are based on Markov

Chain Monte Carlo and augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation 56 datasets show that the method outperforms other citation matching algorithms.

The work in [SB02b] designs a learning based de-duplication system

(ALIAS2) that allows the automatic construction of the de-duplication function by using a novel method of interactively discovering challenging training pairs.

Unlike an ordinary learner that trains using a static training set, an active learner actively picks subsets of instances that when labelled will provide the highest in- formation gain to the learner. With this approach, the more difficult task of bring- ing together the potentially confusing record pairs is automated by the learner.

Experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy.

The paper [CR02] presents an adaptive scheme for clustering and matching entity names by employing a set of features such as “ match” and “ edit distance”. An experimental evaluation on a number of sample datasets shows that adaptive methods sometimes perform better in a particular domain and can be competitive with the best baseline system.

To improve duplicate detection, the MARLIN (Multiply Adaptive Record

Linkage with Induction) framework [BM03b] employs learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field’s do- main. It presents two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experi- mental results on several real-world datasets containing duplicate records show that MARLIN can lead to improved duplicate detection accuracy over traditional techniques. 57

The paper [HGZ+04] describes two supervised learning approaches to dis- ambiguate name entities in citations. The Naive Bayes approach determines the author with the highest posterior probability of writing the paper of a citation, and the SVM approach classifies a test citation to the closest author class. Both approaches use three types of citation attributes: co-author names, paper title key- words and journal title keywords, and they achieve more than 90 per cent accuracy in disambiguating author names. Experiments show that co-author names appear to be the most robust attribute for name disambiguation, whereas using journal title words provides better results than paper title words.

The author resolution problem [BG04] is formulated as an iterative process by using co-author relationships as additional context information. Extensive evaluations on synthetically generated bibliography data show that iterative de- duplication can be a powerful and practical approach for data cleaning.

Based on conditional random fields [LMP01], the approach in [SD05] makes decisions collectively, performing simultaneous inferences for all candidate match pairs and allowing information to propagate from one candidate match to another via the attributes (or fields) that they have in common. For example, a paper in

Proc. IJCAI-03 that is found to be the same paper that appears in Proc. 18th

IJCAI implies that the two strings refer to the same venue, which in turn can help match other pairs of IJCAI papers. Experiments on bibliography databases show that the proposed approach outperforms the standard models making decisions independently.

The work [CM05] constructs a conditional random field model of de- duplication to capture the relational dependencies of the database records and then employs a novel relational partitioning algorithm to de-duplicate records jointly. 58

For two citation-matching datasets, it shows that collectively de-duplicating pa- per and venue records results in an error reduction of up to 30 per cent in venue de-duplication and up to 20 per cent in paper de-duplication.

In [DHM05], new methods are designed for reference comparisons that effec- tively exploit the associations between references. The method first accumulates positive and negative evidences by propagating information between reconcilia- tion decisions and then gradually enriches references by merging attribute values.

The experiments show that the proposed method can considerably improve pre- cision and recall over standard methods on a diverse set of personal information datasets and standard citation datasets.

The work [LOKP05] devises effective and efficient solutions to two practical problems: Mixed Citations (i.e. author names are homonyms) and Split Citations

(i.e. citations of the same author appear under different name variants). They present an effective solution based on a state-of-the-art sampling-based approxi- mate join algorithm and associated information of author names (e.g., co-authors, titles or venues). The effectiveness of the proposed approach is verified through preliminary experimental results.

The papers [KMC05, KM06] propose a domain-independent data-cleaning approach for reference disambiguation, referred to as Relationship-based Data

Cleaning (RelDC). For the purpose of disambiguation, RelDC systematically ex- ploits the relationships among entities and views the database as a graph of enti- ties that are linked to each other via relationships. First, it utilises a feature based method to identify a set of candidate entities (choices) for a reference to be dis- ambiguated. Second, graph theoretic techniques are used to discover and analyse relationships that exist between the entity containing the reference and the set of 59 candidates. The extensive empirical evaluation of the approach demonstrates that

RelDC improves the quality of reference disambiguation beyond the traditional techniques.

A new ontology-driven solution to the entity disambiguation problem in un- structured text is proposed in [HAMA06]. Going beyond traditional syntactic- based disambiguation techniques, it uses different relationships in a document, as well as from the ontology, to provide clues to determine the correct entity.

The effectiveness of this approach is demonstrated through evaluations against a manually disambiguated document set containing over 700 entities. This eval- uation was performed over DBWorld announcements using an ontology created from DBLP (comprising over three million entities). The results of this evaluation claim its applicability to scenarios involving real-world data.

The work [CKM07] presents a graphical approach for entity resolution and has presented its empirical evaluation on datasets taken from two different domains.

The work demonstrates that a technique that measures the degree of interconnect- edness between various pairs of nodes in the graph can significantly improve the quality of entity resolution. Further, it presents an algorithm for making that tech- nique self-adaptive to the underlying data, thus minimising the required participa- tion from the domain-analyst and further improving the disambiguation quality.

A recent article [BdCdMG+11] proposes similarity functions especially de- signed for the digital library domain that identifies duplicated bibliography meta- data records in an efficient and effective way. These functions compare proper nouns and support the features: variations of spelling, omissions of middle names, abbreviations and inversions of names. The experimental results show that the proposed functions improve the quality of metadata de-duplication compared to 60 four different baselines and achieve statistically equivalent results compared to a state-of-the-art method for replica identification based on genetic programming.

The paper [BBHG11] extends the work of [BdCdMG+11] so that, instead of setting thresholds based on the scores returned by similarity functions, the scores are used to train classification algorithms that automatically identify duplicate ref- erences. The experiments show that the classifiers increase up to 11 per cent, which is the quality of results compared to the unsupervised heuristic-based ap- proach.

The study [SK08] presents an overview of the duplication detection algorithms and an up-to-date state of their application in different library systems. Individual algorithms, their application processes for duplicate detection and their results are described based on available literature. The study finds that, when algorithms are applied in one step, a faster application is achieved, but the percentage of database clean up usually remains low. Most algorithms are two-step applications, which result in a greater improvement in database quality.

To overcome the limitations of previous algorithms that use active learning for record matching, the article [AGK10] proposes a new active learning algorithm that generalises the techniques for finding maximal frequent itemsets and cleverly navigates through the similarity space to find a conjunction of similarity thresh- olds. The experimental results show that the proposed algorithm is able to give probabilistic guarantees on the quality of the result while requiring fewer samples than passive learning algorithms.

Apart from the approaches cited above, another category of approaches to deal with the problem of record linkage is to model complex or domain-specific trans- formation rules. The learned rules can then be used to identify more coreferent 61 pairs. Some of the related works are described below.

In order to identify matching objects, Active Atlas [TKM01,TKM02] applies a set of domain-independent string transformations to compare the objects’ shared attributes. It simultaneously learns to tailor both domain independent transfor- mations and mapping rules to a specific application domain through limited user input. The experimental results demonstrate that this approach achieves higher accuracy and requires less user involvement than previous methods across various application domains.

The article [MNK+05] presents an algorithm HFM (Hybrid Field Matcher), which combines the best aspects of the machine-learning and expert systems ap- proaches to create expert-like rules for field matching. In the first step, it uses a library of heterogeneous transformations that enables it to capture complex re- lationships between two field values. In the second step, machine-learning tech- niques are used to automatically customise these transformations for a given do- main so that highly accurate decisions can be made. The experiments show that

HFM produces superior results in domains where simpler field matching metrics, including Active Atlas [TKM02], fail to capture important distinctions.

The work [ACDK08] shows that existing domain knowledge, encoded as rules, can be used effectively to address the synonym problem and this can make the disambiguation task simpler, without the need for much training data. It ex- amines a subset of application scenarios in named entity extraction, categorises the possible variations in entity names and defines rules for each category. Using these rules, synonyms are generated and matched to the actual occurrence in the data sets.

The works in [ACK08,ACGK08] propose a framework that leverages domain 62 knowledge to derive transformation rules. In their works, transformations are also learned from readily available specialised tables from various domains and online resources. String variations are captured by combining the traditional similar- ity function with the derived set of transformation rules. These transformations, coupled with an underlying similarity function, are used to define the similar- ity between two strings. The experiments over real data show that incorporat- ing transformations significantly enhances record matching quality, and that the performance of computing a similarity join is improved by orders of magnitude through this technique.

The paper [MK09] presents a data mining approach to discover heterogeneous transformations between two datasets, without labelled training data. The pro- posed approach first finds a set of possible matches based on the cosine similarity between record pairs, and then mines transformations with the highest mutual information among these pairs. The experiments demonstrate that it discovers many types of specialised transformations, such as synonyms and abbreviations, and shows that exploiting these transformations can improve record linkage.

From a given pair of matching records, most of the previous work focus on do- main knowledge or some external knowledge to learn useful transformation rules that can transform one record to the other. The recent framework in [ACK09] generates the transformation rules without relying on any user supplied input.

The proposed techniques primarily rely on the alignment of the tokens between a string pair that applies a Greedy approach to the token subsequences to learn the useful transformations. The main idea is to identify candidate rules from the unmapped tokens of each record pair and then compute the aggregate score of the candidate rules to find the top-k high quality rules. Experiments over real-life 63 data show better accuracy and scalability to larger datasets compared to previous approaches.

Although only few works in record linkage focus on transformation rules, they have been widely employed in many other areas, and automatic rule learning al- gorithms have been developed accordingly.

In NLP, the work [Tur02] finds sets of synonyms by considering word co- occurrences. Later a similar idea is utilised in [PKM03] to identify paraphrases and grammatical sentences by examining co-occurrences of sets of words.

In linguistics, the article [SH97] introduces an original data structure and ef-

ficient algorithms that learn some families of transformations that are relevant for part-of-speech tagging and phonological rule systems. The rule-based approach to the automated learning of linguistic knowledge [Bri95] has been shown for a number of tasks to capture information in a clearer and more direct fashion with- out a compromise in performance.

Recently, the articles [BYKS09b, AKL+09] address the problem of discov- ering rules that transform a given URL to others that are likely to have similar contents. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), it addresses the problem of mining this set and learning URL rewrite rules that transform all

URLs of an equivalence class to the same canonical form. Without examining page contents, these rewrite rules are used to canonise URL names, and thus re- ducing crawling overhead and increasing crawl efficiency; for example http:

//en.wikipedia.org/?title=* and http://en.wikipedia.org/ wiki/* always refer to the same Web page. Another closely related area is the substitution rules used in query rewriting 64

[JRMG06, RBC+08]. For example, when a user submits a query apple music player to a search engine, the query may be changed to apple ipod. The substi- tution rules are mainly mined from query logs, and the key challenges are how to

find similar queries and how to rank them.

3.3 The Local-alignment-based Algorithm

Similar to [ACK09], the input to transformation rule learning algorithms is a set of coreferent record pairs. The overall idea of the proposed new algorithm is to perform careful local alignment for each record pair, generate candidate rules from the optimal local alignment and aggregate the strength of the rules over all input pairs to winnow high-quality rules from all of the candidate rules.

3.3.1 Pre-processing the Input Data

Standard pre-processing is performed for input record pairs, which includes removing all non-alphanumeric characters except spaces and then converting char- acters to lower case. The case information is not used, as it is not reliable for noisy input data. Stopwords are removed and the strings are then tokenised into a se- quence of tokens using white space as the separator.

3.3.2 Segmentation

The goal of segmentation is to decompose a record into a set of semantic con- stituents known as fields, that is, a substring of tokens in a tokenised record. For example, bibliography records can usually be segmented into authors, title, 65 and venue fields. The proposed framework does not rely on a specific type of seg- mentation method; one can use either supervised methods (e.g., CRF [LMP01]) or simple rule-based segmentation methods (e.g., based on punctuations in the raw input records).

Although this step is optional, appropriate segmentation is beneficial to gen- erating better rules and faster execution of the algorithm for three reasons. First, it does not consider mappings for two tokens in different fields in two records.

Hence, fewer candidate rules are generated, and this speeds up the algorithm.

Second, most rules generated across fields are actually erroneous. For exam- ple, comparing an author name with initial p in the authors field with the token proceedings that appears in the venue field would generate a wrong rule, with p as an abbreviation of proceedings. Third, it is possible that different parameter settings can be used to learn rules for a particular field. For example, transfor- mation rules for author names (e.g., omitting the middle name) probably do not apply for paper titles.

3.3.3 Local Alignment

This phase performs field alignment and then finds the optimal local alignment between the values of the corresponding fields.

3.3.3.1 Computing the Optimal Local Alignment

We need to find a series of edit operations with the least cost to transform one string to another. After analysing common transformations, the following set of common edit operations are executed: 66

• copy: copy the token exactly

• abbreviation: allows one token to be a subsequence of another token, for

example, department ⇔ dept

• initial: allows one single-letter token to be equivalent to the first letter of

another token, for example, peter ⇔ p

• edit: allows the usual edit operations (i.e. insertion, deletion, and substi-

tution), for example, schutze¨ ⇔ schuetze

• unmapped: tokens that are not involved in any transformation are de-

noted as unmapped tokens.

In addition to assigning cost to each of the above edit operations, the opera- tions are also prioritised as copy > initial > abbreviation > edit > unmapped.

That is, if a high priority operation can convert one token to another, the oper- ations of lower priorities are not considered. For example, between the tokens department and dept, since an abbreviation operation can change one into another, the edit operation is not considered. The cost of copy is always 0, and the cost of unmapped is always higher than other costs; the cost of other operations are subject to tuning. Algorithm 1 performs such local alignment and returns the minimum cost between two strings.

For example, Figure 3.1 shows two strings S and T (i.e., there is no segmen- tation), and the optimal local alignment between these strings, where white cells are copied, green cells are unmapped, and yellow cells involve other types of edit operations. 67

Algorithm 1: AlignStrings(S, T ) 1 c ← 0; 2 for each token w ∈ S do 0 0 3 mincost ← min{cost(w, w ) | w ∈ T }; /* follows the priorities of edit operations */ 4 if w is not unmapped then 5 c ← c + mincost;

6 for each unmapped token w ∈ S ∪ T do 7 c ← c + unmapped cost(w);

8 return c

S Juha Karkk¨ ainen:¨ Sparse Suffix Trees. COCOON 1996, Hong Kong. 219-230 T J. Karkkainen. Sparse Sufix Trees. Second Annual International Conference on Computing and Combinatorics, pp. 219-230, 1996

Juha Kärkkäinen Sparse Suffix Tree COCOON 1996 Hong Kong 219-230

R2 R1 R3 Second Annual Conference on Computing Karkkainen J Sparse Sufix Tree pp 219-230 1996 International and Combinatorics

Figure 3.1: Optimal Local Alignment

3.3.3.2 Aligning Fields

If the input records have been partitioned into fields, these fields also need to be aligned. In the easy case, where each field has a class label (as those output by the supervised segmentation methods), the alignment of fields is trivial — fields with the same class label (e.g., authors to authors) are just aligned. If fields do not have class labels, the Algorithm 2 is used to find the optimal alignment — an alignment such that the total cost is minimum. This is done by reducing the problem into a maximum weighted bipartite graph matching problem, which can be efficiently solved by the Hungarian algorithm [Kuh55] in O(V 2 log V +VE) = 68

3 O(Bmax ), where Bmax is the maximum number of fields in the records.

Algorithm 2: AlignBlocks(X , Y ) 1 Construct a weighted bipartite graph G = (A ∪ B, E, λ), such that there is a 1-to-1 mapping between nodes in A and the fields in X, and there is a 1-to-1 mapping between nodes in B and fields in Y , E = {eij} connects Ai and Bj, and the weight of an edge is the negative of its cost, i.e., λ(eij) = −AlignStrings(Ai,Bj); 2 M = FindMaxMatching(G) ; /* use the Hungarian alg */ 3 return M

3.3.3.3 Multi-token Abbreviation

Some tokens are unmapped because they involve in multi-token abbrevia- tion. To address this issue, the abbreviation operation is generalised to in- clude multiple-token to one token abbreviation. For example, in Figure 3.1, one such instance is Conference on Computing and Combinatorics to

COCOON.

Multi-token abbreviations are classified into the following categories:

• Acronym: A set of tokens [u1, u2, . . . , uk] maps to a token v via an

acronym mapping, if there exists a sequence of prefix length li for each

ui, such that u1[1, l1] ◦ u2[1, l2] ◦ . . . uk[1, lk] = v, where ◦ concatenates two strings. For example, association computing machinery

⇔ acm.

• Partial Acronym: A set of tokens [u1, u2, . . . , uk] maps to a token v

via a partial acronym mapping, if there exists a prefix or suffix, vs, of

v longer than a minimum length threshold, and a subsequence ss(ui) of

for each ui, such that ss(u1) ◦ ss(u2) ◦ . . . ss(uk) = vs. That is, vs

is equal to a subsequence of the concatenated string u1 ◦ ... ◦ uk. For 69

Juha Kärkkäinen Sparse Suffix Tree COCOON 1996 Hong Kong 219-230

R2 R1 R3 R4 Second Annual Conference on Computing Karkkainen J Sparse Sufix Tree pp 219-230 1996 International and Combinatorics

Figure 3.2: Optimal Local Alignment After Recognising Multi-token Abbrevia- tion (R4)

example, conference on knowledge discovery and data

mining ⇔ sigkdd.

It should be noted that such possible mappings are checked only among the contiguous unmapped tokens in both directions in two strings. In addition, the search is prioritised, where acronym > partial acronym > others.

Algorithm 3: MultiToken-Abbreviation(X, Y ) Input : X and Y are the input record pairs 1 for each remaining unmapped token v ∈ X do 2 for each remaining contiguous set of unmapped tokens u in Y do 3 str ← u1 ◦ ... ◦ uk; 4 if exists a prefix or suffix, vs, of v such that vs is a subsequence of str then 5 partialAcronym ← true; 6 if each of the match in ui starts from its first character then 7 fullAcronym ← true;

8 if partialAcronym = true or fullAcronym = true then 9 remove matched tokens from X and Y ;

The algorithm to discover both types of acronyms is depicted in Algorithm 3.

As illustrated in Figure 3.2, the partial acronym mapping, denoted by R4, is recognised. 70

3.3.4 Obtaining top-k Rules

The top-k rules are selected by generating the candidate rules in the first phase and then computing and aggregating their scores across all input pairs of records.

Definition 1 (Rule) A rule is in the form of lhs ⇔ rhs, where both lhs and rhs are a sequence of tokens. The rule means that lhs is equivalent to rhs.

Definition 2 (Atomic Rule) An atomic rule is a rule where either the lhs or the rhs is a single token or empty (denoted as ⊥).

It is to be noted that an omission rule is a special rule where one side of the rule is empty; for example, since pp is still unmapped in Figure 3.2, an omission rule pp ⇔ ⊥ is generated.

Depending on the number of tokens on each side of the rule, there can be 1-1 rule (e.g., peter ⇔ p), 1-n rule (e.g., vldb ⇔ very large databases), n-m rule (e.g., information systems ⇔ inf sys). In practice, it is observed that almost all n-m rules can be decomposed into several 1-1 or 1-n rules; for example, the n-m rule above can be decomposed into two 1-n rules: information ⇔ inf and systems ⇔ sys. Therefore, the present work focuses on finding atomic rules and assembling them to find more n-m rules.

3.3.4.1 Generating Candidate Rules

At this stage, the rules are generated from mapped tokens and unmapped to- kens in different manners:

• For mappings that are identified either by local alignment (Algorithm 1)

or by multi-token abbreviation (Algorithm 3), the rules can be easily gen- 71

erated. It is to be noted that the trivial copying rules (i.e., A ⇔ A) are not

generated. The adjacent rules are also combined into n-m rules; for ex-

ample, once the two mappings: information ⇔ inf and systems

⇔ sys are identified, and if information and systems are adja-

cent as well as inf and sys, then the rule information systems

⇔ inf sys is also generated.

• For each of the remaining unmapped tokens in one string, it is postu-

lated that it might be deleted (i.e., transformed to ⊥), or mapped to every

contiguous subset of the unmapped tokens in the other string. It is done

for both strings and it is expected that, although many of these candidate

rules are invalid, some valid rules (usually corresponding to complex,

domain-specific transformations sharing little surface similarities1 ) will

appear frequently if it is aggregated over a large amount of input pairs.

It should be noted that, to avoid generating many invalid rules from the

above postulations, the rules with support less than minimum threshold 2

are filtered. As a result, most of the infrequent noisy rules will be filtered

from the generated candidate rules.

Based on the mappings in Figure 3.2, the following rules are generated from mapped tokens:

• Juha ⇔ J,

• Kakk¨ ainen¨ ⇔ Kakkainen,

• Juha Kakk¨ ainen¨ ⇔ Kakkainen J, 1 One example is maaliskuu ⇔ march found in our experiments, where maaliskuu in Finnish means march in English. 72

• Suffix ⇔ Sufix,

• COCOON ⇔ Conference on Computing and Combinatorics.

Candidate rules generated from unmapped tokens pp are:

• pp ⇔ ⊥

• pp ⇔ Hong

• pp ⇔ Kong

• pp ⇔ Hong Kong

3.3.4.2 Score of a Rule

Definition 3 The score of a rule R is defined as:

score(R) = log(1 + freq(R)) · log(1 + len(R)) · wt(R), (3.1) where freq(R) is the number of occurrences of the rule R, len(R) is the total number of tokens in R, and wt(R) is the weight assigned based on the type of the rule.

The above heuristic definition takes into consideration of the popularity of a rule, its length and type. The length component is important because, whenever ir ⇔ information retrieval holds, ir ⇔ information also holds; thus, score(ir ⇔ information retrieval) ≤ score(ir ⇔ information), and the partial correct rule will be incorrectly ranked higher than the complete rule. The rule-type component is used to capture the intuition that, for example, a rule generated by mapped tokens is more plausible than that generated by (randomly) pairing unmapped tokens. 73

3.3.4.3 Select the top-k Rules

The complete algorithm is given in Algorithm 4, which learns top-k rules with the maximum scores from a set of input pairs of records.

Algorithm 4: TopkRule(D, k)

Input : D = {(X1,Y1),..., (XN ,YN )} are N input record pairs. 1 R ← ∅; 2 for each input record pair pi = (X,Y ) do 3 bX ← segment X into partitions; 4 bY ← segment Y into partitions; 5 M ← AlignBlocks(bX , bY ); 6 for each candidate rule R generated from the mapping M do 7 if R 6∈ R then 8 R.support = {pi}; 9 R ← R ∪ {R}; 10 else 11 Update the R ∈ R so that R.support now includes pi;

12 for each candidate rule R ∈ R do 13 R.score ← calcScore(R.support); /* based on Equation (3.1) */ 14 output ← ∅; 15 i ← 1; 16 while i ≤ k do 17 TopRule ← arg maxR∈R{R.score}; 18 UpdateRules(R, TopRule); 19 output ← output ∪ TopRule; 20 i ← i + 1;

21 return output

In Algorithm 4, Lines 1–11 generate all candidate rules from the mappings obtained for each record pair by considering the best local alignment and possi- ble multi-token abbreviation. Lines 12–13 calculate the scores for each candidate rule. The TopRule, which has the maximum score (Lines 16–20) is then iter- atively selected, and the support is withdrawn from other conflicting candidate rules by the procedure UpdateRules. This process is repeated until k high-quality 74 rules are found. Algorithm 5: UpdateRules(R, TopRule)

1 {p1, p2, . . . , pl} ← the support of rule TopRule; 2 for each pi do 3 CR ← all the rules that conflict with TopRule from the record pair pi; 4 for each rule R ∈ CR do 5 R.support ← R.support \{pi};

The procedure UpdateRules is illustrated in Algorithm 5. A rule R1 is said to conflict with another rule R2, only if one side of R1 is identical to one side of R2. Two types of typical conflicting rules are:

• A ⇔ B and A ⇔ C

• A ⇔ BCD and A ⇔ B

Analysis of the Algorithm Let the number of input record pairs be N and let the average number of candidate rules generated by each record pair be f. Thus,

|R| = f · N. The time complexity of Algorithm 4 is O(k · (|R| + N · f)) =

O(k|R|) = O(k · N · f). In practice, it is observed that only a constant number of rules will be in conflict with the TopRule in each k iteration. Hence, the time complexity is expected to be O(k · N).

3.4 Experiment

In this section, the experimental evaluations and analyses are performed.

3.4.1 Experiment Setup

The following algorithms are used in the experiments: 75

Greedy This is the state-of-the-art algorithm for learning top-k transformation

rules by Microsoft researchers [ACK09]. This approach first analyses the

differences between the strings and uses the hypothesis that consistent

differences occurring across many examples is indicative of a transforma-

tion rule. Based on this intuition, a rule learning problem is formulated

that seeks a concise set of transformation rules that accounts for a large

number of differences. It is linear in the input size which allows to scale

easily with the number of input examples.

LA This is the proposed local-alignment-based algorithm.

It is to be noted that for the proposed LA algorithm, the weights set for various types of rules in Equation (3.1) are: rules from mapped tokens: 4, exact acronym:

3, partial acronym: 2, and others: 1. A partitioning method is used to segment each record into three fields based on an algorithm that takes into consideration of the local alignment costs as described in the next section.

All experiments are performed on a PC with AMD Opteron Processor 8378

CPU and 96GB memory, running Linux 2.6.32. All programs are implemented in

Java.

3.4.1.1 Unsupervised Segmentation

Based on an unsupervised approach, the strings are partitioned into a given number of fields. For a given value of n, the strings are partitioned into n disjoint

fields, such that the field in one string shares the maximum number of exact tokens in the corresponding field in the other string. The key intuition is that, if two fields share most of the exact tokens, they are more likely to be referring to the same key 76

field. This is obtained by partitioning the strings into a set of blocks. Two types of blocks are generated: Mapped and Unmapped.

Definition 4 For a pair of strings S and T , a Mapped block in S is defined as the subsequence of tokens of maximum length that exists in T . An Unmapped block in S is a token that does not have an exact match token in T .

Example 3.4.1 For two strings S: A B C D E F G and T :DBCMFNE

Mapped blocks in S are: BCD, EF and in T are DBC, F , E

Unmapped blocks in S are: A, G and in T are DBC, M, N

The strings are then partitioned into a set of n disjoint subsequence of blocks

(combination of Mapped and Unmapped) representing the fields. All possible ways to divide the strings into n fields are then enumerated.

Two possible sets of fields returned from the following two strings S and T , with n = 3 are shown in Tables 3.1 and 3.2.

S : E. M. McCreight. A space-economical suffix tree construction algorithm.

Journal of the ACM, 23(2):262–272, Apr. 1976.

T : McCreight, E. M. A space-economical suffix tree construction algorithm.

JACM 23, 2 (Apr. 1976), 262–272.

Table 3.1 generates more accurate segmentation than Table 3.2, which is picked up by Algorithm 2. 77

String Segment1 Segment2 Segment3 S E. M. McCreight A space-economical suf- Journal of the ACM, fix tree construction algo- 23(2):262–272, Apr. 1976 rithm. T McCreight, E. M. A space-economical suf- JACM 23, 2 (Apr. 1976), fix tree construction algo- 262–272. rithm. Table 3.1: Example of Segmentation

String Segment1 Segment2 Segment3 S E. M. McCreight A space-economical suf- of the ACM, 23(2):262– fix tree construction algo- 272, Apr. 1976 rithm.Journal T McCreight, E. M. A space-economical suf- 23, 2 (Apr. 1976), 262– fix tree construction algo- 272. rithm.JACM Table 3.2: Example of Segmentation

3.4.2 Datasets

The following three real-world datasets are used for the experiments.

CCSB This dataset comes from the Collection of Computer Science Bibiliographies.2

The site was queried with 38 keyword queries (e.g., data

integration), and the top-200 results were collected. Each query

result is referred to a cluster, as it may contain multiple citations to the

same paper. Only the clusters whose size is larger than one are kept for

the experiments. Five different bibliography styles 3 were used, where

the i-th bibliography style is applied to the i-th citations (if any) in each

cluster to obtain the corresponding transformed string from the LATEX out- 2 http://liinwww.ira.uka.de/bibliography/Misc/CiteSeer/ 3 They are these, acm, finplain, abbrv, and naturemag, mainly from http: //amath.colorado.edu/documentation/LaTeX/reference/faq/bibstyles. pdf. 78

put. As a result, 3030 clusters were generated. An example is shown in

Table 3.3. Then all possible pairs of the strings produced by LATEX within the same cluster were formed to use them as the input record pairs for the

transformation rule learning algorithms. This results in 12,456 pairs.

Cora This is the hand-labeled Cora dataset from the RIDDLE project.4 It con-

tains 1,295 citations of 112 Computer Science papers. The citations are

clustered according to the actual paper they refer to, and all possible pairs

are then generated within each cluster. As a result, 112 clusters were gen-

erated with a total of 17,184 input record pairs.

Restaurant This is the Restaurant dataset from the RIDDLE project. It contains

533 and 331 restaurants assembled from Fodor’s and Zagat’s restaurant

guides, respectively, and 112 pairs of coreferent restaurants were identi-

fied. As mentioned above, a similar record-pair generation method was

applied, generating 112 clusters and 112 record pairs.

It should be noted that all pairs of citations were generated in the same cluster

(i.e., referring to the same publication) to exploit the ground truth data maximally.

This, however, introduces a bias into the frequency of the rules (i.e., freq(R) in

Equation (3.1)). For example, for a cluster of size 2t, a rule A ⇔ B can have a frequency of up to t2 from this cluster alone. To remedy this problem, the frequency is defined as the number of clusters (rather than record pairs) such that the rule is generated. This is applied to both algorithms.

4 http://www.cs.utexas.edu/users/ml/riddle/data.html 79

ID Citations 1 H. Turtle and W. B. Croft. Inference networks for document retrieval. In Proceedings of the 13th International Conference on Research and Development in Information Retrieval 2 Turtle, H., and Croft, W. B.Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (1990), pp. 1–24. 3 H. Turtle ja W. B. Croft. Inference networks for document retrieval. Kirjassa Proc. Thirteenth Intl. Conf. on Res. and Development in Information Retrieval, s. 1, 1990. 4 Turtle, H. & Croft, W. B. Inference networks for document retrieval. In Proc. Thir- teenth Intl. Conf. on Res. and Development in Information Retrieval, 1 (1990). 5 Howard R. Turtle & W. Bruce Croft. Inference networks for document retrieval. Proceedings of the Thirteenth Intl. Conf. on research and Dev, pages 1–24, 1990. Table 3.3: Example of a Cluster of Coreferent Records

3.4.3 Quality of the Rules

For both the Greedy and LA algorithms, top-k rules are generated for each dataset with varying number of k. All output rules that are classified by domain experts into one of the three following categories are then validated. correct when the rule is absolutely correct; for example, proceedings ⇔ proc partially correct when the rule is partially correct; for example, ipl information processing letters ⇔ inf process lett vol incorrect when the rule is absolutely incorrect; for example, computer

science ⇔ volume.

Some of the correct rules discovered by LA are shown in Table 3.4.

The quality of rules generated by the algorithms are evaluated by counting the number of correct and incorrect rules. Precision is calculated as the fraction of 80

ID Rule 1 focs ⇔ annual ieee symposium on foundations of computer science 2 computer science ⇔ comput sci 3 pages ⇔ pp 4 5th ⇔ fifth 5 maaliskuu ⇔ march 6 robert ⇔ rob 7 vol ⇔ ⊥ Table 3.4: Example Rules Found correct rules in the top-k output rules.5 It is to be noted that the partially correct rules are essentially ignored in evaluatiing the number of correct and incorrect rules.

3.4.4 Correct and Incorrect Rules

Figures 3.3–3.5 show the numbers of correct rules for two algorithms by vary- ing k on three datasets. Figures 3.6–3.8 show the number of incorrect rules by varying k. It can be seen that:

• LA outperforms Greedy substantially on all datasets by generating more

correct rules than incorrect rules.

• Since the Cora dataset is dirtier than the CCSB dataset, the precision of

both algorithms is lower. It can also been seen that the Greedy algorithm

is affected more by the noise in the dataset than the LA algorithm.

• Since the Restaurant dataset is small, fewer rules are generated. For the

top-50 rules, LA generate 68 per cent correct rules whereas Greedy has a

5 Note that this definition is different from that in [ACK09], where precision was defined as the number of incorrect rules. 81

600 Greedy LA 500

400

300

200 Number of Correct Rules 100

0 0 200 400 600 800 1000 k

Figure 3.3: CCSB, Number of Correct Rules

precision of 34 percent.

Figures 3.9–3.11 show how many incorrect rules are generated for a given target of correct rules on the three datasets, as the work in [ACK09] did. It can be seen that many more incorrect rules are generated by the Greedy algorithm than the LA algorithm for any fixed amount of correct rules. 82

350 Greedy LA 300

250

200

150

100 Number of Correct Rules

50

0 0 100 200 300 400 500 600 700 800 k

Figure 3.4: Cora, Number of Correct Rules

35 Greedy LA 30

25

20

15

10 Number of Correct Rules

5

0 0 10 20 30 40 50 k

Figure 3.5: Restaurant, Number of Correct Rules 83

800 Greedy LA 700

600

500

400

300

200 Number of Incorrect Rules

100

0 0 200 400 600 800 1000 k

Figure 3.6: CCSB, Number of Incorrect Rules

800 Greedy LA 700

600

500

400

300

200 Number of Incorrect Rules

100

0 0 100 200 300 400 500 600 700 800 k

Figure 3.7: Cora, Number of Incorrect Rules 84

30 Greedy LA 25

20

15

10 Number of Incorrect Rules 5

0 0 10 20 30 40 50 k

Figure 3.8: Restaurant, Number of Incorrect Rules

800 Greedy LA 700

600

500

400

300

200 Number of Incorrect Rules

100

0 0 50 100 150 200 250 300 350 400 450 500 550 Number of Correct Rules

Figure 3.9: CCSB, Number of Incorrect Rules vs. Number of Correct Rules 85

800 Greedy LA 700

600

500

400

300

200 Number of Incorrect Rules

100

0 0 50 100 150 200 250 300 350 Number of Correct Rules

Figure 3.10: Cora, Number of Incorrect Rules vs. Number of Correct Rules

30 Greedy LA 25

20

15

10 Number of Incorrect Rules 5

0 5 10 15 20 25 30 35 Number of Correct Rules

Figure 3.11: Restaurant, Number of Incorrect Rules vs. Number of Correct Rules 86

700 Supervised Unsupervised 600

500

400

300

200 Number of correct rules

100

0 100 200 300 400 500 600 700 800 900 1000 k

Figure 3.12: Supervised vs. Unsupervised CCSB, Number of Correct Rules

3.4.5 Unsupervised versus Supervised

In this section, the results from the proposed unsupervised segmentation ap- proach are compared with the results from supervised segmentation (obtained from the segmented field labels of the datasets).

As expected, for varying k, the supervised segmentation provides more correct rules than the unsupervised approach (Figures 3.12 and 3.13). It can be observed that, towards the higher value of k, the results from the supervised approach are more significant than that at the lower value of k. 87

300 Supervised Unsupervised 250

200

150

100 Number of correct rules

50

0 100 200 300 400 500 600 k

Figure 3.13: Supervised vs. Unsupervised Cora, Number of Correct Rules

3.4.6 Effect of Numeric Rules

The rules with numeric values in both lhs and rhs (e.g., 199 ⇔ 1994) are called numeric rules. It is observed that most of the rules generated as a result of edit distance error (e.g., 1993 ⇔ 1994) and abbreviations (e.g., 1 ⇔ 12) are noisy, thus degrading the precision of both algorithms. This section compares the performance of LA with and without the inclusion of numeric rules on the datasets

CCSB and Cora for both supervised (Figures 3.14 and 3.15) and unsupervised

(Figures 3.16 and 3.17 ) method of segmentation.

It can be seen that, for both supervised and unsupervised methods of segmen- tation, the exclusion of numeric rules increases the precision in both datasets. 88

800 Inclusion Numeric Rules Exclusion Numeric Rules 700

600

500

400

300

Number of correct rules 200

100

0 100 200 300 400 500 600 700 800 900 1000 k

Figure 3.14: Supervised Segmentation CCSB, Number of Correct Rules

400 Inclusion Numeric Rules Exclusion Numeric Rules 350

300

250

200

150 Number of correct rules

100

50 100 200 300 400 500 600 k

Figure 3.15: Supervised Segmentation Cora, Number of Correct Rules 89

600 Inclusion Numeric Rules 550 Exclusion Numeric Rules 500 450 400 350 300 250

Number of correct rules 200 150 100 50 100 200 300 400 500 600 700 800 900 1000 k

Figure 3.16: Unsupervised Segmentation CCSB, Number of Correct Rules

300 Inclusion Numeric Rules Exclusion Numeric Rules 250

200

150

100 Number of correct rules

50

0 100 200 300 400 500 600 k

Figure 3.17: Unsupervised Segmentation Cora, Number of Correct Rules 90

700 Without UpdateRules With UpdateRules 600

500

400

300

200 Number of correct rules

100

0 100 200 300 400 500 600 700 800 900 1000 k

Figure 3.18: CCSB, Impact of UpdateRules

3.4.7 Performance without UpdateRules

In this section, the results of top-k rules are examined without updating the rules (Line 18 in Algorithm 1). The main motivation is to see the effectiveness of the generated rules from not updating the support of the rules. In order to avoid conflicts, once a TopRule R is selected in Line 17, all overlapping rules (i.e. rules with lhs the same as lhs or rhs of R) are removed from further consideration.

Figures 3.18 and 3.19 plot the number of correct rules generated with and without updating the support of rules. As shown, for both CCSB and Cora, the approach without updating the support of rules returns more correct rules for the varying value of k. However, the major drawback is that many potentially correct rules are being ignored in this approach. For example if a rule pages ⇔ pp is selected in a top-k, then the high-scored valid overlapping rules such as pages

⇔ p and pages ⇔ ss are being removed from further consideration. 91

500 Without UpdateRules 450 With UpdateRules

400

350

300

250

200

150 Number of correct rules 100

50

0 100 200 300 400 500 600 700 800 k

Figure 3.19: Cora, Impact of UpdateRules

3.4.8 Precision and Consistency

Figure 3.20 plots the precision versus k graph in the CCSB dataset. Since both algorithms strive to find high-quality rules first, the precision is highest when k is small, and decreases with increasing k. Precisions for both algorithms become stable for k ≥ 500. In all settings, LA’s precision is much higher than Greedy’s.

As shown in Figure 3.20, the precisions of LA and Greedy are 84 per cent and 42 per cent, when k = 100, and 51.3 per cent and 15.4 per cent, when k = 1, 000.

A good transformation rule learner should perform consistently when the num- ber of input record pairs are varied. Such results are shown in Figure 3.21, where the input number of record pairs vary from 2,000 to 12,000, and k = 200. As shown, the precision of the proposed LA algorithm remains stable (between 62 per cent to 69 per cent). 92

0.9 Greedy LA 0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 100 200 300 400 500 600 700 800 900 1000 k

Figure 3.20: CCSB, Precision vs. k

100 LA

80

60

Precision 40

20

0 2000 4000 6000 8000 10000 12000 n

Figure 3.21: CCSB, Precision vs. n 93

3.4.9 Execution Time

This section investigates the efficiency of the algorithms versus the number of input record pairs (n) by measuring the running time of the algorithms. The result is shown in Figure 3.22. It can be seen that the time grows more quickly for the

Greedy algorithm compared to the LA algorithm. This is mainly because, with the increasing input pairs, there are many more candidate rules generated by the

Greedy algorithm, as they do not perform careful local alignment. The running time of Greedy is up to 300 times more than LA is also because of the vast amount of candidate rules generated.

Figure 3.23 plots the running versus the output size k. It can be seen that both algorithms require more time when k increases, but Greedy’s time grows quickly with k. This is mainly because Greedy needs to update the support of the vast amount of candidate rules in each iteration; hence, it takes much more time. 94

10000 10000 Greedy Greedy LA LA

1000

1000 Time (s) Time (s)

100

10 100 200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000 k(n=2000) k (a) CCSB, Time vs. k (n = 2000) (b) CCSB, Time vs. k (n = 4000)

100000 100000 Greedy Greedy LA LA

10000 10000 Time (s) Time (s)

1000 1000

100 100 200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000 k(n=6000) k(n=8000) (c) CCSB, Time vs. k (n = 6000) (d) CCSB, Time vs. k (n = 8000)

100000 100000 Greedy Greedy LA LA

10000 10000 Time (s) Time (s)

1000 1000

100 100 200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000 k(n=10000) k(n=12000) (e) CCSB, Time vs. k (n = 10, 000) (f) CCSB, Time vs. k (n = 12, 000)

Figure 3.22: Execution time vs. Input Record Pairs (n) 95

100000 100000 Greedy Greedy LA LA

10000 10000

1000 1000 Time (s) Time (s)

100 100

10 10 2000 4000 6000 8000 10000 12000 2000 4000 6000 8000 10000 12000 Number of Input Record Pairs(k=200) Number of Input Record Pairs(k=400) (a) CCSB, Time vs. n (k = 200) (b) CCSB, Time vs. n (k = 400)

100000 Greedy LA

10000

1000 Time (s)

100

10 2000 4000 6000 8000 10000 12000 Number of Input Record Pairs(k=600) (c) CCSB, Time vs. n (k = 600)

100000 100000 Greedy Greedy LA LA

10000 10000

1000 1000 Time (s) Time (s)

100 100

10 10 2000 4000 6000 8000 10000 12000 2000 4000 6000 8000 10000 12000 Number of Input Record Pairs(k=800) Number of Input Record Pairs(k=1000) (d) CCSB, Time vs. n (k = 800) (e) CCSB, Time vs. n (k = 1000)

Figure 3.23: Execution time vs. Output Size (k) 96

3.5 Conclusions

This work focuses on the record linkage approaches to the complex coref- erent records that have little surface similarity (e.g., “VLDB” and “Very Large

Databases”). Transformation rules are incorporated to recognise these domain- specific equivalence relationships. These are obtained by proposing an effective and efficient top-k transformation rule learning algorithm. The proposed algo- rithm performs meticulous local alignment for each pair of records by considering a set of commonly used edit operations and then generate a number of candidate rules based on the optimal local alignment. Statistics of the candidate rules are maintained and aggregated to select the optimal top-k rules. Compared with the global Greedy algorithm in [ACK09], the proposed algorithm generates fewer candidate rules, and the generated candidate rules are more likely to be correct rules. Extensive experiments are performed using several publicly available real- world datasets. The results show an increase of 3 times in the percentage of cor- rect rules compared with the previous approach, and an increase in the speed-up in efficiency up to 300 times.

Future work may investigate the impact of the partitioning method on the qual- ity of the rules. Another direction is to explore more efficient data structures and algorithms to further accelerate the proposed rule learning algorithm. Finally, the proposed method will be generalised to more complex transformation rules, for example, rules containing gaps or variables. Chapter 4

Semi-automatic Comparative Citation

Analysis

4.1 Introduction

The previous chapters mainly focus on developing new approches to address the issue of data cleaning arising in bibliography data. It is equally imperative to evaluate the benefit of these technique in a real setting. This chapter aims to address this issue by evaluating the utility of various citation enhanced search engines in conducting the citation analysis. To that end, a semi automatic tool is developed to perform the data cleaning prior to performing the largescale citation analysis.

Citation analysis is a major subfield of informatics, allowing a researcher to follow the development and impact of an article through time by looking back- wards at the references the author cites, and forwards to those authors who then cite the article. Academic institutions, federal agencies, publishers, editors, au- thors and librarians increasingly rely on citation analysis for making hiring, pro- 98 motion, tenure, funding and/or reviewer and journal evaluation and selection deci- sions [MY07]. In general, citation counts are used to measure the effect of articles and journals, thus enabling researchers to examine growth, popularity and trends in a particular research stream. Assuming that scientists cite the work that they have found useful in pursuing their own research, the number of citations received by a publication is seen as a quantitative measure of its impact in the scientific community. The number of citations is frequently incorporated into decisions of academic advancement. However, the citation data is heavily influenced by the coverage of the specific database, since it can take into account only citations from items it indxes.

Until recently, ISI/Thompson Reuters was the single most comprehensive source to offer large-scale bibliography database services, such as locating ci- tations and/or conducting citation analyses. This service maintains citation databases covering thousands of academic journals and is now available online via ISI’s Web of Knowledge service, which provides researchers and other in- terested parties with access to a wide range of bibliography and citation analysis services. Data from the ISI Citation Indexes and the Journal Citation Reports are routinely used by promotion committees at universities all over the world. How- ever, the ISI citation databases are criticised for some limitations [MY07] because they:

1. cover mainly North American, Western European, and English-language

titles

2. are limited to citations from 8,700 journals

3. do not count citations from books and most conference proceedings 99

4. provide different coverage between research fields

5. have citing errors, such as homonyms and synonyms, and inconsisten-

cies in the use of initials and in the spelling of non-English names (how-

ever, many of these errors derive from the primary documents themselves,

rather than faulty ISI indexing).

The availability of citation data in other bibliography databases opens up the pos- sibility of extending the data source for performing citation analysis, particularly to include other publication types of written scholarly communication, such as books, chapters in edited books and conference proceedings. The inclusion of other publication types will contribute to the validity of bibliometric analysis when evaluating fields in which the internationally oriented scientific journal is not the main medium for communicating research findings [ND06].

In recent years, several database producers have noticed the potential of ci- tation indexing and have manually added cited references to a subset of their records [ND06]. Discipline-oriented databases such as Chemical Abstracts by the American Chemical Society, MathSciNet by the American Mathematical So- ciety and PsycINFO by the American Psychological Association have introduced citation indexing to their bibliography databases. With the electronic availability of scholarly documents, electronic database searching has become the de facto mode of information retrieval. Several bibliography databases [ND06] are es- tablished that automatically extract bibliography information and cited references from electronic documents retrieved from digital archives and repositories. Some of these databases offer sophisticated features for citation searching and provide detailed information on download frequencies, which may serve as an additional 100 basis for assessing the resonance and impact of publications. Some of these re- markable services include:

• CiteSeer (http://citeseer.ist.psu.edu/): currently focuses

on computer/information science literature.

• Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/): mainly

focuses on medicine and biomedical sciences.

• SciFinder Scholar (http://cas.org/products/scifindr/index.

html) : covers journals and patent information on chemistry and

medicine.

• Faculty of 1,000 (www.facultyof1000.com): highlights and evalu-

ates the most important articles in biology and medicine.

• RePEc (http://repec.org/): covers research papers in economics.

• SMEALSearch (http://library.wlu.edu/details.php?resID=

225): indexes academic business documents.

Beyond these discipline-oriented databases two multidisciplinary databases,

Scopus 1 from Elsevier and Google Scholar (GS)2 from Google Inc, have at- tracted much attention. Both introduced in 2004, Scopus requires a paid subscrip- tion, while GS is free. Each of these databases has a different scope of citation coverage and uses unique methods to record and count citations [KASW09]. The differences in citation counts among the databases could have implications for ci- tation analysis studies and in the use of citation counts for academic advancement

1 www.scopus.com 2 http://scholar.google.com/ 101 decisions. This chapter aims to compare the utility of these citation-enhanced databases in the field of computer science, a relatively new field of study, where conference papers are considered a more important form of publication than is generally the case in other scientific disciplines.

4.2 Proposed Study

Given that various scientific databases have their own characteristics, the present study aims to compare the utility of the current most popular sources of scientific information, namely Scopus and GS. The study first looks into the data quality issues in the citation results returned from GS (such as several entries for the same paper, for example, due to misspelled titles, author names, different or- dering of authors, journal and conference versions of a paper with the same or similar title, or two or three versions of the same paper are found online). Another difficulty in performing citation analysis in GS is that its records are retrieved in a way that is very impractical for use with large datasets, requiring a tedious process of manually cleaning, organising and classifying the information into meaningful and useable formats. As a result, to conduct citation analysis, most of the work in the literature, had to manually clean these errors and classify these citation infor- mation to proper formats (i.e. different fields such as author, venue, and source), which is a tedious and time-consuming task. To ease this manual processing job, this thesis aims to develop a semi-automatic tool to perform extensive data clean- ing to deal with these errors and to present the citations in a format suitable to conduct the analysis. Furthermore, the performance accuracy of the proposed data cleaning tool is analysed. 102

Using various bibliometric indicators (such as papers, cited papers, citations, citations per year and citations per paper, as well as h-index and variants such as g- index and hg-index), the scholar performances of various researchers in computer science are conducted. Computer science is a relatively new field of study, where conference papers are considered a more important form of publication than is generally the case in other scientific disciplines and is less studied in the existing literature. The study examines the extent to which GS citation data can be used for evaluative purposes in computer science research, and uses the two databases

(GS and Scopus) to perform and compare up-to-date citation analysis in this field.

In addition it analyses the strengths and weaknesses of Scopus and GS, their over- laps and uniqueness, characteristics (such as document type, language and content level) and discusses the implications of the findings for citation analysis.

4.3 Literature Review

This section surveys existing studies dedicated to the analysis and comparison of the performance of different citation-enhanced databases.

The paper [GMLG01] examines citations from CiteSeer, a Web-based citation indexing system, with those from ISI for computer science papers. The finding of more citations from conference papers than ISI articles indicates the importance of conference papers among computer science professionals for disseminating re- search results.

The authors in [ZL02] compare the scholarly communication patterns in XML research as revealed by the citation analysis of ResearchIndex data and ISI data.

They find that citation data obtained from ResearchIndex contain more citing pa- 103 pers, a wider variety of document types and more information about cited papers than ISI data. From these findings, they conclude that citation analysis studies should combine ISI data with data available on the Web in order to obtain a larger and richer characterisation of scholarly communication structures and processes.

The work in [VS03] compares bibliography and Google Web citations to arti- cles in 46 journals in library and information science. It finds that Web citations significantly correlate with bibliography citations listed in the Social Sciences Ci- tation Index and the ISI Journal Impact Factor for most journals (57 per cent). The study also shows that Web citation counts aree typically higher than bibliography citation counts for the same article.

The author [Jac05a] has conducted detailed analysis of citation results from

GS in comparison with the results from Scopus and Web of Science (WoS). The author searched for documents citing Eugene Garfield , an article by Garfield published in 1955 in Science, the journal Current Science and its 30 most-cited articles. He found that coverage of Current Science by GS was abysmal; lacking in its scope of handling data, and the modest coverage of social sciences was by Scopus. He also found a considerable overlap between GS and WoS, and many unique documents in each source, stating that the majority of these unique documents are relevant and substantial.

The study in [Nor05] begins with an overview of how to use GS for citation analysis and identifies advanced search techniques that are not well documented by GS. This study also compares the citation counts provided by WoS and GS for articles in the field of Webometric and makes several suggestions for improving

GS. The analysis of citations shows that GS is good in finding additional citations and concludes that GS provides a free alternative or complementary service to 104 other citation indexes.

The article [BB05] presents a case study comparing the citation counts pro- vided by WoS, Scopus, and GS for articles from the Journal of the American Soci- ety for Information Science and Technology (JASIST) published in 1985 and 2000.

It finds that WoS provides maximum citation counts for older articles, whereas for newer articles, GS provides statistically significant higher citation counts than ei- ther WoS or Scopus. However, the authors recommend conducting more rigorous studies before these findings are considered definitive.

The paper [PS05] evaluates ISI and GS by comparing their citations of pa- pers in mathematics, chemistry, physics, computing sciences, molecular biology, ecology, fisheries, oceanography, geosciences, economics and psychology. Each discipline is represented by three authors, and each author is represented by three articles (i.e. 99 articles in total). The findings suggest that GS can be a substitute for ISI and reports a good correlation in citation counts between the two sources without assessing accuracy or relevance and quality of the citing articles.

The work in [Jac05b] presents the outcome of the analysis of the individual and aggregate citation scores calculated by WoS and GS for the papers published in 22 volumes of the Asian Pacific Journal of Allergy and Immunology (APJAI) and discusses the reasons for the significant limitations of GS in calculating and reporting the citedness scores.

The authors in [KT06] conduct a similar analysis to [VS03] of Google ci- tations to open-access (OA) journals in library and information science. Their

Google search results show that 282 research articles were published in the year

2000 in 15 peer-reviewed and library and information science (LIS) OA journals and were invoked by 3,045 URL citations. Of these URL citations, 43 per cent 105 were created for formal scholarly reasons equivalent to traditional citations and

18 per cent were created for informal scholarly reasons. Of the sources of URL citations, 82 per cent were in English, 88 per cent were full text papers and 58 per cent were non-HTML documents. Of the URL citations, 60 per cent were text

URLs only and 40 per cent were hyperlinked.

The paper [ND06] presents an overview of new citation-enhanced databases in the context of research evaluation. It reports the limitations of Thomson Reuter’s

Scientific Citation Index and reviews the characteristics of citation-enhanced databases, such as Chemical Abstracts, GS and Scopus. It suggests that citation- enhanced databases need to be examined carefully with regard to both their po- tentialities and their limitations for citation analysis.

In an attempt to provide a robust study, the authors in [BBGW06] compared citation counts for articles from two disciplines(oncology and condensed matter physics) and two years(1993 and 2003) to test the hypothesis that the different scholarly publication coverage provided by the three search tools WoS, Scopus and GS would lead to different citation counts from each. The result showed that

WoS returned the highest average number of citations and Scopus returned the highest number of citations for condensed matter physics in 1993 and 2003 . The data showed a significant difference in the mean citation rates between all pairs of resources, except between GS and Scopus for condensed matter physics in 2003.

For articles published in 2003, GS returned the largest amount of unique- citing material for oncology and WoS returned the largest amount for condensed matter physics.

The study [Saa06] explores the correlation between authors’ total citation counts with their h-indices, journals’ h-indices with their ISI impact factors, 106 and compares authors’ h-indices as obtained from two separate services, namely

Thompson/ISI and GS. The results show that the correlation between the Thomp- son/ISI h-index and the Thompson/ISI total number of citations was 0.87, and that the correlation between the GS h-index and the Thompson/ISI total number of ci- tations was 0.83, and that the correlation between the two h-indices was 0.82. Of the 55 comparisons, eight yielded a higher Thompson/ISI h-index, another eight yielded no difference between the two indices and the remaining 39 cases yielded a higher GS h-index. The correlation between the journals impact factors and their h-indices was found to be 0.70.

In [Buc06], the authors analysed the nature and extent of errors made by the

Science Citation Index ExpandedTM (SCIE) and the SciFinder ScholarTM (SFS) during data entry. Their analysis included 5,400 cited articles from 204 randomly selected cited-article lists published in three core chemistry journals. They discov- ered that failure to map cited articles to target-source articles was due to transcrip- tion errors, target-source article errors, omitted cited articles and reason unknown, where the mapping error rates ranged from 1.2 to 6.9 percent.

The study [MY07] examines the effects of using Scopus and GS on citation counts and the ranking of scholars as measured by WoS. Using the citations to the work of 25 library and information science as a case study, it examines more than 10,000 citing and purportedly citing documents, and provides several useful suggestions for scholars conducting citation analysis, as well as those who need assistance in compiling their own citation records. The study finds that the addi- tion of Scopus citations to those of WoS could significantly alter the rankings of authors. It also finds that GS stands out in its coverage of conference proceed- ings as well as international, non-English language journals, and indexes a wide 107 variety of document types, which may be of significant value to researchers and others. Finally, it concludes that the use of Scopus and GS, in addition to WoS, reveals a more comprehensive and accurate picture of the extent of the scholarly relationship between LIS and other fields.

The work [BILL07] introduces a set of measures for comparing rankings of different citation databases (WoS, Scopus and GS) induced by the number of cita- tions the publications receive in each database. The results show high similarities between the rankings of the ISI WoS and Scopus, and lower similarities between

GS and other tools. In his next work [BI08], the author compares the h-indices of a list of highly-cited Israeli researchers based on citation counts retrieved from the three databases. The results obtained through GS are found to be considerably dif- ferent from the results based on the WoS and Scopus, and the author recommends the need for data cleaning to achieve a fair analysis.

Based on previous methods, the study [KT07] uses a combined Web and URL citation method to collect a wide range of data. To a sample of 1,650 articles from

108 OA journals, it compares the correlations between ISI and GS across mul- tiple disciplines (biology, chemistry, physics, computing, sociology, economics, psychology and education). The study finds a large disciplinary difference in the percentage overlap between ISI and GS citation sources. It also finds that GS is more comprehensive for social sciences and possibly when conference articles are valued and published online. It concludes that replacing traditional citation sources with the Web or GS for research impact calculations would be problem- atic.

In another study, a sample of 1,483 publications, which is representative of the scholarly production of the LIS faculty, was searched in WoS, Google, and 108

GS [VS08]. As a result, the median number of citations found through WoS was zero for all types of publications except book chapters and the median for

GS ranged from one for print/subscription journal articles to three for books and book chapters. For Google, the median number of citations ranged from nine for conference papers to 41 for books. The study finds that almost 92 per cent of the citations identified through GS represent intellectual impact (primarily citations from journal articles) and non-intellectual impact (bibliography services) are the largest single contributor of citations identified through Google. The study recom- mends GS as a promising tool for research evaluation, especially in a field where rapid and fine-grained analysis is desirable.

The paper [KT08] takes a sample of 882 articles from 39 OA ISI-indexed jour- nals in 2001 from biology, chemistry, physics and computing, and classifies the type, language, publication year and accessibility of the GS unique citing sources.

It finds that the majority of GS unique citations (70 per cent) were from full-text sources, and that there were large disciplinary differences between types of cit- ing documents, suggesting that a wide range of non-ISI citing sources, especially from non-journal documents, are accessible by GS. An important corollary drawn from this study is that the wider coverage of OA Web documents by GS is likely to give a boost to the impact of OA research and the OA movement.

The article [FPMP08] aims to compare the content coverage and practical util- ity of various databases, namely PubMed, Scopus, WoS and GS. It uses the exam- ple of a keyword search to evaluate the usefulness of these databases in biomed- ical information retrieval and a specific published article to evaluate their utility in performing citation analysis. It finds PubMed to be an optimal tool in biomed- ical electronic research and states that Scopus covers a wider journal range and 109 offers the capability for citation analysis limited to recent articles (published after

1995) compared with WoS, and GS can help in the retrieval of the most oblique information, but is marred by inadequate, less often updated citation information.

The paper [Jac08b] focuses on the practical limitations in the content and soft- ware of the databases that are used to calculate the h-index for assessing the pub- lishing productivity and impact of researchers. The author discusses the related features of GS, Scopus and WoS, and demonstrates how a much more realistic and fair h-index can be computed for F. W. Lancaster than the one produced automat- ically. His work [Jac08a] aims to analyse the pros and cons of the three databases for determining the h-index. From his findings, the h-index, developed by Jorge E.

Hirsch to quantify the scientific output of researchers, received well-deserved at- tention by researchers. Many of them also recommended derivative metrics based on Hirsch’s idea to compensate for potential distortion factors, such as high self- citation rates and the need for scrutiny, as the content and software characteristics of reference-enhanced databases could strongly influence the h-index values. He further uses Scopus and Thomson-Reuters’ databases [Jac09a] to explore the ex- tent of the absence of data elements that are critical from the perspective of a scientometric evaluation of the scientific productivity and effect of countries in terms of the most common indicators. These indicators include the number of publications, the number of citations and the ratio of citations received to papers published, as well as the effect these may have on the h-index of countries. He

finds that the presence of the language field is comparable between the Thomson-

Reuters and Scopus databases, and that the rate of presence of the subject category

field is better in Scopus, even though it has far fewer subject categories than the

Thomson-Reuters databases. However, the omission rate of country identification 110 hurts its impressive author-identification feature. He further [Jac09b] uses the two citation databases, Scopus and WoS, to discuss the results of experiments in de- termining the h-index at the country level for the 10 Ibero-American countries of

South America. The results show that, despite the significant differences in the content of the two databases in terms of their source base and the extent of cited reference enhancement of records, the rank correlation of the 10 countries based on the h-index values returned by WoS and Scopus is very high.

The study [MR08] examines the differences between Scopus and WoS in the citation counting, citation ranking and h-index of 22 top human-computer interac- tion (HCI) researchers from EQUATOR, which is a large British Interdisciplinary

Research Collaboration project. The results indicate that Scopus provides signif- icantly more coverage of HCI literature than WoS, and no significant differences exist between the two databases if citations in journals only are compared. Sco- pus also generates significantly different maps of citation networks of individual scholars than those generated by WoS. The study also presents a comparison of h-index scores based on GS with those based on the union of Scopus and WoS.

It concludes that Scopus can be used as a sole data source for citation-based re- search and evaluation in HCI, especially when citations in conference proceedings are sought.

To test the extent to which the use of the freely available database GS can be expected to yield valid citation counts in the field of chemistry, the work [BWM-

RAT09] examines a comprehensive set of 1,837 papers that were accepted for publication by the Angewandte Chemie International Edition (one of the prime chemistry journals in the world) or that are rejected by the journal but then pub- lished elsewhere. Analyses of citations for the set of papers returned by three fee- 111 based databases (Science Citation Index, Scopus, and Chemical Abstracts) were then compared to the analysis of citations found using GS data. The study finds that citations returned by the three fee-based databases shows similar results, and the results of the analysis using GS citation data differ greatly from the findings from the fee-based databases. Therefore, the study supports both the convergent validity of citation analyses based on data from the fee-based databases and the lack of convergent validity of the citation analysis based on the GS data.

The paper [MS09] uses citations from 1996–2007 to the work of 80 randomly selected full-time information studies (IS) faculty members from North Amer- ica to assess differences between Scopus and WoS. Results show that, when the assessment is limited to smaller citing entities (e.g., journals, conference proceed- ings and institutions), the two databases produce considerably different results, whereas when the assessment is limited to larger entities (e.g., research domains and countries), the two databases produce similar pictures of scholarly impact.

The authors in [KASW09] compare the citation-count profiles of articles pub- lished in general medical journals among the citation databases of WoS, Scopus and GS. Using these databases, they retrieve total citation counts for 328 articles published in JAMA, Lancet, or the New England Journal of Medicine and anal- yse the article characteristics in a linear regression model. The results showe that, the databases produce quantitatively and qualitatively different citation counts for articles published in three general medical journals. GS and Scopus retrieve more citations per article respectively, than WoS. Compared with WoS, Scopus retrieves more citations from nonEnglish-language sources and fewer citations from arti- cles, editorials and letters. GS has significantly fewer citations to industry funding and group-authored articles. 112

The article [LIdMAM09] addresses the robustness of country-by-country rankings according to the number of published articles and their average cita- tion impact in the field of oncology. It compares rankings based on bibliometric indicators derived from the WoS with those calculated from Scopus. It finds that the oncological journals in Scopus that are not covered by WoS tend to be na- tionally oriented journals, and discusses its implications for the construction of bibliometric indicators.

The study [HvdW09] conducts a systematic comparison between the GS h- index and the ISI Journal Impact Factor for a sample of 838 journals in Economics

& Business and shows that there is substantial agreement between the ISI Journal

Impact Factor and the GS h-index for most sub-disciplines, as well as for those sub-disciplines that have limited ISI coverage (Finance & Accounting, Marketing, and General Management & Strategy).

The authors in [ACHVH09] pay special attention to the works devoted to using the h-index and its variations to compare scientists from different research areas.

They review some works that compare the h-index and related indices when they are computed using different bibliography databases, mainly focusing the use of

WoS, Scopus and GS. Their review concludes that the h-index is quite dependant on the used database and it is much more difficult to compute those indices using

GS than WoS or Scopus.

The article [LBLB10] compares the citation analysis potential of four databases: WoS, Scopus, SciFinder and GS. It finds that WoS provides cover- age back to 1900, whereas Scopus only has completed citation information from

1996 onwards and provides better coverage of clinical medicine and nursing than

WoS. It also discovers that SciFinder has the strongest coverage of chemistry and 113 the natural sciences, while GS has the capability to link citation information to individual references. It concludes that, although Scopus and WoS provide com- prehensive citation reports, all databases miss linking to some references included in other databases.

The authors in [JPi10] aim to study the qualitative and quantitative differences in the received citation counts for the Serbian Dental Journal (SDJ) found in WoS,

Scopus and GS. They experiment with 158 papers from SDJ and 249 received citations found in the three analysed databases. They find that the greatest number of citations (189) derives from GS, while only 15 per cent of the citations are found in all three databases. They also find a significant difference in the percentage of unique citations in the databases of GS (58 per cent), Scopus (6 per cent) and

WoS (4 per cent). The highest percentage of database overlap is between WoS and Scopus (70 per cent), while the overlap between Scopus and GS is 18 per cent and 17 per cent between WoS and GS. As none of the examined databases can provide a comprehensive picture, the study recommends taking into account of all three available sources.

The author in [BI10] examines GS, Scopus and WoS through the citations of Introduction to informetrics by Leo Egghe and Ronald Rousseau, and find that

Scopus citations are comparable to WoS citations when limiting the citation period to 1996 onwards. GS misses about 30 per cent of the citations covered by Scopus and WoS (90 citations), but another 108 citations located by GS are not covered either by Scopus or by WoS. The author also mentions that, although GS is not very user friendly as a bibliometric data collection tool at this point, it performs considerably better than reported in previous studies.

The article [SUP10]ˇ analyses the citation count of articles published by the 114

Croatian Medical Journal in 2005-2006 based on data from the WoS, Scopus and

GS. As a result, GS returns the greatest proportion of articles with citations (45 per cent), followed by Scopus (42 per cent) and WoS (38 per cent). Almost half (49 per cent) of the articles have no citations, and 11 per cent have an equal number of identical citations in all three databases. The greatest overlap is found between

WoS and Scopus (54 per cent), followed by Scopus and GS (51 per cent), and

WoS and GS (44 per cent). The greatest number of unique citations is found by

GS, where the majority of citations (64 per cent) come from journals, followed by books and PhD theses. Approximately 55 per cent of all citing documents are full-text resources in OA, and the language of citing documents is mostly English, where 29 per cent of citing documents are in Chinese. From these results, GS is believed to serve as an alternative bibliometric tool for an orientational citation insight.

The paper [Fra10] provides a case study for computer science scholars and journals evaluated on WoS and GS databases. It concludes that GS computes significantly higher indicator scores than WoS, and citation-based rankings of both scholars and journals do not significantly change when compiled on the two data sources, while rankings based on the h-index show a moderate degree of variation.

In order to measure the degree to which GS can compete with bibliographical databases, search results from this database are compared with WoS [Mik10]. For impact measures, the h-index is investigated and the rank of records displayed in

GS and WoS are compared by means of Spearman’s footrule. The results show significant similarities in measures for the two sources.

To investigate whether the h-index can be reliably computed through alterna- tive sources of citation records, the WoS, PsycINFO and GS are used to collect 115 citation records for known publications of four Spanish psychologists [GP10].

Compared with WoS, PsycINFO includes a larger percentage of publication records, whereas GS outperformed WoS and PsycINFO. Compared with WoS,

PsycINFO retrieves a larger number of citations in unique areas of psychology, but it retrieves a smaller number of citations in areas that are close to statistics or the neurosciences, whereas GS retrieves the largest numbers of citations in all cases. Incorrect citations are scarce in Wos (0.3 per cent), more prevalent in

PsycINFO (1.1 per cent) and overwhelming in GS (16.5 per cent). All platforms retrieve unique citations, with the largest set from GS. WoS and PsycINFO co- vere distinct areas of psychology unevenly, thus applying different penalties on the h-index of researchers working in different fields. To obtain fair and accurate h-indices, the author suggests the need of union of citations retrieved by all three platforms.

4.4 Google Scholar (GS)

In November 2004 Google, the producer of the most popular Web search en- gine, introduced GS in Beta version, a freely available service that uses Google’s crawler to index the content of scholarly material and adds citation counts to raise or lower individual articles in the rankings of a result set [BB05].

GS uses a matching algorithm to look for keyword search terms in the title, abstract or full text of an article from multiple publishers and websites, where the number of times a journal article, book chapter, or website is cited also plays an important part [BBGW06]. Unlike most scholarly research databases, it looks be- yond the journal literature, which claims to include peer-reviewed papers, theses, 116 books, abstracts and articles, from academic publishers, professional societies, pre-print repositories, universities and other scholarly organisations, and aims to rank documents the way researchers do, weighing the full text of each document, where it was published and who it was written by, as well as how often and how recently it has been cited in other scholarly literature 3 . It covers a large number of documents by crawling the Web automatically, but also includes papers from several digital libraries, including those from ACM, IEEE and Springer. GS auto- matically extracts the bibliography data from the reference sections of the docu- ments (mostly in PDF and PS formats) and determines citation counts for papers in its collections as well as for citations for which the document is not available.

Figure 4.1 shows parts of the GS search interface and a section of the search results for one of the searches with the article title “Edjoin: an efficient algorithm for similarity joins with edit distance constraints”. By default, GS provides the title and hyperlink to the document, the authors, publication year, information about the source and the citation count, and a hyperlink to the citing documents.

Figure 4.2 shows a section of results of the citing link of the same article.

The publications in the query result are typically ranked according to the num- ber of citations, displaying most-cited and highly relevant articles at the top of the set [RT05]. The search interface of GS is simple and easy to use; it allows a quick search and an advanced search. In the advanced search, the results can be lim- ited by title words, authors, source, date of publication and subject areas. The languages of the interface and of the search are optional. The results can be dis- played as a listing of 10–100 items per page. Each retrieved article is represented by title, authors and source, but the abstract and information on free full-text avail-

3 http://scholar.google.com/intl/en/scholar/about.html 117 ability are not provided by GS. Under each retrieved article, the number of cited articles is noted and can be retrieved by clicking on the relevant link. Clicking on the article title leads to a list of possible links to the article, usually on the jour- nal’s site. Users are able to view the full-text of only the items that are available free and those to which their libraries subscribe. In addition, GS provides links to relevant articles and allows for a general Google Web search using self selected keywords from the article and the author name [FPMP08].

GS is an important service for those that do not have access to expensive mul- tidisciplinary databases such as the Thomson-Reuters Scientific Citation Index or

Scopus. This is potentially a powerful new tool for citation analysis, particularly in subject areas that have experienced rapid changes in communication and pub- lishing. 118 4.1: GS Search Results for an Article Title Figure 119 4.2: GS Results from Citing Link of an Article Figure 120

4.5 Scopus

The largest abstract and citation database, Scopus, was launched in November

2004. It contains both peer-reviewed research literature and quality Web sources, and offers researchers a quick, easy and comprehensive resource to support their research needs in the scientific, technical, medical and social sciences fields and, more recently, also in the arts and humanities 4 . It also offers powerful features for browsing, searching, sorting and saving functions, as well as exporting to citation management software. Coverage in Scopus goes back to 1966 for bibliography records and abstracts and 1996 for citations. It provides various types of searches, including quick, basic and advanced searches, and searches by author or source.

Figure 4.3 shows a portion of the search interface provided by Scopus and Fig- ure 4.4 shows the author search results for Elio Masciari.

In the basic search, the results for the chosen keywords can be limited by date of publication, by addition to Scopus, by document type and by subject ar- eas, whereas the author search is based only on author names. The advanced search combines the basic search without the limits and the author search, and it allows more operators and codes. The source search is confined to the selection of a subject area, a source type, a source title, the ISSN number and the pub- lisher. Subject areas covered in Scopus include chemistry, physics, mathematics, and engineering, life and health sciences, arts and humanities, social sciences, psychology, economics, biology, agriculture, environmental sciences and general sciences [YM06]. The search results in Scopus can be displayed as a listing of

20–200 items per page, and documents can be saved to a list and/or can be ex-

4 http://info.scopus.com/scopus-in-detail/facts/ 121 ported, printed or e-mailed. The results can be refined by source title, author name, year of publication, document type and/or subject area, and a new search can be initiated within the results. The presence of an abstract, references and free full text is noted under each article title, in addition to where these can be found.

When abstracts are displayed, the keywords are highlighted. The fields that can be included in the output are optional (i.e. citation information and bibliograph- ical information of citations). In addition, Scopus has search tips written in 10 languages [FPMP08].

Beyond the generic cited reference index, Scopus has separate indexes for cited author, year, title, source and pages. These are comprehensive options for citation searching, and they facilitate the performance of citation analysis using different approaches. Scopus also offers Related Documents, which returns a list of documents that share cited references with the currently selected document.

Through its refinement option, Scopus provides an overview of search results ac- cording to author, publication year, journal title, document type and subject area.

Scopus Citation Tracker further enhances citation analysis by enabling citations to be viewed on year, providing users with a powerful way to explore citation data over time. The Scopus Citation Tracker tabulates citation data and shows how often the individual documents have been cited in individual years, as well as an overall total. If necessary, the tabular representation can be exported as a text

file [ND06]. 122 4.3: Scopus Search Interface Figure 123 4.4: Results of Scopus Author Search Figure 124

4.6 Data Collection

In order to perform a fair comparartive citation analysis, a list of 10 authors was randomly selected from the DBLP xml file5 shown in Table 4.1. It can be seen that the randomly selected authors are from various scatterd regions (e.g., countries) with varying number of publications as shown in Tables 4.2 and 4.3, providing fairness in the data set to conduct the citation analysis.

Author Name Affiliation City Country Elio Masciari National Research Rome Italy Council Guido Schafer¨ Vrije Universiteit Amsterdam Netherlands Amsterdam Luis Baumela Technical University of Madrid Spain Madrid Patrick Marais University of Cape Town South Africa Cape Town Pernilla Qvarfordt FX Palo Alto Palo Alto USA Laboratory Rajib Mall Indian Institute of Kharagpur India Technology, Kharagpur Sunita Sarawagi Indian Institute of Mumbai India Technology, Bombay Tai-Lung Chen Chung Hua Hsin-chu Taiwan University Timothy C. Lethbridge University of Ottawa Canada Ottawa Yoshiharu Ishikawa Nagoya University Nagoya Japan Table 4.1: Author Information

For computer science researchers, the DBLP Web site is a popular tool to trace the work of colleagues and to retrieve bibliography details when composing the lists of references for new papers [Ley09]. The DBLP is a large archive of publication records that document the papers published in a range of computer 5 http://dblp.uni-trier.de/xml/ 125 science conferences and journals, as well as some workshops and symposiums.

Figure 4.5 shows a snapshot of DBLP’s page for Elio Masciari. The bibliography records are contained in a large XML file that can be downloaded. Although

DBLP does not provide any citation data, it can be used to provide a suitable list of seed articles for analysis.

The citation data for all selected authors were manually collected from Scopus by performing an author search on Scopus and then storing the data in comma- separated CSV files.

A major disadvantage of GS is that its records are retrieved in a way that is impractical for use with large sample sizes and that requires a tedious process of manually extracting, verifying, cleaning, organising, classifying and saving the bibliography information into meaningful and useable format [MY07]. For in- stance, GS frequently has several entries for the same paper, for example, due to misspelled author names or different ordering of authors. Conversely, GS may group together citations of different papers, for example, for a journal and con- ference version of a paper with the same or similar title, for counting a citation published in two different forms (such as pre-print and journal articles) as two citations, and for inflating citation counts due to the inclusion of non-scholarly sources (e.g., thesis, pre-prints and technical reports), as well as a lack of infor- mation about the document type, document language, document length and the refereed status of the retrieved citations.

However, access to the necessary bibliography data using GS provides a way to perform a large-scale citation analysis of authors, conferences and journals in various disciplines. Given an article, it provides its citation count and full details on the other articles that cite it. All that is required is a suitable list of articles to 126 seed the analysis, which is where a service such as the DBLP comes into play.

To collect data from GS, a semi-automatic tool was developed to map each paper found in DBLP for a given author to GS to determine its citations. For a given list of publications from DBLP, the tool first fetches the list of citations from

GS for each publication. In the second step, it performs extensive data cleaning to deal with errors in the citations and limitations of the automatic extraction of references in GS. In addition to dealing with these issues, the developed tool also detects citations from technical reports, theses and pre-prints, and removes these citations from the analysis. For artcles available in Scopus, but that were missed in

DBLP, an exact match search (including title, authors and venue) was performed on GS to retrieve the cited articles. The same process was performed on Scopus for articles in DBLP that do not appear on the list of author published articles.

To measure the effectiveness of the proposed data-cleaning tool, a sample list of 300 citations was manually verified to obtain a list of cleaned citations. The list was then compared to the list of cleaned citations obtained from the proposed tool. The accuracy was found to be 98.8 per cent, which is reasonable. 127 Elio Masciari 4.5: DBLP Search Results for Author Figure 128

4.7 Experimental Analysis

This section presents the comparative analysis of citations returned from GS and Scopus.

4.7.1 Citation Counts

Tables 4.2 and 4.3 display the descriptive statistics of the citation counts from databases. As shown, for all authors, Scopus fails to obtain some publication information that is obtainable in GS. Overall, 26 per cent of missing publications in Scopus are available in GS.

For all authors, GS returns the highest number of citations and the highest average number of citations. The average number of citations for the top 10 and top five publications received in GS citations is much higher than the average number of citations received in Scopus. As expected, the average citation counts of all authors from GS are several times larger than from Scopus. As shown in

Table 4.4, this is also the case for the standard errors.

Since the citation counts are highly skewed, a t-test is performed with the hy- pothesis that different scholarly publication coverage provided by the two tools will lead to different citation counts from each. It begins with the following hy- pothesis:

Ho: There is no difference among the citation counts extracted from the three resources.

Ha: A difference exists among the citation counts extracted from the three resources.

Table 4.4 displays the summary results. The data shows a significant differ- 129 ence in the mean citation rates between the databases. Pair-wise comparisons using Students’s t-test are performed to compare the database pairs. Based on this investigation, there is a statistically significant difference in citation counts between the two databases (p < 0.05).

Author Total Total Average Total Total Publications Citations Citations Citations Citations for for Top 10 Top 5 Publications Publications Elio 39 386 9 342 284 Masciari Guido 35 394 11 261 172 Schafer¨ Luis 37 296 8 251 186 Baumela Patrick 25 165 6 158 128 Marais Pernilla 16 234 14 226 187 Qvarfordt Rajib 84 380 4 189 125 Mall Sunita 67 3522 52 2201 1449 Sarawagi Tai-Lung 15 23 1 23 23 Chen Timothy C. 81 1797 22 998 626 Lethbridge Yoshiharu 51 860 16 756 702 Ishikawa Table 4.2: Statistics of Citations in GS 130

Author Total Total Average Total Total Publications Citations Citations Citations Citations for for Top 10 Top 5 Publications Publications Elio 22 103 4 101 84 Masciari Guido 28 121 4 101 67 Schafer¨ Luis 30 104 3 100 90 Baumela Patrick 23 56 2 55 45 marais Pernilla 8 67 8 67 67 Qvarfordt Rajib 64 173 2 113 84 Mall Sunita 41 545 13 450 318 Sarawagi Tai-Lung 14 14 1 14 14 Chen Timothy C. 65 491 7 347 228 Lethbridge Yoshiharu 30 75 2 72 57 Ishikawa Table 4.3: Statistics of Citations in Scopus

Database Mean Std Dev Variance GS 805.70 1080.15 1166725.12 Scopus 174.90 186.08 34627.43 Table 4.4: Mean, Std Dev and Variance of Citation Counts 131

900 GS 800 Scopus

700

600

500

400

Total Citations 300

200

100

0 1996 1998 2000 2002 2004 2006 2008 2010 Year

Figure 4.6: Distribution of Citations Over Years

4.7.2 Growth in Citations over the Years

This section studies the distribution of citations in different years. As year information is not available for all citations provided by GS, only those citations, where year information is availabe, are considered for this study. As Scopus only provides citation information published from 1996 onwards, the study compares the distribution of citations from 1996 for both databases.

Figure 4.6 shows the distribution of citations published in different years. It can be seen that, both Scopus and GS have the same trend in the number of ci- tations published (i.e. increasing number of citations for the subsequent years).

Figure 4.7 groups the citations published in every 5 years. As expected, for both databses, more citations are available in the past five years than the preceeding 10 years. 132

4500 Scopus 4000 GS

3500

3000

2500

2000

Total Citations 1500

1000

500

0 1996-2000 2001-2005 2006-2011 Year Range

Figure 4.7: Grouping Distribution of Citations

4.7.3 h-index, g-index and hg-index

4.7.3.1 h-index

Originally presented by Hirsch [Hir05], the h-index is an index that attempts to measure both the productivity and impact of the published work of a scholar.

The index is based on the set of the scientist’s most-cited papers and the number of citations received in other publications: A scientist has index h if h of his or her Np papers have at least h citations each, and the other (Np-h) papers have no more than h citations each.

One of the main advantages of the h-index is that it combines both the quantity

(publications) and effect(citations) of the author’s papers and performs better than other single-number criteria (e.g., total number of documents and citations) com- monly used to evaluate a researcher’s scientific output. Another advantage of this indicator is that it is simple to compute from the citation data available through 133 the WoS of the ISI Web of Knowledge 6 . The h-index has been proven robust in the sense that it is insensitive to a set of lowly cited papers [Van07]

However it has received many criticisms [BW11,Wen07,Sek08] of disadvan- taging newcomers, since their publication and citation rates are relatively low. The h-index does not account for the number of authors of a paper and the informa- tion contained in author placement, which is significant in some scientific fields.

Although it is useful for comparing the best scientists, its power for distinguish- ing between average scientists is not acceptable. Being sensitive to the number of citations received, it lacks sensitivity to performance changes. As a result, it al- lows scientists to rest on their laurels, since the number of citations received may increase even if no new paper is published.

4.7.3.2 g-index

Proposed by Leo Egghe [Egg06], the g-index aims to improve on the h-index by giving more weight to highly-cited articles. The g-index attempts to address the shortcomings of the h-index and it is calculated based on the distribution of citations received by a given researcher’s publications: Given a set of articles ranked in decreasing order of the number of citations they received, the g-index is the (unique) largest number such that the top g arti- cles received (together) at least g2 citations.

In simple terms, this means that an author that produces n articles is expected to have, on average, n citations for each article, and to have a g-index of n. In this way, it is similar to the h-index, with the difference that the number of citations per article is not explicit.

6 http://www.isiwebofknowledge.com/ 134

It is easy to prove that g > h [Egg06]. However, although the g-index is successful in evaluating the production of a researcher incorporating the actual citations of his or her papers, it also presents some drawbacks of being greatly influenced by a very successful paper [ACHVH10].

4.7.3.3 hg-index

The paper [ACHVH10] presents a combined index, the hg-index, that tries to fuse together all benefits of both previous measures and that tries to minimise the drawbacks that each one presents. The hg-index of a researcher is computed as the geometric mean of his h- and √ g-indices, that is, hg = h.g.

It is trivial to demonstrate that the hg-index corresponds to a value nearer to the h-index than to the g-index, thus avoiding the problem of the big influence that a very successful paper can introduce to the g-index. Some additional benefits of this new index are that:

• it is very simple to compute once the h- and g-indices have been obtained

• it provides more granularity than the h- and g-indices and is easy to un-

derstand and compare with existing indices

• it takes into account the number of citations of highly cited papers (the

h-index is insensitive to highly cited papers), but it significantly reduces

the impact of single very high-cited papers (a drawback of the g-index),

thus achieving a better balance between the impact of the majority of the

best papers of the author and very highly cited papers. 135

Table 4.5 depicts the h-index, g-index and hg-index of all authors considered.

The result shows that, for all authors, GS provides higher values for all three in- dexes. These findings show that it matters which citation tool is used to compute the h-index, g-index or hg-index of scientists. In addition, there seems to be disci- plinary differences in the coverage of the databases, which creates a dilemma for policy makers and promotion committees. The study recommends further studies to explore the relative strengths and weaknesses of the currently available citation tools. It also recommends further exploration of the capabilities and limitations of

GS, especially the citing items.

Author GS Scopus h-index g-index hg-index h-index g-index hg-index Elio Masciari 9 19 13.08 5 10 7.07 Guido Schafer¨ 11 18 14.07 7 10 8.37 Luis Baumela 8 16 11.31 5 10 7.07 Patrick marais 7 12 9.17 4 7 5.29 Pernilla Qvarfordt 7 15 10.25 4 8 5.66 Rajib Mall 11 15 12.85 6 10 7.75 Sunita Sarawagi 29 59 41.36 13 23 17.29 Tai Lung Chen 3 4 3.46 2 3 2.45 Timothy C. Lethbridge 23 41 30.71 12 21 15.87 Yoshiharu Ishikawa 9 29 16.16 4 8 5.66 Table 4.5: Measuring h-index, g-index and hg-index

4.7.4 Overlap and Uniqueness of Citing References

In the next step, a list of top cited articles is gathered from all authors, and the citing references for the respective articles are analysed for uniqueness. For these top cited articles, an automated matching algorithm is developed to identify the overlapping and uniqueness of citing references between GS and Scopus. For 136 each article, the algorithm divides its citing references into three groups:

1. overlapping citations from GS and Scopus

2. unique citations from GS

3. unique citations from Scopus.

Like the work [BBGW06], to test the accuracy rate of the matching algorithm, a sample set of 200 citing references is checked manually to determine if the algorithm has placed each citing reference in the correct category (of the three listed above). The citing article is marked as an error if it is not placed in the correct category by the matching algorithm. The matching algorithm is found to categorise the citation articles with an accuracy rate of 99.1 per cent. With this acceptable accuracy, the matching algorithm is then used to categorise the citing references of all top cited articles.

Table 4.6 shows the distribution of the unique and overlapping references re- turned by the algorithm. As shown, for all articles, GS provides the greatest amount of unique cited-article references than Scopus, which also produces more citation counts. Overall, GS provides a 70 per cent of unique citations, whereas

Scopus provides only 20 per cent of unique citations. From further investigation, the unique citations from GS contains 0.14 per cent of citations from languages other than English and 0.02 per cent of citations are from books and book chapters. 137

Article Title Unique to GS Unique to Scopus Common to GS and Scopus Fast detection of XML structural 45 5 39 similarity A group-strategyproof mecha- 29 1 17 nism for Steiner forests Determining the egomotion of 36 4 21 an uncalibrated camera from in- stantaneous optical flow Animation space: A truly linear 15 2 18 framework for character anima- tion Conversing with the user based 47 0 29 on eye-gaze patterns Load balanced routing in mobile 10 2 24 ad hoc networks Modeling multidimensional 413 7 63 databases Performance effective pre- 3 2 7 scheduling strategy for hetero- geneous grid systems in the master slave paradigm What knowledge is important to 106 10 49 a software professional? Evaluation of signature files as 81 4 31 set access facilities in OODBs Table 4.6: Distribution of Unique and Overlapping Citations

4.8 Conclusions

The study underlines the usefulness of GS for scholars for conducting cita- tion analysis, and for those who need assistance in compiling their own citation records. GS helps to identify a considerable number of citations not found in Sco- pus, thus significantly increasing the bibliometric index (e.g., h-index). The study again suggests the need for data preparation, data cleaning to achieve useful and correct results and informs researchers of the value of using multiple sources for identifying citations to authors, papers and journals. With a performance accuracy 138 of 98.8 per cent, it develops a semi-automatic tool to ease the tedious and time- consuming task of cleaning the errors from GS citation results. The experiments conducted on two databases (GS and Scopus), show that GS offers invaluable help to collect citations from unreachable scholarly sources not covered by subscrip- tion databases, but also from journals covered by subscription databases. These could be useful in showing evidence of the broader international impact on the quality of the publication. The study also finds a number of publications (26 per cent) that are misssing in Scopus, but are available in GS. These results suggest that GS citations may provide a more equitable measure of impact than Scopus ci- tations across the discipline. In addition, as GS is free, it can be useful to perform the citation analysis where subscription based tools are not available. Chapter 5

Conclusions and Future Work

This thesis mainly emphasises the text mining approaches in dealing with data quality problems. It further outlines the major steps for data transformation, data cleaning and clustering, and proposes new techniques to handle data from single and multiples sources.

Chapter 2 tackles the problem of automatically clustering search results re- turned by the popular freely-accessible Web search engine Google Scholar. Two key issues are identified: the need for a good similarity function to reflect the simi- larities between research publications, and the need for a clustering algorithm that is robust to short documents and noises. These issues are tackled by the utilisation of domain information in proposing a new similarity function and implementation of an outlier-conscious algorithm in generating the required number of clusters.

The experiment results suggest that domain knowledge can be useful in improv- ing the quality of clusters and detecting the outliers. In addition, clustering the snippets returned by the search engine is a reasonable and speedy alternative to downloading the original documents. 140

The future work plans to investigate alternative method of extracting useful domain knowledge. One interesting direction could be the use of citation infor- mation. There is also a plan to experiment with datasets in various other domains, perform large-scale evaluations of the proposed system and obtain user feedbacks to further improve the propoed system.

Chapter 3 focuses on the record linkage approaches to the complex coreferent records that have little surface similarity (e.g. VLDB and Very Large Databases).

Transformation rules are incorporated to recognise these domain-specific equiva- lence relationships. These are obtained by proposing an effective and efficient top- k transformation rule learning algorithm. The algorithm is based on performing careful local alignment of input coreferent record pairs, and generating candidate rules based on the optimal local alignment. Compared with the global Greedy al- gorithm in [ACK09], the proposed algorithm generates fewer candidate rules, and the generated candidate rules are more likely to be correct rules. Extensive ex- periments are performed using several publicly available real-world datasets. The results show an increase of 3 times in the percentage of correct rules compared with the previous approach, and an increase in the speed-up in efficiency up to

300 times.

Future work may investigate the impact of the partitioning method on the qual- ity of the rules. Another direction is to explore more efficient data structures and algorithms to further accelerate the proposed rule learning algorithm. Finally, the proposed method will be generalised to more complex transformation rules, for example, rules containing gaps or variables.

Chapter 4 suggests the need for data preparation and data cleaning to achieve useful and correct results. It informs researchers of the value of using multiple 141 sources to identify citations to authors, papers and journals. With a performance accuracy of 98.8 per cent, it develops a semi-automatic tool to ease the tedious and time-consuming task of cleaning the errors from Google Scholar citation results.

In addition, it is able to classify the citations into proper formats (e.g. author, source and venue), in order to make it suitable for conducting citation analysis.

Using the two databases (Google Scholar and Scopus), it compares the scholar performances of various researchers in computer science by using various biblio- metric indicators (e.g. number of citations, citations per year and h-index). The experiment results show that, Google Scholar outperforms Scopus in its cover- age of the number of citations and unique citations. These results agree with the results of recent outcomes from the literature [BI10, SUP10,ˇ Fra10, GP10, JPi10] conducted with various types of citations from Information Science Book, Croat- ian Medical Journal, Computer science Journanl, Spanish psychologists and Ser- bian Dental Journal. Pair-wise comparisons using Student’s t-test performed on the mean citation rates of the database pairs shows a statistically significant dif- ference in citation counts between the two databases (p < 0.05). Google Scholar helps to identify a considerable number of citations not found in Scopus, thus sig- nificantly increasing the bibliometric index (h-, g- and hg-index). The study also

finds a number of missing publications (26 percent) in Scopus that are available in

Google Scholar. These results suggest that Google Scholar citations may provide a more equitable measure of impact across the discipline than Scopus citations.

As Google Scholar is free, it can be useful to perform the citation analysis, where subscription based tools are not available.

Future work involves large-scale, discipline-specific tests. It is important to examine the exact composition of citing material and to test the intrinsic quality 142 of citations found in the databases. It would be interesting to compare the impact of different geographic locations in conducting the citation analysis of authors, conferences and journals. It can help to analyse the performace of young sci- entists and scholars from different geographic locations (e.g. based on country, institutions or continents).

It is believed that the proposed snippet clustering and transformation rule learning methods can be applied to any short text or snippets in any domain. Fu- ture work will explore this with datasets from other domains and will conduct large scale citation analysis from various other disciplines. Bibliography

[ACDK08] Rema Ananthanarayanan, Vijil Chenthamarakshan, Prasad M. Deshpande, and Raghuram Krishnapuram. Rule based synonyms for entity extraction from noisy text. In AND, pages 31–38, 2008.

[ACGK08] Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, and Raghav Kaushik. Incorporating string transformations in record match- ing. In SIGMOD Conference, pages 1231–1234, 2008.

[ACHVH09] Sergio Alonso, Francisco Javier Cabrerizo, Enrique Herrera- Viedma, and Francisco Herrera. h-index: A review focused in its variants, computation and standardization for different scien- tific fields. J. Informetrics, 3(4):273–289, 2009.

[ACHVH10] Sergio Alonso, Francisco Javier Cabrerizo, Enrique Herrera- Viedma, and Francisco Herrera. hg-index: a new index to char- acterize the scientific output of researchers based on the h- and g-indices. Scientometrics, 82(2):391–400, 2010.

[ACK08] Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. Transformation-based framework for record matching. In ICDE, pages 40–49, 2008.

[ACK09] Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. Learning string transformations from examples. PVLDB, 2(1):514–525, 2009.

[AGK10] Arvind Arasu, Michaela Gotz,¨ and Raghav Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783–794, 2010. 144

[AKL+09] Amit Agarwal, Hema Swetha Koppula, Krishna P. Leela, Kr- ishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chit- taranjan Haty, Anirban Roy, and Amit Sasturkar. Url normal- ization for de-duplication of web pages. In CIKM, pages 1987– 1990, 2009. [BB05] Kathleen Bauer and Nisa Bakkalbasi. An examination of ci- tation counts in a new scholarly communication environment. DLibMagazine, 11(9):0, 2005. [BBGW06] Nisa Bakkalbasi, Kathleen Bauer, Janis Glover, and Lei Wang. Three options for citation tracking: Google scholar, scopus and web of science. Biomedical Digital Libraries, 3, 2006. [BBHG11] Eduardo Borges, Karin Becker, Carlos Heuser, and Renata Galante. An automatic approach for duplicate bibliographic metadata identification using classification. 2011. [BdCdMG+11] Eduardo N. Borges, Moises´ G. de Carvalho, Renata de Matos Galante, Marcos Andre´ Gonc¸alves, and Alberto H. F. Laender. An unsupervised heuristic-based approach for bibliographic metadata deduplication. Inf. Process. Manage., 47(5):706–718, 2011. [BG04] Indrajit Bhattacharya and Lise Getoor. Iterative record linkage for cleaning and integration. In DMKD, pages 11–18, 2004. [BI08] Judit Bar-Ilan. Which h-index? - a comparison of wos, scopus and google scholar. Scientometrics, 74(2):257–271, 2008. [BI10] Judit Bar-Ilan. Citations to the ”introduction to informetrics” indexed by wos, scopus and google scholar. Scientometrics, 82(3):495–506, 2010. [BILL07] Judit Bar-Ilan, Mark Levene, and Ayelet Lin. Some measures for comparing citation databases. J. Informetrics, 1(1):26–34, 2007. [BK10] W. Aisha Banu and Dr. P. Sheikh Abdul Kader. Article:a hybrid context based approach for web information retrieval. International Journal of Computer Applications, 10(7):25–28, November 2010. [BM02] Mikhail Bilenko and Raymond J. Mooney. Learning to com- bine trained distance metrics for duplicate detection in databases. Technical report, University of Texas at Austin, 2002. 145

[BM03a] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 39–48, New York, NY, USA, 2003. ACM.

[BM03b] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39–48, 2003.

[BM03c] Mikhail Bilenko and Raymond J. Mooney. On evalu- ation and training-set construction for duplicate detection. In PROCEEDINGS OF THE KDD-2003 WORKSHOP ON DATA CLEANING, RECORD LINKAGE, AND OBJECT CONSOLIDATION, WASHINGTON DC, pages 7–12, 2003.

[BRG07] Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. Clustering short texts using wikipedia. In SIGIR, pages 787– 788, 2007.

[Bri95] Eric Brill. Transformation-based error-driven learning and natu- ral language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565, 1995.

[Buc06] Robert A. Buchanan. Accuracy of cited references: The role of citation databases. College and Research Libraries, 67(4):292– 303, 2006.

[BW11] Lutz Bornmann and Marx Werner. The h index as a research performance indicator. European Science Editing, 37(3):77–80, 2011.

[BWMRAT09] L. Bornmann, H. Schier W. Marx, E. Rahm, and H. Daniel A. Thor. Convergent validity of bibliometric google scholar data in the field of chemistrycitation counts for papers that were ac- cepted by angewandte chemie international edition or rejected but published elsewhere, using google scholar, science citation index, scopus, and chemical abstracts. Journal of Informetrics, 3(1):27–35, 2009.

[BYKS09a] Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. Do not crawl in the dust: Different urls with similar text. ACM Trans. Web, 3(1):1–31, 2009. 146

[BYKS09b] Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. Do not crawl in the dust: Different urls with similar text. TWEB, 3(1):3, 2009.

[BZZM10] Shunlai Bai, Wenhao Zhu, Bofeng Zhang, and Jianhua Ma. Search results clustering based on suffix array and vsm. IEEE-ACM International Conference on Green Computing and Communications and International Conference on Cyber, Physical and Social Computing, 0:852–857, 2010.

[CCG+06] Pavel´ Calado, Marco Cristo, Marcos Andre´ Gonc¸alves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, and Nivio Zi- viani. Link-based similarity measures for the classification of web documents. JASIST, 57(2):208–221, 2006.

[CD07] Hung Chim and Xiaotie Deng. A new suffix tree similarity mea- sure for document clustering. In WWW, pages 121–130, 2007.

[CDN06] Ricardo Campos, Gael¨ Dias, and Celia Nunes. Wise: Hierar- chical soft clustering of web page search results based on web content mining techniques. In Web Intelligence, pages 301–304, 2006.

[CKM07] Zhaoqi Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra. Adaptive graphical approach to entity resolution. In JCDL, pages 204–213, 2007.

[CM05] Aron Culotta and Andrew McCallum. Joint deduplication of multiple record types in relational data. In CIKM, pages 257– 258, 2005.

[CR02] William W. Cohen and Jacob Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, pages 475–480, 2002.

[CRF03] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fien- berg. A comparison of string metrics for matching names and records. In In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, 2003.

[DHM05] Xin Dong, Alon Y. Halevy, and Jayant Madhavan. Reference reconciliation in complex information spaces. In SIGMOD Conference, pages 85–96, 2005. 147

[DKS08] Anirban Dasgupta, Ravi Kumar, and Amit Sasturkar. De-duping urls via rewrite rules. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 186–194, New York, NY, USA, 2008. ACM.

[Egg06] Leo Egghe. Theory and practise of the g-index. Scientometrics, 69(1):131–152, 2006.

[EIV07] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16, 2007.

[FG08] Paolo Ferragina and Antonio Gulli. A personalized search en- gine based on web-snippet hierarchical clustering. Softw., Pract. Exper., 38(2):189–225, 2008.

[FPMP08] Matthew E Falagas, Eleni I Pitsouni, George A Malietzis, and Georgios Pappas. Comparison of PubMed, scopus, web of sci- ence, and google scholar: strengths and weaknesses. The FASEB Journal: Official Publication of the Federation of American Societies for Experimental Biology, 22(2):338–342, February 2008.

[Fra10] Massimo Franceschet. A comparison of bibliometric indicators for computer science scholars and journals on web of science and google scholar. Scientometrics, 83(1):243–258, 2010.

[GBL98] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In ACM DL, pages 89– 98, 1998.

[GDV07] Fatih Gelgi, Hasan Davulcu, and Srinivas Vadrevu. Term ranking for clustering web search results. In WebDB, 2007.

[GMLG01] Abby Goodrum, Katherine W. McCain, Steve Lawrence, and C. Lee Giles. Scholarly publishing in the internet age: a citation analysis of computer science literature. Inf. Process. Manage., 37(5):661–675, 2001.

[Gon85] Teofilo F. Gonzalez. Clustering to minimize the maximum inter- cluster distance. Theor. Comput. Sci., 38:293–306, 1985.

[GP10] Miguel A. Garc´ıa-Perez.´ Accuracy and completeness of publi- cation and citation records in the web of science, psycinfo, and 148

google scholar: A case study for the computation of indices in psychology. JASIST, 61(10):2070–2085, 2010.

[GPMS06] Filippo Geraci, Marco Pellegrini, Marco Maggini, and Fabrizio Sebastiani. Cluster generation and cluster labelling for web snip- pets: A fast and accurate hierarchical solution. In SPIRE, pages 25–36, 2006.

[GPPS06] Filippo Geraci, Marco Pellegrini, Paolo Pisati, and Fabrizio Se- bastiani. A scalable algorithm for high-quality clustering of web snippets. In SAC, pages 1058–1062, 2006.

[HAMA06] Joseph Hassell, Boanerges Aleman-Meza, and Ismailcem Budak Arpinar. Ontology-driven automatic entity disambiguation in un- structured text. In International Semantic Web Conference, pages 44–57, 2006.

[Hea06] Marti A. Hearst. Clustering versus faceted categories for infor- mation exploration. Commun. ACM, 49(4):59–61, 2006.

[HGZ+04] Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL, pages 296–305, 2004.

[HH96] Jeremy A. Hylton and Jeremy A. Hylton. Identifying and merg- ing related bibliographic records. Technical report, MIT LCS Masters Thesis, 1996.

[Hir05] Jorge E. Hirsch. An index to quantify an individual?s scien- tific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46):16569–16572, 2005.

[HN02] Tu Bao Ho and Ngoc Binh Nguyen. Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst., 17(2):199–212, 2002.

[HvdW09] Anne-Wil Harzing1 and Ron van der Wal2. A google scholar h- index for journals: An alternative metric to measure journal im- pact in economics and business. Journal of the American Society for Information Science and Technology, 60:41–46, 2009. 149

[Jac05a] Peter Jacso. As we may search comparison of major features of the web of science, scopus, and google scholar citation-based and citation-enhanced databases. Current Science, 89(9):1537–1547, 2005.

[Jac05b] Peter Jacso. Comparison and analysis of the citedness scores in web of science and google scholar. Digital libraries Implementing strategies and sharing experiences, pages 360 – 369, 2005.

[Jac08a] Peter Jacso. The plausibility of computing the h-index of schol- arly productivity and impact using reference-enhanced databases. 32:266 – 283, 2008.

[Jac08b] Peter Jacso. Testing the calculation of a realistic h-index in google scholar, scopus, and web of science for f. w. lancaster. Library Trends, 56:784–815, 2008.

[Jac09a] Peter Jacso. Errors of omission and their implications for com- puting scientometric measures in evaluating the publishing pro- ductivity and impact of countries. Online Information Review, 33:376 – 385, 2009.

[Jac09b] Peter Jacso. The h-index for countries in web of science and scopus. Online Information Review, 33:831 – 837, 2009.

[JHR09] Supakpong Jinarat, Choochart Haruechaiyasak, and Arnon Rungsawang. Web snippet clustering based on text enrichment with concept hierarchy. In ICONIP (2), pages 309–317, 2009.

[JJKY] Zhihua Jiang, Anupam Joshi, Raghu Krishnapuram, and Liyu Yi. Retriever: Improving web search engine results using clustering.

[JK] Jongkol Janruang and Worapoj Kreesuradej. Hierarchical and overlapping clustering of retrieved web pages. In Recent Advances in Intelligent Information Systems, pages 345–358.

[JK06] Jongkol Janruang and Worapoj Kreesuradej. A new web search result clustering based on true common phrase label discovery. In CIMCA/IAWTIC, page 242, 2006.

[JMF99] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, 1999. 150

[JPi10] Jelena Jaimovi1, Ruica Petrovi1, and Slavoljub ivkovi2. A ci- tation analysis of serbian dental journal using web of science, scopus and google scholar. DOAJ, 57(4):201–211, 2010. [JRMG06] Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. Generating query substitutions. In WWW, pages 387–396, 2006. [KASW09] Abhaya V. Kulkarni, Brittany Aziz, Iffat Shams, and Jason W.Busse. Comparisons of citations in web of science, scopus, and google scholar for articles published in general medical jour- nals. JAMA, 302(10):1092–1096, 2009. [KJNY01] Raghu Krishnapuram, Anupam Joshi, Olfa Nasraoui, and Liyu Yi. Low-complexity fuzzy relational clustering algorithms for web mining. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 9:595–608, 2001. [KLR+04] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Sin- gal, and Raghu Krishnapuram. A hierarchical monothetic doc- ument clustering algorithm for summarization and browsing search results. In WWW, pages 658–665, 2004. [KM06] Dmitri V. Kalashnikov and Sharad Mehrotra. Domain- independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2):716–767, 2006. [KMC05] Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen. Ex- ploiting relationships for domain-independent data cleaning. In SDM, 2005. [KP04] Sotiris B. Kotsiantis and Panayiotis E. Pintelas. Recent ad- vances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications, 1:73–81, 2004. [KPT09] Stella Kopidaki, Panagiotis Papadakos, and Yannis Tzitzikas. Stc+ and nm-stc: Two novel online results clustering methods for web searching. In WISE, pages 523–537, 2009. [KT06] Kayvan Kousha and Mike Thelwall. Motivations for url ci- tations to open access library and information science articles. Scientometrics, 68(3):501–517, 2006. [KT07] Kayvan Kousha and Mike Thelwall. Google scholar citations and google web/url citations: A multi-discipline exploratory analysis. JASIST, 58(7):1055–1065, 2007. 151

[KT08] Kayvan Kousha and Mike Thelwall. Sources of google scholar citations outside the science citation index: A comparison be- tween four science disciplines. Scientometrics, 74(2):273–294, 2008. [Kuh55] Harold W. Kuhn. The hungarian method for the assignment prob- lem. Naval Research Logistics Quarterly, 2:83–97, 1955. [LBLB10] Jie Li, Judy F Burnham, Trey Lemley, and Robert M Britton. Citation analysis: Comparison of web of science(registered), scopus, scifinder(registered), and google scholar. Journal of Electronic Resources in Medical Libraries, 7:196–217, 2010. [LC03] Dawn J. Lawrie and W. Bruce Croft. Generating hierarchical summaries for web searches. In SIGIR, pages 457–458, 2003. [Ley09] Michael Ley. Dblp - some lessons learned. PVLDB, 2(2):1493– 1500, 2009. [LGB99] Steve Lawrence, C. Lee Giles, and Kurt D. Bollacker. Au- tonomous citation matching. In Agents, pages 392–393, 1999. [LIdMAM09] Carmen Lopez-Illescas,´ Felix´ de Moya Anegon,´ and Henk F. Moed. Comparing bibliometric country-by-country rankings de- rived from the web of science and scopus: the effect of poorly cited journals in oncology. J. Information Science, 35(2):244– 256, 2009. [LMP01] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for seg- menting and labeling sequence data. In ICML, pages 282–289, 2001. [LOKP05] Dongwon Lee, Byung-Won On, Jaewoo Kang, and Sanghyun Park. Effective and scalable solutions for mixed and split cita- tion problems in digital libraries. In IQIS, pages 69–76, 2005. [LW10] Zhao Li and Xindong Wu. A phrase-based method for hierarchi- cal clustering of web snippets. In AAAI, 2010. [LY06] Jianchao Li and Tianfang Yao. An efficient token-based approach for web-snippet clustering. In SKG, page 13, 2006. [Mik10] Susanne Mikki. Comparing google scholar and isi web of science for earth sciences. Scientometrics, 82(2):321–331, 2010. 152

[MK09] Matthew Michelson and Craig A. Knoblock. Mining the het- erogeneous transformations between data sources to aid record linkage. In IC-AI, pages 422–428, 2009.

[MNK+05] Steven Minton, Claude Nanjo, Craig A. Knoblock, Martin Michalowski, and Matthew Michelson. A heterogeneous field matching method for record linkage. In ICDM, pages 314–321, 2005.

[MNU00] Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to ref- erence matching. In KDD, pages 169–178, 2000.

[MR08] Lokman I. Meho and Yvonne Rogers. Citation counting, ci- tation ranking, and h-index of human-computer interaction re- searchers: A comparison between scopus and web of science. CoRR, abs/0803.1716, 2008.

[MS09] Lokman I. Meho and Cassidy R. Sugimoto. Assessing the scholarly impact of information studies: A tale of two citation databasesscopus and web of science. Journal of the American Society for Information Science and Technology, 60:24992508, 2009.

[Mul05]¨ H. Muller.¨ Problems, methods, and challenges in comprehensive data cleansing. Informatik-Berichte. Professoren des Inst. Fur¨ Informatik, 2005.

[MY07] Lokman I. Meho and Kiduk Yang. Impact of data sources on citation counts and rankings of lis faculty: Web of science versus scopus and google scholar. JASIST, 58(13):2105–2125, 2007.

[MYC08] Erwan Moreau, Franc¸ois Yvon, and Olivier Cappe.´ Robust simi- larity measures for named entities matching. In COLING, 2008.

[NC10] Roberto Navigli and Giuseppe Crisafulli. Inducing word senses to improve web search result clustering. In EMNLP, pages 116– 126, 2010.

[ND06] Christoph Neuhaus and Hans-Dieter Daniel. Data sources for performing citation analysis: An overview. Journal of Documentation, 64(2):193–210, 2006. 153

[NLQ+09] Xingliang Ni, Zhi Lu, Xiaojun Quan, Wenyin Liu, and Bei Hua. Short text clustering for search results. In APWeb/WAIM, pages 584–589, 2009.

[NN05] Chi Lang Ngo and Hung Son Nguyen. A method of web search result clustering based on rough sets. In Web Intelligence, pages 673–679, 2005.

[Nor05] Alireza Noruzi. Google scholar: The new generation of citation indexes. Libri, 55:170–180, 2005.

[NPH+09] Cam-Tu Nguyen, Xuan Hieu Phan, Susumu Horiguchi, Thu- Trang Nguyen, and Quang-Thuy Ha. Web search clustering and labeling with hidden topics. ACM Trans. Asian Lang. Inf. Process., 8(3), 2009.

[NQL+11] Xingliang Ni, Xiaojun Quan, Zhi Lu, Liu Wenyin, and Bei Hua. Short text clustering by finding core terms. Knowl. Inf. Syst., 27(3):345–365, 2011.

[OSW04] Stanislaw Osinski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search results clustering algorithm based on singular value de- composition. In Intelligent Information Systems, pages 359–368, 2004.

[OW04] Stanislaw Osinski and Dawid Weiss. Conceptual clustering using lingo algorithm: Evaluation on open directory project data. In Intelligent Information Systems, pages 369–377, 2004.

[PKM03] Bo Pang, Kevin Knight, and Daniel Marcu. Syntax-based align- ment of multiple translations: Extracting paraphrases and gener- ating new sentences. In HLT-NAACL, 2003.

[PMM+02] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J. Russell, and Ilya Shpitser. Identity uncertainty and citation matching. In NIPS, pages 1401–1408, 2002.

[PS05] Daniel Pauly and Konstantinos L. Stergiou. Equivalence of results from two citation analyses: Thompson isi’s citation index and google’s scholar service. Ethics in Science and Environmental Politics, pages 33–35, 2005. 154

[RBC+08] Filip Radlinski, Andrei Z. Broder, Peter Ciccolo, Evgeniy Gabrilovich, Vanja Josifovski, and Lance Riedel. Optimizing rel- evance and revenue in ad search: a query substitution approach. In SIGIR, pages 403–410, 2008.

[RD00] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.

[RKT11] Aniket Rangrej, Sayali Kulkarni, and Ashish V. Tendulkar. Com- parative study of clustering techniques for short text documents. In WWW (Companion Volume), pages 111–112, 2011.

[RT05] Erhard Rahm and Andreas Thor. Citation analysis of database publications. SIGMOD Record, 34(4):48–53, 2005.

[Saa06] Gad Saad. Exploring the h-index at the author and journal levels using bibliometric data of productive consumer schol- ars and business-related journals respectively. Scientometrics, 69(1):117–120, 2006.

[SB02a] Sunita Sarawagi and Anuradha Bhamidipaty. Interactive dedu- plication using active learning. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269–278, New York, NY, USA, 2002. ACM.

[SB02b] Sunita Sarawagi and Anuradha Bhamidipaty. Interactive dedu- plication using active learning. In KDD, pages 269–278, 2002.

[SD05] Parag Singla and Pedro Domingos. Collective object identifica- tion. In IJCAI, pages 1636–1637, 2005.

[Sek08] Cagan H. Sekercioglu. Quantifying Coauthor Contributions. Science, 322(5900):371+, October 2008.

[SH97] Giorgio Satta and John C. Henderson. String transformation learning. In ACL, pages 444–451, 1997.

[SK08] Anestis Sitas and Sarantos Kapidakis. Duplicate detection algo- rithms of bibliographic descriptions, 2008.

[SUP10]ˇ Marijan Sember,ˇ Ana Utrobiciˇ c,` and Jelka Petrak. Croatian med- ical journal citation score in web of science, scopus, and google scholar. Croatian Medical Journal, 51(2):99–103, 2010. 155

[TKM01] Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning object identification rules for information integration. Inf. Syst., 26(8):607–633, 2001.

[TKM02] Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learn- ing domain-independent string transformation weights for high accuracy object identification. In KDD, pages 350–359, 2002.

[Tur02] Peter D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. CoRR, cs.LG/0212033, 2002.

[Van07] Jerome K. Vanclay. On the robustness of the h-index. CoRR, abs/cs/0701074, 2007.

[VS03] Liwen Vaughan and Debora Shaw. Bibliographic and web cita- tions: What is the difference? JASIST, 54(14):1313–1322, 2003.

[VS08] Liwen Vaughan and Debora Shaw. A new look at evidence of scholarly citation in citation indexes and from web sources. Scientometrics, 74(2):317–330, 2008.

[WDDT04] Ian H. Witten, Katherine J. Don, Michael Dewsnip, and Valentin Tablan. Text mining in a digital library. International Journal on Digital Libraries, 4:2004, 2004.

[Wei01] Dawid Weiss. A Clustering Interface for Web Search Results in Polish and English, 2001.

[Wen07] Michael C. Wendl. H-index: however ranked, citations need con- text. Nature, 449(7161):403, 2007.

[WHL09] Han Wen, Guo-Shun Huang, and Zhao Li. Clustering web search results using semantic information. In International Conference on Machine Learning and Cybernetics, volume 3, pages 1504– 1509, 2009.

[Win99] William E. Winkler. The state of record linkage and current re- search problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.

[WK02] Yitong Wang and Masaru Kitsuregawa. On combining link and contents information for web page clustering. In DEXA, pages 902–913, 2002. 156

[WMH+08] Junze Wang, Yijun Mo, Benxiong Huang, Jie Wen, and Li He. Web search results clustering based on a novel suffix tree struc- ture. In ATC, pages 540–554, 2008.

[WS03] Dawid Weiss and Jerzy Stefanowski. Web search results cluster- ing in polish: Experimental evaluation of carrot. In IIS, pages 209–218, 2003.

[WT91] William E. Winkler and Yves Thibaudeau. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. Decen- nial Census. Technical report, US Bureau of the Census, 1991.

[WXC09] Han Wen, Nanfeng Xiao, and Qiong Chen. Web snippets clus- tering based on an improved suffix tree algorithm. In FSKD (1), pages 542–547, 2009.

[WXLZ09] Wei Wang, Chuan Xiao, Xuemin Lin, and Chengqi Zhang. Effi- cient approximate entity extraction with edit distance constraints. In SIGMOD, 2009.

[WZP+08] Ying Wang, Wanli Zuo, Tao Peng, Fengling He, and Hailong Hu. Clustering web search results based on interactive suffix tree algorithm. Convergence Information Technology, International Conference on, 2:851–857, 2008.

[XI05] Rui Xu and Donald C. Wunsch II. Survey of clustering algo- rithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005.

[XWLY08] Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Ef- ficient similarity joins for near duplicate detection. In WWW, 2008.

[Yia93] Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, pages 311– 321, 1993.

[YM06] Kiduk Yang and Lokman I. Meho. Citation analysis: A comparison of google scholar, scopus, and web of science. Proceedings of the American Society for Information Science and Technology, 43(1):1–15, 2006.

[ZD04] Dell Zhang and Yisheng Dong. Semantic, hierarchical, online clustering of web search results. In APWeb, pages 69–78, 2004. 157

[ZD08] Dengya Zhu and Heinz Dreher. Improving web search by cat- egorization, clustering, and personalization. In ADMA, pages 659–666, 2008.

[ZE98] Oren Zamir and Oren Etzioni. Web document clustering: A fea- sibility demonstration. In SIGIR, pages 46–54, 1998.

[ZE99] Oren Zamir and Oren Etzioni. Grouper: A dynamic cluster- ing interface to web search results. Computer Networks, 31(11- 16):1361–1374, 1999.

[ZEMK97] Oren Zamir, Oren Etzioni, Omid Madani, and Richard M. Karp. Fast and intuitive clustering of web documents. In KDD, pages 287–290, 1997.

[ZHC+04] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jin- wen Ma. Learning to cluster web search results. In SIGIR, pages 210–217, 2004.

[ZL02] Dangzhi Zhao and Elisabeth Logan. Citation analysis using sci- entific publications on the web as data source:a case study in the xml research area. scientometrics. 54(3):449–472, 2002.

[ZZHM04] Yongzheng Zhang, A. Nur Zincir-Heywood, and Evangelos E. Milios. Term-based clustering and summarization of web page collections. In Canadian Conference on AI, pages 60–74, 2004.