Specialized Research Datasets in the Citeseerx Digital Library

Specialized Research Datasets in the CiteSeerX Digital Library Sumit Bhatiaα, Cornelia Carageaβ , Hung-Hsuan Chenβ , Jian Wuβ , Pucktada Treeratpitukβ , Zhaohui Wuβ , Madian Khabsaα, Prasenjit Mitraαβ and C. Lee Gilesαβ αComputer Science and Engineering β College of Information Sciences and Technology The Pennsylvania State University University Park, PA-16802, USA [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT of writing), it could also be quite difficult to generate these We provide an overview of some of the specialized datasets datasets due to lack of computing resources. These datasets that were created for various projects related to the CiteSeerX can be made available for use by the research community digital library. These datasets are not those usually available and it is hoped that these datasets will help further the from CiteSeerX and awareness of these datasets could possi- state-of-the-art in academic data management and analysis. bly further the state-of-the-art research in academic digital library data management and analysis. 2. DOCUMENT-ELEMENT SUMMARIZA- TION DATASET Categories and Subject Descriptors A document-element is defined as an entity, separate from H4.0 [Information Systems Applications]: General the running text of the document, that either augments or summarizes the information contained in the running General Terms text [2]. In academic documents, a number of document- Documentation, Standardization. elements are used for a variety of purposes like reporting and summarizing experimental results (plots, tables), describing a process (flow charts) or presenting an algorithm (pseudo- Keywords code). Given the importance of these document-elements, a CiteSeerX, academic digital library, datasets. number of document-element search engines have been proposed [4, 3, 10]. These search engines, however, only pro- 1. INTRODUCTION vide a thumbnail view of the document-elements and a small The increase in the amount of scientific literature behooves snippet describing the dataset that is usually the element's the development of new techniques for efficient manage- caption. Oftentimes, the captions do not provide sufficient ment and analysis of scientific publications. Furthermore, details to understand the content of the document-element. the techniques and methods developed should be scalable In our previous research [2, 1] we explored the problem of to meet the demands of this ever increasing data. A major generating descriptive summaries of these document-elements bottleneck in research for academic document mining and that can help the end-users to understand their content with- analysis is the unavailability of public datasets for evaluat- out having to read the entire paper. For testing the proposed ing and comparing proposed techniques with the state-of- algorithms, a dataset consisting of 290 document-elements the-art. In this paper, we provide an overview of seven (163 figures, 78 tables and 49 algorithms) from 152 differ- different specialized datasets derived from the CiteSeerX ent Computer Science publications was prepared. Full text (citeseerx.ist.psu.edu) { a digital library and search en- of each paper in the dataset is available. Further, a gold gine with a focus on computer science publications. In some set of summaries of each document-element as created by cases this data is contained with the the CiteSeerX distribu- two human evaluators is also available. This dataset can be tion, but is not readily available. In many cases, given the used to further develop and evaluate the document-element size of the data (more than 1:6 million documents at time summarization systems. 3. CITATION GRAPH AND CITATION REC- OMMENDATION DATASET The citation recommendation dataset is compiled from the CiteSeerX citation graph and the metadata available for each paper indexed in CiteSeerX , as of December 2011. We define a citing paper as a paper for which we have access to WOSP ’12, June 14, 2012, Washington D.C., USA its content and the reference list, and a citation as a paper that occurs in the reference list of at least one citing paper Figure 1: The citations in the CiteSeer citegraph fol- Figure 2: Number of citations per citing paper in low a Zipf distribution, i.e., only a few citations are the CiteSeer citegraph. cited by many citing papers, whereas the majority of them are cited rarely. Treeratpituk and Giles [11]. The dataset also provides the canonical name of each disambiguated author, his or her to- in the corpus, and for which we have access to its content, tal number of papers and number of citations, homepage url but may or may not have access to its reference list. In the (if available), and the list of affiliations found to be associ- CiteSeerX citation graph, there are 1; 345; 249 citing papers ated with the author. Further, for each disambiguated au- and 9; 150; 279 citations. The total number of links in the thor a unique key is provided that can be used to retrieve all graph, i.e., [citing paper ! citation], is 25; 526; 384. Figure the papers related to the author. In addition, the standard 1 shows that the citations in the CiteSeer citegraph typically dataset used for evaluating the disambiguation algorithm follow a Zipf distribution, i.e., only a few citations are cited by Treeratpituk and Giles [11] is also available. The dataset by many citing papers, whereas the majority of them are contains author records of 10 highly ambiguous names sam- X cited rarely. Figure 2 shows the number of citations per cit- pled from the CiteSeer database. The most ambiguous ing paper, i.e., the size of the reference list. As can be seen names in the dataset contains 525 records corresponding to in the figure, very few citing papers have a large number of 99 unique authors. citations, whereas for most of the citing papers the number of citations ranges between 8 and 32. From the CiteSeer 5. OPEN ACCESS BOOK PROJECT citegraph and the available metadata, a smaller dataset is The aims of the Open Access Book Search project are: constructed for the task of citation recommendation by fil- tering out papers that: • to provide search and navigation inside freely available online books, • do not have title and abstract, • to extract and index metadata, hierarchy structure, • have less than 5 or more than 200 citations, or table of contents and citations present in these books. • cite less than 5 or more than 100 other papers. The dataset used in this project consists of 5945 books that are freely available online along with their associated PDF In the resulting citation graph, there are 190; 450 citations, documents. The associated metadata consists of extracted 293; 711 citing papers, and 2; 839; 455 links. text from the PDF files, book titles, authors, date of pub- lishing, ISBN, page number, language and country of pub- This dataset can be used for a variety of academic literature lication. It can be used to develop and evaluate informa- analysis tasks such as citation recommendation [7], study- tion extraction and metadata extraction techniques, entity ing research trends and identifying influential papers and recognition algorithms for books, or for documents with het- author. erogeneous format and structure. 4. AUTHOR NAME DISAMBIGUATION 6. ACKSEER – ACADEMIC ACKNOWLEDG- Automatically identifying the author of a given scientific MENT DATASET publication is a crucial pre-processing step in many bib- AckSeer is a beta automatic acknowledgment indexing search liometric analysis tasks such as finding influential authors, engine that explores automatic identification of acknowledg- finding researcher homepages and studying the temporal re- ments in academic documents [8, 9]. The system also ex- search interests of a given researcher [11]. The data for tracts acknowledged entities (resercher names, funding agen- the disambiguated authors in CiteSeerX is available within cies etc.) from the acknowledgments and indexes them to the standard CiteSeerX databases. The dataset contains enable search. Currently, AckSeer indexes acknowledgments more than 300,000 unique authors found in CiteSeerX digi- from more than 500,000 papers in CiteSeerX . These ac- tal library using the disambiguation algorithm proposed by knowledgments contain more than 4 million acknowledged entities with approximately 2 million of them unique. En- mit URLs to crawl. tity extraction is based on multiple state-of-the-art Named Entity Recognizers(NER). Acknowledged entities are ranked 9. CONCLUSION by citation. Though CiteSeerX is a popular data source for academic document research in data mining, entity disambiguation, Two datasets related to this project are publicly available. information extraction, etc., there are other interesting data The first one includes manually tagged acknowledgments sets that can be used and extracted. We present here an to measure the performance of the extractors. The second overview of such data sets and discuss possible use in aca- dataset contains the acknowledged entities as extracted by demic document data analysis and management. the named entity recognizers deployed in Ackseer. The lat- ter can be used to study the distribution of acknowledged entities, and perhaps build a graph between the authors and 10. ACKNOWLEDGMENTS the acknowledged entities. This work was partially supported by the National Science Foundation and DTRA. 7. COAUTHOR NETWORK DATASET 11. REFERENCES CollabSeer (http://collabseer.ist.psu.edu/) is a search [1] S. Bhatia, S. Lahiri, and P. Mitra. Generating synopses for engine for discovering potential collaborators for a given au- document-element search. In CIKM '09: Proceeding of the thor [6, 5].CollabSeer discovers potential collaborators by 18th ACM conference on Information and knowledge analyzing the structure of a user's coauthor network and management, pages 2003{2006, New York, NY, USA, 2009. research interests. Currently, CollabSeer supports three dif- ACM. ferent network structure analysis modules for collaborator [2] S.

Load more