Specialized Research Datasets in the CiteSeerX

Sumit Bhatiaα, Cornelia Carageaβ , Hung-Hsuan Chenβ , Jian Wuβ , Pucktada Treeratpitukβ , Zhaohui Wuβ , Madian Khabsaα, Prasenjit Mitraαβ and C. Lee Gilesαβ αComputer Science and Engineering β College of Information Sciences and Technology The Pennsylvania State University University Park, PA-16802, USA [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

ABSTRACT of writing), it could also be quite difficult to generate these We provide an overview of some of the specialized datasets datasets due to lack of resources. These datasets that were created for various projects related to the CiteSeerX can be made available for use by the research community digital library. These datasets are not those usually available and it is hoped that these datasets will help further the from CiteSeerX and awareness of these datasets could possi- state-of-the-art in academic data management and analysis. bly further the state-of-the-art research in academic digital library data management and analysis. 2. DOCUMENT-ELEMENT SUMMARIZA- TION DATASET Categories and Subject Descriptors A document-element is defined as an entity, separate from H4.0 [Information Systems Applications]: General the running text of the document, that either augments or summarizes the information contained in the running General Terms text [2]. In academic documents, a number of document- Documentation, Standardization. elements are used for a variety of purposes like reporting and summarizing experimental results (plots, tables), describing a process (flow charts) or presenting an algorithm (pseudo- Keywords code). Given the importance of these document-elements, a CiteSeerX, academic digital library, datasets. number of document-element search engines have been pro- posed [4, 3, 10]. These search engines, however, only pro- 1. INTRODUCTION vide a thumbnail view of the document-elements and a small The increase in the amount of scientific literature behooves snippet describing the dataset that is usually the element’s the development of new techniques for efficient manage- caption. Oftentimes, the captions do not provide sufficient ment and analysis of scientific publications. Furthermore, details to understand the content of the document-element. the techniques and methods developed should be scalable In our previous research [2, 1] we explored the problem of to meet the demands of this ever increasing data. A major generating descriptive summaries of these document-elements bottleneck in research for academic document mining and that can help the end-users to understand their content with- analysis is the unavailability of public datasets for evaluat- out having to read the entire paper. For testing the proposed ing and comparing proposed techniques with the state-of- algorithms, a dataset consisting of 290 document-elements the-art. In this paper, we provide an overview of seven (163 figures, 78 tables and 49 algorithms) from 152 differ- different specialized datasets derived from the CiteSeerX ent publications was prepared. Full text (.ist.psu.edu) – a digital library and search en- of each paper in the dataset is available. Further, a gold gine with a focus on computer science publications. In some set of summaries of each document-element as created by cases this data is contained with the the CiteSeerX distribu- two human evaluators is also available. This dataset can be tion, but is not readily available. In many cases, given the used to further develop and evaluate the document-element size of the data (more than 1.6 million documents at time summarization systems.

3. CITATION GRAPH AND CITATION REC- OMMENDATION DATASET The citation recommendation dataset is compiled from the CiteSeerX citation graph and the available for each paper indexed in CiteSeerX , as of December 2011. We define a citing paper as a paper for which we have access to WOSP ’12, June 14, 2012, Washington D.C., USA its content and the reference list, and a citation as a paper that occurs in the reference list of at least one citing paper Figure 1: The citations in the CiteSeer citegraph fol- Figure 2: Number of citations per citing paper in low a Zipf distribution, i.e., only a few citations are the CiteSeer citegraph. cited by many citing papers, whereas the majority of them are cited rarely. Treeratpituk and Giles [11]. The dataset also provides the canonical name of each disambiguated author, his or her to- in the corpus, and for which we have access to its content, tal number of papers and number of citations, homepage url but may or may not have access to its reference list. In the (if available), and the list of affiliations found to be associ- CiteSeerX citation graph, there are 1, 345, 249 citing papers ated with the author. Further, for each disambiguated au- and 9, 150, 279 citations. The total number of links in the thor a unique key is provided that can be used to retrieve all graph, i.e., [citing paper → citation], is 25, 526, 384. Figure the papers related to the author. In addition, the standard 1 shows that the citations in the CiteSeer citegraph typically dataset used for evaluating the disambiguation algorithm follow a Zipf distribution, i.e., only a few citations are cited by Treeratpituk and Giles [11] is also available. The dataset by many citing papers, whereas the majority of them are contains author records of 10 highly ambiguous names sam- X cited rarely. Figure 2 shows the number of citations per cit- pled from the CiteSeer database. The most ambiguous ing paper, i.e., the size of the reference list. As can be seen names in the dataset contains 525 records corresponding to in the figure, very few citing papers have a large number of 99 unique authors. citations, whereas for most of the citing papers the number of citations ranges between 8 and 32. From the CiteSeer 5. BOOK PROJECT citegraph and the available metadata, a smaller dataset is The aims of the Open Access Book Search project are: constructed for the task of citation recommendation by fil- tering out papers that: • to provide search and navigation inside freely available online books, • do not have title and abstract, • to extract and index metadata, hierarchy structure, • have less than 5 or more than 200 citations, or table of contents and citations present in these books. • cite less than 5 or more than 100 other papers. The dataset used in this project consists of 5945 books that are freely available online along with their associated PDF In the resulting citation graph, there are 190, 450 citations, documents. The associated metadata consists of extracted 293, 711 citing papers, and 2, 839, 455 links. text from the PDF files, book titles, authors, date of pub- lishing, ISBN, page number, language and country of pub- This dataset can be used for a variety of academic literature lication. It can be used to develop and evaluate informa- analysis tasks such as citation recommendation [7], study- tion extraction and metadata extraction techniques, entity ing research trends and identifying influential papers and recognition algorithms for books, or for documents with het- author. erogeneous format and structure.

4. AUTHOR NAME DISAMBIGUATION 6. ACKSEER – ACADEMIC ACKNOWLEDG- Automatically identifying the author of a given scientific MENT DATASET publication is a crucial pre-processing step in many bib- AckSeer is a beta automatic acknowledgment indexing search liometric analysis tasks such as finding influential authors, engine that explores automatic identification of acknowledg- finding researcher homepages and studying the temporal re- ments in academic documents [8, 9]. The system also ex- search interests of a given researcher [11]. The data for tracts acknowledged entities (resercher names, funding agen- the disambiguated authors in CiteSeerX is available within cies etc.) from the acknowledgments and indexes them to the standard CiteSeerX databases. The dataset contains enable search. Currently, AckSeer indexes acknowledgments more than 300,000 unique authors found in CiteSeerX digi- from more than 500,000 papers in CiteSeerX . These ac- tal library using the disambiguation algorithm proposed by knowledgments contain more than 4 million acknowledged entities with approximately 2 million of them unique. En- mit URLs to crawl. tity extraction is based on multiple state-of-the-art Named Entity Recognizers(NER). Acknowledged entities are ranked 9. CONCLUSION by citation. Though CiteSeerX is a popular data source for academic document research in data mining, entity disambiguation, Two datasets related to this project are publicly available. information extraction, etc., there are other interesting data The first one includes manually tagged acknowledgments sets that can be used and extracted. We present here an to measure the performance of the extractors. The second overview of such data sets and discuss possible use in aca- dataset contains the acknowledged entities as extracted by demic document data analysis and management. the named entity recognizers deployed in Ackseer. The lat- ter can be used to study the distribution of acknowledged entities, and perhaps build a graph between the authors and 10. ACKNOWLEDGMENTS the acknowledged entities. This work was partially supported by the National Science Foundation and DTRA.

7. COAUTHOR NETWORK DATASET 11. REFERENCES CollabSeer (http://collabseer.ist.psu.edu/) is a search [1] S. Bhatia, S. Lahiri, and P. Mitra. Generating synopses for engine for discovering potential collaborators for a given au- document-element search. In CIKM ’09: Proceeding of the thor [6, 5].CollabSeer discovers potential collaborators by 18th ACM conference on Information and knowledge analyzing the structure of a user’s coauthor network and management, pages 2003–2006, New York, NY, USA, 2009. research interests. Currently, CollabSeer supports three dif- ACM. ferent network structure analysis modules for collaborator [2] S. Bhatia and P. Mitra. Summarizing figures, tables, and algorithms in scientific publications to augment search search: Jaccard similarity, cosine similarity, and our relation results. ACM Trans. Inf. Syst., 30(1):3:1–3:24, Mar. 2012. strength similarity. Users can further refine the recommen- [3] S. Bhatia, P. Mitra, and C. L. Giles. Finding algorithms in dation results by clicking on their topics of interest, which scientific articles. In WWW 2010, pages 1061–1062, 2010. are generated by automatically extracting key phrases from [4] S. Bhatia, S. Tuarob, P. Mitra, and C. L. Giles. An previous publications. algorithm for software developers. In Proceedings of the 3rd International Workshop on CollabSeer uses the CiteSeerX dataset to build a coauthor Search-Driven Development: Users, Infrastructure, Tools, and Evaluation, SUITE ’11, pages 13–16, New York, NY, network, which includes over 1,300,000 computer science re- USA, 2011. ACM. lated documents and over 0.3 million unique authors. This [5] H.-H. Chen, L. Gou, X. Zhang, and C. L. Giles. Capturing co-author network database can be used for evaluating bib- missing edges in social networks using vertex similarity. In liometric tasks such as identifying potential collaborators Proceedings of the sixth international conference on and for studying problems related to academic social net- Knowledge capture, K-CAP ’11, pages 195–196, New York, works. NY, USA, 2011. ACM. [6] H.-H. Chen, L. Gou, X. Zhang, and C. L. Giles. Collabseer: a search engine for collaboration discovery. In Proceedings 8. FOCUSED CRAWLINGFOR ACADEMIC of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL ’11, pages 231–240, CONTENT New York, NY, USA, 2011. ACM. X The CiteSeer crawler crawls the web and collects docu- [7] S. Kataria, P. Mitra, and S. Bhatia. Utilizing context in ments to be indexed by CiteSeerX . A major challenge for generative bayesian models for linked corpus. In M. Fox the crawler is to differentiate between academic and non- and D. Poole, editors, AAAI. AAAI Press, 2010. academic content so as to keep the CiteSeerX repository [8] M. Khabsa, S. Koppman, and C. Giles. Towards building clean. The dataset used for focused crawling consists of a and analyzing a social network of acknowledgments in scientific and academic documents. Social Computing, URL whitelist that contains the seed URLs used for crawl- Behavioral-Cultural Modeling and Prediction, pages ing. The list is being constantly updated so as to remove any 357–364, 2012. dead links, blacklisted URLs and URLs which do not provide [9] M. Khabsa, P. Treeratpituck, and C. Giles. Ackseer: A ingestable documents (not all downloaded documents are in- repository and search engine for automatically extracted gestable). At present, there are more than 100,000 URLs in acknowledgments from digital libraries. 2012. JCDL. the whitelist from which the CiteSeerX crawler crawls new [10] Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: academic content. In addition to the URL whitelist, the automatic table metadata extraction and searching in statistical information of the crawling history for hosts, do- digital libraries. In JCDL, pages 91–100. ACM, 2007. [11] P. Treeratpituk and C. L. Giles. Disambiguating authors in mains and top level domains is also available. In addition, academic publications using random forests. In Proceedings a collection of researcher homepages accumulated from the of the 9th ACM/IEEE-CS joint conference on Digital fifteen US universities is also available. The dataset consists libraries, JCDL ’09, pages 39–48, New York, NY, USA, of a crawl (mime-type text/html) of fifteen US universities. 2009. ACM. [12] J. Wu, P. Teregowda, J. P. F. Ram´ırez, P. Mitra, and The crawl database as described above provide invaluable L. Giles. A study of the crawling strategy evolution for resource for the focused web crawling research. The dataset academic document search engines. ACM WebScience, 2012. can be used for developing and testing focused crawling tech- [13] S. Zheng, P. Dmitriev, and C. L. Giles. Graph-based seed niques [12, 13], academic webpage classification and iden- selection for web-scale crawlers. In Proceedings of the 18th tifying researcher homepages. The URL associated with ACM conference on Information and knowledge this project is at http://louise.ist.psu.edu/ from where management, CIKM ’09, pages 1967–1970, New York, NY, users can view the crawl history, document ranking and sub- USA, 2009. ACM.