Specialized Research Datasets in the Citeseerx Digital Library

Total Page:16

File Type:pdf, Size:1020Kb

Specialized Research Datasets in the Citeseerx Digital Library Specialized Research Datasets in the CiteSeerX Digital Library Sumit Bhatiaα, Cornelia Carageaβ , Hung-Hsuan Chenβ , Jian Wuβ , Pucktada Treeratpitukβ , Zhaohui Wuβ , Madian Khabsaα, Prasenjit Mitraαβ and C. Lee Gilesαβ αComputer Science and Engineering β College of Information Sciences and Technology The Pennsylvania State University University Park, PA-16802, USA [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT of writing), it could also be quite difficult to generate these We provide an overview of some of the specialized datasets datasets due to lack of computing resources. These datasets that were created for various projects related to the CiteSeerX can be made available for use by the research community digital library. These datasets are not those usually available and it is hoped that these datasets will help further the from CiteSeerX and awareness of these datasets could possi- state-of-the-art in academic data management and analysis. bly further the state-of-the-art research in academic digital library data management and analysis. 2. DOCUMENT-ELEMENT SUMMARIZA- TION DATASET Categories and Subject Descriptors A document-element is defined as an entity, separate from H4.0 [Information Systems Applications]: General the running text of the document, that either augments or summarizes the information contained in the running General Terms text [2]. In academic documents, a number of document- Documentation, Standardization. elements are used for a variety of purposes like reporting and summarizing experimental results (plots, tables), describing a process (flow charts) or presenting an algorithm (pseudo- Keywords code). Given the importance of these document-elements, a CiteSeerX, academic digital library, datasets. number of document-element search engines have been pro- posed [4, 3, 10]. These search engines, however, only pro- 1. INTRODUCTION vide a thumbnail view of the document-elements and a small The increase in the amount of scientific literature behooves snippet describing the dataset that is usually the element's the development of new techniques for efficient manage- caption. Oftentimes, the captions do not provide sufficient ment and analysis of scientific publications. Furthermore, details to understand the content of the document-element. the techniques and methods developed should be scalable In our previous research [2, 1] we explored the problem of to meet the demands of this ever increasing data. A major generating descriptive summaries of these document-elements bottleneck in research for academic document mining and that can help the end-users to understand their content with- analysis is the unavailability of public datasets for evaluat- out having to read the entire paper. For testing the proposed ing and comparing proposed techniques with the state-of- algorithms, a dataset consisting of 290 document-elements the-art. In this paper, we provide an overview of seven (163 figures, 78 tables and 49 algorithms) from 152 differ- different specialized datasets derived from the CiteSeerX ent Computer Science publications was prepared. Full text (citeseerx.ist.psu.edu) { a digital library and search en- of each paper in the dataset is available. Further, a gold gine with a focus on computer science publications. In some set of summaries of each document-element as created by cases this data is contained with the the CiteSeerX distribu- two human evaluators is also available. This dataset can be tion, but is not readily available. In many cases, given the used to further develop and evaluate the document-element size of the data (more than 1:6 million documents at time summarization systems. 3. CITATION GRAPH AND CITATION REC- OMMENDATION DATASET The citation recommendation dataset is compiled from the CiteSeerX citation graph and the metadata available for each paper indexed in CiteSeerX , as of December 2011. We define a citing paper as a paper for which we have access to WOSP ’12, June 14, 2012, Washington D.C., USA its content and the reference list, and a citation as a paper that occurs in the reference list of at least one citing paper Figure 1: The citations in the CiteSeer citegraph fol- Figure 2: Number of citations per citing paper in low a Zipf distribution, i.e., only a few citations are the CiteSeer citegraph. cited by many citing papers, whereas the majority of them are cited rarely. Treeratpituk and Giles [11]. The dataset also provides the canonical name of each disambiguated author, his or her to- in the corpus, and for which we have access to its content, tal number of papers and number of citations, homepage url but may or may not have access to its reference list. In the (if available), and the list of affiliations found to be associ- CiteSeerX citation graph, there are 1; 345; 249 citing papers ated with the author. Further, for each disambiguated au- and 9; 150; 279 citations. The total number of links in the thor a unique key is provided that can be used to retrieve all graph, i.e., [citing paper ! citation], is 25; 526; 384. Figure the papers related to the author. In addition, the standard 1 shows that the citations in the CiteSeer citegraph typically dataset used for evaluating the disambiguation algorithm follow a Zipf distribution, i.e., only a few citations are cited by Treeratpituk and Giles [11] is also available. The dataset by many citing papers, whereas the majority of them are contains author records of 10 highly ambiguous names sam- X cited rarely. Figure 2 shows the number of citations per cit- pled from the CiteSeer database. The most ambiguous ing paper, i.e., the size of the reference list. As can be seen names in the dataset contains 525 records corresponding to in the figure, very few citing papers have a large number of 99 unique authors. citations, whereas for most of the citing papers the number of citations ranges between 8 and 32. From the CiteSeer 5. OPEN ACCESS BOOK PROJECT citegraph and the available metadata, a smaller dataset is The aims of the Open Access Book Search project are: constructed for the task of citation recommendation by fil- tering out papers that: • to provide search and navigation inside freely available online books, • do not have title and abstract, • to extract and index metadata, hierarchy structure, • have less than 5 or more than 200 citations, or table of contents and citations present in these books. • cite less than 5 or more than 100 other papers. The dataset used in this project consists of 5945 books that are freely available online along with their associated PDF In the resulting citation graph, there are 190; 450 citations, documents. The associated metadata consists of extracted 293; 711 citing papers, and 2; 839; 455 links. text from the PDF files, book titles, authors, date of pub- lishing, ISBN, page number, language and country of pub- This dataset can be used for a variety of academic literature lication. It can be used to develop and evaluate informa- analysis tasks such as citation recommendation [7], study- tion extraction and metadata extraction techniques, entity ing research trends and identifying influential papers and recognition algorithms for books, or for documents with het- author. erogeneous format and structure. 4. AUTHOR NAME DISAMBIGUATION 6. ACKSEER – ACADEMIC ACKNOWLEDG- Automatically identifying the author of a given scientific MENT DATASET publication is a crucial pre-processing step in many bib- AckSeer is a beta automatic acknowledgment indexing search liometric analysis tasks such as finding influential authors, engine that explores automatic identification of acknowledg- finding researcher homepages and studying the temporal re- ments in academic documents [8, 9]. The system also ex- search interests of a given researcher [11]. The data for tracts acknowledged entities (resercher names, funding agen- the disambiguated authors in CiteSeerX is available within cies etc.) from the acknowledgments and indexes them to the standard CiteSeerX databases. The dataset contains enable search. Currently, AckSeer indexes acknowledgments more than 300,000 unique authors found in CiteSeerX digi- from more than 500,000 papers in CiteSeerX . These ac- tal library using the disambiguation algorithm proposed by knowledgments contain more than 4 million acknowledged entities with approximately 2 million of them unique. En- mit URLs to crawl. tity extraction is based on multiple state-of-the-art Named Entity Recognizers(NER). Acknowledged entities are ranked 9. CONCLUSION by citation. Though CiteSeerX is a popular data source for academic document research in data mining, entity disambiguation, Two datasets related to this project are publicly available. information extraction, etc., there are other interesting data The first one includes manually tagged acknowledgments sets that can be used and extracted. We present here an to measure the performance of the extractors. The second overview of such data sets and discuss possible use in aca- dataset contains the acknowledged entities as extracted by demic document data analysis and management. the named entity recognizers deployed in Ackseer. The lat- ter can be used to study the distribution of acknowledged entities, and perhaps build a graph between the authors and 10. ACKNOWLEDGMENTS the acknowledged entities. This work was partially supported by the National Science Foundation and DTRA. 7. COAUTHOR NETWORK DATASET 11. REFERENCES CollabSeer (http://collabseer.ist.psu.edu/) is a search [1] S. Bhatia, S. Lahiri, and P. Mitra. Generating synopses for engine for discovering potential collaborators for a given au- document-element search. In CIKM '09: Proceeding of the thor [6, 5].CollabSeer discovers potential collaborators by 18th ACM conference on Information and knowledge analyzing the structure of a user's coauthor network and management, pages 2003{2006, New York, NY, USA, 2009. research interests. Currently, CollabSeer supports three dif- ACM. ferent network structure analysis modules for collaborator [2] S.
Recommended publications
  • Don's Conference Notes
    Don’s Conference Notes by Donald T. Hawkins (Freelance Conference Blogger and Editor) <[email protected]> Information Transformation: Open. Global. Plenary Sessions Collaborative: NFAIS’s 60th Anniversary Meeting Regina Joseph, founder of Sibylink (http://www.sibylink.com/) and co-founder of pytho (http://www.pytho.io/), consultancies that Column Editor’s Note: Because of space limitations, this is an specialize in decision science and information design, said that we are abridged version of my report on this conference. You can read the gatekeepers of knowledge and information. Information has never full article which includes descriptions of additional sessions at been more accessible, in demand, but simultaneously under attack. https://against-the-grain.com/2018/04/nfaiss-60th-anniversary- There is both a challenge and an opportunity in information system meeting/. — DTH availability and diversity. News outlets have become organs of influ- ence, and social networks are changing our consumption of information (for example, 26% of news retrieval is through social media). We are In 1958, G. Miles Conrad, director of Biological Abstracts, con- willingly allowing ourselves to be controlled. How will we be able to vened a meeting of representatives from 14 information services to harness the advantages of open access to information when the ability collaborate and cooperate in sharing technology and discussing issues of to access it might be compromised? We need people with multiple mutual interest. The National Federation of Abstracting and Indexing areas of specialist knowledge but who are also connected with broad (now Advanced Information) Services (NFAIS) was formed as a result and general knowledge.
    [Show full text]
  • Bibliometric Study of Indian Open Access Social Science Literature Deep Kumar Kirtania [email protected]
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln 2018 Bibliometric Study of Indian Open Access Social Science Literature Deep Kumar Kirtania [email protected] Follow this and additional works at: http://digitalcommons.unl.edu/libphilprac Part of the Library and Information Science Commons Kirtania, Deep Kumar, "Bibliometric Study of Indian Open Access Social Science Literature" (2018). Library Philosophy and Practice (e-journal). 1867. http://digitalcommons.unl.edu/libphilprac/1867 Bibliometric Study of Indian Open Access Social Science Literature Deep Kumar Kirtania Library Trainee, West Bengal Secretariat Library & M.Phil Scholar, Department of Library & Information Science University of Calcutta [email protected] Abstract: the purpose of this study is to trace out the growth and development social science literature in open access environment published from India. Total 1195 open access papers published and indexed in Scopus database in ten years have considered for the present study. Research publication from 2008 to 2017 have been analyzed based on literature growth, authorship pattern, activity index, prolific authors and institutions, publication type, channel and citation count have examined to provide a clear picture of Indian social science research. The study shows the dominance of shared authorship and sixty percentages of total articles have been cited. This original research paper described the research productivity of social science in open access context and will be helpful to the social scientist and library professional as a whole. Key Words: Bibliometric study, Research Growth, Social Sciences, Open Access, Scopus, India Introduction: Scholarly communications have been the primary source of creating and sharing knowledge by academics and researchers from the mid 1600s (Chan, Gray & Kahn, 2012).
    [Show full text]
  • 1 Scientometric Indicators and Their Exploitation by Journal Publishers
    Scientometric indicators and their exploitation by journal publishers Name: Pranay Parsuram Student number: 2240564 Course: Master’s Thesis (Book and Digital Media Studies) Supervisor: Prof. Fleur Praal Second reader: Dr. Adriaan van der Weel Date of completion: 12 July 2019 Word count: 18177 words 1 Contents 1. Introduction ............................................................................................................................ 3 2. Scientometric Indicators ........................................................................................................ 8 2.1. Journal Impact Factor ...................................................................................................... 8 2.2. h-Index .......................................................................................................................... 10 2.3. Eigenfactor™ ................................................................................................................ 11 2.4. SCImago Journal Rank.................................................................................................. 13 2.5. Source Normalized Impact Per Paper ........................................................................... 14 2.6. CiteScore ....................................................................................................................... 15 2.6. General Limitations of Citation Count .......................................................................... 16 3. Conceptual Framework .......................................................................................................
    [Show full text]
  • Citeseerx: 20 Years of Service to Scholarly Big Data
    CiteSeerX: 20 Years of Service to Scholarly Big Data Jian Wu Kunho Kim C. Lee Giles Old Dominion University Pennsylvania State University Pennsylvania State University Norfolk, VA University Park, PA University Park, PA [email protected] [email protected] [email protected] ABSTRACT access to a growing number of researchers. Mass digitization par- We overview CiteSeerX, the pioneer digital library search engine, tially solved the problem by storing document collections in digital that has been serving academic communities for more than 20 years repositories. The advent of modern information retrieval methods (first released in 1998), from three perspectives. The system per- significantly expedited the process of relevant search. However, spective summarizes its architecture evolution in three phases over documents are still saved individually by many users. In 1997, three the past 20 years. The data perspective describes how CiteSeerX computer scientists at the NEC Research Institute (now NEC Labs), has created searchable scholarly big datasets and made them freely New Jersey, United States – Steven Lawrence, Kurt Bollacker, and available for multiple purposes. In order to be scalable and effective, C. Lee Giles, conceived an idea to create a network of computer AI technologies are employed in all essential modules. To effectively science research papers through citations, which was to be imple- train these models, a sufficient amount of data has been labeled, mented by a search engine, the prototype CiteSeer. Their intuitive which can then be reused for training future models. Finally, we idea, automated citation indexing [8], changed the way researchers discuss the future of CiteSeerX.
    [Show full text]
  • Utility-Based Control Feedback in a Digital Library Search Engine: Cases in Citeseerx
    Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX Jian Wuy Alexander Ororbiay Kyle Williamsy Madian Khabsaz Zhaohui Wuz C. Lee Gilesyz yInformation Sciences and Technology zComputer Science and Engineering Pennsylvania State University, PA, 16802 USA Abstract We describe a utility-based feedback control model and its applications within an open access digital library search engine – CiteSeerX, the new version of Cite- Seer. CiteSeerX leverages user-based feedback to correct Figure 1: The utility-based control feedback loop. metadata and reformulate the citation graph. New docu- ments are automatically crawled using a focused crawler for indexing. Those documents that are ingested have their document URLs automatically inspected so as to provide feedback to a whitelist filter, which automati- dynamic resource management, where automated tech- cally selects high quality crawl seed URLs. The chang- niques are employed to alter the system state configu- ing citation count plus the download history of papers is ration in response to fluctuations in workload and error an indicator of ill-conditioned metadata that needs cor- cases [15]. We reinterpret feedback computing as user- rection. We believe that these feedback mechanisms ef- based feedback which is useful in improving a digital li- fectively improve the overall metadata quality and save brary search engine. computational resources. Although these mechanisms To represent high-level policies, a utility function are used in the context of CiteSeerX, we believe they can U(S) is defined [17], which maps any possible system be readily transferred to other similar systems. state, expressed in terms of a service attribute vector S, to a scalar value [22].
    [Show full text]
  • Finding Health and Science RSS Feeds
    Finding Citation References in Science Resources: Who Cited My Article? Do you need to find out if your articles have been cited, or to see who else cited a journal article? Library Subscribed Databases While the Houston Cole Library does not subscribe to such citation services as Science Citation Index, Web of Science, or Scopus, several databases to which we do subscribe have some type of citation searching. These include: Elsevier Science Direct When you view a record from the list of search results, on the right sidebar you will find a link to “Cited and Related Articles.” Scroll down to “Cited by in Scopus” and the number of times the work has been cited. As we do not have access to Scopus, clicking on the link will bring up a preview of citing articles. You can then use the information to look for an article using Library resources, or click on “View at Publisher” to see more details. SciFinder Scholar Click on the Citings number and icon to the right of an article in your results to retrieve a list of the articles that cite that one. You will see the term “Citings” if you hover the mouse over the icon. (To use SciFinder, you need to create an account and login to access the database. Information is here on the library website,) EBSCO Academic Search Premier “Times Cited in this Database” shows the number of times that the article being viewed was cited in other articles. Click the link to bring up the Citing Articles Screen for a list of records that cite the original article Web Resources Google Scholar Includes a "Cited by" count in search results.
    [Show full text]
  • Bibliometric Analysis of Published Literature Citing Data Produced by the Gap Analysis Program (GAP)
    Review and Bibliometric Analysis of Published Literature Citing Data Produced by the Gap Analysis Program (GAP) By Joan M. Ratz and Shannon J. Conk Open-File Report 2013–1294 U.S. Department of the Interior U.S. Geological Survey U.S. Department of the Interior SALLY JEWELL, Secretary U.S. Geological Survey Suzette M. Kimball, Acting Director U.S. Geological Survey, Reston, Virginia: 2014 For more information on the USGS—the Federal source for science about the Earth, its natural and living resources, natural hazards, and the environment—visit http://www.usgs.gov or call 1–888–ASK–USGS For an overview of USGS information products, including maps, imagery, and publications, visit http://www.usgs.gov/pubprod To order this and other USGS information products, visit http://store.usgs.gov Suggested citation: Ratz, J.M., and Conk, S.J., 2014, Review and bibliometric analysis of published literature citing data produced by the Gap Analysis Program (GAP): U.S. Geological Survey Open-File Report 2013–1294, 117 p., http://dx.doi.org/10.3133/ofr20131294. ISSN 2331-1258 (online) Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. Although this information product, for the most part, is in the public domain, it also may contain copyrighted materials as noted in the text. Permission to reproduce copyrighted items must be secured from the copyright owner. ii Contents Executive Summary ......................................................................................................................................................
    [Show full text]
  • Citeseerx: AI in a Digital Library Search Engine
    Articles CiteSeerX: AI in a Digital Library Search Engine Jian Wu, Kyle William, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan,Prasenjit Mitra, C. Lee Giles n CiteSeerX is a digital library search iteSeerX is a digital library search engine providing engine that provides access to more than free access to more than 5 million scholarly docu - 5 million scholarly documents with Cments. In 1997 its predecessor, CiteSeer, was devel - nearly a million users and millions of oped at the NEC Research Institute, Princeton, NJ. The serv - hits per day. We present key AI tech - ice transitioned to the College of Information Sciences and nologies used in the following compo - nents: document classification and de- Technology at the Pennsylvania State University in 2003. duplication, document and citation Since then, the project has been directed by C. Lee Giles. clustering, automatic metadata extrac - CiteSeer was the first digital library search engine to provide tion and indexing, and author disam - autonomous citation indexing (Giles, Bollacker, and biguation. These AI technologies have Lawrence 1998). After serving as a public search engine for been developed by CiteSeerX group nearly eight years, CiteSeer began to grow beyond the capa - members over the past 5–6 years. We show the usage status, payoff, develop - bilities of its original architecture. It was redesigned with a ment challenges, main design concepts, new architecture and new features, such as author and table and deployment and maintenance search, and renamed CiteSeerX. requirements. We also present AI tech - CiteSeerX is unique compared with other scholarly digital nologies, implemented in table and libraries and search engines.
    [Show full text]
  • Microsoft Office Outlook
    Naresh Agarwal From: Jackie Chan on behalf of Dr. Tony Luo [[email protected]] Sent: Monday, October 11, 2010 7:16 AM To: [email protected]; [email protected]; [email protected] Subject: CSEIT 2010: Acceptance of Full Paper Annual International Conference on Computer Science Education: Innovation and Technology CSEIT 2010 6 – 7 December 2010, Phuket Beach Resort, Thailand www.cseducation.org Paper Code: 37 Paper Title: COLLABORATIVE LEARNING IN A KNOWLEDGE COMMUNITY Dear Author(s), We are pleased to inform you that your paper as referenced above has been accepted for presentation at CSEIT 2010 and for publication in the conference proceedings. Congratulations! In order for your paper to be published, you are required to complete the registration where the instructions are available at conference website (visit http://www.cseducation.org/ and click on "Registration" on the left panel). Kindly note that the Early- bird Registration Deadline is November 10, 2010 . In addition, we would like to share with you some highlights of the conference: • The Conference Proceedings will be indexed by / included in CiteSeerX , SCIrus , EBSCO ,getCITED and Google Scholar . In addition, they will be submitted to IEEE xplore , ACM Digital Library , EI Compendex , and ISTP for indexing/inclusion subject to acceptance / approval. • Selected papers will be published in GSTF International Journal on Computing (JoC) and GSTF associated journals subject to extension/acceptance. • Best Paper Awards and Best Student Paper Awards will be conferred at the conference (in order to qualify for the award, the paper must be presented at the conference). • Special Track: Knowledge Discovery (www.kdiscovery.org) • Keynote Address will be delivered by Dr.
    [Show full text]
  • Scientometrics: Tools, Techniques and Software for Analysis
    Indian Journal of Information Sources and Services ISSN: 2231-6094 Vol. 9 No. 2, 2019, pp. 116-121 © The Research Publication, www.trp.org.in Scientometrics: Tools, Techniques and Software for Analysis V. Jayasree1 and M. D. Baby2 1Research Scholar, Bharathiar University, Coimbatore, Tamil Nadu, India 2Professor & Head, School of Library & Information Science, Rajagiri College of Social Science, Kochi, Kerala, India E-Mail: [email protected], [email protected] Abstract - This paper aims to discuss the significance of e- Russian term “naukometriya” (measurement of science) resources on scientometrics study. Tools for scientometric coined by Nalimov and Mulchenko (1969). Scientometrics analysis are listed out. Data collected from literature search is a branch of science which can also be termed as "Science and website of softwares. Citation tracking tools like Web of of Science". It involves quantitative studies of scientific science, Scopus and Google Scholar citations, CiteseerX etc., activities, especially publications, which overlap with are discussed. Various software tools for bibliometric analysis like Bibexcel, CiteSpace, Histcite, Pajek, Publish or Perish, bibliometrics to some extent. The terms bibliometrics and Scholarometer, VOS viewer-tool for constructing and scientometrics were almost simultaneously introduced by visualizing bibliometric networks, CitNet explorer - tool for Pritchard and by Nalimov and Mulchenko in 1969. visualizing and analysing citation networks of publications etc Pritchard explained the term bibliometrics as “the are discussed, The study concludes that combination of application of mathematical and statistical methods to books different software tools can be used for complete scientometric and other media of communication, Nalimov and analysis and the familiarization of bibliometric software Mulchenko (1989) define scientometrics, "as the application among students and researchers will help to promote research of those quantitative methods which are dealing with the in scientometrics in a more productive method.
    [Show full text]
  • How to Cite Complete Issue More Information About This
    Transinformação ISSN: 2318-0889 Pontifícia Universidade Católica de Campinas SCHIESSL, Marcelo; BRÄSCHER, Marisa Ontology lexicalization: Relationship between content and meaning in the context of Information Retrieval1 Transinformação, vol. 29, no. 1, 2017, January-March, pp. 57-72 Pontifícia Universidade Católica de Campinas DOI: 10.1590/2318-08892017000100006 Available in: http://www.redalyc.org/articulo.oa?id=384357140006 How to cite Complete issue Scientific Information System Redalyc More information about this article Network of Scientific Journals from Latin America and the Caribbean, Spain and Journal's webpage in redalyc.org Portugal Project academic non-profit, developed under the open access initiative 57 Ontology lexicalization: Relationship between LEXICALIZATION ONTOLOGY content and meaning in the context of Information Retrieval1 Lexicalização de ontologias: o relacionamento entre conteúdo e significado no contexto da Recuperação da Informação Marcelo SCHIESSL2 Marisa BRÄSCHER3 Abstract The proposal presented in this study seeks to properly represent natural language to ontologies and vice-versa. Therefore, the semi-automatic creation of a lexical database in Brazilian Portuguese containing morphological, syntactic, and semantic information that can be read by machines was proposed, allowing the link between structured and unstructured data and its integration into an information retrieval model to improve precision. The results obtained demonstrated that the methodology can be used in the risco financeiro (financial risk) domain in Portuguese for the construction of an ontology and the lexical-semantic database and the proposal of a semantic information retrieval model. In order to evaluate the performance of the proposed model, documents containing the main definitions of the financial risk domain were selected and indexed with and without semantic annotation.
    [Show full text]
  • Seersuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web
    SeerSuite: Developing a scalable and reliable application framework for building digital libraries by crawling the web Pradeep B. Teregowda Isaac G. Councill Juan Pablo Fernandez´ R. Pennsylvania State University Google Pennsylvania State University Madian Khabsa Shuyi Zheng Pennsylvania State University Pennsylvania State University C. Lee Giles Pennsylvania State University Abstract modules. CiteSeerx, an instance of SeerSuite is one of the top SeerSuite is a framework for scientific and academic dig- ranked resources on the web and indexes nearly one and ital libraries and search engines built by crawling scien- half million documents. The collection spans computer tific and academic documents from the web with a fo- and information science (CIS) and related areas such as cus on providing reliable, robust services. In addition mathematics, physics and statistics. CiteSeerx acquires to full text indexing, SeerSuite supports autonomous ci- its documents primarily by automatically crawling au- tation indexing and automatically links references in re- thors web sites for academic and research documents. search articles to facilitate navigation, analysis and eval- CiteSeerx daily receives approximately two million hits uation. SeerSuite enables access to extensive document, and has more than two hundred thousand documents citation, and author metadata by automatically extract- downloaded from its cache. The MyCiteSeer personal ing, storing and indexing metadata. SeerSuite also sup- portal has over ten thousand registered users. ports MyCiteSeer, a personal portal that allows users to While the SeerSuite application framework has most monitor documents, store user queries, build document of the functionality of CiteSeer, SeerSuite represents a portfolios, and interact with the document metadata. We complete redesign of CiteSeer.
    [Show full text]