Big Scholarly Data in CiteSeerX: Information Extraction from the Web Alexander G. Ororbia II Jian Wu Madian Khabsa Pennsylvania State University Pennsylvania State University Pennsylvania State University University Park, PA, 16802 University Park, PA, 16802 University Park, PA, 16802
[email protected] [email protected] [email protected] Kyle Williams C. Lee Giles Pennsylvania State University Pennsylvania State University University Park, PA, 16802 University Park, PA, 16802
[email protected] [email protected] ABSTRACT there are at least 114 million English scholarly documents 2 We examine CiteSeerX, an intelligent system designed with or their records accessible on the Web [15]. the goal of automatically acquiring and organizing large- In order to provide convenient access to this web-based scale collections of scholarly documents from the world wide data, intelligent systems, such as CiteSeerX, are developed web. From the perspective of automatic information extrac- to construct a knowledge base from this unstructured in- tion and modes of alternative search, we examine various formation. CiteSeerX does this autonomously, even lever- functional aspects of this complex system in order to in- aging utility-based feedback control to minimize computa- vestigate and explore ongoing and future research develop- tional resource usage and incorporate user input to correct ments1. automatically extracted metadata [28]. The rich metadata that CiteSeerX extracts has been used for many data min- ing projects. CiteSeerX provides free access to over 4 million Categories and Subject Descriptors full-text academic documents and rarely seen functionalities, e.g., table search. H.4 [Information Systems Applications]: General; H.3 In this brief paper, after a brief architectural overview of [Information Storage and Retrieval]: Digital Libraries the CiteSeerXsystem, we highlight several CiteSeerX-driven research developments that have enabled such a complex General Terms system to aid researchers' search for academic information.