Seersuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web
Total Page:16
File Type:pdf, Size:1020Kb
SeerSuite: Developing a scalable and reliable application framework for building digital libraries by crawling the web Pradeep B. Teregowda Isaac G. Councill Juan Pablo Fernandez´ R. Pennsylvania State University Google Pennsylvania State University Madian Khabsa Shuyi Zheng Pennsylvania State University Pennsylvania State University C. Lee Giles Pennsylvania State University Abstract modules. CiteSeerx, an instance of SeerSuite is one of the top SeerSuite is a framework for scientific and academic dig- ranked resources on the web and indexes nearly one and ital libraries and search engines built by crawling scien- half million documents. The collection spans computer tific and academic documents from the web with a fo- and information science (CIS) and related areas such as cus on providing reliable, robust services. In addition mathematics, physics and statistics. CiteSeerx acquires to full text indexing, SeerSuite supports autonomous ci- its documents primarily by automatically crawling au- tation indexing and automatically links references in re- thors web sites for academic and research documents. search articles to facilitate navigation, analysis and eval- CiteSeerx daily receives approximately two million hits uation. SeerSuite enables access to extensive document, and has more than two hundred thousand documents citation, and author metadata by automatically extract- downloaded from its cache. The MyCiteSeer personal ing, storing and indexing metadata. SeerSuite also sup- portal has over ten thousand registered users. ports MyCiteSeer, a personal portal that allows users to While the SeerSuite application framework has most monitor documents, store user queries, build document of the functionality of CiteSeer, SeerSuite represents a portfolios, and interact with the document metadata. We complete redesign of CiteSeer. SeerSuite takes advan- describe the design of SeerSuite and the deployment and x tage of and includes state of the art machine learning usage of CiteSeer as an instance of SeerSuite. methods and uses open source applications and modules. The structure and architecture of SeerSuite is designed to 1 Introduction enhance the ease of maintenance and to reduce the cost of operations. SeerSuite is designed to run on off the Efficient and reliable access to the vast scientific and shelf hardware infrastructure. x scholarly publications on the web requires advanced cita- With CiteSeer , SeerSuite focuses primarily on CIS. tion index retrieval systems [21] such as CiteSeer [22, 4], However, there are requests for similar focused services CiteSeerx [5], Google Scholar [9], ACM Portal [1], etc. in other fields such as chemistry [33] and archaeol- The SeerSuite application framework provides an unique ogy [37]. SeerSuite can be adopted to providing ser- x advanced and automatic citation index system that is us- vices similar to those provided by CiteSeer in these ar- able and comprehensive, and provides efficient access to eas. SeerSuite is a part of the project ChemX Seer [3], scientific publications. To realize these goals our design a digital library search engine and collaboration service focuses on reliable, scalable and robust services. for chemistry. Though designed as a framework for x A previous implementation, CiteSeer (maintained as CiteSeer -like digital libraries and search engines, the of this date), was designed to support such services. modular and extensible framework allows for applica- However, CiteSeer was a research prototype and, as such, tions that use SeerSuite components as stand alone ser- suffered serious limitations. SeerSuite was designed to vices or as a part of other systems. provide a framework that would replace CiteSeer, to pro- vide most of its functionality, but designed to be extensi- 2 Background and Related Work ble. SeerSuite improves on several aspects of the original CiteSeer with features such as reliability, robustness and Domain specific repositories and digital library systems scalability. It does this by adopting a multi-tier architec- have been very popular over the last decades with several ture with a service orientation and a loose coupling of examples such as arXiv [23] for physics and RePEc [15] for economics. The Greenstone digital library [10] was overall architecture in the web application, data storage, developed with a similar goal. These repository systems, metadata extraction and ingestion, crawling and mainte- unlike CiteSeerx, depend on manual submission of doc- nance sections. ument metadata, a tedious and resource expensive pro- cess. As such CiteSeerx which crawls authors web page for documents is in many ways a unique system, closest in design to Google Scholar. The popularity of services provided by the origi- nal CiteSeer and limitations in its design and archi- tecture were the motivation behind the design of Seer- Suite [18, 30]. The design of SeerSuite was an incre- mental process involving users and researchers. User in- put was received in the form of feedback and feature re- quests. Exchange of ideas between researchers and users across the world through collaborations, reviews and dis- cussions have made a significant contribution to the de- sign. In addition to ensuring reliable scalable services, portability of the overall system and components was identified as an essential feature that would encourage adoption of SeerSuite elsewhere. During the process of designing the architecture of SeerSuite, other academic repository content management system (CMS) architec- tures such as Fedora [28] and DSpace [8] were studied. The Blackboard architecture pattern [34] had a strong in- fluence on the design of the metadata extraction system. The main obstacles in adopting existing repository and CMS systems were the levels of customization and the Figure 1: SeerSuite Architecture effort required to meet SeerSuite design requirements. More specifically, implementation of the citation graph structure with a focus on automatic metadata extraction and workflow requirements for maintaining and updating 3.1 Web application citation graph structures made these approaches cumber- The web application makes use of the model view con- some to use. troller architecture implemented with the Spring frame- In addition to repository, search engine and digital work. The application presentation uses a mix of java library architectures, advances in metadata extraction server pages and javascript to generate the user inter- methods [24, 25, 19, 27, 32] and the availability of open face. Design of the user interface allows the look and source systems have influenced SeerSuite design. We be- feel to be modified by switching Cascading Style Sheets gin our discussion of SeerSuite by describing the archi- (CSS). The use of JavaServer Pages Standard Tag Li- tecture. brary (JSTL) supports template based construction. The web pages allow further interaction through the use of 3 Architecture javascript frameworks. While the development environ- ment is mostly Linux, components are developed with In the context of SeerSuite reliability refers to the ability portability as a focus. Adoption of the spring framework of the framework instances to provide around the clock allows development to concentrate on application design. service with minimal downtime, scalability to the ability The framework supports the application by handling in- to support increasing number of user requests and doc- teractions between the user and database in a transparent uments in the collection, and robustness to the ability of manner. the instance to continue providing services while some of Servlets use Data Access Objects (DAO) with the sup- the underlying services are unavailable or resource con- port of the framework to interact with databases and the strained. repository. The index, database and repository enable An outline of SeerSuite architecture is shown in fig- the data objects crawled from the web to be stored and ure 1. By adopting a loosely coupled approach for mod- accessed through the web, using efficient programmatic ules and subsystem, we ensure that instances can be interfaces. The web application is deployed through a scaled and can provide robust service. We describe the web application archive file. 2 3.2 Data storage final level contains metadata, text and cached files. The document identifier structure ensures that the there are a The databases store metadata and relationships in the nominal number of sub folders under any folder in the form of tables providing transaction support, where tree structure. transactions include adding, updating or deleting objects. An index provides a fast efficient method of storing The database design partitions the data storage across and accessing text and metadata through the use of an three main databases. This allows for growth in any inverted file. SeerSuite uses the Apache Solr an Index one component to be handled by further partitioning the application [16] supported by Apache Tomcat to provide database horizontally. full text and metadata indexing operations. The meta- The main database contains objects and is focused on data items are obtained from the database and the full transactions and version tracking of objects central to text from the repository. The interaction of the applica- SeerSuite and digital libraries implementations. Objects tion in the form of controllers is through the REST (Rep- stored in the main database include document metadata, resentational state transfer) [20] interface. This allows author, citation, keywords, tags, and hub (URL).