Quick viewing(Text Mode)

Citeseerx: AI in a Digital Library Search Engine

Citeseerx: AI in a Digital Library Search Engine

Articles

CiteSeerX: AI in a

Jian Wu, Kyle William, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea, Suppawong Tuarob, Alexander Ororbia, Douglas Jordan,Prasenjit Mitra, C.

n CiteSeerX is a digital library search iteSeerX is a digital library search engine providing engine that provides access to more than free access to more than 5 million scholarly docu - 5 million scholarly documents with Cments. In 1997 its predecessor, CiteSeer, was devel - nearly a million users and millions of oped at the NEC Research Institute, Princeton, NJ. The serv - hits per day. We present key AI tech - ice transitioned to the College of Information Sciences and nologies used in the following compo - nents: document classification and de- Technology at the Pennsylvania State University in 2003. duplication, document and citation Since then, the project has been directed by C. Lee Giles. clustering, automatic extrac - CiteSeer was the first digital library search engine to provide tion and indexing, and author disam - autonomous citation indexing (Giles, Bollacker, and biguation. These AI technologies have Lawrence 1998). After serving as a public search engine for been developed by CiteSeerX group nearly eight years, CiteSeer began to grow beyond the capa - members over the past 5–6 years. We show the usage status, payoff, develop - bilities of its original architecture. It was redesigned with a ment challenges, main design concepts, new architecture and new features, such as author and table and deployment and maintenance search, and renamed CiteSeerX. requirements. We also present AI tech - CiteSeerX is unique compared with other scholarly digital nologies, implemented in table and libraries and search engines. It is an digital library algorithm search, that are special search because all documents are harvested from the public web. modes in CiteSeerX. While it is chal - lenging to rebuild a system like Cite - SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines.

Copyright © 2015, Association for the Advancement of . All rights reserved. ISSN 0738-4602 FALL 2015 35 Articles

This is different from arXiv, Harvard ADS, and ical machine cluster to a private cloud using virtual - PubMed, where papers are submitted by authors or ization techniques (Wu et al. 2014). CiteSeerX exten - pushed by publishers. Unlike and sively leverages software, which signifi - Search, where a significant por - cantly reduces development effort. Red Hat tion of documents have only metadata (such as titles, Enterprise Linux (RHEL) 5 and 6 are the operating authors, and abstracts) available, users have full-text systems for all servers. heartbeat-idirectord provides access to all papers searchable in CiteSeerX. In addi - virtual IP and load-balancing services. Tomcat 7 is tion, CiteSeerX keeps its own repository, which used for web service deployment on web and index - serves cached versions of papers even if their previ - ing servers. MySQL is used as the database manage - ous links are not alive any more. In additional to ment system to store metadata. Apache Solr is used paper downloads, CiteSeerX provides automatically for the index, and the Spring framework is used in extracted metadata and citation context, which the web application. enables users to locate the relevant paragraphs and sentences. CiteSeerX provides all metadata through an OAI (Open Archive Initiative) service interface Highlights of AI Technologies and on Amazon S3 (Amazon charges based on usage). In this section, we highlight four AI solutions that are Document metadata download service is not avail - leveraged by CiteSeerX and that tackle different chal - able from Google Scholar and only recently available lenges in metadata extraction and ingestion modules from Microsoft Academic Search. Finally, CiteSeerX (tagged by C, E, D, and A in figure 1). Document clas - performs automatic extraction and indexing on sification ( C) is used to separate academic papers paper entities such as tables and figures, a capability from all PDF documents harvested by the web rarely seen in other scholarly search engines. For all crawler; metadata extraction (tag E) is used to extract features, CiteSeerX extracts information from the header and citation information automatically from PDFs of scholarly documents, since this is their most academic papers; de-duplication (tag D) eliminates common format. duplicate and near-duplicate academic papers based CiteSeerX also provides a digital library search on their metadata and helps to construct the citation engine framework that can be deployed on similar graph; author disambiguation (tag A) attempts to dif - sites. This framework, called SeerSuite (Teregowda et ferentiate authors that share the same names or al. 2010), has been under active development and group name variants that belong to the same author. applied to other digital libraries. Some services are Author disambiguation enhances search accuracy by also available online, such as text extraction author names. It also enables better bibliometric (Williams et al. 2014a). analysis by allowing a more accurate counting and AI techniques are used in many CiteSeerX compo - grouping of publications and citations. Overcoming nents, including document classification, de-dupli - these four challenges is critical to retrieving accurate cation, automatic metadata extraction, author dis - and clean metadata used for searching and scientific ambiguation, and more. Here, we describe the AI studies. We also present AI technologies implement - techniques used in these components and their per - ed in two specific search engines: table and algorithm formance. We also briefly discuss some AI techniques search developed on top of CiteSeerX. that are under active development. Document Classification CiteSeerX Overview Textual contents extracted from crawled documents are examined by a binary filter, which judges whether Figure 1 illustrates the top-level architecture of the they are scholarly or not. Here scholarly documents CiteSeerX system. At the front end, all user requests include all conference papers, journal and transac - are processed by the CiteSeerX web service, which is tion articles, technical reports, , and even slides supported by three data servers. Searching requests and posters as long as they are freely accessible are handled by the index server, cached full-text doc - online. The current filter uses a rule-based model, uments are provided by the repository server, and all which looks for the reference sections identified by metadata are retrieved from the database server. At key words such as references, bibliography, and their the back end, the harvests PDF files variants. While this appears simple, it performs sur - across the . These files are passed to prisingly well with 80–90 percent precision and 70– the data-extraction module where text contents are 80 percent recall evaluated based on manually extracted and classified. Scholarly documents are labeled document sets. Here, the precision is defined then parsed and the metadata, such as titles, authors, as the number of correctly identified scholarly docu - and abstracts, are extracted. The ingestion module ments over all the documents examined by the filter; writes all metadata into the database. The PDF files the recall is defined as the number of correctly iden - are renamed with document IDs and saved to the tified scholarly documents over all scholarly docu - repository server. Finally, the index data are updated. ments in the labeled sample. The relatively low recall This architecture was recently migrated from a phys - is caused by scholarly documents not containing any

36 AI MAGAZINE Articles

frontend backend

search index web crawler

repository metadata extraction C E users CiteSeerX

database ingestion D A

Figure 1. Top Level Architecture of CiteSeerX.

of the key words/key phrases above, for example, itory is populated by the web crawler, which active - invited talks, slides, posters, and others. In some ly crawls the web (Wu et al. 2012) and downloads papers, such sections are called literature cited or notes. PDF documents. The production repository contains The 10–20 percent of false positives are due to non - academic papers selected from the crawl repository scholarly documents such as government reports, based on certain criteria along with associated meta - resumes, newsletters, and some PDF files simply con - data. Obviously, sample C has more diverse docu - taining these key words/key phrases in their text bod - ment types than sample P. The gold standard was ies. generated by manually labeling documents in these We have developed a sophisticated AI approach to two samples. The performance was evaluated with increase the classification recall and precision multiple classifiers, all of which are trained on the (Caragea et al. 2014b). The new approach utilizes features above. The results on sample C indicate that structural features to classify documents. We consid - the support vector machine (SVM) (Cortes and Vap - ered four types of structural features. File-specific fea - nik 1995) achieves the highest precision (88.9 per - tures include file size and page count. Text-specific cent), followed by the logistic regression classifier features include document length in terms of char - (Bishop 2006), which achieves a slightly lower preci - acters, words, and lines, the average number of sion (88.0 percent). The naïve Bayes classifier words/lines per page, the reference ratio (the number achieves the highest recall (88.6 percent) at the sac - of references and reference mentions divided by the rifice of a low precision (70.3 percent). Sample P total number of tokens in a document), the space yields similar results. In either sample, the classifiers ratio (the percentage of space characters), the symbol based on the structural features significantly outper - ratio (the percentage of words that start with nonal - phanumeric characters), the length ratio (the length form the baseline in terms of precision, recall, and of the shortest line divided by the longest line), the accuracy. We find the top three informative features number of lines that start with uppercase letters and of sample C are reference ratio, appearance of refer - with nonalphanumeric characters. Section-specific ence or bibliography , and document length; the top features include whether the document has section three for sample P are page count, number of words, headings such as , introduction or motiva - and number of characters. We are working toward tion, conclusions, references or bibliography, and extending this binary classifier such that it is able to . Containment features are self references classify documents into multiple categories includ - such as this paper, this , this , and others. ing papers, reports, books, slides, posters, and To evaluate this new classification approach, we resumes. Documents in these categories can be draw two random samples from the crawl repository aligned on topics and displayed on the same web (hereafter sample C) and from the CiteSeerX produc - page, which helps users to retrieve information more tion repository (hereafter sample P). The crawl repos - efficiently.

FALL 2015 37 Articles

Metadata Extraction The line classification algorithm includes two Document metadata is extracted from textual con - steps: independent line classification and contextual tent extracted from PDF files crawled from the web. line classification. In step 1, 15 classifiers (corre - This step is a prerequisite for documents to be sponding to 15 header fields) are trained on 15 indexed and clustered. The header metadata includes labeled feature vector sets, each of which is generat - 15 fields: title, authors, affiliation, address, note, ed by collecting all feature vectors labeled as a cer - email, date, abstract, introduction, phone, key word, tain class. The goal is to classify text lines into a sin - web, degree, publication number, and page informa - gle class or multiple classes. In the second step, the tion. The citation metadata basically contains the classification is improved by taking advantage of the same fields as the header, but it has to be located and context around a line. Specifically, the class labels of parsed by a different algorithm due to its special for - the N lines before and after the current line L are mat. encoded as binary features and concatenated to the feature vector of line L formed in step 1. A line clas - sifier is then trained based on these labeled feature Header Extraction vectors with additional contextual information. Text Header extraction is performed using SVMHeader - lines are then reclassified by the contextual classi - Parse (Han et al. 2003), which is an SVM-based head - fiers, which is repeated such that, in each iteration, er extractor. The idea is to classify textual contents the feature vector of each line is extended by incor - into multiple classes, each of which corresponds to a porating the neighborhood label information pre - header metadata field. Although regular expressions dicted in the previous iteration, and converges when and rule-based systems do not require any training the percentage of lines with new class labels is lower models and are generally faster, they depend on the than a threshold. application domain and a set of rules or regular The key to extract metadata from classified lines is expressions that can only be set by domain experts. identifying information chunk boundaries. Here, we SVMs are well known for their generalization per - focus on author name extraction. While it is rela - formance in handling high dimensional data. In tively easy to identify chunks in punctuation-sepa - SVMHeaderParse, the traditional binary SVM is rated multiauthor lines, it is challenging to identify extended to a multiclass classifier. The whole process chunks in space-separated multiauthor lines. First, can be divided into three phases: feature extraction, we generate all potential name sequences based on line classification, and metadata extraction. predefined name patterns; second, each name SVMHeaderParse first extracts word-specific and sequence is manually labeled; third, an SVM classifi - line-specific features from textual content. The word- er is trained based on the labeled sample; finally, the specific extraction is performed using a rule-based, potential name sequences are classified and the one context-dependent word clustering method. The with the highest score is predicted as the correct rules are extracted from various domain databases, name sequence. An example is presented in figure 2. including the standard Linux dictionary, Bob Bald - The core software we use for SVM training and test - win’s collection of 8441 first names and 19,613 last ing is SVMlight (Joachims 1999). names, Chinese last names, U.S. state names and To evaluate the extractor, we use a data set con - Canada province names, USA city names, country taining 935 labeled headers of names, and month names. We also construct domain databases from training data: affiliation, address, papers (Seymore, Mccallum, and Rosenfeld 1999 degree, pubnum, note, abstract, introduction, and [hereafter S99]). The overall accuracy is 92.9 percent, phone. Text orthographic properties are also extract - which is better than the accuracy (90 percent) report - ed, for example, capitalization. The domain data - ed by S99. Specifically, the accuracies of author, affil - ’s words and bigrams are then clustered based on iation, address, and publication number classes are their properties. For example, an author line Chungki improved by 7 percent, 9 percent, 15 percent, and 35 Lee James E. Burns is represented as Cap1NoneDict - percent, respectively. Note that the evaluation above Word: :MayName: :Mayname: :SingleCap: :MayName: was based on a set of high-quality extracted text files. after word clustering. In real situations, the extracted metadata may be Line-specific features are also extracted such as the noisy if the input text files are poorly extracted from number of words a line contains, line number, the the original PDF files. In addition, the design of the percentages of dictionary and nondictionary words, current classifier is optimized for computer science the percentage of date words, and the percentage of papers. It does not necessarily perform equally well numbers in the line. for other subject domains such as medical science, There are also features representing the percent - physics, and chemistry. We are planning a new ages of the class-specific words in a line, for instance, extractor that autonomously chooses a domain- the percentage of affiliation words, address words, dependent model to extract header metadata. date words, degree words, phone words, publication Citation Extraction number words, note words, and page number words CiteSeerX uses ParsCit (Councill, Giles, and Kan in a line. 2008) for citation extraction, which is an implemen -

38 AI MAGAZINE Articles

score label possible name sequence

1.640 ACCEPT Alan Fekete | David Gupta | Victor Luchangco | Nancy Lynch | Alex Shvartsman

0.900 REJECT Alan Fekete | David Gupta | Victor Luchangco Nancy | Lynch Alex Shvartsman

0.006 REJECT Alan Fekete | David Gupta Victor | Luchangco Nancy | Lynch | Alex Shvartsman

Figure 2. Example of Potential Name Sequence.

A. Bookstein and S. T. Klein, <9tle> Detec9ng content-bearing words by serial clustering, Proceedings of the Nineteenth Annual International

ACM SIGIR Conference on Research and Development in , pp. 319327, 1995.

Figure 3. A Labeled Reference String. Each green (gray) bracket represents a token. tation of a reference string parsing package. The core found by searching for subsequent section labels of ParsCit is a conditional random field (CRF) (Laf - such as appendices, acknowledgments, or the docu - ferty et al. 2001) model used to label the token ment end. sequences in the reference strings. This core was The next phase is to segment individual reference wrapped by a heuristic model with added function - strings. We do this by constructing a number of reg - ality to identify reference strings from plain text files ular expressions matching common marker styles, and to retrieve citation contexts. We summarize the for example, [1] or 1, then counting the number of learning model of ParsCit next. matches to each expression in the reference string In the reference section of a paper, each reference text. The regular expression with the greatest num - string can be viewed as a set of fields (for example, ber of matches is indicated. If no reference string author, title, year) with punctuations or spaces as markers are found, several heuristics are used to delimiters. We break down this string into a sequence decide where individual reference starts and ends of tokens, each of which is assigned a label from a set based on the length of previous lines, strings that of classes (figure 3). To classify a token, we can make appear to be author name lists, and ending punctu - use of any information derived from the reference ation. string including previous classification results. We The next step is to tag individual reference strings encode separate features for each token using multi - automatically using CRF modeling. Each tagged field ple types of features. Features include the ortho - is normalized into a standard representation. This graphic cases, presence of punctuation, numeric makes the structure of metadata more uniform and properties, token location within the reference string, easy for future analysis. whether the token is a possible publisher name, a sur - ParsCit also extracts citation context based on the name, a month name, and others. reference marker discovered above by scanning the ParsCit attempts to find the reference section body text and locating citations matching a particu - before parsing the reference string. For generality, lar reference string. Citation context allows a user to ParsCit is designed to be independent of specific for - quickly and easily see what other researchers say matting and finds reference strings using a set of about an article of interest. For marked citation lists, heuristics given a plain UTF-8 text file. It first search - the expression can be a square bracket or other par - es for sections labeled references, bibliography, or com - enthetical expression. For unmarked citation lists mon variations of these key phrases. If it finds a label (such as Wu and Giles [2012] ), the expression is con - too early in the document, it seeks subsequent structed using author last names. matches. The final match is considered the starting The evaluations were based on three data sets: point of the reference section. The ending point is Cora (S99), CiteSeerX, and the CS data set (Cortez et

FALL 2015 39 Articles

Jian Wu: wu_focuscrawloptimize, and wu_crawlopti - mize. The first key is matched against all keys in the Precision Recall key-map database table. If it is not found, a new doc - key-mapping 0.65 0.67 ument cluster is created with the above two keys, and simhash 0.94 0.88 the metadata extracted from this paper is imported as the metadata of the new cluster. If it matches an existing key in the table, this paper is merged into the cluster owning that key. In either case, the two keys Table 1. Performance Comparison of ND Algorithms. above are written into the keymap table as cluster keys. If a paper contains multiple authors, additional keys are generated using the second author’s last al. 2007). The results show that the performance of name. Note that an offset title is used in key genera - ParsCit is comparable to the original CRF-based sys - tion, that is, crawloptimize, in which the first word is tem in Peng and Mccallum (2006). ParsCit outper - removed from the original title. These provide alter - forms FLUX-CiM (Cortez et al. 2007), which is an native keys in paper matching. unsupervised reference string parsing system, on the An alternative to the key-mapping algorithm key common fields of author and title. However, would be a full-text-based approach. This approach ParsCit makes some errors in not segmenting vol - would not be susceptible to errors in metadata extrac - ume, number, and pages, as ParsCit currently does tion; however, it could still suffer from other short - not further tokenize beyond white spaces (for exam - comings. ple, 11(4):11-22 versus 11 (4) 11 - 22). It is desirable We conducted an experiment to evaluate the per - to incorporate some preprocessing heuristics for formance of the key-mapping algorithm and com - ParsCit to correct for such errors. pare it with the state-of-the-art simhash approach (Charikar 2002). In the experiment, 100,000 docu - Document De-Duplication ments were randomly sampled from the CiteSeerX and Citation Graph collection. Documents containing fewer than 15 Duplication is rare in submission-based digital tokens were filtered after standard information libraries, such as arXiv and PubMed. For a crawl- retrieval preprocessing such as stemming and remov - based digital library, such as CiteSeerX, document ing stop words, resulting in a collection of 95,558 duplication is inevitable and should be handled documents. Documents belonging to the same clus - intelligently. Two types of duplication are consid - ter were grouped. The precision was estimated by ered: bitwise and near duplication. manually inspecting a random sample of 20 docu - Bitwise duplicates are documents that are identical ment pairs belonging to the same cluster. The recall in all bits. Web crawlers may download the same doc - was estimated based a set of 360 known ND pairs ument in two crawl batches, but the system should from a previous study (Williams and Giles 2013). We keep only the first copy. Bitwise duplicates have the check each of these 360 pairs to determine if they had same cryptographic hash values, such as SHA-1, so as been assigned to the same cluster and recall was cal - long as the new document has exactly the same SHA- culated as the proportion of the 360 pairs. Table 1 1 as an existing one in the database, it is dropped compares the performance of these two algorithms. immediately. The precision and recall of the simhash algorithm are Near duplicates (NDs) refer to documents with quoted from Williams and Giles (2013). similar content but minor differences, for example, a As can be seen from table 1, the simhash method and a published version of the same paper. outperforms the key-mapping method. The main rea - Intuitively, NDs can be detected if two papers have son is that the key-mapping method is largely the same title and authors. However, matching dependent on the quality of the author and title title/author strings directly is inefficient because it is extraction since any errors could result in incorrect sensitive to the text extraction and parsing results, keys. The results are also only preliminary due to the which are noisy in general. relatively small sample size and more analysis is NDs in CiteSeerX are detected using a key-map - needed before strong conclusions can be drawn. ping algorithm, applied after the metadata extraction Actually, there are cases where the key-mapping module but before papers are ingested. Most papers approach works better. For instance, given a pair con - we crawled do not contain complete publication sisting of a paper and an extended version of that information, that is, journal name, year, and page paper, the full-text based methods may not consider numbers, so we have to rely on title and author infor - them NDs due to there potentially being large differ - mation, which appear in almost all papers. When a ences in their text content. The key-mapping document is imported, a set of keys is generated by approach, however, may still group the papers concatenating normalized author last names and together as they may have the same title and authors. normalized titles. For example, two keys are generat - Whether or not this is desirable will vary depending ed for the paper Focused Crawling Optimization by on the application.

40 AI MAGAZINE Articles

Aside from full-text documents, a document clus - likely to be the same person. However, if both papers ter may also contain citations, which typically con - discuss similar topics and have a similar set of tain title, author, venue, and year information. When authors, the two names are very likely to refer to the a citation is parsed from a paper, it is merged into a same person. Specifically, CiteSeerX selects 12 fea - document cluster in the same way as a document. tures for two target authors of two different papers, The concept of document cluster integrates papers including the similarity between two authors’s first and citations, making it more accurate to perform names, the similarity between their middle names, statistical calculations, ranking, and network analy - the authoring orders in two papers, the similarity sis. The citation graph is naturally generated on top between the emails of the first authors, the similari - of document clusters. Each directional edge repre - ty between the affiliations of the first authors in two sents a citation relationship. Document clusters are papers, the similarity between the author names of modified when a user correction occurs. When a user in two papers, and the similarity between the titles of corrects metadata of a paper, it is removed from its two papers. previous cluster and put into a new cluster based on While we can classify authors by applying a super - the corrected metadata. The previous cluster is delet - vised learning approach on the aforementioned fea - ed if it becomes empty. tures, such a classification may output results that violate transitivity principle. For example, even if Author Disambiguation author A and author B are classified as one person, In addition to document search, another important and author B and author C are also classified as one function of CiteSeerX is author search, which enables person, author A and author C may be classified as users to find an author’s basic information and pre - two different persons. By applying density-based spa - vious publications. Author search is also the founda - cial clustering of application with noise (DBSCAN), a tion of several other services we provide, such as col - clustering algorithm based on the density reachabil - laborator search (Chen et al. 2011) and expert search ity of data points, CiteSeerX resolves most of these (Chen et al. 2013). inconsistent cases (Huang, Ertekin, and Giles 2006). To search for an author, a typical query string is the The remaining small portion of ambiguous cases are author’s name. However, processing a name-based those located at cluster boundaries. These authors query is complex. First, different authors may share are difficult to disambiguate even manually due to the same name. It is not straightforward to decide if insufficient or incorrectly parsed author informa - the John Smith of one paper is the same as the one in tion. another paper. Second, one author may have several We have compared the accuracy of the author dis - name variations. For instance, Dr. W. Bruce Croft ambiguation problem using several supervised learn - could be interpreted as W. B. Croft or Bruce Croft. The ing approaches, including random forest, support ambiguity of names is a common issue for most dig - vector machine, logistic regression, naïve Bayes, and ital libraries. decision trees. We found that the accuracy achieved To disambiguate authors, one could define a dis - by the Random Forest significantly outperforms the tance function between two authors and identify the other learning methods (Treeratpituk and Giles similarity between a pair of authors based on this dis - 2009), and its model can be trained within a reason - tance function. However, a digital library usually able period. Thus, CiteSeerX applies Random Forest contains millions of papers and tens of millions of on the 12 features we listed for author disambigua - un-disambiguated authors. It is infeasible to compare tion. each pair of authors. To reduce the number of com - parisons, CiteSeerX groups names into small blocks Table Extraction and claims that an author can only have different CiteSeerX has partially incorporated a special mode name variations within the same block. Thus, we (Liu et al. 2007) to search tables in scholarly papers. only need to check name pairs within the same Tables are ubiquitous in digital libraries. They are block. CiteSeerX groups two names into one block if widely used to present experimental results or statis - the last names are the same and the first initials are tical data in a condensed fashion and are sometimes the same. the only source of that data. However, it can be very In many cases, simple name information alone is difficult to extract these tables from PDF files due to insufficient for author disambiguation. Other infor - their complicated formats and schema specification mation related to authors is used including, but not and even more difficult to extract the data in the limited to, their affiliations, emails, collaborators tables automatically. CiteSeerX uses the table meta - (coauthors), and contextual information, such as the data extractor developed by Liu et al. (2007), which key phrases and topics of their published papers. comprises three major parts: a text information strip - CiteSeerX relies on this additional information for per, a table box detector, and a table metadata extrac - author disambiguation. For example, if we found tor. John Smith of paper A and J. Smith of paper B belong The text information stripper extracts out the tex - to different affiliations then the two Smiths are less tual information from the original PDF files word by

FALL 2015 41 Articles

1

2 3

4

5

11

14

6 7 8 12 13

9

10 15

Figure 4. An Example of the Box-Cutting Result.

word by analyzing the output of a general text extrac - into boxes. Each box is defined as a rectangular region tor such as PDFBox or PDFLib TET. These words are containing adjacent text lines with a uniform font then reconstructed with their position information size. The spacing between two text lines in each box and written into a document content file, which must be less than the maximal line spacing between specifies the position, line width, and fonts of each two adjacent lines in the same paragraph. Figure 4 line. An example line in the file is illustrates the results of the box-cutting method when applied on an article page. These boxes are [X,Y]=[31,88] Width=[98] font=[14] Text=[This paper] then classified into three categories: small-font, regu - Based on the document content file, the tables are lar-font, and large-font, depending on how their font identified using a box-cutting method, which sizes compare to the font size of the document body attempts to divide all literal components in a page text. Of course, other methods can be considered.

42 AI MAGAZINE Articles

The next step is to find tables and their metadata ture based features. The machine-learning method in these boxes (Liu et al. 2006). First, the algorithm can capture most noncaptioned pseudocodes, but generates table candidates by matching small-font some still remain undetected. Such pseudocodes are boxes against a list of key words, such as Table and either written in a descriptive manner or are present - TABLE. If it detects the tabular structure from white ed as figures. space information, it confirms this box as a real table. The combined method combines the benefits of If the algorithm does not find any table candidates in both rule-based and machine-learning methods the first iteration, it starts the second iteration on reg - through a set of heuristics in two steps, (1) for a giv - ular-font boxes and then large-font boxes. Most en document, run the rule-based and the machine- tables are in fact detected among small-font boxes. learning methods; (2) for each pseudocode box Figure 4 shows that Box 12 is detected to contain the detected by the machine-learning method, if a key word Table. Its neighbor — Box 13 — is con - pseudocode caption detected by the rule-based firmed as the table body. method is in proximity, the pseudocode box and the Work in this area is still of interest. The ICDAR caption are combined. The combined method conference has held table competitions for several achieves a precision of 87 percent and a recall of 67 years with the goal of improving extraction precision percent (Tuarob et al. 2013). and recall. Indexable metadata of an algorithm includes cap - tion, reference sentences, and synopsis. The former Algorithm Extraction and Indexing two are directly extracted from the original text. The Recently AlgorithmSeer, a prototype of the algorithm synopsis is a summary of an algorithm, which is gen - search mode of CiteSeerX (Tuarob, Mitra, and Giles erated using a supervised learning approach to rank 2014) has been designed. In this section, we focus on and retrieve top relevant sentences in the document AI technologies used in algorithm extraction and (Bhatia et al. 2011). The synopsis generation method indexing. employs a naïve Bayes classifier to learn features from Algorithms are ubiquitous in computer science the captions and references. The features include and related fields. They offer stepwise instructions for content-based and context-based features (Tuarob et solving computational problems. With numerous al. 2013). Again, algorithm search is still work in new algorithms developed and reported every year, it progress with much left to do at this time. would be useful to build systems that automatically identify, extract, index, and search this ever increas - Development and Deployment ing collection of algorithms. Such systems could be useful to researchers and software developers looking Although CiteSeerX utilizes many open source soft - for cutting-edge solutions to their problems. ware packages, many of the core components are A majority of algorithms in computer science doc - designed and coded by CiteSeerX contributors. The uments are represented as pseudocodes. We develop current CiteSeerX codebase inherited little from its three methods for detecting pseudocodes in scholar - predecessor’s (CiteSeer) codebase. The core part of the ly documents including a rule-based, a machine- main web apps were written by Isaac Councill and learning based, and a combined method. All meth - Juan Pablo Fernández Ramírez. Other components ods process textual content extracted from PDF were developed by graduate students, postdocs, and documents. The rule-based method detects the pres - software engineers, over at least three to four years. ence of pseudocode captions using a set of regular expressions that captures captions containing at least one algorithm key word or its variants, for example, Usage and Payoff algorithm, pseudocode, and others. This method yields CiteSeer started running in 1998 and its successor high detection precision (87 percent), but low recall CiteSeerX has been running since 2008. Since then, (45 percent), because a majority of pseudocode the document collection has been steadily growing (roughly 26 percent) does not have associated cap - (see figure 5). The goal of CiteSeerX is to improve tions. the dissemination of and access to scholarly and sci - The machine-learning method directly detects the entific literature. Currently, CiteSeerX has registered presence of pseudocode content. This method relies users from around the world and is hit more than 2 on the assumption that pseudocodes are written in a million times a day. The download rate is about 3– sparse, programminglike manner, which can be visu - 10 PDF files per second (Teregowda, Urgaonkar, and ally spotted as sparse regions in documents. Our Giles 2010). Besides the web search, CiteSeerX also experiments show that a sparse region contains at provides an OAI protocol for metadata harvesting least four consecutive lines, whose nonspace charac - in order to facilitate content dissemination. By pro - ters are less than 80 percent of the average number of grammatically accessing the CiteSeerX OAI harvest characters per line. We use 47 features to characterize URL, it is possible to download metadata for all each sparse region, including a combination of font- CiteSeerX papers. We are receiving about 5000 OAI style based, context based, content based, and struc - requests each month on average. Researchers are

FALL 2015 43 Articles

CiteSeerX Document Collection

25

21 20 n o i l l i 15 m

/ 13.02 s t n e

m 10

u 7.93 c o 6.15 D 5.62 5.10 5 3.78 4.00 2.9 2.89 1.89 1.93 2.35 1.38 1.66 1.54 0.61 1.02 1.22 0.48 0.83 0 2008 2009 2010 2011 2012 2013 2014 Year

Figure 5. Changes in the CiteSeerX Document Collection Since 2008. The indexed counts reflect unique papers without ND.

interested in more than just our metadata but also fy the development of subdisciplines based on the real documents. CiteSeerX usually receives a dozen combination of DBLP and CiteSeerX data sets. or so data requests per month through the contact Recently, Caragea et al. (2014a) generated a large form on the CiteSeerX website (Williams et al. cleaned data set by matching and merging CiteSeerX 2014b). Those requests include graduate students and DBLP metadata. This data set contains cleaned seeking project data sets and researchers that are metadata of papers in both CiteSeerX and DBLP looking for large data sets for experiments. For these including citation context, which is useful for study - requests, dumps of our database are available on ing ranking and recommendation systems (Chen et Amazon S3. This alleviates our cost for distributing al. 2011). Other data sets have also been created (Bha - the data since users pay for downloads. tia et al. 2012). The CiteSeerX (and CiteSeer) data have been used Previously, the majority of CiteSeerX papers were in much novel research work. For example, the Cite - from computer and information sciences. Recently, a Seer data was used to predict the ranking of comput - large number of papers have been crawled and ingest - er scientists (Feitelson and Yovel 2004). In terms of ed from mathematics, physics, and medical science algorithm design, Councill, Giles, and Kan (2008) by incorporating papers from open-access reposito - used CiteSeerX data along with another two data sets ries, such as PubMed (subset), and crawling URLs to evaluate the performance of ParsCit. Madadhain released by Microsoft Academic Search. It is chal - et al. (2005) used CiteSeerX as a motivating example lenging to extract metadata from noncomputer sci - for JUNG, which is a language to manipulate, ana - ence papers due to their different header and citation lyze, and visualize data that can be represented as a formats. As a result, the current extraction tools need graph or network. Pham, Klamma, and Jarke (2011) to be revised to be more general and efficient for studied the knowledge network created at the jour - papers in new domains. We are working to imple - nal/conference level using citation linkage to identi - ment a conditional random field (CRF)–based meta -

44 AI MAGAZINE Articles data extraction tool to replace SVMHeaderParse, and rebuild the entire production database and reposito - Description Estimation ry. We expect this to encourage users from multiple Total physical source lines of code 44,756 disciplines to search and download scholarly papers Development effort (Person-Years) 10.82 from CiteSeerX, and to be useful for studying cross- Schedule estimate (Years) 1.32 discipline citation and social networks. Estimated average number of developers 8.18 In addition to increasing the collection size, Cite - Total estimated development cost* $1,462,297 SeerX also strives to increase metadata quality. For example, we are using multiple data-cleaning tech - niques to sanitize and correct wrong metadata. We are also developing new algorithms to improve the Table 1. Performance Comparison of ND Algorithms. quality of text and metadata extraction. Users will soon see cleaner and more accurate descriptions and more reliable statistics. the repository servers. The crawl document importer Besides data services, CiteSeerX has released the middleware is written in Python and uses the Djan - digital library search engine framework, SeerSuite go framework. The name disambiguation module is (Teregowda et al. 2010), which can be used for build - written in Ruby and uses the Ruby on Rails frame - ing personalized digital library search engines. As far work. as we know, at least five other SeerSuite instances are The current CiteSeerX codebase has been designed running in the world. to be modular and portable. Adding new functional - Main Design Concepts ity, such as other databases or repository types, is as easy as implementing more classes. In addition, the While RHEL is the major development environment, use of Java beans enables CiteSeerX to dynamically the application uses Java as the main programming load objects in memory. One can specify which language for portability. The architecture of the web implementation of the class to use allowing for more application is implemented within the Spring frame - flexibility at runtime. work, which allows developers to concentrate on application design rather than access methods con - Development Cost necting to applications, for example, databases. The CiteSeerX is not just a digital library that allows users application presentation uses a mixture of Java serv - to search and download documents from a large er pages and JavaScript to generate the user interface. database and repository. It encapsulates and inte - The web application is composed of servlets, which grates AI technologies designed to optimize and are Java classes used to extend the capabilities of web enhance document acquisition, extraction, and servers (that is, Tomcat) by means of responding to searching processes. The implementation of these AI HTTP requests. These servlets interact with the index - technologies makes CiteSeerX unique and more valu - ing and database servers for key word search and gen - able, but meanwhile makes it difficult to estimate the erating document summary pages. They also interact cost of rebuilding itself. An estimation using the with the repository server to serve cached files. SLOCcount software suggests that the total cost to We partition the entire data across three databases. rewrite just the core web app Java code from scratch The main database contains document metadata and might be as high as $1.5 million. Table 2 lists the esti - is focused on transactions and version tracking. The mation of effort, time, and cost in the default con - second database stores the citation graphs, and the figuration. third stores user information, queries, and document Table 2 implies it could take lots of effort to build portfolios. The databases are deployed using MySQL a CiteSeerlike system from scratch. Looking at these and are driven by InnoDB for its fully transactional challenges in the context of AI, one would face features with rollback and commit. numerous obstacles. To begin with, all the machine- Metadata extraction methods are built on top of a learning algorithms need to be scalable to support Perl script, with a multithread wrapper in Java. The processing hundreds of thousands of documents per wrapper is responsible for acquiring documents from day. For example, the document classifier should not the crawl database and repository, as well as control - be too slow compared to the crawling rate. Similarly, ling jobs. The Perl script assembles multiple compo - the information extraction part must not be a bot - nents such as text extraction, document classifica - tleneck in the ingestion pipeline. However, having tion, and header and citation extraction. It works in fast machine-learning algorithms should not com - an automatic mode, in which each thread keeps pro - promise accuracy. This also extends to citation cessing new batches until it receives a stop com - matching and clustering, which compares millions mand. It is fast enough to extract one document per of candidate citations. Therefore, obtaining the best second on average. The ingestion system is written in balance between accuracy and scalability is a major Java, which contains methods to interact with the challenge that is addressed by relying on heuristic extraction server, the database, the Solr index, and based models for certain problems, while relying on

FALL 2015 45 Articles

algorithms that need to optimize intractable solu - code in its Java codebase. It could take up to 10 per - tions for others. son-years to rebuild this codebase from scratch. Recently, CiteSeerX was migrated from a cluster Despite limited funding and human resources, Cite - composed of 18 physical machines into a private SeerX has been maintained regularly with many fea - cloud platform (Wu et al. 2014) for scalability, stabil - tures planned for the near future. Migrating to the ity, and maintainability. Our cost analysis indicates cloud environment is a milestone to scale up the that, based on the current market, moving to a pri - whole system. New features that could be incorpo - vate cloud is more cost-effective compared with a rated into CiteSeerX are algorithm search (Tuarob et public cloud solution such as Amazon EC2. The al. 2013), figure search (Choudhury et al. 2013), and major challenges include lack of documentation, acknowledgment search (Giles and Councill 2004; resource allocation, system compatibility, a complete Khabsa, Treeratpituk, and Giles 2012) . and seamless migration plan, redundancy/data back - up, configuration, security, and backward availabili - Acknowledgments ty. Most of the AI modules are compatible with the We acknowledge partial support from the National new environment. Science Foundation and suggestions from Robert Neches. Maintenance References CiteSeerX has been maintained by graduate students, postdocs, and department information-technology Bhatia, S.; Caragea, C.; Chen, H.-H.; Wu, J.; Treeratpituk, P.; Wu, Z.; Khabsa, M.; Mitra, P.; and Giles, C. L. 2012. Special - (IT) support since it moved to Pennsylvania State ized Research Datasets in the Citeseerx Digital Library. D-Lib University. With limited funding and human Magazine 18(7/8). resources, it is challenging to maintain such a system Bhatia, S.; Tuarob, S.; Mitra, P.; and Giles, C. L. 2011. An and add new features. CiteSeerX is currently main - Algorithm Search Engine for Software Developers. In Pro - tained by one postdoc, two graduate students, an IT ceedings of the 3rd International Workshop On Search-Driven technician, and one undergraduate student. Howev - Development: Users, Infrastructure, Tools, and Evaluation , 13– er, this changes as students leave. Only the postdoc 16. New York: Association for Machinery. is full time. The maintenance work includes periodi - Bishop, C. M. 2006. Pattern Recognition and cally checking system health, testing and imple - ( and Statistics). Secaucus, NJ: Springer- menting new features, fixing bugs and upgrading Verlag New York, Inc. software, and adding new documents. Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fernandez- CiteSeerX data is updated regularly. The crawling Ramirez, J.; Chen, H.-H.; Wu, Z.; and Giles, C. L. 2014a. rate varies from 50,000 to 100,000 PDF documents Citeseerx: a Scholarly Big Dataset. In Proceedings of the 36th per day. Of the crawled documents, about 40 per - European Confernence on Information Retrieval, 311–322. cent–50 percent are eventually identified as scholar - Berlin: Springer. ly and ingested into the database. Caragea, C.; Wu, J.; Williams, K.; Gollapalli, S. D.; Khabsa, M.; and Giles, C. L. 2014b. Automatic Identification of Research Articles From Crawled Documents. Paper present - Conclusion ed at the 2014 WSDM Workshop on Web-Scale Classifica - tion: Classifying Big Data From the Web. 28 February 2014, We described CiteSeerX, an open access digital New York, NY. library search engine, focusing on AI technologies Charikar, M. 2002. Similarity Estimation Techniques from used in multiple components of the system. Cite - Rounding Algorithms. In Proceedings of the Thirty-Fourth SeerX eliminates bitwise duplicates using hash func - Annual ACM Symposium on Theory of Computing (STOC ’02), tions; it uses a key-mapping algorithm to cluster doc - 380–388. New York: Association for Computing Machinery. uments and citations. The metadata is automatically Chen, H.-H.; Gou, L.; Zhang, X.; and Giles, C. L. 2011. Col - extracted using an SVM-based header extractor, and labseer: A Search Engine for Collaboration Discovery. In Pro - the citations are parsed by ParsCit, a reference string ceedings of the 2011 ACM/IEEE Joint Conference on Digital parsing package. Author names are disambiguated to Libraries (JCDL ’11), 231–240. New York: Association for facilitate the author search. CiteSeerX has modules to Computing Machinery. extract tables and index them for search. Chen, H.-H.; Treeratpituk, P.; Mitra, P.; and Giles, C. L. 2013. Algorithms are also extracted from PDF papers Csseer: An Expert Recommendation System Based on Cite - seerx. In Proceedings of the 13th ACM/IEEE Joint Conference on using combinations of rule-based and machine- Digital Libraries (JCDL ’13) , 381–382. New York: Association learning methods. These AI technologies make Cite - for Computing Machinery. SeerX unique and add value to its services. CiteSeerX Choudhury, S. R.; Tuarob, S.; Mitra, P.; Rokach, L.; Kirk, A.; has a large worldwide user base, with a downloading Szep, S.; Pellegrino, D.; Jones, S.; and Giles, C. L. 2013. A Fig - rate of 3–10 documents per second. Its data is updat - ure Search Engine Architecture for a Chemistry Digital ed daily and is utilized in many novel research proj - Library. In Proceedings of the 13th ACM/IEEE Joint Conference ects. CiteSeerX takes advantages of existing open on Digital Libraries (JCDL ’13), 369–370. New York: Associa - source software. It contains more than 40,000 lines of tion for Computing Machinery.

46 AI MAGAZINE Articles

Cortes, C., and Vapnik, V. 1995. Support-Vector Networks. Peng, F., and Mccallum, A. 2006. Information Extraction Machine Learning 20(3): 273–297. from Research Papers Using Conditional Random Fields. Cortez, E.; Da Silva, A. S.; Gonçalves, M. A.; Mesquita, F.; Information Processing Management 42(4): 963–979. and De Moura, E. S. 2007. Flux-Cim: Flexible Unsupervised Pham, M.; Klamma, R.; and Jarke, M. 2011. Development of Extraction of Citation Metadata. In Proceedings of the Computer Science Disciplines: A ACM/IEEE Joint Conference on Digital Libraries (JCDL ’07), Approach. Social Network Analysis and Mining 1(4): 321–340. 215–224. New York: Association for Computing Machin - Seymore, K.; Mccallum, A.; and Rosenfeld, R. 1999. ery. Learning Hidden Markov Model Structure for Informa - Councill, I.; Giles, C. L.; and Kan, M.-Y. 2008. Parscit: An tion Extraction. In Machine Learning for Information Open-Source CRF Reference String Parsing Package. In Pro - Extraction: Papers from the AAAI Workshop, ed. M. E. ceedings of the Sixth International Conference on Language Califf. WS-99-11. Menlo Park, CA: AAAI Resources and Evaluation (LREC’08). Paris: European Lan - Press. guage Resources Association (ELRA). Teregowda, P. B.; Councill, I. G.; Fernández, R. J. P.; Khabsa, Feitelson, D. G., and Yovel, U. 2004. Predictive Ranking of M.; Zheng, S.; and Giles, C. L. 2010. Seersuite: Developing Computer Scientists Using Citeseer Data. Journal of Docu - a Scalable and Reliable Application Framework for Building mentation 60(1): 44–61. Digital Libraries by Crawling the Web. Paper presented at Giles, C. L., and Councill, I. 2004. Who Gets Acknowledged: the UNSENIX Conference on Web Application Develop - Measuring Scientific Contributions Through Automatic ment (WebApps ’10), 23–24 June, Boston, MA. Acknowledgement Indexing. Proceedings of the National Teregowda, P.; Urgaonkar, B.; and Giles, C. L. 2010. Cloud Academy of Sciences of the United States of America 101(51): Computing: A Digital Libraries Perspective. In Proceedings of 17599–17604. the 2013 IEEE Sixth International Conference on Cloud Com - Giles, C. L.; Bollacker, K.; and Lawrence, S. 1998. Citeseer: puting (Cloud ’10), 115–122. Piscataway, NJ: Institute for An Automatic Citation Indexing System. In Proceedings of Electrical and Electronics Engineers. the IEEE International Forumn on Research and Technology Treeratpituk, P., and Giles, C. L. 2009. Disambiguating Advances in Digital Libraries (ADL ’98), 89–98. Piscataway, Authors in Academic Publications Using Random Forests. NJ: Institute of Electrical and Electronics Engineers. In Proceedings of the 2009 ACM/IEEE Joint Conference on Dig - Han, H.; Giles, C. L.; Manavoglu, E.; Zha, H.; Zhang, Z.; and ital Libraries (JCDL ’09), 39–48. New York: Association for Fox, E. A. 2003. Automatic Document Metadata Extraction Computing Machinery. Using Support Vector Machines. In Proceedings of the 2003 Tuarob, S.; Bhatia, S.; Mitra, P.; and Giles, C. L. 2013. Auto - ACM/IEEE Joint Conference on Digital Libraries (JCDL ’03), 37– matic Detection of Pseudocodes in Scholarly Documents 48. New York: Association for Computing Machinery. Using Machine Learning. Paper presented at the Twelfth Huang, J.; Ertekin, S.; and Giles, C. L. 2006. Efficient Name International Conference on Document Analysis and Disambiguation for Large-Scale Databases. In Proceedings of Recognition (ICDAR 2013), 25–28 August, Washington, the 10th European Conference on Principles and Practice of D.C. Knowledge Discovery in Databases (PKDD ’06). 536–544. Tuarob, S.; Mitra, P.; and Giles, C. L. 2014. Building a Search Berlin: Springer. Engine for Algorithms. ACM SIGWEB Newsletter (Winter): Joachims, T. 1999. Making Large-Scale SVM Learning Prac - Article 5. tical. In Advances in Kernel Methods, 169–184. Cambridge, Williams, K., and Giles, C. L. 2013. Near Duplicate Detec - MA: The MIT Press. tion in an Academic Digital Library. In Proceedings of the Khabsa, M.; Treeratpituk, P.; and Giles, C. L. 2012. ACKseer: 14th ACM Conference on Document Engineering (Doceng ’13), A Repository and Search Engine for Automatically Extract - 91–94. New York: Association for Computing Machinery. ed Acknowledgments from Digital Libraries. In Proceedings Williams, K.; Li, L.; Khabsa, M.; Wu, J.; Shih, P.; and Giles, of the 2012 ACM/IEEE Joint Conference on Digital Libraries C. L. 2014a. A Web Service for Scholarly Big Data Informa - (JCDL ’12), 185–194. New York: Association for Computing tion Extraction. In Proceedings of the 21st IEEE International Machinery. Conference on Web Services (ICWS 2014). Piscataway, NJ: Lafferty, J. D.; Mccallum, A.; and Pereira, F. C. N. 2001. Con - Institute for Electrical and Electronics Engineers. ditional Random Fields: Probabilistic Models for Segment - Williams, K.; Wu, J.; Choudhury, S. R.; Khabsa, M.; and ing and Labeling Sequence Data. In Proceedings of the Eigh - Giles, C. L. 2014b. Scholarly Big Data Information Extrac - teenth International Conference on Machine Learning (ICML tion and Integration in the Citeseerx Digital Library. In Pro - 2001), 282–289. San Francisco: Morgan Kaufmann. ceedings of the 2014 IEEE 30th International Conference on Liu, Y.; Bai, K.; Mitra, P.; and Giles, C. L. 2007. Tableseer: Data Engineering Workshops (ICDEW). Piscataway, NJ: Insti - Automatic Table Metadata Extraction and Searching in Dig - tute of Electrical and Electronics Engineers. ital Libraries. In Proceedings of the 7th ACM/IEEE-Computer Wu, J.; Teregowda, P.; Ramírez, J. P. F.; Mitra, P.; Zheng, S.; Society Joint Conference on Digital Libraries (JCDL ’07), 91– and Giles, C. L. 2012. The Evolution of a Crawling Strategy 100. New York: Association for Computing Machinery. for an Academic Document Search Engine: Whitelists and Liu, Y.; Mitra, P.; Giles, C. L.; and Bai, K. 2006. Automatic Blacklists. In Proceedings of the 3rd Annual ACM Web Science Extraction of Table Metadata from Digital Documents. In Conference, 340–343. New York: Association for Computing Proceedings of the 6th ACM/IEEE-Computer Society Joint Con - Machinery. ference on Digital Libraries (JCDL ’06), 339–340. New York: Wu, J.; Teregowda, P.; Williams, K.; Khabsa, M.; Jordan, D.; Association for Computing Machinery. Treece, E.; Wu, Z.; and Giles, C. L. 2014. Migrating a Digital Madadhain, J.; Fisher, D.; Smyth, P.; White, S.; and Boey, Y. Library into a Private Cloud. In P roceedings of the IEEE Inter - 2005. Analysis and Visualization of Network Data Using national Conference on Cloud Computing (IC2E 2014). Piscat - Jung. Journal of Statistical Software 10(1): 1–35. away, N.J.: Institute for Electrical and Electronics Engineers.

FALL 2015 47 Articles

Hung-Hsuan Chen is a researcher at the Computational Intelligence Technology Center at the Industrial Technolo - gy Research Institute. He received his Ph.D. in 2014 at Penn - sylvania State University. He is interested in information extraction, information retrieval, data mining, natural lan - guage processing, and graph analysis. He developed CSSeer and CollabSeer.

Madian Khabsa is a Ph.D. candidate in computer science and engineering at Pennsylvania State University. His research interests are big data, information retrieval and extraction, applied machine learning, and data mining. He is also interested in building and contributing to large-scale systems.

Cornelia Caragea is an assistant professor at University of North Texas, where she is part of the interdisciplinary Knowledge Discovery from Digital Information (KDDI) research cluster, with a dual appointment in computer sci - ence and engineering and library and information sciences. She obtained her Ph.D. from Iowa State University in 2009. Her research interests are machine learning, text mining, information retrieval, social and information networks, and natural language processing, with applications to scholarly Support AAAI Programs! big data and digital libraries. Suppawong Tuarob is a Ph.D. candidate in computer sci - ence and engineering at Pennsylvania State University. He Thank you for your ongoing support of earned his BSE and MSE in computer science and engineer - AAAI programs through the continuation ing from the University of Michigan–Ann Arbor. His research involves data mining in large-scale scholarly data, of your AAAI membership. We count on , and health-care domains. you to help us deliver the latest informa - Alexander Ororbia is a Ph.D. graduate student in informa - tion about artificial intelligence to the sci - tion sciences and technology at Pennsylvania State Univer - entific community, and to nurture new sity. He received his BS in computer science and engineering at Bucknell University, with minors in mathematics and research and innovation through our philosophy. His research focuses on developing connec - many conferences, workshops, and sym - tionist architectures capable of deep learning. posia. To enable us to continue this Douglas Jordan is a senior in the College of Engineering at Pennsylvania State University majoring in computer science effort, we invite you to consider an addi - and mathematics. He is in charge of transferring CiteSeerX tional gift to AAAI. For information on data. how you can contribute to the open Prasenjit Mitra is a professor in the College of Information access initiative, please see www.aaai.org Sciences and Technology at Pennsylvania State University. He serves on the graduate faculty of the Department of and click on “Gifts.” Computer Sciences and Engineering and is an affiliate fac - ulty member of the Department of Industrial and Manufac - turing Engineering at Pennsylvania State University. He received his Ph.D. from Stanford University in 2004. His research contributions have been primarily in the fields of information extraction, information integration, and infor - Jian Wu is a postdoctoral scholar in College of Information mation visualization. Sciences and Technology at Pennsylvania State University. C. Lee Giles is the David Reese Professor at the College of He received his Ph.D. in astronomy and astrophysics in Information Sciences and Technology at Pennsylvania State 2011. He is currently the technical director of CiteSeerX. His University, University Park, PA, with appointments in com - research interests include information extraction, web puter science and engineering and supply chain and infor - crawling, cloud computing, and big data. mation systems. He is a Fellow of the ACM, IEEE, and INNS Kyle Williams is a Ph.D. candidate in information sciences (Gabor prize). He is probably best known for his work on and technology at Pennsylvania State University. His estimating the size of the web and on the search engine and research interests include information retrieval, machine digital library, CiteSeer, which he cocreated, developed, and learning and digital libraries. His work involves the use of maintains. He has published more than 400 refereed articles search, machine learning and text inspection for managing with nearly 26,000 citations with an h-index of 77 accord - document collections. ing to Google Scholar.

48 AI MAGAZINE