Cover Page

DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA

A Project

Presented to the faculty of the Department of

California State University, Sacramento

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

Computer Science

by

Tony Le

FALL 2014

Thesis Approval Page

DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA

A Project

by

Tony Le

Approved by:

______, Committee Chair Dr. Du Zhang

______, Second Reader Dr. Jinsong Ouyang

______Date

ii

Thesis Format Approval Page

Student: Tony Le

I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project.

______, Graduate Coordinator ______Dr. Nikrouz Faroughi Date

Department of Computer Science

iii

Thesis Abstract Form

Abstract

of

DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA

by

Tony Le

Exploring publications in academia usually entails browsing a collection of published literature. In the digital space, collections typically come in the form of primitive bibliographic or full featured digital libraries. Bibliographic databases contain an organized collection of references to published literature, whereas digital libraries host publications in full-text.

Although lacking full-text content of a publication, bibliographies contain rich metadata such as publication titles, venues or contributing authors that are adequate enough for finding publications of interest.

One example of a bibliographic is DBLP, which provides publication records in the computer science discipline. DBLP supports researchers in publication exploration by providing interfaces for finding specific bibliographic records or collections of records by a specific author. However, it does not provide a means for exploring families of similar publications by topic areas.

In this project, an application tool named DBLPminer is developed that extends upon the

DBLP dataset. Addressing the topic limitations of DBLP, DBLPminer provides an interface for linking and accessing DBLP records to topic areas. Using this tool, researchers may organize

DBLP records by topic categories and subsequently explore publications by topic categories or iv

find associated topics for a DBLP record. At its foundation, the application indexes DBLP records by topic categories under the taxonomy of ACM 2012 Computing Classification System.

Indexed DBLP publication records are made accessible via a prototype web application interface.

Lastly, analysis is made on these datasets and the indexing algorithm's performance.

There exist other tools like DBLPminer that aid researchers in exploring publications through bibliographic data. Some tools take a social network approach such as Arnetminer and

ResearchGate, allowing researchers to connect with one another to share and access publications.

Tools such as Scholarometer and are more search oriented with interfaces for finding publications of interest. Although similar, DBLPminer differs from tools such as

Arnetminer, ResearchGate, and Scholarometer because it focuses on the publication relationships by topic instead of authorship and references. Another difference between DBLPminer and

ResearchGate, Scholarometer, and Google Scholar is that it focuses on publications specific to the computer science community.

______, Committee Chair Dr. Du Zhang

______Date

v

TABLE OF CONTENTS

Page

List of Tables ...... viii

List of Figures ...... ix

Chapter

1. INTRODUCTION ...... 1

1.1. Related Work ...... 1

2. BACKGROUND ...... 4

3. DESIGN ...... 7

3.1. High Level Application Structure ...... 7

3.2. Development Environment ...... 8

3.3. Internal Application Structure ...... 8

3.3.1. MongoDB ...... 8

3.3.2. Configuration Files ...... 9

3.3.3. Content Curation Module ...... 12

3.3.4. Web Server Module ...... 18

3.3.5. Web Application ...... 20

4. INDEXING ANALYSIS ...... 28

4.1. Taxonomy Dataset Analysis ...... 28

4.1.1. Distinct Topic Terms ...... 28

4.1.2. Topic Names ...... 31

4.2. Performance Analysis ...... 32

5. CONCLUSION AND FUTURE WORK ...... 39

vi

5.1. Future Work ...... 39

Appendix A. Software Dependencies ...... 41

Appendix B. Source Code ...... 42

Bibliography ...... 71

vii

LIST OF TABLES

Tables Page

1. Table 1 – DBLP XML Element Tags ...... 4

2. Table 2 – 2012 ACM CCS Major Topic Categories ...... 5

3. Table 3 – ACM CCS SKOS XML Element Tags ...... 6

4. Table 4 – mongodb.json configuration file ...... 10

5. Table 5 – ccs.json configuration file ...... 10

6. Table 6 – .json configuration file ...... 11

7. Table 7 – analysis.json configuration file ...... 12

8. Table 8 – Application Script Commands ...... 12

9. Table 9 – MongoDB Search String Parsing Example...... 18

10. Table 10 – Web Server URL Routes ...... 19

11. Table 11 – REST API Routes ...... 20

12. Table 12 – Application’s Custom Web Components ...... 21

13. Table 13 – Taxonomy Distinct Term Counts...... 29

14. Table 14 – Most Frequent Terms ...... 30

15. Table 15 – Topic Name Length Counts ...... 31

16. Table 16 – Total Bibliographies Indexed ...... 32

17. Table 17 – Bibliographies Indexed by N-grams ...... 34

18. Table 18 – Single Topic Term Indexing Performance ...... 35

19. Table 19 – Bigram Topic Term Indexing Performance ...... 36

20. Table 20 – Trigram+ Topic Term Indexing Performance...... 38

viii

LIST OF FIGURES

Figures Page

1. Figure 1 – High Level Application Structure...... 7

2. Figure 2 – MongoDB Document Structure ...... 9

3. Figure 3 – Script & 3rd Party Module Dependencies ...... 13

4. Figure 4 – Taxonomy Migration Pseudo code ...... 15

5. Figure 5 – Bibliography Migration Pseudo code ...... 16

6. Figure 6 – Bibliography Indexing Pseudo code ...... 17

7. Figure 7 – Text Search String Creation Example ...... 18

8. Figure 8 – Web Application Navigation by View ...... 22

9. Figure 9 – Topics View Screenshot ...... 23

10. Figure 10 – Search / Home View Screenshot ...... 24

11. Figure 11 – Subtopics View Screenshot ...... 24

12. Figure 12 – Topic Relevant Bibliographies Screenshot ...... 25

13. Figure 13 – Bibliography Search Results Screenshot ...... 26

14. Figure 14 – Bibliography Info Screenshot ...... 27

ix 1

Chapter 1

INTRODUCTION

DBLP hosts a large collection of computer science publications, allowing for simple queries for retrieving publications based on conference, authors, or publication title [1]. Complex queries such as finding relevant publications or publications pertaining to a particular topic domain are not supported. The significance to this missing feature is the benefit of discovery in publication exploration. Researchers exploring the DBLP dataset can benefit from more generalized searches based on topic categories or where to find similar publications given a publication.

DBLPminer is developed to address the limitations of DBLP and its dataset, organizing

DBLP records by indexing them against topic categories extracted from a taxonomy dataset. The aforementioned taxonomy dataset is obtained from the 2012 ACM Computing Classification

System (ACM CCS), containing topic categories within the field of computer science [2].

Interface access to the organized collection of publications by topic is provided via a web application accessing DBLPminer’s built-in web services. Therefore, researchers can use this tool to organize the DBLP dataset by topic categories and subsequently perform complex queries on publications associated with topic information.

1.1 Related Work

DBLPminer is a web application that organizes and allows for the retrieval of publication records by topic categories using the DBLP dataset. There are other web applications similar to

DBLPminer that offer publication exploration based on bibliographic data. However, these applications differ from DBLPminer in their approach towards organization and presentation of public records.

2

Arnetminer is an online web service that provides comprehensive search and mining services for researcher social networks [3]. Whereas DBLPminer organizes publications by topic categories, Arnetminer organizes publications by researchers and builds a network of content using researcher activity. Identification of connections between researchers, conferences, and publications are achieved using social network analysis. Using these connections, services are created featuring: expert finding, geographic search, reviewer recommendation, association search, course search, academic performance evaluation, and topic modeling. Bibliographic information is extracted from researcher profile websites and digital libraries.

Scholarometer is a web service and browser extension that offers a social tool for facilitating citation analysis to measure the impact of an author’s publication [4]. It is more similar to Arnetminer than DBLPminer in that emphasis is placed on publication authors. The tool offers searching functionality on publications based on authors. In addition, Scholarometer provides statistical data ranking authors based on impact and discipline. The tool sources its bibliographic data from Google Scholar.

Google Scholar provides a way to broadly find scholarly literature across many disciplines and sources [5]. Users can explore related works, citations, authors, and publications.

Documents made available in full detail are ranked by weighing its content, publication venue, author, and citation references. DBLPminer is similar to Google Scholar in that both services provide a means to find and explore scholarly works.

ResearchGate is a social network connecting researchers to enable sharing and access of scientific output, knowledge, and expertise [6]. Researchers using ResearchGate can share publications, connect and collaborate with colleagues, discuss research problems, and analyze

3 personal statistics. Statistical information collected on research include: views, downloads, and citations.

Like DBLPminer, these tools aid researchers in exploring publications through bibliographic data. Some tools take a social network approach such as Arnetminer and

ResearchGate, allowing researchers to connect with one another to share and access publications.

Other tools such as Scholarometer and Google Scholar are more search oriented with interfaces for finding publications of interest. DBLPminer differs from these tools because it focuses on publication relationships by topic instead of authorship and citation references. Another difference between DBLPminer and most of these tools is that it focuses on publications specific to the computer science community.

4

Chapter 2

BACKGROUND

DBLPminer utilizes the DBLP and ACM CCS datasets to fulfill its end goal. Topic records from the ACM CCS dataset will be used to categorize publications from the DBLP dataset. Specifically, the names of topic categories will be used to match bibliography titles for categorization. Dealing with text, natural language processing techniques will be used to achieve bibliography categorization by topic.

The DBLP dataset, obtainable as an Extensible Markup Language (XML) file, is perpetually growing with a current size of about 2.6 million records. Publications are organized within the XML file using tags based on publication venue with additional tags to mark content.

These tags coincide with entry types of the BibTeX format. A complete listing of publication venue and content tags are present in Table 1.

Table 1 – DBLP XML Element Tags

Publication Content Tags Venue Tags article incollection author address ee crossref inproceedings phdthesis editor journal cdrom isbn proceedings masterthesis title volume cite series book www booktitle number publisher school pages month note chapter year url

In detail, the venue tags: articles, proceedings, books, masterthesis, and phdthesis are straight forward in explanation. The inproceeding tag corresponds to articles within conference proceedings, while the incollection tag corresponds to parts of books having its own title. Finally, the www tag corresponds to author web pages and is not used in DBLPminer. Content tags

5 identify various metadata information relevant to publications. DBLPminer focuses on title tag to associate publications with topics.

The 2012 ACM CCS dataset contains a collection of topic categories fully organized as a taxonomy of fourteen major topics. Among the fourteen topic trees, only twelve are considered for the DBLPminer application. The two ignored topic trees contain topics that appear irrelevant to differentiating computer science topics. A complete listing of these major topic categories is shown in Table 2. The dataset can be obtained in the Simple Knowledge Organization System

Table 2 – 2012 ACM CCS Major Topic Categories

Major Topic Categories Applied computing Mathematics of computing Computer systems organization Networks Computing methodologies Security and privacy Hardware Social and professional topics Human-centered computing Software and its engineering Information systems Theory of computation

Omitted Major Topic Categories General and reference Proper nouns: People, technologies and companies

XML (SKOS XML) format, a semantic web model standard for expressing basic structure and content of concept schemes. For each topic record within this file, DBLPminer extracts the following pieces of information: topic name, topic aliases, id, parent topic ids, and subtopic ids via element tags and attributes. A listing of tags and attributes containing this information is shown in Table 3. IDs are used in building taxonomy relationship between topics and subtopics, whereas name and aliases are used as topic labels.

6

Table 3 – ACM CCS SKOS XML Element Tags

Element Attribute Description skos:Concept rdf:about (record ID) Taxonomy record

skos:prefLabel Record’s topic name

skos:altLabel Record’s topic alias

skos:broader rdf:resource (parent ID) Reference to parent topic’s ID

skos:narrower rdf:resource (child ID) Reference to child subtopic ID

Natural language processing techniques are used to index publication records by topic categories. One common technique in natural language processing is filtering out common words that appear to be of minimal value in selecting documents matching a user need such as: a, be, for, in, etc. This collection of words is known as stopwords. DBLPminer utilizes a dictionary of stopwords for bibliography indexing.

Another technique in natural language processing is stemming. Stemming involves reducing inflected or derived words into their base or root form. There are several stemming algorithms available, each with different stemming approaches and results. DBLPminer uses the

English Snowball stemmer for bibliography indexing.

7

Chapter 3

SOFTWARE DESIGN

3.1 High Level Application Structure

DBLPminer is a composite application consisting of several components. The content curation component that exposes functionality for: migrating DBLP and ACM CCS datasets to a database, indexing of publication records by topic categories, and analyzing the records after persistence. The web server component houses a web application and provides web services for accessing publication and topic records. The web application acts as a prototype REST web service client. Lastly, a set of configuration files are available for fine tuning content curation, web server, and database settings. These components are visually presented in Figure 1.

Figure 1 – High Level Application Structure

In terms of usage, researchers first modify configuration files to fine tune the database, web server, and content curation scripts. Subsequently, DBLP and ACM CSS datasets are migrated into MongoDB for persistence using the content curation scripts. Furthermore,

8 researchers will use the indexing functionality in the content curation scripts to index DBLP publications by ACM CCS topic categories. Once all records are updated, researchers can interact with the datasets via the web application in a web browser after starting the web server.

3.2 Development Environment

DBLPminer is developed using various languages and technologies. The corresponding modules using these languages and technologies is also shown in Figure 1. Programming languages used in development include Python and JavaScript. Python is used to develop content curation module scripts, whereas JavaScript is used to develop both the web server and web application modules. HTML and CSS are supplementary languages used for designing the web application’s components. Technologies used include MongoDB [7], Node.js [10] and Polymer

[11]. MongoDB is a document-based database used to persist DBLPminer’s publication and topic records. The web server is constructed using the Node.js platform. Polymer web components framework is used to build the web application. A detailed description for each technology used can be found in Section 3.3.

3.3 Internal Application Structure

DBLPminer is comprised of various components implemented using several languages and technologies. These modules and technologies introduced in Section 3.1 and Section 3.2 respectively are covered more in detail in this section.

3.3.1 MongoDB

MongoDB is an open-source document database software written in C++ [7]. The database stores records as documents, grouped into collections. Compared to SQL, documents are analogous to records and collections are analogous to tables. MongoDB features JSON-styled

9 documents (BSON) with dynamic schemas, offering simplicity and power for rapid prototyping

[7]. BSON documents stored in a collection require a unique _id field that acts as a primary key.

DBLPminer creates two collections in MongoDB – bibliographies and acmccs.

Document structures listing important fields for each collection are shown in Figure 2. The main fields in a bibliography document are title and authors. The fields: dblp and extra contain DBLP metadata and other extraneous content. A topic document consists of label properties in name and aliases, and tree path properties in broader, narrower, and path fields. The broader and narrower fields correspond to parent and children topics respectively, while the path field provides a complete listing of ancestor topics leading up to the document’s topic.

Figure 2 – MongoDB Document Structure

3.3.2 Configuration Files

DBLPminer uses various configuration files to direct and alter how modules operate.

MongoDB instance connection is directed by the mongodb.json file. Parameters and their default values in that file are presented in Table 4. The host and port parameters configure MongoDB host URL and port number respectively. The db parameter pertains to the target database name in

10

MongoDB. Accessible collections managed by the database is configured by the collections parameter.

Table 4 – mongodb.json configuration file

Key Default Value Description host “localhost” MongoDB server host URL

port 27017 MongoDB listening port

db “miner” Database name

collections [“acmccs”, ”bibliographies”] List of document collections

The ACM CCS taxonomy dataset file and its topics to migrate are configured by the ccs.json file.

The contents of configuration file is shown in Table 5. The input parameter points to the dataset

SKOS XML file location. The output parameter points to a text file location for logging topic names. The topic_choices parameter is a listing all topic trees from the ACM CCS dataset for reference; therefore, modification of this parameter will have no effect and is not encouraged. The topics parameter lists all topics to migrate from the topic_choices list.

Table 5 – ccs.json configuration file

Key Default Value input ./data//ccs2012.xml

output ./data/output/taxonomy.txt ["General and reference", "Mathematics of computing", "Information systems", "Security and privacy", "Networks", "Human-centered computing", "Social and professional topics", "Theory of computation", "Computing topic_choices methodologies", "Applied computing", "Computer systems organization", "Hardware", "Software and its engineering", "Proper nouns: People, technologies and companies"] ["Mathematics of computing", "Information systems", "Security and privacy", "Networks", "Human-centered computing", "Social topics and professional topics", "Theory of computation", "Computing methodologies", "Applied computing", "Computer systems organization", "Hardware", "Software and its engineering"]

11

DBLP dataset migration can be configured via dblp.json configuration file. Contents of this configuration file are presented in Table 6. The input parameter points to the dataset XML file location. The output parameter points to a text file location for logging bibliography names.

The _venues and _metadata parameters are listings of publication venues and metadata tags for reference; therefore, modification of these parameters will have no effect. Publications venues to consider for migration are controlled by the venues parameter. Possible publication venues to consider can be obtained in the _venues listing. Metadata content to consider for migration is listed in the _metadata parameter and managed by the metadata parameter. The url and version parameters are used to manually keep track of where the XML file is obtained and its version.

Table 6 – dblp.json configuration file

Key Default Value input ./data/xml/dblp.xml

output ./data/xml/output/titles.txt ["article", "inproceedings", “proceedings", "book", _venues “incollection", "phdthesis", "masterthesis", "www"] ["article", "inproceedings", "proceedings", "book", venues "incollection", "phdthesis", "masterthesis"] ["editor", "booktitle", "pages", "year", "address", "journal", _metadata "volume", "number", "month", "url", "ee", "cdrom", "cite", "publisher", "note", "crossref", "isbn", "series", "school", "chapter"] ["editor", "booktitle", "pages", "year", "address", "journal", metadata "volume", "number", "month", "url", "ee", "cdrom", "cite", "publisher", "note", "crossref", "isbn", "series", "school", "chapter"] url "http://dblp.uni-trier.de/xml/"

version "14-Apr-2014 22:54"

Natural language processing in dataset analysis is configured in the analysis.json file. The contents of the file are shown in Table 7. Munging options are listed in munging_options and configured in munging_mode. The none option performs no operations on text. The

12 removeStopwords option removes stopwords from text, whereas the stem option stems text. The removeStopwordsAndStem first removes stopwords from text and subsequently stems it.

Table 7 – analysis.json configuration file

Key Default Value munging_options ["none", "removeStopwords", "stem", "removeStopwordsAndStem"]

munging_mode "removeStopwordsAndStem"

The logging.json file configures logging settings for DBLPminer. The application implements logging using Python’s built-in logging library. Logging is organized by groups, each with its own settings for displaying logging output and sources for log output. Exact parameter details for that file can be found in Python’s logging module documentation [8].

3.3.3 Content Curation Module

Table 8 – Application Script Commands

Python Script Commands Arguments $> dblpminer.py migrate taxonomy, bibliographies

$> dblpminer.py index bibliographies count_indices_per_topic, count_indices_per_tree, $> dblpminer.py analysis count_topic_ngrams, count_topic_term_frequency, count_topic_terms, output_indices_per_topic

The content curation component exposes functionality for: migrating content from the

DBLP and ACM CCS datasets, indexing bibliographies by topic category, and analysis. These functionalities are exposed via commands upon running the application script. The commands and arguments are shown in Table 8. The migrate command takes in taxonomy or bibliography as arguments for the task of migrating datasets from XML files to database. The index command performs indexing on bibliographic records according to topic categories. The analysis command

13 takes in various arguments pertaining to names of implemented functions for dealing with analysis on the datasets and indexing performance.

Figure 3 – Script & 3rd Party Module Dependencies

The content curation module is bundled as a collection of Python scripts. Application scripts and 3rd party module dependencies are presented in Figure 3. A few of these scripts are straight forward. The dblpminer.py script serves as the module’s application entry point housing the commands aforementioned. General utilities used in the application are implemented in util.py such as file directory manipulation. Several scripts depend on the utility script for manipulating file directories. MongoDB database connection management is implemented in the

14 persistence.py. The persistence script depends on the PyMongo library, a database driver for

MongoDB.

Implemented in bibliographies.py and taxonomy.py are object models for managing read, write, and update operations against bibliography and topic records respectively. Because these object models interact with the database for persistence and retrieval, their respective scripts depend on persistence.py. Furthermore, both object model scripts expose functionality for migrating their respective datasets from XML files to the database. These XML files are parsed using the Lxml library.

Bibliography categorization by topic category is implemented in indexing.py. It has a dependency on taxonomy.py for retrieving topic records and a dependency on bibliographies.py for updating publication records. Stemming operations are performed using the English Snowball

Stemmer via PyStemmer library. Stopword removal is assisted by a text file containing a line delimited list of stopwords.

Functionality for analysis and testing on records and indexing performance is implemented in analysis.py. Because analysis touches upon many aspects of the application, this script has a fairly large set of dependencies encompassing: persistence, taxonomy, bibliographies, and indexing scripts.

Entry point script dblpminer.py upon receiving a command, routes its commands to commands.py as this script facilitates access to the functionalities of the other scripts. For that reason, this script has dependencies on almost all application scripts including: persistence, taxonomy, bibliographies, indexing, and analysis scripts.

15

3.3.3.1 Taxonomy Migration

DesiredTrees = List of root topic names for desired topic trees declared in ccs.json

Topics = List of all topic records from SKOS XML file

RootTopics = root topics subset of Topics that intersects DesiredTrees

FOR each topic in RootTopics:

Walk down topic tree updating each node’s ancestor info

SelectedTopics = subset of Topics which have ancestor updated

FOR each topic in SelectedTopics:

Save topic to database

Figure 4 – Taxonomy Migration Pseudo code

Taxonomy information is integrated into DBLPminer by migrating topic records from the

ACM CCS SKOS XML file to MongoDB. Figure 4 contains pseudo code detailing how taxonomy migration is accomplished. First, a list containing the names of desired topic trees is retrieved from the ccs.json configuration file to determine what topic trees will be migrated. The application will create a list of all topics by iterating over the contents of the SKOS XML file.

Once all topics are extracted, ancestral information is updated for all relevant topics by propagating path information via tree traversal for each desired topic tree. In the context of tree data structures, “ancestral information” refers to a list containing all nodes visited in sequential order, starting from root until current node’s parent. Because topic records that fall under desired topic trees have updated ancestral information, records which fall solely under undesired topic trees will be filtered out from migration. The resultant filtered set of topics are subsequently saved to the database, and topic names are dumped into an output file specified by the configuration file.

16

3.3.3.2 Bibliography Migration

DesiredVenues = List of selected publication venues declared in dblp.json

DesiredTags = List of selected metadata tags declared in dblp.json

FOR each record in DBLP XML file:

IF record type is one of DesiredVenues:

Parse title, author, and dblp-meta element information

Parse other metadata elements that fall under DesiredTags

Save current record containing parsed element data to database

Figure 5 – Bibliography Migration Pseudo code

Bibliographic data is assimilated into DBLPminer by migrating publication records from

DBLP XML file to MongoDB. Figure 5 displays the pseudo code implementation for bibliography migration. First, a list of desired publication venues and metadata tags are gathered from the dblp.json configuration file. The XML file is then iterated over records that fall under desired publication venues. For each bibliographic record: author, title, and dblp-meta tag information is extracted. Additional tag information that fall under the desired metadata tags are extracted. Finally, the record along with its filtered content are saved to the database.

3.3.3.3 Bibliography Indexing

Publication records stored within MongoDB are indexed according to topic names.

Figure 6 displays the pseudo code implementation for the indexing process. This process is prepared by iterating over all topic categories retrieved from the acmccs collection in MongoDB, and retrieving bibliographies from the database where the bibliography title contains terms matching the topic category name. These relevant bibliographies are successively indexed according to the associated topic name.

17

FOR each RootTopic in Topic Trees retrieved from acmccs collection in MongoDB:

FOR each Topic in RootTopic (traverse tree depth-first):

TopicName = Topic’s name

RelevantBibliographies = Find bibliographies from bibliography collection in MongoDB where title contains terms from TopicName

FOR each bibliography in RelevantBibliographies:

Update bibliography tag info with TopicName

Figure 6 – Bibliography Indexing Pseudo code

Using terms from topic names to find matching bibliography titles is a complicated process. Text search string generation requires significant preprocessing in order to achieve suitable results. First, the topic name is converted into a list a terms. That list of terms is then filtered for stopwords and stemmed to remove inflections. The resultant list of terms is then concatenated to build the search string. Finally, the search string is forwarded as a text search query to MongoDB for parsing to retrieve matching bibliographies. An example of this process is depicted in Figure 7. Each rounded box represents a single string and adjacently connected boxes represent a list of strings. Labeled arrows describe transformations performed on strings.

MongoDB text search takes in any combination of individual terms or phrases. The database parses the search string, performing a logical OR search on terms within the search string unless a phrase is present. To specify a phrase, phrases within the search string are enclosed with double quotes. When a phrase is present in a search string, MongoDB performs a logical

AND between the phrase and the remaining set of terms. DBLPminer currently implements text search by converting all individual terms into phrases to force MongoDB in performing a logical

AND on all individual terms. An example of how search strings are parsed by MongoDB is

18

Figure 7 – Text Search String Creation Example shown in Table 9. The complete search string is enclosed by single quotes, while phrases within the string are enclosed with double quotes. The last example is how DBLPminer currently performs text search for indexing bibliographies. For more information regarding text search in

MongoDB, refer to their documentation on text search [9].

Table 9 – MongoDB Search String Parsing Example

Search String MongoDB search method 'ssl certificate' "ssl" OR "certificate"

' "ssl certificate" ' "ssl certificate"

' "ssl certificate" authority key' "ssl certificate" AND ("authority" OR "key")

' "mathemat" "comput" ' "mathemat" AND "comput"

3.3.4 Web Server Module

The web server component is built using the node.js platform and two node.js modules – express.js and mongo.js. Node.js is a JavaScript platform for building fast, scalable web applications that utilizes an event-driven non-blocking I/O model [10]. Express.js, a minimalist

19 web framework, is used to efficiently develop web application structures. Mongo.js is a

MongoDB wrapper database driver for node.js. Using these technologies and JavaScript,

DBLPminer implements a web server with a RESTful API web service simply by declaring how

URL routes are handled by the server and what port the server is listening to.

3.3.4.1 Routes

The web server organizes routes by two categories – static files and the RESTful web service API. These URL routes are presented in Table 10. The routes "/" and "/public" point to a public folder on the webserver, which houses the DBLPminer web application. The route

“/components” directs to the server’s directory containing additional web assets for generating the web application – namely, web components. The "/api" route is the root route for the RESTful web service API.

Table 10 – Web Server URL Routes

URL Routes Description / Root

Static /public Webpage

/components Web Assets

Web Service /api REST API

3.3.4.2 REST API

RESTful web services have the following characteristics: a base URI, an media type, standard HTTP methods, and hypertext links to reference state and resources. DBLPminer implements RESTful web services as it exhibits those characteristics: “/api” as the base URI, responding to calls with the JSON media type, providing the HTTP method GET, and providing

20 several routes identifying resources and differentiating application states. Implemented API routes on the web server are listed in Table 11.

Table 11 – REST API Routes

HTTP Route Params Description methods /api/bibliographies GET Returns a list of bibliographies

/api/bibliographies/:id GET id Returns bibliography data given id Returns relevant bibliographies given a /api/bibliographies/topic/:id GET id topic id /api/topics GET Returns root topic list

/api/topics/:id GET id Returns topic data given a topic id

/api/topics/:id/subtopics GET id Returns subtopic list given a topic id

/api/search/bibliographies GET query Returns bibliographies given query data

/api/search/topics GET query Returns topics given a query parameters

The "/api/bibliographies" route and its child routes handle all incoming requests for data pertaining to a publication record or a collection of records. The "/api/topics" route and its child routes handle all incoming requests for a topic record or a collection of records. The "api/search" family of routes correspond to collections of records when performing text search on publication and topic records.

3.3.5 Web Application

DBLPminer has a simple prototype web application that utilizes the REST web service

API to present and navigate through topic category and publication records. The web application is developed using the Polymer web components framework. Polymer is based on encapsulated and interoperable custom elements that extend HTML [11]. These elements are pieced together to create rich web applications. DBLPminer’s web interface is built using several custom web

21 components in conjunction with various premade web components offered by the Polymer framework. A mixture of HTML, CSS, and JavaScript is used to create these custom web components.

3.3.5.1 Web Components

Table 12 – Application’s Custom Web Components

Category HTML Elements

Site , Layouts , , List , , , Main , Views List , ,

Item ,

The web components developed for DBLPminer are shown in Table 12. These components are grouped into two categories – layouts and views. Layout components provide the web application’s overall look and feel. Two components are used to design the overall website menu navigation and content view layouts, while four components are used in designing list layouts. Web components in the view category encapsulate various web page views navigable on the website. All view components utilize the dblpminer-view web component to bootstrap the view's general look and feel. List views utilize bibliography-list and topic-list web component to bootstrap the look and feel for lists. The item view components are single item web page views for publication and topic records.

22

3.3.5.2 Site Navigation

Figure 8 – Web Application Navigation by View

As mentioned in Section 3.3.5.1, the web interface can be broken down into three types of views – main, list, and item. Navigation across the website entails traversing amongst these types of views. Figure 8 depicts the overall navigation structure of the web application.

The main views are presented in a navigation drawer. The four main views are home, topics, stats, and search views. Non-main view items in the drawer include a list view containing recently added bibliographies and a search form stub. General information about the project can be found in the home view. The topic view presents a list of topic categories, starting from root topics as shown in Figure 9. Each item in the list navigates to single topic views containing

23 detailed information, subtopics, and associated publication records. Currently a stub, the stats views are intended for presenting statistical information on the dataset. The search view presents

Figure 9 – Topics View Screenshot an interface for finding bibliographies and topics via text search string. It is implemented as a search bar that can be toggled into view by a search icon button as shown in Figure 10.

24

Figure 10 – Search / Home View Screenshot

List views allow for interaction with collections of data. The list are: subtopics, bibliographies per topic, and search results for both bibliographies and topics. In a list view, list items allow for navigation to information on that list item or examined in full detail or that topic category can be expanded to introduce a list of subtopics pertaining to that topic as show in Figure 11.

Figure 11 – Subtopics View Screenshot

25

When a topic item navigates to a view of publications relevant to that topic, a view like

Figure 12 is displayed. In this view, publication records associated to the topic "Mathematics of computing" is presented. A button can be clicked to view detailed information on that topic, or a bibliography item in the list can be clicked to view more information about that bibliography record.

Figure 12 – Topic Relevant Bibliographies Screenshot

26

Similarly, list items in search result views can be expanded to view more information on the item as shown in Figure 13, depicting sample search results for "machine learning" in the bibliographies collection.

Figure 13 – Bibliography Search Results Screenshot

Item views are pages that present data pertaining to a single item. DBLPminer has two views for items, one for viewing bibliographic content and one for viewing topic content. In the example depicted in Figure 14, bibliographic information is shown for a publication with the title

"A Diversity Measure for Tree-based Classifier Ensembles." In this view, information pertaining to authors, year, and publication venue are shown. Also shown are metadata used by the DBLP database to organize records. Lastly, topic tags that this publication has been indexed under are displayed.

27

Figure 14 – Bibliography Info Screenshot

28

Chapter 4

INDEXING ANALYSIS

The foundation to DBLPminer's functionalities comes from the indexing of bibliographies by topic categories. The accuracy of these bibliography indices affects the overall performance of DBLPminer in terms of publication exploration. In this section, analysis is conducted on the ACM CCS taxonomy dataset and the performance of experiments on the indexing algorithm is analyzed.

4.1 Taxonomy Dataset Analysis

Since the ACM CCS taxonomy serves as the foundation for classifying bibliographic records, it is beneficial to understand the collection of terms used in implementing DBLPminer's bibliography indexing. Statistics gathered on the taxonomy dataset focuses on the twelve topic trees used by DBLPminer. Refer to Table 2 for a listing of the default twelve-topic trees.

4.1.1 Distinct Topic Terms

Distinct topic term counts per topic tree is computed on the taxonomy dataset. Different approaches are considered based on term variations: unmodified, stemmed, and stemmed with stopword removal. Statistics gathered is presented in Table 13. In the unmodified approach, the sum of distinct terms by topic tree sets is 2,687. Distinct term totals for all topic trees combined are 1,614. This differential of 1,073 indicate instances where a term is shared between two or more topic trees.

In the approach with stemming applied via English Snowball stemming algorithm, the total number of distinct terms amongst all topic trees is reduced by 17.7% with a total of 1,329 in comparison with the unmodified approach. Compared to 1,073 shared instances in the unmodified approach, the stemming approach yields an increase of 4.6% with a total of 1,122 shared

29

Table 13 – Taxonomy Distinct Term Counts

Distinct Term Counts Stemmed Topic Trees Unmodified Stemmed Terms & Terms Terms Stopword removal Applied computing 214 196 190 Computer systems organization 90 82 76 Computing methodologies 341 309 297 Hardware 319 294 285 Human-centered computing 143 134 130 Information systems 387 348 340 Mathematics of computing 178 162 156 Networks 157 147 142 Security and privacy 122 116 110 Social and professional topics 183 173 169 Software and its engineering 271 243 234 Theory of computation 282 247 239 Totals: 2,687 2,451 2,368 Distinct Totals: 1,614 1,329 1,308 instances. With stemming applied, overall term counts are reduced as similar terms are now joined, while shared term instances across topic trees are increased since these shared terms are more generalized after being reduced to their "root" form. With the inclusion of stopword removal in a stemmed approach, distinct term totals across all topic trees become 1,308, a reduction of 1.6% or 21 terms compared to the stemmed approach. Similarly, shared term instances are reduced by 5.5% with a total of 1060 instances.

The stemmed approach with stopword removal is implemented for bibliography indexing in DBLPminer. This approach is favored because it naively maximizes distinctive term quality by reducing terms to their "root" form as defined by the Snowball stemming algorithm while also filtering out common terms with minimal contribution in value via stopword removal.

30

Examining terms in more detail, the most frequent terms used per topic tree is computed and shown in Table 14. Each stemmed term shown is prepended with the exact number of instances the term was used. The most three most frequently used family of terms in the taxonomy dataset are: network, comput and system. Out of these three terms, comput is the most commonly used term as it within the top 3 encountered term for 5 topic trees. The network family of terms appears to have the most impact for a single topic tree with a frequency of 70 instances in the Networks topic tree.

Table 14 – Most Frequent Terms

Topic Trees Most Frequent Terms (stemmed) 15 comput, 14 enterpris, 9 document, 8 architectur, Applied computing 8 manag, 7 busi, 6 process, 6 system, 5 languag 16 architectur, 12 comput, 11 system, 5 real-tim, 5 robot, Computer systems organization 4 data, 4 embed, 4 multipl, 4 network, 3 instruct 37 learn, 23 algorithm, 23 model, 19 simul, 13 comput, Computing methodologies 12 represent, 8 imag, 8 plan, 7 algebra, 7 languag 23 circuit, 19 design, 12 power, 11 system, 9 test, 8 analysi, Hardware 8 devic, 8 emerg, 8 hardwar, 8 synthesi, 7 energi, 7 integr 17 comput, 17 design, 16 social, 13 interact, 12 visual, Human-centered computing 9 collabor, 9 mobil, 9 studi, 9 system, 8 interfac, 8 user 37 data, 24 retriev, 24 search, 21 databas, 21 storag, Information systems 21 system, 19 web, 17 inform, 15 queri, 14 languag 10 comput, 10 differenti, 10 optim, 9 analysi, 9 graph, Mathematics of computing 8 calculus, 8 equat, 7 mathemat, 6 algorithm, 6 statist 70 network, 19 protocol, 6 algorithm, 6 area, 6 wireless, Networks 4 mobil, 4 secur, 4 topolog, 3 access, 3 data, 3 design 30 secur, 8 privaci, 6 attack, 6 system, 4 engin, 4 hardwar, Security and privacy 4 manag, 4 protocol, 3 authent, 3 control, 3 cryptographi 21 comput, 12 educ, 10 inform, 10 system, 9 manag, Social and professional topics 8 polici, 7 technolog, 6 engin, 6 softwar, 5 histori, 4 access 47 softwar, 36 languag, 16 system, 11 architectur, Software and its engineering 10 develop, 10 manag, 10 model, 8 program, 6 analysi 28 algorithm, 22 theori, 16 complex, 16 learn, 15 program, Theory of computation 13 comput, 13 logic, 12 model, 10 optim, 8 data, 8 databas 113 network, 109 comput, 100 system, 77 model, All Topics 74 languag, 68 data, 64 algorithm, 63 softwar, 58 learn

31

4.1.2 Topic Names

Topic name length counts per topic tree is computed on the taxonomy dataset and presented in Table 15. Name lengths are organized as: single term unigrams, phrased n-grams, and their intersection. Intuitively, increasing number of terms used to match relevant bibliographies by topic should result in narrower finds. Therefore, unigram topic terms should yield broader finds compared to n-gram topic terms. When unigrams are commonly shared between topic trees such as comput, accuracy is expected to suffer significantly more when there are no other terms to support topic categorization. Conversely, unigram terms that are mostly contained within its topic tree pose minimal threat to categorization accuracy. When Name lengths computed on the taxonomy dataset are promising, as phrased n-grams make up 89.2% of the total topic categories.

Table 15 – Topic Name Length Counts

Topic Counts Topic Trees All Unigram N-gram Applied computing 160 40 120 Computer systems organization 61 7 54 Computing methodologies 267 19 248 Hardware 224 14 210 Human-centered computing 119 11 108 Information systems 324 31 293 Mathematics of computing 139 14 125 Networks 118 5 113 Security and privacy 78 7 71 Social and professional topics 139 24 115 Software and its engineering 213 32 181 Theory of computation 212 16 196 Totals: 2,054 220 1,834

32

4.2 Performance Analysis

Bibliography indexing performance depends on both topic name input and how text search is performed to retrieve bibliographies. Since analysis of topic name input is discussed in

Section 4.1, this section will focus on text search methodology. Using text search capabilities of

MongoDB, performance will depend on how text search strings are composed. Table 16 shows indexed bibliography totals using 3 different approaches for text search string composition.

Table 16 – Total Bibliographies Indexed

Indexed Bibliographies Topic Trees All Terms All Terms Any Term (stemmed) (not stemmed) Applied computing 1,853,539 981,572 103,759 Computer systems organization 1,306,965 350,686 78,957 Computing methodologies 2,276,984 613,136 188,813 Hardware 2,138,655 239,099 64,979 Human-centered computing 1,831,156 238,996 42,827 Information systems 2,284,986 478,553 156,581 Mathematics of computing 1,858,230 295,108 90,663 Networks 1,928,774 257,563 152,342 Security and privacy 1,819,550 58,692 26,421 Social and professional topics 1,825,507 646,521 44,706 Software and its engineering 2,130,579 570,197 140,520 Theory of computation 2,173,441 1,096,748 119,633 Total Indexed Bibliographies: 2,521,720 1,423,199 873,481

As mentioned in Section 3.3.3.3, topic terms are stemmed and filtered, removing stopwords in the generation of text search strings. MongoDB processes a text search based on how the text search string is constructed. In the approach considering any-term, a total of

2,521,720 bibliographies are indexed, 56.4% more than the approach considering all-terms with a total of 1,423,199. This total is reduced further when the all-term approach ignores stemming with a total of 873,481 bibliographies indexed.

33

Compared to DBLP dataset's maximum total of 2,587,888 records, the any-term approach achieves highest coverage with 97.4%, while coverage using the all-term approach with and without stemming are 54.9% and 33.7% respectively. Therefore, indexing using MongoDB text search functionality is bounded between 33.7% and 97.4% in potential coverage.

While maximizing coverage is desired, precision plays a factor in ultimately deciding the approach used in DBLPminer. Loosely indexed bibliographies using any terms yield higher coverage, but reduces precision due to the variety of accepted term combinations. Likewise, indexing publications to topic names nearly verbatim yields better precision, but minimizes leeway in matching similar publications. The all-term approach with stemming is used for

DBLPminer as it strikes balance between these two drastically different approaches. The remainder of this section focuses on analysis pertaining to the all-term approach with stemming.

Indexing coverage is also analyzed with respect to topic name lengths shown in Table 15.

The computed results are presented in Table 17. Total bibliography indices appear to decrease as topic name lengths decrease: 1,390,481 with unigrams, 1,356,199 with only bigrams, and

1,331,023 with topic names containing two or more terms (n-grams). Out of the total bibliographies indexed, 97.7% are indexed under unigram topic categories, while 93.5% are indexed under n-gram topic categories. Bibliographies indexed by unigram topic names may not be precise since unigram topics only make up 10.5% of the total topics per Table 15, while contributing to 97.7% of the total indexed publications. Topic trees that contain high indexing contributions from unigram topics such as: applied computing, networks, and theory of computation result from stemmed unigram terms that cover across multiple topic categories such as computer and network.

34

Table 17 – Bibliographies Indexed by N-grams

Indexed Bibliographies by N-grams Topic Trees Unigram All Unigrams Bigrams N-grams ∩ N-gram Applied computing 981,572 803,895 666,670 670,587 492,910 Computer systems 350,686 169,304 174,674 350,207 168,825 organization Computing methodologies 613,136 106,329 595,344 613,136 106,329 Hardware 239,099 175,502 198,131 209,316 145,719 Human-centered computing 238,238 64,092 234,831 238,982 64,078 Information systems 478,553 238,238 454,849 468,243 227,928 Mathematics of computing 295,108 163,006 275,324 276,270 144,168 Networks 244,637 244,637 215,321 218,186 205,260 Security and privacy 58,692 15,853 54,327 58,198 15,359 Social and 646,521 82,033 624,514 630,767 66,279 professional topics Software and its engineering 570,197 416,731 570,197 570,197 416,731 Theory of computation 1,096,748 1,084,269 753,816 758,301 745,822 Totals: 1,423,199 1,390,481 1,356,199 1,331,023 1,423,199

Relevance is measured using precision and recall. Assuming bibliography relevance is determined by matching all topic terms minus stopwords, recall is always 100% since the indexing process performs a search for relevant publications according to topic terms. Precision is not as easily computed as the commonality, distinctiveness, and combination of terms vary among topics. Indexing performance and sample results on three topic categories containing a single term are shown in Table 18. In the case where a stemmed topic term is extremely distinctive and uncommon such as psycholog from the topic Psychology, precision is high at about 99%. Although more common with 12 occurrences, the term electron from the topic

Electronics remains fairly distinctive, yielding a precision of 99%. In the case where a stemmed term is not as distinctive nor uncommon such as visual from the topic Visualization, precision

35

Table 18 – Single Topic Term Indexing Performance

Topic: Psychology Terms (frequency): psycholog (1) Retrieved Count: 1,201 Precision: ~99% First 10 Query Results: 1. Identifying Psychological Theme Words from Emotion Annotated Interviews. 2. Attention Metaphors: How Metaphors Guide the Cognitive Psychology of Attention. 3. Human Psychology of Common Appraisal: The Reddit Score. 4. Issues in the Design of Workstations for Psychology Experimentation. 5. Discrimination Nets as Psychological Models. Topic: Electronics Terms: electron (12) Retrieved Count: 12,831 Precision: ~99% First 10 Query Results: 1. Electron-beam position monitoring and feedback control in Duke Free-Electron Laser Facility. 2. Exploiting time in electronic health record correlations. 3. Changes to the electronic health records market in light of health information technology certification and meaningful use. 4. Technology and Electronic Communications Act 2000. 5. A Review of Aeronautical Electronics and Its Parallelism With Automotive Electronics. Topic: Visualization Terms: visual (19) Retrieved Count: 34,966 Precision: ~70% First 10 Query Results: 1. Virtual instrument for measurement, processing data, and visualization of vibration patterns of piezoelectric devices. 2. GaDeVi -Game Development Integrating Tracking and Visualization Devices into Virtools. 3. Motion Visualization of Ultrasound Imaging. 4. Fast Cost-Volume Filtering for Visual Correspondence and Beyond. 5. Virtually Visual: The effects of visual technologies on online identification. appears to suffer much more. Out of the 34,966 Visualization associated bibliographies, 14,393 bibliographies contain the terms visualize or visualization. These bibliographies are correctly indexed. Although matching the stemmed term visual, the remaining 20,573 bibliographies cover a variety of areas ranging from visualization, computation with visual data, visual languages, etc.

Roughly half of these remaining bibliographies match the concept of visualization; therefore, precision for the topic Visualizations comes to about 70%. Precision suffers for this topic because the stemmed form of visualization, visual, occurs more often and is not as distinctive.

36

Table 19 – Bigram Topic Term Indexing Performance

Topic: Concurrent algorithms Terms: concurr (8), algorithm (64) Retrieved Count: 373 Precision: ~80% First 5 Query Results: 1. A Slicing Algorithm of Concurrency Modeling Based on Petri Nets. 2. The concurrency hierarchy, and algorithms for unbounded concurrency. 3. An Efficient Algorithm-Based Concurrent Error Detection for FFT Networks. 4. Concurrent Algorithmic Debugging. 5. LR-algorithm: concurrent operations on priority queues. Topic: Machine learning Terms: machin (13), learn (58) Retrieved Count: 5,147 Precision: ~99% First 5 Query Results: 1. Machine learning techniques for scheduling jobs with incompatible families and unequal ready times on parallel batch machines. 2. Algorithms of Machine Learning for K-Clustering. 3. Extracting NPC behavior from computer games using computer vision and machine learning techniques. 4. Machine learning with Lipschitz classifiers. 5. A Machine-Learning Framework for Hybrid Machine Translation. Topic: Network architectures Terms: network (113), architectur (42) Retrieved Count: 4,534 Precision: ~98% First 5 Query Results: 1. The Architecture of NG-MON: A Passive Network Monitoring System for High-Speed IP Networks. 2. A Novel Architecture for Hierarchically Nested Network Mobility. 3. An Independent Function-Parallel Firewall Architecture for High-Speed Networks (Short Paper). 4. Network Architecture for Scalable Ad Hoc Networks. 5. An Overlay Network Architecture for Data Placement Strategies in a P2P Streaming Network.

Indexing performances and sample results for three topic categories containing two terms are shown in Table 19. The uncommon term concurr and common term algorithm from the topic

Concurrent algorithms leads to a precision score of about 80%. Retrieved bibliographies from this topic are least precise when the two terms appear non-sequentially within the bibliographic title, a strong indication that order and adjacency of terms affect relevance to topic meaning.

Machine learning contains stemmed terms that are both fairly common and mildly common in learn and machin respectively. The union of the two terms produce a distinctive set of terms that

37 describe a specific topic area, leading to a high precision score of 99%. Like the topic Machine learning, even though both terms are fairly common for the topic Network architectures, precision is a high 98% because the union of the two terms present a distinct topic area for classification.

Presented in Table 20 are indexing performances and sample results for three topic categories containing three or more terms. Following the trend witnessed in sample topic categories containing two terms, combining three terms of varying commonality still yield fairly high precision scores. The topic Bayesian nonparametric models has a precision score of 95%, while Power grid design and Computer science education have scores of 89% and 93% respectively. Also similar to the two-term topic categories, precision is hindered by cases where terms are presented non-sequentially, which leads to topics that differ from the intended topic category. Bibliographic titles such as "A concept map-embedded educational computer game for improving students' learning performance in natural science courses" from the topics Computer science education is an example of inaccuracy due to non-sequential matching terms.

38

Table 20 – Trigram+ Topic Term Indexing Performance

Topic: Bayesian nonparametric Terms: bayesian (5), nonparametr (5), model (77) models Retrieved Count: 82 Precision: ~95% First 5 Query Results: 1. Gaussian Beam Processes: A Nonparametric Bayesian Measurement Model for Range Finders. 2. A Nonparametric Bayesian Model for Multiple Clustering with Overlapping Feature Views. 3. A sparse nonparametric hierarchical Bayesian approach towards inductive transfer for preference modeling. 4. Dynamic Bayesian Network and Nonparametric Regression for Nonlinear Modeling of Gene Networks from Time Series Gene Expression Data. 5. Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis. Topic: Power grid design Terms: power (14), grid (4), design (57) Retrieved Count: 37 Precision: ~89% First 5 Query Results: 1. Power-Aware Designers at Odds with Power Grid Designers? 2. Control Design and Implementation for High Performance Shunt Active Filters in Aircraft Power Grids. 3. Design and Realization of Clustering Based Power Grid SCADA System. 4. Hierarchical power supply noise evaluation for early power grid design prediction. 5. A fast algorithm for power grid design. Topic: Computer Science education Terms: comput (109), scienc (10), educ(13) Retrieved Count: 521 Precision: ~93% First 5 Query Results: 1. Computer Science and Computer Science Education. 2. Proceedings of the 2010 International Conference on Frontiers in Education: Computer Science & Computer Engineering, FECS 2010, July 12-15, 2010, Las Vegas, Nevada, USA 3. Proceedings of the 2007 International Conference on Frontiers in Education: Computer Science & Computer Engineering, FECS 2007, June 25-28, 2007, Las Vegas, Nevada, USA 4. Reconfigurable Computing Education in Computer Science. 5. Issues in undergraduate education in computational science and high performance computing.

39

Chapter 5

CONCLUSION AND FUTURE WORK

Bibliographic information from sources such as DBLP can be beneficial to exploring publications in academia. Unfortunately, DBLP does not support broad searches by topic areas.

DBLPminer is tool developed to address DBLP's limitations. DBLPminer consists of modules for publication and topic category management, web services for content interaction, and a GUI web application for interfacing with the web services. At the application's foundation is the categorization of publication records by topic categories. Indexing implementation involves matching bibliography titles to all terms in a topic name that are stemmed and filtered for stopwords.

The algorithm performance for indexing bibliographic records by topics is analyzed.

Performance is measured by total number of bibliographies indexed (coverage) and the precision of matching bibliographies to topics. With the implementation aforementioned, coverage is maximized with consideration to accuracy. Precision varied from topic to topic dependent on the commonality, distinctiveness, and combination of topic terms. Overall, precision appears positive among the multitude of sample topics analyzed.

5.1 Future Work

Because this project serves as the base foundation, several avenues can be taken to improve the overall application tool. Content curation, web server, and web application components can all benefit from modifications and additions. The indexing algorithm from the content curation module can be improved for better performance. For example, further analysis can be made on the taxonomy dataset to determine what terms are more valuable for categorization and adjust the indexing algorithm accordingly.

40

Improvements can be made on DBLPminer's interface. Web services implemented on the web server can be updated to be more robust and secure, as the current implementation is rudimentary.

Additional web service endpoints can be added to support statistical information retrieval. The web application can be improved by implementing features such as an interface for interacting with statistical data or an interface for a search form to perform more complicated queries. The web application can also be improved in managing state information between pages.

41

Appendix A. Software Dependencies

Software / Module Version Source Library

Python 2.7.8 https://www.python.org/downloads/

http://api.mongodb.org/python/current/installatio PyMongo 2.7.2 Content n.html Curation (Python) PyStemmer 1.3.0 https://github.com/snowballstem/pystemmer

Lxml 3.3.5 http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

Node.js 0.10.31 http://nodejs.org/download/

Web Server Express.js 4.9.0 http://expressjs.com/starter/installing.html (JavaScript)

Mongo.js 0.14.1 https://github.com/mafintosh/mongojs

Web Application https://www.polymer- Polymer 0.4.0 (HTML, CSS, project.org/docs/start/getting-the-code.html JavaScript)

Database MongoDB 2.6.4 http://www.mongodb.org/downloads

42

Appendix B. Source Code

dblpminer.py import argparse, atexit, json, logging.config from lib import commands, util

''' initialize logging settings from configuration file ''' def init_logging(): with open('./config/lib/logging.json') as data: config = json.load(data) if config.get('handlers'): filenames = (h['filename'] for h in config['handlers'].values() if h.get('filename')) [util.makedirs(filename) for filename in filenames] logging.config.dictConfig(config)

''' application entry point ''' def main(): #set up logging init_logging() #initialize logging logger = logging.getLogger('miner') #log under "miner" group logger.info('Application started.') exit_hook = lambda s: logger.info(s) atexit.register(exit_hook,'Application terminated.') #run logging exit hook at termination

#create command line argument parser prog_desc = 'DBLPminer content curation tool.' argparser = argparse.ArgumentParser(prog='dblpminer', description=prog_desc) subparser = argparser.add_subparsers(title='Commands')

#commands: data migration tooltip = "migration commands" migration_subparser = subparser.add_parser('migrate', help=tooltip) cmd = commands.Migration() migration_subparser.add_argument('data',nargs=1,choices=cmd.options,help=cmd.help) migration_subparser.set_defaults(func=cmd.router)

#commands: bibliography indexing tooltip = 'indexing commands' index_subparser = subparser.add_parser('index', help=tooltip) cmd = commands.Index() index_subparser.add_argument('data',nargs=1,choices=cmd.options,help=cmd.help) index_subparser.set_defaults(func=cmd.router)

#commands: analysis tooltip = 'content analysis'

43

analysis_subparser = subparser.add_parser('analysis', help=tooltip) cmd = commands.Analysis() analysis_subparser.add_argument('data',nargs=1,choices=cmd.options,help=cmd.help) analysis_subparser.set_defaults(func=cmd.router)

#parse arguments & call corresponding function args = argparser.parse_args() args.func(args) if __name__ == '__main__': main()

commands.py import bibliographies, indexing, persistence, taxonomy, analysis import inspect class Router: def __init__(self): self.help = "" self.routes = {} self.options = self.routes.keys()

def router(self,args): return self.routes[args.data[0]]() class Migration(Router): def __init__(self): self.help = 'content to migrate.' self.routes = { 'acmccs': self.migrate_acmccs, 'dblp': self.migrate_dblp } self.options = self.routes.keys()

def migrate_acmccs(self): with persistence.MongoDB() as mongoDB: taxonomy.ACMCCS(mongoDB).migrate_topics()

def migrate_dblp(self): with persistence.MongoDB() as mongoDB: bibliographies.DBLP(mongoDB).migrate_bibliographies() class Index(Router): def __init__(self): self.help = 'index content.' self.routes = { 'bibliographies': self.index } self.options = self.routes.keys()

def index(self):

44

with persistence.MongoDB() as mongoDB: indexing.TopicIndexer(mongoDB).index_bibliographies() class Analysis(Router): def __init__(self): self.help = 'Analyze content' self.options = [name for (name,value) in inspect.getmembers(analysis, inspect.isfunction)]

def router(self, args): getattr(analysis,args.data[0])()

taxonomy.py import json, logging from lxml import etree import persistence, util

''' ACM Computing Classification System topic data management ''' class ACMCCS: def __init__(self,db_client): self.logger = logging.getLogger('taxonomy') #set logging category self.collection = db_client.db.acmccs #set persistence collection object with open('./config/lib/ccs.json') as data: #open configuration file and self.config = json.load(data) #load configuration data

''' extract topic information from ACM CCS xml file ''' def extract_topics_from_xml(self): ns = lambda prefix, name: '{%s}%s'%(prefix,name) _id = lambda el, name: int(el.get(ns(rdf,name))[1:])

#iterate through {*}Concept element blocks for _, topic in etree.iterparse(self.config['input'], tag='{*}Concept'): rdf, skos = topic.nsmap['rdf'], topic.nsmap['skos']#get mappings to ns prefixes

#create structure to store topic data data = { '_id': _id(topic,'about'), 'aliases': [], 'broader': [], 'narrower': [] }

#extract information from topic element's children elements for child in topic: if child.tag == ns(skos,'prefLabel'): #topic name data['name'] = unicode(child.text) elif child.tag == ns(skos,'altLabel'): #topic alternative names data['aliases'] += [unicode(child.text)] elif child.tag == ns(skos,'broader'): #topic parent ids data['broader'] += [_id(child,'resource')]

45

elif child.tag == ns(skos,'narrower'): #topic children ids data['narrower'] += [_id(child,'resource')] yield data topic.clear() #mark element for garabage collection

''' update taxonomy path information to topic data (used during data migration) ''' def __update_taxonomy_paths(self, topics, path=[], topic=None): if topic: #insert/update taxonomy path info to topic data path_str = ';'.join(unicode(_id) for _id in path) topic['path'] = topic['path']+[path_str] if topic.get('path') else [path_str] #recursively update subtopics subtopics = (t for t in topics if t['_id'] in topic['narrower']) for subtopic in subtopics: self.__update_taxonomy_paths(topics, path=path+[topic['_id']], topic=subtopic) else: #start from root topics root_topics = [t for t in topics if t['name'] in self.config['topics']] for root_topic in root_topics: self.__update_taxonomy_paths(topics, topic=root_topic)

''' write topic labels to output file ''' def create_output_file(self, print_aliases=False): print_format = lambda pad, label: ('%s- %s\n'%(pad,label)).encode('UTF-8')

with open(util.makedirs(self.config['output']),'w') as f: for path, topic in self.iterate_trees(): #traverse topic trees f.write(print_format('\t'*len(path), topic['name'])) #also write topic label alises to file if print_aliases is set to TRUE for alias in (alias for alias in topic['aliases'] if print_aliases): f.write(print_format('\t'*len(path), alias))

''' migrate topic information from xml file to database & output txt file ''' def migrate_topics(self): self.logger.info('Migrating ACM CCS topics to database.') topics = [t for t in self.extract_topics_from_xml()] #extract topic data from xml

#update taxonomy path info per topic and prune topics according to path relevance self.__update_taxonomy_paths(topics) filtered_topics = [t for t in topics if t.get('path')]

[self.save_topic(topic) for topic in filtered_topics] #persist topics to database self.create_output_file() #write topic labels to output

#log results selected, total = len(filtered_topics), len(topics)

46

self.logger.info('%s/%s ACM CCS topics migrated to database.'%(selected,total))

''' persist topic data to database ''' def save_topic(self, topic): fields, weights = [('name','text'), ('aliases','text')], {'name': 2, 'aliases': 1} self.collection.ensure_index(fields, weights=weights, name='fts') self.collection.ensure_index('path', name='path') self.collection.ensure_index('broader',name='broader') self.collection.ensure_index('narrower',name='narrower') return self.collection.find_and_modify( query = {'_id': topic['_id']}, update = {'$setOnInsert': topic}, upsert = True ) ''' retrieve topic information from database ''' def find_topic(self, criteria={}, projection={}): docs = self.collection return docs.find_one(criteria,projection) if projection else docs.find_one(criteria) def find_topics(self, criteria={}, projection={}, sort={}, limit=0): docs = self.collection cursor = docs.find(criteria,projection) if projection else docs.find(criteria) return cursor.sort(sort.items()).limit(limit) if sort else cursor.limit(limit)

''' iterate topic trees (depth-first: pre-order & in-order modes) ''' def iterate_topics(self, topic=None, path=[], mode='pre_order'): if not topic: #start from root topics root_topics = [topic for topic in self.find_topics(criteria={'path':[u'']})]

#create generator for iterating root subtopics recursively for root_topic in root_topics: for _, topic in self.iterate_topics(topic=root_topic,mode=mode): yield path, topic else: if mode=='pre_order': #yield path & topic information in pre-order traversal yield (path, topic) path.append(topic['name']) #update tree path for chidlren nodes

#create generator for iterating subtopics recursively criteria = {'_id': {'$in': topic['narrower']}} subtopics = [subtopic for subtopic in self.find_topics(criteria)]

for subtopic in subtopics: for _, t in self.iterate_topics(topic=subtopic,path=path,mode=mode): yield path, t path = path.pop() #update path information for ancestor node

47

if mode=='in_order': #yield path & topic information in in-order traversal yield (path, topic)

def iterate_tree(self, node_name, topic=None, depth=0): if not topic: topic = self.collection.find_one({'name': node_name}) if topic: for d,t in self.iterate_tree(node_name,topic=topic): yield d, t else: yield depth, topic depth +=1 criteria = {'_id': {'$in': topic['narrower']}} subtopics = [subtopic for subtopic in self.find_topics(criteria)]

for subtopic in subtopics: for d,t in self.iterate_tree(node_name,topic=subtopic, depth=depth): yield d, t

def iterate_trees(self,topic=None): root_topics = [t for t in self.find_topics(criteria={'path':[u'']})] for root_topic in root_topics: name = root_topic['name'] for d,t in self.iterate_tree(name,topic=root_topic): yield d,t

bibliographies.py import json, logging, re from lxml import etree import persistence, util

''' DBLP data management ''' class DBLP: def __init__(self,db_client): self.logger = logging.getLogger('dblp') #set logging category self.collection = db_client.db.bibliographies #set persistence collection object with open('./config/lib/dblp.json') as data: #open configuration file and self.config = json.load(data) #load configuration data

''' format title element into html/plain text (support for extract_bibliographies_from_xml) ''' def __format_title(self,element): tostring = lambda el, m: etree.tostring(el, encoding='unicode', method=m).strip() text, html = tostring(element,'text'), tostring(element, 'html') html = re.sub('^','',re.sub('\$','',html)) return { 'html': html, 'text': text }

48

''' extract bibliography information from DBLP xml file ''' def extract_bibliographies_from_xml(self): source, venues = self.config['input'], self.config['venues'] for _, element in etree.iterparse(source, load_dtd=True, tag=venues): dblp, authors, extra = dict(element.items()), [], {} #init data variables for metadata in element: if metadata.tag == 'author': #extract author metadata authors += [metadata.text] elif metadata.tag == 'title': #extract title metadata title = self.__format_title(metadata) elif metadata.tag in self.config['metadata']: #extract other metadata extra[metadata.tag] = metadata.text #yield bibliography contents if a title exists if title: yield { 'venue':element.tag, 'dblp':dblp, 'title':title, 'authors':authors, 'extra':extra } element.clear() #mark element for garabage collection ''' migrate bibliography information from xml file to database & output txt file ''' def migrate_bibliographies(self): self.logger.info('Migrating DBLP bibliographies to database.') with open(util.makedirs(self.config['output']), 'w') as f: for count, bibliography in enumerate(self.extract_bibliographies_from_xml()) self.save_bibliography(bibliography) f.write((u'%s\n'%bibliography['title']['text']).encode('UTF-8'))

if (count+1)%100000==0: self.logger.debug('%8d DBLP bibliographies migrated to database.'%(count+1)) self.logger.info('%9d DBLP bibliographies migrated to database.'%(count+1)) ''' persist bibliography data to database ''' def save_bibliography(self, bibliography): fields, weights = [('title.text','text'), ('tags','text')], {'title.text': 10, 'tags': 1} self.collection.ensure_index(fields, weights=weights, name='fts') self.collection.ensure_index('dblp.key', unique=True, name='dblp.key') self.collection.ensure_index('dblp.mdate',name='dblp.mdate') self.collection.ensure_index('authors', name='authors') return self.collection.find_and_modify( query = {'dblp.key': bibliography['dblp']['key']}, update = {'$setOnInsert': bibliography}, upsert = True ) ''' update bibliography tag data in database ''' def update_bibliography_tags(self, _id, tags): return self.collection.find_and_modify( query = {'_id': _id},

49

update = {'$set': {'tags': tags}} ) ''' clears tag field from all documents in database ''' def clear_bibliograhies_tags(self): return self.collection.update( {'tags': {'$exists': True } }, {'$unset': {'tags': True } }, multi=True )

''' retrieve bibliography information from database ''' def find_bibliographies(self, text=None, criteria={}, projection={}, sort={}, limit=0): criteria = {'$text': {'$search': text }} if text else criteria docs = self.collection cursor = docs.find(criteria,projection) if projection else docs.find(criteria) return cursor.sort(sort.items()).limit(limit) if sort else cursor.limit(limit)

def count_matching_bibliographies(self, text): return self.collection.find({'$text': {'$search': text }},{}).count()

indexing.py import logging import Stemmer import bibliographies, taxonomy class TopicIndexer: def __init__(self,db_client): self.logger = logging.getLogger('indexer') self.ccs = taxonomy.ACMCCS(db_client) self.dblp = bibliographies.DBLP(db_client)

with open('./data/stopwords','r') as f: self.stopwords = [line.strip() for line in f] self.stemmer = Stemmer.Stemmer('english')

def index_bibliographies(self): stem = lambda word: self.stemmer.stemWord(word) #stems word quote = lambda text: u'"{0}"'.format(text) #wraps word in "" concat = lambda words: unicode(" ".join(unicode(word) for word in words)) union = lambda list_a, list_b: list(set(list_a) | set(list_b)) #union two lists

self.logger.info('Indexing bibliographies:') indexed_set = set() topic_trees, current_tree = [], None

for path, topic in self.ccs.iterate_topics():

50

topic_name = topic['name'].lower() search_terms = [stem(word) for word in topic_name.split() if word not in self.stopwords] search_text = concat(quote(term) for term in search_terms) topic_indices = path+[topic_name] matches = set() for doc in self.dblp.find_bibliographies(text=search_text, projection={}): indexed_set.add(doc['_id']) tags = union(doc['tags'],topic_indices) if doc.get('tags') else topic_indices matches.add(doc['_id']) self.dblp.update_bibliography_tags(doc['_id'],tags)

if len(path)==0: self.logger.debug('%7d bibliographies indexed.'%len(indexed_set))

current_tree = { 'name': topic_name, 'matches': set(matches) } topic_trees.append(current_tree) else: current_tree['matches'] |= set(matches) self.logger.info('%8d bibliographies indexed in total.'%len(indexed_set))

self.logger.info('Counting relevant bibliographies:') aggregate_set = set() for tree in sorted(topic_trees, key=lambda topic_tree: topic_tree['name']): aggregate_set |= tree['matches'] self.logger.info('%5d %s'%(len(tree['matches']),tree['name'])) self.logger.info('%5d Total (All topics)'%len(aggregate_set))

analysis.py import json, logging, collections import Stemmer import bibliographies, indexing, taxonomy, persistence, util logger = logging.getLogger('analysis') class Tool: def __init__(self): #load configuration with open('./config/lib/analysis.json') as data: self.config = json.load(data) #access MongoDB collections with persistence.MongoDB() as db: self.ccs = taxonomy.ACMCCS(db) self.dblp = bibliographies.DBLP(db) #load stopwords / stemmer with open('./data/stopwords','r') as f:

51

self.stopwords = [line.strip() for line in f] self.stemmer = Stemmer.Stemmer('english') #set up munger stem = lambda words: self.stemmer.stemWords(words) rm_stop = lambda words: [w for w in words if w not in self.stopwords] self.munge = { 'none': lambda l: l, 'stem': lambda l: stem(l), 'removeStopwords': lambda l: rm_stop(l), 'removeStopwordsAndStem': lambda l: stem(rm_stop(l)) }[self.config['munging_mode']]

def analyze_topics(self): trees, root = [], None for depth, topic in self.ccs.iterate_trees(): name = topic['name'] terms = self.munge(name.lower().split()) size = len(terms) if depth==0: root = {'name': name, 'terms': [], 'ngrams': collections.defaultdict(int)} trees.append(root) root['terms'] += [terms] root['ngrams']['all'] += 1 root['ngrams'][str(size)] += 1 if size > 1: root['ngrams']['2+'] +=1 return sorted(trees, key=lambda tree: tree['name']) def analyze_terms(self): all_counter = collections.defaultdict(int) trees, root = [], None for depth, topic in self.ccs.iterate_trees(): name = topic['name'] terms = self.munge(name.lower().split()) if depth==0: root = {'name': name, 'topics': [], 'terms':collections.defaultdict(int)} trees.append(root) for term in terms: all_counter[term] += 1 root['terms'][term] += 1 root['topics'] +=[{'name':name,'terms':terms}] folder = './data/output/terms' with open(util.makedirs('%s/terms.txt'%folder),'w') as f: terms = sorted(all_counter.iteritems(), key=lambda t: (-t[1], t[0])) for term, count in terms: f.write(('%4d %s\n'%(count,term)).encode('UTF-8'))

52

for tree in trees: with open(util.makedirs('%s/%s.txt'%(folder,tree['name'])),'w') as f: terms = sorted(tree['terms'].iteritems(), key=lambda t: (-t[1],t[0])) for term, count in terms: f.write(('%4d %s\n'%(count,term)).encode('UTF-8')) with open(util.makedirs('%s/tree/%s.txt'%(folder,tree['name'])),'w') as f: count = lambda term: tree['terms'][term] topics = sorted(tree['topics'], key=lambda k: (max(map(count,k['terms'])),k['name'])) for topic in topics: f.write(('%s\n'%topic['name']).encode('UTF-8')) for term in sorted(topic['terms'], key=lambda t: -count(t)): f.write(('%4d: %s\n'%(count(term),term)).encode('UTF-8')) f.write('\n'.encode('UTF-8'))

def analyze_indexing(self): quote = lambda text: u'"{0}"'.format(text) concat = lambda words: unicode(" ".join(unicode(word) for word in words)) for depth, topic in self.ccs.iterate_trees(): name = topic['name'] terms = self.munge(name.lower().split()) size = len(terms) search_text = concat(quote(term) for term in terms) cursor = self.dblp.find_bibliographies(text=search_text,projection={}) docs = set(b['_id'] for b in cursor) yield depth, name, size, docs

def analyze_indicies(self): logger.info('Listing relevant bibliographies per topic:') quote = lambda text: u'"{0}"'.format(text) concat = lambda words: unicode(" ".join(unicode(word) for word in words)) for name in self.config['topics']: terms = self.munge(name.lower().split()) size = len(terms) search_text = concat(quote(term) for term in terms) cursor = self.dblp.find_bibliographies(text=search_text,projection={'title.text':1}) with open('./data/output/indices/%s.txt'%name,'w') as f: for c, b in enumerate(cursor): f.write((u'%s\n'%b['title']['text']).encode('UTF-8')) logger.info('%7d: %s\n'%(c+1,name)) def count_topic_terms(): logger.info('Counting distinct words per topic tree:') total_terms = set() for tree in Tool().analyze_topics():

53

distinct_terms = set(term for terms in tree['terms'] for term in terms) total_terms |= distinct_terms logger.info('%5d %s'%(len(distinct_terms), tree['name'])) logger.info('%5d Total (All topics)'%len(total_terms)) def count_topic_term_frequency(): logger.info('Counting term frequency:') Tool().analyze_terms() def count_topic_ngrams(): logger.info('Counting ngrams per topic tree:') logger.info(' %5s %5s %5s %5s %5s %s'%('uni','bi','bi+','all','inter','TREE:')) group = lambda d,name: (d['1'],d['2'],d['2+'],d['all'],d['inter'], name) counter_totals = collections.defaultdict(int) for tree in Tool().analyze_topics(): counter = tree['ngrams'] ngrams = [terms for terms in tree['terms'] if len(terms)>1] unigrams = set(terms[0] for terms in tree['terms'] if len(terms)==1) intersection = [n for n in ngrams if any(u in unigrams for u in n)] folder = './data/output/terms/intersections' with open(util.makedirs('%s/%s.txt'%(folder,tree['name'])),'w') as f: for term in intersection: f.write((u'%s\n'%term).encode('UTF-8')) counter['inter'] = len(intersection) counter_totals['inter'] += len(intersection) for key in ['1','2','2+','all']: counter_totals[key] += counter[key] logger.info(' %5d %5d %5d %5d %5d %s'%group(counter,tree['name'])) logger.info(' %5d %5d %5d %5d %5d %s'%group(counter_totals,'Total (All topics)')) def count_indices_per_tree(): logger.info('Counting bibliography indices per topic tree:') trees, root = [], None for depth, name, size, docs in Tool().analyze_indexing(): if depth==0: root = {'name': name, 'ngrams': collections.defaultdict(set)} trees.append(root) root['ngrams']['all'] |= docs root['ngrams'][str(size)] |= docs if size>1: root['ngrams']['2+'] |= docs totals = collections.defaultdict(set) set_names = {'1':'unigrams','2':'bigrams','2+':'ngrams','all':'all','inter':'intersection'} for tree in sorted(trees,key=lambda d: d['name']): for key in ['1','2','2+','all']: totals[key] |= tree['ngrams'][key]

54

tree['ngrams']['inter'] = tree['ngrams']['1']&tree['ngrams']['2+'] totals['inter'] |= tree['ngrams']['inter'] logger.info(' %s'%tree['name']) for key in set_names.keys(): logger.info(' %7d %s'%(len(tree['ngrams'][key]),set_names[key])) logger.info(' Totals (All topics)') for key in set_names.keys(): logger.info(' %7d %s'%(len(totals[key]),set_names[key])) def count_indices_per_topic(): logger.info('Counting bibliography indices per topic:') trees, root = [], None for depth, name, size, docs in Tool().analyze_indexing(): if depth==0: root = {'name': name, 'topics': []} trees.append(root) root['topics'] += [(len(docs), size, name)] folder = './data/output/indices/counts' for tree in trees: logger.info('%s (%d topics):'%(tree['name'],len(tree['topics']))) with open(util.makedirs('%s/%s.txt'%(folder,tree['name'])),'w') as f: tree['topics'].sort(key=lambda x: (x[1],-x[0])) for count, size, name in tree['topics']: f.write((u'%7d %2d %s\n'%(count, size, name)).encode('UTF-8')) def output_indices_per_topic(): Tool().analyze_indicies()

persistence.py import json, pymongo class MongoDB: def __init__(self): with open('./config/db/mongodb.json') as c: config = json.load(c) #load data from config self.client = pymongo.MongoClient(config['host'],config['port']) self.db = self.client[config['db']] #database object '''context manager's (with keyword) entry hook''' def __enter__(self): return self '''context manager's (with keyword) exit hook''' def __exit__(self, type, value, traceback): self.client.close()

55

util.py import os, time ''' create necessary directories for a given file path ''' def makedirs(filepath): if not os.path.exists(os.path.dirname(filepath)): os.makedirs(os.path.dirname(filepath)) return filepath

server.js /* DB connection */ var mongojs = require('mongojs') //load mongojs module var config = require('../config/db/mongodb.json') //load config file var url = config['host']+':'+config['port'] //build connection url +'/'+config['db'] var db = mongojs.connect(url, config['collections']) //create db client process.on('exit', function(){db.close()}) //close connection on exit

/* Express */ var express = require('express') //load express module var router = express.Router() //get express routing object

/* Web Service Routes*/ router.route('/search/topics').get(function(req, res, next) { search_text = req.query.text db.acmccs .find({'$text': {'$search': search_text}}) .limit(30, function(err, docs) { if (err) res.send(err) res.send(docs) })}) router.route('/search/bibliographies').get(function(req, res, next) { search_text = req.query.text db.bibliographies .find({'$text': {'$search': search_text}}, {'title.text':1}) .limit(30, function(err, docs) { if (err) res.send(err) res.send(docs) })}) router.route('/bibliographies').get(function(req, res, next) { db.bibliographies .find({},{'title.text':1}) .sort({'dblp.mdate':-1}) .limit(30, function(err, docs) { if (err) res.send(err) res.send(docs)

56

})}) router.route('/bibliographies/:_id').get(function(req, res, next) { key = mongojs.ObjectId(req.params._id) db.bibliographies.findOne({'_id':key}, function(err, data) { if (err) res.send(err) res.send(data) })})

router.route('/bibliographies/topic/:_id').get(function(req, res, next) { key = parseInt(req.params._id) db.acmccs.findOne({'_id':key},{'name':1}, function(err,topic) { db.bibliographies .find({'tags':topic.name}, {'title.text':1,'tags':1}) .limit(20, function(err,docs) { if (err) res.send(err) res.send(docs) })})}) router.route('/topics').get(function(req, res, next) { db.acmccs.find({'path':['']}, {'name':1, 'narrower':1}, function(err, topics) { if (err) res.send(err) res.send(topics) })}) router.route('/topics/all').get(function(req, res, next) { db.acmccs.find({}, {'name':1}, function(err, topics) { if (err) res.send(err) res.send(topics) })}) router.route('/topics/:_id').get(function(req, res, next) { key = parseInt(req.params._id) db.acmccs.findOne({'_id':key}, function(err, topic) { if (err) res.send(err) res.send(topic) })}) router.route('/topics/:_id/subtopics').get(function(req, res, next) { key = parseInt(req.params._id) db.acmccs.find({'broader':key}, {'name':1, 'narrower':1}, function(err, topics) { if (err) res.send(err) res.send(topics) })})

/* Create Web Server */ var expressApp = express() expressApp //create routes for web service, assets, and index

57

.use('/api', router) .use('/components', express.static(__dirname+'/components')) .use(express.static(__dirname+'/public')) var server = expressApp.listen(21025, function() { console.log('Listening on port %d', server.address().port) })

dblpminer-layout.html

dblpminer-layout.css * { font-family: 'RobotoDraft', sans-serif; } /* Font */ html, body { height: 100%; width: 100%; overflow: hidden } .main-logo { font-size: 32px; color: #014731; padding: 4px 0 0 16px; } /* Logo */ /* Drawer */ #drawerPanel:not([narrow]) #menuButton { display: none; } [drawer] { background-color: #eee; box-shadow: 1px 0 1px rgba(0, 0, 0, 0.1); } core-menu#menu { padding: 16px 0; margin: 0; } core-menu#menu paper-item { padding-left: 32px; font-size: 16px; font-weight: normal !important; height: 56px; color: #003E2B; } core-menu#menu paper-item.core-selected { background-color: #dedede; } /* App Bar */ #appBar { color: #fff; background-color: #003E2B; font-size: 20px; font-weight: 400; } core-toolbar.medium-tall { height: 144px; } core-toolbar span { max-width: 960px; } /* Search Panel */

60

#searchBar { background-color: #eee; } #searchType paper-button.core-selected { color: green; } #searchPanel { background-color: #ddd; }

/* Content */ [main] { height: 100%; max-width: 960px; background-color: white; } #drawerPanel[narrow] #views { position: absolute; top: 0; right: 0; bottom: 0; left: 0; overflow: auto; }

dblpminer-layout.js Polymer('dblpminer-layout', { responsiveWidth: '860px', ready: function() { this.$.searchType.selected = 0 this.page = location.hash.slice(1) || 'home' addEventListener('popstate', this.popstate.bind(this)) }, pageChanged: function() { if (this.poppedPage !== this.page) history.pushState(this.page,'','#'+this.page) }, popstate: function (event) { this.poppedPage = this.page = event.state }, back: function() { history.back() }, menuSelect: function(e, detail) { if (detail.isSelected) { this.pageTitle = detail.item.label if (this.narrow) { this.$.drawerPanel.closeDrawer() } } }, togglePanel: function() { this.$.drawerPanel.togglePanel(); }, tagSignal: function(e, detail, sender) { this.targetID = detail._id this.targetName = detail.name this.page = detail.tag this.pageTitle = detail.title },

61

toggleSearch: function() { this.searching = !this.searching || false this.$.searchInput.inputValue = "" }, searchTypeSelect: function(e, detail) { if (detail.isSelected) this.searchType = detail.item.getAttribute('tag') }, searchKeypress: function(e) { ENTER_KEYCODE = 13 if (e.keyCode == ENTER_KEYCODE) this.performSearch() }, performSearch: function() { this.$.searchInput.commit() if (this.$.searchInput.value) { this.searchCollection = this.searchType this.searchText = this.$.searchInput.value this.fire('core-signal', { name: 'tag', data: { tag: 'results', name: this.searchText, title: 'Search Results' } }) } this.searching = false } })

dblpminer-view.html

dblpminer-view.css * { font-family: 'RobotoDraft', sans-serif; } ::content section { padding: 8px 16px; margin: 24px; line-height: 32px;

62

color: #555; font-size: 16px; background-color: #eee; max-width: 960px; } ::content section h1 { font-weight: normal; margin: 16px 0px 0px 0px; color: green; } ::content section p { margin: 8px 8px; } section-divider.html

topic-list.html

topic-list-item.html

topic-list-item.css paper-item:hover { color: #E2C776; } paper-icon-button:hover { color: green; }

bibliography-list.html

bibliography-list-item.html

bibliography-list-item.css paper-item:hover { color: #E2C776; }

home-view.html

home-view.css a { text-decoration: none; color: green; } .link { width: 16px; height: 16px; } paper-button { color: green; height: 40px; font-size: 16px; }

topics-view.html

bibliographies-view.html stats,html

66

search-form.html

results-view.html suptopics-view.html

67

topic-view.html

topic-view.css

68

#mainMeta { padding-left: 24px; } #mainMeta td:first-child { font-weight: bold; } #mainMeta td:last-child { text-align: right; }

topic-bibliographies-view.html

topic-bibliographies-view.css paper-button { margin-top: 10px; margin-bottom: 8px; font-size: 14px; background: #003E2B; color: #E2C776; height: 40px; } paper-button:hover { background: green; color: yellow; }

bibliography-view.html

69

bibliography-view.css table { color: #555; } #mainMeta { padding-left: 24px; } #mainMeta td:first-child { font-weight: bold; } #dblpMeta { font-size: 12px; text-transform: uppercase; } #dblpMeta td:first-child { color: green; } .tag { color: gray; font-size: 12px; display: inline-block; background-color: #ddd; border-radius: 0.6em; padding-left: 8px; padding-right: 8px; margin-bottom: 8px; }

70

index.html DBLPminer

71

Bibliography

[1] dblp: DBLP Bibliography. (2014). Retrieved September 28, 2014, from http://www.informatik.uni-trier.de/~ley/db/

[2] The 2012 ACM Computing Classification System - Association for Computing Machinery. (2014). Retrieved September 28, 2014, from http://www.acm.org/about/class/class/2012

[3] Arnetminer Introduction. (2014). Retrieved September 28, 2014, from http://arnetminer.org/introduction

[4] Scholarometer: Browser Extension and Web Service for Academic Impact Analysis. (2014). Retrieved September 28, 2014, from http://scholarometer.indiana.edu/

[5] About Google Scholar. (2014). Retrieved November 12, 2014, from http://scholar.google.com/intl/en-US/scholar/about.html

[6] ResearchGate. (2014) Retrieved November 12, 2014, from http://www.researchgate.net/about

[7] MongoDB. (2014). Retrieved September 28, 2014, from http://www.mongodb.org/

[8] Logging Configuration - Python 2.7.8 documentation. (2014). Retrieved September 28, 2014, from https://docs.python.org/2/library/logging.config.html

[9] $text - MongoDB Manual 2.6.4. (2014). Retrieved September 28, 2014, from http://docs.mongodb.org/manual/reference/operator/query/text/

[10] node.js. (2014). Retrieved September 28, 2014, from http://nodejs.org/

[11] Polymer. (2014). Retrieved September 28, 2014, from https://www.polymer-project.org/