CALIFORNIA STATE UNIVERSITY, NORTHRIDGE Determining The

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Determining the Erdős Number in DBLP-DB and More

A thesis submitted in partial fulfillment of the requirements

For the degree of Master of Science in Computer Science

Bageshree Parmar

December 2015

The thesis of Bageshree Parmar is approved:

______

Dr. Adam Kaplan Date

______

Dr. Robert D McIlhenny Date

______

Dr. John Noga, Chair Date

California State University, Northridge

Table of Contents

Signature……………………………………………………………………………...….. ii

List of Figures………………………………………………………………………...….. v

Abstract……………………………………………………………………...…………... vi

1. Introduction……………….……………………………………………………...... 1

2. Related Work………………………………………………………………………..... 4

3. Application Overview………………………...……………………………………….. 6

3.1. MariaDB………………………...……………………………………………… 6

3.2. Servlets/JSP…………...………………………………………………………... 7

4. Design of the Application…………...……………………………………………….... 8

4.1. Database Design………………………………………………………………… 8

4.2. Web Application Design...…………………………………………………….. 15

4.3. Shortest Path Algorithm……………………………………………………….. 16

4.4. Application Features…………………………...……………………………… 18

4.5. Lessons Learned……………………………………………………………….. 19

5. Results…………………………………………………………………………...…… 22

5.1. Screenshots of Web Application………………………………………………. 22

5.2. Facts From Database…………………………………………………………... 25

5.3. Performance of the Application…..…………………….……………………... 26

iii

6. Future Work…………………..……………………………………………………… 27

6.1. Database Setup………………………………………………………………… 27

6.2. Better Author Identification…………………………………………………… 28

6.3. Add Admin Privileges…………………………………………………………. 29

6.4. Find Multiple Shortest Paths…………………………………………..………. 30

6.5. Application to other problems……………………...…………………………. 30

7. Conclusion…………………………………..……………………………………….. 32

References……………………………………………………….…………………..…. 33

List of Figures

Figure 4.1. Excerpt from DBLP XML file…………………………………...………… 10

Figure 4.2. Database schema…………………..………………………………...……... 12

Figure 4.3. CREATE TABLE sample query using CONNECT………………………... 12

Figure 4.4. Code snippet in Java used for creating coauthors table…………………….. 14

Figure 5.1. Screenshot of application homepage……………...…………………….….. 23

Figure 5.2. Screenshot of collaboration distance tool……………………………..……. 23

Figure 5.3. Screenshot of Erdős number tool…………..………………………………. 24

Figure 5.4. Screenshot of advanced search tool…………..…………………………….. 24

Figure 5.5 Co-Publications button showing the publication list between the authors….. 25

Figure 5.6 Comparison of Erdős numbers in DBLP and MR dataset…………………... 25

Abstract

Determining the Erdős Number in DBLP-DB and More

Bageshree Parmar

Master of Science in Computer Science

This thesis presents a study of the collaboration graph of the bibliography data of the

computer science community. Such a graph shows how the authors in the community

relate to each other based upon co-authorship of academic papers. This study is supported

by a Java-based web application. The application features tools to calculate the

collaboration distance and Erdős number in the collaboration graph. The back-end of the

web application uses a MariaDB database created from the DBLP XML file consisting of

the computer science publication records. The analysis from this graph reveals many

interesting characteristics about the relationship between the authors in this community.

The study has identified the most prolific authors in the computer science community. It

also shows how the Erdős number of an author in the collaboration graph of the computer

science community matches with the Erdős number in the collaboration graph of the

mathematics community.

1. Introduction

Social network analysis has gained increasing popularity in recent years. The origin

of social network analysis can be traced back to the “six degrees of separation”

phenomenon based on an experiment, i.e., any two people in the U.S. are connected

through about six intermediaries. Since then, a lot of research has been done to analyze

this fact, and many studies have been able to prove it in various social networks. The

strategy in social network analysis is to investigate social structures using network and

graph theories. Networks play a significant role in computer science, and researchers

study various networks such as citation networks, collaboration networks, etc. in

academia. Underlying any network is a graph, and this thesis focuses on the collaboration

graph. The structure and evolution of collaboration among researchers can be

investigated using such a collaboration graph or a co-authorship network.

A collaboration graph is a graph modeling a social network where the vertices

represent people in that network and where two distinct people are joined by an edge

whenever there is a collaborative relationship between them of a particular kind [7]. It is

a simple graph with no loop edges and no multiple edges. The collaboration graph need

not be connected. An isolated vertex will represent a person who has never co-authored a

paper. Paul Erdős was an extremely prolific and significant figure in the field of

mathematics. He had more than 500 co-authors and published around 1500 technical

papers according to data from the Oakland University [5]. The Erdős number of a

researcher is defined as the collaboration distance between Paul Erdős and another author

in the collaboration graph. V. Yegnanarayanan et al. [13] say that the collaboration graph

of mathematicians has a "small world topology" i.e. they have a large number of vertices,

most of small degree, that are highly clustered, and a giant connected component with

short average distances between vertices.

Collaboration graphs are used to measure the closeness of relationships between the

participants of the network. One useful and widespread use of such graphs is to find the

collaboration distance between participants. This is defined as the length of the shortest

path between two distinct nodes. If no such path exists, then the collaboration distance

between them is said to be infinite. Collaboration distance can be used to evaluate a

group of authors or journals/conferences, or to assess citations of an author.

This thesis focuses on the Erdős number and the concept of collaboration distance in

the collaboration graph constructed from the bibliography data of the computer science

community that is available at the DBLP website [9]. It is an on-line reference for

bibliographic information on major computer science publications. It provides free access

to the data for research purposes. It was started by the computer science department of

the University of Trier and currently maintained by both the University of Trier and the

Schloss Dagstuhl - Leibniz Center for Informatics, Germany. It currently indexes over

2.6 million publications, published by more than 1.4 million authors. It also indexes

about 25,000 journal volumes, more than 24,000 conferences and workshops, and more

than 17,000 monographs.

In the Erdős collaboration graph, two authors are joined by an edge if they co-

authored a paper together. The Erdős number of an author is the collaboration distance

from that author to Paul Erdős in the graph. The existing tools available online to

calculate Erdős numbers that are discussed later in the next chapter.

The nature of the data available is useful in investigating the collaboration

characteristics of authors in computer science community. The bibliographic records also

cover a wide time range that can be used to study the evolution of the network over time.

The motivation behind this project was to analyze and review the DBLP network to find

interesting facts or patterns. Another goal was to create a tool to find collaboration

distance between the co-authors in computer science community. Microsoft offers a tool

to find the collaboration distance between academic authors, but their data source is

vague, the scope is not limited to any particular field, many spurious results are presented

(e.g. two authors who share a first initial and last name are usually identified as the same

person), and many computer science publications are not included.

This report has six chapters. Chapter 1 provides a general overview of the thesis and

the motivation behind implementing it. Chapter 2 shows a survey of similar applications

and tools available. Chapter 3 explores the technologies and software used in the

development of the application. Chapter 4 describes the detailed application development

process, reasons for modifications to some of the features, and lessons learned. Chapter 5

presents the outcome of the results of the study and details of the web application.

Chapter 6 discusses further enhancements to the application.

2. Related Work

This chapter presents applications related to this thesis about calculating the

collaboration distance.

Microsoft Academic Search [2] is a web-based search implementation and an

experimental research service concentrating exclusively on the scholarly material. The

service indexes millions of academic publications and displays the key relationships

between and among subjects, content, and authors, highlighting the critical links that help

define scientific research. Two features in their web application are similar to features

developed in this thesis. One of them is the Co-author Graph feature that shows a visual

graph of all the co-authors of a particular author. The nodes in the graph represent the

authors, and the edges show the number of publications common between two authors.

One can further the search by clicking on the nodes and edges to get more information

about the authors and co-authored publications respectively. Another feature is the Co-

author Path feature that provides a visual display of shortest path between two authors.

This tool shows the relationship and the degree of separation between the coauthors. The

nodes and edges in the graph show the author and the common publication information as

mentioned in the previous tool. However, Microsoft Academic Search does not provide

explicit information about their data source. The tool has information from various

disciplines, but appears incomplete for any particular area. Some details of the sources of

the data are mentioned on their website. Hence, this tool is not focused on a particular

field or community but acts more like an (incomplete) search engine for academics with

access to some tools that perform functions similar to the one accomplished in this thesis.

Another similar application, MathSciNet [4] is an electronic publication offering

access to a database of reviews, abstracts, and bibliographic information for the

mathematical sciences’ literature. MathSciNet contains almost 3 million items and over

1.7 million direct links to original articles. This web of citations allows users to track the

history and influence of research publications in the mathematical sciences. The

MathSciNet database provides an excellent example of a relatively large collaboration

graph. The website hosts a free online tool called Collaboration Distance, which finds the

shortest publications-path between two mathematicians as well as the Erdős number of a

mathematician. This tool also shows the actual chain of co-authors that realizes the

collaboration distance. The user has to enter two author names in the input fields

provided, and the search returns the shortest path between those two authors in the

collaboration graph. For each edge, it shows the paper co-authored by the two authors.

Hence, this tool is also similar in functionality to the one developed in this thesis but is

only focused on the mathematics community.

3. Application Overview

This chapter presents the architectural overview of the web application developed to

find the Erdős number and the collaboration distance between two authors. The

application is implemented using JSP/Servlets and supported by a MariaDB database as

the back-end. An overview of the technologies used for the development of the web

application is as follows.

3.1 MariaDB

It is an open source relational database technology and a great alternative and drop-in

replacement for MySQL. An important reason for selecting this database management

system was the power of the CONNECT storage engine in MariaDB 10.0. This storage

engine enables MariaDB to access external local or remote data, as there were standard

tables on the server. This is done by defining tables based on different data types, in

particular, files in various formats, data extracted from other DBMS or products via

ODBC, or data retrieved from the environment [6].

XML is used to encode and store data having any structure. The tag hierarchy in an

XML file describes a tree structure of the data. Modern database management systems

including MariaDB, implement something close to the relational model and work on

tables that are structurally not hierarchical but tabular with rows and columns.

Nevertheless, the CONNECT engine can help to work with the structural data. The user

can specify what data to extract from the XML structure.

Here are the reasons for selecting the CONNECT engine for this thesis it 1) supports

tables represented by XML files, 2) supports large table sizes, and 3) handles multiple

nodes in an XML file very easily. The DBLP dataset is a huge XML file with size

approximately 1.7 GB (at the time of download on 6/23/2015), and the entire data was

imported from this XML file to MariaDB. CONNECT can also handle multiple nodes in

the XML document. This feature is important because the author node can be “multiple”

meaning that there can be more than one author of a publication. This information needs

to be imported correctly in the relational model. Most of the tools available online could

not handle multiple nodes in an XML file and importing the data failed in MariaDB.

CONNECT provides two possibilities to achieve this. The first one is to return, as many

rows than there are authors and repeating other columns as if a join was made between

the author column and the rest of the table. To achieve this, the “multiple” node name and

the “expand” option have to be specified when creating the table. They are defined as

sub-options of the “option_list” in create table query. The “limit” sub-option, if defined,

specifies the maximum number of values that will be expanded. If not specified, it

defaults to 10. Any values above the limit will be ignored, and a warning message will be

issued. The second way to see multiple values is to set the “expand” option mentioned

above to false, which will return a comma separated list of the multiple node values in a

single row.

3.2 Servlets/JSP

A Java servlet is a program that runs within a web server. They receive and respond

to requests from web clients across HTTP. These web servlets are the Java counterpart to

PHP and ASP.NET. JSP is a server-side programming technology that enables software

developers in the creation of a dynamic and platform-independent method for building

Web-based applications. JSP has access to Java API’s, including the JDBC API to access

enterprise databases, which is also used for this thesis and discussed in later section. It is

a technology for developing web pages that support dynamic content, which helps

developers insert java code in tags <% and %> in HTML pages. It is a type of servlet

designed to fulfill the role of a user interface for a Java web application. JSP can be used

to collect user input from HTML forms, present results from database or another source

and create web pages dynamically.

4. Design of the Application

This chapter presents the design of the web application, which includes details about

the back-end database schema, user interface, and the features implemented in the

application.

4.1 Database Design

The collaboration graph of DBLP database is the graph representing authors of

computer science publications as vertices and the edges connecting the authors who have

been co-authors on at least one paper. DBLP is a high-quality citation digital library that

has a nearly complete coverage of computer science community and hence is chosen as

the dataset for this thesis. The database for this study was designed primarily with a web

application in mind. Hence, it was designed as the back-end part of the application and in

such a way as to optimizing for the needs of the web application.

Figure 4.1. Excerpt from DBLP XML file

The DBLP XML file format follows the BibTeX file format and contains millions of

records. The file structure is defined in the DTD (data type definition) file named as

dblp.dtd accompanying the XML file. An excerpt from the XML file is shown in Figure

4.1. The XML root element contains a long sequence of bibliographic records.

The DTD lists several elements to be used as a bibliographic record:

These tags correspond to the entry types used in BibTeX. The publication records used in

the database are given by one of the following elements in DBLP XML file:

• article – An article from a journal or magazine

• inproceedings – An article in a conference proceeding.

• incollection – A part of a book having its title.

Following is the detail about the elements in DBLP XML file that are used for this

thesis:

: The name of the author. In DBLP, there is an author element for each author.

The order of the author elements inside a record is the same as on the head of the paper.

DBLP specifies author name as a full name without commas. If name parts are

abbreviated, each initial is followed by a dot. For example, it would use H. P. Smith

rather than H P Smith or HP Smith. Behind a dot, there is a blank or a hyphen. DBLP

identifies authors with the same name by appending a space character and a four-digit

number to the names.

: The title of the work. This is a required element in each DBLP publication record. It may contain sub-elements for subscripts, sup elements for superscripts, i 10 elements for italics, and tt for typewriter text style. These elements may be nested. <pages>: This element is mentioned in form, “from-to”. If the number of the last page is unknown, then it takes form “from-”. A single page paper is mentioned as the page number without a hyphen. For splitting articles in magazines, it has a comma-separated list of page numbers or page ranges. In rare cases, the pages element may contain any character sequence. <year>: The year of publication. It contains a four-digit number according to the Gregorian calendar. For journal articles, it is a definite date of publication of the issue. The year field of conference papers specifies the date when the conference took place; the year field of the enclosing proceedings specifies the publication date. <booktitle>: The title of a book, part of which is being cited. <journal>: The journal or magazine the work was published in. Abbreviations are provided for many journals. <number>: The “(issue) number of a journal or magazine. <volume>: The volume of a journal or multi-volume book. The database schema designed for the application is shown in Figure 4.2. The tables named article, inproceedings, and incollection represent the elements <article>, <inproceedings>, and <incollection> in the XML file. They are created using CONNECT engine by importing data from the XML file. The publication types are separate in tables in the beginning because when we import data from XML file we need to specify the element that will be identified as a row in a table. This is explained in detail in the later part of this section, as to how the create table query works while importing from the XML file. 11 Figure 4.2. Database schema Figure 4.3. Create Table sample query using CONNECT engine Figure 4.3 shows a sample SQL create table query using CONNECT in MariaDB. By default, the column names specified in the query correspond to the tag names in the XML file. XML is case-sensitive and hence the column names should be the same as they 12 appear in the file though there is an option to have different column names than the tags in the XML file. The order of the columns in the table can also be different from the order in which the nodes appear in the XML file. The “tabname” defines the root element of the XML file. Rows in the tables are identified by declaring the ‘rownode’ option as an article, incollection or inproceedings in the create table query. Multiple authors exist meaning that there can be more than one author of a publication. So, “Mulnode” option is specified as ‘author’ meaning that a record in XML file can have more than one author. The “Expand” option is defined as ‘1’ which will return as many rows as the number of authors for a publication record, and the other columns being repeated for each of the authors. The “Limit” option is defined as 50 because there were few papers with more than 40 authors (10 is the default value if this option is not specified as discussed earlier in Section 3.1). Once the desired data is imported in MariaDB, new tables were created for use in SQL join operations or the processing of the algorithm. The art_inp_inc table combines the data of all publication types in one table using UNION ALL operation on the three original tables, i.e., article, inproceedings, incollection. The author table contains distinct author names along with an auto-generated authorid. Similarly, the paper table contains distinct paper titles with an auto-generated paperid and the paper type information (i.e. if the paper is a conference paper, journal or a book chapter). The coauthors table contains a list of coauthors of each author pre-computed and stored. Earlier this table was created to get the coauthors of an author in algorithm implementation directly and minimize the computation time, but now it is used for analysis of the database only. This was because 13 of the implementation of an advanced search feature; this table could no longer be used in the algorithm. A Java code snippet of the creation of the coauthors table is shown in Figure 4.4. Figure 4.4. Code snippet in Java used for creating coauthors table The author_paper table was created by joining tables author, paper, and art_inp_in’. This table contains the author id, paper id, and the paper type information, which is used later in the execution of the algorithm. The algorithm makes a request to the database on the fly to find the coauthors of the input author names in the application. The coauthor result depends on the application feature firing the query. For the advanced search feature, the coauthor list will only include the coauthor data from the paper type selected in the application. Another necessary task in creating the tables is adding indexes. It helps improve the speed of operations in the table, especially when performing SELECT queries on a table. 14 It can be created using one or more columns, providing the basis for both rapid random lookups and efficient ordering of access to records. Indexes are created for those columns in the table, which will be used in SQL queries. For example, the art_inp_inc table has an index on column title because it will be used in join operation with tables author and paper to get the author_paper table. Then, the author_paper table has an index on columns authorid, paperid, and papertype because the algorithm to find the coauthor list uses this table. 4.2 Web Application Design This thesis is supported by a Java-based web application to find the Erdős number of the computer science authors in the DBLP. Java is chosen as a programming language for implementing this application just to have some experience with designing and creating Java-based web applications. The application provides tools to find the collaboration distance, Erdős number, advanced search options and email service to receive suggestions from users of the application. The front end of the application uses JSP/Servlets, and back-end is a MariaDB database. JDBC is used for communication of the application with the database. JSON is used for formatting the data before displaying the results in HTML. JQuery and JavaScript are also used to perform server and client side validations and fulfill certain functionalities. 4.3 Shortest Path Algorithm A shortest path algorithm based on the concept of bi-directional search is designed to find the collaboration distance between two authors in coauthor graph. This graph search 15 algorithm runs two simultaneous searches: one forward from the initial state and one backward from the goal, stopping when the two meet in the middle. The reason for choosing this approach is that the dataset used in this application is huge and performing bi-directional search improves the performance of the algorithm drastically. It is faster than the native breadth-first search. The number of authors in the database is over 1.5 million and many authors have more than 50 co-authors or more including some with more than 500 co-authors. Hence, for instance, in a problem where both searches expand a tree with branching factor b and distance from start to goal is d, each of the two searches has complexity O(bd/2) and the sum of these two search times is much less than O(bd) complexity that would result from a single search from the beginning of the goal. The algorithm implemented in Java, computes the shortest path between two authors in the coauthor graph. The implementation is structured in four classes, three servlets, and three JSP files. The Author class defines the Author object and helps set or receive the author metadata for the algorithm. The algorithm queries coauthor data for the Author object received from the input in the application. The getCoauthors method defined in the SQLUtility class returns a list of coauthors to the algorithm. Each Author object has a label of type Integer that defines the position of each Author object during the algorithm execution. The algorithm requires two authors, and it will construct the coauthor path object in the FindCoauthorPath class. The class Author has a createAuthor method to produce new Author objects. The createAuthor method first tests if the there is already an Author object with the specified author id in AuthorMap. A new object is only created if the test 16 fails. After creation of Author object, it only contains the author’s id. The algorithm loads the list of coauthors only when required. The boolean field coauthorsLoaded contains the required state information. The getCoauthors method loads the coauthors list based on the paper type(s) selected in the advanced search option, or the default scenario is to consider all the paper types when fetching the coauthors data from the database. The algorithm logic is implemented in the CollaborationDistance class. As discussed above, the algorithm uses a bi-directional breadth-first search approach to optimize the performance of the algorithm. Hence, the search starts from both authors and the algorithm prefer the side with the lower number of authors to be visited next. The method coauthorPath in the CollaborationDistance class implements this idea. The algorithm labels the authors, author1, and author2 with 1 and -1 respectively. The coauthors of author1 are set to 2, and the coauthors of author2 are set to -2. The position1 and position2 variables contain the sets of authors who form the outer level of the labeling processes. The main loop flips the side where to advance next depending on the size of these sets. When a level is expanded in the algorithm, the unvisited authors are stored in the set named next. Whenever the variables position1 and position2 meet, it shows that the authors have a coauthor path between them and that path is traced in the algorithm by tracing the path in reverse order from the meet point. If the two sides never meet each other, that means that the authors are not connected, and no path exists between them. 4.4 Application Features Erdős number: - It is used to find the Erdős number of any author in the database. While using this tool, one of the authors will be set to Paul Erdős and the user only needs 17 to enter another author. It is used to find the collaboration distance between any author and Paul Erdős. The purpose of designing this tool it to study any similarities between the Erdős number found in the MathSciNet and the one calculated in this application. It would also be interesting to know the authors who have papers published in computer science, have the same Erdős number but a different set of publications used in the path. This may happen because the MathSciNet consists of mathematical publications and this application uses the database of computer science publications from the DBLP. Collaboration distance: - It finds the shortest path between any two authors in the database used in this application. It also displays the common publications co-authored between the authors in the path. This tool helped in analyzing the six degrees of separation and small world phenomenon discussed earlier in this report. Advanced search option: - This feature provides the capability to select the type of publication(s) to consider while finding the collaboration distance. The user has a choice to select publication type categories like articles, journals, and book chapters. This tool helped to identify any patterns based on the type of publication. Contact us: - This tool provides a user with the capability to provide suggestions via email. Since this application is designed for academic research purposes, it is important to get feedback and comments from users of the application. This tool will also help to maintain and update the database for any errors identified by the authors or users. This is important because the database is large and DBLP has taken a lot of care to provide as correct the information they can, but there are scenarios where there can be issues introduced in the database and need to be fixed. For example, an author may identify that 18 his paper is listed under another author’s name (whose name is the same as the author who identified it). In this case, the suggestion to update the database will help fix errors. 4.5 Lessons Learned This section describes the problems faced during the designing of the application. Database setup issue: There were issues while importing the data from XML file to MariaDB because of the character set used in the DBLP XML file. The XML file is encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as named entities in the DTD accompanying the XML file and used whenever necessary. Most parts of DBLP are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255 Unicode characters. However, some ‹author› elements contain only latin-1 characters, and the numerical entities may be outside of this range. For example, ‹title› elements may include Greek letters like ε. All characters above the first 255 Unicode characters are given as numerical entities. It contains foreign language characters, and encoding of the file is in Latin1. The XML DTD file contains the data to reference these entities. Though MariaDB has the default character set as Latin1, it had problems importing the XML file that has the same encoding. The problem was in referencing the entities from the DTD file. After numerous failed attempts, the character entities defined in the XML file were replaced with their corresponding numeric entities defined in the DTD file. A simple perl command accomplished this. The command replaced a regular expression with the new regular expression. This method worked fine and then the data was successfully imported from the XML file in MariaDB database. 19 Sample Perl command used: perl -pi -w -e 's/oldregex/newregex/g;’ DBLP.<a href="/tags/XML/" rel="tag">xml</a> The MariaDB database was created and updated several times during the implementation of the application. The database is huge and problems were identified while executing queries. It is impossible to verify every single record in the database for any abnormalities or unidentified exceptions. One such situation was when the number of authors for a publication record in XML file was over 25. The limit of a number of authors to expand for each record in the file while importing the database was set by default to 10 that truncated the rest of the authors from the record. This issue was realized while verifying the coauthors table and the coauthors list provided on the DBLP website. Therefore, the database was created again from the initial step of importing the data to get the correct number of authors from the XML file while executing the CREATE TABLE query. Another issue faced was that duplicate and insufficient values were populated in the database for the publication title. Initially, the data from the <title> element of each record was stored in the database excluding the other metadata like the volume, pages, etc. Then, distinct papers were selected and stored in the paper table to identify each publication title uniquely. However, it was realized later when importing the <incollection> type data that many title values were named ‘Preface’ and the metadata associated with each record could distinguish them. This is because the incollection records represent the book chapters and many records would have the title as Preface but were from different books. This information was unknown before, and the authors of all Preface titles were treated as co-authors of each other introducing incorrect entries in the 20 co-authors table. This issue also led to creating the database again including the metadata information of all paper titles to distinguish Preface item of different books. Last but not the least, a lot of time was spent while creating the database because of the indexes added to the tables. Since the data is huge creating an index was a necessity to help speed up the queries. Performing INSERT statements during join operation took more time with tables having indexes. The reason is that while doing insert or update, the database needs to insert or update the index values as well. Hence, making changes to the database or creating the database again would take up a lot of time. 21 5. Results This chapter presents the screenshots of the web application and facts about the collaboration graph and Erdős number as analyzed in the database. The screenshots present the functionality of the important features implemented in the application. The facts present interesting results identified as part of the thesis. 5.1 Screenshots of Web Application The home page of the application as shown in Figure 5.1 gives a brief overview of the application and the concepts utilized in implementing the application. The important features of the application are provided to the user as links in the left sidebar as shown in Figure 5.1. The Erdős number link opens a page layout as shown in Figure 5.2, where a user can enter an author name to find his Erdős number. Here one of the author name fields is already set to Paul Erdős. The collaboration distance link opens up a layout similar to the Erdős number tool as shown in Figure 5.3. This tool helps a user find the collaboration distance between the two authors entered. The advanced search link opens up a layout as shown in Figure 5.4. Here a user can select the publication type categories with the help of checkboxes. This tool gives a user the ability to find the Erdős number or the collaboration distance based on the publication type(s) selected. There is a button named “Co-publications” shown in Figure 5.5 near the name of the author in the results area in all the tools. Clicking this button shows a hop up which displays the common publications between the author name with whom the button associated is clicked and the author above him. During the advanced search option, the Co-publications button will filter the results based on the types selected. 22 Figure 5.1. Screenshot of application homepage Figure 5.2. Screenshot of collaboration distance tool 23 Figure 5.3. Screenshot of Erdős number tool Figure 5.4. Screenshot of advanced search tool 24 Figure 5.5 Co-Publications button showing the publication list between the authors 5.2 Facts Computed in the Database 1) Figure 5.5 shows a comparison of number of authors with Erdős number 1 and 2 between the DBLP and the Mathematical Reviews dataset. These results were computed using a Java program and the coauthors table. The numbers in the DBLP dataset are quite less compared to the MR dataset because the DBLP dataset contains the computer science publications and Paul Erdős has less publications in DBLP compared to MR and hence the values. DBLP dataset MathSciNet Erdős number 1 = 165 Erdős number 1 = 511 Erdős number 2 = 3780 Erdős number 2 = 11009 Figure 5.6 Comparison of Erdős numbers in DBLP and MR dataset 25 2) Another comparison to Mathematical Reviews: - One similarity is observed while finding the Erdős number with the database used in this application and the Mathematical Reviews database. The Erdős number of many authors in the computer science coauthor graph is the same as that in the mathematics coauthor graph. However, the publications used in computing the path may be different because the Mathematical Reviews database contains the mathematic publications and the database used in this application contains computer science publications. 3) Number of vertices in the computer science co-author graph is 1594172, which is the number of authors in the database. 4) Number of edges in the co-author graph is 2983838, which is the number of publications in the database. Hence, we can observe that the number of edges in the graph is almost double the number of vertices in the graph. 5.3 Performance of the Application The current performance of the application is good for displaying the results from the database to the user. Previously the implementation used the coauthors table that stored the coauthor list of each author pre-computed. This resulted in very good performance because there was no processing required at the back-end and the only time required was in executing the algorithm. However, the new implementation does not use the coauthors table, and the list of coauthors required in the algorithm is calculated on the fly while running the application. Nevertheless, the application works fine with no performance degradation. A primary reason for this is the use of indexes in the tables. 26 6. Future Work This chapter presents the ideas that can be implemented to enhance the application in future. Time constraints limited the amount of work done in this thesis. Following is the ideas that can be implemented further to improve the user experience with the web application and automate certain tasks to keep the database up to date. 6.1 Automatic Database Setup The DBLP XML file is huge with millions of publication records. The data from XML file is imported to construct the back-end for the web application in MariaDB. The size of the database is approximately 1.6 GB, and it will keep on increasing as the DBLP keeps adding data to it. Problems were faced importing such a huge file and specifically while creating tables with SQL join operations. Most of the SQL queries involving join operations took a long time to finish and consumed a lot of memory and hard disk space. This is because indexes are created to enhance the performance of select queries, as the database is large. At present, the database is already set up for use with the web application. The XML file was downloaded from the DBLP website on 6/23/2015. DBLP updates the file regularly and hence, it is very important to propagate the updates from the DBLP file to the database used in this application. At present, the database needs to be updated from scratch from the XML file. Writing automated scripts to create the required database is a very good option to solve the above problem. The scripts can regularly check the DBLP website for any 27 update and execute the SQL queries and create the desired database for the application automatically. Otherwise, the administrator can start the scripts manually at any time, and it would create the database as required. This capability is a very important feature to be implemented further so the admin can keep the database up to date. 6.2 Better Author Identification The DBLP dataset contains many different authors with the same name. It is very important to be able to identify the authors based on some metadata. The database distinguishes different authors with the same name by appending a four-digit number at the end of the author name. However, the user of the application does not know to identify author homonyms while entering the name in the application. For example, when a user types in Chen Li in the author name input field, the application provides a user with suggestions with at least seven different authors with that same name. In this case, the user will not be able to distinguish which author they are looking for in that drop- down list of suggestions. A very good solution to this problem is to have the author affiliation information stored in the database. When the user types in the author name in the input field, the application can then provide the drop-down list of suggested author names with their affiliation information. This would make it very easy for a user to identify the author for whom they are searching. One strategy for identification would be to use the affiliation information of all the authors in the database. However, no central location for this information has been found which makes importing the information to the MariaDB a difficult manual process. The important thing here is to find that data at one place as the 28 thesis database has over 1.5 million authors and hence it is not feasible to insert manually such information from various sources in the database. Another issue with importing the affiliation information in the database is that with the author name format. If the format of author name is different from the one used in DBLP, then it can create issues while importing the data. The names will not match, and the database may end up with incorrect values. A less interesting approach for author identification can be to find the publication year range of an author. This metadata can help a little better in picking up an author name during suggestions in the web application. This would work in following way: the user can see the author name and metadata about the years he is active alongside the name. This can help in identifying which author to choose from the list. 6.3 Add admin privileges Providing admin capabilities through a Log In button to manage the database at the back-end can be a good enhancement for the web application. The administrator can go in and update or correct the database when desired. A nice user interface can be designed to assist the admin in updating the database very easily and quickly. For example, if the admin desires to update the name of an author in the database, he can log in the application and fill a form with new author name to be replaced with the existing one, and that would make changes to the database. The admin need not work with the database through a terminal and need not even know any MySQL queries to execute any action. He just needs to choose a field to update and enter values through input field in a form. This would make the necessary changes to the database automatically. This task can be 29 performed by the admin particularly in the case when he receives an email from any author identifying issues with his publications or spelling of the name etc. The other admin privilege is related to the enhancement discussed in section 6.1. Admin can have access to scripts that can create the database from the XML file to keep it up to date. 6.4 Find multiple shortest paths Another enhancement for the application is to find multiple shortest paths between two authors. This feature can be tricky to implement and may degrade the performance of the application due to extra processing required for each combination of collaboration path possible. Though this feature is not implemented explicitly at present, the results displayed in the application do show different possible shortest paths while performing the same query multiple times. This is because the present execution of the application finds the coauthors of an author on the fly from the database and it receives a different sequence of the coauthor list for each query and it can show one of the shortest paths to the user. 6.5 Application to other problems This study has applications to other similar problems in social network analysis. The collaboration distance tool can be used in any social or academic network to find the shortest path between any two people. Any network can be utilized to analyze the ‘small world’ phenomenon as mentioned by few papers in past. Example similar problems for reference are discussed further. A similar famous work to this thesis is the Bacon number 30 (as in the game Six Degrees of Kevin Bacon), connecting actors that appeared in a film together. The Bacon number of an actor or actress is the number of degrees of separation he or she has from Bacon, as defined by the game. This is an application of the Erdős number concept to the Hollywood movie industry. The higher the Bacon number, the farther away from Kevin Bacon the actor is. Another application for this study can be to find the Erdős–Bacon number, which is the sum of one's Erdős number and one's Bacon number. The lower the number, the closer a person is to Erdős and Bacon, which reflects a small world phenomenon in academia and entertainment. In general, to have a defined Erdős–Bacon number, it is a necessary (but not a sufficient) condition for one to have both appeared in a film and co- authored an academic paper. One can also consider the time and spatial metadata of the publication records and analyze the evolution of the database over a geographical domain. This may help researchers reveal any trends or patterns in the computer science community. Milosˇ Savic ́ et al. [11] presented a study on the structure and evolution of scientific collaboration in Serbian mathematical journals. They used various metrics like the degree of centrality to analyze the evolution the scientific collaboration over time. Mike Lieberman [14] has done social network analysis on networks with big data like Facebook, Twitter, Google, Craigslist and analyzed how that data can have various use cases. This serves as a good example of social network analysis application to big data. 31 7. Conclusion The analysis of the collaboration graph from the bibliography data of the computer science community reveals many interesting facts. The graph has roughly 1.5 million authors as its vertices, with an edge between every pair of an author who has a joint publication. It shows the relationship between authors and the growth of the community over years. The most prolific author in the computer science community is Wei Wang with 2218 coauthors. This number is much larger than Paul Erdős in the mathematics community. Other prolific authors include Wei Zhang with 1584 coauthors and Wei Li with 1473 coauthors. One observation in this thesis is that the Erdős number of an author in the collaboration graph of the computer science community matches with the Erdős number in the collaboration graph of the mathematics community. The six degrees of separation phenomenon can be seen holding true in this dataset even though Paul Erdős is not a prolific author in computer science journals. His primary area was combinatorics and most of his publications appeared in mathematic journals. The analysis of the DBLP data also reveals that the shortest path between any two authors is not more than six except in cases where there is no path. 32 References [1] DBLP. http://dblp.uni-trier.de. Accessed on November 12, 2015. [2] Microsoft Academic Search. http://academic.research.microsoft.com/VisualExplorer. Accessed on October 2, 2015. [3] MariaDB Tutorial. http://www.techonthenet.com/mariadb/. Accessed on October 20, 2015. [4] MathSciNet. http://www.ams.org/mathscinet/. Accessed on August 20, 2015. [5] Erdős Number Project. http://wwwp.oakland.edu/enp/. Accessed on October 1, 2015. [6] CONNECT Storage Engine in MariaDB. https://mariadb.com/kb/en/mariadb/connect/. Accessed on June 11, 2015. [7] Mathematical Reviews http://www.ams.org/mr-database. Accessed on November 15, 2015. [8] DBLP XML file location. http://dblp.uni-trier.de/xml/. Accessed on July 16, 2015. [9] Ergin Elmacioglu, Dongwon Lee. On Six Degrees of Separation in DBLP-DB and More. In Proceedings of the ACM SIGMOD Record Conference (New York, NY, USA) June 2005. Vol. 34, No. 2, pp. 33-40. [10] Milosˇ Savic ́, Mirjana Ivanovic ́, Milosˇ Radovanovic ́, Zoran Ognjanovic ́, Aleksandar Pejovic ́, Tatjana Jaksˇic ́ Kru ̈ger. The structure and evolution of scientific collaboration in Serbian mathematical journals. Scientometrics, December 2014, Vol. 101, No. 3, pp. 1805-1830. [11] Michael Ley. DBLP: some lessons learned. In Proceedings of the VLDB Endowment. Vol. 2, No. 2, August 2009, pp. 1493-1500. [12] V. Yegnanarayanan, G. K. Umamaheshwari. A Note on the Importance of Collaboration Graphs. International Journal of Mathematical Sciences and Applications September 2011. Vol. 1, No. 3, pp. 1113-1121. [13] Mike Lieberman. Visualizing Big Data: Social Network Analysis. In Proceedings of Digital Research Conference (San Antonio, Texas) March 2014. 33 </div> </article> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'c4f95cc07e7778422d969b1a1afc77f4'; var endPage = 1; var totalPage = 39; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/c4f95cc07e7778422d969b1a1afc77f4-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 5) { adcall('pf' + endPage); } } }, { passive: true }); if (totalPage > 0) adcall('pf1'); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>