CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
Determining the Erdős Number in DBLP-DB and More
A thesis submitted in partial fulfillment of the requirements
For the degree of Master of Science in Computer Science
By
Bageshree Parmar
December 2015
The thesis of Bageshree Parmar is approved:
______
Dr. Adam Kaplan Date
______
Dr. Robert D McIlhenny Date
______
Dr. John Noga, Chair Date
California State University, Northridge
ii
Table of Contents
Signature……………………………………………………………………………...….. ii
List of Figures………………………………………………………………………...….. v
Abstract……………………………………………………………………...…………... vi
1. Introduction……………….……………………………………………………...... 1
2. Related Work………………………………………………………………………..... 4
3. Application Overview………………………...……………………………………….. 6
3.1. MariaDB………………………...……………………………………………… 6
3.2. Servlets/JSP…………...………………………………………………………... 7
4. Design of the Application…………...……………………………………………….... 8
4.1. Database Design………………………………………………………………… 8
4.2. Web Application Design...…………………………………………………….. 15
4.3. Shortest Path Algorithm……………………………………………………….. 16
4.4. Application Features…………………………...……………………………… 18
4.5. Lessons Learned……………………………………………………………….. 19
5. Results…………………………………………………………………………...…… 22
5.1. Screenshots of Web Application………………………………………………. 22
5.2. Facts From Database…………………………………………………………... 25
5.3. Performance of the Application…..…………………….……………………... 26
iii
6. Future Work…………………..……………………………………………………… 27
6.1. Database Setup………………………………………………………………… 27
6.2. Better Author Identification…………………………………………………… 28
6.3. Add Admin Privileges…………………………………………………………. 29
6.4. Find Multiple Shortest Paths…………………………………………..………. 30
6.5. Application to other problems……………………...…………………………. 30
7. Conclusion…………………………………..……………………………………….. 32
References……………………………………………………….…………………..…. 33
iv
List of Figures
Figure 4.1. Excerpt from DBLP XML file…………………………………...………… 10
Figure 4.2. Database schema…………………..………………………………...……... 12
Figure 4.3. CREATE TABLE sample query using CONNECT………………………... 12
Figure 4.4. Code snippet in Java used for creating coauthors table…………………….. 14
Figure 5.1. Screenshot of application homepage……………...…………………….….. 23
Figure 5.2. Screenshot of collaboration distance tool……………………………..……. 23
Figure 5.3. Screenshot of Erdős number tool…………..………………………………. 24
Figure 5.4. Screenshot of advanced search tool…………..…………………………….. 24
Figure 5.5 Co-Publications button showing the publication list between the authors….. 25
Figure 5.6 Comparison of Erdős numbers in DBLP and MR dataset…………………... 25
v
Abstract
Determining the Erdős Number in DBLP-DB and More
By
Bageshree Parmar
Master of Science in Computer Science
This thesis presents a study of the collaboration graph of the bibliography data of the
computer science community. Such a graph shows how the authors in the community
relate to each other based upon co-authorship of academic papers. This study is supported
by a Java-based web application. The application features tools to calculate the
collaboration distance and Erdős number in the collaboration graph. The back-end of the
web application uses a MariaDB database created from the DBLP XML file consisting of
the computer science publication records. The analysis from this graph reveals many
interesting characteristics about the relationship between the authors in this community.
The study has identified the most prolific authors in the computer science community. It
also shows how the Erdős number of an author in the collaboration graph of the computer
science community matches with the Erdős number in the collaboration graph of the
mathematics community.
vi
1. Introduction
Social network analysis has gained increasing popularity in recent years. The origin
of social network analysis can be traced back to the “six degrees of separation”
phenomenon based on an experiment, i.e., any two people in the U.S. are connected
through about six intermediaries. Since then, a lot of research has been done to analyze
this fact, and many studies have been able to prove it in various social networks. The
strategy in social network analysis is to investigate social structures using network and
graph theories. Networks play a significant role in computer science, and researchers
study various networks such as citation networks, collaboration networks, etc. in
academia. Underlying any network is a graph, and this thesis focuses on the collaboration
graph. The structure and evolution of collaboration among researchers can be
investigated using such a collaboration graph or a co-authorship network.
A collaboration graph is a graph modeling a social network where the vertices
represent people in that network and where two distinct people are joined by an edge
whenever there is a collaborative relationship between them of a particular kind [7]. It is
a simple graph with no loop edges and no multiple edges. The collaboration graph need
not be connected. An isolated vertex will represent a person who has never co-authored a
paper. Paul Erdős was an extremely prolific and significant figure in the field of
mathematics. He had more than 500 co-authors and published around 1500 technical
papers according to data from the Oakland University [5]. The Erdős number of a
researcher is defined as the collaboration distance between Paul Erdős and another author
in the collaboration graph. V. Yegnanarayanan et al. [13] say that the collaboration graph
1
of mathematicians has a "small world topology" i.e. they have a large number of vertices,
most of small degree, that are highly clustered, and a giant connected component with
short average distances between vertices.
Collaboration graphs are used to measure the closeness of relationships between the
participants of the network. One useful and widespread use of such graphs is to find the
collaboration distance between participants. This is defined as the length of the shortest
path between two distinct nodes. If no such path exists, then the collaboration distance
between them is said to be infinite. Collaboration distance can be used to evaluate a
group of authors or journals/conferences, or to assess citations of an author.
This thesis focuses on the Erdős number and the concept of collaboration distance in
the collaboration graph constructed from the bibliography data of the computer science
community that is available at the DBLP website [9]. It is an on-line reference for
bibliographic information on major computer science publications. It provides free access
to the data for research purposes. It was started by the computer science department of
the University of Trier and currently maintained by both the University of Trier and the
Schloss Dagstuhl - Leibniz Center for Informatics, Germany. It currently indexes over
2.6 million publications, published by more than 1.4 million authors. It also indexes
about 25,000 journal volumes, more than 24,000 conferences and workshops, and more
than 17,000 monographs.
In the Erdős collaboration graph, two authors are joined by an edge if they co-
authored a paper together. The Erdős number of an author is the collaboration distance
2
from that author to Paul Erdős in the graph. The existing tools available online to
calculate Erdős numbers that are discussed later in the next chapter.
The nature of the data available is useful in investigating the collaboration
characteristics of authors in computer science community. The bibliographic records also
cover a wide time range that can be used to study the evolution of the network over time.
The motivation behind this project was to analyze and review the DBLP network to find
interesting facts or patterns. Another goal was to create a tool to find collaboration
distance between the co-authors in computer science community. Microsoft offers a tool
to find the collaboration distance between academic authors, but their data source is
vague, the scope is not limited to any particular field, many spurious results are presented
(e.g. two authors who share a first initial and last name are usually identified as the same
person), and many computer science publications are not included.
This report has six chapters. Chapter 1 provides a general overview of the thesis and
the motivation behind implementing it. Chapter 2 shows a survey of similar applications
and tools available. Chapter 3 explores the technologies and software used in the
development of the application. Chapter 4 describes the detailed application development
process, reasons for modifications to some of the features, and lessons learned. Chapter 5
presents the outcome of the results of the study and details of the web application.
Chapter 6 discusses further enhancements to the application.
3
2. Related Work
This chapter presents applications related to this thesis about calculating the
collaboration distance.
Microsoft Academic Search [2] is a web-based search implementation and an
experimental research service concentrating exclusively on the scholarly material. The
service indexes millions of academic publications and displays the key relationships
between and among subjects, content, and authors, highlighting the critical links that help
define scientific research. Two features in their web application are similar to features
developed in this thesis. One of them is the Co-author Graph feature that shows a visual
graph of all the co-authors of a particular author. The nodes in the graph represent the
authors, and the edges show the number of publications common between two authors.
One can further the search by clicking on the nodes and edges to get more information
about the authors and co-authored publications respectively. Another feature is the Co-
author Path feature that provides a visual display of shortest path between two authors.
This tool shows the relationship and the degree of separation between the coauthors. The
nodes and edges in the graph show the author and the common publication information as
mentioned in the previous tool. However, Microsoft Academic Search does not provide
explicit information about their data source. The tool has information from various
disciplines, but appears incomplete for any particular area. Some details of the sources of
the data are mentioned on their website. Hence, this tool is not focused on a particular
field or community but acts more like an (incomplete) search engine for academics with
access to some tools that perform functions similar to the one accomplished in this thesis.
4
Another similar application, MathSciNet [4] is an electronic publication offering
access to a database of reviews, abstracts, and bibliographic information for the
mathematical sciences’ literature. MathSciNet contains almost 3 million items and over
1.7 million direct links to original articles. This web of citations allows users to track the
history and influence of research publications in the mathematical sciences. The
MathSciNet database provides an excellent example of a relatively large collaboration
graph. The website hosts a free online tool called Collaboration Distance, which finds the
shortest publications-path between two mathematicians as well as the Erdős number of a
mathematician. This tool also shows the actual chain of co-authors that realizes the
collaboration distance. The user has to enter two author names in the input fields
provided, and the search returns the shortest path between those two authors in the
collaboration graph. For each edge, it shows the paper co-authored by the two authors.
Hence, this tool is also similar in functionality to the one developed in this thesis but is
only focused on the mathematics community.
5
3. Application Overview
This chapter presents the architectural overview of the web application developed to
find the Erdős number and the collaboration distance between two authors. The
application is implemented using JSP/Servlets and supported by a MariaDB database as
the back-end. An overview of the technologies used for the development of the web
application is as follows.
3.1 MariaDB
It is an open source relational database technology and a great alternative and drop-in
replacement for MySQL. An important reason for selecting this database management
system was the power of the CONNECT storage engine in MariaDB 10.0. This storage
engine enables MariaDB to access external local or remote data, as there were standard
tables on the server. This is done by defining tables based on different data types, in
particular, files in various formats, data extracted from other DBMS or products via
ODBC, or data retrieved from the environment [6].
XML is used to encode and store data having any structure. The tag hierarchy in an
XML file describes a tree structure of the data. Modern database management systems
including MariaDB, implement something close to the relational model and work on
tables that are structurally not hierarchical but tabular with rows and columns.
Nevertheless, the CONNECT engine can help to work with the structural data. The user
can specify what data to extract from the XML structure.
6
Here are the reasons for selecting the CONNECT engine for this thesis it 1) supports
tables represented by XML files, 2) supports large table sizes, and 3) handles multiple
nodes in an XML file very easily. The DBLP dataset is a huge XML file with size
approximately 1.7 GB (at the time of download on 6/23/2015), and the entire data was
imported from this XML file to MariaDB. CONNECT can also handle multiple nodes in
the XML document. This feature is important because the author node can be “multiple”
meaning that there can be more than one author of a publication. This information needs
to be imported correctly in the relational model. Most of the tools available online could
not handle multiple nodes in an XML file and importing the data failed in MariaDB.
CONNECT provides two possibilities to achieve this. The first one is to return, as many
rows than there are authors and repeating other columns as if a join was made between
the author column and the rest of the table. To achieve this, the “multiple” node name and
the “expand” option have to be specified when creating the table. They are defined as
sub-options of the “option_list” in create table query. The “limit” sub-option, if defined,
specifies the maximum number of values that will be expanded. If not specified, it
defaults to 10. Any values above the limit will be ignored, and a warning message will be
issued. The second way to see multiple values is to set the “expand” option mentioned
above to false, which will return a comma separated list of the multiple node values in a
single row.
3.2 Servlets/JSP
A Java servlet is a program that runs within a web server. They receive and respond
to requests from web clients across HTTP. These web servlets are the Java counterpart to
7
PHP and ASP.NET. JSP is a server-side programming technology that enables software
developers in the creation of a dynamic and platform-independent method for building
Web-based applications. JSP has access to Java API’s, including the JDBC API to access
enterprise databases, which is also used for this thesis and discussed in later section. It is
a technology for developing web pages that support dynamic content, which helps
developers insert java code in tags <% and %> in HTML pages. It is a type of servlet
designed to fulfill the role of a user interface for a Java web application. JSP can be used
to collect user input from HTML forms, present results from database or another source
and create web pages dynamically.
8
4. Design of the Application
This chapter presents the design of the web application, which includes details about
the back-end database schema, user interface, and the features implemented in the
application.
4.1 Database Design
The collaboration graph of DBLP database is the graph representing authors of
computer science publications as vertices and the edges connecting the authors who have
been co-authors on at least one paper. DBLP is a high-quality citation digital library that
has a nearly complete coverage of computer science community and hence is chosen as
the dataset for this thesis. The database for this study was designed primarily with a web
application in mind. Hence, it was designed as the back-end part of the application and in
such a way as to optimizing for the needs of the web application.
Figure 4.1. Excerpt from DBLP XML file
9
The DBLP XML file format follows the BibTeX file format and contains millions of
records. The file structure is defined in the DTD (data type definition) file named as
dblp.dtd accompanying the XML file. An excerpt from the XML file is shown in Figure
4.1. The XML root element
The DTD lists several elements to be used as a bibliographic record:
(article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www)*>
These tags correspond to the entry types used in BibTeX. The publication records used in
the database are given by one of the following elements in DBLP XML file:
• article – An article from a journal or magazine
• inproceedings – An article in a conference proceeding.
• incollection – A part of a book having its title.
Following is the detail about the elements in DBLP XML file that are used for this
thesis:
The order of the author elements inside a record is the same as on the head of the paper.
DBLP specifies author name as a full name without commas. If name parts are
abbreviated, each initial is followed by a dot. For example, it would use H. P. Smith
rather than H P Smith or HP Smith. Behind a dot, there is a blank or a hyphen. DBLP
identifies authors with the same name by appending a space character and a four-digit
number to the names.
record. It may contain sub-elements for subscripts, sup elements for superscripts, i
10
elements for italics, and tt for typewriter text style. These elements may be nested.
unknown, then it takes form “from-”. A single page paper is mentioned as the page
number without a hyphen. For splitting articles in magazines, it has a comma-separated
list of page numbers or page ranges. In rare cases, the pages element may contain any
character sequence.
Gregorian calendar. For journal articles, it is a definite date of publication of the issue.
The year field of conference papers specifies the date when the conference took place; the
year field of the enclosing proceedings specifies the publication date.
provided for many journals.
The database schema designed for the application is shown in Figure 4.2. The tables
named article, inproceedings, and incollection represent the elements
engine by importing data from the XML file. The publication types are separate in tables
in the beginning because when we import data from XML file we need to specify the
element that will be identified as a row in a table. This is explained in detail in the later
part of this section, as to how the create table query works while importing from the
XML file.
11
Figure 4.2. Database schema
Figure 4.3. Create Table sample query using CONNECT engine
Figure 4.3 shows a sample SQL create table query using CONNECT in MariaDB. By
default, the column names specified in the query correspond to the tag names in the XML
file. XML is case-sensitive and hence the column names should be the same as they
12
appear in the file though there is an option to have different column names than the tags
in the XML file. The order of the columns in the table can also be different from the
order in which the nodes appear in the XML file.
The “tabname” defines the root element of the XML file. Rows in the tables are
identified by declaring the ‘rownode’ option as an article, incollection or inproceedings in
the create table query. Multiple authors exist meaning that there can be more than one
author of a publication. So, “Mulnode” option is specified as ‘author’ meaning that a
record in XML file can have more than one author. The “Expand” option is defined as ‘1’
which will return as many rows as the number of authors for a publication record, and the
other columns being repeated for each of the authors. The “Limit” option is defined as 50
because there were few papers with more than 40 authors (10 is the default value if this
option is not specified as discussed earlier in Section 3.1).
Once the desired data is imported in MariaDB, new tables were created for use in
SQL join operations or the processing of the algorithm. The art_inp_inc table combines
the data of all publication types in one table using UNION ALL operation on the three
original tables, i.e., article, inproceedings, incollection. The author table contains distinct
author names along with an auto-generated authorid. Similarly, the paper table contains
distinct paper titles with an auto-generated paperid and the paper type information (i.e. if
the paper is a conference paper, journal or a book chapter). The coauthors table contains a
list of coauthors of each author pre-computed and stored. Earlier this table was created to
get the coauthors of an author in algorithm implementation directly and minimize the
computation time, but now it is used for analysis of the database only. This was because
13
of the implementation of an advanced search feature; this table could no longer be used in
the algorithm. A Java code snippet of the creation of the coauthors table is shown in
Figure 4.4.
Figure 4.4. Code snippet in Java used for creating coauthors table
The author_paper table was created by joining tables author, paper, and art_inp_in’.
This table contains the author id, paper id, and the paper type information, which is used
later in the execution of the algorithm. The algorithm makes a request to the database on
the fly to find the coauthors of the input author names in the application. The coauthor
result depends on the application feature firing the query. For the advanced search
feature, the coauthor list will only include the coauthor data from the paper type selected
in the application.
Another necessary task in creating the tables is adding indexes. It helps improve the
speed of operations in the table, especially when performing SELECT queries on a table.
14
It can be created using one or more columns, providing the basis for both rapid random
lookups and efficient ordering of access to records. Indexes are created for those columns
in the table, which will be used in SQL queries. For example, the art_inp_inc table has an
index on column title because it will be used in join operation with tables author and
paper to get the author_paper table. Then, the author_paper table has an index on
columns authorid, paperid, and papertype because the algorithm to find the coauthor list
uses this table.
4.2 Web Application Design
This thesis is supported by a Java-based web application to find the Erdős number of
the computer science authors in the DBLP. Java is chosen as a programming language for
implementing this application just to have some experience with designing and creating
Java-based web applications. The application provides tools to find the collaboration
distance, Erdős number, advanced search options and email service to receive
suggestions from users of the application. The front end of the application uses
JSP/Servlets, and back-end is a MariaDB database. JDBC is used for communication of
the application with the database. JSON is used for formatting the data before displaying
the results in HTML. JQuery and JavaScript are also used to perform server and client
side validations and fulfill certain functionalities.
4.3 Shortest Path Algorithm
A shortest path algorithm based on the concept of bi-directional search is designed to
find the collaboration distance between two authors in coauthor graph. This graph search
15
algorithm runs two simultaneous searches: one forward from the initial state and one
backward from the goal, stopping when the two meet in the middle. The reason for
choosing this approach is that the dataset used in this application is huge and performing
bi-directional search improves the performance of the algorithm drastically. It is faster
than the native breadth-first search. The number of authors in the database is over 1.5
million and many authors have more than 50 co-authors or more including some with
more than 500 co-authors. Hence, for instance, in a problem where both searches expand
a tree with branching factor b and distance from start to goal is d, each of the two
searches has complexity O(bd/2) and the sum of these two search times is much less than
O(bd) complexity that would result from a single search from the beginning of the goal.
The algorithm implemented in Java, computes the shortest path between two authors
in the coauthor graph. The implementation is structured in four classes, three servlets, and
three JSP files. The Author class defines the Author object and helps set or receive the
author metadata for the algorithm. The algorithm queries coauthor data for the Author
object received from the input in the application. The getCoauthors method defined in the
SQLUtility class returns a list of coauthors to the algorithm. Each Author object has a
label of type Integer that defines the position of each Author object during the algorithm
execution.
The algorithm requires two authors, and it will construct the coauthor path object in
the FindCoauthorPath class. The class Author has a createAuthor method to produce new
Author objects. The createAuthor method first tests if the there is already an Author
object with the specified author id in AuthorMap. A new object is only created if the test
16
fails. After creation of Author object, it only contains the author’s id. The algorithm loads
the list of coauthors only when required. The boolean field coauthorsLoaded contains the
required state information. The getCoauthors method loads the coauthors list based on the
paper type(s) selected in the advanced search option, or the default scenario is to consider
all the paper types when fetching the coauthors data from the database.
The algorithm logic is implemented in the CollaborationDistance class. As discussed
above, the algorithm uses a bi-directional breadth-first search approach to optimize the
performance of the algorithm. Hence, the search starts from both authors and the
algorithm prefer the side with the lower number of authors to be visited next. The method
coauthorPath in the CollaborationDistance class implements this idea. The algorithm
labels the authors, author1, and author2 with 1 and -1 respectively. The coauthors of
author1 are set to 2, and the coauthors of author2 are set to -2. The position1 and
position2 variables contain the sets of authors who form the outer level of the labeling
processes. The main loop flips the side where to advance next depending on the size of
these sets. When a level is expanded in the algorithm, the unvisited authors are stored in
the set named next. Whenever the variables position1 and position2 meet, it shows that
the authors have a coauthor path between them and that path is traced in the algorithm by
tracing the path in reverse order from the meet point. If the two sides never meet each
other, that means that the authors are not connected, and no path exists between them.
4.4 Application Features
Erdős number: - It is used to find the Erdős number of any author in the database.
While using this tool, one of the authors will be set to Paul Erdős and the user only needs
17
to enter another author. It is used to find the collaboration distance between any author
and Paul Erdős. The purpose of designing this tool it to study any similarities between the
Erdős number found in the MathSciNet and the one calculated in this application. It
would also be interesting to know the authors who have papers published in computer
science, have the same Erdős number but a different set of publications used in the path.
This may happen because the MathSciNet consists of mathematical publications and this
application uses the database of computer science publications from the DBLP.
Collaboration distance: - It finds the shortest path between any two authors in the
database used in this application. It also displays the common publications co-authored
between the authors in the path. This tool helped in analyzing the six degrees of
separation and small world phenomenon discussed earlier in this report.
Advanced search option: - This feature provides the capability to select the type of
publication(s) to consider while finding the collaboration distance. The user has a choice
to select publication type categories like articles, journals, and book chapters. This tool
helped to identify any patterns based on the type of publication.
Contact us: - This tool provides a user with the capability to provide suggestions via
email. Since this application is designed for academic research purposes, it is important
to get feedback and comments from users of the application. This tool will also help to
maintain and update the database for any errors identified by the authors or users. This is
important because the database is large and DBLP has taken a lot of care to provide as
correct the information they can, but there are scenarios where there can be issues
introduced in the database and need to be fixed. For example, an author may identify that
18
his paper is listed under another author’s name (whose name is the same as the author
who identified it). In this case, the suggestion to update the database will help fix errors.
4.5 Lessons Learned
This section describes the problems faced during the designing of the application.
Database setup issue: There were issues while importing the data from XML file to
MariaDB because of the character set used in the DBLP XML file. The XML file is
encoded in plain ASCII. Additional ISO/IEC 8859-1 (latin-1) characters are defined as
named entities in the DTD accompanying the XML file and used whenever necessary.
Most parts of DBLP are restricted to ISO-8859-1 (latin-1) characters, i.e. the first 255
Unicode characters. However, some ‹author› elements contain only latin-1 characters,
and the numerical entities may be outside of this range. For example, ‹title› elements may
include Greek letters like ε. All characters above the first 255 Unicode characters are
given as numerical entities. It contains foreign language characters, and encoding of the
file is in Latin1. The XML DTD file contains the data to reference these entities.
Though MariaDB has the default character set as Latin1, it had problems importing
the XML file that has the same encoding. The problem was in referencing the entities
from the DTD file. After numerous failed attempts, the character entities defined in the
XML file were replaced with their corresponding numeric entities defined in the DTD
file. A simple perl command accomplished this. The command replaced a regular
expression with the new regular expression. This method worked fine and then the data
was successfully imported from the XML file in MariaDB database.
19
Sample Perl command used: perl -pi -w -e 's/oldregex/newregex/g;’ DBLP.xml
The MariaDB database was created and updated several times during the
implementation of the application. The database is huge and problems were identified
while executing queries. It is impossible to verify every single record in the database for
any abnormalities or unidentified exceptions. One such situation was when the number of
authors for a publication record in XML file was over 25. The limit of a number of
authors to expand for each record in the file while importing the database was set by
default to 10 that truncated the rest of the authors from the record. This issue was realized
while verifying the coauthors table and the coauthors list provided on the DBLP website.
Therefore, the database was created again from the initial step of importing the data to get
the correct number of authors from the XML file while executing the CREATE TABLE
query.
Another issue faced was that duplicate and insufficient values were populated in the
database for the publication title. Initially, the data from the
record was stored in the database excluding the other metadata like the volume, pages,
etc. Then, distinct papers were selected and stored in the paper table to identify each
publication title uniquely. However, it was realized later when importing the
associated with each record could distinguish them. This is because the incollection
records represent the book chapters and many records would have the title as Preface but
were from different books. This information was unknown before, and the authors of all
Preface titles were treated as co-authors of each other introducing incorrect entries in the
20
co-authors table. This issue also led to creating the database again including the metadata
information of all paper titles to distinguish Preface item of different books.
Last but not the least, a lot of time was spent while creating the database because of
the indexes added to the tables. Since the data is huge creating an index was a necessity
to help speed up the queries. Performing INSERT statements during join operation took
more time with tables having indexes. The reason is that while doing insert or update, the
database needs to insert or update the index values as well. Hence, making changes to the
database or creating the database again would take up a lot of time.
21
5. Results
This chapter presents the screenshots of the web application and facts about the
collaboration graph and Erdős number as analyzed in the database. The screenshots
present the functionality of the important features implemented in the application. The
facts present interesting results identified as part of the thesis.
5.1 Screenshots of Web Application
The home page of the application as shown in Figure 5.1 gives a brief overview of the
application and the concepts utilized in implementing the application. The important
features of the application are provided to the user as links in the left sidebar as shown in
Figure 5.1. The Erdős number link opens a page layout as shown in Figure 5.2, where a
user can enter an author name to find his Erdős number. Here one of the author name
fields is already set to Paul Erdős. The collaboration distance link opens up a layout
similar to the Erdős number tool as shown in Figure 5.3. This tool helps a user find the
collaboration distance between the two authors entered. The advanced search link opens
up a layout as shown in Figure 5.4. Here a user can select the publication type categories
with the help of checkboxes. This tool gives a user the ability to find the Erdős number or
the collaboration distance based on the publication type(s) selected. There is a button
named “Co-publications” shown in Figure 5.5 near the name of the author in the results
area in all the tools. Clicking this button shows a hop up which displays the common
publications between the author name with whom the button associated is clicked and the
author above him. During the advanced search option, the Co-publications button will
filter the results based on the types selected.
22
Figure 5.1. Screenshot of application homepage
Figure 5.2. Screenshot of collaboration distance tool
23
Figure 5.3. Screenshot of Erdős number tool
Figure 5.4. Screenshot of advanced search tool
24
Figure 5.5 Co-Publications button showing the publication list between the authors
5.2 Facts Computed in the Database
1) Figure 5.5 shows a comparison of number of authors with Erdős number 1 and 2
between the DBLP and the Mathematical Reviews dataset. These results were computed
using a Java program and the coauthors table. The numbers in the DBLP dataset are quite
less compared to the MR dataset because the DBLP dataset contains the computer
science publications and Paul Erdős has less publications in DBLP compared to MR and
hence the values.
DBLP dataset MathSciNet
Erdős number 1 = 165 Erdős number 1 = 511
Erdős number 2 = 3780 Erdős number 2 = 11009
Figure 5.6 Comparison of Erdős numbers in DBLP and MR dataset
25
2) Another comparison to Mathematical Reviews: - One similarity is observed while
finding the Erdős number with the database used in this application and the Mathematical
Reviews database. The Erdős number of many authors in the computer science coauthor
graph is the same as that in the mathematics coauthor graph. However, the publications
used in computing the path may be different because the Mathematical Reviews database
contains the mathematic publications and the database used in this application contains
computer science publications.
3) Number of vertices in the computer science co-author graph is 1594172, which is the
number of authors in the database.
4) Number of edges in the co-author graph is 2983838, which is the number of
publications in the database. Hence, we can observe that the number of edges in the graph
is almost double the number of vertices in the graph.
5.3 Performance of the Application
The current performance of the application is good for displaying the results from the
database to the user. Previously the implementation used the coauthors table that stored
the coauthor list of each author pre-computed. This resulted in very good performance
because there was no processing required at the back-end and the only time required was
in executing the algorithm. However, the new implementation does not use the coauthors
table, and the list of coauthors required in the algorithm is calculated on the fly while
running the application. Nevertheless, the application works fine with no performance
degradation. A primary reason for this is the use of indexes in the tables.
26
6. Future Work
This chapter presents the ideas that can be implemented to enhance the application in
future. Time constraints limited the amount of work done in this thesis. Following is the
ideas that can be implemented further to improve the user experience with the web
application and automate certain tasks to keep the database up to date.
6.1 Automatic Database Setup
The DBLP XML file is huge with millions of publication records. The data from
XML file is imported to construct the back-end for the web application in MariaDB. The
size of the database is approximately 1.6 GB, and it will keep on increasing as the DBLP
keeps adding data to it. Problems were faced importing such a huge file and specifically
while creating tables with SQL join operations. Most of the SQL queries involving join
operations took a long time to finish and consumed a lot of memory and hard disk space.
This is because indexes are created to enhance the performance of select queries, as the
database is large.
At present, the database is already set up for use with the web application. The XML
file was downloaded from the DBLP website on 6/23/2015. DBLP updates the file
regularly and hence, it is very important to propagate the updates from the DBLP file to
the database used in this application. At present, the database needs to be updated from
scratch from the XML file.
Writing automated scripts to create the required database is a very good option to
solve the above problem. The scripts can regularly check the DBLP website for any
27
update and execute the SQL queries and create the desired database for the application
automatically. Otherwise, the administrator can start the scripts manually at any time, and
it would create the database as required. This capability is a very important feature to be
implemented further so the admin can keep the database up to date.
6.2 Better Author Identification
The DBLP dataset contains many different authors with the same name. It is very
important to be able to identify the authors based on some metadata. The database
distinguishes different authors with the same name by appending a four-digit number at
the end of the author name. However, the user of the application does not know to
identify author homonyms while entering the name in the application. For example, when
a user types in Chen Li in the author name input field, the application provides a user
with suggestions with at least seven different authors with that same name. In this case,
the user will not be able to distinguish which author they are looking for in that drop-
down list of suggestions.
A very good solution to this problem is to have the author affiliation information
stored in the database. When the user types in the author name in the input field, the
application can then provide the drop-down list of suggested author names with their
affiliation information. This would make it very easy for a user to identify the author for
whom they are searching. One strategy for identification would be to use the affiliation
information of all the authors in the database. However, no central location for this
information has been found which makes importing the information to the MariaDB a
difficult manual process. The important thing here is to find that data at one place as the
28
thesis database has over 1.5 million authors and hence it is not feasible to insert manually
such information from various sources in the database. Another issue with importing the
affiliation information in the database is that with the author name format. If the format
of author name is different from the one used in DBLP, then it can create issues while
importing the data. The names will not match, and the database may end up with
incorrect values.
A less interesting approach for author identification can be to find the publication
year range of an author. This metadata can help a little better in picking up an author
name during suggestions in the web application. This would work in following way: the
user can see the author name and metadata about the years he is active alongside the
name. This can help in identifying which author to choose from the list.
6.3 Add admin privileges
Providing admin capabilities through a Log In button to manage the database at the
back-end can be a good enhancement for the web application. The administrator can go
in and update or correct the database when desired. A nice user interface can be designed
to assist the admin in updating the database very easily and quickly. For example, if the
admin desires to update the name of an author in the database, he can log in the
application and fill a form with new author name to be replaced with the existing one,
and that would make changes to the database. The admin need not work with the database
through a terminal and need not even know any MySQL queries to execute any action.
He just needs to choose a field to update and enter values through input field in a form.
This would make the necessary changes to the database automatically. This task can be
29
performed by the admin particularly in the case when he receives an email from any
author identifying issues with his publications or spelling of the name etc.
The other admin privilege is related to the enhancement discussed in section 6.1.
Admin can have access to scripts that can create the database from the XML file to keep
it up to date.
6.4 Find multiple shortest paths
Another enhancement for the application is to find multiple shortest paths between
two authors. This feature can be tricky to implement and may degrade the performance of
the application due to extra processing required for each combination of collaboration
path possible. Though this feature is not implemented explicitly at present, the results
displayed in the application do show different possible shortest paths while performing
the same query multiple times. This is because the present execution of the application
finds the coauthors of an author on the fly from the database and it receives a different
sequence of the coauthor list for each query and it can show one of the shortest paths to
the user.
6.5 Application to other problems
This study has applications to other similar problems in social network analysis. The
collaboration distance tool can be used in any social or academic network to find the
shortest path between any two people. Any network can be utilized to analyze the ‘small
world’ phenomenon as mentioned by few papers in past. Example similar problems for
reference are discussed further. A similar famous work to this thesis is the Bacon number
30
(as in the game Six Degrees of Kevin Bacon), connecting actors that appeared in a film
together. The Bacon number of an actor or actress is the number of degrees of separation
he or she has from Bacon, as defined by the game. This is an application of the Erdős
number concept to the Hollywood movie industry. The higher the Bacon number, the
farther away from Kevin Bacon the actor is.
Another application for this study can be to find the Erdős–Bacon number, which is
the sum of one's Erdős number and one's Bacon number. The lower the number, the
closer a person is to Erdős and Bacon, which reflects a small world phenomenon in
academia and entertainment. In general, to have a defined Erdős–Bacon number, it is a
necessary (but not a sufficient) condition for one to have both appeared in a film and co-
authored an academic paper.
One can also consider the time and spatial metadata of the publication records and
analyze the evolution of the database over a geographical domain. This may help
researchers reveal any trends or patterns in the computer science community.
Milosˇ Savic ́ et al. [11] presented a study on the structure and evolution of scientific
collaboration in Serbian mathematical journals. They used various metrics like the degree
of centrality to analyze the evolution the scientific collaboration over time.
Mike Lieberman [14] has done social network analysis on networks with big data like
Facebook, Twitter, Google, Craigslist and analyzed how that data can have various use
cases. This serves as a good example of social network analysis application to big data.
31
7. Conclusion
The analysis of the collaboration graph from the bibliography data of the computer
science community reveals many interesting facts. The graph has roughly 1.5 million
authors as its vertices, with an edge between every pair of an author who has a joint
publication. It shows the relationship between authors and the growth of the community
over years. The most prolific author in the computer science community is Wei Wang
with 2218 coauthors. This number is much larger than Paul Erdős in the mathematics
community. Other prolific authors include Wei Zhang with 1584 coauthors and Wei Li
with 1473 coauthors.
One observation in this thesis is that the Erdős number of an author in the
collaboration graph of the computer science community matches with the Erdős number
in the collaboration graph of the mathematics community. The six degrees of separation
phenomenon can be seen holding true in this dataset even though Paul Erdős is not a
prolific author in computer science journals. His primary area was combinatorics and
most of his publications appeared in mathematic journals. The analysis of the DBLP data
also reveals that the shortest path between any two authors is not more than six except in
cases where there is no path.
32
References
[1] DBLP. http://dblp.uni-trier.de. Accessed on November 12, 2015.
[2] Microsoft Academic Search. http://academic.research.microsoft.com/VisualExplorer. Accessed on October 2, 2015.
[3] MariaDB Tutorial. http://www.techonthenet.com/mariadb/. Accessed on October 20, 2015.
[4] MathSciNet. http://www.ams.org/mathscinet/. Accessed on August 20, 2015.
[5] Erdős Number Project. http://wwwp.oakland.edu/enp/. Accessed on October 1, 2015.
[6] CONNECT Storage Engine in MariaDB. https://mariadb.com/kb/en/mariadb/connect/. Accessed on June 11, 2015.
[7] Mathematical Reviews http://www.ams.org/mr-database. Accessed on November 15, 2015.
[8] DBLP XML file location. http://dblp.uni-trier.de/xml/. Accessed on July 16, 2015.
[9] Ergin Elmacioglu, Dongwon Lee. On Six Degrees of Separation in DBLP-DB and More. In Proceedings of the ACM SIGMOD Record Conference (New York, NY, USA) June 2005. Vol. 34, No. 2, pp. 33-40.
[10] Milosˇ Savic ́, Mirjana Ivanovic ́, Milosˇ Radovanovic ́, Zoran Ognjanovic ́, Aleksandar Pejovic ́, Tatjana Jaksˇic ́ Kru ̈ger. The structure and evolution of scientific collaboration in Serbian mathematical journals. Scientometrics, December 2014, Vol. 101, No. 3, pp. 1805-1830.
[11] Michael Ley. DBLP: some lessons learned. In Proceedings of the VLDB Endowment. Vol. 2, No. 2, August 2009, pp. 1493-1500.
[12] V. Yegnanarayanan, G. K. Umamaheshwari. A Note on the Importance of Collaboration Graphs. International Journal of Mathematical Sciences and Applications September 2011. Vol. 1, No. 3, pp. 1113-1121.
[13] Mike Lieberman. Visualizing Big Data: Social Network Analysis. In Proceedings of Digital Research Conference (San Antonio, Texas) March 2014.
33