DBLP — Some Lessons Learned ∗
Total Page:16
File Type:pdf, Size:1020Kb
DBLP — Some Lessons Learned ∗ Michael Ley Universitat¨ Trier, Informatik D–54286 Trier Germany [email protected] ABSTRACT Bibliometric studies are a third group of publications. They The DBLP Computer Science Bibliography evolved from an make use of the full semantics of the data. The main disad- early small experimental Web server to a popular service vantages of DBLP for this purpose are the lacking citation for the computer science community. Many design decisions information and the varying coverage for different subfields and details of the public XML-records behind DBLP never of computer science[3]. The main advantages are the free were documented. This paper is a review of the evolution availability and the inclusion of many conference proceed- of DBLP. The main perspective is data modeling. In DBLP ings which play an essential role for many branches of CS persons play a central role, our discussion of person names and are poorly covered by other bibliographic data bases. A may be applicable to many other data bases. All DBLP fourth group of papers deals with person name disambigua- data are available for your own experiments. You may either tion, a special aspect of data quality. download the complete set, or use a simple XML-based API DBLP is a (very imperfect) \authority file”[1] for com- described in an online appendix. puter science researchers. We try to identify the persons be- hind the research papers and to treat synonyms and homo- 1. INTRODUCTION nyms as precise as possible. Incomplete and inconsistent in- formation, imperfect software, lack of time, and our own in- In June 2009 the DBLP Computer Science Bibliography1 ability are the limiting constraints for this task. The DBLP from the University of Trier contained more than 1.2 mil- web server lists all known papers published by a person on lion bibliographic records. For computer science researchers her/his \person page". This simple mapping becomes tricky, the DBLP web site is a popular tool to trace the work of as soon as a person has several names (synonyms) or if there colleagues and to retrieve bibliographic details when com- are several persons with the same name (homonyms). The posing the lists of references for new papers. Ranking and main obstacles are the bad habit to abbreviate (given) names profiling of persons, institutions, journals, or conferences is beyond recognition and spelling errors. The main algorith- another sometimes controversial usage of DBLP. mic idea we use to identify names we should check more The DBLP data may be downloaded. The bibliographic precisely, is to look at person pairs which have the distance records are contained in a huge XML file2. We are aware of two in the coauthor graph and have a \similar" name[6]. more than 400 publications which mention the usage of these DBLP isn't a well designed project. It grew from a small data for an amazing variety of purposes. Many researchers scale experimental server which was set up at the end of simply need non-toy files to test and evaluate their algo- 1993 to test web technology. In retrospect many ad hoc rithms. They are interested in XML, but not in the seman- solutions are poorly designed. Nevertheless, our policy is to tics of the data. Others inspect the DBLP data more closely: keep DBLP as stable as possible. Data formats, URLs etc. It is easy to derive several graphs like the bipartite person- are only changed if they prevent important functionality, publication graph, the person-journal or person-conference and not if we simply recognized them as unaesthetic. graphs, or the coauthor graph, which is an example of a social network. Methods for analysis and visualization of Section 2 of this paper describes dblp.xml. Beyond the these medium sized graphs are reported in many papers. syntactic framework given by the DTD, there are a lot of conventions and micro-syntax rules. We already commented ∗ An early version of this paper was titled \dblp.xml | A on the special treatment of person names in DBLP. In sec- Documentation". Version: June 18, 2009 tion 3 you may find more details. In section 4 we sketch the 1http://dblp.uni-trier.de 2 remaining HTML-style part of DBLP. The new DBLP XML 2009/06: sizeof(dblp.xml) = 532MB, sizeof(...gz) = 93MB services are explained in the online appendix. Our example application is a simple crawler which finds the shortest path between two DBLP authors in the coauthor graph. In addi- Permission to copy without fee all or part of this material is granted provided tion the appendix lists code to map person names to DBLP that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, URLs. and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers 2. DBLP RECORDS or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. The DBLP data set is available from the location VLDB ‘09, August 24-28, 2009, Lyon, France Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. http://dblp.uni-trier.de/xml/ The file dblp.xml contains all bibliographic records which which are published in journals, transactions, maga- make DBLP. It is accompanied by the data type definition zines, or newsletters. The second part of a DBLP key file dblp.dtd. You need this auxiliary file to read the XML- typically designates the conference series or periodical file with a standard parser[4]. dblp.xml has a simple layout: the papers appeared in. The last part of the key may <?xml version="1.0" encoding="ISO-8859-1"?> be any sequence of alphanumerical characters, in most <!DOCTYPE dblp SYSTEM "dblp.dtd"> cases these IDs are derived from the authors' names <dblp> and the year of publication, sometimes a letter is ap- record 1 pended to make this key part unique. Keys are not ... changed if we correct misspelled author names or year record n information. Obviously keys are not functional depen- </dblp> dent from the contents of their records, you should not make any assumption about the last part of a key. The header line specifies ISO-8859-1 (\Latin-1") as the en- coding, but in fact the file only contains characters <128, The DBLP key namespace layout is coarse and some- i.e. pure ASCII. All non-ASCII characters are represented times fuzzy: periodicals or conference series often are by symbolic or numeric entities. The symbolic entities like renamed, sometimes they are split or they are joined. é for the character '´e'are declared in the DTD. Nu- The world of publications does not form a hierarchy, meric entities like é should be understood by any XML the mapping into the DBLP namespace often is a very parser without declaration. helpful sub-categorization, but sometimes it may be In practice there are some obstacles in parsing the large wrong or controversial. A typical problem is the de- XML file which cost us a lot of time: The SAX parser con- cision when to treat a satellite workshop as a self- tained in the Java standard distribution has a limit for han- contained event or as a part of the hosting conference. dling symbolic entities. When starting the Java virtual ma- chine the option 'DentityExpansionLimit' has to be set to a • mdate is the date of the last modification of this record. large number. The parser contained in Java 1.6 distribution The format complies with ISO 8601, i.e. YYYY-MM-DD. is not able to handle large XML files. You should install We have no explicit change log for DBLP, but the an alternative SAX Java parser. We now use Xerces-J from mdate attribute makes it easy to load only recent addi- the Apache XML project, which reads dblp.xml without tions into your application. The file dblp.xml contains any problem. The DBLP FAQ3 reports more details. only the current version of the records. dblph.xml is The XML root element <dblp> contains a long sequence an extended version of the XML file which additionally of bibliographic records. The DTD lists several elements to contains old versions of the records. These informa- be used as a bibliographic record: tion were extracted from the daily backups of the last years. If may be useful for the analysis of the evolution <!ELEMENT j j dblp (article inproceedings of DBLP. dblph.xml is only updated occasionally. proceedingsjbookjincollectionj phdthesisjmastersthesisjwww)*> In our example the elements author, title, pages, year, volume, journal, and number are used similar to the corre- These tags correspond to the entry types used in BibTEX[5]. sponding BibT X fields. DBLP records may be understood as \BibTEX records in E XML syntax +": Author <article key="journals/cacm/Szalay08" mdate="2008-11-03"> In BibTEX there is at most one author field which may <author>Alexander S. Szalay</author> contain an 'and'-separated list of names. In DBLP records <title>Jim Gray, astronomer.</title> there is an author element for each author. The order of <pages>58-65</pages> the author elements inside a record is significant, it should <year>2008</year> be the same as on the head of the paper. In BibTEX a name <volume>51</volume> may be entered as "Michael Ley" or as "Ley, Michael" <journal>Commun. ACM</journal> (last name, first name). In DBLP we always use the first <number>11</number> form, there should be no comma inside of author elements.