A Tool for Exploring Bibliographic Data

Cover Page DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Tony Le FALL 2014 Thesis Approval Page DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA A Project by Tony Le Approved by: __________________________________, Committee Chair Dr. Du Zhang __________________________________, Second Reader Dr. Jinsong Ouyang ____________________________ Date ii Thesis Format Approval Page Student: Tony Le I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator ___________________ Dr. Nikrouz Faroughi Date Department of Computer Science iii Thesis Abstract Form Abstract of DBLPMINER: A TOOL FOR EXPLORING BIBLIOGRAPHIC DATA by Tony Le Exploring publications in academia usually entails browsing a collection of published literature. In the digital space, collections typically come in the form of primitive bibliographic databases or full featured digital libraries. Bibliographic databases contain an organized collection of references to published literature, whereas digital libraries host publications in full-text. Although lacking full-text content of a publication, bibliographies contain rich metadata such as publication titles, venues or contributing authors that are adequate enough for finding publications of interest. One example of a bibliographic database is DBLP, which provides publication records in the computer science discipline. DBLP supports researchers in publication exploration by providing interfaces for finding specific bibliographic records or collections of records by a specific author. However, it does not provide a means for exploring families of similar publications by topic areas. In this project, an application tool named DBLPminer is developed that extends upon the DBLP dataset. Addressing the topic limitations of DBLP, DBLPminer provides an interface for linking and accessing DBLP records to topic areas. Using this tool, researchers may organize DBLP records by topic categories and subsequently explore publications by topic categories or iv find associated topics for a DBLP record. At its foundation, the application indexes DBLP records by topic categories under the taxonomy of ACM 2012 Computing Classification System. Indexed DBLP publication records are made accessible via a prototype web application interface. Lastly, analysis is made on these datasets and the indexing algorithm's performance. There exist other tools like DBLPminer that aid researchers in exploring publications through bibliographic data. Some tools take a social network approach such as Arnetminer and ResearchGate, allowing researchers to connect with one another to share and access publications. Tools such as Scholarometer and Google Scholar are more search oriented with interfaces for finding publications of interest. Although similar, DBLPminer differs from tools such as Arnetminer, ResearchGate, and Scholarometer because it focuses on the publication relationships by topic instead of authorship and references. Another difference between DBLPminer and ResearchGate, Scholarometer, and Google Scholar is that it focuses on publications specific to the computer science community. _______________________, Committee Chair Dr. Du Zhang _______________________ Date v TABLE OF CONTENTS Page List of Tables ................................................................................................................................ viii List of Figures ................................................................................................................................. ix Chapter 1. INTRODUCTION ..................................................................................................................... 1 1.1. Related Work ................................................................................................................... 1 2. BACKGROUND ....................................................................................................................... 4 3. SOFTWARE DESIGN .............................................................................................................. 7 3.1. High Level Application Structure .................................................................................... 7 3.2. Development Environment .............................................................................................. 8 3.3. Internal Application Structure .......................................................................................... 8 3.3.1. MongoDB ............................................................................................................ 8 3.3.2. Configuration Files .............................................................................................. 9 3.3.3. Content Curation Module .................................................................................. 12 3.3.4. Web Server Module .......................................................................................... 18 3.3.5. Web Application ............................................................................................... 20 4. INDEXING ANALYSIS ......................................................................................................... 28 4.1. Taxonomy Dataset Analysis ........................................................................................... 28 4.1.1. Distinct Topic Terms ......................................................................................... 28 4.1.2. Topic Names ..................................................................................................... 31 4.2. Performance Analysis .................................................................................................... 32 5. CONCLUSION AND FUTURE WORK ................................................................................ 39 vi 5.1. Future Work ................................................................................................................... 39 Appendix A. Software Dependencies ............................................................................................ 41 Appendix B. Source Code .............................................................................................................. 42 Bibliography .................................................................................................................................. 71 vii LIST OF TABLES Tables Page 1. Table 1 – DBLP XML Element Tags ................................................................................. 4 2. Table 2 – 2012 ACM CCS Major Topic Categories ........................................................... 5 3. Table 3 – ACM CCS SKOS XML Element Tags ............................................................... 6 4. Table 4 – mongodb.json configuration file ....................................................................... 10 5. Table 5 – ccs.json configuration file ................................................................................. 10 6. Table 6 – dblp.json configuration file ............................................................................... 11 7. Table 7 – analysis.json configuration file ......................................................................... 12 8. Table 8 – Application Script Commands .......................................................................... 12 9. Table 9 – MongoDB Search String Parsing Example....................................................... 18 10. Table 10 – Web Server URL Routes ................................................................................ 19 11. Table 11 – REST API Routes ........................................................................................... 20 12. Table 12 – Application’s Custom Web Components ........................................................ 21 13. Table 13 – Taxonomy Distinct Term Counts.................................................................... 29 14. Table 14 – Most Frequent Terms ...................................................................................... 30 15. Table 15 – Topic Name Length Counts ............................................................................ 31 16. Table 16 – Total Bibliographies Indexed .......................................................................... 32 17. Table 17 – Bibliographies Indexed by N-grams ............................................................... 34 18. Table 18 – Single Topic Term Indexing Performance ...................................................... 35 19. Table 19 – Bigram Topic Term Indexing Performance .................................................... 36 20. Table 20 – Trigram+ Topic Term Indexing Performance................................................. 38 viii LIST OF FIGURES Figures Page 1. Figure 1 – High Level Application Structure...................................................................... 7 2. Figure 2 – MongoDB Document Structure ......................................................................... 9 3. Figure 3 – Script & 3rd Party Module Dependencies ....................................................... 13 4. Figure 4 – Taxonomy Migration Pseudo code

Load more