A Comparison of Taxonomy Generation Techniques Using Bibliometric Methods: Applied to Research Strategy Formulation

A Comparison of Taxonomy Generation Techniques Using Bibliometric Methods: Applied to Research Strategy Formulation Steven L. Camiña Working Paper CISL# 2010-01 July 2010 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E53-320 Massachusetts Institute of Technology Cambridge, MA 02142 A Comparison of Taxonomy Generation Techniques Using Bibliometric Methods: Applied to Research Strategy Formulation by Steven L. Camiña S.B., E.E.C.S. M.I.T., 2009 Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology July 2010 Copyright 2010 Steven L. Camiña. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created. Author Department of Electrical Engineering and Computer Science July 23, 2010 Certified by [Supervisor's Name and Title] Stuart Madnick John Norris Maguire Professor of Information Technologies and Professor of Engineering Systems, Massachusetts Institute of Technology Thesis Co-Supervisor Certified by Wei Lee Woon Assistant Professor, Masdar Institute of Science and Technology Thesis Co-Supervisor Accepted by Dr. Christopher J. Terman Chairman, Department Committee on Graduate Theses 1 2 A Comparison of Taxonomy Generation Techniques Using Bibliometric Methods: Applied To Research Strategy Formulation by Steven L. Camiña Submitted to the Department of Electrical Engineering and Computer Science July 23, 2010 In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science ABSTRACT This paper investigates the modeling of research landscapes through the automatic generation of hierarchical structures (taxonomies) comprised of terms related to a given research field. Several different taxonomy generation algorithms are discussed and analyzed within this paper, each based on the analysis of a data set of bibliometric information obtained from a credible online publication database. Taxonomy generation algorithms considered include the Dijsktra-Jarnik-Prim‟s (DJP) algorithm, Kruskal‟s algorithm, Edmond‟s algorithm, Heymann algorithm, and the Genetic algorithm. Evaluative experiments are run that attempt to determine which taxonomy generation algorithm would most likely output a taxonomy that is a valid representation of the underlying research landscape. Thesis Co-Supervisor: Stuart Madnick Title: John Norris Maguire Professor of Information Technologies and Professor of Engineering Systems, Massachusetts Institute of Technology Thesis Co-Supervisor: Wei Lee Woon Title: Assistant Professor, Masdar Institute of Science and Technology 3 Table of Contents CHAPTER 1: Introduction .............................................................................................. 8 1.1 Motivations ....................................................................................................... 8 1.1.1 Experts and the Decision Making Process ....................................................... 8 1.1.2 Research Landscapes ...................................................................................... 8 1.1.3 Analysis of Publication Databases ................................................................... 9 1.2 Technology Forecasting Using Data Mining and Semantics ................................... 9 1.3 Project Objectives ................................................................................................ 11 1.4 Overview ............................................................................................................. 12 CHAPTER 2: Literature Review ................................................................................... 13 2.1 Technology Forecasting ....................................................................................... 13 2.2 Taxonomy Generation ......................................................................................... 14 2.3 Bibliometric Analysis .......................................................................................... 14 CHAPTER 3: Taxonomy Generation Process ................................................................ 17 3.1 Chapter Overview ................................................................................................ 17 3.2 Extracting Bibliometric Information .................................................................... 18 3.2.1 Engineering Village ...................................................................................... 19 3.2.2 Scopus .......................................................................................................... 23 3.3 Quantifying Term Similarity ................................................................................ 26 3.3.1 Cosine Similarity .......................................................................................... 26 3.3.2 Symmetric Normalized Google Distance Similarity ...................................... 27 3.3.3 Asymmetric Normalized Google Distance Similarity .................................... 28 3.4 Populating the Term Similarity Matrix ................................................................. 29 3.5 Choosing a Root Node ......................................................................................... 32 3.5.1 Betweenness Centrality ................................................................................. 32 3.5.2 Closeness Centrality ...................................................................................... 33 3.6 Taxonomy Generation Algorithms ....................................................................... 34 3.6.1 Dijsktra-Jarnik-Prim Algorithm..................................................................... 34 3.6.2 Kruskal‟s Algorithm ..................................................................................... 36 3.6.3 Edmond‟s Algorithm..................................................................................... 38 3.6.4 The Heymann Algorithm .............................................................................. 40 3.6.5 The Genetic Algorithm ................................................................................. 44 3.7 Viewing Taxonomies ........................................................................................... 48 3.8 Taxonomy Generation Process Summary ............................................................. 50 4 CHAPTER 4: Taxonomy Evaluation Methodology ....................................................... 52 4.1 Introduction ......................................................................................................... 52 4.2 Taxonomy Evaluation Criteria ............................................................................. 53 4.3 Evaluating the Consistency of Taxonomy Generation Algorithms ........................ 55 4.4 Evaluating Individual Taxonomies ....................................................................... 57 4.5 Synthetic Data Generation ................................................................................... 59 CHAPTER 5: Results .................................................................................................... 62 5.1 Introduction ......................................................................................................... 62 5.2 Evaluating the Consistency of Taxonomy Generation Algorithms ........................ 65 5.2.1 Backend Data Set Consistency ...................................................................... 65 5.2.2 Term Consistency ......................................................................................... 67 5.2.3 Consistency Test Summary ........................................................................... 68 5.3 Evaluating Individual Taxonomies ....................................................................... 69 5.3.1 Using the top 100 terms ................................................................................ 70 5.3.2 Using the top 250 terms ................................................................................ 71 5.3.3 Using the top 500 terms ................................................................................ 72 5.3.4 Evaluating Individual Taxonomies Analysis.................................................. 73 5.4 Synthetic Data Generation ................................................................................... 75 5.4.1 Estimating the Optimal Bibliometric Data Set Size ....................................... 75 5.4.2 Measuring Algorithm Variant Consistency Using Synthetic Data .................. 79 5.5 Analysis of Results .............................................................................................. 80 CHAPTER 6: Conclusion .............................................................................................. 85 6.1 Recommendations ............................................................................................... 85 6.2 Summary of Accomplishments ............................................................................ 85 6.3 Limitations and Suggestions for Further Research................................................ 86 REFERENCES.............................................................................................................. 87 APPENDIX

Load more