Identification of Emerging Scientific Topics in Bibliometric Databases

Identification of Emerging Scientific Topics in Bibliometric Databases Zur Erlangung des akademischen Grades eines Doktors der Wirtschaftswissenschaften (Dr. rer. pol.) von der Fakultät für Wirtschaftswissenschaften des Karlsruher Instituts für Technologie (KIT) genehmigte DISSERTATION von Dipl.-Inform.Wirt Carolin Mund ______________________________________________________________ Tag der mündlichen Prüfung: .................................. Referent: Prof. Dr. Rudi Studer Korreferent: Prof. Dr. Ulrich Schmoch Tag der mündlichen Prüfung 18.07.2014 One is always considered mad, when one discovers something that others cannot grasp. Edward D. Wood, Jr. (1922-1978) I’m deeply grateful to all people who contributed to the achievement of this thesis. You all found your personal way to help me out in times of need or to challenge me when I was get- ting too comfortable. The following short list can only hint at all the support I got, but it should give you an idea of how thankful I am: My family and friends, Rudi, Uli, Rainer, Achim, my colleagues and friends from NISTEP and Fraunhofer ISI, … Short Table of Content I Overview .......................................................................................................................................... 1 1 Introduction .................................................................................................................. 1 2 Emerging Topics and Their Indicators ....................................................................... 15 II Fundamentals ................................................................................................................................ 37 3 Bibliometrics .............................................................................................................. 39 4 Machine Learning Foundations .................................................................................. 63 5 Latent Dirichlet Allocation (LDA) ............................................................................. 77 III Contributions ................................................................................................................................ 89 6 Emerging Topics – What They Look Like ................................................................ 91 7 Emerging Topics – Interdisciplinarity as one Indicator ........................................... 111 8 Emerging Topics – Why Citation Analysis is not an Adequate Metric ................... 121 9 Emerging Topics – How They can be Detected ....................................................... 133 10 Emerging Topics – How New Terms are Introduced in the Scientific Landscape................................................................................................................. 181 11 Conclusions .............................................................................................................. 197 IV Appendix...................................................................................................................................... 201 V Publication bibliography ............................................................................................................ 217 I Overview 1 1 Introduction This thesis is concerned with the assessment of novel methods to discover emerging topics in science. This chapter explores the motivation behind this task. In particular, it explains why such an approach is necessary (Section 1.1) and its essential characteristics (Section 1.2). The chapter ends with an out- line of the thesis and its relation to the author’s previous publications. 1.1 Motivation This thesis addresses the identification of emerging topics in science. Typically, publication data are used to map progress in science and thus the emergence of novel topics. The goal of this thesis is to use these so-called bibliometric data to monitor the scientific landscape to detect emerging topics. The limitations of the data and the derivable indicators1 have to be acknowledged in this regard. Therefore, a clear focus of this thesis is on the distinction between reliable and unreliable indicators of emerging topics. The necessity for indicators that are independent of citation or impact measures is illustrated by a publication by Mendel (1865), which is discussed in more detail during the course of this thesis and also serves as a running example: Mendel wrote a highly innovative paper in 1865 titled “Versuche über Pflanzen-Hybriden (Experi- ments with Plant Hybrids)” (Mendel 1865). This paper represented groundbreaking work for the un- derstanding of genetics and inheritance and acknowledgement and follow-up studies by the scientific community could have been expected regarding the findings. On the contrary, however, there were only two noteworthy reactions: A few researchers questioned his figures, but the majority ignored his work for decades (see e.g. Atkins 2003, pp. 46f, van Raan 2004). Only one “misleading” citation was made in 1881 (Atkins 2003, p. 47). 35 years later, Mendel’s findings were confirmed or, more pre- cisely, rediscovered (by Hugo de Vries, Carl Correns and Erich Tschermak) and only then acknowledged by the scientific community for the first time.2 Similarly, but in a different field, the later findings by Planck were first “met by silence [... and] regarded as a mathematical ruse” (Atkins 2003, p. 205). Reasons for the belated acknowledgement could be that the publication by Mendel “drowned” in the vast sea of scientific publications. It is true that, at that time, scientific publications were not produced in the same quantity as nowadays (for a discussion of growth rates in science, see Michels and Schmoch 2012), but the access to publication data was also unstructured and complicated. While the introduction of the internet has increased awareness of worldwide publications (and also facilitated 1 If not stated otherwise, the term indicator will denote any part of the system that enables the flagging of documents or topics (cluster) as “emerging” (or - based on the lack thereof - as “not emerging”). Indicators are calculated based on certain characteristics of the publications. Those characteristics that are comput- able, comparable and stable are labelled features and only these can form the basis for indicators. In them- selves, features have no explanatory power about the emergence of a topic. However, indicators are gener- ated by applying these features to rules, topic models etc. The concept of features is explained in more detail in Chapter 4. 2 http://www16.us.archive.org/stream/planthybridizati00robe/planthybridizati00robe_djvu.txt, last accessed on 2014/02/14. 2 open access – an option that was simply not possible with former publication means), past publications were restricted to physical outlets and thus also locally bound. Garfield (1970) argues that Mendel’s paper would have been cited if the ISI Science Citation Index3 had been around at that time: “I like to think that SCl will not only prevent inadvertent neglect of useful work but, feel confident it will prevent much unwitting duplication of research and publication” (Garfield 1970, p. 70). Besides the restrictions related to the publication form, there are other known factors that might influ- ence the reception of a publication. Even if a paper is read by a wide variety of researchers, in the end, its reusability in other (related) work and applications determines its dissemination and also the upper limit for its citation count. As Mendel’s work was deemed useful in retrospect, other reasons must have prevented its recognition. One possible explanation is that Mendel’s paper was refuted simply on the grounds of its high innovativeness, i.e. scientists were overwhelmed by the novelty of the findings. The gap between the state of knowledge prior to and after his groundbreaking work might have hin- dered other scientists relating it to their work (cf. Grinnell 1987, pp. 45f). However, another factor that will be discussed later in more detail is that Mendel had to rely on Mathematics to explain his findings – a fact that was not well received in his scientific environment (Barber 1961, Atkins 2003, p. 47). Regardless of the exact or main reason, in the end, the relevance of Mendel’s work was acknowledged, albeit belatedly. Given this background, it is important to grasp how humanity evolves as an intelligent species; discoveries and errors (and errors are merely “negative” discoveries”4) are passed on to other humans and generations (cf. Section 2.1, Johnson 2013, p. 172). This spread of knowledge avoids repetition of efforts, errors and failures and ensures the advancement of science at the research front – the point of development where humanity is currently positioned – in contrast to the knowledge level of individu- als or groups. Thus, it becomes irrelevant whether these earlier discoveries were made by the same person, group, country etc. Regardless of their source, they form the basis for further common advancement. None- theless, as the above example and later discussions show, the selection of related work can be biased due to various factors (see in particular Sections 2.2 and 3.2.1). However, in the ideal case, research is based on the most recent discoveries in a scientific field. By developing a system for the semi-automatic detection of emerging topics in science, this thesis aims to facilitate their accessibility and observance. The resulting procedure is comparable to detecting outliers in a set of documents. The main

Identification of Emerging Scientific Topics in Bibliometric Databases

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support