Topic Discovery from Textual Data

Shaheen Syed This research was funded by the project SAF21, “Social Science Aspects of Fisheries for the 21st Century”. SAF21 is a project financed under the EU Horizon 2020 Marie Skłodowska-Curie (MSC) ITN – ETN program (project 642080).

c 2018 Shaheen Syed Topic Discovery from Textual Data ISBN: 978-90-393-7086-5 Topic Discovery from Textual Data

Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain

Thema Ontdekking in Tekstuele Data

Machinaal Leren en Natuurlijke Taalverwerking voor Kennisontdekking in het Domein van de Visserij

(met een samenvatting in het Nederlands)

Proefschrift

ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op woensdag 20 maart 2019 des middags te 4.15 uur door

Shaheen Ali Shah Syed

geboren op 9 februari 1985 te Rotterdam Promotor: Prof. dr. S. Brinkkemper Copromotor: Dr. M. Spruit Acknowledgments

This thesis is the result of my three-year PhD journey in which I had the pleasure to meet beautiful minds and souls. Throughout this journey, I also had the chance to work with and learn from some amazing people, for which I am truly grateful. My nearest and dearest have always supported me in the most loving and caring way, something that I truly cherish.

I want to thank my supervisors, Marco Spruit, Sjaak Brinkkemper, and Bruce Edmonds, for giving me the freedom to explore and work on my own ideas and interests. You have been a positive influence and working with all of you has been a real pleasure.

I want to thank my co-authors, Melania Borit, Charlotte Weber, and Lia ní Aodha, for putting up with my crazy way of working. I have learned a lot from each of you, and you have shown me interesting and alternative views of the world that have enriched me personally and professionally.

I want to thank Michaela Aschan for being a great mentor, for providing me with lots of opportunities, and for being an inspirational person. I want to thank Sjaak Brinkkem- per for developing a master’s program that prepared me for many of the academic challenges in this PhD journey. I want to thank Melania Borit for the nice lunches, for welcoming me, and for helping me during my many visits to Tromsø.

I want to show my gratitude to the EU for funding the project, to all the people involved in writing the SAF21 proposal, to the SAF21 members who I have met, to the UiT BRIDGE group for having me as a guest, to the University of Utrecht for welcoming me, to the Manchester Metropolitan University and the Centre for Policy Modelling for providing me with a work environment, and to all the other people I have had the pleasure to meet and talk to during my PhD.

I want to thank Charlotte Weber who has played many roles throughout this journey and surely will continue to do so in the future. Thank you for being such an amazing and caring person and thank you for showing me how to become a better version of myself. You truly are a unique soul, a blessing to the universe, and I am grateful to

i have had the pleasure to meet you.

And last but not least, I want to thank my family for raising me, for making me the person I am today, and for supporting and loving me all these years.

Thank you all,

— Shaheen Syed

ii Contents

1 Introduction 1 1.1 Knowledge Discovery Process ...... 3 1.2 Topic Models ...... 5 1.2.1 Latent Dirichlet Allocation ...... 7 1.3 Research Domain ...... 10 1.4 Research Questions ...... 11 1.4.1 Main Research Question (MRQ) ...... 11 1.4.2 Research Questions (RQ) ...... 12 1.5 Research Methods ...... 17 1.5.1 Computational Experiment ...... 18 1.5.2 Quantitative Content Analysis ...... 19 1.5.3 Social Network Analysis ...... 20 1.6 Dissertation Outline ...... 21

2 Full-Text or Abstract? 25 2.1 Introduction ...... 26 2.2 Background ...... 27 2.2.1 Latent Dirichlet Allocation ...... 27 2.2.2 Topic Coherence Measurement ...... 30 2.3 Methodology ...... 32 2.3.1 The Experiment ...... 32 2.3.2 Dataset ...... 32 2.3.3 Creating LDA Models ...... 34

iii CONTENTS

2.3.4 Topic Coherence ...... 35 2.4 Results ...... 35

2.4.1 DS1 Dataset ...... 39

2.4.2 DS2 Dataset ...... 39 2.4.3 Human Topic Ranking ...... 39 2.5 Discussion ...... 43 2.6 Conclusion ...... 44

3 Exploring Dirichlet Priors 47 3.1 Introduction ...... 48 3.2 Background ...... 49 3.2.1 Latent Dirichlet Allocation ...... 49 3.2.2 Research Utilizing LDA ...... 51 3.2.3 Coherence Scores ...... 52 3.3 Methods ...... 54 3.3.1 Dataset ...... 54 3.3.2 Dirichlet Hyperparameters ...... 55 3.3.3 Creating LDA Models ...... 56 3.3.4 Topic Coherence ...... 57 3.3.5 Human Topic Ranking ...... 57 3.3.6 Relaxing LDA assumptions ...... 58 3.4 Results ...... 59 3.4.1 Topic Coherence ...... 59 3.4.2 Human Topic Ranking ...... 67 3.5 Discussion and Conclusion ...... 69

4 Bootstrapping a Semantic Lexicon 73 4.1 Introduction ...... 74 4.2 Previous Work ...... 75 4.3 Lexicon Bootstrapping ...... 76 4.3.1 Domain and Seed Words ...... 78 4.3.2 Building the Corpus ...... 78 4.3.3 Chunking ...... 79

iv CONTENTS

4.3.4 Scoring Verbs ...... 80 4.3.5 Verb Extraction Pattern ...... 82 4.3.6 Bootstrapping ...... 84 4.4 Evaluation ...... 85 4.5 Conclusion ...... 89

5 Topic Analysis of Fisheries Science 91 5.1 Introduction ...... 92 5.2 Methods ...... 94 5.2.1 Latent Dirichlet Allocation ...... 94 5.2.2 Assumptions behind LDA ...... 96 5.2.3 Creating the Data Set ...... 97 5.2.4 Creating the LDA Model ...... 102 5.2.5 Calculating Model Quality ...... 102 5.2.6 Labeling Topics ...... 103 5.2.7 Calculating Topical Trends over Time ...... 104 5.2.8 Calculating Topic over Journals ...... 104 5.2.9 Relaxing LDA Assumptions and Future Research Directions . . . . 105 5.3 Results and Discussion ...... 105 5.3.1 Uncovering Fisheries Topics ...... 105 5.3.2 Topic Proportions within Documents ...... 113 5.3.3 Topical Trends over Time and Topic Prevalence ...... 114 5.3.4 Topical Trends over Journals ...... 118 5.3.5 Validation of Results ...... 120 5.4 Conclusion and Recommendations ...... 120 Appendix ...... 121

6 Sub-Topic Analysis of Fishery Models 129 6.1 Introduction ...... 130 6.2 Methods ...... 132 6.2.1 Latent Dirichlet Allocation ...... 132 6.2.2 Topic Interpretation ...... 133 6.2.3 Creating the Dataset ...... 133

v CONTENTS

6.2.4 Pre-processing the Data Set ...... 136 6.2.5 Creating LDA Models ...... 136 6.2.6 Identifying Subtopics ...... 137 6.2.7 Labelling the Topics ...... 137 6.2.8 Calculating Sub-Topical Modelling Trends ...... 138 6.3 Results and Discussion ...... 138 6.3.1 General Modelling Topics ...... 138 6.3.2 Subtopics within Estimation Models ...... 142 6.3.3 Subtopics within Stock Assessment Models ...... 147 6.4 Conclusions ...... 149 Appendix ...... 150

7 Global Network of Fisheries Science 155 7.1 Introduction ...... 156 7.2 Results ...... 157 7.2.1 Topology of the Co-Authorship Network ...... 158 7.2.2 Country-Level Giants ...... 159 7.2.3 Institutional Dynamics ...... 159 7.2.4 Hidden Collaborative Groups ...... 161 7.2.5 Country Clusters ...... 162 7.2.6 Communities of Authors and their Topical Foci ...... 162 7.3 Discussion ...... 171 7.3.1 A Bourdieusian Perspective ...... 171 7.3.2 Democratizing Fisheries Science? ...... 172 7.3.3 Systems of Regionalization ...... 173 7.3.4 Collaboration Styles ...... 174 7.3.5 The Topical Landscape of Fisheries Science ...... 175 7.4 Limitations and Ways Forward ...... 176 7.5 Conclusion ...... 177 7.6 Materials and Methods ...... 177 7.6.1 Data Collection ...... 178 7.6.2 Social Network Analysis ...... 178 7.6.3 Hidden Groups ...... 178

vi CONTENTS

7.6.4 Community Detection ...... 179 7.6.5 Topic Modeling ...... 180 Appendix ...... 180

8 Conclusions 207 8.1 Scientific Contributions ...... 217 8.1.1 LDA Workflow ...... 218 8.2 Limitations ...... 225 8.2.1 Latent Dirichlet Allocation ...... 225 8.2.2 Fisheries Domain ...... 227 8.3 Future Work ...... 228 8.4 Personal Reflections ...... 229

Bibliography 230

Published Work 259

Summary 261

Samenvatting 263

Curriculum Vitae 265

vii

Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it.

– Steve Jobs

Chapter 1

Introduction

It is estimated that the world’s data will grow to approximately 160 billion terabytes by 2025 (Reinsel et al., 2017), with most of that data occurring in an unstructured form—typically in the form of text. Approximately 20% of that data will be critical to the continuity of humans’ daily lives, and nearly 10% will be hypercritical with a direct and immediate impact on humans’ health and wellbeing (e.g., medical systems and telemetry). Today, we have reached the point where more data is being produced than can physically be stored (Hilbert and Lopez, 2011). In academia, research output is expected to grow at a rate of 8–9% annually, doubling roughly every nine years (Born- mann and Mutz, 2015). Across different fields and disciplines, problems, challenges, and opportunities arise to ingest all this data, create useful knowledge, and use the results for prediction and understanding (Blei, 2018; Science, 2011).

The surge of large volumes of data makes manual probing of the data slow, expensive, and subjective (Jiawei et al., 2012). New computational tools and algorithms are nec- essary to support the extraction of useful knowledge in a more structured way (Blei, 2012). Furthermore, to optimally and efficiently explore large volumes of data, it is es- sential to follow a well-established and accepted systematic process for comprehensive data analysis and knowledge extraction (Krochmal and Husi, 2018). The Knowledge Discovery in Databases (KDD) process is one such approach, which provides a system- atic, iterative process to support the extraction of useful knowledge from large volumes of data (Fayyad et al., 1996). This process-oriented approach creates the context for developing and exploring the tools necessary to control the flood of data more system- atically and helps to gain a better understanding of the data.

Extracting knowledge from data is particularly challenging for unstructured data, such as text in documents, where the data does not have a clear, semantically overt structure which a computer can easily understand (Manning et al., 2009). The field of natural language processing (NLP) is an area of research and application that examines how computers can be used to understand and manipulate natural language text (Chowd-

1 CHAPTER 1. INTRODUCTION hury, 2003). NLP studies how humans interpret and understand natural language, and the field intends to develop appropriate tools and techniques to enable comput- ers to assist in this task. The foundation of NLP is rooted in the fields of computer and information science, linguistics, mathematics, electrical and electronic engineer- ing, artificial intelligence and robotics, and psychology (Chowdhury, 2003). From the perspective of natural language understanding—in contrast to natural language generation—applications of NLP include, but are certainly not limited to, the under- standing of sentiments and opinions within text (e.g., positive or negative reviews), exploring entities within text (e.g., persons, companies, locations), part-of-speech tag- ging (e.g., verbs, nouns, adjectives), and automatically uncovering the themes or topics from documents. It is the uncovering of topics that forms the subject of this thesis.

Typically, one would use keyword searches to find documents exhibiting certain themes or topics. However, a common problem with traditional keyword searches is that they have difficulties in detecting the underlying ideas, themes, or topics from doc- uments (Blei and Lafferty, 2009; Srivastava and Sahami, 2009). This is especially true when the topics are hidden or latent, meaning that they are not explicitly men- tioned in the document. For instance, a hypothetical document containing the words ”blue”, “red”, and “green” might not explicitly mention the underlying topic of color. Here, a topic is thus a reference to a group of words that one would commonly use to describe something, and such words typically occur within the same linguistic con- text (DiMaggio et al., 2013). More formally, the group of words tends to co-occur, and this phenomenon is rooted in the distributional hypothesis; namely, words with similar meaning tend to occur in similar contexts (Harris, 1954).

Within the field of NLP, techniques from the field of —the science of getting computers to act without being explicitly programmed—are commonly ap- plied. One such technique is probabilistic topic modeling, a type of unsupervised ma- chine learning, which can automatically capture latent topics from large collections of documents (Blei, 2012; Blei and Lafferty, 2006, 2007, 2009; Blei et al., 2003; Rosen- Zvi et al., 2004). More concretely, they can discover groups of co-occurring words (the topics) from thousands or millions of documents without the need to manually anno- tate or label them. The uncovered latent topics can help users to explore, classify, and organize individual documents, as well as a document collection (called a corpus), and can additionally provide an extension to traditional search mechanisms.

The unsupervised nature of topic models makes them ideal candidates to employ on a variety of corpora and in numerous fields (Gatti et al., 2015; Hall et al., 2008; Sun and Yin, 2017; Wang and McCallum, 2006; Westgate et al., 2015). This is primarily since they: (i) require little human intervention and knowledge in the pre-analysis phase, (ii) generate reproducible results without human subjectivity bias, and (iii) can easily scale to thousands or millions of documents (Debortoli et al., 2016; Quinn et al., 2010). However, researchers usually treat topic models as black boxes, without thoroughly ex- ploring their underlying assumptions and parameter values (Chen et al., 2016). This black box approach comes from the inherently complex statistical nature of topic mod-

2 CHAPTER 1. INTRODUCTION els, and with improper use, their results may not sustain under scrutiny, may generate replication failure, may produce different outputs on multiple runs, and could be inter- preted as questionable (Chuang et al., 2014, 2012; Grimmer and Stewart, 2013). In particular, the variety of parameters and hyper-parameters are commonly set to their default values (Wallach et al., 2009), oftentimes caused by a lack of clear guidelines on how to optimally set them (Chen et al., 2016).

Motivated by the above factors, this thesis aims to provide answers regarding how to optimally and efficiently employ probabilistic topic models to large collections of docu- ments. In this context, we have selected the research domain of fisheries science as our testbed (described further in Section 1.3), and we take fisheries scientific publications as our source of textual data. In particular, for fisheries sustainability, it is crucial that fisheries science appropriately considers the ecological, social, economic, and institu- tional elements (Boström, 2012; Dahl, 2012; Rindorf et al., 2017). To date, the social, economic, and institutional considerations have been largely neglected (Hicks et al., 2016; Levin et al., 2015), despite international agreements and legislation (Stephen- son et al., 2018). Though illuminating, the assessment of considerations has been primarily performed by classical methods from the social sciences. Here, we aim to expand on such analyses by employing techniques from the field of NLP and proba- bilistic topics models to computationally analyze large collections of scientific data for the assessment of fisheries sustainability.

Our objective here is two-fold. First, this thesis scientifically investigates how to op- timally and efficiently apply and interpret topic models to large collections of docu- ments. Specifically, we study how different types of textual data, pre-processing steps, and hyper-parameter settings affect the quality of the derived latent topics. In doing so, we contribute to the methodological analysis and optimization of topic models, pro- viding a starting point for researchers who want to apply topic models with scientific rigorousness to scientific publications. Second, by applying topic models to fisheries science publications, we study the domain through a new computational lens and ex- pand on traditional approaches to assess fisheries sustainability. Moreover, we con- struct unique datasets comprising thousands of scientific articles and provide a quan- titative assessment of fisheries science. For both objectives, we follow the systematic and well-grounded KDD process for comprehensive data analysis and knowledge ex- traction.

1.1 Knowledge Discovery Process

Within this thesis, our aim is to create new and useful knowledge by exploring la- tent topics derived from (raw) data. In this pursuit, we follow the well-established knowledge discovery process called Knowledge Discovery in Databases (KDD), orig- inally proposed by Fayyad et al. (1996). Their seminal work provides an iterative systematic process to extract useful knowledge from large collections of data, which

3 CHAPTER 1. INTRODUCTION

KDD Process

/ tion reta terp tion knowledge in lua eva

g inin a m dat patterns tion rma sfo tran

transformed g sin ces data -pro pre-processed pre data

ion lect data se target data

Figure 1.1: The Knowledge Discovery in Databases (KDD) process. takes the data as the starting point and through a sequence of steps ends with the derived knowledge (see Fig. 1.1). Although the process stems from more than two decades ago, the fundamental principles underlying the process still hold today. In fact, books devoted to analyzing data effectively, efficiently and in a scalable manner for the purpose of knowledge discovery incorporate the same, or very similar, itera- tive process (Jiawei et al., 2012). However, within the industry, the Cross-Industry Standard Process for Data Mining (CRISP-DM) (Shearer, 2000) is widely considered the number-one knowledge discovery process (Spruit and Lytras, 2018). CRISP-DM, although based on KDD (Wirth, 2000), is highly tuned towards the industry, where typically the evaluation phase addresses business needs, and the deployment phase covers customers’ needs. This makes CRISP-DM less suitable for academic purposes, and less suitable for our purpose.

The KDD process starts with obtaining data and selecting only data of interest (target data), on which discovery is to be performed. This is followed by a pre-processing step that might include, but not be limited to, removing noise and outliers from the data, normalizing, handling missing values and so on. The transformation step converts the data into appropriate features (i.e., variables describing the data) or uses dimensional- ity reduction techniques, such as principal component analysis, to reduce the number of features. In practice, the subsequent data mining step, oftentimes, if not always, dictates how the transformation into appropriate features must commence. The data mining step is the process where the intelligent data mining happens by applying al- gorithms, such as classification, regression and clustering methods, which result in the discovery of patterns, such as latent topics. Such patterns can then be interpreted and evaluated for their usefulness and correctness, and, if satisfactory, can lead to the creation of knowledge.

The KDD process has been used in many different areas and domains (Krochmal and Husi, 2018), and it has served as the basis for more domain-tailored knowledge discov- ery processes (Shearer, 2000). It is particularly useful as its aim is to place emphasis

4 CHAPTER 1. INTRODUCTION on all steps when deriving knowledge from data, and not specifically on a single pro- cess, which is, oftentimes, the data mining process that entails optimizing algorithmic details. To reiterate the authors’ words “most previous work on KDD focused primarily on the data mining step. However, the other steps are equally, if not more, important for a successful application of KDD in practice” (Fayyad et al., 1996). Within the KDD process, the data mining step involves fitting (statistical) models to data to extract pat- terns and is shown as a single step within the overall sequence of steps. However, data mining can also refer to the entire KDD process, and such definitions are used inter- changeably in the literature. In the latter case, the term “data mining” is treated in the broader sense of the words (Jiawei et al., 2012).

The reference to databases in KDD can look confusing, as it might imply that the data must originate from a database. Indeed, databases can hold and accumulate vast amounts of data and tapping into this (raw) data can yield interesting results. How- ever, the original KDD paper defines data as “a set of facts” (Fayyad et al., 1996), which might be stored—and typically is—within a database, but may very well be data orig- inating from elsewhere, such as documents on a disk or content from the web. In this sense, and given that data today can come from a multitude of sources, we interpret the KDD process more as the process of Knowledge Discovery from Data.

1.2 Topic Models

As previously mentioned, this thesis explores the research directive of automatically uncovering topics from a corpus in an attempt to improve and better understand the knowledge discovery process. In this pursuit, we employ topic modeling algorithms to automatically uncover topics from documents, and our focus here is on the most popular and highly studied topic model called Latent Dirichlet Allocation (LDA) (Blei et al., 2003).

LDA overcomes the limitations of older topic models, such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990) and probabilistic Latent Semantic Indexing (pLSI) (Hof- mann, 1999). At the time of writing, the original LDA method proposed by Blei et al. (2003) has over 23,000 citations. The technique has received much attention from machine learning researchers and other scholars, and has been adopted and extended in many ways (Blei and Lafferty, 2006, 2007; Chang and Blei, 2010; Doyle and Elkan, 2009; Reisinger et al., 2010; Rosen-Zvi et al., 2004; Wallach, 2006b; Wang and Blei, 2009; Whye Teh et al., 2004). To provide a sense of the popularity of topic models within the scientific literature, Fig. 1.2 shows how frequently LDA and the two older topic models, LSI (synonymously referred to as or LSA) and pLSI (synonymously referred to as probabilistic Latent Semantic Analysis or pLSA), occur within the title, abstract or keywords of publications covered by Scopus in the period 2000 to 2017. Frequencies are obtained by searching for the full name of the topic model enclosed by quotation marks, e.g. “latent Dirichlet allocation” (to disam-

5 CHAPTER 1. INTRODUCTION

500 450 400 LDA 350 PLSI 300 PLSA 250 LSI 200 LSA 150 100 50 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Figure 1.2: Mentions of topic model algorithms within the title, abstract or keywords of all articles covered by Scopus from 2000–2017. biguate it from linear discriminant analysis).

One of the main advantage of LDA is that it produces a set of topics that describe the entire corpus and are individually interpretable, while simultaneously describing how each document exhibits those topics proportionally (Blei and Lafferty, 2009; Grif- fiths et al., 2007). For example, we can infer that a corpus of 1,000 fisheries science publications contains four topics, and those four topics can be interpreted as topics concerning “fisheries management”, “gear”, “growth”, and “vessels”. Additionally, we can infer how these topics are proportionally present within each document. For ex- ample, one document might contain 60% on the topic of “fisheries management”, and 40% on the topic of “vessels”. Technically, all topics are present within each docu- ment, but only a few topics make up the most substantial part of the document, with the remaining topics very close to 0%. LDA thus follows the natural assumption that documents exhibit multiple topics in mixed proportions (Blei and Lafferty, 2009), and is able to capture the heterogeneity of topics within documents (Erosheva et al., 2004). In contrast to other topic models (e.g., pLSI), LDA can handle large amounts of data as the model parameters do not grow linearly with the number of documents (Blei et al., 2003; Chen et al., 2016), which prevents overfitting of the model. LDA has also been shown to outperform traditional cluster-based methods for information retrieval purposes (Wei and Croft, 2006) and baseline clustering methods (e.g., K-means) for scientometric research (Yau et al., 2014). More research into the benefits of LDA and the differences between other topic model algorithms is described in Anaya (2011) and Steyvers and Griffiths (2007).

To date, researchers apply LDA on a variety of corpora and in numerous domains, and many toolkits are available that more easily enable this: Mallet (McCallum, 2002), Gensim (Rehurek and Sojka, 2010) and Stanford TMT (Ramage and Rosen, 2009). Knowing how to employ LDA optimally fosters a better understanding of the underlying

6 CHAPTER 1. INTRODUCTION

document-topic distribution D documents x K topics

topic 1 topic 2 topic3 topic...

d1 10% 90% 0% ... =100%

D Documents Bag of words LDA d2 0% 0% 100% ... =100% D documents x V words d3 10% 5% 20% ... =100% d1 w1 w2 w3 w.. 1 Generative process d...... d2 d1 1 0 3 ... input output d3 d2 2 1 0 ... 2 Inference process topic-word distribution d.. d3 4 3 0 ... K topics x V words d...... w1 w2 w3 w...

topic1 1% 2% 5% ... =100%

topic2 0% 0% 6% ... =100%

topic3 5% 5% 1% ... =100%

topic......

Figure 1.3: Graphical diagram of the application of LDA to a set of documents, their conversion into a bag-of-words representation, the two-phased process of LDA, and the output representation of topics within documents, and words within topics. implications of the model parameters, pre-processing steps and selected data, and can help future research yield better and scientifically valid results. For these reasons, this thesis takes LDA as the primary technique to uncover latent topics from a set of documents.

1.2.1 Latent Dirichlet Allocation

We will succinctly describe LDA in non-technical language solely for the purpose of understanding the remaining part of this chapter. We do so by presenting the analytical steps involved, the adopted terminology, and the underlying statistical assumptions. Fig. 1.3 displays a graphical diagram of the LDA process where we assume that there are D documents, V words, and K topics. The number of documents D, and all the (distinct) words V within those documents, are defined by the corpus under study. The number of topics, K, is assumed to be known a priori. A formal and mathematical description is given in Chapters 2 and 3, and a non-technical description with concrete examples is additionally given in Chapter 5. We further refer the interested reader to Blei (2012) for an introductory explanation.

Each document includes some words, and the collection of all the words is considered the vocabulary of the corpus (hence the V). In many areas of NLP,documents are rep- resented as bag-of-words (BOW) features, where the words and their frequency within a document are treated as individual document features. A BOW representation of a single document is merely a row (i.e., vector) within a table (i.e., matrix) where each column represents a word (i.e., term or feature) from the vocabulary, and the cells represent the frequency or count of that word within the document. Fig. 1.3 depicts a BOW representation for three hypothetical documents and three hypothetical words. Document d1 thus contains word1 one time, word2 zero times, word3 three times, and so on. The process of converting documents into BOW features is performed in the

7 CHAPTER 1. INTRODUCTION pre-processing phase of LDA, and if desired, words with similar meaning (e.g., plural and singular words) can be grouped ( or lemmatization), words that have no specific meaning can be filtered out (removing stop words such as “the” and “an”), and high- and low-frequency words can be eliminated (pruning).

Creating the LDA model comprises two steps (see Fig. 1.3): (1) generating documents based on statistical sampling rules, called the generative process; and (2), inferring model parameters from these generated documents, called the inference process. The first step, the generative process, involves the following steps:

1. For each topic (of K topics):

(a) Sample a distribution over V words (from a Dirichlet distribution).

2. For every document (of D documents):

(a) Sample a distribution over K topics (from a Dirichlet distribution). (b) For every word within this document (document length can be fixed or sampled from an appropriate distribution, e.g., Poisson): i. Sample a single topic from the previously sampled topic distribu- tion. ii. Sample a word from this topic.

The Dirichlet distribution can be viewed as a “distribution of distributions”, where every point on the distribution (called the simplex, see Fig. 1.4) represents some prob- ability distribution over K discrete classes (where K 2). The shape of the Dirichlet distribution is defined by a parameter value for each≤ of the K classes. Technically, this is a vector of length K and consists of non-negative values. Fig. 1.4 shows two Dirich- let distributions with three classes (K=3) and corresponding parameter values (labeled a1, a2, and a3) for each of the three classes. When sampling from the Dirichlet distri- bution, we obtain some probability mass (between 0 and 1) for each of the K classes, with the restriction that the sum of the probability masses for all classes equals 1. Fur- thermore, it is important to note that within LDA there are two Dirichlet distributions (for topics in documents (step 1a), and words within topics (step 2a)) from which we sample, and these distributions are considered prior distributions. The chosen param- eters of the two Dirichlet distributions affect the smoothing or shape of the distribution (see Fig. 1.4) and the drawn samples will be different with different parameterizations, which in turn affects the output of LDA.

Within the first part, LDA produces a set of generated documents governed by the gen- erative process, the prior Dirichlet distribution, and the statistical sampling process. The second part, the inference process, is directed at uncovering the parameters that most likely would have resulted in those generated documents. This process can be viewed as reverse engineering the generative process. Technically, we want to infer the

8 CHAPTER 1. INTRODUCTION

α1 = 0.33, α2 = 0.33, α3 = 0.33 α1 = 0.6, α2 = 0.2, α3 = 0.2

3 3

2 2

1 1 density density 0 0

−1 −1

−2 −2 1.0 1.0 −3 −3 1.0 0.8 1.0 0.8

0.8 0.8 0.6 0.6 0.6 0.6 X 0.4 0.4 X Y 0.4 Y 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0

Figure 1.4: Two Dirichlet distributions with three classes (K = 3), shaped differently due to the parameterization of the distribution. On the left side, a symmetrical Dirichlet is shown (a1 = a2 = a3) and on the right side an asymmetrical Dirichlet where one class has more weight (a1 = 0.6). Created with the R library MCMCpack. hidden variables given the observed documents, also referred to as inferring the poste- rior probability of the hidden variables given the data. The hidden variables are (i) the word probabilities within each topic, (ii) the topic probabilities within each document, and (iii) the mapping of each word to a topic, called topic assignment. Calculating the posterior probability is intractable and several inference methods exist (Blei and Jordan, 2006; Newman et al., 2007; Porteous et al., 2008; Teh et al., 2006; Wang et al., 2011). Although the discussion of posterior inference techniques is beyond the scope of this thesis, after inference has converged, we obtain topics (represented as topic-word distributions) and how these topics are proportionally present within each of document (represented as document-topic distributions).

In a way, LDA can be seen as a dimensionality reduction technique, as documents are now described as distributions over K topics, in comparison to the original bag-of-words representation of length V. The document-topic distribution provides insights into the topical decomposition of the entire corpus. Additionally, documents with similar topic distributions, measured by, for instance, the Kullback-Leibler divergence (Kullback and Leibler, 1951), or the Hellinger distance (Hellinger, 1909) can be viewed as compara- ble in the topical content they address. Documents that are comparable can be used for searching (retrieval) or clustering (classification). Until now, we have presented the topics as distributions over V words. To get insights into the semantic meaning of the topic distribution, typically one can sort the probabilities of words within a particular topic in descending order. The top words (usually the top 10) are most informative about the latent meaning of the topic, and typically a domain expert assigns a label to each topic that best captures the semantics of the top words.

An example of two topic-word distributions with the top 10 high probability words is shown in Fig. 1.5. The two latent topics are uncovered from fisheries publica-

9 CHAPTER 1. INTRODUCTION

MANAGEMENT MODELS

Fishery .017 Model .051 0.6 Management .014 Estimate .021 Fishing .010 Value .015

State .006 Variable .012 0.45 Resource .006 Parameter .010 models Economic .006 Analysis .008 Vessel .005 Effect .008 0.3 management Policy .005 Distribution .007 proportion topic Area .005 Base .007 0.15 Fish .005 Sample .007

topic-word distribution (top-10) 0 document-topic distribution

Figure 1.5: Two LDA topic-word probability distributions showing the top 10 high probability words, and a document-topic distribution of a hypothetical document. tions (Syed et al., 2018a), and after close examination, they can be labeled Management and Models. The document-topic distribution depicts how topics are proportionally present within a single document, with the example document in Fig. 1.5 comprising nearly 60% on the Models topic, and 40% on the Management topic. Although we only show two topics, the full model actually contains 25 topics (K=25), and each document exhibits each of the 25 topics in some proportion. Thus, by employing LDA, one can explore the prevailing topics in a corpus, while also exploring how individual documents exhibit them.

1.3 Research Domain

This thesis operates within the domain of fisheries. A fishery can be defined as “the complex of people, their institutions, their harvest and their observations associated with and including a targeted stock or group of stocks (i.e., usually fish), and increas- ingly, the associated ecosystems that produce said stocks” (Link, 2010). Simply put, fisheries deal with all aspects of harvesting fish, including the people, methods, tools, management, boats, ecosystem, and fish, and how they are interlinked.

The primary reason for the focus of this research domain is the overarching project in which this thesis is embedded: the EU Horizon 2020, Marie Skłodowska-Curie, Eu- ropean Training Network called “Social Science Aspects of Fisheries for the 21st Cen- tury”, or SAF21 for short. The project aims to improve the management (i.e., social) understanding of fisheries, where fisheries are recognized to be part of social-ecological systems (Ostrom, 2009); systems where the social (e.g., human interest and actions) and ecological (e.g., fish and ecosystem) dimensions are connected and interlinked.

For fisheries sustainability, it becomes increasingly important to appropriately consider the four elements inherent within social-ecological systems through equal and balanced

10 CHAPTER 1. INTRODUCTION investigation (Stephenson et al., 2018). These four elements are the ecological, social, economic, and institutional considerations (Boström, 2012; Dahl, 2012; Rindorf et al., 2017). To date, it is argued that typically the ecological considerations are primarily addressed, and as a result, the social, economic, and institutional considerations are relatively neglected (Hicks et al., 2016; Levin et al., 2015).

We additionally strive to provide insights into the prevalence of these dimensions within fisheries, and specifically within fisheries science publications. Fisheries sci- ence publications are thus taken as the source of data, i.e., documents or corpus as input for LDA, within this thesis. Additionally, fisheries domain experts are consulted to interpret and evaluate latent topics derived from this data.

1.4 Research Questions

1.4.1 Main Research Question (MRQ)

A large body of scientific work aimed at uncovering latent topical structures from tex- tual data has focused on the theoretical aspects of the topic model’s algorithmic com- plexity. Such aspects include, but are not limited to, model run time, model complexity, model inference, and other forms of optimizing algorithmic details. Although these are highly relevant and important endeavors, little research has taken the solution- oriented, applied data science approach as its vantage point (Spruit and Lytras, 2018); that is, how to optimally and efficiently apply topic models to large collections of doc- uments and interpret them to gain a better understanding of the underlying topical content.

Indeed, much research has applied topic models in a variety of contexts and research domains, such as transportation research (Sun and Yin, 2017), (Hall et al., 2008; Wang et al., 2011; Wang and McCallum, 2006), conservation science (West- gate et al., 2015) and the fields of operations research and management science (Gatti et al., 2015). Although each uncovers interesting patterns and relevant knowledge from documents, they often lack foundational theory for data selection and model (hyper-) parameterization (Chen et al., 2016), which form the basis of any derived output. Especially when the underlying data represents knowledge itself—as is the case with scientific literature—and when it forms the basis of a new knowledge dis- covery process, making theoretically justified choices to optimally utilize the potential of topic models becomes extremely important. Furthermore, within an applied data science context, and from a domain-specific perspective, there is space to not only in- terpret the latent topics in isolation, but to interpret them in a broader, domain-related context. We hypothesize that data selection and model parameterization significantly affect the derived underlying latent topical structure from a set of documents, and, sub- sequently, affect any derived knowledge in this regard. We further hypothesize that we

11 CHAPTER 1. INTRODUCTION can create new (fisheries) domain knowledge from the interpretation of latent topics and explore them beyond the raw topic-word and document-topic distributions. Thus, this dissertation poses the following main research question (MRQ):

MRQ — How can we improve the knowledge discovery process from textual data through latent topical perspectives?

In this thesis, we take the entire KDD process, from raw data to useful knowledge, as a blueprint for an appropriate knowledge discovery process. The ultimate objective of a KDD process is to turn raw data, with implicitly captured knowledge, into patterns that explicitly expose this knowledge. Each step in a KDD process is considered equally im- portant and together they make up the knowledge discovery process (see Section 1.1). Since our main research question is to improve the knowledge discovery process of topical perspectives from textual data, it is imperative that the intricacies of each KDD step are dealt with appropriately, that is, with sufficient scientific rigorousness.

1.4.2 Research Questions (RQ)

To find answers and support our main research question, we formulated six research questions (RQ1–6) with the aim to improve and better understand the knowledge dis- covery process. We provide a list of the six research questions below, and each is treated in more detail in subsequent sections of this chapter:

1. What types of textual data result in high-quality latent topics? 2. How does the hyper-parameterization of a topic model algorithm affect the qual- ity of latent topics? 3. Can we assess the quality of latent topics using a semi-automatically constructed list of semantically related words? 4. How can we construct knowledge from latent topics derived from large collec- tions of documents? 5. How can we construct knowledge from latent sub-topics derived from large col- lections of documents? 6. How can we utilize knowledge derived from latent topics for a subsequent knowl- edge discovery process?

The mapping of the six research questions to existing KDD steps is shown in Fig. 1.6. The six research questions can be divided into two parts. The first consists of RQ1, RQ2, and RQ3, to scientifically study the effects several KDD steps have on the derived pat- terns (the latent topics). RQ1 studies the KDD data selection and related pre-processing

12 CHAPTER 1. INTRODUCTION

Q3 ) R R 4 KDD Process TE AP (CH / tion reta terp tion knowledge in lua eva RQ6 (CHAPTER 7) g inin a m dat patterns tion rma sfo tran

transformed RQ5 ing (CHAPTER 6) ess data roc pre-processed re-p p data

ion lect data se target data

RQ2 RQ4 RQ1 (CHAPTER 3) (CHAPTER 5) (CHAPTER 2)

Figure 1.6: The six research questions (RQ1–6) mapped onto the KDD steps with a reference to the corresponding chapters in which they are addressed. steps, RQ2 studies the settings of hyper-parameters within the KDD data mining step, and RQ3 studies the evaluation of the patterns within the interpretation and evaluation step. Thus, RQ1–RQ3 aim to optimize the choices made in a number of KDD steps to ensure that such choices are based on proper scientific rationale, rather than relying, for instance, on default choices (Spruit and Lytras, 2018; Wallach et al., 2009). The second part consists of RQ4, RQ5, and RQ6, which aim to improve the understanding of the derived patterns and the translation into useful knowledge. Below, each RQ is discussed in more detail.

RQ1 — What types of textual data result in high-quality latent topics?

The first step of the KDD process starts with the data, and the selection of appropriate or relevant data thereof. A known phrase amongst computer scientists is the phrase “garbage in, garbage out”. It means that the quality of data you feed into a system affects the quality of the results you obtain from that system. As a first step within the KDD process, the data selection process can be extremely important as it could affect any subsequent step or process. This thesis operates within the domain of fisheries and uses scientific literature, in the form of scientific peer-reviewed articles, as the basis for data. Logically, that leaves the choice for two types of variants of this data: the abstracts of the scientific articles, and the full-text of the articles. Although any combination of article meta-data can be viewed as a type representation, from an article data retrieval perspective, abstracts and full-text data are usually treated as two types. Abstracts are easily accessible, or at least more easily, than obtaining full-text data for scientific articles. When employing probabilistic topic models, many studies use abstract data as the source of textual data (Gatti et al., 2015; Grimmer and Stewart, 2013; Sun and Yin,

13 CHAPTER 1. INTRODUCTION

2017; Westgate et al., 2015), and a smaller number of studies use full-text data (Alston and Pardey, 2016; Hall et al., 2008; Wang and McCallum, 2006). It can be argued that the abstracts should contain most parts of the research underpinning the article (or at least the essential parts), and, as such, should be sufficient to distill the main topical content of the article (Lin, 2009).

This research question aims to (i) explore the effects of using abstracts and full-text documents on the quality of derived latent topics, and (ii) explore what role related pre-processing steps have within this process. Until now, no study has examined the effects of using abstracts and full-text data when uncovering latent topics. As a re- sult, researchers using abstracts or full-text publications oftentimes do so without any scientific rational, and the choice for either is typically not argued for. There are var- ious reasons why this might be the case: one might only have access to abstract data (i.e. availability), one may want to keep the computational time to a minimum (i.e. feasibility), or one may want to reduce the pre-processing steps that are often neces- sary when dealing with full-text articles (i.e. simplicity). These pre-processing steps could include scraping the publishers’ repositories; converting PDF to plain-text, ei- ther directly or with the aid of optical character recognition (OCR) software; or an increased boilerplate cleaning phase. Since the data is a crucial determinant within any knowledge discovery process, and since our main aim is to improve any derived topical content from it, we study the empirical effects abstracts and full-text have on the derived latent topics. Fig. 1.6 highlights where RQ1 takes place, and spans, mainly, the data and data selection phase, while also considering aspects of the pre-processing phase.

RQ2 — How does the hyper-parameterization of a topic model algorithm affect the quality of latent topics?

Many topic model algorithms allow for tuning and tweaking of a variety of hyper- parameters that will affect the model’s output. This is certainly the case for LDA. Since LDA is a Bayesian probabilistic topic model, one of its advantages is that it incorpo- rates prior knowledge into the model before topic discovery starts (see section 1.2.1). Although Bayesian statistics is often heralded for its ability to use prior knowledge in the statistical equation, it can simultaneously be seen as a disadvantage, as selecting and choosing an appropriate prior can be a difficult task in itself. For LDA, this prior knowledge is encoded in the two Dirichlet priors on (i) the distribution of topics within documents, and (ii) the distribution of words within topics. Effectively, such priors can be set to be symmetrical or asymmetrical, meaning that (i) the probability distribution of topics is evenly (symmetrically) or unevenly (asymmetrically) distributed within a document, and (ii) the probability distributions of words within a topic is evenly (symmetrically) or unevenly (asymmetrically) distributed. One might surmise that, for instance, the topics within a corpus are not all equally present, such as is the case when one topic is more popular or gains more attention than another. In this case, a symmetrical prior distribution, thus an evenly distributed prior for topics in documents, can be a false assumption. The symmetrical Dirichlet prior, which is almost always the

14 CHAPTER 1. INTRODUCTION default choice for researchers to use when using LDA (Wallach et al., 2009), can thus, possibly, lead to incorrect or less optimal latent topics.

Little research has been performed on the effects of Dirichlet priors on the discovery of latent topics; a single study explores these effects for documents related to news content and patents (Wallach, 2006b). However, studies exploring the effects of priors on scientific articles have not been performed. RQ2 thus aims to (i) explore the empir- ical effects Dirichlet priors have on the quality of the uncovered latent topics, and (ii) simultaneously consider the effects on different parts of the scientific article (abstract and full-text). Fig. 1.6 shows where RQ2 fits into the overall KDD process; mainly in the data mining phase. Furthermore, incorporating appropriate prior knowledge within the KDD process—here through a statistical Bayesian prior—“ensures that use- ful knowledge is derived from the data” (Fayyad et al., 1996). With the main research question targeted at improving the understanding of latent topics, RQ2 is an important question to help achieve this goal.

RQ3 — Can we assess the quality of latent topics using a semi-automatically con- structed list of semantically related words?

LDA outputs a single topic as a discrete probability distribution over all the words in the vocabulary. Thus, each word within a topic has a certain probability assigned to it (this is true for each topic), and when sorted by probability value, the words with the highest probabilities reveal the semantic meaning of that topic (see Section 1.2.1). The use of coherence scores is a typical approach to quantify the quality of the words within topics (Aletras and Stevenson, 2013; Newman et al., 2010b; Röder et al., 2015; Stevens et al., 2012), and is also the approach adopted in this thesis. To assess the quality of topics, the gold standard approach is human topic ranking (Lau et al., 2011). How- ever, the human ranking approach is, oftentimes, too expensive or time consuming. Coherence scores have been shown to correlate well with human ranking data (Chang et al., 2009), and can thus be considered an appropriate and adequate approach when human ranking is not feasible.

In addition to coherence scores, RQ3 investigates whether a list of semantically related words, called “semantic lexicons”, can be obtained to assess the quality of the words that constitute a latent topic (Igo and Riloff, 2009; Qadir et al., 2015; Thelen and Riloff, 2002; Ziering et al., 2013b). Ideally, such lists should be constructed from a separate, unrelated collection of textual data, such as content from the web. The semi-automatic process of creating a semantic lexicon is based on a bootstrapping process, which starts with an initial set of words (called “seed words”) and follows with an automatic process to find semantic related words. The semi-automatic bootstrapping process is necessary to steer the process into a specific context or topic, as determined by the choice of seed words. In doing so, we aim to investigate whether semantic lexicons, which consist of semantically related words, can be used to assess the quality of latent topics. RQ3 thus aims to assess the quality of latent topics with a reference list of words, which can then help in “finding understandable patterns that can be interpreted as useful or interesting knowledge” (Fayyad et al., 1996). With respect to the KDD process, RQ3

15 CHAPTER 1. INTRODUCTION can be mapped onto the evaluation phase, as shown in Fig. 1.6.

Up to this point, RQ1, RQ2, and RQ3 all aim to improve the quality of the derived knowledge output intrinsic within the KDD process. The subsequent three research questions are aimed at the interpretation of the derived patterns, the latent topics, and how such patterns can be turned into usable and, within the context of this thesis, also re-usable knowledge. We take the lessons learned from RQ1, RQ2 and, to a lesser extent, RQ3 as groundwork for a large-scale topic model analysis within the domain of fisheries science through answering RQ4–6.

RQ4 — How can we construct knowledge from latent topics derived from large collections of documents?

The primary result of topic model analysis is a set of topics and, for each topic, a list of words with associated probabilities. This new representation, where documents are now represented as distributions over topics, forms the basis for the evaluation step and, ultimately, the knowledge construction step within the KDD process (Fig. 1.6). This research question aims to interpret the latent topics, which are derived from the data mining phase, and aims to extract useful, relevant and understandable knowledge from them. Several studies have already explored the latent topical contents from a collection of scientific publications, and we certainly do not claim to be pioneers in this respect. However, we are the first to perform such (large-scale) analysis within the domain of fisheries, and the first to use scientific rationale for the steps leading up to the evaluation phase. Also, we aim to go beyond exploring the latent structures of the documents and interpret the results in a broader context. That is, we aim to construct new knowledge and shed light on the ecological, social, economic, and institutional considerations within fisheries for increased fisheries sustainability (Boström, 2012; Dahl, 2012; Rindorf et al., 2017; Stephenson et al., 2018). Thus, RQ4 is aimed at the interpretation of latent topics for the construction of new knowledge, which is the final step in the KDD process (Fig. 1.6).

RQ5 — How can we construct knowledge from latent sub-topics derived from large collections of documents?

Within the KDD process, there is a possibility to utilize the results or outcomes of one step, and to return to a previous step to exploit this result. This process is indicated by the dashed line in Fig. 1.6 (below each consecutive step). RQ5 uses this possibility in the penultimate phase (evaluation and interpretation), which causes the output of that phase to be the input of a new KDD process. More concretely, we explore how we can utilize the latent topics of one KDD process to then re-run the KDD process with a subset of the original document collection. LDA lends itself for such an analysis, as the latent topics can be used to select a subset of the original set of documents (which then only address a single topic, for instance). In doing so, we extract sub-topics from a main topic. The distinction between a main topic and a sub-topic can be made more explicit with an example. Imagine a collection of news articles. When performing a topic model analysis, we might find (latent) topics related to sports, politics, and the

16 CHAPTER 1. INTRODUCTION economy. When isolating the articles covering (mainly) the sports topic, we can then take the collection of sports articles as our starting point. A new topic model analysis on that subset of articles could then reveal specific types of sports, such as football and basketball. Within the context of fisheries science, a main topic can be, for instance, fisheries management, with subtopics covering aspects of the management process, such as the management of recreational fisheries.

RQ5, similar to RQ4, aims to construct new knowledge, but now on a sub-topical level. It provides a knowledge discovery process on a lower level (the sub-topics), in contrast to the higher level (the main topics) analysis performed in RQ4.

RQ6 — How can we utilize knowledge derived from latent topics for a subsequent knowledge discovery process?

The traditional KDD process has knowledge itself as its final output. Though it looks like this is the final step, this knowledge may very well be the start (or any other step) of a new (related or unrelated) KDD process. Extended KDD models capture this step more explicitly (Quick and Choo, 2014). RQ4 and RQ5 explore the latent topics extracted from fisheries science publications with the aim to explore and better understand the corpus, and shed light on particular domain-specific challenges, such as fisheries sustainability. Such analysis reveals interesting results, but we aim to combine these results with an additional analysis. RQ6 explores how we can use the discovery of latent topics (and the knowledge it contains) to enrich a new knowledge discovery process. Within the domain of fisheries, and within the context of the scientific articles, we aim to use latent topics to gain a better understanding of the knowledge production of fisheries science. More concretely, we utilize techniques from the field of social network analysis, particularly community detection algorithms, to explore the spatial, temporal and, by enriching it with a topic model analysis, topical foci. In doing so, RQ6 aims to explore whether new knowledge can be constructed by combing topic model analysis with a new knowledge discovery process.

1.5 Research Methods

Within this thesis, different research methods are used to formulate answers to the research questions described in Section 1.4. We will briefly describe the three major ones: computational experiments (used for RQ1–3), content analysis (used for RQ4– 6), and social network analysis (used for RQ6).

17 CHAPTER 1. INTRODUCTION

1.5.1 Computational Experiment

A computational experiment concerns itself with the theoretical analysis and empirical testing of a computational method, such as approximation or optimization algorithms. The experiment is a set of tests which runs under controlled definitions for a specific purpose (Barr et al., 1995). Such a purpose can include a demonstration of known truths, validating hypotheses, and examining the performance of something new. Typ- ically, the effects and influences of controllable variables (factors) within an experiment are measured and studied on some phenomena. Within computational experiments, the studied phenomena oftentimes entail the algorithm’s performance (examples de- scribed later). When new computational methods are presented, their contributions should be evaluated scientifically and reported on in an objective manner, which is not always done (Barr et al., 1995). A proper experiment includes the following seven steps (Montgomery, 2012):

1. Recognition of and statement of the problem 2. Selection of the response variables 3. Choice of factors, levels, and ranges 4. Choice of experimental design 5. Performing the experiment 6. Statistical analysis of the data 7. Conclusions and recommendations

Recognition of and statement of the problem: The first step defines the purpose of the experiment and should be determined before performing the actual experiment. The objective is a statement of the questions to be answered and the reasons the ex- periment should be conducted. For computational experiments, two types can be dis- cerned: (i) comparing different algorithms addressing the same class of problems, or (ii) characterizing the algorithm in isolation. The latter type is used within RQ1 and RQ2, with the objective to gain an understanding of the behavior of the algorithm (LDA) and the factors that influence its behavior (data, pre-processing, and priors). In both cases, the experiment is performed by singling out factors while fixing the remain- ing code, which is a prerequisite of a well-designed experiment (Barr et al., 1995). RQ3 fits into the former type, exploring different algorithms with the aim of constructing a high-quality semantic lexicon.

Selection of the response variables: The second step defines how to measure the performance of the algorithm under study. Here, performance is the response variable and can be determined or characterized in a number of ways. Typically, performance addresses the quality of the output, the time it takes to find a good or optimal solution, the robustness of the algorithm, or the tradeoff between feasibility and quality. Within

18 CHAPTER 1. INTRODUCTION

RQ1 and RQ2, the experiments are designed to compare the quality of the output, which are the latent topics derived from the data (documents) under study, and for RQ3, the quality of the semantic lexicon. Additionally, robustness is addressed by experimenting with different datasets.

Choice of factors, levels, and ranges: The third step—often performed simultane- ously with the second step—deals with the selection of factors that affect the perfor- mance of the algorithm within the computational experiment. Such factors can take on many forms, but within the context of RQ1–3, the factors include data types (abstract and full-text), data sources (journals and web), parametric distributions (Dirichlet pri- ors), model parameters (number of topics) and, to a lesser extent, data representation as a result of different pre-processing steps.

Choice of experimental design and execution: The fourth and fifth steps design and execute the experiment to ensure that appropriate test results can be collected and are suitable for statistical analysis (Mason et al., 2003). Additionally, the design should achieve the experimental goals and measure the response variable without bias. The variation and combination of factors (with their values) are explored with a grid search, effectively experimenting with a structured range of values, to allow for a wide array of measurements. By using a grid search, the aim is to find the combination of factors that result in high-quality latent topics, or algorithmic variations that lead to high-quality semantic lexicons. Additionally, random initializations are used to balance out some effects of uncontrollable factors.

Statistical analysis of the data: Within the sixth step, statistical methods are used to analyze the data so that the results and conclusions are objective in nature. The statis- tical tests cannot prove causality; instead, they allow for an indication of the strength of the relationships between the factors and the performance measure. Typically, the evaluation of the algorithm is done through well-studied and accepted performance metrics, such as error or loss functions.

Conclusions and recommendations: The seventh, and final step, includes the deriva- tion of practical conclusions. Besides statistical tests, graphical methods, such as vi- sualizations, are used to present the results. Within this thesis, results obtained from RQ1–3 feed into RQ4–6, which are focused on the interpretation of latent topics for knowledge discovery. Moreover, the computational experiments will serve as scientifi- cally justified choices within a KDD process to convert latent topics into useful knowl- edge for the fisheries domain.

1.5.2 Quantitative Content Analysis

Several definitions of content analysis exist, but within the context of this thesis, we adopt the definition proposed by (Neuendorf, 2016): “Content analysis is a summariz- ing, quantitative analysis of messages that follows the standard of the scientific method

19 CHAPTER 1. INTRODUCTION and is not limited as to types of variables that may be measured or the context in which the messages are created or presented.” Simply put, content analysis is a re- search methodology to study and make inferences from text. The scientific method guards for threats to reliability, validity, generalizability, and replicability. The goal of (any) quantitative analysis is to produce counts or measurements of categories (such as themes or topics) or variables (such as individual words) (Fink, 2009). Within the context of this thesis, the proposed definition also allows for the measurement of la- tent variables (elements that are not explicitly present), in contrast to definitions where content is strictly bound to manifest content (elements that are physically present and countable) (Berelson, 1952). The summarization aspect of the definition deals with summarizing results (nomothetic approach), instead of reporting on all the details within the text (idiographic approach). Within nomothetic approaches, conclusions are broadly based, generalizable, objective, summarizing, and inflexible (Neuendorf, 2016).

A downside of (traditional) content analysis is the manual coding process. To provide answers to RQ4–6, we apply quantitative content analysis to fisheries science publica- tions through probabilistic topic models (LDA). In doing so, we overcome the need for manual coding, and commonly this type of content analysis is referred to as “computer- assisted content analysis” (Chuang et al., 2014). During this process, we utilize results obtained from RQ1–3 to increase reliability and validity of the results.

1.5.3 Social Network Analysis

Within a social network, people are typically represented as circles (called nodes) and are connected through a type of relationship (called an edge). Such relationships might entail a friendship (online or physical), a work colleague or, within the context of this thesis, a scientific collaboration (co-authorship). The language to represent social net- works, the constellation of all nodes and edges, is the formal language of graph the- ory (West, 2000). In short, social network analysis (SNA) can be described as the “study of human relationships by means of graph theory” (Tsvetovat and Kouznetsov, 2011). More formally, SNA comprises a broad approach to sociological analysis and a set of methodological techniques that aim to describe and explore the patterns appar- ent in the social relationships that individuals and groups form with each other (Scott, 2017). The patterns are not limited to those derived from visualization only, but can include computational patterns as well, such as detecting communities from large net- works (Clauset et al., 2004; Girvan and Newman, 2002; Newman, 2012a). SNA has also shown to allow for a variety of extensions, for example, by modeling knowledge velocity and viscosity within networks (Helms and Buijsrogge, 2005).

To provide answers to RQ6, we utilize SNA to detect community structures found within a large social network created from authors (nodes) and their collaborations through co-authorship (edges).

20 CHAPTER 1. INTRODUCTION

1.6 Dissertation Outline

This thesis consists of eight chapters, of which Chapters 2–7 discus the six research questions described in Section 1.4. Before outlining each chapter in particular, we present an overview in Table 1.1 of Chapters 2–7 along with the research question each addresses, the corresponding publication, the employed research method, and a summary of the dataset that we constructed for the domain of fisheries science.

Table 1.1: High-level overview of Chapters 2–7 in relation to the research questions, the corresponding publications, the employed research methods, and descriptions of the datasets used.

Ch. RQ Publication Research Dataset Method

2 RQ1 S. Syed and M. Spruit. Full-Text or Abstract? Computational – 4,417 abstracts Examining Topic Coherence Scores Using Latent Experiment – 4,417 full-text Dirichlet Allocation. In 2017 IEEE International – 15,004 abstracts Conference on Data Science and Advanced Analyt- – 15,004 full-text ics (DSAA), pages 165–174, Tokyo, Japan, 2017. IEEE. doi: 10.1109/DSAA.2017.61 3 RQ2 S. Syed and M. Spruit. Exploring Symmetri- Computational cal and Asymmetrical Dirichlet Priors for Latent Experiment – 8,012 full-text Dirichlet Allocation. International Journal of Se- – 4,417 abstracts mantic Computing, 12(3):399–423, 2018b. doi: 10.1142/S1793351X18400184 4 RQ3 S. Syed, M. Spruit, and M. Borit. Bootstrap- Computational – 300 websites ping a Semantic Lexicon on Verb Similarities. Experiment In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol- ume 1, pages 189–196. Scitepress, 2016. doi: 10.5220/0006036901890196 5 RQ4 S. Syed, M. Borit, and M. Spruit. Narrow lenses Quantitative for capturing the complexity of fisheries: A topic content – 46,582 full-text analysis of fisheries science from 1990 to 2016. analysis Fish and Fisheries, 19(4):643–661, 2018a. doi: 10.1111/faf.12280 6 RQ5 S. Syed and C. T. Weber. Using Machine Quantitative Learning to Uncover Latent Research Topics content – 22,236 full-text in Fishery Models. Reviews in Fisheries Sci- analysis ence & Aquaculture, 26(3):319–336, 2018. doi: 10.1080/23308249.2017.1416331 7 RQ6 S. Syed, L. ni Aodha, C. Scougal, and M. Spruit. Quantitative – 73,240 abstracts Mapping the global network of fisheries science content anal- – 106,137 authors collaboration. Reinforcing or broad-based struc- ysis; Social – 100,175 affilia- tures of knowledge production? (submitted for network tions publication). 2018b analysis

21 CHAPTER 1. INTRODUCTION

Chapter 1 — Introduction

The introduction places this thesis into context and highlights the relevance of this work. We start with an introduction of the challenges of unstructured data, a descrip- tion of the Knowledge Discovery in Databases (KDD) process, the technique of topic modeling and specifically Latent Dirichlet Allocation, and a brief description of the fisheries domain in which this thesis is embedded. It then provides an overview of the research questions and employed research methods.

Chapter 2 — Full-Text or Abstract? Examining Topic Coherence Scores Using La- tent Dirichlet Allocation

This chapter answers RQ1, which constitutes the basis of the data selection phase within a KDD process. Through computational experiments, we examine the effects that abstract and full-text data have upon the quality of the derived latent topics. Dif- ferent datasets in terms of size and scope are analyzed. The quality of the derived latent topics is measured by coherence scores, a quantitative metric that correlates well with the qualitative human interpretation, which is considered the gold-standard in topic modeling. We also evaluate the quality of topics using a domain expert evaluation. Furthermore, we provide guidelines for the data pre-processing step that help increase the quality of latent topics, which is a crucial second step within the KDD process.

Published as: S. Syed and M. Spruit. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174, Tokyo, Japan, 2017. IEEE. doi: 10.1109/DSAA.2017.61

Chapter 3 — Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation

This chapter answers RQ2 and deals with hyper-parameter optimization within the KDD data mining phase. Through computational experiments, we examine the effects of symmetrical and asymmetrical Dirichlet priors on the quality of the derived latent topics, concerning both the topics within documents, as well as the words within top- ics. As a result, a total of four different combinations of Dirichlet priors are studied concurrently, with a randomized and grid-search approach to allow for a wide array of experimental results. We furthermore evaluate the quality of derived latent topics us- ing domain expert evaluation. Additionally, by conducting computational experiments on different datasets (in size and scope), we aim to enhance generalizability.

Published as: S. Syed and M. Spruit. Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12(3):399–423, 2018b. doi: 10.1142/S1793351X18400184 A short version of this chapter was published as: S. Syed and M. Spruit. Select-

22 CHAPTER 1. INTRODUCTION

ing Priors for Latent Dirichlet Allocation. In 2018 IEEE 12th International Confer- ence on Semantic Computing (ICSC), pages 194–202, Laguna Hills, CA, USA, 2018a. IEEE. doi: 10.1109/ICSC.2018.00035

Chapter 4 — Bootstrapping a Semantic Lexicon on Verb Similarities

This chapter answers RQ3 and examines the construction of a semantic lexicon—a list of semantically related words. We conduct computational experiments by comparing different algorithms with the aim of creating semantic lexicons from a list of initial seed words. The semantic lexicon can assist as a validation and assessment process for the quality of latent topics, which can be mapped to the evaluation step within the KDD process. Here, we aim to develop optimized algorithms that facilitate such evaluation, rather than performing the actual validation step.

Published as: S. Syed, M. Spruit, and M. Borit. Bootstrapping a Semantic Lex- icon on Verb Similarities. In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol- ume 1, pages 189–196. Scitepress, 2016. doi: 10.5220/0006036901890196

Chapter 5 — Narrow Lenses for Capturing the Complexity of Fisheries: A Topic Analysis of Fisheries Science from 1990 to 2016

This chapter answers RQ4 and deals with the knowledge discovery process concern- ing the obtained latent topics (patterns) from fisheries science publications. We apply quantitative content analysis to understand the knowledge captured in latent topics from 46,582 full-text articles from 21 fisheries journals over a period of 26 years (1990– 2016). In doing so, we intend to understand the topical and temporal dynamics found within 26 years of scientific output. The derived knowledge from latent topics will aid a better understanding of fisheries sustainability by quantifying the underlying dimen- sions (e.g., social, ecological, economic and institutional) within fisheries science.

Published as: S. Syed, M. Borit, and M. Spruit. Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19(4):643–661, 2018a. doi: 10.1111/faf.12280

Chapter 6 — Using Machine Learning to Uncover Latent Research Topics in Fishery Models

This chapter answers RQ5 and deals with the knowledge discovery process concerning the obtained latent sub-topics; this is in contrast to Chapter 5, which deals with the broader and more general topics. Quantitative content analysis is employed to examine 22,236 full-text publications from 13 fisheries journals over a time span of 26 years (1990–2016). The sub-topics are derived through a re-iteration of the KDD process, where latent topics are used to produce latent sub-topics. In doing so, we strive to

23 CHAPTER 1. INTRODUCTION improve the understanding of sub-topics within the domain of fisheries science and, more specifically, the science regarding fisheries models. Similar to Chapter 5, we additionally aim to quantify fisheries dimensions to better understand the aspects of fisheries sustainability, although our focus here is specifically on fisheries models.

Published as: S. Syed and C. T. Weber. Using Machine Learning to Uncover Latent Research Topics in Fishery Models. Reviews in Fisheries Science & Aquaculture, 26 (3):319–336, 2018. doi: 10.1080/23308249.2017.1416331

Chapter 7 — Mapping the Global Network of Fisheries Science Collaboration

This chapter answers RQ6 by utilizing the knowledge output from one KDD process and, through a feedback loop, uses that knowledge to enrich a new KDD process. Within the first KDD process, we employ quantitative content analysis through proba- bilistic topic models to analyze 73,240 fisheries publications from 50 fisheries journals over a time span of 17 years (2000–2017). The first step uncovers the latent topical structure of the constructed corpus, which provides an overview of the latent topics found within all the documents, and the topical distribution of each individual doc- ument. In the second step, we employ techniques from the field of social network analysis to construct a co-author network of 106,137 authors from 100,175 differ- ent affiliations. We use community detection algorithms to detect hidden communities (sub-networks) within the larger co-author network. Besides studying spatial (through affiliation data) and temporal (through publication years) characteristics of the net- work, we enrich the new knowledge discovery process with a topic model analysis.

Submitted for publication: S. Syed, L. ni Aodha, C. Scougal, and M. Spruit. Map- ping the global network of fisheries science collaboration. Reinforcing or broad- based structures of knowledge production? (submitted for publication). 2018b

Chapter 8 — Conclusions

In this final chapter of this thesis, we answer our research questions based on the results obtained from the previous chapters. We discuss the results and findings, the implications, the limitations, we provide directions for future research, and highlight some personal reflections.

24 Chapter 2

Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Al- though LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of dif- ferent types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distribu- tions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.

This work was originally published as:

S. Syed and M. Spruit. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174, Tokyo, Japan, 2017. IEEE. doi: 10.1109/DSAA.2017.61

25 CHAPTER 2. FULL-TEXT OR ABSTRACT?

2.1 Introduction

There is an ever-growing amount of scientific literature with which scientists must grapple and which threatens to overwhelm their capacity to stay up to date with new research (Larsen and von Ins, 2010). As a consequence, increased availability of tools and algorithms is necessary to match the ever-growing scientific output (Boyack and Klavans, 2014). These tools and algorithms could aid in exploring large document col- lections in alternative and structured new ways in contrast to traditional searches. This is especially important as the topics within articles, the main ideas within articles that can be shared among similar articles, cannot always be detected through traditional keyword searches (Srivastava and Sahami, 2009).

Topic models are machine-learning algorithms to uncover hidden or latent thematic structures (i.e. topics) from large collections of documents (Deerwester et al., 1990; Hofmann, 1999; Blei et al., 2003; Blei and Lafferty, 2009). The latent thematic struc- tures automatically emerge from the statistical properties of the documents and, as such, no prior labeling or annotation is necessary. In turn, the thematic structures can be used to automatically categorize or summarize documents up to a scale that would be impossible to do manually. Topic modeling approaches have proved to be very help- ful in elucidating the key ideas within a set of documents (Griffiths and Steyvers, 2004; Grimmer and Stewart, 2013; Rusch et al., 2013), and they do so with greater speed and a quantitative rigor that would otherwise be possible only through a traditional narrative review (Grimmer and Stewart, 2013).

One of the most popular and highly researched topic models is latent Dirichlet allo- cation (LDA) (Blei et al., 2003). LDA is a generative probabilistic topic model that overcomes the limitations of other well-known topic model algorithms such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990) and probabilistic Latent Semantic In- dexing (pLSI) (Hofmann, 1999). LDA models documents as multinomial distributions over K latent topics and each topic is modeled as a multinomial distribution over the fixed vocabulary V . As such, LDA captures the heterogeneity of research topics or ideas within scientific publications and can be viewed as a mixed membership model (Ero- sheva et al., 2004).

Utilizing LDA to uncover latent topics from textual data has been successfully applied in several research domains. Griffiths and Steyvers (2004) performed LDA on 28,154 abstracts of the journal Proceedings of the National Academy of Sciences (PNAS) to un- cover topics and to illustrate their relation to the journal’s categorization scheme. Gatti et al. (2015) used LDA on 80,757 abstracts from 37 primary journals from the fields of operations research and management science (OR/MS) to gain insight into the his- torical and current publication trends. A similar approach was performed within the field of transportation research on 17,163 abstracts from 22 leading transportation journals (Sun and Yin, 2017) and within the field of conservation science on 9,834 ab- stracts (Westgate et al., 2015). Besides being performed on abstract data, LDA has also been applied to 12,500 full-text research articles within the field of computational lin-

26 CHAPTER 2. FULL-TEXT OR ABSTRACT? guistics (Hall et al., 2008), 2,326 articles from Neural Information Processing Systems papers (NIPS) (Wang and McCallum, 2006), and 1,060 articles within agricultural and resource economics (Alston and Pardey, 2016).

However, the reason for choosing abstract data over full-text data, or vice versa, when using LDA has not been argued for. Although some researchers (e.g. (Gatti et al., 2015)) mention that abstract data is likely to contain a high density of words, thus making it suitable for LDA, others simply mention the dataset without explaining the rationale for the choice. There are various reasons why this might be the case: one might simply only have access to abstract data (i.e. availability), one may want to keep the computational time to a minimum (i.e. feasibility), or one may want to reduce the pre-processing steps that are often necessary when dealing with full-text articles (i.e. simplicity). These pre-processing steps could include scraping the publishers’ reposi- tories; converting PDF to plain-text, either direct or with the aid of optical character recognition (OCR) software; or an increased boilerplate cleaning phase. However, a more scientific rationale is required to aid in the choice of abstract or full-text data when uncovering latent topics with LDA.

This research is the first to explore the practical effects of choosing abstract or full- text data when uncovering latent topics with LDA. In particular, it shows the practical effects when revealing latent semantic structures from documents concerning scien- tific research publications. The differences between topics are calculated with a topic coherence measure (Aletras and Stevenson, 2013; Röder et al., 2015; Stevens et al., 2012; O’Callaghan et al., 2015) that shows, in contrast to the likelihood of held-out data, a higher correlation with human topic ranking data, the gold standard for topic interpretability. The underlying idea of topic coherence is rooted in the distributional hypothesis of linguistics (Harris, 1954)—namely, words with similar meanings tend to occur in similar contexts. Additionally, we use the knowledge of a domain expert to rank topics, thus providing, along with topic coherence, a comparison of topic quality from a human perspective.

2.2 Background

2.2.1 Latent Dirichlet Allocation

LDA is a generative probabilistic topic model that aims to uncover latent or hidden thematic structures from a corpus D. The latent thematic structure, expressed as topics and topic proportions per document, is represented by hidden variables that LDA posits onto the corpus. The generative nature of LDA describes an imaginary random process based on probabilistic sampling rules from which we assume that the documents come from. However, we only observe the words within documents and need to infer the hidden structure, that is, the topics and topic proportions per document, by applying

27 CHAPTER 2. FULL-TEXT OR ABSTRACT? statistical inference techniques. This process aims to answer the question: Which hid- den structure or topic model is most likely to have generated these documents? In doing so, we obtain the posterior distribution that captures the hidden structure given the observed documents. The generative process is defined as follows:

1. For every topic k = 1, ..., K { } (a) draw a distribution over the vocabulary V, βk Dir(η) ∼ 2. For every document d

(a) draw a distribution over topics, θd Dir(α) (i.e. per-document topic pro- portion) ∼ (b) for each word w within document d

i. draw a topic assignment, zd,n Mult(θd ), where zd,n 1, ..., K (i.e. per-word topic assignment) ∼ ∈ { } ii. draw a word w Mult , where w 1, ..., V d,n (βzd,n ) d,n ∼ ∈ { }

Each topic βk is a multinomial distribution over the vocabulary V and comes from a Dirichlet distribution βk Dir(η). Additionally, every document is represented as a dis- ∼ tribution over K topics and come from a Dirichlet distribution θd Dir(α). The Dirich- let parameter α denotes the smoothing of topics within documents,∼ and η denotes the smoothing of words within topics. The joint distribution of all the hidden variables

βK (topics), θD (per-document topic proportions), zD (word topic assignments), and observed variables wD (words in documents) is expressed by (2.1):

K D N Y Y Y p(βK , θD, zD, wD α, η) = p(βK η) p(θd α) p(zd,n θd )p(wd,n zd,n, βd,k) (2.1) | k=1 | d=1 | n=1 | |

Figure 2.1 shows the LDA probabilistic graphical model in plate notation (Buntine, 1994), where the unshaded nodes represent the hidden random variables, the shaded nodes the observed random variables, and the edges the conditional dependencies between them. The rectangles, called plates, represent replication. The graphical model is equivalent to the joint probability of all the hidden and observed variables expressed in (2.1). We have K topics βK (K-plate) as distributions over words depend- K ing on the Dirichlet parameter η, i.e. Q p β η . For all D documents (D-plate) we k=1 ( K ) have a per-document topic proportion θd depending| on the Dirichlet parameter α, i.e. D Q p θ α . Finally, for all N words (N-plate) of a document d D, we find that the d=1 ( d ) per-word topic| assignment zd,n depends on the previously drawn∈ per-document topic proportion θd , and the drawn word wd,n depends on the per-word topic assignment zd,n QN and all the topics βd,k, i.e. n 1 p(zd,n θd )p(wd,n zd,n, βd,k [we retrieve the probability = | | of wd,n (row) from zd,n (column) within the K V topic matrix]. × 28 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Per-word topic Dirichlet parameter assignment Topics

⍺ θd zd,n wd,n βd,n η N D K

Per-document Observed word Dirichlet parameter topic proportion

Figure 2.1: LDA represented as a graphical model in which the nodes denote the random variables and the edges the dependencies between them. Unshaded nodes are unobserved or hidden variables and the shaded nodes represent the observed random variables. The boxes, called plates, indicate replication.

The per-word topic assignment, the per-document topic distribution, and the topics are the latent variables and are not observed. We would have to condition on the only observed variable, the words within the documents, to infer the hidden structure with statistical inference. This can be viewed as a reversal of the generative process. The conditional probability, also known as the posterior, is expressed by (2.2):

p(βK , θD, zD, wD) p(βK , θD, zD wD) = (2.2) | p(wD)

Unfortunately, computation of the posterior is intractable due to the denominator (Blei et al., 2003). The marginal probability p(wD) is the sum of the joint distribution over all instantiations of the hidden structure and is exponentially large (Blei, 2012). Al- though the posterior cannot be computed exactly, a close enough approximation to the true posterior can be achieved with statistical posterior inference. Mainly two types of inference techniques can be discerned: sampling-based algorithms (e.g. (Newman et al., 2007; Porteous et al., 2008)) and variational-based algorithms (e.g. (Blei and Jordan, 2006; Teh et al., 2006; Wang et al., 2011)). It is important to note that both variational and sampling-based algorithms provide similarly accurate results (Asuncion et al., 2012).

29 CHAPTER 2. FULL-TEXT OR ABSTRACT?

2.2.2 Topic Coherence Measurement

After approximating LDA’s posterior distribution, the K topics are represented as multi- nomial distributions over V . Each topic distribution contains every word but assigns a different probability to each of the words. The words within topics with high proba- bility are words that tend to co-occur more frequently. These high-probability words, usually the top 10 or top 15, are used to interpret and semantically label the topics. However, LDA outputs as many topics as are defined by K: a low K results in too few or very broad topics, whereas a high K results in uninterpretable topics or topics that ideally should have been merged. Choosing the right value of K is thus an important task in topic modeling algorithms, including LDA.

Measures such as the predictive likelihood of held-out data (Wallach, Hanna M., Mur- ray, Iain, Salakhutdinov, Ruslan and Mimno, 2009) have been proposed to evaluate the quality of generated topics. However, such a measure correlates negatively with human interpretability (Chang et al., 2009), making topics with high predictive likelihood less coherent from a human perspective. This is especially important when generated topics are used for browsing document collections by users or understanding the trends and development within a particular research field. As a result, researchers have proposed topic coherence measures, which are a qualitative approach to automatically uncover the coherence of a topic (Aletras and Stevenson, 2013; Newman et al., 2010a), and the underlying idea is rooted in the distributional hypothesis of linguistics (Harris, 1954); words with similar meanings tend to occur in similar contexts. The topics are consid- ered to be coherent if all or most of the words, for example, the topic’s top N words, are related. The computational challenge is to obtain a measure that correlates highly with human topic ranking data, such as topic ranking data obtained by word and topic intrusion tests (Chang et al., 2009). Human topic ranking data are often considered to be the gold standard, and consequently a measure that correlates well is a good indicator of topic interpretability. A recent study by Röder et. al. (Röder et al., 2015) systematically and empirically explored the multitude of topic coherence measures and their correlation with available human topic ranking data. Additionally, new coherence measures obtained by combining existing elementary elements were explored as well. Their systematic approach revealed a new unexplored coherence measure, which they labeled CV , to achieve the highest correlation with all available human topic ranking data. As a result, this study adopts the CV coherence measure for topic coherence cal- culations. CV is based on four parts: (i) segmentation of the data into word pairs, (ii) calculation of word or word pair probabilities, (iii) calculation of a confirmation measure that quantifies how strongly a word set supports another word set, and finally (iv) aggregation of individual confirmation measures into an overall coherence score.

(i) Data segmentation pairs each of the topic’s top-N words with every other top-N word. Let W be the set of a topic’s top-N most probable words W = W1, ..., WN , Si a { } segmented pair of each word W 0 W paired with all other words W ∗ W, and S the ∈ ∈ set of all pairs defined as S = (W 0, W ∗) W 0 = wi ; wi W; W ∗ = W . For example, if { | { } ∈ } W = w1, w2, w3 , then a pair Si = (W 0 = w1), (W ∗ = w1, w2, w3). Such segmentation { } 30 CHAPTER 2. FULL-TEXT OR ABSTRACT?

measures the extent to which the subset W ∗ supports, or conversely undermines, the subset W 0 (Douven and Meijs, 2007).

(ii) Probabilities of single words p(wi) or the joint probability of two words p(wi, w j) can be estimated by Boolean document calculation, that is, the number of documents in which (wi) or (wi, w j) occurs, divided by the total number of documents. The Boolean document calculation, however, ignores the frequencies and distances of words. CV incorporates a Boolean sliding window calculation in which a new virtual document is created for every window of size s when sliding over the document at a rate of one word token per step. For example, document d1 with words w results in virtual d w w d w w p w documents 10 = 1, ..., s and 20 = 2, ..., s+1 , and so on. The probabilities ( i) { } { } and p(wi, w j) are subsequently calculated from the total number of virtual documents. In contrast to Boolean document calculation, the Boolean sliding window calculation tries to capture the word token proximity to some degree.

(iii) For every Si = (W 0, W ∗), we calculate a confirmation measure φ that calculates how strongly W ∗ supports W 0 and is based on the similarity of W 0 and W ∗ in relation to all the words in W. To calculate this similarity, W 0 and W ∗ are represented as context vectors (Aletras and Stevenson, 2013) as a means to capture the semantic support of all the words in W. These vectors ~v(W 0) and ~v(W ∗) are created by pairing them to all words in W, as exemplified in (2.3). The agreement between individual words wi and w j is calculated via normalized pointwise mutual information (NPMI), as shown in (2.4). NPMI, in contrast to pointwise mutual information (PMI), shows a higher correlation with human topic ranking data (Bouma, 2009). Additionally, " is used to account for the logarithm of zero and γ to place more weight on higher NPMI values.

The confirmation measure φ of a pair Si is obtained by calculating the cosine vector similarity of all context vectors u, w within S , with v W u and v W w as φSi (~ ~) i ~( 0) ~ ~( ∗) ~ expressed in (2.5). ∈ ∈

( ) X γ ~v(W 0) = NPMI(wi, w j) (2.3) w W i 0 j=1,..., W ∈ | |

γ  P(wi ,w j )+ε  log P w P w γ ( i ) ( j ) NPMI(wi, w j) =  ·  (2.4) log(P(wi, w j) + ε) −

P W | | ui wi u, w i=1 (2.5) φSi (~ ~) = · u~ 2 w~ 2 k k · k k (iv) The final coherence score is the arithmetic mean of all confirmation measures φ.

31 CHAPTER 2. FULL-TEXT OR ABSTRACT?

2.3 Methodology

2.3.1 The Experiment

This paper explores the effects of uncovered latent topics and their topic coherence score, a proxy for topic quality when applying LDA on abstract and full-text data. Be- sides topic coherence, we explore the effects of human topic ranking—often considered the gold standard for topic interpretability—on topics uncovered from abstract and full- text data. In doing so, we explore the practical effects that types of documents, and more specifically, word length, vocabulary size, and document frequency have on the coherence and interpretability of LDA topics.

2.3.2 Dataset

Two datasets were created that contain abstract and full-text data: DS1 contains 4,417 research articles (1996 to 2016) from the journal Canadian Journal of Fisheries and

Aquatic Sciences, and DS2 contains 15,004 research articles (2000 to 2016) from 12 top-tier fisheries journals: Canadian Journal of Fisheries and Aquatic Sciences, Fish and Fisheries, Fisheries, Fisheries Management and Ecology, Fisheries Oceanography, Fisheries Research, Fishery Bulletin, Marine and Coastal Fisheries, North American Journal of Fish- eries Management, Reviews in Fish Biology and Fisheries, Reviews in Fisheries Science, and Transactions of the American Fisheries Society. Note that DS1 DS2 for Y = 2000 to 2016. Regular expressions were used to extract abstracts from⊂ full-text articles, as the downloaded articles appeared in full-text.

The DS1 dataset relates to studies where a single scientific journal was analyzed from a domain-specific journal (e.g. (Wang and McCallum, 2006)), and DS2 to studies where LDA was used to uncover topics from a multitude of related domain-specific journals (e.g. (Gatti et al., 2015; Sun and Yin, 2017; Hall et al., 2008) ). The two datasets allow for comparison of not only abstract and full-text data but also on corpus size (i.e. the number of scientific publications). An overview of DS1 and DS2 is given in Table 2.1, and histograms of token and vocabulary (i.e. distinct words) frequencies are displayed in Fig. 2.2.

The choice of these journals was based on two factors: (i) they are domain-specific journals but employ a broad scope of research topics from the field of fisheries, and (ii) a fisheries domain expert was available to manually label and rank the topics as an alternative means of assessing the quality of topics. Furthermore, a domain-specific journal might increase the generalizability to other domain-specific journals (e.g. jour- nals in the domain of social psychology or resource economics) compared to a more general or broadly oriented journal such as Nature, Science, or PLOS ONE. The domain of fisheries includes a multitude of knowledge production approaches, from mono- to

32 CHAPTER 2. FULL-TEXT OR ABSTRACT?

1400 1400 tokens tokens 1200 vocabulary 1200 vocabulary

1000 1000

800 800

600 600 Frequency 400 400

200 200

0 0 0 50 100 150 200 0 2000 4000 6000 8000 10000 Counts tokens | vocabulary Counts tokens | vocabulary

(a) DS1 - abstract (b) DS1 - full-text

5000 5000 tokens tokens vocabulary vocabulary 4000 4000

3000 3000

2000 2000 Frequency

1000 1000

0 0 0 50 100 150 200 250 300 0 2000 4000 6000 8000 10000 Counts tokens | vocabulary Counts tokens | vocabulary

(c) DS2 - abstract (d) DS2 - full-text

Figure 2.2: Histograms of token and vocabulary frequencies for DS1 and DS2 for both abstract and full-text data. (b) and (d) contain very long tails for the number of tokens (up to 18,000).

33 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Table 2.1: Overview of the DS1 and DS1 datasets where J = num. journals; Y = time range; D = num. documents; Nd = mean document length; N = num. tokens; V = vocabulary size.

DS1 DS2 Abstract Full-text Abstract Full-text J 1 1 12 12 Y 1996-2016 1996-2016 2000-2016 2000-2016 D 4,417 4,417 15,004 15,004

Nd 108.94 3,855.36 123.7 3,850.78 N 481,168 17,029,133 1,856,700 57,777,025 V 14,643 142,852 25,781 379,116 transdisciplinary. Biologists, oceanographers, mathematicians, computer scientists, an- thropologists, sociologists, political scientists, economists, and researchers from many other more disciplines contribute to the body of knowledge of fisheries, together with non-academic participants such as decision makers and stakeholders. Within the do- main of fisheries, research into text analytics techniques has only been applied in a number of cases (e.g. (Jari´c et al., 2012; Syed et al., 2016)).

All research articles were downloaded from the journals’ repository and converted from PDF to plain text. Full-text data and abstract data were tokenized, and single-character words, numbers, and punctuation marks were removed. Furthermore, we removed all single-occurrence words, words that occurred in more than 90% of the documents, and words that belonged to a standard English list (n = 153). Apart from group- ing lowercase and uppercase words, no normalization method such as stemming or lemmatization was applied to reduce inflectional and derivational forms of words to a common base form; stemming algorithms can be overly aggressive and could result in unrecognizable words that reduce the interpretability when labeling the topics. Stem- ming might also lead to another problem, namely that it cannot be deduced whether a stemmed word comes from a verb or a noun (Evangelopoulos et al., 2012).

2.3.3 Creating LDA Models

For both datasets, and for both abstract and full-text data, we created 40 different LDA models by varying the K parameter (i.e. the number of topics) from 1 to 40 and repeat- ing this process three times (4 120 LDA models in total). The Dirichlet parameters × 1 are set to be symmetrical for the smoothing of words within topics η = V and topics 1 within documents α = K . By keeping α < 1, the modes of the Dirichlet distribution are close to the corners, thus favoring just a few topics for every document and leaving

34 CHAPTER 2. FULL-TEXT OR ABSTRACT? the larger part of topic proportions very close to zero. The LDA models are created using the Python library Gensim (Rehurek and Sojka, 2010). Gensim uses variational inference called online LDA (Hoffman et al., 2010) to approximate the posterior. The convergence iteration parameter for the expectation step (i.e. E-step) is set to 100; the part where per-document parameters are fit for the variational distributions [see Algorithm 2 in (Hoffman et al., 2010)].

2.3.4 Topic Coherence

For every LDA model created (480 in total), we calculated the CV coherence score as explained in Section 2.2.2. Segmentation of top pairs is obtained by pairing every word from the top 15 words with every other word from the top 15 words. In some cases, coherence calculations are based on the top 10 most probable words. However, as no stemming or lemmatization was applied, several words with the same base form were among the top 10 words (e.g. sample, sampling), so analyzing the top 10 words would effectively mean analyzing less than 10 distinct words. To avoid logarithms of zero 12 when calculating coherence scores, " is set to a very small number, 10− , as proposed by Stevens et al. (2012). We furthermore set γ = 1 to place equal weights on all NPMI values as researched by Röder et al. (2015) and have shown the highest correlation with all topic ranking data, in contrast to Aletras and Stevenson (Aletras and Stevenson, 2013), where γ = 2 shows better results. To capture word proximity when calculating word or word pair probabilities, the Boolean sliding window for Boolean document calculation is set to s = 110 (Röder et al., 2015). The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative), was additionally analyzed by a fisheries domain expert. The domain expert is affiliated with the leading competence institution for fishery and aquaculture in Norway. The analysis consisted of an inspec- tion of the top 15 most probable words for each topic, together with an inspection of the document titles and content. Additionally, the domain expert rated the topics (high, medium, low) by assessing the coherence of the top 15 words and the presence of incorrect terms (i.e. words) within each topic. High-quality topics contain no incor- rect terms, medium-quality topics contain one or two, and low-quality topics contain three or more. An incorrect term is defined as a word that has no semantic relationship with the topic’s top 15 words. The domain expert attached a label to each topic that best captured the semantics of the top 15 words.

2.4 Results

Fig. 2.3 shows the obtained CV coherence scores for all 480 LDA models created, with Fig. 2.3a and Fig. 2.3b displaying the results for the DS1 and DS2 datasets, respectively.

35 CHAPTER 2. FULL-TEXT OR ABSTRACT?

0.65 0.65

0.60 0.60

0.55 0.55 e e r r o o c c s s 0.50 0.50 e e c c n n e e r r e 0.45 e 0.45 h h o o c c

V V C 0.40 C 0.40

0.35 abstract 0.35 abstract full-text full-text 0.30 0.30 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of topics K number of topics K

(a) DS1 dataset (b) DS2 dataset

Figure 2.3: Calculated CV topic coherence score for LDA models with K = 1, ..., 40 for (a) DS1 and (b) DS2. The coherence score is the mean score for all 3 runs. Scores{ for} DS1 with 4,417 documents shows that full-text data achieves a higher topic coherence score for all k-values. In contrast, DS2 with 15,004 documents show similar coherence scores. Individual lines for each run are not shown for clarity.

The lines represent the mean coherence scores from 3 runs where the number of topics was varied from 1 to 40. A visual inspection of Fig. 2.3a shows that LDA models created with full-text data from the DS1 dataset achieve higher mean coherence scores among all values of K, a result that is not visible for DS2 (Fig. 2.3b).

Table 2.2 displays the actual coherence score values for uncovered topics from abstract ¯ and full-text data for both datasets. It shows the mean CV coherence score (X ), the ¯ ¯ standard deviation (s), and the difference between mean values (X2 X1) calculated from all three runs for K = 2, ..., 40 . Positive differences between mean− values indi- cate a higher achieved coherence{ score} for full-text data. We furthermore calculate the ¯ ¯ significance (p < 0.05, p < 0.01, and p < 0.001) between X1 and X2 with an indepen- dent two-sample t-test as Levene’s test for homoscedasticity assumes equal variances for all K values.

36 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Table 2.2: Calculated coherence score for abstract and full-text data for both datasets. ¯ ¯ ¯ X = Mean coherence score, s = Standard deviation coherence score, X2 X1 = Dif- ference in mean coherence scores, t = Calculated t-statistic, p = Two-tailed− p-value, K = number of topics

Dataset DS1 (4,417 documents) Abstract1 Full-text2 Statistics (t-test) ¯ ¯ ¯ ¯ K X1 s1 X2 s2 X2 X1 t p − 2 0.392 0.040 0.547 0.058 0.156 -3.14 0.0350∗ 3 0.454 0.032 0.536 0.034 0.082 -2.49 0.0671

4 0.433 0.027 0.556 0.012 0.123 -5.88 0.0042∗∗ 5 0.454 0.028 0.575 0.012 0.121 -5.67 0.0048∗∗ 6 0.479 0.044 0.572 0.020 0.093 -2.71 0.0534

7 0.503 0.009 0.560 0.001 0.057 -8.98 0.0009∗∗∗ 8 0.509 0.024 0.567 0.017 0.058 -2.83 0.0474∗ 9 0.492 0.016 0.576 0.013 0.084 -5.86 0.0042∗∗ 10 0.475 0.008 0.566 0.017 0.091 -6.90 0.0023∗∗ 11 0.473 0.015 0.578 0.008 0.105 -8.87 0.0009∗∗∗ 12 0.491 0.010 0.572 0.010 0.081 -7.99 0.0013∗∗ 13 0.484 0.010 0.591 0.009 0.107 -11.08 0.0004∗∗∗ 14 0.515 0.014 0.568 0.006 0.052 -5.03 0.0074∗∗ 15 0.475 0.022 0.583 0.008 0.107 -6.40 0.0031∗∗ 16 0.485 0.021 0.585 0.006 0.100 -6.59 0.0028∗∗ 17 0.489 0.015 0.590 0.022 0.101 -5.40 0.0057∗∗ 18 0.506 0.035 0.592 0.015 0.086 -3.24 0.0315∗ 19 0.493 0.009 0.589 0.011 0.096 -9.92 0.0006∗∗∗ 20 0.493 0.007 0.584 0.009 0.091 -11.54 0.0003∗∗∗ 21 0.504 0.020 0.579 0.004 0.076 -5.37 0.0058∗∗ 22 0.497 0.012 0.576 0.009 0.079 -7.51 0.0017∗∗ 23 0.486 0.009 0.572 0.022 0.086 -5.09 0.0070∗∗ 24 0.489 0.001 0.584 0.015 0.095 -9.14 0.0008∗∗∗ 25 0.471 0.006 0.567 0.011 0.096 -10.95 0.0004∗∗∗ 26 0.490 0.016 0.589 0.019 0.099 -5.72 0.0046∗∗ 27 0.482 0.013 0.573 0.009 0.091 -8.15 0.0012∗∗ 28 0.488 0.009 0.585 0.007 0.097 -12.22 0.0003∗∗∗ 29 0.500 0.017 0.590 0.002 0.090 -7.50 0.0017∗∗ 30 0.475 0.010 0.583 0.002 0.108 -14.37 0.0001∗∗∗ 31 0.478 0.010 0.584 0.009 0.105 -11.41 0.0003∗∗∗ 32 0.488 0.007 0.588 0.006 0.100 -15.40 0.0001∗∗∗ 33 0.484 0.013 0.581 0.000 0.097 -10.57 0.0005∗∗∗ 34 0.488 0.002 0.594 0.010 0.107 -14.57 0.0001∗∗∗ 35 0.502 0.011 0.584 0.013 0.082 -6.78 0.0025∗∗ 36 0.481 0.002 0.578 0.002 0.097 -59.63 0.0000∗∗∗ 37 0.491 0.015 0.591 0.009 0.100 -8.17 0.0012∗∗ 38 0.476 0.008 0.580 0.010 0.105 -12.07 0.0003∗∗∗ 39 0.483 0.024 0.576 0.008 0.094 -5.26 0.0063∗∗ 40 0.494 0.007 0.586 0.008 0.092 -12.55 0.0002∗∗∗

∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

37 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Table 2.2: continued.

Dataset DS2 (15,004 documents) Abstract1 Full-text2 Statistics (t-test) ¯ ¯ ¯ ¯ K X1 s1 X2 s2 X2 X1 t p − 2 0.448 0.016 0.490 0.004 0.041 -3.47 0.0255∗ 3 0.434 0.016 0.517 0.024 0.084 -4.04 0.0156∗ 4 0.482 0.020 0.522 0.014 0.040 -2.37 0.0772 5 0.484 0.016 0.520 0.016 0.035 -2.19 0.0938

6 0.488 0.017 0.543 0.010 0.055 -3.92 0.0172∗ 7 0.507 0.029 0.529 0.002 0.022 -1.07 0.3433 8 0.496 0.010 0.518 0.019 0.022 -1.40 0.2336 9 0.527 0.015 0.531 0.007 0.004 -0.36 0.7350 10 0.536 0.007 0.538 0.013 0.002 -0.19 0.8593 11 0.539 0.010 0.536 0.011 -0.002 0.24 0.8238 12 0.550 0.013 0.545 0.006 -0.005 0.53 0.6255 13 0.538 0.007 0.533 0.003 -0.004 0.84 0.4469 14 0.536 0.014 0.548 0.003 0.012 -1.15 0.3129 15 0.558 0.017 0.555 0.008 -0.003 0.24 0.8195 16 0.542 0.007 0.561 0.010 0.019 -2.22 0.0902 17 0.562 0.022 0.557 0.009 -0.005 0.27 0.7997 18 0.558 0.015 0.550 0.005 -0.008 0.66 0.5441 19 0.543 0.017 0.553 0.011 0.010 -0.73 0.5081 20 0.550 0.019 0.561 0.006 0.011 -0.82 0.4574 21 0.569 0.014 0.560 0.014 -0.009 0.67 0.5398 22 0.559 0.016 0.564 0.006 0.005 -0.41 0.7012 23 0.562 0.006 0.562 0.012 -0.000 0.04 0.9733 24 0.552 0.008 0.564 0.006 0.012 -1.63 0.1794 25 0.548 0.006 0.564 0.011 0.016 -1.84 0.1392 26 0.554 0.018 0.564 0.011 0.010 -0.67 0.5403 27 0.553 0.010 0.561 0.010 0.008 -0.79 0.4720 28 0.552 0.004 0.567 0.014 0.015 -1.43 0.2267 29 0.543 0.015 0.560 0.003 0.018 -1.68 0.1682 30 0.558 0.007 0.557 0.012 -0.001 0.14 0.8980 31 0.557 0.014 0.568 0.006 0.011 -1.03 0.3628 32 0.553 0.002 0.557 0.003 0.004 -1.61 0.1825

33 0.541 0.009 0.564 0.004 0.023 -3.26 0.0311∗ 34 0.554 0.010 0.565 0.013 0.011 -0.97 0.3885 35 0.550 0.002 0.568 0.014 0.018 -1.77 0.1521 36 0.550 0.016 0.573 0.010 0.023 -1.69 0.1667

37 0.545 0.009 0.576 0.005 0.031 -4.18 0.0139∗ 38 0.550 0.008 0.565 0.003 0.014 -2.26 0.0867 39 0.546 0.019 0.577 0.005 0.032 -2.32 0.0814 40 0.569 0.016 0.574 0.009 0.004 -0.33 0.7560

∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001

38 CHAPTER 2. FULL-TEXT OR ABSTRACT?

2.4.1 DS1 Dataset

¯ ¯ Although every X2 (full-text) outperforms X1 (abstract), not all differences are statis- tically significant. For k = 3 and k = 6, the differences between coherence scores are not significant but are still very close to the 5% significance threshold. The largest difference between mean values is achieved at k = 2 (two-topic LDA model), although it is only significant at p < 0.05. Looking at all K values, three runs achieve p < 0.05 significance, 18 achieve p < 0.01 significance, and 16 achieve p < 0.001 significance. The choice of full-text data results in overall topics with higher coherence for all K val- ues, and these differences are significant for all but two LDA models. The abstract data achieved the optimal coherence score (via elbow method) at k = 14, and the full-text data achieved this at k = 13.

2.4.2 DS2 Dataset

The DS2 dataset with 15,004 research articles from 12 top-tier fisheries journals show that only 5 LDA models are significantly different at the 5% significance threshold; k = 2, 3, 6, 33, and 37. Looking at the actual coherence scores, most LDA models show a slightly higher coherence score (shown in bold) for full-text data compared to abstract data. However, the large difference in coherence scores and significance levels are not similar to the DS1 dataset. The LDA model with the optimal coherence score for abstract data is at k = 17, and for full-text data at k = 16.

2.4.3 Human Topic Ranking

Table 2.3 shows the results of the human topic ranking by a fisheries domain expert. For an equal comparison, the LDA models with optimal coherence scores were ranked and compared. The LDA model from DS1 abstract data (k = 14) contains 50% high- quality topics, 36% medium-quality topics, and 14% low-quality topics. In contrast, the LDA model from full-text data (k = 13) contains 92% high-quality, 8% medium-quality, and no low-quality topics. DS2 abstract and full-text data show similar ranking scores; almost 90% high-quality topics with just two topics ranked as medium-quality. Table 2.4 provides an example of high- medium- and low-quality topics, the top 15 words, and the incorrect terms that caused the topics to be ranked lower for DS1. A two- dimensional inter-topic distance map for the LDA models is displayed in Fig. 2.4. This two-dimensional representation is obtained by computing the distance between top- ics (Chuang et al., 2012) and applying multidimensional scaling (Sievert and Shirley, 2014). It displays the similarity between topics with respect to their probability dis- tribution over words. Furthermore, it shows the topic label that best captures the semantics of the top 15 words. The color coding indicates the quality of the topics based on human interpretation (see Section 2.3.4 for ranking method). It shows that,

39 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Table 2.3: Manual topic ranking for DS1 and DS2 datasets for abstract and full-text. H = high-quality, M = medium-quality, and L = low-quality topics.

DS1 DS2 Abstract Full-text Abstract Full-text H 7/14 (50.0%) 12/13 (92.3%) 15/17 (88.2%) 14/16 (87.5%) M 5/14 (35.7%) 1/13 (7.7%) 2/17 (11.8%) 2/16 (12.5%) L 2/14 (14.3%) 0/13 (0%) 0/17 (0%) 0/16 (0%) overall, more high-quality topics are obtained from full-text data than from the abstract counterpart for DS1, and similar topic rankings are achieved for DS2.

40 CHAPTER 2. FULL-TEXT OR ABSTRACT?

overall topic 8 cod genetics overall topic population swimming overall topic prevalence overall topic 8 cod genetics 12 prevalenceswimming genetics population performance prevalence prevalence 12 performance 13 population genetics 2% 2% genetics13 population salmon population 2% salmon 2% 5% 5% genetics dynamics salmon population 5 12salmon 9 5% growth 10% 10% 5% 8 12 streamcod genetics dynamics 9 5 overall topic overall topic growth stream salmon reproduction swimmingoverall topic 10% overall topic temperature10% 8 cod genetics fish population swimming prevalence prevalence 4 fish population salmon reproductio4 12 n performanceprevalence prevalenceotolith effects temperature otolith genetics12 performance 13 7 4population genetics 4 2% 2% otolith13 effectspopulation otolith analysis genetics 7 analysis salmon population rivers and 2% 2% analysissalmon 6 analysis 5% 5% genetics11 salmon population stream habitat riverslake and salmon 12 9 dynamics 69 55% 5% 5 10% lake 10% 12 growth11 stream 9 dynamics 9 sedimentsstream habitat management growth stream 8 reproduction 10% 10% 14 fish salmon8 reproductio1 reproductionn sediments management temperature214 fish salmon reproductiomodels n 1 temperature 4 models 4 otolith effects4 2 7 estimation otolith4 3 otolith analysiseffectsstream lake trophic models otolithestimation analysis 3 3stream fish7 lake trophic 11 rivers and2 analysis5 habitats level interaction analysis10 modelspopulation 6 rivers and 117 1 5 habitatsabundance 311 fish level9 interaction 6 stream habitat lake 2 10 1 11 9 abundance dynamics10 streampopulation habitat lake sediments7 management 10 8 reproduction dynamics sediments 14 138 reproduction lake1 management 14 fish lake2 stock recruitmentmodels1 lake nutrients population 2 models 13 ecology lake distribution fishnutrients lake estimation stock recruitment3 and algae lake nutrients models population stream 6 lake trophic estimation model 3 ecology stream distributionand algae lakefishnutrients trophic models model 11 and algae2 models 5 habitats 3 level interaction6 models population11 7 5 habitats1 3 fish levelabundance interactionand algae 10 2 1 10 abundance 10 population dynamics 7 10 dynamics 13 lake fish lake stock recruitmentlake lake nutrients population fish lake 13 lakeecology nutrients population distribution nutrients stock recruitment modelecology and algae models distribution nutrients 6 model and algae models and 6algae and algae

(a) DS1 abstract (b) DS1 full-text overall topic freshwater overall topic prevalence overall habitatstopic freshwater population prevalence overall topic 13 population 3 prevalence habitats genetics 13 freshwater 3 prevalence 2% lake speciesgenetics freshwater 2% 2% 7 8 temperature 2% 5% species lake 7 168 species aquaculture effects 5%temperature 5% species larvaeaquaculture 5% 10% overall topic freshwater 16 larvae 10% effects overall topic overall topic 10%freshwaterotolith overall topic 10% 4 prevalence habitats 10 lake populationriver 1 prevalence prevalence habitatsanalysis otolith population 13 3 prevalence1 4 913 ecology 10geneticsmigration lakefreshwater river 3 2% analysis genetics freshwater 2% 2% lake 9 ecologyspeciesmigration 15 temperature2% management lake 7 8 species 15 5%models species7 8 11 6 16 trawling species aquaculture temperature effects 5% 5% managementspecies 3 aquaculture larvae 5% 10% models 16 11 6 gear distributiontrawling larvae specieslakeeffects species 10% 10% 3 10% otolith gear distribution 1 lake species 12 4 otolith salmon 10 lake river 1 4 12 analysis10 salmonmortalitylake river 1 analysis 14 9 ecologymortalitymigration climate tagging estuaries 17 9 ecology migration climate tagging estuaries 15 management 1 17 14 change 13 15 management models 5 15 11 6 4 trawlingchange species freshwater stock assessment models 3 11 6 trawling species 4 13 10 fish 3 growth5 15 gearstock distribution9 lakespecies species10 freshwater stock assessment growth gear assessmentdistribution 5 stocklake species 9 species abundance12 fish salmon 14 freshwater trawl fishery reproductionsalmonabundance mortality 14 assessment 5 fisheries freshwater 12 climate7 tagging trawl fishery1 reproductionmortality14 management freshwater7 estuaries fisheries 1 17 14tagging climate changemanagement tagging estuaries 17 tagging models13 habitat 16freshwater 2 5 15 change 4 models freshwater stock assessment 15 2 8 4 13 habitat12freshwater1016 fish5 growth stock 11109 6 species12 stock assessment growth 2 8 stock 9 species fish abundance 14 assessment 5 11freshwater6 14 assessment salmon5 2 freshwater trawlabundance fishery reproduction 7 salmon fisheries trawl fishery reproduction management fisheries tagging management 7 freshwater tagging freshwatermodels 2 models habitat 16 (c) DS2 abstract (d)8 DS full-texthabitat 16 12 2 8 2 12 11 6 2 11 6 2 salmon Figure 2.4: Inter-topic distance map showing a two-dimensionalsalmon representation (via multi-dimensional scaling) of the latent topics. The distance between the nodes repre- sents the topic similarity with respect to the distributions of words. The surface of the nodes represents the prevalence of the topic within the corpus. Color coding is used to display the topic ranking: green = high-quality topic, orange = medium-quality topic, and red = low-quality topic.

41 CHAPTER 2. FULL-TEXT OR ABSTRACT? High Medium Low High Medium , samples, , size, rates, , lake, fish, loci, within using among , structure, diversity, , dna, atlantic, sample, sea, microsatel- two among dance, spatial, habitat, model, fishery,sea effort, fish, water, species, river rate, juvenile, salmo, feeding,lations wild, density, food, popu- models, recruitment, cod,used estimates, microsatellite, salmon, lite, structure, alleles with the 15 most probable words, topic label, and ranking data. Text in bold indicates 1 DS Population modelsPopulation genetics model, data, mortality, stock, fish, population, fishing, genetic, populations, population, Population genetics genetic, populations, population, river, samples, loci, A selection of topics from Dataset LabelAbstract fish distribution fishing, distribution, data, species, Top areas, 15 words catch, abun- Full-text Salmon population dynamics salmon, trout, prey, growth, atlantic, temperature, water, Ranking Table 2.4: incorrect terms.

42 CHAPTER 2. FULL-TEXT OR ABSTRACT?

2.5 Discussion

The coherence of a topic is based on the topic’s top 15 words and shows how strongly pairs of these top 15 words support each other within the corpus. Such an approach, drawing on the philosophical premise that a set of statements or facts is said to be coherent if its statements or facts support each other, informs us about the under- standability and interpretability of topics from a human perspective. The LDA models obtained from DS1 full-text data, compared to DS1 abstract data, show a higher coher- ence overall, with the test statistics showing that these differences are significant for all but two LDA models. On the other hand, such significant differences are not present within the DS2 dataset when comparing abstract and full-text data, although full-text data achieved more topics with a higher coherence score.

Additionally, topic ranking by a fisheries domain expert shows similar, or even greater, improvements in results for the DS1 full-text data; topics uncovered from full-text data contain 92% high-quality topics compared to 50% high-quality topics from abstract data. The quality of topics from a human perspective was lowered by the inclusion of incorrect terms in the top 15 words. Such terms, however, are not related to the biological, ecological, or socio-ecological meanings of those topics but can be seen as noise terms: using, used, use, within, total, two, among, and within. There is little to no specific semantic meaning behind these terms, and although they are important in written text, they are less important when uncovering latent semantic structures (i.e. topics) from documents. This issue may be potentially rectified by a part-of-speech (POS) tagger to eliminate the verbs or prepositions that crop up as noise among the topic’s top words. However, one should proceed carefully in cases where verbs are important cues for understanding the semantics of the top words. For example, Table 2.4 shows that fishing and feeding are among the top 15 words, and in these cases, the verbs are important terms that are necessary for the understanding of the semantic context. In such cases, one might proceed with a domain-specific stop word list to prevent such terms from becoming part of the topic-word distribution. A lower ranked topic caused by noise terms is not as apparent for full-text data, nor does it seem to hold for abstract data from the DS2 dataset. Such noise terms seem less of an issue when document frequency, word length, or vocabulary size increases.

Also worth noting is an increased level of detailed topics within DS1 full-text data (Fig. 2.4b) compared to DS1 abstract data (Fig. 2.4a). For example, the topics salmon population dynamics and salmon reproduction were uncovered from full-text data, where the single topic salmon was uncovered from abstract data. Similarly, the topics dealing with lakes are split into three topics (lake nutrients and algae, lake sediments, and lake ecology) from full-text data, compared to two (lake tropic level interaction and lake nu- trients and algae) from the abstract data. Lastly, the topics dealing with models were split into three (estimation models, stock assessment models, and reproduction models) from full-text data in contrast to an overarching population model topic from abstract data. Such a clear difference between low and high granularity topics is not present

43 CHAPTER 2. FULL-TEXT OR ABSTRACT?

within the DS2 dataset. Although the differences in word length and vocabulary size exists, similarly to DS1, it seems that a higher number of documents makes up for these differences in granularity. A comparison between other LDA models (not presented) shows similar granularity between abstract and full-text for DS2. Although the article’s abstract aims to provide a complete but succinct description of the whole paper, it is often restricted by a limitation on the number of words. Such word limitation, with a relatively small number of documents, has practical effects on the level of detail (i.e. granularity) of uncovered LDA topics.

Besides topic coherence, topic ranking, and the level of detail, Fig. 2.4 shows a num- ber of uncovered topics that are present in abstract data but absent in full-text data.

Within DS1 for example, the topics temperature effects, cod genetics, management, and fish abundance were not found within full-text data, and neither were related topics showing semantic resemblance to these absent topics. Although we identified similar and detailed topics, there remains an inconsistency between some uncovered topics from both datasets. Knowing that abstracts were retrieved from full-text articles and are thus, in essence, a subset of the full-text data, one might question why these differ- ences exist. One reason might be that manual topic labeling is limited to the subjectivity inherent in human interpretation and an analysis of the topics by others could yield opposite results, explaining away any differences between the two datasets. On the other hand, topic labeling is usually performed by inspection of the topic words with the highest probabilities (top 10 or 15). Such an approach might up-weight terms that have high probability under all topics. Other approaches to identify the terms that best describe a topic exist (e.g. (Blei and Lafferty, 2009; Tang and Maclennan, 2005)) and could yield different results. Finally, abstract data, being restricted by the limited number of words, fail to adequately convey the heterogeneity of research ideas or top- ics that are part of a document. Uncovered latent topics might thus not completely resemble the document collection and, as a result, provide a limited or even incorrect view of the underlying thematic structure.

2.6 Conclusion

In this paper, we presented a comparison between topic coherence scores and human topic ranking when creating LDA topics from abstract and full-text data. Two datasets were compared, DS1 consisting of a single fisheries journal with 4,417 scientific re- search articles that span 20 years of scientific output, and DS2 consisting of 12 fisheries journals, 15,004 articles, and span 16 years of research. The two types of data, abstract and full-text, combined with two different datasets, a single journal and a set of jour- nals, allow for comparison on a variety of characteristics, such as document length, doc- ument frequency, and vocabulary size. Topics were statistically compared by adopting the CV coherence measure that shows the highest correlation with all available human topic-ranking data. Furthermore, the LDA models with the optimal coherence scores were manually inspected and ranked by a fisheries domain expert.

44 CHAPTER 2. FULL-TEXT OR ABSTRACT?

Our results show that uncovering LDA models from a single journal with, relatively speaking, a low number of documents are very prone to noise terms that crop up into the topic’s top words—the words that are often used to capture the semantics of the topic—for abstract data. Such noise terms require special attention when dealing with abstract data with, e.g. an increased cleaning phase, POS filtering, or a domain-specific stop word list. Our results show that full-text data seem less affected by such words, thus increasing the coherence and manual topic ranking. On the other hand, increasing the number of document (e.g. DS2) results in fewer noise terms, thus an improvement in coherence and human topic ranking for both abstract and full-text data. Further- more, on a small dataset (e.g. DS1) abstract topic distributions capture more broad topics, with full-text topics achieving more fine-grained results. These differences in detail are not present for bigger datasets containing a higher number of documents, regardless the choice for abstract or full-text data.

We identified a number of topics that were uncovered from abstract data but were absent among the topics uncovered from full-text data. A detailed analysis of the rea- sons behind these differences would yield interesting results and would be a possible direction for future research.

45

Chapter 3

Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) has gained much attention from researchers and is increasingly being applied to uncover underlying semantic structures from a va- riety of corpora. However, nearly all researchers use symmetrical Dirichlet priors, often unaware of the underlying practical implications that they bear. This re- search is the first to explore symmetrical and asymmetrical Dirichlet priors on topic coherence and human topic ranking when uncovering latent semantic structures from scientific research articles. More specifically, we examine the practical effects of several classes of Dirichlet priors on 2000 LDA models created from abstract and full-text research articles. Our results show that symmetrical or asymmetrical priors on the document–topic distribution or the topic–word distribution for full- text data have little effect on topic coherence scores and human topic ranking. In contrast, asymmetrical priors on the document–topic distribution for abstract data show a significant increase in topic coherence scores and improved human topic ranking compared to a symmetrical prior. Symmetrical or asymmetrical priors on the topic–word distribution show no real benefits for both abstract and full-text data.

This work was originally published as:

S. Syed and M. Spruit. Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12(3):399–423, 2018b. doi: 10.1142/S1793351X18400184

47 CHAPTER 3. EXPLORING DIRICHLET PRIORS

3.1 Introduction

Global research efforts have led to an ever-increasing amount of scientific output. Com- bined with the digitalization of scientific archives, this increase is threatening to over- whelm today’s scientists trying to keep track of and identify relevant literature (Larsen and von Ins, 2010). Consequently, scientists need new tools and algorithms for brows- ing these collections in a structured way, particularly as topics within articles, which are the ideas contained within articles that can be shared among similar articles, can- not always be detected through traditional keyword searches (Srivastava and Sahami, 2009). Probabilistic topic models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and probabilistic latent semantic indexing (pLSI) (Hofmann, 1999) are machine- learning algorithms used to automatically uncover underlying semantic structures, such as themes or topics, in large collections of documents. These underlying seman- tic structures can subsequently be used to categorize, summarize, and annotate large document collections in a purely unsupervised fashion.

LDA, although the simplest topic model, has received much attention from machine- learning researchers and has been adopted and extended in many ways. LDA is a gen- erative probabilistic topic model that aims to uncover hidden or latent thematic struc- tures from large collections of documents. LDA is a three-level hierarchical Bayesian model that models documents as discrete distributions over K latent topics, and every topic is modeled as a multinomial distribution over the fixed vocabulary. Uncovering latent thematic structures proceeds through posterior inference of the latent variables given the observed words. Apart from its applicability to text, LDA has proven useful to other types of data, such as image (Fergus et al., 2005), video (Mehran et al., 2009), and audio (Kim et al., 2009).

As a conjugate prior to the multinomial distribution, LDA uses a Dirichlet prior to sim- plify posterior inference. Typically, these priors and related hyperparameters are set to be symmetrical, assuming that a priori all topics have equal probability to be as- signed to a document and all words have an equal chance to be assigned to a topic. The reasons for choosing symmetrical priors, compared to asymmetrical priors, are not explicitly stated and are often implicitly assumed to have little or no practical ef- fect (Wallach et al., 2009). However, hyperparameters can have a significant effect on the achieved accuracy for various inference techniques, such as Gibbs sampling, varia- tional Bayes, or collapsed variational Bayes (Asuncion et al., 2012). In fact, inference methods have relatively similar predictive performance when the hyperparameters are optimized, thereby explaining away most differences between them.

Little research has examined the effects of Dirichlet priors on the quality of generated topics. Among the few, Wallach et al. (2009) demonstrated that using an asymmetric Dirichlet prior on the document–topic distribution shows significant performance gains concerning the likelihood of held-out documents. However, the likelihood correlates negatively with human interpretability (Chang et al., 2009), which is often considered the gold standard for topic quality. Consequently, researchers have proposed topic co-

48 CHAPTER 3. EXPLORING DIRICHLET PRIORS herence measures (Aletras and Stevenson, 2013; Stevens et al., 2012; Newman et al., 2010a; Röder et al., 2015), a proxy for topic quality that shows improved correlation with human topic ranking data. The underlying idea of topic coherence is rooted in the distributional hypothesis of linguistics (Harris, 1954)—namely, words with similar meanings tend to occur in similar contexts. This paper is the first to explore the prac- tical effects of several classes of Dirichlet priors on the coherence of generated topics. More specifically, we study topic coherence for the combinations of symmetrical and asymmetrical priors on the document–topic distribution, as well as the topic–word dis- tribution, when uncovering latent topics with LDA. In addition, topics are ranked by a domain expert on interpretability, providing a qualitative analysis of topic quality for different classes of Dirichlet priors in addition to a quantitative measure. Such anal- yses can provide valuable guidance to researchers utilizing LDA tools such as Mallet and Gensim (Rehurek and Sojka, 2010) to uncover topical structures from scientific articles (Gatti et al., 2015; Sun and Yin, 2017; Westgate et al., 2015; Wang and McCal- lum, 2006; Alston and Pardey, 2016) and unknowingly leaving hyperparameters set to default (i.e. symmetrical).

3.2 Background

3.2.1 Latent Dirichlet Allocation

LDA is a generative probabilistic topic model that aims to uncover latent semantic struc- tures from a set of documents, D. The latent semantic structures can subsequently be used to organize, categorize, and annotate documents without the need of prior hu- man labeling or annotation. LDA models documents as discrete distributions over K latent topics, and every topic is modeled as a discrete distribution over the fixed vocab- ulary. As a result, LDA captures the heterogeneity of ideas prevailing in a document collection and can be viewed as a mixed membership model (Erosheva et al., 2004). The underlying latent semantic structure is expressed by topics β, topic proportions θ, and topic assignments z and includes hidden variables that LDA posits into the cor- pus. However, β, θ, and z are unobserved, and the goal is to determine them from the observed variables (i.e. the words within the documents). LDA’s structure allows the observed variables to interact with structured distributions of a hidden variable model (Blei and Lafferty, 2009). Learning the hidden variables (i.e. the underlying se- mantic structure) can be achieved by inferring the posterior distribution of the latent variables given the observed documents. The interaction between latent and observed variables is manifested in the generative process behind LDA, the imaginary random process in which we assume the documents come from and are based on probabilistic sampling rules. The generative process is described as follows:

1. For every topic k = 1, ..., K { }

49 CHAPTER 3. EXPLORING DIRICHLET PRIORS

Per-word topic Dirichlet parameter assignment Topics

⍺ θd zd,n wd,n βd,n η N D K

Per-document Observed word Dirichlet parameter topic proportion

Figure 3.1: LDA represented as a graphical model in which the nodes denote the random variables and the edges the dependencies between them. Unshaded nodes are unobserved or hidden variables, and the shaded nodes represent the observed random variables. The boxes, called plates, indicate replication.

(a) draw a distribution over the vocabulary V, βk Dir(η) ∼ 2. For every document d

(a) draw a distribution over topics, θd Dir(α) (i.e. per-document topic pro- portion) ∼ (b) for each word w within document d

i. draw a topic assignment, zd,n Mult(θd ), where zd,n 1, ..., K (i.e. per-word topic assignment) ∼ ∈ { } ii. draw a word w Mult , where w 1, ..., V d,n (βzd,n ) d,n ∼ ∈ { }

Where K is the numbers of topics, V is the vocabulary size, and α and η are the Dirich- let hyperparameters that affect the smoothing of topic proportions within documents and words within topics, respectively. The joint distribution (see Figure 3.1 for the graphical representation) of all the hidden and observed variables becomes:

K D N Y Y Y p(βK , θD, zD, wD α, η) = p(βK η) p(θd α) p(zd,n θd )p(wd,n zd,n, βd,k) (3.1) | k=1 | d=1 | n=1 | |

To learn the distribution of the hidden variables, we invert the generative process and fit the hidden variables onto the observed words. The hidden structure is thus described by the posterior distribution of the latent variables given the observed words:

50 CHAPTER 3. EXPLORING DIRICHLET PRIORS

p(βK , θD, zD, wD α, η) p(βK , θD, zD wD, α, η) = | (3.2) | p(wD α, η) | Z Z p(wD α, η) = p(wd α, η) (3.3) | βK θD |

However, the posterior is intractable to compute (Blei et al., 2003) due to the evidence as expressed in (3.3). The solution is to approximate the posterior using inference techniques. Two main posterior inference techniques can be discerned: (i) sampling- based algorithms (Newman et al., 2007; Porteous et al., 2008) and (ii) variational- or optimization-based algorithms (Blei and Jordan, 2006; Teh et al., 2006; Wang et al., 2011). Sampling-based algorithms, such as Markov Chain Monte Carlo (MCMC) sam- pling, sample from the posterior—usually one variable at a time—while fixing the other variables. Repeating this process for several iterations causes the process to converge, in which the sample values have the same distribution as if they came from the true posterior. Variational inference aims to find a simplified parametric distribution that is closest to the true posterior measured in the Kullback-Leibler (KL) divergence. Once inference is complete, the posterior distribution reveals the latent structure of the doc- uments expressed by topics β, topic proportions θ, and topic assignments z.

One way to think about LDA is to imagine a document in which one highlights words with colored markers. Words that relate to one topic are colored blue, words that relate to another topic are colored red, and so on. After all of the words have been colored (excluding words such as "the", "a"), all the words with the same color are the top- ics, and the article will blend the colors in different proportions. Different documents will have different blends of colors, and we could use the proportion of the various colors to situate this specific document in a document collection (e.g. documents ad- dressing mainly the blue topic). Moreover, documents with the same blend of colors discuss the topics in similar proportion and are considered closely related from a top- ical perspective. Technically, documents with similar topic distributions are close in Kullback-Leibler divergence, a measure to calculate the distance between two prob- ability distributions. LDA as a statistical model captures this intuition. We refer the interested reader to (Blei, 2012) for a concise introduction to LDA.

3.2.2 Research Utilizing LDA

Topic modeling algorithms, and specifically LDA, have been helpful in elucidating the key ideas within a set of documents, such as articles published in the journal PNAS (Griffiths and Steyvers, 2004), political science texts (Grimmer and Stewart, 2013) or data-driven journalism (Rusch et al., 2013). Moreover, it is considered that this approach could provide insight into the development of a scientific field and changes

51 CHAPTER 3. EXPLORING DIRICHLET PRIORS in research priorities (Neff and Corley, 2009), and do so with greater speed and quan- titative rigor than would otherwise be possible through traditional narrative reviews (Grimmer and Stewart, 2013). As such, LDA has been applied, for example, in the domain of transportation research (Sun and Yin, 2017), computer science (Hall et al., 2008; Wang et al., 2011; Wang and McCallum, 2006), fisheries science (Syed and We- ber, 2018; Syed et al., 2018a), conservation science (Westgate et al., 2015), and the fields of operations research and management science (Gatti et al., 2015).

3.2.3 Coherence Scores

Measures such as predictive likelihood on held-out data (Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan and Mimno, 2009) have been proposed to evaluate the quality of generated topics. However, such measures correlate negatively with human interpretability (Chang et al., 2009), making topics with high predictive likelihood less coherent from a human perspective. High-quality or coherent latent topics are of par- ticular importance when they are used to browse document collections or understand the trends and development within a particular research field. As a result, researchers have proposed topic coherence measures, which are a qualitative approach to auto- matically uncover the coherence of topics(Aletras and Stevenson, 2013; Stevens et al., 2012; Newman et al., 2010a; Röder et al., 2015). Topics are considered to be coherent if all or most of the words (e.g. a topic’s top-N words) are related. Topic coherence measures aim to find measures that correlate highly with human topic evaluation, such as topic ranking data obtained by, for example, word and topic intrusion tests (Chang et al., 2009). Human topic ranking data are often considered the gold standard and, consequently, a measure that correlates well is a good indicator for topic interpretabil- ity. A recent study by Röder et al. (2015) systematically and empirically explored the multitude of topic coherence measures and their correlation with available human topic ranking data; new coherence measures obtained by combining existing elementary el- ements were also examined. The researchers’ systematic approach revealed a new unexplored coherence measure, which they labeled CV , to achieve the highest correla- tion with all available human topic ranking data. This study adopts the CV coherence measure for calculating topic coherence, with a detailed description of the calculations behind this measure described below.

The calculation of CV starts with the segmentation of the topic’s top-N words into pairs of word subsets, Si = (W 0, W ∗), where W 0 W, W ∗ W, and W consists of the topic’s top-N most probable words. More formally,∈ a pair S∈is defined as:

S = (W 0, W ∗) W 0 = wi ; wi W; W ∗ = W (3.4) { | { } ∈ }

For example, if W = w1, w2, w3 , then one pair might be Si = (W 0 = w1), (W ∗ = { } w1, w2, w3). Such segmentation measures the extent to which the subset W ∗ supports or

52 CHAPTER 3. EXPLORING DIRICHLET PRIORS

conversely undermines the subset W 0 (Douven and Meijs, 2007). The support between word subsets of a pair Si = (W 0, W ∗) is calculated with a confirmation measure φ. CV uses an indirect confirmation measure that considers not only the words within a pair, but also all words in W. A direct confirmation measure, such as difference, ratio, and likelihood measure, could place a low probability on high-support but low-frequency pairs. An indirect confirmation measure overcomes this by pairing every subset with W, thereby increasing the semantic support of supporting pairs. Word subsets are now represented as context vectors (Aletras and Stevenson, 2013), such as ~v(W 0) by pairing them to all words in W, as exemplified in (3.5). The relatedness between context vectors and words in W is calculated by normalized pointwise mutual information (NPMI), as shown in (3.6).

( ) X γ v(W 0) = NPMI(wi, w j) (3.5) w W i 0 j=1,..., W ∈ | |

γ  P(wi ,w j )+ε  log P w P w γ ( i ) ( j ) NPMI(wi, w j) =  ·  (3.6) log(P(wi, w j) + ε) −

In contrast to pointwise mutual information, NPMI achieves a higher correlation with human topic ranking data (Aletras and Stevenson, 2013), which is generally a result of reducing the impact of low-frequency counts in word co-occurrences (Bouma, 2009). Given our running example of W = w1, w2, w3 , we obtain the context vector for w1 as γ {γ } γ w~1 = NPMI(w1, w1) , NPMI(w1, w2) , NPMI(w1, w3) , with the constant ε to prevent logarithms{ of zero, and γ to place more weight on higher} NPMI values.

Probabilities of single words p(wi) or the joint probability of two words p(wi, w j) can be estimated using a Boolean document calculation—that is, the number of documents in which (wi) or (wi, w j) occurs, divided by the total number of documents. The Boolean document calculation, however, ignores the frequencies and distances of words. CV incorporates a Boolean sliding window calculation in which a new virtual document is created for every window of size s when sliding over the document at a rate of one word token per step. For example, document d1 with words w results in virtual documents w w d w w p w p w w d10 = 1, ..., s , 20 = 2, ..., s+1 , and so on. The probabilities ( i) and ( i, j) are subsequently{ } calculated{ from the} total number of virtual documents. In contrast to the Boolean document calculation, the Boolean sliding window calculation tries to capture the word token proximity to some degree.

The indirect confirmation measure u, w is obtained by calculating the cosine vec- φSi ( ) tor similarity between all context vectors v(W 0) u and v(W ∗) w of a pair Si = ∈ ∈ (W 0, W ∗), as shown in (3.7).

53 CHAPTER 3. EXPLORING DIRICHLET PRIORS

P W | | ui wi u, w i=1 (3.7) φSi ( ) = · u 2 w 2 k k · k k Finally, the arithmetic mean of individual confirmation measures is used to arrive at an overall topic coherence score.

3.3 Methods

3.3.1 Dataset

We compare the influence of Dirichlet hyperparameters on two datasets containing scientific research articles related to the domain of fisheries. The first dataset, DS1, contains all full-text research articles published by the journal Canadian Journal of Fisheries and Aquatic Sciences and the journal ICES Journal of Marine Science from 1996 to 2016, with D = 8, 012 documents, vocabulary size of V = 203, 248, a total of N = 29, 469, 919 words, and on average 3,678 words per document. The second dataset,

DS2, contains only abstract data from the journal Canadian Journal of Fisheries and Aquatic Sciences, with D = 4, 417, V = 14, 643, N = 481, 168, and 109 words on average per document. Both journals are domain-specific (i.e. fisheries) journals, but employ a wide scope of research directives related to the biological, ecological, and socio-ecological aspects of fisheries.

The domain of fisheries includes a multitude of knowledge production approaches, from mono- to transdisciplinary. Biologists, oceanographers, mathematicians, com- puter scientists, anthropologists, sociologists, political scientists, economists, and re- searchers from many other disciplines contribute to the body of knowledge of fisheries, together with non-academic participants such as decision makers and stakeholders. Within the domain of fisheries, research into text analytics techniques has only been applied in a number of cases (e.g. (Syed et al., 2016; Syed and Spruit, 2017)).

These journals were chosen for several reasons. First, a fisheries domain expert was available to rank the topics manually. Second, domain-specific journals, in contrast to generic journals such as Nature, Science, or PLOS ONE, increase generalizability to other domain-specific journals that are often the subject of study when uncovering topical structures from scientific publications, such as research performed within the field of computational linguistics (Hall et al., 2008) or neural information processing systems (NIPS) (Wang and McCallum, 2006), thereby making our results applicable to such approaches. Finally, the two journals have the highest frequency of publication output within the analyzed period compared to all other fisheries journals.

Words that were part of a standard list of stop words (n = 153), single-occurrence

54 CHAPTER 3. EXPLORING DIRICHLET PRIORS

Table 3.1: Notation of Dirichlet classes. α = document-topic distribution, η = topic- word distribution

Abbreviation α η

AA Asymmetric Asymmetric AS Asymmetric Symmetric SA Symmetric Asymmetric SS Symmetric Symmetric words, and words occurring in 90% of the documents (e.g. fish, analysis, research) were removed. The removal of≥ the top 90% of words serves as an estimate to prevent frequently occurring words from dominating all topics. All documents were tokenized and represented as bag-of-word features. Apart from grouping lowercase and upper- case words, no normalization method (e.g. stemming or lemmatization) was applied to reduce inflectional and derivational forms of words to a common base form. Stem- ming algorithms can be overly aggressive and could result in unrecognizable words that reduce interpretability when labeling the topics. Stemming might also lead to an- other problem: It cannot be deduced whether a stemmed word comes from a verb or a noun (Evangelopoulos et al., 2012). As human topic ranking was part of our topic quality evaluation, interpretability was considered to be highly important.

3.3.2 Dirichlet Hyperparameters

Hyperparameter α controls the shape of the document–topic distribution, whereas η controls the shape of the topic–word distribution. A large α leads to documents con- taining many topics, and a large η leads to topics with many words. In contrast, small values for α and η result in sparse distributions: documents containing a small number of topics and topics with a small number of words. In essence, the hyperparameters α and η have a smoothing effect on the multinomial variables θ and β, respectively. Four different classes or combinations of Dirichlet priors are explored, as listed in Table 3.1, in which we follow a similar notation (i.e. AA, AS, SA, SS) as described in (Wallach et al., 2009).

Symmetrical priors are often the default setting for LDA tools such as Mallet and Gen- sim and assume a priori that each of the K topics has an equal probability of being assigned to a document while each word has an equal chance of being assigned to a topic. For the symmetrical prior α, the hyperparameter is a vector with the value 1/K, where K is the number of topics. The symmetrical prior η has a scalar parameter with the value 1/V , where V is the size of the vocabulary (full-text data DS1 = 203, 248, and abstract data DS2 = 14, 643). For the asymmetrical priors, we utilize an iterative learning process to approximate the hyperparameters from the data; estimation is re-

55 CHAPTER 3. EXPLORING DIRICHLET PRIORS quired as no exact closed form solution exists. Estimating hyperparameters can be used to increase model quality, and their values can reveal specific properties of the corpus: α for the distinctiveness in underlying semantic structures and η for the group size of commonly co-occurring words (Heinrich, 2005). Several methods for hyperparameter estimation exist, such as gradient ascent, fixed point iteration, and Newton-Raphson method. Estimating the Dirichlet parameter α aims to maximize p(D α) by maximiz- ing the log likelihood function of the data D, with log ¯pk being the observed| sufficient statistics (the following is analogous to that of η).

X X X F(α) = log p(D α) = N logΓ ( ak) N log Γ (ak) + N (ak 1) log ¯pk | − − k k k (3.8) 1 with log ¯p log p k = N i,k

This study adopts the Newton-Rapson (Huang, 2005) method that provides a quadratic converging method for parameter estimation. Given an initial value for α, parameters are iteratively updated to arrive at an asymmetrical Dirichlet distribution learned from the data. The update is given in (3.9), with F being the gradient descent, iteratively stepping along a positive gradient to maximize∇ or converge the log-likelihood function F (3.8).

new old ( F)k b αk = αk ∇ − − qkk ‚ ‚ Œ Œ ∂ F X F N Ψ α Ψ α log ¯p = ∂ α = k ( k) + k ∇ k k − P j( F)j/qj j (3.9) b = ∇P 1/c + j 1/qj j ‚ Œ X c = NΨ0 αk k

qjk = NΨ0(αk) −

3.3.3 Creating LDA Models

LDA models were created for four different classes of priors on α and η, as listed in Table 3.1. For each class of priors, LDA models were produced by varying the number of topics parameter K = 1, ..., 50 and repeating the process five times; one class resulted in 250 LDA models. The{ same} approach was performed on both datasets: DS1 for full- text data and DS2 for abstract data. A total of 2000 different LDA models were created.

56 CHAPTER 3. EXPLORING DIRICHLET PRIORS

Given that our datasets focus on fisheries only, making them homogeneous in nature, a small number of topics is expected—typically around 10 to 20 given the scope and aims of the selected journals.

The Python library Gensim (Rehurek and Sojka, 2010) was used to create LDA mod- els. Posterior inference approximation is performed with online variational Bayes (VB) as proposed by Hoffman et al. (2010). Online VB is based on an online stochastic optimization process and produces similar or improved (Hoffman et al., 2010) and faster (Bottou and Bousquet, 2007) LDA models compared to its batch variant. The Newton-Raphson process of iteratively learning asymmetrical Dirichlet priors can con- veniently be incorporated into online LDA in linear time. The convergence iteration parameter for the expectation step (i.e. E-step) is set to 100, where per-document pa- rameters are fit for the variational distributions [see Algorithm 2 in (Hoffman et al., 2010)].

3.3.4 Topic Coherence

The coherence of topics was calculated using the CV coherence measure as described in detail in Section 3.2.3; CV has been shown to obtain the highest correlation with all available human topic ranking data. The segmentation of the topic’s top-N words and subsequent calculation of confirmation are calculated for N = 15, pairing every top 15 word with every other top 15 word and calculating their semantic support within the corpus. N = 15 was chosen, in contrast to, for example, N = 10 (Aletras and Stevenson, 2013), as no stemming or lemmatization was applied; with N = 10, several words with the same base form were among the top 10 words (e.g. sample, sampling), so analyzing the top 10 words would effectively mean analyzing fewer than 10 distinct words. The constant ε for NPMI calculations (see (3.6)) avoids logarithms of zero 12 and acts as a smoothing factor. This value is set to a very small number, 10− , as proposed by Stevens et. al. (Stevens et al., 2012); the coherence measure is highly dependent on the smoothing constant, and a very small value significantly reduces the scores for unrelated words compared to, for example, ε = 1 (Mimno et al., 2011). The γ constant for NPMI calculations is set to 1 (see (3.5)) to place equal weights on all NPMI values. In contrast to γ = 2 (Aletras and Stevenson, 2013), γ = 1 produced a higher correlation with human topic ranking data (Röder et al., 2015). The sliding window s for the Boolean sliding window calculation is set to 110 (Röder et al., 2015).

3.3.5 Human Topic Ranking

A fisheries domain expert manually ranked a selection of topics by inspecting the topic’s top 15 most probable words and a selection of document titles and content. The do- main expert is affiliated with the leading competence institution for fishery and aqua- culture in Norway. As topic coherence scores are also obtained from the topic’s top

57 CHAPTER 3. EXPLORING DIRICHLET PRIORS

15 words, the manual ranking of the top 15 words allows for equal comparison be- tween the two proposed assessments. The domain expert was asked to provide a label for each topic that best captures the semantics of the top 15 words. In addition, the domain expert was asked to rank the topics concerning semantically correct or, con- versely, incorrect words. An incorrect word could be a wrong fisheries domain-related word that does not match the topic label and, thus, does not fit with the semantics of the majority of right words. For example, in cases where most of the topic words re- semble the fish species cod, an incorrect domain-related word might refer to a different kind of species. Furthermore, incorrect terms may refer to noise terms (i.e. words that serve a grammatical or syntactical purpose only). Topics are subsequently ranked by the number of right terms concerning all of the top 15 words. High-quality topics have 90% correct words, medium-quality topics have 80% but < 90% correct words, and≥ low-quality topics have < 80% correct words. ≥

3.3.6 Relaxing LDA assumptions

At the time of writing, the original LDA method proposed by Blei et al. (2003) has over 22,000 citations. The technique has received much attention from machine learning researchers and other scholars and has been adopted and extended in a variety of ways. More concretely, relaxing the assumptions behind LDA can result in richer represen- tations of the underlying semantic structures. The bag of words assumption has been relaxed by conditioning words on the previous words (i.e. Markovian structure) (Wal- lach, 2006a); the document exchangeability assumption (i.e., the order in which doc- uments are analyzed), relaxed by the dynamic topic model (Blei and Lafferty, 2006), and the Bayesian non-parametric model can be utilized to automatically uncover the number of topics (Whye Teh et al., 2004). Furthermore, LDA has been extended in var- ious ways. Topics might correlate as a topic about "cars" is more likely to also be about "emission" than it is about "diseases". The Dirichlet distribution is implicitly indepen- dent and a more flexible distribution, such as the logistic normal, is a more appropriate distribution to capture covariance between topics. The correlated topic model aids in this task (Blei and Lafferty, 2007). Other examples extending LDA include the author- topic model (Rosen-Zvi et al., 2004), the relational topic model (Chang and Blei, 2010), the spherical topic model (Reisinger et al., 2010), the sparse topic model (Wang and Blei, 2009), and the bursty topic model (Doyle and Elkan, 2009). Topic models that relax or extend the original LDA model bring additional computational complexity and their own sets of limitations and challenges; nevertheless, it would be interesting to explore these models in future research.

58 CHAPTER 3. EXPLORING DIRICHLET PRIORS

3.4 Results

3.4.1 Topic Coherence

The coherence scores for the prior classes AA, AS, SA, and SS obtained from 8,012 full-text research articles (DS1) are shown in Figure 3.2. Additionally, the coherence scores for prior classes obtained from 4,417 abstracts (DS2) are shown in Figure 3.3. The coherence score represents the mean coherence score from all five runs for each value of k.

A visual inspection of Figs. 3.2a–3.2f (full-text data) shows that similar coherence scores are obtained for AA and AS (Figure 3.2a), with both sharing an asymmetrical prior over α but a different prior over η. Similar results are obtained when comparing SA and SS (Figure 3.2f), sharing a symmetrical prior over α and a different prior over η. Thus, varying η, while maintaining a similar prior over α, shows no real difference in obtained coherence score. A slightly increased coherence is obtained for an asym- metrical prior over α (e.g., Figure 3.2d) for k > 20. Other combinations explored (e.g. AA–SA, AA–SS, and AS–SS) show similar results: a slight increase in coherence for an asymmetrical prior over α, with η showing no real benefits on topic coherence.

Figs. 3.3a–3.3f show coherence scores for LDA models obtained from abstract data

(DS2). AA–AS (Figure 3.3a) show that different priors over η, while maintaining the same asymmetrical prior over α, result in similar coherence scores. Similarly, a sym- metrical prior over α (Figure 3.3f) with different priors over η shows no real differences in topic coherence. However, a large difference in coherence is obtained when varying the priors over α (Figure 3.3d), with an asymmetrical α showing improved coherence over a symmetrical α. For DS2, priors over α, in contrast to results from DS1, show higher coherence scores for all values of k. Moreover, varying priors over η for DS1 and DS2 have a negligible effect on obtained coherence scores.

Table 3.2 shows the coherence score values obtained from DS1 for k = 2, ..., 50 , with X¯ representing the mean coherence over 5 runs, s the standard deviation,{ and f}and p the one-way ANOVA F-value and p-value, respectively. The last six columns show the post hoc significance thresholds for all six comparison of Dirichlet priors.

Table 3.2 reveals that significant differences are obtained starting from k 25, al- though this does not hold for every k 25. For k < 25, except for k = 6, no significant≥ differences are obtained for combinations≥ of priors; asymmetrical or symmetrical pri- ors over α and η have no significant effect on topic coherence. However, the coherence score values for k < 25 show slightly higher values (shown in bold) for a symmetrical prior over α compared to an asymmetrical prior. In contrast, for k 25, an asym- metrical prior over α shows higher coherence values compared to a symmetrical≥ prior. For all k, where p is significant, SA–SS show no significance and AA–AS show signifi- cance only for k = 6 and k = 47; indicating the marginal importance of symmetrical

59 CHAPTER 3. EXPLORING DIRICHLET PRIORS

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AA AA AS SA 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(a) AA - AS (DS1) (b) AA - SA (DS1)

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AA AS SS SA 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(c) AA - SS (DS1) (d) AS - SA (DS1)

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AS SA SS SS 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(e) AS - SS (DS1) (f) SA - SS (DS1)

Figure 3.2: A comparison of calculated CV topic coherence scores for all classes of priors (i.e. AA, AS, SA, SS). Coherence scores represent mean scores from five runs for K = 1, ..., 50 . DS1 = 8, 012 full-text articles. { } 60 CHAPTER 3. EXPLORING DIRICHLET PRIORS

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AA AA AS SA 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(a) AA - AS (DS2) (b) AA - SA (DS2)

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AA AS SS SA 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(c) AA - SS (DS2) (d) AS - SA (DS2)

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AS SA SS SS 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(e) AS - SS (DS2) (f) SA - SS (DS2)

Figure 3.3: A comparison of calculated CV topic coherence scores for all classes of priors (i.e. AA, AS, SA, SS). Coherence scores represent mean scores from five runs for K = 1, ..., 50 . DS2 = 4, 417 abstracts. { } 61 CHAPTER 3. EXPLORING DIRICHLET PRIORS or asymmetrical priors over η.

Table 3.3 shows the coherence score values and ANOVA statistics for DS2. For all k > 2, the difference is significant (p < 0.001). These significant differences are caused by using an asymmetrical prior over α compared to a symmetrical prior. Where DS1 shows mixed results between different priors over α, for DS2, every combination of asymmetrical priors over α outperforms symmetrical priors over α. The post hoc tests for comparisons between different priors over η are almost in all cases not significant, following a similar trend with DS1.

62 CHAPTER 3. EXPLORING DIRICHLET PRIORS 0.001 < SS p − ∗∗∗ SSSA 0.01, − < p ∗∗ SA AS − 0.05, < SS AS p ∗ − . } SA AA − ** ** 2, ..., 50 { = AS AA K − ** ** for 1 ∗ ∗∗ DS f p AA SS s SA s AS s AA s 0.0140.013 0.0160.014 0.010 0.0200.010 0.011 0.025 0.020 0.017 0.021 0.013 2.5540.025 0.013 0.007 1.788 0.0918 0.011 0.019 0.018 3.298 0.1899 0.012 0.017 0.018 1.308 0.0475 0.009 0.008 0.018 0.3063 0.014 0.022 0.018 0.571 0.012 0.016 0.878 0.6419 0.013 0.242 0.4732 0.005 0.8660 0.008 0.334 0.010 0.8006 0.010 0.0080.010 0.009 0.012 0.017 0.010 0.9260.010 0.008 0.014 0.4508 0.010 0.012 1.639 0.007 0.216 0.2200 0.020 0.8840 1.385 0.2834 SS ¯ X 0.573 0.0140.587 0.017 0.011 0.007 0.017 0.010 0.014 0.670 0.011 0.5828 1.816 0.1848 0.569 0.571 0.574 0.591 0.591 0.598 0.572 0.578 0.570 0.565 0.571 0.593 SA ¯ X 0.581 0.577 0.0140.581 0.016 0.581 0.012 0.015 0.0130.586 0.009 1.309 0.586 0.005 0.3061 0.579 0.004 0.0060.581 0.583 0.012 0.359 0.582 0.004 0.005 0.7837 0.008 0.011 0.010 0.019 0.012 0.659 0.009 0.009 0.5890 0.012 2.377 1.730 0.1082 0.2010 0.5160.564 0.571 0.564 0.037 0.026 0.021 0.020 0.055 0.021 0.0040.569 0.035 3.237 0.565 0.108 0.0501 0.025 0.9543 0.009 0.012 0.012 0.682 0.5761 0.577 0.597 Mean Std. dev. ANOVA Statistics AS ¯ X 0.595 0.583 0.584 0.009 0.011 0.011 0.007 6.739 0.0038 0.566 0.567 0.571 0.006 0.005 0.005 0.021 0.270 0.8462 0.588 0.587 0.587 0.595 0.599 0.583 0.570 0.569 AA ¯ X Coherence score values and one-way ANOVA test statistics for 0.572 0.610 2 0.580 56 0.5527 0.539 0.5618 0.547 0.562 0.559 0.555 0.563 0.556 0.565 34 0.559 0.543 0.547 0.565 9 0.554 0.567 0.569 K 1011 0.55412 0.562 0.56013 0.567 0.566 14 0.566 0.569 0.565 0.57116 0.570 0.571 0.57718 0.579 19 0.583 20 0.58621 0.587 0.57622 0.580 0.588 0.578 23 0.594 0.584 24 0.589 0.58125 0.592 0.588 15 0.569 17 0.585 0.583 0.585 Table 3.2:

63 CHAPTER 3. EXPLORING DIRICHLET PRIORS SS − SSSA − SA AS − SS AS − ** SA AA − ****** ** ** *** ** ** ** ** ******** ** ** ** ** ** ** ** ** ** *** ** *** ** ** AS AA − ** ** ** ** ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗∗ ∗∗∗ ∗∗ continued. f p AA SS s Table 3.2: SA s AS s AA s SS ¯ X SA ¯ X 0.5940.586 0.5880.588 0.592 0.009 0.596 0.008 0.013 0.008 0.008 0.008 0.005 0.009 0.010 0.014 0.015 1.471 0.017 0.812 0.2601 1.038 0.5056 0.4024 0.5810.583 0.5960.592 0.588 0.0070.587 0.603 0.010 0.011 0.584 0.006 0.004 0.004 0.007 0.013 0.005 0.013 0.009 0.010 0.009 7.1720.586 0.009 0.017 6.359 0.0029 0.596 0.600 0.016 1.237 0.0048 0.602 0.004 5.993 0.3290 0.004 0.004 0.0061 0.008 0.0160.585 0.013 0.0110.594 0.594 0.006 2.937 0.600 0.015 1.529 0.0650 0.009 0.010 0.2453 0.594 0.010 0.006 0.598 0.007 0.014 0.004 0.011 2.6640.589 0.006 4.600 0.0831 0.594 0.011 0.0167 0.006 0.006 0.007 6.280 0.010 0.0051 0.010 6.225 0.0053 Mean Std. dev. ANOVA Statistics AS ¯ X 0.599 0.578 0.592 0.008 0.015 0.014 0.013 3.162 0.0534 0.605 0.582 0.595 0.012 0.007 0.007 0.006 6.342 0.0049 0.6030.596 0.5940.601 0.596 0.594 0.589 0.582 0.014 0.596 0.007 0.010 0.005 0.012 0.005 0.011 0.004 0.0110.604 0.010 0.006 1.269 0.591 0.013 5.398 0.3185 0.596 1.864 0.0093 0.608 0.005 0.1764 0.602 0.583 0.010 0.589 0.595 0.0150.605 0.599 0.011 0.0070.608 0.588 0.009 0.009 1.605 0.590 0.597 0.009 0.009 0.2274 0.599 0.007 0.009 0.017 0.009 0.007 0.007 4.402 0.008 0.007 3.502 0.0194 0.006 0.008 0.0400 0.005 5.384 12.376 0.0094 0.0002 0.6100.605 0.595 0.587 0.596 0.589 0.009 0.007 0.013 0.008 0.009 0.007 0.010 0.013 3.006 5.712 0.0612 0.0074 0.603 0.597 0.601 0.608 0.604 0.607 0.611 0.607 0.608 0.605 0.614 0.615 0.611 AA ¯ X 0.604 0.605 0.605 0.602 0.604 0.605 0.611 0.608 0.606 0.619 0.612 0.608 K 26 27 0.596 2829 0.595 30 0.601 3132 0.606 33 0.597 0.606 35 36 37 38 0.600 42 43 0.605 45 46 4748 0.606 49 50 0.609 34 0.606 3940 0.605 44 0.613 41

64 CHAPTER 3. EXPLORING DIRICHLET PRIORS 0.001 < SS p − ∗∗∗ SSSA 0.01, − < p ∗∗ SA AS − 0.05, < SS AS p − ∗ . } SA AA − ** *** **** *** **** *** *** ** ******* *** *** **** ** *** *** *** *** *** *** **** ** ** ** *** ** ** ****** ***** ****** ** ** ** ***** ** *** ** ** ****** ** *** *** ** *** *** *** *** *** ****** ****** *** *** ** *** ** ** *** *** ** ** ****** ** ** *** *** ** ** 2, ..., 50 { = AS AA − ** ***** *** *** ***** ** *** *** *** *** *** *** *** K for 2 ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ DS f p AA SS s SA s AS s AA s SS ¯ X SA ¯ X 0.4660.487 0.4830.476 0.484 0.0270.499 0.492 0.011 0.0220.475 0.481 0.017 0.016 0.0210.463 0.494 0.016 0.014 0.008 0.0310.487 0.489 0.015 0.014 0.022 0.019 12.797 0.483 0.018 0.011 0.023 0.015 33.262 0.0002 0.494 0.005 0.016 0.020 0.023 26.740 0.0000 0.477 0.011 0.019 0.013 19.807 0.0000 0.471 0.016 0.022 0.020 41.061 0.0000 0.499 0.018 0.016 38.081 0.0000 0.491 0.021 0.021 33.195 0.0000 0.481 0.021 0.014 0.0000 0.011 0.013 19.513 0.009 0.019 0.0000 0.022 16.624 0.007 0.0000 33.981 0.0000 0.400 0.3800.431 0.0170.446 0.458 0.0110.453 0.464 0.010 0.021 0.474 0.032 0.023 0.037 0.013 0.015 0.025 1.207 0.015 0.030 0.018 0.3390 0.017 0.024 18.234 0.014 10.511 0.0000 17.836 0.0005 0.0000 0.465 0.483 0.009 0.029 0.020 0.029 7.939 0.0018 0.497 0.468 0.009 0.010 0.029 0.013 17.407 0.0000 Mean Std. dev. ANOVA Statistics AS ¯ X 0.5490.541 0.4770.534 0.453 0.4820.548 0.466 0.485 0.008 0.462 0.489 0.003 0.016 0.486 0.019 0.018 0.025 0.026 0.017 0.012 0.022 0.026 0.013 0.012 18.741 0.026 0.026 72.727 0.0000 0.018 14.533 0.0000 14.272 0.0001 0.549 0.0001 0.4840.553 0.486 0.484 0.0090.560 0.495 0.017 0.495 0.024 0.014 0.494 0.015 0.022 0.022 0.022 27.914 0.014 0.012 0.0000 0.012 17.750 0.025 0.0000 15.922 0.0000 0.488 0.425 0.414 0.013 0.026 0.025 0.031 11.187 0.0003 0.532 0.558 0.570 0.562 0.568 0.577 0.573 0.559 0.554 0.553 0.559 0.411 0.524 0.531 0.517 0.545 AA ¯ Coherence score values and one-way ANOVA test statistics for X 0.551 0.569 0.546 0.556 0.566 0.563 0.562 0.493 89 0.527 2 0.399 45 0.504 6 0.523 7 0.517 0.535 3 K 10 11 12 1314 0.547 15 0.536 16 0.560 17 0.561 18 0.554 19 0.570 20 0.558 2122 0.551 2324 0.543 25 0.549 Table 3.3:

65 CHAPTER 3. EXPLORING DIRICHLET PRIORS SS − SSSA − SA AS − SS AS − SA AA − **** *** *** *** ***** *** *** *** *** *** ** ** *** ** *** ********* ****** ***** ** *** *** ****** ** ****** ****** *** *** ****** ** *** *** *** ****** *** *** ***** *** *** *** ****** ** *** *** *** ** *** *** *** ****** ****** ** *** *** *** *** ** ****** *** *** ** *** *** *** *** ** *** *** ** *** ** *** *** *** *** ** *** AS AA − ** ** ** *** *** ** ** ** *** *** ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ continued. f p AA SS s Table 3.3: SA s AS s AA s SS ¯ X SA ¯ X 0.492 0.4940.487 0.0160.480 0.474 0.0110.469 0.483 0.022 0.019 0.480 0.015 0.010 0.0050.474 0.013 0.013 0.019 19.5760.475 0.491 0.009 0.021 0.010 0.0000 0.484 0.501 0.017 0.015 0.010 27.501 0.480 0.009 0.021 0.019 24.880 0.0000 0.014 0.007 0.009 30.982 0.0000 0.477 0.006 0.013 0.013 0.0000 0.485 0.489 0.019 0.013 28.4140.487 0.473 0.023 0.018 36.003 0.0000 0.494 0.014 0.007 22.547 0.0000 0.014 0.011 0.006 0.0000 0.476 0.005 0.008 0.007 0.488 0.011 0.017 39.5600.482 0.021 0.008 36.334 0.0000 0.473 0.483 0.019 32.693 0.0000 0.476 0.008 0.018 0.0000 0.472 0.006 0.016 0.0170.475 0.478 0.008 0.010 10.426 0.476 0.005 0.012 0.011 0.0005 0.470 0.009 0.016 0.010 22.455 0.484 0.005 0.018 66.934 0.0000 0.018 0.017 0.009 0.0000 0.013 0.010 29.467 0.008 41.643 0.0000 0.014 0.0000 21.727 0.0000 Mean Std. dev. ANOVA Statistics AS ¯ X 0.551 0.483 0.496 0.0140.547 0.013 0.485 0.013 0.486 0.016 0.022 27.0740.538 0.018 0.0000 0.540 0.463 0.016 0.489 0.488 0.021 0.488 0.013 16.533 0.011 0.011 0.0000 0.539 0.019 0.0120.535 0.488 0.016 0.015 0.461 0.483 0.012 43.1390.532 0.478 0.017 15.847 0.0000 0.470 0.012 0.023 0.0000 0.483 0.009 0.0100.525 0.015 0.011 0.013 0.470 0.015 0.015 15.567 0.483 0.015 48.701 0.0001 0.527 0.014 0.013 0.0000 0.476 0.019 23.777 0.487 0.007 0.0000 0.014 0.016 0.015 16.132 0.009 0.0000 0.022 14.562 0.0001 0.558 0.554 0.552 0.551 0.557 0.544 0.547 0.553 0.548 0.548 0.543 0.533 0.544 0.545 0.540 0.535 AA ¯ X 0.552 0.559 0.552 0.541 0.543 0.546 0.542 0.528 0.538 K 2627 0.523 2829 0.552 30 0.538 31 0.537 3233 0.551 34 0.538 35 0.539 36 3738 0.552 39 0.545 40 0.530 41 4243 0.520 4445 0.528 46 0.536 4748 0.527 49 0.536 50 0.525

66 CHAPTER 3. EXPLORING DIRICHLET PRIORS

Table 3.4: Human topic ranking for DS2 (abstract) on k = 17 LDA model

Class High-quality Medium-quality Low-quality

AA 15/17 (88%) 2/17 (12%) 0/17 (0%) AS 15/17 (88%) 2/17 (12%) 0/17 (0%) SA 12/17 (70.5%) 4/17 (23.6%) 1/17 (5.9%) SS 11/17 (64.7%) 6/17 (35.3%) 0/17 (0%)

3.4.2 Human Topic Ranking

The results of the fisheries domain expert’s human topic ranking are shown in Table 3.4 (see Section 3.3.5 for the classification method). Human topic ranking was performed on DS2 for k = 17 LDA models, which is the k-value that shows the best coherence score (via elbow method) and, simultaneously, the k-value with the largest difference amongst all prior classes (ANOVA f = 41.06). A similar pattern as found for topic coherence scores can be identified (Figs. 3.3a–3.3f): AA and AS with an asymmetrical prior over α result in more high-quality (88%) topics compared to SA and SS with a symmetrical prior over α (70.5% and 64.7% high-quality topics). Both AA and AS perform similarly, indicating that priors over η have no effect on human topic ranking. Furthermore, SA and SS show similar lower human topic ranking, with three topics dif- ferently classified: SS has 77.5% of high-quality topics compared to 64.7% for SS, but simultaneously one low-quality topic. A two-dimensional inter-topic distance map for DS2 with k = 17 is displayed in Figure 3.4 for all classes of priors. This two-dimensional representation is obtained by computing the distance between topics (Chuang et al., 2012) and applying multidimensional scaling (Sievert and Shirley, 2014). It displays the similarity between topics concerning their probability distribution over words (i.e. β). In addition, a topic label that best captures the semantics of the top 15 words is attached.

We omitted human topic ranking results for DS1 as they show an equal number of high- quality and medium-quality topics for all classes of priors and for several arbitrarily chosen k-values (k < 25). These results are in line with topic coherence scores that show similar scores for all prior classes (see Figs. 3.2a–3.2f). An inspection of k 25 LDA models (the point where significant differences between prior classes start) shows≥ an increasing number of incorrect terms for LDA models with a symmetrical prior over α (SA and SS), compared to models with an asymmetrical prior over α.

67 CHAPTER 3. EXPLORING DIRICHLET PRIORS

fish diseases & parasites otoliths management fish diseases (water) life fish & parasites otoliths history managementmussels experiments fish diseasesfish diseases production lake (water) nutrients life & parasites & parasites rate otoliths otoliths fish fish managementlake history management mussels migration fish experimentsas food (water)communities lake population (water)production stream life life models rate nutrients dynamics fish fishfish musselslake habitatsmussels history history lake predation experimentsmigrationexperiments fish as food production productionpredation communities lake nutrients nutrientsstream population rate rate lake population fish fish models lake habitats dynamics sediments communities genetics migration migrationpredation fish as foodfish as food communities population predationstream stream populationotoliths salmonmodels & dynamicsstream climate change models habitats habitats models populationdynamics cod fishgenetics predationlakepredation sediments predation predation trout population ecosystems salmon & acousticpopulation stream otoliths population genetics models sediments sediments climate change surveygenetics fish genetics cod lake lake fishing salmon & salmon & stream otoliths otoliths 14 trout models communities stream ecosystems mortality life climate changeclimate change models fish acoustic river codpopulationcod fish lake lake history genetics survey lakelake spawningtrout trout management ecosystems ecosystems fishing salmon 14 acoustic acoustic communitiesnutrients mortality sites population population life genetics river survey surveylake history genetics lake lake fishing fishing 14 14 spawning managementcommunitiescommunities mortality mortalitylife salmon nutrients life river sitesriver history lake lake history spawning spawning managementmanagement salmon nutrients nutrients salmonsites sites

(a) AA (b) AS reproduction trout salmon in fish streams reproductionassemblages trout sea population salmon in reproduction lampreys fish trout genetics reproductiontrout trout streams assemblages larvae salmon salmon in salmon in fish fish sea salmon populations population otoliths streams streams salmon assemblagesassemblages lampreys growth trout genetics survival sea sea larvae populationsalmon population lampreys lampreys salmon trout salmontrout geneticspopulationsgenetics otoliths life salmon lake life history in riverslarvae larvaefish growth salmon salmon mussels survival history salmon salmon populationspopulations otoliths otoliths salmon populationsalmon salmon lake fish life growth growthlake lifegrowth abundance survival geneticssurvival in rivers populations mussels history fish lakehistory (surveys) preysalmon salmon fish life life populationlake lake fishing life modelslifelake fish history in rivers in riversfish abundance lake musselsgrowthmussels abundancehistory genetics fish mortality history historypopulations water (surveys) population populationlake prey lake fish lake fishinglake fish fish growth genetics geneticsnutrients populations models growth abundance abundance perch mortalitypopulationsabundance lake habitat production models lake lake (surveys)growth (surveys) prey prey fish fish water models models lake fishing fishing models models models abundance abundance lake nutrients mortality mortality lake management production perch models water habitat lake water growth mussels models lake models nutrients models perch nutrients habitat production production perch models models growthmanagement growth habitat models models models models models models mussels (c) SA management(d) SSmanagement mussels mussels Figure 3.4: A two-dimensional inter-topic distance map (via multidimensional scaling) for all classes of priors for DS2 with k = 17. The surface of the node indicates the overall topic prevalence within the corpus. Color coding is used to indicate human topic ranking classification: white = high-quality, grey = medium-quality, and black = low-quality.

68 CHAPTER 3. EXPLORING DIRICHLET PRIORS

3.5 Discussion and Conclusion

Our results show that an asymmetrical prior over α indicates increased topic coherence and topic ranking compared to a symmetrical prior. However, this particularly holds for the DS2 dataset, the collection of 4,417 abstracts, and not necessarily for the DS1 dataset, the collection of 8,012 full-text documents. Thus, selecting a different prior on α has large practical implications for datasets containing a smaller vocabulary size and being homogenous in nature. Symmetrical or asymmetrical priors over η show no real benefits regarding topic coherence and human topic ranking.

The results on DS2 are in line with research performed by Wallach et. al. (Wallach et al., 2009), which found that an asymmetrical prior over α shows improved likelihood of held-out data and that different priors over η show no real differences; the vocabulary size of the DS2 data set can be compared to the vocabulary size used in their research. However, our results on DS1, the full-text dataset with significantly higher vocabulary size and an average number of words per document, found no difference for combi- nations of priors over α and η, making topic coherence and manual topic ranking less influenced by full-text data.

A symmetrical prior over α assumes that all topics have an equal probability of be- ing assigned to a document. Such an assumption ignores that certain topics are more prominent in a document collection and, consequently, would logically have a higher probability to be assigned to a document. Conversely, specific topics are less common and, thus, not appropriately reflected with a symmetrical prior distribution. Logically speaking, an asymmetrical prior over α would capture this intuition and would, there- fore, be the preferred choice. We have empirically shown that this intuition indeed results in significantly higher topic coherence and a better topic ranking for DS2 and DS1 for k 25. For DS1 with k < 25, the differences are not significant, although human topic≥ ranking shows slightly better topics for the classes with an asymmetrical prior over α.

Concerning priors over η, we naturally want topic–word distributions to be differ- ent from each other so as to avoid conflicts between them. A symmetrical prior over η will reflect the power-law usage of words (i.e. some words occur in all topics) while simultaneously resolving ambiguity between topics with a few distinct word co-occurrences (Wallach et al., 2009). Therefore, symmetrical priors over η are the preferred choice. Although our empirical results indicate no real benefits when vary- ing priors on η, the symmetrical prior shows slight, but still very marginal, overall improved coherence and ranking results for both datasets.

69 CHAPTER 3. EXPLORING DIRICHLET PRIORS High High High High Medium Medium Medium Medium , fish, using based , models, habitat, size, use , stock, management, fish, used , , species, estimates, using using used , fish, estimates, method, approach, using , , estimates, used used 17. Terms in bold are considered incorrect words. = k sampling, analysis, based, methods, estimate, timates, fisheries, abundance, biomass, size, spawning, years tribution, gaussian, survey, simulated, simulation, acoustic,pollock parameters, mates, models, mortality, management, assessment, biomass, effort ment, estimates, year, models, parameters, population, assessment recruitment, mortality, population, method, approach, fisheries, abundance, stock, habitat, analysis, recruitment mortality, parameters, length, method, population A selection of modeling topics for Table 3.5: AA Models (population)AS model, cod, stock, mortality, population, recruitment, Models year, (abundance) models, es- model, length,SA estimates, uncertainty, abundance, models, values, dis- ModelsSS Models (fishing)AA Models model, fishing,AS catch, model, stock, fishery, stock, fisheries, mortality, fishing, effort, manage- catch, Models fisheries, recruitment,SA fishery, year, esti- ModelsSS model, data, models, Models (growth) data, model, models, model, growth, data, estimates, model, fish, data, models, management, 1 1 1 1 2 2 2 2 DS DS DS DS DS DS DS DS Dataset Class Label Top 15 words Ranking

70 CHAPTER 3. EXPLORING DIRICHLET PRIORS

Human topic ranking was based on the presence of incorrect terms being part of the topic’s 15 most probable words. A closer look into the reasons why topics are ranked lower reveals that all topics contain correct domain-related terms, but are only ranked lower due to the presence of so-called noise terms (e.g. used, using, two, among, total, higher, within, great, large, high, significantly). Lower-ranked topics contain a higher number of such terms and, as such, are classified as medium- or low-quality topics. Interestingly, none of the topics have incorrect domain-related terms that could refer to, for example, the biological, ecological, socio-ecological, or social aspects of fisheries. A selection of topics and incorrect terms is shown in Table 3.5. Topics uncovered from abstract data, combined with a symmetrical prior over α, are more prone to contain such noise terms. For full-text data, all classes of priors show an equal but low number of noise words.

A growing amount of research is utilizing LDA to uncover latent semantic structures from scientific research articles as a mean to discover topical trends and developments within a particular research area (Gatti et al., 2015; Sun and Yin, 2017; Westgate et al., 2015; Wang and McCallum, 2006; Alston and Pardey, 2016). These approaches are often characterized by (i) exploring one or several domain-specific journals (e.g. jour- nals related to transportation research, operations research and management science), (ii) using abstract data, and (iii) using an open source tool (e.g. Mallet, Gensim) to perform LDA. Our approach touches upon all three characteristics; thus, we would recommend an asymmetrical prior over α and a symmetrical prior over η for optimal topic coherence and topic ranking.

Figure 3.4 shows a visual representation of 17 latent topics for DS2. We identify several overlapping topics (e.g. otoliths, population genetics) and several semantically related topics (e.g. population dynamics and population genetics; lake nutrients and lake wa- ters). At the same time, we find several topics occurring in one of the prior classes that are absent in other prior classes. For instance, fish diseases and parasites occurs only in AA (Figure 3.4a). One reason might be that manual topic labeling is limited by the subjectivity inherent in human interpretation (Urquhart, 2001); indeed, an analysis of the topics by another domain expert could yield contradictory results. Another reason might be due to the probabilistic nature of LDA, where differences are merely a result of differences in sampling. Although such analysis is outside the scope of this research, it is an interesting directive for future research. Furthermore, the research performed by Wallach et. al. was applied on corpora related to patent, newsgroup and news data, whereas this paper analyzed scientific research articles. Future research might focus on different types of scientific articles, more broadly oriented journals, or other unex- plored forms of textual data to gain more insights into the practical effects Dirichlet priors have on LDA’s latent topics.

71

Chapter 4

Bootstrapping a Semantic Lexicon on Verb Similarities

We present a bootstrapping algorithm to create a semantic lexicon from a list of seed words and a corpus that was mined from the web. We exploit extraction patterns to bootstrap the lexicon and use collocation statistics to dynamically score new lexicon entries. Extraction patterns are subsequently scored by calculating the conditional probability in relation to a non-related . We find that verbs that are highly domain related achieved the highest accuracy and collocation statistics affect the accuracy positively and negatively during the bootstrapping runs.

This work was originally published as:

S. Syed, M. Spruit, and M. Borit. Bootstrapping a Semantic Lexicon on Verb Similarities. In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, volume 1, pages 189–196. Scitepress, 2016. doi: 10.5220/0006036901890196

73 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

4.1 Introduction

Semantic lexicons for specific domains are increasingly becoming important in many Natural Language Processing (NLP) tasks such as , word sense disambiguation, anaphora resolution, , and discourse processing. Online lexical resources such as WordNet (Miller et al., 1990) and Cyc (Lenat, 1995) are useful for generic domains but often fall short for domains that include specific terms, jargon, acronyms and other lexical variations. Handcrafting domain-specific lexicons can be time-consuming and costly. Various techniques have been developed to automatically create semantic lexicons, such as lexicon induction, lexicon learning, lexicon bootstrapping, lexical acquisition, hyponym learning, and web-based informa- tion extraction.

We define a semantic lexicon as a dictionary of hyponym words that share the same hy- pernym. For example, cat, dog or cow are all hyponym words that share the hypernym ANIMAL. Likewise, the words red, green, and blue are semantically related by the hy- pernym COLOR. A semantic lexicon differs from an ontology or taxonomy as it does not describe the formal representation of shared conceptualizations or information about concepts and their instances, nor does it provide a strict hierarchy of classes.

Several attempts have been made to automatically create semantic lexicons from text corpora by utilizing semantic relations in conjunctions (dogs and cats and cows), lists (dogs, cats, cows), appositives (labrador retriever, a dog) and nouns (dairy cow) (Riloff and Shepherd, 1997; Roark and Charniak, 1998; Phillips and Riloff, 2002; Widdows and Dorow, 2002). Others have used extraction patterns (Colombia was divided, the country was divided) (Thelen and Riloff, 2002; Igo and Riloff, 2009), in- stance/concept ("is-a") relationships (Pantel and Ravichandran, 2004), coordination patterns (Ziering et al., 2013b), multilingual symbiosis (Ziering et al., 2013a), or com- binations thereof (Qadir and Riloff, 2012). Lexical learning from informal text, such as social media, has also been performed (Qadir et al., 2015).

Learning semantic lexicons is often based on existing corpora that may not be available for all domains. Furthermore, little research on lexicon learning has been performed on web text from informative sites, forums, blogs, and comment sections that contain content written by a variety of people, thus containing different writing idiosyncracies. We incorporate web mining techniques to build our own domain corpus and combine the BASILISK (Thelen and Riloff, 2002) bootstrapping algorithm and the Pointwise Mutual Information (PMI) scoring metric proposed by (Igo and Riloff, 2009). We based the hyponym relationships on the extraction pattern context of verb stem similarities and used a probability scoring metric to extract the most suitable verbs. Our aim is to use existing web content to create a semantic lexicon and subsequently use extraction patterns to find semantically related words. Extraction patterns, and more specifically the verbs within these patterns, are scored against a non-related text corpus to explore if verbs that occur more frequently in domain text are more likely to be accompanied by semantically related nouns.

74 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

The remainder of this paper is structured as follows: First, we present previous work in semantic lexicon learning for hypernym-hyponym relationships. Second, we detail our bootstrapping algorithm, the extraction of the text corpus, the creation of extraction patterns, and the scoring method. Third, we analyze the semantic lexicon and the accuracy of our algorithm during each successive run of the bootstrapping process. We then conclude our work and provide new research directives.

4.2 Previous Work

Riloff and Shepherd (1997) used noun co-occurrence statistics to bootstrap a seman- tic lexicon from raw data. Their bootstrapping algorithm scored conjunctions, lists, appositives and nominal compounds to find category words within a context window. It is one of the first attempts to build a semantic lexicon and uses human interven- tion to review the words and select the best ones for the final dictionary. Roark and Charniak (1998) applied a similar technique but use a different definition for noun co-occurrence and scoring candidate words. Riloff and Shepherd (1997) ranked and selected candidate words based on the ratio of noun co-occurrences in the seed list to the total frequency of the noun in the corpus while Roark and Charniak (1998) used log-likelihood statistics (Dunning, 1993) for final ranking.

Widdows and Dorow (2002) used graph models of the British National Corpus for lexical acquisition. They focused on the relationships between nouns when they oc- curred as part of a list. The nodes represent nouns and are linked to each other when they conjunct with either and or or. The edges are weighted by the frequency of the co-occurrence. Their algorithm mitigates the infection of bad words entering the can- didate word list by looking at type frequency rather than token frequency.

Phillips and Riloff (2002) automatically created a semantic lexicon by looking at strong syntactic heuristics. They distinguish between two types of lexicons, (1) proper noun phrase and (2) common nouns, by utilizing the syntactic relationships between them. Syntactic structures are defined by appositives, compound nouns, and ”is-a" clauses— identity clauses with a main verb of to be. Statistical filtering is applied to avoid inaccu- rate syntactic structures and deterioration of the lexicon entries. Pantel and Ravichan- dran (2004) also looked at "is-a" relationships by looking at concept signatures and applying a top-down approach. Their method first uses co-occurrence statistics for semantic classes to then find the most appropriate hyponym relationship.

Thelen and Riloff (2002) proposed a weakly supervised bootstrapping algorithm, called BASILISK, to learn semantic lexicons by looking at extraction patterns. They used the AutoSlog (Riloff, 1996) extraction pattern learner and a list of manually selected seed words to build semantic lexicons for the MUC-4 proceedings, a terrorism domain re- lated corpus. Extraction patterns capture role relationships and are used to find noun phrases with similar semantic meaning as a result of syntax and lexical semantics. This

75 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON is explained by their example verb robbed. The subject of the verb robbed often indi- cates the perpetrator while the direct object of the verb robbed could indicate the victim or target. To avoid inaccurate words entering the lexicon and to increase the overall accuracy they learned multiple categories simultaneously. (Igo and Riloff, 2009) used BASILISK and applied co-occurrence statistics for hypernym en seed word collocation by computing a variation of Pointwise Mutual Information (PMI). They used web statis- tics to re-rank the words after the bootstrapping process was finished.

Qadir and Riloff (2012) used pattern-based dictionary induction, contextual semantic tagging, and coreference resolution in an ensemble approach. They combined the three techniques in a single bootstrapping process and added lexicon entries if words occur in at least two of the three methods. Since each of them exploit independent sources, their ensemble method improves precision in the early stages of the bootstrapping process, in which semantic drift (Curran et al., 2007) can decrease the overall accuracy substantially. Ziering et al. (2013a) have also employed an ensemble method and used linguistic variations between multiple languages to reduce semantic drift.

We combine the BASILISK bootstrapping algorithm (Thelen and Riloff, 2002) and co- occurrence statistics proposed by Igo and Riloff (2009) to dynamically score each word before it enters the lexicon. We compare BASILISK’s scoring metric AvgLog in contrast to the PMI metric at each bootstrap run rather than using PMI scores to re-rank the lexicon after the bootstrapping process has finished.

4.3 Lexicon Bootstrapping

We use a list of seed words to initiate the bootstrapping process and use this set of seed words to create a highly related text corpus by mining the web. Web mining was applied because domain specific corpora are not always readily available. Before bootstrapping begins, we use the linguistic expressions of extraction patterns (Riloff, 1996) and group noun phrases when they occur with the same stemmed verb. The stemmed verbs are then scored by a probability score with respect to a non-related text corpus. Verbs that are domain specific, that is, they infrequently occur in general text are given a higher score. We then use BASILISK’s bootstrapping algorithm and a PMI scoring metric before adding new words to the lexicon. The PMI score is a measure of association between words. For this, we use web count statistics between hyponym and hypernym words to calculate collocation statistics and utilize these scores to only allow the strongest hyponyms to enter the lexicon, thus decreasing wrong lexicon entries during the bootstrapping process.

A schematic overview of the bootstrapping process is displayed in Figure 4.1 and Al- gorithm 1. We will discuss the bootstrapping process in detail in subsequent sections.

76 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

seed words

create

corpus

extract

extraction patterns

initialize select topN extract patterns nouns semantic lexicon pattern pool word pool

add topN words

Figure 4.1: Schematic overview of bootstrapping process.

Algorithm 1 Bootstrap Lexicon 1: lexicon seedwords 2: corpus ←TopNSearches(seedwords) 3: pat terns← Pat terns(corpus, 0.5 p 1.0) 4: for i = 0 to←i < m do ≤ ≤ 5: patternpool Score(pat terns, topN + i) 6: words GetN← ouns(patternpool) 7: lexicon← ScoreWord(words, topN) / lexicon 8: i + + ← ∈ 9: end for 10: return lexicon

77 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

4.3.1 Domain and Seed Words

We choose to build a single related lexicon for the fisheries domain. The fisheries do- main includes a multitude of knowledge production approaches, from mono- to trans- disciplinary. Biologists, oceanographers, mathematicians, computer scientists, anthro- pologists, sociologists, political scientists, economists and researchers from many other more disciplines contribute to the fisheries body of knowledge, together with non- academic participants, such as decision makers and stakeholders. Due to these diverse contributions of specialized language from a multitude of knowledge production ap- proaches, the fisheries domain is characterized by a large body of words. For the same reason, this body of words is extremely rich in concepts. However, sometimes the con- cepts use different words to refer to the same abstraction (e.g. fishermen and fishers) and sometimes, even though they are using the same words, they are referring to dif- ferent abstractions (e.g. fisher behavior may mean something for an economist and something else for an anthropologist). In addition, the fisheries domain has a high frequency of compound words (e.g. fisheries management, fishing method), in order to differentiate it from other resource management domains, such as forestry, for ex- ample. The definition of a hyponym-hypernym relation is therefore based on a more abstract level and also includes transitive relationships. For example, if x is a hyponym of y, and y is a hyponym of z, then x is a hyponym of z.

The seed words were chosen by experts from the Arctic University of Norway where they were asked for a list of 10 nouns or compound nouns that best cover the domain. The list contains the phrases fishery ecosystem, fisheries management, fisheries policy, fishing methods, fishing gear, fishing area, fish, fish species, fishermen, fish supply chain.

4.3.2 Building the Corpus

The fisheries corpus was created by extracting web text from a Google search for each of the seed words and extracting the content of the first 30 URLs. We enclosed the 10 search terms with quotations marks to force Google into an exact match. We found that the first 30 searches provided sufficient text for the corpus and still contained search related content. After mining the pages we started an extensive data cleaning process. Besides cleaning HTML, JavaScript, and CSS tags, we needed to clean text from e.g. headers, footers, labels, sidebars that entered the corpus. We removed content such as "click here", "all rights reserved", "copyright". Finally, we scored the mined text fragments with an en-US and en-GB lookup dictionary D as:

V Pi wi S i=1 (4.1) tex ti = Vi

78 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

Figure 4.2: The lexical dispersion plot shows how often a word, displayed on the y- axis, occurs in the cleaned corpus. The x-axis displays the position within the corpus from beginning to end.

Where Vi is the vocabulary size of tex ti, and wi a word match such that wi D. We found S 0.3 sufficient to clean the corpus even further as it removes non-English∈ tex ti and improper≥ formatted text. For example, phrases like "Not logged inTalkContribution- sCreate accountLog in" bear no meaning and serves navigational purposes only when rendered by the web browser, yet it is extracted from the HTML content. Figure 4.2 shows the dispersion plot for the seed words unigrams after the cleaning phase.

4.3.3 Chunking

We used the Punkt sentence tokenizer and a part-of-speech (POS) tagger to create individual sentences with their grammatical properties. We then used a shallow parser with regular expressions (RE) on POS tags to extract noun phrases (NP) when they occur as a direct object or subject. The verb that precedes or follows the NP is used to group related NPs when they share the same stem.

79 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

VB manage JJ European NP NN fish S CC and VB protect DT the NP NN marine NN environment

Figure 4.3: Parse tree for the noun phrase European fish with its accompanying transi- tive verb manage and noun phrase the marine environment with transitive verb protect. Both NPs occur as a direct object.

The RE grammar for the parser is defined as (1) (

?*+) for direct object noun phrases and (2) (
?*++) as subject noun phrases. (1) extracts a verb in any tense (), followed by an optional determiner (
), followed by 0 or more adjectives (*) and ending with at least one noun (+). Figure 4.3 shows the parse tree of the phrase manage European fish and protect the marine environment. We extract two patterns manage European fish and protect the marine environment.

(2) extracts patterns starting with an optional determiner (

), followed by 0 or more adjectives (*), followed by at least one noun (+) and ending in any verb tense (). Figure 4.4 shows the parse tree for the phrase the northern fisherman catches in which the NP occurs as a subject.

DT the NP JJ northern S NN fisherman VB catches

Figure 4.4: Parse tree for the noun phrase the northern fisherman with its accompany- ing transitive verb catches. The NP occurs as a subject.

4.3.4 Scoring Verbs

The verb tenses that precede an NP as a direct object, or follow the NP as a subject are stemmed with the Porter algorithm (Porter, 1980) to subsequently group NPs when creating the extraction patterns. Stemming groups verb tenses into the same stem— not necessarily the root of the verb. To score stemmed verbs we used a non-related text corpus containing rural information. The idea behind scoring the domain related

80 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

250

200

150

100 Frequency N1 + N2

50

0 0.0 0.2 0.4 0.6 0.8 1.0 Score

Figure 4.5: Frequency of stemmed verbs and estimated probability. verbs is to limit the extraction patterns to verbs that are more frequently found in the fisheries domain. For example, the verb create is often found in all sorts of domains but the verb catch not. The noun phrases that are linked to the verb catch are more likely to contain semantically related words.

Let D1 be the set of stemmed verbs from the seed word related corpus (Section 4.3.2) and D2 be the set of stemmed verbs from the non-related corpus (rural corpus) and that the vocabulary size V V . Now let L be the combined set of stemmed D1 D2 verbs such that for every l| |L we≈ have| | that l D D . Let N denote the number of 1 2 l,D1 instances l contain in D1.∈ We estimate the probability∈ ∪ that stemmed verb l contained within the seed word related set D1:

Nl,D P l D 1 (4.2) ( 1) = N N | l,D1 + l,D2

For example, the verb supervise, which includes supervised and supervising, gets stemmed into supervis. A frequency of D1 = 12 and D2 = 3 results in a score of 0.8. The distribution of scores is shown in Figure 4.5 and shows the number of verbs with their estimated probability. For example, there are 50 verbs with a score P 0.9 which are highly domain related, such as prohibit, catch, rescue, exploit, breath, deplete,≥ fish.

81 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

Table 4.1: Example phrases where the transitive verb of a phrase gets stemmed into the same stem.

stem verb noun phrase

regul regulated industry ... regulated industry ... regul regulated seafood trade ... regulated seafood trade ... regul regulated quota ... quota regulated ... regul regulates legislation ... legislation regulates ... catch catching fish ... catching wild fish ... catch catching marine life ... catching marine life ... catch catch deep-dwelling fish ... catch deep-dwelling fish ... catch catches fisherman ... the fisherman catches ...

4.3.5 Verb Extraction Pattern

We selected verbs in which P 0.5 and group them in steps of 0.1 such that 0.9 P 1.0, 0.8 P 1.0, ... ,≥ 0.5 P 1.0 so that we analyze the top 10%, 20%,≤ ...≤ , 50% of verbs.≤ ≤ To create the extraction≤ ≤ pattern, any noun or compound noun + extracted by the chunker is grouped together if they share the same stemmed verb in either subject and direct object cases. For example, Table 4.1 shows some phrases for the root verb regulate which gets stemmed into regul and root verb catch which has an identical stem. The nouns industry, seafood trade, quota, and legislation are grouped together into an extraction pattern because they all share the same stem regulate.

For every NP,we have extracted the noun or compound noun. We have not restricted ourselves to the head noun as compound nouns are generally more informative for the fisheries domain. For example, marine science was accepted as a fisheries related word but science not. An overview of some extraction patterns are listed below:

{industry, seafood trade, quota, legislation} • {fish, marine life, deep-dwelling, fish, fisherman} • {catch, conservation, sustainability, growth, fishing, participation, tuna, manage- • ment approach}

{oxygen, air, fish, equipment, right} • {sediment resuspension, world, bycatch, potential, fishing technique, hunger} •

We found 468 different extraction patterns in our corpus. An overview of the distribu- tion is shown in Figure 4.6 (y-axis logarithmically scaled).

82 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

103

102 noun frequency

101

0 100 200 300 400 500 pattern(i)

Figure 4.6: The frequency of nouns in an extraction pattern. The x-axis shows the number of extraction patterns found in our corpus (468). The y-axis shows the number of nouns grouped within that pattern.

83 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

4.3.6 Bootstrapping

The bootstrapping process starts with a list of 10 seed words (Section 4.3.1). Next, all the extraction patterns are scored by calculating the RlogF score (Riloff, 1996):

 Fi log F if F 1 Ni 2( i) i RlogF(pat terni) = ∗ ≥ (4.3) 1 if Fi = 0 −

Where Fi is the number of lexicon words found in pat terni and Ni the total number of nouns in pat terni. Extraction patterns that contain nouns that are already part of the lexicon will get a higher score. The first iteration selects the top N patterns which are then placed into a pattern pool. We used N=20 for the first iteration and incremented it by 1 every next run to allow new patterns to enter the process.

The next step is to score all the nouns that are part of the pattern pool. We evaluated two scoring metrics: (1) BASILISK’s AvgLog, (2) PMI that uses search counts from the Bing search engine. The AvgLog score is defined as:

P Pi log2(Fj + 1) j=1 Avg Log(wordi) = (4.4) Pi

(1) AvgLog uses all the patterns to score the nouns found in the top N patterns. Pi is the number of patterns in which wordi occurs and Fj the number of lexicon words found in pattern j. The nouns that are part of the pattern pool are given a higher score, thus being more semantically related, when they also occur in other extraction patterns with a high number of lexicon word matches.

(2) is based on hypernym collocation statistics proposed by (Igo and Riloff, 2009). We implement the PMI scoring metric within the bootstrapping process and dynamically calculate collocation statistics before adding new words in the lexicon. We hypothe- size that lexicon words that occur more often in collocation with its domain are more likely to be semantically related. We use the number of hits between a lexicon word (hyponym) and its hypernym word by utilizing the NEAR operator from the Microsoft’s BING search engine. We choose collocation range of 10 and define the PMI score as:

P(x, y) PMI(x, y) = log (4.5) P(x) P(y) ·

Count(x, y) PMI(x, y) = log(N) + log (4.6) Count(x) Count(y) ·

84 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

Count(x, y) PMI(x, y) = log (4.7) Count(x) Count(y) · PMI(x, y) is the Pointwise Mutual Information that lexicon word x occurs with hyper- nym y, where P(x, y) is the probability that x and y occur together on the web (using the NEAR operator), P(x) the probability that x occurs on the web, and P(y) the prob- ability that y occurs on the web. We would have to calculate probabilities such as P(x) by dividing the counts of x by the total number of web pages N. However, N is not known and can be omitted because it will be the same for each lexicon word. We can rewrite PMI(x, y) again by taking the log of the number of hits from the collocation statistics and dividing it by statistics of their individual parts.

Each noun that was part of the extraction pattern was given a score and the top-N nouns were selected to enter the lexicon. We added the noun with the highest score (N=1) to the lexicon and repeated the bootstrapping process.

4.4 Evaluation

We evaluated the lexicon entries by a gold standard dictionary. Domain experts have labeled every noun or compound noun that was found in the extraction patterns. A value of 1 was assigned if the word was related to the fisheries domain and 0 if it had no relationship. We did not distinguish between highly relevant domain words and less relevant words. For example, the word marine science is highly relevant and has often a direct link to the domain, but the word conservation in itself can be ambiguous. It could be related to e.g. conserving artwork or to prevent depletion of natural resources. However, a word was considered to be correct if any sense of the word is semantically related. Furthermore, unknown words were manually looked up for their meaning. For example, plecoglossus altivelis, oncorhynchus keta, leucosternon, peach anthias and khaki grunter are types of fish unknown to the annotators, yet are semantically related.

We have run the bootstrapping algorithm until the lexicon contained 100 words. We re- peated the process for the top 10% to top 50% scored verbs as discussed in Section 4.3.5 and compared the AvgLog (Sb) and PMI (Spmi) scoring metric. Examples of lexicon en- tries are: invertebrate seafood, ground fish, shellfish, demersal fish, mackerel, carp, crabs, deepwater shrimp, school, life, squid, hake, fisherman, cod, tuna, conservation reference size, trout, quota, sea, shrimp, mortality, freshwater, trawl, salmon, tenkara, snapper, method, license, attractor, pocket water, fee, vessel license, style fly, artisan, plastic worm, freshwater fly, saltwater, jack mackerel, trawler.

Figure 4.7 shows the accuracy (percentage of correct lexicon words) for Sb when learn- ing 100 semantically related words. The lines represent the verb probability groupings (e.g. top 10%, 20%). The accuracy degrades when incorrect words enter the lexicon

85 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON and starts to contribute in learning more incorrect words. The accuracy converges to around 70% for all probability groupings. Figure 4.8 shows the accuracy for Spmi. It shows that using web statistics for hyponym-hypernym words affects the accuracy pos- itively and negatively, yet still converges to around 70% after 100 words are learned.

The use of Spmi affects the top 30% verbs substantially, outperforming Sb up to learning 90 words. However, Spmi negatively affects the accuracy when looking at the top 50% verbs compared to Sb.

1.0 0.9 ρ 1.0 0.8 ρ 1.0 0.7 ρ 1.0 0.6 ρ 1.0 0.5 ρ 1.0 0.9

0.8 accuracy 0.7

0.6

0.5 20 30 40 50 60 70 80 90 100 lexicon size

Figure 4.7: Graph that shows the accuracy when learning 100 words for the top 10% to top 50% verbs when scoring candidate words with BASILISK’s AvgLog (Sb).

Scoring verbs before creating extraction patterns causes differences in accuracy up to a lexicon size of 90. Verbs that occur more often in domain related text, such as discussed in Section 4.3.4, essentially benefit the accuracy of the lexicon up to a certain size yet have limited effect on large lexicons. We would, however, have expected that the top 10% verbs outperform the top 20% and so on. This does not seem to hold for both Sb and Spmi. For example, when using Sb and learning 50 words, the top 50% verbs achieved a higher accuracy (0.74), compared to the top 20% (0.66). Similarly, when learning 40 words and using Spmi, the top 20% verbs achieved lower accuracy (0.68) than the top 30% verbs (0.78). The top 10% verbs for Sb and Spmi achieve the highest accuracy in nearly all stages of the bootstrapping process. Small lexicons would benefit from selecting only the top 10% verbs to create extraction patterns, achieving an accuracy of around 0.8 when learning 50 words. An overview of all accuracy scores is given in Table 4.2.

86 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

1.0 0.9 ρ 1.0 0.8 ρ 1.0 0.7 ρ 1.0 0.6 ρ 1.0 0.5 ρ 1.0 0.9

0.8 accuracy 0.7

0.6

0.5 20 30 40 50 60 70 80 90 100 lexicon size

Figure 4.8: Graph that shows the accuracy when learning 100 words for the top 10% to top 50% verbs when scoring candidate words with Pointwise Mutual Information

(Spmi).

87 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON 1.0 pmi S ≤ P ≤ b S 0.90.8 0.95 0.8 0.8 0.8 0.82 0.8 0.9 0.750.71 0.77 0.71 0.76 0.69 0.75 0.66 0.72 0.68 1.0 pmi S ≤ P ≤ b S 0.8 0.8 0.7 0.68 0.8 0.73 0.73 0.660.67 0.66 0.67 0.68 0.68 0.66 0.68 0.64 0.69 0.68 0.68 1.0 pmi S ≤ P ≤ b S 0.8 0.9 0.7 0.7 0.7 0.730.68 0.87 0.64 0.78 0.68 0.7 0.67 0.73 0.68 0.71 0.69 0.71 0.69 1.0 pmi S ≤ (scoring new entries with PMI). P pmi ≤ S b S 0.8 0.75 0.6 0.670.68 0.67 0.64 0.6 0.68 0.68 0.67 0.68 0.68 0.67 0.69 0.68 0.69 0.7 0.68 ), and 1.0 pmi S ≤ P AvgLog ≤ b S 0.8 0.75 0.7 0.69 0.5 0.730.68 0.67 0.74 0.65 0.72 0.68 0.69 0.68 0.68 0.67 0.69 0.69 0.7 20 30 40 50 60 70 80 90 100 Lexicon entries Lexicon accuracy when learning upto a 100 words. Accuracy scores are given for the top 10%, ... , 50% verbs for both (scoring new entries with BASILISK’s b Table 4.2: S

88 CHAPTER 4. BOOTSTRAPPING A SEMANTIC LEXICON

4.5 Conclusion

In this paper, we presented a bootstrapping algorithm based on BASILISK and a highly related corpus that was created by mining web pages. We have created the corpus by utilizing the same set of seed words that initially was used to start the bootstrapping process. We used extraction patterns to group noun phrases with similar semantic meaning by grouping them when they share the same stemmed verb. We scored the extraction patterns with a non-related text corpus and calculated accuracy scores for the top 10%, 20%, 30%, 40% and 50%. Next to using BASILISK original scoring metric, we used a PMI score by looking at hyponym-hypernym collocation statistics.

We found varied results between the scored extraction patterns. Patterns that were created by looking at strong verbs that most often occur in domain related text, and less frequent in general (non-related) text, created a higher accuracy lexicon when looking at the top 10% scored verbs while other top scores showed mixed results. The use of collocation statistics by utilizing the NEAR operator of Microsoft’s Bing search engine provided better accuracy for a number of scores but simultaneously caused a degrade in the accuracy for other verb scores.

The achieved accuracy covers web text for the fisheries domain and more research is needed concerning the generalizability into other domains and other forms of text, such as scientific literature and other technical language. Furtermore, research is needed to explain why accuracy varies between verb scores and why collocation statistics work better in some cases. Finally, research is also necessary when scoring the verbs against a non-related text corpus to see which types or genres of non-related domain corpora affects the domain under study.

89

Chapter 5

Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016

Despite increased fisheries science output and publication outlets, the global crisis in fisheries management is as present as ever. Since a narrow research focus may be a contributing factor to this failure, this study uncovers topics in fisheries research and their trends over time. This interdisciplinary research evaluates whether sci- ence is diversifying fisheries research topics in an attempt to capture the complexity of the fisheries system, or if it is multiplying research on similar topics, attempt- ing to achieve an in-depth, but possibly marginal, understanding of a few selected components of this system. By utilizing latent Dirichlet allocation as a generative probabilistic topic model, we analyze a unique dataset consisting of 46,582 full-text articles published in the last 26 years in 21 specialized scientific fisheries journals. Among the 25 topics uncovered by the model, only one (Fisheries management) refers to the human dimension of fisheries understood as socio-ecological complex adaptive systems. The most prevalent topics in our dataset directly relating to fisheries refer to Fisheries management, Stock assessment, and Fishing gear, with Fisheries management attracting the most interest. We propose directions for fu- ture research focus that most likely could contribute to providing useful advice for successful management of fisheries.

This work was originally published as:

S. Syed, M. Borit, and M. Spruit. Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19(4):643–661, 2018a. doi: 10.1111/faf.12280

91 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

5.1 Introduction

Following a similar trend to scientific research at large, fisheries science research out- put has significantly increased in the last three decades (Aksnes and Browman, 2016), in parallel with an increase in the number of fisheries scientific journals (Mather et al., 2008). This rapid expansion of the field is attributed to the growing concern about the state of global fish stocks and to the major role that science has been playing in fisheries management (Jari´c et al., 2012). However, despite the increased volume of fisheries science output and publication outlets, the global crisis in marine capture fisheries management is as present as ever, with unforeseen consequences ranging from fisheries-induced evolutionary changes among wild fish populations (Belgrano and Fowler, 2013) to conflicts between states over the implementation of best avail- able science (Brooks et al., 2016). There are various hypotheses regarding causes and contributing factors for failures of fisheries management, including data uncer- tainty, model inadequacy, ecosystem structure, institutional efficacy, economic discord or research focus (Smith and Link, 2005). Among these, research focus is the least explored (Smith and Link, 2005). Using hybrid content analysis of a unique dataset consisting of 46,582 fisheries science full-text articles published in the last 26 years, we uncover focus topics in fisheries research and their trends.

Fisheries are socio-ecological complex adaptive systems (SECASs) in which macro- scopic properties emerge from local actions that spread to higher scales due to agents’ (fish and humans) collective behavior; these properties then feed back, in a nonlinear way, influencing individuals’ options and behaviors, but they typically only do so dif- fusely and over long timescales (Levin et al., 2013; Ostrom, 2009). A fishery can be de- fined as "the complex of people, their institutions, their harvest and their observations associated with and including a targeted stock or group of stocks (i.e. usually fish), and increasingly, the associated ecosystems that produce said stocks" (Link, 2010). De- constructing the concept, the two main dimensions of a fishery are the human dimen- sion (i.e. human agents, communities of these, and their institutions) and the natural dimension (i.e. biotic, such as predator species and prey species, and abiotic, such as water temperature and nutrients) (Charles, 2000; Lennox et al., 2017; Österblom et al., 2013). The purpose of this study is to assess whether fisheries science output is reflecting this conceptual diversity of fisheries as SECASs, and if so, to what extent. Is science diversifying fisheries research topics in an attempt to capture the complexity of the fisheries system, or is it multiplying research on similar topics, trying to achieve an in-depth, but possibly marginal, understanding of a few selected components of this system? Based on the critical reflection that "the majority of fisheries scientists have a biologically oriented background, they can be a bit naïve regarding other factors when it comes to the prominence of economic or social considerations" (Link, 2010), the working hypothesis of this study is that the human dimension of fisheries might be under-represented in the fisheries specialty literature.

The assessments of the development and trends in fisheries science have so far been

92 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE mostly based on reviews (e.g. (Johnson et al., 2013)) or bibliometric evaluations (e.g. (Aksnes and Browman, 2016)) of scientific publications in the field. Limitations of these studies include: taking into account only a limited number of publications (e.g. (Jari´c et al., 2012)), or a limited time period (e.g. 2000–2009; (Jari´c et al., 2012)); having a limited scope (e.g. artisanal coral reef fisheries research (Johnson et al., 2013); fish stock assessment research (Kumaresan et al., 2014); shark by-catch re- search (Molina and Cooke, 2012)); using proxies for full-text articles (e.g. titles (Jari´c et al., 2012); abstracts (Aksnes and Browman, 2016)), or proxies for topics of research (e.g. one word per topic (Jari´c et al., 2012; Aksnes and Browman, 2016)). Most im- portantly, all these previous attempts to map the fisheries science field are top-down approaches, with topics of interest manually predefined by the analysts (e.g. species, region, habitat, study object (Jari´c et al., 2012)), and the analyzed data manually as- signed to these topics. However, such approaches are limited due to the subjectivity inherent in human decisions, and the analysis of the same research field could yield opposite results (e.g. (Rose et al., 2011) vs. (Hill and Lackups, 2010) evaluation of the field of cetacean research).

In contrast to previous approaches, we follow a completely novel strategy for the field of fisheries science, a bottom-up approach (Debortoli et al., 2016) by utilizing topic modeling to uncover hidden research topics within fisheries science publications. Topic modeling algorithms are machine learning methods to automatically uncover hidden or latent thematic structures from large collections of documents. Topic models can produce a set of interpretable topics that can be viewed as groups of co-occurring words that are associated with a single topic or theme (DiMaggio et al., 2013). Such groups of co-occurring words (i.e. topics) are words that tend to frequently come up together within the same linguistic context; more frequently than one would expect by chance alone. These co-occurring words tend to purport similar meaning and refer to a similar subject. For example, in the context of fisheries science, an author might write a text to which she/he gave the key words ’community structure’, ’subtropical areas’, ’refer- ence points’, and ’weight’. This text might use more frequently the words ’parameters’, ’estimation’, ’stock’, ’modeling’, ’male’, ’female’, ’sex’, and ’spawning’. If we wanted to use topic modeling to uncover the latent topics of this hypothetical text, based on how often these most used words would appear together (i.e. co-occur), the automated topic model would group the first four words in one topic and the last four words in a different topic. These two topics would then be manually labeled by a domain expert most likely as ’stock assessment modeling’ and ’reproduction’, respectively. Note that the subject of these two topics is not similar to the one that might be inferred from the key-words given to this hypothetical text. Thus, these topics are latent, they are hidden in the pattern of co-occurring words. In essence, topic models are able to exploit the co-occurrence structure of texts and produce the topics as lists of words that frequently come up together, within and between documents; technically, such lists of words are probability distributions over words.

The topics emerge from the statistical properties of the documents and, thus, overcome the need for manual annotation of the collection of texts, though manual interpreta-

93 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE tion of the subject of a topic might still be needed, as it is yet considered the gold standard in the domain of topic modeling (Lau et al., 2011). As such, we allow the documents to speak for themselves and view the documents through the computational lens of the topic model, rather than relying on the manifest or reported content by their authors. Document collections that are too large to explore manually can now be ana- lyzed to study phenomena of the sort that can only be viewed through the macroscopic computational lens of the topic model (Mohr and Bogdanov, 2013). Topic modeling approaches have been helpful in elucidating the key ideas within a set of documents, such as articles published in the journal PNAS (Griffiths and Steyvers, 2004), political science texts (Grimmer and Stewart, 2013) or data-driven journalism (Rusch et al., 2013). Moreover, it is considered that this approach could provide insight into the development of a scientific field and changes in research priorities (Neff and Corley, 2009), and do so with greater speed and quantitative rigor than would otherwise be possible through traditional narrative reviews (Grimmer and Stewart, 2013). As such, this topic modeling method has been applied, for example, in the domain of trans- portation research (Sun and Yin, 2017), computer science (Hall et al., 2008; Wang et al., 2011; Wang and McCallum, 2006), fisheries modeling (Syed and Weber, 2018), conservation science (Westgate et al., 2015), and the fields of operations research and management science (Gatti et al., 2015).

After identifying the hidden topics of fisheries science, we analyze the extent to which these topics cover the complexity of the fisheries domain. Afterwards, we examine topic similarity, topic co-occurrence, topic prevalence, and topical trends over the last 26 years. We furthermore identify patterns in increasing and decreasing topic trends over specific periods of time (i.e. hot and cold topics in 1990–1995, 1995–2000, 2000– 2005, 2005–2010, and 2010–2016), and describe the distribution of uncovered topics over journals.

5.2 Methods

5.2.1 Latent Dirichlet Allocation

This paper utilizes the topic model latent Dirichlet allocation (LDA) (Blei et al., 2003). LDA is a Bayesian probabilistic topic model and follows the assumption that documents exhibit multiple topics in mixing proportions, thus capturing the heterogeneity of, for example, research topics within scientific publications. In statistics, this is often re- ferred to as a mixed-membership model (Erosheva et al., 2004). Technically, a topic is a multinomial distribution of words in the vocabulary, where each word has a differ- ent probability within each topic; within a topic, more prominent words have a higher probability and groups of high probability words can be considered as co-occurring clusters or constellation of words that describe a certain underlying topic or theme. A document might be 60% about the topic fisheries management and 40% about the

94 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE topic stock assessment. A topic "about" a subject (e.g. fisheries management) relates to the probability distribution of words that places high probability to words that would be used to describe the subject (DiMaggio et al., 2013). Note that the underlying top- ics and to what extent the document exhibits these topics are not known in advance. These details are the output of the LDA analysis and emerge automatically from the statistical properties of the documents and the assumptions behind LDA.

One way to think about LDA is to imagine a document in which one highlights words with colored markers. Words that relate to one topic are colored blue, words that relate to another topic are colored red, and so on. After all of the words have been colored (excluding words such as ’the’, ’a’), all the words with the same color are the topics, and the article will blend the colors in different proportions. Different documents will have different blends of colors, and we could use the proportion of the various colors to situ- ate this specific document in a document collection (e.g. documents addressing mainly the blue topic). Moreover, documents with the same blend of colors discuss the top- ics in similar proportion and are considered closely related from a topical perspective. Technically, documents with similar topic distributions are close in Kullback-Leibler di- vergence, a measure to calculate the distance between two probability distributions. LDA as a statistical model captures this intuition. We refer the interested reader to Blei (2012) for a concise introduction to LDA.

LDA is best described by its generative process, i.e. the imaginary probabilistic recipe that generates the documents as well as the hidden structure. The hidden structure is the topics, modeled as distributions of words, and the topic proportions per document, where each document has some probability for each latent topic (i.e. mixing topic proportions). More formally, the generative process also assigns each word to a topic as to allow for documents to exhibit multiple topics, analogous to the colored words example. Given the observed documents, the aim now is to infer the hidden structure to answer the question "what is the likely hidden topical structure that have generated these documents?", a process that can be seen as reverse-engineering the generative process. Technically, we want to infer the posterior distribution of the latent variables given the observed documents. An analogy to this process is described by the local farmers market example; one might estimate what vegetables and what quantities are being sold at the local farmers market by post-hoc inspection of people’s shopping bas- ket. Seeing more baskets refines the estimation of the products and their quantities and provides an estimate of the market’s produce (Rhody, 2013). Mainly two types of inference techniques can be discerned: sampling-based algorithms (e.g. (Newman et al., 2007; Porteous et al., 2008) and variational-based algorithms (e.g. (Blei and Jor- dan, 2006; Teh et al., 2006; Wang et al., 2011). To simplify posterior inference, LDA uses a Dirichlet distribution as a conjugate prior for the multinomial distribution, hence the name latent Dirichlet Allocation. The posterior distribution will reveal the prob- ability distributions of words for each topic, and the topic proportions per document. Note that the obtained structure is latent, and therefore, the probability distributions of words are not semantically labeled. However, when sorted, the words with the highest probability within a topic will relate to what one would call a topic or theme (DiMaggio

95 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE et al., 2013; Mohr and Bogdanov, 2013). In this context, it is important to mention that the LDA model does not give a name to the identified latent topics (i.e. the model does not label the topics). The output of the model groups the co-occurring words un- der numbered topics (i.e. topic 1, topic 2, topic 3 etc.). Albeit a subjective endeavor, possibly affecting the statistical objectiveness of the LDA method, in order to increase the readability and interpretability of these topics, a human analyst can be used to in- terpret what is the common subject of the words within each topic and consequently give a name (i.e. a post-hoc label) to this topic (DiMaggio et al., 2013). Research into automatic assignment of topic labels exists, however manual annotation by a human expert is still considered the gold standard in labeling topics (Lau et al., 2011). For our study, instead of using the topic numbering provided by the LDA model (i.e. topic 1, topic 2, topic 3 etc.), and in order to increase readability of the text and interpretability of the results, we chose to give a specific label to each topic using the gold standard in this domain, i.e. manual annotation (see the section Labeling Topics).

5.2.2 Assumptions behind LDA

LDA is a bag-of-words model in which documents are represented as unordered se- quences of words. Such an assumption neglects word order and possibly important cues to the content of a document (Steyvers and Griffiths, 2007). Although an unre- alistic assumption, it is reasonable when uncovering semantic structures of text (Blei, 2012; Blei and Lafferty, 2006). Consider a thought experiment where the words of a document are shuffled. After finding a high number of words like spawning, eggs, and growth, one can imagine that the document deals with some aspects of reproduction. LDA further assumes document exchangeability, that is, the order in which documents are analyzed is unimportant, yet all documents are analyzed at the end of the LDA analysis. Consequently, LDA is unable to explicitly capture evolving topics from docu- ments that cover large time spans (e.g. centuries). To do that, we would need to resort to a more complicated and computationally expensive dynamic topic model (Blei and Lafferty, 2006). Such an approach is currently not feasible given the large dataset used here, but would be interesting to explore in future work. Nevertheless, the assumption of document exchangeability captures the fact that current literature builds on top of previous literature, but is a limitation for topics that have radically changed in the way they are described (e.g. terminology) in literature. For example, the field of atomic physics was described by words relating to “matter” in the late 19th Century, “electron” in middle of the 20th Century, and “quantum” in the late 20th Century. Likewise, the field of neuroscience evolved from being described by words relating to “nerve”, to “neuron”, to “ca2” over the last 100 years (Blei and Lafferty, 2006). The dynamic topic model uses a sequence of time slices in which topics are conditioned on the previous topics, which is a limitation of the standard LDA model used in this study.

96 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

5.2.3 Creating the Data Set

Taking into consideration the issue of having access to the electronic version of the text, we decided to include in our analysis only journal articles, as the majority of these are now available for download from online databases. Thus, we have excluded books, books chapters, and reports, something that may have introduced bias in our results. Furthermore, due to computational and time constraints, we have limited the number of journals included in our analysis. Thus, the total volume of fisheries publications is underestimated in our analysis, something that limits the results of this study. The dataset was constructed following a set of inclusion criteria to obtain a diverse set of journals that reflect fisheries science while maintaining computational feasibility. First, we included all journals with the term “fisheries” or “fishery” in their title that are listed by the Fisheries Science Citation Index Extended (SCIE) 2016 provided by Thomson Reuters, and having an impact factor of 1.0. Second, we included all journals from the Fisheries SCIE 2016 that do not include≥ these words in their titles, but explicitly address fisheries in their aims and scopes, and having an impact factor of 1.0. Third, we included the top four journals with the highest 2016 impact factor with≥ the term “marine” in their title, indexed by any list from SCIE or Social Science Citation Index (SSCI), and explicitly addressing fisheries in their aims and scopes. All journals were subject to the University of Tromsø—The Arctic University of Norway subscription rights. A total of 21 journals satisfied these criteria (Table 5.1). Although journals that do not match these criteria also publish fisheries research, such journals were not considered to be specialized fisheries research outlets (e.g. the journal Ecology and Society).

97 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE min V the mean vocabulary V Std. W W N max Y TOTAL 46582 min a 3.13 1991 2016 1328 2873 1251 842 a 2.292 1990 2016 12674 3046 2242 812 a 2.235 1990 2016 2516 3145 1264 889 / / / the mean number of words, Std. W the estimated standard deviation of number of words, and JournalFish and FisheriesReviews in Fish Biology andFisheries FisheriesAquaculture Environment InteractionsICES Journal of Marine ScienceReviews in Fisheries Science &Canadian Aquaculture Journal of Fisheries andFisheries Aquatic Research SciencesEcology of Freshwater FishMarine Resource EconomicsFisheries OceanographyJournal of Fish Biology 10 3Transactions of the American 2.466 3.575 Fisheries 6CCAMLR Society Science 9 1996 1991 2.905Fisheries Management and Ecology 2.545 2010Knowledge and 2016 Management 2016 of 1997 AquaticNorth Ecosystems American 4423 Journal 659 2016 7 of Rank Fisheries 1Marine Management and 2016 Coastal 203 Fisheries 3205 9.013Aquatic 2.76 Conservation: 375 4142 Marine and 1121 2000 Freshwater IF 1990 22 Ecosystems 3014 3172 1.502 Y 26 4443 1045 2016 828 2016 n 1997 1070 1.217 4429 13 419 5 3903 14 27 844 1997 2.054 2016 1.911 1.201 1061 11 2432 1996 4161 2381 2016 2010 1997 2.185 18 3 1039 2022 25 590 1995 1.578 2016 1997 3168 2016 1.327 2016 21 689 1997 932 1085 1186 159 1994 2517 2016 2016 1860 1.519 3610 2016 477 958 1990 791 2706 2471 2016 3609 752 1206 972 2491 1001 24 1279 2016 1977 28 622 1084 1.429 7075 3036 1987 1313 1.177 680 836 709 1990 1123 862 2009 679 2113 667 1551 2016 2016 765 629 314 274 652 1722 3539 1123 1144 506 876 Marine Ecology Progress Series n Marine Policy n An overview of the dataset used when creating the LDA model to uncover latent topics from fisheries publications. The W Table 5.1: dataset consists of 46,582 full-textfrom publications the from 2016 21 Fisheries top-tierjournals ISI fisheries not Journal covered journals. Citation by the Fisheries Reports JCRis (JCR) rank fisheries the and provided index, lowest impact but by publication cover factor Thomson year, fisheries Ymaxanalysis, are Reuters. aspects the extracted within highest Journals their publication without aims year, and a Nsize. scopes. the rank Note IF number are that is of “marine” the word documents impact and (publications) factor, vocabulary deemed Y statistics fit for are further obtained after the data cleaning process.

98 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Moreover, even though some of the most influential and highly-cited fisheries papers are published in high-impact journals such as Nature and Science, they only marginally contribute to the total number of papers published in fisheries-related journals. In- cluding in our analysis all publications from Nature and Science would result in a high number of fisheries-irrelevant topics (e.g. astrophysics), as these journals typically publish a broad range of topics. Using keyword searches to obtain only fisheries re- lated publications would be a top-down approach, and hence, would be biased by: (i) the search terms used, and (ii) the way publications are indexed and, subsequently, retrieved.

We downloaded full-text research articles published in the 21 journals covering fish- eries aspects for a time span of 26 years (1990–2016) to allow for enough variation in publication trends. Analyzing full-text articles, compared to just abstract data, results in more detailed and higher quality topics (Syed and Spruit, 2017). Only research articles were considered and other types of publications, such as errata, conference re- ports, forewords, announcements, dedications, letters, comments, and book reviews, were excluded. A total of 46,582 articles were deemed fit for further analysis. The year of publication was chosen to be the issue year in which the article appeared in print, regardless of the accepted year or (first) online publication. Information about the journal name, the time range for which articles were collected, the journal’s impact factor, the total number of articles deemed fit for further analysis, and word statistics are given in 5.1. Additionally, an overview of the number of publications per jour- nal per year is shown in Fig. 5.1. Not all journals provided articles for the complete time span of 26 years. For example, the journal Fish and Fisheries started in 2000 and, therefore, only articles from 2000 to 2016 were included in the study. Another exam- ple relates to the journal subscription rights, which did not allow data collection for all years. For example, the Canadian Journal of Fisheries and Aquatic Sciences started in 1901, but our subscription only allowed access from 1996.

99 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Transactions of the American Fisheries Society Reviews in Fisheries Science Reviews in Fish Biology and Fisheries North American Journal of Fisheries Management Marine Resource Economics Marine Policy Marine Ecology Progress Series Marine and Coastal Fisheries Knowledge and Management of Aquatic Ecosystems Journal of Fish Biology ICES Journal of Marine Science Fisheries Research Fisheries Oceanography Fisheries Management and Ecology Fisheries Fish and Fisheries Ecology of Freshwater Fish CCAMLR Science Canadian Journal of Fisheries and Aquatic Sciences Aquatic Conservation Marine and Freshwater Ecosystems Aquaculture Environment Interactions

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004 2003

Year

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993 1992

The number of publications (y-axis) per journal (color-coded) for the years 1990–2016 (x-axis) that were used to create

1991 1990 0

500

3000 2500 2000 1500 1000 Number of publications of Number the LDA model. The total number of documents was 46,582. Figure 5.1:

100 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

All articles appeared in portable document format (PDF) and were first converted to their plain text representation. This resulted in a complete transformation from PDF to plain text for all elements of each article, including the header, title, author info, affiliation info, abstract, keywords, content, tables, bibliography, and captions. Several articles, mainly from the early 1990s, were image-based PDFs that were unsuitable for direct conversion from PDF to plain text. In these cases, the Tesseract optical character recognition (OCR) software library was used to subsequently convert these articles into text-based PDFs and then to their plain text representation. To make sure we only analyzed the content text of each article, we used regular expressions, an advanced text pattern search method, to remove boilerplate content such as journal information, article metadata, acknowledgments, and bibliographies. Additionally, multi-language abstracts or non-English articles (e.g. articles that appeared in French), were also removed.

LDA is a bag-of-words model in which documents are represented as sequences of indi- vidual word features. As such, every document was tokenized. Tokenization is the pro- cess of obtaining individual words (also known as unigrams) from sentences. Unigrams lose important semantic cues that are encoded by compound words. To overcome this, bi-grams were included by combining two consecutive unigrams that occurred 20 times within each document. As a result, compound words, such as “rainbow trout”,≥ are preserved. Additionally, we used named entity recognition (NER), a technique from natural language processing (NLP), to retrieve entities related to names, nationalities, companies, locations, objects etc. from the documents. Entities such as “the European Union”, “the Norwegian Research Council”, and “marine protected areas” are thus pre- served and included in the analysis. The inclusion of bi-grams and entities allows for a richer bag-of-words representation than a standard unigram representation.

Although all tokens within a document serve an important function, for topic modeling they are not all equally important. We proceed by filtering out numbers, punctuation marks and single-character words as they bear no topical meaning. Furthermore, we removed stop words (e.g. the, is, a, which), words that occurred only once (e.g. mainly typos and incorrectly hyphenated words), and words that occurred in 90% of the documents (e.g. result, study, show) as they serve no discriminative topical≥ signifi- cance. Omitting frequently occurring words prevents such words from dominating all topics.

For grammatical reasons, different word forms or derivationally related words can have a similar meaning and, ideally, we would want such terms to be grouped. Stem- ming and lemmatization are two NLP techniques to reduce inflectional and derivational forms of words to a common base form. Stemming heuristically cuts off derivational affixes to achieve some normalization, albeit crude in most cases. Stemming loses the ability to relate stemmed words back to their original part-of-speech, such as verbs or nouns, and decreases the interpretability of topics in later stages (Evangelopoulos et al., 2012). For example, the term ’fishing’ will be stemmed to ’fish’, likewise, “modeling” will be stemmed to “model”, and cannot be returned to its original part-of-speech (i.e.

101 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE verb). Our analysis uses lemmatization, which is a more sophisticated normalization method that uses a vocabulary and morphological analysis to reduce words to their base form, called lemma. It is best described by its most basic example, normalizing the verbs ’am’, ’are’, ’is’ to ’be’, although such terms will be filtered out from our analy- sis. Likewise, lemmatization correctly normalizes ’fisheries’ and ’fishery’, and ’policies’ and ’policy’. Additionally, uppercase and lowercase words were grouped. The final corpus consisted of 46,582 full-text publications with around 130 million words and 170,000 unique words.

5.2.4 Creating the LDA Model

LDA assumes that the number of topics to uncover is known in advance and is set by the K-parameter. As the optimal number of topics is not known in advance, we created 50 different LDA models by varying the K-parameter from 1 to 50. Measures to determine the optimal LDA model are described in the next section. The LDA models are created using the Python library Gensim (Rehurek and Sojka, 2010). Since LDA is a Bayesian probabilistic model, we can incorporate some prior knowledge into the model. Prior knowledge can be encoded by symmetrical or asymmetrical Dirichlet priors. A symmet- rical prior distribution of topics within documents assumes that all topics have an equal probability of being assigned to a document. Such an assumption ignores that certain topics are more prominent in a document collection and, consequently, would logically have a higher probability to be assigned to a document. Conversely, specific topics are less common and, thus, not appropriately reflected with a symmetrical prior distribu- tion. Logically speaking, an asymmetrical prior would capture this intuition and would, therefore, be the preferred choice (Syed and Spruit, 2018a; Wallach et al., 2009). Ad- ditionally, we iteratively optimize the prior using the Newton-Rapson method (Huang, 2005) by learning it from the data. To infer the hidden variables (i.e. inferring the posterior distribution of the hidden variables given the observed documents) we use variational inference called ’online LDA’ (Hoffman et al., 2010).

5.2.5 Calculating Model Quality

Analogous to choosing the right number of clusters for techniques such as k-nearest neighbors, choosing the right number topics is an important task in topic modeling, in- cluding LDA, to appropriately capture the underlying topics in a dataset. A low number of topics will result in a few too broad topics, with high values capturing meaningless topics; such topics are merely the result of the statistical nature of LDA. Several ap- proaches to determine the optimal number of topics have been proposed. One such approach is to fit various topic models to a training set of documents and calculate a model fit on a test set (held-out data) (Scott and Baldridge, 2013). The model that fits best on the test set would be considered a better model. However, topic models are

102 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE used by humans to interpret and explore the documents, and there is no technical rea- son that the best-fitted model would aid best in this task (Blei, 2012). In fact, research has shown that such measures negatively correlates with human interpretation (Chang et al., 2009).

Another approach is to assess the quality of topics with human topic ranking, which is considered the gold standard when assessing the interpretability of generated top- ics. Such ranking is often based on word or topic intrusion tests, in which an intruder word or topic needs to be recognized within a set of related or cohesive words or top- ics (Chang et al., 2009). However, this approach is time-consuming and expensive as for every created topic model (e.g. 1 to 50), and for every topic within that model, the interpretability of individual words and sets of words need to be assessed. To circumvent this, a more quantitative approach is preferred while maintaining human interpretability. One way is to assess the quality of topics with coherence measures that are based on the distributional hypothesis (Harris, 1954), which states that words with similar meanings tend to co-occur within similar contexts. Such an approach, drawing on the philosophical premise that a set of statements or facts is said to be coherent if its statements or facts support each other, informs us about the understandability and interpretability of topics from a human perspective. This study utilized the CV coher- ence measure (Röder et al., 2015), which has shown the highest correlation with all available human topic ranking data, and is thus an appropriate quantitative approach.

The CV coherence score for all 50 LDA models was calculated, and an elbow method, estimating the (inflection) point where adding more topics will not increase coherence, was used to obtain the optimal number of topics.

5.2.6 Labeling Topics

As previously described, the topical structure that permeates the document collection is latent, and the probability distributions of words (i.e. topics) are not semantically labeled (i.e. they are not given a name by the LDA model). When sorted, the top 10 or top 15 high probability words within each topic are used to describe the topic. How- ever, algorithmic analyses of content remain limited in their capacity to understand latent meanings or the subtleties of human language (Lewis et al., 2013) and manual labeling is still consider the gold standard in topic modeling (Lau et al., 2011). Thus, the labeling of each topic (i.e. giving a name to each topic; e.g. a topic with the five most probable words being ’pig’, ’cow’, ’sheep’, ’goat’, ’horse’ would be labeled as ’do- mestic animals’) was performed by a human analyst, i.e. a fisheries domain expert. When identifying the common subject of each topic (i.e. the name or the label of the topic), the analyst used the following procedure. First, the analyst closely inspected the 15 most probable words from each topic. Second, the analyst inspected the titles of the documents in the dataset that were included by the topic model in that respec- tive topic. The interested reader can find a sample of publication titles that have high probability within a single topic in Table 5.4 in the Appendix. Third, based on the

103 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE previous two steps, the analyst labeled a topic (i.e. gave it a name; e.g. if the LDA model included in topic 1 the words ’pig’, ’cow’, ’farm’ and the titles of the documents included by the model in this topic have in common the subject of domestic animals, then the analyst gave topic 1 the label of ’domestic animals’). Furthermore, to vali- date the labeling of the topics, we visualized the topics in a two-dimensional area by computing the distance between topics (Chuang et al., 2012) and applying multidi- mensional scaling (Sievert and Shirley, 2014). This two-dimensional topic representa- tion displays the similarity between topics with respect to their word distribution over topics, i.e. the words and their corresponding probability within the topic. Clustering and overlapping nodes indicate similar word distributions and the surface of the node indicates the relative topic prevalence in the complete set of documents. The topic prevalence indicates how widespread a topic is within all the documents, as all topic proportions add up to 100%. In a fourth step, the analyst used this visualization in order to validate the choice of the final label for each topic (e.g. topics using similar vocabulary usually refer to similar subjects; thus, for example, the topics labeled by the analyst ’domestic animals’ and ’astrophysics’ appearing close together in the two- dimensional topic representation would raise suspicions and the analyst would have to go through the labeling procedure again in order to find labels that make sense for the two vocabulary-close topics). The labels were further validated in a fifth step, as described in the section Validation of Results.

5.2.7 Calculating Topical Trends over Time

To gain insight into the topical temporal dynamics of the fisheries field, we aggregated the document topic proportions for each year and for every individual topic into a composite topic-year proportion. Doing so provides a sense of how the prevalence of each topic within fisheries science publications has changed over the last 26 years. Additionally, to obtain insight into increasing and decreasing topical trends, we fit a one-dimensional least square polynomial for different time intervals. The polynomial coefficient is used as a proxy for the trend and defines the slope of the composite topic- year proportions for a range of years. Coefficients are multiplied by the number of years within each time interval to obtain the change measured in percentage points. Positive values indicate increasing or hot topics, and negative values decreasing or cold topics. The time intervals allow for historical comparisons between 1990–1995, 1995– 2000, 2000–2005, 2005–2010, and 2010–2016. Color coding is used to resemble the hot (i.e. red) and cold (i.e. blue) topical trends.

5.2.8 Calculating Topic over Journals

Following a similar approach as topical trends over time, we aggregate topic proportion per journal to gain insight into how topics are covered by the journals included in this study. Doing so enables us to identify broadly oriented or more focused oriented

104 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE journals. Note that aggregation of topic proportions is handled per journal and covers only the years for which articles are downloaded (see Table 5.1). For example, for the journal Fish and Fisheries, journal topic distributions cover the time range 2000–2016, whereas the journal ICES Journal of Marine Science covers the complete time range of 1990–2016.

5.2.9 Relaxing LDA Assumptions and Future Research Directions

At the time of writing, the original LDA method proposed by Blei et al. (2003) has over 20,000 citations. The technique has received much attention from machine learn- ing researchers and other scholars and has been adopted and extended in a variety of ways. More concretely, relaxing the assumptions behind LDA can result in richer representations of the underlying semantic structures. The bag of words assumption has been relaxed by conditioning words on the previous words (i.e. Markovian struc- ture) (Wallach, 2006b); the document exchangeability assumption, relaxed by the pre- viously mentioned dynamic topic model (Blei and Lafferty, 2006), and the Bayesian non-parametric model can be utilized to automatically uncover the number of top- ics (Whye Teh et al., 2004). Furthermore, LDA has been extended in various ways. Topics might correlate as a topic about ’cars’ is more likely to also be about ’emission’ than it is about ’diseases’. The Dirichlet distribution is implicitly independent and a more flexible distribution, such as the logistic normal, is a more appropriate distri- bution to capture covariance between topics. The correlated topic model aids in this task (Blei and Lafferty, 2007). Other examples extending LDA include the author-topic model (Rosen-Zvi et al., 2004), the relational topic model (Chang and Blei, 2010), the spherical topic model (Reisinger et al., 2010), the sparse topic model (Wang and Blei, 2009), and the bursty topic model (Doyle and Elkan, 2009). Apart from its applica- bility to text, LDA can be applied to audio (Kim et al., 2009), video (Mehran et al., 2009), and image (Fergus et al., 2005) data. Those topic models that relax or extend the original LDA model bring additional computational complexity and their own sets of limitations and challenges; nevertheless, it would be interesting to explore these models in future research.

5.3 Results and Discussion

5.3.1 Uncovering Fisheries Topics

The LDA model with the optimal coherence score contains 25 topics (k = 25). The ten most probable words (i.e. the words with the highest probabilities), together with the semantically attached label for each uncovered latent topic, are shown in Table 5.2. The manually assigned labels for the 25 topics are: (1) Conservation, (2) Morphology,

105 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

(3) Salmon, (4) Reproduction, (5) Non-fish species, (6) Coral reefs, (7) Biochemistry, (8) Freshwater, (9) Diet, (10) North Atlantic, (11) Southern hemisphere, (12) De- velopment, (13) Genetics, (14) Assemblages, (15) Growth experiments, (16) Stock assessment, (17) Growth, (18) Tracking and movement, (19) Fishing gear, (20) Pri- mary production, (21) Models, (22) Salmonids, (23) Acoustics and swimming, (24) Estuaries, and (25) Fisheries management. These 25 topics can be grouped into overar- ching themes: aquatic organisms biology (n = 11), specific aquatic organisms (n = 4); aquatic habitats (n = 3), geographical areas (n = 2), modeling (n = 2), management (n = 2), and fishing technology (n = 1). A visual representation of the topics, their proportions within the complete corpus, and their grouping in overarching themes can be found in Fig. 5.2.

106 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE Theme Table showing the 25 uncovered topics from 46,582 fisheries science articles published in 21 fisheries-specialized journals Topic Label123 Conservation4 Morphology5 Salmon6 Reproduction Top-10 words 7 Non-Fish Species Marine, Ecosystem, Change, System,8 Environmental, Corals Impact, Environment, Process, Specimen, Ecological, Mm, Research Body,9 Dorsal, Biochemistry Morphological, Length, Shape, Head, Management Form,10 Morphology Freshwater Female, Male, Sediment, Sex, Crab, Size, Site, Salmon, Reproductive, Mussel,11 Chinook, Shark, Diet Seagrass, Chinook Spawn, Density, Salmon, Mature, Treatment, Pacific, Effect, Maturity, Oocyte River, Plant, Columbia, North Shell Fish, Atlantic12 Year, Stock, Juvenile Southern Hemisphere13 Concentration, Cell, Tissue, Development Acid,14 Protein, Reef, Lipid, Coral, Level, Site, Sample, Habitat, Activity, Genetics Sea, Area, Exposure Lake, Area,15 Community, Fish, Abundance, Water, Region, Bass, Colony, Island, Island, Reservoir, Angler, Depth Shelf, Largemouth, Whale, Walleye, Assemblages Temperature, Sea, Population, South,16 Aquatic Year, Cod, Perch Depth Organism Area, Biology North, Lamprey, Fish, Growth Atlantic, Experiments17 Herring, Parasite, Baltic Prey, Stock Diet, Egg, Assessment18 Food, Larval, Specific Predator, Size, Larvae, Aquatic Feed, Spawn, Organisms Fish, Stage, Trophic, Growth Temperature, Value, Larva,19 Consumption Early, Fish, Day, Hatch Temperature, Treatment, Aquatic Experiment, Organism Effect, Tracking Biology Water, and Rate,20 Specific Movement Habitat, Control, Aquatic Population, River, Tank, Organisms Genetic, Site, Test Sample, Stream, Analysis, Water, Fishing Flow, Area, Individual, Gear Year,21 Population, Channel, Gene, Stock, Fish, Allele, Fish, Mortality, Rate, Reach Dna, Tag, River, Recruitment, Loci, Release, Estimate, River Primary Dam, Model, Production22 Movement, Aquatic Biomass, Hatchery, Organism Migration, Change Biology Survival, Rate Models23 Aquatic Aquatic Salmonids Habitats Habitats 24 Concentration, Growth, Water, Rate, Otolith, Acoustics Nutrient, Length, and25 Phytoplankton, Fish, Swimming Production, Sturgeon, Catch, Sediment, Geographical Sample, Fishing, Carbon, Areas Mm, Fishery, Chl, Size, Fish, Estuaries Sample Growth Gear, Net, Rate, Depth, Trawl, Rate Fish, Size, Velocity, Water, Hook, Speed, Vessel Distance, Fisheries Vertical, Management Surface, Sound, Night Aquatic Geographical Organism Areas Modeling Biology Model, Fishery, Management, Trout, Estimate, Aquatic Specific Fish, Aquatic Fishing, Value, Organism Aquatic Stream, Organism Variable, State, Biology Parameter, Organisms Rainbow, Biology Analysis, Resource, Population, Economic, Effect, Brook, Aquatic Vessel, Distribution, Cutthroat, Policy, Organism Area, Base, Creek, Biology Fish Sample Brown, Salmonid Aquatic Organism Fish, Biology Estuary, Water, Bay, Salinity, Estuarine, Area, Freshwater, Habitat, River Aquatic Organism Biology Aquatic Organism Biology Modeling Aquatic Organism Biology Specific Aquatic Management Organisms Fish Technology Aquatic Habitats Table 5.2: in the period 1990–2016.are Each manually topic labeled displays with the aparentheses. logical ten topic most description probable (top) words that (i.e. best captures words the with semantics the of highest the probability). top The words, with topics the topic ID in

107 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Conditioning the topics on the word ’fishery’ or ’fisheries’ (i.e. taking into consider- ation the probability assigned to this word), these 25 topics can be divided into four groups, the first two of which we considered to be directly relating to fisheries: using the word often (n = 3), moderately (n = 5), infrequently (n = 8) or almost not at all (n = 9) (Fig. 5.3). The topics using this word often are, in descending order: (25) Fisheries management, (19) Fishing gear, and (16) Stock assessment. Almost one-fifth of all the topics are using the word ’fishery’ or ’fisheries’ moderately: (1) Conservation, (3) Salmon, (4) Reproduction, (8) Freshwater, (24) Estuaries . One third of all the topics are using the word ’fishery’ or ’fisheries’ infrequently: (2) Morphology, (7) Bio- chemistry, (10) North Atlantic, (13) Genetics, (15) Growth experiments, (17) Growth, (18) Tracking and moving, (22) Salmonids. Another one-third of all of the topics does not use this word almost at all: (5) Non-fish species, (6) Corals, (9) Diet, (11) South- ern Hemisphere, (12) Development, (14) Assemblages, (20) Primary production, (21) Models, (23) Acoustics and swimming.

Considering all the 25 topics, only one (i.e. (25) Fisheries management) refers explic- itly to the human dimension component of the fishery system, something that confirms our working hypothesis that the human dimension of fisheries is under-represented in the fisheries specialty literature. To evaluate whether the human dimension of fish- eries as SECASs is further refined within the Fisheries management topic, following the same methodology as described above, we created a new LDA model that zooms in on this topic, thereby creating subtopics from the broader Fisheries management topic. The new model uncovered 12 subtopics from the topic Fisheries management (Table 5.3), out of which eight assign higher probability to the term ’fishery’ or ’fisheries’ (i.e. use this word often or moderately) and, thus, were considered directly relating to fisheries: three on various management approaches (i.e. Co-management, Precaution- ary approach, Quota systems); three on economics (i.e. Markets, Bioeconomics, Blue economy); and two on type of fishery (i.e. Small scale fisheries, Recreational fisheries).

Out of the total of 25 topics uncovered by our analysis, three were considered generic (i.e. (10) North Atlantic, (11) Southern hemisphere, (21) Models)). Out of the remain- ing 22 topics, 20 cover the natural dimension of fisheries, reasonably mirroring the cur- riculum of fisheries biology and fisheries ecology higher education courses (e.g. (Jen- nings et al., 2009; King, 2007)), but not addressing topics such as climate change. However, considering the focus of the two remaining topics, i.e. (1) Conservation and (25) Fisheries management, it is apparent that the research focus in fisheries during the last 26 years has not entirely captured the complexity of the fisheries domain, especially of the human dimension component, something also observed for specialized research areas such as, for example, by-catch reduction technology (Campbell and Cornwell, 2008; Molina and Cooke, 2012). These results seem to be confirmed by the bibliomet- ric analyses published in (Aksnes and Browman, 2016) and (Jari´c et al., 2012), where no human dimension related words were identified among the most frequent words used in fisheries publication titles and abstracts. This situation might not be surprising given the institutional context in which fisheries research is performed. For example, within the International Council for the Exploration of the Sea (ICES), which is one of

108 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

overall topic prevalence development growth experiments 2% 12 primary diet 5% production 15 reproduction 10% 9 growth biochemistry 20 salmonids 5 4 17 stock assessment 22 7 18 non-fish tracking and species 8 16 movement 23 freshwater acoustics and swimming 21 14 assemblages 19 3 models 6 fishing morphology 13 gear salmon 24 corals 11 2 estuaries genetics Southern 10 Hemisphere management 1 North Atlantic aquatic organism biology specific aquatic organisms conservation 25 fisheries aquatic habitats management geographical areas modelling fishing technology

Figure 5.2: Inter-topic distance map that shows a two-dimensional representation (via multidimensional scaling) of the 25 uncovered fisheries topics. The distance between the nodes represents the topic similarity with respect to the distributions of words (i.e. LDA’s output). The surface of the nodes indicates the topic prevalence within the corpus, with bigger nodes representing topics being more prominent within the document collection (all nodes add up to 100%).

109 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

overall topic prevalence

growth development experiments 2% 12 5% diet primary 10% production 15 reproduction 9 20 growth salmonids biochemistry 5 stock assessment 4 17 22 7 non-fish 18 tracking and species 8 movement assemblages 16 23 freshwater acoustics and swimming 14 21 3 6 19 models morphology 13 24 fishing salmon corals 11 gear 2 genetics estuaries 10 Southern Hemisphere 1 North Atlantic conservation 25

fisheries management

Figure 5.3: Inter-topic distance map showing topics conditioned on the word ’fishery’ (including ’fisheries’). The figure is similar to Fig. 5.2 but expresses the differences in probability assigned to the word ’fishery’. Bigger nodes place higher probability to the word ’fishery’ and can be considered more directly related to fisheries science.

110 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Table 5.3: Table showing the 12 uncovered sub-topics from the documents (n = 3,390) dealing with the topic Fisheries management. The sub-topics provide a “zoomed-in” view of the topical decomposition from the subset of documents on fisheries manage- ment. Each topic displays the ten most probable words and the semantic label that best describes the underlying latent topic.

Topic Label Top-10 words Theme

1 Spatial Planning Marine, Policy, Stakeholder, Process, Non-Fisheries Coastal, Development, Sea, Plan, Envi- ronmental, Regional 2 Markets Price, Market, Fish, Product, Production, Economics Model, Value, Seafood, Estimate, Sector 3 Bioeconomics Cost, Model, Stock, Fishery, Effort, Value, Economics Scenario, Harvest, Fish, Rate 4 Conservation and Marine, Mpa, Conservation, Protect, Non-Fisheries MPA Ocean, Ecosystem, Mpas, Protection, Sea, Habitat 5 Small-Scale Fish- Fishing, Fisher, Fish, Study, Catch, Fish- Type of Fishery eries ery, Boat, Fisherman, Local, Community 6 Blue Economy Marine, Fish, Aquaculture, Fishery, Economics Shark, Water, Coastal, Development, Production, Specie 7 Pollution Ship, Vessel, Oil, Port, Shipping, Pollu- Non-Fisheries tion, Risk, International, Trade, Country 8 Legislation Sea, Law, International, Convention, Non-Fisheries Agreement, Country, China, Water, Coastal, Maritime 9 Co-Management Fishery, Community, Social, Fishing, Sys- Management Ap- tem, Right, Group, Fisher, Government, proaches Local 10 Quota Systems Fishery, Vessel, Fishing, Catch, Quota, Management Ap- Fish, Fleet, System, Total, Stock proaches 11 Precautionary Ap- Fishery, Stock, Specie, Catch, Fishing, Da- Management Ap- proach tum, Assessment, Fish, Whale, Ecosystem proaches 12 Recreational Fish- Angler, Recreational, Fish, Fishing, Sur- Type of Fishery eries vey, Respondent, Catch, Fishery, Esti- mate, Value

111 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE the most important fisheries-related intergovernmental organizations, despite having along the years various groups working more or less directly with different aspects of this human dimension, now only one out of more than 45 expert groups is working explicitly with this dimension of fisheries. This group, the Strategic Initiative Human Dimension (SIHD), became operational in 2015.

Persisting in having this heavily unbalanced focus between the two dimensions of fish- eries systems (i.e. the human dimension (i.e. human agents, communities of these, and their institutions) and the natural dimension (i.e. biotic, such as predator species and prey species, and abiotic, such as water temperature and nutrients)) will not help in understanding the behavior of fisheries stakeholders (from fishers to consumers), lead- ing to unintended, and too often undesirable, management outcomes (Fulton et al., 2011), and thus unsustainable fisheries. Responding to the challenges posed by sus- tainable fisheries necessitates the development of stronger networks within the family of human dimension sciences and across disciplinary boundaries with the natural di- mension sciences (Symes and Hoefnagel, 2010a). Without providing an exhaustive list and in random order, the human dimension in fisheries could be included in fisheries science by addressing topics such as: institutional aspects (enforcement and compli- ance, policy interactions etc.), social aspects (gender, religion/beliefs, welfare, social cohesion, social networks, education and learning, human agency, health, safety and security at sea, food security, perception, attitudes, social norms, compliance, men- tal models of various actors involved in fisheries etc.), economic aspects (poverty, innovation, distribution of benefits, spiritual, inspirational, and aesthetic services of fisheries etc.), political aspects (power structures, transparency etc.), and cultural as- pects (traditional/local ecological knowledge, history, cultural dimensions, culinary choices, heritage, blue humanities, fisheries literacy etc.) (Charles, 2000; De Young et al., 2008; ICES, 2016; Österblom et al., 2013; Sowman, 2011; Spalding et al., 2017; Stone-Jovicich, 2015).

Continuing our analysis, three topic clusters can be identified in Fig. 5.2, indicating a similar probability distribution over words (i.e. topics that are, to some extent, related to the words they use to describe the theme): a growth cluster (the topics Growth ex- periments (15), Diet (9), Non-fish species (5), Primary production; Development (12), Reproduction (4), and Growth (17)); an institutions cluster (the topics Fisheries man- agement (25) and Conservation (1)); and a salmonids cluster (the topics Freshwater (8), Tracking and movement, and Salmonids (22); one would expect to find here also the topic Salmon (3), but, interestingly from a linguistics point of view, this topic seems rather isolated). The most isolated topics are Morphology (2) and Biochemistry (7), indicating most probably the use of a very specific topic distribution over words.

The most frequent aquatic organisms mentioned in our corpus are salmonids (e.g. salmon, trout) and other freshwater organisms (e.g. perch); shark (within the topic Reproduction); crab, mussel, and oyster (within the topic Non-fish species); cod, lam- prey, and herring (within the topic North Atlantic); whale (within topic Southern hemi- sphere); sturgeon (within the topic Growth); tuna (within the topic Fishing gear); and

112 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE shrimp (within topic Estuaries). Commercially important species, such as anchoveta, pollock, and tilapia, were not included among the most frequent words of any of the 25 topics. These findings are relatively consistent with Aksnes and Browman (2016) and Jari´c et al. (2012), who reported that the most frequently studied group of species was the Salmonidae, followed by the Atlantic cod, and that there is no correlation be- tween the production of various species and the number of publications about these species.

Aquaculture has not been identified as a topic of its own in our dataset, and the word aquaculture was not included in the top 10 most frequent words of any of the 25 topics. However, the word aquaculture was included in the top 10 most frequent words for the sub-topic Blue economy under the topic Fisheries management, possibly indicating the interest in this relatively new industry in a context that focuses on increasing economic activities in the marine and maritime sector.

With regard to the typology of fishers by Charles (2000), only the recreational type is specifically mentioned among the most frequent words in our corpus, with the word an- gler included in the topic (8) Freshwater. This might be because the research focusing on the other types (e.g. subsistence, artisanal) does not employ a very specific vocabu- lary, or that there might be a lag in research on, for example, small-scale and artisanal fishery (Purcell and Pomeroy, 2015). However, if we look only at topic (25) Fisheries management, the recreational type and small-scale type have each its subtopic, indi- cating that, from a management perspective, these two types of fisheries have been relatively extensively explored by fisheries scientists.

Out of the 25 topics uncovered by our LDA model, two refer to large geographical areas: the North Atlantic (10) and Southern hemisphere (11). The words Norwegian (within the topic North Atlantic) and Florida (within the topic Estuaries (25)), are the only specific geographic references among the top 10 most probable words. These very few specific geographic references might indicate that most of the fisheries research is focused on a few areas around the globe, leaving large zones underexplored, as also indicated in Molina and Cooke (2012), or that research about other regions is published in other languages than English.

5.3.2 Topic Proportions within Documents

For every document, the LDA model infers the topical decomposition, indicating which topics are found in that document and in what proportions. The assumptions behind LDA causes documents to exhibit mainly a small number of main topics, with other top- ics very close to zero (note that all topics per document sum up to 1). This structure assumes that documents are often about some topics, rather than being about all topics equally. Following this line of reasoning, we analyzed the remaining topic proportions for documents exhibiting one of the topics as the dominant topic, this being defined as the document’s topic proportion that exceeds all other topic proportions. Such an anal-

113 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE ysis provides insight into the document composition that comes from the mixing topic proportions, that is, what topics co-occur together within fisheries research articles.

Fig. 5.4 shows the average remaining topic proportions with respect to the dominant topic displayed as a heat map. The dominant topic, shown on the left (i.e. rows), has the average dominant topic proportion within parentheses. A higher number indicates that more of the document content deals with the dominant topic, while the other topics make up the smaller remaining portion of the document. Conversely, lower numbers reflect a dominant topic making up a smaller portion of the document, leaving more room for other topics to be part of that document. For example, documents dealing with the topic Fishing gear are on average allocating 45% of their content to their own topic, leaving 55% of the remaining content to other topics (for example, 9% to the topic Models and between 4 and 6% to each of the topics Fisheries management, Acoustics and swimming, Stock assessment, and Southern hemisphere). Furthermore, the remaining average topic proportions, shown at the top (i.e. columns), are sorted from high to low and reflect which topics more frequently, or to a higher extent, co- occur with other topics. For example, given any document with a dominant topic, the remaining topic proportion deals more often with the topics Models, Stock assessment, Fish reproduction, or Growth experiments (i.e. these three topics have the highest co- occurrence), and less often with the topics Salmonids, Genetics, or Salmon (i.e. these three topics have the lowest co-occurrence).

Documents dealing with the most prevalent topic in the corpus directly relating to fish- eries, i.e. Fisheries management, allocate to this topic almost 60% of their content, about 13% to Conservation, and between 3 and 5% to topics such as (in descend- ing order): Fishing gear, Models, and Stock assessment. Documents dealing with the most prevalent topic in the rest of the corpus, i.e. Primary production, allocate to this topic 55% of their content, and between 3 and 6% to topics such as (in descend- ing order): Non-fish species, Biochemistry, Models, Diet, Southern hemisphere, and Growth experiments. Co-occurrence of such topics might be something natural. How- ever, it would be interesting to consider whether other mixtures of topics would bring novel, and possibly also innovative, insight into fisheries science. Such an insight is highly needed in order to achieve sustainable fisheries exploitation and implement the fisheries-related actions of the international ocean governance objectives (European Commission, 2016a).

5.3.3 Topical Trends over Time and Topic Prevalence

To gain insight into the temporal changes of topics, we display the topical trend values and topic prevalence in Fig. 5.5. The left-hand side displays the fitted increase (hot topics) or decrease (cold topics) in percentage points for different time intervals and represents the change in composite topic-year proportions within a certain time frame. Additionally, we display the average composite topic-year proportions for every topic on the right-hand side, referred to as topic prevalence. Individual trend lines for the

114 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

(21) Models(16) Stock(15) Assessment Growth(1) Conservation Experiments(9) Diet (14) Assemblages(11) Southern(6) Corals Hemisphere(19) Fishing(23) AcousticsGear(24) Estuaries and(10) Swimming North(12) Atlantic Development(18) Tracking(4) Reproduction and(5) Movement Non-Fish(25) Fisheries Species(17) Growth Management(7) Biochemistry(8) Freshwater(20) Primary(2) Morphology Production(22) Salmonids(13) Genetics(3) Salmon

(25) Fisheries Management (59.0%) 4.8 3.7 0.7 12.7 0.4 0.6 1.4 2.0 5.3 0.7 1.2 2.0 0.2 0.5 0.5 0.6 0.2 0.4 1.2 0.2 0.3 0.2 0.3 1.0

(20) Primary Production (55.9%) 3.9 1.7 3.0 2.4 3.5 2.2 3.4 2.7 0.2 2.7 1.3 1.3 1.3 0.2 0.2 5.8 0.3 0.6 4.1 1.2 0.9 0.2 0.8 0.3

(13) Genetics (50.4%) 3.1 1.9 1.3 3.3 0.4 2.1 2.4 2.7 1.3 0.5 4.3 3.2 1.3 2.2 2.4 1.5 1.0 1.3 2.1 2.3 0.6 3.2 2.9 2.4

(19) Fishing Gear (45.4%) 9.1 4.3 3.0 2.3 1.1 0.5 4.0 3.0 4.6 2.9 3.4 0.5 1.3 2.0 1.2 5.9 1.2 0.3 1.2 0.2 0.9 0.3 0.2 0.9

(23) Acoustics and Swimming (44.8%) 8.1 1.9 5.4 2.2 2.8 1.8 4.9 2.5 4.2 1.8 2.3 2.4 2.8 1.1 1.7 0.8 0.8 0.5 1.6 2.2 1.5 0.6 0.2 1.2

(6) Corals (44.4%) 4.9 3.6 2.9 4.7 2.6 1.6 2.2 2.9 2.9 2.8 1.7 1.9 1.0 1.8 6.8 2.0 1.3 1.4 0.5 2.7 1.5 0.6 1.0 0.3

(14) Assemblages (44.4%) 7.6 2.0 2.5 6.0 1.9 0.5 1.8 0.7 2.4 3.5 0.8 1.6 2.8 0.7 2.7 1.0 1.2 0.7 4.1 2.4 1.2 5.5 0.6 1.6

(7) Biochemistry (43.9%) 1.9 0.8 11.7 2.2 2.6 0.8 1.1 1.8 0.5 1.0 1.8 1.9 3.1 1.0 3.5 4.0 0.5 1.3 1.1 4.3 4.1 1.2 3.2 0.7

(15) Growth Experiments (43.6%) 2.9 1.5 2.2 4.1 1.7 0.7 1.4 1.7 4.6 2.0 1.9 2.8 3.1 2.6 1.7 0.7 1.9 7.2 2.6 1.6 2.1 3.2 1.2 1.2

(17) Growth (43.3%) 5.3 4.3 3.8 1.7 1.2 1.9 2.9 2.1 2.9 1.7 4.0 2.6 3.8 3.0 3.4 0.7 0.8 1.9 2.3 1.0 2.0 1.0 1.3 1.2

(21) Models (43.0%) 10.0 1.2 3.1 1.3 2.6 3.2 2.4 6.3 3.5 1.2 2.1 1.4 2.3 1.2 1.1 2.7 2.6 0.4 2.0 1.2 0.4 1.0 0.6 3.2

(5) Non-Fish Species (42.8%) 3.2 3.2 4.0 3.6 3.8 2.9 1.5 9.3 0.7 2.9 2.2 1.7 2.0 0.5 1.2 0.6 1.3 2.5 0.6 7.2 1.1 0.6 0.7 0.2

(11) Southern Hemisphere (42.7%) 6.3 5.5 1.5 2.8 5.4 0.7 2.4 4.4 5.6 2.1 2.6 4.1 1.2 2.6 0.8 1.3 1.3 0.9 0.2 2.8 0.6 0.1 0.6 1.6

(18) Tracking and Movement (42.4%) 4.9 3.5 6.2 2.3 0.9 4.4 1.6 1.2 2.1 4.9 1.7 3.3 0.9 2.1 0.5 1.4 2.4 1.2 2.7 0.2 1.1 3.3 1.1 3.7

(8) Freshwater (41.9%) 5.4 6.4 3.3 2.6 6.3 5.4 0.7 0.9 2.7 2.1 1.4 0.9 2.0 3.1 1.6 1.0 2.0 2.6 1.0 1.4 1.4 2.3 1.1 0.7

(1) Conservation (40.5%) 4.1 3.7 2.3 1.4 4.1 2.4 3.9 1.4 1.4 2.9 2.4 0.6 1.6 0.8 2.2 13.6 0.4 1.2 1.8 1.8 1.3 1.4 1.8 1.0

(16) Stock Assessment (40.3%) 12.3 1.1 3.6 2.7 1.0 3.7 1.5 4.7 1.0 1.5 5.0 3.3 1.5 2.6 1.0 4.2 2.5 0.2 1.8 0.7 0.3 0.7 0.6 2.3

(2) Morphology (39.7%) 1.4 0.4 3.6 2.6 2.2 1.7 1.9 3.2 1.4 2.8 7.8 2.6 3.2 0.8 4.1 1.5 0.7 4.9 5.0 2.0 1.2 0.8 4.0 0.3

(3) Salmon (38.5%) 7.2 6.9 1.7 3.1 3.9 3.8 4.9 0.7 2.3 1.1 0.7 1.3 1.4 7.1 1.6 0.4 1.8 2.6 0.9 1.2 0.9 0.3 1.8 3.9

(4) Reproduction (38.3%) 3.5 2.9 6.4 2.9 1.6 1.1 2.6 2.8 2.8 1.9 3.6 2.1 5.9 1.8 1.2 1.1 4.0 3.6 1.4 0.4 3.7 1.4 2.2 0.8

(22) Salmonids (38.0%) 4.7 3.6 6.4 2.5 3.6 11.5 0.4 0.4 0.7 2.6 0.7 1.4 1.8 6.4 2.1 0.7 0.8 2.1 0.8 3.5 0.5 0.9 2.2 1.5

(9) Diet (37.1%) 5.0 3.1 4.5 2.2 1.8 4.2 3.2 2.7 2.1 4.7 3.2 2.9 0.7 1.9 2.9 0.4 1.8 2.5 4.4 3.9 1.9 1.6 0.5 1.1

(24) Estuaries (37.0%) 3.3 3.1 1.6 5.3 4.3 6.9 2.5 4.2 4.0 2.1 1.8 2.6 1.4 2.8 3.0 2.1 2.3 0.8 2.6 1.3 2.7 0.6 1.6 0.2

(12) Development (34.3%) 5.3 6.1 6.3 1.4 3.1 1.4 4.9 2.7 1.3 4.4 2.1 3.7 0.7 4.3 2.9 0.3 3.1 3.2 1.1 3.2 1.8 0.7 0.9 0.7

(10) North Atlantic (34.1%) 4.6 7.0 2.9 3.1 3.8 1.1 4.2 2.8 4.4 2.3 2.6 4.0 4.0 2.0 1.7 1.2 2.1 2.0 1.3 1.3 2.8 1.4 1.9 1.4

Figure 5.4: Heat map displaying the dominant topic (left) and the remaining average topic proportions (top) for all 46,582 documents from the corpus. This indicates the extent to which documents about one main topic relate to the other uncovered topics (i.e. the degree of topic co-occurrence), decreasing from left to right. For example, documents that primarily focus on Fisheries management also focus on Conservation (12.7%) or Fishing gear (5.3%).

115 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

25 broad fisheries topics, as well as the topical trends and prevalence for the Fisheries management sub-topics, can be found in Figs. 5.7, 5.8, 5.9 in the Appendix.

With regard to the groups of topics described in Table 5.2, during the entire period of 26 years, taken together, the 11 topics referring to aquatic organisms biology themes made up 40% of the corpus; the groups of topics on the themes of specific species, modeling, management, and habitats, between 11% and 14% each; the two topics on the theme of geographical regions, 7% together; and the fishing gear topic, almost 5%. The topics Models, Primary production, and Fisheries management are the most prevalent in the corpus during the entire period of 26 years, with approximately 7, 6.5, and 6% respectively from the total number of documents. The topics Salmon, Salmonids, and Morphology are the least prevalent, with only around 2% of the corpus accounting for each. The topic Models has been among the top three most prevalent topics since 1995. Primary production was the most prevalent topic in the first three time periods we analyzed (from 1990 to 2005). Topics directly relating to fisheries, such as Fisheries management and Stock assessment, were among the top 3 of most prevalent topics in the period 2005–2010 (both topics), and 2010–2016 (only Fisheries management). The topic Conservation joined this top during the last 6 years.

Taken together, topics that could be grouped as relating to freshwater fisheries (i.e. Freshwater, Tracking and movement, Salmonids, Salmon, Estuaries, Assemblages) ac- count for one-sixth of the corpus. These results, corroborated with the declining inter- est in these topics in the last 16 years (see Fig. 5.5), confirm the findings of Jari´c et al. (2012) that indicate a general decline in the frequency of studies focused on freshwater habitats.

The top four hottest topics of the last 26 years (overall column) are (in descending order): Fisheries management, Conservation, Models, and Fishing gear (with Stock assessment, the third topic directly relating to fisheries, in 7th place). The interest in models has also been confirmed by Jari´c et al. (2012), and this, in addition to the preva- lence of the modeling topics, which was described above, provides empirical evidence for the fact that modeling is one of the most important research methods in fisheries science (Angelini and Moloney, 2007). Topic Fisheries management, the third most prevalent topic in the corpus, remained among the top three hottest topics during the last 11 years, while it was the coldest topic in the first half on the 1990s.

The configuration of top three hottest and top three coldest topics has fluctuated during the five time periods we have analyzed. However, the topic Primary production has constantly been ranked among the coldest topics (with the topic Biochemistry also being in this top in the period 1990–2005). Overall the 26 year period, it is interesting to note that Primary production was the most prevalent topic, but at the same time had the steepest decline in interest. The topics Models and Conservation have been a constant occurrence in the top three hottest topics since the turn of the Century, whereas the topic Fisheries management only joined this top category in 2005. Another topic directly relating to fisheries, Fishing gear, was part of this top in the period 2000– 2005.

116 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Topic trends (percentage points) Topic prevalence (percent)

Figure 5.5: Topical trends and prevalence for all 25 uncovered fisheries topics dis- played as a heat map. The topical trends (left) show the increasing/hot (red) and decreasing/cold (blue) topics for different time intervals. The value represents the fit- ted (via linear regression) change in percentage points within each time interval. The topic prevalence (right) shows the average cumulative topic proportion in percentage for different time intervals for each of the 25 fisheries topics. It shows how present a topic is within a certain time interval given all the scientific output within that time interval. Individual trend lines per year can be found in the Fig. 5.7 in the Appendix.

117 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

The increased prevalence in the corpus of the topic Fisheries gear and the constant interest shown in this topic, together with the constant prevalence in the corpus of the topic Stock assessment and the constant interest shown in this topic, indicate that fishing gear and stock assessment have been the central elements of fisheries science in the last 26 years. The increasing prevalence and interest in the topic Fisheries man- agement might indicate the strengthening of the connection between fisheries science and management processes, in the light of the growing concern about the status of fish stocks worldwide. Stock assessments provide a scientific and quantitative basis to the process of developing and implementing a management plan (Hoggarth et al., 2005) and, as such, are indispensable to management processes (Hoggarth et al., 2006). The interest in the topics Growth and Reproduction seems to be most stable among all the 25 topics when looking at the entire period of 26 years, even though the prevalence of these topics is rather small (around 3%). The constant interest in the latter can be explained by the importance of fisheries reproduction for fisheries assessment and management (Jakobsen et al., 2016).

5.3.4 Topical Trends over Journals

Although many journals included in our analysis overlap to some extent in their con- tent, it is possible to identify journals that seem specialized in specific topics (Fig. 5.6). For example, almost one-fifth of the publications that appeared in the journal Fish and Fisheries relate to the topic Fisheries management, whereas another approximatively one-fifth is related to the topic Conservation. Among the topics directly relating to fish- eries, the journals Marine Policy and Marine Resource Economics are highly specialized in the topic of Fisheries management, and the journals Fisheries research and CCAMLR Science, in the topic Fishing gear. It appears that no journal is highly specialized in Stock assessment, with this topic being addressed by almost all the journals. The top three journals publishing this topic are ICES Journal of Marine Science, Fisheries Oceanog- raphy, and Fish and Fisheries, with 10-11% of the publication space of each of these journals covering this topic.

118

CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE (25) Fisheries Management Fisheries (25)

2.3 7.4 1.7 5.4 0.7 9.4 0.9 7.5 5.4 0.9 4.1 5.3 0.7 3.2 8.3 0.5

19.8 17.4 57.8 47.4 10.6 (24) Estuaries (24)

2.3 5.3 0.9 0.8 5.7 2.7 2.7 6.9 2.9 3.2 1.3 6.4 4.0 4.7 2.4 0.9 1.0 1.0 7.2 4.6 2.1 (23) Acoustics and Swimming and Acoustics (23)

4.8 1.7 3.6 6.1 3.5 2.5 1.4 3.0 7.6 5.6 9.3 3.6 2.3 4.2 4.8 1.0 0.3 3.4 2.5 1.2 3.3 (22) Salmonids (22)

0.8 1.0 3.1 0.0 8.4 0.8 3.6 4.0 0.1 0.4 0.2 2.6 1.8 0.3 0.3 0.1 0.2 6.3 2.2 1.3 6.8 (21) Models (21)

5.4 4.4 5.7 7.3 3.6 5.3 2.8 3.7 8.5 5.3 3.4 8.5 3.5 3.6 6.8

12.8 15.0 10.6 13.8 12.7 19.1 (20) Primary Production Primary (20)

2.6 6.6 0.5 0.7 0.4 0.6 0.9 3.1 0.3 3.0 0.8 4.6 0.6 0.4 0.4 0.3 0.7 1.6 0.6

13.1 18.5 (19) Fishing Gear Fishing (19)

1.6 3.9 2.7 0.6 7.4 2.7 7.4 5.3 8.0 1.9 1.2 9.7 1.6 5.0 4.9 5.2 4.7 5.7 1.7

14.4 18.7 (18) Tracking and Movement and Tracking (18)

5.3 1.3 3.7 1.1 7.7 1.7 4.8 9.9 1.5 2.4 2.1 3.5 3.4 6.6 1.0 0.4 0.6 3.0 5.0 9.4

11.9 (17) Growth (17)

0.9 0.7 3.4 4.3 3.7 0.8 1.3 2.8 2.5 5.9 2.7 4.2 1.8 6.0 2.1 0.2 0.3 4.2 1.7 2.0 5.0 (16) Stock Assessment Stock (16)

1.8 2.0 9.4 8.7 3.0 3.7 4.9 9.3 2.1 2.1 8.5 3.7 2.6 7.8 5.6 6.0 6.2 4.6

10.5 11.5 11.8 (15) Growth Experiments Growth (15)

9.4 1.5 6.2 1.1 7.3 4.6 2.3 5.2 1.3 2.8 2.4 4.3 2.9 4.0 0.6 1.5 6.4 6.7 6.5 8.1

13.7 (14) Assemblages (14)

1.4 7.5 0.1 0.9 9.3 8.1 1.0 0.8 0.7 2.3 3.7 1.4 0.6 0.7 2.1 3.7

14.5 14.5 12.5 10.5 12.9 (13) Genetics (13)

100%). The heat map displays which journals publish

3.7 1.8 3.9 0.9 3.8 3.1 1.3 2.2 0.4 2.3 2.1 7.5 4.4 2.9 2.4 0.3 0.4 2.2 6.9 3.2 5.9 (12) Development (12) =

2.0 0.6 2.5 1.1 2.7 1.3 0.6 1.5 2.3 3.4 3.7 2.1 2.6 4.1 0.1 0.1 1.3 2.8 2.4 2.4

10.1 (11) Southern Hemisphere Southern (11)

1.2 4.0 2.7 0.6 2.8 0.7 0.7 3.7 7.3 1.5 0.5 3.6 7.8 2.0 0.6 0.4 2.7 1.7 1.0

28.9 24.4 (10) North Atlantic North (10)

2.3 2.0 0.3 1.9 2.8 0.8 3.3 3.4 3.7 8.3 3.5 2.4 0.9 2.0 2.1 2.4 0.5 2.3 2.9 0.7

11.3 (9) Diet (9)

2.6 1.2 5.0 2.9 8.1 1.7 0.8 2.4 2.9 1.5 3.2 4.9 3.5 3.3 5.5 0.3 0.7 1.5 2.3 3.1 4.6 (8) Freshwater (8)

0.6 1.8 5.5 0.0 7.9 1.2 8.2 7.8 0.2 1.6 0.3 2.4 4.3 1.4 0.2 0.3 2.2 1.8 3.1

17.6 10.1 (7) Biochemistry (7)

5.1 0.7 3.8 0.4 1.1 1.5 0.9 0.8 0.3 0.7 1.3 9.5 4.0 1.0 4.0 0.4 0.7 1.2 6.4 7.5 2.1 (6) Corals (6)

3.8 1.3 1.2 1.2 3.3 1.8 2.1 1.7 3.3 3.4 2.5 3.7 2.0 0.8 0.5 3.0 3.3 0.7

11.0 11.9 10.8 (5) Non-Fish Species Non-Fish (5)

9.5 6.5 2.1 0.3 0.7 0.3 0.7 0.9 0.5 1.1 1.8 0.8 3.7 1.3 0.7 0.7 0.8 0.7 2.6 0.8

10.7 (4) Reproduction (4)

1.6 1.7 2.0 3.0 3.8 3.0 1.1 1.6 0.7 3.8 2.1 8.0 3.1 6.7 1.9 0.3 0.2 1.4 5.1 2.7 2.3 (3) Salmon (3)

1.0 0.3 3.4 0.4 0.9 1.2 4.8 1.0 4.4 1.5 1.4 0.8 0.1 7.5 0.6 0.8 3.4 3.7 1.7 2.2 5.3 (2) Morphology (2)

0.4 1.4 0.7 0.6 2.8 2.1 0.9 1.8 0.3 1.0 0.7 7.5 4.9 0.7 1.2 0.3 0.1 0.7 3.8 1.6 0.7 (1) Conservation (1) 8.2 3.4 2.4 3.2 6.1 2.4 2.5 5.2 2.7 9.3 3.5 3.2 3.7 1.8 1.9 20.2 16.4 23.9 17.5 11.8 11.4 Fisheries Marine Policy CCAMLR Science Fish and Fisheries Fisheries Research Journal of Fish Biology Fisheries Oceanography Ecology of Freshwater Fish Marine Resource Economics Reviews in Fisheries Science Marine and Coastal Fisheries ICES Journal of Marine Science Marine Ecology Progress Series Topical distribution over journals displayed as a heat map. For each included journal (left), the coverage of topics Fisheries Management and Ecology Aquaculture Environment Interactions Reviews in Fish Biology and Fisheries Transactions of the American Fisheries Society North American Journal of Fisheries Management Canadian Journal of Fisheries and Aquatic Sciences Knowledge and Management of Aquatic Ecosystems Aquatic Conservation Marine and Freshwater Ecosystems Figure 5.6: (top) in percentages is displayed (percentage values in cells, row total which topics and in what1997–2016, 2000–2016). proportions. See Table 5.1 for journal year coverage, as this differs between journals (e.g. 1990–2016,

119 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

5.3.5 Validation of Results

We validated the output of the LDA model (including its labeling) by comparing the hot/cold LDA topics in the period 2000–2009 with the hot/cold words used in publica- tion titles in the same period identified by the bibliometric study of Jari´c et al. (2012). Some of the words that they identified as having the greatest increase in frequency can be directly linked to some of the hot topics identified by our analysis, e.g. by-catch, longline, seabird in the bibliometric study relate to the topic Fishing gear (19) iden- tified by our study; genetic relates to the topic Genetics (13). Likewise, some of the words with the largest decrease in frequency can be directly linked to some of the cold topics identified by the LDA analysis over the same period e.g. Atlantic in the bib- liometric study relates to the topic North Atlantic identified by our study; growth, to the topics Growth (17) and Growth experiments (15); recruitment, to the topic Stock assessment (14); feeding, to the topic Diet (9).

5.4 Conclusion and Recommendations

From the analysis of more than 46,500 full-text articles published in 21 top fisheries journals, it is apparent that, during the last 26 years, the research focus of fisheries sci- ence has been predominately on the natural dimension of the fisheries system, with 22 out of 25 topics referring to this dimension. While the natural dimension of fisheries was split into various aspects covering topics from specific species to fish catch technol- ogy, the human dimension was explicitly expressed only through one, albeit the hottest topic in the data set and the second most prevalent: Fisheries management. Although there is undoubtedly some scientific production addressing various human dimension of fisheries, it could be that the narrative used to describe the human dimension is not explicit enough to be captured by word co-occurrence, or that the human dimension is not prevalent enough to be recognized as a general topic or specific subtopic by the LDA model. Additionally, it might be that the scientific production on the human di- mension is published in journals other than those specialized in fisheries or other types of outlets, such as books and book chapters. We could advance various hypotheses as to why this might be the case (e.g. most fishery scientists are biological/natural sci- ences trained and oriented (Link, 2010); those who are non-natural sciences trained and oriented tend to publish in outlets that foster recognition from within their own scientific communities, such as books, rather than journals), but this exercise would be outside the scope of this study. Instead, we want to emphasize two important rec- ommendations: 1. Diversification of the scientific focus so that it covers more of the complexity of fisheries, especially the human dimension (funding bodies play a crucial role here by setting the research agenda); and 2. Publication of fisheries research in outlets more likely to reach the intended audience (i.e. top interdisciplinary journals or specialized fisheries journals, if the objective of publishing is to contribute to fisheries sustainability by reaching fisheries policy- and decision-makers). A lack of interdisci-

120 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE plinary synthesis in research is one of the major factors in fisheries collapses (Smith and Link, 2005). Thus, more integrative research and research focused on the un- derrepresented topics might provide insight into the fine mechanisms of fisheries as socio-ecological complex adaptive systems, and, thus, a critical input for developing successful fisheries management approaches.

121

Appendix

Calculating Topical Trends over Time

For every topic k = 1, . . . K and for every year y = 1990, . . . 2016 , we calculate the ˆ composite topic-year{ proportion} φy,k as expressed in{ Equation 5.1. }

P φ ˆ d y d,k φy,k = ∈ (5.1) ny

ˆ th φy,k aggregates for every document d of year y the k topic proportion φd,k and divides it by the number of documents for that year ny to obtain averages.

123 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE Year Prevalence 201119991997 82% 82% 1998 74% 200420041994 79% 2001 84% 82% 2002 90% 2003 90% 2007 84% 2003 84% 2002 75% 20062002 67% 2000 87% 2000 86% 2009 69% 1997 86% 2009 85% 91% 89% 20022012 84% 2006 90% 2011 86% 199119972006 93% 2011 91% 2007 96% 1998 95% 1998 93% 2001 91% 2008 72% 2005 70% 2015 88% 1992 87% 87% 78% 96% 95% A selection of publication titles that have high probability for a single topic. Table 5.4: 11 Biodiversity, ecosystems2 and coastal zone Restoring management: aquatic2 linking ecosystems: science an and overview Description policy of3 Careproctus patagonicus sp. The nov. ontogeny3 and of C. the magellanicus chondrocranium Magnitude sp. in and4 Clarias nov. Trends gariepinus: in (Pisces: Abundance trends Scorpaeniformes) Competition of in from between4 Hatchery siluroids the Asian and lower pink Wild slope salmon Pink Alternative of (Oncorhynchus Salmon reproductive5 Drake gorbuscha) Chum tactics Passage and Salmon and Alaskan and courtship Laboratory sockeye Sockeye in behavior5 salmon Salmon the of (O. in common adolescent nerka) the goby and in Drift North adult the algae-epiphyte-seagrass Pacific6 males North interactions Ocean of Pacific in Ocean the a Emergence snow subtropical stress6 crab Thalassia and (Chionoecetes testudinum morphological opilio) meadow constraints (Brachyura: Patterns affect of7 Majidae) the coral mated species community noncompetitively distribution structure and Influence and of competitively of7 growth subtropical with habitat of reefs primiparous on subtropical in females grouper intertidal Structure the abundance seagrasses of Solitary8 in the Islands the exocrine Marine Florida pancreas Reserve, Cytochrome Keys, Eastern of P4501A1 U.S.A. 8 Australia flounder induction (Paralichthys by olivaceus): polychlorinated White immunological biphenyls Bass9 localization (PCBs) Population of in Differences zymogen liver in Population granules cell Nebraska Characteristics in9 lines Reservoirs and the from with Ecological digestive rat Gizzard Role tract and Shad Trophic of using dynamics trout or Northern anti-trypsinogen of and Alewife Pike antibody two the Prey in sympatric derivation Bases Shallow Predator mysid of Natural size species toxic Lakes - in equivalency in prey an factors Nebraska size estuarine relationships transition of zone marine fish predators: interspecific variation and effects of ontogeny and body size on 1997 trophic-niche breadth 79% 1995 2000 83% 74% 2010 2003 1996 80% 75% 83% 2000 74% 1010 Myxosporean gall bladder parasites of The gadid parasite fishes fauna in of the the North Norwegian Atlantic: spring Their spawning geographical herring distributions (Clupea and harengus an L.) assessment of their economic importance in fisheries and mariculture 2005 75% 1111 Foraging niches12 of three Diomedea Convergence albatrosses or12 divergence: where do Length short-tailed of13 shearwaters herring forage larvae in in Comparison the relation of13 Southern to in Ocean? age situ and egg Multiple time production lineages14 of rate of hatching in the Calanus mitochondrial Allozymes, finmarchicus DNA mtDNA14 and introgression and Metridia from microsatellites longa: Pungitius study Interpreting discriminating pungitius introgression Stream15 between (L.) in Physical methodological to a Characteristics and Pungitius stocked in species-specific tymensis Influences trout Index effects (Nikolsky) of15 population of Watershed in Biotic Land France Integrity Use Cyclic Classifications on feeding16 General Habitat and Similarities Quality subsequent and and compensatory Specific Oxygen Biotic growth Differences consumption, Integrity16 do ammonia in not excretion Wisconsin significantly Streams and Varying impact components protein17 standard of use metabolic productivity in rate and response or Impacts their to critical of17 impact thermal swimming density-dependent on changes speed growth fishing in in and mortality juvenile A rainbow maturation reference Atlantic Comparison18 trout on salmon points of assessment Salmo for Calcified advice salar Grand Structures Utility to Bank for of18 rebuild Atlantic Aging Alligator depleted cod Bluefish Gar U.S. and in Age silver Downstream American the Estimates hake Passage plaice19 Chesapeake from and (Merluccius Bay Otoliths Impact bilinearis) Region Pectoral of stocks Upstream Fin Turbine Movements Shutdowns Rays19 of on and Atlantic Survival Scales Salmon of Shark in Silver catch20 the American in Lower Eels a Penobscot at pelagic A River Five longline comparison Maine20 Hydroelectric fishery: of Following Dams Two Comparison circle on Dam of hook the Aerobic Removals circle and Shenandoah and and21 and J River anaerobic Fish tuna hook mineralization Passage hooks performance Modifications of Influence in organic of21 a material silicate western in on equatorial marine particulate Atlantic Assessing sediment carbon Ocean the22 microcosms production pelagic performance in longline of phytoplankton fishery linear Investigating geostatistical confidence22 tools intervals applied generated to by Competition artificial zero-inflated and23 fisheries count Predation data models: as Implications Mechanisms No for for Net23 fisheries Displacement Loss management of of Greenback Brook Experiences Cutthroat Trout with24 Trout Distribution multibeam by in sonar Brook Areas in Trout In of shallow situ Sympatry24 tropical swimming with waters behaviour Rainbow of Trout Geographical in individual and25 Tennessee mesopelagic typological Streams fish 2008 changes studied in The by fish role25 split-beam guilds of echo of salinity target South in tracking Fishery African structuring self-governance estuaries the in fish fishing assemblages communities The in of registration a South of tropical Korea merchant estuary ships under perestroika: Soviet legislation 69% 2006 84% 1998 2014 2016 86% 86% 2016 88% 86% Topic Publication title

124 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Conservation (1) Morphology (2) Salmon (3) Reproduction (4) Non-Fish Species (5) 10.0% 4.0% 2.5% 3.8% 7.0% 9.0% 3.6% 6.5% 3.5% 2.0% 8.0% 3.4% 6.0% 3.0% 5.5% 7.0% 1.5% 3.2% 5.0% 6.0% 2.5% 3.0% 4.5% 5.0% 1.0% 2.8% 2.0% 4.0% 4.0% 2.6% 3.5% 1.5% 0.5%

Topic prevalence 3.0% 2.4% 3.0% 2.0% 1.0% 0.0% 2.2% 2.5% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Corals (6) Biochemistry (7) Freshwater (8) Diet (9) North Atlantic (10) 6.0% 9.0% 4.5% 6.0% 4.2% 8.0% 4.0% 4.0% 5.5% 5.5% 7.0% 3.5% 3.8% 5.0% 5.0% 3.6% 6.0% 3.0% 3.4% 4.5% 5.0% 2.5% 4.5% 3.2% 4.0% 2.0% 4.0% 4.0% 3.0% 3.0% 1.5% 2.8% 3.5% 3.5%

Topic prevalence 2.0% 1.0% 2.6% 3.0% 1.0% 0.5% 3.0% 2.4% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Southern Hemisphere (11) Development (12) Genetics (13) Assemblages (14) Growth Experiments (15) 6.0% 5.0% 4.5% 5.5% 7.5% 5.0% 4.5% 4.0% 7.0% 5.5% 4.5% 4.0% 3.5% 4.0% 6.5% 5.0% 3.5% 3.5% 3.0% 6.0% 3.0% 4.5% 3.0% 2.5% 2.5% 5.5% 2.0% 4.0% 2.5% 2.0% 5.0%

Topic prevalence 1.5% 3.5% 2.0% 1.5% 1.0% 4.5% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Stock Assessment (16) Growth (17) Tracking and Movement (18) Fishing Gear (19) Primary Production (20) 7.0% 4.5% 7.0% 18.0% 6.5% 16.0% 3.5% 4.0% 6.0% 6.0% 3.5% 14.0% 5.0% 5.5% 12.0% 3.0% 3.0% 5.0% 4.0% 10.0% 2.5% 4.5% 8.0% 3.0% 4.0% 2.5% 2.0% 6.0% 1.5% 2.0%

Topic prevalence 3.5% 4.0% 3.0% 2.0% 1.0% 1.0% 2.0% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Models (21) Salmonids (22) Acoustics and Swimming (23) Estuaries (24) Fisheries Management (25) 10.0% 2.6% 6.5% 3.8% 12.0% 9.0% 2.4% 6.0% 3.6% 11.0% 2.2% 10.0% 8.0% 5.5% 3.4% 2.0% 9.0% 3.2% 7.0% 1.8% 5.0% 8.0% 3.0% 6.0% 1.6% 4.5% 7.0% 2.8% 1.4% 6.0% 5.0% 4.0% 1.2% 2.6% 5.0% 4.0% 3.5% Topic prevalence 1.0% 2.4% 4.0% 3.0% 0.8% 3.0% 2.2% 3.0% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Year Year Year Year Year

Figure 5.7: Topic prevalence for each of the 25 uncovered fisheries science topics. The x-axis represents the year with the y-axis displaying the prevalence of the topic within all the documents within that year. The prevalence indicates the cumulative topic proportion relative to all the documents published in that year. For example, in 2016, almost 9% of the papers published in this year cover aspects of fisheries management.

125 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Topic trends (percentage points) Topic prevalence (percent)

Figure 5.8: Topical trends and prevalence for all 12 uncovered fisheries management sub-topics displayed as a heat map. The topical trends (left) show the increasing/hot (red) and decreasing/cold (blue) topics for different time intervals. The value repre- sents the fitted (via linear regression) change in percentage points within each time interval. The topic prevalence (right) shows the average cumulative topic proportion in percentage for different time intervals for each of the 12 fisheries management sub- topics. It shows how present a sub-topic is within a certain time interval given all the scientific output within that time interval for documents relating primarily to the broad topic of fisheries management.

126 CHAPTER 5. TOPIC ANALYSIS OF FISHERIES SCIENCE

Spatial Planning (1) Markets (2) Bioeconomics (3) Conservation/MPA (4) 25.0% 14.0% 9.0% 10.0% 12.0% 8.0% 9.0% 20.0% 8.0% 10.0% 7.0% 7.0% 15.0% 6.0% 8.0% 6.0% 5.0% 6.0% 5.0% 10.0% 4.0% 4.0% 4.0% 3.0% 5.0% 3.0% Topic prevalence 2.0% 2.0% 2.0% 0.0% 0.0% 1.0% 1.0% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Small-Scale Fisheries (5) Blue Economy (6) Pollution (7) Legislation (8) 14.0% 12.0% 40.0% 40.0% 35.0% 35.0% 12.0% 10.0% 10.0% 30.0% 30.0% 8.0% 25.0% 25.0% 8.0% 6.0% 20.0% 20.0% 6.0% 15.0% 15.0% 4.0% 4.0% 10.0% 10.0%

Topic prevalence 2.0% 2.0% 5.0% 5.0% 0.0% 0.0% 0.0% 0.0% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Co-Management (9) Quota Systems (10) Precautionary Approach (11) Recreational Fisheries (12) 22.0% 18.0% 18.0% 18.0% 20.0% 16.0% 16.0% 16.0% 18.0% 14.0% 14.0% 14.0% 16.0% 12.0% 12.0% 12.0% 14.0% 10.0% 10.0% 10.0% 12.0% 8.0% 8.0% 8.0% 10.0% 6.0% 6.0% 6.0% 8.0% 4.0% 4.0% 4.0% Topic prevalence 6.0% 2.0% 2.0% 2.0% 4.0% 0.0% 0.0% 0.0% 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015 Year Year Year Year

Figure 5.9: Topic prevalence for each of the 12 uncovered fisheries management sub- topics. The x-axis represents the year with the y-axis displaying the prevalence of the topic within all the documents within that year relating primarily to fisheries man- agement. The prevalence indicates the cumulative topic proportion relative to all the fisheries management documents published in that year.

127

Chapter 6

Using Machine Learning to Uncover Latent Research Topics in Fishery Models

Modelling has become the most commonly used method in fisheries science, with numerous types of models and approaches available today. The large variety of models and the overwhelming amount of scientific literature published yearly can make it difficult to effectively access and use the output of fisheries’ modelling pub- lications. In particular, the underlying topic of an article cannot always be detected using keyword searches. As a consequence, identifying the developments and trends within fisheries modelling research can be challenging and time-consuming. This paper utilises a machine learning algorithm to uncover hidden topics and subtopics from peer-reviewed fisheries modelling publications and identifies tem- poral trends using 22,236 full-text articles extracted from 13 top-tier fisheries jour- nals from 1990–2016. Two modelling topics were discovered: estimation models (a topic that contains the idea of catch, effort, and abundance estimation) and stock assessment models (a topic on the assessment of the current state of a fish- ery and future projections of fish stock responses and management effects). The underlying modelling subtopics show a change in the research focus of modelling publications over the last 26 years.

This work was originally published as:

S. Syed and C. T. Weber. Using Machine Learning to Uncover Latent Research Topics in Fishery Models. Reviews in Fisheries Science & Aquaculture, 26(3):319–336, 2018. doi: 10.1080/23308249.2017.1416331

129 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.1 Introduction

Global research efforts have increased significantly in recent years (Oecd, 2008), as has publication output within fisheries science (Aksnes and Browman, 2016). This growth has been partly driven by growing concerns about the state of fish stocks and the need to provide information for policy and decision makers globally. Since each fish stock is typically unique, and experimental approaches cannot be used to predict their response to fishing, it follows that the modelling and simulation of fisheries play a major role in providing management advice; these are among the most frequently used methods in fisheries science (Jari´c et al., 2012). Models offer a feasible approach to the approximation of trends and processes, and they advance the understanding of fisheries and ecosystem dynamics (Angelini and Moloney 2007) while guiding data collection and illuminating core uncertainties (Epstein, 2008). For this reason, and in contrast to common perceptions, a multitude of fisheries models is available besides standard stock assessment models, and these models take on many different shapes and forms depending on their method and purpose. They may include individual- based models to investigate fleet behaviour (Bastardie et al., 2014); Bayesian belief networks to better understand stakeholder viewpoints and perceptions (Haapasaari et al., 2012); or conceptual models to analyse fisheries from a socio-ecological complex adaptive system perspective (Ostrom, 2009; Partelow, 2015).

The frequent use of models and their wide range of applications, in combination with the growing global collections of scholarly literature, have led to an ever-increasing number of publications on the various types of models and approaches. As a result, scientists are suddenly faced with millions of publications, overwhelming their capacity to effectively use these collections and to keep track of new research (Larsen and von Ins, 2010). Online collections can be browsed and explored using keyword searches, through which publications can be collected manually; however, in addition to being time-consuming, the size and growth of the body of research often has the effect of lim- iting the possibility of identifying all the relevant literature. Another problem is that the underlying topic of an article is not readily available in most collections. Thus, the topic of an article—that is, the idea underlying the article, which may be shared with similar articles—cannot always be detected using keyword searches (Srivastava and Sahami, 2009). Given such challenges, an assessment of the field of fisheries models could reveal overlooked research topics, identify important changes in research direc- tions (i.e., trends), assess the diversity of topics in publication outlets, and ultimately help in identifying new and emerging modelling topics. Furthermore, an improved understanding of fisheries modelling approaches could help researchers to more easily synthesise historical and current research developments.

The developments and trends in fisheries science and fishery models are usually as- sessed through reviews (e.g., (Bjørndal et al., 2004; Prellezo et al., 2012) and biblio- metric studies (Jari´c et al., 2012; Aksnes and Browman, 2016)). These types of studies have several limitations, such as taking into account only a limited number of publica-

130 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS tions (e.g., only 61 publications, (Gerl et al., 2016)); a limited time period (e.g., from 2000-2009, (Jari´c et al., 2012); a limited scope or very specialised focus (e.g., stock as- sessment methods, (Cadrin and Dickey-Collas, 2015); bio-economic models, (Prellezo et al., 2012); models of an ecosystem approach to fisheries, (Plagányi, 2007); and models of the Celtic Sea, (Minto and Lordan, 2014)). Other limitations include prox- ies for full text, such as titles (Jari´c et al., 2012), and abstracts (Aksnes and Browman, 2016), and proxies for research topics, such as one word per topic (Jari´c et al., 2012; Aksnes and Browman, 2016). Most importantly, previous attempts to identify trends in fisheries and fisheries modelling are based on top-down approaches, in which research topics are predefined by the researcher (Debortoli et al., 2016), such as region, species, habitat, or study area. Such approaches are prone to human subjectivity; researchers may end up with different results (Urquhart, 2001), or the mapping of text features to categories may not be explicitly known (Quinn et al., 2010).

This study aims to overcome the limitations of previous approaches by applying a bottom-up approach in which research topics automatically emerge from the statis- tical properties of the documents. In doing so, the topics are automatically uncovered without prior human labelling, categorisation, or predefined classification of publica- tions, and they are thus not biased by researchers’ top-down subjective choices. For this purpose, a probabilistic topic model algorithm called latent Dirichlet allocation (LDA) (Blei et al., 2003), which belongs to the field of unsupervised machine learning algorithms, was used to reveal research topics within the field of fisheries models that are published in peer-reviewed journals and have a strong focus on fisheries. Topic model algorithms can automatically uncover hidden or latent thematic structures (i.e., topics) from large collections of documents. The unsupervised nature of LDA allows documents to “speak” for themselves, and topics emerge without human intervention. They have proven to be very useful in automatically identifying and interpreting sci- entific themes in relation to the journal’s existing themes or categories (Griffiths and Steyvers, 2004).

By utilising unsupervised machine learning, this study aims to provide comprehensive information on topical trends within fisheries modelling research for fisheries scientists and stakeholders. In particular, this study analyses 22,236 full-text scientific publica- tions published within the period from 1990 to 2016 in 13 top-tier fisheries journals. Thus, a unique dataset for the field of fisheries models was created, and topics in fish- eries modelling and their underlying sub-topics were identified to determine historical and current research interests. In addition, the species, areas, and methods occurring within the identified topics were assessed.

131 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.2 Methods

6.2.1 Latent Dirichlet Allocation

The LDA model is a generative probabilistic topic model that represents documents (i.e., fisheries publications) as discrete distributions over K latent topics; each topic is subsequently represented as a discrete distribution over all the words (i.e., vocab- ulary) used. The words with high probability within the same topic are frequently co-occurring words, which can be seen as clusters or constellations of words that are often used to describe an underlying topic or theme (DiMaggio et al., 2013). In this way, LDA captures the heterogeneity of research ideas or topics within publications. The topics and their relative proportions within documents are hidden (i.e., latent) variables that LDA infers from the observable variables—that is, the words within the documents. The generative process behind LDA involves an imaginary random pro- cess, through which documents are created based on probabilistic sampling rules. The topics and their proportions are subsequently inferred from these generated documents by applying statistical inference techniques, such as variational and sampling-based al- gorithms (Blei and Jordan, 2006; Teh et al., 2006; Hoffman et al., 2010; Wang et al., 2011). LDA extends other popular topic model algorithms such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990) and probabilistic Latent Semantic Indexing (pLSI) (Hofmann, 1999) while also overcoming their limitations.

The LDA model makes two assumptions when analysing and uncovering latent topics from documents. First, documents are represented as “bags of words” (i.e., unordered lists of words) in which the word order is neglected. Although this is an unrealistic assumption, it is reasonable if the aim is to uncover semantic structures from text (Blei and Lafferty, 2006; Blei, 2012). Consider a thought experiment where one imagines shuffling all the words in a document. Even when shuffled, one might find words such as “population”, “size”, “virtual”, “minimum”, and “recruitment” and expect that the document deals with aspects of population dynamics. One of the core underlying prin- ciples of LDA is based on word co-occurrences, and a small number of co-occurring words is sufficient to resolve problems of ambiguity. Second, LDA assumes that the order in which documents are analysed is unimportant (i.e., document exchangeabil- ity is assumed); however, at the end of the analysis, all documents are analysed. As a result, LDA is unable to explicitly capture the evolution of topics over decades or cen- turies of work. This would require a more complicated and computationally expensive dynamic topic model (Blei and Lafferty, 2006), which is currently not feasible given the large dataset; however, this is a potential approach for future work. Document exchangeability is a limitation in the case of topics whose presentation in the litera- ture has dramatically changed (e.g., in terms of the terminology used to describe the topic), but it still captures the phenomenon by which current literature builds upon previous literature. Nonetheless, the assumption of document exchangeability is espe- cially problematic when analysing topics that span 50 to 100 years of research.

132 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.2.2 Topic Interpretation

The topics emerge from the statistical properties of the documents and the statistical assumptions behind LDA. The topics are represented as discrete distributions over all the words, in which the top words (e.g., top 15) for each topic—that is, the words with the highest probability and those that more frequently co-occur together – pro- vide insights into the semantic meaning of the topic. Topics are thus a reference to these probability distributions over words to exploit text-oriented intuitions. No epis- temological claims are made beyond this representation. Furthermore, by no means is the topic distribution over words limited to these top 15 words; in fact, every word occurs in every topic, but with different probabilities. The topics are used to uncover the themes prevailing the documents, as well as the extent to which such themes are present in each document. In doing so, the main ideas of a publication can be extracted and used to track how they have developed over time. Note that the underlying top- ics and to what extent the document exhibits these topics are not known in advance. These details are the output of the LDA analysis and emerge automatically from the statistical properties of the documents and the assumptions behind LDA.

6.2.3 Creating the Dataset

This paper aims to identify latent fisheries modelling topics from scientific research articles published in peer-reviewed journals specialising in fisheries. In this manner, the selection of publications was restricted exclusively to fisheries journals; therefore, it follows that some subjective choices were made to achieve this. All journals included in this analysis contain the term “fishery” or “fisheries” in their title and have an impact factor of 1.0 or higher. Additionally, the journal The ICES Journal of Marine Science was included, because it is part of the International Council for the Exploration of the Seas (ICES), which channels science-based advice to decision makers for sustainable fish- eries, and fisheries models are an important focus of this journal. A total of 13 fisheries journals were included in the study (see Table 6.1). A time frame of 26 years, from 1990 to 2016, was chosen to allow for enough variation within publication trends. Due to difficulties with journal subscription rights and the fact that some journals started after 1990 (e.g., Fish and Fisheries was first published in 2000), coverage was incom- plete for the complete time range of 26 years for a few journals. Documents that did not constitute a type of research article (e.g., book reviews, forewords, errata, confer- ence reports, comments, policy notes, corrigenda and letters) were discarded. In total, 22,236 full-text research articles from 13 top-tier fisheries journals were downloaded using automated download scripts, as well as by utilising the available application pro- gramming interfaces (APIs) offered by the publishers. The use of full-text articles, in contrast to only using abstracts, has shown to increase topic quality and provide a more detailed overview of the latent topics permeating a document collection (Syed and Spruit, 2017). Table 6.1 provides an overview of the complete dataset utilised in this study.

133 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS V T the percentage / Std.W W T / Total 22,236 the mean vocabulary size (number of unique words) within V the mean number of words within each document, std. W the W Overview of the dataset (i.e., corpus): years represents the years for which documents (i.e., articles) are downloaded, JournalCanadian Journal of Fisheries andFish Aquatic and Sciences FisheriesFisheries 1996-2016Fisheries 2.44 Management and EcologyFisheries Oceanography 4,427Fisheries Research 19.90%Fishery 4,076 BulletinICES 1,306 Journal of Marine Science 1,267 Marine and Coastal FisheriesNorth American Journal of FisheriesReviews 1994-2016 Management in Fish Biology andReviews 1.51 Fisheries in Fisheries 1997-2016 Science &Transactions Aquaculture of 1,001 the 1.01 American Fisheries Society 4.50% 2,517 2000-2016 2,692 11.30% 1997-2016 8.26 Years 3,289 1,136 2.73 1990-2016 1997-2016 1997-2016 1,421 2.63 1.47 2.03 1991-2016 419 956 1,037 2009-2016 IF 1995-2016 752 1997-2016 3.22 3,903 2,381 1.90% 1.44 2.23 2.43 375 17.60% 10.70% 3.40% 5,893 1990-2016 3,380 3,888 N 3,867 659 3,610 1.70% 2,801 274 1.51 477 1,379 16.20% 1,382 1,354 6,186 1,757 3.00% 1,119 3,204 1,203 1,188 1.20% 1,441 6,020 2.10% 5,800 N 4,474 1,326 1,737 3,410 3,994 6.50% 1,064 1,364 1,633 1,750 3,356 1,368 1,312 2,037 1,074 each document. The total number of documents is 22,236. Table 6.1: IF the journal’s impactof factor journal according articles to inestimated ISI relation standard Journal to deviation Citation of the Reports words total 2016, within number each N of document, the articles, and number of documents, N

134 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

The selection of fisheries journals and underlying fisheries publications comes with some limitations. First, some of the highly influential and most cited papers on fish- eries models are published in high-impact journals such as Nature, Science, and PNAS. Although highly influential, such publications would constitute only a small number of our sample and would only marginally or even negligibly contribute to the overall number of 22,236 publications downloaded from fisheries journals for this study. Two other reasons exist to exclude such generic journals. The first reason is that includ- ing all publications published in such outlets would drastically increase the number of uncovered topics, as fisheries make up a small portion of the publications in Nature, Science and PNAS. While one might be able to use keyword searches and include only those publications that match fisheries-related terms, this brings up the second rea- son to exclude such journals: publication filtering is based on the subjective choice of relevant keywords and is limited in terms of how publications are indexed and subse- quently can be retrieved (e.g., title, abstract or full text) from these journals. Through the inclusion of publications from only fisheries journals, such subjective choices and associated limitations are avoided.

The second limitation concerns the exclusion of non-fisheries-specialised journals in which fisheries-modelling-related publication might appear. Such journals focus on, but not limited to, the field of marine science (e.g., Marine Policy and Advances in Ma- rine Biology), the field of coastal areas or zones (e.g., Coastal Management and Ocean and Coastal Management), the field of toxicology (e.g., Environmental Toxicology and Pharmacology and Aquatic Toxicology), and the field of modelling (e.g. Environmental Modelling & Software and Ecological Modelling), in addition to a number of other jour- nals, such as Developmental Dynamics, Bulletin of the American Meteorological Society, Environmental Science and Technology, Philosophical Transactions of the Royal Society, Environmental Health Perspectives, BioScience, Journal of Fish Biology, and Progress in Oceanography. Some publications related to fisheries modelling approaches are pub- lished in these outlets, which is a potential limitation of this study. Again, filtering for fisheries modelling publications in these journals would be biased by the subjec- tive choice of keywords and limitations due to indexing and retrieval functionalities. Consequently, publications with a focus on the novelty in modelling approaches, which are commonly published in specialised modelling journals such as Ecological Modelling, were not assessed in this study. On the other hand, the modelling publications captured within the fisheries journals included in this study can potentially address other topics besides fisheries, such as climate change or habitat loss, which are likely to be included in the analysis of modelling publications.

The third limitation relates to the focus on peer-reviewed journals only. As a result, fisheries modelling research that appears in grey literature was excluded. As grey liter- ature is not indexed in the same way as peer-reviewed studies, selecting only relevant grey literature would, again, introduce bias due to human subjectivity in the search and retrieval.

135 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.2.4 Pre-processing the Data Set

Several important pre-processing steps were required to transform the documents into appropriate bag-of-word representations. First, each document was converted from PDF format into a plain-text representation. Image-based PDFs, mainly old docu- ments from the 1990s, were converted using the Tesseract optical character recognition (OCR) library. Second, documents were tokenised, which involved creating individual words (e.g., from paragraphs and sentences); meanwhile, numbers, single characters, punctuation marks, and words with only a single occurrence were removed, since they bear no topical meaning. Additionally, words that occurred in 90% of the documents were discarded due to their lack of distinctive topical significance.≥ Boilerplate con- tent, such as title pages, article metadata, footnotes, margin notes and so on, was also removed. The reference list of each article was maintained so as to allow for refer- enced titles and names of authors to be part of the word distributions of topics. An advantage of this approach is that author names can be part of specific topics, but they can simultaneously introduce bias when the referenced articles have no direct link to the underlying topics. A standard English stop word list (n = 153) was used to remove words that serve only syntactical and grammatical purposes, such as ’the’, ’and’, ’were’ and ’is’. Finally, other than grouping lowercase and uppercase words, no normalisation method was applied, such as stemming or , to reduce the inflectional and derivational forms of words to a common base form (e.g., fishing and fishery to fish). Normalisation reduces the interpretability of topics at later stages, as stemming algorithms can be overly aggressive and may result in unrecognisable words when interpreting topics. Stemming might also lead to another problem, as it cannot be deduced whether a stemmed word comes from a verb or a noun (Evangelopoulos et al., 2012). For these reasons, and considering that the interpretability of the top- ics at a later stage was considered to be highly significant, an extensive normalisation phase was omitted.

6.2.5 Creating LDA Models

The LDA models were created with the Python library Gensim (Rehurek and Sojka, 2010). The number of topics to be uncovered (i.e., K parameter) varied from one to 50, thus creating 50 different LDA models. The hyper-parameters for the LDA models, which affect the sparsity of the topics created and their relative proportions, were set to be symmetrical. Technically, since LDA is a Bayesian probabilistic model, the symmet- rical hyper-parameters encode prior knowledge that a priori assign equal probabilities to topics within documents, and words within topics. The quality of each topic was calculated using a topic coherence measure to find the optimal value for K (analogous to finding the right number of clusters for, e.g., k-nearest neighbours). A coherence measure calculates the degree of similarity between a topic’s top N words. This pro- vides a quantitative approach for assessing the interpretability of topics from a human perspective. As such, coherence measures aim to find coherent topics—a topic with

136 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS top words ’apple’, ’pear’, and ’banana’ is more coherent than ’apple’, ’pear’, and ’car’— rather than topics that are merely artefacts of the statistical assumptions behind LDA.

The CV coherence measure was adopted, since it has shown the highest accuracy of all available coherence measures (Röder et al., 2015). An elbow method was employed to find the K value with the best performing topic coherence score.

6.2.6 Identifying Subtopics

For each modelling topic identified, a zoom-in was employed with the aim of uncov- ering underlying subtopics within each of the general modelling topics by applying an approach similar to that described above. These subtopics provide a more detailed deconstruction of the respective general modelling topics. A zoom-in is performed on a subset of the data consisting of documents that have the general modelling topic as the dominant topic. The dominant topic is defined as the topic with the highest rel- ative proportion—that is, the topic that exceeds all other topic proportions within a document. Since documents are modelled as mixtures of topics, the dominant topic represents the primary topic of a document.

6.2.7 Labelling the Topics

The LDA model outputs the uncovered topics as probability distributions over all the words used; when sorted, the top 15 words are used to label the topic semantically. Representing the words as probabilistic topics has the distinct advantage that each topic is now individually interpretable (Griffiths et al., 2007), compared to a purely spa- tial representation like the topic model of latent semantic analysis (Deerwester et al., 1990). As stated before, the distributions of words, and specifically the words with the highest probability within each topic, are used to describe an underlying theme; however, such themes are latent, and a semantic label that best captures those words needs to be attached. For example, a topic with the top 5 words apple, banana, cherry, pear, and mango describes the underlying theme of fruits and can be labelled as such.

To provide a semantically meaningful and logical interpretation of these probability dis- tributions, a fisheries domain expert manually labelled the topics by close inspection of the top 15 high-probability words, together with an inspection of the document titles and content. Furthermore, to improve the labelling of the topics, the topics were visu- alised in a two-dimensional area by computing the distance between topics (Chuang et al., 2012) and applying multidimensional scaling (Sievert and Shirley, 2014). This two-dimensional topic representation aided in identifying similarities between topics and thus similarities between topic labels.

137 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.2.8 Calculating Sub-Topical Modelling Trends

To gain insight into the sub-topical temporal dynamics of the modelling subtopics, doc- ument topic proportions were aggregated into a composite topic-year proportion. Such composite values provide insights into the prevalence of a modelling subtopic within a certain year, given all the publications within that year. It furthermore enables the analysis of changing topic proportions over the course of 26 years, as proportions in- crease or decrease for each subtopic and for each year. Additionally, to obtain insight into increasing and decreasing topical trends, a one-dimensional least square polyno- mial was fitted for different time intervals. The time intervals chosen were 1990–1995, 1995–2000, 2000–2005, 2005–2010, and 2010–2016, so as to allow for historical com- parison. The polynomial coefficient is used as a proxy for the trend and defines the slope of the composite topic-year proportions for a range of years. Coefficients are mul- tiplied by the number of years within each time interval to obtain the change measured in percentage points. Positive values indicate increasing or “hot” topics, and negative values indicate decreasing or “cold” topics. Color coding is used to represent the hot (i.e., red) and cold (i.e., blue) topical trends.

6.3 Results and Discussion

6.3.1 General Modelling Topics

The optimal LDA model for the complete corpus (N = 22,236 documents) uncovered 31 general fisheries topics. The calculated coherence scores to obtain the optimal num- ber of topics, referred to as the K parameter, can be found in Fig. 6.7 in the Appendix. Among these general fisheries topics, two topics deal with the aspects of fisheries mod- elling. The publications dealing with these two modelling topics account for 12% (N = 2,761 documents) of the total number of publications. The remaining 29 topics, which relate to other aspects of fisheries research, are listed in Table 6.3 in the Ap- pendix. A bibliometric analysis of trends in fisheries science found a higher proportion of publications employing models—around 30%, as estimated from publication titles and abstracts from a dataset containing 695 fisheries-related publications (Jari´c et al., 2012). Several reasons can be offered to explain why these two percentages differ, such as the used time range and the selected journals; most importantly, the present paper identifies publications which predominantly deal with fisheries modelling aspects, in contrast to publications in which a modelling method is employed.

Figure 6.1 shows the top 15 words and their probabilities for the two modelling topics. The first modelling topic concerns catch-effort and abundance estimation methods and is, therefore, given the short name estimation models. It contains the words ‘catch’, ‘survey’, ‘sampling’, ‘effort’, and ‘sample’ among its top 15 words. These words reflect the collection of both fisheries-independent data, which are usually gathered through

138 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

(2) STOCK ASSESSMENT (1) ESTIMATION MODELS MODELS word prob. word prob. MODEL .015 MODEL .024 ESTIMATES .014 STOCK .014 CATCH .012 MORTALITY .014 SURVEY .008 POPULATION .012 SAMPLING .008 RECRUITMENT .011 ESTIMATED .008 MODELS .010 MODELS .007 BIOMASS .007 ESTIMATE .007 YEAR .007 DISTRIBUTION .007 RATE .007 ABUNDANCE .006 MANAGEMENT .007 MEAN .006 PARAMETERS .006 EFFORT .006 ASSESSMENT .006 SAMPLE .005 FISHERIES .006 METHOD .005 ESTIMATES .006 SIZE .005 FISHING .005

Figure 6.1: The two uncovered fisheries modelling topics (i.e., estimation models and stock assessment models) from the dataset containing 22,236 fisheries publications (1990–2016; 13 journals). The figure displays the topic label (top) and the top 15 high-probability words. survey and sampling methods, and fisheries-dependent data (e.g., collected through logbooks), which commonly provide information on catch and effort. These and other obtained data feed into models in order to estimate intermediate parameters such as natural mortality rate or catchability (Hoggarth et al., 2006); this is a phase of research reflected in estimation models through the words ‘model’, ‘estimates’, ‘estimated’, and ‘estimate’. These types of models might also be called retrospective models, since they interpret the past based on collected data.

The second modelling topic concerns modelling approaches for the assessment of the current state of a fishery and future projections and is assigned the short name stock assessment models. It contains the words ‘stock’, ‘mortality’, ‘biomass’, ‘rate’, and ‘esti- mate’, which reflect the most commonly used indicators (i.e., fish catch, stock biomass, stock size and fishing mortality; (Hoggarth et al., 2006)) to measure the status of the fishery and the state of the stock (Le Gallic 2002). These indicators link to ref- erence points, which give quantitative meaning to the goals and objectives set for a fishery (Jennings et al., 2009). Reference points are usually estimated through models that use stock and recruitment data, which is reflected in the words ‘stock’, ‘population’, ‘recruitment’, ‘management’, ‘parameters’, and ‘estimates’ in stock assessment models. Together, indicators and reference points play a crucial role in fisheries management and can be used to give quantitative meanings to the objectives of a fishery (Hoggarth et al., 2006).

The distinction between these two topics shows how they are treated separately in fisheries research publications, whereas in practice (i.e., in fisheries stock assessments

139 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

140 Estimation models Stock assessment models 120

100

80

60 Number of publications 40

20

0

1990 1995 2000 2005 2010 2015 Year of publication

Figure 6.2: The number of publications per year for publications related to the topic estimation model and stock assessment model. for management), these two topics are connected and combined into one model but reflect the different phases of the model development (Hoggarth et al., 2006). The distribution of publication frequencies for both general modelling topics is shown in Figure 6.2, which highlights the increased research interest in stock assessments mod- els compared to estimation models. Additionally, the top five publications with the highest topic prevalence for each of the two modelling topics, indicating to what ex- tent the content of a publication relates to the modelling topic, are shown in Table 6.2.

140 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS Year Prevalence 1999 95.69% 2014 92.23% 2011 90.48% 2009 99.37% 2003 98.14% 2012 97.33% 2014 94.87% Covariances in Multiplicative Estimates.Use of simulation-extrapolation estimation in catch-effortReducing analyses. Bias and Filling inEffort Spatial Data Gaps by in Geostatistical Fishery Dependent Prediction Catch IConfidence per Methodology Unit and intervals Simulation. for trawlabletrawl abundance surveys. from stratified-random bottom 1999 93.90% Analytical models for fishery reference points.Implications of life-history invariantsfishery management. for biological reference pointsThe estimation used and robustness in of FMSYence 1999 and points alternative associated fishing with mortality high refer- long-termAge-specific yield. natural mortalitydensity-dependent. 94.35% rates in stock assessments: size-based vs. 1998 98.50% Trawl survey based abundance estimation usingcatches. data sets with unusually large The Structure of Complexplacement. Biological Reference Points and the Theory of Re- Publication title, year, and topic prevalence (in percentages) for the five publications with the highest topic prevalence Modelling Topic Title Estimation Models Stock Assessment Models Table 6.2: for each general modelling topic.

141 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

Interestingly, only the topics of estimation models and stock assessment models were uncovered (both of which focus on the ecological dimension of fisheries), whereas topics on economic and social fisheries aspects were not found within the modelling publications. This finding might be a result of the selection of journals used in this study. Most of the included fisheries journals declare a multidisciplinary or interdisci- plinary scope, while some specifically include socio-economic considerations and the human dimension as subjects of interest. Therefore, at least one social or economic modelling topic could be expected to be identified by the LDA model. Another reason for the absence of other modelling topics may be that fisheries are still perceived as a natural science. The International Council for the Exploration of the Sea (ICES) only recently established the Strategic Initiative on the Human Dimension (SIHD) “to sup- port the integration of social and economic science into ICES work” (ICES, 2017), and the majority of the ICES workgroups still lack social science input (ICES, 2016). As a result, social scientists and economists may pursue publication of their models not in a journal related to fisheries, but rather in a journal related to their respective disciplines or having a broader scope, such as Ecology and Society, Marine Resource Economics or Marine Policy. Merit issues could also contribute to the topic bias. Different scientific disciplines receive publication merits for different journals, which is more often de- pendent on the index of a journal (e.g., Science Citation Index (SCI), Social Science Citation Index (SSCI), or International Scientific Index (ISI)) than on its impact factor. As a result, non-biological and non-ecological disciplines are less likely to use top-tier fisheries journals as publication outlets. This might, in turn, lead to low visibility of non-ecological models among fisheries stakeholders, because many fisheries journals such as Fish and Fisheries and Fisheries Research intend to reach fisheries managers, administrators, policy makers and legislators.

6.3.2 Subtopics within Estimation Models

The zoom-in (i.e., the process of uncovering subtopics from general topics) on the general topic of estimation models (N = 1,124 documents) identified 14 sub-topics (see Figure 6.7 in the Appendix). Figure 6.3 provides an overview of the 14 estima- tion model sub-topics, the top 15 words of the topics with their probabilities, and the manually attached label that best captures the semantics of the top words. Further- more, a two-dimensional topic representation can be found in the topic similarity map in Figure 6.4 (A), showing the topic similarity with respect to the distribution of the words. The trends (i.e., the change in overall topic proportion, in percentage points) and prevalence (i.e., the size of the overall topic proportion as a percentage) are pre- sented in Figure 6.5 (A).

Most of the uncovered subtopics can be grouped. The principal group consists of the five subtopics focusing on the biological aspects of fisheries (i.e., catch and abundance, mortality rate (tags), fish distribution, spawning, and length and growth). This high- lights the importance and scientific focus of the biological dimension in fisheries re-

142 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

(1) CATCH AND (2) MORTALITY RATE (3) ABUNDANCE (4) RECREATIONAL (5) PARAMETERS ABUNDANCE (TAGS) (SURVEYS) FISHERIES AND ESTIMATORS word prob. word prob. word prob. word prob. word prob. MODELS .013 TAG .016 SPATIAL .015 CATCH .023 ERROR .011 CATCH .011 MORTALITY .014 SURVEY .011 EFFORT .015 ABUNDANCE .010 ABUNDANCE .008 RATES .013 ABUNDANCE .009 FISHING .012 YEAR .009 SPECIES .007 TAGGING .013 DENSITY .009 SAMPLING .012 STOCK .007 YEAR .006 RATE .012 AREA .009 SURVEY .010 VARIANCE .007 DEPTH .006 TAGS .009 ACOUSTIC .007 ANGLERS .008 CATCH .007 EFFECTS .005 TAGGED .009 VARIANCE .007 HARVEST .007 POPULATION .006 CPUE .005 MOVEMENT .008 SURVEYS .006 SURVEYS .007 MODELS .006 VARIABLES .005 REPORTING .006 SAMPLING .006 RATE .007 INDEX .006 SPATIAL .004 MODELS .006 DISTANCE .005 ANGLER .007 YEARS .005 LONGLINE .004 YEAR .006 BIOMASS .005 FISHERY .006 ERRORS .005 LINEAR .004 FISHING .006 RANDOM .005 RECREATIONAL .006 BIAS .005 ENVIRONMENTAL .004 RELEASE .006 ESTIMATION .004 DAY .005 INDICES .005 EFFECT .004 PARAMETERS .005 SEA .004 VARIANCE .005 SAMPLE .004 RATES .004 FISHERY .005 KM .004 LAKE .005 REGRESSION .004

(7) ABUNDANCE (10) NET (6) SAMPLING (8) FISH DISTRIBUTION (9) SPAWNING (SAMPLING) SELECTIVITY word prob. word prob. word prob. word prob. word prob. SAMPLING .011 SAMPLING .009 CATCH .015 SPAWNING .017 SELECTIVITY .026 FISHING .010 ABUNDANCE .008 FISHING .014 EGG .014 MESH .013 SPECIES .010 POPULATION .007 EFFORT .013 EGGS .012 LENGTH .012 FISHERY .009 BAYESIAN .007 FISHERY .013 PRODUCTION .008 NET .010 BYCATCH .008 POSTERIOR .007 CPUE .011 DAY .007 GILLNET .009 CATCH .008 PROBABILITY .006 AREA .011 STAGE .007 SELECTION .009 TRIP .006 SPECIES .006 COD .011 BIOMASS .006 CATCH .008 TRIPS .006 CATCHABILITY .006 ABUNDANCE .010 LARVAE .006 GEAR .008 OBSERVER .006 MODELS .006 CATCHABILITY .009 SAMPLING .005 CURVE .008 VESSELS .006 CAPTURE .006 BIOMASS .008 MORTALITY .005 NETS .007 EFFORT .005 DENSITY .006 STOCK .006 DAILY .005 CURVES .007 SHRIMP .005 PRIOR .005 AREAS .006 SAMPLES .005 GILL .006 LANDINGS .005 SITES .004 SEASON .006 LARVAL .005 PARAMETERS .006 VESSEL .004 PARAMETERS .004 CRAB .006 TEMPERATURE .004 MM .006 COMMERCIAL .004 ELECTROFISHING .004 RATES .006 FEMALES .004 RELATIVE .006

(11) VESSELS AND (13) LENGTH AND (12) TRAWL SURVEYS (14) SALMON FLEET GROWTH word prob. word prob. word prob. word prob. FISHING .026 SURVEY .021 LENGTH .015 SALMON .016 CATCH .016 TRAWL .019 GROWTH .014 RIVER .009 VESSEL .012 SAMPLING .013 PARAMETERS .010 COUNTS .007 EFFORT .010 SPECIES .011 SAMPLE .008 SAMPLING .007 VESSELS .010 SURVEYS .008 PARAMETER .006 ABUNDANCE .007 FISHERY .008 BOTTOM .007 SAMPLES .006 RUN .006 FLEET .006 SAMPLE .006 LIKELIHOOD .006 SURVEY .005 SPECIES .006 TOW .006 ERROR .005 SPAWNING .004 CPUE .006 LENGTH .006 MODELS .005 POPULATION .004 POWER .005 EFFICIENCY .005 STOCK .005 YEARS .004 AREA .004 DESIGN .005 FUNCTION .005 CHINOOK .004 YEAR .004 AREA .005 DISTRIBUTIONS .004 COUNT .004 MODELS .004 CATCH .005 ESTIMATION .004 SAMPLE .004 RATE .004 DENSITY .005 STANDARD .004 STREAM .004 INFORMATION .003 TOWS .005 SET .003 ESTIMATOR .004

Figure 6.3: The 14 uncovered subtopics from the documents (N = 1,124) exhibiting the topic estimation models as the dominant topic. The figure displays the subtopic label (top) and the top 15 high-probability words.

143 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

A ESTIMATION MODELS B STOCK ASSESSMENT MODELS

predation 7 salmon 4 bayesian stock-recruitment 9 approach 14 recreational fisheries 8 cod recruitment abundance 14 reference (sampling) 15 freshwater fisheries points spawning management (and salmon) trawl surveys 6 tools parameters and estimators 7 9 12 5 12 2 13 10 length and sampling movement growth 6 fecundity and 3 4 3 11 5 2 abundance life reproduction (surveys) 8 history 13 estimator performance population mortality rate 11 harvest dynamics (tags) fish distribution management strategy 2% 10 net 1 effects 5% selectivity growth and length 10% catch and vessels and fleet 1 abundance overall topic prevalence

Figure 6.4: Topic similarity map that shows a 2-dimensional representation (via multi- dimensional scaling). A: 14 estimation model subtopics. B: 15 stock assessment model subtopics. The distance between the nodes represents the topic similarity with respect to the distributions of the words (i.e., nodes closer together have more related word probabilities). The surface of the nodes represents the prevalence of the topic within the corpus. search. Catch and abundance shows the biggest overall increase over time (+15.46%) and had the largest proportion (14.84%) within the last six years (Figure 6.5 (A)). Most of the other biological subtopics show very little variation over time, and some only make a small contribution in terms of proportion (e.g., spawning), with only 3.82% overall topic proportion (Figure 6.5 (A)). Length and growth showed the high- est overall decrease over time (-14.04%), indicating a diminishing scientific interest. The subtopic of length and growth remained relatively high in terms of topic propor- tion, with an average of 9.13% between 2010 and 2016, possibly because growth is an important parameter for stock assessments (Lorenzen, 2016; Maunder et al., 2016) and is also most frequently discussed in fisheries, as shown by a previous trend analy- sis (Jari´c et al., 2012). The subtopic of parameters and estimators relates more to the technical aspects of estimation modelling, but appears to be similar to the biological subtopic of mortality rate, as apparent from the similarity map (Figure 6.4 (A)). Vessel and fleet showed a large topic proportion (between 8% and 10%) over the last 16 years (Figure 6.5 (A)). Both the topic of vessel and fleet and that of net selectivity likely re- late to biological considerations, but they could also hint at a slightly more economic perspective on industry (fleet) and gear-related matters; however, additional words such as ‘firm’, ‘prices’, or ‘market’ would have to be present to confirm this hypothesis further. The four subtopics of abundance (survey), sampling, abundance (sampling), and trawl survey focus on survey and sampling, which are essential methods for gath- ering data and information on fisheries. In particular, information on catch and stock abundance is required by almost all stock assessment models (Hoggarth et al., 2006). These four subtopics account for a combined overall topic prevalence of 30.73%, in- dicating their importance to fisheries research. The subtopic of recreational fisheries

144 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

A ESTIMATION MODELS B STOCK ASSESSMENT MODELS Topical trends in percentage points Topical

Topic prevalence 2% 5% 10% 15%

Figure 6.5: Trends in changing topic proportions for different time intervals for all subtopics. The left-hand side (A) displays the 14 uncovered estimation model subtopics. The surface of the node represents the topic prevalence within a certain time range and indicates how present a topic was within all the published material of that time frame. The colours indicate the trend in topic proportion (i.e., change in percentage points) and indicate whether a topic increased in popularity (hot topic) or decreased in popularity (cold topic) within that time frame. The right-hand side (B) displays the information for the 15 uncovered stock assessment model subtopics.

145 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS refers to a type of fishery that differs in the estimation process compared to commercial fisheries, as it often employs surveys on anglers. This type of estimation process may refer not only to marine but also to freshwater fisheries. Recreational fisheries under- went an increase in topic proportion from 2.11% in the 1990–1995 period to 7.90% in the 2010–2016 period, indicating the growing importance of recreational fisheries as- sessments in fisheries science. The increased importance of recreational fishing on the commercial fish stocks (Griffiths and Fay 2015) is in line with the observed trend in this study. Apart from recreational fisheries, no other types of fisheries (e.g., small-scale, artisanal, or commercial fisheries) were identified by the topic model. The distance of recreational fisheries from the other subtopics in the similarity map may explain this, as authors writing about recreational fisheries use distinctive words that are different from the discourse on other types of fisheries. Another possible explanation may be that there are more studies on recreational fisheries than on other types of fisheries. Salmon is the only topic that focuses on one particular species. The similarity map shows how the topic of salmon differs within the words used, indicating the particu- larity and specialised research niche of the topic (Figure 6.4 (A)). Salmon showed a positive trend (+5.61%) over the study period; however, this result is in conflict with previous research that showed a diminishing research interest in the species (Jari´c et al. 2012). This could be due to the increasing effort within aquaculture and the grow- ing economic importance of the species over the period (FAO, 2016) that separates this study from that of Jari´c et al. (2012).

Within the top 15 words of the subtopics, important subjects such as species and names/methods can be identified. Three subtopics contain species names (i.e. ‘shrimp’ in sampling, ‘cod’ and ‘crab’ in fish distribution, and ‘salmon’ and ‘chinook’ in salmon). Methods mentioned within the subtopics of estimation models are ‘regression’ in pa- rameters and estimators and ‘Bayesian’ in abundance (sampling). Parameters for fish stock assessments can be estimated through the least square method, represented in the form of regression analysis; however, maximum likelihood methods are now preferred, as they allow for a better specification in the form of errors in the models. Bayesian methods are commonly used to incorporate uncertainty into management advice, but this could also involve other methods such as maximum likelihood, bootstrapping, or Monte-Carlo modelling (Hoggarth et al., 2006). The two methods ‘regression’ and ‘Bayesian’ do not reflect the current diversity of modelling methods, nor necessarily the most conventional models used in fisheries assessments today, but they seem to have a strong association with the two topics of parameters and estimators and abun- dance (sampling). Note that references to names of species and methods highlight the importance and relation of such words within a specific topic—technically, they co-occur more frequently to describe the latent topic—but are by no means mutually exclusive (i.e., methods and species can occur in different subtopics simultaneously). They provide information from a topical perspective (i.e., a high-level decomposition of the document into clusters of co-occurring words), but fail to address on what basis such species and methods are linked within a specific topic.

146 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

6.3.3 Subtopics within Stock Assessment Models

The zoom-in on the topic of stock assessment models (N = 1,637 documents) revealed 15 sub-topics (see Figure 6.7 in the Appendix for the calculated topic coherence scores). Figure 6.6 provides an overview of the 15 sub-topics, the top 15 words with their prob- abilities, and the label attached to each topic. The topic similarity for these subtopics can be found in Figure 6.4 (B). The subtopic trends and prevalence are displayed in Figure 6.5 (B).

Most of the subtopics of stock assessment models evolve around biological aspects and processes (i.e., growth and length, movement, predation, cod recruitment, fecundity and reproduction, population dynamics, life history, and stock recruitment). The ma- jority of these subtopics show a slight increase over the study period (Figure 6.5 (B)); together, these subtopics have an overall topic proportion of 42.91%, which shows their consistent importance within fisheries science and fisheries management (Hilborn and Walters 1992). Within the biological subtopics, predation stands out as the only subtopic that refers to ‘interaction’, ‘multispecies’ and the ‘ecosystem’. The subtopic of predation increased by 4.67% during the period from 1990 to 1995 (Figure 6.5 (B)), which reflects the increased scientific awareness of predator-prey interaction and model implications in the early 1990s (e.g., (Yodzis, 1994)). The topic proportion of predation shows a positive trend, as it rose from 3.75% in the period of 1990–1995 to 5.07% in the period of 2010–2016; this might indicate the increased attention of the scientific community to an ecosystem approach to fisheries and the implementation of multispecies and ecosystem considerations within stock assessments, modelling frame- works, and management advice (Maynou, 2014; Möllmann et al., 2014; Gaichas et al., 2017). The four subtopics of harvest strategy, management effects, management tools and reference points all concern management measures and effects, but they mainly address biological components such as ‘recruitment’, ‘abundance’, and ‘biomass’. Ref- erence points shows the strongest overall negative trend of all subtopics (-26.55%), indicating that the popularity of this topic among fisheries scientists has decreased over the years. Nevertheless, the topic of reference points still makes up a relatively large proportion, 9.82% (Figure 6.5 (B)); this is the second largest proportion in the period of 2010–2016 after estimator performance, which has a 15.19% topic proportion within the same period. This highlights the continuity of research on reference points from the 1990s to the present day (Caddy and Mahon, 1995; Caddy, 2004; Froese et al., 2017). The subtopic of estimator performance shows the highest increase (+11.11%) within the overall study period (i.e., 1990–2016) and makes up a large proportion within the last six years of the time frame, from 2010–2016 (15.19%); this finding could be re- lated to the increased overall importance of models in fisheries science (Jari´c et al., 2012). The subtopic of freshwater fisheries shows an overall positive trend (+6.28%), even though freshwater fisheries habitats have been found to be less studied than ma- rine fisheries (Jari´c et al., 2012). The topic proportion of freshwater fisheries rose over the study period, from 1.82% in 1990–2000 to 8.08% in 2010–2016 (Figure 6.5 (B)). The importance of freshwater fisheries in areas such as Africa and India may explain the increase in research efforts within this field (FAO, 2016).

147 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

(1) GROWTH AND (2) ESTIMATOR (3) HARVEST (4) MANAGEMENT (5) MOVEMENT LENGTH PERFORMANCE STRATEGY EFFECTS word prob. word prob. word prob. word prob. word prob. GROWTH .017 SELECTIVITY .011 FISHING .008 FISHING .013 SPATIAL .008 MM .007 BIOMASS .010 CATCH .007 CATCH .011 TUNA .007 ABALONE .006 RECRUITMENT .010 CRAB .007 LENGTH .010 MOVEMENT .006 LENGTH .006 CATCH .010 BIOMASS .006 EFFORT .006 FISHING .006 HARVEST .005 ERROR .007 SHARK .006 LANDINGS .004 TAGGING .006 PARAMETER .004 ESTIMATION .006 LOBSTER .006 GULF .004 RATES .006 ABUNDANCE .004 RELATIVE .006 RECRUITMENT .004 CATCHES .004 DISTRIBUTION .005 BASS .004 BIAS .006 MEAN .004 SOUTH .004 ABUNDANCE .005 MEAN .004 PERFORMANCE .005 SHARKS .004 YIELD .003 TAG .005 INDIVIDUAL .004 FISHING .005 FLOUNDER .004 BIOMASS .003 INFORMATION .004 LAKE .003 PUNT .005 ABUNDANCE .004 STUDY .003 AREA .004 MAXIMUM .003 TRUE .005 GROWTH .003 REFERENCE .003 SURVEY .004 ENHANCEMENT .003 SURVEY .005 MATURE .003 ESTIMATE .003 ATLANTIC .004 RELEASE .003 SIMULATION .005 RATES .003 STOCKS .003 CATCH .004 STUDY .003 ASSESSMENTS .005 MALE .003 EXPLOITATION .003 ASSUMED .003

(6) MANAGEMENT (8) BAYESIAN (10) FECUNDITY AND (7) PREDATION (9) COD RECRUITMENT TOOLS APPROACH REPRODUCTION word prob. word prob. word prob. word prob. word prob. FISHING .017 BIOMASS .015 PARAMETER .008 COD .022 SPAWNING .018 EFFORT .011 PREDATION .014 DISTRIBUTION .008 RECRUITMENT .013 EGG .015 HARVEST .011 PREY .012 BAYESIAN .007 SEA .010 REPRODUCTIVE .014 CATCH .008 ECOSYSTEM .010 PRIOR .007 FISHING .007 FECUNDITY .014 YIELD .008 FISHING .009 POSTERIOR .007 NORTH .006 SURVIVAL .013 AREA .007 PREDATOR .008 UNCERTAINTY .007 STOCKS .006 LIFE .009 AREAS .006 FOOD .007 SERIES .006 SPAWNING .006 EGGS .008 BIOMASS .006 TROPHIC .006 ERROR .005 ATLANTIC .005 LARVAL .008 OPTIMAL .005 MULTISPECIES .006 PROBABILITY .005 HERRING .005 PRODUCTION .008 TARGET .005 PREDATORS .006 PROCESS .005 ENVIRONMENTAL .005 RECRUITMENT .008 CONTROL .004 COMMUNITY .006 DISTRIBUTIONS .005 SSB .004 STAGE .007 POLICY .004 CONSUMPTION .006 FUNCTION .005 TEMPERATURE .004 POTENTIAL .006 RECRUITMENT .004 ABUNDANCE .005 LIKELIHOOD .004 CHANGES .004 LARVAE .006 LEVEL .004 INTERACTIONS .004 INFORMATION .004 BALTIC .004 MATURITY .006 LEVELS .004 SEA .004 EXAMPLE .004 POPULATIONS .004 EFFECTS .006

(11) POPULATION (12) FRESHWATER (14) STOCK- (15) REFERENCE (13) LIFE HISTORY DYNAMICS FISHERIES (AND SALMON) RECRUITMENT POINTS word prob. word prob. word prob. word prob. word prob. GROWTH .013 LAKE .012 GROWTH .041 RECRUITMENT .016 FISHING .011 SHRIMP .012 SALMON .011 LENGTH .015 PACIFIC .010 BIOMASS .010 RECRUITMENT .009 RIVER .011 LIFE .008 STOCKS .008 REFERENCE .008 BAY .006 POPULATIONS .009 INDIVIDUALS .006 ENVIRONMENTAL .008 CATCH .008 OYSTER .006 SURVIVAL .009 HISTORY .006 SALMON .008 STOCKS .007 SEA .005 RATES .007 RATES .005 ABUNDANCE .006 RECRUITMENT .007 FISHING .005 TROUT .007 MEAN .005 SARDINE .006 POINTS .006 ABUNDANCE .004 HABITAT .006 MATURATION .005 ANCHOVY .005 YIELD .006 TEMPERATURE .004 ABUNDANCE .005 INDIVIDUAL .005 SERIES .005 MSY .005 SQUID .004 DENSITY .005 BERTALANFFY .004 SPAWNING .005 SSB .005 MM .004 HARVEST .005 BODY .004 BIOMASS .005 PRODUCTION .004 POPULATIONS .004 LAKES .004 POPULATIONS .004 CLIMATE .004 EFFORT .004 BIOMASS .004 ADULT .004 CM .004 VARIABILITY .004 SEA .003 RATES .003 CHINOOK .003 ECOLOGY .004 RICKER .004 FMSY .003 ANIMALS .003 RECRUITMENT .003 MATURITY .004 MEAN .004 MAXIMUM .003

Figure 6.6: The 15 uncovered subtopics from the documents (N = 1,637) exhibit- ing the topic stock assessment models as the dominant topic. The figure displays the subtopic label (top) and the top 15 high-probability words.

148 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

From the top 15 words (Figure 6.6), related subjects were identified, such as regions, species, and names/methods. The two marine regions mentioned are ‘Atlantic’ and ‘Pacific’, possibly because these are some of the world’s major fishing areas (FAO, 2016). The various species names found within the top 15 words, such as ‘cod’, ‘herring’, and ‘anchovy’, cover many of the commercially important species in marine capture production (FAO, 2016). These results stand in stark contrast to a bibliometric study on trends in fisheries science, which found virtually no research on many commercially important species (Aksnes and Browman, 2016); however, these results were based on word frequencies in publication titles and abstracts, which may not mention the species of concern. This finding highlights the strength of the full-text LDA analysis. Other mentioned species, such as ‘abalone’, ‘lobster’, and ‘shark’, may have high probabilities for occurrence in the subtopics because they represent species of great economic value and also are often a focus of conservation efforts (Turpie et al., 2003; Simpfendorfer and Dulvy, 2017).

Several names within the words of the subtopics refer to a method named after a scientist, such as ‘Bayesian’, ‘Bertalanffy’, ‘Ricker’, and ‘Punt’, which could be a di- rect consequence of the inclusion of the reference list in the analysis. The subtopic of Bayesian approach indicates the importance of this methodology in fishery science and for fisheries models. A Bayesian approach can be used for stock assessments and decision analysis and resembles an improved way of fitting models to data and deci- sion making (Hoggarth et al., 2006). The scientists von Bertalanffy and Ricker both made substantial contributions to fisheries science—von Bertalanffy in metabolism and growth (von Bertalanffy, 1957) and Ricker in the computation and interpretation of computational statistics of fish populations (Ricker, 1975). Their methods are still applied today in the form of growth models (Allen, 1966; Piner et al., 2016) and in stock-recruitment models (Baker et al., 2014). The author Punt has not developed any particular method that takes his name; however, his name may occur within the top 15 words due to his significant contribution to research and his publications on estimator performance and data standardisation, as well as his many citations by other scientists within the field. Although Punt is, relatively speaking, a newcomer compared to some of the early influential researchers in the field (e.g., Hjort, Beverton, and Holt), the occurrence of his name is perhaps a result of the timeframe examined, or it may indi- cate that the names of senior scientists and methods have become somewhat common knowledge and are therefore not always explicitly stated or cited.

6.4 Conclusions

The aim of this paper was to uncover fisheries modelling topics from 22,236 scientific publications from 13 peer-reviewed fisheries journals. Additionally, subtopics from general modelling topics were uncovered to provide insights into their developments and trends over the last 26 years. Overall, two main fisheries modelling topics were identified: estimation models and stock assessment models. This study demonstrates

149 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS that research in the field of fisheries modelling shows a shift of scientific focus in topics and subtopics over the last 26 years. Stock assessment models are outperforming es- timation models, and their underlying subtopics have moved from length and growth to catch and abundance, and from reference points to estimator performance over the last 26 years. Economically important species and areas show a high presence within the modelling subtopics.

Both general modelling topics focus primarily on the biological aspects of fisheries; however, since this study was limited to publications in 13 fisheries journals, other topics in fisheries modelling (e.g., with a focus on social, management or economic as- pects of fisheries) may well exist in publications of other journals. Possible disciplinary merit issues and the remaining understanding of fisheries as a natural science disci- pline might further limit fisheries journals to models with an ecological focus, despite their multidisciplinary scope.

In conclusion, this novel machine learning approach revealed interesting insights into the topical trends of a large dataset of models published in fisheries journals. This ap- proach enables researchers to identify research topics and shifts in research focus, and it provides a bigger picture that captures the main ideas prevailing scientific publica- tions.

150 Appendix

151 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS

All documents 0.65

0.60

0.55

0.50 run 1

Coherence score run 2 0.45 run 3 avg 0.40 0 5 10 15 20 25 30 35 40 Estimation models 0.58

0.56

0.54

0.52

0.50 Coherence score 0.48

0.46 0 5 10 15 20 25 30 Stock assessment models 0.56

0.54

0.52

0.50 Coherence score

0.48

0 5 10 15 20 25 30 Topics

Figure 6.7: Calculated coherence scores (y-axis) for the number of topics (x-axis) (i.e., K parameter) for three different runs. The average coherence score is calculated by averaging the scores over all three runs for the same K parameter. The figures represent the following: all documents (N = 22,236); documents that exhibit the topic estimation models as the dominant topic (N = 1,124); documents that exhibit the topic stock assessment models as the dominant topic (N = 1,637).

152 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS The top 15 words (i.e., the words with highest probability) for each of the 31 uncovered general fisheries topics. Topics model, estimates, catch, survey, sampling, estimated, models, estimate, distribution, abundance, mean, effort, sample, method, size model, stock, mortality, population, recruitment,mates, models, fishing biomass, year, rate, management, parameters, assessment, fisheries, esti- spatial 1 crab, crabs,2 lobster, eel, eels, size, traps, salmon, mm, hatchery,3 chinook, lobsters, river, trap, wild, american, atlantic, anguilla, survival, blue, river, coho, species, females, sockeye,4 sampling, fishery juvenile, electrofishing, oncorhynchus, colorado, fisheries, fishes, smolts, chub, pacific, capture,5 steelhead population, suckers, sucker, abundance, reach, genetic, sites, populations,6 site population, river, loci, samples, among, prey, larvae, structure,7 growth, individuals, larval, microsatellite, food, within, predation, stock, size, alleles, red, feeding, diversity, reef,8 sample diet, gulf, juvenile, species, zooplankton, snapper, florida, abundance, marine, mm, atlantic, mexico, predator, bay,9 rates reefs, striped, fishes, tuna, shrimp, bass, coral, flounder, estuary, habitat, north, artificial, marine, drum new, river, carolina, chesapeake, estuaries, estuarine 10 species, variables, environmental, sites, lakes, assemblages,11 community, water, assemblage, richness, cod, communities, sea,12 diversity, atlantic, index, north, models, species, herring, fisheries, size, management, cm,13 fishing, trawl, fishery, length, catch, stock, economic, area, marine, habitat, mesh, effort, water,14 baltic, flow, fishers, use, fishing species, depth, recreational, river, information, velocity, substrate, anglers, spawning, channel, use, females,15 areas, new eggs, sites, egg, site, males, area, female, movement, species, reproductive, spawning sharks, male, bycatch, fecundity, sex, shark, maturity, catch, mature, longline, stage, fishery, size, fisheries, development fishing, gear, hooks, caught, hook, cm, atlantic Topic Top 15 words Table 6.3: in bold (i.e., 4 and 9)and are 9 the being identified the modelling stock topics assessment used models. in the analysis of this paper, with 4 being the topic estimation models

153 CHAPTER 6. SUB-TOPIC ANALYSIS OF FISHERY MODELS continued. Table 6.3: perimental quences levels 16 mortality, tag,17 tagged, tags, tagging, release, survival, lake, released, lakes,18 movement, perch, rates, michigan, fisheries, yellow, walleye, studies, great, capture, growth, fisheries, effects, length,19 northern, transmitters mm, walleyes, size, mean, otoliths, ontario, body, ages, journal, temperature, otolith, consumption, water, cm, population growth, mean, effects, years, swimming, first, weight, treatment,20 differences, energy, lengths levels, temperatures, experiment, sea, squid, activity,21 body, mediterranean, effect, distribution, experiments, area, anchovy, ex- species, bass, waters, largemouth,22 larvae, reservoir, river, sardine, species, marine, lake, spawning, catfish, shelf, species, smallmouth, temperature, fishes,23 fisheries, mackerel freshwater, shad, carp, water, new, management, native, reservoirs, river, water, species, white, aquaculture, black dna, crayfish, genetic, populations, introduced, gene, conservation, mtdna, tilapia,24 samples, many molecular, mitochondrial, sequence, acoustic, haplotypes, depth,25 infection, vertical, atlantic, water, bottom, identification, surface, disease, ts, otolith, distribution, se- otoliths,26 speed, sr, marine, range, river, target, ratios, density, samples, measurements, fishing, water, night, differences, marine,27 behaviour juvenile, species, chemistry, fisheries, isotope, areas, freshwater, sea, values, river, area, campana sturgeon, fishery,28 dam, catch, chinook, australia, columbia, effort, lower, total, passage, south, trout, migration, effects, brook,29 salmon, coastal rainbow, downstream, brown, steelhead, lake, upstream, sea, juvenile, fry, water, lamprey, spawning, concentrations, river, stocking, dams phytoplankton, lampreys, production, salvelinus, concentration, arctic, samples, streams,30 nutrient, stocked sediment, carbon, total, stream, food, trout,31 values, streams, biomass, creek, organic, habitat, cutthroat, sea, sites, pacific, reaches, marine, river, effects, species, brook, climate, temperature, ocean, watershed, north, abundance, alaska, aquatic rockfish, change, ecosystem, changes, abundance, temperature, california Topic Top 15 words

154 Chapter 7

Mapping the Global Network of Fisheries Science Collaboration

As socio-environmental problems have proliferated over the past decades, one nar- rative which has captured the attention of policymakers and scientists has been the need for collaborative research that spans traditional boundaries. Collaboration, it is argued, is imperative for solving these problems. Understanding how collab- oration is occurring in practice is important however, and may help explain the idea space across a field. In an effort to make sense of the shape of collaboration in fisheries science, here we construct a co-authorship network of the field, from a dataset comprising 73,240 scientific articles, drawn from 50 journals, and pub- lished between 2000 and 2017. Using a combination of social network analysis and unsupervised machine learning, the work first maps the global structure of scientific collaboration among fisheries scientists at the author, country and insti- tutional levels. Second, it uncovers the hidden subgroups—here country clusters and communities of authors—within the network, detailing also the topical focus of the largest fisheries science communities. We find that while the fisheries sci- ence network is becoming more geographically and institutionally extensive, it is simultaneously becoming more intensive. The uncovered network exhibits char- acteristics suggestive of a thin style of collaboration, and groupings that are more regional than they are global. Although likely shaped by an array of overlapping micro- and macro-level factors, the analysis reveals a number of political-economic patterns that merit reflection by both fisheries scientists and policymakers.

Submitted for publication:

S. Syed, L. ni Aodha, C. Scougal, and M. Spruit. Mapping the global network of fisheries science collab- oration. Reinforcing or broad-based structures of knowledge production? (submitted for publication). 2018b

155 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

7.1 Introduction

Over the past number of decades, as socio-environmental crises have multiplied, the question of research collaboration has captured the attention of policymakers and sci- entists alike (Katz and Martin, 1997), with calls for “intensive cooperation” and a widening of perspectives becoming commonplace (Palsson et al., 2013). The underly- ing assumption driving these calls is that solutions to the complex—social, ecological, socio-ecological—problems facing humanity today are in many instances not going to be found within the confines of traditional disciplinary, thematic, sectoral, or territorial boundaries (European Commission, 2008). In this respect, fisheries have not been an exception. Amidst ongoing dissatisfaction with the outcomes of traditional fisheries science and management—for example, in terms of the increasingly precarious status of fish stocks and the communities that depend upon them (Symes et al., 2015)— policymakers and scientists have shifted their gaze to the production of knowledge within this space, and actively sought to broaden collaborative efforts in this area (Eu- ropean Commission, 2008, 2016b; Geoghegan-Quinn et al., 2013; IOC-UNESCO, 2017; Rozwadowski, 2002; Smith and Link, 2005; Symes and Hoefnagel, 2010b).

It is unsurprising then, if not entirely consequential, that research collaboration has in- creased exponentially over the past decades (Wuchty et al., 2007). Further, given the significant amount of empirical research suggesting that social relationships and the networks these relationships constitute are important in explaining processes of knowl- edge production (Bourdieu, 1975, 1991; Forsyth, 2003; Granovetter, 1983; Law, 1987; Moody, 2004; Phelps et al., 2012; Schott, 1991, 1993), it is of no surprise that this shift- ing character of science (Adams, 2013) has drawn the attention of scholars. Patterns of co-authorship amongst scientists—long recognized as providing a window into collab- oration within the academic community (Newman, 2004)—have proven a particularly fruitful line of inquiry in this respect (Adams, 2012, 2013; Azoulay et al., 2010; Ding, 2011; Katz, 1994; Katz and Martin, 1997; Leydesdorff and Wagner, 2008; Liu and Xia, 2015; Martin et al., 2013; Newman, 2001; Wagner et al., 2015a). Regarding the field of fisheries, however, whilst scholars have directed their attention towards character- izing the direction and content of fisheries science publications (Aksnes and Browman, 2016; Jari´c et al., 2012; Natale et al., 2012; Nikolic et al., 2011; Syed et al., 2018a), and studies have highlighted that collaboration is increasing within this space (Jari´c et al., 2012), we know comparatively little about the structure these collaborations are taking. The small body of work that has analyzed co-authorship networks in fisheries science has been narrowly confined in terms of timespan and journal inclusion (Elango and Rajendran, 2012), or a particular type of fishing (Oliveira Júnior et al., 2016).

Given the applied nature of fisheries science, with science playing a critical role in in- forming fisheries management decisions (Campling, 2012), and hence having practical consequences for fish and people, understanding how knowledge is produced in this area is especially pertinent. Thus, with an eye to making sense of the shape of fish- eries science, here we take scientific collaboration, measured as co-authorship amongst

156 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE scientists in this field, as our analytical vantage point. Using a combination of social network analysis and topic modeling (a variant of unsupervised machine learning), alongside theoretical insights from the sociology of science, we map the co-authorship network that characterizes this applied domain, and investigate the collaborative en- tanglements within this space. In doing so we pose questions with respect to how patterns of collaboration differ between subjects and how these have changed over time (Newman, 2004). Our analysis provides a dynamic portrait (Newman, 2004) of the fisheries science community, and an avenue through which the social dynam- ics underpinning fisheries science collaborations, and consequently the production of knowledge within this space may be explored (Bourdieu, 1975; Ding, 2011; Forsyth, 2003; Latour. B, 1993; Liu and Xia, 2015; Martin et al., 2013).

Our study builds upon the important groundwork that has been laid out by previ- ous scholars within this domain (Aksnes and Browman, 2016; Elango and Rajendran, 2012; Jari´c et al., 2012; Natale et al., 2012; Nikolic et al., 2011; Oliveira Júnior et al., 2016; Syed et al., 2018a) in a number of ways. First, by focusing our attention on the networks of production, we expand upon existing analysis that has focused on the con- tent of fisheries science (Aksnes and Browman, 2016; Jari´c et al., 2012; Natale et al., 2012; Nikolic et al., 2011; Syed et al., 2018a), by characterizing the structure of the community of scientists that produce that output, in a manner that may help us under- stand its content (Bourdieu, 1975; Forsyth, 2003). Second, as detailed, the work that has previously taken a network approach to the production of fisheries-related knowl- edge (Elango and Rajendran, 2012; Oliveira Júnior et al., 2016), though illuminating, has been hitherto narrowly bounded either by time or specific knowledge communi- ties. Here our analysis is based upon a dataset comprising 73,240 scientific journal articles, drawn from all 50 journals in the fisheries category as defined by the Science Citation Index Extended (SCIE), and published between 2000 and 2017. This category has been cited as containing the core journals within the field (Aksnes and Browman, 2016). Consequently, the network we construct is expansive, comprising 106,137 au- thors from 100,175 different affiliations, across a broad spectrum of fisheries science research, related to both capture and culture fisheries. This large network is subse- quently analyzed at progressively finer levels of granularity (Ding, 2011), across three planes (spatial, temporal, and topical), in a manner which broadens the bounds of the analysis and provides for a multi-dimensional overview of the field.

7.2 Results

The results of our investigation are presented in two parts. First, the macro-level struc- ture of the global fisheries science network is detailed and mapped at the author, coun- try, and institutional levels. Second, moving to a more fine-grained level of analysis, the hidden collaborative groupings within the network—here country clusters and com- munities of authors, within which the nodes (i.e., countries, authors) are more tightly connected to each other than to the rest of the network (Palla et al., 2005)—are spec-

157 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE ified. At the community level, adding a further layer to our analysis, we detail their topical foci.

7.2.1 Topology of the Co-Authorship Network

In line with broader trends (Adams, 2012, 2013; Leydesdorff et al., 2013), and previous work regarding fisheries (Aksnes and Browman, 2016; Jari´c et al., 2012), the fisheries science collaboration network is expanding rapidly (Appendix Fig. 7.7). The number of authors participating in the network has increased steadily, whilst the number of col- laborative ties via publication has increased almost exponentially, with a rapid increase visible since 2015. This has been fueled, at least in part, by the volumetric rise in fish- eries science publications, which has almost doubled since 2000 (Appendix Fig. 7.8). That said, as the network has expanded, the network degree (i.e., the average number of connections possessed by each scientist (Liu and Xia, 2015)) has increased, whilst the average clustering (i.e., the extent to which a scientist’s co-authors also collaborate with each other (Liu and Xia, 2015; Newman, 2004)) has decreased, indicating that collaboration is indeed becoming more extensive. On the other hand, the density (i.e., the degree of connectedness of the network) has decreased, implying that the network has become less structurally cohesive.

In light of the fragmentation and lack of connectivity which has previously been cited as problematic in fisheries science (Jari´c et al., 2012; Symes and Hoefnagel, 2010b), and the existing “narrow lenses” that have been detailed as persisting in the field (Syed et al., 2018a), we might tentatively infer that the structural trends exhibited by the fish- eries science network are a good thing. Very dense ties can have a homogenizing effect on a network (Bodin and Crona, 2009), whilst high levels of clustering are indicative of fragmentation and division (Lambiotte and Panzarasa, 2009). For example, the more an individual’s collaborators are also connected to one another, the less likely those connections will lead to new collaborations with “dissimilar others”, thereby making exposure to new ideas similarly unlikely (Granovetter, 1973; Lambiotte and Panzarasa, 2009). Thus, that the global fisheries science network exhibits trends in the opposite direction may well suggest that the network is becoming less fragmented (Borrett et al., 2014).

As it has expanded, however, the number of potential connections across the network that have been realized has decreased, and the network has become less structurally co- hesive. This pattern could work to limit the spread of ideas across the network (Moody, 2004), with ties in this sense working to enhance knowledge production (Bodin and Crona, 2009). Seen from this angle, this trend may be indicative of a field that is becoming increasingly divided into silos, albeit silos within which there is consider- able collaboration. This could have implications in terms of inhibiting knowledge ex- change (Borrett et al., 2014), reinforcing lines of division that already exist, or gener- ating new ones. That said, an element of agonistic pluralism is desirable in all fields, certainly in terms of creating space for historically underrepresented ideas (Matulis and

158 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Moyer, 2017). Therefore, cast in a more favorable light, this pattern might suggest that the field is becoming more heterogeneous, in a manner that could provide welcome space for addressing particular problems and the nurturing of new ideas (Borrett et al., 2014), or place-based epistemologies (Escobar, 2004).

7.2.2 Country-Level Giants

Large networks are difficult to visualize, since the nodes (here authors) simply get plotted on top of each other (Moody and Light, 2006). To get a clearer picture of col- laboration at the global level, for visualization purposes we therefore aggregated the network of authors at the country level, whereby attribution for each publication was fractionally credited. As has been detailed elsewhere (Aksnes and Browman, 2016; Jari´c et al., 2012; Oliveira Júnior et al., 2016), in terms of publication output, the fish- eries science network is dominated by authors located in a few geographical regions. A large proportion of the publication volume in this field is produced by a small group of fisheries science powerhouses, comprising a number of traditional fisheries science producers (e.g., US, Canada, Japan, Australia, UK, Norway), who have over the past decades been joined, and in some instances surpassed, by a number of large emerging economies (e.g., China, India, Brazil) (Fig. 7.1 and Appendix Table 7.1).

Although cross-border collaboration in the field has increased over time, the patterns across the field are far from even, and the collaborative landscape—when viewed from the global level—is dominated by Western countries (Fig. 7.2). In line with exist- ing analysis (Jari´c et al., 2012), the US, UK, and Canada are the most internationally collaborative countries in the network. As the field has become increasingly collabora- tive, historical links between European and North American countries have intensified, whilst a number of emerging economies have forged strong links with the US. For ex- ample, mirroring the pattern in science more generally (Wagner et al., 2015a), China has emerged as a prominent US collaborator, a relationship that is surpassed only by the collaborative relationship between the US and Canada. Conversely, the tradition- ally strong—albeit at times unequal—relationship between the US and Japan in this field (Finley, 2011; Hamblin, 2000) has dwindled. The pattern between these two countries has also been detected beyond fisheries science (Wagner et al., 2015b) and may be reflective of a number of factors. These may include shifting geopolitical re- alities (Hamblin, 2000), Japan’s declining publication output (Aksnes and Browman, 2016), and the isolating effect the Anglophone bias in scientific publication has had on a number of countries, including Japan (King, 2004).

7.2.3 Institutional Dynamics

With an eye to further investigating the level of diversity across the field, we aggre- gated the network of authors at the institutional level (Figs. S4–S7). Spatially, the

159 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008

2009–2017

relative publication frequency

0 > 0 0.5 1.0 2.0 3.0 5.0 10.0 15.0 20.0 25.0 30.0 Figure 7.1: Publication percentage per country for the periods 2000–2008 and 2009– 2017. Each publication is fractionally credited based on the number of authors and country affiliations. The actual values for the top-25 largest countries can be found in Appendix Table 7.1

160 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008 2009–2017

South Korea South Korea New Zealand New Zealand

Denmark

Ireland Chile Mexico Mexico 0 0 2000 0 0 3000 0 Brazil 0 6000 0 4000 Denmark 0 9000 0 0 China 6000 United States 12000 0 Portugal 0 United States Greece 0 0 8000 15000

0 Portugal 0 10000 18000 Italy

0 3000 Sweden 12000 21000 0 0 Japan

0 Italy 24000 0 Germany

0

0 2000

Australia 0

0 3000

China 3000

4000

United Kingdom 0

6000 United Kingdom

2000 6000

Spain

0

3000 0

0

0 3000

0 Australia 2000

6000 3000

0

4000 Canada

France Canada

2000

0 0

6000 0

3000

2000

3000 4000 France 0

Norway Norway

Japan

Spain collaboration frequency

1 500 1000 1500 2000 2500+ Figure 7.2: The collaboration frequency counts of international country collaborations for the time frame 2000–2008 and 2009–2017. Only the top-10 percent (90th per- centile) strongest links of the top-25 largest collaborating countries are shown, sorted clockwise. See Appendix Fig. 7.9 for international and domestic collaborations. largest institutional cross-border collaborators in the fisheries science network are pre- dominantly located in the Northern Hemisphere, and this pattern remains relatively unchanged. Our analysis does, however, illuminate an increasing typological diversity of institutions across the network. Over time a number of institutes with a more ex- plicit leaning towards the social sciences, albeit erring on the side of economics (e.g., The Institute of Economic Studies, University of Iceland and Socio-Economic Marine Research Institute, NUIG Ireland), have become prominent institutional collaborators at the international level. Alongside this, reflecting the increasingly multi-actor charac- ter of fisheries management, these institutions increasingly comprise a mixture of na- tional institutes, universities, private institutes, and non-governmental organizations (NGOs). This trend—whereby, for example, environmental NGOs have become signif- icant producers of scientific knowledge—has also been noted in other fields related to the socio-environment (Holmes, 2011; da Fonseca, 2003).

7.2.4 Hidden Collaborative Groups

Hitherto we have been speaking about the macro-characteristics of the overall network, albeit at different levels. While this provides a good overview of the global fisheries science network, many of the most important characteristics of a network only become apparent when analyzing the hidden groups within that network (Girvan and Newman,

161 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2002; Newman, 2012b). Thus,to get a more nuanced understanding of the field, we decomposed the network into country clusters and communities of authors (Palla et al., 2005).

7.2.5 Country Clusters

Fig. 7.3 presents the main country clusters within the collaboration network. As indi- cated by the colors, the network divides into distinct clusters, with spatial and tempo- ral variation in clustering visible. Three large distinct country clusters are uncovered within the time period 2000–2008, all of which comprise Northern and Southern part- ners. Four clusters are visible in the 2009–2017 time period, three of which comprise a mixture of Northern and Southern partners, one of Southern partners only. Although the clusters are globally dispersed to varying degrees, regions of spatial clustering are visible in all of them. While the countries within each of the clusters has changed over time, all have maintained this spatialized character. This in keeping with scholarship that has indicated that though the bias towards collaboration within territorial borders (regional, national, and linguistic) has decreased over time, spatial proximity remains an important determinant of research collaboration (Hoekman et al., 2010).

In terms of quantity or quality of collaborative connections, and location within the network, the country clusters are centered on a small group of (mainly Western) coun- tries (Appendix Table 7.2 and 7.3). Many of the fisheries science powerhouses (as detailed in the previous section) have maintained central positions within the clus- ters, thus placing them in favorable positions with respect to control and dissemina- tion of information. The most geographically expansive cluster is centered on the US. The second on North European countries—with Norway, for example, positioned as the best-connected country within that cluster. A further Europe-centered cluster is also evident, with France and Spain the prominent collaborators in this grouping. The fourth cluster which comprises partners from Africa, Asia, and the Middle East—among them some of the largest aquaculture producers in those regions (FAO, 2018)—is cen- tered on Malaysia and Japan. That said, a number of smaller countries (e.g., Bulgaria, Tunisia, and Cambodia) are positioned favorably on the shortest path between other authors and thus may be playing important roles as knowledge brokers within their clusters (Newman, 2004).

7.2.6 Communities of Authors and their Topical Foci

Seeking a finer-grained analysis of the fisheries science landscape, we examined the communities of authors collaborating within the network, across three planes. This involved coupling the social network analysis techniques we have been utilizing thus far with topic modeling, thereby extending the inquiry beyond the spatial and tempo- ral, and adding a topical dimension to the analysis. In excess of 3000 communities of

162 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008

1 2 3 4 2009–2017

1 2 3 4

Figure 7.3: Country clusters ranked 1–4 based on the total number of countries within them for the periods 2000–2008 and 2009–2017.

163 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE authors were identified in the fisheries science network, which we ranked according to the number of authors within them. The distribution of the community size across the network is highly skewed, with the largest fifty communities comprising in excess of 80 percent of the authors in the network, whilst the remaining 20 percent is composed largely of sole authors, or groups of two to three authors (Appendix Fig. 7.14). Fig. 7.4 presents the largest fifteen (ranked 1–15) communities within the network, which to- gether comprise almost 60 percent of the network (communities 16–30 can be viewed in Appendix Fig. 7.15). Though the communities are globally dispersed, all display dense points of regional centralization. Across many, rather than having diminished, this spatial clustering has intensified over time.

164 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009–2017 2000–2008 Spatial distribution of authors within communities for the period 2000–2008 and 2009–2017. Communities are ranked Figure 7.4: (1–15) based on the number ofthe authors within number the of community, and authors differentiated within by the different colors. same location. The size Spatial of distribution the nodes of represent communities 16–30 can be found in Appendix Fig. 7.15

165 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009–2017 2009–2017 2009–2017 Continued with communities rank 11–15 Figure 7.4: 2000–2008 2000–2008 2000–2008

166 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

To varying degrees, each of the fisheries science communities has grown in size over time, with the average number of connections each author has increasing (Appendix Fig. 7.16). The density across each of these fifteen communities is low however, indi- cating that although collaboration has increased, only a small number of the potential connections in each of the communities have been realized. This suggests that, when viewed at the individual level, the communities of authors in the fisheries science net- work are quite loosely knit. As discussed in relation to the global structure of the network, low levels of cohesion may be reflective of a number of things (e.g., fragmen- tation and division), which could have implications in terms of inhibiting knowledge exchange. In light of our analysis here, this pattern may indicate that though the net- work has become more collaborative, scholars are engaging in repeated collaborations, rather than forming new links beyond their existing connections (Leahey, 2016; Lea- hey and Reikowsky, 2008; Saetnan and Kipling, 2016). Our analysis at the country level would seem to support this finding. With respect to the interlinkages between these communities, the most frequent collaborative links are amongst the European, American, and Oceania communities (Fig. 7.5).

In order to gain a more substantive understanding of the manner in which the authors in the fisheries science network are grouping, we uncovered the topical foci of the net- work, alongside the community level and temporal variations therein (Fig. 7.6). In total, sixteen latent topics were identified within our corpus (Appendix Fig. 7.17): Age & growth ; Aquaculture (growth effects); Aquaculture (health effects); Climate effects; Diet; Gear technology & bycatch; Genetics; Habitats; Immunogenetics; Management; Models (estimation & stock); Physiology; Reproduction; Salmonids; and Shellfish. By and large, these topical foci are relatively consistent with previous analysis of the con- tent of fisheries science (Aksnes and Browman, 2016; Jari´c et al., 2012; Syed et al., 2018a). This content has been discussed at length elsewhere, most recently by (Syed et al., 2018a), and is therefore not reported on in detail here. That said, uncovering these foci is instructive, as it allows us to investigate whether the fisheries science com- munities are clustered around a particular or similar topics (Clauset et al., 2004), how these may have changed, and how they might be related (Moody and Light, 2006).

In terms of community level and temporal variations in topics, while there are common- alities, and each of the communities contain authors that are engaged in the sixteen topics we have identified, variations in this respect are evident. For example, whilst our analysis indicates an almost across the board increase in publication output fo- cused on Management, reflecting the increasing propensity of fisheries scientists in the West to focus their attention on managing human interactions with the natural envi- ronment, rather than managing fish per se (Bavington, 2010), the most intense focus on this topic is seen across the European, North American, and Oceania communities. In contrast, a much weaker focus on this topic is seen within the China-, Japan- and Iran-centered communities. Seeking to investigate this further, we calculated the simi- larities in cumulative topic distributions across the communities (Appendix Fig. 7.18). Our analysis reveals greater topical similarities among the North American, European and Oceania communities. A comparable pattern of similarity is discernible across the

167 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

European

Iran

centered − centered

East (15)

US (14) America Aus India − (13) − Latin 0 2000 0 4000 (1) − 0 Eur centered 6000 (12) 0 8000 10000 Brazilian(11) 0 China −centered(10) 0 0 −centered 2000 (2) Japan 0

EU (9) 0 −Nrth

American 2000 2000

0 (3)

4000 Nrth American−1 (8) 6000

Oceania 2000

0

0

2000

6000

4000 4000 (4)

(7) Nrth 6000 American

−south 2000

0 −2

European 0 2000 6000

4000

4000

2000

6000 0 (6) 8000 (5) 3 Nrth European − Nrth

American collaboration frequency

1 500 1000 1500 2000 2500+

Figure 7.5: Inter-community collaboration for the top-15 largest communities (ranked 1–15). The geographical label for each of the 15 communities indicates where the majority of authors within the community are spatially located.

168 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE topical output of a number of communities centered on newer entrants to the network (e.g., European South, Eastern Europe, Brazil, and Iran-centered communities).

Overall, a distinct geography of topics is detectable across the network. In this respect, a clear division of focus across the communities is seen whereby the topics Manage- ment, Models (estimation & stock), Gear technology & bycatch, and Habitats are ar- eas of central, and in some instances intensifying focus for a number of the largest (Western-centered) communities in the network. On the other hand, a stronger topi- cal leaning towards aquaculture-related topics (e.g., Aquaculture (effects on growth), Diet, Diseases, and Immunogenetics) is seen across the communities centered on the large emerging economies, many of whom are large aquaculture (and fish feed) pro- ducers (FAO, 2018). That said, a focus on aquaculture is also seen amongst European and North American communities that are concentrated in regions with large-scale in- terests in aquaculture production (e.g., Norway, US, Spain, Eastern Europe (Österblom et al., 2015; FAO, 2018)). For instance, we see a strong topical focus on Salmonids amongst a number of North America and North European communities, which may well be aquaculture related. In this regard, previous analysis has detected an increased focus on aquaculture species such as Atlantic salmon, and rainbow trout (Aksnes and Browman, 2016) (both of which are Salmonids) in fisheries science.

169 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009–2017 +30% 25% 20% 15% 2000–2008 10% 5% 0% The proportion of topics published within each of the 15 largest communities for the periods 2000–2008 and 2009– topicproportion Figure 7.6: 2017. The number ofFor publications example, published by (1) each Eur-Aus ofAustralia, the (4244) and 15 4,244 is publications communities, the published andAquaculture between for largest (growth 2000–2008. each effects) community The time for (rank 4,244 period, 6.2%, publications are 1), and cover shown so with aspects in on. of most parentheses. Age of & Growth the for authors 7.7%, spatially located in Europe and

170 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

7.3 Discussion

As science has become increasingly internationalized, scholars investigating the shift- ing spatial structure underlying scientific practice (Hoekman et al., 2010), have posed questions as to whether networks of research collaboration are expanding in every re- gion of the globe (Adams, 2012). Others have suggested that a globalized science may open up research fields in a generative manner, to new perspectives that challenge underlying assumptions, develop new methods, and point to previously unrecognized biases (Frickel et al., 2010). In this sense, scientific networks may be understood as reflecting not only authors, but people, actors, organizations, and things that uphold scientific patterns and beliefs, with different networks having different epistemologi- cal and ontological implications (Forsyth, 2003). The picture we have uncovered of fisheries science is in many respects similar to the broader trend in scientific output, which has led some scholars to suggest that the historically dominant ‘Atlantic Axis’ (an axis that has also been dominant in the production of fisheries science) is unlikely to be the main focus of research in the coming decades (Adams, 2012). In a sense, given China’s rapid growth over our timeframe, and its outpacing of the US in terms of total volume of scientific papers published in 2016 (Tollefson, 2018), this does seem likely. That said, it remains to be seen what this shift means for the production, shape, or order of knowledge (Escobar, 2004) in fisheries science.

7.3.1 A Bourdieusian Perspective

Though there seems to be broad acceptance that collaboration is a good thing (Adams, 2013; Katz and Martin, 1997), scholars have cautioned against viewing increased global collaboration as an unquestionable good (Adams, 2012; Katz and Martin, 1997; Leahey, 2016; Xie, 2014). For instance, as research team size and internationalism have become yet another metric against which science is judged (Xie, 2014), trends towards stratification in scientific collaboration patterns at both the institutional and individual levels have been detected (Dahdouh-Guebas et al., 2003; Jones et al., 2008; Xie, 2014). Regarding collaboration at the international level, scholars have argued that the manner in which the emerging geography of science is developing reflects his- torical patterns of western control and bias (Peters, 2006). Despite this, recent analysis has indicated that empirical work on collaboration tends to be heavily skewed towards the benefits of collaboration (Leahey, 2016; Xie, 2014). This may, we suggest, reflect a tendency of scholars to view scientific fields as largely consensual spaces. Failing to take seriously the role that power plays in shaping these spaces however, makes it very difficult to distinguish between cooperation based on equality and mutuality, and that which is based on domination and subordination (Albert and Kleinman, 2011).

In this sense, sociologists of science have convincingly shown that the structure of sci- entific knowledge in any field reflects a combination of micro- and macro-sociological factors (Bourdieu, 1975; Cetina, 1999; Forsyth, 2003; Law, 1987; Mol et al., 2002).

171 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Adopting an explicitly Bourdieusian perspective helps us understand the role of power (understood as the capacity to define what legitimate science is) in these processes (Al- and Kleinman, 2011; Bourdieu, 1975, 1991). For example, in directing the topics pursued, methodologies adopted, journals in which research is published, or those we might collaborate with (Bourdieu, 1975). Conceiving of the field in this manner also helps us take seriously the role of consumers (e.g., policy makers, funding agencies, industry and so on) in determining the structure of the scientific field (Albert and Klein- man, 2011; Bourdieu, 1991). In this respect, historians of fisheries science have been astute in highlighting that much of fisheries science has been based on Western ideas about fish (Finley, 2011), and how fisheries might be managed (Bavington, 2010). Historians have also argued that the direction and structure of fisheries research has long been shaped by transitory economic and political forces (Bavington, 2010; Finley, 2011; Smith, 1994). This, it has been suggested, has provided the opportunity and encouragement for the development of research programs in certain areas over others, working to side-line longer-term economic, social and scientific goals, and limit the development of the scientific field in the process (Smith, 1994).

7.3.2 Democratizing Fisheries Science?

Considering these issues with respect to our analysis here, it is worth noting that al- though a spirit of internationalism has always animated the field (Hamblin, 2000; Rozwadowski, 2004), the bulk of fisheries science has long been produced by states with significant fishing interests around the globe (Finley, 2011; Smith, 1994). In this regard, while the geography of fisheries science (Adams, 2012) may have expanded, this pattern has not. The largest fisheries research nations (Aksnes and Browman, 2016), including the new entrants, are countries with highly industrialized fishing fleets or significant aquaculture interests (FAO, 2018; Kroodsma et al., 2018). Thus, whilst the arrival of new entrants might in one sense be seen as a shift towards an increasingly democratized global network of science (Xie, 2014), it may well work to marginalize some actors further (Jones et al., 2008; Xie, 2014). For instance, less “de- veloped” countries with significant interests in fisheries in terms of food security, and livelihoods (Oliveira Júnior et al., 2016).

A number of regions remain marginal in this system despite increasing volumes of fisheries-related knowledge produced by authors in Asia, and Latin America (much of which is aquaculture related). For example, despite strong relative growth rates over the past decades (Aksnes and Browman, 2016), output from the Middle East is negligible when viewed at the macro-level. Similarly, standing as a stark reflection of the inequalities in output between developed and developing countries in this field (as in others) (Jari´c et al., 2012), the African continent remains without any large hubs of production. Though the network has become more geographically extensive, this extension appears to be mirroring the shifting patterns of fisheries production, and the growing contribution of aquaculture to the global production of fisheries, much

172 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE of which is produced in Asia (FAO, 2018), rather than necessarily mirroring a shift towards an increasingly democratized global network of science (Forsyth, 2003; Xie, 2014).

In light of the existing inequality in publication output, it seems reasonable to sug- gest that the collaboration patterns we have depicted may serve to amplify inequalities within this domain. For instance, according to our analysis, large emerging economies and developed countries are the largest cross-border collaborators. Further, although prominent North-North and North-South collaborations are visible within the fisheries science network, South-South collaboration remains peripheral when evaluated from a global perspective (Leydesdorff et al., 2013). As is the case in other fields, much of the publication output being produced by developed countries displays an increas- ingly international character, whilst large swathes of the research being published in emerging economies remains entirely domestic (Adams, 2013; IOC-UNESCO, 2017) (see also Appendix Fig. 7.9). Given that internationally collaborated scientific papers are more likely to be published and cited, and are therefore more visible (Adams, 2012; Katz and Martin, 1997), these patterns could further side-line work by authors from countries who are already marginalized within this research system. These patterns may also work to reinforce dominant ways of thinking in the field towards perspec- tives from the Northern hemisphere (Forsyth, 2003), which have previously been cited as problematic within this domain (Francis, 1980).

7.3.3 Systems of Regionalization

The fisheries science landscape we have uncovered depicts a more regionalized than globalised system of knowledge production. In this regard, existing research has shown that scientific collaboration at the international level is shaped by the dynamic interplay of geographical, political, economic, historical, cultural and linguistic factors (Adams et al., 2014; Dahdouh-Guebas et al., 2003; Hoekman et al., 2010; Katz, 1994; Katz and Martin, 1997; Saetnan and Kipling, 2016). Our analysis suggests that a complex mix of these is at play in the fisheries science network. In line with work in other fields (Hoekman et al., 2010; Katz, 1994; Leahey, 2016; Parreira et al., 2017), even if collaboration has become increasingly internationalized, spatial proximity remains an important feature of the collaborative entanglements within the field, and this is seen across the country clusters (Fig. 7.3) and communities of authors (Fig. 7.4) un- covered within the network. This feature of collaboration is in itself reflective of an array of overlapping factors. Among these are regional political groupings, e.g. trade blocs (Parreira et al., 2017), funding mechanisms or opportunities that remain at the national and regional levels (Hoekman et al., 2010), colonial ties (Adams et al., 2014), and no doubt overlapping fishing interests—proximate and distant.

Considering this latter point further, it has been suggested that different scientific fields might have specific “spatial requirements” due to their research topics (Hoekman et al., 2010). For example, collaborative proximity may be due to environmental similarities

173 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE among countries. It may therefore make sense that researchers focused on similar ge- ographical areas or biomes would work together (Parreira et al., 2017). As regards to fisheries, this seems reasonable given that many countries share closely overlapping fishing grounds and thus proximate fisheries interests, across shared ecoregions. In- deed, historians have shown that the requirements of the marine environment—for instance, the de-territorializing impulse of fish and the sea (Bear, 2013)—have his- torically been amongst the drivers of internationalization in the field (Hamblin, 2000; Rozwadowski, 2004). However, our analysis suggests that distant fishing interests also breed collaboration, as do distant colonial ties. For instance, with respect to the coun- try clusters, the fishing interests of France and Spain which extend along the Eastern Tropical Atlantic, and Western Indian Ocean (Campling, 2012), or those of the US that extend into Pacific region (Hamblin, 2000; Havice, 2018), might reasonably be high- lighted. As a further example, the concentrated research links that France has with its former colonies in North-West and West Africa, have been well documented (Adams et al., 2014).

7.3.4 Collaboration Styles

As indicated, sociologists of science have also stressed the role of micro-level character- istics in shaping the structure of scientific fields (Bourdieu, 1975). In this regard, an ad- ditional driver of collaboration highlighted in the literature is preferential attachment at the individual level (Wagner and Leydesdorff, 2005). For example, studies have in- dicated that authors have a tendency to collaborate with “like-minded others” (Leahey, 2016), which may lead to a particular style of collaboration. In this respect, whilst we have detected an increasingly collaborative field, albeit a regionalized one, our analysis of the fisheries science network has identified structural characteristics that indicate the style of collaboration authors are engaging in is a thin one. This pattern may reflect the tendency of scholars to work within their own networks, rather than forming new links beyond those (Saetnan and Kipling, 2016), and may be driven by an array of factors. For instance, scholars might engage in repeat collaborations—which offers returns in terms of trust building, and increased certainty—in an effort to miti- gate against the cost of collaboration. They might also select collaborators from within their own specialty areas, who share areas of expertise, methodological or theoretical perspectives (Leahey, 2016).

In a sense, given the increasingly specialized nature of science (Casadevall and Fang, 2014), including fisheries science (Mather et al., 2008), we might expect this. In- deed, existing research has highlighted that specialization and collaboration in science are not unrelated (Leahey and Reikowsky, 2008). On the one hand, it is precisely this specialization that is driving the need for collaboration (Casadevall and Fang, 2014; Leahey and Reikowsky, 2008). On the other, specialization has been found to inform collaboration strategies, with scientists often having a tendency to engage in within-specialty collaboration rather than complementary collaboration that spans

174 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE boundaries (Leahey and Reikowsky, 2008). A Bourdieusian perspective helps us to understand the role that power may play in directing these choices, and thus the struc- ture of the scientific field in a broad sense (Bourdieu, 1975). This might, for example, explain why research has found that the steps to interdisciplinary science over the past three decades have actually been very small, oftentimes drawing on neighbor- ing fields and only modestly increasing the connections to areas further afield (Porter and Rafols, 2009). The danger with such a strategy, however, is that it can become a reinforcing style of collaboration (Leahey, 2016), which may have potential costs in terms of the production of novel information, and hindering exposure to heteroge- neous ideas (Blondel et al., 2008). Given that much advancement in fisheries science has been cited as coming from the branches of the discipline rather than the roots (Fran- cis, 1980), this pattern may work to limit the development of the field in a direction that may equip it to address some of the ongoing challenges in the field.

7.3.5 The Topical Landscape of Fisheries Science

A distinct geography of topics has been detected across the field. Though likely re- flective of a combination of the macro- and micro-sociological characteristics we have discussed, this geography is further suggestive of the political and economic influences directing the research priorities in this field, and the continuing dominance of spe- cific ideas about fish and fisheries within this space. Unsurprisingly, given that West- ern fisheries management has been built on, and remains based upon, calculations of maximum sustainable yield and the allocation of quotas (Campling et al., 2012; Finley, 2011; Nielsen and Holm, 2007; Smith, 1994; Winder, 2018), a number of the largest communities in the network remain heavily focused on Models (estimation & stock). Fisheries scientists themselves have suggested, that here more attention has been paid to developing sophisticated ways of fitting analytic models, than to the actual assump- tions underpinning these models (Francis, 1980). This in itself, is not unrelated to the demands on fisheries scientists to provide numbers for policy. Similarly, reflecting the heavy spotlight on discarding in fisheries over the past two decades or so (Alverson et al., 1994; Borges, 2015; Kelleher, 2005), Gear technology & bycatch is a further area of strong topical focus for the largest community of authors within the network. Fur- ther, that a large proportion of the publication output of fisheries science is increasingly commanded by aquaculture-related topics is not unrelated to the rapid investment and consequent expansion in production this area has seen (Aksnes and Browman, 2016; Winder, 2018). As capture fisheries have continued to diminish, this area has been given increasing priority by both fisheries managers (Bavington, 2010; Winder, 2018), and governments as a growth strategy under the rubric of ‘blue growth’ (Barbesgaard, 2018; Hadjimichael, 2018; Winder, 2018; Winder and Le Heron, 2017), and this is re- flected not only in the topical foci we have uncovered in the fisheries science network, but across the entire structure of the network.

175 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

7.4 Limitations and Ways Forward

There are several limitations of our study that may adversely influence our findings. The more technical methodological limitations have been detailed in the methods sec- tion of the paper, but here we discuss the more general limitations and suggest pos- sible avenues for further research that might address these. In terms of our corpus, whilst this comprises high ranking journals in fisheries science, we acknowledge that it does not capture the entire spectrum of work being done in the area of fisheries. The fisheries category in Web of Science (WOS) is delimited to SCIE, and thus skewed towards work coming from the natural sciences. Though there are few dedicated fish- eries journals in the social sciences or arts and humanities, a number do exist beyond this category (e.g., Marine Policy, Ocean and Coastal Management). It is also impor- tant to highlight also that important fisheries related work is published beyond these specialized outlets. For example, work which is oriented towards the social sciences or humanities may be published in journals that have a more general focus, or a stronger leaning towards those sciences (e.g., Journal of Agrarian Change). Further, by virtue of being confined to scientific journal articles, our analysis does not include books or grey literature. It has been suggested, in this respect, that a significant proportion of fisheries-related research is published by national institutes, and may not be visible in the scientific literature (Aksnes and Browman, 2016). In conjunction, as we have touched upon, the results of our analysis are based entirely on English publications, and thus reflect an Anglophone bias (King, 2004). For instance, with respect to our findings, Russia’s absence, as the fourth largest producer of capture fisheries in the world (FAO, 2018) is suggestive here. Though this is likely reflective of a number of factors (Bornmann et al., 2015), this is at least in part a reflection of the linguistic bias of our dataset, and thus an underrepresentation. This is likely the case for other linguistic communities across the field. However, in this regard, it is not irrelevant that this bias is reflected across science, and thus likely plays a role in structuring the shape of science (King, 2004), including scientific collaborations.

Boundaries have to be drawn in all research. However, shifting the boundaries we have applied here, and drawing on an expanded dataset that captures a wider spread of fisheries-related work, might prove a fruitful avenue for further research and may overcome a number of the limitations we have detailed with respect to this study. Do- ing so would go a long way to capturing a broader spectrum of fisheries-related knowl- edge, beyond the traditional domain of ‘fisheries science’. We also suggest that a more in-depth analysis regarding the styles and drivers of collaboration, and in terms of col- laboration inequalities in this field are avenues of inquiry that are worthy of pursuit. In this regard, our analysis could be expanded further by focusing on the funding sources underpinning fisheries science. For instance, studies have indicated that one factor that works to bias research activities to core regions, particularly when funding is intended to serve the interests of national or regional research performing entities, is unequal funding opportunities (Hoekman et al., 2010). Investigating this with respect to fish- eries science, drawing on an extended corpus, is a specific next step to developing this

176 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE research.

7.5 Conclusion

Broad-based collaboration, it is argued, is crucial to solving the challenges ongoing with respect to fisheries. In light of this ‘collaboration imperative’, we have mapped and examined the landscape of scientific collaboration across the field of fisheries sci- ence. Overall, our analysis has presented a shifting field that has become increasingly collaborative, though less cohesive, with a number of key players maintaining hege- monic positions within the network. By and large, the most productive (and collabo- rative) countries in terms of fisheries science are those which have large industrialized fisheries-related interests, many of them global in nature. Although the collaboration network has become more extensive, it has also become more intensive in places, with a clear spatial pattern evident in the structure of scientific collaborations across the field. In this respect, the fisheries science landscape is one whereby the centers of knowledge production and the connections between them display trends more akin to regionalization than globalization.

Some of the characteristics of the network suggest that authors across the field may be engaging in a repeat, rather than broad style of collaboration, which may work as a reinforcing mechanism with respect to the knowledge that is produced by the field. This pattern is likely to limit the potential gains of collaboration, and could have consequences in terms of pushing the boundaries of fisheries science in new and fruitful ways, in a manner which may help address some of the ongoing challenges within the field. Though likely shaped by an array of both micro- and macro-sociological factors, the patterns of collaboration, and the geography of topics uncovered across the field betray a number of political-economic influences, which merit reflection by both policy makers and scientists alike.

7.6 Materials and Methods

In order to map the landscape of fisheries science collaborations, we employed a com- bination of social network analysis techniques and topic modeling. Similar approaches have previously been used to characterize scientific fields, both in terms of their collab- orative structures (Moody, 2004; Newman, 2001, 2004) and topical foci (Syed et al., 2018a; Syed and Weber, 2018). Our analysis of the network was conducted at pro- gressively finer levels of granularity (Ding, 2011), progressing from the overall macro- structure to the hidden subgroups within the network. In doing so, we examined at each level of granularity both the structure (links) and individual properties (nodes) of the network (McPherson et al., 2001). Further, given that social networks are not static

177 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE over time, but rather evolve (Bodin and Crona, 2009), alongside analyzing the entire network, we examined our dataset at two-time intervals: 2000–2008 and 2009–2017.

7.6.1 Data Collection

Fisheries science publications were selected based on the fisheries category as defined by the Science Citation Index Extended (SCIE). This category spans a list of 50 journals covering all aspects of fisheries science, technology, and industry. All 50 journals (Ap- pendix Table 7.4) were included, and all articles published between 2000 and 2017 were selected. The Scopus developer API was subsequently utilized to extract article data such as abstracts, authors, and affiliations. Specifically, the Scopus Abstract Re- trieval API provides all (meta-) data associated with a particular article. The Scopus unique identifier for authors and affiliations was used to disambiguate authors and af- filiations with identical names, and to merge the same author with different names. For affiliations without an affiliation ID, a surrogate key was constructed by concatenating all parts of the affiliation address. A filtering process was used to exclude non-English articles and those that did not constitute a research article (such as errata) or contained no abstract. A total of 73,240 articles were deemed fit for further analysis, with a total of 106,137 authors and 100,175 affiliations. The Google Geocoding API was used to convert affiliation addresses into geographic coordinates (latitude-longitude).

7.6.2 Social Network Analysis

The co-author networks were constructed by linking two authors (i.e., nodes) on the basis of co-authorship (i.e. edges). The frequency of collaborations between two au- thors defined the weight of the edge spanning the two nodes. The resulting network was subsequently analyzed utilizing social network analysis, which provides an array of statistics (Appendix Table 7.5) for doing so (Leydesdorff and Wagner, 2008; Newman, 2001).

7.6.3 Hidden Groups

Most real networks contain groups within which the nodes are more tightly connected to each other than the rest of the network, oftentimes referred to as clusters or com- munities (Palla et al., 2005). These groupings might be connected in various ways (e.g., topic, location, history etc.), with studies indicating that links are often ho- mophilous (McPherson et al., 2001). Uncovering these a priori unknown groups al- lows for the identification of functional units within a system (Blondel et al., 2008), alongside their structural properties (Newman, 2012a), which can vary widely, and

178 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE may—with respect to our interests here—be consequential in terms of knowledge cre- ation (Granovetter, 1973; Lambiotte and Panzarasa, 2009). Thus, utilizing community detection techniques in an effort to get a more nuanced understanding of the network, we decomposed the network into country clusters and communities of authors. To bet- ter characterize these hidden groups, we utilized centrality measures which are reflec- tive of the importance and effectiveness of particular nodes within the network (Bodin and Crona, 2009; Freeman, 1978; Newman, 2012a).

7.6.4 Community Detection

To detect community structures, we used the Louvain algorithm (Blondel et al., 2008), extended with a time parameter to allow for community detection at various resolu- tions (Lambiotte et al., 2014). The inclusion of a time parameter increases commu- nity stability and aims to ameliorate community size bias (Fortunato and Barthelemy, 2007). In a comparative study (Lancichinetti and Fortunato, 2009), the Louvain com- munity detection algorithm was found to have ‘excellent performance’ on several classes of benchmark graphs (Girvan and Newman, 2002; Lancichinetti and Fortunato, 2009), although benchmark performance may not necessarily align with broader real-world situations (Newman, 2012a). We performed a grid-search (Appendix Fig. 7.19) on the parameter space of the resolution parameter (from 0.1 to 2.0 in steps of 0.1) and due to the heuristic nature of the Louvain algorithm, conducted 10 different random ini- tializations for each grid-search. In doing so, we aimed to find communities with high modularity (Newman, 2003; Newman and Girvan, 2004), a measure that quantifies the quality of the detected communities. A similar process was performed to detect com- munities or clusters of countries (Appendix Fig. 7.20 and 7.21). The inter-community collaboration was measured by the (weighted) edges traversing communities within an induced community graph, where communities are represented as nodes them- selves. The frequency of domestic and international collaborations between countries was quantified by creating an adjacency matrix of co-authors and their country affilia- tions. A similar process was performed to quantify collaborations between affiliations. The domestic collaborations were restricted to co-authors with the same country affil- iation.

Detecting communities in networks (e.g., social, biological, citation, metabolic net- works) can generally be classified into discovering non-overlapping communities where each node belongs to a single community (Blondel et al., 2008; Clauset et al., 2004; Decelle et al., 2011a,b; Fortunato, 2010; Hofman and Wiggins, 2008; Newman and Girvan, 2004; Newman and Leicht, 2007; Nowicki and Snijders, 2001), or overlap- ping communities, where nodes can belong to several communities (Ahn et al., 2010; Airoldi et al., 2008; Ball et al., 2011; Derényi et al., 2005; Gopalan and Blei, 2013; Gregory, 2010; Lancichinetti et al., 2011; Viamontes Esquivel and Rosvall, 2011). In- creasingly, real-world networks can be characterized as overlapping (Palla et al., 2005), and the most general formulation of a community detection algorithm should ideally

179 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE include both overlapping and non-overlapping communities (Ball et al., 2011). A major drawback, however, of overlapping community detection algorithms is that the num- ber of communities within a network needs to be known in advance (Ball et al., 2011). Typically, this number is unknown, although recent studies have attempted to apply Bayesian inference and Monte Carlo methods to estimate this number (Newman and Reinert, 2016; Riolo et al., 2017). However, a successful application of such methods highly depends on the choice of an appropriate prior probability. Community detec- tion algorithms based on modularity maximization (a quality index for partitioning networks into communities) circumvent this drawback, but might result in a bias of the community sizes it uncovers (Ball et al., 2011; Bickel and Chen, 2009; Fortunato and Barthelemy, 2007). Typically it fails to find very small communities. The Louvain algorithm (Blondel et al., 2008), used in this study uses such modularity maximiza- tion, and the number of communities as well as the division into communities are performed automatically. However, the Louvain algorithm treats communities as dis- joint (non-overlapping), forming a technical methodological limitation with respect to our study. Thus, explicitly identifying nodes that bridge communities could be an interesting directive for future research.

7.6.5 Topic Modeling

To uncover latent topics, the topic model Latent Dirichlet Allocation (LDA) (Blei, 2012; Blei et al., 2003) was used. All pre-processing steps to suitably prepare documents for statistical topical inference (Hoffman et al., 2010) are described in our previous work (Syed et al., 2018a), which are highly optimized for the fisheries domain (Syed and Spruit, 2017, 2018a). The number of topics uncovered was determined by per- forming a grid search on the parameter space of the various LDA hyper-parameters (Ap- pendix Fig. 7.22), and the quality of topics calculated by a topic coherence score (Röder et al., 2015). For readability, topics were labeled by fisheries domain experts via close inspection of the topic’s top words (Appendix Fig. 7.17), titles (Appendix Ta- ble 7.6), abstracts, and a visual representation through multidimensional scaling (Ap- pendix Fig. 7.23) (Chuang et al., 2012; Sievert and Shirley, 2014). The Hellinger distance (Hellinger, 1909) was used to calculate the similarities between cumulative topic distributions of communities (Appendix Fig. 7.18).

180 Appendix

181 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

80000 8 #edges weighted degree #nodes 7 degree 60000

6 40000 5

20000 4

0 3 2000 2004 2008 2012 2017 2000 2004 2008 2012 2017 year year

1900 6000 #communities #max-cliques 1800 5000

1700 4000 1600

3000 1500

1400 2000 2000 2004 2008 2012 2017 2000 2004 2008 2012 2017 year year

0.00055 0.15 density 0.00050 0.10 0.00045

0.00040 0.05 0.00035 avg. clustering 0.00030 0.00 2000 2004 2008 2012 2017 2000 2004 2008 2012 2017 year year

Figure 7.7: Social network analysis metrics obtained from the full network of 106,173 authors from 73,240 publications from 2000–2017. Explanation of metrics provided in Table 7.5

182 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

number of publications number of journals 6000 50

5000 45

4000 40

3000 35

2000 30 2000 2006 2012 2018 2000 2006 2012 2018

Figure 7.8: The frequency counts of publications and journals included in the dataset ranging from 2000–2017.

2000–2008 2009–2017

Netherlands Denmark New Zealand Denmark Czechia Sweden

Turkey Thailand Argentina Czechia 0 0 20000 0 0 50000 Belgium 0 40000 0 1e+05 0 Taiwan 0 60000 150000 0 United States 0 Thailand China Greece 0 0 80000 2e+05 0 Germany 0 1e+05 250000 Iran 0 0 New Zealand 120000 0 Chile 3e+05 0

350000 Germany 0 0 Brazil

0 0 0 South Korea Mexico 0 20000

0 50000 0

40000 Japan Portugal Portugal

0 1e+05

0 60000

United States 150000 Taiwan 0

South Korea 0

0

0

2e+05 20000

Mexico Italy 0

Canada

0 India 0

India 0

0

0

50000

20000 Italy 0

Australia

20000 Norway

France 0 0

0 0

50000

20000

France 20000

50000 Japan 0

0

0

50000

0 Australia

20000 20000 50000 0

0 0 50000 Brazil

Spain United Kingdom Norway United Kingdom Canada

China Spain

collaboration frequency

1 4,000 8,000 12,000 16,000 20,000+

Figure 7.9: The collaboration frequency counts of domestic and international country collaborations for the time frame 2000–2008 and 2009–2017. Only the top-25 largest collaborating countries are shown, sorted clockwise.

183 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008 2009–2017

Bedford Institute of Oceanography Canada

Department of Fisheries and Oceans

LEEISA UMSR Université de Guyane Norwegian Institute for Nature Research

for Appl. Conservation Research University

... CNRS ... IFREMER Cayenne Inverness College Inverness United Kingdom Department of Forest Sciences Ctr.

Rivers and Lochs Institute University of Fishes Leibniz

Department of Biology and Ecology (NINA) Trondheim Norway Ecology and Inland Fisheries Germany

of British Columbia Canada of the Highlands and Islands Spatial Change Roskilde University Denmark GREThA CNRS University of Bordeaux Department of Environment, Social and

− ICCAT Spain Institute of Freshwater

France

Visadvies BV Netherlands

Sciences University of Washington United 40 0 Institute of Economic Studies University 40 of Iceland Reykjavik Iceland 0 0 0 0 0 School of Aquatic and FisheryPessac France 70 40 Centre for Marine Biodiversity Bedford 140 0 0 Institute of Oceanography Canada 210

80 0 0 0 IEO − Centro Oceanográfico de 120 70 0 CSIRO Oceans and Atmosphere Brisbane 0 States Research Faculty of Agriculture Hokkaido 140 80 University Sapporo Japan

Vigo Spain 0

0

210 40 Qld Australia 40 IFREMER Laboratoire MAERHA France 70

280

0

0

80

Florida FWCC Fish and Wildlife 40 0 Research Institute United States Estonian Marine Institute University of

70 0 Tartu Tallinn Estonia 0

70

0 40

Fisheries Research Services Marine Laboratory 140

0 Aberdeen United Kingdom Department of Inland Fisheries Danish 0 Socio 80 −

Institute for Fisheries Research Denmark BLOOM Association Paris France J.E. CairnesEconomic School Marine of Business Research Unit (SEMRU)

210 0 and Economics National University of

0 Ireland Galway Ireland

NERC Centre for Ecology and 70

0 Hydrology Lancaster Environment Centre United 280 0 Kingdom

Scottish Association for Marine Science 40 Croatia 0

−mer France

40 sur 350 Department of Economics University of −

Dunstaffnage Marine Laboratory United Kingdom 0

Victoria Canada Ifremer Channel & North Sea

0

0

0

40

Institute of Oceanography and Fisheries 70 40 0

Fisheries Research Services Marine Laboratory 210 0 Fisheries Research Unit Boulogne Department of Integrative Biology Oklahoma

United Kingdom

40 State University Stillwater USA

140 0 40

0 80

70 70 0

of East Anglia United Kingdom (IMARPE) Callao Peru 140 INIAP/IPIMAR Portugal Instituto Del Mar Del Perú

School of Ocean Sciences Bangor University Anglesey United Kingdom School of Environmental Sciences University Australia

University of Western Australia Australia

School of Environmental Systems Engineering Institute of Food Safety Animal

Health and Environment Riga Latvia CSIRO Division of Marine Research

collaboration frequency collaboration frequency

1 15 30 45 60 75+ 1 15 30 45 60 75+

Figure 7.10: The collaboration frequency counts of only the international collabo- rations amongst affiliations for the time frame 2000–2008. Only the top-10 percent highest frequency links for the top-25 largest collaborating affiliations are shown.

184 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008 2009–2017

Bedford Institute of Oceanography Canada

Department of Fisheries and Oceans

LEEISA UMSR Université de Guyane Norwegian Institute for Nature Research for Appl. Conservation Research University

... CNRS ... IFREMER Cayenne Inverness College Inverness United Kingdom Department of Forest Sciences Ctr.

Rivers and Lochs Institute University of Fishes Leibniz

Department of Biology and Ecology (NINA) Trondheim Norway Ecology and Inland Fisheries Germany of British Columbia Canada of the Highlands and Islands Spatial Change Roskilde University Denmark GREThA CNRS University of Bordeaux Department of Environment, Social and

− ICCAT Spain Institute of Freshwater

France

Visadvies BV Netherlands

Sciences University of Washington United 40 0 Institute of Economic Studies University 40 of Iceland Reykjavik Iceland 0 0 0 0 0 School of Aquatic and FisheryPessac France 70 40 Centre for Marine Biodiversity Bedford 140 0 0 Institute of Oceanography Canada 210

80 0 0 0 IEO − Centro Oceanográfico de 120 70 0 CSIRO Oceans and Atmosphere Brisbane 0 States Research Faculty of Agriculture Hokkaido 140 80 University Sapporo Japan

Vigo Spain 0

0

210 40 Qld Australia 40 IFREMER Laboratoire MAERHA France 70

280

0

0

80

Florida FWCC Fish and Wildlife 40 0 Research Institute United States Estonian Marine Institute University of

70 0 Tartu Tallinn Estonia 0

70

0 40

Fisheries Research Services Marine Laboratory 140

0 Aberdeen United Kingdom Department of Inland Fisheries Danish 0 Socio 80 −

Institute for Fisheries Research Denmark BLOOM Association Paris France J.E. CairnesEconomic School Marine of Business Research Unit (SEMRU)

210 0 and Economics National University of

0 Ireland Galway Ireland

NERC Centre for Ecology and 70

0 Hydrology Lancaster Environment Centre United 280 0 Kingdom

Scottish Association for Marine Science 40 Croatia 0

−mer France

40 sur 350 Department of Economics University of −

Dunstaffnage Marine Laboratory United Kingdom 0

Victoria Canada Ifremer Channel & North Sea

0

0

0

40

Institute of Oceanography and Fisheries 70 40 0

Fisheries Research Services Marine Laboratory 210 0 Fisheries Research Unit Boulogne Department of Integrative Biology Oklahoma

United Kingdom

40 State University Stillwater USA

140 0 40

0 80

70 70 0

of East Anglia United Kingdom (IMARPE) Callao Peru 140 INIAP/IPIMAR Portugal Instituto Del Mar Del Perú

School of Ocean Sciences Bangor University Anglesey United Kingdom School of Environmental Sciences University Australia

University of Western Australia Australia

School of Environmental Systems Engineering Institute of Food Safety Animal

Health and Environment Riga Latvia CSIRO Division of Marine Research collaboration frequency collaboration frequency

1 15 30 45 60 75+ 1 15 30 45 60 75+

Figure 7.11: Similar to Fig. 7.10 but for the time frame 2009–2017.

185 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008 2009–2017

Resources Marine Resources Research Institute

Department of Marine Biosciences Tokyo

Collaborative Innovation Center for Distant

South Carolina Department of Natural School for Resource and Environmental Studies Dalhousie University Halifax NS

University of Marine Science and Department of Economics University of

Charleston SC United States

INIAP/IPIMAR Portugal Product Safety/State Key Laboratory for Sciences University of Alaska Fairbanks Sun Yat Technology Japan International Ocean Institute Dalhousie University UIB) Spain Biocontrol, MOESchool Key of LaboratoryLife Sciences,Fisheries of Aquatic Shanghai China − School of Fisheries and Ocean − sen University Guangzhou China

Victoria Canada Fisheries Resource AssessmentWageningen and UR Monitoring LEI The Hague

Division Northwest Fisheries Science Center Canada − Instituto Mediterráneo de Estudios Avanzados Institute of Economic Studies University IMEDEA (CSIC National Marine Fisheries Service NOAA Water of Iceland Reykjavik Iceland Japan Sea National Fisheries Research 0 Institute Fisheries Research Agency Japan United States 0 0 90 0 0 0 Department of Biology University of 90 0 0 400 Bergen Norway National Research Institute of Far Netherlands 90 0 0 Seattle WA United States 0 Canada Estonian Marine Institute University of

0 Seas Fisheries Japan 0 400 0

0 0 IFREMER Laboratoire MAERHA France Tartu Tallinn Estonia 90 90 800

0

University of Chinese Academy of 180 0

Centre for Environment, Fisheries and 1200 Aquaculture Science Lowestoft Laboratory United 0 Sciences Beijing China World Wildlife Fund

0 Fund N.W.− FisheriesWashington World DC WildlifeUnited University of New Hampshire Morse 90 1600

Kingdom 400

90 Hall 142 United States States

0 2000

0

0 90

Hokkaido Fish Hatchery Japan University of Maine School of 0 Marine Sciences 5706United Aubert States Hall

States

0 0

Northwest Fisheries Science Center United 0 Research Faculty of Agriculture Hokkaido 0 University Sapporo Japan

−mer France 0 90 School of Marine Sciences University

School of Aquatic and Fishery Ifremer Channel−sur & North Sea 90

400 of Maine Orono ME United Sciences University of Washington United 0 States States

0 Fisheries Research Unit Boulogne Department of Integrative Biology Oklahoma

0 State University Stillwater USA

0 0 0 States

United Kingdom Socio NERC Centre for Ecology and 400

Hydrology Lancaster Environment Centre United J.E. Cairnes School of Business 0 −

Kingdom 0 and EconomicsEconomic National Marine University Research of Unit (SEMRU)

0 0 Ireland Galway Ireland

School of Earth Energy and

90 ICCAT Spain 0 McBain Associates Arcata CA United Environmental Sciences Stanford University Stanford

Fisheries Research Services Marine Laboratory 0 Department of Fisheries and Oceans CA United States

0

0 Bedford Institute of Oceanography Canada 400

90

2000

90

0

800

180

1600 1200 States

Centre for Marine Biodiversity Bedford Institute of Oceanography Canada

Key Laboratory for Animal Disease Nutrition of China Ministry of Education, Sichuan Agricultural University Sichuan,Ya'an China

Japan of Sciences Qingdao China United States of East Anglia United Kingdom

Institute of Oceanology Chinese Academy

School of Environmental Sciences University

Department of Fisheries and Aquatic Hawai'iUniversity Institute ofof Hawai'iMarine KaneoheBiology HI Sciences University of Florida United

− Research Institute of Fisheries Science

Resistance

Freshwater Fisheries Research Division National

collaboration frequency collaboration frequency

1 15 30 45 60 75+ 1 15 30 45 60 75+

Figure 7.12: The collaboration frequency counts of the domestic as well as the inter- national collaborations amongst affiliations for the time frame 2000–2008. Only the top-10 percent highest frequency links for the top-25 largest collaborating affiliations are shown.

186 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008 2009–2017

Resources Marine Resources Research Institute

Department of Marine Biosciences Tokyo

Collaborative Innovation Center for Distant

South Carolina Department of Natural School for Resource and Environmental Studies Dalhousie University Halifax NS

University of Marine Science and Department of Economics University of

Charleston SC United States

INIAP/IPIMAR Portugal Product Safety/State Key Laboratory for Sciences University of Alaska Fairbanks Sun Yat Technology Japan International Ocean Institute Dalhousie University UIB) Spain Biocontrol, MOESchool Key of LaboratoryLife Sciences,Fisheries of Aquatic Shanghai China − School of Fisheries and Ocean − sen University Guangzhou China

Victoria Canada Fisheries Resource AssessmentWageningen and UR Monitoring LEI The Hague

Division Northwest Fisheries Science Center Canada − Instituto Mediterráneo de Estudios Avanzados Institute of Economic Studies University IMEDEA (CSIC National Marine Fisheries Service NOAA Water of Iceland Reykjavik Iceland Japan Sea National Fisheries Research 0 Institute Fisheries Research Agency Japan United States 0 0 90 0 0 0 Department of Biology University of 90 0 0 400 Bergen Norway National Research Institute of Far Netherlands 90 0 0 Seattle WA United States 0 Canada Estonian Marine Institute University of

0 Seas Fisheries Japan 0 400 0

0 0 IFREMER Laboratoire MAERHA France Tartu Tallinn Estonia 90 90 800

0

University of Chinese Academy of 180 0

Centre for Environment, Fisheries and 1200 Aquaculture Science Lowestoft Laboratory United 0 Sciences Beijing China World Wildlife Fund

0 Fund N.W.− FisheriesWashington World DC WildlifeUnited University of New Hampshire Morse 90 1600

Kingdom 400

90 Hall 142 United States States

0 2000

0

0 90

Hokkaido Fish Hatchery Japan University of Maine School of 0 Marine Sciences 5706United Aubert States Hall

States

0 0

Northwest Fisheries Science Center United 0 Research Faculty of Agriculture Hokkaido 0 University Sapporo Japan

−mer France 0 90 School of Marine Sciences University

School of Aquatic and Fishery Ifremer Channel−sur & North Sea 90

400 of Maine Orono ME United Sciences University of Washington United 0 States States

0 Fisheries Research Unit Boulogne Department of Integrative Biology Oklahoma

0 State University Stillwater USA

0 0 0 States

United Kingdom Socio NERC Centre for Ecology and 400

Hydrology Lancaster Environment Centre United J.E. Cairnes School of Business 0 −

Kingdom 0 and EconomicsEconomic National Marine University Research of Unit (SEMRU)

0 0 Ireland Galway Ireland

School of Earth Energy and

90 ICCAT Spain 0 McBain Associates Arcata CA United Environmental Sciences Stanford University Stanford

Fisheries Research Services Marine Laboratory 0 Department of Fisheries and Oceans CA United States

0

0 Bedford Institute of Oceanography Canada 400

90

2000

90

0

800

180

1600 1200 States

Centre for Marine Biodiversity Bedford Institute of Oceanography Canada

Key Laboratory for Animal Disease Nutrition of China Ministry of Education, Sichuan Agricultural University Sichuan,Ya'an China

Japan of Sciences Qingdao China United States of East Anglia United Kingdom

Institute of Oceanology Chinese Academy

School of Environmental Sciences University

Department of Fisheries and Aquatic Hawai'iUniversity Institute ofof Hawai'iMarine KaneoheBiology HI Sciences University of Florida United

− Research Institute of Fisheries Science

Resistance

Freshwater Fisheries Research Division National

collaboration frequency collaboration frequency

1 15 30 45 60 75+ 1 15 30 45 60 75+

Figure 7.13: Similar to Fig. 7.12 but for the time frame 2009–2017.

187 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

8000 7000 6000 5000 4000 3000 authors 2000 1000 0

100

80

60

40 cum. % 20

0 1 5 10 15 20 25 30 35 40 45 50 community rank

Figure 7.14: Community size distribution (top) for the top-50 largest communities (by number of authors) sorted by rank; rank 1 being the largest, rank 2 the second largest, and so on. The bottom figure displays the community size cumulative percentage in relation to all the authors (nodes) in the network. For example, the top-10 (i.e., rank 1-10) communities account for over 40% of the total number of authors in the network.

188 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009–2017 2000–2008 Spatial distribution of communities ranked 16–30 differentiated by color, with the size of the node representing the Figure 7.15: number of authors.

189 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009–2017 2009–2017 Continued with communities ranked 26–30 Figure 7.15: 2000–2008 2000–2008

190 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 15 15 10 10 density 5 5 community rank 2009-2017 2000-2008 2009-2017 2000-2008 eigenvector centrality 1 1 0.015 0.010 0.005 0.000 0.015 0.010 0.005 0.000 15 15 2009-2017 2000-2008 2009-2017 2000-2008 10 10 5 5 community rank weighted degree betweenness centrality 1 1 5 15 10 0.003 0.002 0.001 0.000 15 15 2009-2017 2000-2008 2009-2017 2000-2008 10 10 5 5 community rank number of edges closeness centrality 1 1 0 0.3 0.2 0.1 0.0 40000 30000 20000 10000 15 15 2009-2017 2000-2008 10 10 5 5 Social network metrics for the top-15 largest communities sorted by rank for the two time intervals (2000–2008 and community rank degree centrality number of nodes 2009-2017 2000-2008 1 1 0 6000 4000 2000 0.015 0.010 0.005 0.000 Figure 7.16: 2009–2017).

191 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

(1) DISEASES (2) REPRODUCTION (3) HABITATS (4) SALMONIDS word prob. word prob. word prob. word prob. INFECTION .020 FEMALE .039 FISH .035 RIVER .035 DISEASE .019 EGG .037 SPECIE .033 TROUT .028 ISOLATE .019 MALE .029 HABITAT .019 FISH .028 MEAL .018 STURGEON .019 LAKE .013 SALMON .021 SHRIMP .018 SEX .015 SITE .010 RAINBOW .012 FISH .018 STAGE .014 PREY .009 TAG .011 STRAIN .017 REPRODUCTIVE .014 ABUNDANCE .009 HATCHERY .011 PARASITE .015 SPAWNING .013 COMMUNITY .009 LAKE .009 VIRUS .010 SPAWN .013 STUDY .008 ONCORHYNCHUS .009 MORTALITY .009 SPERM .011 TILAPIA .007 RAINBOW TROUT .008

(7) MODELS - (5) GENETICS (6) CLIMATE EFFECTS (8) AGE & GROWTH ESTIMATION & STOCK word prob. word prob. word prob. word prob. POPULATION .035 TEMPERATURE .033 MODEL .034 LENGTH .053 GENETIC .024 WATER .022 ESTIMATE .023 GROWTH .034 SPECIE .022 YEAR .015 STOCK .020 SIZE .030 ANALYSIS .011 SUMMER .012 DATUM .016 MM .029 SAMPLE .011 PERIOD .011 POPULATION .015 AGE .028 REGION .009 CHANGE .011 FISHERY .014 WEIGHT .027 STUDY .008 ABALONE .010 RATE .013 FISH .020 LOCUS .007 HIGH .010 MORTALITY .013 CM .015 VARIATION .007 WINTER .010 CATCH .010 TOTAL .014 MARKER .007 SPRING .010 SIZE .010 ESTIMATE .013

(10) AQUACULTURE - (9) DIET (11) PHYSIOLOGY (12) IMMUNOGENETICS GROWTH EFFECTS word prob. word prob. word prob. word prob. DIET .056 DAY .023 FISH .034 CELL .019 FEED .052 LARVAE .021 ACTIVITY .019 GENE .019 FISH .030 GROWTH .021 CATFISH .018 CARP .014 PROTEIN .024 RATE .019 LEVEL .016 TISSUE .014 ACID .021 SURVIVAL .017 CONTROL .015 SEQUENCE .012 LEVEL .015 HIGH .013 INCREASE .015 PROTEIN .011 DIETARY .014 POND .012 EFFECT .014 EXPRESSION .010 GROWTH .014 GROUP .012 GROUP .014 ANALYSIS .008 LIPID .014 FEED .011 STRESS .013 SHOW .008 WEIGHT .014 EXPERIMENT .010 RESPONSE .012 MUSCLE .008

(13) AQUACULTURE - (15) GEAR TECHNOLOGY (14) SHELLFISH (16) MANAGEMENT HEALTH EFFECTS & BYCATCH word prob. word prob. word prob. word prob. WATER .029 OYSTER .033 CATCH .021 FISH .012 CONCENTRATION .024 SHELL .019 SEA .018 MANAGEMENT .010 TREATMENT .015 CRAB .017 SPECIE .017 AQUACULTURE .009 MG .011 MUSSEL .016 AREA .015 STUDY .009 HIGH .010 CLAM .012 FISHING .013 SYSTEM .008 STUDY .009 SCALLOP .010 FISH .012 FISHERY .008 TOTAL .009 SPECIE .008 NET .012 SPECIE .007 SAMPLE .008 BIVALVE .006 FISHERY .011 PRODUCTION .007 QUALITY .008 INJECTION .006 DEPTH .008 USE .007 PH .008 SITE .006 SURVEY .008 RESEARCH .006

Figure 7.17: Topic-word distributions of the top-10 high-probability words for the 16 uncovered latent fisheries science topics. Each topic is labeled (top) with a logical topic description that best captures the semantics of the top words.

192 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

2000–2008

2009–2017

Figure 7.18: The Hellinger distance between cumulative topic distributions of the 15 largest communities. The Hellinger distance is a symmetrical distance measure of two probability distributions. Smaller distances indicate that two cumulative topic distributions are more similar, meaning that the two communities publish more similar work in terms of latent topics.

193 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Modularity Scores 2000-2017 resolution value 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 0.850

run-1 0.8092 0.8233 0.8334 0.8376 0.8427 0.8455 0.8457 0.8467 0.8489 0.8497 0.8476 0.8489 0.8478 0.8464 0.8432 0.8442 0.8406 0.8462 0.8408 0.841

run-2 0.8096 0.8246 0.8346 0.8376 0.8411 0.8458 0.8467 0.8493 0.8467 0.8485 0.8497 0.8501 0.8464 0.8474 0.8442 0.8429 0.8421 0.8422 0.8392 0.8446 0.849

run-3 0.8089 0.8243 0.8317 0.8404 0.8427 0.8462 0.8474 0.8462 0.848 0.8472 0.8465 0.8447 0.8443 0.8469 0.8448 0.8427 0.8412 0.8423 0.842 0.8409

run-4 0.8093 0.8252 0.8331 0.8413 0.8432 0.8464 0.8466 0.8481 0.848 0.8475 0.8475 0.8479 0.8474 0.8472 0.8439 0.8437 0.8445 0.8426 0.8417 0.8428

0.848

run-5 0.8095 0.8256 0.8315 0.8373 0.8428 0.8458 0.8462 0.8463 0.8482 0.8492 0.8476 0.8474 0.8456 0.8475 0.8441 0.8443 0.8445 0.8431 0.8405 0.8417

run-6 0.8108 0.8248 0.8328 0.8382 0.8425 0.8441 0.8473 0.8462 0.8471 0.8477 0.8489 0.8478 0.8472 0.8433 0.845 0.8445 0.8456 0.8425 0.8438 0.8402 modularity

0.847

run-7 0.8086 0.8241 0.8334 0.8381 0.8437 0.8436 0.847 0.8488 0.8499 0.8482 0.8476 0.8462 0.8464 0.8457 0.845 0.8432 0.8436 0.8436 0.8429 0.838

run-8 0.8087 0.825 0.8318 0.8388 0.8406 0.8447 0.8471 0.8456 0.8494 0.8486 0.8485 0.847 0.8487 0.8486 0.845 0.8445 0.8445 0.8437 0.8397 0.8403

0.846 run-9 0.8098 0.8246 0.8318 0.8398 0.8431 0.8465 0.8474 0.847 0.8488 0.8484 0.8484 0.846 0.846 0.8465 0.8438 0.8451 0.8434 0.8412 0.8415 0.8413

run-10 0.8094 0.8255 0.8324 0.8362 0.8432 0.8436 0.846 0.8485 0.8487 0.8483 0.8466 0.8485 0.8485 0.8472 0.8472 0.8455 0.8438 0.8429 0.8429 0.8388

0.845

Figure 7.19: Modularity scores obtained by performing a grid search on the resolution parameter, ranging from 0.1 to 2.0 as shown on the top, and conducting ten random initializations (run-1 to run-10 shown on the left side) to create author communities (i.e., clusters) for the time frame 2000–2008. A resolution parameter of 1.2 with the second random initializations (run-2) provided author clusters with the highest mod- ularity value.

Modularity Scores 2000-2008 resolution value 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 0.2460

run-1 -0.01259 0.03714 0.05872 0.1329 0.1584 0.1809 0.205 0.2281 0.2392 0.2411 0.2451 0.2442 0.2355 0.2435 0.2411 0.2214 0.2342 0.2423 0.2214 0.2423

run-2 -0.01259 0.03714 0.05872 0.1335 0.1558 0.1908 0.2059 0.228 0.2427 0.245 0.2351 0.2448 0.2439 0.244 0.2435 0.2435 0.2411 0.2435 0.2214 0.2212

0.2445

run-3 -0.01248 0.03714 0.05767 0.1292 0.1584 0.1896 0.2086 0.2411 0.2448 0.2351 0.2454 0.2442 0.2439 0.2436 0.2411 0.2435 0.2435 0.2435 0.2423 0.242

run-4 -0.01259 0.03714 0.05804 0.1284 0.1496 0.1794 0.2109 0.2297 0.2391 0.234 0.2438 0.2329 0.2344 0.2344 0.2214 0.2435 0.2435 0.2214 0.2214 0.2212

0.2430 run-5 -0.01259 0.03714 0.05872 0.135 0.1371 0.194 0.186 0.2184 0.2423 0.2417 0.2351 0.2349 0.244 0.2435 0.2435 0.2435 0.2411 0.2404 0.2404 0.2404

run-6 -0.01259 0.03718 0.05775 0.1292 0.1371 0.1862 0.1844 0.2187 0.2397 0.2349 0.2445 0.2444 0.2439 0.244 0.2439 0.2435 0.2435 0.2214 0.2423 0.2404 modularity

run-7 -0.01259 0.03714 0.05767 0.1287 0.1496 0.186 0.1835 0.2268 0.2398 0.2451 0.2453 0.2349 0.2443 0.244 0.2435 0.2429 0.2435 0.2435 0.2435 0.2404 0.2415

run-8 -0.01259 0.03714 0.05872 0.1292 0.1372 0.1842 0.2087 0.2223 0.2418 0.2453 0.224 0.2448 0.2449 0.2436 0.2439 0.2435 0.2437 0.2437 0.2405 0.2404

run-9 -0.01259 0.03714 0.059 0.1344 0.1371 0.1794 0.1848 0.2274 0.2426 0.2424 0.2452 0.2333 0.2305 0.2435 0.244 0.2435 0.2435 0.2411 0.2404 0.2404 0.2400

run-10 -0.01286 0.03718 0.05872 0.1342 0.1493 0.1612 0.209 0.2139 0.2421 0.2417 0.2453 0.2449 0.2443 0.2436 0.2435 0.2435 0.2214 0.2214 0.2411 0.2404

Figure 7.20: Modularity scores obtained by performing a grid search on the resolution parameter (ranging from 0.1 to 2.0) and conducting ten random initializations (run-1 to run-10) to create country communities (i.e., clusters) for the time frame 2000–2008. Resolution 1.1 with run-3 provides country clusters with the highest modularity value.

194 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Modularity Scores 2009-2017 resolution value 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 0.230

run-1 -0.004309 0.06127 0.08733 0.1418 0.1805 0.1936 0.1971 0.2215 0.2286 0.2288 0.2201 0.2201 0.2196 0.2074 0.2188 0.2014 0.2028 0.2168 0.2247 0.175

run-2 -0.004237 0.06004 0.08532 0.1366 0.176 0.2011 0.2036 0.2217 0.2277 0.2284 0.2202 0.2202 0.2201 0.2183 0.2196 0.2177 0.2177 0.2266 0.2059 0.2164 0.228

run-3 -0.005876 0.06029 0.08817 0.1312 0.1745 0.1889 0.1956 0.2095 0.2244 0.2284 0.2285 0.2195 0.2044 0.2188 0.2262 0.2181 0.2183 0.2109 0.1511 0.2081

run-4 -0.004181 0.05936 0.08532 0.1366 0.1744 0.193 0.2021 0.2217 0.2247 0.2284 0.2264 0.2197 0.2182 0.222 0.2182 0.2156 0.2183 0.2263 0.2284 0.003437

0.226

run-5 -0.003542 0.05856 0.08514 0.1412 0.1767 0.1928 0.1969 0.2217 0.2244 0.2284 0.2282 0.2265 0.2195 0.2263 0.2109 0.2182 0.2164 0.2183 0.07314 0.2054

run-6 -0.005884 0.06049 0.08475 0.1397 0.176 0.2005 0.196 0.2167 0.2244 0.2287 0.2203 0.2229 0.2201 0.2189 0.2201 0.2014 0.2177 0.2183 0.2022 0.2014 modularity

0.224

run-7 -0.004238 0.05935 0.08516 0.1371 0.18 0.1935 0.1958 0.2184 0.2277 0.2279 0.228 0.2188 0.2196 0.2196 0.2188 0.2188 0.2277 0.1076 0.2041 0.1416

run-8 -0.003614 0.06128 0.08532 0.1412 0.1749 0.1933 0.2011 0.2094 0.2279 0.2238 0.2283 0.2262 0.2183 0.2188 0.2263 0.2028 0.2002 0.2183 0.1076 0.1031

0.222 run-9 -0.00431 0.05855 0.0835 0.1291 0.1748 0.1936 0.1968 0.2217 0.2286 0.2288 0.228 0.2192 0.2265 0.179 0.2016 0.2014 0.2183 0.2271 0.2016 0.006145

run-10 -0.004314 0.05935 0.08475 0.1315 0.1745 0.193 0.1972 0.2182 0.2278 0.2279 0.2282 0.219 0.2188 0.2196 0.2263 0.1758 0.2183 0.2006 0.2181 0.1862

0.220

Figure 7.21: Similar to Fig.7.20 but for the time frame 2009–2017. Resolution 1.0 with run-1 provides country clusters with the highest modularity value.

195

CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE coherence score coherence 0.54 0.52 0.50 0.48 0.46 30 0.5 0.5 0.51 0.48 0.48 0.48 0.51 0.47 0.49 0.51 0.49 0.48 29 0.5 0.5 0.5 0.5 0.52 0.48 0.53 0.48 0.49 0.48 0.49 0.48 28 0.5 0.5 0.48 0.48 0.48 0.49 0.51 0.49 0.48 0.47 0.48 0.51 27 0.5 0.5 0.5 0.47 0.51 0.51 0.49 0.48 0.47 0.52 0.48 0.51 26 0.5 0.52 0.48 0.47 0.48 0.49 0.51 0.48 0.53 0.49 0.48 0.49 25 0.5 0.5 0.5 0.5 0.48 0.49 0.51 0.52 0.49 0.51 0.49 0.47 24 0.5 0.5 0.5 0.52 0.51 0.46 0.47 0.49 0.48 0.49 0.49 0.52 23 0.5 0.5 0.51 0.53 0.49 0.49 0.52 0.52 0.51 0.48 0.48 0.49 22 0.5 0.5 0.5 0.48 0.53 0.52 0.53 0.52 0.51 0.48 0.49 0.52 21 0.5 0.5 0.5 0.51 0.52 0.52 0.49 0.49 0.52 0.49 0.48 0.51 20 0.5 0.5 0.5 0.5 0.5 0.49 0.52 0.49 0.51 0.48 0.49 0.52 19 0.5 0.49 0.51 0.52 0.49 0.54 0.51 0.48 0.48 0.53 0.49 0.52 18 0.5 0.5 0.5 0.5 0.52 0.52 0.46 0.48 0.52 0.49 0.54 0.51 17 0.49 0.52 0.53 0.47 0.49 0.52 0.49 0.53 0.51 0.49 0.47 0.48 16 0.5 0.5 0.5 0.47 0.51 0.51 0.55 0.49 0.48 0.51 0.49 0.49 15 0.5 0.5 0.5 0.51 0.51 0.48 0.53 0.51 0.52 0.47 0.53 0.52 14 0.5 0.51 0.52 0.52 0.51 0.51 0.52 0.49 0.53 0.51 0.52 0.52 13 0.5 0.5 0.5 0.51 0.52 0.49 0.53 0.52 0.52 0.51 0.47 0.51 12 0.5 0.5 0.5 0.5 0.5 0.48 0.53 0.49 0.51 0.53 0.49 0.51 11 0.5 0.5 0.48 0.47 0.51 0.48 0.51 0.49 0.52 0.51 0.49 0.51 10 0.5 0.5 0.52 0.51 0.51 0.48 0.51 0.51 0.52 0.52 0.51 0.52 9 0.5 0.5 0.5 0.51 0.51 0.51 0.52 0.54 0.51 0.52 0.51 0.49 8 0.5 0.49 0.49 0.47 0.49 0.48 0.52 0.51 0.48 0.51 0.52 0.48 7 0.5 0.5 0.5 0.51 0.47 0.51 0.52 0.52 0.53 0.51 0.53 0.52 6 0.5 0.5 0.52 0.48 0.51 0.49 0.51 0.52 0.49 0.53 0.49 0.49 5 0.5 0.5 0.5 0.49 0.51 0.45 0.49 0.44 0.46 0.51 0.51 0.49 4 0.5 0.5 0.5 0.51 0.49 0.54 0.47 0.48 0.51 0.49 0.51 0.51 3 0.5 0.5 0.53 0.51 0.52 0.49 0.48 0.52 0.48 0.52 0.49 0.47 2 0.49 0.49 0.49 0.49 0.49 0.49 0.47 0.49 0.49 0.48 0.49 0.49 p = 15, max 100 alpha asym., eta sym. p = 10, max 100 alpha asym., eta sym. p = 20, max 200 alpha asym., eta sym. p = 15, max 200 alpha asym., eta sym. p = 10, max 200 alpha asym., eta sym. p = 20, max 100 alpha asym., eta sym. p = 20, max 100 alpha asym., eta asym. p = 15, max 100 alpha asym., eta asym. p = 10, max 100 alpha asym., eta asym. p = 10, max 200 alpha asym., eta asym. p = 20, max 200 alpha asym., eta asym. p = 15, max 200 alpha asym., eta asym.

Figure 7.22: Coherence scores for all created LDA models by performing a grid search on the number of topics or k-parameter (shown on top), and varying the different LDA hyper-parameters (shown on left). p = number of epochs over the corpus, max = convergence iteration parameter for the expectation step (EM-algorithm), alpha = symmetrical (sym.) or asymmetrical (asym.) Dirichlet prior distribution on the topic probabilities within documents (i.e. document-topic proportions). eta = symmetrical (sym.) or asymmetrical (asym.) Dirichlet prior distribution on the word probabilities within topics (i.e. topic-word proportions). A higher coherence score can be viewed as a better LDA model.

196 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Models (estimation & stock) Aquaculture Climate Salmonids (growth effects) effects Age & growth 6 Diet 7 Habitats 4 10 8 Aquaculture 9 (health effects) 13 15 3 Gear 2 11 technology 16 & bycatch Reproduction PC1 5 Physiology Management

Genetics 14 topic Diseases prevalence Shellfish 2% 1 5% 12 Immunogenetics 10%

PC2

Figure 7.23: Inter-topic distance map that shows a two-dimensional representation (via multidimensional scaling) of the 16 uncovered fisheries science topics with labels. The distance between the nodes represents the topic similarity with respect to the dis- tributions of words (i.e. latent Dirichlet allocation’s output). The surface of the nodes indicates the topic prevalence within the corpus, with bigger nodes representing topics being more prominent within the document collection (all nodes add up to 100%).

197 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2000-2008 2009-2017 total The number of absolute and relative publications published between 2000–2008 and 2009–2017 for the 25 countries countryUnited StatesJapanCanada publicationsAustralia percentageUnited Kingdom 8,497Norway publicationsSpain percentageFrance 1,548China 2,674 27.2 percentage diff. 2,220New 2,025 Zealand publication diff.India totalMexico publications 1,423 5.0Germany 8.6 7.1 8,950Italy 6.5 1,108Taiwan 769 949Portugal 4.6 857 21.40Brazil 1,223 1,959South 2,014 Korea 3.6 2,344Denmark 611 563 2.5 3.0 472Turkey 2.92 2.7Finland 1,468 4.68 4.81Greece 5.60 455 462 1,560 415Poland 2.0 1.8 -5.8 1.5 373Sweden 3.51 657 894 403Netherlands 5,157 344 1.5 1.5 3.73TOTAL TOP 25 1.3 -2.0 311 1,504 -3.9 1.2 -2.3 1.57 287 2.14 12.33 886 -0.9 1.3 590 453 270 1.1 27,814 268 255 -1.0 256 3.60 1.0 470 667 2.12 564 0.9 1.41 -325 -715 0.2 699 -206 1,673 0.9 89.1 16,994 319 0.9 0.8 -0.9 -0.9 434 1.12 1.59 0.8 9.6 1.35 1.67 868 4.00 46 204 1.6 3,096 35,941 1.04 5,348 452 248 0.3 -0.1 4,440 4,050 371 267 -112 2.08 4301 270 -55 0.49 85.92 -0.3 0.1 0.59 0.0 2,845 0.89 0.5 0.64 893 2.7 0.64 2,216 323 118 -0.1 1,538 1,714 1,898 1.1 205 -0.4 15 -3.2 148 -0.3 1270 327 1,222 -0.2 1,126 0.0 -0.2 944 89 558 8127 -83 924 909 831 -22 806 746 103 12 13 689 55,628 621 573 539 537 510 513 Table 7.1: with the highest publicationpublication output. is Differences fractionally credited in based percentages on and the absolute number values of are authors presented and country in affiliations. the total columns. A single

198 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2000-2008 0 Jamaica0 Grenada 1 1 Jamaica Grenada 1 1 Jamaica Grenada 0.71 0.71 Jamaica Grenada 0.4 France0.2 Lithuania 0.69 Spain 0.61 Portugal 0.55 Spain 0.36 Portugal 0.29 0.5 Greece France 0.46 United0.41 States0.26 Guam0.17 Canada0.16 Australia Philippines 0.810.43 United Spain States0.28 Reunion0.23 0.69 0.64 Mauritania 0.63 Australia 0.76 0.57 Canada Japan0.35 United India States0.35 Greenland0.27 United Kingdom 0.830.19 Russian Federation France 0.66 0.63 0.68 0.560.17 Zambia Belgium Italy United 0.44 Australia States Latvia 0.75 0.69 Canada 0.43 0.84 0.29 Germany Norway Japan United Malaysia Kingdom 0.8 0.49 0.81 0.5 0.67 France Belgium United Kingdom 0.43 Canada 0.67 Denmark 0.4 0.56 0.65 Japan Germany 0.14 Italy Norway 0.19 Netherlands Mexico Australia 0.6 Norway 0.39 0.57 0.5 Portugal 0.5 Spain 0.6 Denmark Netherlands 0.28 United Kingdom 0.31 Ireland Italy 0.2 0.27 Denmark Sweden value country value country value country value country 1 2 3 4 Rank Betweenness Centrality Closeness centrality Degree centrality Eigenvector centrality The centrality measures for the four country clusters (ranked 1–4 based on the numbers of countries within them) within Table 7.2: the period 2000–2008.Description Only of the centrality top-5 measures countries can with be found the in highest Table 7.5 centrality value for each of the four country cluster is shown.

199 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 2009-2017 0.260.24 United States0.18 Canada0.14 Mexico0.14 Australia Barbados0.250.24 Tunisia 0.88 United0.22 Kingdom United0.18 Peru States0.16 France 0.73 Monaco0.44 Australia 0.65 0.630.29 Bulgaria Canada 0.62 Mexico 0.77 0.860.19 Belgium South Africa United United0.15 Lithuania Kingdom States 0.79 Russian0.14 Federation France Benin0.41 0.630.29 Cambodia 0.7 0.69 0.68 Australia 0.67 0.390.24 Thailand 0.68 0.46 United Spain Kingdom Italy 0.42 United 0.72 South0.23 Iran, Portugal States Canada Africa Islamic 0.79 Republic of Mexico Netherlands0.22 Egypt Belgium 0.78 Philippines 0.75 Germany 0.44 0.74 0.74 Norway United France Kingdom Thailand 0.13 0.5 0.81 0.71 0.62 Mexico Canada 0.52 0.56 Malaysia Russian 0.39 Federation Netherlands 0.74 0.54 Portugal 0.31 Spain China 0.74 Japan Italy Australia 0.71 Belgium 0.71 Germany 0.67 0.6 0.66 0.72 Bangladesh Norway Russian Federation 0.55 Thailand India Spain 0.3 0.76 Netherlands 0.29 0.33 Malaysia Sweden Italy 0.41 0.55 0.66 0.59 0.41 France Norway Japan 0.4 Bangladesh Germany 0.34 Portugal 0.4 Thailand Denmark 0.62 India 0.61 Japan 0.29 India 0.38 Korea, Republic of 0.33 Malaysia value country value country value country value country The centrality measures for the four country clusters (ranked 1–4 based on the numbers of countries within them) within 1 2 3 4 Rank Betweenness Centrality Closeness centrality Degree centrality Eigenvector centrality Table 7.3: the period 2009–2017.Description Only of the centrality top-5 measures countries can with be found the in highest Table 7.5 centrality value for each of the four country cluster is shown.

200 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Table 7.4: The complete list of journals covered by the fisheries category as defined by the Science Citation Index Extended (SCIE) 2016–2017. This category spans a list of 50 journals covering all aspects of fisheries science, technology, and industry. All 50 journals were included in the dataset. IF = Impact Factor, N = Number of Publications.

RANK JOURNAL NAME IF N

1 Fish and Fisheries 9.013 525 2 Reviews in Aquaculture 4.618 195 3 Reviews in Fish Biology and Fisheries 3.575 588 4 Fish & Shellfish Immunology 3.148 4,530 5 Fisheries 3.000 503 6 Aquaculture Environment Interactions 2.905 161 7 ICES Journal Of Marine Science 2.760 3,350 8 Aquaculture 2.570 8,551 9 Reviews in Fisheries Science & Aquaculture 2.545 321 10 Canadian Journal of Fisheries and Aquatic Sciences 2.466 3,446 11 Fisheries Research 2.185 3,683 12 Journal of Fish Diseases 2.138 1,408 13 Ecology of Freshwater Fish 2.054 938 14 Marine Resource Economics 1.911 378 15 Marine and Freshwater Research 1.757 2,202 16 Aquaculture Nutrition 1.665 1,403 17 Fish Physiology and Biochemistry 1.647 1,848 18 Fisheries Oceanography 1.578 714 19 Aquacultural Engineering 1.559 786 20 Diseases of Aquatic Organisms 1.549 2,528 21 Journal of Fish Biology 1.519 5,419 22 Transactions of the American Fisheries Society 1.502 2,266 23 Aquaculture Research 1.461 3,873 24 CCAMLR Science 1.429 156 25 Fisheries Management and Ecology 1.327 837 26 Knowledge and Management of Aquatic Ecosystems 1.217 342 27 North American Journal of Fisheries Management 1.201 2,359 28 Marine and Coastal Fisheries 1.177 291 29 Aquaculture International 1.095 1,383 30 Journal of the World Aquaculture Society 1.015 1,254 31 New Zealand Journal of Marine and Freshwater Research 0.938 1,003 32 Journal of Aquatic Animal Health 0.906 615 33 Fishery Bulletin 0.879 785 34 Journal of Applied Ichthyology 0.845 2,785 35 Fisheries Science 0.839 2,861

201 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Table 7.4: Continued.

RANK JOURNAL NAME IF N

36 Journal of Shellfish Research 0.721 1,946 37 North American Journal of Aquaculture 0.715 1,035 38 Fish Pathology 0.673 415 39 Acta Ichthyologica et Piscatoria 0.670 538 40 Latin American Journal of Aquatic Research 0.594 583 41 California Cooperative Oceanic Fisheries Investigations Reports 0.586 177 42 Turkish Journal of Fisheries and Aquatic Sciences 0.484 825 43 Aquatic Living Resources 0.448 710 44 Bulletin of the European Association of Fish Pathologists 0.431 630 45 Israeli Journal of Aquaculture-Bamidgeh 0.348 631 46 Boletim do Instituto de Pesca 0.295 232 47 Iranian Journal of Fisheries Sciences 0.285 516 48 Indian Journal of Fisheries 0.235 481 49 California Fish and Game 0.219 231 50 Nippon Suisan Gakkaishi 0.090 3 TOTAL 73,240

202 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE

Table 7.5: Explanation and description of the used social network analysis or graph theory metrics.

Metrics Description Utility

Density The actual number of connections Shows the level of connectedness of divided by the total number of pos- the network. sible connections. Degree The number of connections attached Shows the average number of con- to each node. nections possessed by each scientist. Weighted degree The number of connections attached Shows the average weighted num- to each node, taking into account ber of connections possessed by each the weight of the connection scientist. Max cliques The maximal complete subgraph of Allows to identify intense collabo- a given graph. In other words, the rative sub-networks, where every- largest group of nodes where all the one within the sub-network has co- nodes are connected to one another. authored with everyone else (either through a single or multiple publica- tions) Average Clustering The extent to which a scientist’s co- Allows to make inferences with re- authors also collaborate with each spect to the likely exchange of new other ideas across the network. Degree Centrality Measures the number of links to Allows for the identification of cen- other nodes a particular node has. tral nodes within the network, in terms of the number of connections they have. Closeness Centrality Measures the distance of a node to Allows for the identification of nodes all other nodes in the network that are most likely to receive infor- mation quickly in the network Eigenvector Central- A centrality measure that has been Allows for the identification of nodes ity adjusted on the assumption that the that are well-connected to others centrality of a node cannot be as- who are also well connected. sessed in isolation from the central- ity of all the other nodes to which it is connected. Betweenness Cen- Measures the extent to which a par- Allows for the identification of nodes trality ticular node lies between the other that may otherwise look uninflu- nodes in the network ential, but that play important in- termediary roles in the network in terms of information flow (e.g. bro- kers).

203 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 0.83 JournalFish PathologyIndian Journal of FisheriesJournal of Applied Ichthyology 0.83 0.80 0.84 Journal of % Fish BiologyICES Journal of Marine Science 0.90 0.78 Ecology of Freshwater FishJournal 0.89 of Fish BiologyFisheries ResearchJournal of Shellfish 0.86 Research 0.93 0.85 Fisheries ResearchIranian Journal of Fisheries Sci- ences Journal of Applied Ichthyology 0.95 0.82 creatic Necrosis Disease (AHPND)in and the White Philippines Spot Syndrome Virus (WSSV) infection in the Aegean Sea India geon (Acipenser baerii) for cryopreservation reefs land longispinis and Sebastes hubbsi (Scorpaeniformes: Scorpaenidae) Ocean the 1999 lobster mortality event egy evaluation 1758 in the Southern Caspian Sea Publications with the highest topic proportion from a single topic; indicating that the publication mostly covers aspects Topic Year1 Title 2017 Mortality of pond-cultured Litopenaeus vannamei associated with Acute Hepatopan- 8 2006 Age and growth of the hollowsnout grenadier, Caelorinchus caelorhincus (Risso, 1810), 12 2017 Detection of viruses2 in the 2002 exotic shrimp Penaeus Duration vannamei of Boone, synchronous egg 19313 cleavage cultured cycles at in 2002 different temperatures in Siberian Semen stur- from rainbow trout3 produced using cryopreserved 2015 spermatozoa is more4 suitable Effects of reef attributes 2010 on fish4 assemblage Local similarity 2003 v. between microhabitat artificial5 influences on and Movements the of natural fish Atlantic salmon fauna migrating of upstream through tidal 2015 a pools fish-pass in complex5 in north-east Scot- Brazil Farmed 2012 Atlantic salmon Salmo salar The L. parr incomplete may6 history reduce early Journal of 2015 survival of mitochondrial of Fish wild Biology lineages fish between Analysis of two sailfish6 rockfishes, (Istiophorus Sebastes platypterus) 2005 population structure7 in A the perspective North Journal Pacific on of7 Fish bottom Biology 2012 water temperature 0.87 anomalies Atlantic in 2011 water Long temperature Island8 and climate Sound Catch-at-age in 2015 during assessment the in Barents the Sea, face 20002009 Autocorrelated of error in time-varying stock selectivity assessment estimates: 0.88 Implications 2013 for management strat- Age determination and morphological study using otoliths in Cyprinus carpio Linnaeus, ICES Journal of Marine Science 0.81 ICES Journal of Marine Science 0.97 Table 7.6: of that particular topic. For each of the 16 topics, the top-2 publications are shown.

204 CHAPTER 7. GLOBAL NETWORK OF FISHERIES SCIENCE 0.84 0.90 0.88 0.88 0.83 0.80 0.77 0.75 JournalJournal of the Worldture Aquacul- Society % Journal of the Worldture Aquacul- Society Fish Physiology andistry Biochem- Fish Physiology andistry Biochem- Fish Physiology andistry Biochem- Fish and Shellfish Immunology 0.92 Journal of Fish BiologyNorth AmericanAquaculture JournalNew Zealand of Journal of 0.89 Marine and Freshwater Research Journal of Applied Ichthyology 0.65 AquacultureFisheries Fish and Fisheries 0.64 0.97 Continued. Table 7.6: ber, Apostichopus japonicus consumption of gilthead sea bream, sparus aurata and fisheries Ctenopharyngodon idella Prymnesium parvum vasive mytilid and Manila clam culture in a northern Adriatic lagoon refeeding experiment detailed physiological analysis of side effects(Rhamdia during anesthetic quelen) recovery in silver catfish tial oil of Lippia alba in grass carp Ctenopharyngodon idellus in Doubtful Sound, Fiordland, New Zealand perimental conditions Topic Year9 Title 9 2016 Optimal 2015 dietary carbohydrate to lipid Optimum ratio Level for of bullfrog Dietary Rana n-3 Highly (Lithobates) Unsaturated catesbeiana Fatty Acids for Juvenile Sea Aquaculture Cucum- Research 0.92 10 2011 Combined effects of cycled starvation and feeding frequency on growth and oxygen 1616 2012 Rights-based 2005 fisheries governance: From fishing Autopsy rights your dead...and to living: human rights A proposal for fisheries science, fisheries management Fish and Fisheries 0.98 1213 2011 Myeloid differentiation factor 88 gene 2003 is involved in An antiviral effective immunity minimum in concentration grass of carp un-ionized ammonia nitrogen for controlling 1415 2004 Prey15 preference of Carcinus aestuarii: 2012 Possible implications with Species the 2015 identification control in of seamount an fish Sharks in- aggregations caught using by moored the underwater Brazilian video tuna longline fleet: an overview ICES Journal of Marine Science 0.92 Reviews in Fish Biology and 1011 2014 Changes in digestive enzyme 2017 activities of red Citral11 porgy and Pagrus linalool pagrus chemotypes during of a Lippia fasting- alba12 2012 essential oil as Transportation of anesthetics silver for catfish, Rhamdia fish: quelen, 2013 in a water with eugenol Molecular and characterization, the essen- expression analysis, and biological effects of interleukin-8 1314 2011 Spatial distribution of diatom and 2003 pigment sedimentary records Determination in of surface substrate sediments preferences of tench, Tinca tinca (L.), under controlled ex-

205

Chapter 8

Conclusions

Applied data scientists and other scholars employ probabilistic topic models, such as Latent Dirichlet Allocation (LDA), to explore large corpora that would be impossible to examine manually. The popularity of these unsupervised machine learning techniques is further driven by the available open source libraries—Mallet (McCallum, 2002), Gen- sim (Rehurek and Sojka, 2010) and Stanford TMT (Ramage and Rosen, 2009)—and the relatively easy-to-obtain data. In this respect, however, there are aspects of LDA that can affect the quality of the topic model output—the latent topics. For example, the number of topics parameter, the choice for an appropriate Dirichlet prior, the various pre-processing steps, and the choice of data are all aspects that can affect the quality of the derived latent topics. On the other end, the raw latent topics provide little informa- tion to some of the domain-related questions of the corpus under study. Motivated by (i) trying to understand how to optimize aspects of LDA to obtain high-quality latent topics, and (ii) trying to use these topics to create valuable and useful knowledge for the domain of fisheries science, this thesis posed the following main research question:

MRQ — How can we improve the knowledge discovery process from textual data through latent topical perspectives?

To formulate the answer, we took a systematic and iterative knowledge discovery pro- cess called Knowledge Discovery in Databases (KDD) as a blueprint for effectively dis- covering knowledge from textual data. Successfully and optimally conducting each step within KDD (see 1.1), contributed to improving the final output: knowledge. Thus, Chapters 2–7 within this thesis were aimed at optimally performing KDD steps, which, taken together, are aimed at improving the overall knowledge discovery process for latent topics from fisheries scientific publications. In most chapters, we employed the highly studied and popular topic model Latent Dirichlet Allocation (see 1.2.1).

Chapter 2 studied two manifestations of scientific data, the full-text publication and abstract data. The data selection phase is the first step within KDD, and selecting

207 CHAPTER 8. CONCLUSIONS the data that leads to high-quality topics is an essential one. Additionally, Chapter 2 provided insights into the pre-processing steps (the second KDD step) and how they affected the latent topics. Since LDA is a Bayesian probabilistic topic model, prior knowledge can be encoded into the model. Chapter 3 explored all combinations of prior Dirichlet distributions and their effects on the quality of latent topics.

To assess the quality of latent topics, we used coherence measures, which, to date, are the most promising and near-human accurate measures to use. Additionally, we explored other evaluation approaches in Chapter 4. Combined, Chapters 2, 3 and 4 provide methodological optimizations for LDA with the objective to uncover high- quality topics.

Chapters 5–7 were aimed at interpreting the latent topics with the aim to discover knowledge; the final output of the KDD process. In Chapters 5 and 6, we explored how to construct knowledge from raw latent topics and sub-topics to shed light on the four pillars (ecologic, social, economic, and institution) of fisheries sustainability. In Chapter 7, we additionally combine latent topics with spatial and temporal data from fisheries science collaborations.

Chapters 2–7 are all aimed at improving the knowledge discovery process, and a map- ping of each chapter to the relevant KDD process can be found in Fig. 1.6. Moreover, for Chapters 2–7 we posed six distinct research questions, and each contributed to provid- ing answers to the main research question. In the following sections, we reiterate the six research questions and draw six conclusions. Collectively, they provide an answer to the main research question.

RQ1 — What types of textual data result in high-quality latent topics?

When uncovering latent topics from scientific publications, typically one can commence by choosing either abstract or full-text data. The two variants are generally treated separately from a data retrieval perspective, with abstract data arguably being the easiest to obtain. Although researchers are increasingly using computer-aided content analysis techniques, such as LDA, no study has examined whether abstract or full-text data produces higher quality latent topics of the underlying content. In Chapter 2, we quantitatively (i.e., through topic coherence) and qualitatively (i.e., through human ranking) assessed the quality of latent topics derived from abstract and full-text data from fisheries science publications. We constructed two datasets containing 4,400 and 15,000 articles from various fisheries peer-reviewed journals. Each dataset included both the abstract and full-text variants of the same articles (see Table 1.1).

The first dataset of 4,400 articles showed a significant difference in the quality of the derived latent topics, with the full-text topics exhibiting greater coherence scores and higher human topic rankings. Though the data was pre-processed identically, the lower coherence scores obtained from abstract data were mainly caused by so-called noise terms being present within the topic’s top words. Such terms are not related to the biological, ecological, or socio-ecological meanings of those topics, but can be seen as

208 CHAPTER 8. CONCLUSIONS semantically incorrect terms: using, used, use, within, total, two, among, and within. Such noise terms require proper attention when dealing with abstract data, for exam- ple, through an enhanced cleaning (pre-processing) phase, part-of-speech filtering, or a domain-specific stop word list.

The latent topics from the dataset of 15,000 articles revealed very similar topic coher- ence and human topic ranking results for both abstract and full-text data. The latent topics from full-text articles were not significantly better in most instances, albeit that they obtained somewhat higher coherence scores. The presence of noise terms ap- peared to be mitigated by the higher number of publications and larger vocabulary sizes.

Through a domain expert’s evaluation, full-text topics obtained from the smaller dataset (4,400 articles) displayed a more fine-grained granularity of the latent meanings of those topics—a finding not present within the larger dataset (15,000 articles). Al- though we identified similar and detailed topics in both datasets, there remains an inconsistency between some uncovered topics, with specific topics being present in abstract data, and absent in full-text data (and vice versa). Although we conducted random initializations and aimed at uncovering the high-quality latent topics, these discrepancies nevertheless exist, and care should be given regarding the true underly- ing topical structure.

0.65 0.65

0.60 0.60

0.55 0.55 e e r r o o c c s s 0.50 0.50 e e c c n n e e r r e 0.45 e 0.45 h h o o c c

V V C 0.40 C 0.40

0.35 abstract 0.35 abstract full-text full-text 0.30 0.30 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of topics K number of topics K

(a) 4,400 documents (b) 15,000 documents

Figure 8.1: A: Full-text data produces higher-quality latent topics. B: full-text and abstract data produce topics of comparable quality.

Conclusion I

For relatively small datasets, full-text data produces higher-quality latent topics, whereas, for larger datasets, full-text and abstract data produce topics of compa- rable quality. Smaller datasets should place more emphasis on the pre-processing

209 CHAPTER 8. CONCLUSIONS

steps; specifically, care should be given domain-specific stop words that can poten- tially reduce the quality of latent topics. Additionally, for smaller datasets, full-text data results in more detailed topics, where typically topics have a higher gran- ularity.

RQ2 — How does the hyper-parameterization of a topic model algorithm affect the quality of latent topics?

For LDA, a Bayesian probabilistic topic model, two important hyper-parameters exist that encode prior knowledge into the model. These are the Dirichlet distributions for the probabilities of words within topics, and the probabilities of topics within docu- ments (see Fig. 1.3). In Chapter 3, we studied how symmetrical and asymmetrical (learned from data) parameterizations of the Dirichlet distributions affect the quality of the latent topics. We created 2,000 different LDA models to examine six different combinations of priors on two datasets containing 4,400 abstracts and 8,000 full-text data. To evaluate the quality of topics, we utilized topic coherence scores (as a proxy for topic quality) and human topic ranking.

When looking at the prior Dirichlet distribution of topic probabilities within documents, a natural assumption to make is that specific topics occur more frequently (e.g., pop- ular or more researched topics), and other topics less frequently (e.g., niche research domains). This assumption is encoded by an asymmetrical Dirichlet prior. Conversely, a symmetrical prior (commonly the default choice) would contradict this assumption, as it encodes an equal probability for all topics to be present within documents. Our empirical analysis indeed confirms that an asymmetrical prior of topic probabilities within documents results in topics with greater and statistically significant coherence scores. However, this effect is most significant when using abstract data, and this effect appears to hold (to a far lesser extent) for full-text data only for LDA models with a high number of topics (K parameter). Additionally, human topic ranking showed more high-quality topics when using an asymmetrical prior.

Concerning the prior Dirichlet distribution of word probabilities within topics, we preferably want topics to be different (i.e., distinct) from each other to avoid con- flicts between them. A symmetrical prior distribution allows topics to be as different as need be, as it is not (a priori) influenced by the word use statistics of all the docu- ments. Also, a symmetrical prior distribution considers the power-law usage of words (i.e., some words occur in many of the documents). Thus, a symmetrical prior is nat- urally assumed to be the preferred choice. However, our results show no real benefits when applying a symmetrical or asymmetrical prior distribution to both datasets. Yet, in very few cases, a symmetrical distribution shows slight, but still very marginal, over- all improved coherence and human topic ranking results for both datasets.

Additionally, some concerns raised within the conclusion of RQ1 related to incorrect terms and discrepancies between latent topics (latent topics present in one dataset and absent in another, or vice versa) seem to hold and are reconfirmed within this study.

210 CHAPTER 8. CONCLUSIONS

0.60 0.60

0.55 0.55 V V C 0.50 C 0.50

0.45 0.45 AS AS SA SA 0.40 0.40 0 10 20 30 40 50 0 10 20 30 40 50 K K

(a) 4,400 abstracts (b) 8,000 full-text

Figure 8.2: Varying Dirichlet priors result in different quality topics on abstract data (A), and very similar quality topics on full-text data (B). See Chapter 3 for full details.

Conclusion II

For abstract datasets, an asymmetrical Dirichlet distribution of topic probabil- ities within documents can significantly increase the quality of latent topics. For full-text datasets, a similar, although far less significant observation holds for LDA models with 30 topics. A symmetrical or asymmetrical Dirichlet distribu- tion of word probabilities≥ within topics have a negligible effect on the quality of latent topics.

RQ3 — Can we assess the quality of latent topics using a semi-automatically con- structed list of semantically related words?

The latent topics uncovered with Latent Dirichlet Allocation are expressed by proba- bility distributions over the (fixed) vocabulary. Typically, when sorted, the words with high probability (e.g., top 10) reveal the semantic meaning of the latent topic. Such high probability words can be viewed as semantically related words, as they more fre- quently co-occur within the same linguistic context. Quantitative (coherence scores) and qualitative (human topic ranking) measures are used to evaluate the quality of the co-occurring words, and thus the quality of the uncovered latent topics. In Chapter 4, we studied an alternative approach to assess the quality of latent topics.

Semantically related words can additionally be formalized by a semantic lexicon, where the words are related by a hyponym-hypernym association, including transitive rela- tionships. For example, the words “salmon”, “trout”, and “chars” are all hyponym words that share the hypernym salmonidae. Likewise, the words “crab”, “lobster”, and

211 CHAPTER 8. CONCLUSIONS

“crayfish” are related by the hypernym crustacean. We studied the construction of such semantic lexicons (from web content) by employing a semi-automatic process, called “bootstrapping”, and finding semantically related words with extraction patterns (noun phrases that share the same verb). The bootstrapping approach used a number of seed words to steer the lexicon into the correct linguistic context. Before adding new words to the lexicon, two different scoring functions were used based on word frequency, and collocation statistics—words that occur near each other. A (fisheries) domain expert evaluated the quality of the bootstrapped lexicon.

Extraction patterns that contain strong domain-specific verbs—such as the verbs “fish- ing” and “regulate” for the domain of fisheries—generally achieved the highest accu- racy lexicon. Thus, strong domain-related verbs, which more frequently occur within a particular domain and less frequently in a non-related domain, are better linguistic cues for correct lexicon words. The use of collocation statistics, in contrast to word frequencies, provided higher accuracy lexicons only in a small number of cases. When bootstrapping large lexicons (i.e., 100 words), the differences in scoring functions di- minished, and the accuracy converged to similar values.

We experimented with different algorithmic approaches that can, potentially, be uti- lized to evaluate the quality of latent topics. We have demonstrated several algorith- mic optimizations to create lexicons of higher quality but have not performed the actual quality evaluation of latent topics. For future research, the use of semantic similarities—a measure of conceptual distance between two words or sentences—and lexical databases are ideal candidates for such endeavors (Baddeley, 1966; Li et al., 2006; Resnik, 1999). By exploiting extraction patterns, we have limited lexicon en- tries of semantically related words to nouns only, which may result in limitations when evaluating latent topics.

1.0 1.0 0.9 ρ 1.0 0.9 ρ 1.0 0.8 ρ 1.0 0.8 ρ 1.0 0.7 ρ 1.0 0.7 ρ 1.0 0.6 ρ 1.0 0.6 ρ 1.0 0.5 ρ 1.0 0.5 ρ 1.0 0.9 0.9

0.8 0.8 accuracy accuracy 0.7 0.7

0.6 0.6

0.5 0.5 20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100 lexicon size lexicon size

(a) frequency scoring (b) collocation scoring

Figure 8.3: New lexicon entries from strong fisheries-domain related verbs (see blue line for instance) resulted in higher accuracy lexicons. See Chapter 4 for full details.

212 CHAPTER 8. CONCLUSIONS

Conclusion III

Bootstrapping semantic lexicons with extraction patterns and strong domain re- lated verbs results in higher accuracy lexicons. Scoring new lexicon entries with collocation statistics shows slightly improved accuracy for smaller lexicons. While (high-quality) semantic lexicons of hyponym-hypernym associations provide a way to assess the quality of latent topics, similarity measures and lexical databases are two approaches to evaluate this.

RQ4 — How can we construct knowledge from latent topics derived from large collections of documents?

Chapters 2 and 3 studied methodological optimizations of Latent Dirichlet Allocation that enable the creation of high-quality topics. In Chapter 5, we utilized lessons learned and employed a large-scale topic model analysis for the domain of fisheries science. We constructed a unique dataset of over 46,000 scientific publications from 21 top- tier fisheries journals published between 1990–2016. For fisheries sustainability, it is essential that the ecological, social, economic, and institutional elements are appropri- ately considered, in other words, in an equal and balanced manner. Within these four pillars of sustainability, it is argued that the ecological considerations have attracted the most attention, leaving the social, economic, and institutional (i.e., human dimen- sions) relatively neglected. Chapter 5 aimed to investigate whether latent topics can be utilized to assess and quantify the presence of the ecological and human dimension within fisheries scientific publications. Specifically, we investigated whether fisheries science research is diversified enough to capture the full complexity of the system, or if it is focused on a few selected components of this system.

Our analysis revealed a topical deconstruction of 25 broad topics to describe the entire fisheries science corpus. From the 25 uncovered topics, 24 relate to biological consid- erations, and a single topic (i.e., fisheries management) covered aspects of the human dimension. It is evident that the research focus in fisheries during the last 26 years has not entirely captured the complexity of the fisheries domain, especially of the human dimension component. To enable a dynamic portrait for the field of fisheries science, we analyzed topical trends for different time intervals. The topics related to fisheries management, conservation, and models show the most substantial proportional in- crease between 1990–2016. Negative trends were observed for the topics related to non-fish species, biochemistry, and primary production. Although fisheries manage- ment was identified as the only topic addressing human dimension considerations, it was simultaneously the topic that showed the highest proportional increase (+5.2%) and was the third most prevalent topic (6.13%). The increasing prevalence and in- terest in the topic of fisheries management might indicate the strengthening of the connection between fisheries science and management processes, in the light of the growing concern about the status of fish stocks worldwide.

When exploring latent topics, an extensive post-analysis phase can reveal a detailed

213 CHAPTER 8. CONCLUSIONS and macroscopic view of the corpus under study. It furthermore enables answering questions that through manual analysis would be extremely challenging. For fisheries sustainability, it is essential that the biological and human dimensions are addressed in an equal and balanced manner. Our large-scale topic analysis of fisheries science shows a substantial imbalance in the underlying topical content, focusing heavily on the biological dimension and possibly impeding fisheries sustainability.

Conclusion IV

The interpretation of latent topics in raw form reveal very little insight regard- ing the domain under study. However, with an extensive post-analysis phase, the latent topics can reveal new and useful knowledge that would have been impos- sible to obtain through manual analysis. This post-analysis phase can include: (i) the labeling of topic-word distributions, (ii) the aggregation and segregation of document-topic distributions (by time, journal and overarching theme), (iii) the visualization of latent topics, and (iv) using regression methods to obtain trends.

RQ5 — How can we construct knowledge from latent sub-topics derived from large collections of documents?

In Chapter 5, we uncovered main or general topics from fisheries science publications with the aim to quantify and shed light on the four pillars of sustainability. Main topics provide a broad summary of the corpus and can be viewed as a first-level decompo- sition of the corpus into latent topics. Additionally, in Chapter 6, we zoomed in on the main topics to explore their underlying sub-topics (i.e., second-level analysis) from a dataset comprising 22,000 fisheries science publications from 13 top-tier fisheries journals covering aspects of fishery models. Within fisheries science, modeling and simulation are among the most frequently used methods. Fishery models come in a multitude of shapes and varieties, each addressing specific aspects of the fishery system. For fisheries sustainability, and from the viewpoint of fishery models, it is also essen- tial to shed light on the various aspects they address (ecological, social, economic, or institutional). Therefore, in Chapter 5, we aimed to construct new insights from the sub-topics found within the broader topics covering aspects of fishery models.

From the corpus of 22,000 publications, we uncovered two main modeling topics: (i) estimation models (a topic that contains the ideas of catch, effort, and abundance estimation) and (ii) stock assessment models (a topic on the assessment of the cur- rent state of fishery and future projections of fish stock responses and management effects). The topic of estimation models revealed 14 underlying sub-topics, and the topic stock assessment revealed 15 underlying sub-topics. The sub-topics address pri- marily ecological aspects of the fishery system (i.e., biological dimension), with only a few sub-topics addressing the human dimension—management effects and manage- ment tools. Both modeling topics and their underlying sub-topics, therefore, primarily focus on the biological aspects of fisheries, a finding that indicates that fishery models

214 CHAPTER 8. CONCLUSIONS might not adequately address the four pillars of sustainability.

A first-level topic analysis produces a set of main or general topics that can subsequently be used to filter the corpus for a zoomed-in or second-level topic analysis. In Chapter 6, we used the broad topics to identify documents explicitly addressing modeling as- pects, to then uncover their underlying sub-topics. Our analysis revealed interpretable sub-topics that can help to explore the corpus in more detail. However, it is essential to understand the implications of such an approach. For example, filtering the corpus by main topics was accomplished by selecting documents where the main topic con- stituted the most significant part of the document. More concretely, the main topic proportion was greater than any of the remaining topics. Alternative approaches in- clude adopting a threshold value, where documents are filtered if they constitute at least some percent of the main topic, for instance 50%. In any case, one needs to take topic co-occurrence into account, as other topics still make up for the remaining part of the filtered documents. Technically, with LDA, the word-to-topic assignment can also be used as an alternative approach to filter documents by main topics, as each word in a document gets assigned a topic (see also step 2(b) in the generative process in 1.2.1). However, during our analysis, doing so resulted in less interpretable topics and was therefore not used.

Conclusion V

A first-level topic analysis uncovers main topics, which can subsequently be used to perform a second-level topic analysis with the purpose of exploring the main topics’ underlying sub-topics. In doing so, we uncovered interpretable sub-topics that show a more detailed decomposition of the main topic. However, care must be given to the filtering approach, as filtered documents contain co-occurring topics that might not necessarily cover aspects desirable in the subsequent level of analysis.

RQ6 — How can we utilize knowledge derived from latent topics for a subsequent knowledge discovery process?

In Chapters 5 and 6, we have used latent topics and sub-topics uncovered from fish- eries science publications to conduct a large-scale assessment of fisheries sustainabil- ity. Specifically, we have aimed to shed light on the four pillars of sustainability within general fisheries science topics, as well as sub-topics for the field of fishery models. In both cases, our analysis revealed a strong bias towards the ecological considerations, relatively neglecting the human dimension. In Chapter 7, through an iterative knowl- edge discovery in databases process (KDD), we utilized latent topics obtained from 73,240 fisheries publications to enrich a social network analysis of 106,137 authors from 100,175 affiliations.

Our social network analysis of co-authorships provided insights into the structure of

215 CHAPTER 8. CONCLUSIONS

fisheries science collaborations, where degree, distance, and centrality measures are used to quantify elements of the network. Additionally, to extend our analysis, we have combined properties of each author, such as the publication data, affiliation data (e.g., country or institution), spatial data (such as longitude and latitude information), and hidden community data (through community detection algorithms). In doing so, we have created a social network enriched with insights into temporal, spatial and by using the latent topics, a topical decomposition of the publications. We have shown how latent topics, combined with techniques from social network analysis, can not only provide an analysis of the latent scientific output (as provided in Chapters 5 and 6) but can also provide answers to an array of questions regarding the collaboration patterns within fisheries.

We find that while the fisheries science network is becoming more extensive, it is si- multaneously becoming more intensive, with a clear division of focus between the traditional powerhouses (e.g., US, Canada, Japan, Australia, UK, Norway) and new entrants (e.g., China, Brazil, India). The uncovered network exhibits clusters and links which, though likely shaped by an array of overlapping factors, reveal a number of political-economic patterns, which merit reflection by both fisheries scientists and pol- icymakers alike.

Our analysis has presented a shifting field that has become increasingly collaborative, though less cohesive, with a number of key players maintaining hegemonic positions within the network. The most productive and collaborative countries are those which have large industrialized fisheries-related interests. Although the collaboration net- work has become more extensive, it has also become more intensive in places, with a clear spatial pattern evident in the structure of scientific collaborations across the field. In this respect, the fisheries science landscape is one whereby the centers of knowledge production and the connections between them display trends more akin to regionalization than globalization.

The authorship network suggest that authors across the field may be engaging in a re- peat, rather than broad style of collaboration, which may work as a reinforcing mecha- nism with respect to the knowledge that is produced by the field. This pattern is likely to limit the potential gains of collaboration, and could have consequences in terms of pushing the boundaries of fisheries science in new and fruitful ways, in a manner which may help address some of the ongoing challenges within the field.

Conclusion VI

By employing probabilistic topic models to derive knowledge from fisheries sci- ence publications, and by utilizing the latent topics in combination with a so- cial network analysis of fisheries science collaboration, we have shown that such dual analysis reveals, besides insights into the corpus under study, new and un-

216 CHAPTER 8. CONCLUSIONS

explored insights for the field of fisheries science and the production of knowl- edge. Such an investigation was achievable since (i) the two types of analysis were highly-related—since authors produce publications—and (ii) meta-data on authors and their affiliation was available through external data sources.

8.1 Scientific Contributions

In the previous sections, we re-introduced our main research question and the six for- mulated research questions that, collectively, provide answers to improving the knowl- edge discovery process of latent topics from fisheries scientific articles. This section presents a summary of the scientific and some societal contributions.

For researchers wanting to employ latent Dirichlet allocation to uncover latent topics from scientific publications, the following observations can help to accomplish this goal more effectively:

– The more extended and elaborate format of full-text articles has been shown to produce topics of higher granularity. Additionally, for relatively small datasets —around 5,000 in our case—full-text articles produce more high-quality topics. For larger datasets—around 15,000—abstract and full-text data produce topics of very similar quality.

– For abstract data, the use of an asymmetrical Dirichlet prior for topics in docu- ments outperforms symmetrical Dirichlet priors concerning topic quality. For full-text data, asymmetrical priors can sometimes increase topic-quality, although this increase is not significant most of the time. Asymmetrical or symmetrical Dirichlet priors for words in topics have a negligible effect on the quality of top- ics.

– The use of topic coherence measures as a proxy for topic quality has been re- vealed to generate interpretable topics, and higher topic coherence scores are associated with higher human topic ranking scores.

– For relatively small datasets, more care should be given to domain-specific stop words, as most standard stop word lists do not adequately filter out non- relevant words, which negatively impacts the quality of the topics.

– When the interpretability of topics is considered highly crucial, some form of normalization is still desired. As per our analysis, we advise applying lemma- tization techniques in the pre-processing phase. With lemmatization, and in contrast to stemming techniques, verbs and nouns can still be distinguished and, generally speaking, it has an advantageous effect on the readability of the topic- word distributions.

217 CHAPTER 8. CONCLUSIONS

– Local minima (i.e., model convergence to a sub-optimal solution) persist to be a significant drawback of probabilistic topic models, including LDA. As per our analysis, running multiple random initializations still produced some dis- crepancies between the uncovered topics. Possible solutions to circumvent this drawback are given in the next section.

– A first-level topic analysis produced topics that can subsequently be used to filter for documents containing a particular topic, whereby a second-level topic anal- ysis produced their underlying sub-topics. As per our analysis, this produced interpretable sub-topics for fishery models. However, a second-level topic analy- sis comes with some degree of topic co-occurrence, as other topics still make up some part of the filtered documents.

For the domain of fisheries science, we highlight the following contributions:

– The topics uncovered from 46,000 fisheries publications revealed highly imbal- anced sustainability considerations, with a significant focus on the biological dimension, largely neglecting the human dimension. Out of the 25 uncovered topics, the human dimension was only expressed through a single topic: fisheries management. However, fisheries management showed the most substantial pro- portional increase between 1990–2016 and now constitutes the third most preva- lent topic.

– The sub-topics uncovered from 22,236 fisheries modeling publications show a strong focus on modeling the biological aspects of the fishery system, with only a few sub-topics covering aspects of the human dimension.

– The fisheries science collaboration network of 106,137 authors has become in- creasingly collaborative, though less cohesive, with a number of key players maintaining hegemonic positions within the network. By and large, the most productive (and collaborative) countries in terms of fisheries science are those which have large industrialized fisheries-related interests, many of them global in nature.

– The fisheries science collaboration landscape is one whereby the centers of knowl- edge production and the connections between them display trends more akin to regionalization than globalization. The collaboration network suggest that authors across the field may be engaging in a repeat, rather than broad style of collaboration.

8.1.1 LDA Workflow

Topic models are a popular unsupervised machine learning technique to understand and explore large collections of documents. This section of the thesis describes the

218 CHAPTER 8. CONCLUSIONS steps involved (see Fig. 8.4) to go from raw data to an effective topic model analy- sis. The steps are embedded into the various phases of the KDD process (Fig. 1.1) that we have used throughout this thesis. We furthermore provide the full work- flow in Python code for performing LDA on full-text articles which can be found at: github.com/shaheen-syed/LDA.

EXTRACTION SELECTION

Extract publications data from repository: Select relevant publications only: - content (abstract or full-text). data - filter for language (e.g., English). - year of publication. - filter for article type (e.g., research article). - title of publication. - filter for missing data (e.g., abstract, year). - journal of publication. target data

PRE-PROCESSING (FULL-TEXT)

- convert PDF to plain text. - use OCR to convert image-based PDF to plain-text. PRE-PROCESSING (GENERAL) - correct for carriage returns and end-of-line hypenations. - correct for ligatures. - remove boilerplate content. - tokenize to get unigrams and . - remove acknowledgements. - apply named entity recoginition to find n-grams. - remove bibliography. - normalize with lowercase and lemmatization. - remove stop words (including domain specific). target data - remove numbers. pre-processed - remove punctuation. data PRE-PROCESSING (ABSTRACT)

- remove copyright statement.

TRANSFORMATION (depending on LDA tool)

- create dictionary. - remove high frequency and low frequency words. - create bag-of-words features per document. - create corpus of all bag-of-word features. pre-processed data transformed data

DATA MINING

Create LDA models by performing a grid search on: - number of topics parameter (K). - different types Dirichlet prior distribitions (symmetrical or asymmetrical). - different random initializations * - number of passes over the corpus *

transformed * perform only when sufficient computing resources available or small dataset patterns data

EVALUATION INTERPRETATION

- calculate coherence score for each LDA - visualize topics (e.g., pyLDAvis). model. - infer the document-topic distribution for each publication. - select the LDA model with highest - calculate largest topic for each document (for inspection of (converging) topic coherence score. publication titles). - inspect the top 10 words for each topic. - label the topics by close inspection of the top words, the patterns - revisit pre-processing stage when topic's visualization, and corresponding publication titles (preferably knowledge top 10 words are uninterpretable. performed by domain expert). - calculate cumulative topic distributions. - calculate topic trends over time. - calculate topic co-occurence. - calculate topic distribution per journal.

Figure 8.4: Steps to perform a topic model analysis on scientific articles. Code can be found at github.com/shaheen-syed/LDA

EXTRACTION: The extraction phase involves the process of obtaining a set of doc- uments from a repository, such as Scopus or the Web of Science, or it can involve the steps of scraping a publisher’s website to retrieve full-text articles (typically in PDF

219 CHAPTER 8. CONCLUSIONS format). Scopus generally provides publication abstracts, including all the meta-data (journal, authors, affiliations, publication date) through various APIs. The upside of using an API is that publication content is easily obtained for a large number of doc- uments simultaneously, however, these APIs often do not provide full-text for all the publications or journals of interest. In these cases, scraping a publisher’s websites can be an alternative solution. This process involves building many handcrafted crawlers, as each publisher lists their publications in a different manner on their website. Down- load limits should always be respected when building such scripts. Another option would be to manually download articles, although such approaches might not be fea- sible if the document collection of interest contains thousands or tens of thousands of articles. To enable a comparison of topics by time, or a comparison of topics by jour- nals, it is important to store this information alongside the content of the document.

Relevant code:

extraction.extract_publications() •

SELECTION: This step involves the selection or filtering for relevant publications. For example, the LDA analysis might only look at English language publications, only take into account publication that constitute a proper research articles (excluding errata, editorials, letters, comments etc.). Additionally, filtering for missing data is important when publication meta-data is missing.

PRE-PROCESSING: The pre-processing phase can be seen as the process of going from a document source to an interpretable representation for the topic model algo- rithm. This phase is typically different for full-text and abstract data. One of the main differences is that abstract data is often provided in a clean format, whereas full-text is commonly obtained by converting a PDF document into its plain text representation.

Within this phase, an important part is to filter out the content that is not important from a topic model’s point-of-view, rather than from a human’s point-of-view. Abstract data usually comes in a clean format of around 300–400 words, and little additional text is added to it; typically the copyright statement is the only text that should be removed. In contrast, full-text articles can contain a lot of additional text that has been added by the publisher. This is article meta-data and boilerplate. It is important that such additional text is removed, and various methods to do so exist. Examples include: deleting the first cover page; deleting the first n-bits of the content; using regular expressions or other pattern matching techniques to find and remove additional text, or more advanced methods (e.g., (Kohlschütter et al., 2010)). For full-text articles, a choice can be made to also exclude the reference list or acknowledgment section of the publication.

Latent Dirichlet allocation, as well as other probabilistic topic models, are bag-of-words (BOW) models. Therefore, the words within the documents need to be tokenized; the process of obtaining individual words (also known as unigrams) from sentences.

220 CHAPTER 8. CONCLUSIONS

For English text, splitting words on white spaces would be the easiest example. Be- sides obtaining unigrams, it is also important to find multi-word expressions (Manning and Schütze, 1999), such as two-word (bi-grams) or multi-word (n-grams) combina- tions. Named entity recognition (NER)—a technique from natural language processing (NLP)—can, for instance, be used to find multi-word expressions related to names, na- tionalities, companies, locations, and objects within the documents. The inclusion of bi-grams and entities allows for a richer bag-of-words representation than a standard unigram representation. Documents from languages with implicit word boundaries may require a more advanced type of tokenization (Goldwater et al., 2006).

Although all tokens within a document serve an important grammatical or syntactical function, for topic modeling they are not all equally important. Words need to be filtered for numbers, punctuation marks, and single-character words as they bear no topical meaning. Furthermore, stop words (e.g., the, is, a, which) are words that have no specific meaning from a topical point-of-view, and such words need to be removed as well. For English, and a number of other languages, there exist fixed lists of stop words that can easily be used (many NLP packages such as NLTK and Spacy include them). However, it is important to also create a domain-specific (also referred to as corpus-specific) list of stop words and filter for those words. Such domain-specific stop words can also become apparent in the evaluation phase. If this is the case, going back to the pre-processing phase and excluding them would be a good approach. Another approach to removing stop words is to use TF-IDF (Salton, 1968) and include or exclude words within a certain threshold. Contrary to our analysis, a study by Schofield et al. (2017) has indicated that removing stop words have no substantial effect on model likelihood, topic coherence, or classification accuracy.

For grammatical reasons, different word forms or derivationally related words can have a similar meaning and, ideally, for a topic model analysis, such terms need to be grouped (i.e., they need to be normalized). Stemming and lemmatization are two NLP techniques to reduce inflectional and derivational forms of words to a common base form. Stemming heuristically cuts off derivational affixes to achieve some nor- malization, albeit crude in most cases. Stemming loses the ability to relate stemmed words back to their original part-of-speech, such as verbs or nouns, and decreases the interpretability of topics in later stages (Evangelopoulos et al., 2012). Lemmatization is a more sophisticated normalization method that uses a vocabulary and morpholog- ical analysis to reduce words to their base form, called lemma. For increased topic interpretability, we recommend lemmatization over stemming. Additionally, upper- case and lowercase words can be grouped for further normalization. The process of normalization is particularly critical for languages with a richer morphology (Taghva et al., 2005). Failing to do can cause the vocabulary to be overly large, which can slow down posterior inference, and can lead to topics of poor quality (Boyd-Graber et al., 2014).

Relevant code:

221 CHAPTER 8. CONCLUSIONS

preprocessing.full_text_preprocessing() • preprocessing.general_preprocessing() • TRANSFORMATION: The transformation phase includes the creation of a dictionary of words and preparing the data for the topic model software or package. The dictio- nary of words (also referred to as the vocabulary) is typically a list of unique words represented as integers. For example, ‘fish’ is 1, ‘population’ is 2 and so on. The length of the dictionary is thus the number of unique words within the corpus; normaliza- tion reduces the length of the dictionary, and speeds up the inference time. The next step is to represent the documents as bag-of-words features. Doing this for all the doc- uments creates a matrix (i.e., table) with the rows being the individual documents, the columns being the words within the dictionary, and the cells being the frequency of that word within the document. This is one representation of bag-of-words fea- tures and other less sparse representations exist as well, see for example (Boyd-Graber et al., 2014). If not performed during the pre-processing phase, words that occur only once, and words that occur in roughly 90% of the documents can be eliminated as they serve no discriminative topical significance. Especially omitting high frequently occurring words prevents such words from dominating all topics. Removing high and low-frequency words (i.e., pruning) within a matrix representation is generally much faster to perform.

Several LDA tools are available and each of them requires a slightly different trans- formation step to make the data suitable for topic analysis. However, in essence, they require a conversion from words to bag-of-words representation, to some matrix repre- sentation of the full corpus. Several LDA packages exist that might be worth exploring: Gensim (Rehurek and Sojka, 2010), Mallet (McCallum, 2002), Stanford Topic Model- ing Toolbox (Ramage and Rosen, 2009), Yahoo! LDA (Narayanamurthy, 2011), and Mr. LDA (Zhai et al., 2012).

Relevant code:

transformation.transform_for_lda() • DATAMINING: The data mining phase involves fitting or training the LDA model. It also involves a careful analysis of the hyper-parameters and the creation of different LDA models. Similarly to the transformation phase, the use of a certain LDA module or software tool determines what parameters and hyper-parameters can be adjusted.

Since calculating a closed form solution of the LDA model is intractable (Blei et al., 2003; Blei, 2012), approximate posterior inference is used to create the distributions of words in topics, and topics in documents. To avoid local minima, both in the case of variational (Blei and Jordan, 2006; Teh et al., 2006; Wang et al., 2011) and sampling- based (Newman et al., 2007; Porteous et al., 2008) inference techniques, the initial- ization of the model is an important consideration in the data mining phase. Thus,

222 CHAPTER 8. CONCLUSIONS regardless the initialization, and regardless the inference method used, to guard for problems of local minima, multiple starting points should be used to improve the sta- bility of the inferred latent variables (Boyd-Graber et al., 2014).

Running the inference is the most important step in the data mining phase. It results in the discovery of the latent variables (words in topics, and topics in documents). The convergence of the model (e.g., the likelihood) should be closely monitored. The time typically depends on initialization, the number of documents, model complex- ity, and inference technique. A straightforward approach to optimize for the various hyper-parameters would be to perform a grid-search and inferring LDA models for combinations of them. Such hyper-parameters include the number of epochs or passes over the corpus, the number of iterations for convergence, the number of topics, the types of Dirichlet priors, and the starting points.

Relevant code:

datamining.execute_lda() •

EVALUATION: The evaluation phase includes a careful analysis and inspection of the latent variables from the various created LDA models. Since LDA is an unsupervised machine learning technique, extra care should be given during this post-analysis phase; in contrast to, for example, supervised methods where typically a labeled gold-standard dataset exist.

Measures such as predictive likelihood on held-out data (Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan and Mimno, 2009) have been proposed to evaluate the quality of generated topics. However, such measures correlate negatively with human interpretability (Chang et al., 2009), making topics with high predictive likelihood less coherent from a human perspective. High-quality or coherent latent topics are of par- ticular importance when they are used to browse document collections or understand the trends and development within a particular research field. As a result, researchers have proposed topic coherence measures, which are a quantitative approach to auto- matically uncover the coherence of topics (Aletras and Stevenson, 2013; Stevens et al., 2012; Newman et al., 2010a; Röder et al., 2015). Topics are considered to be coherent if all or most of the words (e.g., a topic’s top-N words) are related. Topic coherence measures aim to find measures that correlate highly with human topic evaluation, such as topic ranking data obtained by, for example, word and topic intrusion tests (Chang et al., 2009). Human topic ranking data are often considered the gold standard and, consequently, a measure that correlates well is a good indicator for topic interpretabil- ity.

Exploring the topics by a human evaluator is considered the best approach. However, since this involves inspecting all the different models, this approach might not be fea- sible. Topic coherence measures can quantitatively calculate a proxy for topic quality, and per our analysis, topics with high coherence were considered interpretable by do-

223 CHAPTER 8. CONCLUSIONS main experts. Combing coherence measures with a manual inspection is thus a good approach to find the LDA model that result in meaningful and interpretable topics. In short, three questions should be answered satisfactory (Boyd-Graber et al., 2014):

1. Are topics meaningful, interpretable, coherent and useful?

2. Are topics within documents meaningful, appropriate and useful?

3. Do the topics facilitate a better understanding of the underlying corpus?

The evaluation phase can also result in topics that are very similar (i.e., identical top- ics), topics that should ideally be merged or split (i.e., chained or mixed topics), topics that are un-interpretable (i.e. nonsensical), or topics that contain unimportant, too spe- cific, or too general words. In those cases, it would be wise to revisit the pre-processing phase and repeat the analysis.

Relevant code:

evaluation.calculate_coherence() • evaluation.plot_coherence() • evaluation.output_lda_topics() •

INTERPRETATION: The interpretation phase, although closely related to the evalu- ation phase, includes a more fine-grained understanding of the latent variables. The main goal of the interpretation phase is to go beyond the latent variables and under- stand the latent variables in the context of the domain under study. This phase is highly depending on the research question that we would want to have answered. What top- ics are present, how are they distributed over time, and how are they related to other topics are possible ways to explore the output of the LDA analysis. Similarly to the evaluation phase, aiming for a deeper understanding of the topics might also result in flaws in the analysis. For example, a visualization of the topics that places two very distinct topics in close proximity, high probability of a topic in a document that does not cover aspects of that topic, or topics that should not co-occur together are indi- cators of flaws or areas of improvements. In such cases, it would be wise to revisit the pre-processing phase and to re-run the analysis with, for instance, different model parameters or pre-processing steps.

Relevant code:

interpretation.infer_document_topic_distribution() • interpretation.get_document_title_per_topic() • 224 CHAPTER 8. CONCLUSIONS

interpretation.plot_topics_over_time() • interpretation.plot_topics_over_time_stacked() • interpretation.plot_topic_co_occurrence() •

8.2 Limitations

The specific limitations are addressed in the individual chapters, though it is worth reiterating some of the most important ones concerning the Latent Dirichlet Allocation method and the chosen domain of fisheries science.

8.2.1 Latent Dirichlet Allocation

Bag-of-words assumption: LDA can be considered a bag-of-words (BOW) model where the order of words is neglected. Specifically, for the field of natural language generation, this assumption would not work. However, the assumption is realistic if the purpose is to uncover the semantic structure of a corpus (Blei, 2012). Also, treating the individual words as document features would lose the meaning of com- pound words (e.g., fisheries management, rainbow trout). We have mitigated this by including two-word expressions (bigrams), and multi-word expressions (n-grams) through named entity recognition methods. To generate more realistic-looking docu- ments within the generative process, one can sample words that are conditioned on the previously sampled words, thus capturing short and long-term dependencies. Such approaches (e.g., (Griffiths et al., 2005; Wallach, 2006b)) relax the BOW assumption and can result in topics that are more meaningful.

Document exchangeability: LDA further assumes document exchangeability (the order in which documents are analyzed) is unimportant, yet all documents are ana- lyzed at the end of the LDA analysis. This assumption is problematic when topics have changed considerably in the way they are described, especially when analyzing doc- uments covering a very long time span. The dynamic topic model (Blei and Lafferty, 2006) is one approach to capture evolving topics by creating topics for different time slices, where each slice is conditioned on the previous time slice. We might surmise that specific fisheries topics were described differently in the early 1990s than they are today and, if so, the standard LDA model used in this thesis does not explicitly capture this phenomenon.

Number of topics: LDA assumes that the number of topics, that is the value of K, is known a priori. Similar assumptions are typically found within other unsupervised clustering methods, such as the K-nearest neighbor or Gaussian Mixture Models. We have used a grid-search approach whereby LDA models were created by varying the

225 CHAPTER 8. CONCLUSIONS

K-parameter. The best model was obtained by calculating the model quality and includ- ing an inspection by domain experts. For LDA, this is a common and useful technique. However, Bayesian non-parametric models exist, such as the hierarchical topic model that automatically infers the number of topics from the data (Whye Teh et al., 2004). Such hierarchical topic models show effective and superior performance over the stan- dard LDA model but come with additional computational complexity.

Posterior inference: To infer the posterior probability of the hidden variables given the observed documents, we have utilized an online learning method (Hoffman et al., 2010) based on variational inference (Jordan et al., 1999). The benefit of online learn- ing methods is that the corpus need not be in-memory, as is the case with batch meth- ods. However, besides variational inference, other methods such as Markov Chain Monte Carlo (MCMC) based methods exist, such as Gibbs sampling (Griffiths and Steyvers, 2004). It is argued that variational inference is computationally faster, and that Gibbs sampling is in principle more accurate (Porteous et al., 2008). However, in a comparative study, different inference techniques show similarly accurate results when the hyper-parameter of the Dirichlet priors (studied in Chapter 3) are optimized (Asun- cion et al., 2012).

Model initialization: One of the drawbacks of probabilistic topic models, including LDA, is that they uncover topics regardless of whether they are naturally there (Blei and Lafferty, 2009), and that the model can yield different topics at different initializa- tions (Chuang et al., 2014). Adequately pre-processing and optimizing for model qual- ity, for instance through coherence scores, can usually produce a set of interpretable and meaningful topics, as we have shown in the various chapters. However, we have identified topics present within one model, and absent in another, even when opti- mizing for topic quality across different random initializations (see Chapter 3). The discrepancies between uncovered topics are inherently caused by approximating the posterior distribution, as a closed form solution is intractable, and where solutions might converge to a sub-optimal solution (i.e., local optima). Typically, starting the model with different random initializations is one way to go, which we experimented with in Chapters 2, 3, 6 and 7, although such an approach was not feasible for the extensive corpus of 46,000 full-text publications studied in Chapter 5. This drawback makes LDA less stable, reproducible, and reliable, and solutions to circumvent or miti- gate these problems are scarcely studied. Optimizing for starting points (Roberts et al., 2016), using different similarity metrics for model comparison (Koltcov et al., 2014), and adapting approaches from community detection algorithms for networks (Lanci- chinetti et al., 2015) are a few approaches to consider.

Topic labeling: For readability and interpretability of the results, we have chosen to attach semantically meaningful labels to the probability distributions over words (i.e. the topics). However, the labeling of the topics is a very subjective endeavor. In all cases, we have used fisheries domain experts to perform the labeling task and have provided, next to the topic’s top words, a selection of document titles strongly associated with the topics, abstracts, and several visualizations to aid in correctly de- termining topic labels. Nevertheless, manual topic labeling, even though considered

226 CHAPTER 8. CONCLUSIONS the gold standard in topic labeling (Lau et al., 2011), is limited by the subjectivity inherent in human interpretation (Urquhart, 2001), and an analysis of the topics by other domain experts could yield different results.

Topic quality: To evaluate the quality of latent topics, one can fit several topic models to a training set of documents and calculate a model fit, such as perplexity or the log- likelihood, on a test set of the data (Scott and Baldridge, 2013). The model that best fits the test set would be considered a better model. However, topic models are used by humans to interpret and explore the documents, and there is no technical reason that the best-fitted model would best help in performing this task (Blei, 2012; Boyd-Graber et al., 2014). In fact, research has shown that such measures negatively correlate with human interpretation (Chang et al., 2009). Within this thesis, we have adopted the Cv coherence measure (Röder et al., 2015). It quantitatively measures the quality of the latent topics from the perspective of human interpretation to achieve near-human ac- curacy (Boyd-Graber et al., 2014), combined with a qualitative assessment by domain experts. The Cv coherence measure has been shown to outperform all other available coherence measures and can thus be viewed as an adequate measure for topic qual- ity. However, using a reference corpus, such as Wikipedia, can potentially improve the estimate of word frequencies and word co-occurrences that are part of coherence measures, and can potentially increase the quantification of topic quality (Yang et al., 2017).

Research domain: All our LDA computational experiments have been performed on documents related to the fisheries domain. While the reasons for this domain are explained in Chapter 1, the results presented in this thesis might not necessarily be generalizable to other forms of documents. However, optimization of Dirichlet priors (Chapter 3) has been shown to provide similar results as previously studied for docu- ments related to news and patent data. Also, given that we studied documents from a domain-specific field (i.e., fisheries), our results might be more generalizable to doc- uments from other domain-specific sciences compared to documents found in more general-purpose journals such as Nature, Science, PNAS, and PLOS ONE.

8.2.2 Fisheries Domain

English language: Throughout this thesis, we have explored latent topics derived from scientific publications that were entirely based on English-speaking journals. The analysis of latent topics addressing the four pillars of fisheries’ sustainability, and the analysis of spatial and temporal characteristics of authors and publication output all reflect an Anglophone bias. In doing so, we have disregarded all fisheries science out- put published in non-English languages, such as Spanish, Portuguese, and Chinese. Even though English is considered the lingua franca in science (Montgomery, 2103), research has shown that, for instance, in 2014, 35.6% of 75,513 publications on bio- diversity conservation were published in non-English languages (Amano et al., 2016).

227 CHAPTER 8. CONCLUSIONS

As a consequence, the results presented in this thesis do not capture all the global scientific output of fisheries.

Journal selection: In terms of our corpus, while this contains high-ranking journals in fisheries science, we acknowledge that it does not capture the entire spectrum of work being done in the area of fisheries. By focusing on peer-reviewed fisheries jour- nals, we have, for instance, ignored all scientific output published in the gray litera- ture, such as governmental or institutional reports. Additionally, by heavily focusing on fisheries journals, some work related to fisheries, in particular, that which is oriented towards the social sciences, may be published in journals that have a more general focus, or a stronger leaning towards those sciences. As a consequence, our work does not reflect all output regarding fisheries science. However, we have aimed to include, especially in Chapters 4, 5 and 6, a large number of journals explicitly addressing fish- eries science in their aims and scopes and those considered to be high-impact journals in the field.

8.3 Future Work

Some future work to consider can be directly related to some of the limitations de- scribed in Section 8.2. When using LDA, relaxing the bag-of-words assumption (Grif- fiths et al., 2005; Wallach, 2006b) and the document exchangeability assumption (Blei and Lafferty, 2006) are useful areas to further explore. Also, the use of non-parametric topic models (Whye Teh et al., 2004) can reveal additional and possibly more accurate results. For the domain of fisheries science, including non-English publications and incorporating journals not specifically targeted at fisheries science are approaches to extend the work presented in this thesis. The field of probabilistic topic modeling is an active area of development, with many new approaches presented during recent years. We want to briefly highlight three interesting and promising alternatives that are useful when exploring and studying the scientific output of a particular domain. We conclude with an overview of extensions of LDA that highlight the potential of probabilistic topic model research.

Labeled LDA (L-LDA): Scientific publications are typically annotated with author- defined keywords, and frequently also by journal-assigned keywords. However, the association between a specific keyword and the relevant part of the text is not provided. Labeled LDA (Ramage et al., 2009) is an extension of LDA that learns to find such associations and can be used to zoom-in or filter for specific content.

Author-topic model (ATM): The author-topic model (Rosen-Zvi et al., 2004) is an extension of LDA that combines author information into the learning process. It infers a topical distribution per author, and thus within the context of scientific publications, it reveals what topics a specific author publishes. Authors who have similar topic dis- tributions, measured by, for instance, KL-divergence (Kullback and Leibler, 1951) or

228 CHAPTER 8. CONCLUSIONS the Hellinger distance (Hellinger, 1909), can be viewed as publishing similar research. Additionally, entropy measures can, for instance, show if authors publish more nar- rowly or broadly oriented work. We have experimented with the ATM, and a working implementation can be found in the Gensim (Rehurek and Sojka, 2010). However, as per our analysis, the current implementation fails to handle datasets with roughly more than 10,000 authors.

Correlated topic model (CTM): The correlated topic model (Blei and Lafferty, 2007) allows topics to correlate with each other, as a document about the topic “cars” is more likely to also be about “emissions” than it is about diseases. The standard LDA model uses a Dirichlet distribution which implicitly assumes independence between topics. The use of a more flexible distribution to model topics in documents, such as the logistic normal, allows for variability between the components (Blei and Lafferty, 2009). The correlated topic model has been shown to fit the data better and provides a more realistic model of the latent topical structure in documents.

Other topic models to consider: Many other types of topic models exists which show potential directives for future research into probabilistic topic models. For example the relational topic model (Chang and Blei, 2010), the spherical topic model (Reisinger et al., 2010), the sparse topic model (Wang and Blei, 2009), the bursty topic model (Doyle and Elkan, 2009), the supervised topic model (Blei and McAuliffe, 2007), the biterm topic model (Yan et al., 2013), topic modeling with network regularization (Mei et al., 2008), the Pachinko allocation model (Wei and McCallum, 2006), the Markov topic model (Wang et al., 2009), the polylingual topic model (Mimno et al., 2009), the cross-collection topic model (Paul, 2009), the differential topic model (Chen et al., 2015), and topic modeling over time (Wang and McCallum, 2006).

8.4 Personal Reflections

Before I started my PhD, I had mainly been working with people very close to my own disciple: computer science. We spoke the same language and we were wearing similar glasses (scientific makeup). When I began my PhD journey, this became vastly differ- ent as I suddenly plunged into a pool of fisheries scientists, marine conservationists, political scientists, social scientists, and people from many other disciplines. To make things even more challenging, being part of the European Training Network SAF21 (Social Science Aspects of Fisheries for the 21st Century), I was surrounded by people from different countries and cultures. How to communicate, navigate, and collaborate with my new colleagues, seniors, project members, beneficiaries and host institutions revealed not to be a trivial task, and one that I had to learn. I made some substantial steps, but there are many more hurdles to overcome. In this final chapter, I want to highlight three important lessons that I learned and that are worth mentioning explic- itly.

229 CHAPTER 8. CONCLUSIONS

“Know your audience” became an important one for me. Oftentimes, I wanted to present the underlying fine-grained mechanisms of the algorithms I was studying, but the better choice was to explain what it does, not how it does it. Also, writing for fish- eries journals was a skill I had to master, sometimes the hard way with the peer review process. Within this thesis, I have tried to leave out some of the overly unnecessary technical jargon. However, there is still much to learn in this respect and, thankfully, throughout this PhD journey, the feedback I have received from peers outside of my own discipline has always been constructive and helpful.

“Do what you love to do”. A PhD can be a stressful undertaking, and the Internet is filled with articles, blogs, and fora stating that PhD researchers suffer from mental health problems such as chronic anxiety and clinical depression. Doing research, analyzing, and just coding away is what I love to do, and this is something I was lucky enough to do most of the time. Perhaps apart from the paper writing and review process, the PhD journey felt like a breeze through life. Okay, perhaps a breeze most of the time.

“Three years is challenging”. The modern version of a streamlined three-year PhD program which is typically found within EU-funded projects is a very challenging one. Especially, when it is combined with training camps, network meetings, secondments, and the tedious and long publication process. Thankfully, being part of a European Training Network gave me the opportunity to live, study and work in diverse and mul- ticultural environments in the Netherlands (Utrecht University), Norway (UiT – The Arctic University of Norway) and the UK (Manchester Metropolitan University). It also gave me the opportunity to be surrounded by interesting and inspiring colleagues, while exploring many other parts of the world (Denmark, Spain, Portugal, Iceland, US, Japan, Italy, and Germany) has made the journey worthwhile and highly enlightening.

230 Bibliography

J. Adams. Collaborations: The rise of research networks. Nature 2012 490:7420, 2012.

J. Adams. The fourth age of research. Nature, 497(7451):557–560, 2013. doi: 10.1038/497557a. J. Adams, K. Gurney, D. Hook, and L. Leydesdorff. International collaboration clusters in Africa. Scientometrics, 98(1):547–556, 2014. doi: 10.1007/s11192-013-1060-2. Y.-Y. Ahn, J. P.Bagrow, and S. Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, 2010. doi: 10.1038/nature09182. E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P.Xing. Mixed Membership Stochastic Blockmodels. Jmlr, 9:1981–2014, 2008.

D. W. Aksnes and H. I. Browman. An overview of global research effort in fisheries science. ICES Journal of Marine Science: Journal du Conseil, 73(4):1004–1011, 2016. doi: 10.1093/icesjms/fsv248. M. Albert and D. L. Kleinman. Bringing Pierre Bourdieu to Science and Technology Studies. Minerva, 49(3):263–273, 2011. doi: 10.1007/s11024-011-9174-2. N. Aletras and M. Stevenson. Evaluating topic coherence using distributional seman- tics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013), pages 13–22, Potsdam, Germany, 2013. Association for Computational Linguistics.

K. R. Allen. A Method of Fitting Growth Curves of the von Bertalanffy Type to Observed Data. Journal of the Fisheries Research Board of Canada, 23(2):163–179, 1966. doi: 10.1139/f66-016. J. M. Alston and P. G. Pardey. Six decades of agricultural and resource economics in Australia: an analysis of trends in topics, authorship and collaboration. Aus- tralian Journal of Agricultural and Resource Economics, 60(4):554–568, 2016. doi: 10.1111/1467-8489.12162.

231 BIBLIOGRAPHY

D. L. Alverson, M. H. Freeberg, S. A. Murawski, and J. Pope. A global assessment of fisheries bycatch and discards. Technical report, FAO Fisheries Technical Paper 339, Rome, 1994.

T. Amano, J. P.González-Varo, and W. J. Sutherland. Languages Are Still a Major Bar- rier to Global Science. PLOS Biology, 14(12):e2000933, 2016. doi: 10.1371/jour- nal.pbio.2000933.

L. H. Anaya. Comparing latent Dirichlet allocation and latent semantic analysis as clas- sifiers. PhD thesis, University of North Texas, 2011.

R. Angelini and C. L. Moloney. Fisheries, Ecology and Modelling: an historical perspec- tive. Pan-American Journal of Aquatic Sciences, 2(2):75–85, 2007.

A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On Smoothing and Inference for Topic Models. In UAI ’09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 27–34, Montreal, Quebec, Canada, 2012. AUAI Press Arlington.

P. Azoulay, J. S. G. Zivin, and J. Wang. Superstar Extinction. Quarterly Journal of Economics, 125(2):549–589, 2010. doi: 10.1162/qjec.2010.125.2.549. A. D. Baddeley. Short-term Memory for Word Sequences as a Function of Acoustic, Semantic and Formal Similarity. Quarterly Journal of Experimental Psychology, 18 (4):362–365, 1966. doi: 10.1080/14640746608400055. M. R. Baker, D. E. Schindler, T. E. Essington, and R. Hilborn. Accounting for escape mortality in fisheries: implications for stock productivity and optimal management. Ecological Applications, 24(1):55–70, 2014. doi: 10.1890/12-1871.1. B. Ball, B. Karrer, and M. E. J. Newman. Efficient and principled method for de- tecting communities in networks. Physical Review E, 84(3):036103, 2011. doi: 10.1103/PhysRevE.84.036103. M. Barbesgaard. Blue growth: savior or ocean grabbing? The Journal of Peasant Studies, 45(1):130–149, 2018. doi: 10.1080/03066150.2017.1377186. R. S. Barr, B. L. Golden, J. P. Kelly, M. G. C. Resende, and W. R. Stewart. Designing and reporting on computational experiments with heuristic methods. Journal of Heuristics, 1:9–32, 1995. doi: 10.1007/BF02430363. F. Bastardie, J. R. Nielsen, and T. Miethe. DISPLACE: a dynamic, individual-based model for spatial fishing planning and effort displacement — integrating underlying fish population models. Canadian Journal of Fisheries and Aquatic Sciences, 71(3): 366–386, 2014. doi: 10.1139/cjfas-2013-0126. D. Bavington. Managed annihilation : an unnatural history of the Newfoundland cod collapse. UBC Press, Vancouver, British Columbia, Canada, 2010.

232 BIBLIOGRAPHY

C. Bear. Assembling the sea: materiality, movement and regulatory practices in the Cardigan Bay scallop fishery. cultural geographies, 20(1):21–41, 2013. doi: 10.1177/1474474012463665. A. Belgrano and C. W. Fowler. How Fisheries Affect Evolution. Science, 342(6163): 1176–1177, 2013. doi: 10.1126/science.1245490. B. Berelson. Content Analysis in Communication Research. Free Press, Michigan, USA, 1952.

P.J. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50): 21068–21073, 2009. doi: 10.1073/pnas.0907096106. T. Bjørndal, D. E. Lane, and A. Weintraub. Operational research models and the man- agement of fisheries and aquaculture: A review. European Journal of Operational Research, 156(3):533–540, 2004. doi: 10.1016/S0377-2217(03)00107-3. D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. doi: 10.1145/2133806.2133826. D. M. Blei. Expressive probabilistic models and scalable method of moments. Commu- nications of the ACM, 61(4):84–84, 2018. doi: 10.1145/3186260. D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121–143, 2006. doi: 10.1214/06-BA104. D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd inter- national conference on Machine learning - ICML ’06, pages 113–120, New York, USA, 2006. ACM Press. doi: 10.1145/1143844.1143859. D. M. Blei and J. D. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17–35, 2007. doi: 10.1214/07-AOAS114. D. M. Blei and J. D. Lafferty. Topic Models. In A. N. Srivastava and M. Sahami, editors, : Classification, Clustering, and Applications, pages 71–94. Chapman and Hall/CRC, London, UK, 2009. D. M. Blei and J. D. McAuliffe. Supervised Topic Models. In NIPS’07 Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 121–128, Vancouver, British Columbia, Canada, 2007. Curran Associates Inc.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of com- munities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. doi: 10.1088/1742-5468/2008/10/P10008.

233 BIBLIOGRAPHY

Ö. Bodin and B. I. Crona. The role of social networks in natural resource governance: What relational patterns make a difference? Global Environmental Change, 19(3): 366–374, 2009. doi: 10.1016/j.gloenvcha.2009.05.002. L. Borges. The evolution of a discard policy in Europe. Fish and Fisheries, 16(3):534– 540, 2015. doi: 10.1111/faf.12062. L. Bornmann and R. Mutz. Growth rates of modern science: A bibliometric analy- sis based on the number of publications and cited references. Journal of the As- sociation for Information Science and Technology, 66(11):2215–2222, 2015. doi: 10.1002/asi.23329. L. Bornmann, C. Wagner, and L. Leydesdorff. BRICS countries and scientific ex- cellence: A bibliometric analysis of most frequently cited papers. Journal of the Association for Information Science and Technology, 66(7):1507–1513, 2015. doi: 10.1002/asi.23333. S. R. Borrett, J. Moody, and A. Edelmann. The rise of Network Ecology: Maps of the topic diversity and scientific collaboration. Ecological Modelling, 293:111–127, 2014. doi: 10.1016/j.ecolmodel.2014.02.019. M. Boström. A missing pillar? Challenges in theorizing and practicing social sustain- ability: introduction to the special issue. Sustainability: Science, Practice and Policy, 8(1):3–14, 2012. doi: 10.1080/15487733.2012.11908080. L. Bottou and O. Bousquet. The Tradeoffs of Large Scale Learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems, volume 20, pages 161–168, 2007.

G. Bouma. Normalized (Pointwise) Mutual Information in . In Proceedings of German Society for Computational Linguistics (GSCL 2009), pages 31– 40, Potsdam, Germany, 2009. GSCL.

P. Bourdieu. The specificity of the scientific field and the social conditions of the progress of reason. Social Science Information, 14(6):19–47, 1975. doi: 10.1177/053901847501400602. P. Bourdieu. The peculiar history of scientific reason. Sociological Forum, 6(1):3–26, 1991. doi: 10.1007/BF01112725. K. W. Boyack and R. Klavans. Creation of a highly detailed, dynamic, global model and map of science. Journal of the Association for Information Science and Technology, 65 (4):670–685, 2014. doi: 10.1002/asi.22990. J. Boyd-Graber, D. Mimno, and D. Newman. Care and feeding of topic models: Prob- lems, diagnostics, and improvements. In E. M. Airoldi, D. Blei, E. A. Erosheva, and S. E. Fienberg, editors, Handbook of Mixed Membership Models and Its Applications, pages 225 – 254. Chapman & Hall/CRC, 2014.

234 BIBLIOGRAPHY

C. M. Brooks, L. B. Crowder, L. M. Curran, R. B. Dunbar, D. G. Ainley, K. J. Dodds, K. M. Gjerde, and U. R. Sumaila. Science-based management in decline in the Southern Ocean. Science, 354(6309):185–187, 2016. doi: 10.1126/science.aah4119. W. L. Buntine. Operations for Learning with Graphical Models. Journal of Artificial Intelligence Research, 2:159–225, 1994.

J. F. Caddy. Current usage of fisheries indicators and reference points, and their po- tential application to management of fisheries for marine invertebrates. Canadian Journal of Fisheries and Aquatic Sciences, 61(8):1307–1324, 2004. doi: 10.1139/f04- 132.

J. F. Caddy and R. Mahon. Reference points for fisheries management. 1995.

S. X. Cadrin and M. Dickey-Collas. Stock assessment methods for sustainable fisheries. ICES Journal of Marine Science, 72(1):1–6, 2015. doi: 10.1093/icesjms/fsu228. L. Campbell and M. Cornwell. Human dimensions of bycatch reduction technology: current assumptions and directions for future research. Endangered Species Research, 5:325–334, 2008. doi: 10.3354/esr00172. L. Campling. The Tuna ‘Commodity Frontier’: Business Strategies and Environment in the Industrial Tuna Fisheries of the Western Indian Ocean. Journal of Agrarian Change, 12(2-3):252–278, 2012. doi: 10.1111/j.1471-0366.2011.00354.x. L. Campling, E. Havice, and P.McCall Howard. The Political Economy and Ecology of Capture Fisheries: Market Dynamics, Resource Access and Relations of Exploita- tion and Resistance. Journal of Agrarian Change, 12(2-3):177–203, 2012. doi: 10.1111/j.1471-0366.2011.00356.x. A. Casadevall and F. C. Fang. Specialized Science. Infection and Immunity, 82(4): 1355–1360, 2014. doi: 10.1128/IAI.01530-13. K. K. Cetina. Epistemic cultures: How the sciences make knowledge. Harvard University Press, Cambridge, MA, USA, 1999.

J. Chang and D. M. Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1):124–150, 2010. doi: 10.1214/09-AOAS309. J. Chang, S. Gerrish, C. Wang, and D. M. Blei. Reading Tea Leaves: How Humans In- terpret Topic Models. In NIPS’09 Proceedings of the 22nd International Conference on Neural Information Processing Systems, pages 288–296, Vancouver, British Columbia, Canada, 2009. Curran Associates Inc.

A. Charles. Sustainable Fishery Systems. Blackwell Science Ltd, Oxford, UK, 2000. doi: 10.1002/9780470698785. C. Chen, W. Buntine, N. Ding, L. Xie, and L. Du. Differential Topic Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):230–242, 2015. doi: 10.1109/TPAMI.2014.2313127.

235 BIBLIOGRAPHY

T.-H. Chen, S. W. Thomas, and A. E. Hassan. A survey on the use of topic models when mining software repositories. Empirical Software Engineering, 21(5):1843– 1919, 2016. doi: 10.1007/s10664-015-9402-8. G. G. Chowdhury. Natural Language Processing. Annual review of information science and technology, 37(1):51–89, 2003.

J. Chuang, D. Ramage, C. Manning, and J. Heer. Interpretation and trust: Designing Model-Driven Visualizations for Text Analysis. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems - CHI ’12, pages 443–452, Austin, TX, USA, 2012. ACM Press. doi: 10.1145/2207676.2207738. J. Chuang, J. D. Wilkerson, R. Weiss, D. Tingley, B. M. Stewart, M. E. Roberts, F. Poursabzi-Sangdeh, J. Grimmer, L. Findlater, J. Boyd-Graber, and J. Heer. Computer-Assisted Content Analysis : Topic Models for Exploring Multiple Subjec- tive Interpretations. In Advances in Neural Information Processing Systems Workshop on Human-Propelled Machine Learning, pages 1–9, Montreal, QC, Canada, 2014. Cur- ran Associates Inc.

A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70(6):066111, 2004. doi: 10.1103/Phys- RevE.70.066111.

J. R. Curran, T.Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 172–180, 2007.

G. A. B. da Fonseca. Conservation Science and NGOs. Conservation Biology, 17(2): 345–347, 2003. doi: 10.1046/j.1523-1739.2003.01721.x. F. Dahdouh-Guebas, J. Ahimbisibwe, R. Van Moll, and N. Koedam. Neo-colonial sci- ence by the most industrialised upon the least developed countries in peer-reviewed publishing. Scientometrics, 56(3):329–343, 2003. doi: 10.1023/A:1022374703178. A. L. Dahl. Achievements and gaps in indicators for sustainability. Ecological Indicators, 17:14–19, 2012. doi: 10.1016/j.ecolind.2011.04.032. C. De Young, A. Charles, and A. Hjort. Human dimensions of the ecosystem approach to fisheries: an overview of context, concepts, tools and methods. FAO fisheries tech- nical paper 489. Technical report, Rome, 2008.

S. Debortoli, O. Müller, I. Junglas, and J. vom Brocke. Text Mining for Information Sys- tems Researchers: An Annotated Topic Modeling Tutorial. Communications of the As- sociation for Information Systems, 39:110–135, 2016. doi: 10.17705/1CAIS.03907. A. Decelle, F.Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochas- tic block model for modular networks and its algorithmic applications. Physical Re- view E, 84(6):066106, 2011a. doi: 10.1103/PhysRevE.84.066106.

236 BIBLIOGRAPHY

A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Inference and Phase Transitions in the Detection of Modules in Sparse Networks. Physical Review Letters, 107(6): 065701, 2011b. doi: 10.1103/PhysRevLett.107.065701. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID- ASI1>3.0.CO;2-9. I. Derényi, G. Palla, and T. Vicsek. Clique Percolation in Random Networks. Physical Review Letters, 94(16):160202, 2005. doi: 10.1103/PhysRevLett.94.160202. P. DiMaggio, M. Nag, and D. Blei. Exploiting affinities between topic model- ing and the sociological perspective on culture: Application to newspaper cov- erage of U.S. government arts funding. Poetics, 41(6):570–606, 2013. doi: 10.1016/j.poetic.2013.08.004. Y. Ding. Scientific collaboration and endorsement: Network analysis of coauthor- ship and citation networks. Journal of Informetrics, 5(1):187–203, 2011. doi: 10.1016/J.JOI.2010.10.008. I. Douven and W. Meijs. Measuring coherence. Synthese, 156(3):405–425, 2007. doi: 10.1007/s11229-006-9131-z. G. Doyle and C. Elkan. Accounting for burstiness in topic models. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, pages 281– 288, Montreal, QC, Canada, 2009. ACM Press. doi: 10.1145/1553374.1553410. T. Dunning. Accurate Methods for the Statistics of Surprise and Coincidence. Compu- tational Linguistics, 19(1):61–74, 1993.

B. Elango and P.Rajendran. Authorship Trends and Collaboration Pattern in the Marine Sciences Literature : A Scientometric Study. International Journal of Information Dissemination and Technology, 2(3):166–169, 2012.

J. M. Epstein. Why Model? Journal of Artificial Societies and Social Simulation, 11(4): 12, 2008.

E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publi- cations. Proceedings of the National Academy of Sciences of the United States of Amer- ica, 101 Suppl(suppl 1):5220–5227, 2004. doi: 10.1073/pnas.0307760101. A. Escobar. Beyond the Third World: imperial globality, global coloniality and anti- globalisation social movements. Third World Quarterly, 25(1):207–230, 2004. doi: 10.1080/0143659042000185417. European Commission. Communication from the Commission to the Council, the Euro- pean Parlianment, the European Economic and Social Commitee and the Committee of the Regions: A European Strategy for Marine. Technical report, European Com- mission, 2008.

237 BIBLIOGRAPHY

European Commission. International ocean governance: an agenda for the future of our oceans (SWD(2016) 352 final). Technical report, European Commission, Brus- sels, Belgium, 2016a.

European Commission. Joint Communication to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions: International ocean governance: an agenda for the future of oceans. Technical re- port, 2016b.

N. Evangelopoulos, X. Zhang, and V. R. Prybutok. Latent Semantic Analysis: five methodological recommendations. European Journal of Information Systems, 21(1): 70–86, 2012. doi: 10.1057/ejis.2010.61. FAO. The State of World Fisheries and Aquaculture 2016. Contributing to food security and nutrition for all. 2016.

FAO. The state of world fisheries and aquaculture - meeting the sustainable devel- opment goals. Technical report, Food and Agriculture Organization of the United Nations, Rome, Italy, 2018.

U. Fayyad, G. Piatetsky-Shapiro, and P.Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34, 1996. doi: 10.1145/240455.240464. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s image search. In Tenth IEEE International Conference on Com- puter Vision (ICCV’05), pages 1816–1823, Beijing, China, 2005. IEEE. doi: 10.1109/ICCV.2005.142. A. Fink. How to Conduct Surveys: A Step-by-step Guide. SAGE Publications, Inc, London, UK, 4 edition, 2009.

C. Finley. All the Fish in the Sea: Maximum Sustainable Yield and the Failure of Fisheries Management. The University of Chicago Press, Chicago, IL, USA, 2011.

T. Forsyth. Critical political ecology : the politics of environmental science. Routledge, Abingdon, UK, 2003.

S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010. doi: 10.1016/j.physrep.2009.11.002. S. Fortunato and M. Barthelemy. Resolution limit in community detection. Proceedings of the National Academy of Sciences, 104(1):36–41, 2007. doi: 10.1073/pnas.0605965104. R. C. Francis. Fisheries science now and in the future: A personal view. New Zealand Journal of Marine and Freshwater Research, 14(1):95–100, 1980. doi: 10.1080/00288330.1980.9515849.

238 BIBLIOGRAPHY

L. C. Freeman. Centrality in social networks conceptual clarification. Social Networks, 1(3):215–239, 1978. doi: 10.1016/0378-8733(78)90021-7. S. Frickel, S. Gibbon, J. Howard, J. Kempner, G. Ottinger, and D. J. Hess. Un- done Science: Charting Social Movement and Civil Society Challenges to Research Agenda Setting. Science, Technology, & Human Values, 35(4):444–473, 2010. doi: 10.1177/0162243909345836. R. Froese, N. Demirel, G. Coro, K. M. Kleisner, and H. Winker. Estimating fisheries reference points from catch and resilience. Fish and Fisheries, 18(3):506–526, 2017. doi: 10.1111/faf.12190. E. A. Fulton, A. D. M. Smith, D. C. Smith, and I. E. van Putten. Human behaviour: the key source of uncertainty in fisheries management. Fish and Fisheries, 12(1):2–17, 2011. doi: 10.1111/j.1467-2979.2010.00371.x. S. K. Gaichas, M. Fogarty, G. Fay, R. Gamble, S. Lucey, and L. Smith. Combining stock, multispecies, and ecosystem level fishery objectives within an operational manage- ment procedure: simulations to start the conversation. ICES Journal of Marine Sci- ence: Journal du Conseil, 74(2):552–565, 2017. doi: 10.1093/icesjms/fsw119.

C. J. Gatti, J. D. Brooks, and S. G. Nurre. A Historical Analysis of the Field of OR/MS using Topic Models. arXiv.org, stat.ML, 2015.

M. Geoghegan-Quinn, E. Fast, K. Jones, and M. Damanaki. Galway Statement Atlantic Ocean Cooperation : Launching a European Union - Canada - United States of Amer- ica Research Alliance, 2013.

T. Gerl, H. Kreibich, G. Franco, D. Marechal, and K. Schröter. A Review of Flood Loss Models as Basis for Harmonization and Benchmarking. PLOS ONE, 11(7):e0159791, 2016. doi: 10.1371/journal.pone.0159791. M. Girvan and M. E. J. Newman. Community structure in social and biological net- works. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002. doi: 10.1073/pnas.122653799. S. Goldwater, T. L. Griffiths, and M. Johnson. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computa- tional Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 673–680, Stroudsburg, PA, USA, 2006. Association for Computa- tional Linguistics.

P.K. Gopalan and D. M. Blei. Efficient discovery of overlapping communities in massive networks. Proceedings of the National Academy of Sciences, 110(36):14534–14539, 2013. doi: 10.1073/pnas.1221839110. M. Granovetter. The Strength of Weak Ties: A Network Theory Revisited. Source: Sociological Theory, 1:201–233, 1983.

239 BIBLIOGRAPHY

M. S. Granovetter. The Strength of Weak Ties. American Journal of Sociology, 78(6): 1360–1380, 1973. doi: 10.1086/225469. S. Gregory. Finding overlapping communities in networks by label propagation. New Journal of Physics, 12(10):103018, 2010. doi: 10.1088/1367-2630/12/10/103018. T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Supplement 1):5228–5235, 2004. doi: 10.1073/pnas.0307752101. T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In Proceedings of the 17th International Conference on Neural Information Processing Systems, volume 17, pages 537–544, Vancouver, British Columbia, Canada, 2005. MIT Press.

T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic representation. Psychological Review, 114(2):211–244, 2007. doi: 10.1037/0033-295X.114.2.211. J. Grimmer and B. M. Stewart. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(03):267–297, 2013. doi: 10.1093/pan/mps028. P. Haapasaari, S. Mäntyniemi, and S. Kuikka. Baltic Herring Fisheries Management: Stakeholder Views to Frame the Problem. Ecology and Society, 17(3):art36, 2012. doi: 10.5751/ES-04907-170336. M. Hadjimichael. A call for a blue degrowth: Unravelling the European Union’s fisheries and maritime policies. Marine Policy, 94:158–164, 2018. doi: 10.1016/j.marpol.2018.05.007. D. Hall, D. Jurafsky, and C. D. Manning. Studying the history of ideas using topic models. In EMNLP ’08 Proceedings of the Conference on Empirical Methods in Nat- ural Language Processing, pages 363–371, Honolulu, Hawaii, 2008. Association for Computational Linguistics.

J. D. Hamblin. Visions of International Scientific Cooperation: The Case of Oceanic Science, 1920–1955. Minerva, 38(4):393–423, 2000. doi: 10.1023/A:1004827125474. Z. S. Harris. Distributional Structure. WORD, 10(2-3):146–162, 1954. doi: 10.1080/00437956.1954.11659520. E. Havice. Unsettled Sovereignty and the Sea: Mobilities and More-Than-Territorial Configurations of State Power. Annals of the American Association of Geographers, 108(5):1280–1297, 2018. doi: 10.1080/24694452.2018.1446820. G. Heinrich. Parameter estimation for text analysis. Bernoulli, 35:1–31, 2005. doi: 10.2514/2.3375.

240 BIBLIOGRAPHY

E. Hellinger. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik (Crelle’s Journal), 136:210–271, 1909. doi: 10.1515/crll.1909.136.210. R. Helms and K. Buijsrogge. Knowledge Network Analysis: A Technique to Analyze Knowledge Management Bottlenecks in Organizations. In 16th International Work- shop on Database and Expert Systems Applications (DEXA’05), pages 410–414. IEEE, 2005. doi: 10.1109/DEXA.2005.127. C. C. Hicks, A. Levine, A. Agrawal, X. Basurto, S. J. Breslow, C. Carothers, S. Charnley, S. Coulthard, N. Dolsak, J. Donatuto, C. Garcia-Quijano, M. B. Mascia, K. Norman, M. R. Poe, T. Satterfield, K. St. Martin, and P.S. Levin. Engage key social concepts for sustainability. Science, 352(6281):38–40, 2016. doi: 10.1126/science.aad4977. M. Hilbert and P.Lopez. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332(6025):60–65, 2011. doi: 10.1126/sci- ence.1200970.

H. Hill and M. Lackups. Journal Publication Trends Regarding Cetaceans Found in Both Wild and Captive Environments: What do we Study and Where do we Publish? International Journal of Comparative Psychology, 23(3):414–534, 2010.

J. Hoekman, K. Frenken, and R. J. Tijssen. Research collaboration at a distance: Chang- ing spatial patterns of scientific collaboration within Europe. Research Policy, 39(5): 662–673, 2010. doi: 10.1016/j.respol.2010.01.012. M. D. Hoffman, D. M. Blei, and F.Bach. Online Learning for Latent Dirichlet Allocation. In NIPS’10 Proceedings of the 23rd International Conference on Neural Information Pro- cessing Systems, pages 856–864, Vancouver, British Columbia, Canada, 2010. Curran Associates Inc.

J. M. Hofman and C. H. Wiggins. Bayesian Approach to Network Modularity. Physical Review Letters, 100(25):258701, 2008. doi: 10.1103/PhysRevLett.100.258701. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information re- trieval - SIGIR ’99, pages 50–57, New York, New York, USA, 1999. ACM Press. doi: 10.1145/312624.312649. D. Hoggarth, C. Mees, and C. O’Neill. A guide to fisheries stock assessment using the FMSP tools. London, 2005.

D. D. Hoggarth, S. Abeyasekera, R. I. R. Arthur, J. R. J. Beddington, R. R. W. Burn, A. A. S. Halls, G. P. G. Kirkwood, M. McAllister, P. Medley, C. C. C. Mees, G. B. G. Parkes, G. M. G. Pilling, R. C. R. Wakeford, and R. L. R. Welcomme. Stock assessment for fishery management : a framework guide to the stock assessment tools of the Fisheries Management and Science Programme. Technical report, FAO, Rome, 2006.

241 BIBLIOGRAPHY

G. Holmes. Conservation’s Friends in High Places: Neoliberalism, Networks, and the Transnational Conservation Elite. Global Environmental Politics, 11(4):1–21, 2011.

J. Huang. Maximum Likelihood Estimation of Dirichlet Distribution Parameters. Tech- nical report, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, 2005.

ICES. Report of the SIHD survey of the current state of "human dimension" in some ICES groups. Technical report, 2016.

ICES. SIHD, 2017.

S. P.Igo and E. Riloff. Corpus-based Semantic Lexicon Induction with Web-based Cor- roboration. In Proceedings of the Workshop on Unsupervised and Minimally Super- vised Learning of Lexical Semantics, UMSLLS ’09, pages 18–26, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.

IOC-UNESCO. Global Ocean Science Report - The Current Status of Ocean Science around the World. Technical report, 2017.

T. Jakobsen, M. J. Fogarty, B. A. Megrey, and E. Moksness, editors. Fish Reproductive Biology. John Wiley & Sons, Ltd, Oxford, 2016. doi: 10.1002/9781118752739. I. Jari´c, G. Cvijanovi´c, J. Kneževi´c-Jari´c, and M. Lenhardt. Trends in Fisheries Science from 2000 to 2009: A Bibliometric Study. Reviews in Fisheries Science, 20(2):70–79, 2012. doi: 10.1080/10641262.2012.659775. S. Jennings, M. J. Kaiser, and J. D. Reynolds. Marine Fisheries Ecology. Blackwell Science Ltd, Oxford, UK, 2009.

H. Jiawei, M. Kamber, J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. 2012. doi: 10.1016/B978-0-12-381479-1.00001-0. A. E. Johnson, J. E. Cinner, M. J. Hardt, J. Jacquet, T. R. McClanahan, and J. N. Sanchirico. Trends, current understanding and future research priorities for arti- sanal coral reef fisheries research. Fish and Fisheries, 14(3):281–292, 2013. doi: 10.1111/j.1467-2979.2012.00468.x. B. F. Jones, S. Wuchty, and B. Uzzi. Multi-University Research Teams: Shifting Impact, Geography, and Stratification in Science. Science, 322(5905):1259–1262, 2008. doi: 10.1126/science.1158357. M. I. Jordan, G. Zoubin, T. S. Jaakkola, and L. K. Saul. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37:183–233, 1999.

J. Katz and B. R. Martin. What is research collaboration? Research Policy, 26(1):1–18, 1997. doi: 10.1016/S0048-7333(96)00917-1. J. S. Katz. Geographical proximity and scientific collaboration. Scientometrics, 31(1): 31–43, 1994. doi: 10.1007/BF02018100.

242 BIBLIOGRAPHY

K. Kelleher. Discards in the world’s marine fisheries: An update. Technical report, FAO Fisheries Technical Paper 470, Rome, 2005.

S. Kim, S. Narayanan, and S. Sundaram. Acoustic topic model for audio information retrieval. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 37–40, New Paltz, NY, USA, 2009. IEEE. doi: 10.1109/ASPAA.2009.5346483. D. A. King. The scientific impact of nations. Nature, 430(6997):311–316, 2004. doi: 10.1038/430311a. M. King. Fisheries Biology, Assessment and Management. Blackwell Publishing Ltd,., Oxford, UK, 2007. doi: 10.1002/9781118688038. C. Kohlschütter, P.Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining - WSDM ’10, page 441, 2010. doi: 10.1145/1718487.1718542. S. Koltcov, O. Koltsova, and S. Nikolenko. Latent dirichlet allocation. In Proceedings of the 2014 ACM conference on Web science - WebSci ’14, pages 161–165, New York, New York, USA, 2014. ACM Press. doi: 10.1145/2615569.2615680. M. Krochmal and H. Husi. Knowledge Discovery and Data Mining. In A. Vlahou, H. Mischak, J. Zoidakis, and F. Magni, editors, Integration of Omics Approaches and Systems Biology for Clinical Applications, pages 233–247. John Wiley & Sons, Inc., Hoboken, NJ, USA, 1 edition, 2018. doi: 10.1002/9781119183952.ch14. D. A. Kroodsma, J. Mayorga, T. Hochberg, N. A. Miller, K. Boerder, F.Ferretti, A. Wilson, B. Bergman, T. D. White, B. A. Block, P.Woods, B. Sullivan, C. Costello, and B. Worm. Tracking the global footprint of fisheries. Science, 359(6378):904–908, 2018. doi: 10.1126/science.aao5646. S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathe- matical Statistics, 22(1):79–86, 1951.

R. Kumaresan, R. Ezhilrani, K. Vinitha, P. Sivaraman, and R. Jayaraman. Research trends in fish stock assessment during 1999 - 2013: A scientometrics study. Interna- tional Journal of Library and Information Science, 3(2):24–36, 2014.

R. Lambiotte and P. Panzarasa. Communities, knowledge creation, and in- formation diffusion. Journal of Informetrics, 3(3):180–190, 2009. doi: 10.1016/j.joi.2009.03.007. R. Lambiotte, J.-C. Delvenne, and M. Barahona. Random Walks, Markov Pro- cesses and the Multiscale Modular Organization of Complex Networks. IEEE Transactions on Network Science and Engineering, 1(2):76–90, 2014. doi: 10.1109/TNSE.2015.2391998. A. Lancichinetti and S. Fortunato. Community detection algorithms: A compar- ative analysis. Physical Review E, 80(5):056117, 2009. doi: 10.1103/Phys- RevE.80.056117.

243 BIBLIOGRAPHY

A. Lancichinetti, F. Radicchi, J. J. Ramasco, and S. Fortunato. Finding Statisti- cally Significant Communities in Networks. PLoS ONE, 6(4):e18961, 2011. doi: 10.1371/journal.pone.0018961. A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. Körding, and L. A. N. Amaral. High-Reproducibility and High-Accuracy Method for Automated Topic Classification. Physical Review X, 5(1):011007, 2015. doi: 10.1103/PhysRevX.5.011007. P. O. Larsen and M. von Ins. The rate of growth in scientific publication and the de- cline in coverage provided by Science Citation Index. Scientometrics, 84(3):575–603, 2010. doi: 10.1007/s11192-010-0202-z. Latour. B. We have never been modern. Harvard University Press, Cambridge, 1993.

J. H. Lau, K. Grieser, D. Newman, and T. Baldwin. Automatic Labelling of Topic Mod- els. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1536–1545, Portland, Oregon, USA, 2011. Association for Compu- tational Linguistics.

J. Law. On the Social Explanation of Technical Change: The Case of the Por- tuguese Maritime Expansion. Technology and Culture, 28(2):227, 1987. doi: 10.2307/3105566. E. Leahey. From Sole Investigator to Team Scientist: Trends in the Practice and Study of Research Collaboration. Annual Review of Sociology, 42(1):81–100, 2016. doi: 10.1146/annurev-soc-081715-074219. E. Leahey and R. C. Reikowsky. Research Specialization and Collaboration Pat- terns in Sociology. Social Studies of Science, 38(3):425–440, 2008. doi: 10.1177/0306312707086190. D. B. Lenat. CYC: a large-scale investment in knowledge infrastructure. Communica- tions of the ACM, 38(11):33–38, 1995. doi: 10.1145/219717.219745. R. J. Lennox, J. Alós, R. Arlinghaus, A. Horodysky, T. Klefoth, C. T. Monk, and S. J. Cooke. What makes fish vulnerable to capture by hooks? A conceptual framework and a review of key determinants. Fish and Fisheries, 18(5):986–1010, 2017. doi: 10.1111/faf.12219. P.S. Levin, G. D. Williams, A. Rehr, K. C. Norman, and C. J. Harvey. Developing con- servation targets in social-ecological systems. Ecology and Society, 20(4), 2015. doi: 10.5751/ES-07866-200406. S. Levin, T. Xepapadeas, A.-S. Crépin, J. Norberg, A. de Zeeuw, C. Folke, T. Hughes, K. Arrow, S. Barrett, G. Daily, P.Ehrlich, N. Kautsky, K.-G. Mäler, S. Polasky, M. Troell, J. R. Vincent, and B. Walker. Social-ecological systems as complex adaptive systems: modeling and policy implications. Environment and Development Economics, 18(02): 111–132, 2013. doi: 10.1017/S1355770X12000460.

244 BIBLIOGRAPHY

S. C. Lewis, R. Zamith, and A. Hermida. Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods. Journal of Broadcasting & Electronic Media, 57(1):34–52, 2013. doi: 10.1080/08838151.2012.761702. L. Leydesdorff and C. S. Wagner. International collaboration in science and the formation of a core group. Journal of Informetrics, 2(4):317–325, 2008. doi: 10.1016/j.joi.2008.07.003. L. Leydesdorff, C. Wagner, H. W. Park, and J. Adams. International Collaboration in Science: The Global Map and the Network. El Profesional de la Información, 22(1): 87–94, 2013.

Y. Li, D. McLean, Z. Bandar, J. O’Shea, and K. Crockett. Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8):1138–1150, 2006. doi: 10.1109/TKDE.2006.130. J. Lin. Is searching full text more effective than searching abstracts? BMC Bioinformat- ics, 10(1):46, 2009. doi: 10.1186/1471-2105-10-46. J. Link. Ecosystem-Based Fisheries Management: Confronting Tradeoffs. Cambridge Uni- versity Press, Cambridge, 2010. doi: 10.1017/CBO9780511667091. P. Liu and H. Xia. Structure and evolution of co-authorship network in an interdisci- plinary research field. Scientometrics, 103(1):101–134, 2015. doi: 10.1007/s11192- 014-1525-y.

K. Lorenzen. Toward a new paradigm for growth modeling in fisheries stock assess- ments: Embracing plasticity and its consequences. Fisheries Research, 180:4–22, 2016. doi: 10.1016/j.fishres.2016.01.006. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, USA, 1999.

C. D. Manning, P. Ragahvan, and H. Schutze. An Introduction to Information Re- trieval. University Press Cambridge, Cambridge, MA, USA, 1 edition, 2009. doi: 10.1109/LPT.2009.2020494. T. Martin, B. Ball, B. Karrer, and M. E. J. Newman. Coauthorship and citation patterns in the Physical Review. Physical Review E, 88(1):012814, 2013. doi: 10.1103/Phys- RevE.88.012814.

R. L. Mason, R. F. Gunst, and J. L. Hess. Statistical Design and Analysis of Experiments. John Wiley & Sons, Ltd, Hoboken, New Jersey, USA, 2 edition, 2003.

M. E. Mather, D. L. Parrish, and J. M. Dettmers. Mapping the Changing Landscape of Fish-related Journals: Setting a Course for Successful Communication of Scientific Information. Fisheries, 33(9):444–453, 2008. doi: 10.1577/1548-8446-33.9.444.

245 BIBLIOGRAPHY

B. S. Matulis and J. R. Moyer. Beyond Inclusive Conservation: The Value of Pluralism, the Need for Agonism, and the Case for Social Instrumentalism. Conservation Letters, 10(3):279–287, 2017. doi: 10.1111/conl.12281. M. N. Maunder, P. R. Crone, A. E. Punt, J. L. Valero, and B. X. Semmens. Growth: Theory, estimation, and application in fishery stock assessment models. Fisheries Research, 180:1–3, 2016. doi: 10.1016/j.fishres.2016.03.005. F. Maynou. Coviability analysis of Western Mediterranean fisheries under MSY sce- narios for 2020. ICES Journal of Marine Science, 71(7):1563–1571, 2014. doi: 10.1093/icesjms/fsu061. A. K. McCallum. MALLET: A Machine Learning for Language Toolkit., 2002.

M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a Feather: Homophily in So- cial Networks. Annual Review of Sociology, 27(1):415–444, 2001. doi: 10.1146/an- nurev.soc.27.1.415.

R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection us- ing social force model. In 2009 IEEE Conference on Computer Vision and Pat- tern Recognition, number 1, pages 935–942, Miami, FL, USA, 2009. IEEE. doi: 10.1109/CVPR.2009.5206641. Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceeding of the 17th international conference on World Wide Web - WWW ’08, page 101, New York, New York, USA, 2008. ACM Press. doi: 10.1145/1367497.1367512. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography, 3(4): 235–244, 1990.

D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylin- gual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 2 - EMNLP ’09, volume 2, pages 880– 889, Morristown, NJ, USA, 2009. Association for Computational Linguistics. doi: 10.3115/1699571.1699627. D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing se- mantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, number 2, pages 262–272, 2011.

C. Minto and C. Lordan. GEPETO: Review of mixed fisheries modelling approaches for the Celtic Sea. Technical report, European Union, 2014.

J. W. Mohr and P.Bogdanov. Introduction—Topic models: What they are and why they matter. Poetics, 41(6):545–569, 2013. doi: 10.1016/j.poetic.2013.10.001. A. Mol, C. Kwa, L. Thevenot, M. Strathern, A. Barry, C. Thompson, M. Callon, N. Lee, J. Law, and S. D. Brown. Complexities: Social studies of knowledge practices. Duke University Press, Durham, NC, USA, 2002.

246 BIBLIOGRAPHY

J. M. Molina and S. J. Cooke. Trends in shark bycatch research: current status and research needs. Reviews in Fish Biology and Fisheries, 22(3):719–737, 2012. doi: 10.1007/s11160-012-9269-3. C. Möllmann, M. Lindegren, T. Blenckner, L. Bergström, M. Casini, R. Diek- mann, J. Flinkman, B. Müller-Karulis, S. Neuenfeldt, J. O. Schmidt, M. Tom- czak, R. Voss, and A. Gårdmark. Implementing ecosystem-based fisheries man- agement: From single-species to integrated ecosystem assessment and advice for Baltic Sea fish stocks. ICES Journal of Marine Science, 71(5):1187–1197, 2014. doi: 10.1093/icesjms/fst123 10.1016/j.pocean.2013.03.003. D. C. Montgomery. Design and Analysis of Experiments Eighth Edition. John Wiley & Sons, Ltd, New York, USA, 8 edition, 2012.

S. L. Montgomery. Does Science Need a Global Language?: English and the Future of Research. University of Chicago Press, Chicago, IL, USA, 2103.

J. Moody. The Structure of a Social Science Collaboration Network: Disciplinary Cohe- sion from 1963 to 1999. American Sociological Review, 69(2):213–238, 2004. doi: 10.1177/000312240406900204. J. Moody and R. Light. A view from above: The evolving sociological landscape. The American Sociologist, 37(2):67–86, 2006. doi: 10.1007/s12108-006-1006-8. S. Narayanamurthy. Yahoo! LDA project, 2011.

F. Natale, G. Fiore, and J. Hofherr. Mapping the research on aquaculture. A biblio- metric analysis of aquaculture literature. Scientometrics, 90(3):983–999, 2012. doi: 10.1007/s11192-011-0562-z. M. W. Neff and E. A. Corley. 35 years and 160,000 articles: A bibliometric explo- ration of the evolution of ecology. Scientometrics, 80(3):657–682, 2009. doi: 10.1007/s11192-008-2099-3. K. A. Neuendorf. The Content Analysis Guidebook. SAGE Publications, Inc, London, UK, 2 edition, 2016.

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for la- tent dirichlet allocation. In Proceedings of the 20th International Conference on Neu- ral Information Processing Systems, pages 1081–1088, Vancouver, British Columbia, Canada, 2007.

D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coher- ence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, number June, pages 100–108, Stroudsburg, PA, USA, 2010a. Association for Computational Lin- guistics.

247 BIBLIOGRAPHY

D. Newman, Y. Noh, E. Talley, S. Karimi, and T. Baldwin. Evaluating topic models for digital libraries. Proceedings of the 10th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 215–224, 2010b. doi: 10.1145/1816123.1816156. M. Newman. Communities, modules and large-scale structure in networks. Nature Physics, 8(1):25–31, 2012a. doi: 10.1038/nphys2162. M. E. J. Newman. Scientific collaboration networks. I. Network construction and fundamental results. Physical Review E, 64(1):016131, 2001. doi: 10.1103/Phys- RevE.64.016131.

M. E. J. Newman. Mixing patterns in networks. Physical Review E, 67(2):026126, 2003. doi: 10.1103/PhysRevE.67.026126. M. E. J. Newman. Coauthorship networks and patterns of scientific collaboration. Pro- ceedings of the National Academy of Sciences, 101(Supplement 1):5200–5205, 2004. doi: 10.1073/pnas.0307545100. M. E. J. Newman. Communities, modules and large-scale structure in networks. Nature Physics, 8(1):25–31, 2012b. doi: 10.1038/nphys2162. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in net- works. Physical Review E, 69(2):026113, 2004. doi: 10.1103/PhysRevE.69.026113. M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in net- works. Proceedings of the National Academy of Sciences, 104(23):9564–9569, 2007. doi: 10.1073/pnas.0610537104. M. E. J. Newman and G. Reinert. Estimating the Number of Communities in a Network. Physical Review Letters, 117(7):078301, 2016. doi: 10.1103/Phys- RevLett.117.078301.

K. N. Nielsen and P.Holm. A brief catalogue of failures: Framing evaluation and learn- ing in fisheries resource management. Marine Policy, 31(6):669–680, 2007. doi: 10.1016/j.marpol.2007.03.014. N. Nikolic, J.-L. Baglinière, C. Rigaud, C. Gardes, M. L. Masquilier, and C. Taverny. Bibliometric analysis of diadromous fish research from 1970s to 2010: a case study of seven species. Scientometrics, 88(3):929–947, 2011. doi: 10.1007/s11192-011- 0422-x.

K. Nowicki and T. A. B. Snijders. Estimation and Prediction for Stochastic Blockstruc- tures. Journal of the American Statistical Association, 96(455):1077–1087, 2001. doi: 10.1198/016214501753208735. D. O’Callaghan, D. Greene, J. Carthy, and P.Cunningham. An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications, 42(13):5645– 5657, 2015. doi: 10.1016/j.eswa.2015.02.055.

248 BIBLIOGRAPHY

Oecd. Main Science and Technology Indicators. Science And Technology, 2008:104, 2008. doi: 10.1787/data-00182-en. J. G. C. Oliveira Júnior, L. P.S. Silva, A. C. M. Malhado, V. S. Batista, N. N. Fabré, and R. J. Ladle. Artisanal Fisheries Research: A Need for Globalization? PLOS ONE, 11 (3):e0150689, 2016. doi: 10.1371/journal.pone.0150689. H. Österblom, A. Merrie, M. Metian, W. J. Boonstra, T. Blenckner, J. R. Watson, R. R. Rykaczewski, Y. Ota, J. L. Sarmiento, V.Christensen, M. Schlüter, S. Birnbaum, B. G. Gustafsson, C. Humborg, C.-M. Mörth, B. Müller-Karulis, M. T. Tomczak, M. Troell, and C. Folke. Modeling Social–Ecological Scenarios in Marine Systems. BioScience, 63(9):735–744, 2013. doi: 10.1525/bio.2013.63.9.9. H. Österblom, J.-B. Jouffray, C. Folke, B. Crona, M. Troell, A. Merrie, and J. Rockström. Transnational Corporations as ‘Keystone Actors’ in Marine Ecosystems. PLOS ONE, 10(5):e0127533, 2015. doi: 10.1371/journal.pone.0127533. E. Ostrom. A General Framework for Analyzing Sustainability of Social-Ecological Sys- tems. Science, 325(5939):419–422, 2009. doi: 10.1126/science.1172133. G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005. doi: 10.1038/nature03607. G. Palsson, B. Szerszynski, S. Sörlin, J. Marks, B. Avril, C. Crumley, H. Hackmann, P.Holm, J. Ingram, A. Kirman, M. P.Buendía, and R. Weehuizen. Reconceptualizing the ‘Anthropos’ in the Anthropocene: Integrating the social sciences and humanities in global environmental change research. Environmental Science & Policy, 28:3–13, 2013. doi: 10.1016/j.envsci.2012.11.004. P. Pantel and D. Ravichandran. Automatically Labeling Semantic Classes. In HLT- NAACL, pages 321–328, 2004.

M. R. Parreira, K. B. Machado, R. Logares, J. A. F. Diniz-Filho, and J. C. Nabout. The roles of geographic distance and socioeconomic factors on international col- laboration among ecologists. Scientometrics, 113(3):1539–1550, 2017. doi: 10.1007/s11192-017-2502-z. S. Partelow. Key steps for operationalizing social–ecological system framework research in small-scale fisheries: A heuristic conceptual approach. Marine Policy, 51:507–511, 2015. doi: 10.1016/j.marpol.2014.09.005. M. Paul. Cross-collection topic models: automatically comparing and contrasting text. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, 2009.

M. A. Peters. The Rise of Global Science and the Emerging Political Economy of Inter- national Research Collaborations. European Journal of Education, 41(2):225–244, 2006. doi: 10.1111/j.1465-3435.2006.00257.x.

249 BIBLIOGRAPHY

C. Phelps, R. Heidl, and A. Wadhwa. Knowledge, Networks, and Knowl- edge Networks. Journal of Management, 38(4):1115–1166, 2012. doi: 10.1177/0149206311432640. W. Phillips and E. Riloff. Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP ’02, volume 10 of EMNLP ’02, pages 125– 132, Morristown, NJ, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1118693.1118710. K. R. Piner, H.-H. Lee, and M. N. Maunder. Evaluation of using random-at-length obser- vations and an equilibrium approximation of the population age structure in fitting the von Bertalanffy growth function. Fisheries Research, 180(180):128–137, 2016. doi: 10.1016/j.fishres.2015.05.024. E. E. Plagányi. Models for an ecosystem approach to fisheries, volume 477. Rome, 2007. doi: 9789251057346.

I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation. In KDD ’08 Proceed- ings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 569–577, Las Vegas, Nevada, USA, 2008. ACM Press. doi: 10.1145/1401890.1401960. A. L. Porter and I. Rafols. Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3):719–745, 2009. doi: 10.1007/s11192-008-2197-2. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

R. Prellezo, P. Accadia, J. L. Andersen, B. S. Andersen, E. Buisman, A. Little, J. R. Nielsen, J. J. Poos, J. Powell, and C. Röckmann. A review of EU bio-economic models for fisheries: The value of a diversity of models. Marine Policy, 36(2):423–431, 2012. doi: 10.1016/j.marpol.2011.08.003. S. W. Purcell and R. S. Pomeroy. Driving small-scale fisheries in developing countries. Frontiers in Marine Science, 2:44, 2015. doi: 10.3389/fmars.2015.00044. A. Qadir and E. Riloff. Ensemble-based Semantic Lexicon Induction for Semantic Tag- ging. In First Joint Conference on Lexical and Computational Semantics, SemEval ’12, pages 199–208, Montreal, Canada, 2012. Association for Computational Linguistics.

A. Qadir, P. N. Mendes, D. Gruhl, and N. Lewis. Semantic Lexicon Induction from Twitter with Pattern Relatedness and Flexible Term Length. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2432–2439. AAAI Press, 2015.

250 BIBLIOGRAPHY

D. Quick and K.-K. R. Choo. Impacts of increasing volume of digital forensic data: A survey and future research challenges. Digital Investigation, 11(4):273–294, 2014. doi: 10.1016/j.diin.2014.09.002. K. M. Quinn, B. L. Monroe, M. Colaresi, M. H. Crespin, and D. R. Radev. How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1):209–228, 2010. doi: 10.1111/j.1540-5907.2009.00427.x. D. Ramage and E. Rosen. The Stanford Topic Modeling Toolbox, 2009.

D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings ofthe 2009 Conference on Empirical Methods in Natural Language Processing, pages 248–256, Singapore, 2009. Association for Computational Linguistics.

R. Rehurek and P. Sojka. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frame- works, pages 45–50, Valletta, Malta, 2010. European Language Resources Associa- tion (ELRA). doi: 10.13140/2.1.2393.1847. D. Reinsel, J. Gantz, and J. Rydning. Data Age 2025: The Evolution of Data to Life- Critical. Technical report, IDC, 2017.

J. Reisinger, A. Waters, B. Silverthorn, and R. J. Mooney. Spherical Topic Models. In Proceedings of the 27th International Conference on Machine Learning, pages 903–910, Haifa, Israel, 2010. International Machine Learning Society (IMLS).

P. Resnik. in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11(3398):95–130, 1999. doi: 10.1613/jair.514. L. M. Rhody. Topic Modeling and Figurative Language. Journal of Digital Humanities, 2(1):19–35, 2013.

W. E. Ricker. Computation and interpretation of biological statistics of fish popula- tions. Bulletin of the Fisheries Research Board of Canada, (191):401, 1975. doi: 10.1038/108070b0. E. Riloff. Automatically Generating Extraction Patterns from Untagged Text. In Pro- ceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2, AAAI’96, pages 1044–1049. AAAI Press, 1996.

E. Riloff and J. Shepherd. A Corpus-Based Approach for Building Semantic Lexicons. In In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117–124, 1997.

A. Rindorf, C. M. Dichmont, J. Thorson, A. Charles, L. W. Clausen, P.Degnbol, D. Gar- cia, N. T. Hintzen, A. Kempf, P. Levin, P. Mace, C. Maravelias, C. Minto, J. Mum- ford, S. Pascoe, R. Prellezo, A. E. Punt, D. G. Reid, C. Röckmann, R. L. Stephen- son, O. Thebaud, G. Tserpes, and R. Voss. Inclusion of ecological, economic, social,

251 BIBLIOGRAPHY

and institutional considerations when setting targets and limits for multispecies fish- eries. ICES Journal of Marine Science: Journal du Conseil, 74(2):fsw226, 2017. doi: 10.1093/icesjms/fsw226. M. A. Riolo, G. T. Cantwell, G. Reinert, and M. E. J. Newman. Efficient method for esti- mating the number of communities in a network. Physical Review E, 96(3):032310, 2017. doi: 10.1103/PhysRevE.96.032310. B. Roark and E. Charniak. Noun-phrase Co-occurrence Statistics for Semiauto- matic Semantic Lexicon Construction. In Proceedings of the 17th International Conference on Computational Linguistics - Volume 2, COLING ’98, pages 1110– 1116, Stroudsburg, PA, USA, 1998. Association for Computational Linguistics. doi: 10.3115/980432.980751. M. E. Roberts, B. M. Stewart, and D. Tingley. Navigating the Local Modes of Big Data: The Case of Topic Models. Computational Social Science: Discovery and Prediction, pages 51–97, 2016.

M. Röder, A. Both, and A. Hinneburg. Exploring the Space of Topic Coherence Mea- sures. In WSDM ’15 Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 399–408, Shanghai, China, 2015. ACM Press. doi: 10.1145/2684822.2685324. N. Rose, D. Janiger, E. Parsons, and M. Stachowitsch. Shifting baselines in scientific publications: A case study using cetacean research. Marine Policy, 35(4):477–482, 2011. doi: 10.1016/j.marpol.2010.11.002. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P.Smyth. The author-topic model for au- thors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487–494, Banff, Canada, 2004. AUAI Press Arlington.

H. M. Rozwadowski. The Sea Knows No Boundaries: A Century of Marine Science under ICES. University of Washington Press, Seattle, WA, USA, 2002.

H. M. Rozwadowski. Internationalism, Environmental Necessity, and National In- terest: Marine Science and Other Sciences. Minerva, 42(2):127–149, 2004. doi: 10.1023/B:MINE.0000030023.04586.45. T. Rusch, P. Hofmarcher, R. Hatzinger, and K. Hornik. Model trees with topic model preprocessing: An approach for data journalism illustrated with the WikiLeaks Afghanistan war logs. The Annals of Applied Statistics, 7(2):613–639, 2013. doi: 10.1214/12-AOAS618. E. R. Saetnan and R. P. Kipling. Evaluating a European knowledge hub on climate change in agriculture: Are we building a better connected community? Scientomet- rics, 109(2):1057–1074, 2016. doi: 10.1007/s11192-016-2064-5. G. Salton. Automatic Information Organization and Retrieval. McGraw Hill Text, 1968.

252 BIBLIOGRAPHY

A. Schofield, M. Magnusson, L. Thompson, and D. Mimno. Understanding Text Pre- Processing for Latent Dirichlet Allocation. In Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics: Volume 2, pages 432–436, New York, NY, US, 2017. Association for Computational Linguistics. doi: 10.1145/1378773.1378800. T. Schott. The world scientific community: Globality and globalisation. Minerva, 29 (4):440–462, 1991. doi: 10.1007/BF01113491. T. Schott. World Science: Globalization of Institutions and Participa- tion. Science, Technology, & Human Values, 18(2):196–208, 1993. doi: 10.1177/016224399301800205. Science. Challenges and Opportunities. Science, 331(6018):692–693, 2011. doi: 10.1126/science.331.6018.692. J. Scott. Social Network Analysis. SAGE Publications, Inc, London, UK, 4 edition, 2017.

J. G. Scott and J. Baldridge. A recursive estimate for the predictive likelihood in a topic model. Journal of Machine Learning Research, 31:527–535, 2013.

W. Seele, S. Syed, and S. Brinkkemper. The Functional Architecture Modeling Method Applied on Web Browsers. In 2014 IEEE/IFIP Conference on Software Architecture, pages 171–174, Sydney, Australia, 2014. IEEE. doi: 10.1109/WICSA.2014.40. C. Shearer. The CRISP-DM Model: The New Blueprint for Data Mining. The Journal of Data Warehousing, 5(4):13–22, 2000.

C. Sievert and K. Shirley. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70, Baltimore, Maryland, USA, 2014. Association for Computa- tional Linguistics.

C. A. Simpfendorfer and N. K. Dulvy. Bright spots of sustainable shark fishing. Current Biology, 27(3):R97–R98, 2017. doi: 10.1016/j.cub.2016.12.017. T. D. Smith and J. S. Link. Autopsy your dead...and living: a proposal for fisheries science, fisheries management and fisheries. Fish and Fisheries, 6(1):73–87, 2005. doi: 10.1111/j.1467-2679.2005.00176.x. T. D. T. D. Smith. Scaling fisheries : the science of measuring the effects of fishing, 1855- 1955. Cambridge University Press, Cambridge, UK, 1 edition, 1994.

M. Sowman. New perspectives in small-scale fisheries management: challenges and prospects for implementation in South Africa. African Journal of Marine Science, 33 (2):297–311, 2011. doi: 10.2989/1814232X.2011.602875. A. K. Spalding, K. Biedenweg, A. Hettinger, and M. P.Nelson. Demystifying the human dimension of ecological research. Frontiers in Ecology and the Environment, 15(3): 119–119, 2017. doi: 10.1002/fee.1476.

253 BIBLIOGRAPHY

M. Spruit and M. Lytras. Applied data science in patient-centric healthcare: Adaptive analytic systems for empowering physicians and patients. Telematics and Informatics, 35(4):643–653, 2018. doi: 10.1016/j.tele.2018.04.002. A. Srivastava and M. Sahami. Text mining: Classification, clustering, and applications. CRC Press, 2009.

R. L. Stephenson, S. Paul, M. Wiber, E. Angel, A. J. Benson, A. Charles, O. Chouinard, M. Clemens, D. Edwards, P.Foley, L. Jennings, O. Jones, D. Lane, J. McIsaac, C. Mus- sells, B. Neis, B. Nordstrom, C. Parlee, E. Pinkerton, M. Saunders, K. Squires, and U. R. Sumaila. Evaluating and implementing social-ecological systems: A compre- hensive approach to sustainable fisheries. Fish and Fisheries, (October 2017):1–21, 2018. doi: 10.1111/faf.12296. K. Stevens, P.Kegelmeyer, D. Andrzejewski, and D. Buttler. Exploring Topic Coherence over Many Models and Many Topics. In EMNLP-CoNLL ’12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, pages 952–961, Jeju Island, Korea, 2012. Associ- ation for Computational Linguistics.

M. Steyvers and T. Griffiths. Probabilistic Topic Models. In T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, editors, Handbook of Latent Semantic Analysis, volume 427, pages 424–440. Erlbaum, 2007.

S. Stone-Jovicich. Probing the interfaces between the social sciences and social- ecological resilience: insights from integrative and hybrid perspectives in the social sciences. Ecology and Society, 20(2):art25, 2015. doi: 10.5751/ES-07347-200225. L. Sun and Y. Yin. Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies, 77(June):49–66, 2017. doi: 10.1016/j.trc.2017.01.013. S. Syed and S. Jansen. On Clusters in Open Source Ecosystems. In Proceedings of the International Workshop on Software Ecosystems, pages 13–25, Potsdam, Germany, 2013. CEUR.

S. Syed and M. Spruit. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174, Tokyo, Japan, 2017. IEEE. doi: 10.1109/DSAA.2017.61. S. Syed and M. Spruit. Selecting Priors for Latent Dirichlet Allocation. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pages 194–202, La- guna Hills, CA, USA, 2018a. IEEE. doi: 10.1109/ICSC.2018.00035. S. Syed and M. Spruit. Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12(3):399– 423, 2018b. doi: 10.1142/S1793351X18400184.

254 BIBLIOGRAPHY

S. Syed and C. T. Weber. Using Machine Learning to Uncover Latent Research Topics in Fishery Models. Reviews in Fisheries Science & Aquaculture, 26(3):319–336, 2018. doi: 10.1080/23308249.2017.1416331. S. Syed, M. Spruit, and M. Borit. Bootstrapping a Semantic Lexicon on Verb Similar- ities. In Proceedings of the 8th International Joint Conference on Knowledge Discov- ery, Knowledge Engineering and Knowledge Management, volume 1, pages 189–196. Scitepress, 2016. doi: 10.5220/0006036901890196. S. Syed, M. Borit, and M. Spruit. Narrow lenses for capturing the complexity of fish- eries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19 (4):643–661, 2018a. doi: 10.1111/faf.12280. S. Syed, L. ni Aodha, C. Scougal, and M. Spruit. Mapping the global network of fisheries science collaboration. Reinforcing or broad-based structures of knowledge produc- tion? (submitted for publication). 2018b.

D. Symes and E. Hoefnagel. Fisheries policy, research and the social sciences in Eu- rope: Challenges for the 21st century. Marine Policy, 34(2):268–275, 2010a. doi: 10.1016/j.marpol.2009.07.006. D. Symes and E. Hoefnagel. Fisheries policy, research and the social sciences in Eu- rope: Challenges for the 21st century. Marine Policy, 34(2):268–275, 2010b. doi: 10.1016/J.MARPOL.2009.07.006. D. Symes, J. Phillipson, and P. Salmi. Europe’s Coastal Fisheries: Instability and the Impacts of Fisheries Policy. Sociologia Ruralis, 55(3):245–257, 2015. doi: 10.1111/soru.12096. K. Taghva, R. Elkhoury, and J. Coombs. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing, pages 152–157, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

Z. Tang and J. Maclennan. Data Mining With SQL Server 2005. Wiley, 2005.

Y. W. Teh, D. Newman, M. Welling, and D. Neaman. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In NIPS’06 Proceedings of the 19th International Conference on Neural Information Processing Systems, pages 1353–1360, Vancouver, British Columbia, Canada, 2006. MIT Press Cambridge, MA, USA.

M. Thelen and E. Riloff. A Bootstrapping Method for Learning Semantic Lexicons Us- ing Extraction Pattern Contexts. In Proceedings of the ACL-02 Conference on Empir- ical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 214– 221, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1118693.1118721. J. Tollefson. China declared world’s largest producer of scientific articles. Nature, 553 (7689):390–390, 2018. doi: 10.1038/d41586-018-00927-4.

255 BIBLIOGRAPHY

M. Tsvetovat and A. Kouznetsov. Social Network Analysis for Startups. O’Reilly Media, Inc, Sebastopol, CA, USA, 2011.

J. K. Turpie, B. J. Heydenrych, and S. J. Lamberth. Economic value of terrestrial and marine biodiversity in the Cape Floristic Region: implications for defining effective and socially optimal conservation strategies. Biological Conservation, 112(1-2):233– 251, 2003. doi: 10.1016/S0006-3207(02)00398-1. C. Urquhart. An encounter with grounded theory: tackling the practical and philo- sophical issues. In E. M. Trauth, editor, Qualitative research in IS, pages 104–140. IGI Publishing Hershey, PA, USA, 2001.

A. Viamontes Esquivel and M. Rosvall. Compression of Flow Can Reveal Overlapping- Module Organization in Networks. Physical Review X, 1(2):021025, 2011. doi: 10.1103/PhysRevX.1.021025. L. von Bertalanffy. Quantitative Laws in Metabolism and Growth. The Quarterly Review of Biology, 32(3):217–231, 1957. doi: 10.1086/401873. C. S. Wagner and L. Leydesdorff. Network structure, self-organization, and the growth of international collaboration in science. Research Policy, 34(10):1608–1618, 2005. doi: 10.1016/j.respol.2005.08.002. C. S. Wagner, L. Bornmann, and L. Leydesdorff. Recent Developments in China–U.S. Cooperation in Science. Minerva, 53(3):199–214, 2015a. doi: 10.1007/s11024- 015-9273-6.

C. S. Wagner, H. W. Park, and L. Leydesdorff. The Continuing Growth of Global Coop- eration Networks in Research: A Conundrum for National Governments. PLOS ONE, 10(7):e0131816, 2015b. doi: 10.1371/journal.pone.0131816. H. M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning - ICML ’06, number 1, pages 977–984, Pittsburgh, Pennsylvania, USA, 2006a. ACM Press. doi: 10.1145/1143844.1143967. H. M. Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 977–984, New York, New York, USA, 2006b. ACM Press. doi: 10.1145/1143844.1143967. H. M. Wallach, D. Mimno, and A. Mccallum. Rethinking LDA : Why Priors Matter. In NIPS’09 Proceedings of the 22nd International Conference on Neural Information Processing Systems, pages 1973–1981, Vancouver, British Columbia, Canada, 2009. Curran Associates Inc.

D. Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan and Mimno. Evaluation Methods for Topic Models. In ICML ’09 Proceedings of the 26th Annual International Conference on Machine Learning, pages 1105–1112, 2009.

256 BIBLIOGRAPHY

C. Wang and D. M. Blei. Decoupling Sparsity and Smoothness in the Discrete Hier- archical Dirichlet Process. In NIPS’09 Proceedings of the 22nd International Confer- ence on Neural Information Processing Systems, pages 1982–1989, Vancouver, British Columbia, Canada, 2009. Curran Associates Inc.

C. Wang, B. Thiesson, C. Meek, and D. Blei. Markov topic models. In International Con- ference on Artificial Intelligence and Statistics, volume 5, pages 583–590, Clearwater Beach, Florida, USA, 2009.

C. Wang, J. Paisley, and D. M. Blei. Online Variational Inference for the Hierarchical Dirichlet Process. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15, pages 752–760, Fort Lauderdale, FL, USA, 2011. PMLR.

X. Wang and A. McCallum. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’06, pages 424–433, Philadelphia, PA, USA, 2006. ACM Press. doi: 10.1145/1150402.1150450. C. T. Weber and S. Syed. Public Perception of Interdisciplinarity: of Twitter Data (submitted for publication). 2018.

L. Wei and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Ma- chine learning, number April, pages 577–584, Pittsburgh, PA, USA, 2006. doi: 10.1145/1143844.1143917. X. Wei and W.B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’06, page 178, New York, New York, USA, 2006. ACM Press. doi: 10.1145/1148170.1148204. M. J. Westgate, P.S. Barton, J. C. Pierson, and D. B. Lindenmayer. Text analysis tools for identification of emerging topics and research gaps in conservation science. Con- servation Biology, 29(6):1606–1614, 2015. doi: 10.1111/cobi.12605. Y. Whye Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Sharing clusters among re- lated groups: Hierarchical Dirichlet processes. In NIPS’04 Proceedings of the 17th In- ternational Conference on Neural Information Processing Systems, pages 1385–1392, Vancouver, British Columbia, Canada, 2004. MIT Press Cambridge.

D. Widdows and B. Dorow. A Graph Model for Unsupervised Lexical Acquisition. In Pro- ceedings of the 19th International Conference on Computational Linguistics - Volume 1, COLING ’02, pages 1–7, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1072228.1072342. G. M. Winder. Introduction: Fisheries, Quota Management, Quota Transfer and Bio- economic Rationalization. In G. M. Winder, editor, Fisheries, Quota Management

257 BIBLIOGRAPHY

and Quota Transfer, volume 15 of MARE Publication Series, pages 3–28. Springer International Publishing, Cham, mare publi edition, 2018. doi: 10.1007/978-3- 319-59169-8.

G. M. Winder and R. Le Heron. Assembling a Blue Economy moment? Geo- graphic engagement with globalizing biological-economic relations in multi-use marine environments. Dialogues in Human Geography, 7(1):3–26, 2017. doi: 10.1177/2043820617691643. R. Wirth. CRISP-DM : Towards a Standard Process Model for Data Mining. In Proceed- ings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, number 24959, pages 29–39. Citeseer, 2000.

S. Wuchty, B. F. Jones, and B. Uzzi. The Increasing Dominance of Teams in Production of Knowledge. Science, 316(5827), 2007.

Y. Xie. "Undemocracy": inequalities in science. Science, 344(6186):809–810, 2014. doi: 10.1126/science.1252743. X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web - WWW ’13, pages 1445–1456, New York, New York, USA, 2013. ACM Press. doi: 10.1145/2488388.2488514. W. Yang, J. Boyd-Graber, and P. Resnik. Adapting Topic Models using Lexical Associ- ations with Tree Priors. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 1901–1906, Copenhagen, Denmark, 2017. Association for Computational Linguistics.

C.-K. Yau, A. Porter, N. Newman, and A. Suominen. Clustering scientific documents with topic modeling. Scientometrics, 100(3):767–786, 2014. doi: 10.1007/s11192- 014-1321-8.

P.Yodzis. Predator-Prey Theory and Management of Multispecies Fisheries. Ecological Applications, 4(1):51–58, 1994. doi: 10.2307/1942114. K. Zhai, J. Boyd-Graber, N. Asadi, and M. Alkhouja. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web, pages 879–888, New York, NY, USA, 2012. ACM Press.

P.Ziering, L. van der Plas, and H. Schütze. Multilingual Lexicon Bootstrapping - Improv- ing a Lexicon Induction System Using a Parallel Corpus. In IJCNLP, pages 844–848. Asian Federation of Natural Language Processing, 2013a.

P.Ziering, L. van der Plas, and H. Schütze. Bootstrapping Semantic Lexicons for Tech- nical Domains. In IJCNLP, pages 1321–1329. Asian Federation of Natural Language Processing, 2013b.

258 All Published Work by Shaheen Syed

1. S. Syed and C. T. Weber. Using Machine Learning to Uncover Latent Research Topics in Fishery Models. Reviews in Fisheries Science & Aquaculture, 26(3):319– 336, 2018. doi: 10.1080/23308249.2017.1416331 2. S. Syed, M. Borit, and M. Spruit. Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19(4):643–661, 2018a. doi: 10.1111/faf.12280 3. S. Syed and M. Spruit. Selecting Priors for Latent Dirichlet Allocation. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pages 194– 202, Laguna Hills, CA, USA, 2018a. IEEE. doi: 10.1109/ICSC.2018.00035 4. S. Syed and M. Spruit. Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12 (3):399–423, 2018b. doi: 10.1142/S1793351X18400184 5. S. Syed and M. Spruit. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174, Tokyo, Japan, 2017. IEEE. doi: 10.1109/DSAA.2017.61 6. S. Syed, M. Spruit, and M. Borit. Bootstrapping a Semantic Lexicon on Verb Similarities. In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, volume 1, pages 189–196. Scitepress, 2016. doi: 10.5220/0006036901890196 7. W. Seele, S. Syed, and S. Brinkkemper. The Functional Architecture Modeling Method Applied on Web Browsers. In 2014 IEEE/IFIP Conference on Software Ar- chitecture, pages 171–174, Sydney,Australia, 2014. IEEE. doi: 10.1109/WICSA.2014.40 8. S. Syed and S. Jansen. On Clusters in Open Source Ecosystems. In Proceedings of the International Workshop on Software Ecosystems, pages 13–25, Potsdam, Ger- many, 2013. CEUR

259 PUBLISHED WORK

Papers under submission

9. S. Syed, L. ni Aodha, C. Scougal, and M. Spruit. Mapping the global network of fisheries science collaboration. Reinforcing or broad-based structures of knowl- edge production? (submitted for publication). 2018b

10. C. T. Weber and S. Syed. Public Perception of Interdisciplinarity: Sentiment Anal- ysis of Twitter Data (submitted for publication). 2018

260 Summary

It is estimated that the world’s data will increase to roughly 160 billion terabytes by 2025, with most of that data occurring in an unstructured form. Today, we have already reached the point where more data is being produced than can be physically stored. To ingest all this data and to construct valuable knowledge from it, new computational tools and algorithms are needed, especially since manual probing of the data is slow, expensive, and subjective.

For unstructured data, such as text in documents, an ongoing field of research is proba- bilistic topic models. Topic models are techniques to automatically uncover the hidden or latent topics present within a collection of documents. For instance, it allows us to infer the topical content of thousands or millions of documents without prior labeling or annotating of these documents. This unsupervised nature makes probabilistic topic models a useful tool for applied data scientists to interpret and examine large volumes of documents for extracting new and valuable knowledge.

The popularity of topic models is further driven by the available open source libraries that handle most of the complexity. However, applied data scientists regularly treat topic models as black boxes without thoroughly exploring their underlying assump- tions and (hyper-)parameter values. This black box approach arises from the inher- ently complex statistical nature of topic models and, with improper use, their results can be interpreted as questionable. With this in mind, this thesis aims to provide an- swers regarding how to optimally and efficiently employ probabilistic topic models to large collections of documents. In this pursuit, we have chosen the domain of fisheries science as our testbed and fisheries scientific publications as our source of textual data. We have taken a systematic and iterative knowledge discovery process called knowl- edge discovery in databases (KDD) as a blueprint for effectively discovering knowledge from textual data. The main research question in this thesis, therefore, is:

How can we improve the knowledge discovery process from textual data through latent topical perspectives?

The first three chapters of this thesis seek to understand how different types of textual

261 SUMMARY data, pre-processing steps, and hyper-parameter settings of probabilistic topic models affect the quality of the derived latent topics. Specifically, we study the effects that us- ing abstract and full-text data, and the use of different prior distributions have on the quality of the latent topics. Additionally, we explore alternative approaches for evalu- ating the quality of latent topics through semantic lexicons. In doing so, we contribute to the methodological analysis and optimization of topic models, providing a starting point for researchers who want to apply topic models with scientific rigor to scientific publications.

The remaining three chapters are aimed at the interpretation of the latent topics and how such (raw) latent topics can be turned into useful (fisheries) domain knowledge. We explore ways to go beyond exploring the latent structures of the documents and interpret the results in a broader context. That is, we aim to construct new knowledge and shed light on the ecological, social, economic, and institutional considerations within fisheries for increased fisheries sustainability. In other words, by applying topic models to fisheries science publications, we study the domain through a new compu- tational lens and expand on traditional approaches to assess fisheries sustainability.

Throughout this thesis, and within each chapter, specific phases of the KDD process are covered. Combined, they provide guidelines on how to optimize the knowledge discovery process with the aim to understand the latent topical content of scientific publications better.

262 Samenvatting

Tegen het jaar 2025 zal alle data ongeveer 160 miljard terabytes omvatten en zal de meeste data in ongestructureerde vorm zijn. Op dit moment hebben we al het punt bereikt waar meer data wordt geproduceerd dan fysiek opgeslagen kan worden. Om al deze data te verwerken én om er bruikbare kennis uit te ontleden zijn nieuwe compu- tationele methoden en algoritmes nodig, vooral omdat het handmatig verwerken van al deze data traag, duur en subjectief is.

Voor ongestructureerde data, zoals tekst in documenten, zijn probabilistische topic modellen een actief onderzoeksgebied. Probabilistische topic modellen zijn methoden om automatisch de onderliggende of verhulde onderwerpen of thema’s (hier topics genoemd) bloot te leggen van grote verzamelingen documenten. Via probabilisitis- che topic modellen is het bijvoorbeeld mogelijk om de actuele inhoud van duizenden of miljoenen documenten af te leiden zonder deze documenten vooraf te labelen of van aantekeningen te voorzien. Deze automatische methodiek maakt probabilistische topic modellen een handig hulpmiddel voor onderzoekers om uit grote hoeveelheden documenten nieuwe en waardevolle kennis te extraheren.

De populariteit van topic modellen wordt deels gevoed door de vele beschikbare open- source bibliotheken die de meeste onderliggende complexiteit ervan wegnemen. Echter, de werking van topic modellen wordt door onderzoekers regelmatig als een ‘black box’ beschouwd, en de onderliggende aannames en parameters van topic modellen wor- den niet altijd grondig verkend. Deze ‘black box’ beschouwing kan deels verklaard worden door de complexe statistische aard die inherent is bij probabilistische topic modellen, én bij verkeerd gebruik kunnen de resultaten ervan als twijfelachtig wor- den geïnterpreteerd. Met dit in acht genomen probeert dit proefschrift antwoorden te vinden om probabilistische topic modellen zo optimaal en effectief mogelijk toe te passen. In dit streven hebben we het domein van de visserij wetenschap gekozen als test casus én zijn de wetenschappelijke publicaties in dit domein gekozen als bron van tekstuele data. Daarnaast is er een kennisontdekkingsproces genaamd “Knowl- edge Discovery in Databases” (KDD) gebruikt om als blauwdruk te fungeren voor het effectief omzetten van tekstuele data naar kennis. De hoofdonderzoeksvraag van dit

263 SAMENVATTING proefschrift is daarom:

Hoe kunnen we het kennisontdekkingsproces van tekstuele gegevens verbeteren door de onderliggende topics in de teksten bloot te leggen?

In de eerste drie hoofdstukken van dit proefschrift proberen we inzicht te krijgen in hoe verschillende soorten tekstuele data, voorbewerkingsstappen en hyper-parameters een effect kunnen hebben op de verhulde topics uit grote verzamelingen documenten. We bestuderen in het bijzonder de effecten die het gebruik van abstracte of volledige tekst én verschillende ‘prior’ distributies hebben op de kwaliteit van de topics. Daarnaast verkennen we alternatieve benaderingen voor het evalueren van de kwaliteit van ver- hulde topics via semantische lexicons. Hiermee dragen we bij aan de methodologische analyse en optimalisatie van topic modellen én bieden we een startpunt voor onder- zoekers die topic modellen met wetenschappelijke nauwgezetheid willen toepassen op wetenschappelijke publicaties.

De overige drie hoofdstukken zijn gericht op de interpretatie van de topics en hoe deze kunnen worden omgezet in nuttige (visserij) domeinkennis. We onderzoeken dus manieren om de latente structuren, de verhulde topics, van documenten te inter- preteren en de resultaten ervan in een bredere context te plaatsen. Dat wil zeggen, we willen nieuwe kennis opdoen en licht werpen op de ecologische, sociale, economische en institutionele vraagstukken binnen de visserij wetenschap, specifiek om de duurza- amheid van de visserij te vergroten. Met andere woorden, door het toepassen van topic modellen op wetenschappelijke publicaties van de visserij bestuderen we het weten- schappelijke domein via een nieuwe computationele lens, én breiden we hiermee de traditionele benadering voor het beoordelen van duurzaamheid in dit domein uit.

In dit proefschrift en binnen elk hoofdstuk worden specifieke fasen van het kennisont- dekkingsproces behandeld. Gezamenlijk bieden de hoofdstukken antwoorden en richtli- jnen voor het optimaliseren van het kennisontdekkingsproces met als doel een beter inzicht en begrip te ontwikkelen voor het extraheren van latente topics uit wetenschap- pelijke publicaties.

264 Curriculum Vitae

Shaheen Syed was born on February 9th, 1985, in Rotterdam, the Netherlands. He obtained his bachelor’s degree in Informatics from the Rotterdam University of Applied Sciences in 2011 and his master’s degree (cum laude) in Business Informatics from Utrecht University in 2013. His master’s thesis explored regression models for the prediction of open source project failure. After his studies, he worked in the industry for three years as a data analyst and software developer.

In 2016, Shaheen started his EU Horizon2020 Marie Skłodowska-Curie funded PhD research at the Manchester Metropolitan University while being enrolled at the De- partment for Information and Computing Sciences at Utrecht University. During his PhD, he was a visiting researcher at the University of Tromsø in Norway where he developed machine learning applications for an EU Horizon2020 project, and he orga- nized and chaired a machine learning workshop for an intergovernmental organization named the International Council for the Exploration of the Sea (ICES).

265