Topic Discovery from Textual Data

Topic Discovery from Textual Data Shaheen Syed This research was funded by the project SAF21, “Social Science Aspects of Fisheries for the 21st Century”. SAF21 is a project financed under the EU Horizon 2020 Marie Skłodowska-Curie (MSC) ITN – ETN program (project 642080). c 2018 Shaheen Syed Topic Discovery from Textual Data ISBN: 978-90-393-7086-5 Topic Discovery from Textual Data Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain Thema Ontdekking in Tekstuele Data Machinaal Leren en Natuurlijke Taalverwerking voor Kennisontdekking in het Domein van de Visserij (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. H.R.B.M. Kummeling, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op woensdag 20 maart 2019 des middags te 4.15 uur door Shaheen Ali Shah Syed geboren op 9 februari 1985 te Rotterdam Promotor: Prof. dr. S. Brinkkemper Copromotor: Dr. M. Spruit Acknowledgments This thesis is the result of my three-year PhD journey in which I had the pleasure to meet beautiful minds and souls. Throughout this journey, I also had the chance to work with and learn from some amazing people, for which I am truly grateful. My nearest and dearest have always supported me in the most loving and caring way, something that I truly cherish. I want to thank my supervisors, Marco Spruit, Sjaak Brinkkemper, and Bruce Edmonds, for giving me the freedom to explore and work on my own ideas and interests. You have been a positive influence and working with all of you has been a real pleasure. I want to thank my co-authors, Melania Borit, Charlotte Weber, and Lia ní Aodha, for putting up with my crazy way of working. I have learned a lot from each of you, and you have shown me interesting and alternative views of the world that have enriched me personally and professionally. I want to thank Michaela Aschan for being a great mentor, for providing me with lots of opportunities, and for being an inspirational person. I want to thank Sjaak Brinkkem- per for developing a master’s program that prepared me for many of the academic challenges in this PhD journey. I want to thank Melania Borit for the nice lunches, for welcoming me, and for helping me during my many visits to Tromsø. I want to show my gratitude to the EU for funding the project, to all the people involved in writing the SAF21 proposal, to the SAF21 members who I have met, to the UiT BRIDGE group for having me as a guest, to the University of Utrecht for welcoming me, to the Manchester Metropolitan University and the Centre for Policy Modelling for providing me with a work environment, and to all the other people I have had the pleasure to meet and talk to during my PhD. I want to thank Charlotte Weber who has played many roles throughout this journey and surely will continue to do so in the future. Thank you for being such an amazing and caring person and thank you for showing me how to become a better version of myself. You truly are a unique soul, a blessing to the universe, and I am grateful to i have had the pleasure to meet you. And last but not least, I want to thank my family for raising me, for making me the person I am today, and for supporting and loving me all these years. Thank you all, — Shaheen Syed ii Contents 1 Introduction 1 1.1 Knowledge Discovery Process . .3 1.2 Topic Models . .5 1.2.1 Latent Dirichlet Allocation . .7 1.3 Research Domain . 10 1.4 Research Questions . 11 1.4.1 Main Research Question (MRQ) . 11 1.4.2 Research Questions (RQ) . 12 1.5 Research Methods . 17 1.5.1 Computational Experiment . 18 1.5.2 Quantitative Content Analysis . 19 1.5.3 Social Network Analysis . 20 1.6 Dissertation Outline . 21 2 Full-Text or Abstract? 25 2.1 Introduction . 26 2.2 Background . 27 2.2.1 Latent Dirichlet Allocation . 27 2.2.2 Topic Coherence Measurement . 30 2.3 Methodology . 32 2.3.1 The Experiment . 32 2.3.2 Dataset . 32 2.3.3 Creating LDA Models . 34 iii CONTENTS 2.3.4 Topic Coherence . 35 2.4 Results . 35 2.4.1 DS1 Dataset . 39 2.4.2 DS2 Dataset . 39 2.4.3 Human Topic Ranking . 39 2.5 Discussion . 43 2.6 Conclusion . 44 3 Exploring Dirichlet Priors 47 3.1 Introduction . 48 3.2 Background . 49 3.2.1 Latent Dirichlet Allocation . 49 3.2.2 Research Utilizing LDA . 51 3.2.3 Coherence Scores . 52 3.3 Methods . 54 3.3.1 Dataset . 54 3.3.2 Dirichlet Hyperparameters . 55 3.3.3 Creating LDA Models . 56 3.3.4 Topic Coherence . 57 3.3.5 Human Topic Ranking . 57 3.3.6 Relaxing LDA assumptions . 58 3.4 Results . 59 3.4.1 Topic Coherence . 59 3.4.2 Human Topic Ranking . 67 3.5 Discussion and Conclusion . 69 4 Bootstrapping a Semantic Lexicon 73 4.1 Introduction . 74 4.2 Previous Work . 75 4.3 Lexicon Bootstrapping . 76 4.3.1 Domain and Seed Words . 78 4.3.2 Building the Corpus . 78 4.3.3 Chunking . 79 iv CONTENTS 4.3.4 Scoring Verbs . 80 4.3.5 Verb Extraction Pattern . 82 4.3.6 Bootstrapping . 84 4.4 Evaluation . 85 4.5 Conclusion . 89 5 Topic Analysis of Fisheries Science 91 5.1 Introduction . 92 5.2 Methods . 94 5.2.1 Latent Dirichlet Allocation . 94 5.2.2 Assumptions behind LDA . 96 5.2.3 Creating the Data Set . 97 5.2.4 Creating the LDA Model . 102 5.2.5 Calculating Model Quality . 102 5.2.6 Labeling Topics . 103 5.2.7 Calculating Topical Trends over Time . 104 5.2.8 Calculating Topic over Journals . 104 5.2.9 Relaxing LDA Assumptions and Future Research Directions . 105 5.3 Results and Discussion . 105 5.3.1 Uncovering Fisheries Topics . 105 5.3.2 Topic Proportions within Documents . 113 5.3.3 Topical Trends over Time and Topic Prevalence . 114 5.3.4 Topical Trends over Journals . 118 5.3.5 Validation of Results . 120 5.4 Conclusion and Recommendations . 120 Appendix . 121 6 Sub-Topic Analysis of Fishery Models 129 6.1 Introduction . 130 6.2 Methods . 132 6.2.1 Latent Dirichlet Allocation . 132 6.2.2 Topic Interpretation . 133 6.2.3 Creating the Dataset . 133 v CONTENTS 6.2.4 Pre-processing the Data Set . 136 6.2.5 Creating LDA Models . 136 6.2.6 Identifying Subtopics . 137 6.2.7 Labelling the Topics . 137 6.2.8 Calculating Sub-Topical Modelling Trends . 138 6.3 Results and Discussion . 138 6.3.1 General Modelling Topics . 138 6.3.2 Subtopics within Estimation Models . 142 6.3.3 Subtopics within Stock Assessment Models . 147 6.4 Conclusions . 149 Appendix . ..

Load more