2 Reviews on Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
A PIPELINE FOR RECOGNITION OF TROPHIC INFORMATION IN PRIMARY LITERATURE by Jennien Raffington A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Bioinformatics Guelph, Ontario, Canada © Jennien Raffington, September, 2020 ABSTRACT A PIPELINE FOR RECOGNITION OF TROPHIC INFORMATION IN PRIMARY LITERATURE Jennien Raffington Advisor(s): University of Guelph, 2020 Dr. Dan Tulpan Dr. Dirk Steinke This thesis consists of an investigation into the use of Natural Language Processing methods for the automated extraction and classification of trophic information from primary literature. First, this thesis explores the use of two-character bigrams in training machine learning models for scientific name identification. Afterwards, the composition and testing of the overall trophic analysis pipeline is discussed, which consists of an open information extraction tool, dictionary-based methods, rule-based methods and a machine learning model. Then potential future directions such as the incorporation of noun phrases and document level analysis are mentioned. The results demonstrate that input format has a large influence on the retrieval of information from primary literature and that open information extraction tools can quickly filter simple relations in text, but long-distance relations are difficult to locate. iii ACKNOWLEDGEMENTS This thesis project was funded by the Food from Thought research program to Dr. Dirk Steinke. I wish to express my deepest gratitude to my supervisors Dr. Dan Tulpan and Dr. Dirk Steinke for their guidance during this research project. You made yourselves available whenever I had questions or was unsure of how to proceed and continued to encourage me throughout this process. Without you, the completion of this project would not have been possible. I’d also like to say thank you to Barrington Raffington, Jennifer Raffington and Ryan Brown for their support. iv TABLE OF CONTENTS Abstract ............................................................................................................................ii Acknowledgements ......................................................................................................... iii Table of Contents ............................................................................................................iv List of Tables ...................................................................................................................vi List of Figures ................................................................................................................. vii List of Appendices ......................................................................................................... viii 1 Introduction and Motivation ...................................................................................... 1 1.1 Biodiversity ......................................................................................................... 1 1.2 Arthropods and Farmlands ................................................................................. 2 1.3 Trophic Information Needs ................................................................................. 3 1.4 Thesis Structure ................................................................................................. 4 2 Reviews on Text Mining ........................................................................................... 5 2.1 What is text mining? ........................................................................................... 5 2.2 Motivations of Text Mining ................................................................................. 5 2.3 Natural Language Processing ............................................................................ 6 2.3.1 Named Entity Recognition ........................................................................... 6 2.3.2 Information Extraction & Relation Extraction ............................................. 10 2.4 Conclusion ....................................................................................................... 14 3 Bigram Based Species Recognition ....................................................................... 16 3.1 Introduction ...................................................................................................... 16 3.2 Bigram Materials and Methods......................................................................... 16 3.3 Bigram Results ................................................................................................. 18 3.4 Bigram Method Conclusion .............................................................................. 22 v 4 Trophic Information Analysis Pipeline .................................................................... 24 4.1 Introduction ...................................................................................................... 24 4.2 Materials and Methods ..................................................................................... 24 4.2.1 Datasets .................................................................................................... 24 4.2.2 Extraction and Classification Method Implementation ............................... 28 4.2.3 Comparison Tests ..................................................................................... 32 4.3 Results ............................................................................................................. 32 4.3.1 Evaluation Measures ................................................................................. 32 4.3.2 Extraction Task Results ............................................................................. 33 4.3.3 Classification Task Results ........................................................................ 37 4.3.4 Comparison Tests Results ......................................................................... 37 4.4 Discussion ........................................................................................................ 40 5 Conclusions and Future Work ................................................................................ 43 5.1 Future Work ..................................................................................................... 44 References .................................................................................................................... 45 Appendices ................................................................................................................... 58 S.1. Research Articles Used in Test Set .................................................................... 58 S.2. Example of Pipeline Output File ........................................................................ 65 S.3. Keyword Categories ........................................................................................... 66 S.4 Pipeline Code ...................................................................................................... 69 vi LIST OF TABLES Table 3.1. Global Names Recognition Discovery (GNRD) tool results with results from highest achieving models tested…………………………………………………………..…20 Table 4.1. A detailed breakdown of scientific names and common names collected from sources..……………………………………………………………………………………......25 Table 4.2. Example of breakdown of relevant sentences with ideal final classifications…………………………………………………………………………….....….26 Table 4.3. Examples of sentences analyzed by Ollie with confidence scores………….40 Table 4.4. Regular expressions for the forms of scientific names located in text by the pipeline……………………………………………………………………………………….…31 Table 4.5. Measurement averages broken down by the number of documents included in the calculation…………………………………………………………………………….....34 Table 4.6. Measurement medians broken down by the number of documents included in the calculation……………………………………………………………………..........…..34 Table 4.7. Description of two relation types outputted by the pipeline, triplet and keyword-based…………………………………………………………………………….…..66 Table 4.8. Measurement results for extraction task based on relation types, triplet and keyword-based.………………………………………………………………………………..66 Table 4.9. Extraction task recall results broken down by trophic category..………………………………………………………………………………….……66 Table 4.10. Classification task results broken down by trophic category………..……………………………………………………………………………….37 Table 4.11. Ideal PDF results broken down by sentence category...………………………………………………………………………………….…...38 Table 4.12. Precision and recall of extraction task for the four tested pipeline implementations …………………………………………………………………………..…..39 Table 4.13. Precision and recall of classification task for the four tested pipeline implementations.…………………………………………………………………………..…..40 vii LIST OF FIGURES Figure 3.1. Training/testing classification accuracies for all 15 classifiers applied on problems P1, P2 and P3.. ............................................................................................. 18 Figure 3.2. Venn diagram representing top 100 high frequency bigrams for SCI, ENG and PEO datasets. ........................................................................................................ 19 Figure 3.3. Runtimes for all 15 classifier methods applied to the 3 classification problems: P1, P2 and P3. ............................................................................................. 22 Figure 4.1. Breakdown of number of keywords incorporated for each type of keyword and the total number of keywords.