A PIPELINE FOR RECOGNITION OF TROPHIC INFORMATION IN PRIMARY LITERATURE

by

Jennien Raffington

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Master of Science in Bioinformatics

Guelph, Ontario, Canada

© Jennien Raffington, September, 2020 ABSTRACT

A PIPELINE FOR RECOGNITION OF TROPHIC INFORMATION IN PRIMARY LITERATURE

Jennien Raffington Advisor(s): University of Guelph, 2020 Dr. Dan Tulpan Dr. Dirk Steinke

This thesis consists of an investigation into the use of Natural Language Processing methods for the automated extraction and classification of trophic information from primary literature. First, this thesis explores the use of two-character bigrams in training machine learning models for scientific name identification. Afterwards, the composition and testing of the overall trophic analysis pipeline is discussed, which consists of an open information extraction tool, dictionary-based methods, rule-based methods and a machine learning model. Then potential future directions such as the incorporation of noun phrases and document level analysis are mentioned. The results demonstrate that input format has a large influence on the retrieval of information from primary literature and that open information extraction tools can quickly filter simple relations in text, but long-distance relations are difficult to locate.

iii

ACKNOWLEDGEMENTS

This thesis project was funded by the Food from Thought research program to Dr. Dirk Steinke.

I wish to express my deepest gratitude to my supervisors Dr. Dan Tulpan and Dr. Dirk Steinke for their guidance during this research project. You made yourselves available whenever I had questions or was unsure of how to proceed and continued to encourage me throughout this process. Without you, the completion of this project would not have been possible. I’d also like to say thank you to Barrington Raffington, Jennifer Raffington and Ryan Brown for their support.

iv

TABLE OF CONTENTS

Abstract ...... ii

Acknowledgements ...... iii

Table of Contents ...... iv

List of Tables ...... vi

List of Figures ...... vii

List of Appendices ...... viii

1 Introduction and Motivation ...... 1

1.1 Biodiversity ...... 1

1.2 and Farmlands ...... 2

1.3 Trophic Information Needs ...... 3

1.4 Thesis Structure ...... 4

2 Reviews on Text Mining ...... 5

2.1 What is text mining? ...... 5

2.2 Motivations of Text Mining ...... 5

2.3 Natural Language Processing ...... 6

2.3.1 Named Entity Recognition ...... 6

2.3.2 Information Extraction & Relation Extraction ...... 10

2.4 Conclusion ...... 14

3 Bigram Based Species Recognition ...... 16

3.1 Introduction ...... 16

3.2 Bigram Materials and Methods...... 16

3.3 Bigram Results ...... 18

3.4 Bigram Method Conclusion ...... 22

v

4 Trophic Information Analysis Pipeline ...... 24

4.1 Introduction ...... 24

4.2 Materials and Methods ...... 24

4.2.1 Datasets ...... 24

4.2.2 Extraction and Classification Method Implementation ...... 28

4.2.3 Comparison Tests ...... 32

4.3 Results ...... 32

4.3.1 Evaluation Measures ...... 32

4.3.2 Extraction Task Results ...... 33

4.3.3 Classification Task Results ...... 37

4.3.4 Comparison Tests Results ...... 37

4.4 Discussion ...... 40

5 Conclusions and Future Work ...... 43

5.1 Future Work ...... 44

References ...... 45

Appendices ...... 58

S.1. Research Articles Used in Test Set ...... 58

S.2. Example of Pipeline Output File ...... 65

S.3. Keyword Categories ...... 66

S.4 Pipeline Code ...... 69

vi

LIST OF TABLES

Table 3.1. Global Names Recognition Discovery (GNRD) tool results with results from highest achieving models tested…………………………………………………………..…20

Table 4.1. A detailed breakdown of scientific names and common names collected from sources..……………………………………………………………………………………...... 25

Table 4.2. Example of breakdown of relevant sentences with ideal final classifications…………………………………………………………………………….....….26

Table 4.3. Examples of sentences analyzed by Ollie with confidence scores………….40

Table 4.4. Regular expressions for the forms of scientific names located in text by the pipeline……………………………………………………………………………………….…31

Table 4.5. Measurement averages broken down by the number of documents included in the calculation…………………………………………………………………………….....34

Table 4.6. Measurement medians broken down by the number of documents included in the calculation……………………………………………………………………...... …..34

Table 4.7. Description of two relation types outputted by the pipeline, triplet and keyword-based…………………………………………………………………………….…..66

Table 4.8. Measurement results for extraction task based on relation types, triplet and keyword-based.………………………………………………………………………………..66

Table 4.9. Extraction task recall results broken down by trophic category..………………………………………………………………………………….……66

Table 4.10. Classification task results broken down by trophic category………..……………………………………………………………………………….37

Table 4.11. Ideal PDF results broken down by sentence category...………………………………………………………………………………….…...38

Table 4.12. Precision and recall of extraction task for the four tested pipeline implementations …………………………………………………………………………..…..39

Table 4.13. Precision and recall of classification task for the four tested pipeline implementations.…………………………………………………………………………..…..40

vii

LIST OF FIGURES

Figure 3.1. Training/testing classification accuracies for all 15 classifiers applied on problems P1, P2 and P3...... 18

Figure 3.2. Venn diagram representing top 100 high frequency bigrams for SCI, ENG and PEO datasets...... 19

Figure 3.3. Runtimes for all 15 classifier methods applied to the 3 classification problems: P1, P2 and P3...... 22

Figure 4.1. Breakdown of number of keywords incorporated for each type of keyword and the total number of keywords...... 27

Figure 4.2. Flowchart that shows the individual steps in the trophic information analysis pipeline...... 29

Figure 4.3. Boxplot for the extraction measures of all documents, which included documents that had zero relevant information extracted...... 34

Figure 4.4. Boxplot for the extraction measures of the documents that had relevant information extracted...... 35

viii

LIST OF APPENDICES

S.1. Research Articles Used in Test Set...... 58

S.2. Example of Pipeline Output File...... 65

S.3. Keyword Categories ...... 66

S.4. Pipeline Code ...... 66

1 Introduction and Motivation 1.1 Biodiversity

Knowing the community composition of an environment is integral to understanding how that environment works. Within farmland and agricultural ecosystems, arthropods can influence a landscape’s agricultural output (Foster et al., 2011). As a result, biodiversity associated metrics such as species richness, species assemblages and food webs must be investigated for populations. These metrics are all associated with biodiversity. Biodiversity is defined as variability across genetic, species’ and ecosystems scales (Walker, 1992). Genetic diversity has been described as genotypic richness combined with the spectrum of possibilities within a genotype (Vellend, 2006), while species diversity is characterized as the number of species within a defined area (Sanderson et al., 2004). Ecosystem biodiversity on the other hand is often used in reference to the variety in landscape types (Baker & Barnes, 1998). Biodiversity at each of these scales contributes to the Earth’s health, an important consideration as studies have shown that global biodiversity has been on the decline for several decades (Butchart et al., 2010). Population and biodiversity decreases in vertebrates (Collen et al., 2009), birds (Gregory, 2006), plants (Niedrist et al., 2009) and arthropods (Hallmann et al., 2017; Lister & Garcia, 2018; Seibold et al., 2019; Van Nuland & Whitlow, 2014; Williams, 1993) occurring simultaneously with observed increases in biodiversity pressures, which are evidence of this decline (Butchart et al., 2010). There are several human-linked biodiversity pressures currently contributing to the aforementioned decline, such as resource consumption, overexploitation and climate change. A better understanding of the decline is achieved through monitoring various landscapes, as monitoring provides population estimates and distribution patterns (Thomsen & Willerslev, 2015). Those population estimates and distribution patterns are important because they inform conservation effort decisions (Niemelä, 2000). Knowing the community composition of an environment is integral to understanding how that environment works. Within farmland and agricultural ecosystems, arthropods can influence a landscape’s agricultural output (Foster et al., 2011). As a result, biodiversity associated metrics such as species richness, community diversity and food webs composition are important because they inform conservation effort decisions (Niemelä, 2000).

Management of biodiversity involves practices designed to conserve and assess biodiversity, and to forewarn of looming extinctions (Lindenmayer et al., 2012). Monitoring has increased globally as a result of awareness and concern for biodiversity loss (Lee et al., 2005) and its effectiveness is predicated on the development and implementation of an accurate monitoring system. An example of a well-designed biodiversity monitoring programme is located in Australia, focused on large macropod populations. Both aerial and ground surveys are completed regularly to gather population data (Lindenmayer et al., 2012). On the other hand, a broad-scale biodiversity monitoring system in Alberta, Canada was found to be ineffective at detecting trends at various spatial scales. The accuracy of the monitoring system lies in its ability to detect statistically relevant changes 1

(Nielsen et al., 2009). Programmes of a large magnitude have failed for many reasons such as being too expensive or trying to monitor too many indicators (Lindenmayer & Likens, 2010). Therefore, smaller local monitoring systems are recommended due to their cost effectiveness and niche specificity. These more local monitoring systems would help to monitor biodiversity declines in arthropod rich environments such as farmlands. 1.2 Arthropods and Farmlands

Technological advancements during the post-war era led to an increase in agricultural intensification (Blaxter & Robertson, 1995), which refers to the practices of simplified crop rotations, increased use of agricultural machinery and crop breeding advancements that increase homogeneity within agricultural habitats (Benton et al., 2003). Agricultural intensification has led to the deterioration of farmland biodiversity (Benton et al., 2003), which makes diversity assessments necessary for monitoring and managing land-health. Currently, multi-species inventories, bird observation or butterfly population models are used to quantify biodiversity (Herzog et al., 2013). These methods are broad, indirect, and ignore common arthropod species that provide important functions to the ecosystem such as pollination, decomposition and predation (Herzog et al., 2013).

The significance of arthropod function within the farmland ecosystem spans all levels of the food chain (Mattoni et al., 2000). The high turnover rate of arthropods legitimizes their use as an indicator of habitat health. Arthropods respond quickly to environmental changes, and their assays are cost effective and efficient. They reflect environmental changes at a microscale, unlike vertebrates, which are insensitive to fluctuations at this level (Mattoni et al., 2000). Utilizing an arthropod-focused monitoring system enables quicker feedback for implemented conservation efforts. Bioindicators have been defined as a “quantifiable characteristic of biochemical, physiological, toxicological or ecological process or function that has been correlated or causally linked to effects at one or more of the organisms, population, community or ecosystem levels of organization” (McCarty et al., 2002). Arthropod communities have been successfully used as bioindicators (Mattoni et al., 2000) in multiple habitats such as woodlands (Williams, 1993), tropical forests (Andresen, 2005) shrublands (Liu et al., 2014) and agroecosystems (Anderson et al., 2011; Paoletti et al., 1999) by providing evidence for successful environmental recovery projects due to their “microgeographic distributions, which may reflect fine-scale heterogeneity in habitats” (Mattoni et al., 2000).

Population size and species richness of certain arthropods are effective indicators for specific environmental variations (Maleque et al., 2009), e.g. dung , whose abundance is correlated with forest fragmentation (Estrada & Coates-Estrada, 2002; Feer & Hingrat, 2005). Parasitic wasps have been monitored in woodland habitats as they tend to be more abundant in species-rich environments with an abundance of broadleaf trees (Maleque et al., 2009). Ants are responsive to multiple types of land conversions and clearances (Underwood & Fisher, 2006). Ant species richness and diversity has been correlated with conversions of tropical land to agricultural use (Naderi et al., 2011; Perfecto et al., 1997; Roth et al., 1994; Vasconcelos, 1999). Overall, arthropods have 2

been useful in measuring the success of conservation efforts, but they can also negatively impact an ecosystem.

Arthropods are capable of invading new ecosystems and causing severe damage (Sanders et al., 2010). For example, the fall allyworm (FAW) Spodoptera frugiperda (J. E. Smith) (Lepidoptera: Noctuidae) is an American that has invaded ecosystems on a global scale, specifically affecting many Asian countries. They are known to destroy over 353 species of plants (Friyake & Behere, 2020). Research suggests that invasive arthropod species are the main reason for biodiversity loss in several habitats (Grice, 2006; Mcneely, 2001; Molnar et al., 2008; Yan et al., 2001). They decrease indigenous species richness and at times cause extinctions in the ecosystems they invade (Clavero et al., 2009; Doherty et al., 2016). Agricultural activity typically enables invasive species because it produces disturbed sites that are ideal for colonization (Sakai et al., 2001). Invasive species present a global threat to agriculture as they are one of the major sources of crop loss, which can lead to food insecurity (Paini et al., 2016). Crop weeds and pests cause billions of dollars in damages and losses annually (Pimentel et al., 2005). For example, invasive weeds overwhelm approximately 700,000 hectares of U.S wildlife habitat per year (Babbitt, 1998). However, weeds can serve as a food source for herbivorous arthropods when their preferred food source is unavailable and these herbivorous arthropods are, in turn, a food source for carnivorous arthropods (Norris & Kogan, 2005). The organisms within a habitat are intricately connected and therefore trophic information is required to fully understand the various relationships between arthropods, agriculture and farmlands. 1.3 Trophic Information Needs

Recently, a new method for farmland biodiversity assessment was proposed, the BioBio indicator set (Herzog et al., 2013). It measures 23 instances of biodiversity spanning multiple farm types in Europe and focuses on genetic, species and habitat diversity. Crop- cultivars as well as livestock are the basis of the genetic diversity indicators. The species diversity indicators look at pollination, decomposition, plant production and predation. For species diversity, the number of vascular plants, wild bees, bumble bees, spider species and earthworm species are calculated (Herzog et al., 2013), each representing a specific level of the food chain. Habitat diversity is calculated by analyzing the areas of farmland that were used for agricultural production. Calculating these indicators requires a workforce, data management and analysis, which cannot be achieved without funding.

Although BioBio focuses on that are well researched (bees, spiders and earthworms), there are insect groups with scattered trophic information or in some cases limited trophic information that would increase the understanding of ecosystems. One forefront that can be worked on to make ecosystem assessments more feasible is gathering scattered primary literature data to centralize the information on trophic relationships. In addition to feasible niche assessments of biodiversity, centralizing information on trophic relationships allows for expansion of food webs that have not yet incorporated the world’s most recent research. Food webs are integral to understanding 3

how different organisms within an ecosystem are connected as they display the various feeding links between and within species (Beckerman et al., 2006). These feeding links provide information on the functions within an agroecosystem. When an ecosystem experiences biodiversity loss, the function is also affected, decreasing its productivity (Soliveres et al., 2016). Functional diversity is the phrase used to refer to the clusters within the ecosystem inhabitants that provide the same service (Moonen & Bàrberi, 2008). For agroecosystems, those services are important to the agricultural production processes carried out by humankind. Decomposition, water regulation, pollination, weed control, and pest control are just some of the services provided by agroecosystem organisms that affect farm production (Moonen & Bàrberi, 2008). Trophic interactions reveal the services provided by a species (Whelan et al., 2016). To manage biodiversity and its effects on functions and services provided by the ecosystem, the trophic network must be understood (Thompson et al., 2012). Research on trophic networks is published by many groups but that information must be combined and connected. Unfortunately, the large number of publications and journals in existence make information extraction a difficult task to accomplish.

With over 50 million scholarly articles (Jinha, 2010) published to date, there is a large breadth of information that needs to be processed with the aid of effective and efficient technologies. One such technology is text mining, which has proven to be a very efficient method for information collection and processing in response to the large number of available resources (Rinaldi et al., 2014). Therefore, the focus of this thesis lies in the development of a text mining approach aimed at extraction of trophic relationships among various species. These relationships will aid with the evaluation of farmland habitats health at the microscale level and to update food webs to better understand the feeding links within ecosystems. This thesis implements rule-based, dictionary-based and machine learning-based methods to create a tool that can find and extract trophic evidence from primary literature. Similar methods have been successfully implemented for automatic information extraction and processing in the areas of protein-protein interactions (Raja et al., 2013), legal documents (Y. L. Chen et al., 2013) and ontology construction (Fortuna et al., 2005). 1.4 Thesis Structure

The rest of the thesis is organized as follows. Chapter two provides a review of research related to computational and experimental trophic biodiversity methods. It also provides an in-depth review of information extraction methods and tools that have been designed and successfully applied in various research areas. Chapter three describes the machine learning methods used and tested for species name recognition. Chapter four discusses materials and methods, looking at the work from a more practical perspective. It provides details regarding the inputs, outputs, datasets, preprocessing methods, core processing methods and evaluation metrics used to analyze the results. Chapter five presents a summary of conclusions and future work.

4

2 Reviews on Text Mining 2.1 What is text mining?

Text mining represents a subfield of data mining that operates on textual information and whose popularity has increased with the continuous deluge of information (Salloum, AlHamad, et al., 2018). Both fields focus on finding significant insights and extracting knowledge from data while text mining is specialized on extracting relevant information from unstructured text or text documents (Andreas et al., 2005). Text mining methods apply approaches from multiple fields like statistics, machine learning, and linguistics to process textual data and identify relevant information (Salloum, AlHamad, et al., 2018).

Text mining methods operate on large amounts of structured and unstructured text (Ittoo et al., 2016). The first step in text mining is pre-processing (Kannan et al., 2014), which typically aims to convert unstructured text to structured information (Tunali & Bilgin, 2012). The second step typically focuses on information extraction, which includes the identification of relationships among entities within the text (Mooney & Bunescu, 2005). The structure of the text must be compatible with the extraction method. Text mining has been applied to multiple domains to decrease time spent manually extracting information from text (Naidu et al., 2018). Fields can have domain-specific entities and domain- specific semantic rules that make certain text mining approaches inapplicable. Therefore, in specialized fields with domain-specific information, domain knowledge is necessary to successfully apply text mining (Spasic et al., 2005). 2.2 Motivations of Text Mining

The creation of the Internet had a dramatic effect on the research community. Previously, research was only published in print making its way into books that were stored in libraries with limited access to a wider audience. With the Internet, a new medium for publication of scholarly research was formed. Findings can be published into a physical book, a journal, an online version of a peer-reviewed journal, a public community site, a preprint or on social media. Many journals are now moving away from print to strictly online publishing. An advantage of the growth in number of research journals and publications is the increase in publishing locations and opportunities. Today there are over 20,000 open access journals where research can be published (X. Chen, 2019) and over 50 million articles were published since 1665 (Jinha, 2010). The disadvantage of multiple journals is an increased difficulty in tracking and finding publications relevant to everyone’s specific interests. As a result, research becomes very segmented. Studies on similar topics and studies that may answer each other’s questions never get connected although linking research between multiple resources is essential to connect information and to increase confidence in the validity of the research.

Many vital pieces of information are included within the text of a publication. Some research can be included even though it may not be the main topic. Trophic information

5

is an example of a vital piece of data that can be explained in a few short lines. The dietary habits of species can be buried within large publications. Technical knowledge and infrastructural support are needed to process and organize such information. For example, there are some community websites such as BugGuide (BugGuide, 2003), that have attempted to organize and categorize taxonomic information such as kingdoms, families and species names into a structured format with labels but most species-specific knowledge is still embedded in paragraphs with unstructured free-text. When the information is buried within text, it lacks structure meaning, it is not labelled, it is not organized in a table, or displayed within a graph. In the best-case scenario, the title of the section may indicate that the text describes the dietary behavior of the organism. Information extraction from free text is time consuming and difficult to extract automatically, since, for example, there is no specific convention for how dietary habits should be described. There are multiple verbs and phrases that are used in the English language to discuss eating habits, and the nature of the trophic relationship between two species is dependent on the wording. For these reasons, gathering trophic information from multiple research articles and research groups that reinforce one another, not only strengthens the confidence in the information, but it organizes the information to make it accessible for future research and their unique use-cases. Extricating relevant information from digitized natural language is difficult to achieve, but fortunately the field of Natural Language Processing (NLP) focuses on creating tools to simplify that task. 2.3 Natural Language Processing

Natural language Processing (NLP) is a broad area of research that focuses on the use of computers to manipulate and understand the written language (Chowdhury, 2010). It involves analyzing how humans understand language and using that information to develop tools and techniques that allow computers to perform language-related tasks. NLP is a multidisciplinary field as it requires and involves research related to human understanding, linguistics and computational systems (Lu, 2018). Within the field, several tasks and subtasks have been accomplished using various techniques. One technique within NLP is Named Entity Recognition (NER), which focuses on the identification and classification of named entities within unstructured text (Mohit, 2014). Named entities can be domain specific or general. NER is used in Information Extraction (IE) (Jiang, 2012). IE focuses on extracting relevant information from a larger set of textual data (Sarawagi, 2008). Another relevant technique is Relation Extraction (RE), which identifies relationships between named entities in text (Banko & Etzioni, 2008). NER is typically paired with RE to achieve IE (Jiang, 2012). Individually, some techniques may seem simple and inconsequential, but their true power comes from their potential to be combined, like NER and RE. Recognizing named entities is essential to a computer’s ability to understand text, which makes NER a key component of any text mining solution.

2.3.1 Named Entity Recognition

Named entity recognition focuses on the specific task of finding named entities within unstructured text (Lample et al., 2016). The difficulty arises from differentiating between 6

named entities in text such as physical or geographical locations, persons’ names, companies and species names. There are several methods used to identify and classify text into categories, which can be roughly grouped as follows: dictionary-based methods, rule-based methods and machine learning methods (Akella et al., 2012). A fourth type of method could be the hybrid approach where more methods are combined.

Dictionary-based approaches. Several tools implement a dictionary approach on some level. The dictionary-based approach refers to having a collection of terms that can be used as a reference for what the classifier should highlight as a true positive when mining through text (Akella et al., 2012). Consequently, the whole basis for whether a term in the text is classified positively is whether it matches an entry in the collection. This method has advantages and disadvantages. The advantage of a dictionary approach is that it allows for common, unlikely words that don’t follow any rules to be located. The disadvantage of a dictionary is that only entries in the dictionary are recognized, which can lead to a high false negative rate (Akella et al., 2012). Also, it requires a large amount of curation to generate a useful dictionary if the entity category researched is large, such as species names, where there are currently several million known to exist and this number increases every year with the discovery of new species. Given this large number of species names, amassing a dictionary that contains every species name would be very difficult. Another issue that can arise within dictionary-based approaches is optical character recognition (OCR) errors (Lample et al., 2016). OCR errors occur when text in within an image is converted to digitized text incorrectly, which can lead to false negatives. Although it has its disadvantages, it is still a very powerful method and has been implemented in tools like TaxonGrab (Koning et al., 2005), FAT (Sautter et al., 2006), and Linnaeus (Gerner et al., 2010).

TaxonFinder is a tool that employs a dictionary approach (P. Leary, 2014) and its collection of species names comes from multiple sources. TaxonFinder is based on a uBio application (P. R. Leary et al., 2007), which uses the uBio database NameBank (The Universal Biological Indexing and Organization System, n.d.) to maintain several indices of scientific names. These indices are cross referenced using uBio’s ClassificationBank (The Universal Biological Indexing and Organization System, n.d.) taxonomies. It also scans many RSS feeds daily that represent scientific resources related to scientific taxa. Each day the scans look for new content to be incorporated into the collection. New content is cross referenced with recent sets of species names to gather context on taxonomic groups represented in the new content. A peer review process is also used to verify any newly added names (P. R. Leary et al., 2007). This approach, although better than a regular dictionary that is rarely updated, can still have false negatives. The dictionary approach is only as good as the dictionary it relies on, which must be kept up to date. There are other tools that combine the dictionary approach with other approaches.

Linnaeus uses a dictionary approach in its methodology. The basis of their dictionary is the NCBI taxonomy list, but Linnaeus focuses on species names (Gerner et al., 2010). They included species names and common names from the NCBI database but did not 7

include acronyms. Linnaeus uses heuristics along with their dictionary to disambiguate ambiguous terms such as abbreviated species names. C. elegans is an example of an abbreviated species name that can refer to 41 different species names. Also, there are common names that can refer to different NCBI species entries such as ‘rats’, which can denote Rattus sp. or Rattus norvegicus. The terms from the NCBI taxonomy are grouped by type, which are species names, common names as well as acronyms but Linnaeus foregoes the acronym category. The drawback of the NCBI database is the large amount of ambiguity as one term can refer to several species, and the lack of adherence to taxonomic rules. To overcome the issue with acronyms, Linnaeus employs the tool Acromine (Okazaki et al., 2010) to calculate the frequency of the various expanded forms of an acronym. The frequencies can be used to estimate the probability of the acronym referring to a species entity. Texts are matched using deterministic finite-state automata, which are an efficient way of matching regular expressions. Linnaeus was tested at the mention-level (meaning the annotation has to be correct as well as the location of the annotation in the document) against a manually annotated corpus and it achieved 94% recall and 97% precision (Gerner et al., 2010). Since it incorporates a dictionary, the pitfalls of the dictionary approach must be considered for the Linnaeus tool as well.

Rule-based approaches. The rule-based method tackles the NER task using a different approach. The rule-based method uses patterns found in the named entity to find examples of the entity within text (Lample et al., 2016). For an entity like species names, this approach is applicable as it follows certain rules. Taxonomic names tend to involve two or three words, referred to as binomen or trinomen (Koning et al., 2005). The first word represents the genus, the second word represents the species, and the third word (of present) represents the subspecies name. The genus is capitalized while the species is not (Koning et al., 2005). These are some of the rules that can be exploited in an entity identification tool designed for species name recognition. The advantage of the rule- based approach is it is not dependent on a premade dictionary, which allows for new terms to be found. The disadvantage lies in situations when the entity within text may not follow the premade rules or vice-versa, when a phrase may follow the rules without being an entity. The rule-based approach can increase recall but decrease precision and accuracy when recognizing named entities in text. Similarly to the dictionary approach, the rule-based approach has been implemented in multiple software packages such as TaxonGrab (Koning et al., 2005) and Linnaeus (Gerner et al., 2010).

Machine learning approaches. Machine learning has also been used to solve NER problems. Unlike the dictionary or the rule-based approach, machine learning uses models to tackle the NER task (Lample et al., 2016). It can be implemented in various ways but overall it “aims to provide automated extraction of insights from data by means of a predictive model.” (Tramèr et al., 2016). Supervised machine learning involves the training of a model using labelled data, in order to predict the label of new data. Unsupervised machine learning involves the use of unlabeled data to build the model and is not used as often in NER. Situations where unsupervised machine learning is used for NER usually include a portion of supervised learning as well (Mansouri et al., 2008). The advantage of the machine learning approach is it can find entities that may not be included 8

in a dictionary, and entities that may not follow specific rules. The disadvantage of the machine learning approach, specifically supervised machine learning, is the model tends to be biased towards the data used to train (Gerner et al., 2010). The machine learning approach has been implemented in various tools such as FAT (Sautter et al., 2006), NetiNeti (Akella et al., 2012) and ChemSpot (Rocktäschel et al., 2012).

Finding all taxon names (FAT) is a tool that employs a hybrid approach by using machine learning and rules to locate the scientific names within a document. In FAT’s first parse of a document, it uses a set of precision rules to locate sequences of words that they classify as “sure positives” with regards to being valid taxonomic names (Sautter et al., 2006). During the second processing round of the document it uses recall rules to locate all sequences of words that are “sure negatives”, i.e. invalid taxonomic names (Sautter et al., 2006). Both the sure negatives and sure positives are used to build a lexicon, which is used to filter out sequences of words that have not been classified yet. For example, if a phrase contains a word that is a sure negative, then that phrase gets filtered out. Words from the sure positives are also used to classify other phrases as negative or positive. In the end, FAT takes all the sure negatives and sure positives identified so far and trains a “word-level language recognizer” (Sautter et al., 2006) to classify any word sequences that are unclassified. FAT uses very specific precision and recall rules to classify sure positives and sure negatives. These rules are implemented using regular expressions. They also model each part of a taxonomic name, including genus, subgenus, species and subspecies. They report very good results with a precision of up to 99.7%. for their method when combining multiple documents and combining the trained classifiers, rather than training a new classifier for each individual document. The drawbacks with this method are twofold. First, the classifier doesn’t work on smaller documents as the document most likely does not have a large enough collection of species names to train a classifier on. Second, even if the classifier will be trained on a small set of species names, its generalization power will be limited by the inherent bias caused by the limited size of the training set.

NetiNeti is another tool that employs a machine learning approach. Machine learning can be implemented in a multitude of ways. NetiNeti uses rules to determine scientific name candidates (Akella et al., 2012). Then it uses probabilistic machine learning algorithms that focus on contextual features and structural features to classify the candidates. NetiNeti preprocesses the information by tokenizing text and the candidates are groups of three consecutive tokens (trigrams). The trigrams are then filtered using capitalization rules, as well as ensuring there are no common English words. After those filters are passed, the machine learning classifier determines whether the trigram is a species name using structural and contextual features (Akella et al., 2012). If the trigram fails, then the first two words of the trigram are tested as a bigram. If the bigram fails, then it is tested as a uninomial. NetiNeti implements probabilistic algorithms such as Naïve Bayes and Maximum Entropy to classify candidate names. Training sets are used to build the models. One disadvantage of the machine learning approaches consists in their bias towards their training datasets (Gerner et al., 2010). The advantage of this method is it incorporates contextual features and can deal with misspellings and OCR errors. NetiNeti 9

was tested on names that were manually extracted from the American Seashells book and utilized that same test set on FAT and TaxonFinder. NetiNeti’s precision (98.9%), recall (70.5%) and F-score (82.3%) are higher than TaxonFinder and FAT on that specific test set (Akella et al., 2012). In addition, NetiNeti was also tested on biodiversity texts with errors, web pages as well as PMC text and MedLine and lead to similar results. NetiNeti is one of the many NER tools, but the real power of NER lies in its combination with other tasks for larger purposes.

Hybrid approaches. As it is often the case, hybrid approaches provide better results by combining complementary methods that address each other’s weaknesses. TaxonGrab is an example of a hybrid approach, which incorporates both a rule-based and a dictionary-based approach (Koning et al., 2005). TaxonGrab uses a reverse dictionary approach, as their dictionary has non-taxonomic terms. They use the reverse dictionary to find groups of two to three words that are considered candidates for species names. Each word in the potential species name cannot appear in the reverse dictionary (lexicon). Then the candidates are classified using Linnaean rules for nomenclature. TaxonGrab reports precision that is consistently higher than 96.0% and a recall of 94.0% (Koning et al., 2005). Like previously mentioned for rule-based methods, it is possible for a phrase that is a species name to be bypassed if it does not follow the rules, or if it contains a term that is in the dictionary. Rule based methods also tend to bypass common names (Gerner et al., 2010). The combination of the two approaches does increase the strength of the tool, as the rule-based approach minimizes some of the disadvantages of the dictionary- based approach such as the previously mentioned high false negative rates that can occur with dictionaries, and the need for a large expansive dictionary that can be time consuming to generate.

2.3.2 Information Extraction & Relation Extraction

Information extraction is a natural language processing task that focuses on extracting relevant text from textual data. The difficulty of the task stems from the amount of irrelevant text that must be parsed to ascertain important data (Leng & Jiang, 2016). Information extraction can help with simplifying the text to a more structured form. One purpose is the extraction of relevant information from unstructured text to populate a database. The simplified text can take the form of a triplet (entity 1, relation, entity 2) in which the relationship is established between two entities. This incorporates the NER and relation extraction (RE) approaches. RE refers to using computer-based techniques to find relations or associations within text (Chun et al., 2006) and has been accomplished using various methods.

The RE approaches include various techniques such as pattern detection, supervised and unsupervised machine learning (Kaushik & Chatterjee, 2018). The pattern-based method uses human crafted rules to extract relations from text. One downside of the method is the limitation of patterns being domain specific. Another category of methods are supervised methods that incorporate machine learning techniques to extract the relations. Within the supervised approach there are multiple algorithms that can be 10

applied, such as conditional random fields, kernel methods and logistic regression. A third approach is based on unsupervised learning, which can automatically identify new relation patterns in text without training. One example of this is the Open Information Extraction (OIE) method. OIE focuses on extracting information without the input of a prechosen vocabulary (Fader et al., 2011) allowing it to be domain independent and therefore extremely useful.

Open Information Extraction attempts to build on tools that learn to extract relations from labeled training data. An issue with learning from trained data is corpora with a variety of relations can cause difficulties and ignore target relations. Open IE attempts to tackle these issues by recognizing relation phrases in text. This eliminates the need for a pre- specified vocabulary. There have been multiple attempts to build systems that achieve this task. TextRunner is an early OIE tool that annotates sentences using part-of-speech (POS) tags that identify what category of speech a word falls into (e.g. noun, verb, preposition) and noun-phrase chunks that identify groups of words that represent a noun. Then a classifier is used to determine whether the sentence represents a relationship. They use a parsing tool to gather examples for training that are automatically labelled as relationships or not. The classifier is trained on these automatically tagged examples (Yates et al., 2007). TextRunner was ran on a subset of 400 tuples and 80.4% of the extracted relations were deemed relevant by human reviewers. WOE is another information extraction approach that seeks to improve on TextRunner (Etzioni et al., 2011). It uses self-supervised learning and heuristic matches to construct training data. WOE operates in two modes. It can be restricted to POS tags (WOEpos) or it can include dependency-parse features (simplified patterns found between words) to improve precision and recall (WOEparse). WOE preprocesses raw Wikipedia text into sentences and attaches NLP annotation. OpenNLP is used to split articles into sentences and add POS tags and NP chunk annotations. Synonyms are also compiled as certain nouns can be represented using multiple terms. It uses a heuristic matching system to match attribute-value pairs in an article. The WOEparse extractor uses a pattern learner to identify whether the words between two nouns represent a semantic relationship. WOEpos uses a conditional random field trained on POS tags to identify and extract text between noun phrases that denote a relationship (F. Wu & Weld, 2010). Both modes of WOE were tested on three sets of 300 randomly chosen sentences that contain a triple of two arguments and one relational phrase. Both WOEpos and WOEparse achieved better precision and recall than TextRunner, which is attributed to better training data and parser features, respectively. In addition, incoherent extraction as well as uninformative extractions are an issue for TextRunner. Incoherent extractions refer to relations where no meaningful relation can be defined. Uninformative extractions are situations where verb noun combinations are mishandled.

Newer OIE tools have been designed to decrease the number of irrelevant extractions. A newer OIE tool called ReVerb was created to address some of the issues TextRunner and WOE faced. ReVerb focuses on phrase patterns expressed in terms of POS tags and noun-phrase chunks. The tool is built on the assumption that a small group of POS tag patterns reflect a large amount of relationships in English. ReVerb uses a syntactic 11

constraint and a lexical constraint to minimize false positives (Fader et al., 2011). The syntactic constraint is meant to prevent uninformative relations and incoherent ones as well. It only allows relation phrases that are made up of a preposition after a verb or a verb followed by either a noun, an adjective or an adverb. The longest possible match is chosen in a sentence. The phrase must appear between the two arguments in the sentence. The lexical constraint is to prevent overly specific matches. The creators of Reverb acknowledge that it has limitations and does not recognize some relationships, such as situations where the sentence is not contiguous, the relation phrase is not between the arguments and when it does not match the incorporated POS patterns. Although Reverb is not perfect, it captures a large amount of relational text. It looks at the relation verb, which prevents it from confusing a noun in the relational phrase with an argument. Reverb also applies a confidence score to each relation using a logistic regression classifier (Fader et al., 2011). The confidence score is based on several features, such as the preposition, the length of the total relation, whether or not there are proper nouns and where the preposition is placed. Their philosophy is that it is easier to train a model to apply confidence scores than it is to train a model to extract relations. Ollie is a tool that seeks to build on ReVerb by starting with 110,000 high confidence ReVerb extractions and uses those extractions to find new sentences that contain the same relation to develop a training set (Mausam et al., 2012). The relations are represented as open pattern templates, which display the different ways a relation can be organized in a sentence. Ollie uses the open pattern templates to find extractions in sentences. In average Ollie achieves 4.4 times more correct extractions than ReVerb.

Relationship extraction has also been used to identify gene-disease associations in several publications. Bhasuran & Natarajan (Bhasuran & Natarajan, 2018) used joint ensemble learning to automatically extract gene-disease associations. Joint ensemble learning combines syntactic and semantic features that were domain specific and independent using word2vec (Mikolov et al., 2013) features, which refers to an open source engine that employs neural networks to create distributed representations of words. To recognize gene and disease associations in text, they implemented Banner, an open source entity recognition tool (Leaman & Gonzalez, 2008). They also used a dictionary matching process with a gene dictionary they created from multiple sources. Overall, they created a recognition system that used a conditional random field with a fuzzy matching disease dictionary. An extensive feature set was used to train their support vector machine (SVM), which included syntactic and semantic features, lexical features, concept features, contextual features, patterns, word representations and negation features (Bhasuran & Natarajan, 2018). Syntactic and semantic features include relational keywords, phrases and word windows. Lexical features include specific mentions while concept features include gene and disease names being recognized as well as looking at the sequential words order and distance among words. Contextual features include words around the gene and disease, as well as the corpus frequency, topic sentence and relationships. For pattern templates, the focus is action verbs and specific genetic and biochemical concepts such as mutation and methylation. Word representation is done using word2vec, which identifies the parts of speech. Negation

12

features include negative independence aspects. The joint learning method uses entity semantics and relation patterns to extract relations from text. Ensemble learning was used to build a model that was based on these features. Then an EnsembleSVM was used to combine multiple SVM (Bhasuran & Natarajan, 2018). They tested their method on four corpora (EUADR, GAD, CoMAGC and PolySearch), achieving 83.45% as their highest precision, 98.01% as their highest recall and 87.39% as their highest F-score.

Other methods have been used for relation extractions. Chun et al. (2006) designed a method that extracts gene-disease relations by combining a dictionary with machine learning (Chun et al., 2006). Their method initially uses a dictionary to find sentences that have a disease and a gene name. Then, a relation is extracted between the disease and the gene name in the sentence. This method encounters three different types of false positives in the dictionary-based results: false gene names, false disease names and false relations. They use a gene dictionary and a disease dictionary. Their gene dictionary is curated using five public databases: HUGO (Eyre et al., 2006), LocusLink (Pruitt, 2001), SwissProt (Bairoch, 2000), RefSeq (O’Leary et al., 2016), and DDBJ (Kaminuma et al., 2010). The Unified Medical Language System (Bodenreider, 2004) was used to build the disease dictionary. The training set was built from an annotated corpus using 1,362,285 abstracts from a Medline (National Library of Medicine, 2018) search. Sentences were classified as a correct co-occurrence if the sentence described in some way a causal relationship between the gene and disease, a therapeutic significance or the gene is a marker for the disease in some way. They filtered their results using a maximum entropy- based NER method to remove false-positives. Different combinations of features were tested for the NER filter. Features included POS tags, capitalization, contextual features, affixes in the candidate terms, and Greek letters in the candidate term. Several combinations of features were tested for NER. The highest precision achieved was 90.0% and the highest recall achieved was 98.1% (Chun et al., 2006).

Another tool that automatically finds associations is Plan2L. Plan2L extracts information on the plant Arabidopsis thaliana via a computational pipeline that retrieves documents related specifically to that particular species (Krallinger et al., 2009). The process of detecting relations uses semantic rules and POS tagging. To extract protein interaction evidence, they trained a sentence machine learning classifier. A manually selected set of interaction passages was used to train the classifier. The classifier achieved a precision of 89.8% and a recall of 92.6% on the second BioCreative Challenge (Krallinger et al., 2008). They provide users with six search options to choose from. A case study was completed using AGAMOUS and LEUNIG, a related gene and regulator (Krallinger et al., 2009). The system scores the sentences that are thought to be evidence of an interaction or relation. Plan2L was able to deduce that there was a regulation relationship between LEUNIG and AGAMOUS. In some cases, the regulation type was correctly classified as ‘Repression’. The Plan2L project shows that the method is useful in multiple fields.

Ontology construction is another approach where RE has been applied with great success. It has been specifically applied to ontology construction in the agricultural field. The RENT algorithm (Kaushik & Chatterjee, 2018) was used for term extraction. RENT 13

incorporated twenty domain specific regular expression patterns. Those expressions focus on four specific relations, “is_a”, “is_type_of”, and “is_intercrop” as well as “has_synonym”. Terms extracted by the expressions were weighted using a specific set of rules. If a term is a noun, appears often or occurs within multiple patterns, additional weight is added to it. POS-based linguistic filters are applied to the text data as well. They used the terms extracted by the RENT algorithm, then tested two approaches for relation extraction. The first approach was a statistical approach that focused on word frequency distribution, while the second approach was based on semantics and carried out using WordNet (G. A. Miller, 1995). The word frequency distribution approach is based on the idea that related words will have similar positions in sentences and frequencies. The WordNet approach uses synonym sets, which contain words that have similar meanings whether they are nouns, verbs, adjectives or adverbs. They used a path similarity measure to categorize terms based on likeness. The combination of the RENT algorithm terms with the two approaches for relation extraction were characterized as a modified open information extraction (mOIE) scheme (Kaushik & Chatterjee, 2018). The mOIE was successful in identifying only the “has_synonym” relation. They then created the RelExOnt algorithm that uses rules to identify relations. The terms connected by the relations are identified based on certain applied constraints for each relation. Each relation had its own rules and constraints. The RelExOnt algorithm had an average precision of 86.89% on 10 randomly sampled datasets.

Relation extraction has been tested as a method of identifying genotype-phenotype relationships in text. Khordad and Mercer (Khordad & Mercer, 2017) identified genotype- phenotype relationships in text by gathering data for constructing training and testing datasets and making use of several relation extractions tools. They also tailored the rule- based and machine learning approaches to this task by incorporating verbs and prepositions specific to biomedical relationships. Sentences that were agreed upon by both the rule-based and machine learning based approaches became the training set. These sentences were then cleaned manually to remove sentences that did not meet the necessary criteria. In this case, the maximum entropy classifier was selected for the application. The representation of the genotype-phenotype pair was derived from features within the sentence. The features included the relationship term itself, it’s position and the stemmed term among other features. They then attempted a self-training algorithm where labelled data was used to train a classifier that can tag unlabeled data. One issue that arises is overfitting when unlabeled instances with the highest confidence are added. To overcome this problem, they were adding instances that had confidence levels between specific thresholds. The self-training model resulted in a higher precision, recall and F- measure, which were equal to 77.70%, 77.84% and 77.77% respectively (Khordad & Mercer, 2017). 2.4 Conclusion

This review chapter highlights a number of relevant methods, tasks and tools that have been developed in the field of text mining with respect to Named Entity Recognition, Relation Extraction and Information Extraction in biological literature. It is evident from the 14

reviewed research that text mining is extremely useful and can be achieved using various approaches and methods. The three main methods that are often applied are based on rules, a dictionary or on machine learning. One area where more research is needed is the creation of more gold standard and domain-specific corpora, which would allow text mining tools to be tested on the same data set. Therefore, the different methods could be more accurately compared. The amount of information currently available greatly supersedes what can be manually parsed, which is why text mining tools have increased in relevance over the years.

15

3 Bigram Based Species Recognition 3.1 Introduction

Species name identification is a process used to link different publications (Akella et al., 2012). This is useful when attempting to extract information about a specific species, organism, family or genus. To examine the method, tests were completed in which 2- word combinations were used to train machine learning classifiers with the purpose of identifying arthropod species names in the presence of regular English words and person names that appear in scientific publications. The frequencies of character bigrams were used to build 15 models utilizing machine learning classifiers spanning 7 algorithmic categories (tree-based, rule-based, artificial neural network, Bayesian, boosting, lazy and kernel-based) and the models were tested on 3 data sets corresponding to 3 classification problems.

The overall goal is to create a pipeline that takes an input of primary literature and processes it to an output of final trophic classifications of organisms within the sentences. This goal is separated into two tasks. The first task is the extraction of the relevant sentences in the text. Relevant sentences are defined as sentences that contain at least one trophic related keyword (TRK) from a precompiled list of key phrases in reference to a scientific name or common name. The second task is the classification of named organisms in the relevant sentences into one of five trophic categories. The five categories for classification are herbivore, carnivore, omnivore, parasite and detritivore. The problem must be split into two tasks because the extraction of sentences and classification of organisms within a sentence are two vastly different jobs. Extraction from primary literature can be completed using different methods that each have varying results. Research articles come in many different formats and contain multiple sections and can be stylized in a variety of ways, at times containing one column, or multiple columns, or having figures placed at the end versus being interspersed. These different placements can lead to unique challenges when extracting the main text. The classification is a separate task that is carried out on the output of the extraction task and has its own challenges such as the incorrect identification of named entities and incorrect final category. 3.2 Bigram Materials and Methods

Datasets: Three datasets were prepared for training the classifiers, which include the following types of information (classes): (i) SCI: 3000 2-word phrases representing arthropod species names, (ii) ENG: 3000 2-word phrases including two consecutive English words commonly used in scientific literature published in English, and (iii) PEO: 3000 2-word phrases representing first and last person names. To validate the results, three additional datasets were prepared corresponding to the same types of information, each including 500 2-word phrases (V_SCI, V_ENG and V_PEO). The validation datasets were not used in the classifiers’ training and testing process. The arthropod species

16

names were collected from BugGuide - a community site for entomologists who share information and photos of arthropod species (Koning et al., 2005). Arthropod species names were scraped and curated using a custom-built Python script, resulting in a large collection of 11,720 unique 2-word phrases, from which we selected 3000 uniformly at random. Two consecutive English words were gathered from five research papers and each word was checked against an English dictionary. The person names were created from combinations of first and last names obtained from GitHub repositories (Arin, 2016; Tarr, 2015).

The datasets were processed and the frequency of all 676 two-letter combinations (bigrams) corresponding to the standard 26 letter English alphabet were calculated for each dataset entry. Therefore, each 2-word phrase is represented by a row of 676 bigram frequencies corresponding to all entries in the 3 datasets.

Classification problems: The goal of this work was to identify ML algorithms capable to distinguish arthropod species names in a scientific publication. The two major challenges specific to this task were to distinguish between arthropod species names and regular English words or person names, respectively.

To address these challenges, three classification problems were defined, two of which are 2-class problems and one is a more generic 3-class problem.

The first problem (P1) is defined such that, given a set of 2-word species names of arthropod species (SCI) and groups of two consecutive English words commonly encountered in the English literature (ENG), classify them into the corresponding categories. The second problem (P2) considers a set of 2-word arthropod species names (SCI) and 2-word phrases representing first and last person names (PEO) and focuses on classifying them into the 2 categories. The third problem (P3) focuses on distinguishing between all three classes: species names, English words, and person names.

The datasets used for P1 and P2 contain 6000 instances each, with 3000 instances in each class, while the dataset for P3 contains 9000 instances, with 3000 instances in each class. In each dataset, the instances are represented by bigrams in the forward direction.

Classification methods: Machine learning models were trained and tested using Weka (Hall et al., 2009). Ten-fold cross validation was used to train and test each model. The following Weka classifiers were used in this study: tree-based (Random Forest, J48), Bayesian (Bayesian Logistic Regression, Naïve Bayes, Bayes Net, Complement Naïve Bayes, Naïve Bayes Multinomial, Naïve Bayes Updateable), ANN (MLP Classifier), kernel-based (LIBSVM, LIBLINEAR), lazy (Lazy K*, Lazy IBK), rule-based (Decision Table) and boosting (AdaBoost). The default settings for each classifier were used. All dataset files were saved in CSV format and converted to the Weka ARFF-format.

Experimental setup and performance metrics: All 15 classifiers were evaluated with Weka (Hall et al., 2009) using a 10-fold cross validation approach. All tests were 17

performed using the default settings for each classifier. Since classes are balanced and include 3,000 items each for training and 500 each for validation, the accuracy of correctly predicted instances (%) and the execution time (seconds) were reported. N-gram extraction was achieved using the re library from Python 3. 3.3 Bigram Results

In this work, the ability of character n-grams-based classifiers to distinguish arthropod species names from regular English words and person names by solving the three problems described in section 3 were investigated. Classification accuracies are summarized in Figure 3.1.

Figure 3.1. Training/testing classification accuracies for all 15 classifiers applied on problems P1, P2 and P3. Note: the Bayesian Logistic Regression methods could not be applied on non-binary classification problems such as P3.

Classification accuracy: For all three problems, the LIBLINEAR classifier outperformed the other classifiers with accuracy values equal to 97.53% (P1), 94.70% (P2) and 91.31% (P3). Random Forest ranked second for problems P1 (96.88%), P2 (93.55%) and P3 (90.98%), while the MLP and SVM classifiers consistently ranked in top 5. The Bayesian Logistic Regression classifier ranked 3rd and 4th, respectively on P1 and P2, while it could not be executed on P3 due to its binary class applicability limitation. The Decision Table classifier’s accuracy ranked in the lower half of the models on all three classification problems. The J48 method performed well in both two-class problems but ranked 11th with an accuracy of 77.67% for the three-class problem. The poorer performers across the board were AdaBoost (ranked last on all problems) and Lazy IBK, which seem to struggle when applied on the 3 datasets.

The three-class problem proved to be more difficult for all models and a significant overall decrease in performance ranging between 5.9% for Random Forest and 33.6% for AdaBoost were observed.

18

Classifying 2-word phrases that represent people names, species names and English words (P3) proves to be more difficult than distinguishing between species names and English words (P1) or between species names and persons names (P2). Different languages have different n-gram frequencies (Keselj et al., 2003) and require independently built classifiers to distinguish among them. Species names tend to be based on Latin and old Greek while the text in a scientific article is largely English. Their n-grams frequencies are different, and some n-grams are more prevalent in one of the two categories, which explains the higher classification accuracy for solving problem P1. This also explains why n-grams are good features for classifiers applied to language identification (Muhammad et al., 2012). Moreover, each language has its own n-gram frequency distribution. Similarly, person names originating from different cultures and language backgrounds have particular n-gram frequency distributions. Since a majority of first and last names that are typically used in English-based languages but have either a Latin or Greek origin were used, the frequency of certain bigrams is high in both (Figure 3.2), the SCI and PEO datasets. In contrast, there is less overlap between sets of high frequency bigrams in SCI and ENG datasets, which could explain why the overall classifier accuracies are higher when solving P1 compared to P2.

Figure 3.2. Venn diagram representing top 100 high frequency bigrams for SCI, ENG and PEO datasets.

Comparison with other species name identification tools: While comparisons were attempted between these results and 9 external tools such as NetiNeti (Akella et al., 2012), TaxonGrab (Koning et al., 2005), LINNAEUS (Gerner et al., 2010), SpeciesTagger (Pafilis et al., 2013), Organism-tagger (Naderi et al., 2011), Solr-Plant (V. Sharma et al., 2019), Whatizit (Rebholz-Schuhmann et al., 2008) and COPIOUS (Nguyen et al., 2019), only a comparison against the Global Name Recognition and Discovery (GNRD) tool (Pyle, 2016) was achieved, as the other tools were either not available or not functioning as described in the documentation. 19

A comparison of the bigram-based classifier result with the Global Names Recognition and Discovery (GNRD) tool using the validation datasets are showcased in Table 3.1. With an overall accuracy of 96.1%, GNRD identified all species names from the V_SCI dataset, but it was able to recognize them only if they were capitalized. While GNRD correctly identified all English words, it miss-identified 58 person names out of 500 as species names. In comparison, our top 4 classifiers capable to solve 3-class problems featured accuracies ranging between 85.6% and 91.8% on the validation set, which are consistent with the performance obtained when trained and tested on the 9,000 2-word bigrams combined dataset.

Table 3.1. Global Names Recognition Discovery (GNRD) tool results with results from highest achieving models tested.

Method Dataset Num. Correctly Predicted Prediction Instances Accuracy [%]

LIBLINEAR V_SCI 458 91.6

LIBLINEAR V_ENG 432 86.4

LIBLINEAR V_PEO 487 97.4

LIBLINEAR V_SCI + V_ENG + 1377 91.8 V_PEO

MLP V_SCI 427 85.4

MLP V_ENG 411 82.2

MLP V_PEO 494 98.8

MLP V_SCI + V_ENG + 1332 88.8 V_PEO

Random V_SCI 425 85.0 Forest

20

Random V_ENG 408 81.6 Forest

Random V_PEO 493 98.6 Forest

Random V_SCI + V_ENG + 1326 88.6 Forest V_PEO

LIBSVM V_SCI 399 79.8

LIBSVM V_ENG 391 78.2

LIBSVM V_PEO 494 98.8

LIBSVM V_SCI + V_ENG + 1284 85.6 V_PEO

GNRD V_SCI 500 100

GNRD V_ENG 500 100

GNRD V_PEO 442 88.4

GNRD V_SCI + V_ENG + 1442 96.1 V_PEO

Runtimes: Classifier runtimes were measured on an HP Notebook – 14-cf0018ca equipped with a 4-core Intel Core i5-8250U CPU (base frequency of 1.6 GHz, 4MB Cache), 8 GB DDR4-2400 SDRAM, a 256 GB PCIe NVMe M.2 SSD and running Windows 10 (64-bit). For each classification problem, the run time for the classifiers (Figure 3.3) varied between 10 milliseconds and 9.4 minutes (562s). Overall, in all three scenarios the Decision Table classifier had the longest runtime with values ranging from 207s to 562s. The fastest models in all three scenarios were the lazy classifiers, taking between 0.01s and 0.30s. Runtime was not an indicator of performance since both, the

21

fastest and the slowest models, ranked in the bottom half out of 15 models for accuracy for all three classification problems.

Figure 3.3. Runtimes for all 15 classifier methods applied to the 3 classification problems: P1, P2 and P3. 3.4 Bigram Method Conclusion

Our results suggest that bigram-based classification is a suitable method for distinguishing arthropod species names from regular English words and person names commonly found in scientific literature. To this extent, 3 classification problems were designed and 3 training/testing datasets including 3000 2-word phrases from each category and 3 more validation datasets with 500 2-word phrases each were constructed. Fifteen classifiers spanning 7 generic categories were considered. The LIBLINEAR classifier outperformed all the other classifiers on all 3 classification tasks with respect to prediction accuracy, while Random Forest, Bayesian Logistic Regression, Multi-Layer Perceptron and LIBSVM ranked in top 5. The least performant classifiers on the 3 problems proposed here were AdaBoost and Lazy IBK. It was observed that the prediction of all 15 classifiers produced the highest accuracy when applied on the SCI-ENG (P1) binary problem but decreased significantly when applied to a 3-class problem (P3). Moreover, their accuracies were seen to decrease less stringently when trying to distinguish between arthropod species names and person names (P2: SCI-PEO). The hypothesis is that their decrease in accuracy when applied to P2 compared to P1 could be due to the presence of an increased number of high frequency bigrams in the SCI and PEO datasets compared to the SCI and ENG datasets. With respect to execution time, neither the fastest, nor the slowest classifiers were top performers. When compared with an external software package (GNRD) on a validation dataset, our bigram-based classifiers maintained their performance, which was comparable but slightly lower than that of GNRD. In the future, work could be completed to explore the use of 3- and 4-grams on the same classification problems and estimate the tradeoff 22

between the practical application of such methods versus the gain in accuracy. Moreover, hybridizing the n-gram based classification method with other approaches such as part-of- speech identification and capitalization rules could lead to significant improvements for arthropod species name identification in non-structured text.

23

4 Trophic Information Analysis Pipeline 4.1 Introduction

The previous chapter discussed the training and testing of algorithms used to create a machine learning model that identifies scientific names. In this chapter the creation and testing of a pipeline for extraction and classification of trophic relations is explained. First, this section provides a description of the overall problem and the two tasks used to create a solution. Then the input and expected output are discussed. The next section explains the full process from input to output. The subsequent paragraphs describe the materials and methods as well as the evaluation measures and results. The chapter ends with a discussion of the previously described results.

Problem and Tasks: The overall goal is to create a pipeline that takes an input of primary literature and processes it to an output of final trophic classifications of organisms within the sentences. This goal is separated into two tasks. The first task is the extraction of relevant sentences from the text. Relevant sentences are defined as sentences that contain at least one trophic related keyword (TRK) from a precompiled list of key phrases in reference to a scientific name or common name. The second task is the classification of named organisms in the relevant sentences into one of five trophic categories. The five categories for classification are: herbivore, carnivore, omnivore, parasite and detritivore. The problem must be split into two tasks because the extraction of sentences and classification of organisms within a sentence are two vastly different jobs. Extraction from primary literature can be completed using different methods that each have varying results. Research articles come in many different formats and contain multiple sections and can be stylized in a variety of ways, using one or multiple columns, or having figures placed at the end versus being interspersed. These different placements can lead to unique challenges when extracting the main text. The classification is a separate task that is carried out on the output of the extraction task and can have its own challenges such as the incorrect identification of named entities and incorrect final categories. 4.2 Materials and Methods

4.2.1 Datasets

Several collections were created for the pipeline. The first was a collection of scientific names and common names, used to identify names in primary literature. To create a large collection of names, data from several sources had to be extracted, cleaned and combined. In total, 3,363,521 scientific names and 298,681 common names were found, representing kingdoms, phyla, classes, orders, superfamilies, families, tribes, genera, intraspecies and species. In addition to the data collected to create a dictionary, sufficient data had to be collected for testing. Another collection was created consisting of 235,886 English words, taken from the Linux operating system built-in dictionary, to help with the

24

distinction of scientific names from regular words. Finally, a set of 116 keywords and key phrases were collected to identify sentences that contain trophic information.

Dictionary Data Sources: Data sources refer to the original location of the scientific and common names that were used to assemble the final data file shown in Table 4.4. A detailed breakdown of scientific names and common names collected from sources. First, 32,169 scientific names and 24,370 common names were first collected from BugGuide (BugGuide, 2003), a community site for entomologists and entomology enthusiasts. BugGuide focuses on organisms that are members of the phylum Arthropoda. A total of 3,949,085 scientific names and 324,582 common names were extracted from the Catalogue of Life (Cachuela-Palacio, 2006) website, which is currently the most comprehensive species focused database, which also contains higher level taxonomic information. It is a highly respected resource and it is updated regularly. Data including 2357 scientific names and 2232 common names was also retrieved from the Entomological Society of America (Entomological Society of America, 1889) website, which focuses on the field of entomology. This site is professionally governed and is used by many educational institutions. To complete the data collection, we extracted 90,909 plant scientific names and 30,980 common names from the United States Department of Agriculture PLANTS database (USDA, 2020). The files were filtered to remove duplicate entries and entries containing non-English characters. The names and kingdom designations were combined into two text files. One file contains the combined common names, with a total of 298,681 entries. The second file contains scientific names, with a total of 3,363,521 entries. A total of 30 common names were manually added to the combined common names file. A third file was created with 1,376,507 abbreviated versions of species names from the scientific name file with the possible expanded species names in the second column. Data is available at https://github.com/JSRaffing/Trophic-Information-Extraction-Pipeline.

Table 4.4. A detailed breakdown of scientific names and common names collected from sources.

Source Scientific Names Initially Common Names Retrieved Initially Retrieved

BugGuide 32,169 24,370

Catalogue of Life 3,949,085 324,582

Entomological Society of 2357 2232 America

25

USDA PLANTS 90,909 30,980

Test Data Collection: A total of 56 research articles (See Table S.1) were manually curated through organism searches in the research article database Google Scholar. Organisms that fit into each trophic category were the basis of each query. For example, a search for “P. clavata” (an omnivorous ant) was completed and articles were found that discussed its trophic behavior. Once an article was found, it was read by the researcher to locate any relevant sentences that should be extracted and classified by the pipeline. This process was repeated until a total of 200 trophic relations were found. These were represented by 175 relevant sentences in 56 research articles covering all 5 trophic categories. Some sentences matched more than one category given the non-exclusive nature of the 5 categories. An example of the breakdown of a relevant sentence is shown in Table 4.2. A trophic relation was defined as a part of a sentence that contains a TRK in relation to a scientific name or common name.

Table 4.2. Example of breakdown of relevant sentences with ideal final classifications. Sentences were taken from two of the 56 articles used in the testing set.

Sentence Number of Categories Ideal Classifications Sentences Represented

Hemiargus isola 1 2: Parasite, 5: Hermiargus isola is larvae feeding on D. Herbivore a herbivore albiflora are parasitised by a braconid wasp is a braconid wasp in the parasite subfamily Microgastrinae, Microgastrinae are Cotesia cyaniridis, parasites and a tachinid fly, Aplomya theclarum. Cotesia cyaniridis is a (Weeks, 2003) parasite, tachinid fly is a parasite,

Aplomya theclarum is a parasite

26

Photuris females eat 1 1: Carnivore 1: Photuris is a Photinus males or carnivore lucibufagin. (Eisner et al., 1997)

Trophic Keyword Collection: To gather relevant trophic information efficiently, a collection of TRKs was used to find matches in scientific literature. If a statement contained a TRK, it was extracted for further analysis in the subsequent steps. Tools such as YAKE! (Campos et al., 2018) and TextRank (Mihalcea & Tarau, 2004) were explored to facilitate the accumulation of key phrases related to trophic information, but these tools were unable to gather relevant key phrases related to this narrow topic. Therefore, the TRK collection was accumulated manually from various research articles to facilitate gathering phrases specific to the topic. As it is a collection, it is subject to the issues that arise when using a dictionary method. Each TRK is in one of two categories. There are triplet keywords that lead to a triplet relation such as “feed on” and there are category keywords that lead to keyword-based relations such as “is omnivorous”. There are also keywords such as “omnivorous”, which contain one word and key phrases such as “is omnivorous”, which contain more than one word. The breakdown of the collection of trophic related keywords is shown in Figure 4.1 and the total set of keywords are shown in Table S.3.

Trophic Category Keywords 140 116 120

100

80 61 60 55

40

Number ofKeywords Number 20

0 Triplet Keywords Category Keywords Total Keywords Groupings of Keywords

Figure 4.1. Breakdown of number of keywords incorporated for each type of keyword and the total number of keywords.

27

4.2.2 Extraction and Classification Method Implementation

Input and Output: The input format for this pipeline is the Portable Document Format (PDF), which can be obtained directly from online resources or easily obtained by converting any other file types such as text files or MS Word documents. PDF is the most popular format used by journals to share their research, nevertheless extracting the actual text from the PDF format is difficult due to various layouts causing continuity and ordering errors based on the placement of blocks of text and images (Ramakrishnan et al., 2012). Several tools have been developed to aid in that process such as PDF2Text and LA-PDF (Salloum, Al-Emran, et al., 2018). In this case, the PyMuPDF package (McKie & Liu, 2016) from Python 3 has been selected as the tool to extract text from the PDFs. The output has relevant sentences that were found within the text. Relevant sentences that contain a scientific name, and a verb or phrase from our trophic related keyword (TRK) collection, are outputted with a classification appended describing the named scientific organism as being in one of five categories; carnivore, herbivore, omnivore, parasite or detritivore. The default option is for the output to be printed directly to the terminal, but the user can choose to save the output to a file. This file contains three columns: the sentence in the first column, the corresponding classification/descriptor phrase in the second column and the second organism in the sentence used for the final classification placed in the third column. Please see Table S.2 for an example of an output file.

Implementation Description: The process contains three main steps: pre-processing, processing using the main section of the pipeline and then post-processing as shown in Figure 4.2. In the pre-processing step, the primary literature file goes from a PDF to a text file containing the extracted text data from the PDF. Then the relations are extracted from the sentences. The processing step takes the semantic relations that have been extracted, locates the trophic related keywords that are within and indexes them. Then the scientific names within the semantic relations are identified. The trophic related keyword or the second argument in the relation is used to categorize the first scientific name argument into a final trophic category. Post-processing cleans the labels and outputs the located semantic relations with a final classification in a result file.

28

Figure 4.2. Flowchart that shows the individual steps in the trophic information analysis pipeline.

Extracting Semantic Relations (Figure 4.2, Steps 1 to 3): The PDF has its text extracted using the PyMuPDF Python 3 package. Before the PyMuPDF output is written to the text file, its output is cleaned (unnecessary characters are removed) and a new line is added after every period that is followed by a combined space and capital letter. That file is then processed using the Ollie open information extraction tool discussed in Chapter 2 (Mausam et al., 2012). Ollie requires that each sentence in the text file is on a new line, hence a new line is added after a period that is followed by a space and capital letter. Ollie parses the text file and locates semantic relations. The Ollie results are redirected to a second text file and each semantic relation is on a new row and begins with a confidence score calculated by a trained Ollie function. An example of a sentence analyzed by Ollie is shown in Table 4.3. One sentence can have multiple relations extracted that can overlap and contain similar segments of the sentence. Ollie also captures relations that do not contain a verb and replaces those with the verb it considers to be best suited.

29

Table 4.3. Examples of sentences analyzed by Ollie with confidence scores. Ollie returns the relation in three parts, beginning with the subject, then the relational phrase, and finally the object.

Sentence Ollie Extractions with Ollie Confidence Scores

Western flower thrips, Frankliniella 0.768: (Thripidae; is; one of the most occidentalis (Pergande) significant pests of commercial vegetables, (Thysanoptera: Thripidae), is one of fruits, and ornamental crops worldwide) the most significant pests of commercial vegetables, fruits, and 0.534: (Thripidae; be the most significant ornamental crops worldwide, causing pests of; commercial vegetables) both direct and indirect damage (Lorenzo et al., 2019).

In the laboratory however, O. tristicolor 0.846: (O. tristicolor; will feed on; will feed on aphids if no other prey is aphids)[enabler=if no other prey is available (Nyffeler, 1999). available]

0.576: (O. tristicolor; will feed in; the laboratory)[enabler=if no other prey is available]

0.115: (no other prey; is; available)

Locating Trophic Names: (Figure 4.2, Step 4): Each Ollie extraction is retrieved and parsed for trophic related keywords. The collection of trophic related keywords comprises verbs and phrases that relate to or denote the dietary habit of a subject in a sentence, for example, ‘feed’. If the semantic relation does not contain a TRK, it is not carried to the next step.

Locating Scientific Names: (Figure 4.2, Step 5): The sentences that have a TRK are then parsed to locate scientific names or common names. The sentence is tokenized and each word in the sentence (except the TRK) is tested using the NLTK (Bird et al., 2009) Python 3 library for whether it is a noun. Possible two worded scientific names and abbreviated scientific names in the sentence are located using regular expressions that match Linnaean scientific notations for taxonomy, displayed in Table 4.4. The matches from the regular expressions are searched for within a large collected database of scientific names and common names. If the potential scientific match is in the dictionary, it then acquires the label ‘Scientific Name’. If it is not in the collection of names, each individual word within the match is searched for within the collections of names. If a word 30

is found within the collection that is one character off, it is classified as a match and given the label ‘Scientific Name’. If the individual word is not located in the collection of names, then it is tested by the trained Random Forest classifier (discussed in Chapter 3). If the Random Forest classifier categorizes it as a scientific name, it receives the label ‘Potential Scientific Name’. The collection of scientific names also stores the kingdom designation for each entry, which is used in the final classification stage.

Table 4.4. Regular expressions for the forms of scientific names located in text by the pipeline.

Regular Expression Example

[A-z]\. [a-z]{3,} P. clavata

[A-Z][a-z]{4,}: [A-Z][a-z]{4,} Hymenoptera: Formicidae

[A-Z][a-z]{2,} [A-z]{3,} Homo sapiens

Final Classification (Figure 4.2, Step 6): Once scientific names and TRKs are located, the next step is to classify the trophic relations. The main identifier of the trophic relation is the TRK. All trophic related keywords have been grouped as being either representing a left to right relation (e.g. “feeds on”), a right to left relation (e.g. “consumed by”), reflexive (e.g. “is omnivorous”) or are category-related (e.g. detritivorous). These groups help to understand whether the noun on the left is in the diet of the noun on the right or vice versa. The category-related phrases and keywords identify the final classification based on the TRK. For example, ‘is omnivorous’ is a phrase that is category related but also reflexive as it classifies a scientific name found before it in a sentence as an omnivore. An example of a keyword that is just category related but lacks a specific direction is ‘omnivorous’. It results in the classification of scientific names on its left or right as being an omnivore. Parasites and detritivores each have a set of trophic related keywords that are specific to their category such as ‘parasitizes’, which can also be classified as a left- to-right acting verb but is specific to parasitic organisms. Detritivore also has its own set of unique keywords related to detritivory such as “detritus”. For a final classification of herbivores or carnivores, the TRK must be categorized as left-to-right or right-to-left and the noun that is being eaten must have Plantae or Animalia in its kingdom designation in the scientific name collection.

Post-Processing (Figure 4.2, Step 7): Through the previous steps additional special characters such as brackets and labels are appended to the semantic relation, which must be removed before the final output. Post-processing completes the task of removal. An option is created using the argparse Python 3 library (Bethard, 2009) to allow a user 31

to create a Comma Separated Values (CSV) file with the name of their choosing that contains the output separated into three columns.

4.2.3 Comparison Tests

In order to evaluate the results produced by the pipeline described above, two tests were performed. The first test was to evaluate how much the layout of a PDF influences the results. To test the influence of the PDF layout, an ideal PDF document was created by converting a text file that contained all 175 relevant sentences each on a new line as required by Ollie. It was then analyzed using the pipeline in the exact same way as the 56 research articles. The second comparison test takes 5 of the 56 research articles and runs them through three different versions of the pipeline. The first version of the pipeline removes the cleaning process, the second version of the pipeline removes Ollie and the third version is the pipeline with both pre-processing and Ollie removed. It must be noted that to test the removal of Ollie, the implementation had to be changed to achieve compatibility with the output of the PDF extraction rather than the Ollie results. The second test allows for an evaluation of the influence of two major pipeline sections: the pre-processing step and the open information extraction tool. 4.3 Results

4.3.1 Evaluation Measures

The following measures were used for evaluating the quality of the results obtained in this study. First, we defined the confusion matrix for both, the extraction and the classification tasks. True positives (TP) in the extraction task were sentences with at least one Ollie relation that contained a TRK from the predefined collection of phrases within a trophic relation specific to a scientific name. False positives (FP) were sentences extracted from the Ollie results that contained a TRK but did not contain relevant trophic information related to a specific scientific name or common name. For extraction, a false negative is a sentence that contains a TRK and a scientific name(s) but was not extracted by Ollie.

In the classification task, a true positive was a final classification that was correct based on the information in the semantic relation. A false positive for the classification task is a classification that is incorrect based on the category or a classification that refers to an incorrect noun in the sentence. A false negative (FN) is a sentence within the extracted sentences that contains a TRK and a scientific name(s) but does not receive a final classification.

Precision: Precision measures the ability of an algorithm to distinguish between positive instances and negative instances, which makes it relevant to this task. This measure was calculated for the extraction task and the classification task. Precision values for both, the extraction and the classification tasks were calculated using Equation 4.1 (Armah et al., 2014). 32

푇푟푢푒 푃표푠푖푡푖푣푒푠 푃푟푒푐푖푠푖표푛 = 100 ∗ (Equation 4.1) 푇푟푢푒 푃표푠푖푡푖푣푒푠+퐹푎푙푠푒 푃표푠푖푡푖푣푒푠

Although precision measures the ability to distinguish between true positives and false positives, overall recall also needs to be calculated to show the percentage of the total amount of true positives that were successfully retrieved or classified.

Recall: Recall measures how many of the total true positives were retrieved by an algorithm. The recall rate was measured for both extraction and classification using Equation 4.2 (Armah et al., 2014).

푇푟푢푒 푃표푠푖푡푖푣푒푠 푅푒푐푎푙푙 = 100 ∗ (Equation 4.2) 푇푟푢푒 푃표푠푖푡푖푣푒푠+퐹푎푙푠푒 푁푒푔푎푡푖푣푒푠

Recall and precision are both important for algorithm statistics but the F1 score is an important measure that averages precision and recall.

F1 score: The F1 score is a measurement that combines precision and recall. The F1 score was calculated for both tasks using Equation 4.3 (Armah et al., 2014), which is shown below.

2∗푃푟푒푐푖푠푖표푛∗푅푒푐푎푙푙 퐹1 푆푐표푟푒 = (Equation 4.3) 푃푟푒푐푖푠푖표푛+푅푒푐푎푙푙 4.3.2 Extraction Task Results

Precision: The 56 PDF documents used for testing have a variety of layouts and contain a total of 200 relevant trophic semantic relations based on the definition of having a TRK that is describing the dietary behavior of a scientifically named organism within the sentence. For the extraction task, 18 of the 56 documents had zero relevant sentences extracted. When considering all 56 documents, the average precision achieved was 35%. When calculating the average precision of the 38 PDF documents that had relevant relations extracted, the average precision was 52% (Table 4.5).The medians for the measures are shown in Table 4.6. The minimum and maximum for the precision, recall and F1 score for all documents was 0% and 100% respectively. For the documents that contain relevant information, the minimum for the precision, recall and F1 score was 5%, 33% and 9% respectively. Boxplots for the measures are shown in Figures 4.3 and 4.4.

33

Table 4.5. Measurement averages broken down by the number of documents included in the calculation. The all documents calculations include documents that had zero relevant information extracted. The thirty-eight documents represent documents that had relevant information extracted.

Documents Average Precision Average Recall Average F1

All Documents (56) 35% 52% 38%

Thirty-eight that contain 52% 77% 57% relevant extractions

Table 4.6. Measurement medians broken down by the number of documents included in the calculation. The all documents calculations include documents that had zero relevant information extracted. The thirty-eight documents represent documents that had relevant information extracted.

Documents Median Precision Median Recall Median F1

All Documents (56) 28% 55% 41%

Thirty-eight that contain 45% 81% 54% relevant extractions

Extraction Measures Boxplot for All Documents

1

0.8

0.6

0.4 Percentages 0.2

0 Extraction Precision Extraction Recall Extraction F1 Measures Figure 4.3. Boxplot for the extraction measures of all documents, which included documents that had zero relevant information extracted.

34

Extraction Measures Boxplot for Relevant Documents 1 0.8 0.6 0.4

Percentages 0.2 0 Extraction Precision Extraction Recall Extraction F1 Measures

Figure 4.4. Boxplot for the extraction measures of the documents that had relevant information extracted.

Extractions can also be analyzed based on the type of relation as each have their own challenges. There are two main types as explained in Table 4.7. The first is the triplet where the keyword explains the directionality of understanding, so either the organism after the trophic keywords eats the organism before the trophic keywords or vice versa, and the kingdom of the organism that is eaten decides the final classification. The second relation type is keyword-based, where the keyword denotes the final classification, so for example, the keywords “is omnivorous” denote that whatever scientific name comes before it in a sentence is omnivorous. The sentences within the categories herbivore and carnivore consist of the both types of relation (triplets and keyword based). Detritivore and parasite consist of sentences that contain a relation that is keyword based. Omnivore contains a mixture of both types of relations. When looking at the data from the relation types perspective, triplet relations were extracted with an average precision of 51%. This means of all the extractions that contained a triplet TRK, only 51% were true trophic relations as precision represents the percentage of true positives classified correctly by an algorithm. The keyword-based relations were extracted with an average precision of 83% (Table 4.8), meaning of all the extractions that contain a keyword-based TRK, 83% of them had a trophic relation.

35

Table 4.7. Description of two relation types outputted by the pipeline, triplet and keyword- based.

Relation Type Description Example

Triplet Contains two arguments at least and a Drepanosiphum trophic related keyword. Argument 1 eats platanoidis eats Acer argument 2 or argument 2 eats argument 1. pseudoplatanus. Form: ARG1 – TRK – ARG2

Keyword- Contains one argument and a trophic related P. clavata is based keyword in either order. omnivorous.(Larson et Form: ARG1 – TRK or TRK – ARG1 al., 2014)

Table 4.8. Measurement results for extraction task based on relation types, triplet and keyword-based.

Relation Type Average Precision Average Recall Average F1

Triplet 51% 43% 46%

Keyword-based 83% 80% 81%

Recall: The average recall when including all 56 PDF documents was 52%. For the 38 documents that had relevant extractions, the average recall was 77%. Each trophic category had a different recall (Table 4.9). Of the five trophic categories, the extraction of herbivore and parasite relations had the highest recall of 67%. The third highest recall rate was for relations that were omnivorous with a recall of 48%. The carnivorous relations were extracted at a recall rate of 44%. The detritivore category was last with a recall of 25%.

Table 4.9. Extraction task recall results broken down by trophic category.

Category Herbivore Carnivore Parasite Omnivore Detritivore

Recall 67% 44% 67% 48% 25%

F1 Score: The results for the extraction task (shown in Table 4.5) had an average F1 score of 38% for the 56 documents, and 57% for the 38 documents that had relevant

36

extracted information. For the triplet type relation, the F1 score was 46% and for the keyword-based relation, the F1 score was 81% (Table 4.8).

4.3.3 Classification Task Results

For the classification task, only results from the extraction task were used. The classification task was treated as a multiclass problem, with each category representing its own class. It must be remembered that a single sentence can refer to multiple nouns and categories when discussing trophic information. Of the 38 PDF documents that had information extracted, 37 had final classifications. There were 192 classifications in total, of which 128 were accurate. Of the 128 accurate classifications, 44 were herbivore, 29 were carnivore, 14 were parasite, 21 were omnivore and 20 were detritivore. The precision recall and F1 results are shown in Table 4.10. The herbivore category had the highest precision rate at 80%. The parasite category had the highest recall rate and F1 score at 74% and 76% respectively.

Table 4.10. Results for classification task broken down by category. There were 192 classifications in total from the 37 documents that contained final classifications.

Category Herbivore Carnivore Parasite Omnivore Detritivore

Precision 80% 55% 78% 74% 54%

Recall 63% 58% 74% 71% 70%

F1 70% 56% 76% 73% 61%

4.3.4 Comparison Tests Results

Ideal PDF Test: The Ollie results from the created PDF were analyzed to see how much information is lost during the text extraction stage from the source PDF document. Of the 175 relevant sentences that were included in the synthetically created PDF, Ollie extracted relations from 150 of those sentences. That is a 14% false negative rate on the 175 relevant sentences included in the test dataset, even when the relevant sentences are each separated on a new line as required by Ollie. Of the 150 sentences that had extractions, 98 of those sentences had relevant extractions, meaning the sentence had at least one extraction that had a trophic relation (a trophic keyword referring to a scientific name or common name). That shows that of the 150 sentences that had extractions, Ollie had a precision rate of 65% based on the 98 sentences (as shown in Table 4.11) that had relevant extractions. Overall, that is a 56% recall rate as only 98 of the 175 sentences that had a trophic relation were found by Ollie in this ideal document. For the classification

37

process, 126 final classifications were generated, of which 81 were correct classifications leading to a precision rate of 64% using Equation 4.1. It should be noted there were several instances of sentences that were cut too short, which would influence Ollie’s ability to recognize relations. This was not the case in every scenario where Ollie did not identify a relation. Ollie also sometimes recognized the period in an abbreviated scientific name as the end of a relation. Incorrect classifications were either due to incorrect identification of the relevant noun or incorrect kingdom identification of the being eaten.

Table 4.11. Ideal PDF results broken down by sentence category.

Sentence All Sentences Sentences with Sentences with Category Extractions Relevant Extractions

Amount 175 150 98

Optimization Test Results: The optimization results were analyzed using 5 PDFs to see how different steps in the pipeline influence the precision and recall measures (Figure 4.5). Three versions of the proposed pipelines were assessed in this study: a pipeline with the pre-processing step removed, a pipeline with Ollie removed, and a pipeline with both pre-processing and Ollie removed. For the extraction task, without pre-processing and Ollie the precision was 12%, the precision without pre-processing alone was 27%, without Ollie alone the precision was 34% and with pre-processing and Ollie the precision was 65% as shown in Table 4.12. The pipeline that did not have Ollie had the highest recall at 100% for the extraction task. The classification task had different results. The pipeline that lacked pre-processing and Ollie had the highest precision for the classification task at 70%. The pipeline that contained Ollie had the highest recall at 64% as shown in Table 4.13. The runtimes were estimated using an HP Notebook – 14- cf0018ca equipped with a 4-core Intel Core i5-8250U CPU (base frequency of 1.6 GHz, 4MB Cache), 8 GB DDR4-2400 SDRAM, a 256 GB PCIe NVMe M.2 SSD and running Windows 10 (64-bit). Removing Ollie caused a large increase in runtime per document, as Ollie had an average runtime of 21 minutes per document. Without pre-processing the average runtime per document was 6 minutes. With pre-processing and Ollie, the average runtime was 12 minutes per document while the pipeline that lacked pre-processing and Ollie had an average runtime of 14 minutes.

38

Precision and Recall for Different Pipeline Versions 100% 90% 80% 70% -PP-OL 60% -PP+OL +PP+OL 50% 40% +PP+OL Precision +PP-OL +PP-OL 30% -PP+OL 20% 10% -PP-OL 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recall

Extraction Classification

Figure 4.5. Precision and recall graph for four tested pipeline implementations. The PP and OL denote “pre-processing” and “Ollie” while the “+” or “-” symbols denote the presence or absence of the two approaches in the pipeline use to produce the results depicted in this figure.

Table 4.12. Precision and recall of extraction task for the four tested pipeline implementations. Components Average Precision Average Recall

Without Pre-processing and 12% 35% Ollie Without Pre-processing 27% 32%

Without Ollie 34% 100%

With Pre-processing and Ollie 65% 37%

39

Table 4.13. Precision and recall of classification task for the four tested pipeline implementations.

Components Average Precision Average Recall

Without Pre-processing and 70% 32% Ollie Without Pre-processing 60% 14%

Without Ollie 36% 64%

With Pre-processing and Ollie 39% 35%

4.4 Discussion

Extracting and classifying trophic information can be time consuming when completed manually, but a combination of rule-based, dictionary-based and machine learning-based methods offer promising results in the quest to automate these tasks. This study makes use of these methods and publicly available resources to create a useful text analysis pipeline. The results showcase the viability of automating relevant information extraction to combat the increasing amounts of primary literature.

Several relevant findings can be inferred from the results. Firstly, format type is a large barrier for accurate text extraction from scholarly articles. Information was extracted from 38 of the 56 research articles that were used to test the pipeline. PDF is one of the most popular formats for research articles but attempting to extract text from the format can often lead to no extraction or incoherent data. Even with the large number of PDF text extractors in existence, it is difficult to achieve precise text extraction for several reasons. These results support previous findings on PDF text extraction that detailed errors such as missing new lines, missing paragraphs and misspelled words from multiple tools (Bast & Korzen, 2017). The ideal PDF test showed that the initial step of text extraction from PDF led to a decrease in testing data for the classification step as the results showed only 56% of data was recalled due in part to 18 documents not returning any relevant extracted data. The observed effect of PDF format on recall of relevant data has been noted in multiple automated relevant information extraction pipelines and frameworks (Luca et al., 2019; Mueller & Huettemann, 2018) It can be inferred that information extraction is largely dependent on the PDF layout. Stylistically, journals may publish their work in a layout that increases visual appeal, but it can also increase the difficulty of extracting relevant information from their work. Alternate document formats incorporate tags that allow for easier extraction of document structures such as headings and paragraphs (Harmata et al., 2017). PDFs are useful to share information, but the format can negatively affect the recall and precision of text extraction as shown by the results of

40

this study and previous studies. Overall, initial input format has a large influence on the success of the text extraction.

Another finding from the results is the advantages and disadvantages of the open information extraction tool Ollie. Ollie quickly analyzes extracted text as shown by the comparison tests that detail running time in multiple versions of the pipeline. The version of the pipeline that did not contain Ollie had an average runtime of 21 minutes, which is 7 minutes slower than the second slowest pipeline. Based on running time, Ollie was a better option, but the results allow for the inference that the low recall numbers in the extraction task can be partially attributed to the use of the tool to recognize trophic-specific semantic relations. Ollie was helpful in filtering a large amount of data, but its input format requirements and relation distance limitations are more noticeable when used on scholarly writings where sentences can be complex. One error that arose because of the adopted formatting style of species names is periods within abbreviated scientific names that were at times recognized by Ollie as the end of a sentence. This might also suggest that an additional pre-processing step is required, where all abbreviated species names are replaced by their full spelling variants before being analyzed by Ollie. These results support findings where the quickness of Ollie is highlighted but the disadvantages of the tool when working with complex sentences is noted (Tan et al., 2016; Xing et al., 2018; Zouaq et al., 2017). The text analyzed with Ollie originated from the PDF extractor and contained parsing errors, which affected Ollie’s performance. Open information extraction tools are helpful but best used in scenarios where the relations are simplistic. Generally, the optimization tests showed that a component of a pipeline can better the results of one measure while worsening the results of another measure.

The collection of trophic keywords was key to the identification of relevant trophic information from the Ollie results. The pipeline results showed successful identification of trophic relations based on finding text that contained a co-occurrence of a scientific name and a keyword in a specific pattern. This supports the results of previous studies that used this basis of co-occurrence of a keyword and specific named entity to identify domain specific relations (Raja et al., 2013). As currently constructed, the algorithm can only find keywords that are an exact match for keywords in the list. Due to that fact, there are relevant sentences that can be missed in research articles as keywords are used in many forms such as “feeds on” and “fed on”, which have the same meaning but are in different tenses. The list of trophic keywords can be expanded to incorporate more keywords to increase the recall of relevant trophic information. Previous studies have tested machine learning as a method to automate the identification of domain-specific keywords (Y. F. B. Wu et al., 2005). This process is imperfect as well. Depending on the method, the breadth of the extracted keywords can be broader or narrower. There is a balance between how narrow or broad the algorithm designers would like their results to be. Of the two types of keywords that were used to find a relation, the category-based keywords and phrases were the most accurate in finding relevant trophic information. The results show that the category-based keywords that lead to keyword-based relations had a higher precision rate than the triplet-based keywords (83% compared to 51% shown in Table 4.8) and a higher recall in the extraction task. It can be inferred that triplet based keywords like “feed 41

on” had a higher false positive rate as they matched with sentences that did not contain named entities while the category-based keywords like “is omnivorous” that resulted in keyword based relations, led to more results that contained a named entity. Therefore, the keyword list is a method that allowed for the finding of relevant data but also narrows the scope of what can be found while still leading to false positives.

As results from previous steps are used for subsequent steps, this could lead to compounded missed data and compounded error. This type of error propagation has been demonstrated in other studies as it is common to pipeline architectures (S. Miller et al., 2000; Ren et al., 2019; A. Sharma et al., 2016). The second classification task was tested on the output of the extraction step, which classified information from 37 of 38 documents that had relevant extracted data. The results must be analyzed with the understanding that the use of results from previous steps means any false negatives generated by the extraction step are not classified in the classification step, and any false positives generated by the extraction step may be classified leading to false positives in the classification step. The herbivore category contained many triplet-based relations and still achieved a precision of 80% (Table 4.10). The carnivore category, which also contained a large amount of triplet-based relations resulted in a 55% precision (Table 4.10). It can be inferred that the variety in precision values for the triplet-based classifications are due in part to an accurate triplet relation being predicated on two achieved smaller tasks: the accurate identification of scientific named entities (scientific names and common names) and the accurate identification of the kingdom designation of the organism that is being eaten. One method that has been researched in other studies to minimize error propagation is joint learning, which attempts to capture information that is beneficial to multiple tasks (Li et al., 2017; Yu & Lam, 2010). This results in a more complex pipeline structure and model, which also has its own disadvantages. This study also showed the limitation of sentence and word-based analysis. Another source of missed classifications was abbreviated scientific names, which can represent several different species, meaning the correct kingdom designation could not be found as this pipeline focuses at the sentence and word level. This gives credit to the idea discussed in previous studies that the most information is retrieved when documents are analyzed at multiple levels (Callan, 1994). This pipeline focused on the sentence level and word level but if evidence was included from the overall document level, the abbreviated sentences may be found in their unabbreviated form, allowing them to be classified. The overall pipeline structure can lead to error propagation, but the tasks must be designed to minimize their inaccuracies.

These tests investigate the level of success attainable through the combination of several text-based tasks. The proposed pipeline for automated extraction of trophic-specific relations with a final classification through the combination of dictionary-based methods, rule-based methods, and machine learning methods achieved the goal outlined in Chapter 1. This pipeline allowed for very specific information to be extracted that at times can be buried within scholarly writings thus making it difficult to locate.

42

5 Conclusions and Future Work

Multiple pipelines have been designed for text analysis of information. Most studies have focused on broader relation types such as causal relations or generic relations found in text rather than specific types like trophic relationships. Past studies focused only on a few species for which trophic relations were manually extracted as it was the most economic method. However, as more species information becomes available, manual extraction becomes intractable. This study succeeds in filling the gap of trophic focused relation extraction with the addition of final classification of the trophic relationship by using text mining methods and tailoring them to this specific domain.

In Chapter 1, the relevance and necessity of data mining specifically in scholarly literature was explained. Then in Chapter 2 several tasks in the data mining field were reviewed such as Named Entity Recognition, Relation Extraction and Information Extraction. Chapter 3 investigated the use of 15 machine learning models to recognize scientific names through the design of three problems. The first problem (P1) tested the ability of the model to identify two-worded scientific names versus two consecutive English words. The second problem (P2) investigated the ability of the models to identify two-worded scientific names versus combined first and last person names. The third problem (P3) investigated the ability of the machine learning models to distinguish between all three previously mentioned categories; two-worded scientific names, English words and first and last person names. Of the 15 models tested with Weka, two models achieved the highest accuracy in all three problems. The LIBLINEAR classifier outperformed the other models with accuracy values equal to 97.53% (P1), 94.70% (P2) and 91.31% (P3). The Random Forest classifier had the second highest accuracy in all three problems with values equal to 96.88% (P1), 93.55% (P2) and 90.98% (P3). Chapter 4 explored the use of a pipeline framework to automate the extraction of trophic relations and the classification of those trophic relations into a trophic category. A test set of 56 PDFs was curated, of which 38 PDFs had relevant extractions. The 38 PDFs had average precision, recall and F1 scores of 52%, 77%, and 57% respectively for the extraction task. For the classification task, 37 of the 38 PDFs had classifications. The pipeline achieved the highest precision rate of 80% on the herbivore category for the classification task. Comparison tests were completed to see the influence of PDF as a starting input format, and the influence of different pipeline compositions. The ideal PDF when analyzed by the pipeline had a recall of 65% on a document only containing relevant trophic information. This demonstrated the influence of PDF as a starting input. The tests on the various compositions of the pipelines proved the effect of pre-processing and the information extraction routine (Ollie) on the results. For the extraction task, the pipeline composition that contained pre-processing and Ollie achieved the highest average precision rate at 65%, while the pipeline composition that did not contain Ollie had the highest average recall rate at 100%. For the classification task, the pipeline composition that did not contain pre-processing and Ollie achieved the highest precision rate at 70%, while the pipeline composition that did not contain Ollie achieved the highest average recall rate of 64%. 43

Current automated pipelines deal with broader relation types and broader domains. This study highlighted the use of keywords to narrow the breadth of results and focus on a highly specific type of information. It demonstrated the disadvantages of the commonly used PDF in text mining, and the usefulness of Ollie to decrease running time and filter information as well as its relation length limitations when analyzing scientific writings. The strength of the pipeline in this study is its ability to quickly filter through large amounts of irrelevant information to locate trophic information and using the information found to predict a final classification. 5.1 Future Work

This study has showcased the applicability of text mining methods to automate trophic information extraction from text. It has also demonstrated the strength of these methods when combined in a pipeline framework to achieve the novel goal of trophic information extraction and classification. The pipeline as currently constructed achieves the task discussed in the first chapter and provides actionable data that is potentially useful for studying several domains such as food webs, biodiversity conservation and monitoring as well as the effect of trophic interactions on ecological function. There are several potential future directions and possibilities for extending the current results. The current pipeline focuses on a short list of specific scientific name structures. Future work could expand the types and structures of scientific names identified to more generic noun phrases. This would increase the recall of relevant trophic information. As the current pipeline implements Ollie, which had low recall and precision for long-distance relations, an option to increase precision and recall would be to explore more advanced open information extraction options, and to implement a pipeline component that can detect long-distance relations as well. At this time, the abbreviated scientific names in a sentence are not disambiguated by the pipeline for classification. A direction for future development would be to add a pre-processing step that locates the expanded form of the abbreviated scientific names in the text before analysis, which would increase the recall of scientific relations. Currently, the results are outputted in a results file for users. Another direction for future work is making use of the pipeline to analyze a large number of scholarly articles to autofill a publicly available database with the extracted information and building a network using the relationships identified with the pipeline. This would provide the academic community with a curated and accessible trophic information database to complement the aggregated traits database TraitBrank (Schulz et al., 2016). Furthermore, the pipeline has been released as a Jupyter notebook. The usefulness of the pipeline would be increased if it would be packaged as a Python library or as a standalone tool, like Ollie. The hope is that this pipeline serves as an alternative to the manual extraction techniques currently used to find similar information.

44

REFERENCES

Akella, L. M., Norton, C. N., & Miller, H. (2012). NetiNeti: Discovery of scientific names from text using machine learning methods. BMC Bioinformatics, 13(1), 211.

Anderson, A., Mccormack, S., Helden, A., Sheridan, H., Kinsella, A., & Purvis, G. (2011). The potential of parasitoid Hymenoptera as bioindicators of arthropod diversity in agricultural grasslands. Journal of Applied Ecology, 48(2), 382–390.

Andreas, H., Andreas, N., Gerhard, P., & Fraunhofer, A. (2005). A Brief Survey of Text Mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1), 19–62.

Andresen, E. (2005). Effects of season and vegetation type on community organization of dung beetles in a tropical dry forest. Biotropica: The Journal of Biology and Conservation, 37(2), 291–300.

Arin. (2016). arincli -- ARIN Command Line Interface. GitHub. https://github.com/arineng/arincli/blob/master/lib/last-names.txt

Armah, G. K., Luo, G., & Qin, K. (2014). A Deep Analysis of the Precision Formula for Imbalanced Class Distribution. International Journal of Machine Learning and Computing, 4(5), 417–422.

Babbitt, B. (1998). Statement by Secretary of the Interior on invasive alien species. Proceedings, National Weed Syposium, BLM Weed Page, 8–10.

Bairoch, A. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28(1), 45–48.

Baker, M. E., & Barnes, B. V. (1998). Landscape ecosystem diversity of river floodplains in northwestern Lower Michigan, U.S.A. Canadian Journal of Forest Research, 28(9), 1405–1418.

Banko, M., & Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 28–36.

Bast, H., & Korzen, C. (2017). A Benchmark and Evaluation for Text Extraction from PDF. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 1–10.

Beckerman, A. P., Petchey, O. L., & Warren, P. H. (2006). Foraging biology predicts food web complexity. Proceedings of the National Academy of Sciences of the United States of America, 103(37), 13745–13749.

Benton, T. G., Vickery, J. A., & Wilson, J. D. (2003). Farmland biodiversity: Is habitat 45

heterogeneity the key? Trends in Ecology and Evolution, 18(4), 182–188.

Bethard, S. (2009). argparse - New Command Line Parsing Module. https://www.python.org/dev/peps/pep-0389/

Bhasuran, B., & Natarajan, J. (2018). Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS ONE, 13(7), e0200699.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O’ Reilly Media Inc.

Blaxter, K., & Robertson, N. (1995). The science and technology of the modern agricultural revolution. In From Dearth to Plenty: The Modern Revolution in Food Production (pp. 39–250). Cambridge University Press.

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1), D267–D270.

BugGuide. (2003). Iowa State University Department of Entomology. https://bugguide.net/node/view/15740

Butchart, S. H. M., Walpole, M., Collen, B., Van Strien, A., Scharlemann, J. P. W., Almond, R. E. A., Baillie, J. E. M., Bomhard, B., Brown, C., Bruno, J., Carpenter, K. E., Carr, G. M., Chanson, J., Chenery, A. M., Csirke, J., Davidson, N. C., Dentener, F., Foster, M., Galli, A., … Watson, R. (2010). Global biodiversity: Indicators of recent declines. Science, 328(5982), 1164–1168.

Cachuela-Palacio, M. (2006). Towards an index of all known species: the Catalogue of Life, its rationale, design and use. Integrative Zoology, 1(1), 18–21.

Callan, J. P. (1994). Passage-level evidence in document retrieval. In SIGIR’94, 302– 310.

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2018). YAKE! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, 806–810.

Chen, X. (2019). Scholarly Journals’ publication frequency and number of articles in 2018- 2019: A study of SCI, SSCI, CSCD, and CSSCI Journals. Publications, 7(3), 58.

Chen, Y. L., Liu, Y. H., & Ho, W. L. (2013). A text mining approach to assist the general public in the retrieval of legal documents. Journal of the American Society for Information Science and Technology, 64(2), 280–290.

Chowdhury, G. G. (2010). Introduction to modern information retrieval. Facet publishing.

46

Chun, H. W., Tsuruoka, Y., Kim, J. D., Shiba, R., Ata, N. N., Hishiki, T., & Tsujii, J. (2006). Extraction of gene-disease relations from medline using domain dictionaries and machine learning. Proceedings of the Pacific Symposium on Biocomputing 2006, PSB 2006, 4–15.

Clavero, M., Brotons, L., Pons, P., & Sol, D. (2009). Prominent role of invasive species in avian biodiversity loss. Biological Conservation, 142(10), 2043–2049.

Collen, B., Loh, J., Whitmee, S., McRae, L., Amin, R., & Baillie, J. E. M. (2009). Monitoring Change in Vertebrate Abundance: the Living Planet Index. Conservation Biology, 23(2), 317–327.

Doherty, T. S., Glen, A. S., Nimmo, D. G., Ritchie, E. G., & Dickman, C. R. (2016). Invasive predators and global biodiversity loss. Proceedings of the National Academy of Sciences, 113(40), 11261–11265.

Eisner, T., Goetz, M. A., Hill, D. E., Smedley, S. R., & Meinwald, J. (1997). Firefly “femmes fatales” acquire defensive steroids (lucibufagins) from their firefly prey. Proceedings of the National Academy of Sciences of the United States of America, 94(18), 9723– 9728.

Entomological Society of America. (1889). https://www.entsoc.org/

Estrada, A., & Coates-Estrada, R. (2002). Dung beetles in continuous forest, forest fragments and in an agricultural mosaic habitat island at Los Tuxtlas, Mexico. Biodiversity and Conservation, 11(11), 1903–1918.

Etzioni, O., Fader, A., Christensen, J., Soderland, S., & Mausam. (2011). Open information extraction: The second generation. IJCAI International Joint Conference on Artificial Intelligence.

Eyre, T. A., Ducluzeau, F., Sneddon, T. P., Povey, S., Bruford, E. A., & Lush, M. J. (2006). The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Research, 34(suppl_1), D319–D321.

Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for Open Information Extraction. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 1535–1545.

Feer, F., & Hingrat, Y. (2005). Effects of forest fragmentation on a dung community in French Guiana. Conservation Biology, 19(4), 1103–1112.

Fortuna, B., Mladenič, D., & Grobelnik, M. (2005). Semi-automatic construction of topic ontologies. In Semantics, Web and Mining, 121–131.

Foster, W. A., Snaddon, J. L., Turner, E. C., Fayle, T. M., Cockerill, T. D., Farnon Ellwood, 47

M. D., Broad, G. R., Chung, A. Y. C., Eggleton, P., Khen, C. V., & Yusah, K. M. (2011). Establishing the evidence base for maintaining biodiversity and ecosystem function in the oil palm landscapes of South East Asia. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1582), 3277–3291.

Friyake, D. M., & Behere, G. T. (2020). Natural mortality of invasive fall armyworm, Spodoptera frugiperda (J. E. Smith) (Lepidoptera: Noctuidae) in maize agroecosystems of northeast India. Biological Control, 148, 104303.

Gerner, M., Nenadic, G., & Bergman, C. M. (2010). LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics, 11(1), 85.

Gregory, R. (2006). Birds as biodiversity indicators for Europe. Significance, 3(3), 106– 110.

Grice, A. C. (2006). The impacts of invasive plant species on the biodiversity of Australian rangelands. Rangeland Journal, 28(1), 27–35.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software. ACM SIGKDD Explorations Newsletter, 11(1), 10.

Hallmann, C. A., Sorg, M., Jongejans, E., Siepel, H., Hofland, N., Schwan, H., Stenmans, W., Müller, A., Sumser, H., Hörren, T., Goulson, D., & De Kroon, H. (2017). More than 75 percent decline over 27 years in total flying insect biomass in protected areas. PLoS ONE, 12(10), e0185809.

Harmata, S., Hofer-Schmitz, K., Nguyen, P. H., Quix, C., & Bakiu, B. (2017). Layout- aware semi-automatic information extraction for pharmaceutical documents. In International Conference on Data Integration in the Life Sciences, 71–85.

Herzog, F., Jeanneret, P., Ammari, Y., Angelova, S., Arndorfer, M., Bailey, D., Balázs, K., Báldi, A., Bogers, M., Bunce, R. G. H., Choisis, J.-P., Cuming, D., Dennis, P., Dyman, T., Eiter, S., Elek, Z., Falusi, E., Fjellstad, W., Frank, T., … Zanetti, T. (2013). Measuring farmland biodiversity. Solutions.

Ittoo, A., Nguyen, L. M., & Van Den Bosch, A. (2016). Text analytics in industry: Challenges, desiderata and trends. Computers in Industry, 78, 96–107.

Jiang, J. (2012). Information extraction from text. In Mining Text Data (pp. 11–41). Springer.

Jinha, A. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 50, 258–263.

Kaminuma, E., Mashima, J., Kodama, Y., Gojobori, T., Ogasawara, O., Okubo, K., Takagi, T., & Nakamura, Y. (2010). DDBJ launches a new archive database with 48

analytical tools for next-generation sequence data. Nucleic Acids Research, 38(suppl_1), D33–D38.

Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J., Nithya, M., Kannan, S., & Gurusamy, V. (2014). Preprocessing Techniques for Text Mining. International Journal of Computer Science & Communication Networks, 5(1), 7–16.

Kaushik, N., & Chatterjee, N. (2018). Automatic relationship extraction from agricultural text for ontology construction. Information Processing in Agriculture, 5(1), 60–73.

Keselj, V., Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING), 255–264.

Khordad, M., & Mercer, R. E. (2017). Identifying genotype-phenotype relationships in biomedical text. Journal of Biomedical Semantics, 8(1), 57.

Koning, D., Sarkar, I. N., & Moritz, T. (2005). TaxonGrab: Extracting Taxonomic Names From Text. Biodiversity Informatics, 2, 79–82.

Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., & Valencia, A. (2008). Evaluation of text-mining systems for biology: Overview of the Second BioCreative community challenge. Genome Biology, 9(2), 1–9.

Krallinger, M., Rodriguez-Penagos, C., Tendulkar, A., & Valencia, A. (2009). PLAN2L: A web tool for integrated text mining and literature-based bioentity relation extraction. Nucleic Acids Research, 37(suppl_), W160–W165.

Lai, L. C., Chiu, M. C., Tsai, C. W., & Wu, W. J. (2018). Composition of harvested seeds and seed selection by the invasive tropical fire ant, Solenopsis geminata (Hymenoptera: Formicidae) in Taiwan. Arthropod-Plant Interactions, 12(4), 623–632.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference, arXiv preprint arXiv:1603.01360.

Larson, H. K., Goffredi, S. K., Parra, E. L., Vargas, O., Pinto-Tomas, A. A., & McGlynn, T. P. (2014). Distribution and dietary regulation of an associated facultative Rhizobiales-related bacterium in the omnivorous giant tropical ant, Paraponera clavata. Naturwissenschaften, 101(5), 397–406.

Leaman, R., & Gonzalez, G. (2008). BANNER: An executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing 2008, PSB 2008, 652–663. 49

Leary, P. (2014). taxonfinder. GitHub. https://github.com/pleary/node-taxonfinder

Leary, P. R., Remsen, D. P., Norton, C. N., Patterson, D. J., & Sarkar, I. N. (2007). uBioRSS: Tracking taxonomic literature using RSS. Bioinformatics, 23(11), 1434– 1436.

Lee, W., McGlone, M., & Wright, E. (2005). Biodiversity Inventory and Monitoring: a review of national and international systems and a proposed framework for future biodiversity monitoring by the Department of Conservation. Landcare Research Contract Report, LC0405/122.

Leng, J., & Jiang, P. (2016). A deep learning approach for relationship extraction from interaction context in social manufacturing paradigm. Knowledge-Based Systems, 100, 188–199.

Li, F., Zhang, M., Fu, G., & Ji, D. (2017). A neural joint model for entity and relation extraction from biomedical text. BMC Bioinformatics, 18(1), 1–11.

Lindenmayer, D. B., Gibbons, P., Bourke, M., Burgman, M., Dickman, C. R., Ferrier, S., Fitzsimons, J., Freudenberger, D., Garnett, S. T., Groves, C., Hobbs, R. J., Kingsford, R. T., Krebs, C., Legge, S., Lowe, A. J., Mclean, R., Montambault, J., Possingham, H., Radford, J., … Zerger, A. (2012). Improving biodiversity monitoring. Austral Ecology, 37(3), 285–294.

Lindenmayer, D. B., & Likens, G. E. (2010). The science and application of ecological monitoring. Biological Conservation, 143(6), 1317–1328.

Lister, B. C., & Garcia, A. (2018). Climate-driven declines in arthropod abundance restructure a rainforest food web. Proceedings of the National Academy of Sciences of the United States of America, 115(44), E10397–E10406.

Liu, R., Zhu, F., An, H., & Steinberger, Y. (2014). Effect of naturally vs manually managed restoration on ground-dwelling arthropod communities in a desertified region. Ecological Engineering, 73, 545–552.

Lorenzo, M. E., Bao, L., Mendez, L., Grille, G., Bonato, O., & Basso, C. (2019). Effect of Two Oviposition Feeding Substrates on Orius insidiosus and Orius tristicolor (: Anthocoridae). Florida Entomologist, 102(2), 395–402.

Lu, X. (2018). Natural Language Processing and Intelligent Computer-Assisted Language Learning (ICALL). In The TESOL Encyclopedia of English Language Teaching (pp. 1–6).

Luca, F., Thaer, D., Akira, S., & Masashi, I. (2019). Proposal for Automatic Extraction Framework of Superconductors Related Information from Scientific Literature. 電子

50

情報通信学会サービスコンピューティング研究会, 43.

Maleque, M. A., Maeto, K., & Ishii, H. T. (2009). Arthropods as bioindicators of sustainable forest management, with a focus on plantation forests. Applied Entomology and Zoology, 44(1), 1–11.

Mansouri, A., Affendey, L. S., & Mamat, A. (2008). Named Entity Recognition Approaches. Journal of Computer Science, 8(2), 339–344.

Mattoni, R., Longcore, T., & Novotny, V. (2000). Arthropod monitoring for fine-scale habitat analysis: A case study of the El Segundo sand dunes. Environmental Management, 25(4), 445–452.

Mausam, Schmitz, M., Bart, R., Soderland, S., & Etzioni, O. (2012). Open language learning for information extraction. EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference, 523–534.

McCarty, L. S., Power, M., & Munkittrick, K. R. (2002). Bioindicators versus biomarkers in ecological risk assessment. Human and Ecological Risk Assessment, 8(1), 159– 164.

McKie, J. X., & Liu, R. (2016). PyMuPDF. GitHub. https://github.com/pymupdf/PyMuPDF

Mcneely, J. (2001). Invasive species: a costly catastrophe for native biodiversity. Land Use and Water Resources Research, 1(1732-2016–140260).

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404–411.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations ofwords and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111–3119.

Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41.

Miller, S., Fox, H. J., Ramshaw, L., & Weischedel, R. (2000). A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics.

Mohit, B. (2014). Named Entity Recognition. In In Natural language processing of semitic languages (pp. 221–245). Springer.

Molnar, J. L., Gamboa, R. L., Revenga, C., & Spalding, M. D. (2008). Assessing the global threat of invasive species to marine biodiversity. Frontiers in Ecology and the 51

Environment, 6(9), 485–492.

Moonen, A. C., & Bàrberi, P. (2008). Functional biodiversity: An agroecosystem approach. Agriculture, Ecosystems and Environment, 127(1–2), 7–21.

Mooney, R. J., & Bunescu, R. (2005). Mining knowledge from text using information extraction. ACM SIGKDD Explorations Newsletter, 7(1), 3–10.

Mueller, R. M., & Huettemann, S. (2018). Extracting Causal Claims from Information Systems Papers with Natural Language Processing for Theory Ontology Learning. Proceedings of the 51st Hawaii International Conference on System Sciences.

Muhammad, R., Nawab, A., Stevenson, M., & Clough, P. (2012). Detecting Text Reuse with Modified and Weighted N-grams. Proceedings of the Sixth International Workshop on Semantic Evaluation, 54–58.

Naderi, N., Kappler, T., Baker, C. J. O., & Witte, R. (2011). Organismtagger: Detection, normalization and grounding of organism entities in biomedical documents. Bioinformatics, 27(19), 2721–2729.

Naidu, R., Bharti, S. K., Babu, K. S., & Mohapatra, R. K. (2018). Text summarization with automatic keyword extraction in telugu e-newspapers. In S. Springer (Ed.), Smart Innovation, Systems and Technologies (pp. 555–564). Springer.

National Library of Medicine. (2018). MEDLINE®: Description of the Database. U.S. National Library of Medicine.

Nguyen, N. T., Gabud, R. S., & Ananiadou, S. (2019). COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal, 7.

Niedrist, G., Tasser, E., Lüth, C., Dalla Via, J., & Tappeiner, U. (2009). Plant diversity declines with recent land use changes in European Alps. Plant Ecology, 202(2), 195.

Nielsen, S. E., Haughland, D. L., Bayne, E., & Schieck, J. (2009). Capacity of large-scale, long-term biodiversity monitoring programmes to detect trends in species prevalence. Biodiversity and Conservation, 18(11), 2961–2978.

Niemelä, J. (2000). Biodiversity monitoring for decision-making. Annales Zoologici Fennici, 37(4), 307–317.

Norris, R. F., & Kogan, M. (2005). Ecology of interactions between weeds and arthropods. Annual Review of Entomology, 50, 479–503.

Nyffeler, M. (1999). Prey selection of spiders in the field. Journal of Arachnology, 317– 324. 52

O’Leary, N. A., Wright, M. W., Brister, J. R., Ciufo, S., Haddad, D., McVeigh, R., Rajput, B., Robbertse, B., Smith-White, B., Ako-Adjei, D., Astashyn, A., Badretdin, A., Bao, Y., Blinkova, O., Brover, V., Chetvernin, V., Choi, J., Cox, E., Ermolaeva, O., … Pruitt, K. D. (2016). Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1), D733–D745.

Okazaki, N., Ananiadou, S., & Tsujii, J. (2010). Building a high-quality sense inventory for improved abbreviation disambiguation. Bioinformatics, 26(9), 1246–1253.

Pafilis, E., Frankild, S. P., Fanini, L., Faulwetter, S., Pavloudi, C., Vasileiadou, A., Arvanitidis, C., & Jensen, L. J. (2013). The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE, 8(6), e65390.

Paini, D. R., Sheppard, A. W., Cook, D. C., De Barro, P. J., Worner, S. P., & Thomas, M. B. (2016). Global threat to agriculture from invasive species. Proceedings of the National Academy of Sciences of the United States of America, 113(27), 7575–7579.

Paoletti, M. G., Dunxiao, H., Marc, P., Ningxing, H., Wenliang, W., Chunru, H., Jiahai, H., & Liewan, C. (1999). Arthropods as bioindicators in agroecosystems of Jiang Han Plain, Qianjiang City, Hubei China. Critical Reviews in Plant Sciences, 18(3), 457– 465.

Perfecto, I., Vandermeer, J., Hanson, P., & Cartín, V. (1997). Arthropod biodiversity loss and the transformation of a tropical agro-ecosystem. Biodiversity and Conservation, 6(7), 935–945.

Pimentel, D., Zuniga, R., & Morrison, D. (2005). Update on the environmental and economic costs associated with alien-invasive species in the United States. Ecological Economics, 52(3), 273–288.

Pruitt, K. D. (2001). RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research, 29(1), 137–140.

Pyle, R. L. (2016). Towards a global names architecture: The future of indexing scientific names. ZooKeys, 550, 261.

Raja, K., Subramani, S., & Natarajan, J. (2013). PPInterFinder - A mining tool for extracting causal relations on human proteins from literature. Database.

Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. P. C. (2012). Layout-aware text extraction from full-text PDF of scientific articles. Source Code for Biology and Medicine, 7(1), 7.

Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., & Jimeno, A. (2008). Text 53

processing through web services: Calling Whatizit. Bioinformatics, 24(2), 296–298.

Ren, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Almost unsupervised text to speech and automatic speech recognition. 36th International Conference on Machine Learning, ICML 2019, arXiv preprint arXiv:1905.06791.

Rinaldi, F., Clematide, S., Marques, H., Ellendorff, T., Romacker, M., & Rodriguez- Esteban, R. (2014). OntoGene web services for biomedical text mining. BMC Bioinformatics, 15(14), 1–10.

Rocktäschel, T., Weidlich, M., & Leser, U. (2012). Chemspot: A hybrid system for chemical named entity recognition. Bioinformatics, 28(12), 1633–1640.

Roth, D. S., Perfecto, I., & Rathcke, B. (1994). The effects of management systems on ground-foraging ant diversity in Costa Rica. Ecological Applications, 4(3), 423–436.

Sakai, A. K., Allendorf, F. W., Holt, J. S., Lodge, D. M., Molofsky, J., With, K. A., Baughman, S., Cabin, R. J., Cohen, J. E., Ellstrand, N. C., McCauley, D. E., O’Neil, P., Parker, I. M., Thompson, J. N., & Weller, S. G. (2001). The population biology of invasive species. Annual Review of Ecology and Systematics, 32(1), 305–332.

Salloum, S. A., Al-Emran, M., Monem, A. A., & Shaalan, K. (2018). Using text mining techniques for extracting information from research articles. In Studies in Computational Intelligence (pp. 373–397). Springer.

Salloum, S. A., AlHamad, A. Q., Al-Emran, M., & Shaalan, K. (2018). A Survey of Arabic Text Mining. In Intelligent Natural Language Processing: Trends and Applications (pp. 417–431).

Sanders, C. J., Mellor, R. S., & Wilson, A. J. (2010). Invasive arthropods. OIE Revue Scientifique et Technique, 29(2), 273.

Sanderson, M. A., Skinner, R. H., Barker, D. J., Edwards, G. R., Tracy, B. F., & Wedin, D. A. (2004). Plant species diversity and management of temperate forage and grazing land ecosystems. Crop Science, 44(4), 1132–1144.

Sarawagi, S. (2008). Information extraction. In Foundations and Trends in Databases. Now Publishers Inc.

Sautter, G., Böhm, K., & Agosti, D. (2006). A combining approach to find all taxon names (FAT). Biodiversity Informatics, 3.

Schulz, K., Hammock, J., & Miller, S. (2016). TraitBank: An open digital repository for organism traits. In 2016 International Congress of Entomology.

Seibold, S., Gossner, M. M., Simons, N. K., Blüthgen, N., Müller, J., Ambarlı, D., Ammer, 54

C., Bauhus, J., Fischer, M., Habel, J. C., Linsenmair, K. E., Nauss, T., Penone, C., Prati, D., Schall, P., Schulze, E. D., Vogt, J., Wöllauer, S., & Weisser, W. W. (2019). Arthropod decline in grasslands and forests is associated with landscape-level drivers. Nature, 574(7780), 671–674.

Sharma, A., Gupta, S., Motlani, R., Bansal, P., Shrivastava, M., Mamidi, R., & Sharma, D. M. (2016). Shallow parsing pipeline for Hindi-English code-mixed social media text.

Sharma, V., Restrepo, M. I., & Sarkar, I. N. (2019). Solr-Plant: Efficient extraction of plant names from text. BMC Bioinformatics, 20(1), 263.

Soliveres, S., Van Der Plas, F., Manning, P., Prati, D., Gossner, M. M., Renner, S. C., Alt, F., Arndt, H., Baumgartner, V., Binkenstein, J., Birkhofer, K., Blaser, S., Blüthgen, N., Boch, S., Böhm, S., Börschig, C., Buscot, F., Diekötter, T., Heinze, J., … Allan, E. (2016). Biodiversity at multiple trophic levels is needed for ecosystem multifunctionality. Nature, 536(7617), 456–459.

Spasic, I., Ananiadou, S., McNaught, J., & Kumar, A. (2005). Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3), 239–251.

Tan, S. S., Lim, T. Y., Soon, L. K., & Tang, E. K. (2016). Learning to extract domain- specific relations from complex sentences. Expert Systems with Applications, 60, 107–117.

Tarr, D. (2015). random-name. GitHub. https://github.com/dominictarr/random- name/blob/master/first-names.txt

The universal biological indexing and organization system. (n.d.). Retrieved August 4, 2020, from http://ubio.org/

Thompson, R. M., Brose, U., Dunne, J. A., Hall, R. O., Hladyz, S., Kitching, R. L., Martinez, N. D., Rantala, H., Romanuk, T. N., Stouffer, D. B., & Tylianakis, J. M. (2012). Food webs: Reconciling the structure and function of biodiversity. Trends in Ecology and Evolution, 27(12), 689–697.

Thomsen, P. F., & Willerslev, E. (2015). Environmental DNA - An emerging tool in conservation for monitoring past and present biodiversity. Biological Conservation, 183, 4–18.

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., & Ristenpart, T. (2016). Stealing machine learning models via prediction APIs. Proceedings of the 25th USENIX Security Symposium, 601–618.

Tunali, V., & Bilgin, T. T. (2012). PRETO: A high-performance text mining tool for preprocessing Turkish texts. ACM International Conference Proceeding Series, 134– 55

140.

Underwood, E. C., & Fisher, B. L. (2006). The role of ants in conservation monitoring: If, when, and how. Biological Conservation, 132(2), 166–182.

USDA, N. (2020). The PLANTS Database. In National Plant Data Team. http://plants.usda.gov

Van Nuland, M. E., & Whitlow, W. L. (2014). Temporal effects on biodiversity and composition of arthropod communities along an urban–rural gradient. Urban Ecosystems, 17(4), 1047–1060.

Vasconcelos, H. L. (1999). Effects of forest disturbance on the structure of ground- foraging ant communities in central Amazonia. Biodiversity and Conservation, 8(3), 407–418.

Vellend, M. (2006). The consequences of genetic diversity in competitive communities. Ecology, 87(2), 304–311.

Walker, B. H. (1992). Biodiversity and Ecological Redundancy. Conservation Biology, 6(1), 18–23.

Weeks, J. A. (2003). Parasitism and ant protection alter the survival of the lycaenid Hemiargus isola. Ecological Entomology, 28(2), 228–232.

Whelan, C., Tomback, D., Kelly, D., & Johnson, M. D. (2016). Trophic interaction networks and ecosystem services. In Why Birds Matter: Avian Ecological Function and Ecosystem Services (pp. 49–72). University of Chicago Press.

Williams, K. S. (1993). Use of Terrestrial Arthropods to Evaluate Restored Riparian Woodlands. Restoration Ecology, 1(2), 107–116.

Wu, F., & Weld, D. S. (2010). Open information extraction using Wikipedia. ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 118–127.

Wu, Y. F. B., Li, Q., Bot, R. S., & Chen, X. (2005). Domain-specific keyphrase extraction. International Conference on Information and Knowledge Management, Proceedings, 283–284.

Xing, W., Qi, J., Yuan, X., Li, L., Zhang, X., Fu, Y., Xiong, S., Hu, L., & Peng, J. (2018). A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach. Bioinformatics, 34(13), i386–i394.

Yan, X., Zhenyu, L., Gregg, W. P., & Dianmo, L. (2001). Invasive species in China - An overview. Biodiversity and Conservation, 10(8), 1317–1341. 56

Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., & Soderland, S. (2007). TextRunner. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 25–26.

Yu, X., & Lam, W. (2010). Jointly identifying entities and extracting relations in encyclopedia text via a graphical model approach. Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 1399– 1407.

Zouaq, A., Gagnon, M., & Jean-Louis, L. (2017). An assessment of open relation extraction systems for the semantic web. Information Systems, 71, 228–239.

57

APPENDICES

S.1. Research Articles Used in Test Set

Table S.1. The total list of the titles of the 56 articles that were included in the test set for the extraction and classification tasks.

No. Citations of Tested Articles Link

1 Amundrud, S. L., & Srivastava, D. S. (2020). Thermal tolerances and Link species interactions determine the elevational distributions of insects. Global Ecology and Biogeography

2 Ariasari, A., Helmiati, S., & Setyobudi, E. (2018). Food preference of red Link devil (Amphilophus labiatus) in the Sermo Reservoir, Kulon Progo Regency. IOP Conference Series: Earth and Environmental Science.

3 Awmack, C. S., & Leather, S. R. (2002). Host plant quality and fecundity in Link herbivorous insects. Annual Review of Entomology, 47(1), 817–844.

4 Baba, Y. G., Watari, Y., Nishi, M., & Sasaki, T. (2019). Notes on the Link feeding habits of the Okinawan fishing spider, Dolomedes orion (Araneae: Pisauridae), in the southwestern islands of Japan. Journal of Arachnology, 47(1), 154–158.

5 Babangenge, G. B., Jocqué, R., Masudi, F. M., Rödel, M. O., Burger, M., Link Gvoždík, V., & Pauwels, O. S. G. (2019). Frog-eating spiders in the Afrotropics: An analysis of published and new cases. Bulletin of the Chicago Herpetological Society, 54(3), 57–63.

6 Bara, J. J., Clark, T. M., & Remold, S. K. (2014). Utilization of larval and Link pupal detritus by Aedes aegypti and Aedes albopictus. Journal of Vector Ecology, 39(1), 44–47.

7 Brandão, C. R. F., Diniz, J. L. M., & Tomotake, E. M. (1991). Link Thaumatomyrmex strips millipedes for prey: a novel predatory

58

behaviour in ants, and the first case of sympatry in the genus (Hymenoptera: Formicidae). Insectes Sociaux, 38(4), 335–344.

8 Brown, B. V., & Morrison, L. W. (1999). New Pseudacteon (Diptera: Link Phoridae) from North America That Parasitizes the Native Fire Ant Solenopsis geminata (Hymenoptera: Formicidae). Annals of the Entomological Society of America, 92(3), 308–311.

9 Dadebo, E., Aemro, D., & Tekle-Giorgis, Y. (2014). Food and feeding Link habits of the African catfish Clarias gariepinus (Burchell, 1822) (Pisces: Clariidae) in Lake Koka, Ethiopia. African Journal of Ecology, 52(4), 471–478.

10 Deepak, C. K., & Ghosh, D. (2018). Insecta: Dermaptera. In Faunal Link Diversity of Indian Himalaya, Insecta: Dermeptera.

11 Egglishaw, H. J. (1964). The Distributional Relationship between the Link Bottom Fauna and Plant Detritus in Streams. The Journal of Animal Ecology, 463–476.

12 Eisner, T., Goetz, M. A., Hill, D. E., Smedley, S. R., & Meinwald, J. (1997). Link Firefly “femmes fatales” acquire defensive steroids (lucibufagins) from their firefly prey. Proceedings of the National Academy of Sciences of the United States of America, 94(18), 9723–9728.

13 Falci Theza Rodrigues, L., Jabour Vescovi Rosa, B. F., Lobo, H., Campos Link Divino, A., & Da Gama Alves, R. (2015). Diversity and distribution of oligochaetes in tropical forested streams, southeastern Brazil. Journal of Limnology, 74(3).

14 Falqueto, S. A., Fernando, V.-D.-M. V., & Schoereder, J. H. (2005). Are Link fungivorous less specialist? Ecología Austral, 15(01), 017–022.

15 Fougeyrollas, R., Dolejšová, K., Křivánek, J., Sillam-Dussès, D., Roisin, Y., Link Hanus, R., & Roy, V. (2018). Dispersal and mating strategies in two neotropical soil-feeding termites, Embiratermes neotenicus and

59

Silvestritermes minutus (Termitidae, Syntermitinae). Insectes Sociaux, 65(2), 251–262.

16 Hall, C. R., Dagg, V., Waterman, J. M., & Johnson, S. N. (2020). Silicon Link alters leaf surface morphology and suppresses insect herbivory in a model grass species. Plants, 9(5), 643.

17 Hanna, W., & Ruter, J. (2006). Pennisetum purpureum plant named Link ’Prince’ (Patent No. 11/151,586).

18 Hereward, H. F. R., Gentle, L. K., Ray, N. D., & Sluka, R. D. (2017). Ghost Link crab burrow density at Watamu Marine National Park: An indicator of the impact of urbanisation and associated disturbance? African Journal of Marine Science, 39(1), 129–133.

19 Ito, T. (2005). Effect of carnivory on larvae and adults of a detritivorous Link caddisfly, Lepidostoma complicatum: a laboratory experiment. Limnology, 6(2), 73–78.

20 Jørgensen, H. B., & Toft, S. (1997). Role of granivory and insectivory in Link the life cycle of the carabid beetle Amara similata. Ecological Entomology, 22(1), 7–15.

21 Koleska, D., Vrabec, V., & Kulma, M. (2017). Teira dugesii (Sauria: Link Lacertidae)-high aggregation. Herpetological Bulletin, 139, 31.

22 Kuusk, A. K., & Ekbom, B. (2012). Feeding habits of lycosid spiders in field Link habitats. Journal of Pest Science, 85(2), 253–260.

23 Lai, L. C., Chiu, M. C., Tsai, C. W., & Wu, W. J. (2018). Composition of Link harvested seeds and seed selection by the invasive tropical fire ant, Solenopsis geminata (Hymenoptera: Formicidae) in Taiwan. Arthropod-Plant Interactions, 12(4), 623–632.

24 Larson, H. K., Goffredi, S. K., Parra, E. L., Vargas, O., Pinto-Tomas, A. A., Link & McGlynn, T. P. (2014). Distribution and dietary regulation of an associated facultative Rhizobiales-related bacterium in the

60

omnivorous giant tropical ant, Paraponera clavata. Naturwissenschaften, 101(5), 397–406.

25 Londt, J. (1995). Afrotropical Asilidae (Diptera) 27: Predation of Asilidae by Link Asilidae. Annals-Natal Museum, 36(1), 161–167.

26 Lorenzo, M. E., Bao, L., Mendez, L., Grille, G., Bonato, O., & Basso, C. Link (2019). Effect of Two Oviposition Feeding Substrates on Orius insidiosus and Orius tristicolor (Hemiptera: Anthocoridae). Florida Entomologist, 102(2), 395–402.

27 Mikani, A. (2016). Tachykinin stimulation effects on α-amylase, protease Link and lipase activities in midgut of American cockroach, Periplaneta americana (Blattodea: Blattidae). Journal of Entomological Society of Iran, 36, 81–88.

28 Motta, R. L., & Uieda, V. S. (2004). Diet and trophic groups of an aquatic Link insect community in a tropical stream. Brazilian Journal of Biology, 64(4), 809–817.

29 Naegle, M. A., Mugleston, J. D., Bybee, S. M., & Whiting, M. F. (2016). Link Reassessing the phylogenetic position of the epizoic earwigs (Insecta: Dermaptera). Molecular Phylogenetics and Evolution, 100, 382–390.

30 Nell, H. W., Lelie, L. A. S. der, Woets, J., & Lenteren, J. C. va. (1976). The Link parasite‐host relationship between Encarsia formosa (Hymenoptera: Aphelinidae) and Trialeurodes vaporariorum (Homoptera: Aleyrodidae): II. Selection of host stages for oviposition and feeding by the parasite. Zeitschrift Für Angewandte Entomologie, 89(1–5), 442–454.

31 Nyffeler, M. (1999). Prey selection of spiders in the field. Journal of Link Arachnology, 317–324.

32 Ocampo, F. C., & Philips, T. K. (2017). Food relocation and nesting Link behavior of the Argentinian dung beetle genus Eucranium and comparison with the southwest African Scarabaeus (Pachysoma) 61

(Coleoptera: Scarabaeidae: ). Revista de La Sociedad Entomológica Argentina, 64(1–2).

33 Pacek, S., Seniczak, A., Graczyk, R., Chachaj, B., & Waldon-Rudzionek, Link B. (2020). The effect of grazing by geese, goats, and fallow deer on soil mites (Acari). Turkish Journal of Zoology, 44(3), 254–265.

34 Pogoreutz, C., & Ahnelt, H. (2014). Gut morphology and relative gut length Link do not reliably reflect trophic level in gobiids: A comparison of four species from a tropical Indo-Pacific seagrass bed. Journal of Applied Ichthyology, 30(2), 408–410.

35 Qashqaei, A. T., & Ahmadzadeh, F. (2015). Dietary records of yellow- Link headed agama in Hormozgan Province, Iran. Russian Journal of Herpetology, 22(4), 315–317.

36 Robbins, R., & Aiello, A. (1982). Foodplant and oviposition records for Link Panamanian Lycaenidae and Riodinidae. Robbins, Robert K. and Aiello, Annette.

37 Santibañez, S., & Hernández, D. (2019). Revisiting the Alimentary Habits Link of the Liolaemus Cristiani.

38 Schönberg, C. H. L., Hosie, A. M., Fromont, J., Marsh, L., & O’Hara, T. Link (2016). Apartment-style living on a kebab sponge. Marine Biodiversity, 46(2), 331–332.

39 Seifert, R. P., & Seifert, F. H. (1979). A Heliconia Insect Community in a Link Venezuelan Cloud Forest. Ecology, 60(3), 462–467.

40 Sereda, E., Wolters, V., & Birkhofer, K. (2015). Addition of crop residues Link affects a detritus-based food chain depending on litter type and farming system. Basic and Applied Ecology, 16(8), 746–754.

41 Silcox, M. T., & Teaford, M. F. (2002). The Diet of Worms: An Analysis of Link Mole Dental Microwear. Journal of Mammalogy, 83(3), 804–814.

62

42 Somda, N. S. B., Maïga, H., Mamai, W., Yamada, H., Ali, A., Konczal, A., Link Gnankiné, O., Diabaté, A., Sanon, A., Roch, K. D., Gilles, J. R. L., & Bouyer, J. (2019). Insects to feed insects - feeding Aedes mosquitoes with flies for laboratory rearing. Scientific Reports, 9(1), 1–13.

43 Srivastava, Sarah L. Amundrud, D. S. (2020). Thermal tolerances and Link species interactions determine the elevational distributions of insects. Global Ecology and Biogeography.

44 Starzomski, B. M., Suen, D., & Srivastava, D. S. (2010). Predation and Link facilitation determine chironomid emergence in a bromeliad-insect food web. Ecological Entomology, 35(1), 53–60.

45 Sweet, M. H. (1979). On the Original Feeding Habits of the Hemiptera Link (Insecta). Annals of the Entomological Society of America, 72(5), 575–579.

46 Szawaryn, K. (2015). Notes on the genus Mada Mulsant with description of Link a new Andean species (Coleoptera: Coccinellidae: Epilachnini). Zootaxa, 3936(2), 281–286.

47 Tamaki, G., & Olsen, Olsen, D. (2019). Feeding potential of predators of Link Myzus persicae. Journal of the Entomological Society of British Columbia, 74, 23–26.

48 Tan, C. W., Peiffer, M., Hoover, K., Rosa, C., & Felton, G. W. (2019). Link Parasitic Wasp Mediates Plant Perception of Insect Herbivores. Journal of Chemical Ecology, 45(11–12), 972–981.

49 Tana, Y., Zhu, M., Xub, W., Zhoua, W., Lub, D., Shang, H., & Zhua, Z. Link (2017). Influence of water-stressed rice on feeding behavior of brown planthopper, Nilaparvata lugens (Stål). Journal of Asia- Pacific Entomology, 20(2), 665–670.

50 Tangkawanit, U., Hinmo, N., & Khlibsuwan, W. (2018). Role of different Link habitats for the functional response of Crytorrhinus lividipennis 63

(Hemiptera: Miridae) on Nilaparvata lugens (Hemiptera: Delphacidae). Biocontrol Science and Technology, 28(7), 663–671.

51 Tauber, C. A., & Tauber, M. J. (1987). Food specificity in predacious Link insects: a comparative ecophysiological and genetic study. Evolutionary Ecology, 1(2), 175–186.

52 Tierno de Figueroa, J. M., Trenzado, C. E., López-Rodríguez, M. J., & Link Sanz, A. (2011). Digestive enzyme activity of two stonefly species (Insecta, Plecoptera) and their feeding habits. Comparative Biochemistry and Physiology - A Molecular and Integrative Physiology, 160(3), 426–430.

53 Weeks, J. A. (2003). Parasitism and ant protection alter the survival of the Link lycaenid Hemiargus isola. Ecological Entomology, 28(2), 228–232.

54 Wondafrash, M., Van Dam, N. M., & Tytgat, T. O. G. (2013). Plant Link systemic induced responses mediate interactions between root parasitic nematodes and aboveground herbivorous insects. Frontiers in Plant Science, 4, 87.

55 Yonggyun, K., Ibrahim, A. M. A., Jung, S., & Kwoen, M. (2006). Differential Link Parasitic Capacity of Cotesia plutellae and C. glomerata on Diamondback Moth, Plutella xylostella and Dichotomous Taxonomic Characters. Journal of Asia-Pacific Entomology, 9(3), 293–300.

56 Zweifel, R. G. (1949). Comparison of food habits of Ensatina eschscholtzii Link and Aneides lugubris. Copeia, 1949(4), 285–297.

64

S.2. Example of Pipeline Output File

Table S.2. Examples of outputs from two of the tested files. The relation is included in the first column, with the final classification shown in the second column, and the word(s) that decided the classification shown in the third column. When a new file is analyzed, there is one row that contains the name of the file. The first set of relations come from Lorenzo et al.’s (Lorenzo et al., 2019) work on feeding substrate. The second relations come from Lai et al.’s (Lai et al., 2018) work on an invasive ant species.

Ollie Relation Final Classification Classification Decider example1.pdf -- --

0.768 Thripidae is one of the most significant pests Thripidae is a ['fruits', 'crops', of commercial vegetables fruits and ornamental herbivore 'vegetables'] crops worldwide

0.887 These omnivorous habits may provide F. F. occidentalis ['omnivorous'] occidentalis with a competitive advantage is an omnivore example2.pdf -- --

0.606 various Amara species eat insect eggs larvae Amara is a ['eggs', and pupae carnivore 'larvae']

65

S.3. Keyword Categories

Table S.3. All keywords broken down by category. The total list of keywords and set of specific keywords are all shown in the table. Left to right and right to left keywords are specific to triplet relations, while reflexive, parasite and detritivore all lead to keyword- based relations.

Keyword Category Words

Total omnivorous on, ingests, detritivory, feed mainly, seed predator, manure, detritivorous on, is an omnivore, are herbivores, attacks, harvest the seeds of, captures, feeds, feeding, parasitize, predated by, found to be prey of, parasite-host, host-parasite relationship of, parasitised by, preys, eating, is granivorous, decompose, ate, are typical detritivores, predation by, forages, prey for, the detritivore, feeds mostly on, decomposer, eat, omnivorous, harvests, feeding of, saprophytic, prey on, endoparasitoid, eats, parasitized by, feeding mostly on, pest of, eaten by, was detritivorous, reared on, specialize on, hunt, cultivate, preying on, feeds on, detritivorous, poop, consumes, predating an, is a herbivore, to eat, foraging on, faeces, host, is detritivorous, fecal, eaten, pests of, predator of, reared from, is a seed predator, parasitic, is carnivorous, fed with, parasite, feeds mainly on, ingest, is herbivorous, consumed by, is insectivorous, excrement, consume, parasitised, herbivorous on, employ, is omnivorous, biological control of, herbivorous, forages on, parasitization by, predation of, feces, saprophagous, predaceous on, carnivorous on, being detritivorous, attacking, by detritivorous, pest control in, is a carnivore, predators on, feed largely upon, stool, cultivating, parasitoid of, consumption of, cultivates, 66

dung, feed on, prefers seeds of, super- parasitism, preys on, prey of, fed upon, preys upon, feeding on, parasite-host association, parasitoid, parasitizes, consumed

Left to right eat, parasitised, parasitization by, parasitoid of, parasitize, parasitizes, feeds on, eaten, predator of, attacking, consumed, ingests, ingest, consumption of, preys upon, preying on, forages, forages on, feeds mainly on, feeds mostly on, pests of, predation of, feeding, cultivate, cultivates, cultivating, eating, reared from, reared on, ate, eat, hunt, feed on, specialize on, harvests, eat, eating, pest of, attacks, feeding on, predating an, captures, prey on, fed upon, predating an, captures, is a seed predator, prefers seeds of, harvests, seed predator, harvest the seeds of, fed upon, fed with, pest control in, biological control of, pest of, attacks, pests of, eating, feeding on, eat, predators on, ate, reared from, reared on, feed on, herbivorous on, carnivorous on, detritivorous on, omnivorous on, attacking, consumed, feeds mostly on, feed largely upon, preys upon, preying on, parasitoid of, consume, eats, feeds on, ingests, predaceous on,

Right to left parasitised by, parasitized by, prey for, prey of, eaten by, feeds, found to be prey of, consumed by, predation by, predated by, by detritivorous, the detritivore

Reflexive detritivory, omnivorous, detritivorous, being detritivorous, are typical detritivores, was detritivorous, is granivorous, is insectivorous, is omnivorous, is herbivorous, are herbivores, is a herbivore, is an omnivore, 67

is a carnivore, is detritivorous, is carnivorous, herbivorous

Parasite parasitised, parasitization by, parasitoid of, parasitize, parasitizes, parasitized by, super-parasitism, parasitised by, endoparasitoid, parasitize, parasitic, host, parasite, parasite-host, parasite-host association, host-parasite relationship of, parasitoid

Detritivore the detritivore, by detritivorous, decomposer, saprophagous, saprophytic, decompose, detritivorous, faeces, feces, dung, fecal, excrement, poop, stool, manure

68

S.4 Pipeline Code

1. Trophic Information Pipeline 2. Description 3. The following cells in concert work to create a trophic information pipeline run in a J upyter notebook to create a file that has the final classifications for scientific name s found within research articles. 4. 5. Getting Started 6. To run this notebook, the libraries that are imported must be on your machine. The Olli e tool must be downloaded from http://knowitall.cs.washington.edu/ollie/ollie-app- latest.jar as well as the English MaltParser model (engmalt.linear- 1.7.mco) http://www.maltparser.org/mco/english_parser/engmalt.html based on the instruc tions from https://github.com/knowitall/ollie. The two downloads should be in the same folder as the jupyter notebook file. The scientific names file, the common names file, the abbreviated scientific names file, the english words file, the Random Forest traini ng file and the trophic keywords file must be downloaded from https://github.com/JSRaff ing/Trophic-Information-Extraction- Pipeline and all kept in the same folder as the previous downloaded mentioned. Once tha t is complete fill in the necessary names in the Part 1 cell which are the names of the files to be analyzed (PDFs), and the name of the result file. All other variables are left as the defaults based on the downloaded file names. If you change a file or a file name, the default must be changed. 7. 8. # Part 1: Filling in the Variables 9. 10. # Files to be analyzed: replace the example file names with the names of your actual pd fs 11. files_to_be_analyzed = ['example_file1.pdf', 'example_file2.pdf', 'example_file3.pdf']

12. # Result file name: replace the example result file name with the name of the file you want 13. output_file = 'Name_of_Result_File.txt' 14. # English words File: To change the file, replace 'words.txt' with the name of your fil e that is structured with one word on each line. 15. english_dict_file = 'words.txt' 16. # Scientific Names File: # To change the file, replace 'scinames- final.txt' with the name of your file that has each scientific name on a row in the fir st column and the corresponding kingdom on the same row in the second column 17. sci_names_file = 'scinames-final.txt' 18. # Common Names File: # To change the file, replace 'comnames- final.txt' with the name of your file that has each common name on a row in the first c olumn and the corresponding kingdom on the same row in the second column 19. common_names_file = 'comnames-final.txt' 20. # Abbreviated Scientific Name File: # To change the file, replace 'acronamesflipped- may8.txt' with the name of your file that has each common name on a row in the first co lumn and the corresponding kingdom on the same row in the second column 21. abbreviated_names_file = 'acronamesflipped-final.txt' 22. # Trophic Keywords File: # To add keywords to the file, open the file and add your keyw ords to the phrases list as well as the specific category it would fall in 23. trophic_keywords_file = 'trophickeywords-final.txt' 24. # Random Forest training file 25. random_forest_training_file = 'mixedtrain2.csv' 26. # Ollie File 27. ollie_file = 'ollie-app-latest.jar' 69

28. 29. Part 2: Importing Libaries 30. 31. # Each imported library is used in a later step. 32. import csv 33. from itertools import zip_longest 34. import nltk 35. from itertools import groupby 36. import sys 37. import re 38. import itertools 39. import operator 40. from itertools import groupby 41. import argparse 42. import subprocess 43. from subprocess import PIPE, Popen 44. import fitz 45. import Levenshtein 46. from Levenshtein import distance 47. from more_itertools import unique_everseen 48. from sklearn.ensemble import RandomForestClassifier 49. import os 50. import pandas as pd 51. import ast 52. 53. print('Libraries imported') 54. 55. # Part 3: Loading Necessary Collections 56. 57. # English Word Collection 58. # The collection of English words is saved as a list by reading the words from a file.

59. english_list = [] 60. 61. with open(english_dict_file, "r") as file: 62. for line in file: 63. english_list.append(line) 64. 65. print('English dictionary loaded') 66. 67. # Scientific Names Collection 68. # The collection of scientific names and kingdoms are saved as a dictionary by reading from a two columned file 69. sci_names = [] 70. kingdom_names = [] 71. 72. with open(sci_names_file, 'r') as data: 73. for line in data.readlines(): 74. # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.

75. line = line.rstrip('\n') 76. entity = line.split('\t') 77. sci_names.append(entity[0]) 78. kingdom_names.append(entity[1]) 79. # Lists are zipped together to create a dictionary. 80. sci_name_dict = dict(zip(sci_names, kingdom_names)) 81. 82. print('Scientific names loaded') 70

83. 84. # Common Names Collection 85. # The collection of common names and kingdoms are saved as a dictionary by reading from a two columned file 86. common_names = [] 87. kingdom_names = [] 88. 89. with open(common_names_file, 'r') as data: 90. for line in data.readlines(): 91. # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.

92. line = line.rstrip('\n') 93. entity = line.split('\t') 94. common_names.append(entity[0]) 95. kingdom_names.append(entity[1]) 96. # Lists are zipped together to create a dictionary. 97. common_names_dict = dict(zip(common_names, kingdom_names)) 98. 99. print('Common names loaded') 100. 101. # Abbreviated Scientific Names 102. # The collection of abbreviated names and expanded names are saved as a dictiona ry by reading from a two columned file 103. abbreviated_names = [] 104. expanded_names = [] 105. 106. with open(abbreviated_names_file, 'r') as data: 107. for line in data.readlines(): 108. # Change the '\t' delimiter to your file's delimiter if it isn't tab sep arated. 109. line = line.rstrip('\n') 110. entity = line.split('\t') 111. abbreviated_names.append(entity[0]) 112. expanded_names.append(entity[1]) 113. 114. print('Abbreviated names loaded') 115. 116. # Trophic Keywords and Categories 117. 118. # The collection of trophic keywords and specific categories are saved as a dict ionary by reading from a two columned file 119. total_keywords_and_categories = [] 120. 121. with open(trophic_keywords_file, 'r') as data: 122. for line in data.readlines(): 123. # Change the '\t' delimiter to your file's delimiter if it isn't tab sep arated. 124. line = line.rstrip('\n') 125. category = line.split('\t') 126. total_keywords_and_categories.append(category[1]) 127. 128. total_keywords = ast.literal_eval(total_keywords_and_categories[0]) 129. lefttoright = ast.literal_eval(total_keywords_and_categories[1]) 130. righttoleft = ast.literal_eval(total_keywords_and_categories[2]) 131. reflexive = ast.literal_eval(total_keywords_and_categories[3]) 132. parasite = ast.literal_eval(total_keywords_and_categories[4]) 133. detritivore = ast.literal_eval(total_keywords_and_categories[5]) 134. 71

135. print('Trophic keywords loaded') 136. print(', '.join(total_keywords)) 137. print(', '.join(lefttoright)) 138. print(', '.join(righttoleft)) 139. print(', '.join(list(reflexive.keys()))) 140. print(', '.join(parasite)) 141. print(', '.join(detritivore)) 142. 143. # Part 4: Random Forest Model 144. 145. # Training Classifier 146. # The file to train the classifier is read and its data is saved into an object

147. mixedtraining3 = pd.read_csv(random_forest_training_file) 148. # The data is cleaned, removing titles 149. del mixedtraining3['name'] 150. X = mixedtraining3.iloc[:, 0:676].values 151. y = mixedtraining3['type'] 152. # The classifier is trained 153. classifier = RandomForestClassifier(n_estimators=20, random_state=0) 154. classifier.fit(X, y) 155. 156. print('Random Forest model loaded') 157. 158. # Creating Function with Classifier 159. # The fucnction sci_check uses the trained Random Forest model from above to pre dict whether or not a term is a scientific word 160. def sci_check(term): 161. entry = {} 162. all_columns = [] 163. entry_frequencies = [] 164. # Creating list of all possible bigrams 165. for char in 'abcdefghijklmnopqrstuvwxyz': 166. all_possible_bigrams = [char+b for b in 'abcdefghijklmnopqrstuvwxyz'] 167. all_columns.extend(all_possible_bigrams) 168. # Creating list of bigrams in term 169. chars = [term[i:i+2] for i in range(0, len(term))] 170. all_columns_2 = all_columns 171. # Count the frequency of the bigram in the term 172. for bigram in all_columns_2: 173. frequency = chars.count(bigram) 174. entry[bigram] = frequency 175. entry_frequencies.append(frequency) 176. # Use random forest model to check if it's a possible scientific name 177. y_pred = classifier.predict([entry_frequencies]) 178. 179. return y_pred[0] 180. 181. # Part 5: Function to Search Ollie Results 182. 183. # The function "searching_ollie_results" searches through Ollie results for rela tions that contain a keyword 184. def searching_ollie_results(list_of_lists): 185. relevant_ollie_results = [] 186. relevant_ollie_results_dict = {} 187. # Iterate through list of sentences and find sentences that have one of the food phrases

72

188. keep = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ.1234567890 ') 189. for ollie_relation in list_of_lists: 190. ollie_relation = ''.join(filter(keep.__contains__, str(ollie_relation)))

191. for keyword in total_keywords: 192. # Check if phrases are in sentences, the lowercasing both 193. if re.search(keyword.lower(), ollie_relation.lower()): 194. # Get the index of phrase in sentence 195. keyword_index = re.search(r'\b({})\b'.format(keyword.lower()), o llie_relation.lower()) 196. # Check if index is equal to None 197. if keyword_index is not None: 198. # Check if sentence already in results 199. if ollie_relation not in relevant_ollie_results_dict.keys():

200. # Add sentence and phrase into dictionary 201. relevant_ollie_results_dict[ollie_relation] = keyword 202. # Add sentence, phrase and indices to final results 203. relevant_ollie_results.extend([[ollie_relation, keyword, [keyword_index.start(), keyword_index.end()]]]) 204. 205. else: 206. # If sentence already in result dictionary, save the pre vious entry 207. smallest = min((relevant_ollie_results_dict[ollie_relati on], keyword), key=len) 208. # Check if smaller phrase is in longer phrase 209. if smallest in keyword: 210. # Save both phrases 211. both = [relevant_ollie_results_dict[ollie_relation], keyword] 212. # Keep the phrase that is the longest 213. longest = max(both, key=len) 214. # Find the index of the phrase 215. keyword_index_2 = re.search(r'\b({})\b'.format(relev ant_ollie_results_dict[ollie_relation].lower()), ollie_relation.lower()) 216. # Get the index of the previous entry from the resul ts 217. if [ollie_relation, relevant_ollie_results_dict[olli e_relation], [keyword_index_2.start(), keyword_index_2.end()]] in relevant_ollie_result s: 218. previous_index = relevant_ollie_results.index([o llie_relation, relevant_ollie_results_dict[ollie_relation], [keyword_index_2.start(), k eyword_index_2.end()]]) 219. # Replace the previous results with the newer phrase entry 220. relevant_ollie_results[previous_index] = [ollie_ relation, longest, [keyword_index.start(), keyword_index.end()]] 221. else: 222. # If the phrases don't intersect, then extend the re sults with the new entry 223. relevant_ollie_results.extend([[ollie_relation, keyw ord, [keyword_index.start(), keyword_index.end()]]]) 224. 225. # Remove duplicates from results 226. relevant_ollie_results = list(relevant_ollie_results for relevant_ollie_resu lts,_ in itertools.groupby(relevant_ollie_results)) 73

227. 228. return relevant_ollie_results 229. 230. # Part 6: Function to Identify Scientific Names and Add a Final Classification 231. 232. # This function "identify_and_classify" takes the relevant trophic relations fou nd by the "searching_ollie_results" function and locates scientific names while giving them a final classification 233. def identify_and_classify(relevant_ollie_relations): 234. all_final_classifications = [] 235. # Iterate through sentences of previous output 236. for i in range(0,len(relevant_ollie_relations)): 237. current_relation = [] 238. words_in_official_scientific_name = [] 239. words_added_to_current_relation = [] 240. # Saving part of each result as a variable 241. sentence = relevant_ollie_relations[i][0] 242. phrase = relevant_ollie_relations[i][1] 243. words_added_to_current_relation.append(phrase) 244. current_relation.append(relevant_ollie_relations[i][1]) 245. keyword_string_index = relevant_ollie_relations[i][2] 246. # Split by spaces 247. before_space = sentence[:keyword_string_index[0]].split() 248. after_space = sentence[keyword_string_index[0]:].split() 249. # Add a rule to check if word is a noun 250. is_noun = lambda pos: pos[:2] == 'NN' 251. # Finding nouns in sentence using the previously made rule 252. nouns = [word for (word, pos) in nltk.pos_tag(before_space+after_space) if is_noun(pos)] 253. nouns2 = nouns 254. words_in_english_dictionary = [] 255. words_in_sci_name_list = [] 256. # Checking if any of the nouns land in the dictionary 257. for word in nouns: 258. if word not in detritivore: 259. if word in english_list: 260. words_in_english_dictionary.append(word) 261. # Check if word closely matches a word in list of scientific names 262. metric = (2, 'noname') 263. for master_name in sci_names: 264. new_metric = distance(word, master_name) 265. if (new_metric < metric[0]): 266. metric = (new_metric, master_name) 267. if metric[1] == 'noname': 268. # Check if word is in abbreviated scientific names or common nam es 269. matching = [s for s in abbreviated_names + common_names if word in s] 270. if len(matching) > 0: 271. metric = (0, metric[1]) 272. if metric[0] < 2: 273. words_in_sci_name_list.append(word) 274. # Regular expressions of different forms of scientific names 275. patterns = ['[A-z]\. [a-z]{3,}', '[A-Z][a-z]{4,}: [A-Z][a-z]{4,}', '[A- Z][a-z]{2,} [A-z]{3,}'] 276. #'[A-Z][a-z]{4,} [A-z]{4,}' 277. # Finding all regular expression matches in a sentence 278. located_patterns = [] 74

279. for pattern in patterns: 280. searching_for_patterns = re.findall(pattern, relevant_ollie_relation s[i][0]) 281. located_patterns.extend(searching_for_patterns) 282. patterns_in_data_list = [] 283. words_from_model = [] 284. # Check if regular expression scientific name is closely related to a sc ientific name in the list 285. for located_word in located_patterns: 286. # Check if length of results is greater than 0 287. metric = (2, 'noname') 288. for master_name in sci_names: 289. new_metric = distance(located_word, master_name) 290. if (new_metric < metric[0]): 291. metric = (new_metric, master_name) 292. if metric[1] == 'noname': 293. # Check if word is in abbreviated scientific names or common nam es 294. matching = [s for s in abbreviated_names + common_names if locat ed_word in s] 295. if len(matching) > 0 or '.' in located_word: 296. metric = (0, metric[1]) 297. if metric[0] < 2: 298. patterns_in_data_list.append(located_word) 299. input_1 = (located_word, 'scientificname') 300. # Check if word comes before or after the relational phrase in t he sentence 301. if sentence.index(located_word) < sentence.lower().index(phrase) : 302. if current_relation.index(phrase) == 0: 303. phrase_index = current_relation.index(phrase) 304. current_relation.insert(0,input_1) 305. words_added_to_current_relation.append(located_word) 306. else: 307. phrase_index = current_relation.index(phrase) 308. current_relation.insert((phrase_index),input_1) 309. words_added_to_current_relation.append(located_word) 310. else: 311. phrase_index = current_relation.index(phrase) 312. current_relation.index(phrase) 313. current_relation.append(input_1) 314. words_added_to_current_relation.append(located_word) 315. else: 316. # Test the located word against the model 317. test = sci_check(located_word) 318. input_test = [located_word, test] 319. if test.item() == 0: 320. words_from_model.append(located_word) 321. input_1 = (located_word, 'scientificname') 322. # If model says it is a scientific word, check if word comes bef ore or after the relational phrase in the sentence 323. if sentence.index(located_word) < sentence.lower().index(phr ase): 324. if current_relation.index(phrase) == 0: 325. phrase_index = current_relation.index(phrase) 326. current_relation.insert(0,input_1) 327. words_added_to_current_relation.append(located_word)

75

328. else: 329. phrase_index = current_relation.index(phrase) 330. current_relation.insert((phrase_index),input_1) 331. words_added_to_current_relation.append(located_word)

332. else: 333. phrase_index = current_relation.index(phrase) 334. current_relation.index(phrase) 335. current_relation.append(input_1) 336. words_added_to_current_relation.append(located_word) 337. 338. # Checking through nouns to see if any of them combined is a scientific name 339. if len(nouns) > 0: 340. for i in range(0, (len(nouns)-1)): 341. possible_scientific_word = nouns[i]+' '+nouns[i+1] 342. metric = (2, 'noname') 343. for master_name in sci_names: 344. new_metric = distance(possible_scientific_word, master_name)

345. if (new_metric < metric[0]): 346. metric = (new_metric, master_name) 347. if metric[0] < 2 and metric[1] != 'noname': 348. if possible_scientific_word in sentence: 349. # Check if the the combination of the words is already i n one of lists of scientific words that have already been found 350. if any((possible_scientific_word in s for s in patterns_ in_data_list + [', '.join(words_added_to_current_relation)])) == False: 351. for each in possible_scientific_word.split(): 352. words_in_official_scientific_name.append(each) 353. # Check if word came before or after phrase 354. if sentence.index(possible_scientific_word) < senten ce.lower().index(phrase): 355. phrase_index = current_relation.index(phrase) 356. if phrase_index == 0: 357. input_1 = (possible_scientific_word, 'possib lescientificname') 358. current_relation.insert(0, input_1) 359. words_added_to_current_relation.append(possi ble_scientific_word) 360. else: 361. phrase_index = current_relation.index(phrase ) 362. input_1 = (possible_scientific_word, 'possib lescientificname') 363. current_relation.insert(phrase_index, input_ 1) 364. words_added_to_current_relation.append(possi ble_scientific_word) 365. # Add word to the end because it comes after the phr ase 366. else: 367. input_1 = (possible_scientific_word, 'possiblesc ientificname') 368. current_relation.append(input_1) 369. words_added_to_current_relation.append(possible_ scientific_word) 370. 76

371. else: 372. # Split the possible scientific word into its parts and see if its in a list 373. splitted = possible_scientific_word.split() 374. # Check if substring is in the list of scientific names 375. for substring in possible_scientific_word.split(): 376. metric = (2, 'noname') 377. for master_name in sci_names: 378. new_metric = distance(substring, master_name) 379. if (new_metric < metric[0]): 380. metric = (new_metric, substring) 381. commonnamesfull = [x for x in detritivore] 382. # Check if the substring has already been added or check ed 383. if metric[1] != 'noname': 384. if substring not in ', '.join(patterns_in_data_list + words_from_model).split() + (' '.join(words_added_to_current_relation)).split(): 385. input_1 = (substring, 'scientificname') 386. # If the substring has not already been added, c heck if it comes before or after the phrase 387. if substring.lower() not in english_list and len (substring) > 2: 388. if sentence.index(substring) < sentence.lowe r().index(phrase): 389. phrase_index = current_relation.index(ph rase) 390. if phrase_index == 0: 391. current_relation.insert(0, input_1)

392. words_added_to_current_relation.appe nd(substring) 393. else: 394. phrase_index = current_relation.inde x(phrase) 395. current_relation.insert(phrase_index , input_1) 396. words_added_to_current_relation.appe nd(substring) 397. # Add the substring to the end if the word c omes after the phrase 398. else: 399. words_added_to_current_relation.append(s ubstring) 400. current_relation.append(input_1) 401. else: 402. dictionaryterm = substring 403. # Check if it is in the list of common names in the detr itivore category 404. elif any(substring in s for s in commonnamesfull): 405. if substring not in ', '.join(words_added_to_current _relation): 406. input_1 = (substring, 'common scientificname') 407. # Check if word comes before or after the phrase

408. if sentence.index(substring) < sentence.lower(). index(phrase): 409. phrase_index = current_relation.index(phrase ) 77

410. if phrase_index == 0: 411. current_relation.insert(0, input_1) 412. words_added_to_current_relation.append(s ubstring) 413. else: 414. phrase_index = current_relation.index(ph rase) 415. current_relation.insert(phrase_index, in put_1) 416. words_added_to_current_relation.append(s ubstring) 417. else: 418. current_relation.append(input_1) 419. words_added_to_current_relation.append(subst ring) 420. 421. else: 422. # Check if word is a scientific word based on the mo del 423. test = sci_check(substring) 424. input_test = [substring, test] 425. if test.item() == 0: 426. words_from_model.append(substring) 427. # Add possible scientific name if not in previou sly added lists 428. if substring not in (' '.join(patterns_in_data_l ist)).split(): 429. if substring not in ', '.join(words_added_to _current_relation): 430. if substring not in english_list: 431. input_1 = (substring, 'possiblescien tificname') 432. if sentence.index(substring) < sente nce.lower().index(phrase): 433. phrase_index = current_relation. index(phrase) 434. if phrase_index == 0: 435. current_relation.insert(0, i nput_1) 436. words_added_to_current_relat ion.append(substring) 437. else: 438. phrase_index = current_relat ion.index(phrase) 439. current_relation.insert(phra se_index, input_1) 440. words_added_to_current_relat ion.append(substring) 441. else: 442. phrase_index = current_relation. index(phrase) 443. current_relation.append(input_1)

444. words_added_to_current_relation. append(substring) 445. 446.

78

447. # Using categories of directionality to figure out sentence classificati on 448. current_relation = [x[0] for x in groupby(current_relation)] 449. phrase_index = current_relation.index(phrase) 450. classifications_and_relation = [] 451. if any('scientificname' in part or 'possiblescientificname' in part or ' common scientificname' in part for part in current_relation): 452. if len(current_relation) > 2: 453. # Separate sentences based on before and after phrase 454. classifications_and_relation.append(sentence) 455. words_after_phrase = [nm for nm in current_relation[phrase_index :] if 'scientificname' in nm[1]] 456. words_before_phrase = [nm for nm in current_relation[:phrase_ind ex] if 'scientificname' in nm[1]] 457. # Check if phrase in lefttoright and if there is a word after th e phrase 458. if phrase in lefttoright and (phrase_index+1) < len(current_rela tion): 459. # Check if word in scientific name list or common name list or abbreviated name list 460. for k in range(0, len(words_after_phrase)): 461. final_result = 'default' 462. word = words_after_phrase[k][0] 463. if word not in detritivore: 464. metric = (2, 'noname') 465. for master_name in sci_names: 466. new_metric = distance(word, master_name) 467. if (new_metric < metric[0]): 468. metric = (new_metric, master_name) 469. if metric[1] == 'noname': 470. # Check if word in common name list 471. matching = [s for s in common_names if word in s ] 472. if len(matching) > 0: 473. metric = (0, metric[1]) 474. kept = min(matching, key=len) 475. final_result = common_names_dict[min(matchin g, key=len)] 476. metric = list(metric) 477. metric[1] = min(matching, key=len) 478. else: 479. # Check if word in abbreviated name list 480. matching = [s for s in abbreviated_names if word in s] 481. if len(matching) > 0: 482. final_result = 'not available' 483. else: 484. if metric[0] < 2: 485. final_result = sci_name_dict[metric[1]] 486. else: 487. final_result = 'none' 488. # Check if words in sentence are in the parasite list of words 489. sentencesplit = sentence.split() 490. parasite_testing = [x for x in parasite if x in sentence ] 491. # If parasite words are in sentence, add parasite classi fication 79

492. if len(parasite_testing)>0: 493. for i in range(0, len(words_before_phrase)): 494. classifications_and_relation.append(str(words_be fore_phrase[i][0] + " is a parasite" + " - " + ', '.join(parasite_testing))) 495. # Check if Animalia is in the kingdom designation 496. elif 'Animalia' in str(final_result): 497. for i in range(0, len(words_before_phrase)): 498. classifications_and_relation.append( str(words_b efore_phrase[i][0] + " is a carnivore" + ' - ' + metric[1])) 499. # If Animalia and Herbivore is in classification then omnivore classification is added 500. if str(words_before_phrase[i][0]) + ' is a herbi vore' in classifications_and_relation: 501. classifications_and_relation.append( str(wor ds_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 502. # Check if Plantae is in the kingdom designation 503. elif 'Plantae' in str(final_result): 504. for i in range(0, len(words_before_phrase)): 505. classifications_and_relation.append(str(words_be fore_phrase[i][0] + " is a herbivore" + ' - ' + metric[1])) 506. # If Animalia and Herbivore is in classification then omnivore classification is added 507. if str(words_before_phrase[i][0]) + ' is a carni vore' in classifications_and_relation: 508. classifications_and_relation.append( str(wor ds_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 509. # Check if word is in detritivore list 510. elif word in detritivore: 511. for i in range(0, len(words_before_phrase)): 512. # Add the detritivore classification 513. classifications_and_relation.append(str(words_be fore_phrase[i][0] + " is a detritivore" + ' - ' + word)) 514. else: 515. tag = 'sentence of interest' 516. # If phrase in detritivore and right to left, the word after the phrase is classified 517. elif phrase in detritivore and phrase in righttoleft: 518. for k in range(0, len(words_after_phrase)): 519. # Add detritivore classification 520. classifications_and_relation.append(str(words_after_phra se[k][0] + " is a detritivore" + ' - ' + phrase)) 521. # Check if phrase in righttoleft and if there is a word after th e phrase 522. elif phrase in righttoleft and (phrase_index+1) < len(current_re lation): 523. # Check if word in scientific name list or common name list or abbreviated name list 524. for k in range(0, len(words_before_phrase)): 525. final_result = 'default' 526. word = words_before_phrase[k][0] 527. if word not in detritivore: 528. metric = (2, 'noname') 529. for master_name in sci_names: 530. new_metric = distance(word, master_name) 531. if (new_metric < metric[0]): 532. metric = (new_metric, master_name) 533. # Check if word in common name list 534. if metric[1] == 'noname': 80

535. matching = [s for s in common_names if word in s ] 536. if len(matching) > 0: 537. metric = (0, metric[1]) 538. kept = (min(matching, key=len)) 539. final_result = common_names_dict[min(matchin g, key=len)] 540. metric = list(metric) 541. metric[1] = kept 542. # Check if word in abbreviated name list 543. else: 544. matching = [s for s in abbreviated_names if word in s] 545. if len(matching) > 0: 546. final_result = 'not available' 547. else: 548. if metric[0] < 2: 549. final_result = sci_name_dict[metric[1]] 550. else: 551. final_result = 'none' 552. # Check if words in sentence are in the parasite list of words 553. sentencesplit = sentence.split() 554. parasite_testing = [x for x in parasite if x in sentence ] 555. # If parasite words are in sentence, add parasite classi fication 556. if len(parasite_testing)>0: 557. for i in range(0, len(words_after_phrase)): 558. classifications_and_relation.append(str(words_af ter_phrase[i][0] + " is a parasite" + ' - ' + ', '.join(parasite_testing))) 559. # Check if Animalia is in the kingdom designation 560. elif 'Animalia' in str(final_result): 561. for i in range(0, len(words_after_phrase)): 562. classifications_and_relation.append( str(words_a fter_phrase[i][0] + " is a carnivore" + ' - ' + metric[1])) 563. # If Animalia and Herbivore is in classification then omnivore classification is added 564. if str(words_before_phrase[i][0]) + ' is a herbi vore' in classifications_and_relation: 565. classifications_and_relation.append( str(wor ds_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 566. # Check if Plantae is in the kingdom designation 567. elif 'Plantae' in str(final_result): 568. for i in range(0, len(words_after_phrase)): 569. classifications_and_relation.append(str(words_af ter_phrase[i][0] + " is a herbivore" + ' - ' + metric[1])) 570. # If Animalia and Herbivore is in classification then omnivore classification is added 571. if str(words_after_phrase[i][0]) + ' is a carniv ore' in classifications_and_relation: 572. classifications_and_relation.append( str(wor ds_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 573. # Check if word is in detritivore list 574. elif word in detritivore: 575. for i in range(0, len(words_after_phrase)): 576. # Add the detritivore classification

81

577. classifications_and_relation.append(str(words_af ter_phrase[i][0] + " is a detritivore" + ' - ' + word)) 578. else: 579. tag = 'sentence of interest' 580. # Check if phrase is in list of reflexive keywords 581. elif phrase in reflexive.keys(): 582. words_before_phrase = [nm for nm in current_relation[:phrase _index] if 'scientificname' in nm[1]] 583. words_after_phrase = [nm for nm in current_relation[phrase_i ndex:] if 'scientificname' in nm[1]] 584. # Check if there is more nouns before or after the keywords

585. if len(words_after_phrase) > len(words_before_phrase): 586. #Iterate through words after phrase and classify 587. for i in range(0, len(words_after_phrase)): 588. # Add omnivore classification 589. if reflexive[phrase] == 'omnivore': 590. classifications_and_relation.append(str(words_af ter_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase)) 591. # Add herbivore classification 592. elif reflexive[phrase] == 'herbivore': 593. classifications_and_relation.append(str(words_af ter_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase)) 594. # Add omnivore classification if carnivore and h erbivore classification are present 595. if str(words_after_phrase[i][0]) + ' is a carniv ore' in classifications_and_relation: 596. classifications_and_relation.append( str(wor ds_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 597. # Add carnivore classification 598. elif reflexive[phrase] == 'carnivore': 599. classifications_and_relation.append(str(words_af ter_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase)) 600. # Add omnivore classification if carnivore and h erbivore classification are present 601. if str(words_after_phrase[i][0]) + ' is a herbiv ore' in classifications_and_relation: 602. classifications_and_relation.append( str(wor ds_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 603. else: 604. classifications_and_relation.append(str(words_af ter_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase)) 605. else: 606. # Iterate through list of words before phrase 607. for i in range(0, len(words_before_phrase)): 608. # Add omnivore classification 609. if reflexive[phrase] == 'omnivore': 610. classifications_and_relation.append(str(words_be fore_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase)) 611. # Add herbivore classification 612. elif reflexive[phrase] == 'herbivore': 613. classifications_and_relation.append(str(words_be fore_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase)) 614. # Add omnivore classification if carnivore and h erbivore classification are present 615. if str(words_before_phrase[i][0]) + ' is a carni vore' in classifications_and_relation:

82

616. classifications_and_relation.append( str(wor ds_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 617. # Add carnivore classification 618. elif reflexive[phrase] == 'carnivore': 619. classifications_and_relation.append(str(words_be fore_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase)) 620. # Add omnivore classification if carnivore and h erbivore classification are present 621. if str(words_before_phrase[i][0]) + ' is a herbi vore' in classifications_and_relation: 622. classifications_and_relation.append( str(wor ds_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn')) 623. else: 624. classifications_and_relation.append(str(words_be fore_phrase[i][0] + ' is a ' + reflexive[phrase])) 625. 626. # If length of relation and scientific name is equal to two 627. elif len(current_relation) == 2: 628. words_before_phrase = [nm for nm in current_relation[:phrase_ind ex] if 'scientificname' in nm[1]] 629. words_after_phrase = [nm for nm in current_relation[phrase_index :] if 'scientificname' in nm[1]] 630. classifications_and_relation.append(sentence) 631. # Add final classification depending on if phrase comes before o r after scientific name 632. if phrase in reflexive.keys() and current_relation.index(phrase) == 1: 633. for i in range(0, len(words_before_phrase)): 634. classifications_and_relation.append(str(words_before_phr ase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase)) 635. elif phrase in reflexive.keys() and current_relation.index(phras e) == 0: 636. for i in range(0, len(words_after_phrase)): 637. classifications_and_relation.append(str(words_after_phra se[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase)) 638. elif phrase in detritivore and phrase in righttoleft: 639. for k in range(0, len(words_after_phrase)): 640. classifications_and_relation.append(current_relation[1][ 0] + " is a detritivore" + ' - ' + phrase) 641. else: 642. tag = 'Sentence of interest' 643. else: 644. tag = 'Sentence of interest' 645. 646. # Check if length of input is greater than one 647. if len(classifications_and_relation)>1: 648. # Remove duplicates 649. classifications_and_relation = list(unique_everseen(classifications_ and_relation)) 650. # Add relation and classifications to overall list 651. all_final_classifications.append(classifications_and_relation) 652. 653. 654. 655. 656. 657. 658. 83

659. return all_final_classifications 660. 661. # Part 7: Function to Clean Results and Output to a Text File 662. 663. # Function that cleans results and outputs it to a file 664. def send_classifications_to_file(classifications, output_file_name, analyzed_fil e_name): 665. # Creating list for triplets to be written to file 666. final_classification_triplets = [] 667. # Adding the name of the analyzed file 668. final_classification_triplets.append((analyzed_file_name, '--', '--')) 669. # Iterating through the final classification for each relation 670. for classification in classifications: 671. # Creating a list of the final classifications for each relation 672. categories = [x.split('-')[0] for x in classification[1:]] 673. # Iterating through each final classification 674. for classified_noun in list(set(categories)): 675. # Creating a list of the decider(s) for each classification 676. duplicate_classifications = [y.split('- ')[1].strip() for x in classifications for y in x if classified_noun in y] 677. # Adding a triplet of the relation, the noun classified and the deci der to the list of triplets 678. final_classification_triplets.append((classification[0].strip(), cla ssified_noun.strip(), list(set(duplicate_classifications)))) 679. # Writing to a file by checking first if it exists 680. if os.path.isfile(output_file) == False: 681. # Adding a triplet that represents the header to the triplet list 682. final_classification_triplets.insert(0, ('Ollie Relation', 'Final Classi fication', 'Classification Decider')) 683. with open(output_file, 'w') as f: 684. writer = csv.writer(f, delimiter='\t') 685. writer.writerows(final_classification_triplets) 686. else: 687. # Appending to a file if it is already exists 688. with open(output_file, 'a') as f: 689. writer = csv.writer(f, delimiter='\t') 690. writer.writerows(final_classification_triplets) 691. 692. return 'Outputs written to file' 693. 694. # Part 8: Analyzing Files 695. 696. # Iterating through list of files to be analyzed 697. print('Analyzing files') 698. for file in files_to_be_analyzed: 699. print(file) 700. extracted_text = '' 701. # Open the file 702. document = fitz.open(file) 703. # Iterate through pages in the file 704. for page in document: 705. # Extract the text from each page in the document 706. texts = page.getText('text') 707. # Add the extracted text to the text object 708. extracted_text = extracted_text + texts 709. # Replace dashes in the extracted text 710. extracted_text = extracted_text.replace(" -", "") 711. extracted_text = extracted_text.replace(" - ", "") 84

712. extracted_text = extracted_text.replace("- ", "") 713. # Split the extracted text where it says References and take all the text ab ove 714. extracted_text = extracted_text.split("References",maxsplit=1)[0] 715. # Extracted text file name 716. extracted_text_file_name = file.split('.')[0] + '-extracted-text.txt' 717. # Ollie extractions file name 718. relation_file_name = file.split('.')[0] + '-relations.txt' 719. # Adding a new line between every sentence 720. text_with_new_lines = re.sub(r'(?<=[^A-Z].[.?]) +(?=[A- Z])', '\n', extracted_text) 721. # Replacing newlines between lowercase characters with a single space 722. replace_new_lines_between_lowercase_characters = re.sub(r'(\n+)(?=[a- z])', " ", text_with_new_lines) 723. # Save the extracted text to a file 724. with open(extracted_text_file_name, 'w') as out: 725. out.write(replace_new_lines_between_lowercase_characters) 726. # Save the Ollie extractions to a file 727. with open(relation_file_name, 'w') as f: 728. # Running the Ollie tool on the extracted text file 729. subprocess.run(["java", "-Xmx512m", "- jar", ollie_file, extracted_text_file_name], stdout=f) 730. # Opening the files with relations 731. with open(relation_file_name) as f: 732. relations = [] 733. for line in f: 734. # Iterating through relations and extracting lines that start with a confidence score 735. if re.match(r"^\d+.*$",line): 736. # Adding the found lines to the relations list 737. relations.append(line) 738. # Replace characters and spaces added by Ollie 739. relations = [relation.replace("\n", "") for relation in relations] 740. relations = [relation.replace(")[enabler", ", ") for relation in relatio ns] 741. relations = [relation.replace(")[", ", ") for relation in relations] 742. relations = [relation.replace("attrib=", " attribute ")] 743. # Search through relations for keywords 744. relations_with_keyword = searching_ollie_results(relations) 745. print(relations_with_keyword) 746. # Analyze relations and add a final classification 747. classifications = identify_and_classify(relations_with_keyword) 748. print(classifications) 749. # Writing the classifications to a file 750. send_classifications_to_file(classifications, output_file, file)

85