<<

Literature-driven Curation for Taxonomic Name Databases

Hui Yang, Alistair Willis, David R. Morse, Anne de Roeck Department of Computing and Communications {Hui.Yang, Alistair.Willis, David.Morse, Anne.deRoeck}@open.ac.uk

tory museums and other libraries’ collections. Abstract However, due to the lack of semantic metadata, the tasks of finding, extracting, and managing the Digitized biodiversity literature provides a wealth knowledge contained in these volumes is still a of content for using biodiversity knowledge by primarily manual process and remains extremely machines. However, identifying taxonomic names difficult and labour-intensive. The difficulty in and the associated semantic metadata is a difficult accessing the existing taxonomic literature is a and labour intensive process. We present a system severe impediment to research and delivery of to support human assisted creation of semantic metadata. Information extraction techniques auto- the subject’s benefits (Godfray, 2002). Semantic matically identify taxonomic names from scanned tagging of organism mentions in biodiversity documents. They are then presented to users for literature has recently been regarded as a pivotal manual correction or verification. The tools that step to facilitate -aware text mining support the curation process include taxonomic applications, including -specific docu- name identification and mapping, and community- ment retrieval (Sarka, 2007), linking biodiversity driven taxonomic name verification. Our research databases (White, 2007), and semantic enrich- shows the potential for these information extrac- ment of biodiversity articles (Penev et al, 2010). tion techniques to support research and curation in Semantic web is a potential solution to the disciplines dependent upon scanned documents. problems of data fragmentation and knowledge management if the appropriate metadata can be 1 Introduction created (Page, 2006). However, manually creat- Our understanding of the natural world is rapidly ing this metadata is an enormous and unrealistic increasing. At the same time, issues in biodiver- task. The verification process of checking the sity are shown to be relevant to many important validity of a taxonomic name is a specialist task policy areas, such as climate change, food secu- requiring expert skills. rity and habitat management. Biological taxon- In this paper, we present a semi-automated omy is a discipline that underlies all of these ar- system that aims to develop a literature-driven eas; understanding species, their behaviours and curation process among practicing taxonomists, how they interact is of critical importance in be- by providing tools to help taxonomists identify ing able to manage commercially important land and validate appropriate taxonomic names from and environment use (SCBD, 2008). the scanned historical literature. Potential taxo- A major difficulty facing the curation of com- nomic names are automatically extracted from prehensive taxonomic databases is incorporating scanned biodiversity documents with their asso- the knowledge that is currently contained only in ciated contextual information. These are pre- the printed literature, which spans well over one sented for validation to taxonomic curators via hundred million pages. Much of the literature, an online web service. The manually verified or especially old taxonomic monographs that are corrected names can then be indexed, and the both rare documents and extremely valuable for semantic data stored, using the Darwin Core bio- taxonomic research, are almost entirely in paper- diversity data standard. print form and are not directly accessible elec- tronically. Recent large-scale digitization pro- 2 System Framework 1 jects like the Biodiversity Heritage Library Figure 1 shows the process for obtaining meta- (BHL) have worked to digitize the (out of copy- data and curating taxonomic names. Publishers right) biodiversity literature held in natural his- who specialise in biological taxonomy often add appropriate metadata (Penev et al. 2010), but for 1 http://www.biodiversitylibrary.org scanned literature, this is generally not available.

25 Proceedings of the Joint Workshop on NLP&LOD and SWAIE, pages 25–32, Hissar, Bulgaria, 12 September 2013. Biodiversity Taxonomists Scratchpads Literature OCR text neighbourhood Citizen scientists print OCR Error format Correction

digital Taxonomic Name Taxonomic Name Judgement format Identification Mapping publish collect Collection

Plain text Taxonomic check databases BHL create report Taxonomic Name Semantic Web Validation Service Layer

Figure 1. System framework of literature-driven curation for taxonomic databases

The image files from scanned literature are proc- assigned to the same species (Remsen, 2011). essed through the ABBY FineReader or Pri- For example, Actinobacillus actionomy , Actino- meReader Optical Character Recognition (OCR) bacillus actionomyce , and Actinobacillus action- software to generate a plain text file. omycetam could all be variants of the same Next, we identify those tokens in the plain text name. In addition, scanned documents can cause that may be taxonomic names (possibly contain- many OCR errors due to outdated fonts, complex ing errors through imperfect OCR, or other tran- terms, and aspects such as blemishes and stains scription errors). We use an information extrac- on the scanned pages. Wei et al (2010) have ob- tion tool for Named Entity Recognition (NER) served that 35% of taxonomic names in scanned based on Conditional Random Fields (CRFs) documents contain an error, and this creates dif- (Lafferty et al., 2001). The detected names are ficulties for term recognition (Willis et al. 2009). then mapped onto unique identifiers across a For example, erroneous OCR might propose ‘ o’ range of taxonomic databases such as uBio in place of ‘ c’ for the taxon Pioa , not a known Name Bank 2, Encyclopaedia of Life (EoL) 3, and name, rather than Pica (European magpie). Catalogue of Life (CoL) 4. Approaches to taxonomic name recognition Taxonomic names that cannot be found in (TNR) span a broad range from traditional dic- online databases will be validated manu- tionary lookup (Gerner et al., 2010; Koning et ally. Potential unknown taxonomic names are al., 2005; Leary et al., 2007) combined with lin- presented for validation or correction to the re- guistic rule-based (Sautter et al., 2006) to pure search community via the Scratchpads social machine learning (Akella et al., 2012). network 5 (e.g., professional taxonomists, experi- In our system (Figure 1), the first stage is enced citizen scientists and other biodiversity identifying potential taxonomic names. We used specialists) in a community-driven verification a supervised learning algorithm implemented by process. The newly verified taxonomic name, the CRF++ Package 6 . Compared to other ma- along with additional metadata recording the user chine learning algorithms, CRFs are good at se- who verified the name, its context and biblio- quence segmentation labeling tasks such as graphic details is published as a semantic web Named Entity Recognition, which have been service layer (currently a Scratchpads portal). shown to be effective for biological entity identi- fication in the biomedical literature (Yang et al. 3 Taxonomic Name Recognition 2008). They can be easily adapted to similar tasks like Taxonomic Name Recognition (TNR). Automatic identification of taxonomic names from biodiversity text has attracted increasing 3.1 Dataset Preparation and Annotation research interest over the past few years, but is difficult because of the problems of erroneous To assess the performance of the CRFs on taxo- transcription and synonymy. There may be or- nomic texts, we generated training and test sets thographic and other term variation in names from scanned volumes between 1879 and 1911 from the Biodiversity Heritage Library (BHL). Annotations were carried out using the BRAT 7 2 http://www.ubio.org/ Rapid Annotation Tool (BRAT) (Stenetorp et 3 http://eol.org/ 4 6 http://www.catalogueoflife.org/ http://crfpp.googlecode.com/svn/trunk/doc/index.html 5 7 http://scratchpads.eu http://brat.nlplab.org/

26 al., 2012) to mark up taxonomic elements in bio- 3.2 Taxonomic Name Identification diversity literature. All mentions of taxonomic To train the CRF-based recogniser, we used a names in the text were manually tagged and variety of linguistic and semantic features to linked to identifiers in external taxonomic data- characterise the semantics of taxonomic names. bases (i.e. uBio Name Bank, Catalogue of Life, The features used for taxon recognition were and Encyclopaedia of Life) where possible. An- grouped into the following five categories: notated mentions were also assigned to several • Word-token Feature. This type of feature categories that indicate specific linguistic or se- includes word lemma, Part-of-Speech (POS) mantic features (e.g. taxonomic rank, genus ab- tag, and chunk tag of the word, which are ob- breviation or omission) for evaluation analysis. tained from the Genia Tagger 13 . The manually annotated dataset consists of: • (a) Training data. We selected three BHL vol- Context Features. The features for the umes of different groups: Coleoptera lemma and POS tag of the three neighbour- (Beetles) 8, Aves (Birds) 9, and Pisces (Fish) 10 ing words before and after the current word as the training data to build a CRF-based token are also considered. • taxon recogniser. The volume text used for Orthographic Features. Taxonomic names the annotation is clear text , i.e. text from tend to be case sensitive, e.g. Agelaus phae- which OCR errors are removed, which was nicio . Moreover, much taxonomic literature obtained from the INOTAXA Project 11 . Ta- employs abbreviations as standard like A. ble 1 reports the statistical annotation infor- phaenicio . Some special tokens, e.g. Greek mation about these three volumes. symbols ( α, β, γ) and Roman numbers ( I., II ., iv .) also frequently occur in the text. #Pages #Taxonomic Names • Morphologic Features. Some taxonomic Coleoptera 324 7,264 names contain typographic ligatures, e.g., æ Aves 553 8,354 (ae ), œ ( oe ), Æ ( AE ). We observed that some Pisces 234 4,915 mentions contain the same suffix strings Table 1. The statistics on the training data such as -us , -um , -eus . (b) Test data. The dataset used for the evalua- • Domain-specific Features. Taxonomic rank tion of the taxon recogniser is another BHL markers and their abbreviations, e.g., spe- volume about Coleoptera (Beetles) 12 . Taxo- cies , genus , sp ., subg ., fam ., etc., frequently nomic names are annotated in two datasets of occur in the text preceding taxonomic names. different quality text, one is clear text (high- This is a binary property. Y if the word is a quality text) and the other is the original rank marker or O otherwise. OCR text with scanning errors (poor-quality The training data file for the CRFs consists of a text). The reason for building this compara- set of word token instances, each of which con- tive corpus is to estimate the impact of OCR tains a feature vector that is made up of five errors on taxon name recognition. The statis- groups of features described above together with tics about this corpus are given in Table 2. an entity class label – BIO tags. More taxonomic names are found in the

OCR text than the clear text because the Precision Recall F-measure OCR text includes page headings that may contain the scientific name of an organism. Clear text 0.9285 0.8642 0.8952 OCR text 0.4450 0.3716 0.4050 Clear Text OCR Text Table 3. The overall performance of taxonomic #Pages 373 373 name identification on a comparative dataset #Taxonomic Names 5,198 5,414 #Taxonomic Names -- 2,335 Performance Evaluation. The trained CRFs (with OCR error) (43.1%) were evaluated on the test corpus. We compared Table 2. The statistics on the test data the results of the clear text with those of the OCR text in order to test the OCR-error toleration ca- pability of the trained CRFs. As shown in Table 3, the trained CRFs performs well and achieves 8 http://www.biodiversitylibrary.org/ia/mobotbca_12_01_01 an F-measure as high as 0.8952 on the clear text 9 http://www.biodiversitylibrary.org/ia/mobotbca_03_01_00 10 http://www.biodiversitylibrary.org/ia/mobotbca_05_00_00 11 http://www.inotaxa.org/jsp/index.jsp 12 13 http://www.biodiversitylibrary.org/ia/mobotbca_12_04_03 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

27 PDF Image Page OCR Text

Correction window Target word

Figure 2. A web service for OCR error correction

(good-quality text). On the “dirty” OCR text the refer to different taxa (homonyms). Moreover, performance is worse and the F-measure drops to there is lexical and terminological variation 0.405. This shows that OCR errors are a potential among taxonomic names. Second, currently there threat that greatly affects the effectiveness of is not a complete taxonomic database that covers taxonomic name identification. Therefore an all the organisms in the world so multiple taxo- OCR error correction tool is necessary for nomic databases are needed to complement each searching and processing the OCR-scanned text. other. To resolve the problem of orthographic and OCR Error Correction. To reduce the im- term variations between taxonomic names, we pact of OCR errors on the identification of taxo- exploited a generic and effective cascaded nomic names, we developed a mechanism for matching method that consists of two stages: error checking and correction. Figure 2 shows a • screen shot of the web service 14 used to highlight Stage I - Exact Matching: string matching a potential taxon to a user. The left-hand panel is between original identified mentions and da- the image of the original page, and the right-hand tabase entries. If a name mention is a known panel is the corresponding text, which is ex- synonym in the curated list of a taxonomic tracted from the DjVu XML file created by the database, the unique identifier of a taxon en- OCR software. When a word is selected using try associated with the synonym will be as- the navigation content in the right-hand panel, a signed to the mention. It is possible that the small error-correction window pops up, and the mention might be mapped to synonyms of user is allowed to make possible modifications, different organisms. In these ambiguous cas- based on the enlarged image of the target word es, additional information such as the sur- appearing in the pop-up window. rounding context of the mention and the at- tributes of its neighboring mentions are 3.3 Taxonomic Name Mapping needed to help determine the selection of the most appropriate organism. Taxonomic name mapping or normalization is to • Stage II - Rule-based approximate match- map the detected mentions in the text into stan- ing: dardised taxonomic identifiers (Gerner et al. First, a set of transformation rules that (2010). It aims to generate correct lists of unique capture morphologic features of name varia- identifiers (typically from referent taxonomic tions are generated to produce more potential databases) for each taxonomic name. There are extended mentions. Second, for each un- two potential factors that affect mapping accu- matched mention filtered at the first stage, racy. First, taxonomic names are not completely the possible extended candidate names cre- stable, and may change due to taxonomic revi- ated by the transformation rules (described sion. There may be multiple names (synonyms) later) are sent to the taxonomic databases for the same organism, and the same name may again to find the possible matched synonyms of a known taxon entry.

14 http://mcs-notes2.open.ac.uk/CommunityCuration/Book1.nsf/Concept1.xsp

28 Construction of transformation rules. Ac- cies has exactly one entry with a unique identi- cording to the observations on our manually- fier, a name classified as scientific name (i.e. the annotated taxonomic dataset, we roughly group “correct” canonical name), as well as other pos- name variations into the four categories below: sible variants (e.g., synonyms, common misspell- (a) Ligature replacement: typographic liga- ings, or retired names if the organism has been tures (e.g., æ, œ, Æ, Œ, etc. ) that appear in reclassified). Moreover, these three databases taxonomic names of the old literature are provide relevant web services to users for online generally replaced with the corresponding search of taxonomic names. For each candidate two or more consecutive letters. For exam- name, we send a name query to different data- ple, Agelæus phœniceus -> Agelaeus phoe- bases, and automatically extract the relevant niceus , Dendrœca -> Dendroica unique identifier from the returned result. (b) Latin declension: the scientific name of an Mapping Results. We collected a total of organism is always written in either Latin or 8,687 distinct candidate names from four anno- Greek. A Latin noun can be described in dif- tated BHL volumes and mapped them to the cho- ferent declension instances (e.g., First- sen taxonomic databases. Table 4 shows the sta- declension, Second-declension) by changing tistical information of name matches in the indi- its suffix substring. For instance, puellae , vidual databases. It is interesting to note that the puellarum , puellis , puellas , can be normal- matched names in EoL usually can be found in ized as the same root word Puella . uBio, whereas CoL can find some names that do (c) Parenthesized trinomial names: some tax- not appear in either uBio or EoL. Nearly a half of onomic names consist of three parts. These the names (4, 273 names) could not be found in are usually represented as a species names any of the taxonomic databases. This suggests with a subgenus name contained within pa- that machine-learning based TNR can find quite rentheses, e.g., Corvus (Pica) beecheii , Tan- a lot of new names that a simple dictionary ap- agra (Aglaia) diaconus . However, the paren- proach cannot identify. Moreover, biodiversity thesized subgenus name is not used very literature is a potentially useful resource to en- much, and some taxonomic databases do not rich the existing taxonomic databases. contain the information at this low rank lev- uBio EoL CoL el. Therefore, the subgenus name can be ig- Mapped Names 3,565 2,893 3,354 nored when mapping, e.g., Corvus (Pica) (total names: 8,687) (41.1%) (33.3%) (38.6%) beecheii -> Corvus beecheii Table 4. Name mapping in taxonomic databases (d) Taxon variety names: taxon variety names are another special case, which can appear in 4 Community Metadata Collection various name forms like Peucæa æstivalis arizonæ ; Peucæa æstivalis var. arizonæ ; Biodiversity communities have come to the con- Peucæa æstivalis, var arizonæ; Peucæa æsti- sensus that converting unstructured biodiversity valis, β. Arizonæ ; even Peucæa arizonæ due literature into semantically-enabled, machine- to taxonomic inflation in which known sub- readable structured data is essential to use the species are raised to species as a result in a currently highly fragmented data sources. The change in species concept (Isaac et al. 2004). main semantic metadata system is the Darwin Core biodiversity data standard 15 , maintained by A set of linguistic rules are expressed as regu- the Biodiversity Information Standards group lar expressions to record the syntactic and se- 16 (TDWG) , and based on Dublin Core. The main mantic clues found in the name variations dis- objects in Darwin Core represent an organism’s cussed above. These rules are used to transform scientific name, information pertaining to its the original mentions to possible extended candi- classification, and the geographical and geologi- dates for string matching in taxonomic databases. cal contexts of the organism. External taxonomic databases. To link more For this research, key information can be identified mentions to existing external taxo- stored in the dwc:Taxon class, which has terms nomic databases, we chose three widely-used defined for the taxonomic name itself large-scale taxonomic databases: uBio Name (dwc:scientificName ), as well as a unique iden- Bank, Encyclopaedia of Life (EoL), and Cata- tifier, the Life Sciences ID (LSID) to locate the logue of Life (CoL), which separately curate 4.8 million, 1.3 million, and 1.6 million taxonomic names respectively. In each database, each spe- 15 http://rs.tdwg.org/dwc/index.htm 16 http://www.tdwg.org/

29 Link to CoL

Link to BHL

Page citation

Figure 3. A sample web page to show how extracted semantic features link to the BHL and external taxonomic databases taxon across remote databases ( dwc:taxonID ) 2013-06-15 and various terms giving taxonomic information and provenance. For example, the metadata iden- Anthus tifying the LSID and publication data for the correndera species Anthus correndera might be represented using the standard Darwin Core terms: London Med. Repos., 15: 308. urn:lsid:catalogueoflife.org:

taxon:f000e838-29c1-102b-9a4a- Further contextual information can be stored, 00304854f820:col20120721 providing species description information includ- Anthus ing morphological features, biogeographic distri- correndera bution, and ecology. Figure 3 shows a web Aves page 17 corresponding to a species in which the Anthus contexts surrounding the occurrence of the target correndera mention are extracted from the text. Each piece of evidence is given a bibliographic citation that London Med. is linked to the respective copy of the referring Repos., 15: 308. page (here, the BHL). Unique database identifi- ers and hyperlinks to external taxonomic data- The basic Darwin Core terms can be extended bases are provided on the web page if possible. to represent the information obtained via the Connections to external databases increase the original document and the curation tools. Labels understanding and analysis of the behaviour of should be used to represent information about the the target species. These bibliographic linkages verified form of the name and the identity of the allow the system to identify and track back the verifier, with the verifier's Scratchpad login raw data across the range of remote databases. name being the obvious choice. Adding the name The metadata can potentially encode many of the verified form and the verifier with appro- semantic aspects of the data. Identified taxo- priate labels to the metadata nomic names and hyperlinks to repositories will (dwc:nameVerifiedBy and dwc:dateVerified ) improve species-specific document retrieval. En- would then give: coding different names for organisms will im-

prove synonym detection so reconciliation tech- urn:lsid:catalogueoflife.org: niques are needed to connect multiple names. taxon:f000e838-29c1-102b-9a4a- Also, linkages to the unique identifiers of organ- 00304854f820:col20120721 isms facilitate the reconciliation process. Future Anthus work will consider citation information, which correndera improves the traceability of naming authorities. Scratchpad user: 17 Michael Smith http://www9.open.ac.uk/Web_NER/mobotbca_03_01_00/3982.html

30 Link to BHL Page

Fig. 4-2

Fig. 4-3 Fig. 4-1 Figure 4. A sample web page for taxonomic name verification (4-1: Extracted context evidence from the text; 4-2: A multi-option form for judgment collection; 4-3: The distribution of human judgments)

The validation information is collected in a

4.1 Taxonomic Name Validation back-end MySQL database in a metadata format Taxonomic name validation task is to present that contains the curator’s name, verification unknown names for human validation. Validat- time stamp, the target name, the associated pub- ing taxonomic names is a specialist process, re- lication, along with appropriate page citations quiring extensive human involvement and exper- and associated URI page linkages to make the tise. Non-professional taxonomists and citizen support evidence traceable. By ensuring that this scientists are an essential part of this effort. We data is available to the community via the seman- aim to demonstrate how small, lightweight plug- tic web service layer, the judgment is exposed to ins integrated to existing web-based collabora- the community for further validation or modifi- tion tools can facilitate the semantic annotation cation (distribution illustrated in Figure 4-3). of open biodiversity resources via crowdsourcing Our aim in the medium term is to link the val- techniques. idation task to search results within the Scratch- Scratchpads (Smith et al. 2009) are a content pad portal 19 . This will allow us to investigate management system that is optimised for han- whether the output of document searching can be dling biological taxonomy data. Scratchpads are used as a reward for carrying out the validation widely used amongst professional and amateur exercise, and so whether the task can be pre- taxonomists, and so are a useful portal for valida- sented in a (relatively) unobtrusive manner to tion. users. Our curation web service is a Scratchpads plug-in. Text for validation is selected via a sim- 5 Conclusions and Future Work 18 ple recommender system (Figure 4). Users are Increasing numbers of older documents are presented with one or more potential taxonomic scanned and made available online, through digi- names found by the CRFs, as text "snippets" tal heritage projects like BHL. It will become containing the proposed name with the surround- more important to annotate those documents with ing context of the original scan (Figure 4-1). To semantic data in order to curate and manage the collect specialists’ judgments, a multiple-option information contained in the documents. form (Figure 4-2) is used to request a judgment We have described how information extraction of whether the text snippet represents a poten- techniques can be used as part of a curation sys- tially new taxon, a synonym or a name variant of tem to improve the mechanisms for collecting an existing organism.

18 19 http://www9.open.ac.uk/ComTax/mobotbca_03_01_00/4136.html http://taxoncuration.myspecies.info/node/77

31 this metadata. Although we have focused on ceedings of the International Conference on Ma- identifying taxonomic names, the same tech- chine Learning (ICML-2001) , pages 282-289. niques could be used to recognise any data of P. R. Leary, D. P. Remsen, C. N. Norton, D. J. Patter- interest, such as geographical data in historic son, and I. N. Sarkar. 2007. uBioRSS: tracking land documents, or proper names in census data. taxonomic literature using RSS. Bioinformatics, The critical part of the system, of course, is to be 23(11):1434–1436. able to find suitable user groups to provide the R. D. M. Page. 2006. Taxonomic Names, Metadata, appropriate semantic markup, as the data can and The Semantic Web. , rapidly become very large. 3, pp. 1-15. The semantic web can provide a portal to this L. Penev, D. Agosti, T. Georgiev, et al. 2010. Seman- data, if the metadata can be reliably collected. tic tagging of and semantic enhancements to sys- We believe that IE-supported curation techniques tematics papers: ZooKeys working examples. can be used to bring this collection about. ZooKeys, 50:1–16. Future work includes: (1) The datasets were annotated by one computer scientist. It would be D. Remsen. 2011. Biodiversity Informatics - GBIFs interesting to compare the annotated data with role in linking information through scientific names. In The Symposium of Anchoring Biodiver- the verification results from biodiversity experts. sity Information: From Sherborne to the 21st Cen- (2) We need more annotated OCR text for the tury and Beyond. development of an automated OCR-error correc- tion tool and a TNR tool built for OCR text. (3) I. N. Sarkar. 2007. Biodiversity informatics: organiz- Our project is in its early stages and requires ing and linking information across the spectrum of life. Briefings in Bioinformatics, 8(5):347-357. more time for the collection of validation judg- ments; to conduct the evaluation of the validation G. Sautter, K. Böhm, and D. Agosti. 2006. A combin- tool and to analyse the validation results. ing approach to find all taxon names (FAT) in leg- acy biosisystematics literature. Biodiversity infor- Acknowledgments matics, 3:41-53. The work was funded by JISC through the Digi- Secretariat of the Convention on Biological Diversity tal Infrastructure programme, Managing Re- (SCBD). 2008. Guide to the global taxonomy ini- search Data call. We thank Mr Jef Lay for devel- tiative . CBD, Technical Report. oping the OCR error correction tool shown in V. Smith, S. Rycroft, K. Harman, B. Scott, and D. Figure 2. Roberts. 2009. Scratchpads: A Data-Publishing Framework to Build, Share and Manage Informa- References tion on the Diversity of Life. BMC Bioinformatics 10(14):S6 . L. M. Akella, C. N. Norton, and H. Miller. 2012. Net- iNeti: discovery of scientific names from text using P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Anani- machine learning methods. BMC Bioinformatics, adou, and J. Tsujii. 2012. brat: a Web-based Tool 13 (211):1471-2105. for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012 , pag- M. Gerner, G. Nenadic, and C. M. Bergman. 2010. es 102-107. LINNAEUS: A Species Name Identification Sys- tem for Biomedical Literature. BMC Bioinformat- Q. Wei, P. B. Heidorn, and C. Freeland. 2010. Name ics 11:85. Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL). In Proceed- H. C. Godfray. 2002. Challenges for taxonomy. Na- ins of 2010 iConference, pages 284-288. ture, 417 (6884):17-19. R. J. White. 2007. Linking Biodiversity Databases. N. J. Isaac, J. Mallet, and G. M. Mace. 2004. Taxo- Systematics Association Special Volume, 73:111– nomic inflation: its influence on macroecology and 128. conservation. Trends in Ecology and Evolution. 19: 464–469. A. Willis, D. Morse, A. Dil, D. King, D. Roberts, and C. Lyal. 2009. Improving search in scanned docu- D. Koning, N. Sarkar, and T. Moritz. 2005. Taxon- ments: Looking for OCR mismatches. In Proceed- grab: extracting taxonomic names from text. Biodi- ings of the workshop on Advanced Technologies versity Informatics, 2:79–82. for Digital Libraries , 2009. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- H. Yang, G. Nenadic, and J. Keane. 2008. Identifica- ditional random fields: Probabilistic models for tion of Transcription Factor Contexts in Literature segmenting and labeling sequence data. In Pro- Using Machine Learning Approaches. BMC Bioin- formatics 9:S11.

32