Reconstructing Manual Information Extraction with DB-To-Document Backprojection: Experiments in the Life Science Domain
Total Page:16
File Type:pdf, Size:1020Kb
Reconstructing Manual Information Extraction with DB-to-Document Backprojection: Experiments in the Life Science Domain Mark-Christoph Muller,¨ Sucheta Ghosh, Maja Rey, Ulrike Wittig, Wolfgang Muller¨ and Michael Strube Heidelberg Institute for Theoretical Studies gGmbH, Heidelberg, Germany fmark-christoph.mueller,sucheta.ghosh,maja.rey, ulrike.wittig,wolfgang.mueller,[email protected] Abstract et al., 2019; Huang et al., 2020; Wu et al., 2020; Abdelhakim et al., 2020), because data quality (i.e. We introduce a novel scientific document pro- correctness and integrity) has priority over quan- cessing task for making previously inaccessi- tity (i.e. more quickly available, but potentially less ble information in printed paper documents reliable, data), and the error rates of current NLP available to automatic processing. We de- scribe our data set of scanned documents systems are still considered too high (Karp, 2016). and data records from the biological database For reasons of ergonomics and ease of handling SABIO-RK, provide a definition of the task, (Buchanan and Loizides, 2007;K opper¨ et al., 2016; and report findings from preliminary experi- Clinton, 2019), the identification and mark-up ments. Rigorous evaluation proved challeng- step often involves paper printouts and highlighter ing due to lack of gold-standard data and a dif- pens,2 like in the example page in Figure1. ficult notion of correctness. Qualitative inspec- tion of results, however, showed the feasibility and usefulness of the task. 1 Introduction Research results from the life sciences are mainly published in the form of written journal or confer- ence papers, even though these results often take the form of measurements of experimental parame- ters, which would more appropriately be stored in a structured, machine-readable form. While there is some tendency towards directly publishing ex- perimental data, e.g. on SourceData (Liechti et al., Figure 1: Page with mark-up (best viewed in color). 2016) or (for environmental data) PANGAEA1, this is not the norm yet, and does not help with the As mere intermediate products of the curation pro- huge body of data already published in the con- cess, the manually highlighted printouts are only ventional literature. It is common practice in the required until all data from the respective docu- life sciences, therefore, to manually extract infor- ment has been curated, and they will normally be mation (including measurements and the experi- archived afterwards. We argue, however, that the mental conditions underlying them) from natural printouts contain even more information which cu- language documents, and to use it to populate bio- ration simply does not make full use of: First, some logical databases. This process is called biocura- document sections, although containing highlight- tion (International Society for Biocuration, 2018) ing, will not lead to the creation of a record in the and comprises, for every document, 1) identifica- biological database (see our results in Section 5.3). tion and mark-up of curatable information, 2) data Yet, this highlighting can still be regarded as a kind extraction, normalization, and consolidation, and of relevance annotation, produced by life science 3) database insertion. Despite constant improve- 2This is true for our own group, and has been corroborated ments in NLP technology, biocuration involves sig- in 2016 by an informal, unpublished survey among 21 curators nificant human labor (mostly reading) (Oughtred from 15+ biological databases. The survey showed that a considerable number of curators rely on paper printouts for 1www.pangaea.de close reading and / or highlighting of important information. 81 Proceedings of the First Workshop on Scholarly Document Processing, pages 81–90 Online, November 19, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 domain experts through attentive, task-oriented version of our archive of 6; 000+ manually high- reading. Obviously, this information should be lighted printouts of documents from the life science useful, e.g. for the analysis of how important in- domain, which have been curated in the 10+ years formation is dispersed over a scientific document. of our database’s existence. Over the years, numer- Second, for those database records that are created ous different curators were involved in the manual from highlighted document sections, the reference mark-up. Different highlighter colors were used, to that section is normally not preserved. Again, sometimes even within the same document (see an obvious way to use this information is to allow Figure1). In case of equivalent information ap- users of the biological database to visually trace pearing repeatedly in the same document, curators the record to its source in the document, including generally attempted to be economical and to avoid the original context. redundancy by highlighting only the most appro- In this paper, we describe our approach towards priate appearance, which is often, but not always, re-purposing scientific document printouts which the first appearance. While the mark-up was per- were manually highlighted during biocuration. formed in a completely unrestricted manner (cf. More precisely, our research question is: Given below), in the vast majority of cases, highlighting records of curated information from the database was applied directly to words or lines (cf. Figure1), and the original, scanned source document, (to which greatly helped in extracting the highlighted what degree) can we recover the document sec- text (cf. Section3). In some rare cases, curators se- tion that a particular record was extracted from? lected whole sections by drawing a vertical line at We consider this to be a novel scientific document the section’s margin. Also, data in tables was some- processing task, and propose to refer to it as DB- times highlighted on the cell level, while in other to-document backprojection. The remainder of cases, only the column header, the table header, or the paper is structured as follows. In Section2 we even the table caption was highlighted. We cre- describe the data basis of our work. Section3 intro- ated an electronic version of the document collec- duces the highlighted text extraction task, which we tion by scanning and OCR-processing all papers4, consider as self-contained and only loosely linked which resulted in a sandwich PDF for each docu- to the main task. Section4 deals with the actual ment with the (partially highlighted) background DB-to-document backprojection task, provides a superimposed with the extracted text. OCR was precise definition, and describes our processing performed with commercial software (Alaris Cap- steps. Section5 presents some preliminary experi- ture Pro), which was used out-of-the-box. The total ments, results, and error analysis.Initially, this sec- number of tokens in the 98 documents is 630; 153, tion will also discuss our approach to evaluation. with 6; 430 tokens/document on average. In Section6 we discuss some related work, and The second data set is the record data set which Section7 contains our conclusions and directions contains measurements of kinetic parameters that for the future. Note that, although our data is from were extracted from individual documents from the life sciences, the task is relevant for all domains the document collection in the course of manual where manual information extraction is performed curation. Each of the 2; 916 records in this data on natural language documents (like e.g. in Lipani set is linked to exactly one source document (via et al.(2014), where information is extracted from its PubMed ID), but no lower-level links (to pages IR research papers in the form of machine-readable or lines) exist. Each document, in turn, can be ’nanopublications’). linked to an arbitrary number of records (29:76 records/document on average). It has to be noted 2 Data that the above count of 2; 916 records contains a The work in this paper is based on two related data considerable number of multiple counts. This is sets, which have been collected in the SABIO-RK true in particular for records of type experimental Biochemical Reaction Kinetics Database project3. condition (cf. below), and is due to the fact that SABIO-RK is a curated database containing struc- often, several measurements are performed under tured information about biochemical reactions and identical experimental conditions. For scoring and their corresponding kinetics (Wittig et al., 2017, evaluation, however, this does not make a differ- 2018). The document data set is an electronic 4For the experiments reported in this paper, we only use a 3http://sabio.h-its.org/ subset of 98 documents. 82 ence, because we conflate semantically identical consist of document pages from which pixels that records before analysis. There are two main types were detected as belonging to text were removed of records, experimental condition and parameter. by inpainting (see ’Page background image’ in the Each record consists of three to six attribute-value lower left part of Figure4). pdftotext, on the (a-v) pairs. Figures2 and3 show one example of other hand, is used to extract the text that was pre- each type of record. viously recognized by OCR. It produces one XML file for the document, incl. bounding boxes on the conditionName: ’pH’, token-level. These tokens reflect the original doc- startValue : 7.7, buffer : ’0.10 M Tris-HCl, ument layout, but come in correct reading order 100 mM KCl, 1 mM DTT, even for multi-column documents. The second step 4.0 mM MgCl2, makes use of some simple image processing. As 10% Glycerol’ described above (Section2), document highlight- Figure 2: Record of type experimental condition with ing can come in any color, so searching for areas of three a-v pairs featuring one numeric, one atomic string, any particular color (like e.g. yellow) is not an op- and one complex string value. tion. Instead, our algorithm combines the facts that 1) highlighting is always non-grey and 2) shades of grey in the RGB color model are characterized parameterName : ’Km’, by identical, or at least highly similar, values in unitName : ’µM’, 6 startValue : 123, the R, G, and B components.