New Methods for Metadata Extraction from Scientific Literature

Systems Research Institute, Polish Academy of Sciences Dominika Tkaczyk ICM, University of Warsaw New Methods for Metadata Extraction from Scientific Literature Ph.D. Thesis Supervisor: Professor Marek Niezgódka, PhD, DSc arXiv:1710.10201v1 [cs.DL] 27 Oct 2017 Auxiliary supervisor:Lukasz Bolikowski, PhD ICM, University of Warsaw November 2015 Abstract Spreading the ideas and announcing new discoveries and findings in the scientific world is typically realized by publishing and reading scientific literature. Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keep- ing track with the latest scientific achievements poses a major challenge for the researchers. Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. Modern research infrastructures facilitate studying scientific literature by providing intelligent search tools, proposing similar and related documents, building and visualizing interactive citation and author networks, assessing the quality and impact of the articles using citation-based statistics, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine- readable metadata directly from source documents. Our research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born- digital form. Extracted information includes basic document metadata, structured full text and bibliography section. Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heteroge- neous document collections. This was achieved by employing supervised and unsuper- vised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types. Proposed method is a reliable and accurate solution to the problem of extracting the metadata from documents. It allows modern research infrastructures to provide intelligent tools and services supporting the process of consuming the growing volume of scientific literature by the readers, which results in facilitating the communication among the sci- entists and the overall improvement of the knowledge propagation and the quality of the research in the scientific world. Contents 1 Overview 1 1.1 Background and Motivation . .1 1.2 Problem Statement . .4 1.3 Key Contributions . .7 1.4 Thesis Structure . .8 2 State of the Art 9 2.1 Metadata and Content Formats . .9 2.2 Relevant Machine Learning Techniques . 12 2.3 Document Analysis . 17 3 Document Content Extraction 36 3.1 Algorithm Overview . 36 3.2 Document Layout Analysis . 37 3.3 Document Region Classification . 51 3.4 Metadata Extraction . 52 3.5 Bibliography Extraction . 61 3.6 Structured Body Extraction . 68 3.7 Our Contributions . 78 3.8 Limitations . 79 4 Experiments and Evaluation 81 4.1 Datasets Overview . 82 4.2 Page Segmentation . 93 4.3 Content Classification . 95 4.4 Affiliation Parsing . 105 4.5 Reference Parsing . 107 4.6 Extraction Evaluation . 110 4.7 Time Performance . 125 5 Summary and Future Work 130 5.1 Summary . 130 5.2 Conclusions . 131 i CONTENTS 5.3 Outlook . 132 A Detailed Evaluation Results 134 A.1 Content Classification Parameters . 134 A.2 Content Classification Evaluation . 134 A.3 Performance Scores . 136 A.4 Statistical Analysis . 136 B System Architecture 152 B.1 System Usage . 152 B.2 System Components . 154 Acknowledgements 156 ii List of Figures 1.1 The number of records by year of publication in PubMed . .3 1.2 The number of records by year of publication and type in DBLP . .4 1.3 Citation and related documents networks in Scopus . .5 1.4 Documents and authors network in COMAC . .6 1.5 Intelligent metadata input interface from Mendeley . .7 3.1 The architecture of the extraction algorithm . 38 3.2 The bounding boxes of characters . 42 3.3 The bounding boxes of words . 43 3.4 The bounding boxes of lines . 44 3.5 The bounding boxes of zones . 44 3.6 Nearest neighbours in Docstrum algorithm . 45 3.7 Nearest-neighbour distance histogram in Docstrum algorithm . 46 3.8 The reading order of zones . 48 3.9 The mutual position of two zone groups . 51 3.10 Affiliations associated with authors using indexes . 58 3.11 Affiliations associated with authors using distance . 58 3.12 An example of a parsed affiliation string . 59 3.13 The references section with marked first lines of the references . 65 3.14 An example of a parsed bibliographic reference . 66 3.15 Examples of headers in different layouts . 73 3.16 An example of a multi-line header . 76 4.1 Semi-automatic method of creating GROTOAP2 dataset . 86 4.2 The histogram of the zone label coverage of the documents . 87 4.3 The results of page segmentation evaluation . 95 4.4 The results of page segmentation evaluation with filtered zones . 96 4.5 Average F-score for category classifier for various numbers of features . 99 4.6 Average F-score for metadata classifier for various numbers of features . 100 4.7 Average F-score for body classifier for various numbers of features . 100 4.8 The visualization of the confusion matrix of category classification evaluation103 4.9 The visualization of the confusion matrix of metadata classification evaluation104 4.10 The visualization of the confusion matrix of body classification evaluation . 106 iii LIST OF FIGURES 4.11 The visualization of the confusion matrix of affiliation token classification . 107 4.12 The visualization of the confusion matrix of citation token classification evaluation . 109 4.13 The results of bibliographic reference parser evaluation . 110 4.14 The results of the evaluation of extracting basic metadata on PMC . 117 4.15 The results of the evaluation of extracting basic metadata on Elsevier . 118 4.16 The results of the evaluation of extracting authorship metadata on PMC . 118 4.17 The results of the evaluation of extracting authorship metadata on Elsevier 119 4.18 The results of the evaluation of extracting bibliographic metadata on PMC 119 4.19 The results of the evaluation of extracting bibliographic metadata on Elsevier120 4.20 The results of the evaluation of extracting bibliographic references . 121 4.21 The results of the evaluation of extracting document headers on PMC . 121 4.22 The histogram of the extraction algorithm's processing time . 126 4.23 Processing time as a function of the number of pages of a document . 126 4.24 The boxplots of the percentage of time spent on each phase of the extraction 127 4.25 The percentage of time spent on each algorithm step . 128 4.26 The comparison of the average processing time of different extraction algorithms . 129 B.1 The extraction algorithm demonstrator service . 152 iv List of Tables 3.1 The decomposition of layout analysis . 41 3.2 The decomposition of metadata extraction . 56 3.3 The decomposition of bibliography extraction . 63 3.4 The decomposition of body extraction . 70 4.1 The dataset types used for the experiments . 83 4.2 The comparison of the parameters of GROTOAP and GROTOAP2 datasets 85 4.3 The results of manual evaluation of GROTOAP2 . 90 4.4 The results of the indirect evaluation of GROTOAP2 . 91 4.5 Final SVM kernels and parameters values for SVM classifiers . 102 4.6 The results of category classification evaluation . 103 4.7 The results of metadata classification evaluation . 105 4.8 The results of body classification evaluation . 105 4.9 The results of affiliation token classification evaluation . 106 4.10 The results of citation token classification evaluation . 108 4.11 The scope of the information extracted by various metadata extraction systems112 4.12 The number and percentage of successfully processed files for each system . 113 4.13 The winners of extracting each metadata category in each test sets . 123 4.14 The summary of the metadata extraction systems comparison . 124 A.1 The results of SVM parameters searching for category classification . 135 A.2 The results of SVM parameters searching for metadata classification . 135 A.3 The results of SVM parameters searching for body classification . 135 A.4 The confusion matrix of category classification evaluation . 136 A.5 The confusion matrix of metadata classification evaluation . 137 A.6 The confusion matrix of body content classification evaluation . 137 A.7 The confusion matrix of affiliation token classification evaluation . 138 A.8 The confusion matrix of citation token classification evaluation . 138 A.9 The results of the comparison evaluation of basic metadata extraction . 139 A.10 The results of the comparison evaluation of authorship metadata extraction 140 A.11 The results of the comparison evaluation of bibliographic metadata extraction141 A.12 The results of the comparison evaluation of bibliographic references extraction142 A.13 The results of the comparison evaluation of section headers extraction . 142 v LIST OF TABLES A.14 The results of statistical tests of the performance of title extraction . 143 A.15 The results of statistical tests of the performance of abstract extraction . 143 A.16 The results of statistical tests of the performance of keywords extraction . 144 A.17 The results of statistical tests of the performance of authors extraction . 144 A.18 The results of statistical tests of the performance of affiliations extraction . 145 A.19 The results of statistical tests of the performance of relations author-affiliation extraction .

New Methods for Metadata Extraction from Scientific Literature

BIOINFORMATICS Doi:10.1093/Bioinformatics/Btq383

Basic Unix Commands

The Uniprot Knowledgebase

Open Semantic Annotation of Scientific Publications Using DOMEO Paolo Ciccarese1, Marco Ocana2, Tim Clark1,2,3* from Bio-Ontologies 2011 Vienna, Austria

Pdfextractor Creates a Data Science Object from Referenced Data in Another Pa- Per