Extracting Body Text from Academic PDF Documents for Text Mining

Extracting Body Text from Academic PDF Documents for Text Mining Changfeng Yu, Cheng Zhang and Jie Wang Department of Computer Science, University of Massachusetts, Lowell, MA, U.S.A. Keywords: Body-text Extraction, HTML Replication of PDF, Line Sweeping, Backward Traversal. Abstract: Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines. 1 INTRODUCTION bitary layouts is challenging, due to the utmost flex- ibility of PDF typesetting. Instead, we focus on BT It is desirable for text mining applications to extract extraction from single-column and multiple column complete sentences and correct boundaries of para- research papers, reports, and case studies. We do so graphs from the body text of a PDF document into by working with the location, font size, and font style a txt file without hard breaks inside each paragraph. of each character, and the locations and sizes of other Layered reading (http://dooyeed.com) and extractive objects. While a PDF file provides such information, summarization, for example, are such applications. we find it easier to work with HTML replications pro- Layered reading allows the reader to read the most duced by an exiting tool named pdf2htmlEX (Wang, important layer of sentences first based on sentence 2014), with almost the same look and feel of the orig- rankings, then the layer of next important sentences inal PDF document, providing necessary formatting interleaving with the previous layers of sentences in information via HTML tags, classes, and id’s in the the original order of the document, and continue in underlying DOM tree. this fashion until the entire document is read. We devise a system named PDFBoT (PDF to Body Text) that, using pdf2htmlEX as a black box, By “body text” (BT in short) it means the main incorporates certain text formatting features produced text of an article, excluding “nonbondy text” (NBT in by it to identify NBT texts. We use a line-sweeping short) such as headings, footings, sidings (i.e., text on method to detect multi-column layouts and the area side margins), tables, figures, charts, captions, titles, for printing the BT text. We also develop multiple authors, affiliations, and math expressions in the dis- tests to identify NBT text inside the BT-text area and play mode, among other things. use a backward traversal method to deploy these tests. Most existing tools for extracting text from PDF In addition, we use POS (Part-of-Speech) tagging to documents, including pdftotext (FooLabs, 2014) and help identify NBT text that are harder to distinguish. PDFBox (Apache, 2017), extract a mixture of both The rest of the paper is organized as follows: BT and NBT texts. Identifying BT text from such Section 2 is related work on text extractions from mixtures of texts is challenging, if not impossible. PDF. Section 3 describes HTML replications via Other tools extract texts according to rhetorical cate- pdf2htmlEX and Sections 4 presents the architecture gories such as LA-PDFText (Burns, 2013) and logical of PDFBoT and its features Section 5 is evaluation re- text blocks such as Icecite (Korzen, 2017), which only sults with F1 scores and running time, and Section 6 provide a suboptimal solution to our applications. is conclusions and final remarks. Extracting BT text from PDF documents of ar- 235 Yu, C., Zhang, C. and Wang, J. Extracting Body Text from Academic PDF Documents for Text Mining. DOI: 10.5220/0010131402350242 In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 235-242 ISBN: 978-989-758-474-9 Copyright c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval 2 RELATED WORK et al., 2019; Wang et al., 2018; Phong et al., 2020). In summary, previous methods, while meeting Existing tools, such as pdftotext (FooLabs, 2014) and with certain success, still fall short of the desired ac- PDFBox (Apache, 2017) the two most widely-used curacy required by text-mining applications relying tools for extracting text from PDF, and a number of on clean extractions of complete sentences and cor- other tools such as pdftohtml (Kruk, 2013), pdftoxml rect boundaries of paragraphs in BT text. (Dejean and Giguet, 2016), pdf2xml (Tiedemann, 2016), ParsCit (Kan, 2016), PDFMiner (Shinyama, 2016), pdfXtk (Hassan, 2013), pdf-extract (Ward, 3 HTML REPLICATION OF PDF 2015), pdfx (Constantin et al, 2011), PDFExtract (Berg, 2011), and Grobid (Lopez, 2017), extract text HTML technologies have been used to replicate PDF from PDF extract BT text and NBT text together with- layouts to facilitate online publishing. A PDF docu- out a clear distinction. PDFBox can extract text in ment can be represented as a sequence of pages, with two-column layouts; some other tools extract text line each page being a DOM tree of objects with sufficient by line across columns. information for an HTML viewer to display the con- Using heuristics is a common approach. For ex- tent (Wang and Liu, 2013). The text extracted from ample, the Java PDF library was used to obtain a PDF by pdf2htmlEX (Wang, 2014) are translated into bounding box for each word, compute the distance HTML text elements that are placed into the same po- between neighboring words, connect them based on a sitions as they are displayed by PDF. set of rules to form a larger text block, place them into Let F denote a PDF document and f the HTML rhetorical categories, and connect these categories file produced by pdf2htmlEX on F. The DOM tree following the order of the underlying document (Ra- for f , denoted by Tf , is divided into four levels: doc- makrishnan et al., 2012). However, this method fails ument, page, text line, and text block (TBK in short). to align broken sentence and determine text on formu- (1) Document Structure. Tf starts with the following las, tables, or figures. Using an intermediate HTML tag as the root: hdiv id=“page-container”i, and each representation generated by pdftohtml (Yildiz et al., of its children is the root of a subtree for a page, listed 2005). Text blocks may also be created by grouping in sequence, with an id indicating its page number characters based on their relative positions (Shigarov and a class name indicating the width and height of et al., 2016), while extracting the tables in PDF. These a page. For example, a child node with hdiv id=“pf7” two methods are focused only on extracting tables. class=“pf w0 h0 data-page-no=“7”i is the root of the Other methods include rule-based and machine- subtree for Page 7, where w0 and h0 are the width and learning models. For example, text may be placed height of the page (specifying the printable area) with into predefined logical text blocks based on a set of the origin at the lower-left corner of the page. rules on the distance, positions, fonts of characters, (2) Page Structure. Each page starts with a page node, words, and text lines (Bast and Korzen, 2017). How- followed by object nodes with contents to be printed. ever, these rules also connect text on tables or fig- Each object occupies a rectangular area (a bounding ures as BT text. A Conditional Random Field (CRF) box) specified on a coordinate system of pixels. The model is trained (Luong et al., 2011; Romary and text of the document is divided into TBKs as leaf Lopez, 2015) to extract texts according to a prede- nodes. Each TBK is represented by a hdivi tag with fined rhetorical category, such as title, abstract, and corresponding attributes, and so the text in a TBK are other sections in the input document. However, this either all BT text or all NBT text. Each object is iden- model fails to determine paragraph boundaries or tified by coordinates (x;y) at the lower-left corner of align broken sentences, among other things. the bounding box relative to the coordinates of its par- CiteSeerX (Giles, 2006), a search engine, extracts ent node. In what follows, these coordinates are re- metadata from indexed articles in scientific docu- ferred to as the starting point of the underlying object. ments for searching purpose, but not focused on the In addition to the starting point, a non-textual object is accuracy of extracting body text. PDFfigures (Clark specified by a width and a height, and a TBK is speci- and Divvala, 2015) chunks the text table and figure fied with a height without a width, where the width is into blocks, then classifies these blocks into captions, implied by the enclosed text, font size and style, and body text, and part-of-figure text. Recent studies have word spacing. The parent of each object may either be shifted attentions to extracting certain types of text, the origin, a node for a figure or a table, or a node due including titles (Yang et al., 2019) (but not text on ta- to some (probably invisible) formatting code.

Load more