HTML Web Content Extraction Using Paragraph Tags

Howard J. Carey, III, Milos Manic Department of Computer Science Virginia Commonwealth University Richmond, VA USA [email protected], [email protected]

Abstract— With the ever expanding use of the internet to numerous links, advertisements, and various navigation disseminate information across the world, gathering useful information. This extra information may not be relevant to the information from the multitude of web page styles continues to main content of the web page, and can be ignored in many be a difficult problem. The use of computers as a tool to scrape cases. This additional information, such as ads, can also lead to the desired content from a web page has been around for several misleading or incorrect information being extracted. Thus, decades. Many methods exist to extract desired content from web determining the relevant main content of a web page among pages, such as Document Object Model (DOM) trees, text the extra information is a difficult problem. density, tag ratios, visual strategies, and fuzzy algorithms. Due to the multitude of different website styles and designs, however, Numerous attempts have been made in the past two finding a single method to work in every case is a very difficult decades to filter the main content of a web page. Therefore, problem. This paper presents a novel method, Paragraph this paper presents Paragraph Extractor (ParEx), a novel Extractor (ParEx), of clustering HTML paragraph tags and local method used to identify the main text content within an article parent headers to identify the main content within a news article. on a website while filtering out as much irrelevant information On websites that use paragraph tags to store their main news as possible. ParEx relies upon using HTML paragraph tags, article, ParEx shows better performance than the Boilerpipe denoted by p in HTML, combined with clustering and entity algorithm with higher F1 scores of 97.33% to 88.53%. relationships to extract the main content of an article. It was Keywords—HTML, content extraction, Document Object shown that ParEx had very high recall and precision scores on Model, tag-ratios, tag density. a set of test sites that use p tags to store their main article content and have little to no user comment section. I. INTRODUCTION This rest of this paper is organized in the following format: The Internet is an ever growing source of information for Section II examines related works in the field. Section III the modern age. With billions of users and countless billions of details the design methodology of ParEx. Section IV discusses web pages, the amount of data available to a single human the evaluation metrics used to test the method. Section V being is simply staggering. Attempting to process all of this describes the experimental results of ParEx. And finally, information is a monumental task. Vast amounts of information section VI summarizes the findings of this paper. that may be important to various entities are provided in web based news articles. Modern day news websites are updated multiple times a day making more data constantly available. II. RELATED WORKS These web based articles offer a good source for information Early attempts at content extraction mostly had some sort because of the relatively free availability, ease of access, large of human interaction required to identify the important features amount of information, and ease of automation. To analyze of a website, such as [1],[9],[10]. While these methods could news articles from a paper source, the articles would first have be accurate, they were not easily scalable to bulk data to be read into a computer, making the process of extracting collection. Other early methods employed various natural the information within much more cumbersome and time language processing methods [7] to help identify relationships consuming. Due to the massive amount of news articles between zones of a web page, or utilized HTML tags to available, there is simply too much data for a human to identify various regions within the text [14]. Kushmerick manually determine which of this information is relevant. developed a method to solely identify the advertisements in a Thus, automating the extraction of the primary article is page, and remove them [11]. necessary to allow further data analysis on any information within a web page. Many methods attempt to utilize the Document Object Model (DOM) to extract formatted HTML data from websites To help analyze the content of web pages, researchers have [3, [4]. DOM provides “a platform and language-neutral been developing methods to extract the desired information interface that allows programs and scripts to dynamically from a web page. A modern web page now a days consists of access and update the content, structure and style of documents.” [13]. In [2], Gongquing et al. utilized a DOM tree Nethra et al. [26] created a hybrid approach using feature to help determine a text-to-tag path ratio, while in [20] extraction and decision tree classification. They used C4.5 Mantratzis et al. developed an algorithm that recursively decision tree and Naïve Bayes classifiers to determine which searches through a DOM tree to find which HTML tags features were important in determining main content. In [27], contained a high density of hyperlinks. Bhardwaj et al. proposed an approach of combining the word to leaf ratio (WLR) and the link attributes of nodes. Their Layouts can be common throughout web pages in the forms WLR was defined as the ratio between the number of words in of templates. Detection of these templates and the removal of a node to the number of leaves in the subtree of said node. In the similar content that occurs between multiple web pages can [28], Gondse et al. proposed a method of extracting content leave the differing content between them, which can be the from unstructured text. Using a web crawler combined with main article itself, as found by Bar-Yossef et al. in [22]. Chen user input to decide what to look for, the crawler analyzes the et al. in [15] explored a method of combining layout grouping DOM tree of various sites to find potential main content and word indexing to detect the template. In [16] and [23], Kao sections. Qureshi et al. [29] created a hybrid model utilizing a et al. developed an algorithm that utilized the entropy of DOM tree to determine the average text size and link density of feature, links, and content of a website to help identify template each node. sections. And Yi et al. in [17] proposed a method to classify and cluster web content using a style tree to help compare No algorithm has managed to achieve 100% accuracy in between website structures to determine the template used. extracting all relevant text from websites so far. With the ever Kohlschütter [30] developed a method, Boilerpipe, to detect changing style and design of modern web pages, different shallow text features in templates to help detect the boilerplate approaches are continually needed to keep up with the changes. (any section of website which is not considered main content) Some algorithms may work on certain websites but not others. using the number of words and link density of a website. There is much work left to be done in the of website content extraction. Much research tends to build on the work from previous researchers. In [5], Gottron provided a comparison between many of the content extraction algorithms at the time, and III. PAREX WEB PAGE CONTENT EXTRACTION METHOD modified the document slope curve algorithm from [18]. The The steps involved in ParEx are shown in Figure 1. The modified document slope curve proved the best within his test Preprocessing step starts with the original website’s HTML group. In [18], Pinto et al. expanded on the work of Body Text being extracted and parsed via JSoup [32]. The Text Extraction from [14] by utilizing a document slope curve to Extraction section locates and extracts the main content within identify content vs. non-content pages in hopes of determining the website. The Final Text is then analyzed using techniques whether a web page had content worth extracting or not. elaborated on in section IV. Debnath et al. proposed the algorithms ContentExtractor [21] and FeatureExtractor [22], that compared similarity between A. Preprocessing blocks across multiple web pages and classified sections as content with respect to a user defined desired feature set. The HTML code was downloaded directly from each website and parsed using the JSoup API [32] to allow easy In [19], Spousta et al. developed a cleaning algorithm that filtering of HTML data. In this way, HTML tags were pulled involved regular expression analysis and numerous heuristics out, processed and extracted from the original HTML code. involving sentence structure. Their results performed poorly on JSoup also simplified the process of locating tags and parent web pages with poor formatting or numerous images and links. Gottron [24] utilized a content code blurring technique to identify the main content of an article. The method involved applying a content code ratio to different areas of the web page and analyzing the amount of text in the different regions. Many recent works have taken newer approaches, but still tend to build on the works of previous research. In [12], Bu et al. proposed a method to analyze the main article using fuzzy association rules (FAR). They encoded the min., mean, and max values of all items and features for a web page into a fuzzy set and achieved decent, but quick, results. Song et al., in [6] and Sun et al. [8], expanded on the tag path ratios by looking at text density within a line and taking into account the number of all hyperlink characters in a subtree compared to all hyperlink tags in a subtree. Peters et al. [25] combine elements of Kohlschutter’s boilerplate detection methods, [30], and Weninger’s CETR methods, [3] and [4], into a single machine learning algorithm. Their combined methodology showed improvements over using just a single algorithm, however had trouble with websites that used little CSS for formatting. Figure 1: Flow-diagram of the presented ParEx method. tags, allowing for quicker testing of the method. be assigned a clustered ratio that is the average of lines 2-6, 32.8. This process is repeated for all lines in the raw HTML, B. p Tag Clustering before any other formatting is done. This clustering helps filter out one line advertisements or extraneous information that The presented ParEx method combines a number of have high text-to-tag ratios while favoring clusters of high text- methods used in previous papers into a single, simple, heuristic to-tag ratios. extraction algorithm. The initial idea stems from the work of Weninger et al., [3], [4], and Sun et al., [8], using a text-to-tag C. Parent Header ratio value. Weninger et al. showed that the main content of a The novelty in this paper recognizes that most (all for the site generally contains significantly more text than HTML tags. websites tested) websites group their entire main article under A ratio of the number of non-HTML tag characters to HTML tag characters was, with higher ratios being much more likely one parent header together, either under p tags or li tags. to contain the main content of an HTML document. A down Essentially, this means if a single line of the main content can side of this method is it will grab any large block of text, which be identified, the entire section can be extracted by finding the can sometimes include comments sections on many websites. parent header tag of the single line. Once every line has a The method to calculate the text to tag ratio in this experiment relative tag ratio, the p tags with the highest relative tag is as follows: ratio is extracted. The highest relative tag ratio will help to find the single textCount  tagRatio    sentence that is most likely to contain the main article text tagCount within it. This extracted line’s parent header is taken to be the master header for the web page. All p tags under this master Where the textCount variable is the number of non-tag header are then extracted and considered to be the primary characters contained in the line and tagCount is the number of article text within the web page. Examples of this procedure HTML tags contained in the line. The tagRatio variable uses are shown in Figures 3 and 4 and discussed below. the number of characters in the line instead of the number of words to prevent any biases from articles that use fewer, but In Figure 3, there are three p tags. If line 3 has the longer, words, or vice versa. The character count gives a greatest relative text-to-tag ratio, it is extracted as the most definitive length of the line without concern to the length of the likely to contain the main content. The parent of line 3 is found individual words. as local . Then, all p tags under local are extracted and Typically, the main HTML content is placed in paragraph considered to be the main content. While main is the most tags, denoted by p , and having a high text-to-tag ratio. senior parent in the tree, only the immediate parent is chosen However, advertisements can contain a massive amount of when determining the parent of the relative p tag. Selecting characters while only containing a few HTML tags, which can fool the algorithm in the form of a high text-to-tag ratio. To a more senior parent generally selects more information than filter out these cases from the main content, a clustered version the desired content. Also in this case, no tags under next of the regular tagRatios (1) is used in the ParEx to find regions would be chosen. of high text-to-tag ratios as opposed to simple one-liners. The clustered text-to-tag ratio uses a sliding window technique to Lists in HTML are often used to list out various talking assign an average text-to-tag ratio from the line in question, points within an article, however these are not generally and the two lines before and after it. This gives each line of formatted into p tags, but into list tags, denoted by li . If HTML two ratios, an absolute text-to-tag ratio which any li tags are found within the master header, then their text determines the text-to-tag ratio of the individual line, and a relative text-to-tag ratio which determines the average text-to- contents are added to the extracted article text as well. tag ratio of the sliding window cluster. In Figure 4, if the p tag on line 3 was chosen as the main Figure 2 shows an example of the clustered text-to-tag content containing line, then all p tags under the local ratio. The second column of numbers represents the absolute text-to-tag ratio for each line. Line 3 will be assigned a clustered ratio that is the average of lines 1-5, 31.8. Line 4 will

Figure 3: p tag extraction example. Figure 2: Example showing the clustered text-to-tag ratio. P  R  F  2   P  R

where P is the precision and R is the recall. These metrics analyze the words extracted from each algorithm and compare them to the words in the article itself. If the words extracted from the algorithm closely match the actual words in the article, the F1 score will be higher. Using Figure 4: li tag extraction example. these three scoring methods allows for a numerical comparison between various algorithms. F1 score is important because tag would be extracted. However, information within the list looking purely at either precision or recall can be misleading. tags, li , would also be extracted, as the algorithm recognizes Having a high precision but low recall value means nearly all any p as well as li tags under the chosen master parent of the extracted data was accurate, however it does not tell any tag. information on how much relevant information is missing. Having a high recall but low precision value means nearly all Once the final list of p and li tags are extracted, the of the data that should have been extracted was, but it does not tell any information on how much extra, non-relevant data was text of these tags is compared to the original, annotated data also extracted. Thus, the F1 score is useful as it allows a that was extracted using methods discussed in section A. The combination of both precision and recall to determine a balance techniques used to evaluate the effectiveness of the algorithm between the two. are discussed in the next section of the paper. ParEx was tested against manually extracted data from each IV. EVALUATION METRICS website. This allowed the results to be verified against a known dataset. The precision, recall, and F1 score were obtained by Three metrics are used to evaluate the performance of the using the set of words that appeared in the manually extracted ParEx, precision, recall, and F1 score. These are standard text and comparing against the set of words extracted by the evaluation techniques used to determine how accurate an tested algorithms. No importance was given to the order in extracted block of text is compared to what it is supposed to be. which certain words appeared. Precision (P) is a ratio between the size of the intersection of the extracted text and the actual text, over the size of the V. EXPERIMENTAL RESULTS extracted text. Precision gives a measurement of how relevant The hypothesis being tested was whether or not the ParEx the extracted words are to the actual desired content text and is method would be a more effective extraction algorithm than expressed as: Boilerpipe on websites which used p tags to store their main content and had little to no user comment sections. To test this   S E S A   hypothesis, 15 websites were selected which exhibited these P  characteristics and 15 websites which did not exhibit these S E characteristics were selected. where SE is the extracted text from the algorithm and SA is the The websites selected were mostly local news based actual article text. websites, such as FOX, CBS, NBC, NPR, etc., with a subject focus on critical infrastructure events in local areas. All 30 of Recall (R) is a ratio between the size of the intersection of the articles selected were from different websites. The wide the extracted text and the actual text, over the size of the actual variety of websites allowed both methods to be tested on as text. Recall gives a measurement of how much of the relevant diverse a set of sites as possible. The use of news article sites data was accurately extracted and is expressed as: as opposed to other sites, such as blogs, forums, or shopping sites, is because news sites tend to have a single primary article of discussion per site. To acquire HTML data form these sites,   R  S E S A   it was simply downloaded from the website directly. To S A compare the algorithm’s extracted text with the actual website main content text, the main content needed to be extracted and where SE is the extracted text from the algorithm and SA is the annotated. The annotation was done manually, by a human. actual article text that serves as the base line comparison for The final decision about what was considered relevant to the accuracy. main article within the website was up to the annotator. The F1 (F) score is a combination of precision and recall, allowing for a single number to represent the accuracy of the algorithm. It is expressed as:

TABLE I: SCORE COMPARISON BETWEEN BOTH METHODS ON THE SET OF

TAGGED SITES.

Method ParEx Boilerpipe Standard Standard Metric Scores Scores Deviation Deviation Precision 96.07% 3.53% 85.07% 8.44%

Recall 98.73% 1.19% 94.00% 5.20%

F1 Score 97.33% 2.04% 88.53% 7.02%

TABLE II: SCORE COMPARISON BETWEEN BOTH METHODS ON THE SET OF NON

TAGGED SITES.

Method ParEx Boilerpipe Figure 5: Individual website results for the p tagged website set. Standard Standard Metric Scores Scores Deviation Deviation Precision 40.33% 38.36% 77.53% 15.55%

Recall 41.53% 40.84% 90.80% 10.88%

F1 Score 33.73% 34.52% 82.53% 15.34% This method of manual content extraction for comparison has some inherent error risk involved. Different people may consider different parts of an article as main content. For instance, the title and header of the article may be viewed as main content or not. The author information at the bottom of an article may or may not be viewed as main content by different people. For the purposes of this paper, only the main news article content was considered, leaving off any author information, contact information, titles, or author notes. With this in mind, a certain amount of error is to be expected even with accurate algorithm results.

The accuracy of ParEx was compared against the accuracy Figure 6: Individual website results for the non- p tagged website set. of the Boilerpipe tool, developed by Kohlschutter [30]. Figure 5 emphasizes the performance of ParEx over Boilerpipe is a fully developed tool that is publically available Boilerpipe. ParEx performed as good as or better for every and provides numerous methods of content extraction via an website tested. Figure 6 demonstrates Boilerpipe’s resiliency easy to use API [31] allowing easy testing and performs well over the non-paragraph tagged websites. Note that Figure 6 is on a broad range of website types. scaled from 60%-100% on the y-axis. While dipping to ~40% Table I shows ParEx performed better than Boilerpipe in all and ~10% on two of the websites, its performance on the metrics for websites that exhibited the required characteristics. majority of the test sites is relatively consistent with that of the Table II shows Boilerpipe performed much better than ParEx first set of test sites in Figure 5. Figure 6 also shows that, while on websites that did not exhibit the required characteristics. ParEx performs poorly on most of the chosen sites, it still These results support the original assumption, for websites that manages to perform quite well on a couple of the websites. As use p tags to store their main content and have limited to no expected, it was found that many of the websites where ParEx performed poorly was because the website did not use p user comment sections ParEx will have a higher performance. tags to store their main content. This led to a score of 0 (see Examining the differences of each method between both Figure 6), as the text-to-tag ratio only examines p tags for tables I and II shows that while ParEx may be more accurate on sites that exhibit the required characteristics, Boilerpipe is a text, and since there were no p tags, there was no text. Also more generalizable method. ParEx claims an F1 score of to be expected, on many of the sites in Table II, the ParEx’s 97.33% on the first data set, while only a 33.73% on the relative text-to-tag ratio had selected a comment section on the second. Boilerpipe, however, achieves F1 scores of 88.53% on website. If there was no overlap in words used in the comment the first set, and 82.53% on the second. Boilerpipe does not section with the actual article, the resulting score was 0%. If work as well on sets that exhibit the characteristics required for there was some minimal amount of overlap in words, it led to a ParEx, but it performs more consistently on multiple types of very low result. Thus the two identified primary requirements websites. of a website to work well with ParEx are: 1) it requires websites to use paragraph tags to store then main article content, 2) the method is susceptible to comment sections, or [10] L. Liu, C. Pu, W. Han. "XWRAP: An XML-enabled wrapper other large blocks of text that may fool the clustering algorithm construction system for web information sources," in Proc. Intl. Conf. on by selecting the wrong block of text as the main content. Data Engineering, pp. 611-621, 2000. [11] N. Kushmerick, "Learning to remove Internet advertisements," in Proc. Conf. on Autonomous Agents, pp. 175-181, 1999. VI. CONCLUSION [12] Z. Bu, C. Zhang, Z. Xia, J. Wang, "An FAR-SW based approach for This paper presented a new method called ParEx to webpage information extraction,"Information Systems Frontiers, vol. 16, no. 5, pp. 771-785, February 2013. evaluate the content within a website and extract the main [13] L. Wood, A. Le Hors, V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. content text within. Building upon previous work with the text- Jacobs et al., "Document object model (DOM) level 1 specification," in to-tag ratio and clustering methods, the presented ParEx W3C Recommendation, 1998. method focused only on the paragraph tags of each website, [14] A. Finn, N. Kushmerick, B. Smyth. "Fact or fiction: Content making the assumption that most websites will have their main classification for digital libraries," in Joint DELOS-NSF Workshop on article within a number of p tags. Two primary Personalisation and Recommender Systems in Digital Libraries, 2001. [15] L. Chen, S. Ye, X. Li, "Template detection for large scale search requirements were found to optimize the success of ParEx: 1) engines," in Proc. ACM symposium on Applied Computing, pp. 1094- websites must use p tags to store their article content, 2) use 1098, 2006. [16] H. Kao, S. Lin, J. Ho, M. Chen, "Mining web informative structures and websites that have limited or no user comment sections. The contents based on entropy analysis,"in Trans. on Knowledge and Data results showed that the ParEx method showed overall better Engineering, vol. 16, no. 1,pp. 41-55, January 2004. performance than Boilerpipe on websites that exhibited these [17] L. Yi, B. Liu, X. Li, "Eliminating noisy information in web pages for characteristics (with F1 scores of 97.33% vs. 88.53% for the data mining," in Proc. ACM intl. conf. on knowledge discovery and data Boilerpipe). This confirms the requirement for the ParEx mining, pp. 296-305, 2003. approach. [18] D. Pinto, M. Branstein, R. Coleman, W.B. Croft, M. King, W. Li, et. al., "QuASM: a system for question answering using semi-structured data," Future work includes further improving the content in Proc. ACM/IEEE-CS joint conference on Digital libraries, pp. 46-55, extraction accuracy by improving the clustering algorithm and 2002. text-to-tag ratio metric to increase the likelihood that the [19] M. Spousta, M. Marek, P. Pecina, "Victor: the web-page cleaning tool," algorithm will select the correct chunk of p tags to extract as in 4th Web as Corpus Workshop (WAC4)-Can we beat Google, pp. 12- 17, 2008. the main content as opposed to a user comment section. [20] C. Mantratzis, M. Orgun, S. Cassidy, "Separating XHTML content from Comparing the differences between a tag ratio that uses navigation clutter using DOM-structure block analysis," in Proc. ACM characters to one that uses words can also be explored. conf. on Hypertext and hypermedia, pp. 145-147, 2005. [21] S. Debnath, P. Mitra, C. L. Giles, "Automatic extraction of informative blocks from webpages," in Proc. ACM symposium on Applied ACKNOWLEDGMENT computing, pp. 1722-1726, 2005. The authors would like to thank support given by Mr. Ryan [22] S. Debnath, P. Mitra, C. L. Giles, "Identifying content blocks from web Hruska of the Idaho National Laboratory (INL) who helped documents," in Foundations of Intelligent Systems, pp. 285-293, 2005. make this effort a success. [23] S. Lin, J. Ho, "Discovering informative content blocks from Web documents," in Proc. ACM SIGKDD intl. conf. on Knowledge discovery and data mining, pp. 588-593, 2002. REFERENCES [24] T. Gottron, "Content code blurring: A new approach to content [1] S. Gupta, G. Kaiser, P. Grimm, M. Chiang, J. Starren, “Automating extraction," in Intl. Workshop on Database and Expert Systems Content Extraction of HTML Documents,” in World Wide Web, vol. 8, Application, pp. 29-33, 2008. no. 2, pp. 179-224, June 2005. [25] M.E. Peters, D. Lecocq, "Content extraction using diverse feature sets," [2] G. Wu, L. Li, X. Hu, X. Wu, “Web news extraction via path ratios,” in in Proc. Intl. conf. on World Wide Web companion, pp. 89-90, 2013. Proc ACM intl. conf. on information & knowledge management, pp. [26] K. Nethra, J. Anitha, G. Thilagavathi, "Web Content Extraction Using 2059-2068, 2013. Hybrid Approach," in ICTACT Journal On Soft Computing, vol. 4, no. [3] T. Weninger, W.H. Hsu, "Text Extraction from the Web via Text-to-Tag 02 (2014). Ratio," in Database and Expert Systems Application, pp.23-28, Sept. [27] A. Bhardwaj, V. Mangat, "A novel approach for content extraction from 2008. web pages," in Recent Advances in Engineering and Computational [4] T. Weninger, W.H. Hsu, J. Han, “CETR: content extraction via tag Sciences, pp. 1-4, 2014. ratios,” in Proc. Intl. conf. on World wide web, pp. 971-980, April 2010. [28] P. Gondse, A. Raut, "Primary Content Extraction Based On DOM," in [5] T. Gottron, "Evaluating content extraction on HTML documents," in Intl. Journal of Research in Advent Technology, vol. 2, no. 4, pp. 208- Proc. Intl. conf. on Internet Technologies and Apps, pp. 123-132. 2007. 210, April 2014. [6] D. Song, F. Sun, L. Liao, "A hybrid approach for content extraction with [29] P. Qureshi, N. Memon, "Hybrid model of content extraction,"in Journal text density and visual importance of DOM nodes,"in Knowledge and of Computer and System Sciences,vol. 78, no. 4, pp. 1248-1257, July Information Systems, vol. 42, no. 1, pp. 75-96, 2015. 2012. [7] A.F.R. Rahman, H. Alam, R. Hartono, "Content extraction from html [30] C. Kohlschütter, P. Fankhauser, W. Nejdl, “Boilerplate detection using documents," in Intl. Workshop on Web Document Analysis, pp. 1-4, shallow text features,” in Proc. ACM intl. conf. on Web search and data 2001. mining, pp. 441-450, 2010. [8] F. Sun, D. Song, L. Liao, “DOM based content extraction via text [31] C. Kohlschütter. (2016, Jan.). boilerpipe [Online]. Available: density,” in Proc. Intl. conference on Research and development in https://code.google.com/p/boilerpipe/ Information Retrieval, pp. 245-254, 2011. [32] J. Hedley. (2016, Jan.). jsoup HTML parser [Online]. Available: [9] B. Adelberg, “NoDoSE—a tool for semi-automatically extracting http://jsoup.org structured and semistructured data from text documents,” in Proc. ACM Intl. conf. on Management of data, pp. 283-294, 1998.