DOM-Based Content Extraction of HTML Documents

DOM-based Content Extraction of HTML Documents Suhit Gupta Gail Kaiser David Neistadt Peter Grimm Columbia University Columbia University Columbia University Columbia University Dept. of Comp. Sci. Dept. of Comp. Sci. Dept. of Comp. Sci. Dept. of Comp. Sci. New York, NY 10027, US New York, NY 10027, US New York, NY 10027, US New York, NY 10027, US 001-212-939-7184 001-212-939-7081 001-212-939-7184 001-212-939-7184 [email protected] [email protected] [email protected] [email protected] Abstract extraction of useful and relevant content from web pages has many applications, ranging from Web pages often contain clutter (such as enabling end users to accessing the web more pop-up ads, unnecessary images and extraneous easily over constrained devices like PDAs and links) around the body of an article that distract cellular phones to providing better access to the a user from actual content. Extraction of “useful web for the visually impaired. and relevant” content from web pages has many applications, including cell phone and PDA Most traditional approaches to removing browsing, speech rendering for the visually clutter or making content more readable involve impaired, and text summarization. Most increasing font size, removing images, disabling approaches to removing clutter or making JavaScript, etc., all of which eliminate the content more readable involve changing font webpage’s inherent look-and-feel. Examples size or removing HTML and data components include WPAR [18], Webwiper [19] and such as images, which takes away from a JunkBusters [20]. All of these products involve webpage’s inherent look and feel. Unlike hardcoded techniques for certain common web “Content Reformatting”, which aims to page designs as well as “blacklists” of reproduce the entire webpage in a more advertisers. This can produce inaccurate results convenient form, our solution directly addresses if the software encounters a layout that it hasn’t “Content Extraction”. We have developed a been programmed to handle. Another approach framework that employs an easily extensible set has been content reformatting which reorganizes of techniques that incorporate advantages of the data so that it fits on a PDA; however, this previous work on content extraction. Our key does not eliminate clutter but merely insight is to work with the Document Object reorganizes it. Opera [21], for example, utilizes Model tree, rather than with raw HTML their proprietary Small Screen Rendering markup. We have implemented our approach in technology that reformats web pages to fit a publicly available Web proxy to extract inside the screen width. We propose a “Content content from HTML web pages. Extraction” technique that can remove clutter without destroying webpage layout, making 1. Introduction more of a page’s content viewable at once. Web pages are often cluttered with Content extraction is particularly useful distracting features around the body of an article for the visually impaired and blind. A common that distract the user from the actual content practice for improving web page accessibility they’re interested in. These “features” may for the visually impaired is to increase font size include pop-up ads, flashy banner and decrease screen resolution; however, this advertisements, unnecessary images, or links also increases the size of the clutter, reducing scattered around the screen. Automatic effectiveness. Screen readers for the blind, like Hal Screen Reader by Dolphin Computer 2. Related Work Access or Microsoft’s Narrator, don’t usually automatically remove such clutter either and There is a large body of related work in often read out full raw HTML. Therefore, both content identification and information retrieval groups benefit from extraction, as less material that attempts to solve similar problems using must be read to obtain the desired results. various other techniques. Finn et al. [1] discuss methods for content extraction from “single- Natural Language Processing (NLP) and article” sources, where content is presumed to information retrieval (IR) algorithms can also be in a single body. The algorithm tokenizes a benefit from content extraction, as they rely on page into either words or tags; the page is then the relevance of content and the reduction of sectioned into 3 contiguous regions, placing “standard word error rate” to produce accurate boundaries to partition the document such that results [13]. Content extraction allows the most tags are placed into outside regions and algorithms to process only the extracted content word tokens into the center region. This as input as opposed to cluttered data coming approach works well for single-body directly from the web [14]. Currently, most documents, but destroys the structure of the NLP-based information retrieval applications HTML and doesn’t produce good results for require writing specialized extractors for each multi-body documents, i.e., where content is web domain [14][15]. While generalized content segmented into multiple smaller pieces, extraction is less accurate than hand-tailored common on Web logs (“blogs”) like Slashdot extractors, they are often sufficient [22] and (http://slashdot.org). In order for content of reduce labor involved in adopting information multi-body documents to be successfully retrieval systems. extracted, the running time of the algorithm would become polynomial time with a degree While many algorithms for content equal to the number of separate bodies, i.e., extraction already exist, few working extraction of a document containing 8 different implementations can be applied in a general bodies would run in O(N8), N being the number manner. Our solution employs a series of of tokens in the document. techniques that address the aforementioned problems. In order to analyze a web page for McKeown et al. [8][9], in the NLP group content extraction, we pass web pages through at Columbia University, detects the largest body an HTML parser that corrects the markup and of text on a webpage (by counting the number creates a Document Object Model tree. The of words) and classifies that as content. This Document Object Model (www.w3.org/DOM) method works well with simple pages. is a standard for creating and manipulating in- However, this algorithm produces noisy or memory representations of HTML (and XML) inaccurate results handling multi-body content. By parsing a web site's HTML into a documents, especially with random DOM tree, we can not only extract information advertisement and image placement. from large logical units similar to Buyukkokten’s “Semantic Textual Units” Rahman et al. [2] propose another (STUs, see [3][4]), but can also manipulate technique that uses structural analysis, smaller units such as specific links within the contextual analysis, and summarization. The structure of the DOM tree. In addition, DOM structure of an HTML document is first trees are highly editable and can be easily used analyzed and then properly decomposed into to reconstruct a complete web site. Finally, smaller subsections. The content of the increasing support for the Document Object individual sections is then extracted and Model makes our solution widely portable. summarized. However, this proposal has yet to be implemented. Furthermore, while the paper Kaasinen et al. [5], discusses methods to lays out prerequisites for content extraction, it divide a web page into individual units likened doesn’t actually propose methods to do so. to cards in a deck. Like STUs, a web page is divided into a series of hierarchical “cards” that A variety of approaches have been are placed into a “deck”. This deck of cards is suggested for formatting web pages to fit on the presented to the user one card at a time for easy small screens of cellular phones and PDAs browsing. The paper also suggests a simple (including the Opera browser [16], and conversion of HTML content to WML Bitstream ThunderHawk [17]); they only (Wireless Markup Language), resulting in the reorganize the content of the webpage to fit on a removal of simple information such as images constrained device and require a user to scroll and bitmaps from the web page so that scrolling and hunt for content. is minimized for small displays. While this reduction has advantages, the method proposed Buyukkokten et al. [3][10] define in that paper shares problems with STUs. The “accordion summarization” as a strategy where problem with the deck-of-cards model is that it a page can be shrunk or expanded much like the relies on splitting a site into tiny sections that instrument. They also discuss a method to can then be browsed as windows. But this transform a web page into a hierarchy of means that it is up to the user to determine on individual content units called Semantic Textual which cards the actual contents are located. Units, or STUs. First, STUs are built by analyzing syntactic features of an HTML None of the concepts solve the problem document, such as text contained within of automatically extracting just the content, paragraph (<P>), table cell (<TD>), and frame although they do provide simpler means in component (<FRAME>) tags. These features which the content can be found. Thus, these are then arranged into a hierarchy based on the concepts limit analysis of web sites. By parsing HTML formatting of each STU. STUs that a web site into a DOM tree, more control can be contain HTML header tags (<H1>, <H2>, and achieved while extracting content. <H3>) or bold text (<B>) are given a higher level in the hierarchy than plain text. This 3. Our Approach hierarchical structure is finally displayed on PDAs and cellular phones. While Our solution employs multiple Buyukkokten’s hierarchy is similar to our DOM extensible techniques that incorporate the tree-based model, DOM trees remain highly advantages of the previous work on content editable and can easily be reconstructed back extraction.

Load more