Web Page Segmentation and Pagination for Enhancing Readability

Vol. 2(4), Jan. 2016, pp. 234-238 Web Page Segmentation and Pagination for Enhancing Readability Ahmad Pouramini Department of Computer Engineering, Sirjan University of Technology, Sirjan, Iran *Corresponding Author's E-mail: [email protected] Abstract eb page readability can be defined as the combination of reading comprehension, reading speed and user satisfaction. To improve the readability of a web page, content extraction and W transformation techniques are used to present the main content to the reader in a more readable fashion. In this paper, we present the design and architecture of a readability enhancement system. We aim at enhancing both the reading speed and comprehension. To achieve these goals, we extract and segment the main content into smaller coherent semantic units. These units are further augmented with text signals such as section headings, captions and page numbers in order to convey the text organization and the page visual structure; thus, enhancing the content comprehension. Our proposed system particularly suits constrained display devices such as mobile phones and PDAs. Keywords: Reading comprehension; Readability Enhancement; Web page Customization. 1. Introduction The rapid growth of World Wide Web has been tremendous in recent years. With the large amount of information on the Internet, web pages have become the main source of information. However, reading web pages on computer screen or a mobile phone have some difficulties. Beside the main content, a web page may comprise of distracting parts such as ads, animations, logos, that can degrade the readability of the main content. In addition, color contrast, font style, letter spacing, layout, line height and length of the content are among the other factors that affect the web page readability [1]. The problem can be more serious for specific individuals such as older adults, visually impaired users, non-native readers (those reading a page in a non-native language). These people need more concentration to comprehend the text, especially if the text is a news or scientific article [2]. In this paper, we propose a system for enhancing the web page readability. We define readability as the combination of reading comprehension, reading speed and user satisfaction. The main stages of our method are extracting the main content; segmenting it into coherent semantic units and presenting each unit on a separated page to the reader. In our definition, a semantic unit is a discrete chunk of information that serves a specific meaningful purpose within the overall structure of a topic, such as itemized lists, paragraphs, images, tables. In addition, text signals such as section headings are added to each page in order to help the user to keep track of text organization. 2. Web Page Readability Typically, readability tools use content extraction methods to eliminate distracting parts and content reformatting and transformation to enhance the reading speed and comprehension of the content. Eliminating distracting parts can also provide easier access to the web over constrained devices like mobile phones [3], [4]. In the following sections, we review literature on the enhancing page readability and usability. 234 Article History: JKBEI DOI: 649123/11034 Received Date: 15 Sep. 2015 Accepted Date: 17 Dec. 2015 Available Online: 09 Jan. 2016 Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 2.1. Content Transformation and reformatting Richards et al. [5] proposed guidelines to have web content adaptations and transformations for specific populations including disabled people, older adults and visually impaired users. Certain changes such as font style, font enlargement, increased inter-letter spacing, and enhanced color contrast can increase legibility for this population. Some other studies focus on the readability enhancement on the user interface level. In an effort to enhance online reading, Walker et al. used a visual-syntactic text formatting (VSTF) method in which sentences are analyzed and reformatted into cascading patterns that cue syntactic structure and assist visual processing [6]. In a similar work [7] offered a visual reformatting method which aims readability for non-native readers of English documents. 2.2. Scrolling vs. Paging There are several studies on the effects of scrolling and paging on text reading and comprehension. Many of these studies support that the comprehension of text is better in paging, especially for narrative, complex and long texts. Piolat et al. found that while there was no difference in reading speed, paging resulted in better comprehension and recalling information [8]. Imai and Omodani reported that both the reading time and comprehension level were superior in paging than in scrolling [9]. Sanches et al. also reported that for complex content, scrolling reduced reading comprehension, especially when working memory was low [2]. Similarly, Fukaya et al. found that on small touch devices the comprehension level for narrative texts was slightly better in paging, and for reading of procedural texts, both scrolling and paging are suitable [10]. According to Wastlaund et al., reading a text document with a page layout can reduces mental load and enhance the speed and comprehension [11]. 2.2. Visual Structure and Text Signals Some other studies investigated the effect of visual structure on reading comprehension [12], [13]. The assumption relies on the effect of text signaling on text cognitive processing. Text signals are used by authors to clarify text organization and emphasize important content [13]. They include a variety of writing devices such as typographical cues, preview statements and overviews, titles and headings to communicate the text organization [13]. Hyona et al. found that the presence of headings in a text aids memory for the text. Moreover, they facilitate the search for specific information relevant to the headings. A heading that communicates organizational information may trigger processing of relations between two subsections that otherwise may not occur [14]. Other researches also showed how the presence of headings aids summarization [15]. Lemari et al. investigated the effects of the text visual structure on text comprehension in segmented presentation [12]. They found if readers are not provided with any information about the text visual structure (pagination) or if they are provided with unusable information, they heavily rely on the segmentation unit to give a structure to the text. As a result, if the segmentation unit does not match the text structure, it leads to a misinterpretation of the relationship between text segments. 3. System Architecture Based on these studies, we propose our system for web page readability enhancement. The overall architecture of the system is shown in Figure 1. As can be observed, the input of the system is the HTML document (DOM tree) and the output is the text segments extracted from the document. In between, there are two stages namely content extraction and content segmentation. The content extraction identifies the main content of the web page. The output is one or more nodes classified as the main content. These nodes are input to the segmentation stage, which decomposes them into 235 Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 semantic units. The resulting segments are further processed to be presented into a more readable and comprehensible format. The following sections will explain these stages. Figure 1. Main stages of the proposed system. 3.1. Main Content Extraction There are many approaches to perform the main content extraction of web pages. Most unsupervised methods utilize heuristics rules in order to automatically determine the main content. The features used in content classification range from visual to text features such as sentence or link density. They basically vary in terms of how general or specific they intend to be, and the target application. To select a suitable approach for the proposed system, we made the following assumptions: • Our main goal is to enhance the readability of reading materials such as news articles, blogs, encyclopedia articles and so on; therefore, we have more assumptions on the structure of the Web page. • We work on the rendered page by a web browser; therefore, we have access to the dynamic and visual properties of the page elements. • Speed and accuracy of the extraction algorithm is more important than its generality. Based on these assumptions and requirements, a suitable method could be a densitometric method, which has proved efficient for content-rich documents, such as news, encyclopedia articles [16]. To improve the efficiency of such a method, we can employ vision-based features such as the location of a block in the page (e.g. the main block often appears in the central part of the page) because the system has access to the rendered page elements. We selected the method introduced by Kohlschütter et al. for boilerplate detection using shallow text features such as link density and text density ratios. They assumed that textual content on web pages can be grouped into two main classes, long text (most likely the main content) and short text (most likely navigational boilerplate text) respectively. Using this simple classification model they achieved competitive accuracy. 3.1. Content Segmentation into Semantic Units In this stage, the extracted nodes which contain the main content are decomposed into semantic units. By a semantic unit, we mean a piece of content which conveys coherent information. We use a recursive algorithm which takes as input the sub-tree associated with each extracted node of the DOM 236 Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 tree. It traverses this sub-tree’s nodes in depth first manner.

Load more