Vol. 2(4), Jan. 2016, pp. 234-238

Web Page Segmentation and for Enhancing Readability Ahmad Pouramini Department of Computer Engineering, Sirjan University of Technology, Sirjan, Iran *Corresponding Author's E-mail: [email protected] Abstract

eb page readability can be defined as the combination of reading comprehension, reading speed and user satisfaction. To improve the readability of a web page, content extraction and W transformation techniques are used to present the main content to the reader in a more readable fashion. In this paper, we present the design and architecture of a readability enhancement system. We aim at enhancing both the reading speed and comprehension. To achieve these goals, we extract and segment the main content into smaller coherent semantic units. These units are further augmented with text signals such as section headings, captions and page numbers in order to convey the text organization and the page visual structure; thus, enhancing the content comprehension. Our proposed system particularly suits constrained display devices such as mobile phones and PDAs.

Keywords: Reading comprehension; Readability Enhancement; Web page Customization.

1. Introduction The rapid growth of has been tremendous in recent years. With the large amount of information on the , web pages have become the main source of information. However, reading web pages on computer screen or a mobile phone have some difficulties. Beside the main content, a web page may comprise of distracting parts such as ads, animations, logos, that can degrade the readability of the main content. In addition, color contrast, font style, letter spacing, layout, line height and length of the content are among the other factors that affect the web page readability [1]. The problem can be more serious for specific individuals such as older adults, visually impaired users, non-native readers (those reading a page in a non-native language). These people need more concentration to comprehend the text, especially if the text is a news or scientific article [2]. In this paper, we propose a system for enhancing the web page readability. We define readability as the combination of reading comprehension, reading speed and user satisfaction. The main stages of our method are extracting the main content; segmenting it into coherent semantic units and presenting each unit on a separated page to the reader. In our definition, a semantic unit is a discrete chunk of information that serves a specific meaningful purpose within the overall structure of a topic, such as itemized lists, paragraphs, images, tables. In addition, text signals such as section headings are added to each page in order to help the user to keep track of text organization.

2. Web Page Readability Typically, readability tools use content extraction methods to eliminate distracting parts and content reformatting and transformation to enhance the reading speed and comprehension of the content. Eliminating distracting parts can also provide easier access to the web over constrained devices like mobile phones [3], [4]. In the following sections, we review literature on the enhancing page readability and usability.

234 Article History: JKBEI DOI: 649123/11034 Received Date: 15 Sep. 2015 Accepted Date: 17 Dec. 2015 Available Online: 09 Jan. 2016

Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

2.1. Content Transformation and reformatting Richards et al. [5] proposed guidelines to have adaptations and transformations for specific populations including disabled people, older adults and visually impaired users. Certain changes such as font style, font enlargement, increased inter-letter spacing, and enhanced color contrast can increase legibility for this population. Some other studies focus on the readability enhancement on the level. In an effort to enhance online reading, Walker et al. used a visual-syntactic text formatting (VSTF) method in which sentences are analyzed and reformatted into cascading patterns that cue syntactic structure and assist visual processing [6]. In a similar work [7] offered a visual reformatting method which aims readability for non-native readers of English documents.

2.2. Scrolling vs. Paging There are several studies on the effects of scrolling and paging on text reading and comprehension. Many of these studies support that the comprehension of text is better in paging, especially for narrative, complex and long texts. Piolat et al. found that while there was no difference in reading speed, paging resulted in better comprehension and recalling information [8]. Imai and Omodani reported that both the reading time and comprehension level were superior in paging than in scrolling [9]. Sanches et al. also reported that for complex content, scrolling reduced reading comprehension, especially when working memory was low [2]. Similarly, Fukaya et al. found that on small touch devices the comprehension level for narrative texts was slightly better in paging, and for reading of procedural texts, both scrolling and paging are suitable [10]. According to Wastlaund et al., reading a text document with a can reduces mental load and enhance the speed and comprehension [11].

2.2. Visual Structure and Text Signals Some other studies investigated the effect of visual structure on reading comprehension [12], [13]. The assumption relies on the effect of text signaling on text cognitive processing. Text signals are used by authors to clarify text organization and emphasize important content [13]. They include a variety of writing devices such as typographical cues, preview statements and overviews, titles and headings to communicate the text organization [13]. Hyona et al. found that the presence of headings in a text aids memory for the text. Moreover, they facilitate the search for specific information relevant to the headings. A heading that communicates organizational information may trigger processing of relations between two subsections that otherwise may not occur [14]. Other researches also showed how the presence of headings aids summarization [15]. Lemari et al. investigated the effects of the text visual structure on text comprehension in segmented presentation [12]. They found if readers are not provided with any information about the text visual structure (pagination) or if they are provided with unusable information, they heavily rely on the segmentation unit to give a structure to the text. As a result, if the segmentation unit does not match the text structure, it leads to a misinterpretation of the relationship between text segments.

3. System Architecture Based on these studies, we propose our system for web page readability enhancement. The overall architecture of the system is shown in Figure 1. As can be observed, the input of the system is the HTML document (DOM tree) and the output is the text segments extracted from the document. In between, there are two stages namely content extraction and content segmentation. The content extraction identifies the main content of the web page. The output is one or more nodes classified as the main content. These nodes are input to the segmentation stage, which decomposes them into

235

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

semantic units. The resulting segments are further processed to be presented into a more readable and comprehensible format. The following sections will explain these stages.

Figure 1. Main stages of the proposed system.

3.1. Main Content Extraction There are many approaches to perform the main content extraction of web pages. Most unsupervised methods utilize heuristics rules in order to automatically determine the main content. The features used in content classification range from visual to text features such as sentence or link density. They basically vary in terms of how general or specific they intend to be, and the target application. To select a suitable approach for the proposed system, we made the following assumptions: • Our main goal is to enhance the readability of reading materials such as news articles, blogs, encyclopedia articles and so on; therefore, we have more assumptions on the structure of the Web page. • We work on the rendered page by a ; therefore, we have access to the dynamic and visual properties of the page elements. • Speed and accuracy of the extraction algorithm is more important than its generality. Based on these assumptions and requirements, a suitable method could be a densitometric method, which has proved efficient for content-rich documents, such as news, encyclopedia articles [16]. To improve the efficiency of such a method, we can employ vision-based features such as the location of a block in the page (e.g. the main block often appears in the central part of the page) because the system has access to the rendered page elements. We selected the method introduced by Kohlschütter et al. for boilerplate detection using shallow text features such as link density and text density ratios. They assumed that textual content on web pages can be grouped into two main classes, long text (most likely the main content) and short text (most likely navigational boilerplate text) respectively. Using this simple classification model they achieved competitive accuracy.

3.1. Content Segmentation into Semantic Units In this stage, the extracted nodes which contain the main content are decomposed into semantic units. By a semantic unit, we mean a piece of content which conveys coherent information. We use a recursive algorithm which takes as input the sub-tree associated with each extracted node of the DOM

236

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

tree. It traverses this sub-tree’s nodes in depth first manner. When a node matching the features of a semantic unit is visited, the algorithm adds it to the list of the extracted segments. The main feature we used to distinguish such a unit is the HTML (paragraph (P), Lists (UL, OL), or headings (H1, H2 …), Images (IMG), Tables (TABLE)). These tags show the main structural units of a reading material. In addition, HTML line breaks (the BR tag) and horizontal line (the HR tag) is used to break down a text that is not enclosed by a block-level element. For the tables and images, the caption must be extracted as a separated text unit and then be merged to the corresponding unit. A caption is usually a short text enclosed by a block-level HTML element, before or after a table or an image. It may contain the “Table” or “Figure” keywords. These features can be used to identify them.

Figure 2. A page augmented with text signals.

3.2 Transformation and Presentation of the Units By following the stages above, a list of segments is generated which must be presented to the reader. Before that, in order to enhance the comprehension of each segment, specific text signals are added to it (see Figure 2). The main idea is to repeat the section’s heading in every segment belonging to that section. Note the headings are extracted as separated units in the previous stage. To add them to the related segments, the list of the extracted units is iterated. By visiting a heading element, it is stored in a variable (H1, H2, H3 …) and is used as the title of the units that follow it. We use multiple variables corresponding to different levels of headings (H1, H2, H3 ...) and repeat all or some of them in a hierarchical manner in the related units. For example, the title “Machine Learning > History and relations to other fields > Relation to statistics” can be used for a text unit about the relation between machine learning and statistics. In figure 2, just the last two headings (H2 and H3) are used because the user usually recalls the main topic of text (“Machine Learning”). As mentioned before, this organization of headings helps the user to scan and find specific information faster. It also helps them to grasp the relationship of the current topic with preceding topics. In addition to headings, the page number and a progress bar showing the position of the current page in all the pages are also added to assist the user with scanning the document.

3.2 Implementation The system was implemented as an application with an embedded web browser, which provides access to the DOM tree and the rendered properties of the page elements. These properties are required in the extraction and segmentation phases. Moreover, by using the rendered elements on each page, the original visual appearance of the text is preserved. To present the extracted units in separated pages, JavaScript and CSS3 programming were used to build a sideshow. Finally, readability

237

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

guidelines such as the font size, text and background contrast and spacing is used to provide a better presentation especially on constrained display devices.

Conclusion In this paper, we have proposed the design of a Web page readability tool, which aims at enhancing both the reading speed and comprehension. We first reviewed the related studies on reading and comprehension and based on them, we proposed the architecture and the main stages of the system. The system mainly involves the main content extraction, decomposition of the extracted content into semantic units and pagination. The units are manipulated to convey the text organization by adding text signals such as the section headings. Moreover, presenting each unit in a separated page with enough spacing and margins and a proper font-size improves the reading speed and the user satisfaction. We pointed that this system is especially useful for constrained devices like PDAs and mobile phones.

References [1] Q. Pakistan, “Web readability factors affecting users of all ages,” Australian Journal of Basic and Applied Sciences, vol. 5, no. 11, pp. 972–977, 2011. [2] C. A. Sanchez and J. Wiley, “To scroll or not to scroll: Scrolling, working memory capacity, and comprehending complex texts,” Human Factors: The Journal of the Human Factors and Ergonomics Society, vol. 51, no. 5, pp. 730–738, 2009. [3] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content structure for web pages based on visual representation,” in Web technologies and applications, Springer, 2003, pp. 406–417. [4] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of hTML documents,” in Proceedings of the 12th international conference on world wide web, 2003, pp. 207–214. [5] J. T. Richards and V. L. Hanson, “Web accessibility: A broader view,” in Proceedings of the 13th international conference on World Wide Web, 2004, pp. 72–79. [6] S. Walker, P. Schloss, C. R. Fletcher, C. A. Vogel, and R. C. Walker, “Visual-syntactic text formatting: A new method to enhance online reading,” Reading Online, vol. 8, no. 6, pp. 1096–1232, 2005. [7] C.-H. Yu and R. C. Miller, “Enhancing web page readability for non-native readers,” in Proceedings of the sIGCHI conference on human factors in computing systems, 2010, pp. 2523–2532. [8] A. Piolat, J.-Y. Roussey, and O. Thunin, “Effects of screen presentation on text reading and revising,” International Journal of Human-Computer Studies, vol. 47, no. 4, pp. 565–589, 1997. [9] J. Imai and M. Omodani, “Reason why comprehension level tends to decrease at reading tasks on displays-challenge to the realization of readable electronic papers,” Nihon Gazo Gakkaishi/Journal of the Imaging Society of Japan, vol. 46, no. 2, 2007. [10] T. Y. Fukaya, S. Ono, M. Minakuchi, S. Nakashima, M. Hayashi, and H. Ando, “Reading text on a smart phone: Scrolling vs. paging: Toward designing effective electronic manuals,” in User science and engineering (i-uSEr), 2011 international conference on, 2011, pp. 59–63. [11] E. Wästlund, T. Norlander, and T. Archer, “The effect of page layout on mental workload: A dual-task experiment,” Computers in human behavior, vol. 24, no. 3, pp. 1229–1245, 2008. [12] J. Lemarié, H. Eyrolle, and J.-M. Cellier, “The segmented presentation of visually structured texts: Effects on text comprehension,” Computers in human behavior, vol. 24, no. 3, pp. 888–902, 2008. [13] R. F. Lorch Jr, “Text-signaling devices and their effects on reading and memory processes,” Educational psychology review, vol. 1, no. 3, pp. 209–234, 1989. [14] J. Hyönä and R. F. Lorch, “Effects of topic headings on text processing: Evidence from adult readers’ eye fixation patterns,” Learning and instruction, vol. 14, no. 2, pp. 131–152, 2004. [15] R. F. Lorch Jr and E. P. Lorch, “Effects of headings on text recall and summarization,” Contemporary educational psychology, vol. 21, no. 3, pp. 261–278, 1996. [16] C. Kohlschütter, P. Fankhauser, and W. Nejdl, “Boilerplate detection using shallow text features,” in Proceedings of the third aCM international conference on web search and data mining, 2010, pp. 441–450.

238

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online)