Web Page Segmentation and Pagination for Enhancing Readability

Total Page:16

File Type:pdf, Size:1020Kb

Web Page Segmentation and Pagination for Enhancing Readability Vol. 2(4), Jan. 2016, pp. 234-238 Web Page Segmentation and Pagination for Enhancing Readability Ahmad Pouramini Department of Computer Engineering, Sirjan University of Technology, Sirjan, Iran *Corresponding Author's E-mail: [email protected] Abstract eb page readability can be defined as the combination of reading comprehension, reading speed and user satisfaction. To improve the readability of a web page, content extraction and W transformation techniques are used to present the main content to the reader in a more readable fashion. In this paper, we present the design and architecture of a readability enhancement system. We aim at enhancing both the reading speed and comprehension. To achieve these goals, we extract and segment the main content into smaller coherent semantic units. These units are further augmented with text signals such as section headings, captions and page numbers in order to convey the text organization and the page visual structure; thus, enhancing the content comprehension. Our proposed system particularly suits constrained display devices such as mobile phones and PDAs. Keywords: Reading comprehension; Readability Enhancement; Web page Customization. 1. Introduction The rapid growth of World Wide Web has been tremendous in recent years. With the large amount of information on the Internet, web pages have become the main source of information. However, reading web pages on computer screen or a mobile phone have some difficulties. Beside the main content, a web page may comprise of distracting parts such as ads, animations, logos, that can degrade the readability of the main content. In addition, color contrast, font style, letter spacing, layout, line height and length of the content are among the other factors that affect the web page readability [1]. The problem can be more serious for specific individuals such as older adults, visually impaired users, non-native readers (those reading a page in a non-native language). These people need more concentration to comprehend the text, especially if the text is a news or scientific article [2]. In this paper, we propose a system for enhancing the web page readability. We define readability as the combination of reading comprehension, reading speed and user satisfaction. The main stages of our method are extracting the main content; segmenting it into coherent semantic units and presenting each unit on a separated page to the reader. In our definition, a semantic unit is a discrete chunk of information that serves a specific meaningful purpose within the overall structure of a topic, such as itemized lists, paragraphs, images, tables. In addition, text signals such as section headings are added to each page in order to help the user to keep track of text organization. 2. Web Page Readability Typically, readability tools use content extraction methods to eliminate distracting parts and content reformatting and transformation to enhance the reading speed and comprehension of the content. Eliminating distracting parts can also provide easier access to the web over constrained devices like mobile phones [3], [4]. In the following sections, we review literature on the enhancing page readability and usability. 234 Article History: JKBEI DOI: 649123/11034 Received Date: 15 Sep. 2015 Accepted Date: 17 Dec. 2015 Available Online: 09 Jan. 2016 Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 2.1. Content Transformation and reformatting Richards et al. [5] proposed guidelines to have web content adaptations and transformations for specific populations including disabled people, older adults and visually impaired users. Certain changes such as font style, font enlargement, increased inter-letter spacing, and enhanced color contrast can increase legibility for this population. Some other studies focus on the readability enhancement on the user interface level. In an effort to enhance online reading, Walker et al. used a visual-syntactic text formatting (VSTF) method in which sentences are analyzed and reformatted into cascading patterns that cue syntactic structure and assist visual processing [6]. In a similar work [7] offered a visual reformatting method which aims readability for non-native readers of English documents. 2.2. Scrolling vs. Paging There are several studies on the effects of scrolling and paging on text reading and comprehension. Many of these studies support that the comprehension of text is better in paging, especially for narrative, complex and long texts. Piolat et al. found that while there was no difference in reading speed, paging resulted in better comprehension and recalling information [8]. Imai and Omodani reported that both the reading time and comprehension level were superior in paging than in scrolling [9]. Sanches et al. also reported that for complex content, scrolling reduced reading comprehension, especially when working memory was low [2]. Similarly, Fukaya et al. found that on small touch devices the comprehension level for narrative texts was slightly better in paging, and for reading of procedural texts, both scrolling and paging are suitable [10]. According to Wastlaund et al., reading a text document with a page layout can reduces mental load and enhance the speed and comprehension [11]. 2.2. Visual Structure and Text Signals Some other studies investigated the effect of visual structure on reading comprehension [12], [13]. The assumption relies on the effect of text signaling on text cognitive processing. Text signals are used by authors to clarify text organization and emphasize important content [13]. They include a variety of writing devices such as typographical cues, preview statements and overviews, titles and headings to communicate the text organization [13]. Hyona et al. found that the presence of headings in a text aids memory for the text. Moreover, they facilitate the search for specific information relevant to the headings. A heading that communicates organizational information may trigger processing of relations between two subsections that otherwise may not occur [14]. Other researches also showed how the presence of headings aids summarization [15]. Lemari et al. investigated the effects of the text visual structure on text comprehension in segmented presentation [12]. They found if readers are not provided with any information about the text visual structure (pagination) or if they are provided with unusable information, they heavily rely on the segmentation unit to give a structure to the text. As a result, if the segmentation unit does not match the text structure, it leads to a misinterpretation of the relationship between text segments. 3. System Architecture Based on these studies, we propose our system for web page readability enhancement. The overall architecture of the system is shown in Figure 1. As can be observed, the input of the system is the HTML document (DOM tree) and the output is the text segments extracted from the document. In between, there are two stages namely content extraction and content segmentation. The content extraction identifies the main content of the web page. The output is one or more nodes classified as the main content. These nodes are input to the segmentation stage, which decomposes them into 235 Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 semantic units. The resulting segments are further processed to be presented into a more readable and comprehensible format. The following sections will explain these stages. Figure 1. Main stages of the proposed system. 3.1. Main Content Extraction There are many approaches to perform the main content extraction of web pages. Most unsupervised methods utilize heuristics rules in order to automatically determine the main content. The features used in content classification range from visual to text features such as sentence or link density. They basically vary in terms of how general or specific they intend to be, and the target application. To select a suitable approach for the proposed system, we made the following assumptions: • Our main goal is to enhance the readability of reading materials such as news articles, blogs, encyclopedia articles and so on; therefore, we have more assumptions on the structure of the Web page. • We work on the rendered page by a web browser; therefore, we have access to the dynamic and visual properties of the page elements. • Speed and accuracy of the extraction algorithm is more important than its generality. Based on these assumptions and requirements, a suitable method could be a densitometric method, which has proved efficient for content-rich documents, such as news, encyclopedia articles [16]. To improve the efficiency of such a method, we can employ vision-based features such as the location of a block in the page (e.g. the main block often appears in the central part of the page) because the system has access to the rendered page elements. We selected the method introduced by Kohlschütter et al. for boilerplate detection using shallow text features such as link density and text density ratios. They assumed that textual content on web pages can be grouped into two main classes, long text (most likely the main content) and short text (most likely navigational boilerplate text) respectively. Using this simple classification model they achieved competitive accuracy. 3.1. Content Segmentation into Semantic Units In this stage, the extracted nodes which contain the main content are decomposed into semantic units. By a semantic unit, we mean a piece of content which conveys coherent information. We use a recursive algorithm which takes as input the sub-tree associated with each extracted node of the DOM 236 Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034 tree. It traverses this sub-tree’s nodes in depth first manner.
Recommended publications
  • Build Lightning Fast Web Apps with HTML5 and SAS® Allan Bowe, SAS Consultant
    1091-2017 Build Lightning Fast Web Apps with HTML5 and SAS® Allan Bowe, SAS consultant ABSTRACT What do we want? Web applications! When do we want them? Well.. Why not today? This author argues that the key to delivering web apps ‘lightning fast’ can be boiled down to a few simple factors, such as: • Standing on the shoulders (not the toes) of giants. Specifically, learning and leveraging the power of free / open source toolsets such as Google’s Angular, Facebook’s React.js and Twitter Bootstrap • Creating ‘copy paste’ templates for web apps that can be quickly re-used and tweaked for new purposes • Using the right tools for the job (and being familiar with them) By choosing SAS as the back end, your apps will benefit from: • Full blown analytics platform • Access to all kinds of company data • Full SAS metadata security (every server request is metadata validated) By following the approach taken in this paper, you may well find yourself in possession of an electrifying capability to deliver great content and professional-looking web apps faster than one can say “Usain Bolt”. AUDIENCE This paper is aimed at a rare breed of SAS developer – one with both front end (HTML / Javascript) and platform administration (EBI) experience. If you can describe the object of object arrays, the object spawner and the Document Object Model – then this paper is (objectionably?) for you! INTRODUCTION You are about to receive a comprehensive overview of building Enterprise Grade web applications with SAS. Such a framework will enable you to build hitherto unimaginable things.
    [Show full text]
  • Webbrowser Webpages
    Web Browser A web browser, or simply "browser," is an application used to access and view websites. Common web browsers include Microsoft Internet Explorer, Google Chrome, Mozilla Firefox, and Apple Safari. The primary function of a web browser is to render HTML, the code used to design or "markup" web pages. Each time a browser loads a web page, it processes the HTML, which may include text, links, and references to images and other items, such as cascading style sheets and JavaScript functions. The browser processes these items, then renders them in the browser window. Early web browsers, such as Mosaic and Netscape Navigator, were simple applications that rendered HTML, processed form input, and supported bookmarks. As websites have evolved, so have web browser requirements. Today's browsers are far more advanced, supporting multiple types of HTML (such as XHTML and HTML 5), dynamic JavaScript, and encryption used by secure websites. The capabilities of modern web browsers allow web developers to create highly interactive websites. For example, Ajax enables a browser to dynamically update information on a webpage without the need to reload the page. Advances in CSS allow browsers to display a responsive website layouts and a wide array of visual effects. Cookies allow browsers to remember your settings for specific websites. While web browser technology has come a long way since Netscape, browser compatibility issues remain a problem. Since browsers use different rendering engines, websites may not appear the same across multiple browsers. In some cases, a website may work fine in one browser, but not function properly in another.
    [Show full text]
  • EMERGING TECHNOLOGIES Dymamic Web Page Creation
    Language Learning & Technology January 1998, Volume 1, Number 2 http://llt.msu.edu/vol1num2/emerging/ pp. 9-15 (page numbers in PDF differ and should not be used for reference) EMERGING TECHNOLOGIES Dymamic Web Page Creation Robert Godwin-Jones Virginia Comonwealth University Contents: • Plug-ins and Applets • JavaScript • Dynamic HTML and Style Sheets • Instructional Uses • Resource List While remaining a powerful repository of information, the Web is being transformed into a medium for creating truly interactive learning environments, leading toward a convergence of Internet connectivity with the functionality of traditional multimedia authoring tools like HyperCard, Toolbook, and Authorware. Certainly it is not fully interactive yet, but that is undeniably the trend as manifested in the latest (version 4) Web browsers. "Dynamic HTML," incorporated into the new browsers, joins plug-ins, Web forms, Java applets, and JavaScript as options for Web interactivity. Plug-ins and Applets While Web pages are beginning to behave more like interactive applications, traditional authoring tools are themselves becoming Internet-savvy, primarily through the use of "plug-in" versions of players which integrate with Web browsers. The most commonly used plug-in today is Macromedia's "Shockwave," used to Web-enable such applications as Director, Authorware, and Flash. "Shocked" Web pages can be very interactive and provide a visually appealing means of interacting with users (as in some sample ESL exercises from Jim Duber). Plug-ins are easy to use -- they just need to be downloaded and installed. Some come bundled with Netscape and Microsoft's browsers, which simplifies considerably the installation process (and gives developers the confidence that most users will actually have the plug-in installed).
    [Show full text]
  • Php Tutorial
    PHP About the Tutorial The PHP Hypertext Preprocessor (PHP) is a programming language that allows web developers to create dynamic content that interacts with databases. PHP is basically used for developing web-based software applications. This tutorial will help you understand the basics of PHP and how to put it in practice. Audience This tutorial has been designed to meet the requirements of all those readers who are keen to learn the basics of PHP. Prerequisites Before proceeding with this tutorial, you should have a basic understanding of computer programming, Internet, Database, and MySQL. Copyright & Disclaimer © Copyright 2016 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at [email protected] i PHP Table of Contents About the Tutorial ...........................................................................................................................................
    [Show full text]
  • Creating a Dynamic Web Presence
    CREATING A DYNAMIC WHAT’S INSIDE Key Concepts ......................................1 WEB PRESENCE Static vs. Dynamic Websites ..........1 The need for a dynamic web presence has increased dramatically Features of a Dynamic Site ............2 as consumers on the Internet become more knowledgeable Dynamic Techniques ......................2 and more demanding in having their needs filled. With internet Getting Started ....................................2 technologies continuing to evolve rapidly, small businesses 1. Research and Planning .............2 2. Branding and Web can easily create more interactive sites to engage their Development ..............................4 target audiences. Nuts & Bolts .......................................4 1. Select and Register a Domain Name .........................4 Key Concepts 2. Review Hosting Requirements ...5 Static vs. Dynamic Websites 3. Generate Content ......................6 4. Incorporate Interactive Tools.....6 How do they differ? A dynamic website includes elements that allow 5. Address Security, Privacy interaction with visitors, builds relationships through dialogue, and and Legal Issues ........................9 personalizes the whole web experience. Contrast this with a static, or 6. Get Indexed ..............................10 “read only,” site where there really is no way to know who your visitors are unless they contact you. Whether you are designing a site for the 7. Market Your Website ................10 first time or redesigning a current site, the goal of achieving a dynamic
    [Show full text]
  • Report Google Chrome's Browser
    CISC 322 Assignment 1: Report Google Chrome’s Browser: Conceptual Architecture Friday, October 19, 2018 Group: Bits...Please! Emma Ritcey [email protected] ​ Kate MacDonald [email protected] ​ Brent Lommen [email protected] ​ Bronwyn Gemmill [email protected] ​ Chantal Montgomery [email protected] ​ Samantha Katz [email protected] ​ Bits...Please! Abstract The Google Chrome browser was investigated to determine its conceptual architecture. After reading documentation online and analyzing reference web browser architectures, the high level conceptual architecture of Chrome was determined to be a layered style. Individual research was done before collaborating as a group to finalize our proposed architecture. The conceptual architecture was proposed to coincide with Chrome’s four core principles (4 S’s): simplicity, speed, security, and stability. In depth research was completed in the render and browser engine subsystems which had the architectures styles object oriented and layered, respectively. Using the proposed architecture, the process of a user logging in and Chrome saving the password, as well as Chrome rendering a web page using JavaScript were explored in more detail. To fully understand the Chrome browser, Chrome’s concurrency model was investigated and determined to be a multi-process architecture that supports multi-threading. As well, team issues within Chrome and our own team were reported to support our derivation process and proposed architecture. 1 Bits...Please! Table of Contents Abstract 1 Table of Contents 2
    [Show full text]
  • Pagedown: Paginate the HTML Output of R Markdown with CSS for Print
    Package ‘pagedown’ June 23, 2021 Type Package Title Paginate the HTML Output of R Markdown with CSS for Print Version 0.15 Description Use the paged media properties in CSS and the JavaScript library 'paged.js' to split the content of an HTML document into discrete pages. Each page can have its page size, page numbers, margin boxes, and running headers, etc. Applications of this package include books, letters, reports, papers, business cards, resumes, and posters. Imports rmarkdown (>= 1.16), bookdown (>= 0.8), htmltools, jsonlite, later (>= 1.0.0), processx, servr (>= 0.18), httpuv, xfun, websocket Suggests promises, testit, xaringan, pdftools, revealjs License MIT + file LICENSE URL https://github.com/rstudio/pagedown BugReports https://github.com/rstudio/pagedown/issues SystemRequirements Pandoc (>= 2.2.3) Encoding UTF-8 RoxygenNote 7.1.1 NeedsCompilation no Author Yihui Xie [aut, cre] (<https://orcid.org/0000-0003-0645-5666>), Romain Lesur [aut, cph] (<https://orcid.org/0000-0002-0721-5595>), Brent Thorne [aut] (<https://orcid.org/0000-0002-1099-3857>), Xianying Tan [aut] (<https://orcid.org/0000-0002-6072-3521>), Christophe Dervieux [ctb] (<https://orcid.org/0000-0003-4474-2498>), Atsushi Yasumoto [ctb] (<https://orcid.org/0000-0002-8335-495X>), RStudio, PBC [cph], Adam Hyde [ctb] (paged.js in resources/js/), Min-Zhong Lu [ctb] (resume.css in resources/css/), Zulko [ctb] (poster-relaxed.css in resources/css/) Maintainer Yihui Xie <[email protected]> Repository CRAN Date/Publication 2021-06-23 04:40:05 UTC 1 2 business_card R topics documented: book_crc . .2 business_card . .2 chrome_print . .3 find_chrome . .5 html_letter . .5 html_paged . .6 html_resume .
    [Show full text]
  • Fiz: a Component Framework for Web Applications
    Fiz: A Component Framework for Web Applications John K. Ousterhout Department of Computer Science Stanford University Abstract Fiz is a framework for developing interactive Web applications. Its overall goal is to raise the level of programming for Web applications, first by providing a set of high-level reusable components that simplify the task of creating interactive Web applications, and second by providing a framework that encourages other people to create addi- tional components. Components in Fiz cover both the front-end of Web applications (managing a browser-based user interface) and the back end (managing the application's data). Fiz makes it possible to create components that encapsulate complex behaviors such as Ajax-based updates, hiding many of the Web's complexities from applica- tion developers. Because of its focus on components, Fiz does not use mechanisms such as templates and model- view-controller in the same way as other frameworks. ger and more useful structures. We will release Fiz in 1 Introduction open-source form and hope to build a user community Although the World-Wide Web was initially conceived that creates an ever-increasing set of interesting com- as a vehicle for delivering and viewing documents, its ponents, which will make it dramatically easier to cre- focus has gradually shifted from documents to applica- ate applications that advance the state-of-the-art in Web tions. Facilities such as Javascript, the Document Ob- interactivity. ject Model (DOM), and Ajax have made it possible to offer sophisticated interactive applications over the The rest of this paper is organized as follows.
    [Show full text]
  • Features Guide [email protected] Table of Contents
    Features Guide [email protected] Table of Contents About Us .................................................................................. 3 Make Firefox Yours ............................................................... 4 Privacy and Security ...........................................................10 The Web is the Platform ...................................................11 Developer Tools ..................................................................13 2 About Us About Mozilla Mozilla is a global community with a mission to put the power of the Web in people’s hands. As a nonprofit organization, Mozilla has been a pioneer and advocate for the Web for more than 15 years and is focused on creating open standards that enable innovation and advance the Web as a platform for all. We are committed to delivering choice and control in products that people love and can take across multiple platforms and devices. For more information, visit www.mozilla.org. About Firefox Firefox is the trusted Web browser of choice for half a billion people around the world. At Mozilla, we design Firefox for how you use the Web. We make Firefox completely customizable so you can be in control of creating your best Web experience. Firefox has a streamlined and extremely intuitive design to let you focus on any content, app or website - a perfect balance of simplicity and power. Firefox makes it easy to use the Web the way you want and offers leading privacy and security features to help keep you safe and protect your privacy online. Mozilla continues to move the Web forward by pioneering new open source technologies such as asm.js, Emscripten and WebAPIs. Firefox also has a range of amazing built-in developer tools to provide a friction-free environment for building Web apps and Web content.
    [Show full text]
  • 5Lesson 5: Web Page Layout and Elements
    5Lesson 5: Web Page Layout and Elements Objectives By the end of this lesson, you will be able to: 1.1.14: Apply branding to a Web site. 2.1.1: Define and use common Web page design and layout elements (e.g., color, space, font size and style, lines, logos, symbols, pictograms, images, stationary features). 2.1.2: Determine ways that design helps and hinders audience participation (includes target audience, stakeholder expectations, cultural issues). 2.1.3: Manipulate space and content to create a visually balanced page/site that presents a coherent, unified message (includes symmetry, asymmetry, radial balance). 2.1.4: Use color and contrast to introduce variety, stimulate users and emphasize messages. 2.1.5: Use design strategies to control a user's focus on a page. 2.1.6: Apply strategies and tools for visual consistency to Web pages and site (e.g., style guides, page templates, image placement, navigation aids). 2.1.7: Convey a site's message, culture and tone (professional, casual, formal, informal) using images, colors, fonts, content style. 2.1.8: Eliminate unnecessary elements that distract from a page's message. 2.1.9: Design for typographical issues in printable content. 2.1.10: Design for screen resolution issues in online content. 2.2.1: Identify Web site characteristics and strategies to enable them, including interactivity, navigation, database integration. 2.2.9: Identify audience and end-user capabilities (e.g., lowest common denominator in usability). 3.1.3: Use hexadecimal values to specify colors in X/HTML. 3.3.7: Evaluate image colors to determine effectiveness in various cultures.
    [Show full text]
  • Lecture 6: Dynamic Web Pages Lecture 6: Dynamic Web Pages Mechanics • Project Preferences Due • Assignment 1 out • Prelab for Next Week Is Non-Trivial
    Lecture 6: Dynamic Web Pages Lecture 6: Dynamic Web Pages Mechanics • Project preferences due • Assignment 1 out • PreLab for next week is non-trivial 1/31/2020 2 Lecture 5: JavaScript JavaScript has its Quirks • Procedural, Functional and Object-Oriented all at once • Objects are very different from Java/C++ o Newer versions have Java-like classes however • Scoping is different o var versus let or const o Declarations can follow uses o Declarations are optional • Automatic type conversion • Strict versus non-strict equality testing • eval function • Semicolons are optional if unambiguous • Read up on the language (prelab) 1/31/2020 3 Lecture 6: Dynamic Web Pages What is an Interactive Application • How do we want to use JavaScript • What does interactive mean • What does it do when you interact o Check inputs, compute next page o Change the page without getting a new page 1/30/2020 4 Lecture 6: Dynamic Web Pages Dynamic Web Page Examples • http://bdognom.cs.brown.edu:5000/ (spheree) • http://conifer.cs.brown.edu/s6 (s6) • http://conifer.cs.brown.edu:8888 (twitter) • http://fred4.cs.brown.edu:8800/ (sign) 1/23/2020 5 Lecture 6: Dynamic Web Pages Interactive Applications • Respond to user inputs • Change the display (e.g. add fields, show errors, …) • Dynamically check and verify inputs • Allow direct manipulation (drag and drop) • Use animation to highlight or emphasize or show things • Display external changes in real time • Provide input help (e.g. text completion) • Handle dynamic resizing of the display 1/23/2020 6 Lecture 6: Dynamic Web Pages Achieving Interactivity • Using CSS • Handling HTML events using JavaScript o Dynamically check and verify inputs o Handle direct manipulation • With modern HTML features • With animation/drawing/multimedia packages • By talking to the server continually • Displaying external changes in real time • Changing styles and the content of the page o Change the display (e.g.
    [Show full text]
  • Create a Web Page Using Microsoft Word
    Create a Website for Denise Harrison’s English Writing Class Prepared by the Student Multimedia Studio Visit http://www.library.kent.edu/sms for more information and tutorials. Contents What you will be creating (in a nutshell): ........................................................................... 1 Create a Folder on your Jump Drive ................................................................................... 1 Create the “Documents” ..................................................................................................... 1 Create the Web Page ........................................................................................................... 2 Create a “table” ........................................................................................................................... 2 Center the Table on the Page ...................................................................................................... 3 Set the Preferred Width for the Table......................................................................................... 3 Add a Title .................................................................................................................................... 3 Add a Row for the Navigation Menu ........................................................................................... 3 Create the Text Hyperlinks to the Word Documents .................................................................. 4 Add another row for more content ............................................................................................
    [Show full text]