<<

SpeedReader: Reader Mode Made Fast and Private

Mohammad Ghasemisharif Peter Snyder University of Illinois at Chicago Software Chicago, IL, USA San Francisco, CA, USA [email protected] [email protected] Andrius Aucinas Benjamin Livshits Brave Software Brave Software & Imperial College London London, UK London, UK [email protected] [email protected]

ABSTRACT ACM Reference Format: Most popular web browsers include “reader modes” that improve Mohammad Ghasemisharif, Peter Snyder, Andrius Aucinas, and Benjamin the user experience by removing un-useful page elements. Reader Livshits. 2019. SpeedReader: Reader Mode Made Fast and Private. In Pro- ceedings of the 2019 Conference (WWW ’19), May 13– modes reformat the page to hide elements that are not related to the 17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 12 pages. page’s main content. Such page elements include site navigation, ://doi.org/10.1145/3308558.3313596 advertising related videos and images, and most JavaScript. The intended end result is that users can enjoy the content they are 1 INTRODUCTION interested in, without distraction. In this work, we consider whether the “reader mode” can be “Web bloat” is a colloquial term that describes the trend in websites widened to also provide performance and privacy improvements. to accumulate size and visual complexity over time. The phenom- Instead of its use as a post-render feature to clean up the clutter on a ena has been measured in many dimensions, including total page page we propose SpeedReader as an alternative multistep pipeline size [7], page load time [5, 44, 45], memory needed [30], number of that is part of the rendering pipeline. Once the tool decides during network requests [16, 28], amount of scripts executed [26, 34, 37, 39] the initial phase of a page load that a page is suitable for reader mode and third parties contacted [25, 26, 28]. This work suggests that use, it directly applies document tree translation before the page is growth in page size and complexity is outpacing improvements in rendered. Based on our measurements, we believe that SpeedReader device hardware. All of this has a predictably negative impact on can be continuously enabled in order to drastically improve end- user experience. user experience, especially on slow mobile connections. Combined Web users and browser vendors have reacted to this “bloat” in a with our approach to predicting which pages should be rendered variety of ways, all partially helpful, but with significant downsides. in reader mode with 91% accuracy, SpeedReader achieves average Ad and tracking blockers are a popular and useful tool for reducing speedups and bandwidth reductions of up to 27× and 84×, respec- the size complexity of sites. Prior work has shown that these tools tively. We further find that our novel “reader mode” approach brings can be effective in reducing privacy leaks [31], network use, and with it significant privacy improvements to users. Our approach extend device memory life. Such tools, which use filter lists, are effectively removes all commonly recognized trackers, issues 115 inherently limited in the scope of improvements they can achieve. fewer requests to third parties, and interacts with 64 fewer trackers While these filter lists are large42 [ ], they are small as a proportion on average, on transformed pages. of all URLs on the web. Similarly, while these lists are updated often, they are updated slowly compared to URL updates on the web. CCS CONCEPTS Similarly, “reader mode” tools, provided in many popular browsers and browser extensions, are an effort to reduce the grow- • Human-centered computing → Web-based interaction; • ing visual complexity of websites. Such tools attempt to extract Information systems → Browsers; Clustering and classifica- the subset of page content useful to users, and remove advertising, tion; Content analysis and feature selection; • Security and privacy animations, boilerplate code, and other non-core content. Current → Privacy protections. “reader modes” do not provide the user with resource savings since KEYWORDS the referenced resources have already been fetched and rendered. The growth and popularity of such tools suggest they are useful to Reader Mode; Boilerplate Removal; Web Document Classification; browser users, looking to address the problem of page clutter and Web Performance; visual “bloat”. In this work, we propose a novel strategy called SpeedReader This paper is published under the Creative Commons Attribution 4.0 International for dealing with resource and bloat on websites. Our technique (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. provides a user experience similar to existing “reader mode” tools, WWW ’19, May 13–17, 2019, San Francisco, CA, USA but with network, performance, and privacy improvements that © 2019 IW3C2 (International World Wide Web Conference Committee), published exceed existing ad and tracking blocking tools, on a significant under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-6674-8/19/05. portion of websites. Significantly, SpeedReader differs from exist- https://doi.org/10.1145/3308558.3313596 ing deployed reader mode tools by operating before page rendering, which allows it to determine which resources are needed for the websites user encounters that are amenable to SpeedReader, under page’s core content before fetching. several browser use scenarios. Section 5 provides some discussion How we achieve speedups. SpeedReader achieves its perfor- for how our findings can inform future readability, privacy and mance improvements through a two-step pipeline: performance work, Section 6 places this work in the context of prior research, and Section 7 concludes. (1) SpeedReader uses a classifier to determine whether there is a readable subset of the initial, fetched page HTML. This classifier is trained on a labeled corpus of 2,833 websites (see 2 BACKGROUND Section 3), and determines whether a page can be display in 2.1 Terminology reader mode with 91% accuracy. This subsection presents several terms that are not standardized. (2) If the classifier has determined that the page is readable, We present them up front, to ease the understanding of the rest of SpeedReader extracts the readable subset of document be- the work. fore rendering, using a variety of heuristics developed in prior research [24] and browser vendors [9, 23], and passes the Reader mode. We use the term “reader mode” to describe any tool simplified, reader mode document to the browser’s render that attempts to extract a useful subset of a website for a simplified layer. This tree translation step is described in Section 4. presentation. These tools can be either included in the by the browser vendor, added by users through browser extensions, Deployment. Combined with a highly accurate classifier of “read- or provided by third parties as a web service. Our use of the term able” pages, the drastic improvements in performance, reduction in “reader mode” is generic to the concept, and should not be confused bandwidth use and elimination of trackers in reader mode make with any specific tool. the approach practical for continuous use. We therefore propose Classification and transduction. Reader mode tools generally SpeedReader as a sticky feature that a user can toggle to be always include both a technique for determining whether a page is readable, on. This approximates the experience of using an e-book reader, which we refer to as “classification”, and a strategy for converting but with strengths of content availability on the web. It is also a the initial HTML tree into a simplified reader mode tree, which we suitable strategy for content prerendering or prefetching that could refer to as “tree transduction”. Though most reader mode tools in- be implemented by web browser vendors, automatically delivering clude both steps within a single tool or library, they are conceptually graceful performance degradation in poor connectivity areas or on distinct. underpowered mobile devices until the rest of the page content can be fetched for a complete render. Readable. We use the term “readable” to describe whether a web Contributions. page contains a subset of content that would be useful to display in a reader mode presentation. Reader mode presentation works best on • Novel approach to Reader Mode - combining machine- pages that are text and image focused, and that are mostly static (i.e. learning driven approach to checking whether content can few interactive page elements). Examples of such readable pages in- be transformed to text-focused representation for end-user clude articles, blog posts, and news sites. Reader mode presentation consumption. does not work well on websites that are highly interactive, or when • Applicability - we demonstrate that 22.0% of web pages a page’s structure is significant to the page’s content. Examples are convertible to reader mode style in a dataset of pages of such non-readable pages include web applications (e.g. reported popular by Alexa. We further demonstrate that Mail, ) or pages that are indexes of other content. 46.27% of pages shared on social networks are readable. • Privacy - we demonstrate that using reader mode in the 2.2 Existing Reader Modes proposed design provides superior privacy protection, ef- fectively removing all trackers from the tested pages, and Several popular web browsers include reader modes designed to dramatically reducing communication with third-parties. simplify a page’s presentation, so that browser users can read the • Ad Blocking - we show that our unique reader mode ap- page’s contents without distraction of visual clutter such as adver- proach blocks ads at least as well as existing ad blocking tools, tisements, page animations, and unnecessary page boilerplate (e.g. blocking 100% of resources labeled as advertising related by footers, page navigation, comments). EasyList in a crawl of 91,439 pages, without the need to use In this section, we give a brief description of several existing hand curated, hard-coded filter lists. reader mode tools, how they’re deployed by their authors, and how • Speed - we find that the lightweight nature of reader mode they are used in the evaluations given in the rest of this paper. content results in huge performance gains, with up to 27× Readability.js. Readability.js [9] is an open source reader mode page load time speedup on average, together with up to 84× library, implemented in JavaScript. It is maintained by Mozilla, and bandwidth and 2.4× memory reduction on average. is used for the reader mode function in . The code is closely Paper organization. The rest of this paper is structured as follows. related to “Readability” [2], an open sourced library developed by Section 2 provides background information to place SpeedReader Arc90 and used for their now-defunct readability.com web service. in context. Section 3 describes the design, evaluation and accuracy Classification works by looking for the element on the page with of the classifying step in the SpeedReader pipeline, and Section the highest density of text and link nodes. If the number and density 4 gives the design and evaluation of the reader mode extraction of text and link nodes in that element exceed a given threshold, step in the SpeedReader pipeline. Section 3.3 measures how many the library treats the page as readable. Tree transduction works 2 by normalizing the contents of the text-and-link dense element (to HTML HTML remove styling and other mark up), looking for near-by images for Page rendering, Extract features executing inclusion, and using text patterns in the document that identify the JavaScript page’s author, source and publication date. No Has readable Fetching subset?(Classifier) Significant to SpeedReader, Readability.js does not consider any resources, trackers, ads, etc. display or presentation information when performing either the Yes Page rendering, classification or tree transduction steps. This means that the page executing Tree JavaScript transduction Extract features does not need to be loaded and rendered to generate a reader mode

Fetching presentation (though in practice Firefox does not use this library in Reader Mode resources, No trackers, ads, etc. Has readable this way). subset?(Classifier) Get necessary resources for Yes Do not show the reader mode reader mode Reader View. Safari Reader View is a JavaScript library button Show the reader that implements the reader mode presentation in Safari. Like Read- mode button ability.js, it is also a from Arc90’s “Readability”, though Apple Tree has changed how the library works in significant ways. In addition transduction to looking for elements with high text and anchor density, Sa- fari Reader View also uses presentation-level heuristics, including Reader Mode where elements appear on the page and what elements are hid- Get necessary SpeedReader resources for den from display by default. Relevant to , this means reader mode that Safari Reader View must load a page and at least some of its Figure 1: Comparison of SpeedReader (left) with other existing resources (e.g. images, CSS, JavaScript) to perform either the clas- reader modes (right) sification or tree transduction level decisions. Because significant portions of Safari Reader View require a document be fetched and small part of DOM Distiller’s strategy. We modified DOM Distiller rendered before being evaluated, we do not consider it further in to remove these display level checks, so that DOM Distiller could this work (for reasons that are detailed in Sections 3 and 4). be applied to prerendered HTML documents. We note that the evaluation of DOM Distiller in this work uses this modified ver- BoilerPipe. BoilerPipe is an academic research project from sion of DOM Distiller, and not the version that Google ships with Kohlschütter et al. [24], and is implemented in Java. BoilerPipe Chrome. We expect this modification to have minimal effects on has not been deployed directly by any browser vendor. BoilerPipe the discussed measurements, but draw the reader’s attention to this does not provide functionality for (readability) classification, and change for completeness. assumes that any HTML document contains a readable subset. For tree transduction, BoilerPipe considers number of words and link 2.3 Comparison to SpeedReader density features. Like Readability.js, it does not require a browser to load and render a page in order to do reader mode extraction. The reader mode functionality shipped with all current major Their analysis reveals a strong correlation between short text and browsers is applied after the document is fully fetched and ren- 1 boilerplate, as well as long text and actual content text (of the tex- dered. This greatly restricts the possible performance, network tual content) on the Web. Using features with low calculation cost and privacy improvements existing reader modes can achieve. In such as number of words enables BoilerPipe to lower the overhead fact, in some reader mode implementations we measured, using while maintaining high accuracy. reader modes increased the amount of network used, as some re- sources were fetched twice, i.e. once for the initial page loading, DOM Distiller. DOM Distiller is a JavaScript and ++ library main- and then again when presenting images in the reader mode display. tained by Google, and used to implement reader mode in recent Most significantly, SpeedReader differs from existing reader versions of Chrome. The project is based on BoilerPipe, though mode techniques in that it is implemented strictly before the display, has been significantly changed by Google. The classification step rendering, and resource-fetching steps in the browser. SpeedReader in DOM Distiller uses a classifier based approach, and considers can therefore be thought of as a function that sits between the features such as whether the page’s URL contains certain keywords browser’s network layer (i.e. takes as input the initial HTML doc- (e.g. “forum”, “.php”, “index”), if the page’s markup contains Face- ument), and returns either the received HTML (when there is no book open graph, Google AMP, identifiers, or the number of “/” readable subset), or a greatly simplified HTML document, represent- characters used in the URL’s path, in addition to the text-and-link ing the reader mode presentation (when there is a readable subset). density measures used by Readability.js. At a high level, the tree Figure 1 provides a high level comparison of how SpeedReader transduction step also looks at text-and-link dense element in the functions, compared to existing reader modes. page, as well as special-cased embedded elements, such as YouTube The fact that SpeedReader only considers features available in or Vimeo videos. the initial HTML and URL enables SpeedReader to achieve perfor- DOM Distiller considers some render-level information in both mance orders of magnitude above existing approaches. Figure 2 the classification and tree transduction steps. For example, any elements that are hidden from display are not included in the text- 1While Readability.js does not require that the page be rendered before making reader and-link density measurements. These render-level checks are a mode evaluations, in practice Firefox does not expose reader mode functionality to users until after the page is fetched and loaded. 3 Figure 3: The example page transformed with each of the evaluated SpeedReader transducers

3.1 Classifier Design The classification step of SpeedReader uses a random forest clas- sifier, trained on a hand-labeled data set of 2,833 websites. Our Figure 2: An example page loaded with browser classifier takes as input a string, depicting an HTML document, with no modifications and returns a boolean label of whether there is a readable subset of the document. We note that the input to the classifier is the initial provides a strawman example of a news page as delivered to a stan- HTML returned by the server, and not the final state of the website dard client: including portal branding and content, but also a range after JavaScript execution. of to different articles, images and trackers, for a total of 2.7MB Our classifier is designed to execute quickly, since document of data transferred and 53 scripts executed. Figure 3 demonstrates rendering is delayed during classification. The classifier is trained the functionality of SpeedReader when applying existing reader using 50 estimators, it expands the nodes until all leaves are pure or mode transducers to just the initial HTML document. Therefore, contain less than 2 samples, and considers 21 features, each selected for documents SpeedReader determines are readable, the sources to be extractable quickly. Selected features include the number of of SpeedReader improvements include: text nodes, number of words, the presence of Facebook open graph or Google AMP markup, and counts for a variety of other tags. • Our classifier considers the following features. We have made Never fetching or executing script or CSS. 2 • Fetching far fewer images or videos (since images and videos the source code for our classifier available publicly as well. not core to the page’s presentation are never retrieved). • Counts of the following tags:

,

    ,
      ,
      , • Performing network requests to far fewer third parties (zero,
      ,
      , ,