Analysing and testing HTML5 parsers

A dissertation submitted to The University of Manchester for the degree of

Master of Science in the Faculty of Engineering and Physical Sciences

2015

Jose Carlos Anaya Bolaños

School of Computer Science

Contents

Contents ...... 2

List of figures ...... 4

List of tables ...... 5

Abstract ...... 6

Declaration ...... 7

Intellectual property statement ...... 8

Acknowledgements ...... 9

The author ...... 10

1. Introduction ...... 11

1.1 Goal and objectives ...... 12

2. Literature review ...... 14

2.1 HTML history ...... 14

2.2 The HTML5 parsing process ...... 16

2.3 Testing HTML5 ...... 20

2.4 HTML5 parsing implementations ...... 22

3. Project architecture ...... 25

3.1 Overview ...... 25

3.2 Tasks distribution ...... 27

3.3 Project evolution ...... 29

4. Project implementation ...... 31

4.1 The MScParser ...... 31

4.1.1 Architecture ...... 31

4.1.2 The custom HTML5 DOM ...... 33

4.1.3 Challenges ...... 36

4.2 The specification tracer ...... 37

2

4.2.1 Architecture ...... 37

4.2.2 Challenges ...... 40

4.3 The harness for comparison ...... 40

4.3.1 The parser adaptors ...... 41

4.3.2 The script execution ...... 41

4.3.3 The comparison and report generation ...... 42

4.3.4 Challenges ...... 47

4.4 The ...... 49

4.4.1 Architecture ...... 50

4.4.2 Parsing and tracing ...... 51

4.4.3 Comparing outputs ...... 53

4.4.4 Reviewing reports ...... 55

4.4.5 Challenges ...... 57

5. Analysis and Results ...... 59

5.1 The html5lib test suite coverage ...... 59

5.2 The MScParser vs. the html5lib test suite ...... 62

5.3 Comparing parsers with the html5lib test suite ...... 63

5.4 Tracing the web ...... 65

6. Conclusions ...... 68

6.1 Reflection ...... 71

Bibliography ...... 73

3

List of figures

Figure 1 – Flow diagram of the HTML5 parsing process (adapted from [13]) ...... 17 Figure 2 – A cycle through the tokenizer to emit a token ...... 19 Figure 3 – A cycle through the tree constructor to process an empty string ...... 20 Figure 4 – Overview of the product architecture (adapted from [26]) ...... 26 Figure 5 – Class diagram of the parser ...... 32 Figure 6 – Class diagram of the custom HTML5 DOM ...... 35 Figure 7 – Class diagram of the specification tracer...... 38 Figure 8 – File example of tracerEvents.xml ...... 39 Figure 9 – Comparator flow diagram ...... 43 Figure 10 – XML report sample ...... 45 Figure 11 – Tracer input tab ...... 51 Figure 12 – Tracer exclusion tabs ...... 52 Figure 13 – Tracer output tabs for the input string this is a test ...... 52 Figure 14 – Input form for the multi-parser comparator tool ...... 53 Figure 15 – Comparison details page ...... 54 Figure 16 - Comparison page displaying differences between outputs ...... 54 Figure 17 – Format options tab ...... 55 Figure 18 – Report class diagram ...... 56 Figure 19 – Report details page ...... 57 Figure 20 – Html5lib tokenizer state tests results ...... 62 Figure 21 – Html5lib tree construction tests results ...... 62 Figure 22 – Insertion modes usage by websites ...... 66 Figure 23 – Tokenizer states usage by websites ...... 67

4

List of tables

Table 1 – Most popular HTML5 parsers in Github ...... 24 Table 2 – Participation of the members in the project ...... 28 Table 3 – Example of inputs that are HTML5 valid but XML invalid ...... 34 Table 4 – Example of diff encoding ...... 46 Table 5 – Code coverage of the tokenizer states by the html5lib test suite ...... 60 Table 6 – Code coverage of the insertion modes by the html5lib test suite ...... 61 Table 7 – Comparison of parsers vs. html5lib expected output ...... 64 Table 8 – Tracing details over websites ...... 65

5

Abstract

In its early days, websites only contained plain text and images interlinked. Over time, websites turned to complex web applications offering diverse services such as multimedia streaming, social networking, gaming, etc. HTML parsers have been historically flexible and permissive with the user inputs. Each parser had to define its own way to parse and fix errors but, due to the increasing complexity of inputs, disagreements and inconsistencies of outputs among different applications have been rising. Those differences might cause missing or misplaced content or even uneven behaviours because other technologies, such as AJAX, Javascript and CSS, rely on the HTML content.

HTML5 is the latest version of the HTML standard and the specification includes, for the first time, an algorithm for parsing and handling errors. The specification aims to finally achieve full consistency and interoperability between parsing implementations. However, the new parsing algorithm brought challenges for testing HTML parsers.

This dissertation presents a set of tools for analysing and comparing HTML5 parsers. The tool set includes a specification compliant parser, a tracer of the specification sections used when parsing and a harness to parse and compare outputs from different parsers. In addition to the tool set, an analysis of a test suite (html5lib) is included and discussed.

6

Declaration

No portion of the work referred to in this dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

7

Intellectual property statement

i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has entered into. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trademarks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the dissertation, for example graphs and tables (“Reproductions”), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this dissertation, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant Dissertation restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s Guidance for the Presentation of Dissertations.

8

Acknowledgements

Foremost, I would like to express my sincere gratitude to my supervisor Dr Bijan Parsia for all his support and guidance in the research and developing of my MSc project. His patience and continuous motivation were key for completing this project and dissertation.

I thank my teammates, Jose Armando and Xiao, for all the debates and discussions that contributed to the development and improvement of our project.

My sincere thanks goes to the professors of the School of Computer Science of the University of Manchester whose passion and love for computer science inspired me to never give up on learning and researching.

I am deeply thankful to the CONACyT for the sponsorship of my MSc course and to Dr Alonso and her LAPP team for helping and guiding me on the admission process and other formalities.

Last but not the least, I would like to thank my family and friends in México who were in constant communication with me despite the long distance that separates us.

9

The author

I graduated as a telematics engineer at the National Polytechnic Institute in Mexico City in 2010. Currently I am pursuing the degree of MSc in Advanced Computer Science with specialization in Advanced Web Technologies.

I have experience and knowledge of some web technologies and applications since I have been working as a web system developer for almost three years.

During the MSc, I had a course called Semi-structured Data and the Web. It was mainly related to XML and technologies/applications for creating, manipulating and querying XML documents. I found the course quite interesting and I enjoyed it because its closeness to web system development.

One of the reasons I chose this project is because it is highly related with the aforementioned course. Another reason was because the project was suitable for applying an agile-based methodology. I had never worked following agile techniques and this was a great opportunity to gain some experience.

Finally, the main reason I chose this project was because it promised a lot of programming. I love programming. I made a great decision; the project was full of programming.

10

1. Introduction

The World Wide Web Consortium (W3C) is an international organization that defines standards regarding web technologies. The mission of the W3C is to “develop protocols and guidelines that ensure the growth of the Web” [1]. Since its foundation, in October 1994, several standards have been promoted for creating, interpreting, rendering and displaying web pages.

The Hypertext Markup Language (HTML) is the most used language by web pages. It was born between 1989 and 1990 taking as a base the Standard Generalized Mark-up Language (SGML) [2]. The W3C has been promoting the use of HTML for achieving full compatibility and agreement between different web vendors. Several HTML versions have been created and in October 2014 the latest version, HTML5, reached the status of Recommendation, i.e., the stage of highest maturity of a W3C standard [3].

Some prior versions of HTML, such as XHTML, are based on the Extensible Markup Language (XML), thus the inputs can be easily parsed using an XML parser. A valid XML document can be analysed and tested by using grammars; nevertheless, those documents are restricted by a strict set of rules defined in a schema. When those rules were not completely met, the parsers had to deal with erroneous inputs in order to produce some output to the user.

With the aim to gain or maintain user acceptance, parsers increased their flexibility and permissiveness. This situation caused that the parsing and error handling processes became more and more complex. Moreover, inconsistencies among different parsers started to appear. Other web technologies such as Javascript, CSS and AJAX rely on the DOM (a tree-based structure that represents a parsed HTML document). Different DOMs for a single input might cause missing or misplaced contents, uneven behaviours, etc.

The HTML5 specification includes several changes and improvements with respect to its predecessors. One of those changes is that, for the first time, the parsing process is defined as an algorithm and it includes error handling. This new parsing process is a key feature of HTML5 because it ensures that every input stream of data has a well-

11 defined output (DOM). This certainty of the input-output relation is the element that targets toward the full consistency and interoperability of parsing implementations.

In order to be compliant with the HTML5 specification, an HTML5 parser may be implemented with any technology or programming language as long as it guarantees the same output as the parsing algorithm.

HTML5 brought new challenges for testing parsers. The new parsing algorithm is a convoluted process that relies on finite state machines and a large set of complex data structures. XML-based parsing and grammar-based testing cannot be used because HTML5 have removed the XML restrictions and its grammar is not context free [4]. In [5], a testing method is proposed by using a process called reachability analysis. However, the test approach is limited to a subset of the specification.

The use of test cases is the most used approach for testing. There are test suites, such as the W3C and html5lib test suites, which contain test cases for specific sections of the HTML5 specification. However, the use of test cases brings challenges such as pronation to errors, specification coverage, complexity of the testing process and uncertainty of expected outputs. Moreover, the Web evolves constantly and HTML5 evolves with it, thus constant maintenance of the test suites is required.

1.1 Goal and objectives The goal of the project is to compare different HTML5 parsing implementations and analyse the sources of agreement/disagreement.

The objectives are:

 Develop a specification conformant HTML5 parser following the given pseudo code.  Create a comparative tool for inspecting and evaluating outputs from different parsers.  Manufacture an analysis tool for tracking parsing information and finding sources of disagreement.  Perform a comparative review of some parsing implementations.  Analyse a HTML5 test suite.

12

The set of tools would help the analysis and comparison of different parser implementations for finding useful information such as level of agreement, causes of disagreement, specification coverage, percentage of use of characteristics, etc.

The html5lib [6] test suite was chosen for comparing parsers. The test suite is public, well documented, constantly updated and it includes more than 8000 test cases for the parsing process.

This dissertation is organized as follows: in chapter 2 a background research about the HTML history, the HTML5 parsing process, current parsing implementations and testing methodologies is presented. Chapter 3 describes the project architecture and the distribution of tasks among team members. The implementation of the project tools is described in chapter 4. In the following chapter, results and analysis of some parsers and the html5lib test suite is presented. Finally, the last chapter presents conclusions and areas of opportunity for future work.

13

2. Literature review

This chapter presents a brief history of the HTML standard followed by a description of the HTML5 parsing process. Next, some strategies for testing HTML5 are discussed. Finally, information related to HTML5 parsers is presented.

2.1 HTML history The Hypertext Markup Language (HTML) was born between 1989 and 1990 as an application of the Standard Generalized Mark-up Language (SGML) [2]. The W3C was born in 1994 with the aim to increase the Web potential through standards and rapidly adopted HTML. In 1995 HTML was extended with new tags and a draft called HTML 3.0 appeared. In 1997 a stable HTML specification, named HTML 3.2, was approved by Microsoft and Netscape (the major browser vendors from that time). In spring 1998 HTML 4 .0 reached the status of W3C recommendation.

HTML documents were validated against a DTD schema. A DTD schema describes the structure of a document, the legal names for elements and attributes, etc. If a document follows a schema rules, it is said to be a valid document. When a document is valid with respect to a DTD, it guarantees that the document can be parsed in a unique (DOM). A DOM is an interface of a data structure, represented as a tree, that allows applications to access and manipulate the structure, content and style of documents. W3C defined an specification for DOM [7].

In 1996 the W3C presented the Extensible Markup Language (XML) specification (a subset of the SGML). XML was designed to be generic, extensible and simple to use, etc. [8]. The rules of a well formed XML document are:

 There is exactly one root element.  Tags are correct (i.e. between “<” and “>” characters) and properly nested.  Attributes are unique for each tag and attribute values are quoted.  No comments inside tags.  Special characters are escaped.

When a non-well-formed XML document is being parsed, it will produce a fatal error (known as Draconian error), and consequently the document will not be parsed into a DOM tree by the XML parser. 14

With the arrival of XML, XML Schema appeared as an alternative to DTD schemas. Unlike DTD schema, XML Schema included new features such as data types, element enumerations, etc. Moreover, an XML Schema follows the XML syntax. “The W3C believed the Web itself would eventually move to XML” [9] and, in January 2000, the XHTML 1.0 spec was adopted as a W3C Recommendation. The version 1.1 was a recommendation by May 2001. XHTML is defined as an XML application (i.e. a restricted subset of XML). The XHTML spec included three schemas (Strict, Transitional and Frameset) in order to validate a document and guarantee the uniqueness of a DOM tree.

With the schema validation and the rules for well-formedness, XHTML was against the permissive and forgiving approach of HTML. A user would prefer an application that produce some output despite a missing closing tag or an unquoted attribute value instead of an application failing or displaying an error as XHTML was proposing.

Some of the W3C members were representatives of major browser vendors such as Mozilla, Apple, Google, , etc. According to them, web pages were turning into something more “than text and images interconnected by links” [9]; they were becoming web applications containing dynamic content and multimedia. To cope with those new features, the W3C began to work on XHTML 2.0.

The first draft of the HTML5 spec (born as a proposal from Mozilla and Opera, called Web Forms 2.0) was presented in 2004 to the W3C. The draft was voted and it was rejected (8 in favour vs. 11 against). Despite the rejection, some members agreed to continue working on the project and formed the Web Hypertext Application Technology Working Group (WHATWG).

W3C continued to work in XHTML 2.0. However, in 2007 they realised that the spec proposed by the WHATWG had indeed a promising future and they asked them to work together. The idea of normalising the way to handle errors seemed more plausible than forcing users to write valid, well-formed documents.

The drafts related to HTML were merged and renamed as HTML5. The first official draft of HTML5 appeared in January 2008. Currently the W3C and the WHATWG specifications are slightly different. The divergence began in 2012, when the W3C

15 introduced a group of editors to organize the draft and decide what should be included in the HTML5 spec and what should be put into another specs. In the W3C recommendation they claim that, “The W3C HTML working group actively pursues convergence of the HTML specification with the WHATWG living standard” [3].

The WHATWG spec is a “living standard” named the HTML Standard [10]. had been (and continues to be) the unique editor of this spec [9]. That decision was taken because web browsers are constantly experimenting with new behaviours and features. According to Hickson, “The reality is that the browser vendors have the ultimate veto on everything in the spec, since if they don’t implement it, the spec is nothing but a work of fiction” [11].

In fact, the major web browsers (Opera, , Apple Safari, Mozilla and Microsoft Internet Explorer) claim to be conformant with the WHATWG HTML Standard and not the W3C HTML5 Recommendation. David Baron, a distinguished engineer from Mozilla said “When the W3C’s and WHATWG’s HTML specifications differ, we tend to follow the WHATWG one” [12].

Each vendor defined its own way to parse and fix HTML when invalid or problematic inputs were presented. Although “error handling is quite consistent in browsers”[4] there were inconsistencies amongst them. In order to finally end with disagreements, the HTML5 spec includes a parsing algorithm and error handling. Moreover, HTML5 is not an XML document; therefore it is not subject to the rules for being a well formed document. The algorithm uses finite state machines and ensures that every1 input stream of data has a well-defined output.

2.2 The HTML5 parsing process Figure 1 presents a simplified overview of the HTML5 parsing process.

The data input is a stream of octets. The flow of the parsing process begins with the identification of the encoding of the input stream by using the encoding sniffing algorithm. Typically the user agent explicitly defines the encoding. When no character encoding is specified, the algorithm analyses the stream in order to try to determine

1 There are some unsupported character encodings in the spec, thus that data cannot be parsed. 16 the encoding. The specification discourages the use of some character encodings and suggests the use of the UTF-8 as default character encoding [3].

Figure 1 – Flow diagram of the HTML5 parsing process (adapted from [13])

17

The next stage is the pre-processing of the input stream. This stage manipulates some characters and raises errors when control characters2 are encountered. After the pre- processing, the tokenizer consumes characters from the input data stream and produces tokens. Those tokens are then consumed by the tree constructor. The tree constructor creates and manipulates a DOM tree that will be the output of the parsing process.

The tokenizer state machine is composed by 69 different states and the transitions are mostly triggered by the data input. The execution of scripts may insert new characters into the input stream. The tree constructor phase is defined by 23 states and the transitions are triggered by the tokens produced by the tokenizer. The DOM is created and manipulated by the tree constructor by using some algorithms and several data structures and flags. Additionally, the tree constructor may also change the current state of the tokenizer.

The tokenizer

There are six different types of tokens: character, comment, DOCTYPE, end of file, end tag and start tag. A cycle through the tokenizer will consume one or more characters and it will end by emitting one or more tokens. Most of the tokenizer states (62 out of 69) will consume and process one character from the input stream. Depending on the character value, it might be ignored, produce or emit a token (or several), cause a state transition and/or be reconsumed.

The default state of the tokenizer is the Data state (i.e. when a token is emitted, the tokenizer will return to this state). Nevertheless, under some circumstances, the tree construction stage may change the default state. Figure 2 presents a worked example of a cycle for emitting a start tag token with one attribute.

2 The Unicode control characters have no visual representation and are used to control how text is displayed. 18

Input

Tokenizer steps 1) Data state consumes a “<” character. Switches to tag open state. 2) Tag open state consumes an “a” character. Creates a start tag token with value equals to “a”. Switches to tag name state. 3) Tag name state consumes a space. Switches to before attribute name state. 4) Before attribute name state consumes an “h” character. Creates an attribute for the token with name equals to “h”. Switches to attribute name state. 5) Attribute name state consumes an “r” character. Appends the character to the current attribute name. Keeps consuming characters and when the “=” character is consumed, it switches to before attribute value state. 6) The characters are consumed and appended to the current attribute value. When the “>” character is consumed, the current start tag token is emitted and the tokenizer switches back to the data state.

Figure 2 – A cycle through the tokenizer to emit a token

Recalling the previous example, if the input was the same except for the first character, the transition to tag open state would never happened and a character token would have been emitted for each character.

The other states will attempt to consume several characters to identify character references, comments, a DOCTYPE declaration or CDATA sections. It is an attempt because the characters are consumed only if they truly represent one of the previously mentioned values. For example, a transition to the markup declaration open state is made. This state will attempt to consume characters matching DOCTYPE or [CDATA[. If there is a match, the characters are consumed, and then a transition is made. If there is no match, a transition is made without consuming the characters.

The tree construction

When the tokenizer completes a cycle, one or more tokens were generated and the tree construction machine will process the token(s). The DOM tree is manipulated in this stage. A pointer to the current node is used (initially null). The character tokens will create text nodes; comment tokens will create comment nodes; start tag tokens will produce element nodes; end tag tokens will be used for closing element nodes (i.e. the pointer to the current node is updated to point to the parent node). The machine has 23 states called insertion modes. The first state it the initial insertion mode. Figure

19

3 presents the cycle through the tree construction to process an end of file token (i.e. an empty string input).

Input “” (empty string) Tree construction steps 1) Initial insertion mode switches to before HTML insertion mode and reprocesses the current token. 2) Before HTML insertion mode creates an and appends it to the document object (DOM tree). It pushes the html element into the stack of open elements, switches to before head insertion mode and reprocesses the current token. 3) Before head insertion mode creates a head element and appends it to the DOM tree. It pushes the head element into the stack of open elements, switches to in head insertion mode and reprocesses the current token. 4) In head insertion mode pops the head element from the stack of open elements, switches to after head insertion mode and reprocesses the current token. 5) After head insertion mode creates a body element and appends it to the DOM tree. It pushes the body element into the stack of open elements, switches to in body insertion mode and reprocesses the current token. 6) After head insertion mode creates a body element and appends it to the DOM tree. It pushes the body element into the stack of open elements, switches to in body insertion mode and reprocesses the current token. 7) In body performs some validations and finally stops parsing.

Figure 3 – A cycle through the tree constructor to process an empty string

The worked example depicts the simplest flow of the tree construction stage. It produces the minimal DOM tree, i.e. a DOM tree that contains only an html element (as root node) and a head and body element (as children elements of the html node).

The tree construction stage is very complex and it uses several data structures (stacks and lists), flags, pointers and persistent status (the current insertion mode). Additionally, it includes some other smaller algorithms that are used across insertion modes.

2.3 Testing HTML5 Testing HTML5 is a complex task. Some reasons are:

 The specifications are updated constantly due to the HTML5 evolving nature.  The specifications are prone to errors, omissions, etc. 20

 Parsers and test suites require constant maintenance to cope with the spec changes.  HTML5 have removed the XML restrictions and its grammar is not context free [4] thus XML-parsing and grammar-based testing cannot be used.  The number of potential inputs is infinite. It might be really hard that a document (the specification) contemplates every possible scenario or combination of inputs.  The implementation of the parsing algorithm into code is a subjective process that depends on the style and experience of the programmer(s).

The W3C has its own test suite and defines it as “The Web Platform Tests Project is a W3C-coordinated attempt to build a cross-browser test suite for the Web-platform stack” [14][15]. The project is hosted in Github and it comprises test cases for the complete HTML5 spec (parsing, HTML element interfaces, encodings, fonts, images, media, events, etc.). It is focused on testing browsers rather than standalone parsing implementations. The WHATWG has a test suite as well [16]. It includes test cases from developers and companies: IE, Opera, Mozilla, Ian Hickson, etc. Both test suites are not completed and they are referred as works on progress.

The html5lib test suite [6] is public, well documented and constantly updated. It contains more than 8000 entries detailing the input, expected output, expected number of errors, etc. It includes tests for parsing (i.e. tokenizer and tree construction stages), encoding, sanitizing, serializing and validating HTML5. Those test cases are generally trusted as reliable and conformant with the WHATWG spec. The html5lib project was started by four developers and it has contributions (test cases) from several users, including developers of WebKit and Mozilla.

In [5] a couple of researchers from the university of Tsukuba (Japan) present another approach for testing HTML5 parsers. By using a method called reachability analysis and conditional push down systems, they generated a set of HTML documents that covers a subset of the specification. Then, they used the obtained set of documents for comparing the outputs of different parsers. The limitation of their work is that it cannot generate tests for the entire specification because the complex behaviour of the formatting elements and the adoption agency algorithm.

21

HTML5TEST [17] is a web application that test browser support of HTML5.It runs a several tests and is assigns a score. The tests cover various sections of HTML5 such as multimedia, parsing rules, device access, connectivity, performance, etc. According to the authors, they test “the official HTML5 specification, specifications that are related to HTML5 and some experimental new features that are extensions of HTML5”.

2.4 HTML5 parsing implementations Rendering (or layout) engines are the main type of applications that require HTML parsing. Web browsers use those engines not only to parse HTML but CSS as well, execute scripts, render and display content, etc. Usually each vendor of major browsers has its own implementation of layout engines. For example Google Chrome and Opera browsers use Blink, Apple Safari browser uses WebKit, Mozilla Firefox uses Gecko, Microsoft Internet Explorer uses MSHTML (also known as Trident), etc. [4].

The new implementation of Gecko, Gecko 2 [18] implements an HTML5 parser, compliant with the spec. The parsing process is executed in a separate thread from the main UI thread to improve responsiveness from the browser. It features speculative parsing in order to parallelize the HTML parsing and the script execution3, improving the performance of the rendering process.

In [19] a new browser engine called Servo is presented. It is written in Rust programming language instead of C++ as the previously commented rendering engines. It aims for taking advantage of parallel hardware and for better performance, power usage and concurrency management than other rendering engines. The authors state that “Servo must be at least as fast as other browsers at similar tasks to succeed, even if it provides additional memory safety”. It is still under development but so far they managed to make Servo faster than Gecko in the layout stage.

Apart from web browsers, there are other applications that use rendering engines such as email managers, Integrated Development Environments (IDEs), e-book readers, VoIP and videoconference applications, etc. For example Microsoft Outlook and Microsoft Visual Studio both of them use Trident, the first for rendering emails and the second one for its web page designer [20].

3 Not always is possible to parallelize those tasks. Moreover, to take real advantage of speculative parsing, some suggestions have to be followed. 22

There are other applications that might require only a standalone HTML parser, i.e. those might not need a complex render engine as they will not execute scripts or render/display the HTML content. Among those applications are HTML debuggers, validators, reporters, web crawlers, text-mining tools, sanitizers, pretty-printers, etc.

Live DOM Viewer [21] is an HTML5 parser developed by Ian Hickson, written in Javascript. It can be accessed online and it displays the HTML output, the rendered view and a representation of the DOM generated.

There are several standalone HTML5 parsers, each offering different features and capabilities. Github claims to be the world’s largest code host. A search for HTML5 parser displays more than 130 repositories in more than 10 different programming languages. According to the search result in Github, the top language used is Javascript, followed by C and then PHP.

Backed by Google and “tested on over 2.5 billion pages from Google's index”, gumbo parser [22] is the most popular and the third most forked HTML5 standalone parser available in Github. It is written in C and it claims to be fully conformant with the WHATWG spec. Moreover, it passes all the test cases from the html5lib test suite [6]. Another well positioned implementation is jsoup [23][24]. It is the most forked and the third most popular HTML5 parser in Github. It is written in java and additionally to HTML5 parsing, it features: XML and CSS parsing, pretty printing and HTML cleaning. It is conformant with the WHATWG spec. Table 1 presents the most popular (highest number of stars) standalone HTML5 parsers in Github.

In [25] a standalone, parallel HTML5 parser is presented. According to the authors, “HPar is the first pipelining and data-level parallel HTML parser”. Parallelization of the parsing algorithm is hard because there are dependencies between the tree construction and the tokenizer. Under some circumstances, a few insertion modes can modify the next tokenizer state. Additionally there are some elements that can be self- closing (for example the br element); in order to raise errors when a non-self-closing element is self-closed, the tokenizer has to wait for feedback of the tree construction.

Initially, the HPar parser divides the input into chunks and each chunk is processed in parallel generating tokens and storing them in a buffer. The parsing process is

23 speculative and it is similar to a transaction: a snapshot stores the state of the tokenizer at a given time and a flag for hazard detection is used, when the flag is true (i.e. the tokenizer state was changed by the tree constructor), a rollback has to be made (i.e. discarding some tokens and creating new ones).

To validate their parallel parser, the authors analysed over 1000 websites to find how often the tree construction stage modified the tokenizer state and they found that it was less than 0.01%. That means that the probability of a rollback is less than 0.01%. To test their implementation, they compared it against to jsoup (commented previously). HPar had a speed improvement up to 2.4 times (1.73 on average) when parsing some websites such as Facebook, YouTube, BBC, etc.

Name Stars Forks (order) Language Spec google/gumbo-parser 3251 399 (3) C WHATWG jhy/jsoup 1878 646 (1) Java WHATWG inikulin/parse5 685 22 (10) Javascript WHATWG aredridel/html5 479 73 (5) Javascript - html5lib/html5lib- 329 79 (4) Python WHATWG python masterminds/html5-php 269 39 (6) PHP WHATWG FlorianRappl/AngleSharp 207 32 (7) C# W3C servo/html5ever 167 31 (8) Rust WHATWG tracy-e/OCGumbo 150 26 (9) Objective-C WHATWG Table 1 – Most popular HTML5 parsers in Github

24

3. Project architecture

This chapter presents an overview of the product, the tasks distribution among team members and a brief chronological description of how the project evolved.

3.1 Overview Figure 4 displays a diagram of the architecture of the product. A brief description of each module is presented below.

Input sources

For testing and comparing the parsers, two input sources were used: the Html5lib test suite and web sites from the Common Crawl corpus. The html5lib test cases were stored as binary files. The Common Crawl module is a sampler for obtaining random web pages from Common Crawl.

Input pre-processing

The inputs to the test harness can be strings, URLs, binary files or WARC files. Depending on the input type, the pre-processing process acts as follows:

 When the input is an URL, it accesses it and stores the content (if available) into a binary file. Then, the path to the saved file is sent to the script.  When the input is a WARC file, it extracts and saves the web pages to disk. Then, the path to the directory is sent to the script.  When the input is a string or a file path, it is sent directly to the script.

Script execution

The script is used for executing the third party parsers and the MScParser. Each parser is executed by using an adapter (implementing a defined interface) that accepts as input either a string or a path to a file. The generated DOMs are serialized as formatted strings and then saved temporarily as binary files.

25

Figure 4 – Overview of the product architecture (adapted from [26])

26

Comparator

The comparator analyses the outputs generated by the parsers and generates an XML report. The output with more parser consensus is stored. Diff files of the other outputs (if any) are generated and stored.

Web application

This application was developed as a graphical user interface for visualizing the reports generated by the test harness and for analysing and comparing the outputs. Additionally, it allows executing the spec tracer and the test harness for string and URL inputs.

HTML5 Parser - MScParser

This parser is conformant with the W3C HTML5 Recommendation. It was designed as a transliteration of the pseudo code presented in the specification. It will be referred as MScParser.

Specification Tracer

The tracer is used for tracking the sections used and the parse errors generated during parsing. Its goal is to produce a log of the parsing process that can be analysed for finding useful information. An XML report containing the tracing details and information about the output can be generated. The tracer runs over the MScParser.

The code of the project is hosted in Github [27]. The repository contains documentation about the installation and usage of the modules.

3.2 Tasks distribution This is a team project. Table 2 presents a summary of the activities in which each team member participated. The table includes only the activities related to the product, i.e., the tasks that involved writing code. Other activities such as reading, researching or project discussions are not contemplated.

27

Estimated % completed by team member Module Activity effort (days) Carlos Jose Xiao HTML5 Parser Pre-processing 1 100 0 0 Tokenizer 15 33 33 33 Tree constructor 15 25 25 50 Algorithms 10 50 50 0 Test harness 4 25 75 0 Testing 25 40 40 20 DOM 7 100 0 0 Adapters Jsoup 2 100 0 0 AngleSharp 2 100 0 0 Html5lib 2 0 50 50 Parse5 2 0 100 0 Validator.nu 2 0 100 0 Comparator Report generator 7 25 75 0 Algorithm 3 0 100 0 Output processing 5 100 0 0 Web Comparator UI 5 100 0 0 application Tracer UI 5 100 0 0 Input form 5 100 0 0 Common crawl Design and 30 0 100 0 implementation Tracer Design and 15 100 0 0 implementation Table 2 – Participation of the members in the project

The estimated effort is measured in days of 8 hours each. Some tasks evolved constantly or were paused and completed or updated after some time of its creation, thus the effort measure is estimated, i.e., the exact time spent on them was impossible to track.

28

The percentage of task completion per member is also estimated because the code is collective, i.e., several files were constantly manipulated (mainly while testing) by all the team members.

The activities presented were developed from March to July 2015. The project had a short pause during April and May due to the writing of the progress report and the exams season. August and early September involved the writing of this document.

We considered including the number of lines of code written per person (LOC metrics). However, the idea was discarded because there are files that we all manipulated, some other files constantly evolved or were refactored, there is some pair-programmed code, the adaptors are in different programming languages, there are several mapping tables and xml files that could be arguably counted as code, etc. Additionally, LOC are usually used as a predictor of development time rather than a measure of system complexity (due to its dependencies on language and style) [28].

3.3 Project evolution This section presents a brief chronological description of the approach adopted for working on the project and distributing tasks.

The team decided to work using some of the Agile Software Development principles. In February we did two spikes for learning and understanding the HTML5 parsing process. The parser development started in March, 2015 with the aim to finish it within 6 weeks. We planned three sprints of two weeks each.

During the first sprint, we designed the general architecture of the parser and wrote the tokenizer states. Each team member worked individually on 23 states (the tokenizer stage consists of 69 states). The division was made in order to work on related states.

The goal of the second sprint was to complete the tree construction stage. The complexity of that module was higher than the tokenizer because it includes many algorithms and data structures. Overall, there is dependency between the insertion modes and algorithms. It was hard to find a way to divide the work minimizing code dependencies in order to avoid conflicts. Each member worked on some insertion modes and some algorithms. 29

The plan for the final sprint goal was to integrate and test the entire parser prototype. The integration was almost completed during Sprint 2, thus we focused mainly on testing and fixing errors.

At the end of May we started to work on the test harness and the adaptors of the parsers. The goal was to have six third-party parsers working (each team member had to work on two adaptors). Meanwhile, Jose Armando and I started to design the script and comparison algorithm. Xiao committed to keep adding parsers and adaptors but then he abandoned the team in mid-June.

By the end of June, Jose Armando started to focus on a module for getting random samples of web pages (from the Common Crawl corpus) for testing.

On the other hand, I began to build the HTML5 DOM for passing the remaining failing tests of the MScParser. Once the parser was completed, I focussed on the web application and the spec tracer.

Despite our individual commitments, there were some activities in which Jose Armando and I worked together:

 Completing and fixing the parser for passing all the remaining tests. Around 70% of the failures and errors were fixed by me and 30% by Jose Armando.  The test harness changed constantly, we maintained constant communication for improving and updating it.  Discussing errors or failures caused by the adaptors or the Comparer.

30

4. Project implementation

This chapter describes the implementation of the following modules:

 The HTML5 parser (MScParser).  The specification tracer.  The harness for comparison of different HTML5 parsers.  The web application for tracing and comparing.

The Common Crawl sampler is fully covered by my teammate Jose Armando in his dissertation “A testing strategy for HTML5 parsers” [29].

4.1 The MScParser This section presents the architecture of the MScParser and the custom HTML5 DOM implementation.

4.1.1 Architecture The MScParser was written in Java. It was designed to be a transliteration of the algorithm of the W3C recommendation in order to check the usability of the specification.

JUnit was used for testing. The test suite of html5lib was chosen due its simplicity and quantity of test cases (more than 8000 for parsing). By using the html5lib test suite, the testing method was dynamic white box. The test cases present an input and the expected output of the whole tokenizer or the tree construction stages, thus they were used as integration tests.

In order to implement a transliteration of the spec as clean as possible, a class called ParserContext was developed. That class act as a container for all the variables, data structures, DOM and states (both, insertion modes and tokenizer states) required while parsing. In Figure 5 the ParserContext class is presented along with the main classes of the parsing process.

31

Figure 5 – Class diagram of the parser

The finite state machines for the tokenizer and the tree construction are handled by using factories (applying the factory method pattern). This was the best approach we found to translate the specification and handle 23 insertion modes and 69 tokenizer states.

32

The parser has the following limitations:

 The sniffing algorithm was not implemented and the UTF-8 character encoding was used by default. There are three reasons that led to that decision: first, UTF-8 is the most widely used character encoding for websites (83.7% of websites use it, according to w3techs.com [30]). Second, UTF-8 is the suggested character encoding by the spec. Finally, the sniffing algorithm is a large a complex procedure that does not guarantee a 100% confidence in determining the character encoding.  The execution of scripts was not implemented. That decision was taken for two reasons: first, the script execution is not part of the parsing process (i.e. it is part of the HTML5 spec but outside the parsing section). The second reason is that a script execution engine is a complex and large system. Handling scripts would have incremented the complexity of the parser because scripts might insert new data into the tokenizer. That new data could lead to a manipulation of the DOM tree by inserting, removing or modifying elements; furthermore, it could produce a change of character encoding (by inserting or modifying a meta tag).

4.1.2 The custom HTML5 DOM In order to create and manipulate an HTML5 DOM tree, the initial and naïve approach was to use the org.w3c.dom [31] implementation which is part of the Java API and provides interfaces for XML processing. It was used because its simplicity and our previous experience with it.

All the MScParser code was built around the usage of this implementation and it worked well in early tests. When the parser was almost complete, a failing test was presented. An attribute which name contained a semicolon (;) provoked a fatal error (a draconian error). That attribute name violates the well-formedness requirement of an XML document.

Each Element node object has a property to associate an object (userData). That property was used to solve the issue, i.e., the XML invalid attribute was stored there and then retrieved when serializing.

That was a dirty trick because the invalid arguments were not part of the DOM tree and additional processing was required to retrieve them. Nevertheless, the tests were passed and the project continued. Later, more failing tests appeared. A draconian error

33 was generated when an element whose name contained a less-than-sign (<), i.e., . The HTML5 specification allows element names with especial characters.

In another failing test, the input contained a space as the name of a DOCTYPE. This is valid according to HTML5 but it causes conflicts with the XML DOM. Table 3 summarizes a list of inputs that led to errors and/or failing tests. At this point we realized that an XML DOM was not suitable for an HTML5 DOM due to the restrictions placed. A new DOM implementation was required.

Input Details Invalid attribute name. Invalid element name. Invalid element name. rdar is considered to be a namespace and the element name is considered to be empty. For example, an input is valid. Invalid doctype name (empty). Table 3 – Example of inputs that are HTML5 valid but XML invalid

JDOM [32] and Dom4j [33] were considered and a few tests were made using them but those are XML frameworks, thus they are not suitable for storing a HTML5 DOM. The Jsoup parser contains its own implementation (written in java), thus it was a potential solution for our problems.

After an analysis, the idea of using the Jsoup HTML5 DOM was discarded. It would require a huge amount of effort to incorporate it to our code due to differences of objects, property names, method implementations, etc. At this point we realized that the only feasible solution was to implement a custom HTML5 DOM.

The custom DOM was designed following the structure and names of the org.w3c.dom but removing the XML restrictions. This was done in order to minimize the impact over the parser code. Figure 6 shows a class diagram of the DOM implementation followed by a description of each class. N.B. for readability, the class methods (operations) are not displayed.

34

Figure 6 – Class diagram of the custom HTML5 DOM

 Node – Abstract class. It defines generic properties and operations for nodes such as getParentNode, appendChild, etc.  NodeType – Enumeration that defines the category of XML nodes. Seven types of nodes are used for an HTML5 DOM: Attribute, CDATA Section, Comment, Document, Document Type, Element and Text.  Attribute – Contains properties for name and value. Local name is available in case the name contains a valid namespace.  CDATASection, Comment, Text – Nodes with a single property to store the content.  Document – Class that represents a HTML5 document. It defines methods for creating nodes, e.g. createComment, createElement, etc.  QuirksMode – Enumeration for the quirks mode status of a Document object.  DocumentType – Has properties for name, public ID and system ID.

35

 Element – Represents an element (tag). Includes operations for creating, getting and setting attributes.

The org.w3c.dom defines interfaces for nodes that are not used in HTML5 such as processing instructions or notations, therefore those implementations were ignored. Moreover, only the required operations and properties were included. For example the Document node has methods such as getXmlVersion or normalizeDocument that are not used for HTML5 parsing.

The org.w3c.dom implementation does not have a direct way to serialize the DOM; a transformer object is required. This custom HTML5 DOM was designed to serialize directly by using the method getOuterHtml. A flag for pretty printing (indent and new lines for each element) can be specified.

Another set of tests that were failing were related to the quirks mode of the HTML5 document. Depending of some conditions of the Doctype definition in the input, the quirks mode may change. The quirks mode affects the behaviour of the table element. The document node of the org.w3c.dom library has no quirks mode property, thus the userData property was used for storing it.

To avoid using the userData property, an enumeration with the possible values of the quirks mode was included. Then, the document node was updated to include a property to get and set the quirks mode status. With this modification the code is cleaner and easier to understand.

The impact of using this custom DOM was minimal because the structure and the names of objects, properties and methods were mostly maintained. Besides a few refactors, the unique change required was to update the references (imports) of the parser classes.

4.1.3 Challenges The last tokenizer state (Tokenizing character references) has a different behaviour than the others states. That state can read characters of the stream beyond the current character for trying to find character references. If a reference is found, the characters are consumed. If no reference was found, the characters are not consumed and the original state consumes the characters. 36

When we realised the unusual behaviour of that state, it was too late to rebuild the architecture and the other tokenizer states, thus it was hacked to fit into the defined architecture. There are 5 tokenizer states that can consume character references, thus they were manipulated as well. This was the unique part of the spec that could not be transliterated completely.

4.2 The specification tracer The goal of the tracer is logging the spec sections used and the parse errors generated while parsing. The log of events could be used for analysing differences between different outputs, for calculating spec coverage or for extracting useful information such as number of parse errors or existence of certain elements in the output. The MScParser has an option for enabling tracing (disabled by default).

4.2.1 Architecture Figure 7 shows a class diagram of the tracer implementation. A brief description of the classes is presented below.

 Tracer – Is used for logging events generated while parsing. It includes operations for filtering events and generating an XML report.  TracerSummary – Contains information such as the number of errors, used insertion modes, presence of certain type of elements, etc.  Event – This class has associated a type, a description and the specification section where the event occurred.  EventType – Enumeration that defines four event types: algorithm, insertion mode, tokenizer state and parse error.

The MScParser generates a list of parse errors whether tracing was enabled or not. When tracing is enabled the parse errors are registered as trace events, i.e., the location of the errors is logged.

The HTML5 specification defines how to handle errors. However, there is not explicit definition of the types of errors. An enumeration called ParseErrorType was defined to categorise the errors according to its nature and location.

37

Figure 7 – Class diagram of the specification tracer.

A XML document (tracerEvents.xml) is used for defining the list of sections to be tracked. The root element is called events and every event node has attributes for section, description and type as shown on Figure 8. The events with no type are treated as informative, i.e., they are not tracked. When the parsing process begins, if the value of the tracing flag is set to true, a new Tracer instance is created and the tracerEvents.xml file is loaded into a hash map.

In order to trace, it has to be specified directly in the code where to raise an event. An operation called addParseEvent, with a required argument (the section number) and an optional argument (event details), was defined. Whenever the addParseEvent operation is executed, it validates if tracing was enabled in order to log the event.

38

Figure 8 – File example of tracerEvents.xml

The task of identifying the location for raising events is relatively trivial because the MScParser was developed as a transliteration of the spec algorithm. This characteristic allows for a high level of tracking granularity. Currently, the tracer granularity is coarse because all the algorithms, insertion modes and tokenizer states register only one event (when they are used).

Nevertheless, it might be the case that a deeper level of tracking is required. Let us consider the InBody insertion mode. It has more than 50 possible paths depending on the token type and value. Furthermore, some of those paths have more divergences depending on other variables or current status.

An event for every path could be raised by just adding a line of code, i.e., the addParseEvent method. To differentiate each path, two options are available. The first option is using as arguments the InBody section number (8.2.5.4.7) and the path details. The other option is to assume every path is a spec subsection, i.e., using a unique section number e.g., section number 8.2.5.4.7_n where n is the number or name of the path.

For the latter option, the description and section number of every path have to be added as an event node in the tracerEvents.xml file. This option is cleaner with respect to the code by the means of defining spec subsections.

39

The TracerSummary is a POJO4. It is used only for tracking details such as number of emitted tokens, existence of HTML5, SVG or MathML elements, etc. The tracer already has a generic method for tracking emitted tokens and created elements. By using those operations, the task for adding new tracking details just require a few extra validations and the respective property in the TracerSummary class.

The tracer has the capability to generate an XML report. This functionality was added for complementing the test harness and retrieve useful data about the nature of the tested input pages. The tracer class has a method called toXML for generating a report; it only requires the file path where the report should be saved.

4.2.2 Challenges These are the challenges faced during the tracer development.

Tracer granularity

Strictly, this was not a challenge during the development. Nevertheless, achieving a finer granularity would have required a substantial amount of time.

XML invalid characters

The XML 1.0 specification defines ranges and values of valid Unicode code points. Nevertheless, that list is not consistent with HTML5 specification. When generating a XML report, it might be the case that the tracer logged an event containing an XML invalid character provoking an error. To solve this problem, all XML invalid characters are escaped before generating the report.

4.3 The harness for comparison The harness goal is to parse, compare and generate a report of the outputs produced by the different parsers. It was designed for handling HTML5 parsers despite their programming language and for being scalable horizontally, i.e., plug-in new parsers.

The test harness consists of three main modules: the parser adapters, the script execution and the comparison and report generation process.

4 POJO stands for Plan Old Java Object. Is a simple class that does not extend from another class, has no special implementation and only contains getters and setters. 40

4.3.1 The parser adaptors In order to compare the DOM generated by each parser, serialized outputs were required. The standard serialized outputs from the different parsers varied slightly as some use formatting or pretty printing options. A standardization of output formats was required in order to perform a comparison.

The html5lib output format was used because of its simplicity. Moreover, some parser implementations use the html5lib test suite and consequently already have functions that serialize in that format.

The interface for the adaptors was defined to be as simple as possible. It receives only two parameters and outputs the serialized DOM formatted accordingly to the html5lib format. The input parameters are:

 Type of input. File path or string (-f and -s, respectively).  Input value.

In addition to the MScParser, the project has adaptors for the following parsers:

 AngleSharp – C#  Html5lib – Python  Jsoup – Java  Parse5 – Javascript  ValidatorNU – Java

4.3.2 The script execution The script (bash for Linux, batch for Windows) is executed through the system’s console. It receives three arguments:

 a path to a directory for saving the outputs  the input type (string, URL, file path or WARC file)  the input value

Depending on the input type, the script might do some pre-processing:

 When the input is a string, no operation is realized. The string input is fed directly to the parser adaptors.

41

 When the input is an URL, the web site content is saved into a binary file. Then, parsers are executed receiving a file path.  When the input is a WARC file, it extracts and saves the web pages to disk.

When the input is a path to a directory, the script loops the files recursively and executes the parsing process for each file.

The script runs all the parsers (by using the adaptors) and saves each output in a binary file (named after the parser). Finally, it executes the process of comparison and report generation.

4.3.3 The comparison and report generation A java program called Comparator is used for reading the files generated by the parsers, comparing the outputs, generating and an xml report and diff files (if required). Figure 9 presents a flow diagram of the Comparator.

When the comparator is executed, it requires as mandatory argument the path where the parser outputs were saved. The process begins with a loop through directories (that represent an input or test case) followed by a loop through the directory files. The content of each file is stored in an object along with the parser name.

After reading all the files, the list of objects is processed. The output that has higher consensus among the parsers is referred as majority tree or majority output. Then, the list is sorted by consensus rate. In case there are outputs different with respect to the majority tree, a diff file per output is generated. Next, the test case is added to the report and the report totals are updated. The xml report is saved to a file and the process ends when there are no more directories to analyse.

The modules for grouping and sorting outputs and the diff files generation are discussed below. Additionally, the process for adding new parsers is explained as well.

42

Figure 9 – Comparator flow diagram

Groping and sorting outputs

The Comparator analyses and groups the outputs applying a slight variation of the Boyer-Moore Majority Vote Algorithm [36]. The algorithm proposed by Boyer and Moore considers a candidate the majority when it constitutes at least the half of the set. However, in this implementation, the set with highest consensus (excluding ties) is considered the majority even if it does not constitute the half of the set, e.g.:

43

 Parser 1 and 2 had output A, parser 3 had output B, parser 4 had output C and parser 5 had output D. Output A is considered majority even it only represents 40% of the set.  Parser 1 and 2 had output A, parser 3 and 4 had output B and parser 5 had output C. Output A and B have 40% of consensus thus, there is no majority.

Considering the report presented in Figure 10 , the report values are calculated as follows:

 Test 1 – All the parsers produced the same output. No diffs are generated. All passed the test.  Test 2 – The parsers 1, 2 and 3 had the same output; they are considered majority and passed the test. Parser 4 had a different output, thus it failed the test and a diff file is generated.  Test 3 – The parsers 1 and 3 had the same output; they are considered majority because the parsers 2 and 4 had different outputs. Parsers 2 and 4 failed the test and a diff file for each output is generated.  Test 4 – All the parsers failed the test because there are two outputs but each one has the same number of parsers. The majority attribute of both outputs is set to false. N.B. the first output has name majority; this is for convenience of the system i.e., although that output is not majority, diff files of other outputs are generated with respect to it.  Test 5 – All parsers had different outputs, thus all failed the test. The majority attribute of all the outputs is set to false. The first output is called majority for convenience of the system and three diff files are generated.

The generalData node tracks the number of tests. The equals attribute represents the number of tests in which all the parsers produced the same output. The testResult node tracks the test results of each parser; the passed attribute represents the number of test in which the parser was part of the majority group.

44

Figure 10 – XML report sample

Diffs generation and encoding

When testing URL pages, we found pages with sizes from a few kilobytes up to a couple of megabytes. Considering m inputs of n kilobytes and p number of parsers, the disk space required for saving all the outputs would be m x n x p kilobytes plus the size of the xml report. For a large set of inputs this could be a potential problem.

With the aim to reduce the disk space required, the test harness was designed for storing only one complete output and diffs files (if any). This decision was taken by considering that the outputs should have a tendency to converge (one of the goals of HTML5).

In the ideal scenario all parsers would produce the same output, thus only one output file would be required to be stored. In the worst scenario all the outputs would be different, therefore storing diff files would be equivalent to storing all the outputs.

This harness uses a library called google-diff-match-patch [34] for generating a list of differences between the majority output and the non-majority outputs. The list of differences is then processed for storing it in a file. 45

A compression method called delta-compression is presented in [35]. The delta- compression is used for “storing or transmitting only the changes relative to a related artefact instead of storing or transmitting the complete content”. In order to perform the compression, an encoding format is detailed.

In this implementation, a simple encoding based on the delta-encoding is used. The encoding format is as follows:

 Diff type : char (‘+’ for insertion, ‘-‘ for deletion)  Index : integer (the index in the majority output where the difference starts)  Separator : char (‘,’ is the designated separator)  Diff length : integer (byte length of the difference)  Separator : char  Diff content : string  End of entry : char (‘;’ is the designated char to denote the end of the entry)

An example of two different strings and the diff encoding is presented in Table 4. Two differences were found between the outputs. The first diff is a deletion of one character (x) at index 42. The second diff is at index 59 and represents an insertion of 13 characters (a new line plus the string | "x").

Majority output Different output Formatted output Diffs encoded -42,1,x;+59,13, | "x";

Table 4 – Example of diff encoding

In order to reconstruct the original output, the diff file has to be decoded to obtain a list of diffs, and then the majority output have to be updated by inserting or removing characters as defined in the list of diffs.

46

Adding a new parser

In order to provide horizontal scalability, i.e., adding new parsers by avoiding re- parsing the inputs, a function called restoreOutputsFromDiffs is included in the comparator. As the name suggests, the method restores (temporarily) the outputs generated by the parsers. Then, the comparison and report processes run normally. The comparison process automatically deletes the repeated outputs and generates diffs files (if required); hence no further operations are required.

When running the comparator, a parameter –u (for update) should be included. This solution only requires that the output(s) of the new parser(s) is included in the path where the current output files are stored.

4.3.4 Challenges While developing the test harness, the following difficulties and challenges were faced:

Command line arguments

The original interface for the adaptors was designed to receive one single string parameter. The web page inputs were read from the script and the content was passed (as argument) to the parsers. However, some web pages produced errors. The cause was that there is a max size for command line arguments. Moreover, the limit size varies depending on the operating system. To avoid potential errors, the interface was modified to include file inputs and each adaptor was updated for reading files.

Dynamic website content

A few inconsistencies were detected when comparing the outputs from URLs. The differences were in text nodes or attribute values. We realized that the differences were caused because some websites have dynamic content that changes depending on the region, language, date, time, etc. Moreover, there are websites that generate tokens or session ids every time a page is requested.

In order to face that kind of problems, a program was developed to save the web page content in one temporary binary file. Then, the script acts as if a file argument was received, i.e., it parses the same file on all the parsers. Finally, the temporary file is deleted.

47

Encodings

Issues with encodings were constantly faced while parsing and reading/writing files. Every single adaptor was required to explicitly read and write files using UTF-8 encoding. When outputting the serialized DOM, explicit UTF-8 encoding was required as well.

UTF-8 was also set for compiling and generating executable files inside the IDE as this used by default the encoding given by the operating system settings.

Line breaks

Some differences were found due to differences of line break characters between operating systems (considering Linux and Windows). A function was included to avoid this kind of problems.

RCDATA sections and XML documents

To make the comparison, the first approach was to generate an XML document containing the outputs from all the parsers directly from the bash/batch script. Then, the java program could read the xml and then process it. Each output is a formatted HTML string, thus it could not be saved directly in a text node of an XML document because the HTML elements would be considered XML elements leading to a mal- formed XML document. N.B. the outputs could be escaped and inserted as Text nodes. However, this approach was discarded (the next paragraphs discusses the reasons).

In order to avoid a mal-formed XML document, the outputs were saved as RCDATA nodes. This worked well for several tests until we realised that some inputs were ignored in the report. After an analysis, we found the source of the problem.

If a web page contains a RCDATA node, it will produce an invalid XML document. The RCDATA end tag of the output was considered the end tag of the RCDATA node of the XML document. The remaining content of the output was inserted after the RCDATA node leading to a mal formed XML file.

The first idea for solving this issue was to escape the RCDATA elements. Nevertheless, it was soon discarded because it would require modifying the adaptors. Moreover, the front-end application would require to un-escape the RCDATA elements. 48

Although performance was not one of the priorities, the escaping/un-escaping process would hurt the performance as find-replace operations would be required for each output over potentially large files (web pages).

Finally, by not using an XML file to save the outputs, a problem (that was not faced yet) was solved. As discussed in the challenges of the tracer (4.2.2), the XML specification defines a range of valid characters. A mal-formed XML document would have been produced if an output had contained any XML invalid character.

Adaptors

The algorithm for structuring DOMs into the html5lib format is simple and easy to apply. However, writing an adaptor for a new parser should not be considered as a trivial task. The setup of some environment might be time-consuming, the documentation of the parser implementation could be scarce, the programming language could be hard to understand, etc.

4.4 The web application Initially, just a simple tool for visualizing the reports generated by the harness was required. As the report was an XML document, an XSLT style sheet was used to generate an HTML page to display the report highlights.

This approach was enough and useful for reports of a single input, or even a few dozens. A large report was hard to visualize and navigate correctly; pagination and sorting functions were required. Programming those operations with XSLT would have required a substantial amount of work. At the end we agreed in creating a web application.

The web application is a Spring MVC project written in Java. This decision was made because of the experience with the language and because most of the project (parser, tracer and comparator tool) was programmed using java.

As the project continued, more characteristics were included with the aim to improve the visualization of the reports and the comparison and tracing processes. Currently, the web application functions are:

 Parse a string or URL input and run the specification tracer. 49

 Analyse and filter tracing events.  Run the comparer harness for string and URL inputs.  Review reports generated by the comparer harness.  Analyse, format and minimize outputs from the comparer.

4.4.1 Architecture The web application follows the MVC design pattern. The model represents the information of the report, test cases, outputs, etc. and the operations for generating and accessing such information. The model also includes classes for handling user requests, i.e., the inputs for the tracer or the minimizing options for the test harness.

A single controller, called ReportController, is used for handling user requests, executing model operations and interacting with the views for presenting results. Additionally, the controller handles the parsing and tracing processes directly (the MScParser and tracer APIs are bundled in a jar file).

With respect to the views, the web application consists of 5 jsp files:

 Report details – presents the report information.  Test details – displays the outputs of a given test case. Includes functionality to compare and minimize outputs.  Tracer form – page for executing the parser and the tracer.  Input form – page for capturing the input for the comparison harness.  Layout – also known as master page. Used for defining a consistent layout through the site.

The web application uses a configuration file (WebConfig) for storing paths to the bash/batch script files and reports directory. The values can be edited manually in the mvc-dispatcher-servlet.xml file.

In addition to the classes previously presented, three classes with generic operations are used:

 FormattingUtils – includes operations for escaping, highlighting and searching strings.

50

 ProcessRunner – used for executing operating system processes. In this case, it executes the script of the comparison harness.  RequestURL – given an URL, creates a connection and returns the content as a stream.

4.4.2 Parsing and tracing Figure 11 presents the input tab for the tracer. A drop down list offers the option to switch between a string and an URL input. A check box allows for the output to be presented formatted and highlighted (pretty code). This is done with the help of an open source Javascript module called prettify [37].

Figure 11 – Tracer input tab

As mentioned in section 4.2, the tracer logs every event generated during the parsing process. Nevertheless, it could be the case that a specific event type or set of spec sections are required to be filtered. The web application offers the option to exclude both, events (by type) and specification sections. Figure 12 shows the tabs that allow the user to define exclusions. N.B. to improve readability, the list of sections is not displayed complete.

51

Figure 12 – Tracer exclusion tabs

Given the input string ‘this is a test’, Figure 13 shows the tabs of the parser output, the tracer log and the tracer summary, respectively.

Figure 13 – Tracer output tabs for the input string this is a test

The parser output displays the HTML output of the parser. The tracer log shows the events (after exclusions, if any). Events are coloured to improve readability. Finally, the tracer summary displays detailed information about the used algorithms, tokenizer 52 states and insertion modes, the existence of certain type of elements and the number of parse errors and emitted tokens.

When tracing a web page or a long input string, if there are no event exclusions, the log produced by the tracer is really big. In such cases, the page might not be displayed correctly or the browser might have unexpected behaviours (even crashing). To avoid those scenarios, a max log size (number of entries) is included in the WebConfig file. When the log size is exceeded a message is presented to the user.

4.4.3 Comparing outputs The web application allows parsing a string (or URL) input and then comparing the output produced by all the available parsers. The input page is shown in Figure 14.

Figure 14 – Input form for the multi-parser comparator tool

Figure 15 shows the result of the parsing and comparison process of the input string ‘this is a test’. In this case, all the parsers had the same output; therefore there is only one tab with the name of all the parsers. When there are differences, a tab for each different output is generated as shown in Figure 16. A panel for navigating through the differences (auto-scroll to the next or previous difference) is displayed when the output is too large to fit the screen.

The report name is a number assigned by the system. This is because the output and the xml report files have to be stored in a file directory (specified in the WebConfig).

53

Figure 15 – Comparison details page

Figure 16 - Comparison page displaying differences between outputs

Figure 17 presents the options for formatting and minimizing the outputs. The removals are applied only when all the outputs present no difference in the specified element. For example, assuming Remove script elements is enabled, if there are five script elements among the inputs and all are the same then the five elements are removed. If one script element is different, only the four that are equal are removed.

54

The application currently allows removing link, script, style and meta elements. Nevertheless, the removal process is generic as it only requires the name of the element to be removed. The addition of new options for removal of elements is a trivial task.

Finally, the option Show original output allows presenting the reconstructed output tree without displaying the differences, i.e., the output generated by the parser. This option was included because there might be the case that the user wants to check for differences manually or by using specialized software.

Figure 17 – Format options tab

4.4.4 Reviewing reports The web application offers the opportunity to visualise the report generated by the comparator harness. Considering the xml report presented previously in Figure 10, four classes were designed for handling the report as shown in Figure 18.

The Report class represents the whole XML document. The TestResult class represents each of the parsers and its number of tests passed and failed. Each input (test input) is mapped to a TestCase object. The set of outputs of a test case is represented by a list of TestOutput objects. A class called ReportGenerator defines an operation for generating a report object given the path to the xml report file.

55

Figure 18 – Report class diagram

Figure 19 presents the report details page. The general information, test results and the test list is displayed. A jQuery plugin called DataTables [38] is used for the sorting and paging functions. Additionally, the plugin offers options for changing the number of elements displayed and for searching specific entries.

When a test entry is selected, the application redirects to the comparison details page (previously shown in Figure 15).

The report details page requires a parameter (a query string parameter in the URL) which is the name of the report to be displayed. The application assumes a relative path for all the report files (configured in the WebConfig bean – previously discussed).

56

Figure 19 – Report details page

4.4.5 Challenges This subsection presents and discusses the issues that were faced during the web application development.

Formatting and highlighting output differences

Initially, a HTML div element was used as the container of the output strings nevertheless we realized that it does not maintain the format (indentation and line

57 breaks) of the strings. Other HTML elements such as p or span have the same limitation. After some research, the HTML pre element resulted to be the solution.

The process for highlighting differences (and lines containing them) represented a challenge. The reason is that the reconstruction of the original output (using the diffs and the majority output files) had to be mixed with a formatting process.

The reconstructed output had to be escaped (because it contains HTML elements that might be confused with the web page HTML elements). HTML span elements had to be inserted whenever a difference is presented. Several indexes had to be used to track the start and end of lines to highlight because a difference may involve several lines or a line may have several differences.

Encoding

Despite that the tracer and the comparison harness were already using UTF-8, making the entire web application to use UTF-8 encoding was a real challenge as several modules had to be configured.

Tomcat 8 is used as the web server for the application. It uses the encoding defined in the operating system settings for encoding the GET request parameters (parameters specified in the URL). This setting had to be changed to use UTF-8 encoding.

The java web application by default uses the encoding of the browser for handling requests and responses. A filter to override this behaviour and force the use of UTF-8 was developed.

Every jsp page can have its own encoding. However, we set in the application configuration file (web.xml) that all the jsp pages must use UTF-8. Finally, the layout page (master page) has a meta tag where the charset attribute is set to UTF-8 as well.

58

5. Analysis and Results

This chapter presents the analysis and results of the following experiments:

 The spec coverage of the html5lib test suite  The MScParser test results using the html5lib test suite  The parsers test results using the html5lib test suite  Running the tracer over websites

5.1 The html5lib test suite coverage The code of the MScParser was used to measure the coverage of the spec by the html5lib test suite. The parser was developed as a transliteration of the specification algorithm, thus its granularity is somehow equivalent to the spec algorithm. It has to be taken into consideration that the coverage is based on lines of code and not in actual coverage of the spec. A line in the spec could imply a dozen of lines of code or vice versa. EclEmma [39] is a code coverage plugin for the Eclipse IDE. The following analysis was performed by using the aforementioned tool.

The tokenizer state comprises 69 states. The html5lib test suite contains 6665 test cases covering in total 68.21% of the tokenizer state process. As shown in Table 5, there are 38 states with 100% coverage, 12 states are not fully covered and 19 states have 0% coverage. From the 19 tokenizer states that are not covered, 18 are related to scripts and one is related to CDATA elements.

An interesting tokenizer state to note is the tokenizing character references. It has code coverage of 74.77%. The five states that can switch to this state are:

 Attribute value double quoted state (94.87%)  Attribute value single quoted state (94.87%)  Character reference in data state (89.94%)  Attribute value unquoted state (74.26%)  Character reference in RCDATA state (51.57%)

As discussed in the section of the MScParser challenges (4.1.3), those states required few changes to fit into the architecture of the system and are not a reliable transliteration of the spec; hence, those percentages might not be completely accurate.

59

Code coverage of the tokenizer states 100.00% > 0% and < 100% 0% - After attribute name -Attribute value -CData section -After attribute value quoted double quoted -Script data double -After DOCTYPE name (94.87%) escape end -After DOCTYPE public identifier -Script data double -After DOCTYPE public keyword -Attribute value single escape start -After DOCTYPE system identifier quoted (94.87%) -Script data double -After DOCTYPE system keyword escaped dash dash -Attribute name -Character reference in -Script data double -Before attribute name data (89.94%) escaped dash -Before attribute value -Script data double -Before DOCTYPE name -PLAINTEXT (80.36%) escaped less than sign -Bogus comment -Script data double -Bogus DOCTYPE -Comment end bang escaped -Character reference in attribute value (78.07%) -Script data end tag name -Comment end dash -Script data end tag open -Comment end -Tokenizing character -Script data escape start -Comment start dash references (74.77%) dash -Comment start -Script data escape start -Comment -Attribute value -Script data escaped dash -Data unquoted (74.26%) dash -DOCTYPE name -Script data escaped dash -DOCTYPE public identifier double -Between DOCTYPE -Script data escaped end quoted public and system tag name -DOCTYPE public identifier single identifiers (74.23%) -Script data escaped end quoted tag open -DOCTYPE -Markup declaration -Script data escaped less -DOCTYPE system identifier double open (69.42%) than sign quoted -Script data escaped -DOCTYPE system identifier single -Before DOCTYPE -Script data less than sign quoted public identifier -Script data -End tag open (67.29%) -RAWTEXT end tag name -RAWTEXT end tag open -Before DOCTYPE -RAWTEXT less than sign system identifier -RAWTEXT (57.01%) -RCDATA end tag name -RCDATA end tag open -Character reference in -RCDATA less than sign RCDATA (51.57%) -RCDATA -Self-closing start tag -Tag name -Tag open Table 5 – Code coverage of the tokenizer states by the html5lib test suite

60

With respect to the tree construction stage, the test suite contains 1555 test cases. The total coverage of the 23 insertion modes is of 94.5% as shown in Table 6.

Insertion Mode Coverage Insertion Mode Coverage

After After Body 100.00% In Column Group 96.67%

After After Frameset 100.00% After Head 96.23%

After Body 100.00% In Table 93.28%

After Frameset 100.00% In Table Body 93.02%

Before Head 100.00% In Select 91.56%

Before HTML 100.00% In Caption 91.03%

In Head 100.00% Initial 88.37%

Text 100.00% In Cell 87.24%

In Frameset 97.88% In Row 87.24%

In Template 97.87% In Head No Script 75.60%

In Body 97.60% In Select In Table 65.32%

In Table Text 97.14% TOTAL 94.50%

Table 6 – Code coverage of the insertion modes by the html5lib test suite

The tree construction involves the use of some algorithms that were defined in other package of the project. The coverage of the algorithms is of 89.5%. The adoption agency algorithm is the most complex algorithm of the spec. It can manipulate the DOM by removing, adding or moving nodes. The code for that algorithm presents 98.2% of coverage. Nevertheless, the coverage of that algorithm should be considered of 100% because after a code review, one line of unreachable code was detected.

The In body insertion mode, which is the largest and most complex section of the spec, has a coverage of 97.52%. After a code review, there are 4 very specific paths that are not covered, e.g., an unclosed p tag inside an isindex node inside a template node or misnested dd and dt elements inside some specific nodes.

When using the tree construction test cases, the tokenizer states coverage reached 79%. It was increased with respect to the tokenizer test cases (68.21%) because the tree constructor fully used the script-related tokenizer states (that had 0% coverage).

61

When running all the test cases (from both, the tokenizer and the tree constructor), the tokenizer code reached a coverage of 94.3%.

The html5 test suite covers 91% of all the source code of the MScParser. However, the percentage rises to 93.02% by removing the spec tracer code, which is bundled to the project.

5.2 The MScParser vs. the html5lib test suite The MScParser was developed according to the W3C Recommendation. However, the html5lib test suite is conformant with the WHATWG specification. The difference of specifications caused 8 failing tests as described below.

In the test suite there are 13 files containing a total of 6665 tests cases for the tokenizer. This parser passes all tests except one as shown in Figure 20.

Figure 20 – Html5lib tokenizer state tests results

The failing test appeared in the test file domjs.test and the test name is CRLFLF in bogus comment state. The input of the test is a comment containing a CR character followed by two LF characters. The expected output is a comment token containing two LF characters. However, the output of our parser is a comment token containing just one LF character.

There is a subtle difference for processing new line characters in the specifications. In the section 8.2.2.5 (Pre-processing the input stream) the W3C spec states that “any LF characters that immediately follow a CR character must be ignored” while the WHATWG spec states that “any LF character that immediately follows a CR character must be ignored”. This is the reason why our parser fails that test.

With respect to the tree construction tests, Figure 21 presents the test results of the parser. There are 48 test files and the total of test cases is 1555.

Figure 21 – Html5lib tree construction tests results

62

As well as the failing test in the tokenizer tests, the failing tests of the tree construction stage are due to differences within the specifications:

 Test case 54 (tests1.dat) – There is one difference in the first step of the Adoption agency algorithm that causes misnested elements.  Test case 14 (ruby.dat) – In the in body insertion mode the rtp, rp and rb elements are treated differently causing misnested content.  Test case 3 (main-element.dat) – The WHATWG spec considers the main element as a foreign content whereas W3C recommendation does not. This lead to a misnested node.  Test case 78 (template.dat) – The list of special HTML5 elements is different (menu and menuitem). This could lead to misnested elements (and chance of missing elements) because of adoption agency algorithm.  Test case 1, 2 and 3 (tests11.dat) – The specs define a table for mapping the name of some attributes in the SVG namespace. The attribute list is not fully consistent between the specifications.

Considering 8 failures out of 8220 test cases, the MScParser achieved a performance of 99.90% of the html5lib test suite.

5.3 Comparing parsers with the html5lib test suite Five third party parsers and the MScParser were compared by using the html5lib test cases for the tree construction. However, from the 1555 test cases available, only 1276 were used. The other 269 test cases were excluded for the following reasons:

 HTML fragment – Some parsers provide incomplete or no support for parsing HTML fragments. For example, Jsoup has a parse fragment operation but the context node is set to the body element and it cannot be changed.  Template – The template element uses HTML fragments.  Scripting flag – According to the spec, the flag is set to true “if scripting was enabled for the Document with which the parser is associated when the parser was created” [3] and set to false otherwise. Parse5 and Html5lib parsers enable scripting and provide no option to change. In fact, those parsers lack the spec sections that handle no script scenarios.

63

Table 7 presents the results of comparing the output of the parsers considering the html5lib expected output as gold standard. From the 1276 test cases, all parsers passed 911, i.e., all had the same output as the html5lib expected. This is equivalent to an agreement of 71.39%.

Number of tests 1276 Equals 911 Different 365 Parser name Failed Passed Conformance AngleSharp 17 1259 98.67% Html5lib 20 1256 98.43% Html5parser 6 1270 99.53% Jsoup 354 921 72.18% Parse5 18 1258 98.59% ValidatorNU 2 1274 99.84% Table 7 – Comparison of parsers vs. html5lib expected output

With only two failed tests, ValidatorNU is the parser with highest conformance with the html5lib test suite. Our parser is closely related, the failing tests are due to the difference of specs mentioned in the previous subsection.

Parse5 and Html5lib show a high performance as well. Several of their failing tests contain a ruby element (13 failing tests each). As well as our parser, they also fail the 3 tests related to SVG attributes. It is highly likely that they have not updated their code recently. One of the tests (number 16, file tests26.dat) causes a stack size exceeded error on Parse5.

AngleSharp is based on the W3C specification. It presents some fails in the entities, domjs-unsafe and plain-text-unsafe test files (12 failing tests); those test cases are related to invalid Unicode characters and character references. This could be caused by the pre-processing stage or the tokenizing character references state. As well as our parser, it also fails the 3 tests related to SVG attributes and the remaining fails are due to tests containing a frameset element.

64

Jsoup was the parser with lowest conformance. Several failed tests are due to the lack of support of namespaces. The HTML5 spec defines two types of elements: foreign elements (in the MathML or SVG namespace) and normal elements (in the HTML namespace). Jsoup does not provide any details of the namespace of the element nodes, thus all the tests that contain foreign elements failed.

Excluding Jsoup, from the 1276 total tests, the other parsers agree on 1240 tests (97.18%).

5.4 Tracing the web With the aim of analysing the coverage of the spec and the usage of HTML5 ‘in the wild’, the specification tracer was run over 90 websites (taken from [40]). Table 8 presents a summary of the information collected by the tracer.

Data Times Used Average Algorithms 1,870 19.28 Insertion Modes 987 10.18 Parse Errors 1,640 16.91 Tokenizer States 3,408 35.13 MathML Elements - - SVG Elements 9 9% HTML5 Form Elements - - HTML5 Graphic Elements 1 1% HTML5 Media Elements 2 2% HTML5 Semantic Elements 54 56% Table 8 – Tracing details over websites

On average, each website uses 10.18 insertion modes (out of 23) and 35.13 tokenizer states (out of 69). Each website generates almost 17 parse errors. 9 websites used SVG elements and none use MathML elements.

The HTML5 specification defined new elements for forms, graphics and media content. However, those are rarely used. The new semantic elements are used by 1 of each 2 websites o average.

65

Figure 22 shows the distribution of usage of insertion modes by the websites. Every input (even an empty string) uses 6 insertion modes by default: initial, before html, before head, in head, after head and in body; the figure corroborates that statement. An interesting case is the text insertion mode. That insertion mode can only be triggered when script or textarea elements are presented. Hence we can say that all the websites used scripts (most likely) or textarea elements.

120 100 80 60 40 20

0

Text

Initial

Incell

Inrow

Inhead

Inbody

Intable

Inselect

Incaption

Afterhead

Afterbody

Inframeset

Intemplate

Beforehtml

Intable text

Beforehead

Intable body

Afterframeset

Afterbodyafter

Inhead noscript

Inin select table Incolumn group

Afterframesetafter

Figure 22 – Insertion modes usage by websites

The insertion modes related to tables (in table, in table text, in table body, in row, etc.) are used by around 20% of the websites. The frameset-related insertion modes were not used. The in template insertion mode was not used by any web site. The might be due to the novelty of the template element which was introduced in HTML5.

With respect to the tokenizer states, Figure 23 presents the distribution of usage by websites. The states for processing tags, attributes and comments are widely used.

The states for processing the public id and system id of doctype elements are barely used. Those attributes are usually used for specifying schemas for validating the html document. Some script related tokenizer states are barely used. Those are used for escaping script content. The last bar corresponds to the tokenizing character references state and it is used by more than 80 websites.

66

120

100

80

60

40

20

0

Figure 23 – Tokenizer states usage by websites 67

6. Conclusions

The developed product presents a new approach for analysing and comparing HTML5 parsers. It has two main characteristics: scalability and modularity. With respect to scalability, it is scalable vertically by allowing processing multiple inputs. It is scalable horizontally by allowing adding new parsers. Modularity was achieved because each part of the system is an independent module that can be updated or replaced effortlessly, i.e., it is a low maintenance system.

Despite the main goal of the HTML5 standard is achieve convergence of outputs between parsers, there are still differences. By using the html5lib test suite, six parsers were analysed and compared finding disagreements between their outputs.

Testing the HTML5 standard is a complex task. With our analysis, we found that more than 8000 test cases of the html5lib test suite are not enough to cover all the specification. The code coverage is around 93% but this is not a measure 100% reliable because code is prone to errors and there could be redundant or unreachable code. Moreover, a transliteration of an algorithm into code is a subjective process that depends on other factors such as the style and experience of the programmer, the programming language, etc.

A highly valuable retribution, not for the system but for the community, would be improving the html5lib test suite. Test cases can be written for both, the tokenizer state and the test construction in order to increase the coverage of the spec. A higher coverage would increase the reliability of the test suite.

Reading and understanding the HTML5 specification represents a challenge. Despite it is well organized and the writing style is clear, it is really large and tedious in some points. The existence of two specifications (W3C and WHATWG) certainly complicates the testing of HTML5. According to our results, there are spec differences that have direct repercussions on the output DOM. The risk of potential differences is higher because there might be differences in parts of the specs were the html5lib test suite has no coverage. As long as there is no full convergence between the specs, there will be not convergence amongst parsers.

68

Performance was not a priority in order to transliterate the algorithm proposed by the W3c specification. However, performance was not an impediment for the correct functionality of the parser in the system. At this point, it is still unclear if a transliteration of the spec would be an effective approach for building a high performance parser.

The spec tracer was thought as an analysis tool. Several ideas were considered for easing the analysis of the parsing process and finding sources of disagreements between outputs. However, due to the reduced span of time for working on it, its current state does not fulfil that goal completely. The EclEmma code coverage plugin resulted to be a better option to measure spec coverage than the tracer (because of the coarse granularity of spec tracking). The tracer has the potential to be an extremely useful tool.

Some areas for improvement of the product are discussed below.

MScParser

 Allow parsing according to the W3C or the WHATWG specifications – I particularly consider this option would be really useful. It would allow tracking and analysing spec differences. However, it would require constant maintenance because of the frequent updates of the HTML5 living standard. Moreover, the complexity of spec differences might involve significant code changes.  Sniffing algorithm – This algorithm is used for parsing inputs with a character encoding different from UTF8. By implementing this algorithm, more web pages could be parsed without encoding issues.  Script execution – Although the script execution is not part of the parsing section of the specification, it is closely related. The script engine could be useful for analysing the structure of web pages with and without script execution.

HTML5 DOM

 Add operations for navigation through nodes – The HTML5 DOM was developed only for storing and manipulating the DOM tree generated while 69

parsing. It lacks methods for easy navigation and node searching that a user might find useful, e.g. finding elements by tag name, finding attributes by id, etc.  Document it and publish it as an API – During the development of the parser no independent HTML5 DOM implementations were found. This module can be documented and offered as a public independent API for potential HTML5 parser developers.

Spec tracer

 Increase the tracer granularity – The current level of tracing is at section level. The granularity can be increased as much as the user requires. Moreover, being ambitious, an option for selecting the granularity level would be extremely useful.  Trace over substrings – It would work as a zoom-in and zoom-out option where the user could select a block of the input for tracing. This will be particularly useful when analysing large inputs.  Minimize repeated events – This could be an option for reducing repeated events into a single event. For example, when the input has text content, every character is emitted as a character token, leading to a loop of repeated events. All those events could be merged as a single event for the entire text block.  Debugger – This option would allow the user to set breakpoints over the input or to track the events generated step by step. This feature could also display the output (and other parsing or context variables) at a given state of the parsing process.  Include more details in the tracing summary – Currently, the summary is hard- coded to trace some elements. A generic option for allowing the user to select specific elements to track and count could be valuable.

Test harness

 Add more parsers – This could be one of the most valuable improvements to the system. An analysis would be more valued and trusted by adding more parsers and, ideally, including the major web browsers parsers.

70

 Performance and threading – The execution of the parsing processes is made sequentially by the bash/batch script. An application for running each parser on parallel processes could reduce the response time.

Web application

 Add tables and graphs – The presentation of the report is very simple. Graphs and tables could be included for analysing in more detail the tracer and comparison reports.  Filter test cases by parser – The web application presents the report details and the list of all the test cases. In the case the user wants to see the tests that a particular parser passed or failed, he has to search manually through all the tests. An option to filter the test cases would be useful.  Link output elements to events – In the tracer page, the HTML output and the log of events are displayed. However, there is no way to link or relate elements (in the HTML output) with tracer events. The proposed functionality would highlight events when hovering (or clicking) over elements or vice versa.

6.1 Reflection I am really glad with our project and accomplishments. However, I think that the system could have been better. There are two reasons that make me feel that way: we had troubles working as a team and we never had a specific and grounded goal.

I feel that we never worked as a team. Instead we just were a group of individuals working on the same project. In fact, since the early spikes (in February) we realised that working as a team would be a challenge. Moreover, there were a few attempts to dissolve the team and work individually.

Xiao is an easy-going, friendly person. The difficulties for working with Xiao were due language barriers. On the other hand, despite Jose Armando and I are Mexicans and we had no language barrier, we could not forge a friendship or at least a comradeship for working together. I feel that we both tried but our tempers are simply not compatible.

During the project, we tried to apply some agile techniques such as pair programming, collective ownership of code, continuous delivery, using a backlog, etc. However, we 71 quit on most of them. This is probably because our lack of experience with Agile and mostly, I think, due our low commitment to teamwork.

For example, when we started to work on the parser, we had two days of discussions of the system architecture with low progression. I feel that there was no engagement from my teammates. I proposed some ideas but they just questioned them without trying to solve anything. At the end I wrote the code base for the parser and literally imposed it to start working. A similar situation happened with the harness for comparison.

I feel that we did not have a specific and grounded project goal until the very end. We had several discussions with our supervisor and he suggested us plenty of ideas. Sadly, we as a team never agreed on something particular and the project changed constantly. We rambled between a parser in a new language, a high performance parser, minimizers, debuggers, amongst other ideas.

Since late April I expressed my desire to work on a spec tracer and debugger but I did not manage to convince my partners to work on it; we ended up working with the comparator and parser adapters. Later I started to work on the spec tracer but the time was not enough to finish it as I would like.

Once I read that we have to do what we can, with what we have and where we are. Although we, as a team, did not have an adequate communication and we did not apply correctly some techniques, the pseudo agile we applied helped us. I feel confident that the agile toolset has real potential for improving software development.

Despite the issues commented before, our project accomplishments makes feel satisfied. The product is usable and I hope that someone will find it useful. I have significantly gained experience with version control software and my programming skills were improved. I have obtained useful knowledge of HTML and related content. I am really enthusiastic to share and apply my new knowledge and experience back in my home country.

72

Bibliography

[1] “W3C Mission.” [Online]. Available: http://www.w3.org/Consortium/mission. [Accessed: 27-Apr-2015].

[2] “2 - A history of HTML.” [Online]. Available: http://www.w3.org/People/Raggett/book4/ch02.html. [Accessed: 27-Apr-2015].

[3] “8 The HTML syntax — HTML5.” [Online]. Available: http://www.w3.org/TR/html5/syntax.html. [Accessed: 18-Apr-2015].

[4] “How Browsers Work: Behind the scenes of modern web browsers - HTML5 Rocks.” [Online]. Available: http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/. [Accessed: 18-Apr-2015].

[5] Y. Minamide and S. Mori, “Reachability analysis of the HTML5 parser specification and its application to compatibility testing,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7436 LNCS, pp. 293–307, 2012.

[6] “html5lib/html5lib-tests.” [Online]. Available: https://github.com/html5lib/html5lib-tests/tree/master. [Accessed: 28-Apr- 2015].

[7] “W3C Document Object Model.” [Online]. Available: http://www.w3.org/DOM/. [Accessed: 02-May-2015].

[8] “Extensible Markup Language (XML) 1.0 (Fifth Edition).” [Online]. Available: http://www.w3.org/TR/REC-xml/#sec-origin-goals. [Accessed: 02-May-2015].

[9] L. Stevens, The Truth About HTML5. Apress, 2012.

[10] “HTML Standard.” [Online]. Available: https://html.spec.whatwg.org/multipage/syntax.html#parsing. [Accessed: 23- Apr-2015].

[11] “Interview with Ian Hickson, editor of the HTML 5 specification. - The Project.” [Online]. Available: http://www.webstandards.org/2009/05/13/interview-with-ian-hickson-editor- of-the-html-5-specification/. [Accessed: 02-May-2015].

[12] “W3C vs. WhatWG HTML5 Specs - Differences Documented -Telerik Developer Network.” [Online]. Available: http://developer.telerik.com/featured/w3c-vs- whatwg-html5-specs-differences-documented/. [Accessed: 08-May-2015].

73

[13] J. Anaya, J. Zamudio, and X. Li, “HTML5 flow diagram- Gliffy Diagram,” 2015. [Online]. Available: https://www.gliffy.com/go/publish/7298487. [Accessed: 08- May-2015].

[14] “w3c/web-platform-tests.” [Online]. Available: https://github.com/w3c/web- platform-tests. [Accessed: 01-May-2015].

[15] “Testing - HTML WG Wiki.” [Online]. Available: http://www.w3.org/html/wg/wiki/Testing. [Accessed: 23-Apr-2015].

[16] “Testsuite - WHATWG Wiki.” [Online]. Available: https://wiki.whatwg.org/wiki/Testsuite. [Accessed: 03-May-2015].

[17] “HTML5test - How well does your browser support HTML5?” [Online]. Available: https://html5test.com/index.html. [Accessed: 03-May-2015].

[18] “HTML5 Parser - Web developer guide | MDN.” [Online]. Available: https://developer.mozilla.org/en- US/docs/Web/Guide/HTML/HTML5/HTML5_Parser. [Accessed: 24-Apr-2015].

[19] B. Anderson, J. Moffitt, M. Goregaokar, D. Herman, J. Matthews, K. McAllister, and J. Moffitt, “Experience Report : Developing the Servo Web Browser Engine using Rust,” 2015.

[20] “Trident (layout engine).” [Online]. Available: http://en.wikipedia.org/wiki/Trident_%28layout_engine%29. [Accessed: 28-Apr- 2015].

[21] “Live DOM Viewer.” [Online]. Available: http://software.hixie.ch/utilities/js/live- dom-viewer/. [Accessed: 29-Aug-2015].

[22] “google/gumbo-parser.” [Online]. Available: https://github.com/google/gumbo- parser. [Accessed: 23-Apr-2015].

[23] “jsoup Java HTML Parser, with best of DOM, CSS, and jquery.” [Online]. Available: http://jsoup.org/. [Accessed: 02-May-2015].

[24] “jhy/jsoup.” [Online]. Available: https://github.com/jhy/jsoup. [Accessed: 23- Apr-2015].

[25] Z. Zhao, C. William, M. Bebenita, D. Herman, M. Corporation, J. Sun, X. Shen, and C. William, “HPar : A Practical Parallel Parser for HTML – Taming HTML Complexities for Parallel Parsing,” vol. 10, no. 4, 2013.

[26] J. Anaya and J. Zamudio, “Project architecture,” 2015. [Online]. Available: https://drive.google.com/file/d/0B49Wuuqv8y6PejN1anFGdVFHS3c/view?usp=s haring. [Accessed: 27-Aug-2015].

74

[27] J. Anaya, J. Zamudio, and X. Li, “HTML5MSc,” 2015. [Online]. Available: https://github.com/HTML5MSc. [Accessed: 31-Aug-2015].

[28] “Lines Of Code.” [Online]. Available: http://c2.com/cgi/wiki?LinesOfCode. [Accessed: 02-Sep-2015].

[29] J. Zamudio, “A testing strategy for HTML5 parsers,” The University of Manchester, 2015.

[30] “Usage Statistics of Character Encodings for Websites, April 2015.” [Online]. Available: http://w3techs.com/technologies/overview/character_encoding/all. [Accessed: 24-Apr-2015].

[31] “org.w3c.dom (Java Platform SE 7 ).” [Online]. Available: https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/package- summary.html. [Accessed: 09-Aug-2015].

[32] “JDOM.” [Online]. Available: http://www.jdom.org/. [Accessed: 09-Aug-2015].

[33] “Dom4j by dom4j.” [Online]. Available: http://dom4j.github.io/. [Accessed: 09- Aug-2015].

[34] “google-diff-match-patch - Diff, Match and Patch libraries for Plain Text - Google Project Hosting.” [Online]. Available: https://code.google.com/p/google-diff- match-patch/. [Accessed: 14-Aug-2015].

[35] “Fossil: Fossil Delta Format.” [Online]. Available: http://fossil- scm.org/xfer/doc/trunk/www/delta_format.wiki. [Accessed: 23-Aug-2015].

[36] W. H. Hesselink, “The Boyer-Moore Majority Vote Algorithm,” vol. 0, no. November, pp. 1–2, 2005.

[37] “google-code-prettify - syntax highlighting of code snippets in a web page - Google Project Hosting.” [Online]. Available: https://code.google.com/p/google- code-prettify/. [Accessed: 11-Aug-2015].

[38] “DataTables | Table plug-in for jQuery.” [Online]. Available: http://www.datatables.net/. [Accessed: 12-Aug-2015].

[39] “EclEmma - Java Code Coverage for Eclipse.” [Online]. Available: http://www.eclemma.org/index.html. [Accessed: 25-Aug-2015].

[40] “5000 Best Websites.” [Online]. Available: http://5000best.com/websites/. [Accessed: 03-Sep-2015].

75