Analysing and testing HTML5 parsers A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences 2015 Jose Carlos Anaya Bolaños School of Computer Science Contents Contents ............................................................................................................................ 2 List of figures ..................................................................................................................... 4 List of tables ...................................................................................................................... 5 Abstract ............................................................................................................................. 6 Declaration ........................................................................................................................ 7 Intellectual property statement ........................................................................................ 8 Acknowledgements ........................................................................................................... 9 The author ....................................................................................................................... 10 1. Introduction ............................................................................................................ 11 1.1 Goal and objectives .......................................................................................... 12 2. Literature review ..................................................................................................... 14 2.1 HTML history .................................................................................................... 14 2.2 The HTML5 parsing process ............................................................................. 16 2.3 Testing HTML5 .................................................................................................. 20 2.4 HTML5 parsing implementations ..................................................................... 22 3. Project architecture ................................................................................................ 25 3.1 Overview ........................................................................................................... 25 3.2 Tasks distribution ............................................................................................. 27 3.3 Project evolution .............................................................................................. 29 4. Project implementation .......................................................................................... 31 4.1 The MScParser .................................................................................................. 31 4.1.1 Architecture .............................................................................................. 31 4.1.2 The custom HTML5 DOM .......................................................................... 33 4.1.3 Challenges ................................................................................................. 36 4.2 The specification tracer .................................................................................... 37 2 4.2.1 Architecture .............................................................................................. 37 4.2.2 Challenges ................................................................................................. 40 4.3 The harness for comparison ............................................................................. 40 4.3.1 The parser adaptors .................................................................................. 41 4.3.2 The script execution .................................................................................. 41 4.3.3 The comparison and report generation .................................................... 42 4.3.4 Challenges ................................................................................................. 47 4.4 The web application ......................................................................................... 49 4.4.1 Architecture .............................................................................................. 50 4.4.2 Parsing and tracing .................................................................................... 51 4.4.3 Comparing outputs ................................................................................... 53 4.4.4 Reviewing reports ..................................................................................... 55 4.4.5 Challenges ................................................................................................. 57 5. Analysis and Results ................................................................................................ 59 5.1 The html5lib test suite coverage ...................................................................... 59 5.2 The MScParser vs. the html5lib test suite ........................................................ 62 5.3 Comparing parsers with the html5lib test suite ............................................... 63 5.4 Tracing the web ................................................................................................ 65 6. Conclusions ............................................................................................................. 68 6.1 Reflection .......................................................................................................... 71 Bibliography .................................................................................................................... 73 3 List of figures Figure 1 – Flow diagram of the HTML5 parsing process (adapted from [13]) ................ 17 Figure 2 – A cycle through the tokenizer to emit a token .............................................. 19 Figure 3 – A cycle through the tree constructor to process an empty string ................. 20 Figure 4 – Overview of the product architecture (adapted from [26]) .......................... 26 Figure 5 – Class diagram of the parser ............................................................................ 32 Figure 6 – Class diagram of the custom HTML5 DOM .................................................... 35 Figure 7 – Class diagram of the specification tracer. ...................................................... 38 Figure 8 – File example of tracerEvents.xml ................................................................... 39 Figure 9 – Comparator flow diagram .............................................................................. 43 Figure 10 – XML report sample ....................................................................................... 45 Figure 11 – Tracer input tab ............................................................................................ 51 Figure 12 – Tracer exclusion tabs .................................................................................... 52 Figure 13 – Tracer output tabs for the input string this is a <b>test .............................. 52 Figure 14 – Input form for the multi-parser comparator tool ........................................ 53 Figure 15 – Comparison details page .............................................................................. 54 Figure 16 - Comparison page displaying differences between outputs ......................... 54 Figure 17 – Format options tab ....................................................................................... 55 Figure 18 – Report class diagram .................................................................................... 56 Figure 19 – Report details page ...................................................................................... 57 Figure 20 – Html5lib tokenizer state tests results .......................................................... 62 Figure 21 – Html5lib tree construction tests results ...................................................... 62 Figure 22 – Insertion modes usage by websites ............................................................. 66 Figure 23 – Tokenizer states usage by websites ............................................................. 67 4 List of tables Table 1 – Most popular HTML5 parsers in Github .......................................................... 24 Table 2 – Participation of the members in the project ................................................... 28 Table 3 – Example of inputs that are HTML5 valid but XML invalid ............................... 34 Table 4 – Example of diff encoding ................................................................................. 46 Table 5 – Code coverage of the tokenizer states by the html5lib test suite .................. 60 Table 6 – Code coverage of the insertion modes by the html5lib test suite .................. 61 Table 7 – Comparison of parsers vs. html5lib expected output ..................................... 64 Table 8 – Tracing details over websites .......................................................................... 65 5 Abstract In its early days, websites only contained plain text and images interlinked. Over time, websites turned to complex web applications offering diverse services such as multimedia streaming, social networking, gaming, etc. HTML parsers have been historically flexible and permissive with the user inputs. Each parser had to define its own way to parse and fix errors but, due to the
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages75 Page
-
File Size-