Top 10 Reasons for a Webpage to Fail the HTML Validator
Total Page:16
File Type:pdf, Size:1020Kb
An Experimental Study on Validation Problems with Existing HTML Webpages Shan Chen Dan Hong Vincent Y. Shen Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon Hong Kong, China {chenshan, csdhong, shen}@cs.ust.hk Abstract average size of a website, and so on [7] [8]. Our concern is about In this paper we report the results of an experimental study on the the quality of the HTML documents in today’s webpages in terms validation problem of existing HTML pages in the World Wide of conformance to public standards. Web. We have found that only about 5% of webpages are “valid” The HTML is actually an extensible computer language used to according to the HTML standard. An “invalid” webpage may be instruct the browser on how to render a webpage, which may rendered differently by different browsers; it may not be machine- contain text, images, links, and multimedia contents. To gain a processable; it might not be translated into other Web document competitive advantage in the “browser battle” each tool vendor formats correctly using some of the translation engines available tried to introduce features that would distinguish its own products. on the Web; etc. Through sampling and analyzing the results, we With these new features, webpage designers could easily create have identified the most common problems in existing webpages exciting webpages, and browser users could enjoy these pages with that made them invalid. We hope our discovery can encourage a brand-new experience. Unfortunately, features introduced by one standard bodies, browser developers, authoring tool developers, particular vendor might not be recognized by the browser from and webpage designers to work together so that the number of another vendor. To create some order in a chaotic situation, the valid webpages continues to grow and the Web can indeed reach World Wide Web Consortium (W3C) released the HTML (Hyper- its full potential. Text Markup Language) Recommendation in 1995 for resolving interoperability concerns for browsers from different vendors [12]. Categories and Subject Descriptors Using the “elements” and “attributes” defined in the HTML H.5.2 [Information Interfaces and Presentation]: User Recommendation, browsers parse the HTML document, analyze Interfaces - Standardization. I.7.5 [Document and Text the tags and render them (hopefully) as the webpage designer has Processing]: Document Capture - Document analysis. intended. This “Recommendation” has become the de facto “standard” in the Web community. General Terms The HTML documents that follow the standard or the “valid” Experimentation, Verification HTML files have a higher chance to be rendered consistently by different browsers and across different platforms than those that do not. An invalid HTML file may not be treated the same way. All Keywords the major browser vendors claim to handle valid HTML files HTML Specification, HTML Validator, HTML tidy tools, W3C, correctly. It makes good sense for the developer of a new website webpage characteristics, Web Sampling. to start with the standard and follow it through later revisions. 1. Introduction In addition to possible different appearances when rendered, invalid HTML files could be less accessible than valid ones. For Invented in the early 1990’s, the World Wide Web is now one of example, the standard defines a required attribute “alt” in the primary media through which people publish and obtain element “IMG” where “alt” provides some textual information information. As one of the most popular applications on the about the image. If the browser does not support images, the value Internet, the number of webpages continues to grow rapidly. In of the “ ” attribute is displayed instead. The textual information 1999, there were 800 million webpages in the world [7], and alt can also be read out by a screen reader to assist the vision- according to Internet Domain Survey of the Internet Systems impaired user who cannot see the image clearly. An invalid Consortium [6], there were already more than 233 million domains webpage missing the “ ” attribute may not be accessible by by the end of January 2004. alt certain users. The importance of the Web attracted many researchers to study its The Web has now become a large resource of information. More properties. People are interested in the average size of an HTML and more people use the computer to extract relevant information document, the average number of images in a webpage, the from the Web for their applications. An invalid HTML file often causes difficulties for the application in understanding the intended structure of the file. Information contained in such files may not be Copyright is held by the authors. available to these applications, defeating the purpose of pub- WWW 2005, May 10--14, 2005, Chiba, Japan. licizing them on the Web. For example, the input interface of a Web application is usually presented as a “FORM” element in the about 40,000 webpages, “crawled” the Web to measure it, and HTML file and the element “TABLE” is widely used to control the obtained some statistics on the size of the Web, its connectivity, layout of the page. Consequently webpage designers often use richness, and the distribution of URL links. Bharat and Broder [3] “FORM” and “TABLE” together to provide interactivity as well as suggested a methodology for finding a page at random from a to control the layout. However, if the webpage designer does not search engine index in 1998. Their approach was based on queries understand the HTML language well, it is easy to commit errors to the search engine using random words. In 1999 Lawrence and while using “FORM” with “TABLE”. Such errors may prevent the Giles [7] used randomly-generated IP addresses to estimate the interactive page to be machine-processable. information available on the Web, such as the characteristics of the hosts and pages. In 2000 Henzinger, Heydon, Mitzenmacher, and To assist webpage designers, several organizations have developed Najork [5] suggested performing a random walk of the Web using “validators” to check if a webpage conforms to the HTML multiple search engines to extract representative samples of standard. These validators can be used to analyze a webpage (for webpages. example, the W3C markup services [14] and the WDG validator [16]), or to analyze a whole website (for example, the CSE HTML In this paper, we adopted the methodology of using randomly- Validator [1]). There are also a number of HTML “tidy” tools generated IP addresses as samples due to its simplicity. We further which can be used to tidy up an HTML document by correcting compared our test results using this Random IP List with the result most of the errors in it [10]. However these tools are often not using the Most Popular 10000 Websites List made available to us powerful enough to correct all errors; sometimes the appearance of by Alexa.com as additional supporting evidence. Note that both a webpage is altered after being tidied up. Proofreading the lists enabled us to analyze the properties of only the home pages of resulting document after processing by a tidy tool could be a the selected websites. Although problems with the homepage may challenge to the webpage designers. be indicative of the problems of other webpages accessible in that website, we created yet another URL List using search engines to We wish to find out the proportion of current webpages that are confirm our findings. We repeated the test using the Second valid HTML documents, and the most common problems that Random IP List in two months to see if there is any change of cause a document to become invalid. This issue was of interest to results over time. The details of the lists are described below. the Web Design Group; they identified six “common” problems [15] by inviting comments from other validator users. However, there was no indication how serious each problem was, and there 2.1 Sample Lists was also no report on the methodology used by various individuals 2.1.1 Random IP List in identifying these problems. It is difficult to draw quantitative We developed a program to randomly generate an IPv4 address. conclusions from these experiential results. To really understand By setting up an HTTP connection to this IP address at port 80, how serious the validity problem on the Web is today, and to study which is the default port for HTTP connections, and listening to the statistics of validation problems, we performed an the response, we would know whether it was a Web server at this experimental study by sampling the existing webpages, using a IP address. If it was indeed a Web server, we added it to the list. validator to check them, and analyzing the results. Our study shows that only about 5% of webpages are “valid”. We also We encountered some difficulties using this approach. Since most identified the major validation problems in existing HTML IP addresses were addresses for online computers which were not webpages. We hope our findings will alert the webpage designers Web servers, most of our probes would time out. After a short of the validity problem, and will help them to identify the most while our service provider received complaints about our activities common errors which they should try to avoid. We hope our since our probes were considered hacker’s probes by some findings will also alert the Web application developers so that they organizations. We had to stop the collection process after could handle the common errors properly in order to provide better obtaining a list of only about 1,100 IP addresses. service. We also hope our findings are useful to the standard bodies and the authoring tool vendors.