Reverse Engineering of Web Pages
Total Page:16
File Type:pdf, Size:1020Kb
Reverse Engineering of Web Pages Joseph Scoccimaro, Spencer Rugaber College of Computing Georgia Institute of Technology [email protected], [email protected] Abstract structure, interfaces, and abstraction are common to both. It is estimated that more than ten billion web When reverse engineering software, we can pages exist on the World Wide Web. Hence, we always fall back on execution semantics to situate argue that there exists a web-page maintenance the various insights we obtain. When reverse engi- problem analogous to the software maintenance neering a web page, however, while there may be problem. That is, there is a need to understand, cor- execution aspects, they are often domain specific rect, enhance, adapt and reuse web pages. There- and provide little help understanding the page fore, it is meaningful to ask, what does it mean to source. Instead, we have found that focusing on the reverse engineer a web page? We examine this “users” of a web page, its visitors, provides the same question and discuss the similarities and differences kind of support for analysis. Presentation, style, from software reverse engineering. We discuss the structuring, and functionality all contribute to enhanc- features and design of a tool, Page Monkey, that we ing the user experience. have built to support web page reverse engineering This paper is organized as follows. In Section 2, and briefly describe user feedback we have received. we survey the different kinds of analyses that can 1. Introduction contribute to understanding web pages. We then look at various technologies that might be used to help Imagine while web surfing that you visit a web automate the reverse engineering of web pages. In page and wonder how the author achieved a particu- Section 4, we discuss the role web page design lar effect. You think that you would like to do some- guidelines can play in understanding web pages. thing similar on a page you are composing. You then Section 5 presents our tool, Page Monkey, and, in view the page source and are bewildered by the Section 6, we discuss the feedback we have complexity of the information you see there. You find received on it. We then look at other efforts under- it difficult to locate the instructions responsible for the taken to understand web pages. In Section 8, future effect that impressed you, and, once found, you have directions for web-page reverse engineering tools to make several attempts before you can extract all are discussed, and we conclude with a discussion of that you need and nothing else. the relationship of web page reverse engineering to Web pages are rich in their expressivity. Besides software reverse engineering. HTML they may contain scripting language and (ref- erences to) style sheets. There are a variety of struc- 2. Characterizing Analyses turing mechanisms used such as tables, frames and As with programs, there are many kinds of analy- headers. Pages may be interactive, employing a vari- ses that can be performed on web pages. We have ety of user interface devices and the accompanying surveyed different types of analyses by looking at the logic. Many pages are automatically generated, lead- features of other related projects and at web page ing to the same sort of comprehension problems design guidelines, by studying the specification of found with generated software. In general, web HTML and by interviewing a web page design expert. pages compare in complexity to software programs The remainder of this section presents a collection of of the same size. interesting analyses. Reverse engineering (of software) is the process of analyzing that software and constructing an Readability Analysis. Readability Analysis predicts abstract representation useful in supporting mainte- the reading skills required to appreciate a text. By nance and reuse. Although web pages have some of obtaining this insight, an analyst can learn the diffi- the aspects of software, they are also different. Con- culty of understanding and the age level of the tent and presentation concerns are much greater for intended audience. The most popular readability web pages while, in most circumstances, logic and measurements are provided by the Flesch-Kincaid semantic precision are reduced. On the other hand, Reading Level and Flesch-Kincaid Reading Ease tests 1. Flesch-Kincaid Reading Level can be com- done to the data provided in the form when the user puted with the following formula, where tw is total submits it. words; ts is total sentences, and ty is total syllables: Universality. Universal design of a web page means that it can accommodate users with disabilities. An .39 * (tw/ts) + 11.8 * (ty/tw) - 15.59 example of testing for universal design is to see if a web page can support people with eyesight problems The Flesch-Kincaid Reading Level test maps the or hearing problems. The analysis is performed by understandability of a piece of writing into the differ- evaluating the page with respect to a list of universal ent US grade levels. Flesch-Kincaid Reading Ease design guidelines. can be calculated with this formula: Style Analysis. Style Analysis allows a person 206.835 - 1.015 * (tw/ts) - 84.6 * (ty/tw) viewing a web page to see how the author organized the presentation of the page. The analysis examines The formula produces a score from zero to one-hun- the style sheets and HTML style options for relevant dred. A lower number means a harder-to-read page, information. The World Wide Web Consortium (W3C) while a higher number means the page is easier to recommends that web page developers separate read. presentation from structure in the composition of web pages. Presentation is defined as the attributes of dif- Structural Analysis. HTML provides various means ferent tags such as color, font size and layout. to hierarchically structure pages, including frames, According to the World Wide Web Consortium, one tables and headers. Frame Analysis looks at the should put presentation elements into style sheets nesting of framesets and frame tags to determine and avoid using tag attributes for that purpose [11]. their structure. A wire frame drawing tool can then Style Analysis looks at the extent to which the page convey a skeleton drawing of the frame layout of a developer has separated structure from presentation. page. Table Hierarchy Analysis examines the differ- Another type of presentation analysis determines ent layers of nested tables. A table hierarchy aids the how cluttered a page is based on the amount of white user in learning how a page is organized and how space, text density, image density, and link density information is nested within different tables. A table of occurring in it. contents presents an outline of the web page using all the header tags contained in the document. They Content Analysis. There are many different types of are then presented based on their importance as analyses that can be done on the content of a web determined by the level of each tag. page. For example, Statistically Improbable Phrases (SIP) Analysis 2, which determines the “best” phrase Link Analysis. Link Analysis examines the hyper- that describes the web page for purposes of index- links embedded in a web page. It looks at how many ing. The converse operation is to try to characterize a links are internal to the page, external to the page, page versus a set of known categories. Examples of external to the web site or are mailto links. The anal- categories include on-line store, news site, and ysis can also determine which links are broken and course descriptions. characterize the targets of the active links. A graphi- cal presentation of this analysis can show external Augmentation Analysis. Augmentation Analysis links having lines leaving a page and internal links looks for features of a web page that are not part of pointing to their targets on the page. Broken links can HTML. Such features include JavaScript, Java be drawn using a wavy line. A similar analysis exam- applets, and Flash content. For example, Augmenta- ines the different image maps contained on the web tion Analysis can be used to determine if JavaScript page. These types the images embed links that can is only employed as a mouse-over feature for links or be examined in the same fashion as textual ones. provides more elaborate functionality. Form Analysis. Form Analysis captures the use of Metadata Analysis. HTML metadata can include the HTML forms on a given page. It looks at the different title of the page, its description, suggested key- types of inputs and fields employed within a form. words for indexing and the date the page was last Form Analysis also looks at hidden fields. These modified. Since the metadata can contain any type of fields contain important information about what is 2. http://en.wikipedia.org/wiki/Statistically_Improb- 1. http://en.wikipedia.org/wiki/Flesch-Kincaid able_Phrases data, any extra information the author has included load the web page into the application or remember can also be extracted. the URL of the page so that the application can download the content. On the other hand, building a Multimedia Analysis. The final type of analysis plug-in enables the user to run the tool while looking examines the multimedia content of a web page. For at the web page. For this reason, we decided to example, the use of animated images and embedded implement our tool, Page Monkey, as a plug-in. Plug- audio and video can be determined. ins, in turn, offer several different implementation approaches. The technologies we researched Table 1 lists the analysis techniques, an estima- included Visual C++ extensions to Internet Explorer, tion of the difficulty of implementing them, an indica- the HotJava browser, the NetClue browser and the tion of whether the necessary information is available 3 Greasemonkey add-on to Firefox.