A Framework for Parallelizing Large-Scale, DOM-Interacting Web Experiments

Sarah Chasins Phitchaya Mangpo Phothilimthana UC Berkeley UC Berkeley [email protected] [email protected]

Abstract by the types of edits that web designers make over time as pages are redesigned. Thus, given a set of DOMS, our simulator can produce Concurrently executing arbitrary JavaScript on many webpages is a DOM change that approximates the modification a user would hard. For free-standing JavaScript — JavaScript that does not in- see by pulling DOMs from the same URLs at a later point in time. teract with a DOM — parallelization is easy. Many of the paral- lelization approaches that have been applied to other mainstream languages have been applied also to JavaScript. Existing testing 1.1 Framework frameworks such as Selenium Grid [19] and PhantomJS [16] al- At present, there are few tools for controlled web research. While low developers to parallelize JavaScript tests that do interact with a there is a preponderance of tools targeted at developers testing DOM. However, these tools frequently limit the user to very con- their own pages, these tools are not easily applied to the more strained programming models. Although it is possible to build test- general problem of running arbitrary tests over real world pages. ing programs that take arbitrary inputs and yield arbitrary outputs We identified five core desirable properties, properties that are key (rather than pass or not pass) on top of these systems, the work re- to making a framework capable of running large-scale experiments quired to do so is non-trivial. Controlling inputs — that is, pages — on real world webpages. To facilitate large-scale web research, a in order to carry out web research on real world pages is even less framework should: supported in modern tools. In fact, no existing tool provides same page guarantees. In response to this paucity of flexible and reliable 1. Parallelize test execution systems, we built a framework designed for running large-scale 2. Allow DOM interaction JavaScript experiments on real webpages. We evaluate the scalabil- 3. Run arbitrary JavaScript code ity and robustness of our framework. Further, we present the node 4. Run on live pages from URLs (not only local) addressing problem. We evaluate a new node addressing algorithm 5. Guarantee same input (page) through experiment against preexisting approaches, leveraging our framework’s clean No existing tool does all of these. In fact, many do none of the programming model to streamline this large-scale experiment. above. Thus, our framework is motivated by the need for a system that offers all the characteristics necessary to carry out controlled, 1. Introduction real world web research. Our system’s central abstraction is the stage. In each stage, the We describe the design and implementation of a framework that framework takes as input a set of JavaScript algorithms each with efficiently runs JavaScript programs over many webpages in par- n inputs, and an input table: a set of n-field records. In each record, allel. We envision that users testing multiple approaches to some the first field represents the webpage on which to run. Both this JavaScript task can provide each candidate solution as a separate webpage argument and the remaining (n − 1) fields are passed program, and then test all approaches over a large corpus of web- as inputs to the JavaScript algorithms. The output table is a set of pages in parallel. Our framework thus facilitates a type of large- records. The output records can have an arbitrary number of fields, scale web experiment for which researchers, until now, have had and there need not be a one-to-one mapping between input and no appropriate tools. output records. We further introduce an application of our framework, a large- Within a JavaScript algorithm, the user may create multiple scale web experiment that relies on the framework’s guarantees. subalgorithms. In fact, an algorithm is a sequence of subalgorithms. Specifically, we describe the challenge of robust representation of If a subalgorithm causes a page reload, then the next subalgorithm DOM nodes in live pages. The task is to observe a webpage at will be run on the new page. If no page reload occurs, the next time t1, describe a given node n from the webpage, then observe subalgorithm will be run on the same page. the webpage at a later time t2, and identify the time-t2 node that Our framework also offers the concept of a session. A session is corresponds to n, based on the time-t1 description. A t2 node a sequence of stages during which our framework offers a same corresponds to a t1 node if sending the same event to both nodes page guarantee. In web experiments, the webpage is frequently has the same effect. We term this the node addressing problem. considered an input, just as much as, for instance, any value passed Several node addressing algorithms have been proposed, but never to a JavaScript function. It is thus crucial that pages stay the same tested against each other. We test these algorithms side by side on across data points, to ensure fairness and correctness of the exper- a test set of thousands of nodes by structuring the experiment as a imental procedure. Our framework facilitates experiments that re- series of stages in our framework. quire stable page inputs by offering the session abstraction. Finally, we created a DOM evolution simulator for creating We implement the framework to take advantage of the program- synthetic DOM changes from existing webpages. Our simulator ming model described above by parallelizing at the level of input takes as input a webpage DOM and produces as output a second rows. In addition, we use a custom page cache to both offer our webpage DOM, whose structure is similar to the first, but altered same page guarantee and to improve framework performance.

1 2014/1/22 1.2 Application To offer robust record and replay for live web pages, a tool must be Input Table able to identify the same node in a web page over time, even if the Output Table DOM structure of the page changes between record and replay. A Framework naive xpath, which simply traces the path through the DOM tree to Input Program the target node is too fragile; wrapper nodes around a target node or its predecessors make the xpath break, as do sibling nodes added before the target node or any predecessors. Testing an algorithm Figure 1: A stage, the central abstraction of our framework. for this task is difficult, because the test must be repeated on the webpage many times, to determine whether it works in the face of a changing page, and because the sample size must be very large in order to get a large number of (naturalistic) broken paths before the later test. To complete these tests by hand — for many nodes, • Algorithm: JavaScript code to run on all input table rows; a many times, and many algorithms — is very tedious. To complete sequence of subalgorithms them programmatically would still be very time-consuming, using • Subalgorithm: component of an algorithm that takes place on current state-of-the-art tools. To complete them fairly, upholding a a single page same page guarantee where necessary, would require a substantial • Output Table: a table with zero or more rows per input table amount of infrastructure building. row A user can run these tests in our framework with four stages. A During a session, a program can run multiple program-table first stage takes URLs as input, and produces fragile xpaths for each pairs through our system. We call each program-table pair a stage. node on those pages. A second stage, in the same session so that During a single session, if the URL in the leftmost column of an the xpaths do not break, takes the xpaths as input and determines input table is the same in multiple rows (even across tables), the whether each node is reactive, storing the xpaths and reactions of page that is loaded for those rows will also be the same. the reactive nodes. A third stage takes the xpaths of the reactive Each stage is defined by its input program and its input table. nodes and runs all node addressing algorithms over each node, See Figure 1 for a visual representation of a single stage. A stage’s producing node addresses. This stage must also be part of the same input program may contain multiple algorithms to run for each row session. A final stage, not a part of the same session, takes the node in the input. Each algorithm may also contain subalgorithms, if the addresses as input, and runs each node addressing algorithm to algorithm must run over multiple pages. For instance, if the first identify a corresponding node. It then identifies the reaction of each subalgorithm finds and clicks on a link, the second subalgorithm candidate corresponding node, comparing the reaction with the will run on the page that is loaded by clicking on the link. If stage 2 reaction to evaluate each algorithm’s success. Rerunning the clicking on a link does not cause a new page to load, the second final stage at a later time, with the same input, tests the algorithms’ subalgorithm will run instead on the original page. robustness over a longer period. This stage can be repeated as often 2.2 Stage Output as desired. Recall that any given input row may correspond to multiple output 1.3 DOM Change Simulator rows. This combined with the existence of subalgorithms compli- We offer five DOM modifiers, each of which makes a different type cates the design of our programming model. One option is to re- of edit to DOM structure, any or all of which may be useful for a quire that only the final subalgorithm can produce any output. This given application, depending on the needs of the user. They may simplifies the programming model, but for cases in which earlier be combined as desired into a single simulator. For the application subalgorithms can access data that the later subalgorithm cannot, described above, which examines the robustness of DOM node this may be undesirable. For instance, consider a test that finds and addresses in the face of changing DOM structures, we observed clicks on a link, and wishes to compare the URLs before and af- the effects of all five types of modification on node addressing ter clicking. To complete this task now, it would need to be split algorithms. into two stages, with the first stage storing the original URL, while the second clicks the link and stores the second URL. Because this 1.4 Organization approach simplifies the programming model, and because splitting In Section 2, we detail the framework’s programming model. In such tasks across stages is sufficient to make the approach general, Section 3 we discuss the implementation of the framework, and this is the design we have adopted. in Section 4 its scalability. Section 5 presents the node addressing One plausible improvement is to allow the subalgorithms to pass application. In Section 6, we discuss our DOM change simulator. their output to later subalgorithms, even if they may not directly Our evaluation of node addressing algorithms appears in Section 7. write to the output table. We believe this modification may indeed Section 8 is a discussion of related work, and Section 9 concludes. be desirable, and if we find applications for which this adjustment would substantially simplify program logic, we may implement it. 2. Framework Programming Model For now, we find this change unnecessary. Alternative approaches included allowing all subalgorithms to 2.1 Abstractions produce output rows, but requiring that the users JavaScript tests Our framework is built around a core set of user-facing abstrac- associate each slice of the row with some sort of row id. By using tions: the id, our system would be able to stitch the row slices together • Session: sequence of stages for which we offer the same page correctly after the completion of the final subalgorithm. Similarly, guarantee we could require that the number of output rows be the same across • Stage: run of a single (input program, input table) pair subalgorithms, and use the slices ordering to produce the correct • Input Program: set of algorithms full rows. • Input Table: each row corresponds to a single run of all pro- Another possibility would have been to allow each algorithm to gram algorithms; the first field is the page on which to run them, produce only one output row for each input row, but we felt this while the others represent the arguments to all algorithms diminished the expressiveness of the model too substantially.

2 2014/1/22 Worker Thread Worker Thread Worker Thread a1 b1 c1 … a2 c2 WebDriver WebDriver WebDriver c3 HTTP req HTTP resp a b 1 1 1 Caching Proxy Server a2 c2 HTTP req HTTP resp c3 Web Figure 2: Different algorithms may return different numbers of rows for a given input row. When this occurs, our framework stitches the output rows together according to their order. Figure 4: Our system consists of multiple workers running in par- allel. Each worker owns a web driver — essentially a browser in- We faced a similar design challenge in determining the appro- stance that can be controlled programmatically. The HTTP traffic priate way to combine distinct algorithms’ outputs. In fact, consid- of the system goes through the caching proxy server. ered most of the same solutions, except that all algorithms must be allowed to produce output. Ultimately, we determined that all algo- rithms should be allowed to produce as many rows of output as de- ically. The first two are headless web browsers, which do not dis- sired, even if different algorithms produce different numbers of out- play the webpages with which they interact. The third, Selenium, put rows for the same input row. Output rows are stitched together manipulates an instance of a standard browser. Substituting dif- based on the order in which they are returned by the algorithms, ferent web automation tools for each other in our system corre- as shown in Figure 2. This approach simplifies users’ code for the sponds to replacing the boxes labeled “WebDriver” in our system common cases in which there is only a single algorithm, each algo- diagram, in Figure 4. The WebDriver’s central role in our system rithm returns only a single row, or each algorithm returns the same is to load pages. Unfortunately, not all web automation tools load number of rows. However, it restricts the model’s expressiveness pages equally well. This led us to test these tools’ suitability for use in cases in which algorithms do not know the order in which other as WebDrivers in our frameworks. algorithms will return their output rows, but still want to achieve a We built implementations of our framework on top of all three particular lineup with each other. Although we believe this situa- of these web automation tools. For each tool, we created both tion is likely to be rare, it would be simple to address such cases by a serial and at least one parallel version. Here we describe the offering the id approach described above in the context of subalgo- approach of each implementation. rithm output. We would maintain the ease of programming offered by the current, simpler model by using the current approach by 3.2.1 PhantomJS default and switching to the id approach only when the user code PhantomJS is a headless WebKit that is controlled with JavaScript. signals that it is necessary. The user could switch modes either on It is termed headless because rather than start a real browser to load a stage-by-stage basis, or even within a single stage. a page and run its JavaScript, the WebKit simulates the browser. 2.3 Interleaving Framework Processing and User Processing This reduces startup time substantially, and eliminates the need to display content. Recall that JavaScript is single-threaded. Our A user can control framework execution from within a standard serial PhantomJS implementation waits until each row in the input Java program, using our simple stage and session abstractions. table is fully processed before moving on to the next. Our async This allows the user to interleave framework processing with other implementation is still serial, still runs in the single JavaScript processing as necessary. Figure 3 shows how a user could use the thread, but uses async.js to improve performance by leveraging the output of early stages as the input of later stages, and how the asynchronous nature of the page retrieval and processing task. Our user can interleave processing of user-created code with framework parallel implementation wraps the asynchronous implementation processing. in another layer. The higher-level layer splits the input table rows Note that in the Figure 3 example, the user uses the output of between a fixed number of workers, and runs an asynchronous stage 1 as the input for stage 2. The user also inserts a call to the PhantomJS implementation for each slice of the rows. function someProcessing to combine the output of stages 2 and 3 into an input for stage 4. 3.2.2 Ghost.py 3. Framework Implementation Ghost.py is a headless WebKit that is controlled with Python. Our sequential Ghost.py version uses a single worker. Our load 3.1 Design balancing implementation uses a shared task queue, and a fixed As illustrated in the system diagram in Figure 4, the basic archi- number of workers. To prevent errors from Ghost’s page closing tecture of our framework revolves around creating a set of worker approach, we must create a new Ghost instance for each task. threads, each of which controls a browser instance. The browser 3.2.3 Selenium instances’ web traffic goes through our custom proxy server, which controls the caching we use both for performance and to offer Selenium provides APIs for automating web interactions from same-page guarantees. within several different programming languages. Our Selenium- based implementations use the Java API. Selenium’s APIs can also 3.2 Web Driver be used to control several different real browsers. Our Selenium- We have implemented versions of our framework on top of three based implementations use instances. Our first implementa- distinct web automation tools: PhantomJS [16], Ghost.py [10], and tion is the standard serial version. Our second splits the input table Selenium [4]. Each web automation tool offers a different type rows across a fixed number of workers. Our third uses a shared task of browser instance that our framework can control programmat- queue and a fixed number of workers.

3 2014/1/22 framework. startSession (); framework. startStage(js1 ,inputCsv1 ,outputCsv1); // Below, the user strings together stages, using the results of stage 1 as the input for stage 2. framework. startStage(js2 ,outputCsv1 ,outputCsv2); framework. startStage(js3 , inputCsv3 , outputCsv3); // Below, the user does some processing to combine results of stages 2 and 3 into input for stage 4. inputCsv4 = someProcessing(outputCsv2 ,outputCsv3); framework. startStage(js4 ,inputCsv4 ,outputCsv4); framework. endSession ();

Figure 3: Interleaving user code and framework processing.

3.2.5 Reliability Benchmark Compleon Time 40 Running the simple title benchmark over the first 500 sites in the Alexa top sites list [5], we measured three kinds of undesirable out- 35 comes: server unreachable errors (Figure 6(a)), timeouts (Figure 30 6(b)), and wrong outputs (Figure 6(c)). While server unreachabil- 25 ity is outside of the web automation tools’ control, the tools are 20 sometimes the cause of timeouts, and always the cause of wrong outputs. Minutes 15 Our results revealed that PhantomJS and Ghost.py did not reli- 10 ably produce the correct (human-identified) outputs. While Phan- 5 tomJS provided the correct titles for most pages, it handled redi- 0 rects poorly, often giving the title associated with a redirect, rather s-seq s-split s-bal g-seq g-bal p-seq p-split than the one associated with the final destination. Further, although this issue did not appear for the simple JavaScript code that re- Figure 5: Median execution time, across impelementations, on a trieves a document title, we also found that for some JavaScript 500-row stage. tests, running the code in PhantomJS failed to produce the same ef- fects that it produced in Selenium, Ghost.py, and normal browsers. Since our framework targets arbitrary JavaScript code, this was un- acceptable. 3.2.4 Performance Evaluation Ghost.py handles redirects correctly, but it times out on a large number of pages, and does not support non-English output. Further, To test our implementations, we ran a simple title extraction bench- although Ghost.py provides an API for accessing new pages loaded mark over the homepages of the first 500 sites on the Alexa top sites by interacting with a page, this API does not apply when the load- list [5]. This experiment was run a Core i7-2600 @3.40GHz 8-core ing interaction is completed by a JavaScript program. This problem machine. We tested the following 8 implementations: appears to be a known issue that the Ghost.py developers have not 1. Selenium-sequential (s-seq) yet addressed. Because our framework targets arbitrary JavaScript 2. Selenium-split (s-split) code, including DOM-interacting code, and because it targets all 3. Selenium-loadbalance (s-bal) pages, including non-English pages, this was unacceptable. 4. Ghost-sequential (g-seq) 5. Ghost-loadbalance (g-bal) 3.2.6 Web Driver Selection 6. PhantomJS-sequential (p-seq) Because PhantomJS and Ghost.py failed to provide the reliability 7. PhantomJS-split (p-split) and robustness so crucial to our goals, we chose to build our system The non-sequential implementations each used 8 threads (or pro- on top of Selenium. cesses). Selenium’s times, although slow in the sequential version, be- Figure 5 displays the median execution time in minutes. As ex- came competitive with a load balanced implementation. Even more pected, Selenium-loadbalance finishes before Selenium-split, and importantly, it produced no incorrect outputs. Because our frame- Selenium-split finishes before Selenium-sequential. However, we work is intended to facilitate web research, it is essential that the only obtained 4.2x speedup from load balancing across 8 threads. results be trustworthy. Ultimately, we preferred seeing more rows We believe this limited speedup stems from the interference of mul- without answers to seeing rows with wrong answers. Rows with- tiple Selenium instances or Selenium-launched Firefox instances out answers are an unavoidable component of web experiments, with each other. While network bandwidth may play a role, we find servers typically being outside of the experimenters’ control. Thus, this unlikely given the substantial speedup we observe for Pham- we privileged wrong outputs as the deciding factor, and conse- tomJS. Our Ghost.py implementation did not benefit from par- quently selected Selenium as the provider of our framework’s web allelization. Our PhamtomJS implementation did benefit greatly, driver. achieving a 31x speedup from parallelization. This speedup re- 3.3 Caching Proxy Server flects the benefits of the additional workers, but also the benefits of the async implementation — recall that each worker runs not Our framework enforces the same-page guarantee though the use of an instance of the serial PhantomJS implementation, but of the a caching proxy server. The framework directs all HTTP request- faster async PhantomJS implementation. PhantomJS implementa- response traffic though the caching proxy server as illustrated in tions sometimes hung indefinitely, but we could not include those Figure 4. All pages for a given session are served from a single data points in our evaluation. cache.

4 2014/1/22 Server Unreachability ters, but is very fragile. The server crashes after serving our frame- 25 work’s HTTP traffic for less than 10 minutes. To achieve the control our framework demands, we imple- 20 mented our own custom caching proxy server. Our server stores every response into its cache before forwarding the response back to Selenium. It ignores all web cache policy parameters in the 15 HTTP header. The same request (URL) from Selenium always elic- its the same response from the server. Note that currently the proxy 10 server does not handle HTTPS requests, although this functional- ity can be built in. The proxy server stores each session’s cache in 5 its own directory, with each response in a separate URL-identified file, relying on the OS to load data from disk. A more efficient

Number of Server Unreachable Errors store would trivially improve the proxy servers performance. With- 0 out this addition, however, good performance can be obtained by s-seq s-split s-bal g-seq g-bal p-seq p-split grouping requests to the same URL. (a) Server unreachability Our cache had to handle the challenge of cyclic redirects. Some pages use a cyclic redirect process, redirecting a request for URL X Timeouts to URL Y (with a ‘no-cache’ policy), and redirecting a request for 60 Y to X, until eventually the originally requested X is ready, and the X response is no longer a redirect. At this point, the response for 50 X contains the final page content that our cache should associate with URL X. 40 In this scenario, if the proxy server caches everything, and there is no mechanism to clear the cache, our system will loop 30 forever. The proxy server will always return the redirects. Our 20 cache addresses this issue by maintaining a redirect table, mapping

Number of Timeouts request URLs to their redirect URLs. Upon receiving a redirect 10 response, the cache checks whether adding the redirect response to the table will create a size-two redirect cycle. If it will, the cache 0 removes the pre-existing redirect entry that causes the cycle. This s-seq s-split s-bal g-seq g-bal p-seq p-split technique is limited to redirect cycles of size two, but since we have (b) Timeouts not yet found a redirect cycle of size greater than two, we feel the performance benefits of avoiding full cycle detection justify this Wrong Output limitation. 80 3.4 User Interaction 70 As discussed in Section 2.3 above, the user controls our framework 60 from a Java program, creating an instance of the framework, then using the session and stage abstractions. She may interleave this 50 processing with processing of her own. 40 The other inputs are all the files that are passed to the stages: a JavaScript file for each stage, with a function for each subalgo- 30 rithm; a table for each stage, with the URLs and algorithm inputs. 20 4. Framework Evaluation

Number of Incorrect Output Rows 10 To briefly evaluate the scalability of our complete framework, we 0 ran the title extraction benchmark on the first 10,000 sites in the s-seq s-split s-bal g-seq g-bal p-seq p-split Alexa top sites list [5]. We ran this benchmark on the same ma- (c) Wrong outputs chine used in the Section 3.2.4 experiments. We recorded the exe- cution time for every increment of 100 sites. The results appear in Figure 6: Instances of several types of bad outcomes, across imple- Figure 7. mentations, on a 500 row benchmark. The data reveals that both the competition time and the number of timeouts scale linearly with the size of the input. We conclude that our framework is sufficiently stable, and that its performance meets the needs of the large-scale experiments our framework tar- Despite the preponderance of existing caching proxy servers, gets. none proved sufficiently controllable to meet our needs. Squid 5. The Node Addressing Problem cache [18, 20] can be configured to ignore some web cache pol- icy parameters, such as ‘no-cache,’ ‘must-revalidate,’ and ‘expira- As discussed in Section 1.2, web tools — such as record and replay tion.’ However, it cannot be configured to ignore others — for in- systems — require a node addressing algorithm that keeps nodes stance, ‘Vary.’ As an example, unmodified Squid will never cache addressable even in the face of pages’ changing DOMs. A simple stackoverflow.com, because of the presence of “Vary: *” in the xpath that traces the path from the root to the target node breaks header. Apache Traffic Server [6] provides even less cache config- when new wrapper nodes are added, even when new sibling nodes uration control. Polipo [8] is a much smaller-scale caching proxy are added for any node along that path. This may occur as the page server that can be configured to ignore all caching policy parame- is redesigned, or simply in response to user actions. For instance,

5 2014/1/22 Compleon Time 500 session 1 400 URLs 1 300 Framework get all xpaths

Minutes 200

100 xpaths 2 Framework 0 filter non- 0 2000 4000 6000 8000 10000 responsive Input Rows nodes (a) Completion time Timeouts filtered xpaths 3 2000 Framework store node addresses 1500 session 2 1000 node addresses 4 Framework 500 retrieve nodes Number of Timeouts

0 node responses 0 2000 4000 6000 8000 10000 Input Rows (b) Timeouts Figure 8: A visualization of the DOM representation algorithm testing task, split into stages for our framework. Figure 7: Scalability test on the simple title extraction benchmark.

many pages add nodes that make suggestions based on recent user actions, as with an airline site that suggests users repeat recent flight searches. Node ids and classes change as pages are redesigned, or if a page uses obfuscation, and many nodes lack ids and classes in xpath head body the first place. legend nodeType Figure 9 illustrates a few of the characteristics that can be used to develop a node representation, including the xpath from root to id div div div node, the node type (a, div, p, and so on), the id, the class, and the inner text. class Figure 10 illustrates some of the common changes that are made to DOMs over time, the sorts of changes that can derail node text div a div div addressing algorithms.

Several solutions have been proposed to the node addressing prob- 1234

lem, but they have never been tested in a controlled experiment. We consider the following 5 plausible node addressing algo- rithms. Figure 9: An illustration of some of the DOM node characteristics 1. xpath Records the path from the root of the DOM tree to the that can be used to construct a DOM ‘address.’ target node. To identify corresponding node, follows the same path. If there is no matched node, returns null. 2. id Selects the first node whose id matches the recorded id, 5. Ringer Collects the xpath as described above, the id, the class, otherwise returns null. and the text. To identify the corresponding node, it uses six 3. class Selects the first node whose class matches the recorded strategies; the original xpath, xpath suffixes, common variations class, otherwise returns null. on the xpath, the class, the id, the text. Each strategy votes for 4. iMacros Collects the list of nodes with the same node type and up to one node. Returns the most voted-for node, or null if no inner text as the target node. Records the target node’s position nodes receive votes. in this list. To identify the corresponding node, constructs the The first three, xpath, id, and class each use only a single charac- list from the new page, selects the item at the target node’s teristic to identify the corresponding node. They rely on that char- original index. If no such node, returns null. acteristic to identify exactly one node. The second two, iMacros

6 2014/1/22 html html html html

head body head body head body head body

div div div div div div div div div div div div

div a div div div a div div div div a div div div div a div div

text1

text1

text1

text1

text

… … … 1

text1

text1

text2

Figure 10: A visualization of the types of changes that are made to DOMs as they are redesigned or obfuscated. and Ringer, are in use by real web tools. The iMacros algorithm comes from the approach used by the iMacros [1] web scripting tool. The Ringer approach is also in use by a real tool, Ringer [3], 4 which is a web record and replay system. We are particularly in- subalg 1 subalg 2 terested in the Ringer approach, which is being developed for the get node, get url Ringer tool by Shaon Barman and one of the authors of this paper. method 1, click 5.2 Testing Node Addressing Algorithms get node, get url Our framework offers a means of testing node addressing algo- method 2, rithms in a fair setting. For testing in our framework, the experiment click is split into four stages, as depicted in Figure 8. The first stage tra- … … verses the DOM tree, recording an xpath for each node. The second stage, with an input row for each xpath, clicks on the node at the xpath. Recall that finding a node on one page and then again after a reloading is extremely difficult, that this problem is in fact the entire Figure 11: A visualization of the stage 4 input program, showing motivation for this application. Even reloading after a few seconds, the first two node addressing algorithms. For each algorithm, the the original xpath may break. It is thus crucial that stages 1 and 2 first subalgorithm uses the node address to identify the correspond- occur in the same session, seeing the same instance of each page ing node according to the given node addressing algorithm. It then during both stages. Stage 2 also compares the pre-click URL with clicks. The second subalgorithm checks the URL of the resultant the post-click URL, producing an output row only if the URL has page. changed. That is, each output row corresponds to a reactive node. If clicking on the node has no effect on the state, it will be impos- sible to check (in stage 4) whether the stage 4 node corresponds to the stage 2 node. Since, for this experiment, we take the URL as a 2. Allow DOM interaction — the test must be able to click on the proxy for the state, the URL must change in order for a node to be algorithm-identified nodes considered reactive. Stage 3 takes the xpaths of all reactive nodes 3. Run arbitrary JavaScript code — the algorithms cannot be lim- as input, and runs each node addressing algorithm on each node, ited to, for instance, returning pass or fail as their outputs producing the node addresses for each node as output. 4. Run on live pages from URLs (not only local) — the test should The fourth stage takes the node addresses as input. It uses the run on the real top Alexa sites, not DOMs built up locally addresses and the corresponding node addressing algorithms to 5. Guarantee same input (page) through experiment — stages 1 find the appropriate node on the new page, and then click on it. through 3 require a same page guarantee It produces the new post-click URL as output. The new post-click In Table 1, we show content descriptions for the input and URL can be compared with the stage 2 post-click URL to determine output files of each stage. whether each algorithm successfully identified the corresponding node. Figure 11 offers a pictorial representation of the stage 4 6. DOM Change Simulator algorithm, showing the different node addressing approaches as different algorithms, and the clicking and URL inspecting functions Testing node addressing algorithms requires a suite of DOMs, each as different subalgorithms. with multiple versions. Alternatively, testing can work over DOMs Thus, this experiment can be cleanly divided into four stages in with naturalistic changes introduced to simulate DOM changes our framework. Note that the experiment relies on all five of the over time. To explore the latter approach, we created a small suite of crucial characteristics identified in Section 1.1: DOM edits that can be combined to create various DOM workload 1. Parallelize test execution — a thorough test demands running simulators. this task on many nodes, in order to reveal a sufficiently large We implemented 4 types of DOM edits: number of naturalistically broken addresses to distinguish be- Wrapper: wrap first-level divs tween approaches, which makes parallel execution highly de- Wrap every div node that is a direct child of the body node with sirable a center node.

7 2014/1/22 input 1: url output 1, input 2: url url after redirects xpath output 2, input 3: url url after redirects xpath post-click url output 3, input 4: url url after redirects post-click url node address (alg 1) node address (alg 2) ... output 4: url url after redirects post-click url post-click url (alg 1) post-click url (alg 2) ...

Table 1: Column descriptions for the input and output files, for each stage of the node addressing testing task.

html html html 100

center center 90 7 head body head body head body xpath - ub 80 6 xpath - lb div div div div div div div div div 70 5 id - ub id - lb a div div span a div div span a a div div div span 4 60

text

text

text

class - ub 1 1 1 1 1 1 1 1 1 3 original wrapper insert 50 class - ub 2 iMacro - ub 40 html html html 1 iMacro - lb Percent Success 30 0 Ringer - ub head body head body head body Ringer - lb 20 tes div div div div div div div 10

a div div p div div span a div div span 0

text1

text1

text2

type div div move text

a Figure 13: Node addressing algorithm success rates on day 0 (the Figure 12: A visualization of our five DOM edits. day the node addresses were recorded) and day 6. In each grouping, the left bar represents day 0 performance, and the right bar repre- sents day 6 performance. Lower bound is indicated with ‘lb’ and Insert: insert many nodes upper bound is indicated with ‘ub.’ For every div node, insert as the div’s first child a new node with the same tag as the div’s original first child. is different from the stage 4 pre-click URL. This approach correctly Type: span to p handles the ‘Random Article’ and top story cases described above, Convert every span node into a p node. but may also produce false positives. This approach allows an Move: become sibling’s child algorithm to click on any URL-changing node and still succeed, Move every object whose next sibling is a div object such that regardless of whether it is the corresponding node. However, if an it becomes that sibling’s first child. algorithm fails by this criterion, it has definitively failed. If no URL Text: modify node text effect is produced, the identified node was not the corresponding Add a letter to the inner text of each node. node. Thus, this condition offers an upper bound on robustness. Figure 12 gives a pictorial representation of the effects of these edits. 7.1 Robustness Over Time 7. Node Addressing Evaluation To evaluate node addressing algorithms’ robustness over time, we compared their immediate performance with their performance af- For the purposes of this work, we consider two correctness condi- ter a delay. Specifically, we ran them once on the same day that the tions. The first, a conservative estimate of success rate, considers a node addresses were recorded — day 0 — and once almost a week result correct if the stage 4 post-click URL is the same as the stage later on day 6. 2 post-click URL. This produces some false negatives. Consider We ran our experiment on the top 15 sites in the Alexa top sites clicking the ‘Random Article’ link on wikipedia.org, or the top list [5], executing the four stages as described in Section 5, running story on a news site. Clicking on the correct node will lead to differ- stage 4 on both day 0 and day 6. The 15 sites yielded 11,373 nodes, ent urls during different runs. However, whenever this correctness 1106 of which were reactive. condition is met, the node addressing algorithm has definitely suc- The algorithms’ success rates on the day 0 and day 6 runs ceeded. Thus, this condition offers a lower bound on robustness. appear in Figure 13. The lighter colored bars represent the success The second correctness condition may overapproximate success rate using the first correctness criteria — that is, the lower bound. rates, considering a result successful if the stage 4 post-click URL The darker colored portions of the bars represent the additional

8 2014/1/22 100 algorithm successfully identifies. The results for id and class are excluded, since they identified none of the nodes correctly. The 90 100% success rate of the xpath algorithm when no DOM changes are present confirms that our framework provides the same page 80 7 guarantee within the same session. Note that iMacros could not correctly identify all nodes even when no changes had been applied 6 70 xpath - ub to the DOMs, because node type, text, and pos were not enough 5 xpath - lb information.

60 4 iMacro - ub As evidenced by Figure 14, xpath performs poorly on wrap- 3 iMacro - lb per, insert, and move. These edits altered the xpaths to most DOM 50 nodes, substantially degrading xpath’s effectiveness. It performed 2 Ringer - ub well on type because most nodes were not of type span, and it 40 1 Ringer - lb could correctly identify all nodes when only the text content was 0 Percent Successful 30 tes changed. Ringer consistently performed better than xpath. On in- sert and move, it benefited greatly from considering the xpath suf- 20 fixes and common variations on the xpath. However, Ringer per- formed worse than iMacros on insert and move, since Ringer gives 10 xpaths so much weight. iMacro’s success rates were relatively con- sistent across different types of DOM changes, with the exception 0 of text. As expected, iMacros could not identify any node correctly when text content was modified.

8. Related Work 8.1 Frameworks Figure 14: Node addressing algorithm success rates, for DOM edit types. Lower bound is indicated with ‘lb’ and upper bound A large number of tools have been developed for the purpose is indicated with ‘ub.’ of running JavaScript tests. Almost all are targeted towards web developers who want to test their own pages, or even only their own JavaScript. Below we cover the main subcategories of this class of percentage points gained by using the second correctness criterion, tools. the upper bound on the true success rate. Jasmine [9] is one of the most prominent system for running As Figure 13 makes evident, the id and class approaches largely JavaScript unit tests. While its ease of use makes it a good platform unsuccessful. In contrast, xpath, iMacros, and Ringer all prove for small-scale experiments, it lacks many of the characteristics we reasonably robust. Ringer definitively outperforms iMacros on day desire for large-scale, general purpose web research. First, its par- 0, with the lower bound on Ringer’s success rate being higher allelization mechanism is quite limited. Second, it uses a restrictive than the upper bound on iMacros’. While Ringer’s success region programming model, tailored to offer pass/fail responses for each overlaps with xpath’s on day 0, both its upper and lower bounds are test. Third, it only runs on locally constructed DOMs — it cannot higher than xpath’s respective upper and lower bounds. The success be fed a URL for its tests. region for xpath in day 0 does overlap with iMacros’, but barely — These limitations characterize a large portion of the JavaScript xpath comes quite close to definitively outperforming iMacros. testing space, which very heavily tailors platforms to web develop- All three of the successful approaches see their performance de- ers with limited experimental needs. This category of tool includes grade with the passage of time. On day 6, Ringer’s upper bound is projects such as QUnit [17], Mocha [15], and YUI Test [14]. In still greater than all other approaches’ upper bounds, and its lower fact, QUnit, Mocha, and YUI Test do not offer even the limited bound is still greater than all other approaches’ lower bounds; how- parallelization that Jasmine provides. ever, note that it no longer definitively dominates the iMacros ap- Some tools, like Vows [11], offer parallelization, but are aimed proach, there being some overlap in their success regions. Never- only at testing JavaScript. These typically run on Node, which elim- theless, since the lower bound on Ringer’s day 6 performance is in inates any DOM-interactive code from their domains, and naturally fact higher than the lower bound on iMacros’ day 0 performance, any URL-loaded code. we consider Ringer’s approach largely successful. Finally we consider the three web automation tools on which we Overall, we conclude that both iMacros and Ringer appear to built our framework implementations: Ghost.py [10], PhantomJS degrade more slowly over time than xpath. However, since Ringer’s [16], and Selenium [4]. None of these is explicitly a testing frame- original performance solidly dominates iMacros’, we conclude that work. Rather, they are generic web automation tools. Clearly it is the Ringer approach is the more robust. possible to build up a testing framework on top of any of them, as this paper has shown. However, we found that the amount of 7.2 Synthetic Benchmark infrastructure and insight necessary to do so was far from trivial. We evaluated the node addressing algorithms using our syn- Ghost.py and PhantomJS have no built-in parallelization. There is thetic DOM evolution simulators on 4 sites: amazon.com, word- a variation on Selenium, Selenium Grid [19], that offers paralleliza- press.com, bing.com, and ask.com. Each run consists of a single tion. We note however that it is tailored for users who want to run session with 9 stages. The first 3 stages correspond to stages 1-3 in the same tests on multiple browsers and on multiple operating sys- the node addressing task, described in Section 5. The last 6 stages tems, rather than for users with large-scale experiments. In fact, we are variations on stage 4, node retrieval. The node retrieval stage found the Selenium Grid approach sufficiently unwieldy for our runs with 6 different types of DOM modifications: none (original needs that we chose to implement our own parallelization layer. baseline), wrapper, insert, type, move, and text. Ghost.py, PhantomJS, and Selenium do all offer DOM interac- This benchmark identified 63 reactive nodes out of 773 total tion, the ability to run arbitrary JavaScript code, and the ability to DOM nodes. Figure 14 shows the percentage of nodes that each load pages from URLs, all characteristics that made them accept-

9 2014/1/22 able candidates for serving as our framework’s web driver. How- state-of-the-art algorithm. Finally, we introduced a DOM evolution ever, in the case of Ghost.py and PhantomJS, we found the ability simulator for generating DOM workloads with realistically modi- to run arbitrary code was sometimes hindered by the tools’ limited fied structures. We obtained interesting insights into node address- robustness. ing algorithms through their use on our synthetic workloads. Ulti- All three of Ghost.py, PhantomJS, and Selenium — like all the mately, we believe that large-scale web research is an increasingly projects described here — lack any support for same page guar- important area, with exciting problems to solve and many crucial antees. Ultimately, most tools, being targeted towards developers, insights to uncover. Increased tool support should accelerate the are targeted towards users who know their test pages will stay the pace of discovery in this burgeoning young field of study. same, or know how they will change. This makes them generally unsuitable for broader web research. References 8.2 Node Addressing [1] Browser scripting, data extraction and web testing by imacros. http://www.iopus.com/imacros/. There are several existing web tools that require robust node ad- [2] Internet archive: Wayback machine. dressing algorithms. Therefore, despite the paucity of literature on the node addressing problem, several solutions have been put into [3] sbarman/webscript. ://github.com/sbarman/webscript. practice. [4] Selenium- automation. http://seleniumhq.org/. We have already described two such solutions, iMacros and [5] Alexa. Alexa top sites. http://www.alexa.com/topsites, Ringer. The iMacros [1] approach uses node type and node text October 2013. to identify a list of nodes, and then uses the target node’s position [6] Apache. Apache traffic server. http://trafficserver.apache. in that list as an address. The Ringer [3] approach uses an xpath, org/. several variations on the xpath, the id, the class, and the text in a voting scheme to identify corresponding nodes. [7] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. Automation and customization of rendered web CoScripter, another web tool which offers some record and pages. UIST ’05. replay functionality, takes an iMacros-style approach [12]. Where possible, it associates a target node with text, whether it be the [8] Juliuz Chroboczek. Polipo: A cahing web proxy. http://www.pps. text within the node, or text that precedes it — as when a textbox univ-paris-diderot.fr/~jch/software/polipo/. appears to the right of a label. When no related text is available [9] Jasmine. Jasmine introduction-1.3.1.js. http://pivotal.github. — for instance, if the node is a search button that displays a io/jasmine/, October 2013. magnifying glass icon rather than the word ‘search’ — CoScripter [10] Jeanphix. ghost.py. http://jeanphix.me/Ghost.py/, December falls back on the position approach. ActionShot [13] also uses 2013. CoScripter’s technique. [11] Vows JS. Vows ¡¡ asynchronous bdd for node. http://vowsjs. Some tools take even more fragile approaches, relying on pro- org/, December 2013. grammers to adjust the node addresses by hand as appropriate. For [12] Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. instance, Selenium IDE offers a record and replay option that de- Coscripter: automating & sharing how-to knowledge in the enterprise. scribes nodes by id when available, describes by their inner CHI ’08. text contents, and backs off to an approach that describes nodes by the combination of node type and parent node type. [13] Ian Li, Jeffrey Nichols, Tessa Lau, Clemens Drews, and Allen Cypher. Here’s what i did: Sharing and reusing web activity with actionshot. Other tools take wholly different approaches. For instance, CHI ’10. Chickenfoot [7] allows users to access and interact with nodes via high level commands and pattern-matching. Sikuli [21] takes a vi- [14] YUI Library. Test - yui library. http://yuilibrary.com/yui/ sual approach, using screenshots and image recognition to identify docs/test/. nodes. [15] Mocha. Mocha - the fun, simple, flexible test framework. http://visionmedia.github.io/mocha/. 8.3 DOM Workloads [16] PhantomJS. Phantomjs — . http://phantomjs.org/, To our knowledge, there have been no previous tools for DOM December 2013. evolution simulation. To this point, researchers conducting web ex- [17] QUinit. Qunit. http://qunitjs.com/. periments who need realistically changing DOM workloads appear [18] SquidCache. Squid: Optimising web delivery. http://www. to have had two main options. First, they could collect their own squid-cache.org/, May 2013. workloads by pulling DOMs at the desired intervals. If a researcher needed control over the amount of change, this was the only op- [19] ThoughtWorks. Selenium grid. http://seleniumgrid. tion. Alternatively, researchers could use DOMs pulled by down by thoughtworks.com/, October 2013. archiving operations such as the WayBack Machine [2]. However, [20] Duane Wessels and k claffy. Icp and the squid web cache. IEEE this option gives researchers no control over the intervals between JOURNAL ON SELECTED AREAS IN COMMUNICATION, 16:345– collection points, and there is no guarantee that a researcher’s sites 357, 1998. of interest will have been archived. [21] Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. Sikuli: using gui screenshots for search and automation. UIST ’09. 9. Conclusion Existing tools for running JavaScript tests in parallel are limited. They offer limited parallelization, limited DOM interaction, lim- ited programming models, often run only on locally constructed DOMs, and never provide same page guarantees. Our framework offers a system with none of these limitations. The node address- ing application revealed that our framework can be used to cleanly structure large and complicated web experiments. We found that our new node addressing approach is more effective than a prior

10 2014/1/22