<<

A Limit Study of JavaScript Parallelism

Emily Fortuna Owen Anderson Luis Ceze Susan Eggers Computer Science and Engineering, University of Washington {fortuna, owen, luisceze, eggers}@cs.washington.edu http://sampa.cs.washington.edu

Abstract—JavaScript is ubiquitous on the web. At the same [15], [16], [17]. Several of these studies compared the behavior time, the language’s dynamic behavior makes optimizations chal- between JavaScript on the web to standard JavaScript bench- lenging, leading to poor performance. In this paper we conduct marks such as SunSpider and V8. Researchers unanimously a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript concluded that the behavior of JavaScript on the web differs benchmarks. We examine dependency types and looping behavior significantly from that of standard JavaScript benchmarks. to better understand the potential for JavaScript parallelization. This paper addresses the slow computation issue by explor- Our results show that the potential speedup is very encouraging— ing the potential of parallelizing JavaScript applications. In averaging 8.9x and as high as 45.5x. Parallelizing functions addition to the obvious program speedups, parallelization can themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate lead to improvements in power usage, battery life in mobile in our JavaScript engine, most of the dependencies manifest via devices, and web page responsiveness. Moreover, improved virtual registers rather than hash table lookups. performance will allow future web developers to consider JavaScript as a platform for more compute-intensive appli- I.INTRODUCTION cations. In an increasingly online world, JavaScript is around every corner. It’s on your search engine and your new word pro- Contributions and Summary of Findings cessing software [1]. It’s on your smart phone browsing the web and possibly running the applications on your phone [2]. We present the first limit study to our knowledge of par- One company has estimated that 99.6% of sites online today allelism within serial JavaScript applications. We analyze a use JavaScript [3]. Dubbed the “ of the variety of applications, including several from Alexa’s top 500 Internet,” JavaScript is used not only for writing applications, websites [18], compute-intensive JavaScript programs, such as but also as the target language for web applications written in a fluid dynamics simulator, and the V8 benchmarks [19]. In other languages, such as Java [4], [5], [6]. contrast to prior JavaScript behavior studies [15], [16], [17], At the same time, the very features that have made this we take a lower-level, data-dependence-driven approach, with dynamically-typed language so flexible and popular make it an eye toward potential parallelism. challenging for writers to efficiently optimize its As a limit study of parallelism, we focused on two funda- execution. A number of websites offer tips and services to mental limitations of parallelism: data dependences and most both manually and automatically optimize JavaScript code control dependences (via task formation; see Section III-A). for download time and execution performance [7], [8], [9]. We impose JavaScript-specific restrictions, such as the event- Further, many web-based companies such as , go as based nature of execution. We believe this model captures all far as optimizing versions of their websites specifically for the first order effects on parallelism. mobile devices. Our results show that JavaScript exhibits great potential for These JavaScript optimizations are of particular importance parallelization—speedups over sequential execution are 8.9x for online mobile devices, where power is rapidly becoming on average and as high as 45.5x. Unlike high-performance the limiting factor in the performance equation. In order computing applications and other successfully parallelized to reduce power consumption, mobile devices are following programs, however, JavaScript applications currently do not in the footsteps of their desktop counterparts by replacing have significant levels of loop-level parallelism. Instead, better a single, higher-power processor with multiple lower-power parallelization opportunities arise between and within func- cores [10]. At the same time, advances in screen technology tions. As a result, current JavaScript applications will require are reducing power consumption in the display [11], leaving different parallelization strategies. actual computation as a larger and larger proportion of total The vast majority of data dependencies manifest themselves power usage for mobile devices. in the form of virtual registers, rather than hash-table lookups, There have been a number of efforts to increase the speed which is promising for employing static analyses to detect par- of JavaScript and browsers in general [12], [13], [14]. More allelism. Finally, our results demonstrate that parallel behavior recently, researchers have sought to characterize the general in JavaScript benchmarks differs from that of real websites, behavior of JavaScript from the language perspective in order although not to the extent found for other JavaScript behaviors to write better interpreters and Just-In-Time (JIT) measured in [15], [16]. The remainder of this paper explains how we reached these A Day in the Life of a JavaScript Program conclusions. Section II provides some background information For the most part, JavaScript is used on the web in a on the JavaScript language. Section III describes the model highly interactive, event-based manner. A JavaScript event is we used for calculating JavaScript parallelism, special con- any action that is detectable by the JavaScript program [25], siderations in measuring parallelism in JavaScript, and how whether initiated by a user or automatically. For example, we performed our measurements. Next, Section IV details when a user loads a web page, an onLoad event occurs. This our findings with respect to JavaScript behavior and potential can trigger the execution of a programmer-defined function speedup. Section V describes related work and its relationship or a series of functions, such as setting or checking browser to our study, and Section VI concludes. cookies. After these functions have completed, the browser waits for the next event before executing more JavaScript code. p II.BACKGROUND Next, in our example a user might click on an HTML element, causing an onClick event to fire. If the programmer JavaScript is a dynamically typed, prototype-based, object- intends additional text to appear after clicking on the p oriented language. It can be both interpreted and JIT’ed by element, the JavaScript function must modify the Document JavaScript engines embedded in browsers, such as SpiderMon- Object Model (DOM), which contains all of the HTML key in [20], SquirrelFish Extreme1 in [21], and elements in the page. The JavaScript function looks up the V8 in Chrome [22]. The current standard for JavaScript does particular object representing the p element in the DOM not directly support concurrency. and modifies the object’s text via its hash table, causing the browser to display the modified page without requiring an Conceptually, JavaScript objects and their properties are entire page reload. Any local variables needed to complete implemented as hash tables. Getting, setting, or adding a this text insertion are stored in virtual registers. property to an object requires the or JIT to either look up or modify that particular object’s hash table. III.METHODOLOGY 2 The majority of production interpreters convert JavaScript Because of JavaScript’s event-based execution model de- syntax into an intermediate representation of JavaScript op- scribed above, we must make special considerations to quan- codes and use one of two main methods to manipulate the tify the amount of potential speedup in JavaScript web appli- opcodes’ operands and intermediate values: stacks or registers cations. Below we explain some terminology for talking about [23]. Stack-based interpreters contain a data stack (separate JavaScript parallelism, how we set up our model (Section from the call stack) for passing operands between opcodes as III-A), and how we made our measurements from the dynamic they are executed. Register-based interpreters contain an array execution traces (Section III-B). of “register slots” (that we call “virtual registers,” which are higher-level than actual machine registers) that are allocated A. Event-driven Execution Model Considerations as needed to pass around operands and intermediate data. For our study, we parallelize each event individually, as Whether an interpreter is stack-based or register-based does described in Section II by dividing them into tasks that can not affect the way that data dependencies manifest; these be potentially executed concurrently. We define a task as any two methods are simply used for storing the operands for consecutive stream of executed opcodes in an event, delimited each opcode. Interpreters such as SpiderMonkey are stack- either by a function call, a function return, an outermost4 loop based, whereas SquirrelFish Extreme is register-based. Some entrance, an outermost loop back-edge, or an outermost loop research suggests that register-based interpreters can slightly exit. A task consists of at least one JavaScript opcode and has outperform stack-based interpreters [24]. no maximum size. Tasks are the finest granularity at which we Memory in a JavaScript program consists of heap-allocated consider parallelism; we do not parallelize instructions within objects created by the program, as well as a pre-existent Doc- tasks. Since the tasks are essentially control independent, we ument Object Model produced by the browser that represents take into account control dependences indirectly. the web page and the current execution environment. All of Figure 1 illustrates a high-level representation of the tasks these are distinct from local variables and temporaries. In this created from the execution of a simple JavaScript function. study we examine a register-based interpreter, which therefore When executing function f, JavaScript code is converted into exposes dependencies via virtual registers for local variables a dynamic trace of opcodes. Our offline analysis divides these and temporaries, and via hash table lookups when accessing opcodes into eight tasks, appearing in boxes on the right of 3 an object’s hash table. Figure 1. We define the critical path as the longest sequence of 1Also known by its marketing name, Nitro. memory and/or virtual register dependencies between tasks 2The notable exception is V8, which compiles JavaScript directly to that occurs within a given event. The critical path is the assembly code. limiting factor that determines the overall execution time for an 3In this study, register slots that directly reference objects for hash table lookups we classify as hash table lookups, distinct from local variables and event; all other data-dependent sequences can run in parallel temporaries stored in the register slots that we classify as being stored in “virtual registers.” 4Not nested inside any surrounding loop. function f() { set i = 0 // start f var i = 0; call g g(); i++; return 0 // return from g for (var j = 1; j < 3; j++) { increment i i += j; set j = 1 } test loop condition j < 3 return i; } i += j function g() { increment j return 0; test loop condition j < 3 } i += j

increment j test loop condition j < 3

return i // return from f

set i = 0 // start f call g function f() { var i = 0; return 0 // return from g g(); increment i i++; set j = 1 for (var j = 1; j < 3; j++) { test loop condition j < 3 i += j; } i += j return i; increment j } test loop condition j < 3

function g() { i += j return 0; increment j } test loop condition j < 3 return i // return from f

Fig. 1. An example JavaScript function f appears on the left. The right shows a high-level, pseudo-intermediate representation of the execution of f, to illustrate how tasks are delineated. Each box represents one task. The i += j tasks are the loop body tasks.

with the critical path. Since a typical JavaScript opcode in Algorithm 1 Critical Path Calculation for Events SquirrelFish executes in approximately 30 machine cycles, {build dependency graph} tasks rather than individual opcodes are a more realistic gran- for each opcode do ularity for parallelization, given the overheads associated with for each argument in opcode do parallelization. We use the procedure detailed in Algorithm if argument is a write then 1 to calculate critical paths for the events observed in the record this opcode address as the most recent writer dynamic execution traces. else if argument is a read then Potential speedup from parallelization is measured by cal- find last writer to the read location culating the length of the critical path for each event that if read and last writer are in different tasks, but in was observed in a dynamic execution trace, summing them the same event then together, and dividing into the total execution time, that is, mark dependency update longest path through this task, given the T Speedup = (1) new dependency’s longest path Pn−1 i=0 ci end if end if where n is the number of events, c is the length of the critical i end for path for event i in cycles, and T is the total number of cycles update longest path through this task with this processed for the entire execution. This optimistically assumes there is opcode’s cycle count no limit in the number of processors available to execute end for independent JavaScript tasks. We conservatively force events to execute in the order {find critical path for each event} that they were observed in the original execution trace. If for each task in tasks do JavaScript were able to run in parallel, one could imagine a event ← this task’s event few JavaScript events occurring and running simultaneously, if the longest path through this task > the event’s current such as a timer firing at the same time as an onClick event critical path then generated by the user clicking on a link. However, from the in- event’s critical path ← longest path through task terpreter’s perspective, it is difficult to distinguish between true end if user events that must be executed sequentially and automatic end for events that could be parallelized, making it very challenging to know whether such events could logically be executed simultaneously. Therefore, we do not allow perturbations in event execution order. tions executed from the interpreter’s perspective. Examples Additionally, while interpreting some JavaScript functions, of this function type include the JavaScript substring, some built-in functions in the language make calls to the parseInt, eval, and alert functions. We examine the native runtime, and therefore we cannot observe these instruc- potential speedup in two cases: the first conservatively assum- ing that all calls to the runtime depend on the previous calls; Benchmark Classification Details ALEXATOP 100 WEBSITES and the second assuming no dependencies ever occur across bing search engine two textual searches on the runtime. We take these elements into consideration in order bing.com and browsed to make our parallelism measurements as realistic as possible. images cnn news read several stories on cnn.com B. Experimental Setup facebook micro data posted on wall, wrote on In order to measure the potential speedup of JavaScript other’s walls, expanded news feed on the web, we instrumented the April 1, 2010 version of flickr photo sharing visited contact’s pages, the interpreter in SquirrelFish Extreme (WebKit’s JavaScript posted comments engine, used in Safari and on the iPhone). We then used Safari gmail productivity archived emails, wrote and sent email with our instrumented interpreter to record dynamic execution google search engine two textual searches and traces of the JavaScript opcodes executed while interacting looked at images on with the websites in our benchmark suite. google.com googleDocs productivity edited and saved a A growing number of recent studies have concluded that spreadsheet, shared with typical JavaScript benchmark suites, such as SunSpider and others V8, are not representative of JavaScript behavior on the googleMaps visualization found a driving route, switched to public transit web [15], [26]. Therefore, we created our own benchmark route, zoomed in to map suite, partly comprised of full interactions with a subset of googleReader micro data read several items, mark the top 100 websites from the Alexa Top 500 Websites [18]. as unread, share item googleWave compute intensive create new wave, add We supplemented our benchmark suite with a few other web people, type to them, applications that push the limits of JavaScript engines today, reply to wave and the V8 version 2 benchmarks for reference. With the nyTimes news read several stories on nyTimes.com exception of the V8 benchmarks that simply run without user twitter micro data posted tweet, clicked input, the length of logged interactions ranged from five to around home page for twenty minutes. Table I explains the details of our benchmark @replies and lists yahoo search perform several searches collection. and look at images Traces were generated on a 2.4 GHz Intel Core 2 Duo youtube video watch several videos, running Mac OS 10.6.3 with 4 GB of RAM. Additionally, we expand comments COMPUTEINTENSIVESUPPLEMENTALBENCHMARKS used RDTSC instructions to access the Time Stamp Counter ballPool data visualization chromeexperiments.com/ to count the number of cycles each JavaScript opcode took detail/ball-pool/ to execute. However, the cycle counts that we measured fluidSim compute intensive fluid dynamics simulator in JavaScript at had several confounding factors, resulting in a variance of nerget.com/fluidSim/ sometimes several orders of magnitude: the JavaScript inter- pacman compute intensive NES emulator written in preter was run in a browser, influencing caching behavior, JavaScript playing Pacman at benfirshman.com/ along with network latency and congestion effects out of our projects/jsnes/ control [27]. Therefore, out of the approximately 400M total V8 BENCHMARKSVERSION 2 dynamic JavaScript opcodes observed, for each type of opcode -crypto V8 Encryption and decryption v8-deltablue V8 One-way constraint solver we selected the minimum cycle count observed and placed v8-earley-boyer V8 Classic Scheme these values in a lookup table we used to calculate the critical benchmarks, translated to paths for events. Additionally, if making repeated XML HTTP JavaScript by Scheme2Js compiler requests is the primary action of a JavaScript program, the v8-raytrace V8 Ray tracer perceived speedup may be less than our measurements because v8-richards V8 OS kernel simulation network latency may come to dominate the time that a user is TABLE I waiting. PROVIDES THE DETAILS AND A GENERAL CLASSIFICATION FOR THE BENCHMARKSUSEDINTHISSTUDY. IV. RESULTS In this section, we first evaluate the overall degree of potential speedup. Then we explore other facets of the par- allelization problem, including which types of tasks are most bodies as task delimiters. As previously mentioned, runtime useful for maximizing speedup and what types of dependences calls are a set of built-in functions implemented natively manifest in typical JavaScript programs. rather than in JavaScript. To reiterate, we conservatively force events to execute in the same order that they were observed A. Total Parallelism in the original execution trace, and we use Equation (1) to The graph in Figure 2 shows the potential speedup, con- calculate the potential speedup. The potential speedups range servatively assuming dependencies across calls to the native from 2.19x and 2.31x for the v8-crypto and google runtime, with both function call boundaries and outermost loop benchmarks respectively, and up to 45.46x on googleWave, iha vrg vraltebnhak f89x fw ex- we If 8.91x. ( of outliers benchmarks maximum and the minimum all the clude over average an with loop and function and dependencies, runtime with speedup delimiters. Potential task 2. Fig. on h oei iue1a n itntlo sequence loop distinct one as we 1 example, Figure an in As code benchmarks. V8 the behavior web-site the looping count “typical” same excluding the the benchmark, exhibit as not each do in which benchmarks, sequence loop each programs. JavaScript write compute-intensive to willing and are complex developers that more perfor- point JavaScript the should to benefits future, improve be mance the may in there parallelization that loop suggest for does This the suite. in benchmarks benchmark speedup scientific-computing-like most in two for tasks, the leaders candidates were delimited two good loop be only The to with parallelization. seem loop yet not automatic do that applications way web a in rewritten be parallelization. could enable example, would index such possible bodies, array is loop the it unparallelizable as currently iterations, these loop of some parallel dependency that JavaScript for If a support task. cause function explicit do parent had its body task. and loop task loop loop the the in between the lookup index to in array variable an induction variable into loop the induction use break that loop related we opcodes opcodes However, the When include not method III-A. modifying do we the Section to boundaries, in using loop at discussed tasks tasks and up into 1 We trace Figure tasks. execution in delimited the up loop divided outermost and function-delimited Loops and Granularity—Functions Task B. 7.42x. v8-crypto h D nFgr lutae h trto onsof counts iteration the illustrates 5 Figure in CDF The JavaScript today’s in alone, loops that indicate graphs The for measurements speedup the illustrate 4 and 3 Figures

Speedup 10x 20x 30x 40x 50x 0x

epciey,teajse aihei)ma is (arithmetic) adjusted the respectively), , bing cnn facebook flickr gmail google googleDocs googleMaps googleReader googleWave

fluidSim nyTimes twitter yahoo youtube ballPool fluidSim googleWave

and , pacman v8-crypto v8-deltablue

ballPool v8-earley-boyer v8-raytrace v8-richards mean and oesr htorlvlo rnlrt sraitcfrmaking for realistic is granularity of level our that ensure to other. loop each such from other average, independent Although on often not iterations. bodies were loop loop bodies smaller of had proportion benchmarks larger a than that suggests largest This and very benchmarks. the the than among all were in size sizes Instead, observed (task) benchmarks. program body loop-parallelizable loop its average less the to we the of relative fact, number executed a a In loops had actually benchmarks. of simulator number dynamics other fluid the to that found compared size program the in ence times. 4k and loop also this sequences then in but loop bimodal; iterations, of few number is a large graph only a execute the loops in most increase benchmark number slope large loops. dramatic a executing the benchmarks: the short of The of most speedup. of typical overall and behavior tasks, largest delimited the loop had only with because speedup benchmarks the two in these iterations of numbers different fluidSim for execute loops which loops few a times. the graph; the 1.4M in in loop tail hot—one long very very were a see find We can sequence. we again,However, loop separate called a were as that f loop the function count If would we iterations. two for executing runtime with tasks, delimited function only for speedup potential dependencies. The 3. Fig. akSize. Task tasks, delimited loop-only for outliers two The with frequency the show 7 and 6 Figures in graphs The ballPool 52% ballPool Speedup 10x 20x 30x 40x 50x 0x flosi h ut trtdol n rtotimes. two or one only iterated suite the in loops of fraction and diinly eeaie h vrg aksize task average the examined we Additionally, bing cnn i o rv ohv napeibediffer- appreciable an have to prove not did ,

googleWave facebook aemr og needn op,rather loops, independent long, more have flickr gmail floseeue eaiet h entire the to relative executed loops of google googleDocs googleMaps

fluidSim googleReader googleWave pacman

ecmrs esnldout singled We benchmarks. nyTimes fluidSim twitter yahoo youtube googleWave ballPool ecmr executed benchmark suuulbecause unusual is fluidSim pacman

a h largest the had v8-crypto googleWave v8-deltablue fluidSim fluidSim v8-earley-boyer v8-raytrace v8-richards smaller

shows mean ii td aueo hswr,admroe,tsscan tasks moreover, the to implementation and overheads. to an parallelization due work, amortize during granularity better appropriate this where this of combined chose be nature we quite finer study still — even limit is an granularity cycles at fine 650 parallelism or a examine opcodes, 17 to However, reason granularity. much little see with we outliers as few such a fluidSim sizes, with task cycles, larger 650 opcode SquirrelFish 17 or around average is instructions, the benchmarks most 8, in Figure size in task shown As worthwhile. parallelization or loop once distinct only each iterate that sequences V8 loop iterations the suite’s for of the except number benchmarks of twice. the the half all About showing from benchmarks. data CDF incorporating took, instance 5. Fig. runtime with tasks, delimited loop only for speedup potential The dependencies. 4. Fig.

Fraction of Distinct Loop Instances Speedup 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0x 1x 2x 3x 0 1

7 poe) eas akszswr hsshort, this were sizes task Because opcodes). (73 bing cnn facebook 10 flickr gmail Number ofIterationsExecuted google

100 googleDocs googleMaps

v8-crypto googleReader googleWave 1000 nyTimes yahoo youtube 10000 ballPool twitter

17ocds and opcodes) (137 fluidSim

100000 pacman v8-crypto v8-deltablue

1000000 v8-earley-boyer v8-raytrace v8-richards mean ubro opisacswt ihnme fieain 4 iterations). (4k iterations large of a number indicating high graph, the a of with given side instances a right loop for the of in executing slope number instances in loop the increase of large in frequency second iterations the showing of CDF number 6. Fig. ayo h 8bnhak,i otat a uhhigher much a had dependencies. contrast, memory in of benchmarks, incidence V8 the poten- of and parallelization. non-speculative Many analysis even and static automatic for some tially promise register- provides were benchmarks) which V8 based, the to opposed (as registers. benchmarks virtual in least variables local At from arise majority runtime the dependencies that about of indicates 9 assumptions Figure JavaScript. our in and dependencies lengths, dependency age Dependences C. a this for of shape executing benchmarks. The JavaScript instances most times. loop of 464 representative executes of very instance is frequency loop CDF the hottest the showing the throughout benchmark, iterations CDF of number given 7. Fig. eo,w ute xlr h ye fdpnece,aver- dependencies, of types the explore further we Below,

Fraction of Distinct Loop Instances Fraction of Distinct Loop Instances 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0 84% 1 0 ftedpnece ftewbpage-based web the of dependencies the of 500 1000 Number ofIterationsExecuted Number ofIterationsExecuted fluidSim 1500 10 2000 ecmr.Ti ecmr a a has benchmark This benchmark. googleWave 2500 3000 100 3500 ecmr.I this In benchmark. 4000 n eprr aibe,weessoigdt noobjects into data storing whereas variables, local house temporary virtual registers and virtual for because The unsurprising lookups). opcodes table is virtual hash difference for 50k for 160k general, lengths versus of the dependencies in register (average than that dependences shorter aver- are see memory by the lengths distances We dependency shows dependency hash count. register register 10 or opcode virtual Figure register and JavaScript read. table virtual subsequent hash a age its to and write table a between instructions) register to register virtual are that dependencies all of Percentage dependencies 9. Fig. count, opcode SquirrelFish by benchmarks of size scale. task log average a The on 8. Fig. eednydsac eest h ubro yls(or cycles of number the to refers distance Dependency 100% 60% 70% 80% 90% Instructions 1000 100 Virtual RegisterDependencies 10 1 between bing cnn bing facebook cnn flickr facebook ak esshs-al okp ewe tasks. between lookups hash-table versus tasks gmail flickr google gmail googleDocs google googleMaps googleDocs googleReader googleMaps googleWave googleReader nyTimes googleWave twitter nyTimes yahoo twitter youtube

Memory Dependencies yahoo ballPool youtube fluidSim ballPool pacman fluidSim v8-crypto v8-deltablue pacman v8-earley-boyer v8-crypto v8-raytrace v8-deltablue v8-richards v8-earley-boyer mean v8-raytrace v8-richards mean n hrfr antb aallzd hsdt niae that indicates data This parallelized. googleWave be cannot execution therefore by we that events and If find higher. weighting or we without 90x CDF time, of and similar speedup higher, potential a or a create 45x have of events speedup all a of have that time see execution of We by 12. all frequency Figure for in the length, of illustrating levels Cumulative a (CDF) speedup graphed Function first We Distribution parallelizable. so program the googleWave googleWave Study: Case D. without assumptions may conservative system much. use runtime parallelism the can hurting de- of we state expensive; tracking internal be because the encouraging conservative via also more pendencies their is from This larger num- counterparts. significantly these not dependencies, are runtime bers of assumption the without to 2.31 from for slightly 2.40 that increased shows speedups 11 potential the Figure although measures. parallelism our affect strongly dependencies “fake” the memory amortize checks. easily the dependence more of because can cost we Additionally, large, exploit. is distance to dependence slack since parallelization, is may automatic there an distances hash for dependency and flexibility Large more registers high. provide aver- fairly virtual the is both manifest lookups that for table function indicates distance to graph dependency The function age dependences. from memory referenced as be may that scale. log distances a dependency on register count, and opcode memory SquirrelFish Average by 10. Fig.

5 Bytecode Instructions ial,w npce h aiu peu outlier, speedup maximum the inspected we Finally, adding of assumption conservative more our that find We 100000 ic ecudnttakdpnece aldi aiecode. native in called dependencies track not could we Since 1e+06 1e+07 10000 1000 100 10 1

google bing cnn

facebook Virtual RegisterDependencies nmr ealt etrudrtn htmakes what understand better to detail more in , a ag ubro eysoteet that events short very of number large a has flickr

n rm4.6t 58 for 45.89 to 45.46 from and gmail 46%

5 google

noadoto utm al i not did calls runtime of out and into googleDocs

faleet aeasedpo 1x of speedup a have events all of googleMaps

events googleReader googleWave nyTimes egtdb eilevent serial by weighted , 75% twitter Memory Dependencies yahoo

faleet weighted events all of youtube ballPool fluidSim pacman googleWave v8-crypto

between v8-deltablue v8-earley-boyer v8-raytrace 50%

tasks v8-richards mean oplsJv nootmzdJvSrp.I spsil htthe tool that possible this is introduction, Web It JavaScript. the optimized Google in into Java the mentioned compiles using As written (GWT). Toolkit investigated we benchmark poe 10cce)o oe n thsavr long very a as intensive has web JavaScript most producing it as Today, from reasons applications opcodes. and performance 10k by limited more, contains are developers task or largest cycles) the tail; (180 that find opcodes we 6 13 Figure In program. parallelizable. very are number that significant events a running but long parallelization, of from benefit not will amount 357x. given and a 90x have that that events the of events. fraction in the speedup illustrating CDF of 12. Fig. delimiters task speedup loop Potential and 11. Fig. tsol lob oe that noted be also should It entire the over tasks of size the of CDF a created then We

50% Fraction of all Events Speedup 10x 20x 30x 40x 50x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0x 0 faleet,wihe yeetlnt,hv peu between speedup a have length, event by weighted events, all of 0 bing

googleWave cnn

50 facebook gmail google 100 without googleDocs googleMaps googleReader ecmr,wihe ytelnt fthe of length the by weighted benchmark,

150 googleWave utm eednis u ihfunction with but dependencies, runtime

Speedup nyTimes twitter

googleWave yahoo 200

52% youtube ballPool googleWave flickr 250 ftetsscontain tasks the of fluidSim pacman v8-crypto

300 v8-deltablue

steonly the is v8-earley-boyer v8-raytrace . 350 v8-richards mean ecmrsb ntuetn nentEpoe n char- and SunSpider 8 and Explorer V8 Internet the instrumenting to compared by web benchmarks the on JavaScript of select to for op_get_by_val value decision were cycle our websites) observed minimum supporting for the short, lengths and bench- chain similar the maximum relatively across also chains the (and prototype that marks object indicates provide analysis of ultimately Their lengths JavaScript. to average for order system on in type focused modification a analysis higher-level their This and set [17]. objects small pages their a web of and real length benchmarks of the SunSpider the and in modified, chains are prototype data and properties their higher-level a researchers from Several JavaScript perspective. of topic. language behavior explored the widely explored have more a become parallelization. could from that gained performance would be more applications and such multi-threading web, from benefit new the a on provide ’s To experience OS). Docs, desktop-like Chrome Google and applications, as with Office web-based (such the applications applications of in based number that to opportunities greater believe parallelism googleWave a similar we and see characteristics JavaScript, similar will of we uses future current other automatically JavaScript. GWT or parallelizable (which lookup), JavaScript more calls hashtable in generates function a than short use Java) instead more in might to methods itself accessor as lends (such Java of style coding The more. or forty opcodes over opcodes. 10 find 10096 We are contains Wave. Wave benchmark Google Google in this in opcodes) in observed (in task size tasks largest task all of of CDF percent 13. Fig. aaaoahne l eyetnieymaue h usage the measured extensively very al. et Ratanaworabhan frequently how objects, JavaScript examined al. et Lebresne recently only has web the on behavior its and JavaScript Although

Fraction of all Tasks 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 1 googleWave hr a enarcn rn oadbrowser- toward trend recent a been has There . . .R V. Task Size(opcodes) ELATED 10 ssmtigo notiramongst outlier an of something is W ORK op_put_by_val 100 10000 and acterizing the behavior of functions, heap-allocated objects we saw in the benchmarks in our study, half of all loops and data, and events and handlers [15]. They conclude that executed in JavaScript programs only execute at most for two the behavior of the “standard” benchmarks is significantly iterations. Ha et al. alleviated the performance problem in different from actual web pages. Their web page traces were [14] by building another trace-based Just-In-Time compiler generally based on short web page interactions (e.g., sign in, that does its hot trace compilation work in a separate thread do one small task, sign out) in comparison to our study’s so that code execution need not halt to compile a code trace longer running logs. Their results corroborate ours in that in [13]. Martinsen and Grahn have investigated implementing web pages there are a small number of longer-running “hot” thread level speculation for scripting languages, particularly, functions (that may be worthy of parallelism). JavaScript [12]. Their algorithm achieves near linear speedup Richards et al. also performed a more extensive analysis for their for-loop intensive test programs. However, as we of the dynamic behavior of JavaScript usage on the web and found in our study, JavaScript as currently used on the web is compared it to that of standard industry JavaScript benchmarks not written in such a way that its iterations are independent. as well as the conventional wisdom of how JavaScript is used [26]. Whereas Ratanaworabhan et al. used short web VI.CONCLUSION interactions, the Richards et al. study recorded longer running Our study is the first exploration of the limits of parallelism traces when other users used the instrumented browser as their within JavaScript applications. We used a collection of traces default browser. The study investigated, among many things, from popular websites and the V8 benchmarks to observe the the frequency of eval, the frequency that properties were data and user-limited dependences in JavaScript. In summary, added to an object after initialization, the variance of the we found: prototype hierarchy, and the frequency of variadic functions. a) JavaScript programs on the web hold promise for They concluded that the behavior on the web and usage of parallelization: Even without allowing parallelization between JavaScript language constructs differed significantly from the JavaScript events, the speedups ranged from 2.19x up to prevailing opinion as well as that of the standard benchmarks. 45.46x, with an average of 8.91x. These numbers are great While our study analyzes the dynamic behavior of JavaScript news for automatic parallelization efforts. on the web and benchmarks arising from low level data b) Programs are best parallelized within or between dependencies in JavaScript opcodes in order to characterize functions, not between loop iterations: Most JavaScript loops the potential speedup through parallelization, Richards et al. on the web today do not iterate frequently enough, and focuses the analysis primarily at the higher level syntactic layer iterations are generally not independent enough to be par- of JavaScript, with a nod to applications of their results to type allelized. Instead, parallelization opportunities arise between systems. tasks formed based on functions. These tasks are large enough Meyerovich and Bod´ık find that at least 40% of Safari’s to amortize reasonable overheads. time is spent in page layout tasks [16]. They also break down c) The vast majority of data dependences come from the length of time for several websites spent on various tasks virtual registers, and the length of these dependences are during page loads, of which JavaScript is a relatively small shorter than hash-table lookup dependences: This finding component (15%-20% of page load time). Given these results, makes static analysis conceivable and argues that disambigua- they detail several algorithms to enable fast web page layout tion for parallelization can likely be made low cost. in parallel. However, after page load time, the proportion These results suggest that JavaScript holds significant of time spent executing JavaScript and making XML HTTP promise for parallelization, reducing execution time and im- requests increases. Indeed, if Meyerovich and Bod´ık succeed proving responsiveness. Parallelizing JavaScript would enable in dramatically reducing the amount of time spent in page us to make better use of lower-power multi-core processors loads, then the time spent after page loads, and therefore in on mobile devices, improving power usage and battery life. JavaScript would increasingly dominate the execution time. Additionally, faster JavaScript execution may enable future Mickens et al. developed the Crom framework in JavaScript programmers to write more compute intensive, richer appli- that allows developers to hide latency on web pages by cations. speculatively executing and fetching content that will then be readily available should it be needed [28]. Developers mark ACKNOWLEDGMENTS specific JavaScript event handlers as speculable, and then the The authors would like to thank Jeff Kilpatrick, Benjamin system fetches data speculatively while waiting on the user, Lerner, Brian Burg, members of the UW sampa group, and the reducing the perceived latency of the page. #squirrelfish IRC channel for their helpful discussions. Meanwhile, some work has been done on the compiler end to improve the execution of JavaScript. Gal et al. im- REFERENCES plemented TraceMonkey, to identify hot traces for the Just- [1] A. Vance, “Revamped Microsoft Office Will Be Free on the In-Time (JIT) compiler, as opposed to the standard method- Web,” New York Times, May 11, 2010. [Online]. Available: based JIT [14]. However, they found that a trace must be http://www.nytimes.com/2010/05/12/technology/12soft.html [2] M. Allen. (2009) Overview of webOS. Palm, Inc. [Online]. executed at least 270 times before they break even from the Available: http://developer.palm.com/index.php?option=com\ content\ additional JIT’ing of traces that they do during execution. As &view=article\&id=1761\&Itemid=42 [3] (2010, Jun) JavaScript Usage Statistics. BuiltWith. [Online]. Available: Dec 2009. [Online]. Available: http://research.microsoft.com/apps/pubs/ http://trends.builtwith.com/javascript default.aspx?id=115687 [4] (2010) . Google. [Online]. Available: http: [16] L. Meyerovich and R. Bodik, “Fast and Parallel Webpage Layout,” in //code.google.com/webtoolkit/ WWW ’10: Proceedings of the Conference on the World Wide Web, Apr [5] F. Loitsch and M. Serrano, “Hop Client-Side Compilation,” in Sympo- 2010. sium on Trends on Functional Languages, 2007. [17] S. Lebresne, G. Richards, J. Ostlund,¨ T. Wrigstad, and J. Vitek, “Un- [6] (2010, June) wave-protocol. [Online]. Available: http://code.google. derstanding the Dynamics of JavaScript,” in STOP ’09: Proceedings for com/p/wave-protocol/ the Workshop on Script to Program Evolution, Jul 2009. [7] (2009) w3compiler. Port80 Software. [Online]. Available: http: [18] (2010, May) Top Sites: The Top 500 Sites on the Web. [Online]. //www.w3compiler.com/ Available: http://www.alexa.com/topsites [8] G. Baker and E. Arvidsson. (2010) Let’s Make the Web Faster. [Online]. [19] (2010, June) V8 Benchmark Suite - version 2. [Online]. Available: Available: http://code.google.com/speed/articles/optimizing-javascript. http://v8.googlecode.com/svn/data/benchmarks/v2/run.html html [20] (2010, June) What Is SpiderMonkey? [Online]. Available: http: [9] D. Crockford. (2003, Dec 4,) The JavaScript Minifier. [Online]. //www..org/js/spidermonkey/ Available: http://www.crockford.com/javascript/jsmin.html [21] M. Stachowiak. (2008, September) Introducing Squir- [10] (2010, June) Qualcomm Ships First Dual-CPU Snapdragon Chipset. relFish Extreme. [Online]. Available: http://webkit.org/blog/214/ [Online]. Available: http://www.qualcomm.com/news/releases/2010/06/ introducing-squirrelfish-extreme/ 01/qualcomm-ships-first-dual-cpu-snapdragon-chipset [22] (2010, June) V8 JavaScript Engine. [Online]. Available: http: [11] (2010) Mirasol Display Technology. [Online]. //code.google.com/p/v8/ Available: http://www.qualcomm.com/products\ services/consumer\ [23] D. Mandelin. (2008, June) SquirrelFish. [Online]. Available: http: electronics/displays/mirasol/index.html //blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ [12] J. Martinsen and H. Grahn, “Thread-Level Speculation for Web Appli- [24] Y. Shi, K. Casey, M. A. Ertl, and D. Gregg, “ Show- cations,” in Second Swedish Workshop on Multi-Core Computing, Nov down: Stack Versus Registers,” Transactions on Architecture and Code 2009. Optimization, vol. 4, no. 4, pp. 21:1–21:36, January 2008. [13] J. Ha, M. Haghighat, S. Cong, and K. McKinley, “A Concurrent [25] (2010, June) JavaScript Events. [Online]. Available: http://www. Trace-based Just-in-Time Compiler for JavaScript,” in PESPMA ’09: w3schools.com/js/js\ events.asp Workshop on Parallel Execution of Sequential Programs on Multicore [26] G. Richards, G. Lebresne, B. Burg, and J. Vitek, “An Analysis of the Architectures, Jun 2009. Dynamic Behavior of JavaScript Programs,” in PLDI ’10: Proceedings of [14] A. Gal, B. Eich, M. Shaver, D. Anderson, D. Mandelin, M. Haghighat, the Conference on Programming Language Design and Implementation, B. Kaplan, G. Hoare, B. Zbarsky, J. Orendorff, J. Ruderman, E. Smith, Jun 2010. R. Reitmaier, M. Bebenita, M. Chang, and M. Franz, “Trace-based Just- [27] Microsoft, “Measuring Browser Performance: Understanding Issues in-Time Type Specialization for Dynamic Languages,” in PLDI ’09: in Benchmarking and Performance Analysis,” pp. 1–14, 2009. Proceedings of the Conference on Programming Language Design and [Online]. Available: http://www.microsoft.com/downloads/details.aspx? Implementation, Jun 2009. displaylang=en\&FamilyID=cd8932f3-b4be-4e0e-a73b-4a373d85146d [15] P. Ratanaworabhan, B. Livshits, D. Simmons, and B. Zorn, [28] J. Mickens, J. Elson, J. Howell, and J. Lorch, “Crom: Faster Web “JSMeter: Characterizing Real-World Behavior of JavaScript Programs,” Browsing Using Speculative Execution,” in NSDI ’10: Proceedings of Microsoft Research, Microsoft Research Technical Report 2009-173, Symposium on Networked Systems Design and Implementation, Apr 2010.