Poisoning the Event-Driven Architecture: A Full-Stack Look at Node.js and Friends

Anonymous Author(s)

Abstract induced by each additional artificially limits the number of In recent years, both industry and the open-source community have clients each server can handle, requiring the service provider to op- turned to the Event-Driven Architecture (EDA) to provide more erate more servers than is strictly necessary. The EDA significantly scalable web services. Though the Event-Driven Architecture can reduces the overhead paid for each additional client, multiplying offer better scalability than the One Thread Per Client Architecture, the number of clients each server can handle. its use can leave service providers vulnerable to a Denial of Service Figure 1 gives some intuition about how the EDA achieves its attack that we call Event Handler Poisoning (EHP). low per-client overhead. Rather than creating a new thread for each In this work we define EHP attacks and examine the ways in request, which introduces overheads in time (context-switching) which EDA frameworks expose applications to this risk. We also and space (per-thread stack), the EDA simply stores the request as touch on application-level EHP vulnerabilities. We focus on three an Event in a queue to be handled by a fixed set of Event Handlers. In popular EDA frameworks: Node.js (JavaScript), Twisted (Python), other words, the EDA achieves scalability using thread multiplexing. and libuv (C/C++). By multiplexing these unrelated computations onto the same Event Our analysis studies both CPU-bound and I/O-bound approaches Handlers, the EDA amortizes the space and time overhead of each to Event Handler Poisoning. For CPU-bound attacks we focus on additional client. Since each client costs less in the EDA, EDA-based Regular Expression Denial of Service (ReDoS). For I/O-bound ap- services can handle a larger number of clients concurrently. proaches we examine the weaknesses of asynchronous I/O imple- mentations in these frameworks. Using our Constant Worst-Case Execution Time pattern, EDA- based services can become invulnerable to EHP attacks. Guided by this pattern, in our system node.cure we implement two framework- level defenses for Node.js that serve as effective antidotes to the concerns we raise. CCS Concepts • Security and privacy → Denial-of-service attacks; Software security engineering; Web application security; Keywords Figure 1: This is the Event-Driven Architecture. Incoming Event-Driven Architecture; Denial of Service; Node.js; Algorithmic events are handled sequentially by the . The Event Complexity; ReDoS Loop may generate new events in the form of tasks it of- floads to the Worker Pool. We refer to the event loopand the workers in the worker pool as Event Handlers. 1 Introduction Web servers are the lifeblood of the modern Internet. To handle Alas, the scalability promised by the EDA comes at a price. Mul- the demands of millions of clients, service providers replicate their tiplexing unrelated work onto the same thread(s) may reduce over- services across many servers. Since each additional server costs head, but it also moves the burden of time sharing out of the thread time and money, service providers want to maximize the number of library or and into the application itself. Where clients each server can handle. Over the past decade, this goal has OTPCA-based services can rely on the preemptive multitasking led the software community to seriously consider a paradigm shift promised by the underlying system to ensure that resources are in their software architecture — from the One Thread Per Client shared fairly, using the EDA requires the service to enforce its own Architecture (OTPCA) to the Event-Driven Architecture (EDA) (also cooperative multitasking [71]. In other words, thread multiplex- known as the reactor pattern). Though using the EDA may increase ing destroys isolation. For example, if a client’s request takes a the number of clients each server can handle, it also exposes the long time to handle in an OTPCA-based service, preemptive mul- service to a class of Denial of Service (DoS) attack unique to the titasking will ensure that other requests are handled fairly. But if EDA: Event Handler Poisoning (EHP). this same lengthy request is submitted to an EDA-based service, Traditionally, web services have been implemented using the and if the service incorrectly implements cooperative multitasking, One Thread Per Client Architecture (OTPCA), and support more then this request will deny service to all other clients because they clients by increasing the number of threads. In practice the overhead cannot use the Event Handler that is processing the request. We call this state of affairs Event Handler Poisoning. Anonymous Submission to ACM CCS 2017, Due 19 May 2017, Dallas, Texas 2017. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 We present a general principle, borrowed from the real-time https://doi.org/10.1145/nnnnnnn.nnnnnnn systems community, that EDA-based services can use to defend 1 Anonymous submission #23 to ACM CCS 2017 against EHP. As EHP attacks are possible when one request to an Although EDA approaches have been discussed and implemented EDA-based server doesn’t cooperatively yield to other requests, in the academic and professional communities for decades (e.g. [30, our antidote is simple: the server must ensure that every stage 42, 48, 69, 74, 77]), historically EDA has only seen wide adoption in of the request handling takes no more than constant time before user interface settings like web browsers. How things have changed. it yields to other requests. In applying this principle, EDA-based The EDA is increasingly relevant in systems programming today servers extend input sanitization from the realm of data into the thanks to the explosion of interest in Node.js [12], an event-driven realm of the algorithms used to process it. We evaluate several server-side JavaScript framework. Although the OTPCA and EDA popular EDA frameworks in light of this principle, identifying EHP architectures have the same expressive power [55], in practice EDA vulnerabilities in each. Our node.cure prototype extends Node.js to implementations have been found to offer greater scaling [67]. defend against the vulnerabilities we identified there. Several variations on the server-side EDA have been proposed: This paper is an extension of a workshop paper [28]. There we single-process event-driven (SPED), asymmetric multi-process event- introduced the idea of EHP attacks, argued that some previously- driven (AMPED), and scalable event-driven (SEDA). The AMPED reported security vulnerabilities were actually EHP vulnerabilities, architecture is depicted in Figure 1, while in SPED there is no worker and gave a sketch of our antidote. Here we offer the complete pool [66]. In the SEDA architecture, applications are designed as a package: a refined definition of EHP, a general principle for defense, pipeline composed of stages, each of which resembles Figure 1 [77]. examples of vulnerabilities in the wild, and effective framework- Figure 1 illustrates the widely used AMPED EDA formulation, level defenses. which is common to all mainstream general-purpose EDA frame- In summary, this paper makes the following contributions: works today. In the AMPED EDA (hereafter simply “the EDA”) the operating system or a framework places events in a queue (e.g. (1) We are the first to define Event Handler Poisoning, an attack that through ’s poll system call1), and pending events are handled exploits a general vulnerability in the EDA (§3). Our definition by an event loop. The event loop may offload expensive activity to of EHP is general, and captures attacks targeting either the a small, fixed-size worker pool, which in turn generates an event event loop or the worker pool. to inform the event loop when an offloaded task completes. The (2) Using concepts borrowed from the real-time systems commu- worker pool is implemented using threads or separate processes, nity, we describe a general antidote for Event Handler Poisoning and offers services like “asynchronous” (blocking but concurrent) (§4). Our antidote gives developers precise guidance, replacing access to the or DNS. the dangerous heuristics currently offered in the EDA commu- In applications built using EDA, there is a set of possible Events nity. (e.g. “new network traffic”) for which the developer supplies asetof (3) We identify potential EHP vulnerabilities in software built using Callbacks [46]. During the execution of an EDA-based application, the popular Node.js EDA framework (§5). In doing so we extend events are generated by external activity or internal processing and the state of the art in ReDoS detection. routed to the appropriate Callbacks. Each Callback is executed on (4) Informed by our EHP antidote, we identify weaknesses in one of the Event Handlers; synchronous Callbacks are executed on Node.js itself that expose its community to EHP. In node.cure the event loop, asynchronous Callbacks on a worker. we implement effective solutions to these problems, defending Node.js applications against two forms of EHP: ReDoS (§5.2) 2.2 Popular EDA Frameworks and Large/Slow I/O (§6.3). An EDA approach can be (and has been) implemented in most pro- gramming languages. Table 1 lists the dominant2 general-purpose 2 Background EDA framework for a range of languages: Node.js (JavaScript) [12], In this section we contrast the EDA with the OTPCA (§2.1). We Twisted (Python) [19], libuv (C/C++) [9], Reactor (Java) [16], Event- then explain our choice of EDA frameworks for study (§2.2), and Machine (Ruby) [5], Zotonic (Erlang) [21], and Microsoft’s P# [41]. finish by explaining formative work on Algorithmic Complexity To capture the ongoing investment in each framework, we surveyed (AC) attacks (§2.3). its GitHub [6] repository for the number of commits, contributors, and releases, with results shown in Table 1. 2.1 The OTPCA and the EDA Guided by Table 1, we selected Node.js (JavaScript), Twisted There are two main architectures for scalable web servers, the One (Python), and libuv (C/C++) for closer investigation, weighting our Thread Per Client Architecture and the Event-Driven Architecture. selection by the number of contributors (as a proxy for community The traditional approach is the OTPCA, of which Apache [2] is interest) and the number of commits (as a proxy for maturity). the most prominent. In the OTPCA style, a thread or a process is Node.js is a server-side event-driven framework for JavaScript assigned to each client, with a large maximum number of concur- that uses the architecture shown in Figure 1. The worker pool is used rent clients. The OTPCA model isolates clients from one another, internally to perform asynchronous DNS queries and file system I/O. reducing the risk of interference. However, each additional client In addition, user-defined code can be offloaded to the worker pool. incurs memory and context switching overhead. In contrast, by Born in 2009, Node.js now claims 3.5 million users [27], and its open- multiplexing multiple client connections into a single thread and source package ecosystem, npm, is the largest of any programming offering a small worker pool for asynchronous tasks, an EDAap- 1Per the Linux man page, “poll...waits for one of a set of file descriptors to become proach significantly reduces a server’s per-client resource use. But ready to perform I/O.” 2 this design also reduces isolation between clients, exposing the We identified the dominant framework for each language through a tournament approach, using web searches to identify the frameworks for each language, and the server to vulnerabilities like EHP. same metrics as in Table 1 to choose the most dominant among them. 2 Anonymous submission #23 to ACM CCS 2017

Framework Contributors Commits Releases 3 Event Handler Poisoning Attacks Node.js (JavaScript) 1344 17355 426 Server-side applications implemented using the EDA are vulnerable Twisted (Python) 52 21897 39 to a unique type of denial of service attack: Event Handler Poisoning. libuv (C/C++) 255 3725 185 Key to the EDA is the multiplexing of many client requests onto Reactor (Java) 56 4249 42 a small set of Event Handler threads. An Event Handler Poisoning EventMachine (Ruby) 118 1153 29 attack exploits this design by poisoning one or more Event Handlers, Zotonic (Erlang) 60 6632 69 causing them to block, perhaps indefinitely. This is done through P# 10 1542 2 the submission of evil input, an expensive task that is either CPU- Table 1: Maturity of event-driven programming environ- bound or I/O-bound. ments measured by development activity. Data were ob- AC-style vulnerabilities are thus a form of EHP: if an attacker tained from GitHub on May 6, 2016. identifies an AC-style vulnerability in an EDA-based server, he can submit “evil input” to trigger the WCET of the vulnerable language or framework [10]. Node.js has seen adoption in industry, algorithm. When an Event Handler picks up the request, it will take at companies including LinkedIn [63], PayPal [50], eBay [65], and an inordinately long time to handle it, starving subsequent requests IBM [11]. — an EHP attack. EHP attacks redirect the energies of the victim Twisted is an event-driven networking engine written in Python rather than trying to cause it to crash. that uses the architecture shown in Figure 1. During its 14 years, We take issue with the prevailing wisdom in the EDA community, Twisted has seen use in significant projects like Ubuntu26 One[ ], which is that offloading tasks to the worker pool is a panacea for Apple’s Calendar and Contacts Server [22], and the SageMath Note- expensive work. Suppose an attacker identifies an EHP vulnerability book [24]. in code executed by the worker pool. A pool of size k will jam after C/C++ developers can use the libuv framework to write event- only k poisonous requests; the EHP vulnerability is essentially the driven programs. Node.js’s event-driven runtime relies on libuv, so same whether the attack is against the event loop or the worker pool. both frameworks use the same architecture. Though libuv was first To induce complete DoS, an attacker’s aim is to submit requests developed in 2011 to support Node.js, it has since been adopted by that poison either the event loop or each worker in the worker pool, other frameworks and languages. Examples include the numerical thereby delaying the handling of subsequent requests. computing language Julia [23], early versions of Mozilla’s systems We note that an attack targeting a worker pool can in principle programming language Rust [25], and others [14]. be carried out against any architecture or system with a cap on the amount of resources available for clients, including both EDA and 2.3 Algorithmic Complexity Attacks pool-based OTPCA designs. However, the pool in a OTPCA system Our work was inspired by and generalizes the idea of Algorithmic is orders of magnitude larger than the worker pool in typical EDA Complexity (AC) attacks [37, 60]. In an AC attack, attackers force implementations. While Node.js has a default worker pool of size 4 victims to take longer than expected to perform an operation. Rather with a maximum of 128 workers [13] , Apache’s httpd server has a than overwhelming the victim with volume, the attacker turns the default connection pool of size 256-400, with a maximum size of victim’s vulnerable algorithms against him. AC vulnerabilities can around 20000 [1] . Due to the size of an OTPCA server’s connection therefore be exploited to launch DoS attacks, when the attacker pool, attempts to jam it would presumably be detected by an IDS or forces the victim to spend a long time processing his request at the addressed through rate limiting. By comparison, the small number expense of legitimate clients, or when the attacker increases the of requests needed to jam a vulnerable EDA server’s worker pool cost even of legitimate requests. is too few to draw attention from network-level defenses, placing Regular expression Denial of Service (ReDoS) attacks are a the burden of defense on the application developer or the runtime class of AC attack that has gained attention in recent years [54, 68, system to address it. 70, 73]. Developers often use regular expressions for input sani- An important facet of an EHP attack is its asymmetry. As in a tization, and must produce regular expressions that correctly dif- Distributed DoS (DDoS) attack, a carefully crafted “evil request” ferentiate between valid and invalid input. When the set of valid will be roughly the same size and shape as the requests submitted inputs is varied, the resulting regular expressions can become com- by legitimate clients. However, unlike a DDoS attack, the attacker plex. Unfortunately, a poor choice of regular expression exposes an need only submit one (or k) rather than a flood. This significantly application to ReDoS. reduces the resources required by the attacker, and makes it difficult Though regular expression engines perform well in general, for a network-level IDS system to detect an attack. many popular engines (, Python, V8 JavaScript, etc.) have worst- Figure 2 illustrates the work patterns available in the EDA. (a-c) case exponential-time performance. An ReDoS attack convinces a are vulnerable to EHP, while (d) and (e) are safe (§4). vulnerable regular expression to evaluate evil input that trig- gers this worst-case exponential running time. Such an evaluation 3.1 EHP Threat Model will run effectively forever, at least until it is canceled or timed out. This worst-case exponential-time performance occurs because Our threat model is straightforward: the attacker crafts “evil input”, production-grade “regular expression” engines are misnamed — and convinces the EDA-based victim to feed it to a vulnerable they support Context-Free Language extensions to traditional regu- API. This model is reasonable because our victim is an EDA-based lar expressions, and these extensions may in the worst case require server; servers exist solely to process client input using APIs. If the exponential time. server uses a vulnerable API, and if it invokes it with client input, 3 Anonymous submission #23 to ACM CCS 2017

the EDA model, as shown in Cylon.js [4], processing the incoming data will require computation within an EDA setting. 3.3 Defining I/O-bound Vulnerabilities Web servers are commonly called on to interact with the file sys- tem. An I/O-bound EHP vulnerability exists when the attacker can request the server make a particularly slow I/O. For example, in a file server, a slow I/O might be a read of an unusual file,which can take on average 8x longer to serve [80], or perhaps data stored on slow media, e.g. a network drive. We suggest several ways in which an attacker might exploit an I/O-bound EHP in Section 6. Gaps between theory and practice aside, we have argued that in the EDA the risk of DoS is more general than just the event loop: EHP attacks can be launched against either the event thread or the worker pool. When practitioners incautiously offload tasks to the worker pool, they are guilty of conceptual abuse. In many Figure 2: This figure illustrates various EDA work patterns, EDA frameworks the worker pool is small and fixed-size – there drawing on material from §3 and §4. Figures (a-c) are vul- are far fewer worker threads than clients per server. This means nerable to EHP, while (d) and (e) practice our antidote (§4). that cavalierly offloading potentially long-running tasks tothe (a) shows the event loop performing an expensive opera- worker pool is a reversion from the EDA to the OTPCA. In the EDA, tion synchronously. (b) shows the event loop offloading that developers must take care even in the tasks they offload to the worker operation to the worker pool, where it is performed syn- pool. chronously by a worker. (c) shows the event loop offloading the partitioned operation to the worker pool, but note that 4 An Antidote for Event Handler Poisoning one of the partitions is not C-WCET. (d) shows a C-WCET partitioning of the operation, with each partition offloaded to the worker pool. In (e) the server offloads a request to an When an algorithm’s overall complexity is not O(1), being C-WCET- external service. partitioned gives it two properties that make it ideal for the EDA setting. First, it will be cooperative; it will not be executed atomically, but will instead allow other events to have a turn on the Event the server can be poisoned. We identify vulnerable APIs in EDA Handler. Second, it will be predictable: other events will have a turn frameworks in sections 5 and 6, and elaborate on this model there. on the Event Handler within a fixed amount of time, necessary for high throughput in a context like Node.js (where events of 3.2 Defining CPU-bound Vulnerabilities different types and costs are handled by a single worker pool).A The CPU-bound EHP vulnerabilities in an EDA system are a form of C-WCET-partitioned algorithm can block neither the event thread Algorithmic Complexity (AC) vulnerability (§2.3). In an AC attack, nor the worker pool, and thus an application composed of C-WCET- attackers force victims to take longer than expected to perform an partitioned algorithms is invulnerable to EHP attacks. To the best operation, typically by exploiting the Worst-Case Execution Time of our knowledge we are the first to propose this hardline stance (WCET) of an algorithm used to process requests. If an attacker on C-WCET partitioning for EDA applications. identifies an AC-style vulnerability in an EDA-based server, hecan Partitioning an algorithm to reduce each partition’s WCET is also submit evil input to trigger the WCET of the vulnerable algorithm. recommended by Wandschneider [75], but he suggests partitions When an Event Handler picks up the request, it will take an inor- with O(n)-WCET. Wandschneider’s approach is unsound if the dinately long time to handle it, starving subsequent requests – an cost for each element is non-trivial. True C-WCET partitioning EHP attack. is necessary to make an application invulnerable. And once an CPU-bound EHP vulnerabilities in the EDA are more general algorithm has been partitioned to O(n) − WCET , partitioning it than Crosby and Wallach’s formulation of AC attacks. According to further to O(1) − WCET should not be challenging. Crosby and Wallach [37], the victim of an AC attack is an algorithm C-WCET partitioning is essentially the logical extreme of co- or data structure that performs well in the average case but poorly operative multitasking [71], and interprets EDA systems as real- on carefully chosen “evil input”. In the EDA setting, however, any time (embedded) systems with a need for strict forward progress algorithm with non-constant WCET (non-C-WCET) and uncontrolled guarantees [78]. C-WCET partitioning also makes predictable the input constitutes an EHP vulnerability (§3). worst-case overall completion time of a request, a useful feature While the EDA was originally intended for I/O-bound applica- for application designers. In §5 and §6 we discuss the risk of EHP tions, we argue that the need to perform computation in an EDA when non-C-WCET-partitioned algorithms are used in the EDA. system is growing. Full-featured servers that rely on the EDA must We argue that developers should view C-WCET partition- inevitably deal with the computations needed to implement busi- ing as input sanitization. In the EDA, input sanitization must ness logic, and with the advent of fog computing [32], embedded IoT extend from the realm of data into the realm of the algorithms used devices will increasingly be called on to perform data processing. to process it. In the EDA, C-WCET partitioning is as necessary as As these input-driven devices are most readily programmed using checking user input for directory traversal attacks or code injection. 4 Anonymous submission #23 to ACM CCS 2017

In many cases EHP vulnerabilities are the responsibility of the so ReDoS attacks are EHP attacks that poison this event handler. application developer. If the application performs expensive work The server can stymie an ReDoS attack by avoiding the use of on the event loop or offloads enormous, unpartitioned tasks to vulnerable patterns, by rejecting the evil input before evaluating the worker pool, we cannot help except on a case-by-case basis. the regular expression on it, or by using a regular expression engine Where we can help more generally, however, are cases where the with a timeout mechanism. application makes a reasonable use of the framework APIs, but To obtain evil input, the attacker must derive from the regu- the implementation of these APIs disappoints. Using Node.js as lar expression three strings: the prefix, the pump, and the suffix. a case study, we identify and repair two framework APIs whose Since backtracking behavior is induced by requiring the regular infelicitous implementations expose any Node.js applications using expression to try exponentially many paths through the input be- them to EHP attacks. In §5 we identify and address a solution to fore declaring a mismatch, the attacker uses the prefix to reach the a CPU-bound API, and in §6 we do the same for an I/O-bound vulnerable point in the regular expression, repeats (or “pumps”) the API. By applying C-WCET techniques at the framework level, our pump string to repeatedly double the size of the tree explored by node.cure prototype offers the Node.js community a high-impact, the engine, and adds the suffix to cause a mismatch and trigger the easy-to-adopt solution to some EHP attacks. backtracking. 5 CPU-bound EHP Vulnerabilities in the EDA 5.1.1 Identifying Vulnerable Regexes A seemingly straightforward defense against ReDoS is to avoid the use of vulnerable regular expressions. But how can a developer de- In this section we discuss CPU-bound EHP vulnerabilities. We termine whether a regular expression is vulnerable? In this section focus our attention on ReDoS, describing existing vulnerabilities in we demonstrate that while there are open-source tools that perform npm (§5.1) and demonstrating a defense that incorporates a hybrid this classification, they do not do so accurately. regular expression engine into Node.js (§5.2). We close with an We identified two open-source tools that claim to answer this exploratory manual investigation of 80 npm modules, showing that question: safe-regex [17] and rxxr2 [68]3. We refer to these tools the Node.js community’s practices are vulnerable to EHP because as our ReDoS oracles. safe-regex labels as vulnerable all regular they do not follow our C-WCET partitioning approach (§5.3). expressions with star height [51] greater than one (i.e. nested quan- 5.1 ReDoS Vulnerabilities tifiers), while rxxr2 performs a more complex analysis based on an After explaining what ReDoS vulnerabilities are and describing the idealized backtracking regular expression engine. associated threat model, this section answers two questions: We next located a source of “ground truth” with which to evalu- (1) How can we identify a vulnerable regular expression (§5.1.1)? ate these tools. Security vulnerabilities in npm modules are publi- 4 If developers cannot do this accurately, their applications are cized by the Node Security Platform (NSP) and Snyk.io [18]. These likely to remain vulnerable to ReDoS. organizations report on a range of vulnerabilities including ReDoS. (2) Are any popular npm modules vulnerable to ReDoS (§5.1.2)? As reported in [28], we extracted 43 potentially-vulnerable regu- While our discovery of ReDoS vulnerabilities as we answered lar expressions from ReDoS vulnerabilities reported by NSP and the second question motivated our solution (discussed in §5.2), Snyk.io as of February 1, 2017, and then applied both safe-regex answering the first question informed its design. and rxxr2 to evaluate each regular expression for vulnerability. We have documented and automated each step of the process Untrustworthy Oracles We found that neither safe-regex nor outlined below. Our programs and the resulting data are available rxxr2 is trustworthy, either alone or when used in combination. By at BLIN DED for those interested in reproducing our results or assessing the decision of each oracle for accuracy, we determined applying our techniques to other questions. the kinds of vulnerable regular expressions each was good and bad As discussed in §2.3, ReDoS vulnerabilities require the server at identifying. to evaluate a vulnerable regular expression on evil input designed We evaluated each of the 43 potentially-vulnerable regular ex- to trigger the worst-case performance of its regular expression pressions from NSP and Snyk.io for true vulnerability using three engine. This worst-case performance is exponential in the length popular backtracking engines, those built into Perl, Python, and of the input, so for long input or fast blow-up other clients will be Node.js. We defined a regular expression as vulnerable in an engine denied computational resources and thus service. Common regular if the engine takes more than 500 ms to evaluate an input that expression engines evaluate inputs against patterns synchronously, has been “pumped” at most 20,000 times. This practical means of so in the EDA an ReDoS attack is depicted in Figure 2 (a) or (b). evaluation gives us strong evidence for true vulnerabilities, but it ReDoS Threat Model We further refine the EHP threat model necessarily filters out vulnerabilities that may only manifest inan (§3.1) for ReDoS. Regular expressions are commonly used as a first engine on (absurdly) long input. Note that due to differing engine line of defense when sanitizing client input. If a client identifies implementations, expressions may be vulnerable on some engines a vulnerable regular expression being used by the server, he can and safe on others. compose evil input that triggers the worst-case exponential-time Here is a description of the result of this study: performance of the regular expression engine. He might identify (1) rxxr2 reported no false positives – it was correct for every reg- this regular expression by examining regular expressions used to ular expression it labeled vulnerable. Since rxxr2 recommends validate input on the client side of this server, or by examining 3We are aware of one other potential ReDoS oracle, Microsoft’s SDL Regex Fuzzer [72]. source code or regular expression libraries. In the EDA, regular However, this tool is no longer available. expressions are typically executed synchronously on the event loop, 4See https://nodesecurity.io/ 5 Anonymous submission #23 to ACM CCS 2017

evil input, we were able to use this input to exploit each of the Regular expression Vulnerable in which engine? vulnerable regular expressions it reported. /android.+(\w+)\s+build\1$/ Node.js, Python (2) safe-regex reports false positives – regular expressions that /(\w+)\1$/ Node.js, perl could in principle be vulnerable but which are not in any of the Table 2: These regular expressions contain backreferences, backtracking engines we evaluated. In our dataset, these false and are vulnerable in the indicated engines. The first regu- positives came in two forms: non-dangerous quantifiers and lar expression in the table is used in the ua-parser-js npm small specific quantifiers. The use of the “optional” quantifier module. The second regular expression is a simplified ver- ‘?’, as in /(X+)?$/, does not seem to be problematic for these sion emphasizing the source of the vulnerability. For “evil engines. Regular expressions containing small specific quan- input” to the final expression, we used an empty prefix, the tifiers, like /(X{2})+$/, are handled quickly up to very long pump string “a”, and a suffix of “!” to cause a mismatch and input. trigger backtracking. (3) safe-regex reports true positives that rxxr2 does not. It is not clear whether rxxr2’s unsoundness comes from rxxr2’s design or its implementation. vulnerable. has-exp-features thus has no false negatives but many (4) rxxr2 reports true positives that safe-regex does not. These false positives. take two forms: nested quantifiers with an OR’d inner expres- Armed with three ReDoS oracles, two existing and one novel, sion, and quantified expressions with overlapping OR’d classes. we searched for true ReDoS vulnerabilities in the wild. First, safe-regex does not correctly detect regular expressions 5.1.2 ReDoS Vulnerabilities in npm Modules of the form /(X+|Y)+$/, though this is clearly an example of a In this section we discuss the process by which we identified 27,808 5 star height greater than one . Second, it misses regular expres- regular expressions used in the wild, and elaborate on how we sions with quantified overlapping OR’d classes, like this one: identified 55 as truly vulnerable and an additional 101 as probably /(X|X)+$/. This regular expression is vulnerable because on vulnerable. input of the form “X...X” that does not match, the engine will Choosing The Software to Study To identify ReDoS vulner- backtrack to try each possible division of the X’s between the abilities, we first had to identify regular expressions used in real first and second OR’d classes – there are exponentially many software. To ensure a fair evaluation, we restricted our definition of choices in the length of the input. Note that this vulnerability “real software” to emphasize well-maintained software. We decided is not due to the star height of the expression; the backtracking to investigate software in the Node.js package ecosystem, npm, is incurred using other means. since it is open-source and has a large and active developer com- (5) Neither safe-regex nor rxxr2 identifies another form of vulner- munity. npm as an ecosystem also remains relatively unexplored. able regular expression: adjacent quantified expressions with We searched the npm ecosystem for packages with relatively large overlapping contents, like this one: /\d+\d+$/. This class of numbers of commits (as a proxy for being well-maintained) and expressions is vulnerable for essentially the same reason as the relatively large monthly download rates to maximize the impact of previously-discussed vulnerability. Curiously, safe-regex’s test our findings. Here are the steps we took: suite includes regular expressions of this form in its list of safe regular expressions. (1) We cloned the npm registry on May 3, 2017. This registry con- (6) Though there was only one example in this dataset, neither tains the metadata for each module defined in the npm ecosys- safe-regex nor rxxr2 seems to identify regular expressions con- tem, including the location of its source code repository. The taining exponential-time features. This inaccuracy is partic- registry had information on 447,362 modules. ularly ironic because backreferences and lookahead are the (2) We filtered the registry for the 348,691 modules whose code is only components of standard regular expression languages hosted on GitHub, which has an easy-to-query API from which that actually must require exponential-time evaluation in the we could extract commit numbers. We excluded from this set worst case [36]. Though backreferences are outside the scope an additional 78,890 modules which had no monthly downloads of safe-regex’s investigations, rxxr2’s authors claim it should or for which we did not identify at least one commit (e.g. the consider them [68]. On the other hand, rxxr2’s implementation GitHub repository no longer existed). This left us with 269,801 does not support regular expressions with lookahead. Table 2 modules. contains an assortment of vulnerable backreference-using reg- (3) For each surviving module, we obtained maintenance data (com- ular expressions that safe-regex and rxxr2 incorrectly label mits) from GitHub and popularity data (monthly downloads) safe. from npm. We only counted the commits made by the project’s top 30 contributors due to GitHub’s rate limiting policy. The most notable weakness of safe-regex and rxxr2 is that they We visualized this data in Figure 3, which shows the frequency do not consider exponential-time regular expression features. To of different values on a log-log scale. Onthe minimize the false negatives reported by our ReDoS oracles, we rainbow scale, the darkest red hexagons have the highest frequency, have developed a novel oracle has-exp-features that identifies reg- the deepest violet the lowest. The red lines indicate two standard ular expressions that use exponential-time features and labels them deviations above the arithmetic mean in each dimension. Using this analysis, we were able to select the “best maintained, most popular” 5This is a bug in safe-regex that we have patched. See our pull request at modules – those in the upper-right quadrant, 1,591 in all. See [79] BLINDED. We subsequently refer only to the patched version of safe-regex. for other work related to the npm ecosystem. 6 Anonymous submission #23 to ACM CCS 2017

Vulnerable Regular Expressions Having identified a set of 8 Counts 27,808 real regular expressions from well-maintained modules, we 7149 asked our ReDoS oracles about each of them. Figure 4 illustrates 6702 the varying opinions the oracles gave on the 4,113 regular expres- 6256 sions reported “vulnerable” by at least one oracle. We used rxxr2’s 5809 6 recommendations for evil input to identify 55 as truly vulnerable. 5362 rxxr2 4915 As ’s recommendation format is ill-defined, we were unable 4468 to automatically reach conclusions on the remaining 101 regular 4022 expressions it flagged as vulnerable. Since neither safe-regex nor 4 3575 has-exp-features recommends evil input, we were unable to vali- 3128 date these dynamically. The inaccuracies of the oracles discussed 2682 in §5.1.1, however, suggests that there are likely truly vulnerable Log_10 Monthly Downloads 2235 regular expressions in each area of the figure. The development of 1788 2 improved ReDoS oracles remains an area for future work. 1341 894 448 1

0 1 2 3 4 5 Log_10 Total Contributions

Figure 3: This figure shows the density of the downloads vs. contributions chart. Redder colors are the most dense, so many modules have no more than 10 contributions and 100 downloads per month. The red lines plot two standard devi- ations above the arithmetic mean on each axis, and intersect at 270 commits and 3500 downloads per month. We label as “well maintained and popular” those modules in the upper- right quadrant. In clockwise order from the lower-left, the quadrants contain 250,164, 6,602, 1,591, and 11,444 modules. Figure 4: This figure shows the overlap between the deci- sions of our ReDoS oracles on our set of wild regular ex- pressions. We asked each oracle about the 27,808 regular ex- Extracting Regular Expressions From npm Modules Hav- pressions in our dataset, and 4,113 were labeled vulnerable ing obtained a dataset of popular, highly-downloaded npm modules, by at least one oracle. rxxr2 (RX) was the most conservative, our next step was to extract the regular expressions they used. As while safe-regex (SR) and has-exp-features (HE) flagged sim- there are well-known difficulties inherent in statically analyzing a ilar quantities of expressions. The small amount of over- dynamic language like JavaScript, we opted for a simpler dynamic lap between SR and HE indicates that real regular expres- 6 approach. To this end, we instrumented Node.js v6.10.3 to emit sions are not “too complicated” – few expressions used both regular expressions as they were declared. nested quantifiers and backreferences or lookahead. Wedy- To determine the regular expressions used by our dataset of npm namically demonstrated ReDoS attacks on 55 of these regu- modules, we installed each of them and ran its test suite. Though lar expressions using the evil input recommended by rxxr2, we don’t expect the test suites to offer complete code coverage, but did not investigate whether attackers could reach these regular expressions are typically used to sanitize input, and thus expressions with evil input. we expect a reasonable test suite to exercise most if not all of the regular expressions used by the module. 5.2 Addressing ReDoS in the EDA We were able to successfully run the test suites of 1,182 of our 7 Many of the possibly- and truly-vulnerable expressions found in our 1,591 chosen modules , capturing 2,282,330 regular expression dec- npm study (§5.1) need not be vulnerable in practice. Though regular larations, which we reduced to a set of 27,808 unique patterns. The expressions that use exponential-time features are unavoidably large number of repeated regular expressions was likely due to our vulnerable, the remainder would be safe were they evaluated using use of dynamic analysis; we captured regular expressions declared a linear-time regular expression engine instead of Node.js’s built-in anywhere in the running test suite, whether in a module’s code or exponential-time engine, Irregexp8. The size of this opportunity in the code of one of its dependencies. For example, if modules A is illustrated in Figure 4: those not flagged by has-exp-features and B both depend on module C, and C uses a regular expression, (52%) could be made safe. While this observation has previously this regular expression would appear twice in our unfiltered set. been made by Cox [36], we are not aware of a previous attempt

6We applied our changes to Node.js commit a0bfd4f5 (5 May 2017) 8Regular expression processing is part of the V8 JavaScript engine. V8 actually has two 7Of the remainder, some test suites would not run, and others timed out under our engines, Atom and Irregexp. While Irregexp is used in general, Atom services five minute cap. the simplest regular expressions, those that can be handled with strstr. 7 Anonymous submission #23 to ACM CCS 2017 to implement a hybrid regular expression engine at the language level. We introduced a linear-time regular expression engine into node.cure, modifying the implementation of regular expressions in the underlying V8 JavaScript engine. We used the RE2 regular expression engine, which is the most prominent linear-time regular expression engine and supports perl-compatible regular expressions (PCRE) [15] We found that in normal usage Irregexp is significantly faster than RE2. We evaluated an RE2-enabled version of Node.js using two benchmarks, Li’s benchmarks [3] and the regular expression-related tests from the V8 Octane benchmark suite [7] . Our results are shown in Figure 5. These benchmarks use safe input on generally- Figure 6: Decision tree for using RE2 vs. the existing V8 regu- safe regular expressions, so we see that in normal usage Irregexp lar expression engines. We use RE2 as a last resort because its outperforms RE2. performance on “good path” patterns and inputs is inferior to that of Irregexp according to our benchmarks. The deci- sions are made in the order of increasing cost, culminating in a query to the ReDoS oracles. The number of regular ex- pressions from our dataset that took each path is shown in (parentheses).

This leaves servers using node.cure open to three avenues of attack. First, though RE2 handles many vulnerable regular expres- sions, an attacker can still carry out standard ReDoS attacks on the subset of vulnerable regular expressions that contain exponential- time features (i.e. that cannot be handled by RE2 in linear time). Second, an attacker can identify vulnerable regular expressions that could be handled by RE2 but which our ReDoS oracles do not identify as vulnerable. This threat can be ameliorated by more ac- curate oracles. Lastly, an attacker can take advantage of the “slow path” in Figure 6. If he can convince a server running node.cure to evaluate an uncached regular expression that can be handled with Figure 5: On the Li and V8 Octane regular expression perfor- RE2, the server will repeatedly have to follow the “slow path” and mance benchmarks, Irregexp significantly outperforms RE2. query the ReDoS oracles. Even with the addition of a database of We show here 50 runs of the performance of three versions oracle decisions with which prime the cache, the attacker could of Node.js: vanilla Node.js, a version that checks whether propose novel expressions not in the database. Faster ReDoS oracles RE2 can be used but does not use it, and a version that uses are needed to defend against this threat. RE2 whenever possible. We traced this overhead to the actual Limitations node.cure cannot protect developers from them- evaluation of the regular expression by the two engines. selves. If developers use regular expressions with exponential-time Given this disparity in performance, node.cure only uses RE2 to features, these expressions must be handled by an exponential- evaluate potentially-vulnerable regular expressions, otherwise fa- time regular expression engine. Vulnerable regular expressions like voring Irregexp. node.cure queries the ReDoS oracles we identified those of Table 2 will remain vulnerable when using node.cure. This in §5.1.1 to determine whether a regular expression is vulnerable. is an unavoidable consequence of supporting a “regular expression” Querying oracles is expensive (about 0.2 seconds per query, unopti- language that allows context-free language construction. mized), so after querying a particular pattern we cache the result. RE2 is not completely compatible with Irregexp. Consider the /(?:(a)|(b)|(c))+/ abc9 Irregexp The algorithm for choosing the regular expression engine to use regular expression with input . returns a single match c, while RE2 returns three separate matches is shown in Figure 6. Our implementation, about 800 lines long, is 10 completely within the V8 JavaScript engine used by Node.js. a, b, and c . Disagreements of this kind mean that applications Attacking Our Solution node.cure does not partition regular would need to be ported to node.cure rather than using it as a expression evaluation; as in JavaScript regular expressions are ex- drop-in replacement for Node.js. We are not aware of a previous ecuted synchronously, and in Node.js this is done on the event observation of this disagreement, nor do we understand its cause. loop, node.cure only seeks to reduce the WCET of the evaluation As flags are not the root cause of the exponential-time matching, from exponential-time to linear-time where possible by execut- supporting them is orthogonal to defending against EHP attacks. ing vulnerable compatible regular expressions using RE2. In other Though our implementation honors the case-insensitive flag (i), it words, node.cure follows the pattern of Figure 2 (a), but tries to 9This example is taken from v8/test/mjsunit/regexp-loop-capture.js. exponentially reduce the size of the work. 10Perl and Python also disagree with Irregexp. 8 Anonymous submission #23 to ACM CCS 2017 refers patterns with other flags to V8’s default regular expression (offloaded to the operating system’s asynchronous I/O facilities). engines. We deferred support for the global match flagд ( ) for future libuv offers Type 1 and Type 2 I/O using typical read and write work due to the complexity of V8’s implementation. For JavaScript’s APIs. Node.js also offers Type 1 and Type 2 I/O built onthe libuv multiline (m), unicode (u), and sticky (y) flags, we could not find an offerings. In addition to read/write[Sync], Node.js offers full-file appropriate analogue in RE2. read/writeFile[Sync], and read/writeStream APIs. Twisted does Practicality Despite the limitations discussed above, in its cur- not provide dedicated file I/O APIs, but developers could readily rent state node.cure can be used by real Node.js applications. node.cure implement Type 1 or Type 2 I/O using Python’s file system support. passes V8’s regexp tests, runs our small Node.js applications, and We could not identify a correct npm or Python module offering can also handle more complex Node.js software like npm’s command- Type 3 file I/O. line utility. The use of type 1 I/O is frowned upon in the EDA community for obvious reasons, and developers favor the type 2 APIs offered by 5.3 Other CPU-bound Vulnerabilities their EDA framework. We found that the Type 2 Read APIs offered As argued in Section §4, in the EDA C-WCET-partitioned algorithms by Node.js and libuv expose applications to EHP, as discussed in are safe from EHP. We examined 80 npm modules in detail to deter- the next section. mine how they handle potentially computationally expensive tasks, manually inspecting both documentation and implementation. We 6.2 Attacking EDA File I/O chose the first 20 modules returned by npm searches for string, File I/O Threat Model Consider two kinds of files: Large Files and string manipulation, math, and algorithm, in hopes of striking a Slow Files. Large Files are files in the MB or GB range, or effectively balance between relevance (high usage) and potential algorithmic bottomless ones like /dev/zero. Slow Files are those whose access complexity. times are variable and potentially extremely slow, e.g. /dev/random11 Though space limitations prevent a detailed description of our or files accessed over a network share. findings, they were troubling. The vast majority of the APIsexe- As in §3.1, we assume that attackers can feed “evil input” to cuted synchronously on the event loop. Exactly two of the hun- vulnerable file I/O APIs. If an attacker persuades a Node.js server to dreds of APIs we evaluated were partitioned, and one of these syn- read from a Large or Slow File using Type 1 or Type 2 I/O, he may chronously executed an exponential-time algorithm before yielding poison the underlying event handler. This handler is either the event control. Although the execution time of an API is critical to pre- loop (Type 1) or a worker thread (Type 2); either way, if the handler venting EHP, the documentation for the majority of the APIs did takes a long time to complete the read, it cannot be used by other not state their running times. During our analysis of the source clients. Though some forms of directory traversal vulnerabilities code we identified algorithms with complexities ranging from O(1) could be exploited in this way, this attack is somewhat more general: and O(n) all the way to O(2n). the EHP attacker is not interested in the contents of the file, but These findings were all the more surprising because the modules merely that it be read. This threat model is analogous to that for we selected are widely used, and were downloaded over 200 mil- ReDoS: the attacker must be able to submit his own input to a lion times in December 2016. We suggest, then, that npm module vulnerable API call. developers are not cognizant of the risks of unbounded non-C- When are large files and slow files problematic in theEDA? WCET-partitioned algorithms, and that many Node.js applications Type 1 and Type 2 APIs are unsafe under this threat model, though are therefore likely vulnerable to CPU-bound EHP attacks. These whether they are vulnerable to large or slow files depends on their findings are in line with previous work studying the frequency implementation. Focusing on Node.js, let us consider the high-level of asynchronous callbacks in server-side JavaScript, though we APIs in turn12. The Type 1 APIs are trivially unsafe, because they identified a lower proportion of asynchronous callbacks than they perform a relatively slow task (file I/O) on the event loop. We were did [47]. surprised, however, to find that the Type 2 readFile API is vul- While Node.js application developers could offload these calls nerable to large files, because it is implemented as a single read to the worker pool, e.g. using NAN or the cluster module, doing spanning the entire file. On the other hand, large files will notpoi- so may simply move the poison from the event loop to the worker son readFileStream, because it partitions the read into fixed-size pool §3. tasks. Between each task other pending tasks can be handled. Unfortunately, both readFile and readFileStream can be poi- 6 I/O-bound EHP Vulnerabilities in the EDA soned by slow files. The slow-ness affects each I/O request, so each task poisons the worker that handles it. Partitioning tasks in size This section presents I/O-bound EHP vulnerabilities. We first dis- does not save readFileStream from a slow file’s variability in time. cuss the support for file I/O in EDA frameworks (§6.1). Then, we Using the language of §4, neither readFileSync nor readFile is exploit weaknesses in this support in Node.js to demonstrate I/O- partitioned, and though readFileStream is partitioned, it is not C- bound EHP (§6.2). Finally, we describe how node.cure leverages WCET partitioned. The implementation of readFileSync is shown True Asynchronous I/O to eliminate these vulnerabilities (§6.3). in Figure 2 (a), readFile in (b), and readFileStream in (c). A hypo- thetical Type 3 API is shown in (e) — by offloading the work to 6.1 File I/O in the EDA 11 There are three Types of file I/O in the EDA. File I/O can be done According to “man 4 random”, “reads from /dev/random may block...users will usually want to open it in nonblocking mode (or perform a read with timeout).” (1) synchronously on the event loop, (2) “asynchronously” (syn- 12The read API is vulnerable to large and/or slow files depending on the size of the chronously but on the worker pool), or (3) truly asynchronously read. Its vulnerabilities follow the same argument. 9 Anonymous submission #23 to ACM CCS 2017 the operating system, none of the Event Handlers (event loop or follow the same principle as in our ReDoS solution (see Figure 6): worker) need execute I/O synchronously. we favor the existing fast path unless we have evidence that the Though developers can avoid these issues by writing their own I/O is vulnerable, at which point we drop to the slow-but-secure I/O libraries to use Type 3 I/O, providing safe asynchronous file KAIO path. I/O is fundamentally the responsibility of the EDA frame- True AIO for Node.js We apply a two-tiered approach to ad- work. Developers should be able to trust their framework to cor- dress large and slow files. First, though the existing implementation rectly perform file I/O in an appropriately partitioned (a la read- performs the I/O for readFile in one large request, there is no need FileStream) and truly asynchronous fashion. for this at API level. Thus, we partition readFile requests just like In node.cure, we propose extensions to Node.js to address these those of readFileStream, and buffer the output until the file has issues (§6.3). Because node.cure addresses all such vulnerabilities, been completely read. With this change in place, the remaining we have not attempted to identify specific examples of these vul- concern for I/O is slow files. nerabilities in the wild. Interested parties could expose some of We then add a slow path which performs I/O using KAIO rather them using standard taint analysis techniques, with user input as than by offloading to the worker pool. With the insertion ofstop- sources and file system APIs as sinks. However, even untainted file watch logic in the Node.js C++ library that offloads I/O to libuv, we I/O can be poisonous depending on the APIs used, file sizes, and can route accesses to fast files to the existing logic and defer slow underlying storage performance. files to KAIO. Slow files are those that are outliers in the distribution of I/O times. When an I/O takes a long time, we obtain the number 6.3 Addressing I/O-bound EHP in the EDA of the underlying inode using fstat and add it to an in-memory As discussed in §6.2, the implementation of Node.js’s (libuv’s) built- slow inode list. All subsequent accesses to this inode using any file in I/O APIs exposes applications to EHP vulnerabilities under our descriptor are routed to the slow path, to be performed using KAIO. threat model. We have designed, implemented, and evaluated an We register the corresponding eventfd in the event loop’s epoll alternative approach that resolves these issues. set and respond appropriately when it is ready. Our slow path I/O Node.js’s I/O design takes a portable approach, at the cost of is C-WCET partitioned; it requires a small series of constant-time exposing applications to EHP attacks when doing I/O. On Linux we operations, carried out on the event loop. Our implementation in can defend against this class of attacks by making use of Linux’s node.cure only affects the read path, since a system that exposes ar- support for Kernel Asynchronous I/O (hereafter, KAIO) to offer bitrary writes to an attacker has greater concerns than EHP-induced 13 true asynchronous I/O . At a high level, KAIO maintains a queue DoS. However, it could easily be extended to the write path. of pending I/O requests, handling each in turn, which scales well Our implementation of readFile partitioning was simple because with the number of requests for the same reason that Node.js does. it could re-use the existing support for readFileStream, requiring KAIO can deal with slow files more effectively than can userspace, only 20 lines in the Node.js JavaScript-level file system library. so using KAIO for slow files won’t just move the problem to the Adding the “safe but slow” KAIO path to node.cure required about kernel. Because KAIO scales well, Node.js could in principle use 3500 lines in libuv, of which around 3000 was an existing hash table KAIO to offload I/O requests to the kernel rather than performing implementation [20]. them in a userspace worker pool. Effectiveness We used a microbenchmark on Linux to illustrate Some low-level details about KAIO inform our design. KAIO of- the effectiveness of node.cure’s partitioning and KAIO schemes. fers an io_submit API to submit an AIO request, and an io_getevents Our benchmark launches k read requests to saturate the worker request to test for completed I/O. Though applications can repeat- pool. These requests are made against a 2MB file stored in an artifi- edly query io_getevents to test for completion, Linux will also sig- cially slow file system14. After launching the request, it attempts nal a file descriptor supplied to io_submit for use with epoll. KAIO to complete a CPU-bound workload consisting of 100 zip requests requires callers to use file descriptors opened with the O_DIRECT flag submitted serially. These requests are also offloaded to the worker (i.e. bypass caches), and I/O request offsets, sizes, and buffers must pool, where they compete for worker processing time with the be aligned on sector boundaries. ongoing slow read requests. Performing all I/O with KAIO would improve overall through- We evaluated this benchmark using readFile against a slow 2MB put, by letting node processes saturate the IO layer even with a file. In Figure 7 you can see that node.cure (partitioned and with the small worker pool (no limitation to k concurrent I/O requests). KAIO slow path) detects the slow file, moving the k read requests to However, doing so would come at a cost: it would dramatically KAIO and allowing the zip workload to complete quickly. With only increase the average response time for requests that require I/O. read partitioning, node.cure allows the zip requests to interleave Average response times would rise because KAIO requires use of with the partitioned slow reads, and the zip workload is able to make the O_DIRECT flag, which bypasses caching on both write and, consistent progress. This is the same performance that would be critically, read. In consequence, a KAIO-only solution would of- seen by an application using vanilla Node.js and the readFileStream fer throughput and EHP-proofing but dramatically increase the API. We also evaluated vanilla Node.js on its normal I/O path for response time for requests. We observe, however, that the EHP con- readFile15. Because the read is not partitioned, one of the workers cern arises only when Large or Slow Files are accessed. Addressing Large Files is straightforward. At the libuv level, we can detect and 14We enforced this file system’s 100 KB/second I/O rate by creating a file systemona defer only slow I/Os to KAIO. When deciding to use KAIO, we Network Block Device whose server is gated by the trickle utility [8]. 15To eliminate caching effects, we modified vanilla Node.js to use O_DIRECT. This 13Windows also offers operating system-level asynchronous I/O, but we did not extend ensured that the I/O performance was as slow as the underlying storage system — and our prototype there. also illustrates why using O_DIRECT is not a good idea by default. 10 Anonymous submission #23 to ACM CCS 2017 must complete its entire read before the zip workload’s tasks will 7 Discussion have a turn. At around 15 seconds one of the reads completes, after Insecure Fast Path, Secure Slow Path which the zip workload can complete. The software development community has long favored perfor- Figure 7 teaches us a key lesson of C-WCET partitioning: it is mance over security, so we were unsurprised that in both regular an all or nothing affair. Though the zip code in Node.js is perfect — expression engines and asynchronous file I/O, the “safe” approach C-WCET partitioned and offloaded to the worker pool — a series of was slower than the existing “fast” approach. For practical rea- competing ill-partitioned tasks still blocks its forward progress. sons, in both aspects of node.cure we therefore favored the speedy existing implementation until we knew that we had a potential vulnerability. In the regular expression component of node.cure this decision is made at regular expression declaration time through a query to our ReDoS oracles. In the I/O component, the decision is made based on the file I/O speeds. Disastrous Portability Software is often implemented with portability in mind. While this is an admirable goal, pursuing portability can be myopic. We found that the Node.js and libuv frameworks favor simpler implementa- tions (regex engine) and “lowest common denominator” portability (asynchronous file I/O) to reduce the size of their code bases. This Figure 7: In this figure we see the effect that slow I/O re- is a false economy. quests in Node.js have on other worker pool tasks, and We believe that in the EDA, framework developers must acknowl- the effectiveness of node.cure in combating these requests. edge EHP and provide applications with well-defined WCET where node.cure quickly detects and removes the slow I/O requests possible. Avoiding DoS attacks is well worth the cost of maintaining from the worker pool, allowing the zip workload to com- platform-specific implementations of the same functionality. While plete. Without KAIO, the zip workload suffers because it EHP can be avoided by applications that correctly sanitize user shares the worker pool with the slow I/O. input, they do not always do so. The burden falls to framework developers to address security vulnerabilities where they can. Is C-WCET partitioning practical? Attacking Our Solution node.cure’s defense against EHP can We have argued that C-WCET partitioning is needful in the EDA be attacked in a way similar to that of its defense against ReDoS to defend against EHP attacks. We also believe that it is practical at EHP attacks. In each case, an attacker who can repeatedly force the application level. The EDA community is already familiar with node.cure to “try new things” will trigger expensive “oracle queries”. developing and debugging asynchronous algorithms, and C-WCET In the context of regular expressions, the oracle query was a literal partitioning is simply an extension of this concept. query to our ReDoS oracles. In the file I/O setting, the oracle is the C-WCET partitioning requires three things of the developer: she first slow read of a slow file. The event handler performing theread must identify a partitioning of the algorithm, and then must both will be poisoned until it completes (or times out, though node.cure enforce ordering constraints and ensure her algorithm does not re- does not implement this), just as the event loop is poisoned while quire atomicity across partitions. We believe identifying a C-WCET we query the ReDoS oracles about a new regular expression. There- partitioning for typical algorithms will be straightforward, and the fore, against both of node.cure’s defenses an attacker can level an ease of sharing software using npm means that a single C-WCET ongoing attack stream of novel requests. In doing so he shifts from partitioned algorithm can be readily used in many applications. a one-time request to a “slow DoS” attack [34]. The EDA community is well equipped to enforce ordering con- Closing Thoughts With the changes in node.cure, asynchro- straints using concepts like Promises (Node.js) and Futures (Twisted). nous DNS queries are the only remaining built-in I/O-intensive Ensuring that a non-atomically-executed algorithm works correctly work performed by the Node.js worker pool. If these could be re- can be non-trivial, but JavaScript’s closures feature gives Node.js de- placed with asynchronous kernel APIs, it would free the worker velopers an easy way to privately store ongoing results. When a de- pool threads to focus on (hopefully C-WCET-partitioned) CPU- veloper makes a mistake in her C-WCET partitioning and introduces intensive tasks. Doing so would reduce the EHP attack surface of ordering or atomicity violations, she can draw on EDA-specific de- Node.js applications to CPU-intensive algorithms. Unfortunately, 16 bugging tools that help developers detect race conditions [38, 59]. the Linux kernel offers no asynchronous DNS resolution API , so At the framework level, C-WCET partitioning is harder to achieve, this idea remains only a vision. particularly in complex software like Node.js. In this work we have Our design for True EDA AIO is applicable to any EDA frame- demonstrated two modifications to the Node.js runtime to move it work that currently implements I/O by offloading it to a userspace closer to our goal of C-WCET partitioning. These were non-trivial worker pool. This includes both Twisted and libuv, and likely oth- extensions requiring a deep understanding of libraries on which ers. Since the KAIO portion is in libuv, our implementation actually Node.js relies: the V8 JavaScript engine and the libuv event-driven protects both Node.js- and libuv-based applications. framework. Though we have addressed what we view as two ma- jor sources of EHP vulnerabilities, others remain because other

16libc’s getaddrinfo_a is implemented with a threadpool. framework-level aspects of Node.js still do not provide C-WCET 11 Anonymous submission #23 to ACM CCS 2017 partitioning. These include JavaScript language constructs like ar- code into an AsyncTask, though this is inappropriate for Node.js ray and object accesses, which are O(N ) in the worst case, as well because of Node.js’s fixed-size worker pool. Work on automatic as built-in Node.js modules like OS and Util. loop parallelization [61] and Speculative execution [33] might be At the application level, there is some tension in whether devel- applicable to automatically C-WCET partitioning an algorithm. oping C-WCET algorithms is appropriate – “To write generic code, EDA Community Awareness The EDA community is aware or to write EDA-specific code?” In Node.js the answer is clear, since of the risk of DoS due to EHP on the event thread. A common rule of JavaScript is essentially only used in event-driven contexts (web thumb is “Don’t block the event loop”, advised by many tutorials as browsers and Node.js). This in turn implies that all npm modules well as recent books about EDA programming for Node.js [35, 75]. should be C-WCET partitioned, without exception. For EDA-based Wandschneider suggests worst-case linear-time partitioning on applications in other languages, e.g. Python’s Twisted and C’s libuv, the event loop [75]. Casciaro advises developers to partition any we anticipate that C-WCET-partitioned versions of popular libraries computation on the event loop, and to offload computationally ex- will need to be developed and maintained in tandem with the “nor- pensive tasks to the worker pool [35]. In practice, however, we have mal” version. found many popular modules offering computationally-expensive C-WCET only helps if tasks are of different sizes. Otherwise operations without taking these recommendations to heart ( §5.3), the throughput remains the same but the response time actually and also showed that offloading without C-WCET partitioning is goes up. This can be controlled with the partitioning size – pick or just dumping the bomb to another place. auto-tune towards a good default if requests are generally the same size. Such a size is still fine because we’re protected from poisonous 9 Conclusion outsized requests. The Event-Driven Architecture holds significant promise for scal- ing. However, EDA-based servers are vulnerable to Event Handler 7.1 Alternatives to C-WCET partitioning Poisoning DoS attacks, just like old single-threaded servers were. Alongside C-WCET partitioning, we recommend the introduction of We believe that the EDA community is incautious about this vulner- runtime-generated TimeoutExceptions into EDA implementations. ability. We have implemented two defenses against EHP attacks and A TimeoutException would abort synchronous computation after a assessed their success. The source code for our EHP-safer version specified time limit, allowing developers to use open-source mod- of Node.js can be found at BLIN DED. ules while still ensuring protection against EHP vulnerabilities. Acknowledgments 8 Related Work Here we will acknowledge those who have contributed to our work. In §2.3, we have presented previous work on AC attacks. In this References section we discuss the way in which our work relates to that of [1] Apache httpd Documentation. https://httpd.apache.org/docs/2.4/mod/. (????). others. [2] Apache HTTP Server. https://httpd.apache.org/. (????). [3] Benchmark of Regex Libraries. http://lh3lh3.users.sourceforge.net/reb.shtml. Security in the EDA Though researchers have studied secu- (????). rity vulnerabilities in different EDA frameworks, most prominently [4] Cylon.js. https://cylonjs.com/. (????). client-side applications, the security properties of the EDA itself [5] EventMachine. (????). http://rubyeventmachine.com/ [6] GitHub. https://github.com/. (????). have not been researched in detail. Some examples include security [7] Google V8 Octane benchmark suite. https://developers.google.com/octane/ studies on client-side JavaScript/Web [39, 53, 56, 62] and Java/An- benchmark. (????). droid [31, 43, 44, 52] applications. However, many of them have [8] How to simulate a slow hardisk. http://giallone.blogspot.com/2015/10/ simulate-slow-hardisk-in-linux.html. (????). focused on language issues (e.g., eval in JavaScript) or platform- [9] libuv. (????). https://github.com/libuv/libuvhttp://libuv.org/ specific issues (e.g., DOM in web browsers). We note that EDA com- [10] Module Counts. (????). http://www.modulecounts.com/ [11] Node-RED. http://nodered.org/. (????). plicates static analyses for event-driven Android software because [12] Node.js. http://nodejs.org/. (????). they often require inferring control flows across event callbacks as [13] Node.js Thread Pool Documentation. http://docs.libuv.org/en/v1.x/threadpool. shown in static information flow tracking [29, 45, 49, 76]. html. (????). [14] Projects that use libuv. https://github.com/libuv/libuv/wiki/ To the best of our knowledge, there are only few academic en- Projects-that-use-libuv. (????). gagement with security aspects of server-side EDA frameworks. [15] RE2 Regular Expression Engine. https://github.com/google/re2. (????). Ojamaa et al. [64] performed high-level survey on Node.js security, [16] Reactor. (????). https://github.com/reactor/reactor-corehttp://projectreactor.io/ [17] safe-regex. https://github.com/substack/safe-regex/. (????). and DeGroef et al. [40] studied a secure library integration. We [18] Snyk.io. https://snyk.io/vuln/. (????). believe that more academic involvement in the EDA will increase [19] Twisted. (????). https://twistedmatrix.com/trac/https://github.com/twisted/ twisted the security of many of the web services we use every day. [20] uthash. http://troydhanson.github.io/uthash/. (????). C-WCET Partitioning At their core, operating systems are also [21] Zotonic. http://zotonic.com/. (????). event-driven. As the bridge between userspace and hardware, the [22] 2007. The Calendar and Contacts Server. https://github.com/Apple/ Ccs-calendarserver. (2007). operating system is constantly responding to events from user pro- [23] 2009. The Julia Language. https://github.com/JuliaLang/julia. (2009). grams and devices. An operating system’s interrupt handlers take [24] 2009. Sage Notebook (). https://github.com/sagemath/sagenb. (2009). great care to perform minimal amounts of work in their “top half”, [25] 2010. The Rust Programming Language. https://github.com/rust-lang/rust. (2010). deferring more expensive responses to an asynchronous “bottom [26] 2012. Ubuntu One: Technical Details. https://wiki.ubuntu.com/UbuntuOne/ half” [58]. TechnicalDetails. (2012). [27] 2016. New Node.js Foundation Survey Reports New “Full Stack” In De- C-WCET partitioning in the EDA is analogous to introducing mand Among Enterprise Developers. (2016). https://nodejs.org/en/blog/ threads in a sequential program. [57] automatically moves Android announcements/nodejs-foundation-survey/ 12 Anonymous submission #23 to ACM CCS 2017

[28] Authors anonymized. 2017. The Case of the Poisoned Event Handler: Weaknesses [53] Xing Jin, Xuchao Hu, Kailiang Ying, Wenliang Du, Heng Yin, and Gautam Nagesh in the Node.js Event-Driven Architecture. In European Workshop on Systems Peri. 2014. Code Injection Attacks on HTML5-based Mobile Apps: Characteriza- Security (EuroSec). tion, Detection and Mitigation. In Proceedings of the 2014 ACM SIGSAC Conference [29] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, on Computer and Communications Security (CCS ’14). Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. Flow- [54] James Kirrage, Asiri Rathnayake, and Hayo Thielecke. 2013. Static Analysis for Droid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-aware Taint Regular Expression Denial-of-Service Attacks. Network and System Security 7873 Analysis for Android Apps. In Proceedings of the 35th ACM SIGPLAN Conference (2013), 35–148. https://doi.org/10.1007/978-3-642-38631-2{_}11 on Programming Language Design and Implementation (PLDI ’14). [55] Hugh C Lauer and Roger M Needham. 1979. On the duality of operating system [30] Bill Atkinson. 1988. Hypercard. Apple Computer. structures. ACM SIGOPS Operating Systems Review 13, 2 (1979), 3–19. https: [31] David Barrera, H. Gunes Kayacik, Paul C. van Oorschot, and Anil Somayaji. 2010. //doi.org/10.1145/850657.850658 A Methodology for Empirical Analysis of Permission-based Security Models [56] Sebastian Lekies, Ben Stock, and Martin Johns. 2013. 25 Million Flows Later: and Its Application to Android. In Proceedings of the 17th ACM Conference on Large-scale Detection of DOM-based XSS. In Proceedings of the 2013 ACM SIGSAC Computer and Communications Security (CCS ’10). Conference on Computer & Communications Security (CCS ’13). [32] Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and Sateesh Addepalli. 2012. Fog [57] Yu Lin, Cosmin Radoi, and Danny Dig. 2014. Retrofitting Concurrency for Computing and Its Role in the Internet of Things. In Mobile Cloud Computing Android Applications through Refactoring. In ACM International Symposium (MCC). https://doi.org/10.1145/2342509.2342513 on Foundations of Software Engineering (FSE). https://doi.org/10.1145/2635868. [33] Matthew J Bridges, Thomas Jablin, and David I August. 2007. Revisiting the 2635903 sequential programming model for the mutlticore era. Micro 07 (2007), 12–20. [58] Robert Love. 2005. Linux Kernel Development. Novell Press. https://doi.org/10.1109/MM.2008.13 [59] Magnus Madsen, Frank Tip, and Ondrej Lhotak. 2015. Static analysis of event- [34] Enrico Cambiaso, Gianluca Papaleo, and Maurizio Aiello. 2012. Taxonomy of driven Node. js JavaScript applications. In ACM SIGPLAN International Conference Slow DoS Attacks to Web Applications. Recent Trends in Computer Networks and on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). Distributed Systems Security (2012), 195–204. ACM. [35] Mario Casciaro. 2014. Node.js Design Patterns (1 ed.). 454 pages. https://doi.org/ [60] M D Mcilroy. 1999. Killer adversary for quicksort. Software - Practice and Experi- 10.1002/ejoc.201200111 ence 29, 4 (1999), 341–344. https://doi.org/10.1002/(SICI)1097-024X(19990410)29: [36] Russ Cox. 2007. Regular Expression Matching Can Be Simple And Fast (but is 4<341::AID-SPE237>3.0.CO;2-R slow in Java, Perl, PHP, Python, Ruby, ...). (2007). https://swtch.com/~rsc/regexp/ [61] Mojtaba Mehrara, Po-Chun Hsu, Mehrzad Samadi, and Scott Mahlke. 2011. Dy- regexp1.html namic Parallelization of JavaScript Applications Using an Ultra-lightweight Spec- [37] Scott A Crosby and Dan S Wallach. 2003. Denial of Service via Algorithmic ulation Mechanism. In Proceedings of the 2011 IEEE 17th International Symposium Complexity Attacks. In USENIX Security. http://www.usenix.org/events/sec03/ on High Performance Computer Architecture (HPCA ’11). tech/full_papers/crosby/crosby.pdf [62] Nick Nikiforakis, Luca Invernizzi, Alexandros Kapravelos, Steven Van Acker, [38] James Davis, Arun Thekumparampil, and Dongyoon Lee. 2017. Node.fz: Fuzzing Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. 2012. the Server-Side Event-Driven Architecture. In European Conference on Computer You Are What You Include: Large-scale Evaluation of Remote Javascript Inclu- Systems (EuroSys). sions. In Proceedings of the 2012 ACM Conference on Computer and Communica- [39] Willem De Groef, Dominique Devriese, Nick Nikiforakis, and Frank Piessens. tions Security (CCS ’12). 2012. FlowFox: A Web Browser with Flexible and Precise Information Flow [63] J O’Dell. 2011. Exclusive: How LinkedIn used Node.js and HTML5 to build a Control (CCS ’12). better, faster app. (2011). http://venturebeat.com/2011/08/16/linkedin-node/ [40] Willem De Groef, Fabio Massacci, and Frank Piessens. 2014. NodeSentry: Least- [64] A Ojamaa and K Duuna. 2012. Assessing the security of Node.js platform. In privilege Library Integration for Server-side JavaScript. In Proceedings of the 30th 7th International Conference for Internet Technology and Secured Transactions Annual Computer Security Applications Conference (ACSAC ’14). (ICITST). 348–355. [41] Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and [65] Senthil Padmanabhan. 2013. How We Built eBayâĂŹs First Node.js Damien Zufferey. 2013. P: Safe asynchronous event-driven programming. In Application. (2013). http://www.ebaytechblog.com/2013/05/17/ ACM SIGPLAN Conference on Programming Language Design and Implementation how-we-built-ebays-first-node-js-application/ (PLDI). https://doi.org/10.1145/2462156.2462184 [66] Vivek S Pai, Peter Druschel, and Willy Zwaenepoel. 1999. Flash: An Efficient [42] Stephen C. Drye and William C. Wake. 1999. Java Swing reference. Manning and Portable Web Server. In USENIX Annual Technical Conference (ATC). https: PUblications Company. //doi.org/10.1.1.119.6738 [43] William Enck, Damien Octeau, Patrick McDaniel, and Swarat Chaudhuri. 2011. [67] David Pariag, Tim Brecht, Ashif Harji, Peter Buhr, Amol Shukla, and David R A Study of Android Application Security. In Proceedings of the 20th USENIX Cheriton. 2007. Comparing the performance of web server architectures. In Conference on Security (SEC’11). European Conference on Computer Systems (EuroSys), Vol. 41. ACM, 231–243. [44] William Enck, Machigar Ongtang, and Patrick McDaniel. 2009. Understanding [68] Asiri Rathnayake and Hayo Thielecke. 2014. Static Analysis for Regular Ex- Android Security. IEEE Security and Privacy (2009). pression Exponential Runtime via Substructural Logics. CoRR (2014). http: [45] Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semantics- //arxiv.org/abs/1405.7058 based Detection of Android Malware Through Static Analysis. In Proceedings [69] Rick Rogers, John Lombardo, Zigurd Mednieks, and Blake Meike. 2009. Android of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software application development: Programming with the Google SDK (1 ed.). O’Reilly Engineering (FSE 2014). Media, Inc. [46] Steve Ferg. 2006. Event-driven programming: introduction, tutorial, his- [70] Alex Roichman and Adar Weidman. 2009. VAC - ReDoS Regular Expression tory. (2006), 59 pages. http://parallels.nsu.ru/~fat/pthreadppt/event_driven_ Denial Of Service. OWASP (2009). https://www.checkmarx.com/white_papers/ programming.doc redos-regular-expression-denial-of-service/ [47] Keheliya Gallaba, Ali Mesbah, and Ivan Beschastnikh. 2015. Don’t Call Us, We’ll [71] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. 2012. Operating Call You: Characterizing Callbacks in Javascript. In International Symposium System Concepts (9th ed.). Wiley Publishing. on Empirical Software Engineering and Measurement (ESEM). https://doi.org/10. [72] Bryan Sullivan. 2010. New Tool: SDL Regex Fuzzer. (2010). https://blogs. 1109/ESEM.2015.7321196 microsoft.com/microsoftsecure/2010/10/12/new-tool-sdl-regex-fuzzer/ [48] Danny Goodman and Paula Ferguson. 1998. Dynamic HTML: The Definitive [73] Bryan Sullivan. 2010. Security Briefs - Regular Expression Denial of Service Reference (1 ed.). O’Reilly. Attacks and Defenses. (2010). https://msdn.microsoft.com/en-us/magazine/ [49] Michael I. Gordon, Deokhwan Kim, Jeff Perkins, Limei Gilhamy, Nguyen ff646973.aspx Nguyenz, , and Martin Rinard. Information Flow Analysis of Android Applica- [74] Richard E Sweet. 1985. The Mesa programming environment. ACM SIGPLAN tions in DroidSafe.. In Network and Distributed System Security (NDSS) Symposium Notices 20, 7 (1985), 216–229. https://doi.org/10.1145/17919.806843 (NDSS 2015). [75] Marc Wandschneider. 2013. Learning Node. js: A Hands-on Guide to Building Web [50] Jeff Harrell. 2013. Node.js at PayPal. (2013). https://www.paypal-engineering. Applications in JavaScript. Pearson Education. com/2013/11/22/node-js-at-paypal/ [76] Fengguo Wei, Sankardas Roy, Xinming Ou, and Robby. 2014. Amandroid: A [51] Kasaburo Hashiguchi. 1988. Algorithms for determining relative star height Precise and General Inter-component Data Flow Analysis Framework for Security and star height. Information and Computation 78, 2 (1988), 124–169. https: Vetting of Android Apps. In Proceedings of the 2014 ACM SIGSAC Conference on //doi.org/10.1016/0304-3975(91)90269-8 Computer and Communications Security (CCS ’14). [52] Stephan Heuser, Adwait Nadkarni, William Enck, and Ahmad-Reza Sadeghi. 2014. [77] Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA : An Architecture for ASM: A Programmable Interface for Extending Android Security. In Proceedings Well-Conditioned, Scalable Internet Services. In Symposium on Operating Systems of the 23rd USENIX Conference on Security Symposium (SEC’14). Principles (SOSP). https://doi.org/10.1145/502059.502057

13 Anonymous submission #23 to ACM CCS 2017

[78] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heck- mann, Tulika Mitra, Frank Mueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, and Per Stenstrom. 2008. The Worst-Case Execution-Time Problem - Overview of Methods and Survey of Tools. ACM Transactions on Embedded Computing Systems (TECS) 7, 3 (2008), 36. https://doi.org/10.1145/1347375.1347389 [79] Erik Wittern, Philippe Suter, and Shriram Rajagopalan. 2016. A look at the dynamics of the JavaScript package ecosystem. In International Conference on Mining Software Repositories (MSR). https://doi.org/10.1145/2901739.2901743 [80] Zhenyu Wu, Mengjun Xie, and Haining Wang. 2011. Energy Attack on Server Systems. In USENIX Workshop on Offensive Technologies (WOOT).

14