contributed articles
DOI:10.1145/3338112 integrated in the workflow used by se- Key lessons for designing static analyses tools curity engineers. It has led to thousands of fixes of security and privacy bugs, out- deployed to find bugs in hundreds of millions performing any other detection method of lines of code. used at Facebook for such vulnerabili- ties. We will describe the human and BY DINO DISTEFANO, MANUEL FÄHNDRICH, technical challenges encountered and FRANCESCO LOGOZZO, AND PETER W. O’HEARN lessons we have learned in developing and deploying these analyses. There has been a tremendous amount of work on static analysis, both in industry and academia, and we Scaling Static will not attempt to survey that material here. Rather, we present our rationale for, and results from, using techniques similar to ones that might be encoun- Analyses tered at the edge of the research litera- ture, not only simple techniques that are much easier to make scale. Our goal is to complement other reports on industrial static analysis and formal at Facebook 1,6,13,17 methods, and we hope that such perspectives can provide input both to future research and to further indus- trial use of static analysis. Next, we discuss the three dimen- sions that drive our work: bugs that matter, people, and actioned/missed STATIC ANALYSIS TOOLS are programs that examine, and bugs. The remainder of the article de- attempt to draw conclusions about, the source of other scribes our experience developing and deploying the analyses, their impact, programs without running them. At Facebook, we and the techniques that underpin our have been investing in advanced static analysis tools tools.
that employ reasoning techniques similar to those Context for Static from program verification. The tools we describe in Analysis at Facebook this article (Infer and Zoncolan) target issues related Bugs that Matter. We use static analysis to prevent bugs that would affect our prod- to crashes and to the security of our services, they ucts, and we rely on our engineers’ judg- perform sometimes complex reasoning spanning ment as well as data from production to many procedures or files, and they are integrated into tell us the bugs that matter the most. engineering workflows in a way that attempts to bring key insights
value while minimizing friction. ˽˽ Advanced static analysis techniques These tools run on code modifications, participating performing deep reasoning about source code can scale to large as bots during the code review process. Infer targets industrial codebases, for example, with our mobile apps as well as our backend C++ code, 100-million LOC. ˽˽ Static analyses should strike a balance codebases with 10s of millions of lines; it has seen between missed bugs (false negatives) and un-actioned reports (false positives).
over 100 thousand reported issues fixed by developers ˽˽ A “diff time” deployment, where issues before code reaches production. Zoncolan targets the are given to developers promptly as part of code review, is important to catching 100-million lines of Hack code, and is additionally bugs early and getting high fix rates.
62 COMMUNICATIONS OF THE ACM | AUGUST 2019 | VOL. 62 | NO. 8 It is important for a static analysis nerabilities on Facebook, or on apps intended audience (that is, the people developer to realize that not all bugs of the Facebook family; for example, the analysis tool will be deployed to). are the same: different bugs can have Messenger, Instagram, or WhatsApp. For classes of bugs intended for all different levels of importance or sever- Third, we have an internal initiative or a wide variety of engineers on a given ity depending on the context and the for tracking the most severe bugs platform, we have gravitated toward a nature. A memory leak on a seldom- (SEV) that occur. “diff time” deployment, where analyz- used service might not be as important Our understanding of Bugs that ers participate as bots in code review, as a vulnerability that would allow at- Matter at Facebook drives our focus making automatic comments when tackers to gain access to unauthorized on advanced analyses. For contrast, a an engineer submits a code modifica- information. Additionally, the frequency recent paper states: “All of the static tion. Later, we recount a striking situ- of a bug type can affect the decision of analyses deployed widely at Google ation where the diff time deployment how important it is to go after. If a cer- are relatively simple, although some saw a 70% fix rate, where a more tradi- tain kind of crash, such as a null point- teams work on project-specific analysis tional “offline” or “batch” deployment er error in Java, were happening hourly, frameworks for limited domains (such (where bug lists are presented to engi- then it might be more important to tar- as Android apps) that do interproce- neers, outside their workflow) saw a 0% get than a bug of similar severity that dural analysis”17 and they give their en- fix rate. occurs only once a year. tirely logical reasons. Here, we explain In case the intended audience is the We have several means to collect why Facebook made the decision to much smaller collection of domain se- data on the bugs that matter. First of deploy interprocedural analysis (span- curity experts in the company, we use all, Facebook maintains statistics on ning multiple procedures) widely. two additional deployment models. At crashes and other errors that hap- People and deployments. While “diff time,” security related issues are pen in production. Second, we have a not all bugs are the same, neither are pushed to the security engineer on-call, “bug bounty” program, where people all users; therefore, we use different so she can comment on an in-progress
IMAGE BY ANDRIJ BORYS ASSOCIATES, USING SHUTTERSTOCK ASSOCIATES, ANDRIJ BORYS BY IMAGE outside the company can report vul- deployment models depending on the code change when necessary. Addition-
AUGUST 2019 | VOL. 62 | NO. 8 | COMMUNICATIONS OF THE ACM 63 contributed articles
ally, for finding all instances of a given crashes and app not-responding events though less recognized, the false posi- bug in the codebase or for historical ex- that occur on mobile devices. tive rate is challenging to measure for ploration, offline inspection provides The actioned reports and missed a large, rapidly changing codebase: it a user interface for querying, filtering, bugs are related to the classic concepts would be extremely time consuming and triaging all alarms. of true positives and false negatives from for humans to judge all reports as false In all cases, our deployments focus the academic static analysis literature. A or true as the code is changing. on the people our tools serve and the true positive is a report of a potential bug Although true positives and false way they work. that can happen in a run of the program negatives are valuable concepts, we Actioned reports and missed bugs. in question (whether or not it will hap- don’t make claims about their rates The goal of an industrial static analysis pen in practice); a false positive is one and pay more attention to the action tool is to help people: at Facebook, this that cannot happen. Common wisdom rate and the (observed) missed bugs. means the engineers, directly, and the in static analysis is that it is important Challenges: Speed, scale, and accuracy. A people who use our products, indirect- to keep control of the false positives be- first challenge is presented by the sheer ly. We have seen how the deployment cause they can negatively impact engi- scale of Facebook’s codebases, and the model can influence whether a tool neers who use the tools, as they tend to rate of change they see. For the server- is successful. Two concepts we use to lead to apathy toward reported alarms. side, we have over 100-million lines of understand this in more detail, and to This has been emphasized, for instance, Hack code, which Zoncolan can process help us improve our tools, are actioned in previous Communications’ articles on in less than 30 minutes. Additionally, reports and observable missed bugs. industrial static analysis.1,17 False nega- we have 10s of millions of both mobile The kind of action taken as a result tives, on the other hand, are potentially (Android and Objective C) code and of a reported bug depends on the de- harmful bugs that may remain unde- backend C++ code. Infer processes the ployment model as well as the type of tected for a long time. An undetected code modifications quickly (within 15 bug. At diff time an action is an up- bug affecting security or privacy can lead minutes on average) in its diff time de- date to the diff that removes a static to undetected exploits. In practice, fewer ployment. All codebases see thousands analysis report. In Zoncolan’s offline false positives often (though not always) of code modifications each day and our deployment a report can trigger the implies more false negatives, and vice tools run on each code change. For Zon- security expert to create a task for the versa, fewer false negatives implies colan, this can amount to analyzing one product engineer if the issue is im- more false positives. For instance, one trillion lines of code (LOC) per day. portant enough to follow up with the way to reign in false positives is to fail It is relatively straightforward to product team. Zoncolan catches more to report when you are less than sure a scale program analyses that do simple SEVs than either manual security re- bug will be real; but silencing an analy- checks on a procedure-local basis only. views or bug bounty reports. We mea- sis in this way (say, by ignoring paths The simplest form is linters, which give sured that 43.3% of the severe security or by heuristic filtering) has the effect of syntactic style advice (for example, “the bugs are detected via Zoncolan. At missing bugs. And, if you want to discov- method you called is to be deprecated, press time, Zoncolan’s “action rate” is er and report more bugs you might also please consider rewriting”). Such simple above 80% and we observed about 11 add more spurious behaviors. checks provide value and are in wide de- “missed bugs.” The reason we are interested in ployment in major companies including A missed bug is one that has been advanced static analyses at Facebook Facebook; we will not comment on them observed in some way, but that was not might be understood in classic terms further in this article. But for more rea- reported by an analysis. The means of as saying: false negatives matter to us. soning going beyond local checks, such observation can depend on the kind of However, it is important to note the as one would find in the academic litera- bug. For security vulnerabilities we have number of false negatives is notori- ture on static analysis, scaling to 10s or bug bounty reports, security reviews, or ously difficult to quantify (how many 100s of millions of LOC is a challenge, as SEV reviews. For our mobile apps we log unknown bugs are there?). Equally, is the incremental scalability needed to support diff time reporting. Figure 1. Continuous development. Infer and Zoncolan both use tech- niques similar to some of what one might find at the edge of the research literature. Infer, as we will discuss, uses one analysis based on the theory 16 Code Reviewers of Separation Logic, with a novel the- orem prover that implements an infer- ence technique that guesses assump- 5 Diff Time Post Land tions. Another Infer analysis involves recently published research results on concurrency analysis.2,10 Zoncolan im- plements a new modular parallel taint analysis algorithm. Developer CI System Code Review CI System Product But how can Infer and Zoncolan scale? The core technical features they
64 COMMUNICATIONS OF THE ACM | AUGUST 2019 | VOL. 62 | NO. 8 contributed articles share are compositionality and careful- WhatsApp—are mostly written in Objec- ly crafted abstractions. For most of this tive-C and Java. C++ is the main language article we will concentrate on what one of choice for backend services. There are gets from applying Infer and Zoncolan, 10s of millions of lines each of mobile rather than on their technical proper- and backend code. ties, but we outline their foundations The reason While they use the same develop- later and provide more technical de- we are interested ment models, the website and mobile tails in an online appendix (https:// products are deployed differently. This dl.acm.org/citation.cfm?doid=333811 in advanced static affects what bugs are considered most 2&picked=formats). analyses important, and the way that bugs can be The challenge related to accuracy is fixed. For the website, Facebook directly intimately related to actioned reports at Facebook might deploys new code to its own datacenters, and missed bugs. We try to strike a bal- and bug fixes can be shipped directly to ance between these issues, informed be understood in our datacenters frequently, several times by the desires based on the class of classic terms: daily and immediately when necessary. bugs and the intended audience. The For the mobile apps, Facebook relies more severe a potentially missed issue false negatives on people to download new versions to is, the lower the tolerance for missed matter to us. from the Android or the Apple store; new bugs. Thus, for issues that indicate a versions are shipped weekly, but mobile potential crash or performance regres- bugs are less under our control because sion in a mobile app such as Messen- even if a fix is shipped it might not be ger, WhatsApp, Instagram, or Face- downloaded to some people’s phones. book, our tolerance for missed bugs is Common runtime errors—for exam- lower than, for example, stylistic lint ple, null pointer exceptions, division by suggestions (for example, don’t use zero—are more difficult to get fixed on deprecated method). For issues that mobile than on the server. On the other could affect the security of our infra- hand, server-side security and privacy structure or the privacy of the people bugs can severely impact both the users using our products, our tolerance for of the Web version of Facebook as well false positives is higher still. as our mobile users, since the privacy checks are performed on the server-side. Software Development at Facebook As a consequence, Facebook invests in Facebook practices continuous soft- tools to make the mobile apps more re- ware development,9 where a main liable and server-side code more secure. codebase (master) is altered by thou- sands of programmers submitting Moving Fast with Infer code modifications (diffs). Master and Infer is a static analysis tool applied diffs are the analogues of, respectively, to Java, Objective C, and C++ code at GitHub master branch and pull re- Facebook.4 It reports errors related to quests. The developers share access to memory safety, to concurrency, to se- a codebase and they land, or commit, a curity (information flow), and many diff to the codebase after passing code more specialized errors suggested by review. A continuous integration system Facebook developers. Infer is run inter- (CI system) is used to ensure code con- nally on the Android and iOS apps for tinues to build and passes certain tests. Facebook, Instagram, Messenger, and Analyses run on the code modification WhatsApp, as well as on our backend and participate by commenting their C++ and Java code. findings directly in the code review tool. Infer has its roots in academic re- The Facebook website was originally search on program analysis with sepa- written in PHP, and then ported to Hack, ration logic,5 research, which led to a a gradually typed version of PHP devel- startup company (Monoidics Ltd.) that oped at Facebook (https://hacklang. was acquired by Facebook in 2013. In- org/). The Hack codebase spans over 100 fer was open sourced in 2015 (www. million lines. It includes the Web fron- fbinfer.com) and is used at Amazon, tend, the internal web tools, the APIs to Spotify, Mozilla, and other companies. access the social graph from first- and Diff-time continuous reasoning. In- third-party apps, the privacy-aware data fer’s main deployment model is based abstractions, and the privacy control log- on fast incremental analysis of code ic for viewers and apps. Mobile apps— changes. When a diff is submitted to for Facebook, Messenger, Instagram and code review an instance of Infer is run
AUGUST 2019 | VOL. 62 | NO. 8 | COMMUNICATIONS OF THE ACM 65 contributed articles
in Facebook’s internal CI system (Fig- assigned them to the developers we an issue is discovered in the codebase, ure 1). Infer does not need to process thought best able to resolve them. it can be nontrivial to assign it to the the entire codebase in order to analyze The response was stunning: we were right person. In the extreme, somebody a diff, and so is fast. greeted by near silence. We assigned who has left the company might have An aim has been for Infer to run in 20–30 issues to developers, and almost caused the issue. Furthermore, even 15min–20min on a diff on average, none of them were acted on. We had if you think you have found someone and this includes time to check out the worked hard to get the false positive familiar with the codebase, the issue source repository, to build the diff, and rate down to what we thought was less might not be relevant to any of their to run on base and (possibly) parent than 20%, and yet the fix rate—the pro- past or current work. But, if we com- commits. It has typically done so, but portion of reported issues that devel- ment on a diff that introduces an issue we constantly monitor performance opers resolved—was near zero. then there is a pretty good (but not per- to detect regressions that makes it Next, we switched Infer on at diff fect) chance that it is relevant. take longer, in which case we work to time. The response of engineers was just Mental context switch has been bring the running time back down. Af- as stunning: the fix rate rocketed to over the subject of psychological studies,12 ter running on a diff, Infer then writes 70%. The same program analysis, with and it is, along with the importance comments to the code review system. same false positive rate, had much great- of relevance, part of the received col- In the default mode used most often er impact when deployed at diff time. lective wisdom impressed upon us by it reports only regressions: new issues While this situation was surprising Facebook’s engineers. Note that others introduced by a diff. The “new” issues to the static analysis experts on the have also remarked on the benefits of are calculated using a bug equivalence Infer team, it came as no surprise to reporting during code review.17 notion that uses a hash involving the Facebook’s developers. Explanations At Facebook, we are working actively bug type and location-independent they offered us may be summarized in on moving other testing technologies to information about the error message, the following terms: diff time when possible. We are also sup- and which is sensitive to file moves and One problem that diff-time deploy- porting academics on researching incre- line number changes cause by refactor- ment addresses is the mental effort of mental fuzzing and symbolic execution ing, deleting, or adding code; the aim is context switch. If a developer is working techniques for diff time reporting. to avoid presenting warnings that de- on one problem, and they are confront- Interprocedural bugs. Many of the velopers might regard as pre-existing. ed with a report on a separate problem, bugs that Infer finds involve reasoning Fast reporting is important to keep in then they must swap out the mental con- that spans multiple procedures or files. tune with the developers’ workflows. text of the first problem and swap in the An example from OpenSSL illustrates: In contrast, when Infer is run in whole- second, and this can be time consum- program mode it can take more than an ing and disruptive. By participating as a apps/ca.c:2780: NULL _ DEREFERENCE hour (depending on the app)—too slow bot in code review, the context switch pointer ‘revtm’ last assigned on line for diff-time at Facebook. problem is largely solved: program- 2778 could be null Human factors. The significance of mers come to the review tool to dis- and is dereferenced at line 2780, col- the diff-time reasoning of Infer is best cuss their code with human reviewers, umn 6 understood by contrast with a failure. with mental context already swapped 2778. revtm = X509 _ gmtime _ adj(NULL, 0); The first deployment was batch rather in. This also illustrates how important 2779. than continuous. In this mode Infer timeliness is: if a bot were to run for an 2780. i = revtm->length + 1; would be run once per night on the hour or more on a diff it could be too entire Facebook Android codebase, late to participate effectively. The issue is that the procedure and it would generate a list of issues. A second problem that diff-time de- X 509 _ g m t i m e _ a d j() can return We manually looked at the issues, and ployment addresses is relevance. When null in some circumstances. Overall,
Figure 2. A simple example capturing a common safety pattern used in Android apps.
Threading information is used to limit the amount of synchronization required. As a comment from the original code explains: “mCount is written to only by the main thread with the lock held, read from the main thread with no lock held, or read from any other thread with the lock held.” Bottom: unsafe additions to RaceWithMainThread .java.
66 COMMUNICATIONS OF THE ACM | AUGUST 2019 | VOL. 62 | NO. 8 contributed articles the error trace found by Infer has 61 neers—it had to be fast, with actionable steps, and the source of null, the call to reports, and not too many missed bugs X 509 _ g m t i m e _ a d j() goes five pro- on product code (but not on infrastruc- cedures deep and it eventually encoun- ture code).2,15 The tool borrowed ideas ters a return of null at call-depth 4. This from concurrent separation logic, but bug was one of 15 that we reported to Advanced static we gave up on the ideal of proving ab- OpenSSL which were all fixed. analyses, like solute race freedom. Instead, we estab- Infer finds this bug by performing lished a ‘completeness’ theorem saying compositional reasoning, which al- those found in the that, under certain assumptions, a the- lows covering interprocedural bugs research literature, oretical variant of the analyzer reports while still scaling to millions of LOC. only true positives.10 It deduces a precondition/postcondi- can be deployed The analysis checks for data races in tion specification approximating the Java programs—two concurrent memo- behavior of X509 _ gmtime _ adj, at scale and ry accesses, one of which is a write. The and then uses that specification when deliver value for example in Figure 2 (top) illustrates: If reasoning about its calls. The specifi- we run the Infer on this code it doesn’t cation includes 0 as one of the return general code. find a problem. The unprotected read values, and this triggers the error. and the protected write do not race be- In 2017, we looked at bug fixes in cause they are on the same thread. But, several categories and found that for if we include additional methods that some (null dereferences, data races, do conflict, then Infer will report races, and security issues) over 50% of the as in Figure 2, bottom. fixes were for bugs with traces that were Impact. Since 2014, Facebook’s devel- interprocedural.a The interprocedural opers have resolved over 100,000 issues bugs would be missed bugs if we only flagged by Infer. The majority of Infer’s deployed procedure-local analyses. impact comes from the diff-time deploy- Concurrency. A concurrency capabili- ment, but it is also run batch to track is- ty recently added to Infer, the RacerD sues in master, issues addressed in fix- analysis, provides an example of the ben- athons and other periodic initiatives. efit of feedback between program analy- The RacerD data race detector saw sis researchers and product engineers.2,15 over 2,500 fixes in the year to March Development of the analysis started in 2018. It supported the conversion of early 2016, motivated by Concurrent Sep- Facebook’s Android app from a single- aration Logic.3 After 10 months of work threaded to a multithreaded architec- on the project, engineers from News ture by searching for potential data rac- Feed on Android caught wind of what es, without the programmers needing we were doing and reached out. They to insert annotations for saying which were planning to convert part of Face- pieces of memory are guarded by what book’s Android app from a sequential locks. This conversion led to an im- to a multithreaded architecture. Hun- provement in scroll performance and, dreds of classes written for a single- speaking about the role of the analyzer, threaded architecture had to be used Benjamin Jaeger, an Android engineer at now in a concurrent context: the trans- Facebook, stated:b “without Infer, multi- formation could introduce concurrency threading in News Feed would not have errors. They asked for interprocedural been tenable.” As of March 2018, no An- capabilities because Android UI is ar- droid data race bugs missed by Infer had ranged in trees with one class per node. been observed in the previous year (mod- Races could happen via interprocedural ulo 3 analyzer implementation errors.)2 call chains sometimes spanning several The fix rate for the concurrency classes, and mutations almost never analysis to March 2018 was roughly happened at the top level: procedural lo- 50%, lower than for the previous gen- cal analysis would miss most races. eral diff analysis. Our de velopers have We had been planning to launch the emphasized that they appreciate the proof tool we were working on in a year’s reports because concurrency errors are time, but the Android engineers were difficult to debug. This illustrates our starting their project and needed help earlier points about balancing action sooner. So we pivoted to a minimum via- rates and bug severity. See Blackshear ble product, which would serve the engi- et al.2 for more discussion on fix rates. a https://bit.ly/2WloBVj b https://bit.ly/2xurbMl
AUGUST 2019 | VOL. 62 | NO. 8 | COMMUNICATIONS OF THE ACM 67 contributed articles
Overall, Infer reports on over 30 types to enable more powerful analysis of the user’s account. The root cause of of issues, ranging from deep inter-pro- the core Facebook codebase. Zoncol- the vulnerability in Figure 3 is that cedural checks to simple procedure- an is the static analysis tool we built the attacker controls the value of the local checks and lint rules. Concurrency to find code and data paths that may member_id variable which is used in support includes checks for deadlocks cause a security or a privacy violation the action field of the