Defensive Analysis: Soundness in Extreme Conditions

Defensive Analysis: Soundness in Extreme Conditions Yannis Smaragdakis Department of Informatics, University of Athens, 15784, Greece [email protected] Abstract It is therefore rather surprising that practically no im- Virtually no realistic whole-program static analysis is sound, plemented whole-program static may-analysis is sound, i.e., i.e., guaranteed to over-approximate all program behaviors. correctly does what it claims, for realistic programs [16]. For The main cause of such unsoundness is “opaque code” fea- instance, a may-analysis should be over-approximate, but tures such as reflection, dynamic loading, and native calls. no realistic may-point-to analysis truly computes an over- We present a defensive may-point-to analysis approach, approximation of the actual points-to sets that arise during which offers soundness in the presence of opaque code. Our program execution. analysis assumes (unless informed otherwise) that native The main culprit for this apparent paradox is complex code or reflection can have arbitrary effects, and arbitrary language features, such as native (non-analyzable) code, dy- new code can be loaded. namic loading, reflection, as well as instructions such as The analysis represents points-to information only when invokedynamic. In the context of this paper, we will col- it knows it to be a safe over-approximation of dynamic be- lectively refer to these features as opaque code. havior. This allows us to perform a highly precise points-to It is tempting to attempt to dismiss opaque code as an en- analysis (such as a 5-call-site-sensitive, flow-sensitive anal- gineering complication: analyses as published do not support ysis) efficiently. We show that sound handling of opaque these features, but surely it must be a straightforward addi- code cannot be an incremental addition: the analysis logic tion to support them in the implementation? This is far from for other language features changes radically when opaque the case, however. Opaque code cannot be handled soundly code needs to be handled soundly. without a fundamentally different approach to static analy- Despite its conservative nature, our analysis yields sound, sis. Furthermore, the traditional handling of opaque code is actionable results for a large subset of the program code, not merely unsound, but also imprecise, making the analysis achieving (under worst-case assumptions) 34-74% of the much less scalable. program coverage of an unsound state-of-the-art analysis for To see this, consider the conventional handling of reflec- real-world programs. tion in may-point-to analysis algorithms for Java [6, 12– 15, 24]. The approach of past analyses has been to attempt to statically model the result of reflection operations, e.g., by computing a superset of the strings that can be used as 1. Introduction arguments to a Class.forName operation (which accepts a name string and returns a reflection object representing the Static program analysis is an established research area, with class with that name). a history tracing back to several decades and applications in This past approach is a) ineffective; b) inefficient; and c) virtually all modern programming technologies. Static anal- fundamentally hopeless for soundness. First, past analyses ysis attempts to answer questions about program behavior are not truly over-approximate. For instance, even though under all possible inputs. Therefore, virtually all interesting the analysis computes a superset of constant strings that are static program analysis questions are undecidable—indeed used as reflection arguments, when faced with a completely the prototypical undecidable problem, the halting problem, unknown string, the analysis does not assume that any class is a static program analysis question: will a program termi- object can be returned, but rather that none can. The rea- nate under all inputs? son is that over-approximation (assuming any object is re- Since static analysis attempts to solve undecidable prob- turned) would be detrimental to the analysis performance lems, practical analysis algorithms are of one of two cat- and precision. Second, even with an unsound approach, cur- egories: either the analysis under-approximates the precise rent algorithms are heavily burdened by the use of reflection answer, in which case it is called a must-analysis, or it over- analysis. For instance, the documentation of the WALA li- approximates the precise answer, in which case it is called a brary directly blames reflection analysis for scalability short- may-analysis. comings,1 and enabling reflection on the DOOP framework approach is based on a different architecture that is likely slows it down by an order of magnitude on standard bench- to be a blueprint for a variety of other sound analyses. marks [24]. Most importantly, however, no such approach • We show that the analysis, though quite defensive, can possibly ever be sound purely statically. Even if the anal- yields useful coverage. In measurements over large Java ysis could feasibly assume that any class object is returned benchmarks, our analysis computes guaranteed over- by a forName operation, this cannot account for dynami- approximate points-to sets for 34-74% of the local vari- cally generated and loaded classes—a ubiquitous feature in ables of a conventional unsound analysis. (This number is Java enterprise applications. much higher than that of a conventional sound but intra- In response to the above need, we propose defensive anal- procedural analysis.) Similar effectiveness is achieved for ysis: a static analysis architecture that aims for sound han- other metrics (e.g., number of calls de-virtualized), again dling of opaque code. To achieve this goal, defensive analy- with actionable, guaranteed-sound outcomes. Finally, our sis has a strikingly different logical form. Instead of adding analysis is efficient, due to its tight representation of sophisticated handling of opaque code, defensive analysis points-to sets: we evaluate a highly precise flow-sensitive, redefines the analysis logic for all other language features, 5-call-site-sensitive defensive analysis, which would be to defensively protect against the possibility that the analysis too heavyweight for typical past approaches. information potentially depends on opaque code. We designed and implemented a defensive may-point-to • The work brings clarity to the domain of static analysis (henceforth just “points-to”) analysis for Java. The analy- of opaque code. Our approach allows, for the first time, to sis guarantees that any points-to set it computes is over- quantitatively weigh the benefits of sound inter-procedural approximate, i.e., captures all possible dynamic behaviors. analysis against its costs and decide whether unsoundness In order to address the dual issues of efficiency and lack is tolerable, or a fully sound approach is preferable. of information about dynamic loading, the analysis represents points-to sets with an unbounded number of objects 2. Illustration of the Approach (e.g., sets that depend on reflection or dynamic loading) as Before we present our analysis in full detail, it is instructive the empty set. That is, when the analysis infers that a vari- to illustrate its behavior with representative examples. able can point to anything, it assigns it an empty points-to set. Our analysis is a may-point-to analysis based on access This choice may at first seem rather counter-intuitive, but it paths, i.e., expressions of the form “var(.fld)*”. As is com- yields several significant advantages. First, the approach ac- mon, the analysis computes the abstract objects (correspond- quires an intuitive interpretation if one views empty points- ing to allocation sites in the program text) that an access path to sets as signifying “don’t know”/”cannot bound” values: may point to. The analysis is flow-sensitive, hence we will be an empty set simply implies that the analysis has failed to computing points-to information separately for each point in upper-bound its contents. Second, this representation yields the program. very compact points-to sets, since unbounded ones are over- Examples. Consider first the case of merging information approximated implicitly (i.e., without an explicit enumera- from two different control-flow paths. When both points- tion of their contents). Third, the approach allows the core of to sets involved have known upper bounds, the defensive the analysis logic to be monotonic (“if the set includes some analysis behaves just like a traditional analysis, taking the value then ...”), allowing for its efficient implementation with union of the two sets: generic fixpoint machinery, such as a Datalog engine. In terms of applications, soundness is an absolute require- 1 if (complexCondition()) { ment for ambitious analysis clients, such as automatic opti- 2 x = new A();// abstract object a1 3 //x points-to set is{a1} mization or semantics-preserving refactoring. Such clients 4 } else { are simply not feasible with prior analysis approaches. In 5 x = new B();// abstract object b1 particular, inter-procedural automatic optimization has long 6 //x points-to set is{b1} been hindered by the absence of sound points-to analysis in- 7 } 8 //x points-to set is{a1,b1} formation. Thus, our work has significant potential for wide practical application. The difference arises when one of the two control-flow Concretely, our work makes the following contributions: paths employs opaque code. We use an imaginary call • We offer the first realistic, general static may-point-to with reflection

Load more