Dependence Analysis in PHP Master’S Thesis

Dependence analysis in PHP Master's thesis Merijn Wijngaard [email protected] August 15, 2016, 64 pages Supervisor: M. Hills & V. Zaytsev Host organisation: Moxio BV, http://www.moxio.com Host supervisor: A. Boks Universiteit van Amsterdam Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering http://www.software-engineering-amsterdam.nl Abstract Dependence analysis evaluates operations of a program to produce execution order constraints, which embody the semantics of the program. Constraints can be `control' dependences, which imply that an operation controls whether another operation is executed, or `data' dependences, which imply that two operations access the same resource. By combining all control and data dependences in a single procedure in a program into a graph we get a Program Dependence Graph (PDG). For multi- procedure programs we can combine the PDGs of individual procedures with a call graph to generate a System Dependence Graph (SDG). Common use cases of the PDG and SDG include optimization, slicing, and semantic clone detection. Dependence analysis has been extensively researched in simple procedural code, but there is currently no PDG and SDG library available for a dynamic language such as PHP. While dynamic features of PHP may cause a loss of precision in the generated dependences, we believe that par- tially correct results can also be of value. In this thesis we therefore study how we can implement dependence analysis for PHP, and how effectively we can apply it in practice. We analysed the effects of PHP language features and their practical usage, developed strategies for handling some problematic features, implemented a PDG and SDG library, and analysed its effectiveness. We show that several of PHP's language features are challenging to model in dependence analysis but rarely used. We demonstrated that in real-world PHP projects the vast majority of the control and data dependences that we studied can be supported, and that the accuracy of the dependences we evaluated is very high. We believe we have shown that in general the partial accuracy of dependences that we are able to achieve in PHP is high enough for dependence analysis to be useful. To further support this claim we demonstrated the practical use of our library by using it to implement an interprocedural slicing tool. In doing so we believe we have shown that dependence analysis is applicable to PHP. Contents 1 Introduction 3 1.1 PHP.............................................. 5 1.2 Problem statement...................................... 5 1.2.1 Research questions.................................. 5 1.2.2 Scope ......................................... 5 1.2.3 Research method................................... 6 1.3 Corpus............................................. 6 1.4 Contributions......................................... 7 1.5 Outline ............................................ 7 2 Background 8 2.1 Dependence analysis..................................... 8 2.2 PHP.............................................. 9 3 Language analysis 10 3.1 Control dependence ..................................... 10 3.2 Data dependence....................................... 13 3.2.1 Aliasing........................................ 13 3.2.2 Dynamic features................................... 14 3.2.3 Scoping ........................................ 16 3.3 Call dependence ....................................... 16 3.3.1 Object-orientation & type system.......................... 16 3.3.2 Dynamic features................................... 17 3.4 Summary ........................................... 20 4 PDG implementation 21 4.1 Libraries............................................ 21 4.1.1 PHP-CFG....................................... 21 4.1.2 Graph......................................... 23 4.2 Implementation........................................ 23 4.2.1 Control dependences................................. 23 4.2.2 Data dependences .................................. 27 4.2.3 Combined....................................... 27 4.2.4 Extensibility ..................................... 27 5 SDG implementation 29 5.1 Libraries............................................ 29 5.1.1 PHP-Types...................................... 29 5.2 Implementation........................................ 30 5.2.1 Function calls..................................... 30 5.2.2 Method calls ..................................... 32 5.2.3 Overloading...................................... 37 5.2.4 Extensibility ..................................... 37 6 Validation 40 1 6.1 Analysis application ..................................... 40 6.2 Results............................................. 40 6.2.1 Control dependence ................................. 41 6.2.2 Data dependence................................... 41 6.2.3 Call dependence ................................... 42 7 Use case 47 7.1 Example............................................ 47 7.2 Implementation........................................ 48 8 Evaluation 51 8.1 Research questions...................................... 51 8.2 Threats to validity...................................... 52 9 Conclusion 53 9.1 Future work.......................................... 53 Bibliography 54 A Graph library 56 B PDG library 59 C SDG library 62 2 Chapter 1 Introduction Dependence analysis is a form of static analysis that evaluates operations of a program to produce execution order constraints. These constraints embody the semantics of a program. Two programs consisting of the same operands and operations and identical constraints can be considered semantically equivalent, regardless of the order of their operations in the source code. There are two kinds of constraints, namely `control' and `data' dependences. A control dependence implies that an operation controls whether or not another operation is executed. A data dependence implies that two operations access the same shared resource. Figure 1.1 shows an example of a control dependence. The echo operation on line 4 has a control dependence on the if operation on line 3, because it controls whether or not the echo operation is executed. If we were to apply a code transformation that moves the execution of the echo operation before the if operation, then the semantics of the program would change because the echo operation would always be executed, instead of it being conditional on the value of $i. 1 <?php 2 3 if( $i>1){ 4 echo"foo";// control dependence on line3 5 } Figure 1.1: Control dependence Figure 1.2 shows an example of a data dependence. The echo operation on line 4 has a data dependence on the assignment operation on line 3. If we were to reorder the two operations in execution, then the value of $a would most likely be undefined, thus changing the semantics of the program. 1 <?php 2 3 $a="foo"; 4 echo $a;// data dependence on line3 Figure 1.2: Data dependence Figure 1.3 shows an example of two programs that have different source code (lines 3 and 4 are switched) but the same dependence structure. These programs are semantically equivalent. By combining all control and data dependences in a single procedure in a program into a graph we create a Program Dependence Graph (PDG). Many optimizing transformations can be efficiently applied using a PDG, as they essentially use the dependence structure of a program to determine which transformations retain the semantics [8]. Similarly, the PDG can also be used to determine whether two code fragments are semantic clones, by comparing their dependences [17, 19]. A final 3 1 <?php 1 <?php 2 2 3 $foo="foo"; 3 $bar="bar"; 4 $bar="bar"; 4 $foo="foo"; 5 echo $foo. $bar;// 'foobar ' 5 echo $foo. $bar;// 'foobar ' (a) Program 1 (b) Program 2 Figure 1.3: Programs with identical dependence structure use case of the PDG is in program slicing, where the problem of determining which operations affect a slicing criterion can be reduced to a vertex reachability problem [8]. For multi-procedure programs we can combine the PDGs of individual procedures with a call graph and create a System Dependence Graph (SDG). An SDG contains `call' and `param' dependences between procedure calls and their definitions. Call dependences are a type of control dependence, and param dependences are a type of data dependences. The SDG can be used for interprocedural slicing [15] and clone detection [23]. Dependence analysis is common in optimizing compilers for statically compiled languages. It has been extensively studied in simple procedural code [8, 22, 15] and in static object-oriented languages like C++ [20] and Java [18, 28, 25,5]. As far as we know, there is currently no PDG or SDG alternative available for a dynamic scripting language such as PHP. This is likely due to the fact that dynamic languages contain features that are difficult to model in static analysis. An example of a problematic feature of PHP is property overloading, shown in figure 1.4. By defining a get method, classes are able to intercept property fetches to undefined properties and dynamically return a property value. To determine which operations have a call dependence on the get method, we need to know if a property fetch refers to a property that has been defined on the object. This is complex because it requires the integration of type inference, class hierarchies, and polymorphism. 1 <?php 2 3 class Foo{ 4 public function __get( $name){ 5 return 'foo '; 6 } 7 } 8 9 $foo= new Foo(); 10 echo $foo->bar;// foo Figure 1.4: Overloading While PHP may contain dynamic

Load more