Dependence analysis in PHP Master’s thesis

Merijn Wijngaard [email protected]

August 15, 2016, 64 pages

Supervisor: M. Hills & V. Zaytsev Host organisation: Moxio BV, http://www.moxio.com Host supervisor: A. Boks

Universiteit van Amsterdam Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering http://www.software-engineering-amsterdam.nl Abstract

Dependence analysis evaluates operations of a program to produce execution order constraints, which embody the semantics of the program. Constraints can be ‘control’ dependences, which imply that an operation controls whether another operation is executed, or ‘data’ dependences, which imply that two operations access the same resource. By combining all control and data dependences in a single procedure in a program into a graph we get a Program Dependence Graph (PDG). For multi- procedure programs we can combine the PDGs of individual procedures with a call graph to generate a System Dependence Graph (SDG). Common use cases of the PDG and SDG include optimization, slicing, and semantic clone detection. Dependence analysis has been extensively researched in simple procedural code, but there is cur- rently no PDG and SDG library available for a dynamic language such as PHP. While dynamic features of PHP may cause a loss of precision in the generated dependences, we believe that par- tially correct results can also be of value. In this thesis we therefore study how we can implement dependence analysis for PHP, and how effectively we can apply it in practice. We analysed the effects of PHP language features and their practical usage, developed strategies for handling some problematic features, implemented a PDG and SDG library, and analysed its effectiveness. We show that several of PHP’s language features are challenging to model in dependence analysis but rarely used. We demonstrated that in real-world PHP projects the vast majority of the control and data dependences that we studied can be supported, and that the accuracy of the dependences we evaluated is very high. We believe we have shown that in general the partial accuracy of dependences that we are able to achieve in PHP is high enough for dependence analysis to be useful. To further support this claim we demonstrated the practical use of our library by using it to implement an interprocedural slicing tool. In doing so we believe we have shown that dependence analysis is applicable to PHP. Contents

1 Introduction 3 1.1 PHP...... 5 1.2 Problem statement...... 5 1.2.1 Research questions...... 5 1.2.2 Scope ...... 5 1.2.3 Research method...... 6 1.3 Corpus...... 6 1.4 Contributions...... 7 1.5 Outline ...... 7

2 Background 8 2.1 Dependence analysis...... 8 2.2 PHP...... 9

3 Language analysis 10 3.1 Control dependence ...... 10 3.2 Data dependence...... 13 3.2.1 Aliasing...... 13 3.2.2 Dynamic features...... 14 3.2.3 Scoping ...... 16 3.3 Call dependence ...... 16 3.3.1 Object-orientation & type system...... 16 3.3.2 Dynamic features...... 17 3.4 Summary ...... 20

4 PDG implementation 21 4.1 Libraries...... 21 4.1.1 PHP-CFG...... 21 4.1.2 Graph...... 23 4.2 Implementation...... 23 4.2.1 Control dependences...... 23 4.2.2 Data dependences ...... 27 4.2.3 Combined...... 27 4.2.4 Extensibility ...... 27

5 SDG implementation 29 5.1 Libraries...... 29 5.1.1 PHP-Types...... 29 5.2 Implementation...... 30 5.2.1 Function calls...... 30 5.2.2 Method calls ...... 32 5.2.3 Overloading...... 37 5.2.4 Extensibility ...... 37

6 Validation 40

1 6.1 Analysis application ...... 40 6.2 Results...... 40 6.2.1 Control dependence ...... 41 6.2.2 Data dependence...... 41 6.2.3 Call dependence ...... 42

7 Use case 47 7.1 Example...... 47 7.2 Implementation...... 48

8 Evaluation 51 8.1 Research questions...... 51 8.2 Threats to validity...... 52

9 Conclusion 53 9.1 Future work...... 53

Bibliography 54

A Graph library 56

B PDG library 59

C SDG library 62

2 Chapter 1

Introduction

Dependence analysis is a form of static analysis that evaluates operations of a program to produce execution order constraints. These constraints embody the semantics of a program. Two programs consisting of the same operands and operations and identical constraints can be considered semanti- cally equivalent, regardless of the order of their operations in the source code. There are two kinds of constraints, namely ‘control’ and ‘data’ dependences. A control dependence implies that an operation controls whether or not another operation is executed. A data dependence implies that two operations access the same shared resource. Figure 1.1 shows an example of a control dependence. The echo operation on line 4 has a control dependence on the if operation on line 3, because it controls whether or not the echo operation is executed. If we were to apply a code transformation that moves the execution of the echo operation before the if operation, then the semantics of the program would change because the echo operation would always be executed, instead of it being conditional on the value of $i.

1 1){ 4 echo"foo";// control dependence on line3 5 }

Figure 1.1: Control dependence

Figure 1.2 shows an example of a data dependence. The echo operation on line 4 has a data dependence on the assignment operation on line 3. If we were to reorder the two operations in execution, then the value of $a would most likely be undefined, thus changing the semantics of the program.

1

Figure 1.2: Data dependence

Figure 1.3 shows an example of two programs that have different source code (lines 3 and 4 are switched) but the same dependence structure. These programs are semantically equivalent. By combining all control and data dependences in a single procedure in a program into a graph we create a Program Dependence Graph (PDG). Many optimizing transformations can be efficiently applied using a PDG, as they essentially use the dependence structure of a program to determine which transformations retain the semantics [8]. Similarly, the PDG can also be used to determine whether two code fragments are semantic clones, by comparing their dependences [17, 19]. A final

3 1

(a) Program 1 (b) Program 2

Figure 1.3: Programs with identical dependence structure

use case of the PDG is in program slicing, where the problem of determining which operations affect a slicing criterion can be reduced to a vertex reachability problem [8]. For multi-procedure programs we can combine the PDGs of individual procedures with a call graph and create a System Dependence Graph (SDG). An SDG contains ‘call’ and ‘param’ dependences between procedure calls and their definitions. Call dependences are a type of control dependence, and param dependences are a type of data dependences. The SDG can be used for interprocedural slicing [15] and clone detection [23]. Dependence analysis is common in optimizing compilers for statically compiled languages. It has been extensively studied in simple procedural code [8, 22, 15] and in static object-oriented languages like C++ [20] and Java [18, 28, 25,5]. As far as we know, there is currently no PDG or SDG alternative available for a dynamic scripting language such as PHP. This is likely due to the fact that dynamic languages contain features that are difficult to model in static analysis. An example of a problematic feature of PHP is property overloading, shown in figure 1.4. By defining a get method, classes are able to intercept property fetches to undefined properties and dynamically return a property value. To determine which operations have a call dependence on the get method, we need to know if a property fetch refers to a property that has been defined on the object. This is complex because it requires the integration of type inference, class hierarchies, and polymorphism.

1 bar;// foo

Figure 1.4: Overloading

While PHP may contain dynamic features that can be difficult or impossible to support in all cases, we believe that common analyses using dependence graphs could also be valuable here. Example application areas we currently see include debugging, code comprehension, security, optimization, and semantic clone detection; but there are likely many more that we have not considered. Dynamic features may cause a loss of precision, but we are confident that partially correct results can also be very helpful in these areas. In this thesis we therefore study how we can implement dependence analysis for PHP, and how effectively we can apply it in practice. Our work is intended to be usable with real PHP code and by real PHP developers. For this reason all our development has been done in PHP itself, and we have extensively tested our implementation on a wide variety of publicly available open source PHP projects.

4 1.1 PHP

PHP is a popular general-purpose scripting language that is especially suited for web development. As of April 2016, it ranks 6th on the TIOBE programming community index1, and powers 82% of all websites whose server-side language can be determined2. PHP originated as a simple CGI scripting language, but has over time acquired several features suitable for more complex application develop- ment. Examples of these are a class-based object model, which enables object-oriented programming; namespacing, which reduces name collisions and improves library interoperability; and autoloading, which allows files containing class definitions to be automatically included when a class is first used. In recent years PHP has developed a solid packaging ecosystem in the form of Composer3 and Pack- agist4, and through the PHP-FIG5 a number of community standards for e.g. code style, autoloading, and logging have been created. This all contributes to a general increase in the compatibility and quality of PHP libraries, and makes PHP increasingly suitable for the development of complex (web) applications. As more complex applications are being developed, PHP is also an increasingly inter- esting target for static analysis tools and techniques, such as the one we discuss in this thesis.

1.2 Problem statement

The problem we study regards the applicability of dependence analysis to PHP.

1.2.1 Research questions We will investigate the applicability of dependence analysis using the following research questions: 1. How do PHP’s language features affect dependence analysis? 2. How are relevant language features used in practice and how does this impact dependence analysis?

3. Are there strategies that we can adopt to handle problematic features? 4. How can we apply dependence analysis to PHP? 5. How effective is dependence analysis for PHP?

1.2.2 Scope Dependence analysis is a broad subject. Given the limited amount of time available for this thesis, we have restricted our scope to only certain kinds of dependences. We cover intraprocedural control dependence, intraprocedural data dependence excluding object properties (these would steer our re- search too much in the direction of alias analysis), and interprocedural call dependence. We consider this to be a good basic feature set for examining dependence analysis in PHP. For brevity we will, throughout this document, use the terms control, data, and call dependence. To apply dependence analysis to a dynamic language several existing analysis techniques need to be combined. For our work we use simple type inference, class hierarchy construction, and polymorphism computations. The precision of our generated dependences could likely be improved by better imple- mentations of these techniques and by the integration of other techniques such as alias analysis. The development of these techniques is however not the focus of this thesis. Our focus is on implementing and evaluating dependence analysis itself.

1http://www.tiobe.com/tiobe_index 2http://w3techs.com/technologies/details/pl-php/all/all 3https://getcomposer.org/ 4https://packagist.org/ 5http://www.php-fig.org/psr/

5 1.2.3 Research method We first performed a language meta-analysis in which we examined PHP language features and their practical usage in relation to dependence analysis. We then implemented our own PDG/SDG library and applied it to a large corpus of real-world PHP projects. To evaluate its effectiveness we collected metrics such as the amount of operands that we were able to associate with data dependences in a PDG, the amount of function and method calls with associated call dependences in the SDG, and the amount of run-time calls in a function trace of a test suite covered by call dependences in our SDG. All our analyses are available as part of our analysis application described in section 6.1. This ensures that our results are reproducible and verifiable.

1.3 Corpus

To analyse practical use of PHP language features and verify our implementation we have collected a large corpus of popular open-source PHP projects. We based the composition of the corpus on the popularity rankings on Black Duck’s Open Hub site6. Most of the projects are among the most popular, but we also chose some of the less popular projects and even a project that that has not seen updates since 2011. As constructing a dependence graph requires a lot of memory, and our test system had only 8G available, we did not include projects with more than around 400k lines of code. Lastly we also added Fabric, a proprietary application framework created by our sponsor company, Moxio.

Table 1.1: Corpus of popular open-source PHP libraries

Project Version Date PHP1 Files SLOC2 Description CakePHP 3.2.8 24-04-2016 5.5.9 852 139,620 Application Framework CodeIgniter 3.0.6 21-03-2016 5.2.4 199 29,770 Application Framework Doctrine ORM 2.5.0 02-04-2016 5.4.0 1,048 85,591 ORM Fabric 1.6.489 08-07-2016 5.5.0 1,306 57,573 Application Framework Gallery 3.0.9 26-06-2013 5.2.3 568 44,758 Photo Management 3.5.1 05-04-2016 5.3.10 3,221 347,423 CMS 3.3.5 10-03-2016 5.3.3 490 30,738 Application Framework MediaWiki 1.26.2 21-12-2015 5.3.3 2,891 400,416 Wiki osCommerce 2.3.4 06-06-2014 4.0.0 702 60,003 Online Retail PEAR 1.10.1 17-10-2015 5.4.0 199 48,872 Component Framework phpBB 3.1.5 03-05-2015 5.3.3 745 183,293 Bulletin Board phpDocumentor 2.9.0 22-05-2016 5.5 470 22,131 Documentation Generator phpMyAdmin 4.6.0 17-05-2016 5.5.0 871 202,399 Database Administration phpUnit 5.4.8 26-07-2016 5.6 321 72,731 Test Framework Roundcube 1.2.1 24-07-2016 5.3.7 256 106,314 Webmail SilverStripe 3.3.1 29-02-2016 5.3.3 769 183,293 CMS Smarty 3.1.29 21-12-2015 5.2.0 204 17,282 Template Engine Squirrel Mail 1.4.22 12-07-2011 4.1.0 293 26,045 Webmail 3.0.4 30-03-2016 5.5.9 2,978 202,757 Application Framework WordPress 4.5 12-04-2016 5.6.0 653 160,039 Blog Zend Framework 2.5.3 27-01-2016 5.5.0 2,639 169,447 Application Framework 1 Minimum required version. Determined by .json or mention on website. 2 Computed using cloc, PHP only.

6https://www.openhub.net/tags?names=php

6 1.4 Contributions

Language feature analysis We analysed a significant number of several PHP language features with regards to dependence analysis and looked into how problematic features are being used in practice. Our results offer insight into the effects and practical use of these language features. We present our language analysis in chapter3.

PDG/SDG library We created a PDG and SDG library that supports a significant subset of the language. It can be used for building PDG/SDG-based static analysis tools and for further research into PHP. Its implementation is described in chapters4 and5. Interprocedural slicing tool Using our PDG and SDG library we developed an interprocedural slicing tool that can create a backwards slice of a system using a file name and line number as a slicing criterion. It can be seen as a proof of concept of what is possible with dependence analysis in PHP. Its implementation is described in chapter7. PHP-CFG improvements As part of our work, the PHP-CFG library that implements a control flow graph in static single assignment form was augmented support for additional PHP language features and many small bugfixes. These are described in section 4.1.1 PHP-Types improvements As part of our work the PHP-Types library, that implements type reconstruction for PHP-CFG, was augmented with an improved class hierarchy, improved in- terprocedural type inference, support for additional PHP language features, and many small bugfixes. These are described in section 5.1.1.

1.5 Outline

The rest of this thesis is structured as follows. Chapter2 describes existing research and its relation to our work. Chapter3 discusses the impact of PHP language features on dependence analysis, practical use of problematic features, and several strategies for dealing with problematic features. Chapters4 and5 present our implementation of a PDG and SDG. Chapter6 presents our validation of the PDG and SDG implementation, and problematic feature handling strategies. In chapter7 we demonstrate the practical use of our PDG and SDG implementation by using it to implement a slicing tool. Chapter8 evaluates our work in relation to our research questions. We conclude in chapter9.

7 Chapter 2

Background

In this chapter we present existing research and relate it to our work.

2.1 Dependence analysis

Weiser [26] introduced the concept of a program slice as the minimal form of a program that is still guaranteed to produce a certain behavior, and demonstrated its usefulness in the testing, automatic parallelization, maintenance, and debugging of programs. He computed program slices by using data and control flow analysis. We use his algorithm for interprocedural slicing in our slicing tool. Ferrante et al. [8] combined the dependence graphs created by control and data flow analysis to form the PDG, and demonstrated its qualities in the optimization of imperative languages. They showed how several kinds of optimizations can be efficiently performed using the PDG. We use a variation of their original algorithm for determining control dependence in our work, but a different method for determining data dependence. They also briefly mention that the PDG is useful in the context of program slicing. Ottenstein & Ottenstein [22] present the PDG as a suitable program representation for a software development environment, and cover program slicing using the PDG more extensively. They also state that the amount of space required for storing the PDG might be a practical consideration. This is something that we confirmed in our work, as we could not support projects of more than around 400k lines of code on our test system. Horwitz et al. [15] combined the PDGs of all application procedures with a call graph to construct the SDG, and demonstrated its use in interprocedural slicing. They demonstrated the construction of call and param in/out dependence. We have integrated call dependence in our SDG, but left param dependences out of scope, and thus as future work. We also use interprocedural slicing as a use case to demonstrate the practical application of our work. Larsen & Harrold [20] demonstrated how interprocedural slicing using the SDG could be applied to object-oriented software. They represent method polymorphism by linking a method call to a polymorphic choice vertex that is then linked to all reachable implementations of that method. We use a similar technique to support method polymorphism, but link a method call to all implementations directly, instead of through a polymorphic call vertex. Sinha & Harold [24] and Jo & Chang [16] demonstrated ways in which exception induced control flow could be integrated into control flow analysis. We initially investigated this for possible imple- mentation in our work, before our language analysis showed that within our current scope modeling exception induced control flow was not essential for accurate dependence analysis. This could be reevaluated when we add support for additional interprocedural control dependences in the future. Komondoor & Horwitz [17] and Krinke [19] used the PDG for identifying non-contiguous semantic code clones by identifying isomorphic subgraphs. Shomrat & Feldman [23] extended this to interpro- cedural semantic clone detection in an attempt to detect refactored clones. Semantic clone detection is one of the main use cases we see for our library.

8 2.2 PHP

Hills et al. [12] performed an empirical analysis of PHP feature usage and looked specifically at usage of several dynamic features. They discovered that uses of most of these dynamic features were very rare. Hills et al. [13] showed that most instances of dynamic includes can be statically resolved, and Hills [10] showed that this is also true for many instances of variable variables. Their results inspired us to investigate the possibilities for dependence analysis in PHP. Our usage statistics are also collected in a similar fashion. Hills & Klint [11] present PHP AiR, a framework for analysing PHP software implemented in Rascal. We analysed their implementation when determining how to implement control flow analysis in PHP. We settled on using PHP-CFG instead of porting their implementation, as this saved us considerable effort during implementation. Van der Hoek & Hage [14] developed an object-sensitive type analysis for PHP that accommodates several of PHP’s dynamic aspects. We investigated their work when determining how to integrate type information into our call dependence determination. We settled on using PHP-Types instead of their implementation, even though its type analysis quality was likely worse, as it already integrated well with PHP-CFG. Implementing type analysis ourselves was beyond the scope of this thesis. We do see their analysis as a promising next step in improving the accuracy of our generated call dependences.

9 Chapter 3

Language analysis

In this chapter we present our analysis of a significant number of PHP language features in relation to dependence analysis. We look at several features that are problematic, how these are being used in practice, and what strategies we can implement for handling some of them. As per our scope statement of section 1.2.2, we only look at language features that affect control, data, and call dependences. This analysis is not intended to be exhaustive, but we consider it to be a good overview of relevant language features. All practical usage data presented here has been gathered using our analysis application described in section 6.1. In many cases we show the number of procedures containing a certain feature, both as an absolute number and percentage of total number of procedures. As every procedure is represented by a separate PDG, this is to indicate how many PDG’s in a project could be affected by that feature. Besides functions, methods, and closures, we also consider the top-level scope of a script to be a procedure, as it can also contain logic. We call this the script’s ‘pseudo-main’ procedure. Figure 3.1 shows examples of the different kinds of procedure in PHP.

1

Figure 3.1: Procedures in PHP

3.1 Control dependence

Most of PHP’s control structures have their counterparts in other imperative programming languages. If, else, elseif/else if, while, do-while, for, foreach, switch, break, continue, and return all work pretty much as expected. There are some minor differences, such as the fact that a switch statement is considered a looping structure, and we can specify a number of levels for the break and continue statements. This means that if we use a switch in a foreach loop, we have to use ‘continue 2’ from

10 inside the switch to jump to the next iteration of the foreach loop. An example of this is shown in figure 3.2.

1

Figure 3.2: Break and continue for switch in loop

Recent additions to PHP are the goto statement and generators. The goto statement allows jumping to another section of the program. Jumps cannot create irreducible control flow such as when jumping into another scope or looping construct. While the target of a goto may require some bookkeeping to resolve, as jumps can be forward or backward, it is fundamentally similar to a break or continue statement and so should not pose any problems for determining control dependences. Figure 3.3 shows an example of goto.

1

Figure 3.3: Goto

Generators are special functions that use the yield expression to iteratively transfer control out of the function and return values. They can be used to create lightweight lazy iteration similar to iterator objects, but generators can also be passed values from outside the generator function that will be received as the result of a yield expression. Generators are challenging to model in control flow analysis as once they yield control, control does not necessarily have to return. If control returns it returns at the last executed yield expression. If we ignore the fact that control might not be returned to the generator function, then from the point of the generator function the yield expression is similar to a function call. Figure 3.4 shows an example of a generator. To determine the effect that generators would have on our ability to determine control dependences, we looked at how common the yield expression was in our corpus. Table 3.1 shows the results. We can see that it is very rare, with only 2 projects using yield at all, and even those having very few instances. This suggests that accurate support for generators is not required for usefully determining control dependences. We therefore treat the yield expression as a function call for now, and ignore any other impact this feature has on control dependence. Exceptions in PHP also work similar to those in other languages. An exception can be thrown using a throw statement and caught, if thrown inside a try block, using a catch statement. A try block can also have a finally block, that contains code that is always executed, whether the try blocks exits normally or with an exception. Exceptions can be problematic for determining control dependences

11 1

Figure 3.4: Generator

Table 3.1: Yield usage

Project Procs Yield Total Procs Proc % CakePHP 9,943 0 0 0.00 CodeIgniter 1,897 0 0 0.00 Doctrine ORM 7,876 0 0 0.00 Fabric 7,407 0 0 0.00 Gallery 3,039 0 0 0.00 Joomla 18,922 0 0 0.00 Kohana 2,409 0 0 0.00 MediaWiki 26,594 0 0 0.00 osCommerce 3,290 0 0 0.00 PEAR 1,756 0 0 0.00 phpBB 5,000 0 0 0.00 phpDocumentor 2,682 0 0 0.00 phpMyAdmin 6,614 0 0 0.00 phpUnit 2,338 0 0 0.00 Roundcube 2,537 0 0 0.00 SilverStripe 9,610 0 0 0.00 Smarty 1,132 0 0 0.00 Squirrel Mail 1,093 0 0 0.00 Symfony 20,744 1 1 0.00 WordPress 6,904 0 0 0.00 Zend Framework 15,509 2 1 0.01

12 because they are a form of unstructured jumping. A thrown exception can either immediately exit the current procedure, or be caught by a surrounding catch statement and continue execution from there. However, catch statements can opt to only catch some types of exceptions, and almost any operation can throw any exception type by the automatic invocation of a destructor or error handler. To determine the effect that exceptions would have on our ability to determine control dependences we looked at how they were used in our corpus. This is shown in table 3.2. We can see that the use of try and throw statements is reasonably common, but very few procedures contain both a try statement and a throw statement. This means that in most cases a throw statement immediately exits the procedure.

Table 3.2: Exception construct usage

Project Procs Try Throw Try & Throw Total Procs Proc % Total Procs Proc % Procs Proc % CakePHP 9,943 78 72 0.72 344 267 2.69 19 0.19 CodeIgniter 1,897 8 8 0.42 13 5 0.26 0 0.00 Doctrine ORM 7,876 127 123 1.56 355 222 2.82 11 0.14 Fabric 7,407 154 122 1.65 926 506 6.83 60 0.81 Gallery 3,039 156 135 4.44 281 208 6.84 29 0.95 Joomla 18,922 576 457 2.42 1,208 766 4.05 87 0.46 Kohana 2,409 54 50 2.08 155 122 5.06 30 1.25 MediaWiki 26,594 495 409 1.54 1,830 1,245 4.68 109 0.41 osCommerce 3,290 10 10 0.30 57 40 1.22 7 0.21 PEAR 1,756 5 4 0.23 10 8 0.46 0 0.00 phpBB 5,000 37 29 0.58 120 73 1.46 6 0.12 phpDocumentor 2,682 10 9 0.34 101 88 3.28 2 0.07 phpMyAdmin 6,614 25 17 0.26 20 18 0.27 3 0.05 phpUnit 2,338 278 262 11.21 222 151 6.46 10 0.43 Roundcube 2,537 36 31 1.22 2 2 0.08 1 0.04 SilverStripe 9,610 91 72 0.75 569 369 3.84 32 0.33 Smarty 1,132 13 13 1.15 102 76 6.71 9 0.80 Squirrel Mail 1,093 0 0 0.00 0 0 0.00 0 0.00 Symfony 20,744 460 375 1.81 1,696 1,204 5.80 131 0.63 WordPress 6,904 48 42 0.61 107 49 0.71 21 0.30 Zend Framework 15,509 207 195 1.26 2,888 2,109 13.60 128 0.83

Several studies have attempted to integrate exception induced control flow into a control flow graph in Java or C++ [24, 16,2], but these studies generally cover explicitly thrown exceptions and their strategies are not immediately suitable for implementation in the dynamic context of PHP. As this problem is too complex to be included in our scope, we currently assume that all throw statements immediately exit the procedure.

3.2 Data dependence

There are many features in PHP that are relevant for determining data dependences. Some can be used to create non-obvious aliasing, some can create non-obvious or dynamic references, and some influence the determination of reaching definitions, such as the scoping mechanism.

3.2.1 Aliasing The most common way of creating aliases in PHP is using object references. While the references themselves are copied on assignment or when passed as an argument, the underlying object remains the same. When considering data dependences to object properties it is important to take aliasing using object references into account. As this is not included in our current scope we will not cover it here, but there are several other language features of PHP that can be used to create aliasing which are currently relevant to us.

13 Predefined variables are provided by PHP to all scripts. They can contain internal or environment information, such as the last error message or the values of GET or POST parameters in the context of an HTTP request. Some of these predefined variables are available to all scopes, these are called superglobals. PHP also allows user defined global variables, that need to be explicitly imported into procedures. All global variables, including superglobals, are also accessible through the superglobal $GLOBALS array. In the context of data dependences this is problematic because the same resource can be referenced in multiple ways. Figure 3.5a shows an example of how a variable defined in the global scope is automatically aliased using the globals array. Similarly, reference assignments also allow multiple variable names to reference the same resource. Correctly determining data dependence for this requires tracking which variable names correspond to the same resource. Further complicating matters is the fact that individual elements of an array can also be assigned by reference. Figure 3.5b shows an example of aliasing using reference assignment.

1

(a) GLOBALS (b) Reference assignment

Figure 3.5: Aliasing using GLOBALS and reference assignments

To correctly model multiple ways of accessing the same resource requires alias analysis. This has been extensively researched particularly in the context of Java [27,4]. Existing techniques for Java could well be suitable for porting to PHP, but we consider this problem too complex to be included in our current scope. We therefore ignore the possibility that variables might reference the same resource, and leave the implementation of alias analysis in PHP as future work. To determine the impact that this could have on our ability to correctly determine data dependences we looked at how common possible aliasing features were in our corpus. This is shown in table 3.3. As we can see, these features are rare in many libraries, but some do use them extensively. While many of these uses of possible aliasing features might not actually cause hidden data dependences, this definitely warrants further research into the impact of aliasing on dependence analysis in PHP. This is something that we currently leave as future work.

3.2.2 Dynamic features The eval and dynamic include statements are PHP’s method of evaluating code at run-time, with eval evaluating a string and a dynamic include evaluating a file. In both cases the pseudo-main procedure of the evaluated code is evaluated in the same scope as the eval or dynamic include statement. This means it has access to any variables in the evaluating scope and can thus have hidden data dependences. Whether or not a dynamically included pseudo-main actually references the surrounding scope could in many cases be determined be static analysis if the path to the included file can be resolved. The path of the file to include however, can itself be the result of an expression, meaning that it can also be determined at run-time. Figure 3.6 shows examples of eval and include statements using the evaluating scope. Variable variables are variable references that have the name of the variable they reference be the result of an expression. Since this expression can even be dependent on user input, it could evaluate to any variable in scope, or even a variable that is currently undefined. Figure 3.7 shows an example of a variable variable. Theoretically these dynamic features pose a significant problem for dependence analysis. However, Hills et al. [12] showed that in practice the use of eval and variable variables is extremely rare. As shown in table 3.4, we can confirm this for our corpus. Hills et al. [13] showed that most instances of dynamic includes can be statically resolved, and Hills [10] showed that this is also true for many instances of variable variables. Although we currently limit our scope to more traditional methods of

14 Table 3.3: Possible aliasing construct usage

Project Procs Ref assign Global Any Total Procs Total Procs Procs Proc % CakePHP 9,943 49 32 0 0 32 0.32 CodeIgniter 1,897 126 93 0 0 93 4.90 Doctrine ORM 7,876 16 11 0 0 11 0.14 Fabric 7,407 39 25 0 0 25 0.34 Gallery 3,039 18 11 9 9 20 0.66 Joomla 18,922 644 331 3 3 333 1.76 Kohana 2,409 66 52 4 4 56 2.32 MediaWiki 26,594 1,021 553 1,578 1,479 1,884 7.08 osCommerce 3,290 23 9 522 511 519 15.78 PEAR 1,756 503 280 6 6 285 16.23 phpBB 5,000 142 93 974 758 832 16.64 phpDocumentor 2,682 5 4 0 0 4 0.15 phpMyAdmin 6,614 149 58 234 209 263 3.98 phpUnit 2,338 5 3 4 2 5 0.21 Roundcube 2,537 13 11 3 3 14 0.55 SilverStripe 9,610 103 62 39 35 97 1.01 Smarty 1,132 22 13 0 0 13 1.15 Squirrel Mail 1,093 23 18 500 414 427 39.07 Symfony 20,744 72 41 0 0 41 0.20 WordPress 6,904 299 139 911 902 1,004 14.54 Zend Framework 15,509 207 149 1 1 150 0.97

1

(a) foo.php

Figure 3.6: Eval and include using evaluating scope

1

Figure 3.7: Variable variable

15 generating data dependences, we see their methods as good candidates for integration into our library in the future. This could significantly improve the accuracy of dependence analysis for these features.

Table 3.4: Eval, include, and variable variable usage

Project Procs Eval Include Var var Total Procs Proc % Total Procs Proc % Total Procs Proc % CakePHP 9,943 0 0 0.00 14 13 0.13 0 0 0.00 CodeIgniter 1,897 1 1 0.05 87 34 1.79 12 5 0.26 Doctrine ORM 7,876 0 0 0.00 91 29 0.37 0 0 0.00 Fabric 7,407 3 3 0.04 102 61 0.82 1 1 0.01 Gallery 3,039 1 1 0.03 48 29 0.95 8 4 0.13 Joomla 18,922 4 3 0.02 714 511 2.70 6 4 0.02 Kohana 2,409 1 1 0.04 50 47 1.95 7 3 0.12 MediaWiki 26,594 5 5 0.02 544 304 1.14 6 3 0.01 osCommerce 3,290 5 5 0.15 777 238 7.23 119 39 1.19 PEAR 1,756 2 2 0.11 302 203 11.56 3 3 0.17 phpBB 5,000 11 6 0.12 397 203 4.06 61 17 0.34 phpDocumentor 2,682 0 0 0.00 11 7 0.26 0 0 0.00 phpMyAdmin 6,614 2 2 0.03 1,217 474 7.17 33 9 0.14 phpUnit 2,338 3 3 0.13 34 15 0.64 0 0 0.00 Roundcube 2,537 0 0 0.00 87 74 2.92 3 2 0.08 SilverStripe 9,610 10 10 0.10 558 290 3.02 3 2 0.02 Smarty 1,132 9 8 0.71 44 25 2.21 40 11 0.97 Squirrel Mail 1,093 1 1 0.09 426 139 12.72 24 6 0.55 Symfony 20,744 23 15 0.07 139 99 0.48 0 0 0.00 WordPress 6,904 0 0 0.00 809 272 3.94 2 1 0.01 Zend Framework 15,509 1 1 0.01 58 42 0.27 2 2 0.01

3.2.3 Scoping In PHP there are only two scopes, the global and function scope. Although many of PHP’s control structures contain a block of statements, blocks do not have an associated scope like in many other languages. Variables defined within the block are defined in the current global or function scope, and thus remain accessible after leaving the block. This can make determining reaching definitions, and by extent data dependences, easier. Figure 3.8 shows an example of the lack of block scoping in PHP.

1

Figure 3.8: Lack of block scoping

3.3 Call dependence

The main challenge in PHP for determining call dependence is its object model and weak dynamic type system, but there are also a number of dynamic features that are relevant.

3.3.1 Object-orientation & type system PHP has a single-inheritance class-based object model. Its large library of built-in functionality is, for historic reasons, mostly not object-oriented, but most projects in our corpus do use the object

16 model extensively. This is shown in table 3.5.

Table 3.5: Call dependence relevant feature usage

Project Files Functions Classes Methods Defs Calls Defs Calls CakePHP 852 21 6,995 843 8,483 48,370 CodeIgniter 199 167 5,087 138 1,531 2,275 Doctrine ORM 1,048 0 2,742 1,500 6,770 32,093 Fabric 1,306 15 3,768 1,133 6,028 18,241 Gallery 568 45 7,176 403 2,426 11,844 Joomla 3,221 258 25,989 2,402 15,383 98,070 Kohana 490 33 2,929 389 1,872 3,972 MediaWiki 2,891 251 28,033 2,745 22,943 85,115 osCommerce 702 404 18,293 273 2,184 2,685 PEAR 199 13 5,237 131 1,544 6,957 phpBB 745 569 18,428 599 3,674 14,568 phpDocumentor 470 11 767 432 2,086 7,975 phpMyAdmin 871 840 18,594 633 4,885 24,358 phpUnit 321 141 1,663 310 1,866 4,384 Roundcube 256 106 5,738 264 2,170 7,842 SilverStripe 769 40 10,648 1,190 8,657 33,470 Smarty 204 59 1,911 175 869 1,064 Squirrel Mail 293 604 10,750 22 196 651 Symfony 2,978 19 11,164 2,754 16,924 63,018 WordPress 653 2,935 52,263 275 3,315 10,058 Zend Framework 2,639 4 13,039 1,946 12,673 20,013

As PHP’s type system is weak and dynamic, determining the type of an operand can be quite difficult. This is problematic in combination with object-orientation, because dynamic-dispatch can result in a method name referencing different definitions based on the type of the object. To be able to correctly resolve dependences to polymorphic properties and methods, we require operand type information. Type analysis in PHP is still an active research topic. Van der Hoek & Hage [14] have recently developed an object-sensitive type analysis framework for PHP that could be suitable for integration in our work, but this would stretch our current scope too far beyond dependence analysis. We therefore opt for a simpler implementation that integrates well with other libraries we use. This implementation is described in section 5.1.1.

3.3.2 Dynamic features PHP’s main mechanism for importing functionality is the include statement. This means that appli- cations could theoretically contain multiple implementations of the same function, class, or method, and vary the run-time implementation by including different files. To determine the impact of this, we looked at how many duplicate class and function names there were in our corpus. This is shown in table 3.6. We can see that in general duplicate names are not very common, but they do exist. Our call dependence generation algorithms therefore all handle multiple definitions of a name by linking to all available implementations, even though this would not be possible at run-time. Our method for this is described in sections 5.2.1 and 5.2.2. Interesting outliers are Joomla and osCommerce, which in both functions and classes have the highest number of duplicate names. We examined their code to find the reasons for this. In Joomla some were conditionally defined functions that ensured compatibility in case certain extensions were not loaded, but most duplicate names were due to the fact that Joomla conditionally includes certain class definitions based on the type of output it is generating, e.g. html or json. In osCommerce we found several monkey-patched functions for compatibility with older PHP versions, but the majority

17 of duplicate names seemed to be due to the face that osCommerce duplicates a lot of functionality between its catalog (front end) and admin (back end) modes. We found several names with identical implementations, but also some for which the implementations varied.

Table 3.6: Duplicate names

Project Files Functions Classes Total Dup Dup % Total Dup Dup % CakePHP 852 21 0 0.00 843 0 0.00 CodeIgniter 199 167 0 0.00 138 1 0.72 Doctrine ORM 1,048 0 0 0.00 1,500 0 0.00 Fabric 1,306 15 0 0.00 1,133 2 0.18 Gallery 568 45 1 2.22 403 0 0.00 Joomla 3,221 258 60 23.26 2,402 55 2.29 Kohana 490 33 0 0.00 389 0 0.00 MediaWiki 2,891 251 6 2.39 2,745 0 0.00 osCommerce 702 404 112 27.72 273 9 3.30 PEAR 199 13 1 7.69 131 2 1.53 phpBB 745 569 16 2.81 599 0 0.00 phpDocumentor 470 11 0 0.00 432 2 0.46 phpMyAdmin 871 840 4 0.48 633 0 0.00 phpUnit 321 141 0 0.00 310 1 0.32 Roundcube 256 106 0 0.00 264 0 0.00 SilverStripe 769 40 0 0.00 1,190 1 0.08 Smarty 204 59 1 1.69 175 0 0.00 Squirrel Mail 293 604 11 1.82 22 0 0.00 Symfony 2,978 19 0 0.00 2,754 5 0.18 WordPress 653 2,935 24 0.82 275 2 0.73 Zend Framework 2,639 4 0 0.00 1,946 0 0.00

There are also a number of ways in which a call target can be determined using an expression. PHP provides variable function and method calls, and several built-in functions than can invoke other functions or methods. Hills et al. [12] showed that these features are rare but used. As shown in table 3.7, we can confirm this for our corpus. Hills et al. [13] showed that many instances of variable function and method calls cannot easily be resolved using static analysis. This means that these calls could be problematic for the correct determination of call dependences. Figure 3.9 shows examples of several forms of dynamic invocation. Lastly, overloading in PHP means to dynamically create methods and properties. Classes can define ‘magic’ methods that are called when a certain operations attempt to access a method or property that is not defined. For methods these are call and callStatic. They are called when a non- existing method is called using an instance or static method call. For properties, these are the get, set, isset, and unset. They are called when a non-existing property is retrieved, set, checked for existence, or deleted. All overloading methods can also be called explicitly, which can be used to enable overloading method inheritance. An additional complication for properties is the fact that they can also be defined at run-time. As it is very difficult to determine whether or not a property is defined at run-time, we ignore this possibility and assume that all properties which we cannot resolve on objects that implement overloading hooks are instances of overloading. We accept the fact that this might generate some call dependences that would actually refer to dynamic properties at run-time. Figure 3.10 shows an example of overloading using the get method, and a dynamic property assignment that overrides the overloading mechanism. Although Hills et al. [12] showed that overloading is not very common in practice, we consider it a good example of a dynamic feature in PHP that should be statically resolvable in most cases. Over- loading is similar to other dynamic features in PHP that allow objects to interact with certain built-in features. Other examples include objects that implement the Countable and ArrayAccess interfaces, which can be used with the built-in count function or array syntax. If accurate type information is

18 Table 3.7: Dynamic invocation usage

Project Procs Var func call Var method call call user func1 Total Procs Proc % Total Procs Proc % Total Procs Proc % CakePHP 9,943 140 105 1.06 54 37 0.37 30 28 0.28 CodeIgniter 1,897 17 8 0.42 23 17 0.90 9 7 0.37 Doctrine ORM 7,876 4 3 0.04 13 8 0.10 10 9 0.11 Fabric 7,407 4 2 0.03 0 0 0.00 29 29 0.39 Gallery 3,039 7 3 0.10 27 14 0.46 53 35 1.15 Joomla 18,922 14 12 0.06 52 37 0.20 94 76 0.40 Kohana 2,409 12 8 0.33 18 14 0.58 13 11 0.46 MediaWiki 26,594 224 76 0.29 68 45 0.17 271 232 0.87 osCommerce 3,290 2 1 0.03 2 2 0.06 5 4 0.12 PEAR 1,756 1 1 0.06 20 9 0.51 37 23 1.31 phpBB 5,000 27 12 0.24 6 5 0.10 35 27 0.54 phpDocumentor 2,682 16 16 0.60 10 8 0.30 1 1 0.04 phpMyAdmin 6,614 15 12 0.18 17 13 0.20 10 10 0.15 phpUnit 2,338 1 1 0.04 10 6 0.26 144 143 6.12 Roundcube 2,537 3 3 0.12 4 3 0.12 13 12 0.47 SilverStripe 9,610 46 28 0.29 242 115 1.20 92 69 0.72 Smarty 1,132 14 6 0.53 18 18 1.59 16 12 1.06 Squirrel Mail 1,093 24 24 2.20 0 0 0.00 4 4 0.37 Symfony 20,744 85 58 0.28 38 28 0.13 86 80 0.39 WordPress 6,904 23 9 0.13 7 7 0.10 127 101 1.46 Zend Framework 15,509 327 172 1.11 83 74 0.48 85 78 0.50 1 Also includes call user func array, call user method & call user method array

1 $baz();// baz 19 echo call_user_func( $foo);// foo 20 echo call_user_func([ $bar, $baz]);// baz

Figure 3.9: Dynamic invocation

19 1 bar;// foo 11 12 $foo->bar= 'bar '; 13 echo $foo->bar;// bar

Figure 3.10: Overloading and dynamic properties

available, most instances of these features should be statically resolvable. By implementing support for overloading we also intend to demonstrate how these other behaviors could be integrated in the future.

3.4 Summary

In this language analysis we have seen that with regards to control dependence PHP is mostly similar to many other imperative languages. Generators are an exception to this, but they are currently rarely used so we can safely ignore their effect and model them as function calls. Exceptions can be problematic, but within our current scope we can assume that they simply exit the procedure. With regards to data dependence problems can come from aliasing, which is too complex to be included in our current scope, and dynamic features, which are rarely used in practice. PHP’s scoping mechanism can actually make determining data dependences easier, as there is only a single scope per procedure. Determining call dependences requires operand type information for method calls, supporting multiple definitions of the same name due to conditional inclusion, and knowledge of how objects can interact with certain built-in functionality such as overloading. Dynamic invocation is problematic to support, but rarely used in practice.

20 Chapter 4

PDG implementation

In this chapter we present our implementation of a PDG for PHP. It is part of our PDG and SDG library called Php-Pdg, and is publicly available1. We first cover the libraries that we use for our implementation and then present our implementation of the PDG itself.

4.1 Libraries

Our PDG is based on PHP-CFG, originally created by Anthony Ferrara and Nikita Popov, and our own implementation of a graph library.

4.1.1 PHP-CFG We have based our PDG implementation on PHP-CFG2, a pure PHP implementation of a Control Flow Graph (CFG) in Static Single-Assignment (SSA) form [7]. PHP-CFG has its own representation of PHP operations, which is similar to the operations that the PHP engine executes when running a program. An AST ‘If’ node, for example, results in a ‘JumpIf’ operation with 2 possible target blocks and 2 ‘Jump’ operations at the end of each of those blocks to implement the joining of control flow after the ‘If’. This is shown in figure 4.1, along with the implicit return at the end of each script’s pseudo-main. The operations used by PHP-CFG are more suitable for PDG construction than the original AST nodes. Using the original AST nodes would requires us to introduce separate constructs for concepts not represented in the AST, such as the headers of while and for loops or implicit returns, which are already represented by operations in PHP-CFG. PHP-CFG is a fairly new library, having been created in the summer of 2015, so not all features of PHP are currently supported. For example, exception handling is missing. The current implemen- tation ignores try statements and considers a throw statement to be equivalent to an empty return. As we have seen in section 3.1 of our language analysis, most instances of a throw statement exit the procedure, so this should be satisfactory for our current purposes.

SSA The SSA form of the CFG is advantageous to us because it makes the process of determining flow dependences trivial. SSA form was developed by Cytron et al. [7] and requires that each variable be assigned exactly once. This means that the use-def chains needed for determining flow data depen- dences are directly encoded into the CFG. The library constructs SSA form directly from the AST, without going through an intermediate representation. It uses an implementation of the algorithm by Braun et al. [3] for this.

1https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/ProgramDependence 2https://github.com/ircmaxell/php-cfg

21 Function {main}():

Stmt_JumpIf cond: LITERAL(1) 1

Terminal_Return expr: LITERAL(1)

(b) CFG

Figure 4.1: If-else program

Model PHP-CFG parses files into a Script object, which contains one or more procedures represented by Func objects. These are the pseudo-main procedure and any other functions, methods, or closures in the file. A Func object contains a Block object representing the start of the CFG, which contains a list of operations represented by Op objects. The last operation of a block can link to another block. Multiple blocks can be linked in the case of a conditional jump operation. Most operations have operands. Operands can be literals, or be defined by other operations. Literals directly contain their value, and are used at most once. Due to fact that the CFG is in SSA form, non-literals are always defined by exactly one operation, and can have any number of using operations. The phi operation is a special kind of operation that is used to construct SSA form. It is inserted at the beginning of a block that has multiple input paths to ‘select’ the value of an operand depending on which definition reaches the block. Phi operations are only inserted if their result operand is used. Figure 4.2 shows most mentioned objects, with operations being printed by their name (e.g. Expr Assign) and operands being printed as Var#X, along with the name of their original variable.

Contributions As part of our work for this thesis we made a number of contributions to PHP-CFG, which have been merged upstream. Most of these contributions concern small fixes to handling certain language features, so we do not explicitly describe them here.

Goto forward jumps The goto statement can be used to jump to a named label. Jumps cannot create irreducible control flow, such as when jumping into another scope or loop construct, but can go forward or backward. PHP-CFG’s implementation of goto did not support forward jumps, so we implemented them.

22 4.1.2 Graph To facilitate easy use of the PDG for e.g. program slicing or clone detection, the PDG should be represented as a single graph. Some existing PDG libraries3 represent the PDG with separate graphs for control and data dependences, but this means standard graph algorithms, e.g. determining reach- ability, need to be applied over multiple graphs. We could not find any existing graph library in PHP that would allow us to do this easily, so we implemented our own. It offers the following key functionalities:

Custom nodes Any objects that implement the node interface can be added to the graph. This allows us to use the same graph library for all the different kinds of graphs in the library (e.g. PDG, SDG, etc.).

Edge attributes Different kinds of relations can be added to the graph using edge attributes. This is useful in e.g. the PDG, for distinguishing between control and data dependences. Two edges are considered equivalent if all their attributes match, and cannot be added to the graph multiple times. The library also allows us to retrieve edges by partial matching on attributes. This allows us to easily retrieve only some edge types.

Custom equality We consider different objects that represent the same node to be equivalent. This is useful for node objects that wrap other objects, e.g. PHP-CFG operation nodes in the PDG. When comparing a PDG to a CFG, the CFG operations can simply be wrapped in new operation node objects which will be considered equivalent to any node containing the same operation already in the PDG. Custom equality is implemented by requiring nodes to implement the getHash method, which should return an equivalent hash for equivalent nodes.

Fast edge lookup For complex programs, the PDG can grow quite large, so near constant lookup times are required. Our graph library offers amortized O(1) lookups by indexing all edges by the hashes of their nodes. Although the lack of indexing on attributes means that theoretically edge lookup is still O(n), in practice very few graphs have many edges between the same nodes.

The implementation of the main components of our graph library is shown in detail in appendixA.

4.2 Implementation

We construct the PDG of a procedure using a Factory object, for which the input is a PHP-CFG Func object and the output is a Php-Pdg Func object. Like PHP-CFG, we do not distinguish between different kinds of procedures. We traverse the PHP-CFG Func object three times, first initializing the graph by adding all operation nodes, then adding control dependences and then adding data dependences. We demonstrate this using the program of figure 4.2 as a running example. The implementation of the outward facing components of our PDG library is shown in detail in appendix B.

4.2.1 Control dependences To determine control dependences we implement the method of Ferrante et al. [8], with some adap- tations to suit our case. We determine control dependences on the level of basic blocks first, before translating them to individual operations. A basic block is linear sequence of program instructions having one entry point and one exit point [1]. By definition it can thus have at most one operation that is a control dependence of any other operation, and this is always the last operation in the block. This means that determining the control dependences of basic blocks is sufficient to be able to infer the control dependences of individual operations. We use this fact to significantly reduce the size of the graph that needs to be evaluated in the intermediate steps. Our algorithm involves the following high-level steps, which are explained further below:

3https://github.com/grammarware/pdg

23 Function {main}():

Stmt_JumpIf cond: LITERAL(1)

1 var: Var#3<$a> 6 $a= 'bar '; expr: LITERAL('foo') expr: LITERAL('bar') 7 } result: Var#2 result: Var#4 8 echo $a; Stmt_Jump Stmt_Jump

(a) Source target target

Var#5<$a> = Phi(Var#1<$a>, Var#3<$a>) Terminal_Echo expr: Var#5<$a> Terminal_Return expr: LITERAL(1)

(b) CFG

Figure 4.2: Running example program

24 1. Construct a CFG for the basic blocks of the original CFG. 2. Construct a Post-Dominator Tree (PDT) for the basic block CFG. 3. Construct a Control Dependence Graph (CDG) using the basic block CFG and basic block PDT. 4. Add individual operation control dependences to the PDG using the basic block CDG. Figure 4.3 shows all intermediate constructs for the program in figure 4.2. To enable us to identify individual blocks we annotate them with the minimum and maximum start line number of the AST nodes that the operations they contain were generated from. For instance, the block containing the Stmt JumpIf operation that was generated from the if statement on line 3 is shown as Block@3:3. Note that we do not use the end line numbers of AST nodes. This is because the end line number of e.g. an if statement would be the line number of the last closing parenthesis, meaning it would enclose any blocks generated from its ‘then’ or ‘else’ statements. We found this to be counterintuitive when reasoning about CFG operations, and that it made it more difficult to distinguish individual blocks.

ENTRY

Block@3:3 ENTRY Block@3:3 Block@4:4 Block@6:6 {"case":true} {"case":false}

Block@3:3 Block@8:8 Block@4:4 Block@6:6 ENTRY Block@8:8

{"case":true} {"case":false}

STOP Block@8:8 Block@4:4 Block@6:6

(b) Block PDT STOP (c) Block CDG

(a) Block CFG

Figure 4.3: Control dependence intermediate constructs

Basic block CFG The CFG created by PHP-CFG already contains Block objects, which are the basic blocks we need, so we need only add them to a new graph. As per the method of Ferrante et al. [8], we augment the new CFG with an entry and stop node, link them to the appropriate block nodes, and link the entry and stop nodes themselves. These steps are required to ensure that nodes which have no control dependences on other nodes become control dependent on the entry node. Figure 4.3a shows the resulting basic block CFG for the program in figure 4.2. The block ending in a conditional jump (the if statement) has its outgoing edges annotated to reflect the case that would result in that edge.

Basic block PDT A PDT encodes post-domination relations between nodes. We use the following definitions: [1]

• A node A is post-dominated by a node B if all paths from from A to STOP contain B. • A node A is strictly post-dominated by a node B if A is post-dominated by B and A is not equal to B

• A node A is immediately post-dominated by a node B if A is strictly post-dominated by B, but B does not strictly post-dominate any other node that strictly post-dominates A.

25 We construct the PDT by implementing the method for computing dominance presented by Cooper et al. [6], though without their optimizations. It is not the fastest available option, but its implemen- tation is very simple and its performance is not a determining factor in the overall performance of the library. The algorithm iteratively computes the following equation until it has reached a fixpoint:

Dom(no) = {no}   \ [ Dom(n) =  Dom(p) {n} p∈preds(n) By running this algorithm on an inverted CFG we can compute post-dominance relationships. We then proceed to eliminate non-strict and non-immediate relationships, and are left with a PDT. Figure 4.3b shows the basic block PDT for the example program in 4.2.

Basic block CDG Using the basic block CFG and basic block PDT, we can now determine the control dependences between basic blocks and create a basic block CDG. We evaluate all edges A-B in the CFG where B does not post-dominate A according to the PDT, and determine their least common ancestor L in the PDT. In this case there are 2 possible situations: [8]

1. L is the parent of A 2. L is equal to A (loop dependence)

In case 1, we make all nodes in the PDT between B and L excluding L control dependent on A. In case 2, we make all nodes in the PDT between B and L including L control dependent on A. Figure 4.3c shows the basic block CDG for the example program in 4.2. The block ending in a conditional jump again has its outgoing edges annotated to reflect the case that would result in that edge.

Operation control dependences Control dependences on an individual operation level are directly inferred from those on the level of basic blocks. We do this by evaluating all edges A-B in the basic block CDG, and making all operations in B control dependent on the last operation of A. Figure 4.4 shows the control dependence subgraph of the PDG for the example program in figure 4.2. It contains operation nodes instead of block nodes. The operation nodes are annotated with the start line number of the AST node that they were constructed from. As the implicit return was not part of the AST, it shows up at line -1. The edges are annotated with the type of dependence, which is always control here, and the case that would result in that edge.

ENTRY

{"type":"control"} {"type":"control"} {"type":"control"}

Op[Stmt_JumpIf]@3 Op[Terminal_Echo]@8 Op[Terminal_Return]@-1

{"case":true,"type":"control"} {"case":true,"type":"control"} {"case":false,"type":"control"} {"case":false,"type":"control"}

Op[Expr_Assign]@4 Op[Stmt_Jump]@3 Op[Expr_Assign]@6 Op[Stmt_Jump]@3

Figure 4.4: PDG control dependence subgraph

26 4.2.2 Data dependences Ferrante et al. [8] list 4 types of data dependences: Flow dependence Read after write Antidependence Write after read Input dependence Read after read Output dependence Write after write As noted in section 4.1.1, the SSA form of the CFG created by PHP-CFG allows us to directly infer flow dependences. PHP-CFG uses operand objects that are referenced from both defining and using operations. Operations using an operand should have data dependences on the operation defining an operand. As the CFG is in SSA form, some operands are defined by phi operations. These phi operations are not part of our PDG, so we recursively resolve them to the operations that define their operands. Figure 4.2 shows that the echo statement on line 8 references variable 5, which is defined by a phi operation. The phi operation itself uses variables 1 and 3, which are defined by the assignments on lines 4 and 6 respectively. The flow dependences of the echo operation on line 8 are therefore the two assignments on line 4 and 6. Figure 4.5 shows the data dependence subgraph of the PDG for the example program in figure 4.2. The operation nodes are again annotated with the starting line number of the AST node that they were generated from. The edges are annotated with the type of dependence, which is always data here, and name of the operand that they were generated from. Adding the operand facilitates easier debugging and verification, but means that there can be multiple data dependence edges between two operations if an operation uses an operand multiple times.

Op[Expr_Assign]@4 Op[Expr_Assign]@6

{"type":"data","operand":"expr"} {"type":"data","operand":"expr"}

Op[Terminal_Echo]@8

Figure 4.5: PDG data dependence subgraph

We have currently limited our scope to supporting flow dependences. The reason for this is that the other types of dependences do not really apply to the SSA context of PHP-CFG. Ferrante et al. [8] showed that they can be useful when applying certain optimizations to the original non-SSA source code, so we see them as a good optional add-on to Php-Pdg. The implementation of this add-on remains future work.

4.2.3 Combined By combining the control and data dependence subgraphs we get a full PDG. Figure 4.6 shows the PDG for the example program in figure 4.2. We now see the different dependences combined, and the need for annotating edges with their type.

4.2.4 Extensibility Our PDG implementation is modular and can easily be extended with support for additional types of control and data dependences. To demonstrate this we have implemented a data dependence generator that adds ‘maybe data’ dependences for eval statements and variable variables4. As these

4https://github.com/mwijngaard/php-pdg/blob/master/src/PhpPdg/ProgramDependence/DataDependence/ MaybeGenerator.php

27 ENTRY

{"type":"control"} {"type":"control"}

Op[Stmt_JumpIf]@3 Op[Terminal_Return]@-1

{"case":true,"type":"control"} {"case":true,"type":"control"} {"case":false,"type":"control"} {"case":false,"type":"control"} {"type":"control"}

Op[Expr_Assign]@4 Op[Stmt_Jump]@3 Op[Expr_Assign]@6 Op[Stmt_Jump]@3

{"type":"data","operand":"expr"} {"type":"data","operand":"expr"}

Op[Terminal_Echo]@8

Figure 4.6: Full PDG language features can technically refer to any variables in the current scope, this generator can make these possible dependences explicit. We do not use this generator by default, but it can be passed to the factory to create a PDG with ‘maybe data’ dependences.

28 Chapter 5

SDG implementation

In this chapter we present our implementation of an SDG for PHP. It is part of our PDG and SDG library called Php-Pdg, and is publicly available1. We first cover the libraries that we use for our implementation and then present our implementation of the SDG itself.

5.1 Libraries

The SDG implementation is based on our own PDG implementation, described in chapter4, and PHP-Types, originally created by Anthony Ferrara.

5.1.1 PHP-Types We use our own fork of PHP-Types2 for resolving calls. PHP-Types is a type reconstruction library for PHP-CFG. It supports parsing type information from type hints, literals and phpdoc3 annotations (@var, @param, @return). It also supports inferring result types of PHP expressions, and contains a database of type information for PHP’s built-in procedures, allowing type inference to also work for them. Like PHP-CFG, PHP-Types was created in the summer of 2015, meaning it is also rather immature. This translates to a lot of room for improvement concerning the support and accuracy of type recon- struction. Since we use type information to generate method call dependences, any improvements to type reconstruction will automatically increase the quality of method call dependences in the SDG.

Model PHP-Types operates on the CFG created by PHP-CFG (see 4.1.1). It offers a TypeReconstructor object, that takes as input all the PHP-CFG Script objects of the entire application. It uses these scripts to construct a state of the application that contains a class hierarchy and indexes certain objects that are required for type reconstruction. Classes, for example, are indexed by their name, and their methods indexed by their class object and method name. It also retrieves all operands in the application. The type reconstructor processes all operands that it can resolve immediately, such as literals and operands with type declarations, and then proceeds to iteratively evaluate the operations defining any remaining operands. It attempts to infer the result types of these operations using the operands it has already resolved, its knowledge of language constructs, and the application state. The types of resolved operands are described using a Type object. PHP-Types follows the type model of phpDocumentor4, with some additions to accommodate practical usage. A Type object has a type property that contains a constant value representing the actual type, such as TYPE STRING or TYPE INT. Some types are compound types, such as TYPE ARRAY, or represent multiple possible

1https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/SystemDependence 2https://github.com/mwijngaard/php-types 3https://www.phpdoc.org/ 4https://www.phpdoc.org/docs/latest/references/phpdoc/types.html

29 run-time types, such as TYPE UNION. For these types the Type object has a property subTypes that contains any additional type information in the form of other Type objects. For a TYPE OBJECT, a Type object also has a property userType, which contains the name of the object class, if it can be determined. Figures 5.1 shows a multi-procedure program and its associated CFG with reconstructed types.

Contributions As part of our work for this thesis we made a number of significant changes to PHP-Types, which we intend to eventually merge upstream.

Application state additions Type reconstruction and call dependence generation requires a lot of the same information. Examples include indexes of classes and methods, a class hierarchy, and knowledge of built-in procedures and types. We decided to implement some of the structures we required in Php-Pdg into PHP-Types. This allowed us to generate call dependences and simultaneously improve the quality of type reconstruction. Class hierarchy The class hierarchy consists of indexes of which names a class resolves, and which names it is resolved by. If A extends B and B extends C, for example, then B is resolved by A and B, and B resolves B and C. The original hierarchy did not correctly support inheritance hierarchies more than one level deep, and did not integrate inheritance of built-in objects. Since we intended to support polymorphism and overloading, we required a correct class hierarchy for this. We therefore combined the built-in class hierarchy already in PHP-Types with that of the application state, and added the computation of a transitive closure for the resolves and resolved by indexes. Polymorphism The original method of PHP-Types for handling polymorphism was to link a method call to all implementations of that method on classes that the class resolved (all superclasses). This is incorrect, as polymorphism should link to the least common implementation of a method on all possible classes that the class is resolved by (all subclasses). Using our new class hierarchy, we implemented this correctly.

5.2 Implementation

We construct the SDG of a system using a Factory object. As the type reconstruction of PHP-Types works on PHP-CFG scripts, the input to our SDG factory is an object containing all scripts indexed by filename. It runs the type reconstructor on all scripts, uses the PDG factory to create PDGs for all procedures in the script, and adds the PDGs to the a System object. The System object stores all individual PDGs, and the graph representing the SDG. The factory then uses the application state created by the type reconstructor to iterate over all function and method calls in the application, adds them to the SDG, and generates containment and call edges. The implementation of the outward facing components of our SDG library is shown in detail in appendixC. The SDG graph contains only procedure and call operation nodes, represented by Func and Op nodes. We were unable to represent the SDG and PDGs as a single graph as this degraded performance too much. Figure 5.2 shows the SDG for the example program in figure 5.1. The function calls are operation nodes as previously seen in the PDG, and so are again annotated with the starting line number of the AST node that they were generated from. The procedure nodes show the file and name of the procedure. There are ‘contains’ edges, indicating that a procedure contains an operation, and ‘call’ edges, indicating that a call operation can call a procedure.

5.2.1 Function calls Function calls can resolve to either a function defined by the application, or one of PHP’s built-in functions. As we have seen in section 3.3 of our language analysis, condition inclusion can result in the definition of a function varying at run-time and projects can monkey-patch functions to provide backwards compatibility. We therefore generate an over-approximation of call-dependences by linking

30 1 0){ 17 return $n* fac(dec( $n)); 18 } 19 return 1; 20 } 21 22 echo fac(5);

(a) Source

Function {main}(): Function dec(): Function fac():

Stmt_Function Expr_Param Expr_Param Stmt_Function name: LITERAL('a') name: LITERAL('n') Expr_FuncCall result: Var#1<$a> result: Var#1<$n> name: LITERAL('fac') Expr_BinaryOp_Plus Expr_BinaryOp_Greater args[0]: LITERAL(5) left: Var#1<$a> left: Var#1<$n> result: Var#1 right: LITERAL(1) right: LITERAL(0) Terminal_Echo result: Var#2 result: Var#2 expr: Var#1 Terminal_Return Stmt_JumpIf Terminal_Return expr: Var#2 cond: Var#2 expr: LITERAL(1)

if else

Var#3<$n> = Phi(Var#1<$n>) Expr_FuncCall name: LITERAL('dec') args[0]: Var#3<$n> result: Var#4 Expr_FuncCall name: LITERAL('fac') args[0]: Var#4 Stmt_Jump result: Var#5 Expr_BinaryOp_Mul left: Var#3<$n> right: Var#5 result: Var#6 Terminal_Return expr: Var#6

target

Terminal_Return expr: LITERAL(1)

(b) Typed CFG

Figure 5.1: Multi-procedure program

31 Func[test.php[{main}]]

{"type":"contains"}

Op[Expr_FuncCall]@22

{"type":"call"}

Func[test.php[fac]]

{"type":"contains"} {"type":"contains"} {"type":"call"}

Op[Expr_FuncCall]@17 Op[Expr_FuncCall]@17

{"type":"call"}

Func[test.php[dec]]

Figure 5.2: Multi-procedure SDG to all available application and built-in implementations of a function. If we cannot find any function definitions but we can statically determine the function name, we add an UndefinedFunc node to the graph with the name of the function. This allows us to link several call dependences to the same undefined function to the same node in the graph. In PHP, the function name in a function call can be fully qualified, qualified, and unqualified. Fully qualified means the name is specified relative to the root namespace, qualified means the name is specified relative to a currently imported namespace, and unqualified means that the name is not namespaced. PHP-CFG automatically rewrites all qualified names to their fully qualified variants, ensuring we do not have to deal with namespace imports. Unqualified function calls can refer to a function in either the current or the global namespace. To support this dual naming, PHP-CFG represents unqualified function calls as an NsFuncCall object that stores both possible names. Figure 5.3 shows an example program with various kinds of function calls and its associated SDG. Algorithm1 shows our steps for generating call dependence edges for fully qualified function calls. The algorithm iterates over all calls and checks if the function name in the call is a literal. If so, we first try to add call edges to all application implementations, and then try to add a call edge to a built-in implementation. Finally, if neither an application nor a built-in implementation has been found, we add a call edge to an undefined func node. Algorithm2 show our steps for generating call dependence edges for unqualified function calls. It is similar to algorithm1, but first tries to add call edges to all application implementations of its namespaced variant, and adds call edges to UndefinedFunc nodes for both the namespaced and non-namespaced variants if no implementations have been found.

5.2.2 Method calls In PHP, any method call can resolve to both a static or non-static implementation. Although a static call to an instance method is now deprecated, the reverse is still common when working with phpUnit for example. As both practices are technically still possible, we ignore the distinction between static and instance methods and use the same algorithm for resolving both kinds of calls. Figure 5.4 shows an example program with various kinds of method calls and its associated SDG. To determine the target of a method call, we first require knowledge of the class of the object that the method is being called on. We do this using the type information added to the CFG by PHP-Types. This can result in multiple class names as a type can be a union of other types.

32 1

(a) Source

Func[test.php[{main}]]

{"type":"contains"} {"type":"contains"} {"type":"contains"} {"type":"contains"}

Op[Expr_FuncCall]@19 Op[Expr_FuncCall]@18 Op[Expr_NsFuncCall]@12

Op[Expr_FuncCall]@5 {"type":"call"} {"type":"call"} {"type":"call"}

UndefinedFunc[bar\baz] Func[test.php[Bar\foo]]

{"type":"call"} {"type":"contains"} {"type":"call"}

Op[Expr_FuncCall]@10

{"type":"call"}

Func[test.php[foo]]

(b) CFG

Figure 5.3: SDG with function calls

33 Algorithm 1: Generating call dependences for fully qualified function calls Data: State = an object containing application state for Call in Calls do if Call.name is Literal then AppF uncs ←− State.functionLookup[Call.name]; if AppFuncs is not empty then addCallEdges(Call, AppF uncs); end BuiltinF unc ←− State.builtins.functionLookup[Call.name]; if BuiltinFunc is not null then addCallEdge(Call, BuiltinF unc); end if AppFuncs is empty and BuiltinFunc is null then addCallEdge(Call, UndefinedF unc(Call.name)); end end end

Algorithm 2: Generating call dependences for unqualified function calls Data: State = an object containing application state for Call in Calls do NamespacedAppF uncs ←− State.functionLookup[Call.namespacedName]; if NamespacedAppFuncs is not empty then addCallEdges(Call, NamespacedAppF uncs); end AppF uncs ←− State.functionLookup[Call.name]; if AppFuncs is not empty then addCallEdges(Call, AppF uncs); end BuiltinF unc ←− State.builtins.functionLookup[Call.name]; if BuiltinFunc is not null then addCallEdge(Call, BuiltinF unc); end if NamespacedAppFuncs is empty and AppFuncs is empty and BuiltinFunc is null then addCallEdge(Call, UndefinedF unc(Call.namespacedName)); addCallEdge(Call, UndefinedF unc(Call.name)); end end

34 1 baz();// instance call 13 $foo->quux();// instance call to undefined method

(a) Source

Func[test.php[{main}]]

{"type":"contains"} {"type":"contains"} {"type":"contains"}

Op[Expr_MethodCall]@12 Op[Expr_MethodCall]@13

{"type":"call"} {"type":"call"} Op[Expr_StaticCall]@11

Func[test.php[Foo::baz]] UndefinedFunc[foo::quux]

{"type":"contains"} {"type":"call"}

Op[Expr_StaticCall]@6

{"type":"call"}

Func[test.php[Foo::bar]]

(b) CFG

Figure 5.4: SDG with method calls

35 Algorithm3 shows our steps for deducing the type from an operand. The algorithm starts by calling resolveClassNamesF romOperand with a PHP-CFG operand, which has a property ‘type’ containing a PHP-Types Type object. We first check if an operand is a string and a literal, in which case we can take the value of the literal as the class name. If the operand is not a string, then we try to resolve any class names from the Type object itself by calling resolveClassNamesF romT ype with the operand’s Type object. This first checks if the type represents an object, in which case we take the value of userType as the class name; and then checks if it is a union, in which case we recursively determine the class names of all subtypes.

Algorithm 3: Resolving operand class names Data: State = an object containing application state Function resolveClassNamesFromType(Type) ClassNames ←− ∅; if T ype.type = TYPE OBJECT then if T ype.userT ype 6= null then ClassNames ←− ClassNames ∪ {type.userT ype}; end else if T ype.type = TYPE UNION then for SubT ype ∈ T ype.subT ypes do ClassNames ←− ClassNames ∪ resolveClassNamesF romT ype(SubT ype); end end return ClassNames; end Function resolveClassNamesFromOperand(Operand) ClassNames ←− ∅; if Operand.type.type = TYPE STRING then if Operand is Literal then ClassNames ←− ClassNames ∪ {Operand.value}; end else ClassNames ←− ClassNames ∪ resolveClassNamesF romT ype(Operand.type); end return ClassNames; end

The method call dependences we generate are again an over-approximation of the possibilities at run-time. Section 3.3 of our language analysis showed that several projects vary the run-time implementation of a class using conditional includes and projects can monkey-patch built-in classes. Another reason for the over-approximation is that an object of a certain class can in theory also be any of its subclasses. These subclasses can provide their own implementations of polymorphic methods which will be called at run-time. To accommodate this we generate call dependences assuming that the object can be all possible subclasses, and determine for each of these classes which implementation of a method would be called at run-time. This means first looking at the class itself, and recursively looking at any parent classes while the method cannot be found. We add call dependences to each possible target method. As it is common practice to extend built-in classes, for instance to create custom Exceptions, we also look for built-in class definitions when resolving polymorphic methods. Algorithm4 shows our full procedure for generating method call dependences. The algorithm starts by calling addMethodCallDependences with all method calls. It iterates over all method calls and checks if the target method name is a literal. If so, we first try to resolve the class names of the target object operand by calling the resolveClassNamesF romOperand function of algorithm3 with the operand. We then iterate over each of these possible classes to determine which subclasses would resolve this class (all subclasses including the class itself), and try to find the polymorphic method implementation that would be called on each of these subclasses at run-time by calling

36 resolveMethodCall with the name of the subclass and method. We resolve both any application and built-in implementations of the method. If a class can be found in the application, we check if it has the method defined. If not, we check if it has a parent class and recurse. To resolve a built-in method we have a separate function that only recurses to built-in parent classes. After all method implementations have been resolved for each subclass, we add call edges to each implementation. An aspect that we have left out of our current implementation is that of method visibility. Like many other object-oriented languages, PHP allows methods to be declared public, protected, or private. Public methods can be called from anywhere, protected method can be called from the object’s class or subclasses, and private methods can only be called from the class itself. As PHP will generate fatal errors in case a class reduces the visibility of a parent method and if an inaccessible method is called we do not consider this feature to be essential. This could at most result in some call dependences that would trigger a fatal error at run-time. All relevant information is available to us however, so its implementation could be an easy extension.

5.2.3 Overloading Property overloading is implemented by attempting to resolve a property fetch on an object for which we can determine the class. If the property cannot be resolved we try to resolve the appropriate overloading hook. As only a few built-in objects from exotic PHP extensions implement any of the overloading hooks, we do not try and resolve overloading hooks on built-in objects. Algorithm5 shows our steps for generating call dependences for property overloading using the get method. The algorithm starts by calling addP ropertyOverloadingGetCallDependences with all property fetches that are not part of an assign, isset, or unset. The initial steps are similar to algorithm4, as we again try to determine all possible implementations of a class, but instead of resolving a method we check if a property exists on each possible subclass by calling hasP roperty with the name of the subclass and property. This function then tries to find the property on the subclass itself, and if it cannot find it recurses to any parent classes. If we cannot find the property on the class or any of its superclasses, we assume the property does not exist and we try to detect of the class uses overloading. We do this by trying to resolve a get method using the resolveMethodCall function of algorithm4. If this resolves to an implementation we have found a use of overloading and add a call edge to that implementation. The implementations of call dependence generation to the other overloading hooks are very similar, so we don’t show them here. For the remaining property overloading hooks ( set, isset & unset), the only difference is the set of initial property fetches and the method we try to resolve if a property cannot be found. For method overloading ( call & callStatic) we slightly modify algorithm4 so that it attempts to resolve the appropriate overloading hook on each subclass for which the original method cannot be resolved.

5.2.4 Extensibility Our SDG implementation is modular and can easily be extended with support for additional call dependences. We use separate call dependence generators for function calls, method calls, and over- loading method calls. This allows us to decouple implementation logic, but it also allows the Factory configuration to be varied according to the use case.

37 Algorithm 4: Generating call dependences for method calls Data: State = an object containing application state Function resolveBuiltinMethodCall(ClassName, MethodName) Method ←− null; BuiltinMethod ←− State.builtins.methodLookup[ClassName, MethodName]; if BuiltinMethod 6= null then Method ←− BuiltinMethod; else Extends ←− State.builtins.extends[ClassName]; if Extends 6= null then Method ←− resolveBuiltinMethodCall(Extends, MethodName); end end return Method; end Function resolveMethodCall(ClassName, MethodName) Methods ←− ∅; for Class ∈ State.classLookup[ClassName] do AppMethods ←− State.methodLookup[Class][MethodName]; if AppMethods 6= ∅ then Methods ←− Methods ∪ AppMethods; else if Class.extends 6= null then Methods ←− Methods ∪ resolveMethodCall(Class.Extends, MethodName); end end BuiltinMethod ←− resolveBuiltinMethodCall(ClassName, MethodName); if BuiltinMethod 6= null then Methods ←− Methods ∪ {BuiltinMethod}; end return Methods; end Function addMethodCallDependences(Calls) for Call ∈ Calls do if Call.name is Literal then ClassNames ←− resolveClassNamesF romOperand(Call.var); for ClassName ∈ ClassNames do SubclassNames ←− State.classResolvedBy[ClassName]; for SubclassName ∈ SubclassNames do Methods ←− resolveMethodCall(SubclassName, Call.name); addCallEdges(Call, Methods); end end end end end

38 Algorithm 5: Generating call dependences for property overloading using get Data: State = an object containing application state Function hasProperty(ClassName, PropertyName) for Class ∈ State.classLookup[ClassName] do AppP roperties ←− State.propertyLookup[Class][P ropertyName]; if AppP roperties 6= ∅ then return T rue; else if Class.extends 6= null then if hasP roperty(Class.extends, P ropertyName) then return T rue; end end end return F alse; end Function addPropertyOverloadingGetCallDependences(PropertyFetches) for P ropertyF etch ∈ P ropertyF etches do if PropertyFetch.name is Literal then ClassNames ←− resolveClassNamesF romOperand(P ropertyF etch.var); for ClassName ∈ ClassNames do SubclassNames ←− State.classResolvedBy[ClassName]; for SubclassName ∈ SubclassNames do for SubclassName ∈ State.classLookup[SubclassName] do if ¬hasP roperty(ClassName, P ropertyF etch.name) then Methods ←− resolveMethodCall(SubclassName,“ get”); addCallEdges(Call, Methods); end end end end end end end

39 Chapter 6

Validation

This chapter presents the validation of our PDG and SDG implementation in Php-Pdg. We first describe the application which we used to gather our analysis results and then cover our validation results for each of the dependence types we have implemented.

6.1 Analysis application

To enable us to extract analysis data, we developed our own analysis application on top of Php-Pdg. It is capable of iteratively running AST-, CFG-, PDG- and SDG-based analyses on a directory of individual projects. The application is called Php-Pdg-Analysis, and is publicly available1. This ensures that all analyses presented here are easily reproducible. We applied it to the projects in our corpus, and used the results to validate and iteratively improve Php-Pdg. It contains the following functionalities: Several analysis types Our application support analyses that work on the project directory itself, its ASTs, CFGs, PDGs, or SDG. This allows us to use the same application to e.g. count language feature occurrences as well as dependence edges in the SDG. Table generation All tables in this thesis had their data automatically generated and pre-formatted for maximum readability in LATEX. Result caching All analysis results are persisted in json format. This allows the format of tables to be changed and newly created using stored results. Intermediate caching The ASTs or CFGs used in most analyses take a significant amount of time to generate, but are generated by third party libraries and so do not change very often. We have therefore implemented a caching mechanism on top of the AST and CFG parsers that saves the serialized version of their output to disk, indexed on the original file name.

6.2 Results

Some of the projects in our corpus contained extremely complex procedures. These procedures caused the block CFG and PDT that we use for generating control dependences to become too big and require an unreasonable amount of time for PDG construction. To detect files containing these procedures we developed a cyclomatic complexity [21] metric that calculates the maximum cyclomatic complexity of all procedures in a given file. The most extreme procedure we found was in Joomla and had a cyclomatic complexity of 810. A procedure with a cyclomatic complexity greater than 50 is already considered unmaintainable [9], but to accommodate a certain amount of complex code we set our limit at 100. If a file contains a procedure with a cyclomatic complexity above this number we skip it from our analyses. Table 6.1 shows the amount of files that we skipped per project. All remaining files are included in our analyses. 1https://github.com/mwijngaard/php-pdg-analysis

40 Table 6.1: Skipped files

Project Files CC > 100 Total % CakePHP 852 0 0.00 CodeIgniter 199 0 0.00 Doctrine ORM 1,048 2 0.19 Fabric 1,306 0 0.00 Gallery 568 4 0.70 Joomla 3,221 2 0.06 Kohana 490 0 0.00 MediaWiki 2,891 5 0.17 osCommerce 702 7 1.00 PEAR 199 1 0.50 phpBB 745 20 2.68 phpDocumentor 470 0 0.00 phpMyAdmin 871 6 0.69 phpUnit 321 2 0.62 Roundcube 256 2 0.78 SilverStripe 769 3 0.39 Smarty 204 2 0.98 Squirrel Mail 293 1 0.34 Symfony 2,978 1 0.03 WordPress 653 16 2.45 Zend Framework 2,639 1 0.04

6.2.1 Control dependence Our language analysis of chapter3 revealed several features that impact the generation of control dependences, but those did not significantly impact our scope. We therefore consider the suite of Php-Pdg to be sufficient validation for our algorithms for generating control dependences.

6.2.2 Data dependence As stated in section 4.2.2, we generate data dependences from the read operands of operations and annotate each data dependence edge with the path from the operation to the operand that generated the edge. We can now use this information to evaluate the quality of our data dependence generation by analysing which operands have associated data dependences in the PDG. Table 6.2 shows the total number of operands and, by subtracting literals and those bound to objects or other scopes, the total number of operands that we expect to have associated data dependences in the PDG. The total number of resolved and unresolved operands can be seen in table 6.3. We can see that in each project the vast majority of operands can be resolved to data dependence edges in the PDG. This suggests that the quality of our data dependence generation is very good. We have verified our data dependence generation algorithms using the unit testing suite of Php-Pdg and so are confident in its quality, but to truly determine the accuracy of our data dependences would require either proof of the correctness of our algorithms or a run-time comparison, which is beyond our current scope. For those operands that we cannot resolve, we have attempted to determine why they were not in the PDG. We found this to be due to 4 possible reasons. Either the operand referenced a predefined variable (see section 3.2.1), the operand referenced a variable that was undefined in the current scope, the operand was part of one of PHP’s variable features (see section 3.2.2), or the operand referenced a non-literal argument default. PHP-CFG currently fully decouples the control flow defining argument defaults from the normal control flow of a function, but allows the two flows to share operand objects. We only add operations in the normal control flow of the function to our PDG, so any operands shared between the two flows would result in missing data dependences. We consider this a missing feature of PHP-CFG,

41 Table 6.2: Resolvable data dependences

Project Ops Read operands Total Literal Bound Expected CakePHP 211,991 424,299 232,322 32,585 159,392 CodeIgniter 61,468 84,876 35,681 6,414 42,781 Doctrine ORM 140,540 247,450 120,201 26,082 101,167 Fabric 97,859 166,738 81,053 16,379 69,306 Gallery 80,311 121,475 62,387 3,326 55,762 Joomla 579,808 938,613 466,053 65,708 406,852 Kohana 46,062 72,352 36,887 3,432 32,033 MediaWiki 612,865 1,076,873 568,019 60,721 448,133 osCommerce 141,522 210,336 99,348 5,127 105,861 PEAR 94,905 130,206 53,500 7,270 69,436 phpBB 202,060 507,771 335,623 12,487 159,661 phpDocumentor 31,303 51,282 25,786 4,421 21,075 phpMyAdmin 252,433 529,488 312,599 15,001 201,888 phpUnit 25,956 38,640 20,362 2,894 15,384 Roundcube 90,972 131,331 57,423 7,996 65,912 SilverStripe 203,780 319,550 154,896 23,297 141,357 Smarty 38,965 63,697 31,141 3,502 29,054 Squirrel Mail 70,474 112,413 55,584 1,016 55,813 Symfony 303,586 539,362 282,576 40,915 215,871 WordPress 305,531 433,082 204,364 11,559 217,159 Zend Framework 256,142 377,697 180,365 27,604 169,728 and believe that this should be resolved by adding the argument default flow to the start of the normal flow, with a condition validating if it should be executed depending on if the argument has been defined. This is not a problem for literal argument defaults, because they do not have defining operations, and therefore no decoupled control flow.

6.2.3 Call dependence To validate our algorithms for generating call dependences we looked at the amount of function and method calls with associated call dependence edges in the SDG. Table 6.4 shows the results for the function calls in our corpus. We are able to uniquely resolve the vast majority to edges in the SDG. Those that we could not resolve were variable function calls. Most calls have been linked to built-in functions, proving the worth of PHP-Types’ built-in function database. The Undef column contains function calls linked to UndefinedFunc nodes (as described in sections 5.2.1 and 5.2.2). Manual inspection of some of these instances in CakePHP and Fabric revealed that some of our evaluated instances were references to functions defined in dependency packages, and some were truly undefined. We suspect the latter to be programmer errors created during refactoring. Unfortunately it is impossible for us to programmatically determine whether a function call linked to an undefined function is a call to a vendor function or truly undefined, so we cannot make a distinction. Table 6.5 shows the results for the method calls in our corpus. We can see that we are able to resolve almost all method calls for which we have type information to edges in the SDG. Those that we could not resolve were all variable method calls. Unfortunately, the amount of objects for which we have type information leaves a lot to be desired. We believe this is due to the relative immaturity of PHP- Types and could be significantly improved in the future. There are two particular improvements that we see as promising. These are to 1) infer property types from assignments if @var annotations are missing, and 2) infer function/method return types from return statements if type declarations and @return annotations are missing. For now we limit our scope to dependence analysis and claim that if type information is available and accurate, our library is able to generate method call dependences very well.

42 Table 6.3: Resolved and unresolved data dependences

Project Expected Resolved Unresolved Total % Predef Undef Dynamic Default Rem CakePHP 159,392 157,636 98.90 233 873 130 520 0 CodeIgniter 42,781 41,352 96.66 259 1,068 18 84 0 Doctrine ORM 101,167 100,751 99.59 49 279 4 84 0 Fabric 69,306 68,469 98.79 57 581 5 194 0 Gallery 55,762 53,842 96.56 281 1,592 11 36 0 Joomla 406,852 397,818 97.78 729 7,444 14 847 0 Kohana 32,033 31,437 98.14 119 425 15 37 0 MediaWiki 448,133 440,828 98.37 439 5,678 213 975 0 osCommerce 105,861 98,650 93.19 256 6,855 58 42 0 PEAR 69,436 67,894 97.78 323 1,159 5 55 0 phpBB 159,661 147,885 92.62 195 11,413 62 106 0 phpDocumentor 21,075 20,936 99.34 10 76 16 37 0 phpMyAdmin 201,888 188,227 93.23 9,981 3,437 39 204 0 phpUnit 15,384 15,124 98.31 103 134 1 22 0 Roundcube 65,912 64,349 97.63 429 1,035 5 94 0 SilverStripe 141,357 138,702 98.12 528 1,662 41 424 0 Smarty 29,054 28,769 99.02 17 238 11 19 0 Squirrel Mail 55,813 48,774 87.39 91 6,896 30 22 0 Symfony 215,871 212,977 98.66 138 1,919 74 763 0 WordPress 217,159 207,695 95.64 2,521 6,607 15 321 0 Zend Framework 169,728 167,625 98.76 203 1,078 223 599 0

Table 6.4: Resolved function calls

Project Function calls Resolved Unresolved Edge targets Total % Dynamic Rem App Built-in Undef CakePHP 6,939 6,801 98.01 138 0 305 6,462 76 CodeIgniter 5,078 5,061 99.67 17 0 593 4,475 25 Doctrine ORM 2,681 2,677 99.85 4 0 0 2,642 70 Fabric 3,651 3,647 99.89 4 0 75 3,569 3 Gallery 6,154 6,147 99.89 7 0 158 4,608 1,419 Joomla 24,639 24,625 99.94 14 0 2,446 23,939 95 Kohana 2,881 2,869 99.58 12 0 43 2,792 34 MediaWiki 26,638 26,414 99.16 224 0 4,869 21,513 92 osCommerce 16,882 16,880 99.99 2 0 16,178 8,368 292 PEAR 5,169 5,168 99.98 1 0 26 5,085 61 phpBB 14,400 14,373 99.81 27 0 3,431 10,214 1,301 phpDocumentor 766 750 97.91 16 0 1 749 0 phpMyAdmin 15,877 15,863 99.91 14 0 8,633 7,972 197 phpUnit 1,545 1,544 99.94 1 0 4 1,525 15 Roundcube 5,416 5,413 99.94 3 0 236 5,140 43 SilverStripe 10,269 10,226 99.58 43 0 806 9,093 368 Smarty 1,803 1,790 99.28 13 0 85 1,725 9 Squirrel Mail 10,572 10,548 99.77 24 0 8,488 4,386 85 Symfony 10,918 10,838 99.27 80 0 35 10,779 56 WordPress 46,312 46,290 99.95 22 0 37,954 15,879 2,775 Zend Framework 12,859 12,545 97.56 314 0 5 12,433 211

43 Table 6.5: Resolved method calls

Project Method calls Typed Resolved Unresolved Edge targets Total % Total Typed % Dynamic Rem App Built-in Undef CakePHP 48,123 36,521 75.89 36,484 99.90 37 0 61,308 291 17,644 CodeIgniter 2,253 1,601 71.06 1,583 98.88 18 0 3,110 29 126 Doctrine ORM 31,491 16,230 51.54 16,222 99.95 8 0 80,986 222 8,580 Fabric 17,551 10,081 57.44 10,081 100.00 0 0 11,163 172 5,329 Gallery 11,020 7,113 64.55 7,089 99.66 24 0 3,573 30 3,887 Joomla 94,041 67,505 71.78 67,477 99.96 28 0 163,241 411 14,141 Kohana 3,791 3,236 85.36 3,219 99.47 17 0 3,091 67 782 MediaWiki 81,965 71,114 86.76 71,056 99.92 58 0 135,168 414 5,820 osCommerce 2,523 1,913 75.82 1,911 99.90 2 0 2,590 14 81 PEAR 6,809 4,431 65.08 4,423 99.82 8 0 3,904 13 910 phpBB 10,986 3,326 30.27 3,321 99.85 5 0 5,384 46 476 phpDocumentor 7,928 4,697 59.25 4,690 99.85 7 0 2,621 258 2,296 phpMyAdmin 21,974 15,332 69.77 15,319 99.92 13 0 9,694 280 6,449 phpUnit 3,649 3,225 88.38 3,224 99.97 1 0 3,910 185 306 Roundcube 7,033 4,826 68.62 4,825 99.98 1 0 4,676 36 629 SilverStripe 32,322 29,496 91.26 29,279 99.26 217 0 40,802 152 7,007 Smarty 1,035 703 67.92 686 97.58 17 0 718 9 119 Squirrel Mail 647 464 71.72 464 100.00 0 0 443 0 41 Symfony 61,405 41,628 67.79 41,602 99.94 26 0 35,594 1,250 15,374 WordPress 7,848 5,841 74.43 5,835 99.90 6 0 6,621 38 309 Zend Framework 19,321 16,142 83.55 16,065 99.52 77 0 24,094 963 1,646

Run-time comparison In order to determine the accuracy of the call dependence edges in our SDG we need information about which function and method calls are executed at run-time. To acquire this we used an XDebug function trace2 of the Doctrine unit-test suite. We chose Doctrine because it is not too big, and its test suite is currently listed as having a coverage of around 86%3. We parsed the generated function trace, compiled a list of call location and target pairs, and compared it with the call dependences of the SDG. We only considered calls for which the SDG contained call edges and the target was an application or built-in procedure. This allows us to measure the accuracy of our call dependence prediction while ignoring method calls for which objects type reconstruction failed. Unfortunately, XDebug does not allow us to distinguish between calls on the same line number, and calls are reported at the last line number that contains logic required for evaluating the call (i.e. it ignores parentheses). We partially worked around this limitation by automatically reformatting the code using a code style that placed all calls on a separate line and placed all closing parentheses on the same line directly after the last argument. This allowed us to uniquely identify as many function calls as possible, but we still had to filter out calls that had a nested call as the last argument as these would be registered on the same line. The results were:

• 6,856 call pairs evaluated • 6,737 found (98.26%) • 119 missing – 66 constructor calls – 10 using Countable – 1 using ArrayAccess – 12 to toString – 5 to sleep – 1 to wakeup – 24 remaining 2https://xdebug.org/docs/execution_trace 3https://coveralls.io/github/doctrine/doctrine2

44 Constructor calls are currently not supported, but these could be added quite easily for instanti- ations with literal class names. Countable and ArrayAccess are built-in interface that objects can implement to interact with the built-in ‘count’ function or array syntax, and toString, sleep, and wakeup are magic methods that objects can define to interact with casts to string and the built-in ‘serialize’ and ‘unserialize’ functions. These methods are all currently not supported, but could be added in the future in a manner similar to how we have implemented overloading. We manually inspected the remaining calls and found some of them to be due to dynamic invocation, but most to be due to incorrect line number reporting by XDebug which we could not readily explain. We suspect this has some relation with function arguments and constant expressions, but as this is far outside of our research scope we did not investigate this further. In general we believe these results justify us saying that the accuracy of the call dependences in the SDG is very high.

Overloading Tables 6.6 and 6.7 show the amount of call dependences to overloading hooks for our corpus. For each hook we also added the sources of incoming call dependence edges, these are ‘O’ for overloading and ‘E’ for explicit. We manually inspected some of the instances in Fabric and CakePHP and found that most of our evaluated cases were correct, but we also found one case which used dynamic property assignment in the constructor and would not actually trigger an overloading hook at run-time. As mentioned in section 3.3.2 of our language analysis this is something that we currently do not support correctly.

Table 6.6: Property overloading

Project get set isset unset Tot Res Srcs Tot Res Srcs Tot Res Srcs Tot Res Srcs OEOEOEOE CakePHP 15 13 365 2 4 3 23 0 7 5 15 0 1 0 0 0 CodeIgniter 7 2 15 0 2 0 0 0 1 0 0 0 0 0 0 0 Doctrine ORM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fabric 6 5 57 2 3 3 11 2 0 0 0 0 0 0 0 0 Gallery 16 13 276 17 5 5 142 8 2 0 0 0 1 0 0 0 Joomla 93 76 270 1,243 36 30 74 360 3 0 0 0 3 0 0 0 Kohana 2 1 11 1 2 2 4 0 2 0 0 0 2 0 0 0 MediaWiki 10 5 98 0 6 2 13 0 3 3 9 0 1 1 1 0 osCommerce 6 2 44 0 0 0 0 0 2 1 1 0 0 0 0 0 PEAR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 phpBB 1 1 63 0 1 1 14 0 0 0 0 0 0 0 0 0 phpDocumentor 2 2 13 0 1 0 0 0 1 0 0 0 0 0 0 0 phpMyAdmin 2 1 0 184 3 1 0 36 0 0 0 0 0 0 0 0 phpUnit 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Roundcube 2 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 SilverStripe 10 6 1,614 3 8 6 424 0 2 1 4 0 1 0 0 0 Smarty 4 3 186 0 3 2 7 0 0 0 0 0 0 0 0 0 Squirrel Mail 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Symfony 5 5 112 3 6 4 72 2 2 0 0 0 2 0 0 0 WordPress 12 3 179 2 8 2 5 0 10 2 2 3 7 0 0 0 Zend Framework 34 11 35 4 21 5 9 6 15 7 8 3 15 2 6 1

These results clearly show that our algorithms can generate call dependences for some cases of overloading. Due to the lacking quality of the type reconstruction currently offered by PHP-Types we cannot make any claims about the amount of overloading calls that we are able to support, as without type information it is impossible to determine if a property fetch or method call would result in overloading. We believe that with better type reconstruction our algorithms are likely capable of supporting more cases of overloading, but validating this claim is something that we have to leave as future work.

45 Table 6.7: Method overloading

Project call callStatic Tot Res Srcs Tot Res Srcs OEOE CakePHP 15 14 1,022 0 1 0 0 0 CodeIgniter 1 0 0 0 0 0 0 0 Doctrine ORM 1 1 2 0 0 0 0 0 Fabric 2 1 8 0 1 0 0 0 Gallery 16 8 74 7 0 0 0 0 Joomla 11 7 641 0 1 0 0 0 Kohana 0 0 0 0 0 0 0 0 MediaWiki 15 9 3,911 112 1 0 0 0 osCommerce 0 0 0 0 0 0 0 0 PEAR 3 1 265 0 1 0 0 0 phpBB 3 1 1 0 0 0 0 0 phpDocumentor 1 1 70 0 0 0 0 0 phpMyAdmin 1 0 0 0 0 0 0 0 phpUnit 0 0 0 0 0 0 0 0 Roundcube 0 0 0 0 0 0 0 0 SilverStripe 15 12 2,390 8 0 0 0 0 Smarty 4 2 25 2 0 0 0 0 Squirrel Mail 0 0 0 0 0 0 0 0 Symfony 9 3 32 1 1 0 0 0 WordPress 9 2 6 0 1 0 0 0 Zend Framework 39 21 237 28 2 0 0 0

46 Chapter 7

Use case

In this chapter we present a practical use case of our library. To demonstrate its possibilities we used it to develop a slicing tool. It can be used to create an interprocedural backwards slice of source code given a certain file and line number as a slicing criterion. We have integrated our slicing logic in Php-Pdg1, and used it from our analysis application, Php-Pdg-Analysis2.

7.1 Example

To demonstrate how it works, we use the system of figure 7.1 as an example. We now assume that the assertion on line 4 of the foo function is failing, and want to know why. To determine this, it is important to know which statements can influence the assertion. We therefore slice the system by line number 4 of the foo function. Figure 7.2 contains the sliced system. The unused baz methods and else statements have been stripped out, but apparently the rest of the quux function of the Baz class can actually indirectly call the foo function. It does so in an obfuscated way and is inverting the argument. While this may be a contrived example, the fact remains that by slicing we are able to easily determine indirect call dependences. This is something that is not commonly supported by IDEs.

1 = 0); 4 public function bar( $n){ 5 if( $n>0){ 5 return foo( $n); 6 return $n* foo( $n - 1); 6 } 7 } 7 public function baz(){ 8 return 1; 8 echo 'baz '; 9 } 9 } 10 class Bar{ 10 public function quux( $n){ 11 public function bar( $n){ 11 if( $n % 2 === 0) { 12 return foo( $n); 12 (new self())->bar(- $n); 13 } 13 } else{ 14 public function baz(){ 14 echo 'quux '; 15 echo 'baz '; 15 } 16 } 16 } 17 } 17 }

Figure 7.1: System to slice

Although the current version of our tool may not be very user friendly, we can see a future version being used in practice as a debugging tool. It could even be integrated into a Continuous Integration

1https://github.com/mwijngaard/php-pdg 2https://github.com/mwijngaard/php-pdg-analysis

47 1 = 0); $ 4 public function bar( n){ 5 if( n>0){ $ $ 5 return foo(- n); 6 return n* foo( n - 1); $ $ $ 6 } 7 } 7 public function quux( n){ 8 return 1; $ 8 if( n % 2 === 0) { 9 } $ 9 (new\Baz())->bar(- n); 10 class Bar{ $ 10 } else{ 11 public function bar( n){ $ 11 } 12 return foo( n); $ 12 } 13 } 13 } 14 }

Figure 7.2: Sliced system

server and generate a slice of the system based on a test suite assertion that is failing. These slices could be made available to developers automatically when reporting errors.

7.2 Implementation

Our slicing tool contains an implementation of Weiser’s [26] algorithm for interprocedural slicing. As this algorithm is not context sensitive, it can generate slices that contain some statements that cannot actually affect the slicing criterion at run-time. Horwitz et al. [15] have developed an improved algo- rithm that is context sensitive and therefore more precise, but this algorithm requires interprocedural data dependences so we cannot implement it using the current version of our library. In order to create a sliced system we first parse all files in a system into their ASTs, use the ASTs to construct CFGs, and use the CFGs to construct the dependence structure using our SDG Factory. As described in section 5.2, this returns a System object that contains the SDG and the Func objects of all procedures with their PDG. Algorithm6 shows our full procedure for slicing a system. The algorithm starts by calling sliceSys- tem with the input system and the file name and line number to slice on. It uses F uncCriteria to store the slicing criteria of all procedures, W orkList, to store a worklist of all procedures we still have to process, and SlicedP dgs, to store the sliced versions of all procedures so far. The addSlicingCrite- rion helper function is used to calculate updated slicing criterion and add functions to the worklist if their criterion has been updated. We also use several helper functions that provide graph operations:

getInvReachable Returns a subgraph of the given graph that contains only nodes and edges that can reach the given set of nodes. getCallingNodes Returns all call nodes in the given graph with a call edge to the given procedure. getContainingFunc Returns the procedure in the given graph that has a contains edge to the given node.

getCalledFuncs Returns all procedure in the given graph that are called from the given node.

The algorithm first iterates over all procedures to find the one matching the file name, and then iterates over all PDG nodes of the matching procedure to find the nodes that match the line number. These nodes are then added as the initial slicing criterion for that procedure, which also adds the procedure to the worklist. We then process the worklist until it is empty. For each procedure we process we slice its PDG based on the current slicing criterion. We then find all calls to the procedure and add them as a new slicing criterion to their containing procedure, thereby possible adding their containing procedure to the worklist. We also find all calls in the new sliced PDG, and add the return nodes of their target procedures to the slicing criterion of the target procedures, thereby possibly adding their target procedures to the worklist. When the worklist is empty, we have propagated the

48 slice to all procedures, and can store the final sliced pdgs on the system. This step also sets the PDGs of procedures that are not part of our slice to null. To propagate the sliced PDGs back to the original source code, we implemented a mapping between PHP-CFG operations and AST nodes based on file and line number. While generating operations from an AST, PHP-CFG stores the line numbers of the AST node that it created an operation from. This allows us to maintain an rudimentary association between the AST and CFG. Our algorithm slices all ASTs in the system to retain only nodes that can be associated with operations in the sliced PDG. We then write the sliced ASTs back to disk.

49 Algorithm 6: Backwards slicing a System Function addSlicingCriterion(Func, NewCriterion, FuncCriteria, WorkList) ExistingCriterion ←− F uncCriteria[F unc]; if ExistingCriterion 6= null then NewCriterion ←− ExistingCriterion ∪ NewCriterion; if NewCriterion = ExistingCriterion then return; end end F uncCriteria[F unc] ←− NewCriterion; W orkList ←− W orkList ∪ {F unc}; end Function sliceSystem(System, SliceFilePath, SliceLineNr) F uncCriteria ←− []; W orkList ←− ∅; SlicedP dgs ←− []; for F unc ∈ System.funcs do if F unc.fileName = SliceF ileP ath then for Node ∈ F unc.pdg do if Node is OpNode and Node.op.line = SliceLineNr then addSlicingCriterion(F unc, {Node}, F uncCriteria, W orkList); end end end end while W orkList 6= ∅ do for F unc ∈ W orkList do W orkList ←− W orkList \{F unc}; SlicedP dg ←− getInvReachable(F unc.pdg, F uncCriteria[F unc]); for CallNode ∈ getCallingNodes(System.sdg, F unc) do ContainingF unc ←− getContainingF unc(System.sdg, CallNode); addSlicingCriterion(ContainingF unc, {CallNode}, F uncCriteria, W orkList); end for Node ∈ SlicedP dg do for T argetF unc ∈ getCalledF uncs(System.sdg, Node) do addSlicingCriterion(T argetF unc, T argetF unc.returnNodes, F uncCriteria, W orkList); end end SlicedP dgs[F unc] ←− SlicedP dg; end end for F unc ∈ System.funcs do F unc.pdg ←− SlicedP dgs[F unc]; end end

50 Chapter 8

Evaluation

In this chapter we evaluate our work with regards to our research questions of section 1.2.1 and discuss threats to the validity of our research results.

8.1 Research questions

How do PHP’s language features affect dependence analysis? Our language analysis of chapter3 discussed several kinds of language features of PHP in relation to dependence analysis. We found that PHP’s features that influence control dependence are very similar to other languages, but that PHP has several features that can make determining data and call dependences challenging.

How are relevant language features used in practice and how does this impact dependence analysis? We examined the practical use of some relevant language features across a large corpus of mainly popular open-source programs. We demonstrated that some problematic features are rarely used in practice, and so are unlikely to significantly affect dependence analysis. We also showed that feature usage varies widely between projects.

Are there strategies that we can adopt to handle problematic features? Using the data on practical usage of relevant language features we developed strategies for handling several language features. We have implemented handling of names with multiple definitions, and have implemented call dependence generations for property and method overloading using type inference, class hierarchy construction, and polymorphism computations.

How can we apply dependence analysis to PHP? We have implemented a PDG and SDG library for PHP and demonstrated its practical use in an interprocedural slicing tool. In doing so we have demonstrated how dependence analysis can be applied to PHP.

How effective is dependence analysis for PHP? We have extensively examined the quality of the dependences generated by our library for a large corpus of mainly popular open-source php projects. We demonstrated that the vast majority of data dependences can be supported, but some data dependences require more sophisticated analysis techniques. These techniques were not available to us, and creating them was beyond the scope of this thesis. We demonstrated that we can resolve the vast majority of function calls and the vast majority of method calls for which object type information is available. We also showed that the call dependences we generate are highly accurate by comparing them to a run-time function trace. Unfortunately the amount of method calls for which we were able to reconstruct object type information was highly variable per project and generally lacking. This prevented us from achieving the level of call dependence support that we intended.

51 8.2 Threats to validity

There are several treats to the validity of our results. Firstly, there are several possibly problematic PHP language features that we left out of scope. We ignored for instance the possibility of aliasing, exceptions thrown and caught in the same procedure, and any data dependences to object properties. This was necessary to create a manageable scope for this thesis. In some cases we have attempted to demonstrate using practical usage data that the existence of these features should not significantly impact dependence analysis, but this is something that we cannot state with certainty. Secondly, we have ignored other types of data dependences than flow dependence, as these do not really apply to our SSA context. Although we are confident that the determination of these data dependences is not fundamentally more complex than flow dependence, we did not prove this by implementing them. Thirdly, our validation showed that we are able to resolve method call dependences for the vast majority of objects for which we can determine the type, but that the amount of objects for which we have type information available varied and was generally lacking. We believe that with better type reconstruction the quality of our call dependence generation can be vastly improved. However, due to PHP’s dynamic features it is possible that a significant improvement in type reconstruction quality cannot be achieved. This is something that we cannot be certain about and requires further research. Lastly, we have seen that feature usage can vary widely between projects. It is possible that a different corpus or individual project not in our corpus could yield significantly different results than those we have presented. To mitigate this our corpus consists of a wide variety of projects of different sizes and from different application domains. This is reflected in the results.

52 Chapter 9

Conclusion

In this thesis we have looked into the applicability of dependence analysis to PHP. We analysed the effects of PHP language features and their practical usage, developed strategies for handling some problematic features, implemented a PDG and SDG library, and analysed its effectiveness. We have shown that several of PHP’s language features are challenging to model in dependence analysis, but that in practice the use of most of them is not very common. We have demonstrated that in real-world PHP projects the vast majority of intraprocedural control and data dependences between operations can be supported, and that given accurate type information the vast majority of call dependences can also be supported. In addition, we have shown that the accuracy of our generated call dependences is very high. When applying dependence analysis to a dynamic language we encounter a greater amount of imprecision than in static languages, but we believe we have shown that for PHP the partial accuracy that we are able to achieve is high enough for dependence analysis to be useful. To further support this claim we used our PDG and SDG library to develop an interprocedural slicing tool. In doing so, we believe we have demonstrated that dependence analysis is applicable to PHP.

9.1 Future work

Throughout this thesis we have mentioned several topics that we left out of scope. These include the integration of alias analysis, exception induced control flow, static resolving of dynamic features, object properties, and improved type reconstruction. We consider these all interesting directions for future work. Algorithms for resolving includes [13] and variables features [10], and object-sensitive type analysis [14] are good candidates for inclusion. Regarding Php-Pdg itself, a primary concern is to reduce the memory usage so it can be applied to bigger codebases. Support should be added for call dependences to features such as constructors and destructors, additional magic methods, and the Countable and ArrayAccess interfaces. We would also like to see an improved link from operations nodes to the original AST and the implementation of the additional dependence types. We believe this could stimulate the development of tooling in other areas than slicing, such as clone detection or detection of performance and security anti-patterns. Lastly, we would sincerely like to discover the reason for XDebug’s irregular line number reporting in function traces.

53 Bibliography

[1] Frances E. Allen. “Control flow analysis”. In: ACM Sigplan Notices. Vol. 5. 7. ACM. 1970, pp. 1–19. [2] Afshin Amighi, Pedro de Carvalho Gomes, Dilian Gurov, and Marieke Huisman. “Sound Control- Flow Graph Extraction for Java Programs with Exceptions”. In: Software Engineering and Formal Methods - 10th International Conference, SEFM 2012, Thessaloniki, Greece, October 1-5, 2012. Proceedings. 2012, pp. 33–47. [3] Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon, and Andreas Zwinkau. “Simple and Efficient Construction of Static Single Assignment Form”. In: Compiler Construction - 22nd International Conference, CC 2013, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2013, Rome, Italy, March 16-24, 2013. Proceedings. 2013, pp. 102–122. [4] Martin Bravenboer and Yannis Smaragdakis. “Strictly declarative specification of sophisticated points-to analyses”. In: Proceedings of the 24th Annual ACM SIGPLAN Conference on Object- Oriented Programming, Systems, Languages, and Applications, OOPSLA 2009, October 25-29, 2009, Orlando, Florida, USA. 2009, pp. 243–262. [5] Zhengqiang Chen and Baowen Xu. “Slicing Object-Oriented Java Programs”. In: SIGPLAN Notices 36.4 (2001), pp. 33–40. [6] Keith D Cooper, Timothy J Harvey, and Ken Kennedy. “A simple, fast dominance algorithm”. In: Software Practice & Experience 4.1-10 (2001), pp. 1–8. [7] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph”. In: ACM Trans. Program. Lang. Syst. 13.4 (1991), pp. 451–490. [8] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. “The program Dependence Graph and its Use in Optimization”. In: International Symposium on Programming, 6th Colloquium, Toulouse, April 17-19, 1984, Proceedings. 1984, pp. 125–132. [9] Ilja Heitlager, Tobias Kuipers, and Joost Visser. “A Practical Model for Measuring Maintain- ability”. In: Quality of Information and Communications Technology, 6th International Confer- ence on the Quality of Information and Communications Technology, QUATIC 2007, Lisbon, Portugal, September 12-14, 2007, Proceedings. 2007, pp. 30–39. [10] Mark Hills. “Variable Feature Usage Patterns in PHP (T)”. In: 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015. 2015, pp. 563–573. [11] Mark Hills and Paul Klint. “PHP AiR: Analyzing PHP systems with Rascal”. In: 2014 Soft- ware Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014. 2014, pp. 454–457. [12] Mark Hills, Paul Klint, and Jurgen J. Vinju. “An empirical study of PHP feature usage: a static analysis perspective”. In: International Symposium on Software Testing and Analysis, ISSTA ’13, Lugano, Switzerland, July 15-20, 2013. 2013, pp. 325–335. [13] Mark Hills, Paul Klint, and Jurgen J. Vinju. “Static, lightweight includes resolution for PHP”. In: ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014. 2014, pp. 503–514.

54 [14] Henk Erik Van der Hoek and Jurriaan Hage. “Object-sensitive Type Analysis of PHP”. In: Proceedings of the 2015 Workshop on Partial Evaluation and Program Manipulation, PEPM, Mumbai, India, January 15-17, 2015. 2015, pp. 9–20. [15] Susan Horwitz, Thomas W. Reps, and David Binkley. “Interprocedural Slicing Using Depen- dence Graphs”. In: ACM Trans. Program. Lang. Syst. 12.1 (1990), pp. 26–60. [16] Jang-Wu Jo and Byeong-Mo Chang. “Constructing Control Flow Graph for Java by Decou- pling Exception Flow from Normal Flow”. In: Computational Science and Its Applications - ICCSA 2004, International Conference, Assisi, Italy, May 14-17, 2004, Proceedings, Part I. 2004, pp. 106–113. [17] Raghavan Komondoor and Susan Horwitz. “Using Slicing to Identify Duplication in Source Code”. In: Static Analysis, 8th International Symposium, SAS 2001, Paris, France, July 16-18, 2001, Proceedings. 2001, pp. 40–56. [18] Gyula Kov´acs,Ferenc Magyar, and Tibor Gyim´othy. “Static slicing of java programs”. In: University. Citeseer. 1996. [19] Jens Krinke. “Identifying Similar Code with Program Dependence Graphs”. In: Proceedings of the Eighth Working Conference on Reverse Engineering, WCRE’01, Stuttgart, Germany, October 2-5, 2001. 2001, pp. 301–309. [20] Loren Larsen and Mary Jean Harrold. “Slicing object-oriented software”. In: Software Engineer- ing, 1996., Proceedings of the 18th International Conference on. IEEE. 1996, pp. 495–505. [21] Thomas J. McCabe. “A Complexity Measure”. In: IEEE Trans. Software Eng. 2.4 (1976), pp. 308–320. [22] Karl J. Ottenstein and Linda M. Ottenstein. “The Program Dependence Graph in a Soft- ware Development Environment”. In: Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, Pittsburgh, Penn- sylvania, USA, April 23-25, 1984. 1984, pp. 177–184. [23] Mati Shomrat and Yishai A. Feldman. “Detecting Refactored Clones”. In: ECOOP 2013 - Object-Oriented Programming - 27th European Conference, Montpellier, France, July 1-5, 2013. Proceedings. 2013, pp. 502–526. [24] Saurabh Sinha and Mary Jean Harrold. “Analysis and Testing of Programs with Exception Handling Constructs”. In: IEEE Trans. Software Eng. 26.9 (2000), pp. 849–871. [25] Neil Walkinshaw, Marc Roper, and Murray Wood. “The Java System Dependence Graph”. In: 3rd IEEE International Workshop on Source Code Analysis and Manipulation (SCAM 2003), 26-27 September 2003, Amsterdam, The Netherlands. 2003, pp. 55–64. [26] Mark Weiser. “Program Slicing”. In: Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981. 1981, pp. 439–449. [27] John Whaley and Monica S. Lam. “Cloning-based context-sensitive pointer alias analysis us- ing binary decision diagrams”. In: Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation 2004, Washington, DC, USA, June 9-11, 2004. 2004, pp. 131–144. [28] Jianjun Zhao. “Applying program dependence analysis to Java software”. In: Proceedings of Workshop on Software Engineering and Database Systems, 1998 International Computer Sym- posium. 1998, pp. 162–169.

55 Appendix A

Graph library

Listings A.1, A.2, and A.3 contain the main components of the graph library. The full implementation has been left out for brevity, but is publicly available as part of Php-Pdg1.

Listing A.1: Node interface 1 interface NodeInterface{ 2 /** 3 * Geta string representation of the node, for printing 4 * 5 * @return string 6 */ 7 public function toString(); 8 9 /** 10 * Returnsa hash uniquely identifying this object. 11 * 12 * @return string 13 */ 14 public function getHash(); 15 }

Listing A.2: Edge 1 class Edge{ 2 private $from_node; 3 private $to_node; 4 private $attributes; 5 6 public function __construct(NodeInterface $from_node, NodeInterface $to_node, ,→ $attributes = []) { 7 $this->from_node= $from_node; 8 $this->to_node= $to_node; 9 $this->attributes= $attributes; 10 } 11 12 /** 13 * @return NodeInterface 14 */ 15 public function getFromNode(){ 16 return $this->from_node; 17 } 18 19 /** 20 * @return NodeInterface 21 */ 22 public function getToNode(){ 23 return $this->to_node; 24 }

1https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/Graph

56 25 26 /** 27 * @return array 28 */ 29 public function getAttributes(){ 30 return $this->attributes; 31 } 32 }

Listing A.3: Graph interface 1 interface GraphInterface{ 2 /** 3 * @param NodeInterface $node 4 */ 5 public function addNode(NodeInterface $node); 6 7 /** 8 * @param NodeInterface $from_node 9 * @param NodeInterface $to_node 10 * @param array $attributes 11 */ 12 public function addEdge(NodeInterface $from_node, NodeInterface $to_node, array ,→ $attributes = []); 13 14 /** 15 * @param NodeInterface $node 16 * @return boolean 17 */ 18 public function hasNode(NodeInterface $node); 19 20 /** 21 * Whether or not the graph has any edges that match the arguments(all match by ,→ default) 22 * 23 * @param NodeInterface $from_node 24 * @param NodeInterface $to_node 25 * @param array $filterAttributes 26 * @param boolean $filterAttributesExact Should all attributes match 27 * @return boolean 28 */ 29 public function hasEdges(NodeInterface $from_node= null, NodeInterface $to_node= ,→ null, array $filterAttributes = [], $filterAttributesExact= false); 30 31 /** 32 * @return NodeInterface[] 33 */ 34 public function getNodes(); 35 36 /** 37 * Get edges that match the arguments(all match by default) 38 * 39 * @param NodeInterface $from_node 40 * @param NodeInterface $to_node 41 * @param array $filterAttributes 42 * @param boolean $filterAttributesExact Should all attributes match 43 * @return Edge[] 44 */ 45 public function getEdges(NodeInterface $from_node= null, NodeInterface $to_node= ,→ null, array $filterAttributes = [], $filterAttributesExact= false); 46 47 48 /** 49 * Deletea node and all its connected edges from the graph 50 * 51 * @param NodeInterface $node 52 */ 53 public function deleteNode(NodeInterface $node); 54

57 55 /** 56 * Delete edges that match the arguments(all match by default) 57 * 58 * @param NodeInterface $from_node 59 * @param NodeInterface $to_node 60 * @param array $filterAttributes 61 * @param boolean $filterAttributesExact Should all attributes match 62 */ 63 public function deleteEdges(NodeInterface $from_node= null, NodeInterface $to_node ,→ = null, array $filterAttributes = [], $filterAttributesExact= false); 64 65 public function clear(); 66 }

58 Appendix B

PDG library

Listings B.1 and B.2 show the Func and Factory class, which are the two main outward facing components of the PDG library, together with the Graph library components shown in appendixA. The inward facing components that we use can be seen in the createDefault method of the Factory class. Their implementation has been left out for brevity, but is publicly available as part of Php-Pdg1.

Listing B.1: Func 1 class Func{ 2 /** @var string*/ 3 public $name; 4 /** @var string|null*/ 5 public $class_name; 6 /** @var string|null*/ 7 public $filename; 8 /** @var NodeInterface*/ 9 public $entry_node; 10 /** @var NodeInterface[]*/ 11 public $param_nodes = []; 12 /** @var NodeInterface[]*/ 13 public $return_nodes = []; 14 /** @var GraphInterface*/ 15 public $pdg; 16 17 /** 18 * Func constructor. 19 * @param string $name 20 * @param string|null $class_name 21 * @param string|null $filename 22 * @param NodeInterface $entry_node 23 * @param GraphInterface $pdg 24 */ 25 public function __construct( $name, $class_name, $filename, NodeInterface $entry_node ,→ , GraphInterface $pdg){ 26 $this->name= $name; 27 $this->class_name= $class_name; 28 $this->filename= $filename; 29 $this->entry_node= $entry_node; 30 $this->pdg= $pdg; 31 } 32 33 /** 34 * @return string 35 */ 36 public function getScopedName(){ 37 if( $this->class_name !== null){ 38 return $this->class_name. ':: ' . $this->name; 39 } 40 return $this->name;

1https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/ProgramDependence

59 41 } 42 43 /** 44 * @return string 45 */ 46 public function getId(){ 47 if( $this->filename !== null){ 48 return $this->filename. '[' . $this->getScopedName(). ']'; 49 } 50 return $this->getScopedName(); 51 } 52 }

Listing B.2: Factory 1 class Factory implements FactoryInterface{ 2 /** @var GraphFactoryInterface*/ 3 private $graph_factory; 4 /** @var ControlDependenceGeneratorInterface*/ 5 private $control_dependence_generator; 6 /** @var DataDependenceGeneratorInterface*/ 7 private $data_dependence_generator; 8 9 /** 10 * Factory constructor. 11 * @param GraphFactoryInterface $graph_factory 12 * @param ControlDependenceGeneratorInterface $control_dependence_generator 13 * @param DataDependenceGeneratorInterface $data_dependence_generator 14 */ 15 public function __construct(GraphFactoryInterface $graph_factory, ,→ ControlDependenceGeneratorInterface $control_dependence_generator, ,→ DataDependenceGeneratorInterface $data_dependence_generator){ 16 $this->graph_factory= $graph_factory; 17 $this->control_dependence_generator= $control_dependence_generator; 18 $this->data_dependence_generator= $data_dependence_generator; 19 } 20 21 /** 22 * @param GraphFactoryInterface|null $graph_factory 23 * @return Factory 24 */ 25 public static function createDefault(GraphFactoryInterface $graph_factory= null){ 26 $graph_factory= $graph_factory !== null? $graph_factory: new GraphFactory(); 27 $block_cfg_generator= new BlockCfgGenerator( $graph_factory); 28 $block_cdg_generator= new BlockCdgGenerator( $graph_factory); 29 $pdt_generator= new PdgGenerator( $graph_factory); 30 $control_dependence_generator= new ControlDependenceGenerator( ,→ $block_cfg_generator, $pdt_generator, $block_cdg_generator); 31 $data_dependence_generator= new DataDependenceGenerator(); 32 return new self( $graph_factory, $control_dependence_generator, ,→ $data_dependence_generator); 33 } 34 35 /** 36 * @param CfgFunc $cfg_func 37 * @param string $filename 38 * @return Func 39 */ 40 public function create(CfgFunc $cfg_func, $filename= null){ 41 $pdg= $this->graph_factory->create(); 42 $entry_node= new EntryNode(); 43 $pdg->addNode( $entry_node); 44 $func= new Func( $cfg_func->name, $cfg_func->class !== null? $cfg_func->class-> ,→ value: null, $filename, $entry_node, $pdg); 45 46 foreach( $cfg_func->params as $param){ 47 $param_node= new OpNode( $param); 48 $func->pdg->addNode( $param_node); 49 $func->param_nodes[] = $param_node;

60 50 } 51 $traverser= new Traverser(); 52 $traverser->addVisitor(new InitializingVisitor( $func)); 53 $traverser->traverseFunc( $cfg_func); 54 55 $this->control_dependence_generator->addFuncControlDependenceEdgesToGraph( ,→ $cfg_func, $pdg, $entry_node); 56 $this->data_dependence_generator->addFuncDataDependenceEdgesToGraph( $cfg_func, ,→ $pdg); 57 58 return $func; 59 } 60 }

61 Appendix C

SDG library

Listings C.1 and C.2 show the System and Factory class, which are the two main outward facing components of the SDG library, together with the graph library components shown in appendixA and the PDG library components shown in appendixB. The inward facing components that we use can be seen in the createDefault method of the Factory class. Their implementation has been left out for brevity, but is publicly available as part of Php-Pdg1.

Listing C.1: System 1 class System{ 2 /** @var Func[]*/ 3 public $scripts = []; 4 /** @var Func[]*/ 5 public $functions = []; 6 /** @var Func[]*/ 7 public $methods = []; 8 /** @var Func[]*/ 9 public $closures = []; 10 /** @var GraphInterface*/ 11 public $sdg; 12 13 /** 14 * System constructor. 15 * @param GraphInterface $sdg 16 */ 17 public function __construct(GraphInterface $sdg){ 18 $this->sdg= $sdg; 19 } 20 21 /** 22 * @return Func[] 23 */ 24 public function getFuncs(){ 25 return array_merge( $this->scripts, $this->functions, $this->methods, $this-> ,→ closures); 26 } 27 }

Listing C.2: Factory 1 class Factory implements FactoryInterface{ 2 /** @var GraphFactoryInterface*/ 3 private $graph_factory; 4 /** @var PdgFactoryInterface*/ 5 private $pdg_factory; 6 /** @var CallDependenceGeneratorInterface*/ 7 private $call_dependence_generator; 8 /** @var TypeReconstructor*/ 9 private $type_reconstructor;

1https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/SystemDependence

62 10 11 /** 12 * Factory constructor. 13 * @param GraphFactoryInterface $graph_factory 14 * @param PdgFactoryInterface $pdg_factory 15 * @param CallDependenceGeneratorInterface $call_dependence_generator 16 */ 17 public function __construct(GraphFactoryInterface $graph_factory, ,→ PdgFactoryInterface $pdg_factory, CallDependenceGeneratorInterface ,→ $call_dependence_generator){ 18 $this->graph_factory= $graph_factory; 19 $this->pdg_factory= $pdg_factory; 20 $this->call_dependence_generator= $call_dependence_generator; 21 $this->type_reconstructor= new TypeReconstructor(); 22 } 23 24 /** 25 * @return Factory 26 */ 27 public static function createDefault(){ 28 $graph_factory= new GraphFactory(); 29 $operand_class_resolver= new OperandClassResolver(); 30 $method_resolver= new MethodResolver(); 31 return new self( $graph_factory, PdgFactory::createDefault( $graph_factory), new ,→ CombiningGenerator([ 32 new FunctionCallGenerator(), 33 new MethodCallGenerator( $operand_class_resolver, $method_resolver), 34 new OverloadingCallGenerator( $operand_class_resolver, $method_resolver), 35 ])); 36 } 37 38 /** 39 * @param CfgSystem $cfg_system 40 * @return System 41 */ 42 public function create(CfgSystem $cfg_system){ 43 $sdg= $this->graph_factory->create(); 44 $system= new System( $sdg); 45 46 /** @var FuncNode[]|\SplObjectStorage $pdg_func_lookup*/ 47 $pdg_func_lookup= new\SplObjectStorage(); 48 $cfg_scripts = []; 49 /** @var\SplFileInfo $fileinfo*/ 50 foreach( $cfg_system->getFilenames() as $filename){ 51 $cfg_scripts[] = $cfg_script= $cfg_system->getScript( $filename); 52 53 $pdg_func= $this->pdg_factory->create( $cfg_script->main, $filename); 54 $system->scripts[ $filename] = $pdg_func; 55 $func_node= new FuncNode( $pdg_func); 56 $system->sdg->addNode( $func_node); 57 $pdg_func_lookup[ $cfg_script->main] = $func_node; 58 59 foreach( $cfg_script->functions as $cfg_func){ 60 $pdg_func= $this->pdg_factory->create( $cfg_func, $filename); 61 $scoped_name= $cfg_func->getScopedName(); 62 if( $cfg_func->class !== null){ 63 $system->methods[ $scoped_name] = $pdg_func; 64 } else if(strpos( $cfg_func->name, '{anonymous}# ') === 0) { 65 $system->closures[ $scoped_name] = $pdg_func; 66 } else{ 67 $system->functions[ $scoped_name] = $pdg_func; 68 } 69 $func_node= new FuncNode( $pdg_func); 70 $system->sdg->addNode( $func_node); 71 $pdg_func_lookup[ $cfg_func] = $func_node; 72 } 73 } 74 75 $state= new State( $cfg_scripts);

63 76 $this->type_reconstructor->resolve( $state); 77 $this->call_dependence_generator->addCallDependencesToSystem( $system, $state, ,→ $pdg_func_lookup); 78 return $system; 79 } 80 }

64