The Case for Analysis Preserving Language Transformation

The Case for Analysis Preserving Language Transformation Xiaolan Zhang Larry Koved Marco Pistoia Sam Weber [email protected] [email protected] [email protected] [email protected] Trent Jaeger∗ Guillaume Marceau† Liangzhao Zeng [email protected] [email protected] [email protected] IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA ABSTRACT Categories and Subject Descriptors Static analysis has gained much attention over the past few D.2.4 [Software Engineering]: Software/Program Verifi- years in applications such as bug finding and program verification; D.2.3 [Software Engineering]: Coding Tools and cation. As software becomes more complex and componen- Techniques tized, it is common for software systems and applications to be implemented in multiple languages. There is thus a strong need for developing analysis tools for multi-language General Terms software. We introduce a technique called Analysis Pre- Languages, Security, Verification serving Language Transformation (aplt) that enables the analysis of multi-language software, and also allows analy- Keywords sis tools for one language to be applied to programs written in another. aplt preserves data and control flow informa- Language Transformation, Language Translation, Static Analy- tion needed to perform static analyses, but allows the trans- sis, Verification, Security, Java, C lation to deviate from the original program’s semantics in ways that are not pertinent to the particular analysis. We 1. INTRODUCTION discuss major technical difficulties in building such a translator, using a C-to-Java translator as an example. We demon- Static analysis is often used to build models of software strate the feasibility and effectiveness of aplt using two us- for the purpose of extracting or verifying properties of code. This has proven to have many applications in software en- age cases: analysis of the Java runtime native methods and gineering, including refactoring [34], program comprehen- reuse of Java analysis tools for C. Our preliminary results sion [17], and maintenance [46]. In particular, static analysis show that a control- and data-flow equivalent model for native methods can eliminate unsoundness and produce reli- has been successfully applied to the area of security, where able results, and that aplt enables seamless reuse of analysis the analyses aim to determine whether or not a given piece tools for checking high-level program properties. of software violates a set of security properties [6, 7, 12, 18, 23, 24, 30, 42, 49, 50]. Examples of security properties include complete mediation of mandatory access control mechanisms [49], proper input sanitization [29], absence ∗Work done while at IBM T.J. Watson Research Center. of vulnerabilities such as Time of Check to Time of Use Current address: Department of Computer Science and En- (tocttou) [10], buffer overflow [31] and format string [42], gineering, The Pennsylvania State University, 346A Infor- permission computations [30], and placement of privileged mation Sciences and Technology Building, University Park, calls [38]. PA 16802, USA. † There are a myriad of vendors who provide customized Work done while summer intern at IBM T.J. Watson Re- search Center. Current address: Department of Computer software engineering tools tailored to particular applications Science, Brown University, Providence, RI 02912, USA. and languages. Although this has been sufficient in the past, software and systems are becoming more complex and com- ponentized, and are now frequently implemented in multiple languages. For example, the JavaTMruntime libraries include “native” methods written in other languages, usually Permission to make digital or hard copies of all or part of this work for for performance or compatibility reasons. The reference im- personal or classroom use is granted without fee provided that copies are plementation of the Java 1.4.2 runtime library includes 1338 not made or distributed for profit or commercial advantage and that copies native methods (about 2% of all methods). bear this notice and the full citation on the first page. To copy otherwise, to It is thus desirable to have a multi-lingual tool that can republish, to post on servers or to redistribute to lists, requires prior specific apply the same analysis across all the different languages permission and/or a fee. ISSTA’06, July 17–20, 2006, Portland, Maine, USA. used by a software system. In the case of security analy- Copyright 2006 ACM 1-59593-263-1/06/0007 ...$5.00. sis, this is a necessity rather than a luxury, because making 191 end-to-end security guarantees or assurance statements re- the original program. Implementing this is a difficult task. quires that the security properties are preserved across the The crucial observation that makes our approach feasible is entire software stack. In other words, the soundness of the that, for many analyses, such as ours whose purposes are security analysis depends on analyzing the entire code. For to detect bugs and security violations, preserving the ex- example, in the case of permission analysis [30], where the ecution semantics is not necessary. For example, typical least privileges required to complete a task are computed security-related analyses are mainly concerned with deter- for a given task, it is necessary that the native methods are mining whether certain “bad” code paths are executed, but included in the analysis for the results to be sound. are not concerned about properties like whether certain ex- Another need for general multi-lingual analysis capabili- pressions are constant. Therefore, in our translation, suffi- ties arises in analysis reuse. Many of the high-level prop- cient data and control flow information needed to perform erties we are interested in, such as the complete mediation static analysis are preserved, while the translation is allowed property, are language-independent. It is thus desirable to to deviate from the original program’s semantics in other as- apply an analysis developed for one language to software pects that are not pertinent to our analysis task. We call written in a different language. In addition, programmers this technique Analysis Preserving Language Transforma- often implement novel analysis techniques for their favorite tion (aplt), as opposed to Execution Preserving Language language. It typically takes a long time for similar features Transformation (eplt), where the translated program exe- to be ported to tools for other languages. For example, cutes the same as the original version. the safe Java typestate checker developed at ibm Research Compared to the traditional approach of porting analy- employs advanced techniques, such as adaptive verification sis algorithms, our approach offers two advantages. First, based on the nature of property being verified [20, 48], and because the analysis can be performed on the entire code integrated aliasing and verification, which are lacking in con- base, the analysis is complete. Secondly, the approach is temporary C checkers. On the other hand, the esp [14] more cost effective: once the translator is developed, we checker for C uses advanced path-sensitive analysis tech- can reuse it with existing analyses as well as future ones. niques which are not present in current Java checkers. Thus, Compared with the cir approach, our approach has much techniques that enable the analysis of multiple languages broader applicability, as the target language is a standard can also be used to facilitate tool reuse by allowing analysis language, not an intermediate language that is specific to a written for one language to be applied to software written particular analysis engine. What we give up for this general- in another language. Essentially, we wish to offer users the ity is precision with regard to code generation. However, our freedom to choose a checker most suitable for the proper- translation is precise with respect to a large class of analyses ties to be checked, without worrying about the underlying (with a few exceptions, see Section 6). If the ultimate goal implementation language of the target program. is code analysis, rather than code generation, we believe the One approach to multi-language analysis is to have mul- tradeoff is worthwhile. tiple analyzers, one for each target language, implementing We demonstrate the feasibility of aplt through two usage the same analysis. Results from each analyzer are then com- cases. Based on the aplt technology we built a C to Java bined at the end. This approach does not entirely satisfy translator called fictoj. We perform security analyses [30, the completeness requirement, because each analysis is local 38] on the Java runtime native methods after translating to the specific component, and cross-component interactions them to a Java equivalent form using fictoj. Our prelim- are lost due to the localized analysis scope. As a result, each inary results show that a control- and data-flow equivalent analysis is only partial, thus the soundness of the analysis model for native methods can eliminate unsoundness and can not be guaranteed. In addition, this approach is not cost produce reliable results. Using fictoj,wehavealsosuc- effective: the learning curve is steep for porting an analy- cessfully applied an advanced analysis available to Java (the sis to each different target language. Worse, the effort is safe model checker) to programs written in C, for which no quadratic on the number of analysis, since a new port is tool with equivalent capabilities is publicly available. These required for each new analysis/language pair. results demonstrate that analysis preserving language trans- Another approach to solving the multi-lingual analysis formation is practical. Further, it serves as a powerful foun- problem is to build analysis frameworks by translating mul- dation for solving real world software engineering challenges, tiple front-ends to a Common Intermediate Representation in particular, multi-lingual program analysis and analysis (cir) and performing the analysis on the cir [2, 35].

Load more