Obfuscator Identification for Android

Who Changed You? Obfuscator Identification for Android Yan Wang and Atanas Rountev The Ohio State University, Columbus, OH, USA Email: fwang10,[email protected] Abstract—Android developers commonly use app obfuscation free and paid apps. In our studies we observed a percentage to secure their apps and intellectual property. Although obfus- that is substantially higher. cation provides protection, it presents an obstacle for a number Although obfuscation provides desirable protection, it of legitimate program analyses such as detection of app cloning and repackaging, malware detection, identification of third-party presents an obstacle for legitimate program analyses. Consider libraries, provenance analysis for digital forensics, and reverse the detection of app cloning and repackaging. At least 25% engineering for test generation and performance analysis. If the of apps in Google Play are clones of other apps [3] and obfuscator used to create an app can be identified, and if some almost 80% of the top 50 free apps have fake versions details of the obfuscation process can be inferred, subsequent because of repackaging [4]. Obfuscation may impair clone analyses can exploit this knowledge. Thus, it is desirable to be able to automatically analyze a given app and determine (1) and repackage detection tools because an obfuscated clone whether it was obfuscated, (2) which obfuscator was used, and app may appear to be different from the original app [5], (3) how the obfuscator was configured. [6]. Another consideration is malware detection using static We have developed novel techniques to identify the obfuscator analysis. Analysis of obfuscated malicious code presents a of an Android app for several widely-used obfuscation tools number of challenges [7], [8]. As yet another example, iden- and for a number of their configuration options. We define the obfuscator identification problem and propose a solution based tifying the third-party libraries included in an Android app is on machine learning. To the best of our knowledge, this is the a well-known problem important for general static analysis, first work to formulate and solve this problem. We identify a security analysis, testing, and detection of performance bugs. feature vector that represents the characteristics of the obfuscated Obfuscation hinders such library identification [9], [10], [11]. code. We then implement a tool that extracts this feature vector In all these examples, one possible approach is to develop from Dalvik bytecode and uses it to identify the obfuscator provenance information. We evaluate the proposed approach on obfuscator-tailored techniques. If the specific obfuscator used real-world Android apps obfuscated with different obfuscators, to create the final app code can be identified, and if some under several configurations. Our experiments indicate that the details of the obfuscation process can be inferred, the sub- approach identifies the obfuscator with about 97% accuracy and sequent analyses can exploit this knowledge. However, a key recognizes the configuration with more than 90% accuracy. prerequisite is to be able to analyze a given app and determine Keywords-Android, App analysis, Obfuscation (1) whether it was obfuscated, (2) which obfuscator was used, and (3) how the obfuscator was configured. I. INTRODUCTION These questions are also important for provenance analysis. The explosive growth in the use of mobile devices such The app is the result of a process that starts with the source as smartphones and tablets has led to substantial changes in code and produces the final distributed APK file. The prove- the computing industry. Android is one of the major platform nance details of this process and the toolchain that implements for such devices [1]. Because of commercial interests and it are important for a number of reasons. For example, such security concerns, many companies and individual developers details are useful for digital forensics [12] and for reverse are reluctant to make public the Android app source code. But engineering of models used for test generation (e.g., [13], [14]) even without source code, these apps can be easily decompiled and code instrumentation for performance analysis (e.g., [15]). and pirated. Developers have to use additional protections to Knowledge of the obfuscator and the configuration options secure their apps and intellectual property. used by it reveal aspects of the app toolchain provenance, One widely used approach for app protection is obfuscation. similarly to provenance analysis for binary code [12]. Obfuscation hides informative data in the software and makes We have developed novel techniques to identify the obfus- it hard to understand for both humans and decompilation cator of an Android app for several widely-used obfuscation tools. Obfuscation is easy to use: a number of obfuscators tools and for a number of their configuration options. The have been developed and their deployment is fairly simple. approach relies only on the Dalvik bytecode in the app For example, the Android IDE provided for free by Google APK. The identification problem is formulated as a machine integrates the ProGuard [2] obfuscation tool. Due to this ease learning task, based on a model of various properties of the of use, obfuscation is becoming more popular. For example, bytecode (e.g., different categories of strings). This model according to a study published in 2014 [3], about 15% of the reflects characteristics of the code that may be modified by the apps in the Google Play app store are obfuscated, for both obfuscators—in particular, names of program entities as well Obfuscate & Optimize Java Source Binary Resource .class files classes.dex .apk file Resource file Code file Build a new apk New APK file .jar files Resource files Obfuscate & Optimize obfuscated .class file classes.dex file .class file Fig. 1: Workflow of Android compilation. Fig. 3: Workflow of an obfuscator. 1.new-instance v4,Landroid/content/ComponentName; // type@0033 2.move-result-object v1 3.if-nez v1,000d // +0004 4.const/4 v2,#int 0 // #0 5.return-object v2 Configure resource file Modified 6.invoke-static {v3},Landroid/support/v4/content/IntentCompat; Resource file Resource file .makeMainActivity:(Landroid/content/ComponentName;) Artificial Landroid/content/Intent; // method@0e03 classes.dex file 7.invoke-direct {v4},Landroid/content/Intent;.<init>:()V APK file Create new apk New APK file // method@013a classes.dex Pack classes.dex Packed file Fig. 2: Example of Dalvik bytecode. file as sequences of instructions. The main insight of our approach Fig. 4: Workflow of a packer. is that relatively simple code features are sufficient to infer precise provenance information. The specific contributions of this work are: many applications in the Google Play app store [3]. Android • We define the obfuscator identification problem for An- apps are released as APK (“Application PacKage”) files. The droid and propose a solution based on machine learning process of obtaining an APK file is shown in Figure 1. The techniques. To the best of our knowledge, this is the first developer builds the app using Java. The Java source code work to formulate and solve this problem. is compiled into .class files using some Java compiler. The • We identify a feature vector that represents the character- .class files contain standard Java bytecode. Android uses its istics of the obfuscated code. We then implement a tool own format of bytecode called Dalvik bytecode. Figure 2 is that extracts this feature vector from Dalvik bytecode and a snippet of such bytecode. The Java bytecode is compiled uses it to identify the obfuscator provenance information. into a file classes.dex (“Dalvik Executable”), which contains • We evaluate the proposed approach on Android apps all the Dalvik bytecode together with other .jar files that may obfuscated with different obfuscators, under several con- come from third-party libraries. In most cases, there is only figurations. Our experiments indicate that the approach one classes.dex file, although a large app can be split into identifies the obfuscator with 97% accuracy and recog- multiple dex files. In addition to the dex file(s), the APK also nizes the configuration with more than 90% accuracy. contains resource files (e.g., images). At installation time, the II. OVERVIEW Android runtime will compile classes.dex using the dex2oat This section presents an overview of Android obfuscation tool. The output will be executable code for target device, and a high-level description of our analysis approach. which achieves better performance than the Dalvik bytecode. A. Software Obfuscation In Android, obfuscation happens before releasing the app. Without obfuscation, the Dalvik bytecode is relatively easy Software obfuscation is a well-known technique for pro- to reverse engineer, using tools such as Soot [19], Andro- tecting software from reverse engineering attacks [16]. Ob- guard [20], Baksmali [21], Apktool [22], Dex2jar [23], Dex- fuscation transforms the input code to generate new target dump [24], etc. If the app is obfuscated, even just using Pro- code. At a high level, we have T = f(I) where I is the Guard (which only provides basic obfuscation), the changes input code that needs obfuscation; it can be source code, will affect these tools and hinder reverse engineering [25]. bytecode, or machine code. T is the target code and f is an obfuscating transformation. I and T must have the same Although obfuscation may protect intellectual property, it observable behavior: for the same valid input, both should may also affect other legitimate uses of reverse engineering terminate and generate the same output. Obfuscation cannot and app analysis. As discussed earlier, one such example achieve a “virtual black box” [17]—in other words, perfect is the detection of cloning and repackaging. Several studies obfuscation is impossible. Nevertheless, in practice it is still have shown the negative influence of obfuscation on such an effective approach to protect the code from humans and detection [5], [6].

Load more