Deducing Similarities in Java Sources from Bytecodes

Total Page:16

File Type:pdf, Size:1020Kb

Deducing Similarities in Java Sources from Bytecodes The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference (NO 98) New Orleans, Louisiana, June 1998 Deducing Similarities in Java Sources from Bytecodes Brenda S. Baker Bell Laboratories, Lucent Technologies Udi Manber University of Arizona For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ Deducing Similarities in Java Sources from Byteco des Brenda S. Baker Bel l Laboratories Lucent Technologies 700 Mountain Avenue Murray Hil l, NJ 07974 bsb@b ell-labs.com, http://cm.bell-labs.com/~bsb/ Udi Manber Department of Computer Science University of Arizona Tucson, AZ 85721 [email protected], http://glimpse.cs.arizona.edu/udi.html sible, given a large set of byteco de les, to iden- Abstract tify most of those that originate from similar source co de. In addition, an index can b e built from any number of byteco de les, such that given a new Several techniques for detecting similarities of Java byteco de one can very quickly nd all the old ones programs from byteco de les, without access to the that are similar to it. Our to ols also can detail in source, are intro duced in this pap er. These tech- a convenientway just which sections of the les are niques can be used to compare two les, to nd similar. similarities among thousands of les, or to compare one new le to an index of many old ones. Exp eri- Our techniques make it p ossible to identify byte- mental results indicate that these techniques can b e co de les that came from the same source without very e ective. Even changes of 30 to the source knowing the actual source even if signi cant parts le will usually result in byteco de that can be as- of the source les are di erent. The les do not even so ciated with the original. Several applications are havetobe of similar sizes. For example, the orig- discussed. inal source co de les may be di erent versions of the same program or di erent programs containing similar sections. 1 Intro duction Finding similarities from compiled co de is much more dicult than nding similarities in source co de or text, b ecause a small change in the source co de Javabyteco de, particularly in the form of applets, can pro duce completely di erent binaries when they is geared to b ecome the common way to execute are viewed just as sequences of bytes. programs through the web, and not only by tradi- tional computers. Network computers, even appli- To accomplish these goals, we adapt three to ols de- ances, are exp ected to be controlled byJavabyte- signed to nd similarity in source co de and text to co de, which will arrive through the net transpar- 1 work with byteco de les. Si [24] is designed ently and for the most part, automatically. There to analyze large number of text les to nd pairs is obviously a great need to be able to control all that contain a signi cantnumb er of common blo cks these programs. There will b e problems of security, \ ngerprints", where a blo ck corresp onds to a management of up dates, p ortability, handling pref- non-trivial segment of co de, usually a few lines. erences, deletions, and so on. 1 The original name was SIF, but at some unknown p oint This pap er intro duces several techniques to help one during the development an extra f was added to make it more asp ect of these problems. We show that it is p os- natural. vidually at nding similar les while keeping false Dup [5], searches sets of source les to lo ok for p ositives low. We then use dup and di on the sim- suciently long sections that match except for sys- ilar pairs to examine the similarities in more detail. tematic transformations of names, suchasvariable names, and can deal eciently with a few mil- Our programs are fast and can analyze thousands lion lines of source co de. The UNIX utility di of byteco de les in minutes. A new le can b e com- describ ed in [21] uses dynamic programming to pared to an index of thousands of existing les in identify line-by-line changes insertions or deletions a second or two. The number of false p ositives is from one le to another, and is useful for detailed kept to avery small minimum. Our programs are comparison of two les and for automated distribu- written in C and run under UNIX. Detailed exp eri- tion of patches. mental results are given later in the pap er. None of the programs mentioned ab ove was de- We foresee several applications for our to ol. signed to deal with binaries. The only work that we knowof that mayhave applied one of these to program management: When p eople havenumerous binaries is .RTPatch [27], a to ol for creating patches Java classes, and get many more on a regular basis, di erences between les for up dating les. This there will be a great need to organize them, often to ol is claimed to work for arbitrary les, even bina- not according to the original \plan." Being able ries; while no technical description is given of their to tell which classes are similar can be very help- metho ds, it seems likely that it applies the di al- ful. Sometimes a similarity will identify the source. gorithm to arbitrary bytes rather than to hashes of Knowing that a very similar class is already installed lines of text. We are not familiar with anywork that on one's disk may b e helpful in deciding what to do can nd similarities in large numb er of binaries. with the new class. It can also help with version control, etc. In some cases, esp ecially for programs Our adaptations of si and di do not work directly that p erform mostly arithmetic op erations, it may on byteco de les, or even on disassembled byteco de b e p ossible to identify programs that implement the les. Byteco de les contain many indices into ta- same algorithm, even when they are written by in- bles, and values of most indices can b e changed bya dep endent writers. For example, when we ran ex- slight even one character change to a Java source p eriments on thousands of arbitrary class les, we le. Therefore, we enco de disassembled byteco de discovered two MD5 programs with 78 similarity les in a \normal form" which takes into accountpo- in their byteco de les. sitional values that are less a ected by small changes than absolute values of indices. This enco ding, Plagiarism detection: Our approach will identify based on a technique rst used in dup, enables the stolen co de if only minor changes are made to it programs to nd the hidden similarityinbyteco de including any amountofsyntactic changes suchas les. Since this technique is already incorp orated changing variable names. However, the advance in into dup, dup can work on disassembled byteco de sophistication of obfuscators [13,12], esp ecially for les; however, additional simple prepro cessing im- Java, would allow someone to hide any co de segment proves its p erformance. prettywell, and our metho ds and very p ossibly any other metho d will not b e able to identify it. One There are other to ols for nding similarities in text may b e able to identify that obfuscators were used, or source co de, as well. However, to ols based on on the other hand, whichmay b e go o d enough e.g., style metrics, suchas[7,19,25], or data ow graphs in a classro om. [17] would require decompilation of byteco de les in order to b e applied. Some other to ols based on Program reuse and reengineering: It will obviously ngerprints, such as [16,20,10], chunks of text [9, be useful to know which classes are similar when 30,34], or visualization via a graphical user interface trying to reuse them. Identifying versions is a go o d [11] may be adaptable to byte co de les using the example of that. same techniques that we use for si , dup, and di . Uninstal lers: If you nally obtain the exact co de To search for similar les in a large set of byteco de you wanted, then you probably want to get rid of les, we run si on the enco ded disassembled byte- other co de that is very similar. co de les and dup on the prepro cessed disassembled byteco de les. We show that combining the output Security: Detecting that a class is similar to a known from si and dup is more e ective than either indi- bad class can b e very helpful. mo des: all-against-all and one-against-all. The rst mo de nds all groups of similar les in a large le Compression: It may b e p ossible to store only di er- system and gives a rough indication of the similarity. ences among classes that are very similar. But rst The running time is essentially linear in the total one must detect all such similarities. This could b e size of all les, which makes si scalable. A sort, esp ecially useful for low-storage devices. which is not a linear-time routine, is required, but it is p erformed only on ngerprints, and therefore Clustering: Small similarities are quite typical do es not dominate the running time unless the total among programs written by the same p erson.
Recommended publications
  • How Can Java Class Files Be Decompiled?
    HOW CAN JAVA CLASS FILES BE DECOMPILED? David Foster Chamblee High School 11th Grade Many computer science courses feature building a compiler as a final project. I have decided to take a spin on the idea and instead develop a decompiler. Specifically, I decided to write a Java decompiler because: one, I am already familiar with the Java language, two, the language syntax does not suffer from feature-bloat, and three, there do not appear to be many Java decompilers currently on the market. I have approached the task of writing a decompiler as a matter of translating between two different languages. In our case, the source language is Java Virtual Machine (VM) Bytecode, and the target language is the Java Language. As such, our decompiler will attempt to analyze the semantics of the bytecode and convert it into Java source code that has equivalent semantics. A nice sideeffect of this approach is that our decompiler will not be dependent upon the syntax of the bytecodes, the compiler used to generate the bytecodes, or the source language (possibly not Java). It will only care whether a class-file’s semantics can be represented in the Java language. Decompilation is divided up into six phases: 1. Read opcodes 2. Interpret opcode behavior 3. Identify Java-language “idioms” 4. Identify patterns of control-flow 5. Generate Java-language statements 6. Format and output to file The first four phases will be discussed in detail within this paper, as they were the primary focus of my research. Further details pertaining to these phases can be gleaned from the accompanying log book.
    [Show full text]
  • Executable Code Is Not the Proper Subject of Copyright Law a Retrospective Criticism of Technical and Legal Naivete in the Apple V
    Executable Code is Not the Proper Subject of Copyright Law A retrospective criticism of technical and legal naivete in the Apple V. Franklin case Matthew M. Swann, Clark S. Turner, Ph.D., Department of Computer Science Cal Poly State University November 18, 2004 Abstract: Copyright was created by government for a purpose. Its purpose was to be an incentive to produce and disseminate new and useful knowledge to society. Source code is written to express its underlying ideas and is clearly included as a copyrightable artifact. However, since Apple v. Franklin, copyright has been extended to protect an opaque software executable that does not express its underlying ideas. Common commercial practice involves keeping the source code secret, hiding any innovative ideas expressed there, while copyrighting the executable, where the underlying ideas are not exposed. By examining copyright’s historical heritage we can determine whether software copyright for an opaque artifact upholds the bargain between authors and society as intended by our Founding Fathers. This paper first describes the origins of copyright, the nature of software, and the unique problems involved. It then determines whether current copyright protection for the opaque executable realizes the economic model underpinning copyright law. Having found the current legal interpretation insufficient to protect software without compromising its principles, we suggest new legislation which would respect the philosophy on which copyright in this nation was founded. Table of Contents INTRODUCTION................................................................................................. 1 THE ORIGIN OF COPYRIGHT ........................................................................... 1 The Idea is Born 1 A New Beginning 2 The Social Bargain 3 Copyright and the Constitution 4 THE BASICS OF SOFTWARE ..........................................................................
    [Show full text]
  • Advanced Programming for the Java(TM) 2 Platform
    Advanced Programming for the Java(TM) 2 Platform Training Index Advanced Programming for the JavaTM 2 Platform By Calvin Austin and Monica Pawlan November 1999 [CONTENTS] [NEXT>>] [DOWNLOAD] Requires login As an experienced developer on the JavaTM platform, you undoubtedly know how fast moving and comprehensive the Early Access platform is. Its many application programming interfaces (APIs) Downloads provide a wealth of functionality for all aspects of application and system-level programming. Real-world developers never use one Bug Database or two APIs to solve a problem, but bring together key Submit a Bug functionality spanning a number of APIs. Knowing which APIs you View Database need, which parts of which APIs you need, and how the APIs work together to create the best solution can be a daunting task. Newsletters Back Issues To help you navigate the Java APIs and fast-track your project Subscribe development time, this book includes the design, development, test, and deployment phases for an enterprise-worthy auction Learning Centers application. While the example application does not cover every Articles possible programming scenario, it explores many common Bookshelf situations and the discussions leave you with a solid methodology Code Samples for designing and building your own solutions. New to Java Question of the Week This book is for developers with more than a beginning level of Quizzes understanding of writing programs in the Java programming Tech Tips language. The example application is written with the Java® 2 Tutorials platform APIs and explained in terms of functional hows and whys, so if you need help installing the Java platform, setting up your Forums environment, or getting your first application to work, you should first read a more introductory book such as Essentials of the Java Programming Language: A Hands-On Guide or The Java Tutorial.
    [Show full text]
  • The LLVM Instruction Set and Compilation Strategy
    The LLVM Instruction Set and Compilation Strategy Chris Lattner Vikram Adve University of Illinois at Urbana-Champaign lattner,vadve ¡ @cs.uiuc.edu Abstract This document introduces the LLVM compiler infrastructure and instruction set, a simple approach that enables sophisticated code transformations at link time, runtime, and in the field. It is a pragmatic approach to compilation, interfering with programmers and tools as little as possible, while still retaining extensive high-level information from source-level compilers for later stages of an application’s lifetime. We describe the LLVM instruction set, the design of the LLVM system, and some of its key components. 1 Introduction Modern programming languages and software practices aim to support more reliable, flexible, and powerful software applications, increase programmer productivity, and provide higher level semantic information to the compiler. Un- fortunately, traditional approaches to compilation either fail to extract sufficient performance from the program (by not using interprocedural analysis or profile information) or interfere with the build process substantially (by requiring build scripts to be modified for either profiling or interprocedural optimization). Furthermore, they do not support optimization either at runtime or after an application has been installed at an end-user’s site, when the most relevant information about actual usage patterns would be available. The LLVM Compilation Strategy is designed to enable effective multi-stage optimization (at compile-time, link-time, runtime, and offline) and more effective profile-driven optimization, and to do so without changes to the traditional build process or programmer intervention. LLVM (Low Level Virtual Machine) is a compilation strategy that uses a low-level virtual instruction set with rich type information as a common code representation for all phases of compilation.
    [Show full text]
  • Studying the Real World Today's Topics
    Studying the real world Today's topics Free and open source software (FOSS) What is it, who uses it, history Making the most of other people's software Learning from, using, and contributing Learning about your own system Using tools to understand software without source Free and open source software Access to source code Free = freedom to use, modify, copy Some potential benefits Can build for different platforms and needs Development driven by community Different perspectives and ideas More people looking at the code for bugs/security issues Structure Volunteers, sponsored by companies Generally anyone can propose ideas and submit code Different structures in charge of what features/code gets in Free and open source software Tons of FOSS out there Nearly everything on myth Desktop applications (Firefox, Chromium, LibreOffice) Programming tools (compilers, libraries, IDEs) Servers (Apache web server, MySQL) Many companies contribute to FOSS Android core Apple Darwin Microsoft .NET A brief history of FOSS 1960s: Software distributed with hardware Source included, users could fix bugs 1970s: Start of software licensing 1974: Software is copyrightable 1975: First license for UNIX sold 1980s: Popularity of closed-source software Software valued independent of hardware Richard Stallman Started the free software movement (1983) The GNU project GNU = GNU's Not Unix An operating system with unix-like interface GNU General Public License Free software: users have access to source, can modify and redistribute Must share modifications under same
    [Show full text]
  • Chapter 1 Introduction to Computers, Programs, and Java
    Chapter 1 Introduction to Computers, Programs, and Java 1.1 Introduction • The central theme of this book is to learn how to solve problems by writing a program . • This book teaches you how to create programs by using the Java programming languages . • Java is the Internet program language • Why Java? The answer is that Java enables user to deploy applications on the Internet for servers , desktop computers , and small hand-held devices . 1.2 What is a Computer? • A computer is an electronic device that stores and processes data. • A computer includes both hardware and software. o Hardware is the physical aspect of the computer that can be seen. o Software is the invisible instructions that control the hardware and make it work. • Computer programming consists of writing instructions for computers to perform. • A computer consists of the following hardware components o CPU (Central Processing Unit) o Memory (Main memory) o Storage Devices (hard disk, floppy disk, CDs) o Input/Output devices (monitor, printer, keyboard, mouse) o Communication devices (Modem, NIC (Network Interface Card)). Bus Storage Communication Input Output Memory CPU Devices Devices Devices Devices e.g., Disk, CD, e.g., Modem, e.g., Keyboard, e.g., Monitor, and Tape and NIC Mouse Printer FIGURE 1.1 A computer consists of a CPU, memory, Hard disk, floppy disk, monitor, printer, and communication devices. CMPS161 Class Notes (Chap 01) Page 1 / 15 Kuo-pao Yang 1.2.1 Central Processing Unit (CPU) • The central processing unit (CPU) is the brain of a computer. • It retrieves instructions from memory and executes them.
    [Show full text]
  • Some Preliminary Implications of WTO Source Code Proposala INTRODUCTION
    Some preliminary implications of WTO source code proposala INTRODUCTION ............................................................................................................................................... 1 HOW THIS IS TRIMS+ ....................................................................................................................................... 3 HOW THIS IS TRIPS+ ......................................................................................................................................... 3 WHY GOVERNMENTS MAY REQUIRE TRANSFER OF SOURCE CODE .................................................................. 4 TECHNOLOGY TRANSFER ........................................................................................................................................... 4 AS A REMEDY FOR ANTICOMPETITIVE CONDUCT ............................................................................................................. 4 TAX LAW ............................................................................................................................................................... 5 IN GOVERNMENT PROCUREMENT ................................................................................................................................ 5 WHY GOVERNMENTS MAY REQUIRE ACCESS TO SOURCE CODE ...................................................................... 5 COMPETITION LAW .................................................................................................................................................
    [Show full text]
  • Information Retrieval from Java Archive Format
    Eötvös Loránd University Faculty of Informatics Department of Programming Languages and Compilers Information retrieval from Java archive format Supervisor: Author: Dr. Zoltán Porkoláb Bálint Kiss Associate Professor Computer Science MSc Budapest, 2017 Abstract During the course of my work, I contributed to CodeCompass, an open source code comprehension tool made for making codebase of software projects written in C, C++ and Java more understandable through navigation and visualization. I was tasked with the development of a module for recovering code information of Java classes in JAR files. This document details background concepts required for reverse- engineering Java bytecode, creating a prototype JAR file reader and how this solu- tion could be integrated to CodeCompass. First, I studied the structure of JAR format and how class files are stored in it. I looked into the Java Class file structure and how bytecode contained in class gets interpreted by the Java Virtual Machine. I also looked at existing decompilers and what bytecode libraries are. I created a proof-of-concept prototype that reads compiled classes from JAR file and extracts code information. I first showcased the use of Java Reflection API, then the use of Apache Commons Byte Code Engineering Library, a third-party bytecode library used for extracting and representing parts of Java class file as Java objects. Finally, I examined how CodeCompass works, how part of the prototype could be integrated into it and demonstrated the integration through parsing of a simple JAR file. i Acknowledgements I would like to thank Dr. Zoltán Porkoláb, Associate Professor of the Department of Programming Languages and Compilers at the Faculty of Informatics for admit- ting me to the CodeCompass project, supplying the thesis topic and helping with the documentation.
    [Show full text]
  • Toward IFVM Virtual Machine: a Model Driven IFML Interpretation
    Toward IFVM Virtual Machine: A Model Driven IFML Interpretation Sara Gotti and Samir Mbarki MISC Laboratory, Faculty of Sciences, Ibn Tofail University, BP 133, Kenitra, Morocco Keywords: Interaction Flow Modelling Language IFML, Model Execution, Unified Modeling Language (UML), IFML Execution, Model Driven Architecture MDA, Bytecode, Virtual Machine, Model Interpretation, Model Compilation, Platform Independent Model PIM, User Interfaces, Front End. Abstract: UML is the first international modeling language standardized since 1997. It aims at providing a standard way to visualize the design of a system, but it can't model the complex design of user interfaces and interactions. However, according to MDA approach, it is necessary to apply the concept of abstract models to user interfaces too. IFML is the OMG adopted (in March 2013) standard Interaction Flow Modeling Language designed for abstractly expressing the content, user interaction and control behaviour of the software applications front-end. IFML is a platform independent language, it has been designed with an executable semantic and it can be mapped easily into executable applications for various platforms and devices. In this article we present an approach to execute the IFML. We introduce a IFVM virtual machine which translate the IFML models into bytecode that will be interpreted by the java virtual machine. 1 INTRODUCTION a fundamental standard fUML (OMG, 2011), which is a subset of UML that contains the most relevant The software development has been affected by the part of class diagrams for modeling the data apparition of the MDA (OMG, 2015) approach. The structure and activity diagrams to specify system trend of the 21st century (BRAMBILLA et al., behavior; it contains all UML elements that are 2014) which has allowed developers to build their helpful for the execution of the models.
    [Show full text]
  • Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK
    1 From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK Abstract—This paper1 describes some of the challenges and research questions that target the most productive intersection opportunities when deploying static and dynamic analysis at we have yet witnessed: that between exciting, intellectually scale, drawing on the authors’ experience with the Infer and challenging science, and real-world deployment impact. Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, Many industrialists have perhaps tended to regard it unlikely impacting billions of people worldwide. that much academic work will prove relevant to their most The paper identifies open problems that have yet to receive pressing industrial concerns. On the other hand, it is not significant attention from the scientific community, yet which uncommon for academic and scientific researchers to believe have potential for profound real world impact, formulating these that most of the problems faced by industrialists are either as research questions that, we believe, are ripe for exploration and that would make excellent topics for research projects. boring, tedious or scientifically uninteresting. This sociological phenomenon has led to a great deal of miscommunication between the academic and industrial sectors. I. INTRODUCTION We hope that we can make a small contribution by focusing on the intersection of challenging and interesting scientific How do we transition research on static and dynamic problems with pressing industrial deployment needs. Our aim analysis techniques from the testing and verification research is to move the debate beyond relatively unhelpful observations communities to industrial practice? Many have asked this we have typically encountered in, for example, conference question, and others related to it.
    [Show full text]
  • Superoptimization of Webassembly Bytecode
    Superoptimization of WebAssembly Bytecode Javier Cabrera Arteaga Shrinish Donde Jian Gu Orestis Floros [email protected] [email protected] [email protected] [email protected] Lucas Satabin Benoit Baudry Martin Monperrus [email protected] [email protected] [email protected] ABSTRACT 2 BACKGROUND Motivated by the fast adoption of WebAssembly, we propose the 2.1 WebAssembly first functional pipeline to support the superoptimization of Web- WebAssembly is a binary instruction format for a stack-based vir- Assembly bytecode. Our pipeline works over LLVM and Souper. tual machine [17]. As described in the WebAssembly Core Specifica- We evaluate our superoptimization pipeline with 12 programs from tion [7], WebAssembly is a portable, low-level code format designed the Rosetta code project. Our pipeline improves the code section for efficient execution and compact representation. WebAssembly size of 8 out of 12 programs. We discuss the challenges faced in has been first announced publicly in 2015. Since 2017, it has been superoptimization of WebAssembly with two case studies. implemented by four major web browsers (Chrome, Edge, Firefox, and Safari). A paper by Haas et al. [11] formalizes the language and 1 INTRODUCTION its type system, and explains the design rationale. The main goal of WebAssembly is to enable high performance After HTML, CSS, and JavaScript, WebAssembly (WASM) has be- applications on the web. WebAssembly can run as a standalone VM come the fourth standard language for web development [7]. This or in other environments such as Arduino [10]. It is independent new language has been designed to be fast, platform-independent, of any specific hardware or languages and can be compiled for and experiments have shown that WebAssembly can have an over- modern architectures or devices, from a wide variety of high-level head as low as 10% compared to native code [11].
    [Show full text]
  • Coqjvm: an Executable Specification of the Java Virtual Machine Using
    CoqJVM: An Executable Specification of the Java Virtual Machine using Dependent Types Robert Atkey LFCS, School of Informatics, University of Edinburgh Mayfield Rd, Edinburgh EH9 3JZ, UK [email protected] Abstract. We describe an executable specification of the Java Virtual Machine (JVM) within the Coq proof assistant. The principal features of the development are that it is executable, meaning that it can be tested against a real JVM to gain confidence in the correctness of the specification; and that it has been written with heavy use of dependent types, this is both to structure the model in a useful way, and to constrain the model to prevent spurious partiality. We describe the structure of the formalisation and the way in which we have used dependent types. 1 Introduction Large scale formalisations of programming languages and systems in mechanised theorem provers have recently become popular [4–6, 9]. In this paper, we describe a formalisation of the Java virtual machine (JVM) [8] in the Coq proof assistant [11]. The principal features of this formalisation are that it is executable, meaning that a purely functional JVM can be extracted from the Coq development and – with some O’Caml glue code – executed on real Java bytecode output from the Java compiler; and that it is structured using dependent types. The motivation for this development is to act as a basis for certified consumer- side Proof-Carrying Code (PCC) [12]. We aim to prove the soundness of program logics and correctness of proof checkers against the model, and extract the proof checkers to produce certified stand-alone tools.
    [Show full text]