Authorship Analysis: Identifying the Author of a Program1

Authorship Analysis: Identifying The 1 Author of a Program Ivan Krsul The COAST Pro ject Department of Computer Sciences Purdue University West Lafayette, IN 47907{1398 [email protected] May 3, 1994 Technical Rep ort CSD{TR{94{030 1 This pap er was originally written as a Master's thesis at Purdue University. Abstract In this pap er we show thatitispossibletoidentify the author of a piece of software by lo oking at stylistic characteristics of C source co de. Wealsoshow that there exist a set of characteristics within a program that are helpful in the identi cation of a programmer, and whose computation can b e automated with a reasonable cost. There are four areas that b ene t directly from the ndings we present herein: the legal community can count on empirical evidence to supp ort authorship claims, the academic communitycancount on evidence that sup- p orts authorship claims of students, industry can count on identifying the author of previously un-identi able software mo dules, and real time intrusion detection systems can b e enhanced to include information regarding the authorship of all lo cally compiled programs. We show that it is p ossible to identify the author of a piece of software by collecting and identifying eighty-eight programs for twenty nine students, sta and faculty memb ers at Purdue University. Chapter 1 Intro duction. There are many o ccasions in whichwewould liketoidentify the source of some piece of software. For example, if after an attack to a system bysome software we are presented with a piece of the software used for the attack, we mightwant to identify the source of the software. Typical examples of 1 2 3 such software are Tro jan horses , viruses , and logic b ombs . Other typical circumstances will require that we trace the source of a program. Pro of of co de re-engineering, resolution of authorship disputes and pro of of authorship in courts are but a few of the more typical examples of such circumstances. Often, tracing the origins of the source requires that we identify the author of the program. This seems at rst an imp ossible task, and convincing arguments can b e given ab out the intractability of this problem. Consider, for example, the following short list of p otential problems with the identi cation of authors: 1. Given that millions of p eople write software, it seems unlikely that, given a piece of software, we will nd the programmer who wrote it. 2. Software evolves over time. As time passes, programmers vary their programming habits and their choice of programming languages. The 1 Tro jan horses are de ned in [GS92] as programs that app ear to have one function but actually p erform another function. 2 Viruses are de ned in [GS92] as programs that mo dify other programs in a computer, inserting copies of themselves. 3 Logic b ombs are de ned in [GS92] as hidden features in programs that go o after certain conditions are met. 1 development of new software engineering metho ds, the intro duction of formal metho ds for program veri cation, and the development of user- friendly, graphic oriented co de pro cessing systems and debuggers all contribute to making programming a highly dynamic eld. 3. Software gets reused. In recentyears, and with the developmentofob- ject oriented programming metho dologies, programmers have come to dep end on reusing large p ortions of co de; similar to the co de pro duced 4 by the GNU/Free Software Foundation ,much of it is public domain. Commerciallyavailable prototyp ers, like the Builder Xcessory by In- tegrated Computer Solutions, Inc., pro duce thousands of lines of co de that are used to develop Motif interfaces. Similar development to ols are available for hundreds of development platforms. Similar arguments could b e given for ngerprinting: Fingerprintmatching is an exp ensive pro cess and it seems unlikely that government agencies will ever b e able to classify every citizen in their lifetime. It is also unlikely that given a ngerprint, we will b e able to pick from a p o ol of several million p eople the correct p erson every time. The identi cation pro cess in computer software can b e made reliable for a subset of the programmers and programs. Programmers that are involved in high security pro jects or programmers that have b een known to break the law are attractive candidates for classi cation. 1.1 Statement of the Problem. Authorship analysis in literature has b een widely debated for hundreds of years, and a large b o dy of knowledge has b een develop ed [Dau90 ]. Author- ship analysis on computer software, however, is di erent and more dicult than in literature. Several reasons make this problem dicult. Authorship analysis in computer software do es not have the same stylistic characteristics as authorship analysis in literature. Furthermore, p eople reuse co de, programs are develop ed by teams of programmers, and programs can b e altered bycode formatters and pretty printers. 4 The Free Software Foundation is a group started byRichard Stallman to emb o dy his ideas of p ersonal freedom and how software should b e pro duced. 2 Our ob jective is to classify the programmer and to try to nd a set of characteristics that remain constant for a signi cant p ortion of the programs that this programmer might pro duce. This is analogous to attempting to nd characteristics in humans that can b e used later to identify a p erson. Eye and hair coloring, height, weight, name and voice pattern are but a few of the characteristics that weuseonaday-to-day basis to identify p ersons. It is, of course, p ossible to alter our app earance to matchthatof another p erson. Hence, more elab orate identi cation techniques like ngerprinting, retinal scans and DNA prints are also available, but the cost of gathering and pro cessing this information in large quantities is prohibitively exp ensive. Similarly,wewould like to nd the set of characteristics within a program that will b e helpful in the identi cation of a corresp onding programmer, and whose computation can b e automated with a reasonable cost. What makes us b elieve that identi cation of authorship in computer software is p ossible? People work within certain frameworks that rep eat themselves. They use those things that they are more comfortable with or are accustomed to. Programmers are humans. Humans are creatures of habit, and habits tend to p ersist. That is why, for example, wehave a handwriting style that is consistent during p erio ds of our life, although the style mayvary as we grow older. Patterns of b ehavior are all around us. Likewise for programming, we can ask: which are the programming con- structs that a programmer uses all the time? These are the habits that will b e more likely entrenched, the things he consistently and constantly do es and that are likely to b ecome ingrained. 1.2 Motivation. Four basic areas can b ene t considerably by the development of solid authorship analysis to ols: 1. For authorship disputes, the legal community is in need of solid metho d- ologies that can b e used to provide empirical evidence to showthattwo or more programs are written by the same p erson. 2. In the academic community, it is considered unethical to copy programming assignments. While plagiarism detection can show that two 3 programs are equivalent, authorship analysis can b e used to showthat some co de fragmentwas indeed written by the p erson who claims authorship of it. 3. In industry, where there are large software pro ducts that typically run for years, and millions of lines of co de, it is a common o ccurrence that authorship information ab out programs or program fragments is nonexistent, inaccurate or misleading. Whenever a particular program mo dule or program needs to b e rewritten, the author mayneedtobe lo cated. It would b e convenient to b e able to determine the name of the programmer who wrote a particular piece of co de from a set of several hundred programmers so he can b e lo cated to assist in the upgrade pro cess. 4. Real-time intrusion detection systems could b e enhanced to include authorship information. Dorothy Denning writes in [Den87] ab out a prop osed real-time intrusion detection system: The mo del is based on the hyp othesis that exploitation of a system's vulnerabilities involves abnormal use of the system; therefore, security violations could b e detected from abnormal patterns of system usage. Obviously, a programmer signature constructed from the identifying characteristics of programs constitutes such a pattern. For example, consider the student who retrieves a copyofapassword cracking program and compiles it in his university account. Once the compiler collects the identifying features of the program, the op erating system could immediately recognize the compiling of this program as an abnormal event. Of course we realize that it is theoretically p ossible for a programmer to fo ol or bypass the system by altering his programming metho ds. The change would have to b e gradual, subtle and ingenious for the system not to record the change. However, we exp ect that mo difying a user's pro le to meet the characteristics of a sp eci c program (a password 4 cracking program, for example) will b e a dicult and time consuming pro cess.

Authorship Analysis: Identifying the Author of a Program1

"This Book Was a Joy to Read. It Covered All Sorts of Techniques for Debugging, Including 'Defensive' Paradigms That Will Eliminate Bugs in the First Place

A Style Guide

Coding Style Guidelines

Coding Style

The Pascal Programming Language

Application Note C/C++ Coding Standard

TT284 Web Technologies Block 2 Web Architecture an Introduction to Javascript

Library of Congress Cataloging-In-Publication Data

A Brief Emacs Walkthrough

Programming Domains Programming Domains (2)

Princeton University COS 217: Introduction to Programming Systems Emacs Tutorial and Reference

Code Style Guide