Cg 2010 M. Cameron Jones. Some Rights Reserved. This Work Is
Total Page:16
File Type:pdf, Size:1020Kb
c 2010 M. Cameron Jones. Some rights reserved. This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. REMIX AND REUSE OF SOURCE CODE IN SOFTWARE PRODUCTION BY M. CAMERON JONES DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Library and Information Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2010 Urbana, Illinois Doctoral Committee: Associate Professor J. Stephen Downie, Chair Professor Michael B. Twidale, Director of Research Associate Professor Kyratso (Karrie) G. Karahalios Professor Linda C. Smith ABSTRACT The means of producing information and the infrastructure for disseminating it are constantly changing. The web mobilizes information in electronic formats, making it easier to copy, modify, remix, and redistribute. This has changed how information is produced, distributed, and used. People are not just consuming information; they are actively producing, remixing, and sharing information, using the web as a platform for creativity and production. This is true of software development as well. It is frequently commented by programmers and researchers who study software development, that programmers frequently copy and paste code. Although this practice is widely acknowledged, it is rarely studied directly, or explicitly accounted for in models of software development. However, this attitude is changing as software becomes more ubiquitous, and software development practice shifts away from the formal models of software engineering, towards a post-modernist perspective. This study explores how source code snippets in programming books and on the web are changing software development practice. By examining program source code using clone detection algorithms, this study provides a comprehensive view of code copying across 6,190 PHP-language applications. These data are used to explore the concept of a \remix" method of software production, where software and systems are built out of copied and pasted snippets of code. These findings are contrasted against both traditional models of information production coming from informetrics (e.g., authorship, citation analysis), and models from software engineering (e.g., the Lego Hypothesis). Explanations for observed phenomena are discussed borrowing metaphors from linguistics, which provide a richer explanation of copy-paste programming than offered by the Lego Hypothesis. The focus and findings of this study ultimately point to a pressing demand for further research centered on the notion of software as information. Software and software repositories hold a large amount of information about how it was produced, and how it is used, adapted, and maintained. Software informatics is proposed as an organizing label to study the science of information, practice, and communication around software. It studies the individual, collaborative, and social aspects of software production and use, spanning multiple representations of software from design, to source code, to application. ii For Ari and Jinha iii ACKNOWLEDGMENTS As I sit here making my final revisions, I cannot help but reflect on the many years I spent at the University of Illinois in the Graduate School of Library and Information Science. My advisor, Dr. Michael Twidale, has guided and advised me throughout my studies. He taught my very first GSLIS course when I was an undergraduate, and later encouraged me to pursue my graduate degree. I wish to thank Dr. Twidale for his patience and encouragement over the years, and for challenging me and pushing my research to higher levels. I would also like to thank the rest of my committee, who have been an invaluable resource. My committee chair, Dr. Downie has not only advised my dissertation research, but has also been a great mentor; he has shown me how the metaphorical research sausage gets made. Dr. Smith has provided valuable feedback and insights on my research, for which I am eternally grateful. Dr. Karahalios has challenged me to think more deeply about the social aspects of my research, leading to new insights and implications. I could not have completed this dissertation without the support and encouragement of the GSLIS community. I am thankful for the assistance and financial support offered to me by Dean Unsworth and the Graduate School, and for the feedback, support, and friendship I have received from the faculty and my fellow graduate students. My research has provided me with many opportunities; one of them was the opportunity to work with Dr. Elizabeth Churchill in the Internet Experiences group at Yahoo! Research. Dr. Churchill gave me the space and direction to explore new avenues of research which ultimately shaped and refined my understanding of software informatics and code sharing. I am grateful for her friendship, patience, and guidance. My greatest debt of gratitude is owed to my family. My loving wife Dr. Jin Ha Lee and my beautiful daughter Ariana encouraged and believed in me, and their love sustained me through many long days and late nights. Also, my parents and siblings have helped me to keep perspective on my research and my life by challenging me to constantly explain my research and why they should care. Thank you. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . 1 1.1 Recent Trends . .1 1.1.1 Authorship in Software Development . .1 1.1.2 Collaborative Software Development . .2 1.1.3 Software and Information Seeking . .2 1.1.4 Programmer as User . .3 1.1.5 User as Programmer . .3 1.1.6 Innovation in the Use of Software . .3 1.2 Remix in Software Production . .4 1.3 Software Reuse . .8 1.4 Copy-Paste in Software Development . 10 1.5 Software Clones . 12 1.6 Broad Scope of the Research Space . 13 1.7 Information Production Process . 14 1.8 Organization of this Study . 15 1.8.1 Part I: Establishing the Landscape of Software Remix and Reuse . 16 1.8.2 Part II: Testing the Lego Hypothesis . 17 1.9 Research Hypotheses . 19 CHAPTER 2 LITERATURE REVIEW . 21 2.1 Software Informatics . 21 2.1.1 Comparison to Software Engineering . 21 2.1.2 Comparison to Social Informatics . 22 2.1.3 Software as Information . 23 2.2 Software Cloning . 24 2.2.1 Type I Clones . 26 2.2.2 Type II Clones . 27 2.2.3 Type III Clones . 27 2.2.4 Type IV Clones . 27 2.3 Clone Analysis . 28 2.4 Code Copying . 33 2.5 The Lego hypothesis and Postmodernism . 36 2.6 Informetrics . 38 2.7 Summary . 40 CHAPTER 3 RESEARCH METHOD AND PLAN . 41 3.1 Collecting Source Code Data . 41 3.1.1 A Brief Word on the Structure of PHP Source Code . 41 3.1.2 Collecting the PILOT Dataset . 42 3.1.3 Collecting the OPENSOURCE Dataset . 43 3.1.4 Collecting the SNIPPETS Dataset . 46 3.2 Managing Collections . 48 v 3.3 Generalized Clone Detection Procedure . 48 3.3.1 Parameterized String Matching . 50 3.3.2 Measuring Cloning Activity . 53 3.4 Testing the Lego Hypothesis . 54 3.4.1 Estimating the Power-law Exponent . 54 3.4.2 Estimating the Power-law Constant . 55 3.4.3 Determining the Cutoff Value . 55 3.4.4 The Kolmogorov-Smirnov Test . 56 3.5 Summary . 57 CHAPTER 4 PATTERNS OF COPY-PASTE PROGRAMMING . 58 4.1 Pilot Study . 58 4.2 Copied Code in the OPENSOURCE Dataset . 62 4.3 Clones as Copied Code: A First Attempt . 66 4.3.1 File Duplication, Near Duplicates, & Code Inclusion . 69 4.4 Clones as Copied Code: A Second Attempt . 72 4.4.1 Filtering Files . 72 4.5 The Case of the getmicrotime Clones . 77 4.5.1 The getmicrotime Snippet on the Web . 83 4.6 Snippets as Copied Code . 85 4.7 Examples of Likely Copy-Pasted Clones . 87 4.7.1 implode assoc Clone . 87 4.7.2 escape data Clone . 88 4.7.3 Database Connection Clone . 89 4.7.4 Binary to Decimal . 89 4.7.5 GZIP File I/O Clone . 90 4.7.6 Image Resize Clone, with Spanish Translation . 90 4.7.7 Sorting Clone . 93 4.8 Examples of Ambiguous Clones . 93 4.8.1 String Conversion Clone . 93 4.8.2 File I/O Clone ..