On the Extent and Nature of Software Reuse in Open Source Java Projects
Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, Maximilian Irlbeck Technische Universität München
ICSR 2011, Pohang, Korea
1 Software Reuse • Reuse of existing artifacts for constructing new software • Proven benefits • Increased productivity • Reduced time to market • Improved quality
2 Software Reuse
• Tremendous reuse opportunities • Class Libraries (e.g. Apache Commons) • Frameworks (e.g. Eclipse: 40 MLOC) • Open source code (Google Code Search: several GLOC) • Internet serves as reuse repository
3 Research Problem • Unclear how software projects make use of available reuse opportunities • Lack of data on amount of reuse in software projects • Assessing success of software reuse difficult
4 Contribution • Empirical knowledge about extent and nature of software reuse in OSS • Quantitative data on software reuse in 20 open source projects • Substantiates discussion of success/failure of software reuse • Provides practioners with benchmark
5 Terms • Software reuse: Using code developed by third parties (excluding OS/platform) • White-box reuse: Code incorporated in source form (internals exposed, potentially modified) • Black-box reuse: Code incorporated in binary form (internals hidden, no modifications)
6 Study Design (GQM) We analyze open source projects for the purpose of understanding the state of the practice in software reuse with respect to its extent and nature from the viewpoint of the developers and maintainers in the context of Java open source software.
7 Study Design (GQM)
Question Metric
RQ1: Do open source projects reuse existence of software? software reuse white-box reuse RQ 2: How much white-box reuse occurs? rate black-box reuse RQ 3: How much black-box reuse occurs? rate
8 Reuse Rate
Overall code of Project‘s own code software system Reused code
Reused source code [LOC] White-box Overall source code [LOC]
Reused binary code [bytes] Black-box Overall binary code [bytes] Study Objects • 20 Java projects from • Criteria: Production/Stable, Standalone app, pure Java, Java SE platform, source download available • All among 50 most downloaded • sourcecode size: 0.4 to 790 kLOC, bytecode size: 17 to 22,761 KB • Test code excluded with heuristics (e.g. folders named test/tests)
10 Study Implementation a) Detecting white-box reuse • White-box reuse = copied code • Can be detected automatically by clone detectors • Clone detection against 22 commonly used Java libraries (~ 6MLOC) • Detection of reuse of statement sequences with > 15 statements
11 Study Implementation a) Detecting white-box reuse • In addition: manual inspection of source directory tree • Clues: file/package names • Source of files identified via header comments/web search • Detection of reuse of whole files/ directories, not limited to fixed set of libraries
12 Study Implementation b) Detecting black-box reuse • Byte-code based static analysis • Aggregates byte code size of all library types referenced by project‘s source code • Traverses type dependency graph using Java Constant Pool (type usages and method calls) • Includes transitive dependencies
13 Study Implementation b) Detecting black-box reuse • Although not covered by reuse definition, potential variations in use of Java API interesting • Black-box reuse baseline of empty Java program: 5 MB (2,082 types) • Object → Class → ClassLoader ... (Reflection API / Collections API)
14 Results RQ 1 Do open source projects reuse software?
• 18 of the 20 projects (90%) reuse software from third parties • Exceptions: HSQLDB (relational database engine), Youtube Downloader (video download utility)
15 Results RQ 2 How much white-box reuse occurs? • Clone detection found 791 clones, 11,701 copied LOC in 7 study objects • Clones found: complete files with minor modifications (e.g. different version) • Manual inspection found additionally whole copied libraries in 4 study objects • Overall: white-box reuse found for 9 of 20 projects • Reuse rates: 0% - 10% 16 Results RQ 3 How much black-box reuse occurs?
Absolute bytecode size distribution (MB) 70 own 3rd party 60 Java API Java API Baseline
50 3rd party: 0 - 42 MB 40 Java API: 13 - 17 MB 30
20
10
0 Jedit Buddi DrJava soapUI JabRef RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client
17 PDF Split and Merge YouTube Downloader Results RQ 3 How much black-box reuse occurs? Relative bytecode size distribution (%) 100
80
60
40 3rd party: 0 - 62%
20 Java API: 23 - 99% Combined: 41 - 99% 0 Jedit Buddi DrJava JabRef soapUI RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client PDF Split and Merge YouTube Downloader Java API 18 3rd Party own Results RQ 3 How much black-box reuse occurs? Relative bytecode size distribution (%) without Java API 100
80
60
40
20
0 JEdit Buddi DrJava soapUI JabRef RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client PDF Split and Merge YouTube Downloader 19 3rd Party own Discussion a) Extent of reuse • Software reuse common among Java OSS • On average: high black-box reuse rates • Expected to have significant impact on development effort • Black-box reuse rates considerably varying
20 Discussion b) Influence of project size on reuse rate
• Lee&Litecky found a negative influence of project size on reuse rate (survey of 500 Ada professionals) • Without Java API: Spearman correlation of 0.05 (two tailed p-value 0.83) • With Java API: Spearman -0.93 (p-value < 0.0001) → significant and strong negative correlation
21 Discussion c) Types of reused functionality • Categorization of reused libraries (e.g. networking, text/xml, rich client platforms) • No predominant category found • Nearly all projects reuse software from more than one category • No significant insights, except reuse diverse w.r.t. types of functionality
22 Threats to internal validity a) overestimation of reuse
• False-positives from clone detection • mitigated by manual inspection of results • Unclear if code was copied into study objects or from them • mitigated by manual inspection • Black-box analysis considers a whole class as the element of reuse
23 Threats to internal validity a) underestimation of reuse • Fixed set of libraries in clone detection • False-negatives in clone detection • Manual inspection for copied code inherently incomplete • Black-box analyses misses calls via reflection, boundaries by Java interfaces • Other forms of component interaction
24 Threats to external validity • Unclear how representative study objects are for all Java OSS • Transferability to other PL or commercial development unclear • Impact of PL is expected to be high • Availability of reusable code depends on PL (e.g. Java vs. COBOL)
25 Conclusions • Early visions of development by plugging reusable components not realistic • But: Reuse in form of libraries common in Java OSS • High black-box reuse rates (9 of 20 projects > 50%) • Availability of reusable functionality well- established for Java platform
26 Future Work • Other programming ecosystems • Legacy programming languages, e.g. COBOL • Scripting languages, e.g. Python • Commercial software development environments
27 Thank you. Questions?
28