On the Extent and Nature of Software Reuse in Open Source Java Projects

Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, Maximilian Irlbeck Technische Universität München

ICSR 2011, Pohang, Korea

1 Software Reuse • Reuse of existing artifacts for constructing new software • Proven benefits • Increased productivity • Reduced time to market • Improved quality

2 Software Reuse

• Tremendous reuse opportunities • Class Libraries (e.g. Apache Commons) • Frameworks (e.g. Eclipse: 40 MLOC) • Open source code (Google Code Search: several GLOC) • Internet serves as reuse repository

3 Research Problem • Unclear how software projects make use of available reuse opportunities • Lack of data on amount of reuse in software projects • Assessing success of software reuse difficult

4 Contribution • Empirical knowledge about extent and nature of software reuse in OSS • Quantitative data on software reuse in 20 open source projects • Substantiates discussion of success/failure of software reuse • Provides practioners with benchmark

5 Terms • Software reuse: Using code developed by third parties (excluding OS/platform) • White-box reuse: Code incorporated in source form (internals exposed, potentially modified) • Black-box reuse: Code incorporated in binary form (internals hidden, no modifications)

6 Study Design (GQM) We analyze open source projects for the purpose of understanding the state of the practice in software reuse with respect to its extent and nature from the viewpoint of the developers and maintainers in the context of Java open source software.

7 Study Design (GQM)

Question Metric

RQ1: Do open source projects reuse existence of software? software reuse white-box reuse RQ 2: How much white-box reuse occurs? rate black-box reuse RQ 3: How much black-box reuse occurs? rate

8 Reuse Rate

Overall code of Project‘s own code software system Reused code

Reused source code [LOC] White-box Overall source code [LOC]

Reused binary code [bytes] Black-box Overall binary code [bytes] Study Objects • 20 Java projects from • Criteria: Production/Stable, Standalone app, pure Java, Java SE platform, source download available • All among 50 most downloaded • sourcecode size: 0.4 to 790 kLOC, bytecode size: 17 to 22,761 KB • Test code excluded with heuristics (e.g. folders named test/tests)

10 Study Implementation a) Detecting white-box reuse • White-box reuse = copied code • Can be detected automatically by clone detectors • Clone detection against 22 commonly used Java libraries (~ 6MLOC) • Detection of reuse of statement sequences with > 15 statements

11 Study Implementation a) Detecting white-box reuse • In addition: manual inspection of source directory tree • Clues: file/package names • Source of files identified via header comments/web search • Detection of reuse of whole files/ directories, not limited to fixed set of libraries

12 Study Implementation b) Detecting black-box reuse • Byte-code based static analysis • Aggregates byte code size of all library types referenced by project‘s source code • Traverses type dependency graph using Java Constant Pool (type usages and method calls) • Includes transitive dependencies

13 Study Implementation b) Detecting black-box reuse • Although not covered by reuse definition, potential variations in use of Java API interesting • Black-box reuse baseline of empty Java program: 5 MB (2,082 types) • Object → Class → ClassLoader ... (Reflection API / Collections API)

14 Results RQ 1 Do open source projects reuse software?

• 18 of the 20 projects (90%) reuse software from third parties • Exceptions: HSQLDB (relational database engine), Youtube Downloader (video download utility)

15 Results RQ 2 How much white-box reuse occurs? • Clone detection found 791 clones, 11,701 copied LOC in 7 study objects • Clones found: complete files with minor modifications (e.g. different version) • Manual inspection found additionally whole copied libraries in 4 study objects • Overall: white-box reuse found for 9 of 20 projects • Reuse rates: 0% - 10% 16 Results RQ 3 How much black-box reuse occurs?

Absolute bytecode size distribution (MB) 70 own 3rd party 60 Java API Java API Baseline

50 3rd party: 0 - 42 MB 40 Java API: 13 - 17 MB 30

20

10

0 Jedit Buddi DrJava soapUI JabRef RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client

17 PDF Split and Merge YouTube Downloader Results RQ 3 How much black-box reuse occurs? Relative bytecode size distribution (%) 100

80

60

40 3rd party: 0 - 62%

20 Java API: 23 - 99% Combined: 41 - 99% 0 Jedit Buddi DrJava JabRef soapUI RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client PDF Split and Merge YouTube Downloader Java API 18 3rd Party own Results RQ 3 How much black-box reuse occurs? Relative bytecode size distribution (%) without Java API 100

80

60

40

20

0 JEdit Buddi DrJava soapUI JabRef RODIN DavMail subsonic HSQLDB FreeMind OpenProj TV-Browser Azureus/Vuze Mediathek View Sweet Home 3D iReport-Designer Mobile Atlas Creator SQuirreL SQL Client PDF Split and Merge YouTube Downloader 19 3rd Party own Discussion a) Extent of reuse • Software reuse common among Java OSS • On average: high black-box reuse rates • Expected to have significant impact on development effort • Black-box reuse rates considerably varying

20 Discussion b) Influence of project size on reuse rate

• Lee&Litecky found a negative influence of project size on reuse rate (survey of 500 Ada professionals) • Without Java API: Spearman correlation of 0.05 (two tailed p-value 0.83) • With Java API: Spearman -0.93 (p-value < 0.0001) → significant and strong negative correlation

21 Discussion c) Types of reused functionality • Categorization of reused libraries (e.g. networking, text/xml, rich client platforms) • No predominant category found • Nearly all projects reuse software from more than one category • No significant insights, except reuse diverse w.r.t. types of functionality

22 Threats to internal validity a) overestimation of reuse

• False-positives from clone detection • mitigated by manual inspection of results • Unclear if code was copied into study objects or from them • mitigated by manual inspection • Black-box analysis considers a whole class as the element of reuse

23 Threats to internal validity a) underestimation of reuse • Fixed set of libraries in clone detection • False-negatives in clone detection • Manual inspection for copied code inherently incomplete • Black-box analyses misses calls via reflection, boundaries by Java interfaces • Other forms of component interaction

24 Threats to external validity • Unclear how representative study objects are for all Java OSS • Transferability to other PL or commercial development unclear • Impact of PL is expected to be high • Availability of reusable code depends on PL (e.g. Java vs. COBOL)

25 Conclusions • Early visions of development by plugging reusable components not realistic • But: Reuse in form of libraries common in Java OSS • High black-box reuse rates (9 of 20 projects > 50%) • Availability of reusable functionality well- established for Java platform

26 Future Work • Other programming ecosystems • Legacy programming languages, e.g. COBOL • Scripting languages, e.g. Python • Commercial software development environments

27 Thank you. Questions?

28