Code siblings: technical and legal implications of copying code Between applications

Daniel German, Massimiliano Di Penta, Yann -Ga ël Gu éhéneuc , and Giuliano (Giulio) Antoniol

MSR 2009, Vancouver 1/17 TheThe ChallengeChallenge

 Code,Code, asas anyany otherother artisticartistic production,production, isis regulatedregulated byby copyrightcopyright lawlaw

 CompaniesCompanies ownown thethe propertyproperty ofof sourcesource codecode

 FreeFree andand openopen sourcesource softwaresoftware (FOSS)(FOSS) modelmodel isis differentdifferent

 CopyingCopying 2727 LOCLOC outout ofof 525525 KLOCKLOC resultedresulted inin aa copyrightcopyright infringementinfringement

 UsersUsers andand companiescompanies mustmust bebe awareaware ofof copyrightcopyright lawlaw andand ownershipownership MSR 2009, Vancouver 2/17 CodeCode HasHas PreferentialPreferential MigrationMigration FlowsFlows

MSR 2009, Vancouver 3/17 LicenseLicense TypesTypes

 Permissive – the MIT/X11 and BSD licenses  Minor constraints on the licensee  Inclusion of fragments in a system under a different license  BSD licensed fragments can be included in proprietary systems.  CAVEAT! Multiple BSD licenses: original BSD (4-clauses BSD), the new BSD (3-clauses BSD), and the 2-clauses BSD  Code licensed under the original 4-clauses BSD cannot be included inside systems licensed under the GPL

 Reciprocal – GNU variants  Any system that includes the fragments must be licensed under the same license  GPL-licensed fragments can only be included in systems licensed under the same version of the GPL

MSR 2009, Vancouver 4/17 TheThe ScaleScale ofof thethe ProblemProblem

 WidelyWidely adoptedadopted systemssystems areare inin thethe rangerange ofof MLOCMLOC andand thousandsthousands ofof filesfiles

 IfIf 27LOC27LOC inin 525KLOC525KLOC leadlead toto copyrightcopyright infringementinfringement  Companies implication in reusing code  End user implications

 WeWe areare likelike detectivesdetectives  Help monitoring and detecting license inconsistencies  Help monitoring and identifying inconsistent licenses in code fragments

MSR 2009, Vancouver 5/17 EmpiricalEmpirical StudyStudy

 CodeCode siblings:siblings: codecode fragmentsfragments thatthat migratedmigrated fromfrom oneone systemsystem toto anotheranother andand thenthen evolvedevolved followingfollowing theirtheir ownown pathspaths

 ThreeThree *nix*nix kernelskernels  ~7MLOC and 20,000 files  FreeBSB ~8MLOC and 21,000 files  OpenBSD ~2MLOC and 5,500 files

 OverallOverall SizeSize asas ofof Jan.Jan. 2009,2009, 17MLOC17MLOC

MSR 2009, Vancouver 6/17 ResearchResearch QuestionsQuestions

 RQ1:RQ1: WhatWhat kindskinds ofof openopen sourcesource licenseslicenses areare usedused inin thethe threethree kernels?kernels?

 RQ2:RQ2: HowHow manymany potentialpotential siblingssiblings existexist betweenbetween thethe BSDBSD kernelskernels andand thethe LinuxLinux kernel?kernel?

 RQ3:RQ3: WhatWhat licenseslicenses areare usedused byby siblingssiblings and,and, ifif different,different, why?why?

MSR 2009, Vancouver 7/17 TechnologiesTechnologies andand SetupSetup

 CloneClone detectiondetection tooltool  CCFinderX tool  Min 100 tokens  Parse only . files  Concentrate on pair of files sharing a high percentage of common code fragment, least ~30%, i.e., ~20LOC  Prune files mapped into than five siblings

 LicenseLicense detectiondetection andand identificationidentification  First comment(s)  FoSSology version 1.0.0  78 different license variants  Added 5 more licenses

MSR 2009, Vancouver 8/17 Sibling(s)Sibling(s) OriginOrigin

 IdentifyIdentify currentcurrent siblingssiblings  TraceTrace backback intointo pastpast siblingssiblings –– theirtheir codecode fragmentsfragments inin thethe samesame filesfiles  WhenWhen theythey disappear,disappear, thenthen wewe havehave theirtheir originsorigins  TakeTake thethe oldestoldest ofof thethe twotwo asas thethe truetrue originorigin

Sys 1 – i

Cloned fragments Migration siblings direction Sys 2 – File j

MSR 2009, Vancouver Cloned fragments 9/17 RQ1:RQ1: KindsKinds ofof openopen sourcesource licenseslicenses

 LinuxLinux …… isis LinuxLinux …… 65%65% ofof GPLGPL filesfiles plusplus 25%25% ofof filesfiles ““promotedpromoted ”” toto GPLGPL byby L.L. TorvaldTorvald  A few files (35) have two licenses  FreeBSDFreeBSD 75%75% ofof thethe filesfiles withwith BSDBSD licenselicense  189 files (5%) with no license  179 files with a corporate license ( licenses)  167 files with MIT license  A few multiple licenses – 19 BSD and GPL, 15 BSD and Educational, 14 MIT and GPL  OpenBSDOpenBSD 7676 %% BSDBSD licenseslicenses  295 files (9%) with a MIT license, 179 with an educational license  138 (84%) without license  59 files with BSD and Educational, 25 with MIT and MSR 2009, BSD, and 14 with BSD and GPL Vancouver 10/17 RQ2:RQ2: SiblingsSiblings betweenbetween kernelskernels

2500

2000

1500

FreeBSD vs.Linux OpenBSD vs. Linux 1000 Siblings 500

0 Filtered siblings Clone pairs Files Linux Files BSD File Pairs File Pairs ( same name) 250

200

150

FreeBSD vs. Linux OpenBSD vs. Linux 100

50 MSR 2009, Vancouver 0 11/17 Files Linux Files BSD File Pairs File Pairs (same name ) RQ3:RQ3: CodeCode MigrationMigration andand LicensesLicenses

FreeBSD Linux Files Before Jan 1, 2002 BSD GPL 8 Almost nothing after BSD MIT 2 OpenBSD Linux Files BSD None 2 BSD BSD+GPL 1 Corporate BSD+GPL 89 BSD MIT 2 GPL None 1 Phrase BSD+GPL 1 BSD Unknown 1 X.Net+BSD MIT 1 BSD+GPL GPL 1 BSD+Phrase Phrase+GPL 1 MIT GPL 23

Linux FreeBSD Files BSD+GPL Corporate 8 GPL BSD 17 GPL BSD+GPL 1 GPL CPL+BSD+GPL 1 After Jan 1, 2002 MIT BSD 1 Nothing before MIT+GPL None 2 MSR 2009, None BSD 1 Vancouver None BSD 1 12/17 Phrase+GPL MIT 2 AIC7xxxAIC7xxx MaintainingMaintaining SiblingsSiblings

 1994:1994: LinuxLinux AIC7xxxAIC7xxx seriesseries SCSISCSI adaptersadapters  1995:1995: LinuxLinux codecode isis incorporatedincorporated intointo anan OpenBSDOpenBSD driverdriver  1996:1996: NetBSDNetBSD driverdriver isis portedported toto FreeBSDFreeBSD  #ifdef to maintain the variants  1997:1997: AA mailingmailing listlist isis createdcreated inin FreeBSDFreeBSD toto unifyunify thethe effortsefforts ofof peoplepeople inin thethe differentdifferent kernelskernels  The major development of the driver seems to happen in FreeBSD  2000:2000: DevelopmentDevelopment propagatespropagates toto Linux,Linux, NetBSDNetBSD ,, andand OpenBSDOpenBSD  Today:Today: DevelopmentDevelopment mostlymostly LinuxLinux andand FreeBSDFreeBSD MSR 2009, Vancouver 13/17 GPCGPC codecode inin FreeBSDFreeBSD

 2002:2002: SiliconSilicon GraphicsGraphics xfsxfs filefile systemsystem integratedintegrated intointo LinuxLinux  DecDec 12,12, 20052005 xfsxfs appearsappears inin FreeBSDFreeBSD  The license of xfs is GPL  FreeBSD is licensed under the 2-clause BSD  Including xfs in a BSD kernel requires the kernel to be under the GPL too a

 CompilingCompiling GPLGPL --licensedlicensed codecode intointo thethe kernelkernel makesmakes itit ““RESTRICTEDRESTRICTED ””  It can no longer be distributed in binary form, its be made available for mirroring

MSR 2009, Vancouver 14/17 LicenseLicense DefectsDefects

 FreeBSD rdma _cma .c / Linux cdma .c are siblings  In Linux, it appeared on Jun 17, 2006, with 64 changes plus including 8 changes after it appeared in FreeBSD  The Linux sibling is licensed under GPL v2 and the 2 - clause BSD licenses  The FreeBSD sibling is licensed under the terms of the new BSD license, the GPL v2, and Commons Public License  Original license still present in FreeBSD  Linux license was changed:

commit a9474917099e007c0f51d5474394b5890111614f Author: Sean Hefty Date: Mon Jul 14 23:48:43 2008 -0700 RDMA: Fix license text The license text for several files references a third MSR 2009, that was inadvertently copied in. Update the license to what was Vancouver intended. This update was based on a request from HP. [..] 15/17 ConclusionConclusion

 CodeCode movemove andand codecode siblingssiblings dodo existexist  SiblingsSiblings havehave aa preferentialpreferential flowflow  Initially from BSD(s) to Linux – frequent  Today from Linux to FreeBSD – less frequent  CompaniesCompanies directlydirectly contributecontribute toto codecode inin differentdifferent kernelskernels –– seesee IntelIntel driversdrivers withwith dualdual licenseslicenses

 ManagingManaging siblingssiblings isis aa difficultdifficult problemproblem

MSR 2009, Vancouver 16/17 IfIf youyou dondon ’’tt monitormonitor codecode maymay sneaksneak inin ……

QuestionsQuestions ??

MSR 2009, Vancouver 17/17