<<

Empirical Analysis of Adoption

Leo A. Meyerovich Ariel Rabkin UC Berkeley ∗ Princeton University [email protected] [email protected]

Abstract aid developers in determining when and whether to bet on a Some programming languages become widely popular while new, experimental language. To date, the language adoption others fail to grow beyond their niche or disappear alto- process has not been quantitatively studied in a large scale. gether. This paper uses survey methodology to identify This paper addresses that gap. We use a combination of sur- the factors that lead to language adoption. We analyze vey research and repository mining to investigate large datasets, including over 200,000 SourceForge projects, the factors that influence developer language choice. 590,000 projects tracked by Ohloh, and multiple surveys of Since little is quantified about the programming language 1,000-13,000 . adoption process, we focus on broad research questions: We report several prominent findings. First, language What statistical properties describe language popu- adoption follows a power law; a small number of languages larity? We begin (Section 3) with an empirical analysis account for most language use, but the programming mar- of language use across many open source projects. Such ket supports many languages with niche user bases. Second, a -scale analysis reveals what trajectories languages intrinsic features have only secondary importance in adop- tend to follow. Our analysis includes the overall distribu- tion. Open source libraries, existing code, and experience tion of language use, and how it varies based on the kind strongly influence developers when selecting a language for of project and developer experience. a project. Language features such as performance, reliability, We found that popularity follows a power law, which and simple do not. Third, developers will steadily means that most usage is concentrated in a small number learn and forget languages, and the overall number of lan- of languages, but many unpopular languages will still find guages developers are familiar with is independent of age. a user base. The popular languages are used across a vari- Developers select more varied languages if their education ety of application domains while less popular ones tend to exposed them to different language families. Finally, when be used for niche domains. Even in niche domains, popular considering intrinsic aspects of languages, developers priori- languages are still more typically used. tize expressivity over correctness. They perceive static types Which factors most influence developer decision- as more valuable for properties such as the former rather making for language selection? Section 4 examines the than for correctness checking. subjective motivations of developers when picking lan- guages for specific projects. Knowing what matters to de- 1. Introduction velopers helps language designers and advocates address Some programming languages succeed and others fail. Un- their perceived needs. derstanding this process is a foundational step towards en- Through multiple surveys, we saw that developers value abling language designers and advocates to influence its out- open source libraries as the dominant factor in choosing pro- come and overall language use. Likewise, understanding will gramming languages. Social factors not tied to intrinsic lan- guage features, such as existing personal or team experience, ∗ Research supported by (Award #024263) and Intel (Award also rate highly. #024894) funding and by matching funding by U.. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National How do developers acquire languages? Knowledge Instruments, Nokia, NVIDIA, Oracle, and Samsung. about the learning process is important because developers are much more likely to use a language they already know. In Section 5, we examine how age and education shape lan- guage learning. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed We found that developers rapidly and frequently learn for profit or commercial advantage and that copies bear this notice and the full citation languages. Factors such as age play a smaller role than sug- on the first page. To copy otherwise, to republish, to on servers or to redistribute to lists, requires prior specific permission and/or a fee. gested by media. In contrast, which languages developers OOPSLA ’13, , Indiana, USA. learn is influenced by their education, and in particular, cur- Copyright c 2013 ACM X. . . $15.00 riculum design. 2. Online Poll: Hammer. “The Hammer Principle” is a What language features do developers value? Whereas that invites readers to compare various items, such Section 4 looks at how developers pick languages for spe- as programming languages, based on a multidimensional se- cific projects, Section 6 examines their feelings about in- ries of metrics [13]. Respondents pick a set of languages trinsic features of languages, such as type systems. The re- that they are comfortable with out of a pool of 51 lan- sults can help designers craft better languages and advocate guages. They picked 7 languages on average. Respondents them, and even influence curriculum design due by exposing are then shown a randomly sorted series of statements, such knowledge gaps. as “When I write code in this language I can be very sure We found that developers generally value expressiveness it is correct.” For each statement, the respondent orders the and speed of development over language-enforced correct- languages that they selected based on how well they match ness. They see more value in unit tests than types, both for the statement. The survey includes 111 statements, and re- debugging and overall. Furthermore, how features are pre- spondents sorted languages for an average of 10 statements sented strongly influences developer feelings about them: each before tiring. The survey period was 2010-2012. developers rank class interfaces more highly than static The raw and anonymized data was provided by the site’s types. maintainer, David MacIver. For each statement, we used a variant of the Glicko-2 ranking algorithm [8] to convert the Our paper is organized around the above research ques- sparse data of inconsistent pair-wise comparisons into a total tions. Before addressing them, we first overview our data ranking of languages. A prior publication [14] describes this sets and methodology. We finish the paper by discussing analysis in further detail. threats to the validity of our results (Section 7), related work The is that we treat each statement as a tour- (Section 8), and our conclusions (Section 9). nament between languages. Glicko converts the sparse pair- wise comparison into a total order. The Glicko family of al- 2. Methodology and Data gorithms is used in chess tournaments and rank- ings for producing a complete ranking of players without ev- We now describe our data sets, methodology, and a key ery pairwise comparison: beating a highly ranked language factor in many of our results: demographics. contributes more than beating a low-ranked language. Each This paper is based on several different surveys and data player (i.e., language) has an absolute rank and a statistical sources: we used software repositories, surveys conducted confidence for it. by others, and a survey conducted by us. These were used 3. Course Surveys: MOOC. We gained access to a in a sequence; the results of one analysis informed the de- survey of 1,185 students in a massive online open course sign of the next. We began with pre-existing data from the (MOOC) on software-as-a-service (SaaS). The survey was SourceForge repository and The Hammer Principle, a long- administered at the beginning of the course, so student be- running online survey about programming languages. These liefs were not be altered by the instructors, though the re- led us to preliminary hypotheses, which we investigated with sults do reflect sample bias towards programmers with an surveys. Finally, we cross-validated with the Ohloh project interest in SaaS development. Most respondents were not . traditional undergraduate students. Their median age was 30 and a majority (62%) described themselves as professional 2.1 Data programmers. We now discuss these data sets in more detail. In chronolog- The survey was primarily conducted for pedagogical pur- ical order, we used: poses, not research. However, we advised the instructors on 1. Open Source Repository Metadata: SourceForge. question wording, and were given access to the raw col- We wrote a crawler to download descriptions of 213,471 lected data. Respondents were asked if they consented to projects from SourceForge [2], an online repository for open research use of their responses; 1,142 said yes (96.5% of source software. Of most relevance to our analysis, down- all responses). We only analyze responses from adults who loaded project metadata labels the set of languages used (out agreed to research use. of 100), primary project category (e.g., accounting) from a Respondents were randomly divided into four subsam- set of 223, date of creation, and the project’s owners. ples. Some questions were asked to every respondent (MOOC The years examined are 2000-2010. This data set is a rea- all) while others were only asked to one subsample. We di- sonable proxy for open source behavior because, for most of vided the questions to avoid fatigue in respondents from the analyzed period, SourceForge was the dominant software overly long surveys while still achieving enough responses repository. For example, its modern competitor GitHub was for statistically significant analysis of many questions. Only not even created until mid 2008. Open source community two of the subsamples (MOOC and MOOC d) are relevant behavior is important in its own right, and as our results will to the the questions addressed in this paper. show, commercial developers prioritize open source when making their own adoption decisions. Name Responses Age Degree Pro. Not every data source is applicable to every question we Quartiles ask. Table 3 summarizes how the data was used in various sections. MOOC b 166 25 - 30 - 39 51% 60% MOOC d 415 25 - 30 - 38 51% 58% 2.2 Respondent Demographics MOOC a–d 1,142 25 - 30 - 38 53% 62% Slashdot 1,679 30 - 37 - 46 55% 92% We tracked demographic information on the MOOC and SourceForce 266,452 people and 217,368 projects. Slashdot surveys. The surveyed populations are both pri- Ohloh 590,000 projects. marily professional developers, and a majority in each have Hammer 13,271 science degrees. The MOOC population skews younger than Slashdot and towards fewer professionals (Ta- Table 1: Overview of data sets and populations. Degree = ble 1), though the majority is still adult professionals with percentage with at least a bachelor’s in CS or related field. degrees. Comparing the results from the MOOC and - Pro = Professional programmers. dot surveys helps us tease out population-specific effects and phenomena in survey design. Programmers are diverse. Professional developers are es- Name Date Top 6 Languages timated to be a minority of all individuals who write code MOOC 2012 , SQL, C, C++, at work or as hobbyists [23]. Our research therefore does JavaScript, PHP not exhaust the universe of programmers. However, study- ing professional developers is of particular interest for un- Slashdot 2012 C, Java, C++, Python, derstanding adoption, both in terms of societal impact and SQL, JavaScript understanding the relevance of best-effort practices such as SourceForge 2000-2010 Java, C++, PHP, C, Python, technical education. We therefore emphasize the results for C# professionals. Our surveys are well targeted for this purpose; Ohloh 2000-2013 XML, HTML, CSS, 80.1% of the responses in the Slashdot survey were about JavaScript, Java, Shell projects performed in the workplace. Hammer 2010-2012 Shell, C, Java, JavaScript, We qualitatively validate our sample against that of pre- Python, vious work. Table 2 compares the six most popular lan- TIOBE Index Feb. 2013 Java, C, Obj.-C, C++, C#, guages of our surveys against the TIOBE index, which mea- PHP sures the volume of web search results for programming lan- guages [3]. The tables suggest that our survey samples are Table 2: Most popular languages in surveyed populations. broadly in alignment with one another and with the TIOBE There is substantial, but not total, overlap. survey. Differences appear attributable to details in ques- tion wording. For example, programmers might use CSS and XML regularly, but not think of it as a programming lan- 4. Online Survey: Slashdot. We created interactive visu- guage unless prompted to include it. They might use SQL, alizations of the Hammer data set and put them on a public but not consider themselves experts. Finally, they might use website. These attracted a substantial amount of web traf- a shell , but not consider it important. We fic. Viewers were invited to answer a short survey. Most of performed similar grounding comparisons against other re- the readers arrived via links from popular such as sults throughout our work. Slashdot and Wired; we refer to this as the “Slashdot” sur- vey. Over 97% of the responses were collected during a two- 2.3 Methodology week span in the summer of 2012. To maximize the quality of our surveys, we applied standard 5. Ohloh We cross-validated some results using Ohloh [1]. methodology for iterative gathering of large-scale and cross- Ohloh is a website, bought by SourceForge, that tracks sectional data. over 590,000 open source projects hosted on SourceForge, Throughout the survey design process, we drew upon ex- Github, and elsewhere. We used it for queries such as how ternal sources. At the start, we examined adoption studies in many repositories contain a Java file. various social sciences, such as the diffusion of innovation Raw anonymized data from the MOOC and Slashdot sur- model by Rogers [22]. We also performed a literature sur- veys are available online, as are the visualization and data vey on beliefs of prominent language designers [15]. Finally, exploration tools for the Hammer data (and underlying cor- we engaged in a series of open-ended discussions with pro- relations).1 The accessed SourceForge webpages and Ohloh grammers and language designers. The interviewees were API are publicly accessible at time of writing. primarily visitors to UC Berkeley, attendees at programming

1 Currently hosted at http://www.eecs.berkeley.edu/~lmeyerov/ berkeley.edu/~lmeyerov/projects/socioplt/viz/index., projects/socioplt/data/all.tar.gz and http://www.eecs. respectively Section Question SourceForge Ohloh Hammer MOOC Slashdot §3 What is the macro-behavior of lan- X + X+ §4.1 guages and communities? §4 What factors influence language + §3.2 and 3.3 + §6.2 X choice for projects §5 How do programmers learn and lose XX languages §6 What do programmers believe is im- XX + §4 portant in a language?

Table 3: Summary of where we use each data source. X= data source used in that section, + indicates that the result is cross- validated against a related result elsewhere in the paper. language and software engineering workshops and confer- methodology and the challenges of survey research on pro- ences, Bay Area startup employees, and were academic, in- grammers appeared at PLATEAU 2012 [14]. dustrial, and government programmers and language design- ers. These discussions covered several pre scripted questions 3. Popularity and Niches and were otherwise driven by the participant. The interviews We first examine the macro-level question of how adoption helped hypotheses as well as helped inform us on how of popular languages differs from unpopular ones. We divide to phrase subsequent survey questions neutrally and broadly. the analysis into three sub-questions: What is the overall Overall, we received advice and input from about 30 people. distribution of popularity? What is the relationship between To improve the survey instruments themselves, we ap- languages and application domains? How do developers plied several techniques. move between languages?

• Piloting questions. In developing survey questions, we 3.1 Popularity falls off quickly, then plateaus first designed an initial survey and tested it on several un- The distribution of usage across different languages indi- dergraduate and graduate students. We then held a discus- cates the risk/reward trade-off for creating a language. If the sion with about 10 graduate students in programming lan- median language has significant usage, that makes creating guages about hypotheses. We then repeatedly revised the a language a more promising endeavor than if the median questions, asking undergraduates, graduate students, vis- language accounts for a negligible fraction of usage. (Even iting researchers and professionals how they understood in the latter case, there may still be utility in building a lan- each question. guage, but the creator has fewer grounds to expect usage.) • Free response. Each survey asked respondents if any Figure 1a shows that a small number of languages claim questions were confusing and, after some individual most use on SourceForge, Ohloh, and Slashdot. For exam- questions, whether they had more to add. We do not re- ple, on SourceForge, the top 6 languages account for 75% of port on questions that respondents flagged as confusing. the projects and the top 20 for 95%. General purposes lan- To aid anonymity, we do not release the answers to guages compromise most of the top 20 languages in all three response questions. data sets. We found several notable exceptions to general pur- • Demographic questions. We applied two techniques to pose languages being the most popular. Considering only detect and compensate for sample bias. First, we included a project’s primary language, SQL ranks in the top 20. If we a variety of demographic questions, such as for age and include all languages for a project, SQL, HTML, and CSS education. Second, we compared results from several are in the top 20. We cross-validated with Ohloh’s analysis different surveys and different populations. of overall language use in actual repositories. Ohloh reports that XML, HTML, and CSS are the top 3 languages in terms Our cross-sectional survey methods have known lim- of any use, and the top 20 further includes Make and SQL. itations. For example, while we asked several questions That a small set of general purpose languages dominate lan- about respondent’s early experiences with programming guage use is not surprising. However, none of the language languages, a longitudinal study might assist future work designers we interviewed suggested that a language such as that explores specific hypotheses. Likewise, we focus on CSS, a constraint-based language for webpage layout, would correlations. To support future examination of causation, be more popular than all of the general purpose languages we solicited information on potentially confounding factors such as C and Java. The prominence of domain specific lan- such as developer demographics. Further discussion of our guages among other popular languages is surprising. Language Rank vs. % of Projects (Zipf's Law) 100.0000% y = 0.08e-0.09x 10.0000% ² = 0.98

1.0000% y = 0.52x-1.52 Proportion of R² = 0.95 Projects for Language 0.1000% (Log-scale) 0.0100% y = 0.04e-0.09x R² = 0.95 0.0010%

0.0001% 1 10 100 Language Rank by Decreasing Popularity (Log-scale) Slashdot Sourceforge Ohloh Power (Slashdot) Expon. (Sourceforge) Expon. (Ohloh) (a) Probability Mass Function. Each point represents a language. Language Popularity CDF (# Languages with >= # Projects) 100% y = -0.09ln(x) + 1.18 R² = 0.95 y = -0.101ln(x) + 1.0392 y = 0.90x-0.63 10% CDF of Languages R² = 0.97 R² = 0.93 with >= # of Projects (Log-scale) 1%

0% 1,000,000 100,000 10,000 1,000 100 10 1 # of Individual Projects (Log-scale Binning) Slashdot Sourceforge Ohloh Power (Slashdot) Log. (Sourceforge) Log. (Ohloh) (b) Cumulative Distribution Function. Each point represents the set of languages with the number of projects listed on the x axis.

Figure 1: Language popularity. Slashdot survey data follows a heavy-tailed power law while curated SourceForge and Ohloh data better follow an exponential curve.

We turn now from the most popular to least popular lan- For example, Slashdot respondents report using SAS, and guages, the tail of the popularity distribution. Our initial though manual verification shows SAS being used in repos- analysis of SourceForge shows an exponential decline in itories, Ohloh does not count it. We conclude that program- popularity, as does cross-validation with Ohloh (Figure 1a). ming language popularity has a heavy tail. In contrast, popularity in the distribution’s tail plateaus in the The heavy tail covers many unpopular languages. For ex- Slashdot survey. The rankings seems valid because the aver- ample, the least popular languages (those with only a single age difference in rank for languages in common between the project) in Slashdot cover 6% of the projects (Figure 1b). Slashdot and SourceForge is 6.9. Instead, the disparity seems Just 3 of them cover 0.2% of the projects. Lacking the heavy to stem from SourceForge and Ohloh language counts being tail, SourceForge and Ohloh show an equivalent percentage based on a curated list of languages. The Slashdot survey of less than 0.002% for 3 of the least popular languages. does not suffer from such culling because it records all re- The unpopular languages can be quite domain specific in sponses and thus tail behavior is not filtered out. Contribut- practice. For example, one respondent reported developing ing to this analysis, despite the Slashdot survey measuring in the Linden Scripting Language for restricted program- orders of magnitudes fewer projects, it yields a similar num- matic manipulation in the Second Life virtual world system. ber of languages. The data sets do not include the same lan- The heavy tail in our data shows that the market for lan- guages, however: they diverge in tail languages. Over half of guages supports creating many unpopular languages, not just the languages reported in Slashdot are not tracked by Ohloh. 1 ActionScript Java ASP.NET 0.1 Objective C PL/SQL C# Ruby 0.01 PL/SQL Assembly JSP Assembly Visual Basic .NET Visual Basic 0.001 -2 VBScript y = 0.02x /Kylix Overall Popularity R² = 0.63 Unix Shell 0.0001 Perl JavaScript 0 1 2 3 4 5 C# Dispersion of popularity across project categories Python C (coefficient of variaon = σ / μ) PHP C++ Java Figure 2: Dispersion of language popularity across Project Categories (223) project categories. The less popular the language, the more variable its popularity is across application domains. 0 0.1 0.2 0.3 0.4 0.5 0.6 (SourceForge data set). Percent of projects in a category Figure 3: Fraction of projects in each language for differ- extending the few relatively popular ones such as Java and ent project categories. Y axis is the top 20 (95%) languages Haskell. and X axis is project categories with at least 100 projects written in any language. Dark red cells indicate a high prob- 3.2 Unpopular languages are niche languages ability of using a particular language within a given project Our next question is how the popularity of a language relates category. (SourceForge). to its domain-specificity. To evaluate this, we use the project category labels provided by the SourceForge metadata. For Top 6 Languages each language, we compared its overall popularity (as a frac- Java C++ PHP C PYTHON C# tion of all projects) to its popularity in each category (e.g., accounting). We formalize this comparison as the the coeffi- µ 0.205 0.167 0.119 0.140 0.063 0.062 cient of variation (standard deviation divided by mean). Fig- σ 0.094 0.101 0.117 0.100 0.028 0.029 σ 0.006 0.007 0.008 0.007 0.002 0.002 ure 2 plots the coefficient of variation of languages against x¯ the percentage of projects using them (i.e., popularity). Table Top 25-30 Languages 4 shows several languages in detail. ASP BASIC Obj Pascal Matlab Fortran We find that popular languages receive broad-based sup- µ 0.003 0.003 0.002 0.003 0.002 port while unpopular languages tend to be used in a few par- σ 0.004 0.004 0.003 0.008 0.009 ticular domains. For example, Java is used by 20% of the σx¯ 0.001 0.001 0.000 0.000 0.000 projects, and for 70% of the project categories, Java is sim- ilarly used by 10–30% (so +/-50%) of the projects within Table 4: Mean, variance, and standard error of language that category. In contrast, a relatively unpopular language popularity in different project categories. Only project like Prolog is only used in a handful of categories. For 70% categories with at least 100 projects are considered. We show of categories, Prolog’s popularity within a category will dif- the top six (75% total usage) popular languages and also fer from its overall popularity by +/- 800% rather than Java’s languages 25-30 (1.2% total usage). Order of categories is +/-50%. arbitrary. (SourceForge). A few outliers stand out in our analysis. Assembly (µ = 0.011, σ = 0.03) and Fortran (µ = 0.002, σ = 0.03) vary in popularity across categories 3-6X more than our model We hypothesize that, as VBScript is a scripting language predicts based on their overall popularity. In effect, they packaged with and made for Windows, it attracted a varied are being used as domain specific languages for numeric or base of developers that were performing a range of tasks. low-level programming. That makes them unusual cases of Despite occasional outliers such as Fortran and VBScript, domain-specific languages that are popular overall. A pos- Figure 2 shows a clear curve that relates overall popularity sible explanation is that both used to be viewed as general- to the variation in popularity across niches. purpose languages; they have held onto niches, rather than Even though unpopular languages have their usage con- colonized them for the first time. centrated in a few niches, they are rarely the most pop- VBScript is an outlier of the opposite sort: our model pre- ular languages in those niches: the consistently popular dicts more variability across categories than actually occurs. languages tend to win out. Figure 3 illustrates this with a e iia lseigpteni u data. our in would pattern we clustering similar that C/C++ Java a hypothesized We see as use [12]. likely Perl as also use times to developers five developers that are Java developers report of C/C++ they and fraction (7%), example, a for small – only languages of stick to “clusters” tend developers to that report and projects, source open devel- a next. influence the will for project selection one oper’s for language ask a we develop- precisely, using More how how language. examine to we language so from decision move next, ers selection the of language independent one their be expect of to not course do the We over careers. projects many on work Developers migration Developer 3.3 domains. across 3 vary Figure C++ niches. language and C, across PHP, one varies that which another shows than to popular extent more for is the not Likewise, but general particular. in C in than popular more example, is For C++ niches. specific in languages more-popular out languages. popular most bot- the the containing with rows popularity, tom by sorted are languages across The languages niches. of popularity relative the showing heatmap (SourceForge) elided. are projects 100 than less with probability the guages of the language is the point shows each axis next: X and project previous the project.of previous the of guage 4: Figure !"#$%&'"($)# Q au n alhv nlzdtecmi osfo 22 from logs commit the analyzed have Gall and Karus beat overall languages less-popular where cases are There !'* !'*+,-. *(%2@2$3$#4I%XI)$"Y$&LI@I3@&LE@L0IL$B0&I#;0I)(0B$%E/I%&0 !//01234 5!'67 7

rbblt fpcigalnug ie h lan- the given language a picking of Probability 78 !"#$%"$&'()'#&*+',-(.&/+ Q+R 799 :03);$<=43$> ?%(#(@& A(%%B4 C@B@ C@B@'"($)# Q+S C'* D$/) DE@ F!.D!5 G2H0"#I*@/"@3 G2H0"#$B0I7 Q+T *@/"@3 *0(3 *J* *D<'KD

xssostelanguage the shows axis Y *(%3%L

p *4#;%& Q+U

( ME24 L '";010

0 ."3

= N&$>I';033 O5'"($)#

x O$/E@3I5@/$" Q+V O$/E@3I5@/$"I+,-. | P'D L P'D O$/E@3I5@/$"I+,-. O$/E@3I5@/$" O5'"($)# N&$>I';033 ."3 '";010 ME24 *4#;%& *(%3%L *D<'KD *J* *0(3 *@/"@3 G2H0"#$B0I7 G2H0"#I*@/"@3 F!.D!5 DE@ D$/) C'* C@B@'"($)# C@B@ A(%%B4 ?%(#(@& :03);$<=43$> 799 78 7 5!'67 !//01234 !'*+,-. !'* !"#$%&'"($)# = y Q+W ) Lan- .

!"#$%"$&'()',-&01(%2',-(.&/+ oecmlctdrfieet oor ol nyne to need projects. unexplained only the would of 25% ours address in to refinements effective. decisions is complicated selection it More simplicity, language model’s use our the Despite prior of SourceForge. and 75% popularity over language predicts selection: language for not projects general intermediate in SourceForge. through stronger on reported carry be may may they report SourceForge because we sample them- effects only it we the used Likewise, projects: then next. and the only project on that of one selves developers on cases to used Some due being use. language be language may for reuse proxy language a as participation semantic with language new languages the to to to use similarities trying try who of and instead programmers domain domain, convince that a in on programmers focus convince that should implies This advocates languages. language underlying than the ecosystem of technical similarity such or by factors background external developer’s movement by the developer more as driven Ruby, that is conclude between languages We between or Perl. LISP, of and and probability Python, Scheme low relatively between a switching is There families. as linguistic such Java. with groupings conjunction other in of used found being results XML who the and [12], WSDL in Gall visible also and are also Karus correlations languages Such such These ASP. languages as C#. web-development Microsoft and with VBScript lan- correlate next application as and the such probabilities scripting for guages, Windows high between language have switch different developers to notably, specific Most a project. of choice the the of one 4. to Figure in switch bands languages vertical will the overall-popular they to These correspond – overall. time languages the six top of More time. 52% the – of 18% often language that random, to at keep uniformly will language developers first a same select the we with If most languages. projects for between that, goes shows developers line diagonal languages, the bright language that The the bottom. probability the use on the used will that right, project SourceForge project the next a developer’s on in labeled was developer language a the that given depict, to . date creation next the chronologically for the choice with language project the con- influenced of that project choice a developer the for how language each examined we For projects, multiple project. to tributed a within than projects across rather activity examining when differences find and vrl,w on ipeadrpeettv model representative and simple a found we Overall, project using are we that is result our of limitation One within exploration significant see not do we Notably, with correlated occasionally only language prior given A normalized is row Each 4. Figure in results the present We question, this answer to data SourceForge used again We saw a Open source libs. (M) Extending existing code (E) Already used in group (E) Personal familiarity (E) Team familiarity (E) Performance (M) Portability/platform (E) Development speed (M) Tools (M) Safety/correctness (I) Potential team famil. (E) Particular language feature (I) Overall Commercial libs. (M) 1-100 employees Simplicity (I) 101+ employees 0 10 20 30 40 50 60 70 80 Percent of respondents describing aspect as medium or strong importance

Figure 5: Importance of different factors when picking a language. Self-reported for every respondent’s last project. Bars show standard error. E = Extrinsic factor, I = Intrinsic, M = Mixed. Shows results broken down by company size for respondents describing a work project and who indicated company size. (Slashdot, n = 1679)

4. Decision Making from final respondents suggest that no significant categories Above, we analyzed the overall “macro”-level process of are missing. language adoption, with results such as that most unpopular We wanted to see how these results depend on a devel- languages have their popularity concentrated in a few niches. oper’s work environment. To do this, we broke out those re- We now offer a “micro” view: we focus on how individ- spondents who picked a work project and who indicated an ual developers make decisions. The central research ques- organization size. Of respondents with who marked the size tion being investigated is which factors influence developer of their company, we split them into those in companies with selection of languages? We break this into two parts: how do fewer than 100 employees and those with more than 100. developers weigh features of languages when they are mak- (This threshold is closest to splitting the data set evenly, fa- ing choices, and how do demographics affect the languages cilitating comparison.) The results for these subpopulations that developers pick? are also shown in Figure 5. A wide gap separates the most and least influential fac- 4.1 Extrinsic properties dominate intrinsic ones tors. The most influential factor, the availability of open source libraries, was “strong” or “medium” for over 60% of In the Slashdot survey, we asked respondents to rate the respondents. For the least influential factor, simplicity, only influence of particular factors in picking the language for 25% said the same. As organization size increases, we found their most recent project on a four-point scale from “none” that commercial libraries become more important, and open to “strong”. By asking about their most recent project, we source libraries become less so (although open source is still encourage reflection upon a historical event and deempha- weighted quite highly). Correctness becomes a more influen- size ideal preferences that might not be acted upon in prac- tial factor, while simplicity, platform constraints, and devel- tice. Respondents assigned individual priorities to 14 factors opment speed matter less for large companies. Group experi- (Figure 5). We selected the 14 categories such that the dom- ence and legacy code mattered more in larger organizations. inant choice for picking a language would be represented. Pretesting helped the list, and free response comments Some of the factors, such as the language’s features or 24% simplicity, depend on the language design, not on its user 20% base. Others, such as whether a developer already knows the language or the ease of hiring developers who know it, are 16% extrinsic and depend on the social context of the language. 12% Some factors include a mix of extrinsic and intrinsic aspects. 8% For example, the presence of libraries or good development 4% tools is a combination of social and technical factors. Figure 0%

5 is labeled with our judgement about whether a factor was last project use on of Probability objectivec perl python intrinsic, extrinsic, or mixed. any 1-19 employees 20+ employees age <= 25 age > 25 We emphasize four results from the data: 1. Open source libraries. Open source libraries are the Figure 6: Demographic influences on selecting partic- most influential factor for language choice overall and the ular languages. Self-reported for every respondent’s last most influential factor for commercial projects at small project. Bars show standard error. (Slashdot, nany = 1679, companies. They are an important factor, but not the most n1-19 employees = 290, n20+ employees = 790, nage≤25 = 154, important factor, at large companies. nage>25 = 1382). 2. Social Factors outweigh Intrinsics. Existing code or ex- pertise with the language are four of the top five factors dot respondents, age and company size strongly influence for adoption. In contrast, intrinsic factors, such as a lan- whether a particular language was selected. For example, guage’s simplicity or safety, rank low. Implementation at- employees at small companies are 1.4-3.0 times more likely tributes, like performance and tool quality, have both in- to use Objective-C than the typical respondent. Likewise, trinsic and extrinsic components. (Some languages lend when looking at age, developers under 25 rarely use Perl but themselves more easily than others to a high-performance disproportionately select Python. implementation.) These mixed attributes vary in impor- Compared to the breakdowns of Figure 5, we see that tance. language selection is especially sensitive to demographic 3. Domain specialization. Libraries, developer experience, effects. Notice also that different languages are sensitive to and legacy code are all important in language selection. different demographic variables: Perl and Python are age These factors are often associated with particular appli- sensitive, but appear insensitive to company size; Objective- cation domains. Thus, the developer emphasis on these C is sensitive to company size, not to age. A consequence of attributes helps explain the result in Section 3.2 that less- this is that we should expect different language communities popular languages are more niche-specific. to be different; our observations suggest that it is unsafe 4. Company size matters. Employees at larger companies to generalize about which ways a particular demographic place significantly more value on legacy code and knowl- variable will correlate with language usage. edge than do employees at small companies. 5. Language Acquisition These results help inform language designers seeking adoption where to focus their efforts. Developing high- We now switch focus from the decision to use a language in value open source libraries is likely to have a large influ- a particular project to the process of learning languages. We ence to language adoption, particularly for individuals and examine three related questions: How long does it take de- small companies. Simplicity will not attract many program- velopers to learn languages? When in their careers do they mers. Smaller companies are less constrained by legacy code learn? And how does education affect learning? The previ- and experience. We suspect that they are therefore likely ous section demonstrated that developers prefer to use lan- more willing to adopt new languages that are not backward- guages they already know. Understanding how quickly de- compatible. In contrast, a backward-compatible change to a velopers learn and what induces them to learn helps explain language might be more valuable to a large company. this adoption factor. 4.2 Demographic influences on language selection 5.1 Learning speed We now look at how developer demographics affect the lan- We examined learning speed because it limits how many lan- guages that developers select. Understanding this extrinsic guages a can use. For the language they used factor clarifies the generality of our results and its role for on their most recent project, the Slashdot survey asked re- future empirical analysis or targeting of developers . spondents to estimate how long it took to learn to use the We observed significant correlations when distinguish- language well. To “know a language well” is an intention- ing the different demographics that selected a particular lan- ally imprecise and subjective standard. We showed above guage. Figure 6 shows that, for languages selected by Slash- that when developers pick languages for a project, they are Probability of learning the primary language during a project (by demographic)

Probability of learning during a project 0% 5% 10% 15% 20% 25% 30% 35% 1-2 years overall 6-12 months < 20 employees 3-6 months 20+ employees 1-3 months age: 21-22 0-1 months age: 23-24 age: 25-30 age: 31-40 age: 41+ C Perl Java PHP C# C++ Python Ruby JavaScript Figure 8: Probability of learning the primary language Figure 7: Reported speed of language acquisition. Bars are during a project. Shading denotes demographics and bars standard error. (Slashdot, n = 1679) are standard error. (Slashdot, n = 1536) heavily influenced by the languages they believe they know: learned a language for their most recent project. That rate developers will be using a standard that is also subjective. increases to 29% for 21-to-22-year olds. While a jump for Figure 7 shows the results for all languages for which younger programmers is not surprising due to inexperience we had at least 50 responses. The question was framed or perhaps eagerness, we had expected to see a much higher as multiple choice; the y-axis labels of Figure 7 were the increase in learning rate. The difference quickly tapers off, available options. with developers aged 25-30 behaving similarly to those aged The median learning times for the most challenging lan- 31-40 olds (20% vs. 19%). Relative to age, organization size guage and the most approachable differ by a factor of ten. has minimal influence. Programmers report C++ as the slowest to learn while the Our analysis shows that developers in our sample steadily fastest are PHP, Python, and Ruby. Java and C#, which are learn languages through their career. As a result, they are semantically similar, are between these extremes and have not limited by the languages that were popular when they similar learning times. were young or in school. This means that the time scale of The relative time to learn PHP is noteworthy. PHP is no- language adoption is not driven by the career timelines of torious for its ad-hoc design while Python well-regarded for developers because young developers will keep pick up old its simplicity. Despite their differences, both languages have languages and old developers will pick up new ones. comparable reported learning times of just a few months. We One might think, from these results on learning rates, that infer that developers report that they can “use the language age has an important role in language knowledge. However, well” even if they have not mastered every nuance. Develop- as we show in the next subsection, these differences in lan- ers may only need to learn a subset sufficient for completing guage learning between older and younger developers have routine work. limited effect on the overall distribution of language knowl- More fundamentally, the relative ease of learning PHP edge. suggests that, while complexity is a barrier to adoption [22], what makes a language complicated for developers need not 5.3 Languages over time be what makes it complicated for a designer or researcher The next sub-question we examine is how a developer’s age (e.g., convoluted formal semantics). For the same reason, influences the languages that the developer knows. Some what makes a language simple in a formal sense does not professional recruiters claim that there are large and signifi- make it simple for developers and otherwise adoptable. cant differences in the languages that older and younger de- velopers use [11]. Our results refute this claim. 5.2 On-site language learning Figure 10 shows the median age for programmers who We now examine the relationship between developer demo- claim to know each of the indicated languages, along with graphics and the languages that they learn. Knowing which the 25th and 75th percentiles of age. The distributions are demographics are likely to learn helps language proponents all very close. This is surprising: we would have expected target their outreach at those potential early adopters. changes in education over time to result in large deviations, The Slashdot survey asked respondents whether, during such as Pascal skewing old, and Ruby skewing young. In- their last project, they learned the primary language for it. stead, the 95% confidence interval for every language in- Figure 8 shows the results broken down by age and orga- cludes the overall response mean age (38). The deviations nization size. We found that younger developers are more are not statistically significant. We observed similar age in- prone to learn new languages. Overall, 21% of respondents variance patterns in the results of the MOOC survey. There 18

16 8 14

12 6 10

8 4

6

4 2 ever used know slightly 2 Mean # Langs. known

Mean # Langs. known know well know well 0 0 20 30 40 50 60 20 30 40 50 60 Age Age

Figure 9: Number of languages by age. Developers of Figure 11: Number of languages by age. As with Figure 9; different ages seem to know a similar number of languages. there is no clear trend with age. (MOOC, n = 1142) Lightly shaded rectangles show 25th and 75th percentiles, error bars show standard error of mean. (Slashdot, n = 1679) number of languages known by developers against their age, along with the inter-quartile range. Both lines are remark- ably flat. The mean respondent to our survey claims to have 55 learned ten languages, and lists six that they “know well.” 50

45 Likewise, the upper and lower quartile range does not show

40 any clear trend over time; the gap between the 25th and 75th

35 percentile of number of languages does not change over time 30 in a systematic way. 25 The MOOC survey asked developers to list the languages Age of users 20 they know well and the number they know slightly. The re- 15 sults are shown in Figure 11. Compared to the Slashdot sur- vey, the average developer in the MOOC survey knows fewer C Basic Java PHP Pascal C++ Ruby languages. However, the results are qualitatively similar in JavaScript that there is no age trend. This shows that our result is robust to both varying the detailed wording of the question and to a Figure 10: Mean, standard error, and 25th and 75th per- change in the underlying population. centiles for ages of programmers who claim to know each There is a tension between the flat lines on Figures 9 language. Languages are sorted by creation date. Distribu- and 11 and the fact that developers are steadily learning. tions are nearly the same for every language. (Slashdot, n = Since developers of different ages are similarly likely to have 1679) learned a language for their last project, we might expect the number of languages they know to rise over time. Instead, as well, developers of different ages know a similar number older and younger developers report a similar degree of mul- of languages and a similar mix of languages. tilingualism. It follows that developers are losing languages This shows that differences in learning by age, described as well as gaining them. They are forgetting – or at least, above, get washed out over the course of a career. In our sur- forgetting to mention – some languages. veys, programmers of any age are equally likely to remem- This result shows that adult developers effectively have a ber or newly learn older languages like Pascal. Young pro- limited capacity for languages. They typically maintain skill grammers catch up to old ones, and older ones keep learning. in a limited number of languages, and will forget as many Popular languages do not thrive or wither based on the age languages as they learn. This implies that immediate devel- breakdown of their adherents. Programmers keep learning, oper familiarity is a limited resource for which languages and learn often enough and quickly enough that their age must compete. Such competition is the basis for ecological does not predict which languages they know. theories of adoption [15]. We now look at the overall number of languages a devel- oper knows. The Slashdot survey asked developers to esti- 5.4 Effects of education mate the number of languages they have learned, and also to We also looked to see how a respondent’s computer science list the languages that they know well. These different ques- education affected their subsequent programming language tions capture different levels of knowledge and familiarity knowledge. The Slashdot survey asked respondents to mark and we include both in our results. Figure 9 plots the mean which families of languages they learned while in school Language Examples Overall For CS Non- Correlation of If taught If not Correlation of Family majors majors CS major vs. in school taught learned in knowing school vs knowing Functional Lisp, Scheme, 22% 24% 19% 0.053 40% 15% 0.262 Haskell, ML (0.004 - 0.101) (0.217 - 0.307) Dynamic Perl, Python, 79% 78% 79% -0.008 84% 77% 0.069 Ruby (-0.056 - 0.040) (0.021 - 0.117) Assembly 14% 14% 14% 0.004 20% 10% 0.138 (-0.044 - 0.052) (0.091 - 0.185) Imperative/OO C/C++,Java/C# 94% 97% 90% 0.134 95% 87% 0.133 (0.087 - 0.181) (0.085 - 0.180) Math R, Matlab, Math- 11% 10% 11% -0.020 31% 7% 0.268 ematica, SAS (-0.068 - 0.028) (0.223 - 0.312)

Table 5: Probability of knowing at least one language in the indicated family, overall and grouped by major and specific educational experience. Also shows correlation coefficients between knowing a language in that family and (a) having a CS degree, and (b) having learned a language in that family in school. Whether developers learn a language in that family in school has much more influence than being a CS major. Shown below each correlation is the 95th percentile confidence interval. (Slashdot, n = 1679)

(e.g., assembly, functional languages, dynamic languages). 6. Beliefs about Languages Table 5 shows our results. We now turn from programmer actions to programmer be- The vast majority of respondents know a compiled non- liefs. This section looks at what attributes of languages do functional language, such as C or Java, regardless of major or developers like or dislike? The question is in contrast to curriculum. Likewise, dynamic languages are widely known. that of Section 4, which examines how developers pick lan- Less-popular language families (assembly, functional, and guages within the context of a specific project. mathematical languages) are more sensitive to prior educa- Beliefs help shape developer and manager decisions to tion. Promisingly, developers who learned a functional or learn and advocate languages, and thereby affect the so- math-oriented language in school are more than twice as cial dynamics of adoption. Understanding what developers likely to know one later than those who did not. value, and how developers perceive languages, can help de- Educational intervention has limits, however. For math- signers both to develop better-liked languages and also to ematical, functional and assembly languages, the large ma- advocate more effectively on behalf of their languages. jority of developers that learned a language in that family no longer know any similar language. Consequently, the corre- 6.1 Perceived value of features lation between education and later knowledge is relatively The MOOC-d survey asked developers to rate the impor- modest. tance of different types of language features on a scale from Notably, having been a computer science major does not unimportant to crucial. The results are shown in Figure 12. lead to the linguistic versatility of students who learned dif- As can be seen, libraries are the top-rated feature. Such per- ferent language families as part of the course curriculum: ceived importance of libraries matches our finding that li- there is no measurable correlation between being a CS ma- braries matter in practice (Section 4), and we stress that these jor and knowing particular programming paradigms. Our re- confirming results are largely free of effects such as priming sults suggest that an undergraduate curriculum that does not because the questions were not asked on the same survey. introduce students to a variety of languages is unlikely to re- We included several options on the MOOC-d survey sult in more versatile programmers later in their careers. This that represent related concepts. For example, higher-order demonstrates a weakness of curriculums that only focus on functions are a strictly more powerful language primitive languages such as Java and Python. than inheritance. Interfaces are often used for type systems. One limitation of this finding is that it measures only Threads can be used to implement task parallelism. correlation, and there could be causation in both directions. While these features are similar in many ways from a se- Developers who expect to use distinctive families such as mantics point of view, they had sharply different priorities assembler languages might choose to study them in school. with developers. For example, 72% of developers consid- Demonstrating how much causation flows in each direction ered inheritance important or crucial; only 45% felt that way is beyond the scope of this paper. about higher-order functions. Interfaces likewise were con- 0%# 20%# 40%# 60%# 80%# 100%#

libraries# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% I don't understand Totally unrelated A lile performance# Somewhat Very similar Mathemacally equivalent object#inheritance# classes#w.#interfaces# Figure 13: Higher-order functions and objects. Most de- excep