
Are Software Dependency Supply Chain Metrics Useful in Predicting Change of Popularity of NPM Packages? Tapajit Dey Audris Mockus University of Tennessee, Knoxville University of Tennessee, Knoxville Knoxville, Tennessee Knoxville, Tennessee [email protected] [email protected] ABSTRACT 1 INTRODUCTION Background: As software development becomes more interdepen- The usefulness and success of a software, as for other products in dent, unique relationships among software packages arise and form the market, is reflected by its usage. A true measure of popularity, complex software ecosystems. Aim: We aim to understand the be- or in other words usage, of a software would be the number of havior of these ecosystems better through the lens of software active users for it. Unless software licenses are sold and carefully supply chains and model how the effects of software dependency tracked, the number of active users is virtually impossible to mea- network affect the change in downloads of Javascript packages. sure directly and accurately. The number of downloads is, perhaps, Method: We analyzed 12,999 popular packages in NPM, between the closest approximation for the popularity of a software. NPM 01-December-2017 and 15-March-2018, using Linear Regression (node package manager), which is package manager for JavaScript and Random Forest models and examined the effects of predictors packages, does track the number of downloads for the packages representing different aspects of the software dependency supply and makes the data publicly available, unlike most other package chain on changes in numbers of downloads for a package. Result: managers. Preliminary results suggest that the count and downloads of up- Like most software packages, the JavaScript packages distributed stream and downstream runtime dependencies have a strong effect by NPM have a dependency structure, i.e. one package may have on the change in downloads, with packages having fewer, more other packages as runtime and/or development dependencies (other popular packages as dependencies (upstream or downstream) likely types of dependencies also exist). This dependency network influ- to see an increase in downloads. This suggests that in order to ences the number of downloads of the individual packages, be- interpret the number of downloads for a package properly, it is cause when one package is installed by a user, all of its dependen- necessary to take into account the peculiarities of the supply chain cies are also downloaded and installed automatically (unless they (both upstream and downstream) of that package. Conclusion: Fu- are cached). The number of JavaScript packages on NPM exceeds ture work is needed to identify the effects of added, deleted, and 600K, therefore analyzing the complete dependency network for unchanged dependencies for different types of packages, e.g. build the whole NPM ecosystem is a challenge. The dependencies listed tools, test tools. for a package are direct dependencies, however, those dependencies might have their own dependencies. This complex interconnection CCS CONCEPTS forms a dependency network that can be considered a software · Computing methodologies → Classification and regression trees; supply chain, with the entire set of recursive dependencies of pack- Bayesian network models; Learning linear models; · Software and age being upstream from it, and the packages that are recursively its engineering → Open source model; dependent on it being downstream. In a nutshell we’d like to know if the supply chain has effect on KEYWORDS downloads: RQ: Do the number and downloads of upstream and down- Software Supply Chain, Open Source, Software Popularity, NPM stream dependencies help predict the downloads for popu- Packages, Software Dependency lar JavaScript packages in NPM? ACM Reference Format: We collected data from npms.io and obtained daily, weekly, Tapajit Dey and Audris Mockus. 2018. Are Software Dependency Supply monthly, and yearly download numbers for all NPM packages. Chain Metrics Useful in Predicting Change of Popularity of NPM Packages? . We focus on predicting monthly downloads, because the daily In The 14th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE’18), October 10, 2018, Oulu, Finland. ACM, and weekly downloads for most packages exhibit large variations, New York, NY, USA, 4 pages. https://doi.org/10.1145/3273934.3273942 which, presumably, are random in nature, and we do not have suffi- cient history to predict yearly downloads for most packages. We Permission to make digital or hard copies of all or part of this work for personal or collected snapshots of the statistics on all packages roughly every classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 2 weeks between 01-December-2017 and 15-March-2018, and used on the first page. Copyrights for components of this work owned by others than ACM the data from a snapshot to predict the number of downloads for the must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, packages over the next month. To ensure as little overlap as possible, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. we did not use the data from a snapshot to predict the number of PROMISE’18, October 10, 2018, Oulu, Finland downloads in the next snapshot, which is 2 weeks apart, but looked © 2018 Association for Computing Machinery. at the one after that, which is roughly 1 month apart. Also, since the ACM ISBN 978-1-4503-6593-2/18/10...$15.00 https://doi.org/10.1145/3273934.3273942 actual number of downloads varies drastically among packages, we PROMISE’18, October 10, 2018, Oulu, Finland Tapajit Dey and Audris Mockus use the logarithm of the ratio of downloads in next month and the packages in NPM meet our criteria across all the snapshots. We used downloads of the previous month as our response variable. To focus the API to collect data on all NPM packages twice a month roughly on the effect of the supply chain aspects we consider only packages 2 weeks apart, once in the beginning of a month and once in the with at least one upstream and one downstream dependency. middle, between 01-December-2017 and 15-March-2018, resulting To answer the RQ we used both linear regression and random in 8 different snapshots. The data collection process takes around forest models. Our preliminary findings suggest that both the num- two days, so the data on all packages are not from the same date, ber and the popularity (measured by the number of downloads) of however, we expect the variance to have been evened out across the dependencies are some of the most important predictors for the all snapshots. Since we are using the data for predicting the ratio change in downloads for NPM packages. These models can help in- of downloads between the current and the next month, we had 6 terpret the download counts and reasons that affect the usage of the usable snapshots (because for the last two we did not have download package. Our findings also suggest that the download number can counts in the next month). be affected (or manipulated) through dependencies and frequent The data collected from npms.io has information on the GitHub updates (probably leading to cache misses). repository of the project, e.g. the number of issues, number of In Section 2, we present the related work in this field. In Section 3, weekly, monthly, quarterly, half-yearly, and yearly commits, the we describe the data. The data analysis and modeling approaches list of contributors and the number of commits by each contributor, used for the study, and the results of the analysis are described in as well as the number of forks and stars. The metadata information Section 4. In Section 5, the implications of the result are presented consists of the list of runtime and development dependencies, the and the ideas for future work are discussed. monthly, quarterly, half-yearly, yearly, and total number of releases 2 RELATED WORK for the project, the name and email of the author and the pub- The term łsoftware supply chainž has been used in 1995 [8]. The lisher, the README text, the list of maintainers and contributors concept has since been used for addressing economic and manage- (as listed in NPM), and the number of daily,weekly, monthly, quar- ment issues in software engineering [6], for facilitating the Software terly, half-yearly, and yearly downloads. We used the number of Factory development environment introduced by Microsoft [7] and these quantities (except the README file) as the variables used for elsewhere. The primary use of software supply chain has been for our analysis. The data also has information about some evaluation identifying and managing risks related to the software development metrics calculated by the npms.io website, which we did not use for process [4, 9]. In this study, we are trying to use the dependency our analysis. For the number of releases and commits, we only used network (the supply chain) for understanding and predicting the the monthly and yearly numbers, because a quick PCA (Principle change of popularity of NPM packages. Component Analysis) suggested that most of the variance (∼80%) The topic of software popularity hasn’t seen much attention, in all the commit and release variables are explained by these two possibly due to the lack of reliable popularity measures. Stars for components. GitHub projects were used to identify the factors impacting popular- As mentioned earlier, the response variable for our study is the ity of GitHub projects [3]. The relationship between popularity of logarithm of the ratio of downloads in next month and the down- mobile apps and their code repository was studied in, e.g.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-