Arxiv:2104.07406V1

A systematic review of Python packages for time series analysis Julien Siebert1[0000−0002−7696−0046], Janek Groß1[0000−0002−6306−711X], and Christof Schroth1[0000−0003−0828−4063] Fraunhofer-Institut fürExperimentelles Software Engineering Abstract. This paper presents a systematic review of Python packages focused on time series analysis. The objective is first to provide an overview of the different time series analysis tasks and preprocessing methods implemented, but also to give an overview of the development characteristics of the packages (e.g., dependencies, community size, etc.). This review is based on a search of literature databases as well as GitHub repositories. After the filtering process, 40 packages were analyzed. We classified the packages accord- ing to the analysis tasks implemented, the methods related to data preparation, and the means to evaluate the results produced (methods and access to evaluation data). We also reviewed the licenses, the packages community size, and the dependencies used. Among other things, our results show that forecasting is by far the most implemented task, that half of the packages provide access to real datasets or allow generating synthetic data, and that many packages depend on a few libraries (the most used ones being numpy, scipy and pandas). One of the lessons learned from this review is that the process of finding a given implementation is not inherently simple, and we hope that this review can help practitioners and researchers navigate the space of Python packages dedicated to time series analysis. 1 Introduction A time series is a set of data points generated from successive measurements over time. The analysis of this type of data has found applications in many fields, from finance to health, including the monitoring of computer networks or the environment. The current trend of decreasing costs of sensors and data storage, the increasing performance of big-data and data analysis technologies such as machine learning or data mining, are opening up more and more possibilities to acquire and analyze temporal data. For example, a simple search on Scopus or Web of Science with the following search string (restricted to documents titles) "time series" AND ("analysis" OR "data mining" OR "machine learning") shows an increasing trend in the number of publications within the last 10 years (see figure 1). Moreover, as the number of time series analysis application cases rises, more and more data scientists, data engineers, analysts, and software engineers have to use dedicated time series analysis libraries. In this article, we systematically review Python packages dedicated to time series analysis. Python is one of the programming languages of choice for Data Scientists1. Data Scientists arXiv:2104.07406v1 [cs.MS] 15 Apr 2021 are not only responsible for analyzing data, their task is also to ensure that services based on these analysis reach a sufficient level of maturity to be deployed and maintained in production. In this context, we review not only the analysis tasks implemented in the packages, but also several factors external to the tasks alone, such as which dependencies are used or how big is the community behind the development of the package in question. Our goal is not to evaluate the quality of the implementations themselves but to provide a structured overview that is useful 1 See the survey performed by Kaggle in 2018: https://www.kaggle.com/kaggle/kaggle-survey-2018. 2 Julien Siebert, Janek Groß, and Christof Schroth 1000 700 Scopus WoS 0.24 600 800 0.16 0.22 500 0.20 0.14 600 0.18 400 0.12 Trend (WoS) Trend (Scopus) 0.16 400 300 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 200 200 100 Time Series Publications (WoS) Time Series Publications (Scopus) 0 0 1940 1960 1980 2000 2020 Year Fig. 1: Number of documents retrieved by Scopus (left axis) and Web Of Science (WoS, right axis) with the search string (restricted to documents titles): "time series" AND ("analysis" OR "data mining" OR "machine learning") over time. The inlet represents the trend (in %) over the last 20 years (the trend is normalized over the total publications in DBLP (the data is accessible in https://dblp.org/statistics/publicationsperyear.html). both to data scientists confronted with time series analysis (and confronted with the choice of which packages to rely on), as well as to the scientific community and to the community of Python developers working in this field. 2 Related Work Time series analysis is a broad research field covering many application domains. The literature contains many reviews, either focusing on the analysis tasks and methods (see for instance the following reviews on forecasting [32,49,57,79], clustering and classification [1,2,7,27,76], anomaly detection [6,15,90], changepoint analysis [66,4,84], pattern recognition [88,83], or dimensional- ity reduction [61]) or focusing on a specific application domain (see for instance the following surveys on finance [65], IoT and Industry 4.0 [54,45,93], or health [92]). Over time, several for- mal definitions and reviews of time series analysis tasks have been published, see for example [25,24,36]. However, existing implementations (software packages or libraries) are often listed - usually in a non-systematic way - in textbooks (like [16,70] for R, or [56] for Python) or grey literature (for example, Towards Data Science2, KDnuggets3 or Machine Learning Mastery4). Some papers review the state of the open source implementations for a given topic. For example, Shen et al. surveyed open source implementations of clinical software available in GitHub [67]. Both Dozmorov and Russel et al. [60,22] use GitHub to survey open source implementations of bioin- formatics software. Few papers actually systematically review packages or libraries in a specific language. For example [35] review packages for analyzing animal movements data in R, and [72] survey R packages for hydrology. With respect to Python, we found several reviews of packages 2 https://towardsdatascience.com/ 3 https://www.kdnuggets.com/ 4 https://machinelearningmastery.com/ A systematic review of Python packages for time series analysis 3 for different domains: social media content scrapping [81], topological data analysis [58], or data mining [75]. For time series analysis in Python, the only related work we could find is [34] where the authors review packages focusing on forecasting. There is, to the best of our knowledge, no systematic review of Python packages for generic time series analysis. 3 Methodology On January 2021, we conducted a systematic literature review with a snowballing strategy following the guidelines from Kitchenham et al. [38,39,37] and Wohlin [89]. However, these guidelines focus on printed literature, not on software packages. We also followed the recommendation from Crowston [17,18] regarding mining software repositories. Our search process is illustrated by the figure 2 and was performed both in literature databases and code repositories (GitHub). The following sections provide more details on the different steps of the search itself. Github search 115 Removing 81 Is a Python 59 Focuses on time (IC:1, 2.1, 2.2) duplicates Package? (IC2.3) series? (IC3) 47 Literature Checking Removing + search (IC:1, 3) 104 Github (IC2.*) 12 duplicates 5 52 Generic vs Snowballing Analysis Domain spe- (IC1, 2.*, 3) 40 cific (IC4) 67 Fig. 2: Search and filtering process overview. Edges labels indicate the number of repositories left after each step. 3.1 Research questions We already stated our goal and the context we set for this review in the introduction. We formalize this context as follows: we want to analyze Python packages dedicated to time series analysis, for the purpose of structuring the available implementations (we explicitly exclude the purpose of evaluating them), with respect to the implemented time series analysis tasks, from the viewpoint of practitioners, in the context of building data-driven services on top of these implementations. Our research questions are: { RQ1 Which time series analysis tasks exist, and which of these are currently implemented in maintained and supported Python packages? { RQ2 How do the packages support the evaluation of the produced results? { RQ3 What insights can we gain to estimate the durability of a given package and make an informed choice about its long-term use? 3.2 Inclusion criteria To guide our review and filter relevant packages, we defined the following inclusion criteria (IC): The package should be open source, written in Python, available on GitHub (IC1). The 4 Julien Siebert, Janek Groß, and Christof Schroth package should be actively maintained (last commit in less than 6 months, i.e., after July 2020) (IC2.1); it should have more than 100 GitHub stars (IC2.2); and it should be listed in PyPI5 and be installable via pip6 or conda7 (IC2.3). The package should target explicitly time series analysis (IC3). This inclusion criteria implies that we exclude packages that allows for time series analysis but whose focus is not primarily time series analysis in itself (for example, generic scientific computing packages such as scipy or numpy, packages dedicated to data manipulation or storage such as pandas, or generic machine learning or data mining packages such as scikit- learn). Finally, we focused our search on packages offering methods that tend to be domain agnostic (IC4). Domain specific packages are packages aiming to solve time series analysis in a specific domain (for example, audio, finance, geoscience, etc.). They usually focus on specific types and formats of time series and domain related analysis tasks. On the contrary, packages classified as generic are those that offer methods that tend to be domain agnostic. 3.3 Searching open source repositories in GitHub In order to filter GitHub repositories, we selected a list of topics8, filtered the results by language (Python, IC1), by number of stars (at least 100, IC2.2), and considered only repositories that were updated after July 2020 (IC2.1).

Load more