Arxiv:2104.07406V1

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2104.07406V1 A systematic review of Python packages for time series analysis Julien Siebert1[0000−0002−7696−0046], Janek Groß1[0000−0002−6306−711X], and Christof Schroth1[0000−0003−0828−4063] Fraunhofer-Institut f¨urExperimentelles Software Engineering Abstract. This paper presents a systematic review of Python packages focused on time series analysis. The objective is first to provide an overview of the different time series analysis tasks and preprocessing methods implemented, but also to give an overview of the development characteristics of the packages (e.g., dependencies, community size, etc.). This review is based on a search of literature databases as well as GitHub repositories. After the filtering process, 40 packages were analyzed. We classified the packages accord- ing to the analysis tasks implemented, the methods related to data preparation, and the means to evaluate the results produced (methods and access to evaluation data). We also reviewed the licenses, the packages community size, and the dependencies used. Among other things, our results show that forecasting is by far the most implemented task, that half of the packages provide access to real datasets or allow generating synthetic data, and that many packages depend on a few libraries (the most used ones being numpy, scipy and pandas). One of the lessons learned from this review is that the process of finding a given implementation is not inherently simple, and we hope that this review can help practitioners and researchers navigate the space of Python packages dedicated to time series analysis. 1 Introduction A time series is a set of data points generated from successive measurements over time. The anal- ysis of this type of data has found applications in many fields, from finance to health, including the monitoring of computer networks or the environment. The current trend of decreasing costs of sensors and data storage, the increasing performance of big-data and data analysis technologies such as machine learning or data mining, are opening up more and more possibilities to acquire and analyze temporal data. For example, a simple search on Scopus or Web of Science with the following search string (restricted to documents titles) "time series" AND ("analysis" OR "data mining" OR "machine learning") shows an increasing trend in the number of publi- cations within the last 10 years (see figure 1). Moreover, as the number of time series analysis application cases rises, more and more data scientists, data engineers, analysts, and software engineers have to use dedicated time series analysis libraries. In this article, we systematically review Python packages dedicated to time series analysis. Python is one of the programming languages of choice for Data Scientists1. Data Scientists arXiv:2104.07406v1 [cs.MS] 15 Apr 2021 are not only responsible for analyzing data, their task is also to ensure that services based on these analysis reach a sufficient level of maturity to be deployed and maintained in production. In this context, we review not only the analysis tasks implemented in the packages, but also several factors external to the tasks alone, such as which dependencies are used or how big is the community behind the development of the package in question. Our goal is not to evaluate the quality of the implementations themselves but to provide a structured overview that is useful 1 See the survey performed by Kaggle in 2018: https://www.kaggle.com/kaggle/kaggle-survey-2018. 2 Julien Siebert, Janek Groß, and Christof Schroth 1000 700 Scopus WoS 0.24 600 800 0.16 0.22 500 0.20 0.14 600 0.18 400 0.12 Trend (WoS) Trend (Scopus) 0.16 400 300 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 200 200 100 Time Series Publications (WoS) Time Series Publications (Scopus) 0 0 1940 1960 1980 2000 2020 Year Fig. 1: Number of documents retrieved by Scopus (left axis) and Web Of Science (WoS, right axis) with the search string (restricted to documents titles): "time series" AND ("analysis" OR "data mining" OR "machine learning") over time. The inlet represents the trend (in %) over the last 20 years (the trend is normalized over the total publications in DBLP (the data is accessible in https://dblp.org/statistics/publicationsperyear.html). both to data scientists confronted with time series analysis (and confronted with the choice of which packages to rely on), as well as to the scientific community and to the community of Python developers working in this field. 2 Related Work Time series analysis is a broad research field covering many application domains. The literature contains many reviews, either focusing on the analysis tasks and methods (see for instance the following reviews on forecasting [32,49,57,79], clustering and classification [1,2,7,27,76], anomaly detection [6,15,90], changepoint analysis [66,4,84], pattern recognition [88,83], or dimensional- ity reduction [61]) or focusing on a specific application domain (see for instance the following surveys on finance [65], IoT and Industry 4.0 [54,45,93], or health [92]). Over time, several for- mal definitions and reviews of time series analysis tasks have been published, see for example [25,24,36]. However, existing implementations (software packages or libraries) are often listed - usually in a non-systematic way - in textbooks (like [16,70] for R, or [56] for Python) or grey literature (for example, Towards Data Science2, KDnuggets3 or Machine Learning Mastery4). Some pa- pers review the state of the open source implementations for a given topic. For example, Shen et al. surveyed open source implementations of clinical software available in GitHub [67]. Both Dozmorov and Russel et al. [60,22] use GitHub to survey open source implementations of bioin- formatics software. Few papers actually systematically review packages or libraries in a specific language. For example [35] review packages for analyzing animal movements data in R, and [72] survey R packages for hydrology. With respect to Python, we found several reviews of packages 2 https://towardsdatascience.com/ 3 https://www.kdnuggets.com/ 4 https://machinelearningmastery.com/ A systematic review of Python packages for time series analysis 3 for different domains: social media content scrapping [81], topological data analysis [58], or data mining [75]. For time series analysis in Python, the only related work we could find is [34] where the authors review packages focusing on forecasting. There is, to the best of our knowledge, no systematic review of Python packages for generic time series analysis. 3 Methodology On January 2021, we conducted a systematic literature review with a snowballing strategy follow- ing the guidelines from Kitchenham et al. [38,39,37] and Wohlin [89]. However, these guidelines focus on printed literature, not on software packages. We also followed the recommendation from Crowston [17,18] regarding mining software repositories. Our search process is illustrated by the figure 2 and was performed both in literature databases and code repositories (GitHub). The following sections provide more details on the different steps of the search itself. Github search 115 Removing 81 Is a Python 59 Focuses on time (IC:1, 2.1, 2.2) duplicates Package? (IC2.3) series? (IC3) 47 Literature Checking Removing + search (IC:1, 3) 104 Github (IC2.*) 12 duplicates 5 52 Generic vs Snowballing Analysis Domain spe- (IC1, 2.*, 3) 40 cific (IC4) 67 Fig. 2: Search and filtering process overview. Edges labels indicate the number of repositories left after each step. 3.1 Research questions We already stated our goal and the context we set for this review in the introduction. We formalize this context as follows: we want to analyze Python packages dedicated to time series analysis, for the purpose of structuring the available implementations (we explicitly exclude the purpose of evaluating them), with respect to the implemented time series analysis tasks, from the viewpoint of practitioners, in the context of building data-driven services on top of these implementations. Our research questions are: { RQ1 Which time series analysis tasks exist, and which of these are currently implemented in maintained and supported Python packages? { RQ2 How do the packages support the evaluation of the produced results? { RQ3 What insights can we gain to estimate the durability of a given package and make an informed choice about its long-term use? 3.2 Inclusion criteria To guide our review and filter relevant packages, we defined the following inclusion criteria (IC): The package should be open source, written in Python, available on GitHub (IC1). The 4 Julien Siebert, Janek Groß, and Christof Schroth package should be actively maintained (last commit in less than 6 months, i.e., after July 2020) (IC2.1); it should have more than 100 GitHub stars (IC2.2); and it should be listed in PyPI5 and be installable via pip6 or conda7 (IC2.3). The package should target explicitly time series analysis (IC3). This inclusion criteria implies that we exclude packages that allows for time series analysis but whose focus is not primarily time series analysis in itself (for example, generic scientific computing packages such as scipy or numpy, packages dedicated to data manipulation or storage such as pandas, or generic machine learning or data mining packages such as scikit- learn). Finally, we focused our search on packages offering methods that tend to be domain agnostic (IC4). Domain specific packages are packages aiming to solve time series analysis in a specific domain (for example, audio, finance, geoscience, etc.). They usually focus on specific types and formats of time series and domain related analysis tasks. On the contrary, packages classified as generic are those that offer methods that tend to be domain agnostic. 3.3 Searching open source repositories in GitHub In order to filter GitHub repositories, we selected a list of topics8, filtered the results by language (Python, IC1), by number of stars (at least 100, IC2.2), and considered only repositories that were updated after July 2020 (IC2.1).
Recommended publications
  • A Library for Time-To-Event Analysis Built on Top of Scikit-Learn
    Journal of Machine Learning Research 21 (2020) 1-6 Submitted 7/20; Revised 10/20; Published 10/20 scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn Sebastian Pölsterl [email protected] Artificial Intelligence in Medical Imaging (AI-Med), Department of Child and Adolescent Psychiatry, Ludwig-Maximilians-Universität, Munich, Germany Editor: Andreas Mueller Abstract scikit-survival is an open-source Python package for time-to-event analysis fully com- patible with scikit-learn. It provides implementations of many popular machine learning techniques for time-to-event analysis, including penalized Cox model, Random Survival For- est, and Survival Support Vector Machine. In addition, the library includes tools to evaluate model performance on censored time-to-event data. The documentation contains installation instructions, interactive notebooks, and a full description of the API. scikit-survival is distributed under the GPL-3 license with the source code and detailed instructions available at https://github.com/sebp/scikit-survival Keywords: Time-to-event Analysis, Survival Analysis, Censored Data, Python 1. Introduction In time-to-event analysis—also known as survival analysis in medicine, and reliability analysis in engineering—the objective is to learn a model from historical data to predict the future time of an event of interest. In medicine, one is often interested in prognosis, i.e., predicting the time to an adverse event (e.g. death). In engineering, studying the time until failure of a mechanical or electronic systems can help to schedule maintenance tasks. In e-commerce, companies want to know if and when a user will return to use a service.
    [Show full text]
  • Statistics and Machine Learning in Python Edouard Duchesnay, Tommy Lofstedt, Feki Younes
    Statistics and Machine Learning in Python Edouard Duchesnay, Tommy Lofstedt, Feki Younes To cite this version: Edouard Duchesnay, Tommy Lofstedt, Feki Younes. Statistics and Machine Learning in Python. Engineering school. France. 2021. hal-03038776v3 HAL Id: hal-03038776 https://hal.archives-ouvertes.fr/hal-03038776v3 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Statistics and Machine Learning in Python Release 0.5 Edouard Duchesnay, Tommy Löfstedt, Feki Younes Sep 02, 2021 CONTENTS 1 Introduction 1 1.1 Python ecosystem for data-science..........................1 1.2 Introduction to Machine Learning..........................6 1.3 Data analysis methodology..............................7 2 Python language9 2.1 Import libraries....................................9 2.2 Basic operations....................................9 2.3 Data types....................................... 10 2.4 Execution control statements............................. 18 2.5 List comprehensions, iterators, etc........................... 19 2.6 Functions.......................................
    [Show full text]
  • Testing Linear Regressions by Statsmodel Library of Python for Oceanological Data Interpretation
    EISSN 2602-473X AQUATIC SCIENCES AND ENGINEERING Aquat Sci Eng 2019; 34(2): 51-60 • DOI: https://doi.org/10.26650/ASE2019547010 Research Article Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation Polina Lemenkova1* Cite this article as: Lemenkova, P. (2019). Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation. Aquatic Sciences and Engineering, 34(2), 51-60. ABSTRACT The study area is focused on the Mariana Trench, west Pacific Ocean. The research aim is to inves- tigate correlation between various factors, such as bathymetric depths, geomorphic shape, geo- graphic location on four tectonic plates of the sampling points along the trench, and their influ- ence on the geologic sediment thickness. Technically, the advantages of applying Python programming language for oceanographic data sets were tested. The methodological approach- es include GIS data collecting, data analysis, statistical modelling, plotting and visualizing. Statis- tical methods include several algorithms that were tested: 1) weighted least square linear regres- sion between geological variables, 2) autocorrelation; 3) design matrix, 4) ordinary least square regression, 5) quantile regression. The spatial and statistical analysis of the correlation of these factors aimed at the understanding, which geological and geodetic factors affect the distribution ORCID IDs of the authors: P.L. 0000-0002-5759-1089 of the steepness and shape of the trench. Following factors were analysed: geology (sediment thickness), geographic location of the trench on four tectonics plates: Philippines, Pacific, Mariana 1Ocean University of China, and Caroline and bathymetry along the profiles: maximal and mean, minimal values, as well as the College of Marine Geo-sciences.
    [Show full text]
  • Introduction to Python for Econometrics, Statistics and Data Analysis 4Th Edition
    Introduction to Python for Econometrics, Statistics and Data Analysis 4th Edition Kevin Sheppard University of Oxford Thursday 31st December, 2020 2 - ©2020 Kevin Sheppard Solutions and Other Material Solutions Solutions for exercises and some extended examples are available on GitHub. https://github.com/bashtage/python-for-econometrics-statistics-data-analysis Introductory Course A self-paced introductory course is available on GitHub in the course/introduction folder. Solutions are avail- able in the solutions/introduction folder. https://github.com/bashtage/python-introduction/ Video Demonstrations The introductory course is accompanied by video demonstrations of each lesson on YouTube. https://www.youtube.com/playlist?list=PLVR_rJLcetzkqoeuhpIXmG9uQCtSoGBz1 Using Python for Financial Econometrics A self-paced course that shows how Python can be used in econometric analysis, with an emphasis on financial econometrics, is also available on GitHub in the course/autumn and course/winter folders. https://github.com/bashtage/python-introduction/ ii Changes Changes since the Fourth Edition • Added a discussion of context managers using the with statement. • Switched examples to prefer the context manager syntax to reflect best practices. iv Notes to the Fourth Edition Changes in the Fourth Edition • Python 3.8 is the recommended version. The notes require Python 3.6 or later, and all references to Python 2.7 have been removed. • Removed references to NumPy’s matrix class and clarified that it should not be used. • Verified that all code and examples work correctly against 2020 versions of modules. The notable pack- ages and their versions are: – Python 3.8 (Preferred version), 3.6 (Minimum version) – NumPy: 1.19.1 – SciPy: 1.5.3 – pandas: 1.1 – matplotlib: 3.3 • Expanded description of model classes and statistical tests in statsmodels that are most relevant for econo- metrics.
    [Show full text]
  • A. an Introduction to Scientific Computing with Python
    August 30, 2013 Time: 06:43pm appendixa.tex A. An Introduction to Scientific Computing with Python “The world’s a book, writ by the eternal art – Of the great author printed in man’s heart, ’Tis falsely printed, though divinely penned, And all the errata will appear at the end.” (Francis Quarles) n this appendix we aim to give a brief introduction to the Python language1 and Iits use in scientific computing. It is intended for users who are familiar with programming in another language such as IDL, MATLAB, C, C++, or Java. It is beyond the scope of this book to give a complete treatment of the Python language or the many tools available in the modules listed below. The number of tools available is large, and growing every day. For this reason, no single resource can be complete, and even experienced scientific Python programmers regularly reference documentation when using familiar tools or seeking new ones. For that reason, this appendix will emphasize the best ways to access documentation both on the web and within the code itself. We will begin with a brief history of Python and efforts to enable scientific computing in the language, before summarizing its main features. Next we will discuss some of the packages which enable efficient scientific computation: NumPy, SciPy, Matplotlib, and IPython. Finally, we will provide some tips on writing efficient Python code. Some recommended resources are listed in §A.10. A.1. A brief history of Python Python is an open-source, interactive, object-oriented programming language, cre- ated by Guido Van Rossum in the late 1980s.
    [Show full text]
  • Testing Linear Regressions by Statsmodel Library of Python for Oceanological Data Interpretation Polina Lemenkova
    Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation Polina Lemenkova To cite this version: Polina Lemenkova. Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation. Aquatic Sciences and Engineering, Istanbul University Press, 2019, 34 (2), pp.51- 60. 10.26650/ASE2019547010. hal-02168755 HAL Id: hal-02168755 https://hal.archives-ouvertes.fr/hal-02168755 Submitted on 29 Jun 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Copyright EISSN 2602-473X AQUATIC SCIENCES AND ENGINEERING Aquat Sci Eng 2019; 34(2): 51-60 • DOI: https://doi.org/10.26650/ASE2019547010 Research Article Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation Polina Lemenkova1* Cite this article as: Lemenkova, P. (2019). Testing Linear Regressions by StatsModel Library of Python for Oceanological Data Interpretation. Aquatic Sciences and Engineering, 34(2), 51-60. ABSTRACT The study area is focused on the Mariana Trench, west Pacific Ocean. The research aim is to inves- tigate correlation between various factors, such as bathymetric depths, geomorphic shape, geo- graphic location on four tectonic plates of the sampling points along the trench, and their influ- ence on the geologic sediment thickness.
    [Show full text]
  • FORECASTING USING ARIMA MODELS in PYTHON Statsmodels SARIMAX Class
    Fitting time series models F OR E CA S TIN G US IN G A R IM A M ODE LS IN P Y TH ON James Fulton Climate informatics researcher Creating a model from statsmodels.tsa.arima_model import ARMA model = ARMA(timeseries, order=(p,q)) FORECASTING USING ARIMA MODELS IN PYTHON Creating AR and MA models ar_model = ARMA(timeseries, order=(p,0)) ma_model = ARMA(timeseries, order=(0,q)) FORECASTING USING ARIMA MODELS IN PYTHON Fitting the model and t summary model = ARMA(timeseries, order=(2,1)) results = model.fit() print(results.summary()) FORECASTING USING ARIMA MODELS IN PYTHON Fit summary ARMA Model Results ============================================================================== Dep. Variable: y No. Observations: 1000 Model: ARMA(2, 1) Log Likelihood 148.580 Method: css-mle S.D. of innovations 0.208 Date: Thu, 25 Apr 2019 AIC -287.159 Time: 22:57:00 BIC -262.621 Sample: 0 HQIC -277.833 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.0017 0.012 -0.147 0.883 -0.025 0.021 ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630 ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208 ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 0.9029 -1.6194j 1.8541 -0.1690 AR.2 0.9029 +1.6194j 1.8541 0.1690 MA.1 -2.7184 +0.0000j 2.7184 0.5000 ----------------------------------------------------------------------------- FORECASTING USING ARIMA MODELS IN PYTHON Fit summary ARMA Model Results ============================================================================== Dep.
    [Show full text]
  • Matlab®To Python: a Migration Guide 6
    White Paper MATLAB® to Python: A Migration Guide www.enthought.com Copyright © 2017 Enthought, Inc. written by alexandre chabot-leclerc All Rights Reserved. Use only permitted under license. Copying, sharing, redistributing or other unauthorized use strictly prohibited. All trademarks and registered trademarks are the property of their respective owners. matlab and Simulink are registered trademark of The MathWorks, Inc. Enthought, Inc. 200 W Cesar Chavez St Suite 202 Austin TX 78701 United States www.enthought.com Version 1.0, September 2017 Contents 1 Introduction5 Why Python6 Getting Started8 2 Differences Between Python and MATLAB® 10 Fundamental Data Types 10 Organizing Code in Packages, not Toolboxes 11 Syntax 12 Indexing and Slicing: Why Zero-Based Indexing 14 NumPy Arrays Are Not Matrices 16 Programming Paradigm: Object-Oriented vs. Procedural 19 3 How Do I? 22 Load Data 22 Signal processing 24 Linear algebra 25 Machine learning 25 Statistical Analysis 26 Image Processing and Computer Vision 26 Optimization 27 Natural Language Processing 27 Data Visualization 27 Save Data 30 What Else? 31 4 4 Strategies for Converting to Python 33 From the Bottom Up: Converting One Function at a Time 33 From the Top Down: Calling Python from MATLAB® 41 5 What Next? 46 Appendix 47 Code Example: Proling Contiguous Array Operations 47 Complete Version of main.py, in Chapter4 47 Anti-Patterns 49 References 51 1 Introduction This document will guide you through your transition from matlab® to Python. The first section presents some reasons why you would want to to do so as well as how to get started quickly by installing Enthought Canopy.
    [Show full text]
  • Researchpy Documentation Release 0.3.2
    researchpy Documentation Release 0.3.2 Corey Bryant Jun 02, 2021 CONTENTS: 1 codebook() 3 1.1 Arguments................................................3 1.2 Examples.................................................3 2 summary_cont() 5 2.1 Arguments................................................5 2.2 Examples.................................................5 3 summary_cat() 7 3.1 Arguments................................................7 3.2 Examples.................................................7 4 difference_test() 9 4.1 Arguments................................................9 4.2 Examples................................................. 11 4.3 References................................................ 15 5 ttest() 17 5.1 Arguments................................................ 17 5.2 Examples................................................. 19 5.3 References................................................ 20 6 crosstab() 23 6.1 Arguments................................................ 23 6.2 Effect size measures formulas...................................... 24 6.3 Examples................................................. 25 6.4 References................................................ 27 7 corr_case() 29 7.1 Arguments................................................ 29 7.2 Examples................................................. 29 7.3 References................................................ 30 8 corr_pair() 33 8.1 Arguments................................................ 33 8.2 Examples................................................
    [Show full text]
  • Array Programming with Numpy
    Review Array programming with NumPy https://doi.org/10.1038/s41586-020-2649-2 Charles R. Harris1, K. Jarrod Millman2,3,4 ✉, Stéfan J. van der Walt2,4,5 ✉, Ralf Gommers6 ✉, Pauli Virtanen7,8, David Cournapeau9, Eric Wieser10, Julian Taylor11, Sebastian Berg4, Received: 21 February 2020 Nathaniel J. Smith12, Robert Kern13, Matti Picus4, Stephan Hoyer14, Marten H. van Kerkwijk15, Accepted: 17 June 2020 Matthew Brett2,16, Allan Haldane17, Jaime Fernández del Río18, Mark Wiebe19,20, Pearu Peterson6,21,22, Pierre Gérard-Marchant23,24, Kevin Sheppard25, Tyler Reddy26, Published online: 16 September 2020 Warren Weckesser4, Hameer Abbasi6, Christoph Gohlke27 & Travis E. Oliphant6 Open access Check for updates Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in felds as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, fnance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the frst imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientifc data. NumPy is the foundation upon which the scientifc Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a fexible framework to support the next decade of scientifc and industrial analysis.
    [Show full text]
  • A Foundational Python Library for Data Analysis and Statistics
    1 pandas: a Foundational Python Library for Data Analysis and Statistics Wes McKinney F Abstract—In this paper we will discuss pandas, a Python library of rich a table or spreadsheet of data will likely have labels for the data structures and tools for working with structured data sets common to columns and possibly also the rows. Alternately, some columns statistics, finance, social sciences, and many other fields. The library provides in a table might be used for grouping and aggregating data into integrated, intuitive routines for performing common data manipulations and a pivot or contingency table. In the case of a time series data analysis on such data sets. It aims to be the foundational layer for the future of statistical computing in Python. It serves as a strong complement to the existing set, the row labels could be time stamps. It is often necessary scientific Python stack while implementing and improving upon the kinds of data to have the labeling information available to allow many kinds manipulation tools found in other statistical programming languages such as of data manipulations, such as merging data sets or performing R. In addition to detailing its design and features of pandas, we will discuss an aggregation or “group by” operation, to be expressed in an future avenues of work and growth opportunities for statistics and data analysis intuitive and concise way. Domain-specific database languages applications in the Python language. like SQL and statistical languages like R and SAS have a wealth of such tools. Until relatively recently, Python had few Introduction tools providing the same level of richness and expressiveness for working with labeled data sets.
    [Show full text]
  • Python Modeling & Fitting Packages
    PYTHON MODELING & FITTING PACKAGES Christoph Deil, MPIK Heidelberg [email protected] HSF PyHEP WG call Sep 11, 2019 !1 MY KEY POINTS 1. Python is great for modeling & fitting. 2. Python is terrible for modeling & fitting. 3. We should make it better and collaborate more. !2 ABOUT ME ➤ Christoph Deil, gamma-ray astronomer from Heidelberg ➤ Started with C++ & ROOT, discovered Python 10 years ago. ➤ Heavy user and very interested in modeling & fitting in Python ➤ Use many packages: scipy, iminuit, scikit-learn, Astropy, Sherpa, … ➤ Develop Gammapy for modeling & fitting gamma-ray astronomy data !3 GAMMA-RAY ASTRONOMY ➤ High-level analysis similar to HEP? ➤ Data: unbinned event lists (lon, lat, energy, time), or binned 1d, 2d, 3d counts ➤ Instrument response: effective area, plus spatial (PSF) and energy dispersion ➤ Models: Spatial (lon/lat) and spectra (energy) ➤ Fitting: Forward-fold model with instrument response. Poisson likelihood. ➤ Goal: inference aboutA&A proofs: manuscriptmodels no. output & parameters !4 Fig. 3. One-standard-deviation likelihood contours (68 % probability content) of spectral parameters (φ0, Γ, β) for the likelihood in Eq. 1. Results from the individual instruments and from the joint-fit are displayed. 411 tioned in the introduction, along with a Docker container9 on 412 DockerHub, and a Zenodo record (Nigro et al. 2018), that pro- 413 vides a Digital Object Identifier (DOI). The user access to the 414 repository hosting data and analysis scripts represents a nec- 415 essary, but not sufficient condition to accomplish the exact 416 reproducibility of the results. We deliver a conda10 configu- 417 ration file to build a virtual computing environment, defined 418 with a special care in order to address the internal dependen- 419 cies among the versions of the software used.
    [Show full text]