The challenges of Data Science from a Scientific Computing point of view

José Abílio Matos 24 November 2020

FEP CMUP (Mathematics Research Center U. Porto) https://www.fep.up.pt/docentes/jamatos/ [email protected]

1 / 46 Outline

Introduction Scientific Computing

Software projects: two distributions models Language package distributions distributions Relation between OS and language repositories

A practical example Fedora Update process for R 4.0

Conclusion

2 / 46 Introduction Mandatory Formula

Credits: https://xkcd.com/ Introduction

Scientific Computing Scientific Computing or Computational Science

"Scientific computing is the collection of tools, tech- niques and theories required to solve on a computer the mathematical models of problems in science and engineering."

Gene H. Golub and James M. Ortega “Scientific Computing and Differential Equations –An Introduction to Numerical Methods”. Academic Press, 1992. Usual disclaimer Not to be confused with Computer Science.

3 / 46 Mandatory Venn diagram

3 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science);

Operational concept

1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools;

4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science);

Operational concept

1. Algorithms; • numerical and non-numerical; • computational models; • computer simulations; • data processing; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools;

4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science);

Operational concept

1. Algorithms; 2. Models adapted to the needs of the problems being solved; • mathematical models; • e.g. complex systems theory; • e.g. Quantum Economics, Quantum Finance, Quantum Genomics; 3. Computer hardware; 4. The development process of the low-level software tools;

4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science);

Operational concept

1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; • High Performance Computing; • Parallel computing; • SIMD, SIMT, GPGPU; • software required to make it work; 4. The development process of the low-level software tools;

4 / 46 Operational concept

1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools; The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science);

4 / 46

Why do I care?

• Because I want to know how do things work and why they are the way they are (an epistemological view); • In order to study that I will look into some examples of: • The ecosystem that has grown around Scientific Computing; • The developing methodologies;

5 / 46 Why is this important for Data Science?

One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.

6 / 46 Why is this important for Data Science?

One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.

6 / 46 Why is this important for Data Science?

One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.

6 / 46 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

Reproducibility

A computation is reproducible if it offers four essential possibilities (Konrad Hinsen):

1. To inspect all the input data and all the source code that can possibly have an impact on the results.

7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results.

3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

Reproducibility

A computation is reproducible if it offers four essential possibilities (Konrad Hinsen):

2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results.

7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results.

4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code.

Reproducibility

A computation is reproducible if it offers four essential possibilities (Konrad Hinsen):

3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools.

7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools.

Reproducibility

A computation is reproducible if it offers four essential possibilities (Konrad Hinsen):

4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code. 7 / 46 • In this talk the focus will be on free/open source software and related projects/communities. • The processes and procedures that make them work and the challenges that lie ahead.

The importance of piping in modern societies

• The piping, wiring and sewer systems are of paramount importance in modern societies even although they are generally invisible. • The same applies to the general framework that is present in software packages that are used in Scientific Computing.

8 / 46 • The piping, wiring and sewer systems are of paramount importance in modern societies even although they are generally invisible. • The same applies to the general framework that is present in software packages that are used in Scientific Computing.

The importance of piping in modern societies

• In this talk the focus will be on free/open source software and related projects/communities. • The processes and procedures that make them work and the challenges that lie ahead.

8 / 46 Software projects: two distributions models Software distribution projects

We will focus on software projects/communities that distribute software packages (free/open source software):

• Package distribution for languages; • Operating systems distributions.

There are projects with similar goals and mixed approaches (e.g. conda or Rstudio package management) that we will not care here.

9 / 46 Software projects: two distributions models

Language package distributions Software languages

There are lots of examples of software languages with special interest to computational science:

• python • R • octave • julia • Lua • javascript • LaTeX

10 / 46 Language packages/libraries

• The power of each language depends not only on the syntax that it uses but also from the set of functions that it makes available for users. • Python has the motto of “batteries included” because the standard library that comes with the language has a very broad and diverse set of tools.

Usually those functions are presented together in software packages or bundles.

11 / 46 Language packages

Depending on the language these bundles can have different names like libraries, modules, packages. For simplicity we will refer to it as packages. The packages are quite diverse in terms of:

• maintenance: controlled by the language core developers, by independent groups of developers, by single developers. • licenses: this is importance because license compliance is an important consideration.

12 / 46 Specific language repositories

Python pypy Octave Octave Forge, Package extension index R CRAN, Bioconductor Perl CPAN latex CTAN javascript npm

13 / 46 Language repositories

• The software packages in general, and language packages in particular, usefulness increases with an easy way to install. • That lead that all the programming languages above created their own set of ways to deal with software packages. • The advantage of having a single or a sufficiently small number of repositories is that it makes the discoverability and installation of software simple.

14 / 46 Language repositories

• In turn this allows for software packages to be built on top of others. Isaac Newton “If I have seen further it is by standing on the sholders [sic] of Giants.”

15 / 46 Language repository types

The control of the packages held in a repository can vary:

• from too tight, where the packages have to follow a very strict set of rules (license, building, etc), • to too loose where essentially the package is merely a package index where everyone can register their package with almost no control.

16 / 46 Package dependencies (example: R-zoo)

Package: zoo Version: 1.8-8 Depends: R (>= 3.1.0), stats Imports: utils, graphics, grDevices, lattice (>= 0.20- 27) Suggests: AER, coda, chron, fts, ggplot2 (>= 3.0.0), mondate, scales, strucchange, timeDate, timeSeries, tis, tseries, xts License: GPL-2 | GPL-3 MD5sum: a751c37a1b84a342851855cae2f40ac5 NeedsCompilation: yes

17 / 46 Package dependencies (example: R-yuima)

Package: yuima Version: 1.9.6 Depends: R(>= 2.10.0), methods, zoo, stats4, utils, expm, cubature, mvtnorm Imports: Rcpp (>= 0.12.1), boot (>= 1.3-2), glassoFast LinkingTo: Rcpp, RcppArmadillo License: GPL-2 MD5sum: 651e8ee8412402adce81cd247220cffa NeedsCompilation: yes

18 / 46

Package distribution evolution (begin)

• Centralized repositories have many advantages, in particular with the overabundance of information, repositories provide a (more or less) curated set of packages (like a museum); • Historically the advantage of a single point of control was due to the lake a standardized location/procedure to download a package;

19 / 46 Package distribution evolution (present)

• With the appearance of git in 2005 and the corresponding blossom of distributed control version systems (git, mercurial, fossil and others) this led to the creation of web services of code share: github, gitlab, bitbucket, savannah (for GNU projects), R-forge (for R packages). • So, in a sense, github become another repository for languages like R with all the advantages and drawbacks of such change.

20 / 46

Software projects: two distributions models

Operating system distributions Operating system distributions

The are projects that aggregate software that is able to run in a large set of hardware. Examples about this are usually run around the BSD family of kernels. They have specific repositories that is built around a common set of packages. Examples are:

/ • Fedora/CentOS/RHEL • Open Suse • OpenBSD • FreeBSD • NetBSD

21 / 46 Why to focus on free software operating processes

Because they offer a good way to integrate all the different low level components with a common and uniform interface. Offering to the users the advantage of something that works.

22 / 46 Common characteristics of OS distributions (general)

• Distribute a coherent set of packages; • Have a package system that deals with software with software inter-dependencies; • Follows well established rules regarding the source code; • Associated with a free/open source software; • Works in multiple platforms (for some in even more than 20! hardware platforms); • Worldwide mirrors network to fetch and update packages easily.

23 / 46 Key differences between OS distributions

• Release cycles; • Community/enterprise relation; • Target audience; • Update policies; • “Support” periods; • Packaging policies (package names, granularity, boundaries); • Packaging method/tools.

24 / 46 Package distribution evolution

• Just like the language repositories have been evolving so has the packaging for OS distributions. • In particular there are are software packages that have been appearing that are, more or less, independent of the distribution. The more known are flatpack and snap, the formats bundle most of the software in a single package. • These packages do not offer the same level of integration as the native packages in the repositories.

25 / 46 Software projects: two distributions models

Relation between OS and language repositories Common characteristics

• Ease of install and usually a unique way to install; • Follow a set of rules or guidelines, so distribute a coherent set of packages; • A curated repository brings other confidence in the corresponding packages; • Take care of inter-dependencies between packages.

26 / 46 Differences (technical)

• Language repositories intend to work with different OS; • wants to abstract as much as possible the difference between OS • OS repositories intend to work with different computer languages; • want to put, as much as possible, all the languages following the same set of structure in order to share the system libraries. • Each community has a different set of interests and culture. The tension is there from the start. There is a need to establish bridges. E.g. talk “R packages from a Fedora perspective” in UseR 2008 (Dortmund).

27 / 46 Differences

• There are differences that do not come from technical limitation but from philosophy of developments. • This applies to both OS and language distributions. • Some project frown from several versions of the same packages while other projects encourage, or at least simplify a lots, its uses.

28 / 46

A practical example A practical example

Fedora What is (from wikipedia)

Fedora is a developed by the community-supported Fedora Project which is sponsored primarily by , a subsidiary of IBM, with additional support from other companies. Fedora contains software distributed under various free and open-source licenses and aims to be on the leading edge of free technologies. Fedora is the upstream source of the commercial distribution, and subsequently CentOS as well.

29 / 46 Related Linux distributions

Fedora integrates initially the new software, community project; Red Hat Enterprise Linux does further testing, is a commercial distribution (from free software) where what you pay is the support; CentOS uses the software from RHEL, compiles it and releases it, adding further repositories; Scientific Linux scientific computing oriented distribution (cooperation between Fermilab, CERN, DESY and by ETH Zurich.), folded into CentOS with version 8;

30 / 46 Core values Supported architectures

Main architectures Alternative Architectures

• x86_64 (64-bit, el) • (32-bit for i686->)* • ARM-hfp (32, el,ARMv7+) • s390x (64-bit for zEC12+) • ARM AArch64 (64, el for • PowerPC64le (64,el, ARMv8+) POWER8+) • MIPS-64el (64, el) • MIPS-el (32, el) • RISC-V (64, open source ISA) el: little endian (yes, dry humour, I know!)

32 / 46 General statistics about Fedora (1)

The data here has been taken from Matthew Miller, Fedora Project Leader, presentation in August of this year in a virtual conference - State of Fedora (2020).

• Contributors (~350 active with ~200 with more than 2 years of contributions) • several paid developers (packaging and infrastructure); • there are several companies involved other than Red Hat like Amazon. • most developers are volunteer (including naturally people from the above companies).

33 / 46 General statistics about Fedora (2)

• ~15 000 project packages available at current release (Fedora 33) Example on x86-64: • 26374 common packages to all architectures; • 21420 x86-64 specific; • 8511 i686 specific (to support 32 bit binaries). • A new release every 6 month (April/May and October/November).

34 / 46 More statistics People involved (roles)

• ambassador; • artwork; • documentation; • infrastructure (networking and building packages); • packaging: • packager; • proven packager; • sponsor.

35 / 46 Supported languages

• PHP • Ruby • Ada • Python • Perl • C • R • Java • C++ • Octave • Go • D • Julia • Rust • Fortran • Lua • DotNet • Pascal • LaTeX • Erlang • ... • Haskell

36 / 46 Relation with programming languages

Release of the following projects is aligned with Fedora release schedule:

• gcc; • Python (core interpreter and standard library).

Advantage for the languages:

• An immediate test while a large number of packages, • in a significant number of diverse architectures.

37 / 46 The Fedora process (FTBFS)

• Almost all packages are rebuilt on every release (~ each 6 months); • If they fail to build from the source, even if the current version installs and runs perfectly, they are marked as such and it given a certain period to be fixed or else they will be removed from the distribution; • The package guidelines encourage all packages to have a %check section as defined by the upstream project. And so all the packages are checked on every rebuild.

38 / 46 A practical example

Update process for R 4.0 Update from R 3.6.2 to 4.0.0(/1)

• Last June R 4.0 was released. • Due to the changes introduced all R packages needed to be rebuilt with the new version. • In Fedora the process was first applied in the development version (that is now Fedora 33). • Later it was backported to at the time latest stable release (Fedora 32).

39 / 46 Challenge

• The change was deemed important enough to be released to Fedora 32. • Usually getting a new R version is a simple as building the version and the very few packages that directly depend on it (like rpy).

40 / 46 Challenge

• Since all the packages have dependencies on other packages, the resulting network of dependencies is a huge mess (TM); • The process start by rebuild R itself as the 0th stage. • The solution is to start by the isolated branches, of the resulting dependency tree, and to rebuild them; • Trim all those packages from the tree and repeat the process. • Unfortunately at some point this process stops because there are direct cyclic graphs.

41 / 46 Success

• In order to proceed is necessary to create a dual stage rebuild. • The reboot process occurs in two steps: 1. package are built with minimal dependencies (e.g. no testing); 2. after all the packages are rebuilt in the first stage they are rebuilt again with the full dependency. • Even with this approach it took 29 iterations, and more than a week of work, to finish the rebuild of the ~440 R packages that are present in Fedora.

42 / 46 Opportunities

• In the last years we have been seeing a convergence between the two models as the metadata information for packages becomes richer. • Iñaki Ucar created a personal repository copr, that once a day, asks to CRAN what are the newest packages and rebuild them. The packages are rebuilt automatically and so do not have the same care that is given to other packages. They are also installed in a different path in order not to conflict with the official Fedora packages. • Installing one R packages allows for any command to install package to use the packages in the copr repository (~15000 packages) without the need to compile locally the code, thus speeding the process.

43 / 46 Cooperation

• This mixed experiment is supported from both the Fedora project that supports the building and storage of the packages and by the R. • We have coordinate the public release of the R packages, in order to ensure that the packages were available at the same time. • At CRAN in the the Download section (Linux) for Fedora and Red Hat, the page is https: //cran.r-project.org/bin/linux/fedora/

44 / 46 Conclusion University Covid model Conclusions

• I hope that by the end of this seminar you get a better understanding of all the work that goes behind the maintenance of the piping in the Scientific Computing ecosystem that directly affects Data Science; • To stress, once again, the importance of the testing implementation in software projects, that are used as basis for others, as a method to help to ensure the reproducibility of results; • Work on Python and R is ongoing to ensure that the infrastructure is fit for the task; • There is future work in the sense that what was done for R can be already done for python and in Fedora we are interested in repeating this for other languages.

46 / 46 Fedora’s Vision and Mission

Vision The Fedora Project envisions a world where everyone benefits from free and open source software built by inclusive, welcoming, and open-minded communities.

Mission Fedora creates an innovative platform for hardware, clouds, and containers that enables software developers and community members to build tailored solutions for their users.