The Challenges of Data Science from a Scientific Computing Point of View

The challenges of Data Science from a Scientific Computing point of view José Abílio Matos 24 November 2020 FEP CMUP (Mathematics Research Center U. Porto) https://www.fep.up.pt/docentes/jamatos/ [email protected] 1 / 46 Outline Introduction Scientific Computing Software projects: two distributions models Language package distributions Operating system distributions Relation between OS and language repositories A practical example Fedora Update process for R 4.0 Conclusion 2 / 46 Introduction Mandatory Formula Credits: https://xkcd.com/ Introduction Scientific Computing Scientific Computing or Computational Science "Scientific computing is the collection of tools, tech- niques and theories required to solve on a computer the mathematical models of problems in science and engineering." Gene H. Golub and James M. Ortega “Scientific Computing and Differential Equations –An Introduction to Numerical Methods”. Academic Press, 1992. Usual disclaimer Not to be confused with Computer Science. 3 / 46 Mandatory Venn diagram 3 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science); Operational concept 1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools; 4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science); Operational concept 1. Algorithms; • numerical and non-numerical; • computational models; • computer simulations; • data processing; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools; 4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science); Operational concept 1. Algorithms; 2. Models adapted to the needs of the problems being solved; • mathematical models; • e.g. complex systems theory; • e.g. Quantum Economics, Quantum Finance, Quantum Genomics; 3. Computer hardware; 4. The development process of the low-level software tools; 4 / 46 The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science); Operational concept 1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; • High Performance Computing; • Parallel computing; • SIMD, SIMT, GPGPU; • software required to make it work; 4. The development process of the low-level software tools; 4 / 46 Operational concept 1. Algorithms; 2. Models adapted to the needs of the problems being solved; 3. Computer hardware; 4. The development process of the low-level software tools; The framework process that is used to develop the building blocks/pipes, software packages, used in the solution of the problems (e.g. Data Science); 4 / 46 Why do I care? • Because I want to know how do things work and why they are the way they are (an epistemological view); • In order to study that I will look into some examples of: • The ecosystem that has grown around Scientific Computing; • The developing methodologies; 5 / 46 Why is this important for Data Science? One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results. 6 / 46 Why is this important for Data Science? One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results. 6 / 46 Why is this important for Data Science? One of the major principles/tenets of the scientific method is the reproducibility of results; Reproducibility Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results. 6 / 46 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code. Reproducibility A computation is reproducible if it offers four essential possibilities (Konrad Hinsen): 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code. Reproducibility A computation is reproducible if it offers four essential possibilities (Konrad Hinsen): 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code. Reproducibility A computation is reproducible if it offers four essential possibilities (Konrad Hinsen): 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. 7 / 46 1. To inspect all the input data and all the source code that can possibly have an impact on the results. 2. To run the code on a suitable computer of one’s own choice in order to verify that it indeed produces the claimed results. 3. To explore the behavior of the code, by inspecting intermediate results, by running the code with small modifications, or by subjecting it to code analysis tools. Reproducibility A computation is reproducible if it offers four essential possibilities (Konrad Hinsen): 4. To verify that published executable versions of the computation, proposed as binary files or as services, do indeed correspond to the available source code. 7 / 46 • In this talk the focus will be on free/open source software and related projects/communities. • The processes and procedures that make them work and the challenges that lie ahead. The importance of piping in modern societies • The piping, wiring and sewer systems are of paramount importance in modern societies even although they are generally invisible. • The same applies to the general framework that is present in software packages that are used in Scientific Computing. 8 / 46 • The piping, wiring and sewer systems are of paramount importance in modern societies even although they are generally invisible. • The same applies to the general framework that is present in software packages that are used in Scientific Computing. The importance of piping in modern societies • In this talk the focus will be on free/open source software and related projects/communities. • The processes and procedures that make them work and the challenges that lie ahead. 8 / 46 Software projects: two distributions models Software distribution projects We will focus on software projects/communities that distribute software packages (free/open source software): • Package distribution for languages; • Operating systems distributions. There are projects with similar goals and mixed approaches (e.g. conda or Rstudio package management) that we will not care here. 9 / 46 Software projects: two distributions models Language package distributions Software languages There are lots of examples of software languages with special interest to computational science: • python • R • octave • julia • Lua • javascript • LaTeX 10 / 46 Language packages/libraries • The power of each language depends not only on the syntax that it uses but also from the set of functions that it makes available for users. • Python has the motto of “batteries included” because the standard library that comes with the language has a very broad and diverse set of tools. Usually those functions are presented together in software packages or bundles. 11 / 46 Language packages Depending on the language these bundles can have different names like libraries, modules, packages. For simplicity we will refer to it as packages. The packages are quite diverse in terms of: • maintenance: controlled by the language core developers, by independent groups of developers, by single developers. • licenses: this is importance because license compliance is an important consideration. 12 / 46 Specific language repositories Python pypy Octave Octave Forge, Package extension index R CRAN, Bioconductor Perl CPAN latex CTAN javascript npm 13 / 46 Language repositories • The software packages in general, and language

Load more