E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama doi:10.3850/38WC092019-1402

BRIDGING THE GAP BETWEEN SOFTWARE AND HYDRAULIC ENGINEERING: THE DEVELOPMENT OF A CLIMATIC OPEN-ACCESS TOOL

DEL-ROSAL-SALIDO JUAN(1), MAGAÑA PEDRO(1), & ORTEGA-SÁNCHEZ MIGUEL(1)

[email protected], [email protected], [email protected] (1) Instituto Interuniversitario de Investigación del Sistema Tierra en Andalucía - Universidad de Granada. Edificio CEAMA, Avenida del Mediterráneo, s/n, 18006, Granada, España.

ABSTRACT

Currently, academics use research software as part of their everyday work, and this trend should increase in the future. In many cases, this scientific software is developed by the researchers themselves who make them available to the rest of the scientific community. Despite the fact that the relationship between science and software is unquestionably useful, it is not always carried out in a successful manner. Some of the key challenges faced by scientists are the lack of training in software development, the shortage of time and resources or the difficulty in effective cooperation with other colleagues. This could lead to make scientific software slow, non-reusable, invisible, and in the worst case, faulty and unreliable. A multidisciplinary framework that covers both scientific and software demands is needed to handle this situation successfully. Nevertheless, a multidisciplinary team is not necessarily always enough to solve this problem. It is necessary to have some kind of link between scientifics and developers: software with solid scientific background. This approach is being used in the framework of the PROTOCOL project, and more particularly in the development of its applied software tools. In this paper, a tool for the characterization of the climatic agents is presented. The main guidelines in the development process comprise, among others, modularity, distributed control version, unit testing, profiling, inline documentation and the use of best practices and tools.

Keywords: See level rise; Climate change; Research Software Engineers; Reproducibility

1 INTRODUCTION Modern scientific research is characterized by the application of new technologies. One illustrative example is the recent unveiling of an image of a supermassive containing the same mass as 6.5 billion suns (Drake, 2019). This was made possible by a global collaboration of more than 200 scientists using an array of observatories scattered around the world, acting like a telescope the size of Earth, called the (EHT). The telescope array collected 5 petabytes of data over two weeks. "The task of correlating huge amounts of data, sifting through noise, calibrating information and creating a usable image was a very significant part of the project", according to Dan Marrone, an astronomer and co-investigator of EHT (Chappell, 2019). While the work team was composed of astronomers, physicists, mathematicians and engineers, the development of the that made the image possible was led by Katie Bouman, a computer scientist (BBC, 2019). It is worth mentioning that she had no experience studying black holes when becoming involved in the project almost six years before the generation of the image. Now she is a postdoctoral fellow at the Harvard-Smithsonian Center for Astrophysics (Hess, 2019). Researchers solve problems that are highly specific to their field of expertise. These problems cannot be solved by the straightforward application of off-the-shelf software, so researchers develop tools that are bound to their exact needs (Brett, 2017). Thus, researchers are the prime producers of scientific software. Some decades ago, most of the computing work done by scientists was relatively straightforward. But as computers and programming tools have grown more complex, scientists have hit a steep learning curve. In (Hannay, 2009), a survey of almost 2000 scientists was presented. Some of the main conclusions indicated that 91 percent of respondents said using scientific software is important for their own research, 84 percent said developing scientific software is important for their own research, and 38 percent spend at least one fifth of their time developing software. But the most distinctive finding was that the knowledge required to develop and use scientific software was primarily acquired from peers and through self-study, rather than from formal education and training. This may lead to problems in the future. As an illustration, researchers do not test or document their programs rigorously as a general rule, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists. At best, poorly written programs cause researchers to waste 4137

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama valuable time and energy. But the coding problems can sometimes cause substantial harm, and have forced some scientists to retract papers (Miller, 2006; Merali, 2010). When a researcher publishes an article with code in a scientific journal, other colleagues may adopt this code and build their own research upon this software. Many of these scientists rely on the fact that the software has appeared in a peer-reviewed article. This is scientifically misplaced, as the software code used to conduct the science is not formally peer-reviewed. This is especially important when a disconnect occurs between equations and algorithms published in peer-reviewed literature and how those are actually implemented in software reportedly used in those papers (Ince, 2012). Despite the fact that these warnings have been sent before by some researchers (Peng, 2011; Goble, 2014; Baker, 2017), they are not having a visible effect. Software is pervasive in research, but its vital role is so often overlooked by funders, universities, assessment committees, and even the research community itself. It needs to be made clear that if the scientific software is incorrect, so will be science (Miller, 2006). While this is a generalized problem in science, some scientific fields are more advanced than others. Some branches of science have spawned sub-disciplines. This is the case of bioinformatics, computational linguistics or computational statistics. They even have their own journals, some of them with great scientific impact, such as the first quartile journal "Journal of Statistical Software". However, this does not seem to be the case of Hydro-Environment Engineering (Hutton, 2016).

Figure 1. The photo of a supermassive black hole captured by the Event Horizon Telescope

2 METHODOLOGY A possible solution to deal with these problems is hiring software engineers to perform the development of the scientific software. While this approach will solve many issues related to the poor quality of scientific software, it usually lacks the physical interpretation or the correct validation of results. Pure software engineers suffer from lack of expert knowledge in the scientific discipline of the software. Furthermore, the availability of massive datasets and the application of cutting-edge technologies, such as data mining or deep learning, does not in itself mean that reliable scientific software is built. To overcome these issues, a more appropriate proposal which tries to solve to these issues is the creation of a new academic professional designation, the Research Software Engineer (RSE), which is dedicated to complementing the existing postdoctoral career structure (Robert, 2012). RSEs are both a part of the scholarly community and are professional software developers, who understand the scientific literature and research questions and have a professional attitude towards software development. Their work should be evaluated by both software and academic metrics. Regardless of the case, developers of scientific software should have both strong scientific and software engineer background. In the following sections, we will briefly describe some of the skills that coastal engineers of the project "Protection of Coastal Urban Fronts against Global Warming (PROTOCOL)" have acquired to conduct a quality scientific software. The main aim of this project is the development of a general technical recommendations to protect the coastal urban fronts against the sea level rise induced by the global warming. A significant part of the project consists in developing a climatic software tool to characterize the different atmospheric, maritime and fluvial agents at case study locations and generate automatic reports with the climate analysis.

4138

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama

Figure 2. Locations of the case study areas considered in the Protocol project

2.1 The use of free openly available tools While most of team members had experience with scientific platforms like Matlab, one of the major decision was the use of the programming language Python. The choice of programming language has many scientific and practical consequences. In particular, Matlab code can be quick to develop and is relatively easy to read. However, the language is proprietary and the source code is not available, so researchers do not have access to core algorithms making bugs in the core very difficult to find and fix. Many scientific developers prefer to write code that can be freely used on any computer and avoid proprietary languages. Python is an increasingly popular and free recommendation of programming language for scientists. It combines simple syntax, abundant online resources and a rich ecosystem of scientifically focused toolkits with a heavy emphasis on community. Python has built-in support for scientific computing. One of the major downsides in the past for Python was its installation. This is currently solved by Python scientific distributions, such as Anaconda, which includes the SciPy ecosystem and has facilitated the adoption of such a language.

Figure 3. Scientific Python Ecosystem

Anaconda includes, among others, packages to perform data cleaning, aggregations and exploratory analysis (Pandas); numerical computation (NumPy); visualizing data (Matplotlib and Seaborn); domain-specific toolboxes (SciPy); (scikit-learn); organizing large amounts of data (netCDF4, h5py, pygrib); interactive data apps (Flask, Bokeh) or scientific dissemination and divulgation (Jupyter, PyLaTeX).

4139

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama

2.2 Inline code documentation In the same way that a well-documented experimental protocol makes research methods easier to reproduce, good documentation helps people understand code. This makes the code more reusable and lowers maintenance costs. The best way to create and maintain reference documentation is to embed the documentation for a piece of software in that software. Python makes this task easier by using documentation strings, also known as docstrings. In contrast to comments, reference documentation and descriptions of design decisions are key for improving the understandability of code. Therefore, scientific should use docstrings to document functions, modules or classes (why), not concrete implementations (how).

2.3 Modular programming To put it simply, modularity refers to coding in a way so that the overall task being performed is divided into small, discrete units of work. These parts are called modules in Python and are grouped in packages. This design paradigm promotes reusability and flexibility. The focus for this separation should be to have modules with no or just few dependencies upon other modules. When creating a modular system, several modules are built separately and independently. The final application will be created by putting them together. Many of these modules and packages could also be reused to build other applications.

Figure 4. Scheme of the different modules of the tool

2.4 Quality control process Programs should be thoroughly tested according to the test plans developed in the design phase. Because of the modularity nature of our software, we can define unit tests to subject each piece to a series of tests. Well- designed unit tests may be used to address whether a particular module of code is working properly and allows testing to proceed piecemeal and iteratively throughout the development process. This enables bugs to be identified and handled early so as to avoid major problems during integration and final testing.

2.5 Distributed control version One of the biggest challenges scientists face when coding is keeping track of changes, and being able to revert them if things fail. When the software is built in a collaborative manner, this is even more dramatic. It is difficult to find out which changes are in which versions or to say exactly how particular results were computed at a later date. The standard solution in both industry and open source is to use a version control system. Programmers can modify their working copy of the project at will, then commit changes to the repository when they are satisfied with the results to share them with colleagues.

4140

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama 2.6 Code performance analysis In scientific software development where computational efficiency is one of the main goals, running-time profiling is a necessary step. The key to speeding up applications is often understanding where the elapsed time is spent, and why. Profiling helps to extract this information and aid program optimization.

3 RESULTS Climate analysis constitutes the first step for the assessment of the climate change impacts and risk management. In the last few years, relevant advances have been done in this field. However, tools and results are often excessively complex and time-consuming for stakeholders and end-users; as a consequence, there is a need for developing simpler tools. The main objective of this work is to develop a climatic open-access and user-friendly tool that simplifies the labor of analyzing the joint behavior of the concomitant climatic drivers. This tool integrates four modules: (1) reading the drivers descriptors time series, (2) basic climate analysis including histograms, kernel density functions and a complete summary of the data (among others), (3) fit of the theoretical distribution function to the data for the medium regime and (4) extremal analysis through "Peak Over Threshold (POT)" approach. Those modules and the respective datasets, can be easily integrated in larger multidisciplinary applications allowing scientists to characterize the maritime, atmospheric and fluvial drivers in the analysis of the climate impacts.

Figure 5. Graphical outputs obtained by the climatic tool. Panels a, b, c, d, e and f show: the time series plot of significant wave height; an histogram of the astronomical tide; a scatter plot of significant wave height and peak period; a wind rose; the variability of river discharge along the year and the empirical cummulative density function of the river discharge respectively.

4141

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama

Figure 6. Friendly report generated automatically from the tool

This methodology has enabled us to have a single source code to generate multiple products addressing different users. Moreover, a new addition to the source code is immediately available to all the products, and thus to every user. Deliverables currently available are tutorials in Jupyter notebooks, friendly automated reports or relational databases. Some of the potential users include public and private managers, specialized technicians, engineering students, stakeholders or other scientists. Once the application is complete, it will be free available on the project website (gdfa.ugr.es/protocol). The tool has been applied at different locations along the coast of the Iberian Peninsula. The software intends to help managers handle the impact of the environmental drivers on coastal urban fronts through the characterization and simulation of the main drivers and their variability. Therefore, this tool constitutes a preliminary and unavoidable step in the decision-making process of the assessment of the flooding risk.

Figure 7. Diagram of the open climate tool

ACKNOWLEDGEMENTS This project is supported by “Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo”, CYTED (project PROTOCOL 917PTE0538) and the Spanish Ministry of Economy and Competitiveness (project PCIN- 2017-108).

REFERENCES

4142

E-proceedings of the 38th IAHR World Congress September 1-6, 2019, Panama City, Panama Hess, A. (2019). 29-year-old Katie Bouman ‘didn’t know anything about black holes’—then she helped capture the first photo of one. CNBC. Retrieved April 15, 2019, https://www.cnbc.com/2019/04/12/katie-bouman- helped-generate-the-first-ever-photo-of-a-black-hole.html. Baker, M. (2017). Scientific computing: Code alert. Nature, 541, 563-565. BBC. (2019). Katie Bouman: The woman behind the first black hole image. Retrieved April 15, 2019, https://www.bbc.com/news/science-environment-47891902. Chappell, B. (2019). Earth Sees First Image Of A Black Hole. NPR. Retrieved April 15, 2019, https://www.theverge.com/2019/4/10/18303661/first-picture-black-hole-sagittarius-event-horizon-telescope. Brett, A., Croucher, M., Haines, R., Hettrick, S., Hetherington, J., Stillwell, M. et al. (2017). Research Software Engineers: State of the Nation Report 2017. Goble, C. (2014). Better Software, Better Research. IEEE Internet Computing, 18, 4-8. Hannay, J. E., MacLeod, C., Singer, J., Langtangen, H. P., Pfahl, D. and Wilson, G. (2009). How do scientists develop and use scientific software? Proceedings from 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. Hutton, C., Wagener, T., Freer, J., Han, D., Duffy, C. and Arheimer, B. (2016). Most computational hydrology is not reproducible, so is it really science? Water Resources Research, 52, 7548-7555. Ince, D. C., Hatton, L. and Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482(7386), 485-488. Merali, Z. (2010). Computational science.Error. Nature, 467, 775-777. Miller, G. (2006). Scientific publishing. A scientist's nightmare: software problem leads to five retractions. Science (New York, N.Y.), 314, 1856-1857. Drake, N. (2019). First-ever picture of a black hole unveiled. National Geographic. Retrieved April 15, 2019, https://www.nationalgeographic.com/science/2019/04/first-picture-black-hole-revealed-m87-event-horizon- telescope-astrophysics/. Peng, R.D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227. Robert, B., Chue Hong, N., Dirk, G., James, H. and Ilian, T. (2012). The Research Software Engineer. Proceedings from Digital Research 2012, Oxford, United Kingdom.

4143