Technological Feasibility Analysis

Date: 10/6/2020

Team Name: Presto Proxies

Team Members: Melissa Peiffer, Rodgers, Justin Coffey, Colin Taylor

Project Sponsor: Dr. Nicholas McKay

Team Mentor: David Failing

2

Table of Contents

1. Introduction 3 2. Technological Challenges 4 3. Technology Analysis 4 a. Reading NetCDF files for Paleoclimate data 4 b. Visualization methods for time series data 6 and index reconstruction c. Front End Framework 9 . Backend Framework 11 4. Technology Integration 13 5. Conclusion 14

3

1. Introduction

Climate can be described as the average, long-term weather patterns observed over many years in a specific region. Observations of the climate include metrics such as temperature, precipitation, and air pressure. The climate affects and has always affected everything, from ecosystems to societies. To better understand our climate now and in the future, it is essential we know how it has changed in the past. The study of past climate variations is known as paleoclimatology. Past climate conditions can be deciphered by observing imprints left on the natural environment, such as the isotopes present in coral skeletons or the substances frozen in the layers of an ice core. Quantitative data, known as proxy data, can be collected from these imprints and be used to reconstruct models of past climate conditions. Over the last 30 years, thousands of proxy data sets have been collected. The quantity of the collected data makes it nearly impossible to analyze climate variation patterns by hand. It would be too time-consuming to search the data set for notable information, and difficult to view the information in a meaningful way. The Paleoclimate Dynamics Laboratory at NAU, along with collaborators at the University of Southern California, are launching a new project called PReSto (Paleoclimate Reconstruction Storehouse), that will streamline the creation of paleoclimate reconstructions. Visual reconstruction of paleoclimate data allows scientists to isolate specific patterns in their data and condense it into meaningful subsets. To accomplish this task, our vision is a modern web application that utilizes the reconstructions being produced by PReSto. The application will present these reconstructions to a wide range of informed end-users in the form of interactive maps and graphs, which will allow users to navigate through past climate reconstructions in space and time. These visualizations will give users insight into climate variation patterns that would otherwise be difficult to obtain. If successful, it’s difficult to overstate the impact of this application. Access to past climate data is of large public interest. Thousands of educators, scientists, and policymakers would benefit from the application. Educators would be able to integrate the application into their sustainability curriculum, and better inform our society about climate-related issues. The ability to visualize past climate data will allow researchers to better understand current climate variations, and to better predict future climate conditions. Finally, the application will inform 4 policymakers to instate laws and protections that mitigate the effects of climate change. It is crucial to address climate change now in order to prevent catastrophic changes to our climate in the future. The application is one of many tools that will provide a way to combat climate change by giving insight into the patterns of climate variation. In this document, we begin by outlining the major technological challenges we expect to encounter as we develop our project. Once we have identified these challenges, we will closely examine each item in more detail, presenting our methods for analyzing that technology, alternatives to that technology, and how we ultimately decided to integrate that technology into our project. The technological challenges we expect to encounter are listed below.

2. Technological Challenges

● We will need a way to read NetCDF files for Paleoclimate data. ● We will need visualization methods for time series data and index reconstruction. ● We will need a front end Framework. ● We will need a back end Framework.

3. Technology Analysis a. Reading NetCDF files for Paleoclimate data Intro the issue In order to generate visualizations, users must choose a NetCDF file containing the proxy data. Several libraries exist for the explicit purpose of manipulating NetCDF data. Our challenge is to select an appropriate library that can handle large NetCDF files efficiently. NetCDF formatted files are made up of plaintext metadata headers and binary compressed scientific data, which must be placed into a data structure in order to be usable. As such, selecting a library to read the NetCDF files is a top priority for our application. We have outlined some of the desired characteristics such a library would have below. Desired Characteristics Our library must meet the following criteria. It should have a smaller library size to allow for efficient loading and manipulation of NetCDF data, be able to place the NetCDF data into a comparable data structure, and be able to return subsets of NetCDF data. The library should be implemented in Python to remain consistent with our chosen programming language. Our 5 analysis of Python libraries for data management has led us to two primary alternatives. These alternatives are NetCDF4 and GDAL.

Alternatives Below, we have provided a broad overview of NetCDF4 and GDAL. ● NetCDF4 is an open-source python module used to read in NetCDF files as NumPy arrays. It is built off of existing C libraries for reading large data structures. ● GDAL, the Geospatial Data Abstraction Library, is a wide-ranging library implemented in multiple programming languages such as C# and Python for reading in vector and raster geospatial data. Analysis After looking at both GDAL and NetCDF4 we examined each of our criteria metrics to compare the feasibility of each technology. Our comparison of these metrics has led us to the observations below.

Analysis of NetCDF4 ● NetCDF4 was built specifically to interface with NetCDF data. ● NetCDF4 has a library size of 3.12 MB. ● NetCDF4 automatically returns data in the form of a NumPy array. ● NetCDF4 can return subsets of specific data.

Analysis of GDAL ● GDAL is a general library that interfaces with a wide variety of data. ● GDAL has a library size of 126 MB. ● GDAL automatically returns data in the form of a NumPy array. ● GDAL can return subsets of data.

6

ALTERNATIVE Technology Concept Library Includes Able to Return S Type Size Conversion to Data Subsets Comparable Data Structure

NetCDF4 Python A Python 3.12 MB Yes Yes library interface for the NetCDF C library

GDAL Python A Python library 126 MB Yes Yes library specifically for vector and geospatial data

Chosen approach Our chosen approach is NetCDF4. Both libraries met our criteria well, but NetCDF4 is superior in terms of library size and efficiency. Also, NetCDF4 is designed specifically to interface with NetCDF data, whereas GDAL is a much broader library and would introduce unnecessary overhead. Proving feasibility In order to prove the feasibility of NetCDF4, our team must import the library into Python and test the functionality using a sample NetCDF file. The tests should include reading the NetCDF file, defining dimensions for data visualization, and accessing specific variables and values from the NetCDF file. If the NetCDF4 library successfully handles the data provided, we will need to use NetCDF4 to access the data for visualization.

b. Visualization methods for time series data and index reconstruction Intro to the Issue Our program needs to be able to generate different types of visualizations for both time series data and index reconstruction. The time-series data should be displayed in five dimensions; those dimensions being spatial (x, y, z,) coordinates, time, and uncertainty. The index reconstruction should be displayed in four dimensions; those dimensions being spatial coordinates and time. Our challenge is to select an appropriate library to construct these visualizations and to give the user the ability to find specific information from their data. Our 7 program also needs to provide users with the ability to export data visualizations and data subsets. We have outlined some of the desired characteristics we would like our visualization library to have below.

Desired Characteristics Our library must meet the following criteria. It should be able to handle large amounts of data, allow for the creation of interactive plots in multiple file formats, be able to isolate specific data subsets and plot the data in the form of NumPy Arrays. Once again, the library should be implemented in Python for consistency. Our research has led us to four alternatives, Matplotlib, Seaborn, ggplot, and Pygal. Alternatives ● Matplotlib is a standard python library utilizing the NumPy library to plot 2D graphs and other plots. ● Seaborn is an extra layer of abstraction over matplotlib that allows complex plot types to be made with less code than if they were implemented with only the matplotlib library. ● ggplot is originally a library implemented in R and then ported to python used for visualization of data science applications. It implements a high-level API that allows for complex plots with very little code. ● Pygal is an open-source python library that by default produces SVG images, which helps ensure scalability of images without producing excessive pixelation. It also provides many options for user interaction, which may be useful for our purposes. Analysis Below are our observations of each library, with an overview of their benefits and deficits. Analysis of Matplotlib ● Matplotlib can create both static and animated plots but is a low-level plotting library that may be difficult to use. Analysis of Seaborn ● Seaborn is a wrapper library over matplotlib which offers many of the same functionalities of matplotlib with less code and complexity. Analysis of ggplot 8

● ggplot is simple and can isolate data by specific periods. However, it is harder to customize. Analysis of Pygal ● Pygal can create interactive plots but cannot handle large datasets.

ALTERNATIV Technology Concept Can Handle Interactive Allows for Data E Type Large Datasets Graphs in Subsets Multiple File Formats

Matplotlib Python A low level, Yes Yes Yes library comprehensiv e library for embedding plots into applications

Seaborn Wrapper An Yes Yes Yes library over abstraction Matplotlib layer over Matplotlib. Allows for same functionality with less code ggplot R library A Yes Yes Yes ported to comprehensiv Python e statistical library for data science applications

Pygal Python An No Yes No library interactive data visualization library focused on aesthetic data visualization

9

Chosen Approach Our chosen approach is ggplot. Pygal did not meet two of our desired criteria, so we eliminated it as an option.While both Matplotlib and Seaborn met our requirements, we decided against using these libraries to reduce overhead. Matplotlib is a large, low-level library which will introduce a steep learning curve to our project. Seaborn is less complex but would introduce dependencies on Matplotlib. The ggplot library met our criteria, and had added simplicity, making it the desirable option. Proving feasibility In order to prove the feasibility of ggplot, our team must import the library and test the functionality using a sample data set. Specifically, our team must ensure that the chosen library can generate visualizations for specific time periods and regions, and generate NumPy arrays for the provided time series and index data. Once we are able to generate the necessary visualizations we will want to display them for users for that we will need a proper front end.

c. Front End Framework Intro to the Issue For this project, we need a front-end framework that is suitable for our project. We need an intuitive user interface that looks professional and is easy to navigate. The main challenge is to find a front end that provides support for user interface creation and is compatible with visualization libraries and ggplot. There are several desirable characteristics for our front end framework. Desired Characteristics The front end framework must be capable of integrating visualization libraries to display the paleoclimate visualizations to the user. It should also provide a template for building a professional user interface that is clean and readable. We have decided to look at , and Reactjs as our alternatives for the frontend framework. Alternatives ● Angular is a front end framework that provides pre-formatted templates using TypeScript. ● Reactjs is a front end framework that provides component-based UI templates using JavaScript. 10

Analysis We have looked over both Angular and Reactjs. Angular ● Angular is easy to learn and provides several templates for building a professional and clean user interface. It extends HTML with new attributes and seems to be easiest when used on simple Single Page Applications (SPAs). Angular is compatible with several visualization libraries. Reactjs ● React.js has a well-defined lifecycle, uses a component-based approach, and uses Javascript, making it very simple to use and learn. Reactjs also uses a special syntax called JSX which allows you to mix HTML with JavaScript. Furthermore, Reactjs is compatible with several visualization libraries.

ALTERNATIVE Technology Concept Dependencies Language Data Binding S Type

Reactjs JavaScript Brings HTML Requires JavaScript + One-way data Library into additional JSX binding JavaScript. tools to Works with manage the virtual dependencies DOM Server-side rendering

Angular Full-fledged Brings Manages JavaScript + Two-way MVC JavaScript dependencies HTML data binding framework into HTML. automatically Works with the real DOM Client-side rendering

Chosen Approach Our chosen approach is React.js because it offers several visualization libraries (such as Victory and React-Vis) and is implemented in JavaScript and JSX. Working with Javascript and 11

JSX would be helpful since we are building a web application, and these languages would help us build it. Having the visualization libraries will be crucial to the project since we need to create visualizations. Our team members have worked with JavaScript before, whereas our team members have not worked with TypeScript, so we also have some familiarity with the tools included with React.js. Proving Feasibility To prove the feasibility of React.js, our team must set up a test webpage implemented with the framework. In order to be successful, the webpage we implement must have a foundation for visualization. Our front end framework will need a working back-end framework.

d. Back End Framework Intro to the Issue For users to load their data, we need a comprehensive back end framework to support our system. The back-end framework should be able to take that data and pass it to a python script which will generate visualizations. Our main challenge is to pick an appropriate back-end framework for our project. In order to pick out an appropriate back end framework, we have outlined some desired characteristics.

Desired Characteristics We would like a back-end framework that can run python scripts and quickly access static files. It should also be able to connect to the front end and produce the data visualizations. Specifically, it needs to serve NetCDF files from the server running the framework, run Python scripts directly, and take in user input in the form of HTML. Below we have listed all the alternatives.

Alternatives ● is an extremely versatile full-stack and open source back end framework. ● Pyramid is an open-source, back-end framework that works well with both large and small applications. ● TurboGears is a full-stack and open source back end framework that is focused on data-driven web applications. 12

is a full-stack and open source back end framework that is data-driven but less complex. Analysis Below is an overview of our technology alternatives and how they hold given our criteria. Django ○ Django has a powerful testing interface and many useful features, such as built-in database features. The provided features of Django may reduce overhead. However, Django is complex and would introduce a somewhat difficult learning curve. Pyramid ○ Pyramid includes features such as data documentation, testing, and the ability to select a programming language and database layer. TurboGears ○ TurboGears is built using several components of other frameworks such as libraries and middleware. It provides a quick way to build data-driven applications. However, it may be less reliable. Web2Py ○ Web2Py includes a built-in integrated testing environment (IDE), can run on multiple operating systems, and is a powerful tool for data-driven applications. However, there are fewer maintainers of Web2Py than there are for Pyramid or Django, so receiving support would be difficult.

ALTERNATIVE Technology Built-in Database Request Extension Python 3 Type Tools Processing Libraries Support

Django Full Stack ORM for Synchronous Yes Yes web communicating framework with database

Pyramid Full Stack No specific Synchronous Yes Yes web ORM but framework SQLAlchemy is recommended

TurboGears Full Stack SQLAlchemy Synchronous Yes Yes 13

web ORM framework

Web2Py Full Stack Custom DAL Synchronous Yes No web that acts like framework ORM

Chosen Approach Our chosen approach is Django because it is versatile and will provide a framework for the necessary features of our project. While all of the back end frameworks we examined could be implemented to suit our needs, having a comprehensive testing environment is desirable. Therefore, we decided to use Django as our framework. Proving Feasibility In order to prove the feasibility of Django, our team must set up a test server to verify that the back end framework provides all the necessary features for our project.

4. Technology Integration: In order to implement our final system, we must have a comprehensive understanding of how each technology will interact. Each technology has been selected to work for the purposes of our project, but they cannot work alone. In this section, we propose our envisioned system and detail how the major components of our project will be integrated with one another.

14

Figure 1: Proposed System Diagram for Technology Integration

The front end will be served from a React-Powered Front End Framework hosted and Django Back-End as shown on the web server portion of Figure 1. Users will then choose different options that will be stored and posted through an HTML form. This can be seen in Figure 1 as a red arrow from the user machine to the Django back-end on the web server. Once the required information is received by the server, the required file will be retrieved and passed, along with the form information, to a python parsing script that will retrieve the necessary data from the file and pass it along to be plotted by a separate script shown in Figure 1 by arrows that travel through the back end to various script sand back into the back end. The plot produced will then be added to the front end and displayed to the user when served to their web browser which again crosses from the web server to the user machine in Figure 1.

5. Conclusion Understanding the past climate is crucial to develop an understanding of the current climate. With visualization software such as PReSto, researchers can easily analyze mass amounts of data to identify common patterns. These patterns can then be used to project the 15 effects of current and future climate variations. The information provided by this software is essential to help researchers develop ways to mitigate the effects of climate change. Our project must integrate several technologies to provide researchers with the tools to visualize and analyze their data. In order to build the project, we must be able to read NetCDF files, create visualizations with that data, and be able to export the resulting visualizations of that data. To do so, we must implement an appropriate front end framework and an appropriate back end framework. We must also utilize appropriate plotting and visualization libraries. For our frameworks, we decided to work with Reactjs for our front-end, and Django for our back-end. Reactjs has a variety of visualization libraries for us to work with, and can all be done in Javascript or JSX. As for Django, it has many tools that could handle some functions that we would otherwise have to deal with ourselves. The only downside would be the learning curve of learning Django, but we believe that it would positively impact the project if we can overcome that. We have also decided that we are going to use the plotting library ggplot. This is because we need to isolate and plot specific data for our visualizations, and this seems to fit our needs best. Finally, we decided to use Netcdf4 to read the provided data files as it was built specifically to handle NetCDF files, and will provide us with ways to manipulate the NetCDF data as needed. We have analyzed each technology and summarized how each technology will be integrated into the final system. However, in order to truly understand how the selected technologies will work in our system, we must begin to implement and test them. Moving forward, we will begin to analyze the overall design of our system, using the technologies we have discussed thus far to begin prototyping.