Parallelized Pacific Northwest Climate Data Analysis Framework

Parallelized Pacific Northwest Climate Data Analysis Framework by Niko Simonson University of Washington, Bothell Introduction

In order to enhance the current state of climate model data analysis, the Parallelized

Pacific Northwest Climate Data Analysis Framework (PNCA) presents a system for the scalable and efficient running of analytical algorithms. It does so by means of a parallelization technique that takes advantage of the fact that most interesting derived measures are either time series measurements or can be applied to multiple time slices of data at once. Using the Multi-Agent

Spatial Simulation (MASS) library, this parallelization is accomplished without necessitating that the developer map the notional space of the calculations to the current computational resources available. [Emau, et al]

Figure 1. Pre-PNCA climate data analysis, using a shell script and the NetCDF Climate

Operators. The following output shows the time to analyze three months of climate data, performing a simple multiplication of moisture and wind speed along one dimension. Purpose

By providing a software framework to demonstrate how this was achieved, it is hoped that it will be used by developers within the climate science community to bring their software capabilities in line with the opportunities provided by parallel computing. Up to now, available analytical tools were hard to use, hard to customize, and any parallelization offered was based on specific data file access. [Kumar, et al.] Figure 1 shows an example of such analysis in operation, using a shell script. We have shown an approach that accesses multiple files in parallel, and can potentially be adapted to parallel file access at the same time.

Inception

This project was conceived in June of 2013, based on a suggestion from Professor

Munehiro Fukuda, who leads the ongoing development of MASS at the University of

Washington, Bothell. It began with a desire to work on that project. Dr. Fukuda proposed that

MASS be applied to the field of climate science analysis in order to demonstrate its facility for scientific applications. Professor Eric Salathe, of UWB's Climate Science Department, required an application that would perform detailed analysis of U.S. Pacific Northwest climate model data, similar to an earlier project he worked on, which involved the U.S. Southwest region. That was the initial goal of the project.

It became apparent that the best application of MASS in this case would be to allow a proposed algorithm to access multiple time slices of files in parallel. This method of implementation did not have to be limited to any single algorithm, either. In consultation with

Professor Hazeline Asuncion of UWB's Computer Science department, the project team, consisting of myself and Brett Yasutake, decided to implement a system that would not focus on a single set of analyses, but allow for their independent development. Brett focused on ways to make the system approachable to potential developers in the climate science community and capture data provenance. I developed the means to allow analytical algorithms to be parallelized via the MASS library, read input, produce results, and allow for independent development of such algorithms. Together, we worked in partnership to develop adapters, suggested by Dr. Asuncion, that would reduce the system's inter-dependencies, giving it great potential to adapt to changing requirements. In this way, our system for analyzing the model data of a specific region became a powerful framework to allow fast analysis on a broad variety of data.

Figure 2. Excerpt of instructions for the downloading and installation of a popular command-

line climate analysis tool, from http://nco.sourceforge.net/ Present Climate Data Analysis

To understand the decisions behind the development of this framework, it is necessary to briefly describe the present state of climate data analysis. In order to test their understanding of the planetary climate and predict the effects of future climate change, large-scale simulations are run of global climate models. [Gau, et al.] The output from such models is extensive, but due to resource limitations imposed by the multi-year length of the simulation, rather course-grained.

Climate scientists wish to conduct finer-grained regional-scale simulations rely on the data from these large models as boundary inputs.

Also due to resource constraints, the output data produced by the models only contains basic information, such as temperature or pressure. Secondary information, derived from analysis, such as measuring actual moisture transport in the simulation, is left to climate researchers. There are many tools for this purpose, but such tools are for the most part so unwieldy, slow, difficult to learn, and limited [Su, et al] in their capabilities that many researchers are tempted to develop in-house software solutions for various questions of analysis.

[Gau, et al.] Figure 2 shows the utilization of Such once-off efforts have little application beyond answering the immediate question at hand, and rarely incorporate up-to-date resources or technologies available in computing. The overall result of this situation is that, even with a seven year cycle between generations of global climate data, researchers find themselves unable to keep up with the overall accumulation of data. Parallelization Efforts

Tools have been developed to apply parallel programming techniques to the question of climate science analysis. In general, efforts were made in three areas: parallelizing data access, developing parallel analytical solutions, and utilizing existing parallel computing tools for the purpose of analyzing climate data. Specific examples of each would be the Parallel NetCDF API

[Li, et al], and applications that use Hadoop to perform climate analysis with a MapReduce model. [Zhao & Huang]

The standard format for climate model data output is the Network Common Data Form

(netCDF). It is a format that allows for the storage and access of multiple data array variables, as well as some descriptive metadata. [Rew & Davis] Tools such as Parallel NetCDF, PIDX

[Kumar, et al] and FastQuery [Chou & Prabhat] access those arrays in parallel.

Climate researchers have turned to independent development of large-scale parallel computing applications. However, the ones described had applicability or resource constraints.

As an example, one of the most notable ones in terms of performance was able to process one terabyte of data in under a minute, but relied on a Cray XT4 supercomputer and had an architecture consisting of thirteen layers of separate processing.[Kendall, et al.] MapReduce is a scheme that first sorts data through key-value pairs with Map() and then performs specified analysis across the data in parallel, using Reduce. [Dean & Ghemawat] The

Hadoop application provides an open source method for implementing MapReduce with Cloud- based processing. [Park, et al.] However, MapReduce implementations, and Hadoop in particular, are not particularly efficient at the message passing that forms the basis of parallelization, instead relying on the Cloud's ability to scale large-scale resources. [Yu & Sim]

This can and does lead to bottlenecks and race conditions in the inter-node communications. [Ji, et al.] It also leads to long execution times in data partitioning and the execution of Reduce.

[Zhao & Huang]

Methods

The PNCA framework is a hybrid of the second and third approaches: a tool for performing parallel analysis using the MASS library as its chosen means of parallelization, instead of Map-Reduce. Unlike projects such as GeoModeler or some of the terascale efforts, it is resource-independent and does not have to be extensively reprogrammed to perform a different analysis. [Vance, et al] [Lofstead, et al] PNCA is complementary to the first approach, as files are accessed in parallel, but single file access depends on the Input Adapter module. Utilizing a tool such as Parallel NetCDF in the Input Adapter would potentially allow for it to be used in concert with parallel file access. It implements parallelization by making it easily available to an analysis module through the use of simple functions, as suggested by Chiang. [Chiang] Figure 3. Data flow diagram. Dependencies occur in the direction of data flow. Other than the

Analytics Module, which is an example from a package, each data processing location corresponds to a class within PNCA.

Designing the Framework

Available resources and technologies played a constant role in PNCA's design. Many existing analysis tools for climate science were written in FORTRAN or had FORTRAN APIs.

To enhance the system's flexibility, a more object-oriented language was desired. The MASS library was only available in Java, a language not known for its overall efficiency. However,

Java's ease of use, its high-level, managed aspect, and the fact that it was resource-independent, all spoke in its favor. Hedging the bet, a C++ high-speed version of MASS is in development, and the established framework can be rewritten to accommodate that, should the need development. An important feature of the framework's architecture is the use of modules termed adapters, as shown in Figure 3, because they allow the system to adapt to changes by minimizing interdependence. Considering the system as beholden to the needs of external inputs and requirements of external outputs, changes to either of those have an impact on the system itself.

Adapter modules exist at the system's interfaces with the external world, and also within the system, separating major portions of itself. Changes effected at these interfaces result in changes to the adapters but not to any class that itself is connected to the adapters. This was a major point in the framework's design.

Specifically, for the execution of parallel analysis, the most important modules besides the analytics module itself were the Input Adapter providing a source of external data, the

Results Adapter to store data, and the Execution Adapter, through which the analytics module is run. An early example of the facility of such a design was a change in the output requirements partway through the project. As originally conceived, results would be stored in a relational database. Based on additional input from Dr. Salathe, it was decided that storing the results in netCDF format would allow them to be easily accessed with existing tools. The Results Adapter was changed accordingly to write to netCDF, but as its interface stayed the same, no further work had to be done anywhere else in the system.

The Input Adapter, unsurprisingly, reads from netCDF formatted data. A single file of climate model output holds the entirety of information for the model's particular geographical region at a single slice in time. This time slice is taken every six hours. Wrapped in the input adapter is a Java-based netCDF reader.

The Execution Adapter's purpose was to read any analytics module. As a result, there needed to be a standardized interface contract between the analytics modules and the execution adapter. The execution adapter itself could be launched from any sort of controller. In the case of PNCA, and also keeping an eye to ease-of-use, that controller is a graphical user interface.

The centerpiece of design lies in the analytics modules. Rather than a single class, it is a package with existing classes providing examples and templates for the independent development of customized data analysis. The analytics module essentially functions like a big operator over the body of climate data. [Park, et al.] Existing analytics modules make use of the

Input Adapters to obtain data from variables of interest, use the results adapters to output what is calculated, parallelize algorithms with calls to the MASS library, provide status updates, and implement provenance collection. Individual coders can choose to implement all of these features, taking advantage of the full capabilities of the framework, or only what they are specifically interested in. For example, if the result of an algorithm was only a single, scalar value, it could be directly transmitted to the GUI using the Status Adapter, while the Results

Adapter could remain unused.

Current adapters perform analyses of immediate interest to the Pacific Northwest region.

Moisture Flux measures the amount of moisture transport in the wind. Wind divergence shows local increases or decreases in the rate and direction of wind speed. One broad class of analysis is known as the Extreme Indices. These useful secondary measures show states of the climate that have notable human impact, such as the number of icy days in a year. By comparing values between the beginning of a climate model's simulation with values near the end, useful statements concerning the impact of predicted climate change can be derived. The indices implemented were chosen to reflect broad categories of potential indices related to temperature and precipitation. For example, if the number of icy days can be calculated, it requires only a trivial change to develop an analytics module that calculates the number of tropical nights. Figure 4. MASS collects climate data files in parallel, which are then arranged by month and time.

Developing the Framework

Development utilized an agile approach, due to the short, six month time frame of the project. Deliverables were presented at the end of two-week sprints. On a larger scale, the product was divided into pre-alpha (design), alpha (feature complete), beta (code complete), and release states. The main deliverable was either pre-alpha, or a candidate for one of the latter states at the end of each sprint. Rather than a hard deadline, this approach cast the target goals in a way that allowed for multiple opportunities to reach them. In this fashion, it is hoped that similar results will be replicable. [Jiang & Cheng] Implementation was done through pair programming, a convenient approach, given the number of people on the project team. Work on design and coding was done with one member programming the section of their responsibility, and the secondary member observing. If necessary, when the programmer was stuck or the observer had an insight, the pair switched places. The observing partner could also work independently on other aspects of the project or perform research to resolve specific problems as they occurred. Because it requires the resources of two developers, to be effective, pair programming must average twice the productivity of a programmer working singly. Although we did not have the means to formally compare it, from an anecdotal point of view, pair programming provided an increase in productivity well in excess of the doubling needed to justify its use. It is important to note that part of this was due to the availability of physical resources to support two developers: assuming the observer would not need their own workstation, network communication, or access to the project code and documentation would be a fallacy.

An important inclusion in the iterative taken was risk management, as an aspect of quality assurance. Using a streamlined methodology, threats were identified and characterized by likelihood and impact, using a small scale of index values. The resulting risk index was calculated as a function of the two values. By identifying the top risks going into each iterative cycle, the project was able to focus on the major obstacles first. Early versions of the framework worked on implementing MASS's parallelization with simple temperature analysis, such as taking daily high temperatures and regional averages.

MASS allows for the creation of a multidimensional logical array which it then parallelizes across multiple processes and threads. [Emau, et al] When considering an analysis of multiple time slices and geographical areas, it becomes apparent that the same analysis can be repeated across each time slice. MASS provides two important and easy-to-use functions in this regard:

CallAll() and ExchangeAll(). Calling CallAll() and passing it a user-defined function causes that function to be executed across every computational node. Calling ExchangeAll() allows nodes to reveal data in the form of object-holding messages to specified neighbors. In both cases, all computational nodes complete before the next step is called, due to barrier synchronization, resulting in some decrease in speed, but completely avoiding the race conditions that often beset attempts at parallelization. [Martha, et al.] [Lofstead, et al.] Once the first working prototype was developed, project development moved on to determining the most efficient implementation of parallelization. Three choices were considered, all of which involved obtaining inputs in parallel, as shown in Figure 4. In the first case, only input would be parallelized. All data would be delivered to a central controller for sequential analysis. This, as well as a non-parallel implementation of the algorithm, was considered to be a form of control. The second case parallelized the input and analysis. The third case would have created two parallel spaces: one in which time (input from time slices) was parallelized and the other in which space, or the geographical cells, was. Time series analysis could be conducted in the former space, and spatial analysis in the latter. Unfortunately, the present inability of the

MASS library to communicate across separately instantiated sets of parallelized nodes meant that only the first two cases and the sequential version could be developed. The results confirmed that the fastest of the available methods involved parallelizing input as well as analysis.

The next step in development was the decoupling of various aspects of the system, such as input, results reporting, and analysis, making use of the adapters that are presently in place.

To minimize complexity, the Input Adapter focused on wrapping the netCDF reader and the

Results Adapter on the writer. They themselves would be launched in parallel as needed from the analytics modules. The parameters that the analytics modules needed to function, such as the files selected for analysis, and also which analytics module would be run, was controlled by the execution adapter. The only interface that the analytics module had to concern itself with is the

Execution Adapter that calls it. Otherwise, the module is free to make use of the interfaces provided to it by the other adapters without having to accommodate changes to them. Results display changed the most during the project. Due to the difficulty involved with developing an entire geo-visualization scheme for the framework, the decision was made to incorporate a tool already in use. [Vance, et al.] Panoply, which had a Java implementation, was selected for this purpose. Panoply visualizes results in a geographical context, showing spatial results on a world map. In order for PNCA's results to be displayed in such a manner, geographic metadata needed to be supplied. Interfaces in the Input and Results Adapter make this easy to provide, and all the currently-developed analytics modules demonstrate how it is done, but the analytics module's developer is under no compunction to create Panoply-readable results.

Figure 6. Graphical User Interface for PNCA after completion of a Moisture Flux analysis.

Results are stored in netCDF files that can be visualized using Panoply.

Results

PNCA successfully runs a variety of climate analysis algorithms that produce time-based on non-time-based results. It operates quickly, and its parallelization allows it to scale to utilize available computing resources.

PNCA Described In its current form, the PNCA framework supplies the user with a GUI, shown in Figure

6, that allows the selection of files and the selection of an analytics module. By coding an analytics module in accordance to the function calls of the execution adapter, custom analytics modules can be written in Java and placed in the Analytics Package of the system. These modules, when selected, have their interface functions called using Java Reflection by the

Execution Adapter. Through parallelization provided by MASS, multiple climate files can be accessed by the algorithm at once, in synchronized fashion, allowing a broad range of calculations to be performed rapidly.

It is the GUI that calls the Execution Adapter, which then launches the MASS engine and runs the selected algorithm. At the coder's option, an analytics module can produce status updates through calls to the status adapter. The status adapter maintains a current status object that defaults to a non-null value and can be called at any time by an object seeking to display it, such as the GUI. The GUI also currently provides an option to launch the Panoply visualization tool, which is able to read and depict any output file produced by the framework's Results

Adapter, thanks to the transfer of geographic metadata. Figure 7. Visualization of the results produced by PNCA using Panoply. Shown here is the annual count of days where the low temperature dropped below freezing, an extreme climate index. Red indicates a higher amount of frost days and can be seen in the mountains. Blue indicates little or no frost, as seen in the ocean.

Time-based Detection

With access to multiple data files, but without knowing whether those files contain gaps or are unsorted, it is desirable to perform indexing based on related times, such as all the time slices that belong to the same day. This is a similar need to the Map() function of MapReduce.

[Dean & Ghemawat] Analytics modules that perform time-based analysis, like calculating the monthly average of the daily low temperature, or the amount of frosty days, as depicted in Figure

7, rely on this detection.

In order to implement it, a form of communication was established using the MASS

ExchangeAll() function, but instead of utilizing the messages to perform a reductive calculation, the communication is expanded by passing a dynamically-generated neighbors array to the function that forces all nodes to treat all other nodes as their immediate neighbor. This results in large message arrays that contain numerous null values, so an encoding and decoding scheme is necessary to access the correct information, but the result effectively indexes similar times together.

Measurements

The following measurements reflect the half-year time frame of project design and development. [Conn] The project itself was conducted by making use of free, online, and cloud- based project management tools, including Google Code, Google Drive, and Trello. Our project management result produced five major prototype versions and a total of 44 commits to Google

Code, which should provide a process-oriented picture of the development. [Rahman &

Devanbu]

The project itself consisted of 6000 lines of code, of which ten analytics modules averaged 400 lines of code each. As a metric, lines of code is useful for open-source projects as a measure of attractiveness, showing that significant development took place. [Meirelles, et al.] The framework is capable of handling significant numbers of files in parallel. File size of climate data is significant: 11 GB are required to store a year's worth of data. The amount of data the framework can handle is limited by the amount of main memory accessible by the Java

Virtual Machine, and therefore is hardware dependent. 1 GB RAM of memory can hold multivariate data of one year of climate files for most of the analytics modules. Compared to this, output storage is minimal. A typical output file is only 240 KB, though provenance collection files could grow significantly, depending on the implementation.

In terms of execution times, the framework's times will scale with the amount of computing resources available, though not necessarily in a linear fashion, as the overhead of message passing across nodes will increase. However, even with only four threads of parallelization, performance speed is significantly better than netCDO, the most common climate data analysis tool. Specifically, a moisture flux calculation using PNCA took 13 seconds to accomplish. The same calculation, but only using one variable pair (PNCA used two), with a commensurate level of provenance storage, took NCO over 15 minutes of actual time to complete. If we only account for processing and IO time, even then 156 seconds was consumed, making it twelve times slower than our framework, even with minimal parallelization.

Discussion

To be able to create complex algorithms, there is a certain amount of programming skill required: the basic ability to code in Java. However, parallelization has generally been the domain of expert programmers, and this framework puts it to work for a coder with minimal effort and have the results immediately run from a GUI, allowing researchers to focus on the most important aspect: the algorithm itself. Current State

PNCA is already performing climate analysis on large climate data sets for the Pacific

Northwest region. With one hundred years of model data to work with, and an analysis time of less than two minutes per year, the entire data set should be processed with all algorithms available in a short time frame.

Certain issues with the MASS library are awaiting resolution. The most significant of these issues is the inability between two different sets of computational nodes to communicate.

It should be emphasized that communication between nodes within a single set functions correctly. Until this issue is resolved, some more interesting analyses that transform time-based mapping into spatial-based mapping, such as tiling, will have to wait. [Guo, et al]

The time detection algorithm is effective, but appears to have quadratic complexity.

Currently, with the limitation of main memorable available with current computing resources, this is not an issue. However, an alternate implementation using agent-based detection should have near-linear complexity.

Impact

The results collected from the currently ongoing analysis of Pacific Northwest climate data will be useful predicting the environmental, social, and economic impacts of climate change to the region. [Gao, et al.] This should serve as an example to encourage the adoption of this framework for climate analysis in other regions and with more resources. Parallelized and modularized climate analysis has been an ongoing goal in the climate science community

[Kumar, et al.] Future Goals

With the release of a C++ version of the MASS library, translating the PNCA framework into C++ presents the opportunity to greatly increase the overall efficiency of the system.

However, leaving behind managed codes will present its own set of challenges to whomever ports it over.

The framework itself has extensibility beyond climate analysis. By changing the input and results adapters, parallelized analysis can be performed on any other sort of dataset. For geophysical data, not even the adapters have to be changed.

Iterative analysis can be performed. The results can be used as input, as both are in netCDF form. This will require some minor modifications to the Input Adapter, though, as the actual variables are not the same as the three-dimensional floats that all climate data is stored under. Results variables can be of differing dimensionality, and are usually doubles or integers.

A major mid-term future goal is the use of agent analysis. As the name implies, MASS supports the deployment of large quantities of mobile agents into its environment. [Emau, et al.]

Two immediate uses for such agents would be meteorological phenomena detection, such as locating atmospheric rivers or storms, and an improvement to the basic time detection algorithm, allowing indexing of similar time slices to proceed more quickly. [Willcock, et al.]

In order to achieve these goals, it is desirable to have more developers and researchers utilize the framework, both extending it and coding analytics modules for the framework as it stands. Acknowledgments

The successful completion of this framework is due to the active participation and assistance of the following people and groups:

 Brett Yasutake, project partner

 Professor William Erdly, academic project guidance

 Professor Munehiro Fukuda, MASS Project leader

 Professor Eric Salathe, climate science algorithms

 Professor Hazeline Asuncion, adapter concept

 Professor Robin Barnes-Spayde, ongoing usability participation

 Del Davis, provenance collection advice

 The MASS Project Team, MASS advice and guidance

 Student cohort of the UWB Master's program, project peer reviews and advice

References

Chiang, C. (2010). Programming parallel apriori algorithms for mining association rules. 2010 International Conference on System Science and Engineering (ICSSE) (pp. 593-598). doi: 10.1109/ICSSE.2010.5551704

Chou, J., Wu, K. & Prabhat. (2011). FastQuery: a parallel indexing system for scientific data. 2011 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 455-464). doi: 10.1109/CLUSTER.2011.86

Conn, R. (2004). A reusable, academic-strength, metrics-based software engineering process for capstone courses and projects. Proceedings of the 35th SIGCSE technical symposium on Computer science education (SIGSCE '04) (pp. 492-496). doi: 10.1145/1028174.971465

Dean, J. & Ghemawat, S. (2004). MapReduce: simplified data processing on large clusters. Communications of the ACM - 50th anniversary issue: 1958 - 2008, 51(1). 107-113. doi: 10.1145/1327452.1327492 Emau, J., Chuang, T., & Fukuda, M. (2011). A multi-process library for multi-agent and spatial simulation. 2011 IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing (pp. 369-375). doi: 10.1109/PACRIM.2011.6032921

Gao, Y., Leung, L. R., Salathe, E. P., Dominguez, F., Nijssen, B., & Lettenmaier, D. P. (2012). Moisture flux convergence in regional and global climate models: Implications for droughts in the southwestern United States under climate change. Geophysical Research Letters, 39(L09711), 1-5. doi:10.1029/2012GL051560

Guo, J., Bikshandi, G., Fraguela, B. B., Garzaran, M. J. & Padua, D. (2008). Programming with tiles. Proceedings of the 13th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '08 (pp. 111-122). doi: 10.1145/1345206.1345225

Ji, C., Li, Y., Qui, W., Awada, U. & Li, K. (2012). Big data processing in cloud computing environments. 12th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN), 2012 (pp. 17-23). doi: 10.1109/I-SPAN.2012.9

Jiang, G. and Chen, Y. (2004). Coordinate metrics and process model to manage software project risk. Engineering Management Conference, 2004. Proceedings. 2004 IEEE International. (2), 865-869. doi: 10.1109/IEMC.2004.1407505

Jones, C. (2013). Function points as a universal software metric. ACM SIGSOFT Software Engineering Notes, 38(4), 1-27. doi: 10.1145/2492248.2492268

Kendall, W., Glatter, M., Huang, J., Peterka, T., Latham, R. & Ross, R. (2009). Terascale data organization for discovering multivariate climatic trends. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Article No. 15). doi:10/1145/1654059.1654075

Kumar, S., Pascucci, V., Vishwanath, V., Carns, P., Hereld, M., Latham, R., et al. (2010). Towards parallel access of multi-dimensional, multi-resolution scientific data. Petascale Data Storage Workshop (PDSW) (pp. 1-5). doi: 10.1109/PDSW.2010.5668090

Li, J., Liao, W., Choudhary, A., Ross, R., Thakur, R., Gropp, W., et al. (2003). Parallel netCDF: a high-performance scientific I/O interface. 2003 ACM/IEEE Conference on Supercomputing (pp. 39-49). doi: 10.1109/SC.2003.10053 Lofstead, J., Zheng, F., Klasky, S. & Schwan, K. (2008). Input/Output APIs and data organization for high performance scientific computing. Petascale Data Storage Workshop (PDSW) (pp. 1-6). doi: 10.1109/PSDW.2008.4811881

Martha, V., Zhao, W. & Xu, X. (2013). h-MapReduce: a framework for workload balancing in MapReduce. 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA) (pp. 637-644). doi: 10.1109/AINA.2013.48

Meirelles, P., Santos, C., Miranda, J., Kon, F., Terceiro, A. and Chavez, C. (2010). A study of the relationships between source code metrics and attractiveness in free software projects. 2010 Brazillian Symposium on Software Engineering (SBES) (pp. 11-20). doi: 10.1109/SBES.2010.27

Park, C., Steel, G. L. & Tristan, J. (2013). Parallel programming with big operators. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (pp. 293-294). doi: 10.1145/2442516.2442551

Rahman, F. and Devanbu, P. (2013) How, and why, process metrics are better. Proceedings of the 2013 International Conference on Software Engineering (ICSE '13) (pp. 432-441). Piscataway, NJ: IEEE Press. Retrieved from IEEE database.

Rew, R., & Davis, G. (1990). NetCDF: an interface for scientific data access. IEEE Computer Graphics and Applications, 10(4), 76-2. doi: 10.1109/38.56302

Su, Y., Agrawal, G. & Woodring, J. (2012). Indexing and parallel query processing support for visualizing climate datasets. 41st International Conference on Parallel Processing (ICPP), 2012 (pp. 249-358). doi: 10.1109/ICPP.2012.33

Vance, T. C., Merati, N., Wright, D. J., Mesick, S. M & Moore, C. W. (2007). GeoModeler: tightly linking spatially-explicit models and data with a GIS for analysis and geovisualization. GIS '07 Proceedings of the 15th annual ACM international sympsoium on advances in geographic information systems (Article No. 32). doi: 10.1145/1341012.1341054

Willcock, J., Hoefler, T., Edmonds, N. & Lumsdaine, A. (2011). Active pebbles: a programming model for highly parallel fine-grained data-driven computations. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PpoPP '11 (pp. 305-306). doi: 10.1145/1941553.1941601 Yoo, D. & Sim, K. M. (2011). A comparative review of job scheduling for MapReduce. 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) (pp. 353- 358). doi: 10.1109/CCIS.2011.605089

Zhao, D. & Huang, Z. C. (2011). SDPPF - a MapReduce based parallel processing framework for spatial data. 2011 International Conference on Electrical and Control Engineering (ICECE) (pp. 1258-1261). doi: 10.1109/ICECENG.2011.6057775

Appendix I – Condensed Project Plan

● Week of 7/21 - development of QA documents, architectural design

● Week of 8/4 - first prototype and unit tests; simple parallel computation utilizing the MASS library

● Week of 8/18 - second prototype; formal status review; integration and initial validation tests are available; read NetCDF in a parallelized fashion

● Week of 9/1 - third prototype, aim for alpha (feature complete) candidate one; climate analysis algorithm implementation complete

● Week of 9/15 - fourth prototype, alpha candidate two; moisture flux divergence output data set is achieved; user interface displays task status

● Week of 9/29 - fifth prototype, beta (can be used/demoed to stakeholders) candidate one, acceptance testing; user interface displays results (text, summary statistics)

● Week of 10/13 - sixth prototype, beta candidate two; user interface displays graphical results (charts, graphs, maps)

● Week of 10/27 - seventh prototype, beta candidate three; user interface allows user defined algorithms to be created and executed

● Week of 11/10 - eighth prototype, code complete release candidate one; user acceptance testing

● Week of 11/24 - ninth prototype, release candidate two to address acceptance test issues; user acceptance testing

● Week of 12/8 - complete and present project; code complete system is achieved Appendix II – NetCDF Analysis Summary

Overview

The Pacific Northwest Climate Data Analysis project seeks to provide an efficient framework to perform analysis and collect provenance information on the data of several climate model simulation results. This data is stored in the network common data form (netCDF) format, which is the international standard for climate science data. Integration of a netCDF reader is necessary for this system. netCDF The netCDF abstraction consists primarily of dimensions, variables, and attributes. It also includes a certain amount of metadata. netCDF also provides a high-level interface for reads and writes in across dimensions or “hyperslabs.” This data is encoded in eXternal Data Representation (XDR) format.

Dimensions A dimension is the extent of a variable array and is expressed as an integer. For example, a variable x having a dimension of 4 could be thought of as an x[4] array. netCDF support multi dimensional access. In addition, one dimension can be unbounded, allowing a variable to grow freely, like a record set. A variable with no dimensions specified is scalar.

Variables In addition to dimensions (shape), a netCDF variable has a name and type. Numerical data types are:  bytes  16-bit “short” integers  32-bit “long” integers  32-bit floating-point numbers  64-bit floating-point numbers

Attributes Variables may have an arbitrary number properties expressed as attributes. Attributes can vectors with multiple values, and can also be strings. Attributes provide information that is not easily expressed as numeric dimensions, such as metadata and missing values. Global attributes can also be defined that provide information about the entire file. Attributes with the same value across different variables are used to create “hyperslabs” of multivariate data that can be read by the netCDF interface. Attributes are stored in a header file for ease of access. Pacific Northwest Climate Data Each pair of header and data files for the project consists of a complete spatial description of the Pacific Northwest region at a specific time slice. Depending on the model, the slice is 6 hours or 24 hours. Each model produces 100 years of simulated data. One file contains roughly 60 megabytes of data, so the size of one model’s entire output is on the order of 10 terabytes. Files are named in a uniform fashion with the name including the date and time of the time slice. netCDF Access The central function of the Pacific Northwest Climate Data Analysis System is to read climate data contained in several netCDF files in parallel and perform the desired analysis upon them.

Parallel Access Parallel access is provided by the Multi-Agent Spatial Simulation (MASS) Library. This is a Java library that allows for multi-process, multi-threaded work in an abstracted, user-defined computational space. MASS automatically handles the division between the computational space and the available number of threads and processes.

One computational node will be created for each netCDF file to be access. Each node will maintain a separate netCDF reader. For each analytical task, these nodes will be responsible for accessing and storing the necessary values, as well as collecting data provenance on its actions. netCDF Reader For ease of integration into the MASS environment, it is strongly desired that the netCDF reader be implemented in Java. There are serial and parallel netCDF readers, but at this point a parallel-access Java implementation has not been found.

Current Reader Unidata’s “NetCDF Java” Reader Version: 4.3.18 Project Version: as of Review Checkpoint #2 Notes: implements Common Data Model (CDM) interface; serial Appendix III – Stakeholder Communication Management Document

Overview

Communications with project stakeholders have the following major purposes:  understanding requirements  obtaining feedback on project progress  determining acceptance

In order to facilitate such communication, many aspects of such communication must be considered:  functional requirements gathering  non-functional/quality requirements gathering  understanding stakeholder goals and perspective  tracking the results of communications

Requirements Elicitation Common Concerns  Problems of Scope - understanding the boundaries of the project  Problems of Understanding - is the conception of the requirements correct?  Problems of Volatility - change in requirements over time

Business and Technical Feasibility  Domain Constraints  Ambiguous Requirements - mitigated by prototyping For each stakeholder:  Identify the real problem/opportunity/challenge  What current measures show that the problem is real?  Identify business deliverables  Determine goal context  What are the constraints?  What are the non-functional or quality requirements?  Operational requirements  Revisional requirements  correcting functionality  adding to functionality  Transitional requirements (adapting to change)  Obtain definitions  Determine acceptance criteria  Determine priorities

Techniques  In-person interviews  Email communication  Video communication  Group meetings

Scheduling All known scheduling availabilities and un-availabilities will be posted to the shared project Google Calendar.

Stakeholder Details See Stakeholder Communication Spreadsheet.

Appendix IV – Computing Resources Document

Overview

The Pacific Northwest Climate Data Analysis Project will provide a framework for efficient analysis of regional climate modeling data. This documents provides a summary of the computing resources necessary to implement the project. The three broad categories of resources are computational resources for the data processing system; data resources for access and storage; and technologies that will be used for the project. Computational Resources In order to implement the system, the following resources are required:  Internet - transport of data; communication; access to further technologies  UW Computing Cluster - access to original data, even if the purpose is just to copy it  UWB Distributed Computing Lab - storage space for climate data  UWB Windows Lab - collaborative programming  UWB Linux Lab - collaborative programming; system testing  Private computing resources - independent and collaborative project development  Portable Media Storage - transport of data, as required

Data Resources Access to Climate Data The model data is currently stored at the UW Computing Cluster located on the Seattle Campus. It consists of several terabytes (~10) of total storage. Our planned access to this information is as follows:  guest account on the cluster (to be implemented by Professor Eric Salathe, faculty sponsor)  ftp account for some sample data on the cluster (in use; simulated data pertaining to January 1972)

The main risk for direct access to this data is that an error in our system could corrupt or destroy it. It is desired that we copy this data to storage on the UW Bothell campus and leave the originals intact.  storage of copied data at UW Bothell on a multi-hard drive system, hopefully of 60-80 TB storage, SATA connection (to be implemented by Professor Munehiro Fukuda, primary faculty sponsor, with cooperation from Josh Larios and Western Digital)  SAS connections offer higher access speed but are prohibitively expensive for the size of storage required

Currently, the sample data consists of several gigabytes.  local storage by Brett Yasutake, project team member (this will not be feasible once the data reaches the ~1 terabyte level) System Data Storage The system utilizes two types of data:  Climate Model Data - discussed above  Algorithm Data - user-defined algorithms used for climate data analysis, e.g. heat wave detection, moisture flux calculations, etc; stored in “algorithm storage” The system will produce three types of data:  status information - related to runtime operation of the system; reported and not stored  results - results of data processing; stored in “results storage”  provenance - data related to the conduct and validation of the data processing; stored in “provenance storage”

Adapters Data storage will be linked to the data analysis components of the system through “adapters.” If the mechanism of data storage or of analysis change, the adapter will change, but otherwise changes in one portion of the system should not affect other portions.

Technologies The following is a list of technologies being employed at various stages of the project:  Collaboration and Communication  Gmail - asynchronous communication  Google Docs - document repository  Skype - synchronous communication  Trello - project management  Development  Java  MASS library - parallelization  Netbeans - integrated development environment  Unidata NetCDF Java - netCDF reader  Linux OS  Windows OS  Git w/Google Code - version control and storage  Google code bug tracker - defect tracking database  Data Storage  mySQL - proposed data storage for algorithms, results, and provenance  netCDF - climate model data storage

Appendix V – Sample Algorithm Specification Wind Divergence Analysis

1. For each sample file 1. for all U10 (x vector wind speed at 10 meters above surface) 2. read all V10 (y vector wind speed at 10 meters above surface) 2. For each geographic cell 1. If the cell is on the boundary (x or y coordinate is 0 or max value) do nothing. 2. For non-boundary cells 1. subtract V10 of x == location, y == location + 1 from V10 of x == location, y == location - 1 and assign to V 2. Subtract H10 of x == location + 1, y == location from H10 of x == location - 1, y == location and assign to H 3. Calculative divergence magnitude as sqrt(H^2 + V^2) 3. Write results to netCDF files

Appendix VI – Sample Code

VIa - Execution Adapter, accessing Analytics Module with Reflection VIb - Sample Analytics Module: Monthly Maximum One Day Precipitation,

showing MASS calls that parallelize the algorithm