Developing a ++ interface for netCDF-4

Shanna-Shaye Forbes

Academic Affiliation, Fall 2006: Senior, The University of Texas at Austin

SOARS® Summer 2006

Science Research Mentor: Russ Rew Writing and Communication Mentors: Catherine Shea

ABSTRACT

The network common data form (netCDF) was created by Unidata at the University Corporation for Atmospheric Research to simplify data access and sharing in the atmospheric science community. Even though the current full release of netCDF known as netCDF-3 has proven to be successful, increasing data complexity and user demands have necessitated a new release of netCDF with improved functionality and the ability to store user defined data types. With this in mind Unidata created netCDF-4 which features all the functionality of netCDF-3 as well as more flexible ways to add data and better support for custom data structures. This implementation adopts simplified aspects of another more complex data model know as HDF5. At present netCDF-4 interfaces exist for C, and Fortran, however none exists for C++. This project’s aim was to design and partially implement a C++ interface for netCDF-4. The netCDF-4 C++ interface was implemented as a thin layer on top of the netCDF-4 C interface; its design however allows all the functionality of a full C++ implementation of the interface. When fully implemented the netCDF-4 C++ interface will allow data providers and developers with a preference for C++ to take advantage of the new features netCDF-4 offers for creating portable, self-describing datasets.

The Significant Opportunities in Atmospheric Research and Science (SOARS) Program is managed by the University Corporation for Atmospheric Research (UCAR) with support from participating universities. SOARS is funded by the National Science Foundation, the National Oceanic and Atmospheric Administration (NOAA) Climate Program Office, the NOAA Oceans and Human Health Initiative, and the Cooperative Institute for Research in Environmental Sciences. SOARS also receives funding from the National Center for Atmospheric Research (NCAR) Biogeosciences Initiative and the NCAR Earth Observing Laboratory. SOARS is a partner project with Research Experience in Solid Earth Science for Student (RESESS). Section 1. INTRODUCTION

For many years no standard data model or data format was available within the atmospheric science community. As a result, data users had to have knowledge of the many different formats used by data providers. In addition, users were also required to have knowledge of the format-specific data analysis or visualization tools for each format. These requirements placed an extra burden on data users and researchers and significantly hindered the dissemination of data within the earth science and atmospheric science community. To simplify the interchange problems caused by numerous formats, the Unidata Program Center at the University Corporation for Atmospheric Research produced the Network Common Data Form (netCDF) and the National Center for Supercomputing Applications at the University of Illinois - Urbana Champagne created the Hierarchical Data Format (HDF) in the late 1980s. Over the years both netCDF and HDF have evolved to fix bugs, to add new features, and to implement new aspects of their data models.

The Network Common Data Form (netCDF) is both a data model and a data format. It is designed to be a machine-independent format for the storage of array based scientific data [1]. The Hierarchical Data Format (HDF) now in a release 5, referred to as HDF5, is a “general purpose and for storing scientific data” [2]. Scientific data providers and developers of scientific data access tools have made use of these data models and formats in archives and tools to suit the needs of their users.

In the article “Scientific Data Management in the Coming Decade,” Gray predicts that within “the next decade a common HDF-like format for all the sciences will emerge” [3]. Unidata’s latest release of netCDF will be netCDF-4. The netCDF-4 software is a merger of netCDF-3 and aspects of the HDF5 data models and data formats. For some users of netCDF-4, an object oriented C++ interface, is desirable because netCDF is used in C, C++, Fortran and various scripting languages. This project designs and begins implementation of the C++ interface for netCDF-4.

In five sections this paper provides background information about netCDF and HDF, a discussion of the interface design, an overview of the interface’s implemented methods, a summary of the impacts of the newly created interface, and a discussion of possible future developments.

Section 2. BACKGROUND

This section discusses netCDF, HDF5, netCDF-4, and provides background information on interface design.

2.1 The Network Common Data Form

The Network Common Data Form (netCDF) began as an attempt to provide a machine independent interface between data providers and data users, and was first implemented at Unidata in 1988. Under collaborative efforts with Joe Fahle at SeaScape, Michael Gough of Apple, and Angel Li at the University of Miami, netCDF was developed and released. Later

SOARS® 2006, Shanna-Shaye Forbes, 2 capabilities and modifications to netCDF included companion utilities ncdump and ncgen. In June 1993 netCDF's first C++ interface was released. Prior to this release netCDF featured Fortran and C interfaces [7].

The Network Common Data Form (netCDF) is “a data model for array-oriented scientific data access, and is a package of freely available software that implements the data model” [4]. It is “widely used in the atmospheric sciences,” and backward compatibility is critical to future modifications of netCDF. In addition to being “a platform-independent… representation for self-describing data,” netCDF also allows appending of data “without copying the data set or redefining its structure” [4].

The current full release of netCDF is netCDF-3. It includes a library written in ANSI C, and features that improved the Fortran interface [4]. Even though netCDF-3 encompassed many improvements over netCDF-2, the interface includes “a limited number of external numeric data types, and restrictions on data sharing” [4]. The netCDF data model shown below in Figure 1 is also limited because it “does not support nested data structures such as trees, nested arrays, or other recursive structures” [4].

A netCDF file generally consists of three kinds of objects: Attributes, Variables, and Dimensions. Attributes can either apply to a file or to a variable. File attributes contain information about all the data in a file, and variable attributes describe the contents of each variable. While variables store data, dimensions are used to describe the shape of variables. Variables with the same dimensions are interpreted as having values on the same grid. An example of a netCDF file's metadata generated with the netCDF utility ncdump is shown in Figure 2. In the netCDF-3 release, a netCDF file could contain any number of dimensions but only one unlimited dimension.

Unlimited dimensions are used with model or observational data. For example, if data is being collected from various stations and there is uncertainty about the number of stations that will be in working order from one hour to the next, it is best to store the data using an unlimited dimension. This way all the data can be stored if 20 stations are functional or if 50 stations are functional. Unlimited dimensions generally represent a growing dimension. As another example, a three dimensional variable temperature could be created based on latitude, longitude, and time. While latitude and longitudes are normal dimensions, an unlimited dimension is desirable for the time dimension to allow temperature readings to append to data without the need to remove the current variable and replace it with a new variable containing all the updated data

SOARS® 2006, Shanna-Shaye Forbes, 3

Figure 1. Graphical Representation of the netCDF-3 data model [10]

Figure 2. Metadata for a netCDF file generated with ncdump

SOARS® 2006, Shanna-Shaye Forbes, 4 2.2 Hierarchical Data Format

In this section, we review the aspects of HDF5 relevant to the netCDF-4 release. HDF5 has been primarily funded by NASA and the Department of Energy (DoE). HDF5 is used more for high-performance computing than netCDF because it supports parallel I/O, and it has a richer and more complex data model than netCDF.

HDF5 is not HDF4 with improvements. HDF5, a simplified version of whose data model is shown in Figure 3, is a complete rewrite of the HDF4 library. Instead of simply making modifications to what already existed, the HDF team started from scratch using all the lessons they learned from previous releases of HDF. “HDF5 files are organized in a hierarchical structure” [2] and support a tree structure collection of items similar to a style file directory system. HDF5 supports more than one unlimited dimension because it uses multidimensional tiling both for data access and compression, supports large files, parallel I/O and also allows arrays to consist of compound data structures, and user defined types [8]. Other features of the HDF5 data model will be discussed in the next section on netCDF-4 since some of netCDF-4’s features are merely simplifications of the HDF5 data model.

Figure 3. Graphical Representation of the HDF5 data model [5]

2.3 The Network Common Data Form-4 (netCDF-4)

NetCDF-4 features improvements on netCDF-3 as well as some basic aspects of the HDF5 data format such as groups, multiple unlimited dimensions, more primitive data types, four user defined data types (enum, opaque, compound, and variable length), and storage of several values of possibly different types in an array through the use of the compound data type. The C++ implementation uses strings instead of character arrays, along with other components of the Standard Template Library. The netCDF-4 data model is presented in Figure 4 and should

SOARS® 2006, Shanna-Shaye Forbes, 5 be compared to the netCDF-3 data model in Figure 1 to observe the differences between netCDF-3 and netCDF-4.

Although the HDF5 data model allows the definition of circular groups, where a subgroup at the grandchild level can be the parent of grandparent, to increase comprehension, the netCDF-4 data model will limit group definition to a tree like structure. The netCDF-4 files will all consist of a root group similar to HDF5. Each group will consist of attributes, dimensions, variables, and other subgroups. Unlike netCDF-3, attributes can now either be of a primitive type or a user defined type and a group can contain variables with multiple unlimited dimensions. In addition a netCDF-4 variable can either be a primitive type or a user-defined type. The netCDF-3 primitive types were expanded in netCDF-4 to include the primitive types shown in red in figure 4, and the user-defined types enum, opaque, compound, and variable length mentioned in the paragraph above were adopted from the HDF5 data model.

Figure 4. Graphical Representation of the netCDF-4 data model [10]

2.4 Pre-Interface Design

Before beginning the design of the C++ interface we wanted to avoid many of the pitfalls that generally accompany new software releases. As a result we relied on the concept known as progressive disclosure described in Ken Arnold's article “ are People, Too.” In his article Ken Arnold presents a well-written argument that suggests improvements to interface design. Using proven research from other disciplines, Arnold focuses on one general area of

SOARS® 2006, Shanna-Shaye Forbes, 6 research known as human factors which is the “the study of design practices for all things people use” instead of focusing on a more specific area of Human Computer Interaction which is a subset of human factors research [6].

Because currently human factor research is “rarely applied in API designs”[6],Arnold predicts that the use of human factors would have a large impact on how an interface is designed, and that incorporating human factor research into interface design can make a significant improvement in productivity and detected bug rates.

Arnold recommends one aspect of human factors known as “progressive disclosure.” Progressive disclosure has tasks or information divided into beginning, intermediate, and expert stages. This division ensures that the beginning user is not overwhelmed by all the technical details an expert would welcome. To make his point, Arnold compares progressive disclosure to driving a car. The beginning user only needs to know how to drive the car and that engine work requires the intervention of an expert. The expert in turn knows how to manipulate different parts of the engine to obtain the desired results [6]. Arnold’s analogy brings the complexity of programming to light. Simply imagine how intimidated you would be if to drive your car you had to know all the intricate details of how the engine works; the feeling is probably the same for a beginning user who is given all the possible options when progressive disclosure is not used.

Arnold also implies that when designing a good user interface one should keep in mind which “users are you designing for... [and] what the users expect [of the interface].” This approach enables the designer to anticipate what the user will expect from a method without requiring that the user read all the documentation. Surprisingly, Arnold points out a simple but important fact that few if any programmers read all of their provided documentation. With this knowledge in hand the designer must simply be aware that “[i]nconsistency makes a system harder to learn and harder to use correctly” [6].

Arnold concludes his article by reminding the reader that “programmers are not only human but also notoriously lazy problem solving humans.” Even if you try to use forcing functions to ensure that no errors are made, programmers will generally combine actions or will find workarounds. As a result, he states that an interface is limited not only by its design, but by a user’s own expectations of what the interface should feature. It is also important to remember that while you may have become an expert after designing your interface it is important you keep in mind as a designer that the user just “wants everything to work”[6].

Section 3. METHODOLOGY

Before beginning the netCDF-4 C++ interface, it was necessary that we determine which aspects of the netCDF-3 C++ interface would be useful in netCDF-4, what other characteristics were necessary to make the interface easy to use and also incorporate the desired aspects of HDF5. To determine the desired aspects of the interface, we sorted through netCDF C++ suggestions and requests submitted to the netCDF newsgroup, read through email suggestions submitted directly to Russ Rew, looked over a netCDF-3 C++ wrapper library created by Fei Lui

SOARS® 2006, Shanna-Shaye Forbes, 7 at NOAA, and then performed a critical analysis of aspects of the C++ standard template library which would prove useful to the latest interface.

Three possible implementations of the netCDF-4 library were considered. The first option was to implement the netCDF-4 interface entirely on top of the current netCDF-4 C interface, the second option to implement it on top of the netCDF-3 C++ interface and the netCDF-4 interface, and the third option to implement it entirely on top of the HDF-5 interface. Though implementing directly on top of the HDF5 C interface would have decreased the number of layers used in Figure 5 below, this option was not chosen because backward compatibility would not have been possible. Instead of creating a netCDF file, putting the C++ interface on top of the HDF5 interface would have created an HDF5 file.

The software architecture diagram pictured in Figure 5 is a visual representation of the layers each version of netCDF encounters. Though the boxes have different sizes and colors those were provided only to make following the versions of netCDF easier. At the third level from the bottom in Figure 5, the HDF5 C API is skewed to the left to show that the HDF5 C API uses both Posix and MPI I/O. It may also be confusing that netCDF-2 sits on top of netCDF-3 and that netCDF-3 sits on top of netCDF-4 but this was done to ensure backwards compatibility. It began when netCDF-3 was developed. To ensure backward compatibility with netCDF-2 Unidata chose to have netCDF-2 functions call netCDF-3 functions. This practice has continued and thus the netCDF-3 functions now call netCDF-4 functions.

Figure 5. netCDF Software Architecture

Since backward compatibility is essential, and based on the precedence of other netCDF C++ interfaces, we decided to take the first option to implement the new C++ interface as a thin layer on top of the current netCDF-4 C interface. This option to build on top of the C interface was selected because it has been used in the past and because an extensive set of tests exists for the C

SOARS® 2006, Shanna-Shaye Forbes, 8 interface. An extensive set of tests exists for the C interface but not for the C++ interface. Additionally, based on the suggestions made on the netCDF newsgroup, and in the wrapper library code produced by Fei Lui, we decided to implement the new interface using C++ generic programming and also to take advantage of powerful C++ tools like namespaces, templates, iterators, and exceptions which have now been standardized in C++ but were not standardized before the release of netCDF-3.

Instead of creating the entire C++ interface from scratch, useful aspects of the netCDF-3 interface were adopted. Although not adopted in their entirety the classes NcAtt which is the representation for an attribute, NcVar which is the representation for a variable, NcFile which is the representation for a file, and NcDim the representation for a dimension were adopted to facilitate an easy transition from the netCDF-3 C++ interface to the netCDF-4 C++ interface. During the project, questions of how to represent user-defined variables arose. We considered two different options. The first was to template the NcVar class to manage both primitive and user-defined variables and the second was to create a class for user-defined variables. After careful consideration we decided to create datatypes instead of variables and have the NcVar class handle both primitive types and user-defined types. We created NcUserDefinedType, NcCompoundType, NcEnumType,NcOpaqueType, and NcVLenType classes to handle the creation of userdefined types. The NcVar class in turn has overloaded functions that take parameters of primitive and user defined types when creating a variable.

In addition to the aforementioned classes a NcException class was created to throw exceptions when an error is returned from the C interface. When necessary, functions in each of the classes mentioned above were templated to make the interface more generic and, as requested via the netCDF mailing group, numerous methods in each of the classes were made virtual, and all the classes, variables, and constants used were placed in a netCDF namespace to decrease collisions with other namespaces. The netCDF-4 interface should serve as a good interface for C++ programmers while existing as a thin layer above the C interface.

Even though a NcValues class was present in netCDF-3, netCDF-4’s NcValues class acts a combination of the original class as well as incorporating aspects of Fei Liu’s wrapper library class array.h. Information is still stored in a C style array; however numerous overloaded operators have been added to the interface to facilitate efficient data access. We also determined that it was essential that an iterator exists for groups, dimensions, attributes and variables, thus a nested iterator class was included in the NcGroup, NcDim,NcAtt, and NcVar classes. The classes in the netCDF-4 C++ interface a shown in Figure 6. Classes in blue represent original classes that have been modified to suit the needs of netCDF-4 and classes in red were created from scratch during this project.

SOARS® 2006, Shanna-Shaye Forbes, 9

Figure 6: Graphical representation of the netCDF-4 classes

To represent variable length user defined variables, and ragged arrays, instead of using the array.h class provided in Fei Liu’s wrapper library, or using a single vector or nested vectors from the standard template library, we decided simply to store the minimal amount of information necessary in each class and depend on the C representation of each of the variables.

After completing the initial design we conducted a Design Review and concluded that netCDF variables, groups, dimensions, and attributes should not exist independently of a netCDF file and thus made all constructors except the file constructor and the user defined type constructors private.

Section 4. RESULTS AND DISCUSSION

Because of the time constraints of the ten-week research project we did not complete the necessary full scale test of the netCDF-4 C++ API or the full implementation of user-defined variables. We performed a small scale test to ensure that all aspects of the API were functional. With the implemented compound user-defined type we wrote a small program to create a netCDF-4 file containing synthetic data. In addition the C++ interface unlike the C interface will not feature parallel I/O given its complexity and the time constraints to the project.

A simple netCDF-4 file was created and it shows that groups, variables, and attributes were successfully created.

SOARS® 2006, Shanna-Shaye Forbes, 10 Section 5. CONCLUSIONS AND FUTURE MODIFICATIONS

Since the small scale test proved to be successful we have assumed that an implementation of the three remaining user-defined types can be similarly implemented and tested according to this proven design.

Future modifications of this project include completing the implementation of the three remaining user defined variable classes, modifying the current implementation which is still referred to as a prototype implementation and implementing it only on top of the netCDF-4 interface to make it ready for release. After these modifications are made the netCDF-4 C++ API will be ready for full release to the netCDF community for larger testing and incorporation into commercial software. It is also important to conduct a survey of the users of the API to determine if the collision conflicts they experienced with portions of the netCDF-3 C++ interface were resolved and if they believe backward compatibility is necessary in the new interface or if a netCDF-3 API should exist separately from the netCDF-4 API.

SOARS® 2006, Shanna-Shaye Forbes, 11 References:

[1] R. Rew, G. P. Davis, “Unidata’s netCDF interface for data access: status and plans,” in the Proceedings of the Thirteenth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, pp. 2-7, Anaheim, CA, February 1997.

[2] The National Center for Supercomputing Applications University of Illinois at Urbana- Champaign, “What is HDF5?,” 2006 April 26, [cited 2006 Jun 21], Available HTTP: http://hdf.ncsa.uiuc.edu/whatishdf5.htm.

[3] J Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. DeWitt, G. Heber. “Scientific Data Management in the Coming Decade,” CT Watch Quarterly Feb. 2005 pp.17-26

[4] R. Rew, G. P. Davis, “Unidata’s netCDF interface for data access: status and plans,” in the Proceedings of the Thirteenth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, pp. 2-7, Anaheim, CA, February 1997.

[5] R. Rew, E. Hartnett, J. Caron,M. Folk, R. McGrath, and Q. Kozial, “NetCDF-4: A New Data Model, Programming Interface, and Format Using HDF5,”[Online document],2005 Aug 9, [cited 2006 Jun 22], Available HTTP: http://www.unidata.ucar.edu/presentations/Rew/netcdf-hdf- final.ppt

[6] K. Arnold, “Programmers are People, Too,” ACM queue: architecting tomorrow’s computing, vol.3, no5, pp.55-59, June 2005.

[7] Unidata Program Center, “The NetCDF User's Guide” 2006 Jan, [cited 2006 Jun 30], Avaliable HTTP: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf.html

[8] J.M. May, Parallel I/O for High Performance Computing. San Diego, CA: Academic Press 2001.

[9] B. Stroustrup, The C++ Programming Language Special Edition. Indianapolis, Indiana (change to tow letter representation later): Addison Wesley 2001.

[10] R. Rew, E. Hartnett, J. Caron, “NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences," Online document],2006 Jan 31, [cited 2006 Jun 26], Available HTTP: http://www.unidata.ucar.edu/presentations/Rew/nc4-iips.ppt

[11] Unidata Program Center, “NetCDF Utilities”[Online document],2005 Jan 10, [cited 2006 Jun 30], Available HTTP: http://www.unidata.ucar.edu/software/netcdf/guide_12.html

SOARS® 2006, Shanna-Shaye Forbes, 12