Developing a C++ Interface for Netcdf-4
Total Page:16
File Type:pdf, Size:1020Kb
Developing a C++ interface for netCDF-4 Shanna-Shaye Forbes Academic Affiliation, Fall 2006: Senior, The University of Texas at Austin SOARS® Summer 2006 Science Research Mentor: Russ Rew Writing and Communication Mentors: Catherine Shea ABSTRACT The network common data form (netCDF) was created by Unidata at the University Corporation for Atmospheric Research to simplify data access and sharing in the atmospheric science community. Even though the current full release of netCDF known as netCDF-3 has proven to be successful, increasing data complexity and user demands have necessitated a new release of netCDF with improved functionality and the ability to store user defined data types. With this in mind Unidata created netCDF-4 which features all the functionality of netCDF-3 as well as more flexible ways to add data and better support for custom data structures. This implementation adopts simplified aspects of another more complex data model know as HDF5. At present netCDF-4 interfaces exist for C, and Fortran, however none exists for C++. This project’s aim was to design and partially implement a C++ interface for netCDF-4. The netCDF-4 C++ interface was implemented as a thin layer on top of the netCDF-4 C interface; its design however allows all the functionality of a full C++ implementation of the interface. When fully implemented the netCDF-4 C++ interface will allow data providers and developers with a preference for C++ to take advantage of the new features netCDF-4 offers for creating portable, self-describing datasets. The Significant Opportunities in Atmospheric Research and Science (SOARS) Program is managed by the University Corporation for Atmospheric Research (UCAR) with support from participating universities. SOARS is funded by the National Science Foundation, the National Oceanic and Atmospheric Administration (NOAA) Climate Program Office, the NOAA Oceans and Human Health Initiative, and the Cooperative Institute for Research in Environmental Sciences. SOARS also receives funding from the National Center for Atmospheric Research (NCAR) Biogeosciences Initiative and the NCAR Earth Observing Laboratory. SOARS is a partner project with Research Experience in Solid Earth Science for Student (RESESS). Section 1. INTRODUCTION For many years no standard data model or data format was available within the atmospheric science community. As a result, data users had to have knowledge of the many different formats used by data providers. In addition, users were also required to have knowledge of the format-specific data analysis or visualization tools for each format. These requirements placed an extra burden on data users and researchers and significantly hindered the dissemination of data within the earth science and atmospheric science community. To simplify the interchange problems caused by numerous formats, the Unidata Program Center at the University Corporation for Atmospheric Research produced the Network Common Data Form (netCDF) and the National Center for Supercomputing Applications at the University of Illinois - Urbana Champagne created the Hierarchical Data Format (HDF) in the late 1980s. Over the years both netCDF and HDF have evolved to fix bugs, to add new features, and to implement new aspects of their data models. The Network Common Data Form (netCDF) is both a data model and a data format. It is designed to be a machine-independent format for the storage of array based scientific data [1]. The Hierarchical Data Format (HDF) now in a release 5, referred to as HDF5, is a “general purpose library and file format for storing scientific data” [2]. Scientific data providers and developers of scientific data access tools have made use of these data models and formats in archives and tools to suit the needs of their users. In the article “Scientific Data Management in the Coming Decade,” Gray predicts that within “the next decade a common HDF-like format for all the sciences will emerge” [3]. Unidata’s latest release of netCDF will be netCDF-4. The netCDF-4 software is a merger of netCDF-3 and aspects of the HDF5 data models and data formats. For some users of netCDF-4, an object oriented C++ interface, is desirable because netCDF is used in C, C++, Fortran and various scripting languages. This project designs and begins implementation of the C++ interface for netCDF-4. In five sections this paper provides background information about netCDF and HDF, a discussion of the interface design, an overview of the interface’s implemented methods, a summary of the impacts of the newly created interface, and a discussion of possible future developments. Section 2. BACKGROUND This section discusses netCDF, HDF5, netCDF-4, and provides background information on interface design. 2.1 The Network Common Data Form The Network Common Data Form (netCDF) began as an attempt to provide a machine independent interface between data providers and data users, and was first implemented at Unidata in 1988. Under collaborative efforts with Joe Fahle at SeaScape, Michael Gough of Apple, and Angel Li at the University of Miami, netCDF was developed and released. Later SOARS® 2006, Shanna-Shaye Forbes, 2 capabilities and modifications to netCDF included companion utilities ncdump and ncgen. In June 1993 netCDF's first C++ interface was released. Prior to this release netCDF featured Fortran and C interfaces [7]. The Network Common Data Form (netCDF) is “a data model for array-oriented scientific data access, and is a package of freely available software that implements the data model” [4]. It is “widely used in the atmospheric sciences,” and backward compatibility is critical to future modifications of netCDF. In addition to being “a platform-independent… representation for self-describing data,” netCDF also allows appending of data “without copying the data set or redefining its structure” [4]. The current full release of netCDF is netCDF-3. It includes a library written in ANSI C, and features that improved the Fortran interface [4]. Even though netCDF-3 encompassed many improvements over netCDF-2, the interface includes “a limited number of external numeric data types, and restrictions on data sharing” [4]. The netCDF data model shown below in Figure 1 is also limited because it “does not support nested data structures such as trees, nested arrays, or other recursive structures” [4]. A netCDF file generally consists of three kinds of objects: Attributes, Variables, and Dimensions. Attributes can either apply to a file or to a variable. File attributes contain information about all the data in a file, and variable attributes describe the contents of each variable. While variables store data, dimensions are used to describe the shape of variables. Variables with the same dimensions are interpreted as having values on the same grid. An example of a netCDF file's metadata generated with the netCDF utility ncdump is shown in Figure 2. In the netCDF-3 release, a netCDF file could contain any number of dimensions but only one unlimited dimension. Unlimited dimensions are used with model or observational data. For example, if data is being collected from various stations and there is uncertainty about the number of stations that will be in working order from one hour to the next, it is best to store the data using an unlimited dimension. This way all the data can be stored if 20 stations are functional or if 50 stations are functional. Unlimited dimensions generally represent a growing dimension. As another example, a three dimensional variable temperature could be created based on latitude, longitude, and time. While latitude and longitudes are normal dimensions, an unlimited dimension is desirable for the time dimension to allow temperature readings to append to data without the need to remove the current variable and replace it with a new variable containing all the updated data SOARS® 2006, Shanna-Shaye Forbes, 3 Figure 1. Graphical Representation of the netCDF-3 data model [10] Figure 2. Metadata for a netCDF file generated with ncdump SOARS® 2006, Shanna-Shaye Forbes, 4 2.2 Hierarchical Data Format In this section, we review the aspects of HDF5 relevant to the netCDF-4 release. HDF5 has been primarily funded by NASA and the Department of Energy (DoE). HDF5 is used more for high-performance computing than netCDF because it supports parallel I/O, and it has a richer and more complex data model than netCDF. HDF5 is not HDF4 with improvements. HDF5, a simplified version of whose data model is shown in Figure 3, is a complete rewrite of the HDF4 library. Instead of simply making modifications to what already existed, the HDF team started from scratch using all the lessons they learned from previous releases of HDF. “HDF5 files are organized in a hierarchical structure” [2] and support a tree structure collection of items similar to a UNIX style file directory system. HDF5 supports more than one unlimited dimension because it uses multidimensional tiling both for data access and compression, supports large files, parallel I/O and also allows arrays to consist of compound data structures, and user defined types [8]. Other features of the HDF5 data model will be discussed in the next section on netCDF-4 since some of netCDF-4’s features are merely simplifications of the HDF5 data model. Figure 3. Graphical Representation of the HDF5 data model [5] 2.3 The Network Common Data Form-4 (netCDF-4) NetCDF-4 features improvements on netCDF-3 as well as some basic aspects of the HDF5 data format such as groups, multiple unlimited dimensions, more primitive data types, four user defined data types (enum, opaque, compound, and variable length), and storage of several values of possibly different types in an array through the use of the compound data type. The C++ implementation uses strings instead of character arrays, along with other components of the Standard Template Library.