Qt547296qf.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
UC Irvine UC Irvine Previously Published Works Title Analysis of self-describing gridded geoscience data with netCDF Operators (NCO) Permalink https://escholarship.org/uc/item/547296qf Journal Environmental Modelling and Software, 23(10-11) ISSN 1364-8152 Author Zender, CS Publication Date 2008-10-01 DOI 10.1016/j.envsoft.2008.03.004 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Environmental Modelling & Software 23 (2008) 1338–1342 Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft Short communication Analysis of self-describing gridded geoscience data with netCDF Operators (NCO) Charles S. Zender* Department of Earth System Science, University of California, Irvine, CA 92697-3100, USA article info abstract Article history: The netCDF Operator (NCO) software facilitates manipulation and analysis of gridded geoscience data Received 22 August 2007 stored in the self-describing netCDF format. NCO is optimized to efficiently analyze large multi-di- Received in revised form 24 February 2008 mensional data sets spanning many files. Researchers and data centers often use NCO to analyze and Accepted 14 March 2008 serve observed and modeled geoscience data including satellite observations and weather, air quality, Available online 28 April 2008 and climate forecasts. NCO’s functionality includes shared memory threading, a message-passing in- terface, network transparency, and an interpreted language parser. NCO treats data files as a high level Keywords: data type whose contents may be simultaneously manipulated by a single command. Institutions and Geoscience data portals often use NCO for middleware to hyperslab and aggregate data set requests, while scientific Time-series Data analysis researchers use NCO to perform three general functions: arithmetic operations, data permutation and Arithmetic compression, and metadata editing. We describe NCO’s design philosophy and primary features, illus- Metadata trate techniques to solve common geoscience and environmental data analysis problems, and suggest Model design ways to design gridded data sets that can ease their subsequent analysis. netCDF Ó 2008 Elsevier Ltd. All rights reserved. Software availability The netCDF Operators have evolved over the past decade to serve research the needs of individual researchers and data centers Name of software: netCDF Operators (NCO) for fast, flexible tools to help manage netCDF-format data sets. The Developer: Charles S. Zender NCO User’s Guide (Zender, 2007) documents NCO’s functionality Operating system: All and calling conventions. Zender and Mangalam (2007) describe the Programming languages: C/C99/Cþþ core NCO arithmetic algorithms and their theoretical and measured Availability and cost: Freely available at http://nco.sf.net scaling with data set size and structure. This paper describes NCO’s design philosophy and primary features, illustrates techniques to solve common geoscience and environmental data analysis prob- 1. Introduction lems, and suggests ways to design gridded data sets that can ease their subsequent analysis. Gridded geoscience model and sensor data sets present an in- We will demonstrate the NCO paradigm and features by ap- teresting set of challenges for researchers and the data portals that plying them to frequently occurring geoscience data reduction serve them (Foster et al., 2002). Many geoscience disciplines have problems taken from the field of climate data analysis. The reader transitioned or are transitioning from data-poor and simulation- will see that these problems are generic to disciplines where large poor to data-rich and simulation-rich (NRC, 2001). A software gridded data sets are regularly produced and analyzed. Modern ecosystem has evolved to help researchers exploit this transition weather, climate, and remote sensing research often require iden- with fast data discovery, aggregation, analysis, and dissemination tical analyses of hundreds of variables in thousands of files. Tradi- techniques (e.g., Domenico et al., 2002; Cornillon et al., 2003). In tional analysis approaches that use low-level, compiled languages this ecosystem are the netCDF Operators (NCO) – software for and most high level, interpreted languages fail to scale well to this manipulation and analysis of gridded geoscience data stored in the problem space (Wang et al., 2007). Re-coding compiled or inter- self-describing netCDF format. NCO is used in several niches in preted data analysis scripts to act on new variables and new data geoscience data analysis workflow (Woolf et al., 2003), because sets is tedious and non-productive when it requires, for example, its functionality is independent of and complementary to data manually changing variable names and loop counters even when discovery, aggregation, and dissemination. the underlying analysis (such as averaging) does not change. NCO helps solve this problem by using the self-describing ca- pability of the netCDF data format (Rew and Davis, 1990) and POSIX * Tel.: þ1 949 824 2987; fax: þ1 949 824 3874. shells (Newham and Rosenblatt, 1998) to define a specific analysis E-mail address: [email protected] of a generic type without user intervention. This flexibility is 1364-8152/$ – see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.envsoft.2008.03.004 C.S. Zender / Environmental Modelling & Software 23 (2008) 1338–1342 1339 important to geoscience researchers who often analyze and inter- multi-file sequences, the metadata in the first file, along with compare gridded data sets in an open-ended fashion, creating a list of the other files, adequately preserves the processing unique analysis workflows through trial and error. For the same history. By convention, NCO keeps this information in the reasons, many data portals use NCO to fulfill the unpredictable history attribute (Rew et al., 2005). hyperslab requests issued by users on their WWW front-ends, e.g., 4. There is value in maintaining the distinctions and associations the NCAR Community Data Portal (CDP; https://cdp.ucar.edu), and between dimensions, coordinates, and variables (Rew and Davis, the NOAA Climate Diagnostics Center (CDC; http://www.cdc. 1990) during data analysis. Unless otherwise specified, NCO noaa.gov/PublicData). NCO is middleware in that it processes data automatically attaches coordinate data (i.e., dimension values) sets in netCDF format, generated by models or retrieval procedures, to variables it transfers. to new netCDF data sets, more suitable for graphical display, dis- 5. Tools should treat data as generically as possible, and impose semination, or numerical analysis. no software limitations on data dimensionality, size, type, or Geoscience researchers use many toolkits besides NCO to ana- ordering. lyze large volumes of gridded data. These include the Climate Data Analysis Tools (CDAT) (Fiorino and Williams, 2002), the Climate This design philosophy allows users to remain relatively igno- Data Operators (CDO; http://www.mpimet.mpg.de/fileadmin/soft- rant to details of file and variable names, field geometry, and NCO ware/cdo), the Grid Analysis and Display System (GrADS; http:// itself. www.iges.org/grads/grads.html), the Interactive Data Language (IDL; http://www.ittvis.com/idl), MATLAB (http://www.math- 3. Operators works.com), and the NCAR Command Language (NCL; http:// www.ncl.ucar.edu). Of these toolkits, CDO is closest to NCO in that NCO partially fulfills the netCDF designers’ original vision for both use command line operators constructed to perform chainable a follow-on set of generic data operators (Rew and Davis, 1990). operations like traditional UNIX filters. Unlike NCO and CDO, the Presently NCO includes 12 utilities built from a common library CDAT, GrADS, IDL, MATLAB, and NCL toolkits support comprehen- (Table 1). sive integrated visualization capabilities, but their design is not Operator names are acronyms for their functionality, prefixed optimized for batch-driven operations on large number of files. with ‘‘nc’’ to indicate their relationship to netCDF. The 12 operators typically read netCDF files as input, perform some manipulations, 2. Design philosophy then write netCDF files as output. In this sense the operators are filters, or middleware. The NCO User’s Guide (Zender, 2007) doc- Traditional geoscience data processing works with an intra-file uments the functionality and calling conventions for all operators. paradigm where users open one or a few files to read and manip- The primary purpose of the arithmetic operators is to alter ulate one or a few variables at a time. The intra-file paradigm works existing or create new data. The other operators, called metadata well in cases where all the pertinent data reside in a few files, and operators, manipulate metadata or re-arrange (but do not alter) the processing of each variable is unique and requires hand-coding. data. The arithmetic operators can be quite computationally In large geoscience applications data storage requirements may intensive, in contrast to the metadata operators which are mostly I/ dictate that relevant data be spread over multiple files. Level one O-dominated. The amount of data processed varies strongly by satellite data, for example, are often stored in a file-per-day or file- operator type. The multi-file operators (MFOs) are the most data- per-orbit format. Data produced by geophysical time-stepping intensive. Often they are applied to entire