Best Practices for Preparing Environmental Data Sets to Share and Archive
Total Page:16
File Type:pdf, Size:1020Kb
Advertise your data using datacasting tools To make your data available using standard and open software tools you should: Use standard language and terms to clearly communicate to others that your data are available for reuse and that you expect ethical and appropriate use of your data Use an open source datacasting (RSS or other type) service that enables you to advertise your data and the options for others to obtain access to it (RSS, GeoRSS, DatacastingRSS) Description Rationale: Additional Information: Examples: http://www.nsidc.org/libre/ http://datacasting.jpl.nasa.gov/ Tags: access, data services, discover Assign descriptive file names File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type. When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Automated processing, URLs and other systems often use spaces and special characters for parsing text string. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # | : and similar. If versioning is desired a date string within the file name is recommended to indicate the version. Avoid using file names such as mydata.dat or 1998.dat. Description Rationale: Clear, descriptive, and unique file names may be important when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. File names that reflect the contents of the file and uniquely identify the data file enable precise search and discovery of particular files. Additional Information: Hook, Les A., Suresh K. Santhana Vannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/BestPractices-2010 Borer et al. 2009. Some Simple Guidelines for Effective Data Management. Bull. of ESA 90: 209-214. Examples: An example of a good data file name: Sevilleta_LTER_NM_2001_NPP.csv Sevilleta_LTER is the project name NM is the state abbreviation 2001 is the calendar year NPP represents Net Primary Productivity data csv stands for the file type—ASCII comma separated variable Instead of "data May2011" use "data_May2011" or "data-May2011" Tags: access, describe, discover, format Backup your data To avoid accidental loss of data you should: Backup your data at regular frequencies When you complete your data collection activity After you make edits to your data Streaming data should be backed up at regularly scheduled points in the collection process High-value data should be backed up daily or more often Automation simplifies frequent backups Backup strategies (e.g., full, incremental, differential, etc…) should be optimized for the data collection process Create, at a minimum, 2 copies of your data Place one copy at an “off-site” and “trusted” location Commercial storage facility Campus file-server Cloud fire-server (e.g., Amazon S3, Carbonite) Use a reliable device when making backups External USB drive (avoid the use of “light-weight” devices e.g., floppy disks, USB stick-drive; avoid network drives that are intermittently accessible) Managed network drive Managed cloud file-server (e.g., Amazon S3, Carbonite) Ensure backup copies are identical to the original copy Perform differential checks Perform “checksum” check Document all procedures to ensure a successful recovery from a backup copy Description Rationale: To avoid data loss, and to retain information held in earlier versions, it is valuable to back-up your data at frequent intervals and in multiple locations. Additional Information: http://en.wikipedia.org/wiki/Backup http://www.carbonite.com/ http://aws.amazon.com/s3/ http://en.wikipedia.org/wiki/Incremental_backup Examples: Tags: access, disaster recovery, preserve Choose and use standard terminology to enable discovery Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include: Identify the relevant descriptive terms used as categorical values in your community prior to start of the project (ex: standard terms describing soil horizons, plant taxonomy, sampling methodology or equipment, etc.) Identify locations in metadata where standardized terminology should be used and sources for the terms. Terminology should reflect both data type/content and access methods. Review existing thesauri, ontologies, and keyword lists for your use before making up a new terms. Potential sources include: Semantic Web for Earth and Environmental Terminology (SWEET), Planetary Ontologies, and NASA Global Change Master Directory (GCMD) Enforce use of standard terminology in your workflow, including: Use of lookup tables in data-entry forms Use of field-level constraints in databases (restrict data import to match accepted domain values) Use XML validation Do manual review Publish metadata using Open Standards, for example: z39.50 OGC Catalog Services for Web (CSW) Web Accessible Directory (WAD) If you must use an unconventional or unique vocabulary, it should be identified in the metadata and fully defined in the data documentation (attribute name, values, and definitions). Description Rationale: The consistent use of well-defined, referenced terminology in describing data products, their parameters, and access methods improves the ability to discover those products for specific uses and access via well-known methods. Determine if there are controlled vocabulary terms, scientific taxonomies, and ontologies used by your community, and use those when creating metadata for your dataset. Additional Information: Controlled vocabulary: http://www.nlm.nih.gov/mesh/meshhome.html SWEET Web Site: http://wiki.esipfed.org/index.php/SWEET Biological ontologies: http://www.obofoundry.org/ Taxonomy: http://www.biodiversitylibrary.org/ Examples: Ecological Informatics, Volume 2, Issue 3, October 2007, Pages 279-296 Meta-information systems and ontologies. A Special Feature from the 5th International Conference on Ecological Informatics ISEI5, Santa Barbara, CA, Dec. 4–7, 2006 - Novel Concepts of Ecological Data Management S.I. Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. 2010. ANSI/NISO Z39.19. (https://www.niso.org/publications/ansiniso-z3919-2005-r2010) Olsen, L.M., G. Major, K. Shein, J. Scialdone, R. Vogel, S. Leicester, H. Weir, S. Ritz, T. Stevens, M. Meaux, C.Solomon, R. Bilodeau, M. Holland, T. Northcutt, R. A. Restrepo, 2007 . NASA/Global Change Master Directory (GCMD) Earth Science Keywords. Version 6.0.0.0.0 Citing GCMD Keywords (https://earthdata.nasa.gov/about/gcmd/global-change-master-directory-gcm...) Tags: Communicate data quality Information about quality control and quality assurance are important components of the metadata: Qualify (flag) data that have been identified as questionable by including a flagging_column next to the column of data values. The two columns should be properly associated through a naming convention such as Temperature, flag_Temperature. Describe the qality control methods applied and their assumptions in the metadata. Describe any software used when performing the quality analysis, including code where practical. Include in the metadata who did the quality control analysis, when it was done, and what changes were made to the dataset. Describe standards or test data used for the quality analysis. For instance, include, when practical, the data used to make a calibration curve. If data with qualifier flags are summarized to create a derived data set, include the percent flagged data and percent missing data in the metadata of the derived data file. High frequency observations are often downsampled, and it is critical to know how much of the data were rejected in the primary data. Description Rationale: Data quality and any methods used for quality control should be communicated so others can assess the data independently. Additional Information: Hook, L.A., Beaty, T.W., Santhana-Vannan, S., Baskaran, L. and Cook, R.B. 2007. Best practices for preparing environmental data sets to share and archive. Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. (daac.ornl.gov/PI/bestprac.html) Sheldon, W., Henshaw, D. and Ramsey, K. 2007. Final Report: Workshop to define quality management standards for data completeness in derived data products. Long Term Ecological Research Network Document Archive, University of New Mexico, Albuquerque, NM. Examples: Tags: assure, flag, quality Confirm a match between data and their description in metadata To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For