I’m attaching an abridged version of a letter I received from Dr. Jeremy Wojdak that highlights the need for accurate data collection and helps you understand why I’ve sent you a data file (in Excel) and a metadata file (in Microsoft Word).

Here it is.

J. Wojdak, August 28, 2005

Need for Metadata and Data Standards Radford University Stormwater Wetland Project

We suggested to NSF that a key benefit of our proposed project was that it would be interdisciplinary and collaborative; students from traditionally very distinct disciplines would share data and knowledge. However, successful data sharing is dependent on very clear and accurate communication of not only the actual data collected by different courses, but of the metadata that corresponds to the data. Metadata is the information typically found scribbled in margins of field or lab notebooks – the when, where, how, what, and why that is necessary to understand and interpret the actual data. Metadata typically includes methods used to collect the data, the date/time/location of collection, who owns the rights to the data, units of measurement, degree of precision of measurements, and who should be contacted with questions regarding the study.

Have you ever opened an old Excel file like this:

plot site mass height x 11.1 0.08 a 1 9.66 40.93 or c 11.2 0.35 a 2 8.36 20.72 v 11.3 2.25 a 3 6.71 2.28 x 21.1 0.38 b 1 1.43 22.77 c 21.2 3.04 b 2 5.09 27.28 v 21.2 4.14 b 3 5.39 42.98 x 11.1 3.91 c 1 4.15 40.71 c 11.2 2.82 c 2 1.99 17.41 v 11.3 3.17 worse yet like this:

and tried to figure out the units of measurement, methods used, date and time of collection….. so you could use the data?

If you have, you already know (deep down) why detailed, structured metadata are important. Especially today as nearly all data are stored electronically, the lack of corresponding electronic metadata cripples efforts to share data. The National Science Foundation is in the process of requiring investigators to distribute and share data more freely – there has been a realization that the Foundation’s money will have greater impact if data are shared and reused. High quality data that are well documented can be used many times for different purposes – data without documentation will likely “die” within several years after collection and will benefit only the original investigator. Another way to consider the issue is that proper documentation will increase one’s own scientific contribution and legacy. Moreover, large-scale synthetic questions will likely only be tackled through data sharing because the scope of the questions precludes any individual or group from collecting sufficient data. Along with NSF’s push for data sharing will come requirements for good metadata, and thus training ourselves and our students in good data practices is valuable.

I am the “database manager” for this project – which means I will be responsible for organizing and locally archiving the data and metadata collected in participating courses, publishing this to the web, and possibly publishing it to a permanent scientific data archive (guaranteed stable web access into perpetuity). To make this task possible, and to allow students from different courses and disciplines to collaborate via data sharing, I will need to receive high quality data and metadata files. Much of the metadata documentation will already be done in preparing lab handouts (that describe methods, for instance). However, descriptions of projects may need to broadened and free from subject-specific jargon so a general scientific audience can understand why and how the data was collected.

Below are some guidelines to ensure high-quality data that will be useful to the maximum number of students at RU and elsewhere:

1.) Instill in the students collecting the data that they are responsible for part of a much larger project, and that mistakes hurt themselves, classmates, and students in other courses. Mistakes are inevitable, but can be minimized. 2.) Check for data entry errors, then check for “problem data” using simple descriptive statistics. For example, a quick look at the range of values in a data column might indicate a typing error or a unit of measurement error…. If data are supposed to vary from 0.0 – 50.0 cm and the range of data in that column goes from 0-137 that might indicate a 13.7cm was typed in as 137 or someone measured in mm. 3.) Look over existing files in the database and use the same data names, type, and units for the same kinds of data, where feasible. e.g., if you measure the concentration of dissolved oxygen in standing water in various locations, and other classes have done the same thing at other times, naming your data columns the same thing and reporting the data in the same units (% saturation or mg/L or ppm) would be really helpful… it makes it unambiguous that these data are comparable and can be usefully merged. 4.) Provide detailed and jargon free descriptions of the methods used in data collection. Specifically include these descriptors: a. Time, date (including year), spatial location (see appendix) b. Principal investigator (instructor), course, module name c. Description of each column in data file i. Define the data in each column – e.g., “above-ground plant biomass”, “extractable organic compounds”, “water depth” ii. Column name, data type (see appendix), precision, units of measurement, if a data column contains codes (treatments, plots, etc.) describe coding system iii. Describe the actual processes and methods used in data collection d. Data permissions – can anyone use the data for any purpose? Should people contact the PI (instructor) before using data? 5.) Please follow the submission guidelines (see appendix) when forwarding your data for inclusion in the database.

Appendix. Data conventions. General Guidelines  Please submit data (except GIS or map files) as Excel files or as tab-delimited text files. Email files to [email protected], if files are very large burn them to a CD.  Data files should be named as such: topic-date-investigator.xls or topic-date- investigator.txt (e.g., waterchemistry-12312005-wojdak.xls)  Please submit metadata associated with a data file as a similarly named Word document: topic-date-investigator.doc (e.g., waterchemistry-12312005-wojdak.doc)  Data should be in columns, with the first row being column headers. o Do not put text comments or notes in columns with data (text comments shouldn’t be in the data file at all, but instead in the metadata).  Dates should be reported as mm/dd/year.  Times should be reported as 00:00 to 24:00 (as in 2:30 am = 2:30 and 2:30 pm = 14:30).

Data types  Unordered or nominal: unordered categories or text (e.g., Male, Female)  Ordered or ordinal: ordered categories (e.g., low, medium, high)  Relative or interval: values from a scale with equidistant points (e.g., 12.2 meters)  Absolute or ratio: measurement scale with a meaningful zero point (e.g., 273 Kelvin)  Date-Time: (e.g, 2:30 11/30/2002)

THANKS.

I (Fred) am back. I thought that would be helpful to you for understanding why I am sending you two files, and second, why it is important to be careful, and third, why the files are structured the way they are.

Here is your assignment for the macroinvertebrates lab.

Answer the following questions.

1. Do richness and abundance of macroinvertebrates vary in relation to the three categories of moisture levels that we established (wet, muddy and dry)? 2. Do richness and abundance of macroinvertebrates vary in relation to the moisture levels (measured in percent) I measured two days after the experiment?

3. Do richness and abundance of macroinvertebrates vary in relation to the temperature readings you measured when you were collecting data?

4. Are there differences in species composition that you observe in relation to our moisture categories, or the measured moisture levels?

I remind you that my website has detailed instructions on how to do ANOVAs and correlations/regressions. I’m also happy to help anyone with anything that doesn’t require lifting. So talk to me if you’re confused, concerned or excited.

Make sure that your answers include means, F-values, p-values, R2 values, etc. where appropriate. You will probably want to include graphs or tables to summarize the results for each answer. Please don’t give me any computer output (except for graphs). I will deduct from your grade on this assignment if you give me a bunch of computer output as part of your answer (if you want, you can attach it as an appendix).

This is not a formal report, so just number each section as 1, 2, 3 or 4.

Good luck!