Collaborative Session Notes 2017 NOAA EDM Workshop

Session 3A: NOAA PARR Implementation

Highlights: ● NOAA NCEI supports PARR implementation for data ○ Data discovery, access, and archiving ○ Tools, standards, templates, and value-added stewardship ● NOAA Institutional Repository supports PARR for publications ○ https://repository.library.noaa.gov/ ○ Accepting submissions via web or email [email protected] ○ Includes minting DOIs for NOAA tech memos ■ Open House at SSMC3 tomorrow at 11am ● NOAA science programs taking various approaches to PARR ○ NCCOS: focus on archiving data and web services ○ PMEL: new data integration strategy ○ NWFSC: enterprise system connects data producers to users ○ All are developing guidance, templates, and tracking - need to work together to learn from each other! ● We ALL need to be working together!!

Notes:

Jessica Morgan, Chair - aim is to foster a discussion to hear about what different people in NOAA are doing, and what capabilities exist (NCEI, Library)

Katharine Weathers - PARR - What you need to know If you are federally funded, YES! your data should be visible, accessible and understandable to the public

NCEI trying to help people meet PARR - includes website to understand PARR (see presentation slides). There are 3 steps, NCEI has tools and capabilities to help you meet them.

Step 1: Data discoverability is content-driven. NCEI has tools to make it easier to meet these requirements - Advanced Tracking and Resource tool for Archive Collections (ATRAC), Send2NCEI (S2N), NCEI netCDF templates, etc.

ESRI Geoportal server then exposes the information in the metadata so data is discoverable

Step 2: Data access Collaborative Session Notes 2017 NOAA EDM Workshop

Step3: Can they understand it? Need complete metadata record for this - what, why, who, how was data collected

NCEI has tiers of stewardship (complete description of tiers are in slides) Tier 1: Long term preservation and basic access is basic tier - will meet all PARR requirements Tier 2: Enhanced access T3: Scientific Improvements T4: Derived products T5: Authoritative records T6: National Services and International Leadership

Q: What is the difference between S2N and ATRAC? A: Similar but depends on size of data and whether it is a one-off. ATRAC used for larger (>20GB) data collections and supports developing an automated acquisition and transfer process. S2N is optimized for smaller, one-at- a-time data collections and supports creating basic discovery and use metadata and attaching data file(s) in a submission package.

Jessica Morgan - NCCOS Implementing PARR How do we actually make PARR implementation work? “Science Center” for the Line office - putting lots of federal money into science, how do we implement policies to make this most useful

Will discuss: New NCCOS Scientific Data Policy Current State How are they actually achieving access to NCCOS research results

Scientific Data Policy - meeting PARR and all EDMC guidelines,applies to everything NCCOS does Being pragmatic - what’s the minimum that needs to be done? -archive data at NCEI -catalog publications -make clear the roles and responsibilities

What does it look like now? (see slide!) NCCOS Project Life Cycle - combination of “Before PARR” existing systems and what’s been done since - Already had website, already using MERMAID tool for metadata, publications - had own ways of tracking, links to website, etc. but hadn’t been done systematically until now. - About 85 datasets in the NODC, but they could have thousands of publications - not sure where they stand with past data - Started a partnership with NCEI to see where they are; construct a pipeline for data (through S2N) and data management planning; Collaborative Session Notes 2017 NOAA EDM Workshop - Ready to start with Institutional respository - Simplified archiving process for PI (see slide); NCCOS Scientific Coordinator helps step through - talking with PI to figure out what they have; “translating it” into what the archive expects and needs to archive; call it “S2N +” because SDC makes some tweaks - including inputing NCCOS project page and key words - PARR compliance at Step 10; Steps 11 and 12 meet internal NOS and NCCOS tracking requirements - From a programmatic perspective - PIs could go straight to S2N, archive, but for tracking purposes this approach provides additional abilities for program office. - They consider Optional step - provide data via web service. - Outline for publications workflow - has not been implemented yet - From the program side - the steps are there, but the TRACKING and the planning (e.g. budgeting/ costs model) are difficult

Eugene Burger - PMEL data integration strategy 13 research efforts - disparate data streams Tasked with developing a more integrated data management approach - mission statement in slides

Goals - interoperable access for scientists and public, daylight data, and add value by leveraging the data management activities with visualization and other tools

Approach - work from processing and archiving steps, make it a 2-way process

Asses - talk with scientists to explain approach, assess data assets, where do scientists need help with integration and data management

Pilot - start with 4 projects, “fail in small steps” so that they could iterate along the process

Workflow slide describes pilot process Start by working with project and see what formats files were in; use Ferret to created NetCDF; use ERDDAP to provide interoperable access, add LAS visualization and management services

Slides have some examples of daylighted data and visualization services.

Evaluation - data file conversion more time consuming than originally thought; requested changes with ERDDAP; pretty happy and seems scalable. Looking at video and sound data next; build out data inventory tool to provide data management dashboard to scientists; metadata collection is a serious lack - priority for the next year

What about PARR? - by addressing sounds data management principles you meet PARR. Do better with the data for the intrinsic benefits; PARR compliance is a side benefit Collaborative Session Notes 2017 NOAA EDM Workshop Stanley Elswick - NOAA IR update History, background and building block (CDC partnership) detailed in slides

What’s included - NOAA series and peer reviewed journal article manuscripts (details in slides); default one year embargo (sometimes will be less, e.g. for open access)

Submission - NOAA authors/offices submit the documents via a submission page (link on slide) or through email. Can work out other methods. Processing - Library staff creates metadata and mints a DOI (if appropriate, but will use published version DOI for journal articles), then deposits metadata and content in IR

Details of submission in slides Processing - For NOAA series, Library will create DOI using EZID - draft procedures are linked in presentation slides.

Q: Is existing Central Library catalog already pushed into the IR? A: No, because many of them will not qualify for IR (legacy materials, NOAA history, etc). But many of the relevant ones have

Anna Fiolek - Minting DOIs for NOAA publications DOI purpose - provide long term sustainability and access (persistent URL)

It’s Mandated that NOAA series (listed on slides) are included in IR by NOAA PARR and Publications Policy, DOIs should be minted.

EZID metadata procedures: ● core and required elements - DataCite fields required listed on slides ● 2 types - predictable DOI - each has same shoulder part, and add series specific information to create predictable DOI (structure: T(echnical)Memorandum - Issuing Office Abbreviation - report #) (see example). Issuing office can embed before it is submitted to library ● 2 types - Systems generated - documents that don’t have a series name will receive a system generated DOI and landing page ● Documents can be secured from editing

Q: relationship between DOIs provided by library and provided by journal? A: IR will have manuscript available for download, but the DOI will be that of the publisher and will link out.

Q: Is availability of copyright material on case-by-case or looking for agreement with publishers? A: Usually by publishers, sometimes by journal titles - generally most publishers have them so going to publishers to find them and determine. Collaborative Session Notes 2017 NOAA EDM Workshop Q: How will you tell what’s missing? A: ongoing effort to do bibliometric work - search against web of science - over time will notice that it will be missing. (May not find out for a while) Haven’t completely worked out that mechanism yet.

Q: are there waivers in cases where you can’t get manuscript? A: will still have record but will have to do case by case basis

Q: If you had to guess - what’s the budget for data management. A: so far, at NCCOS, cost is “people time” - need relationship with Data centers to get time and expertise on archiving, metadata, and other tasks - need a “team” but parts of people’s time. In-house need a coordinator and IT/hardware. RIght now about 1 ½ FTEs to manage data management and the coordinator for ~100 NCCOS PIs.

Q: For all, do you track usage? A: Google analytics for IR; analytic level at NCEI; from Science Center perspective - it’s a huge question. Particularly going from Tier 1 to Tier 2 - how do you show if it is worth it? NMFS PARR Implementation “Grand Tour” - RIchard Kang NMFS - more than just science, but also policy and regulation

Keep in mind - the different parts of the infosphere (diagram) - remember the people are part of these components.

Mission - to deliver the right data, information, and service when and where they are needed AND to effect a cultural change

Note: there is new OPEN Government data act (Dec 2016)

Also - has hierarchy translating NOAA PARR guidelines, NMFS requirements, into something implementable at the NWFSC level

Management and User Needs and Center Directions (list on slide) - “What questions are we actually answering? As a government science organization, we need to be able to show that research dollars are going into answering fundamental questions”

Information is valuable - it is a resource and not just a cost

This Science Center deals in small data sets and multiple PIs - 265 data sets - ⅔ in Excel. Many don’t have metadata. 47 MS Access databases!

Delivered so far: very thorough documentation - slide has links to a number of products - reports, surveys, guides Documented PARR workflow Lifecycle (flowchart in slides) Collaborative Session Notes 2017 NOAA EDM Workshop NWFSC maintains 4 “source systems,” they target systems for external access (e.g. NWFSC Public website, InPort, Noaa.data.gov, etc. (complete list in slides)), and feed to archive systems (NCEI and NOAA IR)

Working on Genbank as an additional archive (many PIs have already worked with them) - it is a federal maintained system - does that meet NOAA archive requirements? Would like a clear answer from EDMC.

Why web services? - LOTS of reasons (see slide)

Decision - use a Very structured, standardized enterprise set of tools Several next steps identified. Web services can help “connect the dots” and fulfill NMFS mission

NMFS Live Demo Foundation is the Project Planning Database (PPD) SYstem was adapted to capture discovery level metadata There are templates for PIs There is a workflow for researchers There is a NMFS PARR Data Inventory (PARRDI) where data is publicly discoverable and accessible

From project summary page, data sets tab has this functionality. Can fill out template to generate DMP or metadata record; there are several tabs under projects to input keywords, discovery metadata

REsearcher collects data and metadata and enters it in PPD, next step is to capture the detail level metadata. Work with the PIs to get the right info, then data is pushed to InPort and NCEI Demo of the metadata template - a spreadsheet version. Most researchers are familiar working with these. each worksheet gets a corresponding metadata workshop, ingest it into oracle, generate detail level metadata, and generate access points

Final Product - public facing inventory for NWFSC Human interface is not just a link to download but a discovery/search tool Download - you get something that looks like the original workshop

In addition to human interface, published as a RESTful service

Trying to automate process (or at least notifications) for things like checking on waivers, reporting, etc. Collaborative Session Notes 2017 NOAA EDM Workshop 1st year - goal of ⅓ of data completed Staged to have a three year process to completion Started with spreadsheets, will need to move to more complicated data types like access database

Comment (Anna NCEI) - ability to link data set to publications would be very helpful. Response: are looking to link data (one-to-many) places it exists. Will take time, some automation, some ‘eye-balling’ by PIs, etc.

Q: Experience with Derived products? How do you handle this in PARR landscape? A: Many-to-many relationships in relational data systems - can be simple thing to track. That’s how they would do it.