Version 2, 08/01/2017 09:47:19

1 of 89

Table of Contents

SUMMARY VISION STATEMENT 4

CROSS-CAPABILITY COSTS 6 COS STAFF RESOURCE POOL 6

KEY CAPABILITIES 12 Capability 1. Community Supported FAIR Guidelines and Metrics: Development and Implementation Plan 12 Project personnel, contribution levels, and roles 12 Objectives for the Key Capability 13 How key personnel will accomplish the objectives 14 Figure 1. Illustration of FAIR Metrics Working Group defining new FAIR metrics 16 Figure 2. Attributes that inform metrics for FAIR Principles 17 Project management plan 20 Task plan 20 Capability 2. Global Unique Identifiers (GUID) for FAIR Biomedical Digital Objects 24 Project Personnel, Contribution Levels, and Roles 24 Objectives for the Key Capability 25 How key personnel will accomplish the objectives 25 Project management plan 28 Task plan 28 Capability 3. Open Standard APIs 32 Project Personnel, Contribution Levels, and Roles 32 Objectives for the Key Capability 33 How key personnel will accomplish the objectives 33 Project management plan 37 Task plan 37 Capability 4. Cloud Agnostic Architecture and Frameworks 41 Project Personnel, Contribution Levels, and Roles 41 Objectives for the Key Capability 42 How key personnel will accomplish the objectives 42 Project management plan 49 Task plan 50 Capability 5. Workspaces for Computation 53 Project Personnel, Contribution Levels, and Roles 54 Objectives for the Key Capability 54 How key personnel will accomplish the objectives 55

2 of 89

Figure 3: An overview of Workspaces for Computation, showing the layers of integration for supporting the full lifecycle of CWL and WDL workflows. 56 Project management plan 59 Task Plan 59 Capability 6. Research Ethics, Privacy, and Security 62 Project Personnel, Contribution Levels, and Roles 62 Objectives for the Key Capability 63 How key personnel will accomplish the objectives 63 Project management plan 67 Task plan 67 Capability 7. Indexing and Search 70 Project Personnel, Contribution Levels, and Roles 70 Objectives for the Key Capability 71 How key personnel will accomplish the objectives 72 Figure 4. Technology Stack of available services to support Capability 7 73 Figure 5: Wireframe of Data Commons Search and Discovery Interface 78 Project management plan 79 Task plan 79

Bibliography 83

Appendix A: Biography 84

3 of 89

NIH - RM-17-026 Data Commons - Pilot Phase STAGE 1 Application

1. Project Title - An Open, FAIRified Data Commons

2. Key Capabilities for which application is intended (1,2,3,4,5,6,&7)

3. Project Lead Brian Nosek, Ph.D. Executive Director, Co-founder Center for Open Science 210 Ridge McIntire Road, Suite 500 Charlottesville, VA 22903 434-806-6460 [email protected]

4. Type - Nonprofits with 501(c)(3) IRS Status (Other than Institutions of Higher Education)

5. Name of the applicant organization - Center for Open Science DUNs #079116611 | SAM registered - Yes | eRA Commons SO - bnosek1972

6. Official Authorized Brian Nosek, Ph.D. Executive Director, Co-Founder Center for Open Science [email protected] 434-806-6460

7. Approximate budget (direct and total) for the Stage 1 activities: $2,249,276

8. Other key personnel names and organizations:

Erik Schultes, Leiden University Medical Center Eric Pugh, OpenSource Connections, LLC Maryann Martone, U. California, San Diego Cynthia Hudson Vitale, Independent Barend Mons, Leiden University Medical Center Consultant Albert Mons, Dutch Tech Centre for Life Rick Johnson, University of Notre Dame Sciences Judy Ruttenberg, Association of Research Luiz Olavo Bonino, Dutch Tech Centre for Life Libraries Sciences Jeffrey Spies, Center for Open Science (COS) Natalie Meyers, COS/University of Notre Dame Matt Spitzer, COS C.Titus Brown, U. California, Davis David Mellor, COS Ian Taylor, University of Notre Dame Nici Pfeiffer, COS Center for Research Computing David Litherland, COS Sandra Gesing, University of Notre Dame Michael Haselton, COS Lee Giles, Pennsylvania State University Courtney Soderberg, COS Mark Musen, Stanford University Tim Errington, COS

9. Resources required: o Are Human Subjects Involved: Answer No o Are Vertebrate Animals Used: Answer No o Are Biohazardous Materials Used: Answer No o Are Select Agents Used: Answer No o Are Human Embryonic Stem Cells Used: Answer No

4 of 89

SUMMARY VISION STATEMENT

The scholarly community needs a data commons that opens and connects the research lifecycle. On Day 1 of the project period, we will deliver MVPs of substantial portions of the requirements for achieving the DCPPC’s objectives by leveraging the Center for Open Science’s (COS) Open Science Framework (OSF). For the remaining 179 days, we will:

● Complete a gap analysis and needs assessment of the current MVPs and the DCPPC requirements. ● Build capabilities on the MVPs to achieve as many of the DCPPC objectives as possible in the initial project period. ● Complete a second analysis and needs assessment at the end of the project period to identify gaps, challenges, and opportunities for the next stage.

We are an expert, collaborative team that will deliver on Capabilities 1-7 of the DCPPC. Partners including those at COS, Open Source Connections, Notre Dame’s Center for Research Computing, UC Davis, Penn State, and the Association of Research Libraries add capability to meet and exceed the stated requirements.

Collaboration and community development is part of every capability. Our team has substantial expertise fostering community standards related to FAIR and technical infrastructure (APIs). Team members are already centrally involved in the definition and implementation of FAIR principles and metrics globally, as well as open reference implementations of infrastructure supporting FAIR data and services. This will achieve optimal interoperability with the European Open Science Cloud (connected to all capabilities) and future developments in the ecosystem. Similarly, team members have fostered community engagement on standards for data sharing and technical collaborations and implementations.

Implementation is part of every capability. The OSF provides a solution that will dramatically accelerate implementation of the Data Commons as envisioned in NIH’s RFP. As an open-source, scalable data commons, the OSF provides a suite of modular, open software tools and services to support scholarly investigation, communication, sharing, and preservation. OSF includes services such as authentication, authorization, file storage, file rendering, metadata storage, search, analytics, commenting, moderation, and documentation; an add-on ecosystem of services such as data repositories, computational platforms, visualization services, and authoring tools created by abstracting their APIs to a common API; and, an open data set of scholarly metadata via SHARE. OSF is a public goods infrastructure with efficient extensibility via modular design and enterprise-level scalability, security, deployment, access, and preservation that is proven effective and robust with years in active use amongst a growing user base.

5 of 89

The Commons must support use cases of many stakeholders who need access to scholarly process, content, and outcomes in pursuit of knowledge. Moreover, the Commons must be flexible enough to respect researchers’ idiosyncratic workflows, yet specific enough to solve problems that researchers are trying to solve. To meet both demands, a successful Commons will provide core services that are shared across workflows, and flexible interfaces that meet the individual needs of stakeholders. By leveraging existing open tools, an expansive community network, and in-depth expertise, this collaborative team is well positioned to contribute to the Data Commons pilot and beyond.

6 of 89

CROSS-CAPABILITY COSTS

Because services already exist that provide interoperability across some of the capabilities, some costs are shared across capabilities. We document those costs in this section. As with the rest of the budget, upon selection to the consortium, these costs would be renegotiated depending on the team’s contribution across capabilities.

A project manager will support the execution of DCPPC across capabilities, consortium members, and stakeholder communities. This project manager will be POC for this proposing team and stakeholders that are involved in execution of the project aims at NIH, pilot databases, and other consortium members. The PM will manage the “molar” roadmap of the DCPPC (i.e., not the product roadmap), identify dependencies, maintain metrics and project status, identify risk, facilitate resourcing, manage milestones, and facilitate cross-team integration.

Other cross-capability costs are infrastructure hosting, maintenance, and operational costs.

Line item Cost

TBN: Project Manager (1.0 FTE) - salary and fringe benefits. $48,008

10% of hosting and service provider costs for the OSF for the 6 $7,200 month project period.

Data backups, software licenses for support/community $15,500 engagement, and search for 6 month project period.

Pre-work -- before the kick-off meeting (general preparatory effort $56,902 of documentation, planning, team organizing, communicating with other consortium members: capacity leads and other core project personnel). Includes labor, fringe, and overhead for 36 hours effort from each of the core personnel.

Travel for 8 COS personnel to participate in 2 stakeholder $22,928 meetings (kick-off and one additional). Costs that are limited to specific capabilities are included in those budgets.

COS STAFF RESOURCE POOL COS maintains a sizable, high-performing development team for implementation and quality assurance. DevOps, Developer, and QA personnel are assigned on a team and sprint depending on expertise, need, and availability. The following are the available full-time staff from these teams.

7 of 89

Developer Pool

● Brian Geiger - Senior Software Engineer - Geiger leads the team that is responsible for ensuring that the Open Science Framework receives continual maintenance and improvements and was also responsible for the team that developed the publicly accessible API. He has worked in a wide range of industries from video games and newspaper publishing to helping to create an autonomous vehicle for the DARPA Urban Challenge.

● Chris Wisecarver - Senior Software Engineer - Wisecarver is responsible for leading a team of developers that develop and maintain the Open Science Framework. His experience includes leading a team that migrated the OSF from TokuMX to PostgreSQL, leading a team through the final stages of developing a database-as-a-service using MongoDB and Elasticsearch, and directing junior developers and interns on architectural best practices. He has over a decade of user focused full-stack developer experience including 8 years at The Motley Fool as a Senior Web Developer.

● Steve Loria - Senior Software Engineer - Loria leads one of the teams that is improving the OSF backend. His team is currently focused on supporting new functionality and optimizing performance of the OSF’s public-facing RESTful API. He is an active contributor to the Python and Javascript open source software communities. He authored TextBlob--a popular Natural Language Processing package--as well as a number of libraries focused on building web APIs, including Marshmallow, Webargs, and APISpec.

● James Davis - Senior Software Engineer - Davis leads the OSF Front-end Team, which is responsible for user-interface development in Ember. He has nearly 15 years years of experience designing and building systems for scientific data management. He holds a BS in Computer Science and a MS in Environmental Science, both from Texas A&M University-Corpus Christi.

● Fitz Elliott - Software Engineer - Elliott is the team lead for the Services team at COS, which maintains the WaterButler storage provider abstraction service. His past experience includes ten years with the Cell Migration Consortium as the lead developer on the Cell Migration KnowledgeBase. He has also developed plugins for the NIH’s ImageJ image processing suite for Dr. Rick Horwitz and in-house tooling for Charlottesville-area biotech startup HemoShear.

● Chris Seto - Software Engineer - Seto leads the SHARE team at COS. He was responsible for much of the initial development of SHARE v2 and WaterButler as well as components of OSF. Seto created a database-as-a-service powered by Tornado, MongoDB, and Elasticsearch.

8 of 89

● Sherif Abdelhamid, Ph.D. - Software Engineer - Abdelhamid contributes as a developer on the team building SHARE. He brings experience including modeling and simulation, digital libraries, distributed systems, network science, informatics, and social network analysis. He has a Ph.D. in Computer Science and focused his thesis project on a software system for modeling contagions in networked populations.

● Caner Uguz, Ph.D. - Software Engineer - Uguz has contributed to the development of several aspects of the Open Science Framework and other COS products with front end development, UI/UX, product development, and building custom JavaScript libraries including the Treebeard tabular data rendering library. He has experience leading both product development and coding for “dscourse” an innovative online discussion tool developed at the Curry School of Education. Uguz received his Ph.D. in Education/Instructional Technology at the University of Virginia.

● Casey Rollins - Software Engineer - Rollins has a degree in computer science from the University of Virginia. She works on a team that is responsible for developing, maintaining, and improving the Open Science Framework. Her experience includes extensive work with the OSF’s public-facing API and maintaining the OSF’s developer documentation.

● Saman Ehsan - Software Engineer - Ehsan works on the research and development team that builds prototypes for new COS services. Previously, she worked on features for the Open Science Framework. She graduated with a degree in Chemistry from the University of Virginia and is interested in improving science through software development.

● Dawn Pattison - Software Engineer - Pattison works on one of the teams maintaining the front-end of the Open Science Framework, with a focus on the Preprints and Registries services. Her skills include API development and JavaScript frameworks like Ember. She graduated with a degree in Engineering Science and Mechanics from Virginia Tech and previously worked as a mechanical engineer.

● Lauren Barker - Software Engineer - Barker works on a team dedicated to developing, maintaining, and improving SHARE. She has also contributed to the OSF and other COS products. She graduated with a degree in Computer Science from the University of Virginia and has experience working with undocumented APIs and sensitive data.

● Tom Baxter - Software Engineer - Baxter splits time between Services Group and Bugs & Improvements supporting the Open Science Framework. Past experience includes over 15 years as Systems Engineer and Product Expert for a leading ERP SAAS company focusing on Banking and Hospital verticals.

9 of 89

● Joshua Bird - Software Engineer - Bird works on experimental projects at COS on the research and development team. He has full stack experience deploying business applications and developing algorithms to handle business process workflow.

● Cameron Blandford - Software Engineer - Blandford has Bachelor’s Degrees in Computer Science and Cognitive Science from the University of Virginia. At COS, he works on researching and building prototypes and proofs-of-concept of new technologies and applications.

● Abram Booth - Software Engineer - Booth works on improving and maintaining SHARE. He received a Master's Degree in Computer Science from the University of Michigan and has experience with full-stack development, algorithm analysis, and contemplation.

● Nan Chen - Software Engineer - Chen is a developer working on maintenance and improvements of the Open Science Framework and has also contributed to other COS products. He has 6 years research and development experience in web application. Chen received his degree in Computer Science and Math from University of Virginia.

● Matt Frazier - Software Engineer - Frazier works on the team focused on optimizing performance of the OSF. He is also in charge of the OSF side of external service integrations, and has contributed to the internal abstraction services that stream and render files. He graduated with a BS in Computer Science from the University of Virginia, and has experience with full-stack application development, machine learning, and neuromimetic hardware and algorithm design.

● Yuhuai Liu - Software Engineer - Liu works on developing and improving the Open Science Framework. He is interested in modern web technologies, data preservation and history. He graduated from University of Virginia with a B.S. degree in Computer Science.

● Ryan Mason - Software Engineer - Mason graduated from Rochester institute of technology with a BS in Information Technology and has experience with web and mobile development and works on experimental projects and prototyping at COS.

● Alex Schiller, Ph.D. - Software Engineer - Schiller is a developer working on maintenance and improvements for the OSF. He received his Ph.D. in Psychology from the University of Virginia and has over 9 years of experience collecting, cleaning, and analyzing experimental data.

● Lauren Revere - Software Engineer - Revere currently works on maintenance of the OSF. She recently graduated from Washington and Lee university with a B.S. in Computer Science.

10 of 89

● Baylee Swenson - Software Engineer - Swenson is a developer working on the Ember codebase of the OSF. She graduated last December from Valley City State University with a B.S. in Software Engineering and a Minor in Computer Science.

● John Tordoff - Software Engineer - Works primarily on maintaining OSF Services: projects resulting from Integrator grant program and those using WaterButler, the COS storage provider abstraction service. John graduated Temple University with BS in Information Science and Technology.

● Erin Braswell - Software Engineer - Braswell is a developer working on the backend of the OSF. Before joining the team at COS, she worked to at the National Air and Space Museum and the Harvard-Smithsonian Center for Astrophysics as a science educator. She graduated from Smith College with a BS in Astronomy.

DevOps Pool

● Longze Chen - Software Engineer - Longze works on a team that provides vital services to Open Science Framework. He is responsible for the Authentication and Authorization System for OSF, and has been fully engaged in improving the security and reliability of OSF and its services. Longze has 6+ years research and development experience on a variety aspects of computer security, with a focus on web application, data integrity and cryptography. Longze received his Master's Degree in Computer Science from University of Virginia, advised by Dr. David Evans.

● Matt Clark - Software Engineer - Matt is a developer for the DevOps team, which provides systems engineering and development for systems operations. Matt has worked in the past as a software engineer for six years and additionally as a systems engineer for over two years at COS.

● Barrett Harber - Software Engineer - Barrett is a developer for the DevOps team and has a passion for cloud technologies and web application development. Prior to joining COS last year, Barrett worked as a developer and technical intelligence analyst at the United States Air Force (veteran), Defense Intelligence Agency, and defense and intelligence contracting firms. Barrett has a BS in Electrical Engineering from Wright State University.

QA Pool

● Amanda Liscouski - Quality Assurance Team Lead - Liscouski coordinates quality assurance (QA) management and leads a team to perform rigorous automated and manual testing procedures on all software produced at COS. She plans and implements process improvements to the QA workflow that advance test coverage and increase the

11 of 89

team’s capacity. Prior to coming to COS, Liscouski worked for 3 years in technical copywriting and customer service for an e-commerce electronics company.

● Todd Sacco - Quality Assurance Associate - Sacco brings over 10 years of experience working with the federal government on software development for intelligence projects, etc. Formerly a technical writer, he has been involved in requirements development and quality assurance for the past 6 years. At COS, Sacco maintains and contributes to a suite of API tests that measure performance and monitor functionality across the OSF and SHARE projects.

● Allison Schiller - Quality Assurance Associate - Schiller works on front-end test automation at COS, as well as manual testing to ensure functionality of the OSF and related services. Schiller performs testing to support a number of projects and developer teams, including OSF Preprints, Waterbutler, and SHARE.

Product Management Pool

● Sara Bowman - Product Manager - Bowman is a Product Manager providing support for requirement gathering, sprint planning and feedback loop during execution of development tasks through ongoing communication and status reporting.

● TBN Product Manager - An additional Product manager resource to be added to support sprint execution of development tasks and status reporting.

12 of 89

KEY CAPABILITIES

Capability 1. Community Supported FAIR Guidelines and Metrics: Development and Implementation Plan

Total Cost: $511,603

Project personnel, contribution levels, and roles

Name Role Contribution to Capability

Erik Schultes, Project planning, community evaluation and Lead Ph.D. engagement, report writing, strategic decision making

Coordinating Maryann Martone, committee Community evaluation and engagement, report writing Ph.D. member

Coordinating Barend Mons, committee Community evaluation and engagement, report writing Ph.D. member

Luiz Olavo Bonino Coordinating da Silva Santos, committee Community evaluation and engagement, report writing Ph.D. member

Coordinating Albert Mons, MsC. committee Community evaluation and engagement, report writing member

Coordinating Brian Nosek, committee Community evaluation and engagement, report writing Ph.D. member

Coordinating Jeffrey Spies, committee Community evaluation and engagement, report writing Ph.D. member

Community Community engagement with solutions, assessment of Matt Spitzer Development alternatives, and achieving alignment

David Mellor, Community Community engagement with solutions and Ph.D. development assessment of alternatives

Courtney Methodologist Consulting on design, implementation, analysis, and

13 of 89

Soderberg, Ph.D. interpretation of survey results

Events Coordinates planning and implementation of virtual and Whitney Wissinger Coordinator in-person meetings.

Objectives for the Key Capability Objective 1: Documented Plan for Coordinating FAIR Community Standards

Realizing that Community “ownership” of the implementation of FAIR in substantive domains is essential for broad and effective adoption, the primary mission for Capability 1 is to establish a framework for Coordinating FAIR Community Standards (CFCS). This framework will allow the community of FAIR stakeholders to coherently deliberate and efficiently come to consensus on standards and implementation decisions for objective and quantifiable metrics in which to measure the FAIRness of digital resources. The framework for CFCS leverages on-going work in (among others) the FAIR Metrics Working Group as an organizational scheme by which community stakeholders can be guided to define for themselves standards for FAIR data and metadata. This framework will also engender trust and transparency, allowing visibility and coordination with other communities. In Objective 1, we will formulate this mission and articulate a realistic and appropriately resourced plan for execution. We will identify opportunities to resource, with the required expertise, stakeholder communities (e.g., disciplinary groups) to define domain-specific FAIR standards (including data formats, commonly used conceptual data models and metadata descriptions in collaboration with consortia such as CEDAR and SHARE). As part of Objective 1, we will distribute the plan for CFCS, collect survey information, and facilitate open commentary. Community feedback will be directed to the project consortium’s implementation strategy.

Objective 2: Documented limitations and challenges for implementing the plan for CFCS

There will certainly be limitations and challenges regarding the implementation of the plan for Coordinating FAIR Community Standards. In Objective 2, we will document, and if possible, systematize those limitations and challenges. This will include the identification of areas of conflict or ambiguity between proposed community standards and implementations, and devising open and transparent procedures for resolution. Real-world validation of the plan for CFCS will be conducted with consortium members TOPMed, MOD, and GTEx. The outcome of this implementation testing will include a before-and-after gap analysis of the FAIRness of these three applications.

Objective 3: Working prototype

Objectives 1 and 2 will inform revisions and tuning of a prototype framework for Coordinating FAIR Community Standards that will then be opened to any stakeholder with an interest in producing or consuming FAIR data and services. The framework will include a system for

14 of 89

defining novel FAIR metrics as stakeholders see fit, as well as an OSF-based system for hosting documentation of community recommendations and standards.

How key personnel will accomplish the objectives Status of MVP on Day 1 The concept of FAIR Data Stewardship was first discussed in a Lorentz Workshop in January 2014. The original workshop participants represented a broad spectrum of community stakeholders that included public and private organizations and representatives of public-private consortia. By early 2016, workshop participants and others had formulated a set of 15 principles that could serve as architectural requirements for the design of protocols guaranteeing Findability, Accessibility, Interoperability, and Reusability of data, by both humans and machines (Nature Scientific Data & PeerJ). This formulation emphasized that the 15 principles were not a standard in themselves, but were guidelines to which many possible existing standards could be deployed in order to achieve increasingly FAIR digital resources.

Since then, the concept of FAIR Data has enjoyed rapid uptake by a variety of different academic projects, data repositories, data-intensive private companies, government ministries and funding organizations. This uptake is reflected in numerous, independent attempts to estimate the FAIRness of digital resources. To name a few: DANS FAIR badge schema; Horizon 2020 Expert Group on FAIR research data; A Call to Participate in the NIH Commons Framework Working Group; Evaluation of data repositories based on the FAIR Principles for IDCC 2017 practice paper, TU Delft; Lorentz Workshop: Making Data FAIR (May 15-19 2017); Boi-IT World Hackathon (May 22-24 2017); RDA Working Group WDS/RDA Assessment of Data Fitness for Use WG.

Although the rapid community commitment to the FAIR principles is welcomed, it is clear that it brings with it the risk of fragmentation of the interpretation of the FAIR concept across stakeholders. It is necessary to find commitment to a set of community standards and to devise a set of broadly accepted metrics by which compliance with the 15 FAIR principles can be objectively measured. The GO FAIR Initiative has recently emerged as a bottom-up FAIR implementation mechanism aimed at federating the existing ‘gems’ in Europe and other countries. Any group can start a GO FAIR Implementation Network if they are ready to commit to FAIR principles for implementation, have a reasonable level of critical mass in terms of representing a community, and avoid vendor/provider lock-in. The Netherlands and Germany support a GO FAIR support and coordination office in The Netherlands, working with coordinating persons in other countries, and is supported by the Dutch and German governments. Many other countries, in and outside Europe, support the initiative at the national level and have demonstrated interest in joining one or more GO FAIR Implementation Networks.

Members of the GO FAIR support office (Schultes, Bonino, B. Mons) and deeply involved in its practical embedding and implementation (A.Mons) bring complementary depth of expertise in scientific data sharing, development of the FAIR standards, and community connections across

15 of 89

the international research community and optimal alignment with similar efforts in Europe. Such connections will provide credibility for the effort, and the existing experience and work on FAIR standards will accelerate the concepts into action. In June 2017, GO FAIR members Schultes and Bonino, along with 4 colleagues (Mark Wilkinson, Universidad Politécnica de Madrid; Susanna Sansone, University of Oxford; Michel Dumontier, Maastricht University; Peter Doorn, DANS) created the FAIR Metrics Working Group as a mechanism for coordinating the broad set of stakeholders seeking concrete guidelines and metrics for assessing existing resources and in creating new resources (http://fairmetrics.org). The FAIR Metrics Working Group has already self-sponsored two, two-day face-to-face hackathons and one virtual hackathon. The FAIR Metrics Working Group has designed and drafted an implementation plan for FAIR metric development that explicitly enables and encourages broad community engagement.

The starting point for the FAIR Metrics Working Group’s implementation plan is the realization that discoverability and reusability are not only abstract concepts, but they imply concrete behaviors and properties that can be captured with existing information technology and digital communication protocols. Given this, it is possible to precisely define a set of measurable properties that can assess FAIRness effectively. Over the previous 2 months, the FAIR Metrics Working Group has created a cogent framework for developing FAIR metrics. This framework is manifested as a simple form with 8 concrete questions that structure fruitful conversations about proposed metrics. An example metric definition is displayed in the Figure 1.

The CFCS, as a pioneer framework to develop core FAIR metrics recognizes that diversity in opinion must be present when crafting effective and representative FAIR guidelines. Communities must understand what is meant by FAIR and be able to monitor the FAIRness of their digital resources in a realistic and quantitative manner. What is considered FAIR in one community may be quite different from FAIRness in another community - different community norms and practices make this a certainty! As such, the FAIR Metrics Working Group has developed a mechanism by which metrics can be created by community members themselves, rather than attempting to create a set of one-size-fits-all metrics to apply to every resource.

With this framework in place to design metrics, it is now possible to make metrics definition a community-based process. To seed this activity, the Working Group has created several exemplar “core” metrics that we think will be broadly applicable. Additional metrics may be designed and published through our open submission framework, or simply shared within a community through normal communication channels.

16 of 89

Figure 1. Illustration of FAIR Metrics Working Group defining new FAIR metrics

17 of 89

The FAIR Metrics Working Group framework ensures that metrics will be rigorous, unambiguous, reproducible and algorithmically executable ensuring objective standards and quantitative assessment of the degree of FAIRness for arbitrary data resources. This will facilitate structured reporting methods for FAIRness. Through judicious implementation choices, we have selected an approach to publishing FAIR Metrics that is, itself, FAIR. This takes the form of a FAIR Accessor (a kind of Linked Data Platform Container), which describes a subset of metrics, the community to which they are applicable, other relevant metadata, and links to each of the associated metrics metadata documents. These metadata documents are written in YAML, and follow the smartAPI annotation patterns for Web Services. As such, each of these documents contains a link to the Metric itself - a Web interface capable of testing a resource's compliance with that metric.

The FAIR Metrics Working Group has already completed definitions for 4 independent metrics regarding attributes for F and is proceeding with core metrics for A, I, and R. Taken together, the Metrics Working Group anticipates definitions for 25 to 30 core metrics corresponding to observable (i.e., presence/absence) or quantifiable attributes stated within the 15 principles (indicated in color font in Figure 2 below). However, the upper bound on the number of conceivable metrics is limited only by the imagination of the community stakeholders. It is instructive to note that the 4 completed metrics thus far required a minimally estimated 240 person hours (5 full days for 6 FAIR Metrics Working Group members or 40 hours per metric). However, this time-frame also includes development of the metric creation process itself. Assuming additional metrics can be developed at a more rapid pace (20 hours per metric), the remaining 26 metrics require minimally 520 additional hours. The FAIR Metrics Working Group is now engaged in weekly, online creation, deliberation and publication of the remaining metrics.

Figure 2. Attributes that inform metrics for FAIR Principles

Font color indicates classes of metrics (red, presence / absence; orange, quantitative; blue, still undetermined).

18 of 89

The following is a requirements status list for Day 1 of the project period.

Requirement Status

15 Guiding principles of FAIR (meta)data standards defined Complete

Framework for the development of core FAIR metrics, which can be Complete opened to the community at-large

Defining ~30 algorithmically executable metrics for implementing 15 FAIR In Progress principles (4 completed, another 6 anticipated before October)

Activities for Project Period

We will connect existing international efforts to clarify what it means for data and services producers to follow FAIR principles and how to objectively measure FAIRness. Our approach consists of three activities.

A Coordinating Committee of Erik Schultes, Maryanne Martone, Barend Mons, Luiz Olavo Bonino, Brian Nosek, Jeffrey Spies, and other leaders in this space will support the objectives of creating a plan, documenting its limitations, and creating the ~30 core metrics corresponding to the existing 15 FAIR Data principles by: ● Fostering a framework for Coordinating FAIR Community Standards and their translation into tools and metrics to measure FAIRness of digital resources ● Fostering communication, coordination, and collaboration among distributed efforts to advance FAIR principles to drive convergence on shared points of view and commitment to shared standards and best practices ● Identifying opportunities to resource disciplinary groups and stakeholders to define discipline/domain-specific FAIR standards. ● Engaging the broader community for review, feedback, and adoption of the FAIR standards and metrics

Second, the Coordinating Committee will host action meetings of stakeholders to define, review, and improve the FAIR standards and metrics. Stakeholder meetings will facilitate development and review of FAIR uses cases, setting requirements for interfaces to capture and report FAIR analytics, and assessing usability and utility. Moreover, these meetings will be critical for ensuring that FAIR definitions represent the perspectives across stakeholder communities (researchers, funders, societies, institutions) and disciplinary interests. Broad community engagement from the outset will maximize the likelihood of identifying unique disciplinary needs, and fostering community buy-in to shared standards.

The FAIR Metrics Working Group is motivated by a sense of urgency, but progress is limited by a lack of official mandate and existing priorities. Nevertheless, when DCPPC begins the FAIR

19 of 89

Metrics Working Group will have completed 10 to 15 of the 30 proposed metrics. We therefore will resource the FAIR Metrics Working Group to facilitate monthly face-to-face 2-day ‘hackathons’ accelerating efforts to complete the 30 core metrics, and we will propose the outcome as a start for a formalised GO FAIR metrics standard and certification scheme.

Third, domain experts will be essential for translating the general FAIR framework to concrete implementations for their communities. The Coordinating Committee and broader stakeholder community will provide resources to disciplinary groups to determine how FAIR standards translate into practice within their domain. Specific use cases and implementation details will emerge from these resourced groups that will facilitate revision and improvement of the general framework and metrics for assessing FAIRness. As such, stakeholder meetings will have representation of leaders from these domain-specific teams. Direct resourcing of domain-specific groups and an iterative feedback cycle between the general framework and specific implementations will facilitate and demonstrate community endorsement.

This will begin in the DCPPC with stakeholders for the pilot databases. We will conduct workshops involving metric development for project consortium members TOPMed, MOD, and GTEx. These workshops will have two principle objectives: (i) road-testing the core metrics on these real-world use cases, and (ii) scoping the need for additional metrics as defined by these stakeholders, independently and as a group. These workshops will be an opportunity to test the framework to develop FAIR metrics, and observe how users will define new community standards for FAIR. For example, it is likely that when defining ‘rich’ metadata descriptions (Principle F2) different communities will place emphasis on different metadata content. We will guide these discussions, helping each community to define rich metadata models that will become the ‘standard’ for their community. Once these metadata models are in place, it will then be possible to build tools that automatically assess the degree of compliance and for instance feed our findings into ongoing projects such as CEDAR.

Finally, we will engage the broader community for feedback on the development of FAIR standards through on-going one-on-one discussions, social media invitations for feedback, webinars every three months, direct engagement with key stakeholder groups (e.g., FORCE11) and public sharing of documented progress by the coordinating committee and working groups.

Requirement Status

Establish a framework for Coordinating FAIR Community Standards In Progress (CFCS)

Community feedback of each proposed core metric In Progress

Distribute the plan for CFCS, collect survey information and facilitate open Not Started commentary.

20 of 89

Document and systematize limitations/challenges in the plan for CFCS Not Started (Gap analysis with TOPMed, MOD, and GTEx) Revise prototype framework for the CFCS Not Started

Project management plan Fifteen to twenty core metrics will remain to be defined and implemented when this project begins in October. We will combine the framework used by the FAIR Metrics Working Group with an agile project management system used by software developers at COS. At the beginning of each new, 2 week sprint, one or more metrics developed by an individual member are introduced. Each metric is provided to the wider coordinating committee for review, which can last for 1 week of the sprint. Each metric will be refined based on received feedback. Any metric can be incorporated into the sprint review process multiple times, until consensus is reached by the group, at which point it is incorporated into the shared document that is available for community-wide feedback or discipline-specific application from content experts.

This community engagement includes testing, comments, suggestions, consensus building and getting communities to define their own standards for ontologies and metadata models. Our algorithmic methods can then accept them as a parameter when computing compliance. Proposed metrics will be tested on three pilot databases representing diversity across disciplines. Identification of wider community stakeholders will occur by coordinating with NIH team and by leveraging existing communities that are fostered by FORCE 11, RDA, COS (e.g existing advisory panels for the TOP Guidelines, Registered Reports, Preprints, and Registries), and the GO FAIR network. Initial outreach to those communities will occur via email, listservs, and social media, and two webinars will be scheduled during the project period to demonstrate existing metrics and the channels for input.

The Coordinating Committee will meet virtually three times during the six month project period to assess current status and to prioritize proposed work for the coming two months. Between virtual meetings, communication will occur asynchronously using the 2 week “sprint” schedule defined above.

Task plan Task 1: Create a well-documented plan Labor cost: $79,810

Month Milestone Personnel (Hours) Metric

1 Identify common interests and Erik (16), Brian (8), Jeff (8), Initial define common purpose among Albert (8), Barend (8), feedback stakeholders within and outside of Maryann (8), Luiz (8), from CC and consortium Whitney (30), David M (40), stakeholders

21 of 89

Matt S (40)

1 Design and implement survey to Erik (32), David M (24), Survey and collect community feedback Courtney (32), Jeff (2), Brian sampling (2) plan

2 Virtual meeting of Coordinating Erik (8), Brian (2), Jeff (2), Summary Committee and key stakeholders Albert (2), Barend (2), feedback Maryann (2), Luiz (2), David M from CC and (16), Matt S (8), Whitney (30) stakeholders

3 Initial draft of strategic plan for Erik (40), Brian (8), Jeff (8), Draft advancing FAIR principles Albert (8), Barend (8), strategic plan Maryann (8), Luiz (8), David M (8)

4 Virtual meeting of Coordinating Erik (8), Brian (2), Jeff (2), Summary Committee and key stakeholders to Albert (2), Barend (2), Luiz feedback initiate revisions (2), Maryann (2), Whitney (4), from CC and David M (8), Matt S (4) stakeholders

5 Revised draft of strategic plan Erik (16), Brian (8), Jeff (8), Draft Albert (8), Barend (8), strategic plan Maryann (8), Luiz (8), Matt S (16), David M (16)

6 Completed strategic plan Erik (16), Brian (4), Jeff (4), Completed Albert (4), Barend (4), strategic plan Maryann (4), Luiz (4), David M (16)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $23,040

Month Milestone Personnel (Hours) Metric

2 List of identified concerns to inform Erik (8), Brian (2), Jeff (2), List of prototype development Albert (2), Barend (2), concerns Maryann (2), Luiz (2) identified

3 Implementation review of FAIR and Erik (12), Brian (2), Jeff (2), Feedback metrics with TOPMed, MOD, and Albert (2), Barend (2), received GTEx investigators Maryann (2), Luiz (2) from stakeholders

4 2nd list of identified concerns for Erik (8), Brian (2), Jeff (2), 2nd list of prototype revision Albert (2), Barend (2), concerns Maryann (2), Luiz (2)

22 of 89

5 2nd implementation review Erik (12), Brian (2), Jeff (2), 2nd Albert (2), Barend (2), feedback Maryann (2), Luiz (2)

6 Completed documentation of Erik (16), Brian (4), Jeff (4), Delivered limitations and challenges Albert (4), Barend (4), plan Maryann (4), Luiz (4)

Task 3: Implement a working MVP/prototype Labor cost: $173,102

Month Milestone Personnel (Hours) Metric

1 Assessment of present status of Erik (25), David M (16), Matt Assessment FAIR metrics definition S (16), Albert (25), Barend provided (25), Maryann (25), Luiz (25)

1 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

2 Completion of prototype metrics Erik (25), David M (16), Matt Prototype S (16), Albert (25), Barend metrics (25), Maryann (25), Luiz (25), Brian (4), Jeff (4)

2 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

3 Assess implementation challenges Erik (25), David M (16), Matt Summary of identified by pilot teams S (16), Albert (25), Barend challenges (25), Maryann (25), Luiz (25)

3 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

4 Revise prototype metrics Erik (25), David M (16), Matt Revisions S (16), Albert (25), Barend document (25), Maryann (25), Luiz (25)

4 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

23 of 89

5 2nd assessment of implementation Erik (25), David M (16), Matt Assessment challenges S (16), Albert (25), Barend available (25), Maryann (25), Luiz (25), Brian (4), Jeff (4)

5 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

6 Deliver completed prototype metrics Erik (25), David M (25), Matt Metrics S (16), Albert (25), Barend delivered (25), Maryann (25), Luiz (25)

6 Ongoing community engagement, Matt S (16), David M (16), Delivery of support, and feedback Whitney (4) community documentation feedback

24 of 89

Capability 2. Global Unique Identifiers (GUID) for FAIR Biomedical Digital Objects

Total Cost: $230,814

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Brian Nosek, Ph.D. Lead Product planning and vision, strategic evaluation, report writing, community leadership, team management

Nici Pfeiffer Product Manager Gap analysis, requirements gathering and documenting, roadmap and sprint planning, ongoing communication with teams, status reporting

David Litherland Technical Project Project planning, project status, metrics, risk Manager identification, resourcing, sequencing, dependencies, milestones, cross-team integration, development process, management

Michael Haselton Technical Lead Directs technical development, prioritizes work, time estimation, analysis and recommends solutions, creates and enforces standards and procedures for technical solutions

Jeffrey Spies, Ph.D. Product Vision and Technical strategy/vision, Product strategy/vision, Leading R&D Lead R&D team, prototyping, technology stack selection, Technical partner development

Matt Spitzer Community Community engagement with solutions, Development assessment of alternatives, and achieving alignment

Natalie Meyers Technical Community engagement with solutions, Community assessment of alternatives, and achieving Development alignment

TBN from COS QA QA Associate(s) Gather requirements for test planning, create test pool plans and assertion criteria, coordinate testing activities for QA team, execute testing, document bugs and improvements, maintain communication with developer leads

25 of 89

TBN from COS Developer(s) Designs/codes applications, maintains existing Developer pool applications, communicates application development progress and escalates roadblocks, ongoing data architecture for product

Objectives for the Key Capability Objective 1: Documented Plan An MVP of OSF using unique identifiers for digital objects will exist on Day 1 of the project period. Our first step will be a review and gap analysis of MVP with the MOD, TOPMed, and GTEx databases and other consortium members to roadmap additional development tasks for the project period to meet capability goals for FAIR digital object identifiers. We will also take advantage of the community survey and stakeholder engagement activities (see Capability 1) to inform priorities and gather use cases to serve community needs related to GUIDs. Near the end of the project period, we will document the updated MVP to create a project plan for the next stage of DCPPC.

Objective 2: Documented limitations and challenges for implementing that plan During the second review, we will conduct a gap and needs analysis to identify limitations of existing solutions and opportunities to redress those with tech development or alternative approaches.

Objective 3: Working prototype OSF’s GUID approach meets many of the specified requirements already or will in the near future. During the project period, we will extend the existing services to meet the objectives of the DCPPC and needs of the pilot databases. As development work progresses, we will also maintain engagement with the community through a variety of activities (e.g. email and virtual meeting based feedback, conference attendance) to support community adoption of new standards and features as well as continually gather feedback as the prototype is implemented.

How key personnel will accomplish the objectives Status of MVP on Day 1 The OSF was designed presuming that strong, persistent identifiers are critical for enabling long-term persistence, creating unambiguous relationships between objects, and creating accurate provenance records. As such, users of the OSF can mint Digital Object Identifiers (DOI) and Archival Resource Keys (ARK) for projects, components, registrations, and preprints. DOIs for files are on the near-term roadmap along with DOI versioning. Existing community standards will be explored and adopted for the latter. Users can sign-in with or attach their ORCID to their account, and these identifiers are made available as metadata linked to relevant objects.

26 of 89

The OSF’s existing GUID system and its established commitment to identifier persistence will meet the needs of the Data Commons and remain aligned with community standards. Every project, component, registration, preprint, file, and person in the OSF is given a random, five-character, alpha-numeric, globally unique identifier (GUID). URLs in the OSF are structured using this identifier (e.g., http://osf.io/GUID). This makes most URLs in the OSF quite short and raises the prominence of the identifier. We ensure that URLs up to and including paths (e.g., http://osf.io/GUID/wiki) are persistent such that they always resolve to what is expected or are redirected if the URL structure changes--something we do very rarely.

Because of GUID persistence, there is little to no risk in moving digital objects both within the OSF and across other services. If a file or object is deleted, the GUID will resolve to a page that provides metadata about the deleted file including file name and storage provider. If the deletion occurred on the OSF or on an add-on service, the file name, GUID of the user who deleted the file, and timestamp of the deletion will remain when appropriate. The privacy settings of the project, component, or registration dictate who can see this metadata.

The COS team respects the importance of GUIDs and community-based standards for identifiers. We have processes to ensure features comply with OSF’s standards and changes to URL structures migrate appropriately to ensure persistence. This includes an ecumenical approach to maintaining GUIDs from other sources in metadata, particularly in the cases of specialized communities and data sets. This is particularly important for supporting diverse communities and their preferred identifiers.

Data stored in OSF Storage is considered persistent (as opposed to connected storage solutions e.g., Dropbox or Github). COS established a $250,000 preservation fund to maintain a static version of the OSF in the event that COS had to curtail or close its offices. This includes the URL structure, information stored in the database, and data stored in OSF Storage. This fund is sufficient for 50+ years of hosting at present cost and usage projections. Further, all code is open source so that other groups could stand up the service in the event of COS failure.

The OSF has a robust backup policy for both the database that contains GUIDs and other metadata and data, which includes regular audits and verification.

The following is a requirements status list for Day 1 of the project period.

Requirement Status

DOIs and ARKs for projects, components, registrations and preprints Complete

GUIDs for projects, files, users, preprints, and comments Complete

Linking between GUID-minted projects Complete

27 of 89

Ability to issue GUIDs of arbitrary length Complete

GUID Authentication and Authorization Complete

Backup policy Complete

Preservation contingency plan and fund Complete

ORCID connections Complete

Infrastructure open sourced Complete

Activities for Project Period

The Data Commons will uniquely identify all FAIR digital objects and provide resolution of those identifiers to cited persistent data. Moreover, OSF’s flexible structure and services ecosystem facilitates linking of digital objects stored in different locations via their identifiers and via functional integrations of services.

OSF will soon provide DOIs for individual files and add DOI versioning for all DOI-assignable assets. The Data Commons will have access to the OSF’s DOI and ARK minting capabilities, backup features, and preservation fund as a means of increasing community trust.

Objects that are currently not referenced individually are given GUIDs. If there is a need or desire in the community to raise these objects to a citable form, they can be transitioned to the OSF’s short identifier format and/or be given DOIs and ARKs. Metadata in the OSF and SHARE are also given GUIDs.

With sensitive data, there can also be a need for identifiers at the level of individual participants in human subjects data. For example, NCATS GRDR program uses GUIDs to link disparate data sets and remove personally identifying information from datafiles of joined data from multiple sources. This approach can be extended to any data set during an ingestion and/or merging process in order to aid compliance with privacy policies, and more detail on data combining and meta-analytic approaches are discussed in Capability 7.

COS is actively involved in relevant working groups and is therefore well-positioned to contribute to community discussions about identifiers. Spies’ role as CTO of COS and co-director of SHARE, a partnership with the Association of Research Libraries, puts COS and the SHARE team (Capability 7) in the middle of scholarly discussions and planning for identifiers and related curation, preservation, and accessibility discussions. Further, COS’s community team is actively engaged with many stakeholders to initiate building community endorsement of solutions and can leverage the community engagement activities and survey from Capability 1 to gain additional information concerning community support of and engagement with the proposed solutions. Moreover, COS can facilitate implementation of multiple emerging

28 of 89

standards into the Data Commons through its flexible approach to identifiers in the OSF. Any changes would respect existing guarantees of persistence. Persistence of prior identifiers with addition of new identifier solutions will ensure backwards compatibility and avoid conflicts with earlier versions.

The following are the known requirements for the project period. These will be reviewed and revised as necessary upon announcement of the consortium members and identification of their available tooling.

Requirement Status

DOIs and ARKs for Files further enhance community identifiers for Not started common objects

DOIs and ARKs Versioning support object level versioning via Not started community identifiers for Projects, Components, Preprints, and Files

GUID Metadata for Removed Objects surface object specific metadata Not started when ethically appropriate

Disparate Data Set Linking to provide rich metadata of object Not started relationships

Community engagement to determine standards of FAIR digital object Not Started identifiers

Implementation of community standards identified in needs analysis Not Started

Project management plan Technical development will follow COS’s agile product development. We work directly with the product owner in two-week development cycles called sprints. At the end of each sprint, the development team and product owner review progress, assess release feasibility, and discuss where process or product can be improved.

Gap analysis and report writing will follow COS’s product evaluation process. Two teams--R&D and Product--support assessment and recommendations. R&D will conduct a technical review of services landscape and assessment of alternative solutions and opportunities for leveraging existing resources (tools or expertise) to meet DCPPC objectives. Product will conduct a review of the existing MVP against the requirements and user stories to define a roadmap for the prototype phase of project.

Task plan Task 1: Create a well-documented plan Labor cost: $30,372

29 of 89

Month Milestone Personnel (Hours) Metric

1 Initial gap and needs analysis in Nici (16), Michael (40), Jeff Delivered concert with other members of (16), Brian (8), Matt (8), gap/needs consortium Developers (16) analysis

1 Gather stakeholder feedback on Matt (8), Natalie (8), Nici (4) Delivered gap/needs analysis and convey feedback findings to product team

2 Documentation of roadmap and Nici (16), Jeff (8), Brian (8), Delivered sprints for 180 day project period David (8), Michael (8) Roadmap

5 Documentation of the project Nici (8), David (8), Rebecca Outcomes of progress (~4 sprints) (16) sprints

6 Completion and delivery of plan. Nici (16), Natalie (16), Jeff Delivered (16), Brian (16), David (16), plan Michael (16), Matt (16)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $8,900

Month Milestone Personnel (Hours) Metric

2 Documentation of anticipated Nici (4), Natalie (12), Jeff (4), Delivered limitations in initial gap and needs Brian (2) gap/needs analysis analysis

3 Documentation of identified Nici (4), Natalie (8), David (8) Outcomes limitations and challenges from first of sprints half of sprints

6 Documentation of limitations and Nici (8), Natalie (8), Jeff (8), Delivered challenges in plan for stage two. Brian (8), David (8), Michael plan (8)

Task 3: Implement a working MVP/prototype Labor cost: $144,819

Month Milestone Personnel (Hours) Metric

1 Gap Analysis: Review Product Nici (16), Jeff (8), Developers Core App Landscape (80) Evaluation

2 Sprint 1: DOIs and ARKs for Files Developers (160), QA (8), Core App

30 of 89

David (16), Product Manager Extend (8), Michael (8), Jeff (2) Community Identifiers

2 Sprint 2: DOIs and ARKs Versioning Developers (160), QA (8), Core App (Files) David (16), Product Manager Community (8), Michael (8), Jeff (2) Identifier Versioning

2 Ongoing community engagement, Matt (32) Delivery of support, and feedback community documentation feedback

3 Sprint 3: DOIs and ARKs Versioning Developers (160), QA (8), Core App (Files) David (16), Product Manager Community (8), Michael (8), Jeff (2) Identifier Versioning

3 Sprint 4: DOIs and ARKs Versioning Developers (160), QA (8), Core App (Projects, Components) David (16), Product Manager Community (8), Michael (8), Jeff (2) Identifier Versioning

3 Ongoing community engagement, Matt (32) Delivery of support, and feedback community documentation feedback

4 Sprint 5: DOIs and ARKs Versioning Developers (160), QA (8), Core App (Preprints) David (16), Product Manager Community (8), Michael (8), Jeff (2) Identifier Versioning

4 Sprint 6: GUID Metadata for Developers (80), QA (8), Core App Removed Objects David (8), Product Manager Provenance (8), Michael (4), Jeff (2)

4 Ongoing community engagement, Matt (32) Delivery of support, and feedback community documentation feedback

5 Sprint 7: Disparate Data Set Linking Developers (160), QA (8), Core App David (16), Product Manager Compliance (8), Michael (8), Jeff (2)

5 Sprint 8: Disparate Data Set Linking Developers (160), QA (8), Core App David (16), Product Manager Compliance (8), Michael (8), Jeff (2)

31 of 89

5 Ongoing community engagement, Matt (32) Delivery of support, and feedback community documentation feedback

6 Final: Integration usability testing, QA (80), Rebecca (40), Test Report quality assurance and Michael (16), David (16), Nici documentation (2), Product Manager (8), Jeff (2)

6 Ongoing community engagement, Matt (32) Delivery of support, and feedback community documentation feedback

32 of 89

Capability 3. Open Standard APIs

Total Cost: $179,378

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Natalie Meyers Lead Community development and management, standards development and implementation

Nici Pfeiffer Product Manager Gap analysis, requirements gathering, roadmap planning, status reporting

Jeffrey Spies, Ph.D. Product Vision; Technical strategy/vision, Product strategy/vision, Leading R&D Lead R&D team, prototyping, technology stack selection, Technical partner development

Brian Nosek, Ph.D. Product Vision Product planning and vision, strategic evaluation, report writing, community leadership, team management

Rebecca Rosenblatt Documentation Help, tutorial, and documentation specialist

TBN from COS QA QA Associate(s) Gather requirements for test planning, create test pool plans and assert criteria, coordinate testing activities for QA team, execute testing, document bugs and improvements, maintain communication with developer leads, documentation

C. Titus Brown, Consultant Reviews features inventory, contributes to gap Ph.D. analysis, reviews OSF API enhancement

33 of 89

features, contributes to extending API standards, designs and tests interoperability

Ian Taylor, Ph.D. Consultant Reviews features inventory, contributes to gap analysis, reviews OSF API enhancement features, designs and tests interoperability

Objectives for the Key Capability Objective 1: Documented Plan API MVPs provided by our team will exist on Day 1 of the project period. Our first step will be a DCPPC API features inventory preparatory to a proposed FAIR gap analysis treating API metadata, registries, and workflows as they each relate to the MOD, TOPMed, and GTEx data. This will include input from other consortium members’ APIs to road map additional DCPPC community API features and API standards development tasks for the project period. Near the end of the project period, we will document the updated API MVP(s) and standards for API metadata, registries, and workflows to create a project plan for the next stage of DCPPC.

Objective 2: Documented limitations and challenges for implementing that plan During the second review, we will conduct a FAIR driven gap and needs analysis to identify limitations of existing solutions and opportunities to redress those with extended standards, tech development or alternative approaches.

Objective 3: Expanded Standards and Working prototype(s) OSF provides many of the requirements for Capability 3 already and has about a dozen successful use examples of the API that parallel the present needs. During the project period, we will extend community standards as well as the OSF API to meet the specific objectives of the DCPPC and needs of the pilot databases, such as alignment with GA4GH. The next section identifies known requirements for this technical development.

How key personnel will accomplish the objectives Status of MVP on Day 1 With its HTTP API, COS facilitates the interoperable services ecosystem on the OSF. The OSF’s modular architecture and API minimizes user lock-in to any service. Connecting cloud services makes seamless workflows easy. Once a service is connected to the ecosystem, it is also connected to the other services in the ecosystem.

The OSF is built on the Django web framework and the API is built on Django REST Framework. The OSF runs on COS-maintained infrastructure at Rackspace with code available as open source. Developers can efficiently deploy the OSF using resources at https://github.com/CenterForOpenScience/cos-ansible-base and https://github.com/CenterForOpenScience/Docker-library. Continuous Integration is used to verify and test automated builds of the codebase. The site undergoes QA review--both manual,

34 of 89

provided by an in-house team of Quality Assurance Engineers and automated, operated by same engineers using best practice tools of Ghost Inspector, Selenium and BrowserStack. The API undergoes additional testing using the Runscope service for automated testing and Postman for manual testing.

We create abstracted, interoperable APIs when possible. We have done this successfully for storage and maintain 8 integrations with storage APIs to offer a single API to access many services. These include ownCloud, GitHub, Dropbox, figshare, Box.com, Dataverse, Google Drive, and Amazon S3. We expect to soon release, bitbucket, Gitlab, OneDrive, and integrations with Minio, Rackspace, Azure, Swift and Google Cloud Storage coming shortly after.

Access to cloud storage providers is abstracted via WaterButler (WB), an asynchronous web service, which communicates with cloud storage providers through a common interface. A WB storage provider translates the requests it receives into the format understood by each cloud storage provider, then maps the storage provider’s response back into a common response format that can be presented to users and programs accessing the OSF.

Together, the API and WB support many apps and interfaces that serve specific disciplinary communities throughout the research lifecycle such as branded preprint and institutional single sign-on services, meeting services, registries, and other collections. This includes external Single Page Application (SPA) framework implementations with customized web interfaces like the Circuit Realization at Faster Timescales repository (CRAFT).

COS provides support through a staffed helpdesk with a ticket system and a slack channel for integrators. Our commitment to Substantive API documentation facilitates API use by community integrators, supplemented by an OSF API public project where developers can access quick links to the API Dev Docs: see examples and view a typical workflow.

Beyond storage and computational solutions, the OSF API is used to connect services such as Mendeley and Zotero citation managers, and Rmap and python command line client (scheduled for fall releases). Third-party products also connect to the add-on ecosystem including: Open Sesame, JASP Stats, Overleaf, PsychoPy, and an R package. Integrations of services around the lifecycle dramatically enhance reproducibility of workflows and facilitate research efficiency for users.

All this prior effort prepares COS to foster community based standards for APIs in support of DCPPC.

Requirement Status Standards for API Metadata are met by defining the API with the Swagger specification for communicating to others. Complete

35 of 89

Standards for API Registries are met by publishing Swagger specification online and registering the API on Programmable Web. Complete Community defined API standards are met by using JSON API standard to communicate in an API-community driven way. Complete

Activities for Project Period

The existing work provides an advanced baseline for pursuing high interoperability and reusability of APIs. The stage is set to achieve broad adoption and extension of community-defined API standards such as with GA4GH.

The OSF has an API built on the JSON API open standard specification, widely adopted in the web development community allowing modern tools to consume it in a known way. Moreover, the system was built to be flexible to the needs and changing standards of multiple communities across biomedicine and beyond so that OSF can offer other APIs and be based on alternate standards. We appreciate the value of standards to maximize interoperability and reuse, and we avoid unnecessary technical debt and exclusivity by designing to anticipate diversification and change.

Because groups such as the Global Alliance for Genomics and Health (GA4GH) have the same goals to create an interoperable exchange of information for genomic data, we are well-positioned to collaborate with this community and others in biomedicine. We can ensure that community-defined APIs such as those being developed by GA4GH can be efficiently consumed by other components in the Commons (e.g., computational workspaces). One approach COS will investigate is making machine-readable API schemas available that define how the API is to be accessed. If those schemas are ingested by a client that maps to a common interface (e.g., Python library), then users may interact using the substantive meanings presented by the API and not need to learn specifics of how to access each API. For example, “data.get(10)” would mean the same thing regardless of the API schema as the schema would map the retrieval of n rows of data into the format “data.get(n)”. We have already a swagger specification for our API - one purpose of the swagger spec is to allow machine readability.

OSF makes it easy to offer multiple APIs or proxy requests from one API to another. For example, with the abstraction of storage APIs requests are consumed by technology that can deal with high-concurrency, I/O-bound (rather than CPU-bound) requests, transformed as necessary, and passed to the third-party service. If and when that service responds, we consume the response, handle any errors, serialize to the abstracted response format, and return the response back to the user.

COS will collaborate with existing communities to define/refine API standards and adopt existing community-defined standards. We maintain rich API documentation (http://developer.osf.io) to facilitate use and commentary by the developer community. Further, Project Manager, Natalie

36 of 89

Meyers leads engagement with external teams to assess their needs. She provides consultation on potential interactions and manages the technical requirements processes.

Meyers and C. Titus Brown will coordinate with community leaders like the GA4GH and their Data Working Group’s Directory and Streaming API Task Team to facilitate interoperability with their API. This will provide for transfer between cloud systems for data files, processing of sequencing reads, and API endpoint access like those provided for genotype-phenotype (G2P) associations in GA4GH datastores. In many use cases, users can search for associations by building queries composed of features, phenotypes, and/or evidence terms. There is great potential for wider interoperability because the G2P API is designed to accommodate search terms specified as either a string, external identifier, ontology identifier, or as an “entity”. Each of these could be accessed, provisioned, and informed by other interoperable scholarly information systems through APIs and linked data tools like OSF’s RMap integration, which provides a browsable, queryable, research graph view of linked, scholarly objects. Queries for widely adopted binary read data formats used in the Genotype-Tissue Expression (GTEx) program and the Trans-Omics for Precision Medicine (TOPMed) program as well as for data objects in the Model Organism Databases (MODs)/Alliance of Genome Resources (AGR) can be tested for interoperability and compatibility with command line and genome browsers as part of broader scholarly workflows. Protocols developed for bulk streaming of next-generation sequencing file formats (SAM/BAM/CRAM) can be tested within a network of clients and interoperable services.

Based on what is presently known about partners for the DCPPC consortium and APIs, these are our forecasted requirements. It is possible that additional requirements will emerge depending on the status of the APIs of other consortium members.

Requirement Status

Clarify additional community standards, including FAIR, development to Not started interoperate with web-based Biomedical APIs

Maximize interoperability will be met by prototyping an API Abstraction Not started service similar to the in-production, COS-developed, file storage abstraction and proxying service, Waterbutler

Maximize Reuse will be met by prototyping a cross-API translation Not started software library using an API Definition Standard such as the Swagger specification.

Standards for API Workflow Develop open, community-driven sample Not started client applications and workflow documentation for the above prototypes.

37 of 89

Gap analysis and report writing will follow COS’s product evaluation process. Two teams--R&D and Product--support assessment and recommendations. R&D will conduct a technical review of services landscape and assessment of alternative solutions and opportunities for leveraging existing resources (tools or expertise) to meet DCPPC objectives. Product will conduct a review of the existing API and web services solution(s) against the requirements and user stories to define a roadmap for the next stage of project.

Task plan Task 1: Create a well-documented plan Labor cost: $53,868

Month Milestone Personnel (Hours) Metric

1 API Features Inventory of COS & Natalie (24), Titus (16), Ian Feature SHARE OSF APIs, DCPPC (16), Jeff (8), Nici (8), David inventory of member APIs including Global (4), Rebecca (24), APIs Alliance’s Matchmaker, Developers (36) Genomics/G2P, TES, TRS, WES, & DOS and inventory of other APIs used to interoperate with TOPMed, GTEx, and MODs/AGR

1 Label API features based on: 1) Natalie (36), Titus (8), Ian API Feature internal to the DCPPC vs external; (8), Developers (24), Classification 2) open vs. proprietary; and 3) free Rebecca (24) vs. paid

2 DCPPC member feedback solicited Natalie (36), Titus (16), Ian All and received via survey and/or (16), Developers (16), requirements virtual workshop on API inventory Rebecca (36) ranked based and API features and solicitation of on member API DCPPC MVP requirements feedback resulting in rank ordered list (Must Have; Should Have; Could Have; Won't Have)

2 Assign existing user stories match Nici (12), Developers (12), Coverage up to requirements Titus (8), Ian (8), Jeff (4), traceable

38 of 89

Rebecca (24) between requirements and user stories

3 Gap Analysis of DCPPC API MVP Natalie (36), Developers Documentation feature requirements and currently (24),Titus (8), Ian (8), Jeff of difference available OSF/DCPPC API features (4), Nici (4) between feature inventory and requirements

Task 2: Document limitations and challenges for implementing the plan Labor cost: $26,771

Month Milestone Personnel (Hours) Metric

2 Time/effort, talent scheduling, and Natalie (8), Nici (4), Schedule dependency estimates for fulfilling Developers (4), David (4) traceable to identified development on API requirements DCPPC MVP requirements

3 Draft and circulate for comment Nici (12), Natalie (8), Titus Roadmap roadmap DCPPC API MVP (6), Developers (12), David comment development tasks (4), Michael (2), Ian (6), request sent to Rebecca (16) members

3 Finalize road map of DCPPC API Nici (6), Natalie (4), TItus (4), Roadmap MVP development tasks w/funder Jeff (2), David (2) updated based and DCPPC leadership on member feedback

3 Document desired API Natalie (24), Titus (8), All features enhancement requirements against Developers (12), David (4), traced to features accepted for roadmap Nici (12) requirements implementation (Must Have; Should and Have; Could Have; Won't Have) categorized

4 DCPPC Enhanced Standards put Natalie (24), Titus (8), Proposed forward to DCPPC and standards; Developers (24), David (4), standards to communities for review and Michael (2), Ian (6), Rebecca document comment (16) submitted to communities

Task 3: Implement a working MVP/prototype Labor cost: $84,171

39 of 89

Month Milestone Personnel (Hours) Metric

3 Sprint 1: API Abstraction Service Developers (128), David (4) First sprint Prototype Development Michael (2),Product Manager feature (8), QA (8), Jeff (2) implementati on

3 Sprint 2: API Abstraction Service Developers (128), David (2) Initial test Prototype Development Michael (2), Product Manager version of (8), QA (8), Jeff (2) prototype

4 Sprint 1 and 2: API Translator Developers (256), David (4) Initial test Library Prototype Development Michael (2), Product Manager version of (8), QA (8), Jeff (2) prototype library published to Github

4 Sprint 3: API Abstraction Service QA (40), Rebecca (16), Documentati Prototype QA and Documentation Developers (44), David (4), on submitted. Product Manager (2) List of bugs from QA to developers.

5 Sprint 3: API Translator Library Rebecca (16), Developers Documentati Initial Documentation and Sample (88), David (2), Michael (2), on and Applications Product Manager (8), QA (8), samples Jeff (2) published to Github

5 Sprint 4, 5, and 6: API Abstraction Developers (60), David (6), Completed Service Prototype Michael (4), Product Manager service Improvements/Polishing (6), Natalie (8) QA (16), Jeff running on (2) test environment

5 Sprint 4: API Translator Library Developers (48), David (2) Updated Prototype Improvements/Polishing Product Manager (4), QA version (12), Natalie (8), Michael (4) published to Github

6 Sprint 5 and 6: API Translator Developers (40), David (2), Updated Library Prototype Maintenance Product Manager (4), Natalie version (4) published to Github

6 Sprint 7 and 8: API Abstraction Developers (40), David (2), Updated

40 of 89

Service Prototype Maintenance Product Manager (4), Natalie version (4) running on test environment

6 Sprint 4 to 8: Communicate to QA (40), Product Manager Outcome DCPPC Accepted MVP API(s) QA (16), Developers (16) document Interval outcomes against released sent to code test outcomes. Example: issue DCPPC code reviews, bug reports, document change requests, success statements

6 Sprint 5 to 8: MVP API(s) and Developers (12), Natalie (4), Additional DCPPC standards documentation of Rebecca (36) reporting to features including against fulfilled suit DCPPC requirements & supported user needs stories. Example: Readthedocs content, requirements checklist updates, user stories cross referenced to documentation of completed features, updated and/or newly proposed standards

41 of 89

Capability 4. Cloud Agnostic Architecture and Frameworks

Total Cost: $377,762

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Brian Nosek, Ph.D. Lead Product planning and vision, strategic evaluation, report writing, community leadership, team management

Nici Pfeiffer Product Manager Gap analysis, requirements gathering, roadmap planning, status reporting

David Litherland Project Manager Project planning, project status, metrics, risk identification, resourcing, sequencing, dependencies, milestones, cross-team integration, development process, management

Jeffrey Spies, Ph.D. Product Vision, Technical strategy/vision, Product strategy/vision, Leading R&D Lead R&D team, prototyping, technology stack selection, Technical partner development

Natalie Meyers Community Community development and management, development standards development and implementation

C. Titus Brown, Consultant Lend expertise and problem solving regarding Ph.D. abstracting APIs - technical infrastructure, and high performance computing

42 of 89

Ian Taylor, Ph.D. Consultant Lend expertise and problem solving regarding abstracting APIs - technical infrastructure, and high performance computing

Objectives for the Key Capability Objective 1: Documented Plan A substantial technology base already exists for Capability 4; an MVP will exist on Day 1 of the project period. We will conduct an initial review and gap analysis with members of the consortium to identify areas of collaboration, if any. This review will modify the planned activities below to maximize efficient development of the cloud agnostic architecture. Toward the end of the 180-day project period, we will conduct another review based on progress made to identify a project plan for the next stage of DCPPC.

Objective 2: Documented limitations and challenges for implementing that plan The second review toward the end of the 180-day period will include a gap and needs analysis. This analysis will identify limitations of the existing solutions, and opportunities to redress those solutions or recommendations for alternative approaches.

Objective 3: Working prototype OSF provides many of the requirements for Capability 4 and is in production with a substantial user base and support community. The existing infrastructure is the basis for Day 1 MVP. During the project period, we will extend OSF to meet the specific objectives of the DCPPC specified as requirements in the next section, and based on what emerges from the initial gap analysis.

How key personnel will accomplish the objectives Status of MVP on Day 1 OSF is a data commons comprised of a collection of services, an ecosystem of add-ons, and a database of scholarly activity. With its composable, modular design, OSF successfully provides a cloud agnostic framework for facilitating interoperability across services for data storage. The OSF API for accessing data is an abstraction against the APIs of 8 storage providers so far (ownCloud, GitHub, Dropbox, figshare, Box.com, Dataverse, Google Drive, Amazon S3; with 5 additional--Bitbucket, Gitlab, OneDrive, Dryad, and Fedora--nearly complete). The modular approach minimizes user lock-in to any service, including COS’s. Connecting services makes it easy for end users to integrate the services they use and switch to alternate services.

The OSF is a platform as a service (PaaS) for developing applications to support research workflows across the research lifecycle. In that way, developers can leverage a collection of composable services to efficiently create new tools and interfaces. These services include a file storage abstraction service (WaterButler), metadata aggregation and search (SHARE), file rendering and conversion (MFR), authentication support for Shibboleth/SAML, CAS (protocol)

43 of 89

and OAuth (CAS Mutli-protocol Server), analytics (Keen), ops/deploy/security (Ansible and Kubernetes), UI components (Ember-OSF), an API, administrative controls, commenting, and moderation and review workflows. These services can support virtually any digital service for scholarly communication including repositories, databases, journals, preprint servers, registries, curation tools, discovery layers, review services, project management, analytics tools, and content libraries.

With its API, COS is facilitating a services ecosystem. The ecosystem provides a substantial efficiency gain for the community of open science service developers. For example, the University of Notre Dame Center for Research Computing integrated their cloud computational platform with the OSF at a point when they were just starting to add Dropbox integration. By creating a single connection to OSF, they immediately were able to push and pull code and data from Dropbox and all the other storage providers supported on the OSF (e.g., AWS S3, Dataverse). As more repositories or other services are connected to the ecosystem, they become available to users immediately and automatically.

The reusability and extensibility of each service layer provides economy of scale. Moreover, modularity between layers leverages domain expertise for collective benefit. Software developers focus their time and expertise on creating robust, scalable, extensible modular tools. Scholarly communications experts leverage the software developers’ expertise and focus on design and development of collections and workflows. Scholars leverage the interfaces and workflows and focus on doing their substantive research. In other words, the layered, abstracted approach lets experts spend their time on their substantive expertise without needing to understand the underlying infrastructure of the OSF data commons.

OSF enables development of a variety of interfaces to facilitate discovery, visualization, analysis, and use of the digital objects stored in the services connected to the commons. As a public goods infrastructure with liberal licensing, organizations and other service providers can build their interfaces to provide services for specific benefits to their communities or users. For example, a cloud computation service provider could leverage OSF and their own services with a unique interface/workspace for particular research applications.

With one interface, http://osf.io/, research is organized into projects. Researchers connect services relevant to that project. For example, a researcher might connect a GitHub repo hosting code, Box.com folder hosting some dynamic data, a Dataverse archive hosting some preserved data, and use OSF Storage for storing materials, manuscripts, and other supplementary information about the project. The interface facilitates interaction of digital objects across services. If the researcher wants to archive the data stored in Box.com for preservation, OSF can facilitate moving it into an alternative storage solution. The administrator of the project maintains permissions control for which parts of the project are private or publicly-accessible. For private components, an administrator can authorize authenticated others to access the content. All digital objects, projects, and components have persistent

44 of 89

GUIDs. Interaction with stored content is facilitated with the user interface and the API: https://developer.osf.io/.

OSF uses Central Authentication Service (CAS) software to provide users secure, single sign-on (SSO) to access multiple research applications/interfaces while providing login and password credentials only once. Because CAS supports multiple authentication protocols including SAML, OSF provides institutions (e.g., Duke) and other services (e.g., ORCID) the ability for users to use their institutional credentials to access OSF services. Already, 26 institutions are connected with many more in development. OSF offers two-factor authentication for added account security.

The OSF scholarly commons is in continuous development and improvement. The development priorities for this public goods infrastructure are in COS’s Strategic Plan. The OSF codebase is entirely open source.

The Commons can use all OSF’s integrated repository capabilities. Users can add supplementary data, materials, links, or other files. If data, materials, and other digital objects are stored in different clouds, OSF maintains the connections between the services for discoverability. OSF can surface links to external repositories or integrate the external repositories to appear OSF native via add-ons.

The following is a requirements status list for Day 1 of the project period. The first seven are requirements as described in the RFA--OSF provides MVP’s for all of those, much of it quite mature with enterprise-level code in production supporting a highly scalable service with almost 60,000 users already. Specific requirements supporting the general description from the RFA follow.

Requirement Status MVP: Enable interoperability of services and cross-discoverable data commons efforts SHARE metadata aggregation across research lifecycle and ecosystem Complete Composable, multi-instance deploy architecture specification Complete MVP: Enable use of multiple clouds File storage abstraction and common API Complete Deployment of OSF and related services (via Kubernetes) In Progress MVP: Provide services for authentication, authorization, digital IDs, metadata and data access that span multiple commons platforms/clouds so that data can be accessed transparently across commons platforms/clouds

45 of 89

Central Authentication Service (CAS) multi-protocol authentication server with support for single sign-on Complete ORCID single sign-on and integration Complete OSF-specific authorization and permissions control (accessible via API) Complete Integration with an enterprise access management system (e.g., See plan for Internet2’s Grouper) Capability 6 SHARE metadata aggregation across research lifecycle and ecosystem Complete MVP: services for executing reproducible workflows across clouds so analysis tools can be ported easily across commons platforms/clouds, and so queries and analysis pipelines can be distributed across See plan for clouds and the results gathered Capability 5 MVP: minimize ingress and egress charges between clouds; make charges predictable and well-understood File storage abstraction supports “direct” upload/download Complete Centralized logging across services Complete Modular storage backend that can support attaching storage that minimizes costs Complete MVP: support resiliency and high availability of services available in the cloud Redundant database architecture (allowing for immediate fail-over) Complete Hot database backups and Point-In-Time database recovery Complete Exercised/auditing backups of database Complete Zero-downtime deployments and load balancing using HAProxy Complete Error logging and aggregation with Sentry Complete Task scheduling backed by reliable delivery message queue, RabbitMQ Complete File parity storage and auditing (beyond cloud-storage redundancy) Complete Automatic backup of file storage Complete MVP: create a community-driven process for implementing and promulgating existing standards for data, tools, and semantics Open source code base for all software components on public Github account Complete Specific completed requirements for the framework supporting the general requirements Organizing Files Complete

46 of 89

File Storage Complete File sharing in multiple file formats Complete Integration and interoperability with external file storage providers Complete APIs for ingestion and aggregation platforms Complete ORCID login Complete Open database of normalized metadata for all preprint services with ongoing ingestion Complete View files in browser Complete File rendering multiple formats in browser Complete Open source for community contribution and use Complete Open database of metadata content Complete Versioning Complete Infrastructure scalability for high concurrency Complete API for file storage services Complete File preservation and archival Complete Persistent identifiers Complete Attaching supplementary materials, research data, code, protocols Complete Integration with research lifecycle Complete Attaching co-authors with notification Complete Permissions system for co-authors admin, revising privileges Complete Updating files Complete Subjects taxonomy Complete Mint DOIs Complete Link DOIs Complete Add unregistered co-authors with notification Complete Add metadata Complete Add license Complete Download files from browser Complete Download multiple files as zip Complete Analytics for views and download counts Complete Search via SHARE Complete Search and sort by provider, tag, date, search relevance Complete

47 of 89

Email notifications with assignable frequency and for particular actions (e.g., comments, file updates) Complete View only links for ability to share without adding as contributor Complete Forks of projects Complete Registration of projects Complete Help docs/support for users Complete User claiming (for being added as author and discover name) Complete Login with institution credentials Complete Share a file on social media (linked in, facebook, twitter and email) Complete Commenting on files Complete Activity logs on project Complete Edit completed metadata Complete SPAM detection Complete Google Scholar Optimization Complete

Activities for Project Period We will connect MODs/AGR, TOPMed, and GTEx databases, and computational and related services (Capability 5) in partnership with members of the computational community, including C.Titus Brown and Ian Taylor who bring both expertise in substantive problem solving, technical infrastructure, and high performance computing (highly relevant when thinking about abstracting APIs). Abstracting across managed cloud services and high performance computing centers and their varied APIs requires this additional expertise. While challenging, this is a critical next step to meet the needs of many users across many disciplines by leveraging work already happening in the community. Brown and Taylor have already done significant development using the OSF API.

We have jumpstarted the process of achieving FAIR compliance, and can meet this requirement within the MVP phase. A completed gap analysis of the FAIR Data Point Software Specification (FDP) and its alignment with Open Science Framework (OSF) services identified six action items to achieve this specification. These include:

● Offer SHARE or SHARE-like service as FAIR Data Point Discovery layer--SHARE is maintained by COS and features in Capability 7 ● Users can trigger or register their assets to propagate metadata to SHARE ● Enhance File-level metadata feature set, including for FDP Distribution and Data Record levels ● Unify dcterms, foaf, dcat meta info into metadata across OSF object headers ● Create Catalog view for FAIR data sets

48 of 89

● Enhance granular access control for data sets and individual files,users, groups and roles beyond the project/component level

Relatedly, we will accelerate progress on meeting relevant data management standards and certifications such as HIPAA, FedRAMP, and FISMA, extending completed initial work from a NIST review.

Note on DCPPC Requirements - Capability 4 There are many requirements that may be desired for the DCPPC in Capability 4 that are not expressed in the RFP. The following summarizes the requirements expressed in the DCPPC as well as a selection of features we anticipate will be desired by NIH for the Data Commons given our experience with building and maintaining these services, and based on user requests and experience for effective functionality.

We expect that the actual requirements list will be negotiated during the consortium formation phase. As such, we provide these additional requirements as illustrations of anticipated work. These include service improvements that lower barriers to entry for researchers, streamline workflows and user experience, and enhance capabilities of connecting services. Resourcing of Task 3 assumes these requirements or ones of equal effort.

Requirement Status

Computational Services Integration allowing commons platform Not started providers to connect resources

FISMA / FedRAMP / HIPAA Compliance Review for moderate level with Not started NIST process and process documentation requirements

Meet FAIR Standards and Metrics specified in capability #1 Not started SHARE or SHARE-like Service as FAIR Data Point Discovery Layer Not started

Users can Trigger or Register Asset Metadata to SHARE-like Services Not started

Enhance File-level metadata feature set, including for FDP Distribution Not started and Data Record levels

Unify dcterms, foaf, dcat meta info into metadata across OSF object Not started headers

Create Catalog view for FAIR data sets Not started

49 of 89

Enhance granular access control for data sets and individual files,users, Not started groups and roles beyond the project/component level

Bulk adding content--streamlining the process of contributing bulk Not Started content into the Data Commons to support users with large numbers or volume of data objects

Language internationalization--improve accessibility of the Data Not Started Commons for users of the data worldwide

Institutional SSO--implementation of InCommon to enable single sign-on In Progress for researchers using institutional credentials--this will have particular value in relation to identity management for access to sensitive data (see also Capability 6)

Subjects taxonomy--improving semantics/metadata for digital objects in In Progress the Data Commons by providing a subjects taxonomy

Enhanced file provenance and metadata for integrated 3rd In Progress parties--provide universal abstraction layers and identifiers of digital objects

Internationalization of storage locations--extend OSF’s current features In Progress to store data in particular geographic regions in order to remain in compliance with local regulations governing the location of data storage and support the automatic distribution of data to minimize bandwidth or transfer costs

Enhanced commenting system--Improve user experience of dynamic In Progress discussion and collaboration around objects in the Data Commons

Submissions system managing multiple submissions of digital Not Started objects--universal abstraction layers and identifiers to deposit data and metadata into any service connected to the Data Commons

Enhanced specification of contributions to digital objects--enables Not Started researchers improved means to identify role/contribution to digital objects in Data Commons

Expansion of Services Ecosystem--augment universal abstraction layer In progress to further reduce friction for researchers to access and contribute data

50 of 89

Gap analysis and report writing will follow COS’s product evaluation process. Two teams--R&D and Product--support assessment and recommendations. R&D will conduct a technical review of services landscape and assessment of alternative solutions and opportunities for leveraging existing resources (tools or expertise) to meet DCPPC objectives. Product will conduct a review of the existing solution against the requirements and user stories to define a roadmap for the next stage of project.

Task plan Task 1: Create a well-documented plan Labor cost: $33,826

Month Milestone Personnel (Hours) Metric

1 Initial gap and needs analysis in Nici (24), Natalie (40), Jeff Delivered concert with other members of (16), Brian (8) gap/needs consortium analysis

2 Documentation of roadmap for 180 Nici (16), Jeff (8), Brian (8), Delivered day project period David (8), Michael (8), Roadmap Developers (40)

3 Documentation of first half project Nici (8), David (8), Rebecca Outcomes of progress (40) sprints

5 Collection of assessments by Nici (8), Natalie (16) Delivered consortium members for next steps assessments by consortium members

6 Documentation of second half Nici (8), David (8), Rebecca Outcomes of project progress (40) sprints

6 Completion and delivery of plan for Nici (16), Natalie (16), Jeff Delivered stage two. (16), Brian (16), David (16), plan Michael (16)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $8,629

51 of 89

Month Milestone Personnel (Hours) Metric

1 Documentation of anticipated Nici (4), Natalie (12), Jeff (4), Delivered limitations informed by initial gap Brian (2) gap/needs and needs analysis analysis

3 Documentation of identified Nici (8), David (8) Outcomes limitations and challenges from first of sprints half of sprints

6 Documentation of limitations and Nici (8), Natalie (8), Jeff (8), Delivered challenges in plan for stage two. Brian (8), David (8), Michael plan (8)

Task 3: Implement a working MVP/prototype Labor cost: $248,686

Month Milestone Personnel (Hours) Metric

1 Gap Analysis: Product and R&D Nici (16) Jeff (8), Developers Core App Review Product Landscape (80) Evaluation

2 Sprint 1: MODs/AGR, TOPMed, Developers (480), QA (8), Core App GTEx Database Integrations David (24), Product Manager Biomedical (8), Michael (20), Jeff (2) Databases

2 Sprint 2: MODs/AGR, TOPMed, Developers (480), QA (8), Core App GTEx Database Integrations David (24), Product Biomedical Manager(8), Michael (20), Databases Jeff (2)

3 Sprint 3: Seamless Deposit and Developers (160), QA (8), Core App Discovery Services David (12), Product Manager Discovery (8), Michael (8), Jeff (2)

Sprint 3: Computational Service Developers (160), QA (8), Core App Registration David (12), Product Manager Computation (8), Michael (8), Jeff (2) Registration

3 Sprint 4: Seamless Deposit and Developers (160), QA (8), Core App Discovery Services David (12), Product Manager Discovery (8), Michael (8), Jeff (2)

Sprint 4: Computational Service Developers (160), QA (8), Core App Registration David (12), Product Manager Computation (8), Michael (8), Jeff (2) Registration

52 of 89

4 Sprint 5: Computation Task Developers (160), QA (8), Core App Scheduling and Result Aggregation David (12), Product Manager Computation (8), Michael (8), Jeff (2) Tasks & Results

Sprint 5: Meet FAIR Standards and Developers (160), QA (8), Core App Metrics David (12), Product Manager FAIR (8), Michael (8), Jeff (2) Standards

4 Sprint 6: Computation Task Developers (160), QA (8), Core App Scheduling and Result Aggregation David (12), Product Manager Computation (8), Michael (8), Jeff (2) Tasks & Results

Sprint 6: Meet FAIR Standards and Developers (160), QA (8), Core App Metrics David (12), Product Manager FAIR (8), Michael (8), Jeff (2) Standards

5 Sprint 7: Computation Task Developers (160), QA (8), Core App Scheduling and Result Aggregation Rebecca (10), David (12), Computation Product Manager (8), Michael Tasks & (8), Jeff (2) Results

5 Sprint 8: FISMA / FedRAMP / DevOps (160), Developers Core App HIPAA Compliance Review (80), Michael (80), David Compliance (12), Nici (4), Jeff (2)

6 Final: Integration usability testing, QA (80), Rebecca (40), Nici Test Report quality assurance & documentation (4), Product Manager (8), Michael (16), David (12)

53 of 89

Capability 5. Workspaces for Computation

Total Cost: $375,657

In the last 10 years, there have been an emergence of several scientific methodologies to aid researchers in describing and developing Bioinformatics pipelines. The Common Workflow Language (CWL) focuses on portability and scalability across a variety of software and hardware environments by allowing researchers to describe workflows and tools using a common format and making it easy to incorporate tools as Docker containers within such flows. The Workflow Description Language (WDL), on the other hand, is an emerging workflow language that is designed to be readable and writable by humans. Environments, such as Dockstore, provides a meeting place where users can share CWL or WDL workflows in a way that makes them machine readable and runnable in a variety of environments. Dockstore also provides links to third party environments that can execute workflows e.g., DNAstack for analysis, but support is currently limited within only a small percentage of workflows that offer this capability. There are also several efforts underway for creating tooling for CWL and WDL, such as: toolkits for executing workflows e.g., the Rabix Executor for CWL and the Cromwell workflow execution engine for WDL; and editing and visualization tools e.g., Rabix CWL-SVG for CWL or Pipeline for WDL. However, such efforts are currently fragmented and what is lacking is a common environment where these tools can be brought together to help researchers realize the full potential of their workflows. A recent study on genomics workflows elucidates that it is important that tools should either be packaged along with the workflow or made available via public repositories to assure reproducibility and for tracking provenance (Kanwal, Zaib Khan, Lonie, & Sinnott, 2017). Genomic data and its analysis is challenging for researchers because of its complexity and size and the data distribution in diverse platforms. Thus, full-featured solutions with visualization on scalable infrastructures as well as platforms for sharing data and workflows are desired to support researchers focusing on their scientific questions instead of becoming acquainted deeply with the technical details of a workflow language or computing infrastructure. Especially in the area of personalized medicine, easy-to-use solutions for sharing patient data are still missing.

Workspaces for computation therefore will address this gap by providing a collaborative workspace to support interaction throughout the full lifecycle of CWL and WDL workflows including the visualization of results. It will provide users with the ability to store, create, and publish digital objects and workflows defined using either CWL or WDL. Users will be able to upload and visualize workflows within an OSF project, which will form the basis for a workflow-oriented collaborative space. The integration will capitalize on the existing Single Page Application (SPA) framework and OSF REST interface to create a customized environment for scientific pipelines, which will utilize all of the existing collaborative OSF project tools, including the ability to fork and publish projects.

54 of 89

Since the used data in TOPMed, MOD, and GTEx requires FISMA compliance at moderate level and to follow HIPAA guidelines and/or the “Precision Medicine Initiative: Data Security Policy Principles and Framework”, we will build on the experience and accreditation with FISMA we achieved in projects such as VectorBase, to fulfill the required security in the proposed framework. The security will leverage the audit trails and access controls of OSF to achieve this goal.

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Ian Taylor, Ph.D. Technical Lead Responsible for designing and architecting the integration of the MVP components

Sandra Gesing, Senior personnel Building of example workflows and visualization Ph.D of results

Natalie Meyers Community Community development and management, Development standards development and implementation

Jeffrey Spies, Ph.D. Technical Advisor Advisor to architecture and design for this capability

Nici Pfeiffer Product Manager Gap analysis, requirements gathering, roadmap planning, status reporting

Objectives for the Key Capability Workspaces for Computation will meet the three outcomes of Stage 1, defined by NIH’s RM-17-026. We will provide: (1) a well-documented plan for selecting and defining how CWL and WDL technologies can be integrated, along with a detailed budget for each consortium member involved in this activity, (2) a well-justified description of the limitations and challenges for the plan with recommendations for additional expertise or resources if available, and (3) a fully working MVP to demonstrate the viability of the plan. The MVP is intended be adapted after

55 of 89

feedback from early adopters and extended by further features. The objectives, approach, tasks and milestones defined below support meeting these objectives.

Objective 1: Documented Plan We have conducted an initial review of the various technologies that support the CWL and WDL workflow lifecycle, and environments that promote the sharing of them. In this initial phase, we will experiment with such technologies and assess their level of maturity, with the aim of selecting the most suitable tools for developments of the MVP. We will also identify potential areas of collaboration with other members of the consortium. Toward the end of the 180-day project period, based on experienced gained and lessons learned, we will perform a further review and propose a roadmap for the next stage of DCPPC.

Objective 2: Documented limitations and challenges for implementing that plan The second review toward the end of the 180-day period will review what was accomplished, and outline any limitations to the approach to offer a critical assessment of the implementation process. This analysis will identify limitations of the existing solutions, and opportunities to redress those solutions or recommendations for alternative approaches.

Objective 3: Working prototype The OSF and WholeTale environments provide the underlying infrastructure to support the development of Capability 5. OSF is already in production with a substantial user base and support community and WholeTale is releasing a production system for use in September 2017. These systems will provide the basis for the MVP. We will first develop a common EmberJS based app, which will allow us to connect the systems and provide a common user environment for CWL and WDL workflows. We have developed a similar system for the Craft project and we anticipate a working backbone for the MVP within the first month of the project. After this initial integration effort, work can proceed in parallel, with the concurrent development of both of the docker-based workflow environments for WDL and CWL for WholeTale, and the visualization tools within the collaborative space. The data access will be handled fine grained in a way that users can share their data with all users part of a project, sub-groups or keep data privately. We will outline this in more detail below.

How key personnel will accomplish the objectives The modern use of computation for scientific discovery has fundamentally changed the way scientists interact with data. Researchers now routinely interacting with large amounts of data, various computational infrastructures, and sophisticated analysis tools to evaluate their hypotheses and results. Capability 5 is a natural extension to this research evolution and will provide scientists a means of publishing digital research objects and analytical pipelines to enable interactivity and streamlined reproducibility of such experiments. By exposing a store, create, and publish facility for CWL and WDL workflows, using the OSF API, users will be able to publish their research within an OSF project workspace and expose the analysis of diverse data sets and tools for the visualization of results. The overview of the different layers we

56 of 89

anticipate building into the MVP are shown in the Figure 3 below. This Figure shows five layers. The lower layers are the workflow language formats for the MVP. WDL provides a way to describe command line tools (and CWL extends this to use Docker containers) and their inter-dependencies to create workflows. WDL/CWL are specification and we intend to provide a visual means of modelling the development lifecycle by importing, executing, and exporting CWL workflows within an OSF project workspace.

Figure 3: An overview of Workspaces for Computation, showing the layers of integration for supporting the full lifecycle of CWL and WDL workflows.

Above this layer, we show a representative existing mechanism that researchers us for sharing such workflows, namely, Dockstore. Dockstore offers an API and we will assess whether such an API can be used within the resulting system to sync Dockstore content within the OSF.

For editing and visualization therefore, we will explore using tools such as joint.js or use customized Javascript renderings that already have implemented workflow support, such as Rabix CWL-SVG for CWL and Pipeline for WDL. Notre Dame has had extensive experience in designing using such Javascript visualization tools e.g. the Craft project circuit designer flows are implemented in this way and also leverages OSF for the backend.

For execution and underpinning this work will be a binding between an OSF project and WDL/CWL, and their associated collections of Docker containers to expose a complex workflow or individual Docker containers for an individual research contribution. Building on prior work for exposing research containers using the OSF (Taylor & Nabrzyski, 2016) and for creating and exposing Tales (data and container pairs) in the WholeTale project, we will provide the capability of exposing CWL and WDL (from services like Dockstore) within an OSF project.

57 of 89

For this aspect, OSF will be connected to the WholeTale environment to enable a computational environment for executing CWL/WDL workflows on datasets in the OSF or other storage locations. Underpinning the computational environment will be workflow-oriented Docker containers, which will be architected and added as so-called frontends to WholeTale. Since WholeTale frontends expose mount points for data access and synchronization with a user’s account, users will not only have full access to workflows and their execution environments, but they will be capable of importing their own datasets and datasets of TOPMed, MOD, and GTEx for analysis and use pre-packaged Jupyter notebooks for the visualization of results. Using WholeTale, a user can launch a user-supplied Docker container that exposes a Web front end, such as Jupyter. The latter will be used to be flexible for the visualization of results in regard to the number and kind of tools preferred for visualization of specific data. We will start with widely used tools we have experience in like R and Bioconductor but also consider novel developments such as CIRCOS. WholeTale will allow combinations of programmable front end interfaces (R or Jupyter), to be hosted alongside backend support for workflow execution. To this end, we will be creating two WholeTale containers to support CWL and WDL. We are currently looking at the Rabix Executor for CWL and the Cromwell workflow execution engine for WDL. Visualization of result files is a complex topic, dependent on the tools used in the workflows and the diverse data formats. Thus, we aim at delivering examples for pre-packaged Jupyter notebooks for widely used tools such as R and Bioconductor, NCBI’s BOV (a web-based BLAST output visualization tool), and CIRCOS. The concept of the Jupyter notebooks delivers a highly extensible and scalable environment for the visualization. The community will be able to add their own pre-packaged solutions after the Minimal Viable Product (MVP) has been accomplished.

Once the containers are running, WholeTale will provide mechanisms for a user to introspect the individual running Docker container instances and their user interfaces where relevant. WholeTale also provides a sophisticated Web GUI that provides a full feature drag and drop file manager for importing data and Fuse-file system extensions. This allows a folder in WholeTale to be mounted onto the running container. Any changes to the files during execution are preserved and saved within the system so that a researcher can interact with the data within the same environment.

Finally, at the uppermost layer, we have OSF and WholeTale and the connection between them. We will connect OSF to WholeTale in one of two ways: (1) by providing a file bundling exchange to pass data packages on OSF to WholeTale for execution, which offers a similar approach to how DockStore provides workflow execution to some of its workflows using DNAstack; or (2) by providing OSF permission to connect to a WholeTale user accounts with particular authorization functions using Oauth. The former would provide a delegation model whereas the latter would allow OSF to interface to WholeTale services directly. The glue that will tie the systems together will be an EmberJS SPA application. We have used EmberJS previously in the Craft project to interface with OSF and we also use EmberJS as the front end toolkit for WholeTale. Given both systems already have EmberJS models, controllers and views built by our team, we have a lot of flexibility about how we combine the systems. During

58 of 89

objective 1, we will investigate the various options and provide a concrete plan for the MVP implementation.

We anticipate delivering the resulting integration using a hybrid EmberJS and OSF environment, which we have successfully used for Craft. Within this architecture, we use EmberJS to implement new or customized interfaces e.g. CWL, WDL visualization, result visualization in Jupyter but at the same time, we reuse interfaces currently existing on the OSF by providing a customized OSF instance that works alongside the EmberJS app. We will work closely with the OSF team in identifying the most sustainable way of hosting this OSF instances for the project.

The Commons can leverage OSF’s API for interfacing with OSF data and the open source WholeTale system for setting up and interacting with running research containers. We will also use open source platforms, such as rabix.io, joint.js or node-red, for the visualization of the CWL workflows and an open CWL execution engine for running the resulting flows. CRC has a group of 50 research software developers, system administrators, and faculty to support this work. The team has large experience with bioinformatic workflows in projects such as VectorBase and with workflow-enabled science gateways, such as Galaxy. The team does not propose using its own computational resources. The CRC has ongoing collaborations with both AWS (Amazon) and GCP (Google Cloud Platform); the solution will be enabled on these public cloud platforms and against the abstracted API developed in Capability 4. A development testbed will be deployed using CRC resources.

COS brings considerable expertise in Docker, Docker orchestration (e.g., Kubernetes), and product-level considerations with regards to using and making usable computational and analytics tools and will contribute this expertise to the CRC. CRC and COS have been successfully collaborating for 2 years on applying OSF to support diverse research projects.

Activities for Project Period The Notre Dame team has already done significant development using the OSF API and has been involved in architecting WholeTale and developing the WholeTale Web dashboard. The following are the known requirements for the project period. These will be reviewed and revised as necessary upon announcement of the consortium members and identification of their available tooling.

Requirement Status

Workspaces for Computation app instances create EmberJS App for Not started interfacing with OSF API and deploy OSF instance to create the hybrid deployment stack

Workflow Visualization Support add WDL and CWL upload/download Not started and available editing/visualization interfaces to the Workspaces for Computation app

59 of 89

Result Visualization Support create example WDL and CWL workflows Not started and visualization tools in Jupyter

CWL Image Create R/Jupyter docker WholeTale image for with execution Not started support for CWL

WDL Image Create R/Jupyter docker WholeTale image for with execution Not started support for WDL

OSF/WholeTale connect and interface the OSF and WholeTale systems Not started

Project management plan ND’s technical development will use the CRC’s Scrum development approach. The gap analysis and report writing will follow COS’s evaluation process format. R&D and product will conduct the technical reviews and specifications for the MVP and next stage roadmaps to meet DCPPC objectives, and will become Scrum product owners for overseeing development meets requirements. The COS development team will consist of a Scrum master and a team of developers for its contribution to building out the MVP.

Task Plan Task 1: Create a well-documented plan Labor cost: $36,635

Month Milestone Personnel (Hours) Metric

1 Explore architecture Ian (20), Jeff (20), Michael Initial (20), COS Developers (20) assessment

1 Initial experimentation and Ian (40), Sandra (40) Delivered evaluation of maturity of tools and gap/needs gap evaluation of technologies analysis

1 Documentation of roadmap for Ian (40), Sandra (20), Jeff (8), System project period: MVP design, task Natalie (8), Michael (8), COS design distribution and technical project Developers (8) document plan

3 Status report for first half project Ian (10), Sandra (10), Nici (8) Well-docume progress nted plan

5 Collection of assessments by Ian (4), Sandra (4), Natalie Delivered consortium members for next steps (16) assessments by consortium members

60 of 89

6 Status report of second half project Ian (10), Sandra (10) Outcomes of progress sprints

6 Completion and delivery of plan for Ian (16), Sandra (8), Natalie Delivered stage two. (16), Jeff (8), Michael (8), plan COS Developers (8)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $7,612

Month Milestone Personnel (Hours) Metric

1 Documentation of anticipated Ian (12), Sandra (4), Jeff (2) Report limitations in initial gap and needs analysis

2 Usability design Ian (10), Sandra (20) System design document

3 Identify limitations of approach at Ian (10), Sandra (8) Report half way through project

3 Documentation of limitations and Ian (8), Sandra (8) Delivered challenges in plan for stage two. plan

MVP implementation can start immediately through the implementation of the Ember App and concurrently the supporting infrastructure will be deployment. The integration of visualization tools for data in Jupyter and creation of workflows can start concurrently with setting up the OSF instance. The project will be organized using 2 week sprints that will identify user stories and sprints. The anticipated topics for each sprint are provided below.

Task 3: Implement a working MVP/prototype Labor cost: $291,606

Month Milestone Personnel (Hours) Metric

1 Integration of OSF instance with ND Technical Support (20), OSF access to Docker, AWS and GCP COS Developers (80) Integration

1 Integration with OSF instance and ND Programmers (80), COS OSF workspace server Developers (80) Integration

1 Sprint 1: Core Ember OSF app ND Programmers (160), Ian Core App development (10), Sandra (4)

61 of 89

1 Sprint 2: Ember OSF API integration ND Programmers (160), Ian Core App API (10), Sandra (4), COS Developers (80)

2 Sprint 3: Ember app ND Programmers (160), Ian Core App upload/download for CWL/WDL and (10), Sandra (4), COS CRUD management interfaces Developer Support (80)

2 Sprint 4: Ember app CWL/WDL ND Programmers (160), Ian Core App versioning support (10), Sandra (4) Versioning

3 Sprint 5: Ember app CWL ND Programmers (160), Ian Core App Viz visualization & integration of (10), Sandra (16) CWL CWL/WDL workflows in Jupyter

3 Sprint 6: Ember app WDL ND Programmers (160), Ian Core App Viz visualization & integration of (10), Sandra (16) WDL visualization tools in Jupyter

4 Sprint 7: OSF WholeTale integration ND Programmers (160), Ian Core App Viz and CWL Docker image (10), Sandra (5), WT CWL Jeff (1), Michael (1)

4 Sprint 8: OSF WholeTale integration ND Programmers (160), Ian Core App Viz and CWL Docker image (10), Sandra (5), Jeff (1), WT CWL Michael (1), COS Dev (1)

5 Sprint 9: OSF WholeTale integration ND Programmers (160), Ian Core App Viz and WDL Docker image (10), Sandra (5), Jeff (1), WT1 WDL Michael (1), COS Dev (1)

5 Sprint 10: OSF WholeTale ND Programmers (160), Ian Core App Viz integration and WDL Docker image (10), Sandra (5), Jeff (1), WT1 WDL Michael (1), COS Dev (1)

6 Sprint 11: CWL use case ND Programmers (160), Ian Core App Viz implementation and testing for OSF (10), Sandra (10), COS WT1 CWL and WholeTale Developers (80) Integration Testing

6 Sprint 12: WDL use case ND Programmers (160), Ian Core App Viz implementation and testing for OSF (10), Sandra (10), WT1 WDL and WholeTale COS Developers (80) Integration Testing

6 Leveraging diverse infrastructures ND Technical Support (60), MVP with for compute (Docker, AWS, GCP) COS Developers (80) access to Docker,

62 of 89

AWS, GCP

6 Integration and usability tests, Ian (20), Sandra (20) Test report quality assurance

Capability 6. Research Ethics, Privacy, and Security

Total Cost: $223,079

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Brian Nosek, Ph.D. Lead Product planning and vision, strategic evaluation, report writing, community leadership, team management

Nici Pfeiffer Product Manager Gap analysis, requirements gathering, roadmap planning, status reporting

Jeffrey Spies, Ph.D. Product Vision, Technical strategy/vision, Product strategy/vision, Leading R&D Lead R&D team, prototyping, technology stack selection, Technical partner development

Natalie Meyers Community Community engagement, standards review, Engagement standards development and implementation plan review for interoperability & compliance, and liaison to standards bodies, expert consultants as needed

63 of 89

with developer leads

Objectives for the Key Capability Objective 1: Documented Plan Much of the key technologies to support privacy, security, and ethical management of data for the Data Commons is available via the OSF and will support the MVP on Day 1 of the project period. The key activities for the remaining days in Stage 1 will be an analysis for needed improvements to meet the MODs, TOPMed, and GTEx use cases, and integration of datasets. At the onset of the project period, we will explore and identify additional technical objectives with the partner repositories, and create a project plan for the next stage of the DCPPC.

Objective 2: Documented limitations and challenges for implementing that plan We will also conduct a gap and needs analysis to identify novel issues that may arise from data joining, particularly including feedback from ethics/human subjects experts. This is important as there are few established examples of ethical management of open, joined data. This will include both a technical and ethical review addressing human subjects considerations and review of relevant solutions for elements of the objective such as NCATS’ GRDR program for de-identification.

Objective 3: Working prototype OSF currently provides much of the “core” security, privacy, and permissions technologies needed to support Capability 6. The focus of the 180-day project period will be enhancement of those services and meeting the privacy and ethical needs identified in the use cases of the pilot datasets. Additionally we will partner with the GA4GH Security Working Group, data ethics experts and MODs/AGR, TOPMed and GTEx administrators as needed to plan or add appropriate functionality to the OSF. In initial discussions with MODs/AGR, TOPMed, and GTEx administrators, we expect to add requirements to the known list of requirements described in the next section.

How key personnel will accomplish the objectives Status of MVP on Day 1 The OSF uses identified best practices and robust security guidelines in all aspects of its application and web server infrastructure. Following the principle of least privilege, system processes are provisioned with only the permissions needed to perform their specific operation. Ingress and egress traffic is controlled via internal and external firewalls between every server and application. Web traffic is monitored via ModSecurity for malicious attacks and any abnormalities. Application and Server logs are centralized and provide real time alerting.

64 of 89

Infrastructure access is controlled via Bastion VPN servers configured with cryptographic port knocking. Vulnerability scanning is performed via Nessus on a weekly basis. OSF uses HackerOne for its vulnerability coordination and bug bounty program.

OSF employees follow strict security policies. Developers’ laptops must use full-disk encryption with strong passwords. Access to logging, unhandled exceptions, database replicas, and servers or containers is extremely limited. Developers are required to show cause before being allowed access to each of these resources. Before access is granted they must prove they understand data privacy and sign a copy of COS’ data privacy policy.

OSF uses Central Authentication Service (CAS) software to provide users secure, single sign-on (SSO) to access multiple research applications (preprints, branded preprints, registries, meetings, institutions) while providing login and password credentials only once. Because CAS supports multiple authentication protocols including OAuth2, CAS, SAML, OSF provides institutions and other services the ability for their users to use their institutional credentials to access OSF services. OSF offers two-factor authentication for added account security.

OSF privacy practices are user-centered in that research administrators control their own projects/data sets. Researchers create projects and add data storage repositories to their projects using their service credentials and tokens. OSF projects are private by default and project administrators have full control over which authenticated users can access the project and connected services. When administrators authorize users for access, they can set permissions--read, read/write, admin--for each user. Those permissions can apply globally to the entire project, or uniquely for individual components. For example, an administrator might approve a user to access code and metadata but deny access to sensitive raw data until providing evidence for institutional human subjects approval. Alternatively, administrators can create view-only links that give anyone with the link access to authorized components of projects. Such links are appropriate for providing access to non-sensitive data in delayed sharing circumstances (e.g., providing links to data and code during peer review), but are not appropriate for controlled access to sensitive data. Administrators can revoke access to authorized users or view-only links as needed. Administrators and other authorized users can agree to receive notifications for additions and changes to the digital objects stored on OSF. Finally, if they have legal and ethical ability to do so, administrators can make some or all of the project completely open so that even visitors that are not logged into the service can discover and access the digital objects.

More generally, OSF has legal and ethical responsibilities for its role as a data repository for private or public storage. COS and users’ responsibilities are articulated in OSF’s Terms of Service and Privacy Policy. For example, for addressing copyright violations, OSF operates in accordance with the Digital Millennium Copyright Act (17 USC § 512). Notifications of claimed copyright infringement are sent to COS’s Designated Agent and are evaluated promptly in accordance with those legal requirements, including, as appropriate, rapid triage, temporarily blocking access, removing content, and occasionally engaging legal counsel for review.

65 of 89

Following the full investigation, the Designated Agent documents the process and resolution and informs relevant parties. The terms of use also clarify conditions under which COS may remove unlawful or offensive items.

Requirement Status Internal and external firewalls for all servers/applications for ingress and egress traffic Complete System processes provisioned with only necessary permissions (least privilege) Complete Attack monitoring via ModSecurity Complete Centralize app and server logs Complete Bastion VPN with cryptographic port knocking config Complete Vulnerability scans via Nessus Complete Vulnerability/bug bounty program via HackerOne Complete Strict employee security policies Complete Single sign-on via CAS to support e.g., SAML & OAUTH2. Complete Two-factor authentication Complete Terms of Service Complete Privacy Policy Complete DMCA process Complete Project administrators manage access control to data Complete Projects private by default Complete Enriched permissions system for project administrators to manage data access + read-only, read+write, and admin permission settings Complete Public sharing and view-only links Complete Revoking access Complete Help/Support staff Complete

Activities for Project Period Ethics, privacy, and security are central to the FAIR-compliant goals of the Data Commons with regard to both privacy and security needs of research users, and sensitive data, particularly that from human participants. OSF uses Central Authentication Service (CAS) software to provide users secure, single sign-on (SSO) to access multiple research applications/interfaces while providing login and password credentials only once. Because CAS supports multiple authentication protocols, including SAML, OSF provides institutions and other services (e.g., ORCID) the ability for their users to use their institutional credentials to access OSF services.

66 of 89

This ability to connect the identity of users at institutions is the first phase of a two-phase plan to integrate institutional identity and access management (IAM; i.e., third-party authentication and authorization). For example, COS has been in discussion with the University of Notre Dame and New York University to coordinate institutional and user research and feedback in the design of this second phase, which could include, for example, integrations with Internet2’s Grouper service.

At present, administrators create ad hoc mechanisms for others requesting data access--such as posting publicly a standard form from their organization that is submitted via email. We will create a streamlined workflow so that administrators can offer a mechanism for streamlining controlled access requests. Projects enabling access requests will make metadata about the project contents available for search and discovery. A ‘request access’ button will initiate a simple workflow for a registered user to submit an access request by providing the necessary information for the project administrator to evaluate and approve or deny the request. The approval process will integrate directly with OSF’s permission system for simple access management by administrators. An access request mechanism will be designed with MODs/AGR, TOPMed, and GTEx administrators, A4GH Security Working Group, as initial stakeholders for design, testing, and implementation. Ideally the approach could be presented for review at this Fall’s International Workshop on Genome Privacy and Security (GenoPri17). The service will also be available for all OSF projects and hosted data sets (more than 100,000 already).

The OSF provides an ideal platform for introducing interventions or education to comply with policies regarding research ethics and privacy. Integration with research workflow--one of the main design considerations of the OSF--increases efficacy by making the action easy, automatic, and relevant to the researcher rather than require extra work for which there exists limited tangible incentives. For example, when a user uploads a file recognized as a data set, checks could be run looking for patterns that suggest sensitive data and alerting the user to the potential issue for further review before releasing the data set.

OSF’s user-centered privacy management services are flexible to the unique ethics, privacy, and human subjects requirements of researcher administrators’ home institutions. Using the pilot datasets, COS can provide implementation support for emerging standards in ethical management of potentially identifiable data, joining of data sets, and patient/participant-centered consent processes for data sharing. We will work closely with the pilot database administrators, research ethics experts at NIH relevant institutional IRBs, and the broader community on enacting principles for legal and ethical data management into the Data Commons infrastructure. Further, some existing technologies such as NCAT’s GRDR program for de-identification may provide parts of solutions that could be generalized for broader use (see Capability 2-GUIDs). In sum, the team is well prepared to jointly examine and implement the policy considerations of integrating large data sets, facilitating oversight of controlled-access, and tracking metadata such as informed consent and other critical ethics and privacy information.

67 of 89

Requirement Status

2nd Phase Integrating Institutional Identity and Access Management Not started (IAM) such as Internet2 Grouper and Cloud Services Streamlined Workflow for Managing Data Access Requests Not started

Discovery of Requirements for Supporting Unique Human Subjects/Ethics Not started Requirements for MODs/AGR, TOPMed, and GTEx

Review of ethical management requirements from research ethics experts Not started of joining MODs/AGR, TOPMed, and GTEx data sets to inform requirements for MVP

MVP of Supporting Data Access Workflow for MODs/AGR, TOPMed, Not started and GTEx

Gap analysis and report writing will follow COS’s product evaluation process. Two teams--R&D and Product--support assessment and recommendations. R&D will conduct a technical review of services landscape and assessment of alternative solutions and opportunities for leveraging existing resources (tools or expertise) to meet DCPPC objectives. Product will conduct a review of the existing solution against the requirements and user stories to define a roadmap for the next stage of project.

Task plan Task 1: Create a well-documented plan Labor cost: $35,818

Month Milestone Personnel (Hours) Metric

1 Initial gap and needs analysis in Nici (24), Natalie (40), Jeff Delivered concert with other members of (16), Brian (8) gap/needs consortium including unique Human analysis Subjects/Ethics requirements for MODs, TOPMed, and GTEx

2 Documentation of roadmap for 180 Nici (16), Rebecca (40), Jeff Delivered day project period (8), Brian (8), David (8), Roadmap

68 of 89

Michael (8), COS Developers (40)

3 Documentation of first half project Nici (12), David (12), Rebecca Outcomes of progress (~12 sprints) (40) sprints

5 Collection of assessments by Nici (8), Natalie (16) Delivered consortium members for next steps assessments by consortium members

6 Documentation of second half Nici (12), David (12), Rebecca Outcomes of project progress (~12 sprints) (40) sprints

6 Completion and delivery of plan for Nici (16), Natalie (16), Jeff Delivered stage two. (16), Brian (16), David (16), plan Michael (16)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $8,629

Month Milestone Personnel (Hours) Metric

1 Documentation of anticipated Nici (4), Natalie (12), Jeff (4), Delivered limitations informed by initial gap Brian (2) limits and needs analysis /needs analysis

3 Documentation of identified Nici (8), David (8) Outcomes limitations and challenges from first of sprints half of sprints

6 Documentation of limitations and Nici (8), Natalie (8), Jeff (8), Delivered challenges in plan for stage two. Brian (8), David (8), Michael plan (8)

Task 3: Implement a working MVP/prototype Labor cost: $108,171

Month Milestone Personnel (Hours) Metric

2 Sprint 1: 2nd Phase Integrating Developers (160), QA (8), Core App Institutional Identity and Access David (16), Product Manager IAM Management (IAM) (8), Michael (8), Jeff (2) Integration

2 Sprint 2: 2nd Phase Integrating Developers (160), QA (8), Core App

69 of 89

Institutional Identity and Access David (16), Product Manager IAM Management (IAM) (8), Michael (8), Jeff (2) Integration

3 Sprint 3: 2nd Phase Integrating Developers (160), QA (8), Core App Institutional Identity and Access David (16), Product Manager IAM Management (IAM) (8), Michael (8), Jeff (2) Integration

3 Sprint 4: 2nd Phase Integrating Developers (160), QA (8), Core App Institutional Identity and Access David (16), Product Manager IAM Management (IAM) (8), Michael (8), Jeff (2) Integration

4 Sprint 5: Streamlined Workflow for Developers (80), QA (8), Core App Managing Data Access Requests David (8), Product Manager Data Access (8), Michael (4), Jeff (2) Requests

4 Sprint 6: Streamlined Workflow for Developers (80), QA (8), Core App Managing Data Access Requests David (8), Product Manager Data Access (8), Michael (4), Jeff (2) Requests

5 Sprint 7: MVP of Supporting Data Developers (160), QA (8), Core App Access Workflow for MODs, David (16), Product Manager Data Access TOPMed, and GTEx (8), Michael (8), Jeff (2) Workflows

5 Sprint 8: MVP of Supporting Data Developers (160), QA (8), Core App Access Workflow for MODs/AGR, David (16), Product Manager Data Access TOPMed, and GTEx (8), Michael (8), Jeff (2) Workflows

6 Sprint 9: Maintenance/ Developers (160), QA (8), Core App Improvements/Polishing David (16), Jeff (2) Data Access Workflows

6 Sprint 10: Maintenance/ Developers (160), QA (8), Core App Improvements/Polishing David (16), Jeff (2) Data Access Workflows

70 of 89

Capability 7. Indexing and Search

Total Cost: $608,482

Project Personnel, Contribution Levels, and Roles

Name Role Contribution to Capability

Jeffrey Spies, Ph.D. Lead Technical strategy/vision, Product strategy/vision, Lead R&D team, prototyping, technology stack selection, Technical partner development

Brian Nosek, Ph.D. Product Vision Product planning and vision, strategic evaluation, report writing, community leadership, team management

Nici Pfeiffer Product Manager Gap analysis, requirements gathering, roadmap planning, status reporting

Lee Giles, Ph.D. Metadata Expert Metadata extraction and linking lead

Jian Wu, Ph. D. Metadata Expert Machine learning-based metadata extraction

Eric Pugh Search Subject Search architecture, direct data ingestion Matter Expert approach, recommend & implement data ingestion architecture, ensure scalability of

71 of 89

solution. Integrate Machine Learning framework into Elasticsearch

Doug Turnbull Relevancy Subject Recommend approaches for machine learning to Matter Expert build better ranking mechanism. Identify KPI’s and best avenues for improving quality of search results. Generate models using machine learning techniques. Build analytics dashboard

Mark Musen, M.D. Subject Matter and Curation interface lead. Liaison for biomedical Ph.D. Metadata Expert community outreach

Cynthia Community Community development and outreach, needs Hudson-Vitale Development analysis

Rick Johnson Metadata and Metadata and technology design, liaise with Technology technical teams and subject matter experts, Consultant community outreach

Judy Ruttenberg Community Liaison Community outreach and engagement, requirements gathering

Courtney Statistician Consult on meta-data models, creation of Soderberg, Ph.D. example meta-analysis scripts

Tim Errington, Ph.D. Consultant Consult on meta-data models, creation of example meta-analysis scripts

Objectives for the Key Capability Objective 1: Documented Plan A documented plan for the Data Commons indexing and searching will be developed by: 1) gathering and analyzing stakeholder input on faceting and searching requirements for the creation of synthetic cohorts; 2) conducting a gap analysis between available data and metadata quality, the MVP, and the community requirements identified in 1 above; and, 3) assessing existing technical capabilities and tools for alignment with the outcomes of 1 and 2 above. The latter will include using SHARE infrastructure for harvesting data, CEDAR for creating standardized, ontology-based metadata and for curating datasets, COS orchestration to deploy Component 7 infrastructure necessary for indexing, PSU technology for metadata extraction and linking, Open Source Connections (OSC) tooling for capturing stakeholder input on search quality and integrating machine learning models to optimize querying of data, and the combination of SHARE and OSC tooling for discovery and the generation of synthetic cohorts.

Objective 2: Documented limitations and challenges for implementing that plan Limitations and challenges of implementing the plan will be identified and documented through community and stakeholder feedback and technology reviews. We will leverage the quantitative

72 of 89

data collected from the community to provide a baseline for measuring existing search quality in building synthetic cohorts, and guide what improvements in creating synthetic cohorts can be expected through improved machine learning based models for query results.

Objective 3: Working prototype The Data Commons indexing and searching feature for the creation of synthetic cohorts will leverage existing infrastructure developed and/or often used by the COS, SHARE, CEDAR, PSU, and OSC to complete a minimum viable product with examples for use in meta-analytics tools, e.g., R and Python, in collaboration with stakeholders from the scientific use-case community. This will include harvesting data from APIs; metadata extraction and processing and indexing data in Elasticsearch; integrating the data with CEDAR for curation purposes; creating a discovery interface to create synthetic cohorts; and demonstrating use of the API in meta-analytic tools. Lastly we will demonstrate personaling search results based on user interactions by gathering user behavior and using Learning to Rank machine learning to improve the quality of the search results. We will also demonstrate machine learning for building better connections between data sets to facilitate synthetic cohort creation where existing ontologies do not exist or are not sufficient.

How key personnel will accomplish the objectives Status of MVP on Day 1 The Commons will use existing infrastructure of the OSF, SHARE, CEDAR, PSU, and OSC projects to address user needs for biomedical metadata creation, indexing, and searching and for the creation of synthetic cohorts for meta-analysis.

Currently, the OSF makes metadata about objects in its system available via APIs. Some of this metadata is created by users and other metadata is generated automatically as a means of tracking object provenance within research workflow. For example, the OSF maps metadata about preprints to a subset of Dublin Core and schema.org in HTML headers to aid in search engine discovery.

We are expanding the mapping of metadata to community-developed schemas and engaging researchers in a manner that integrates with their current workflows. We have an advantage because the OSF works across the research workflow--we can access the user when metadata is created or generated in the process of research, rather than appending the request for metadata after the user is done with their work. We are exploring applications of the CEDAR (Center for Expanded Data Annotation and Retrieval) platform with listed key personnel Mark Musen (CEDAR PI) to facilitate metadata entry, management, and processing across projects. CEDAR is a BD2K project that enables the heterogeneous use of domain specific metadata ontologies while providing a unified framework for creating consistent, searchable metadata. Unless datasets are annotated with high-quality metadata, search for online data is greatly encumbered.

73 of 89

Figure 4. Technology Stack of available services to support Capability 7

The OSF transmits metadata about research objects to SHARE, a free, open data set of research activity across the research workflow. This data is mapped to the SHARE schema--a schema that combines elements of other schemas in the ecosystem, focusing on surfacing relationships in a rich, machine-readable format (Hudson-Vitale, Johnson, Ruttenberg, & Spies, 2017). The OSF data, along with data from over 150 other sources (e.g., clinicaltrials.gov, PLOS, PubMed Central, figshare, Mendeley Data, NIH Research Portals online reporting tools), is then made available via a number of APIs for public consumption. Unlike some metadata aggregators, SHARE does not require any wrappers or for data providers to modify their metadata schemas to fit within specific elements. Instead, the SHARE metadata schema is flexible and extensible and requires no effort on behalf of the metadata or data providers. SHARE is exploring the use of CEDAR to edit and curate inconsistent and variable metadata--a common challenge for scholarly output and a problem that must be addressed for effective data indexing and search.

One of SHARE’s APIs is a full-text search database implemented with Elasticsearch. This database supports a SHARE discovery portal (http://share.osf.io), the OSF Preprints preprint aggregator (http://osf.io/preprints/discover), and the SHARE Institutional Dashboard (e.g., branded for UCSD at http://tritonshare.ucsd.edu). The database design enables straightforward creation of new interfaces for particular use cases and stakeholder groups to extract relevant metadata.

The value of search is when we understand our users well enough to predict what information they need. Several information retrieval models (e.g., Xie, 2012; Kuhltau, 1991; Bates, 1989) recognize that the search process is iterative and effective information retrieval tools adapt to users’ needs over time. To imbue Elasticsearch with the understanding of what researchers are trying to accomplish, we need to gather some additional quantitative information to make the search results more relevant than just using the out of the box scoring algorithm. This will simplify cohort creation and produce better results. We will deploy the Quepid platform

74 of 89

(developed by OSC) to allow the project team to gather typical search queries, organize them into cases, and then gather feedback on the search quality from the domain experts (i.e., the researchers). This feedback becomes a living dataset--called in search terms a “judgement list”--and is used to measure the quality of searches and provide the training data for machine learning models.

Learning to Rank is becoming the most common approach for integrating personalized search rankings using machine learning generated models. In the Day 1 MVP, the project team will leverage the Learning to Rank plugin for Elasticsearch, created and open sourced by OSC, to refine and optimize search results. While the training data set gathered via Quepid provides a base model, the true value in Learning to Rank is the ongoing collection of user search results, letting Elasticsearch personalize the search algorithm per user group and individual researcher. We are able to collect that data through the use of the GoRank platform, which gathers search behavior, and provides the source material for the ongoing machine learning process.

The following is a requirements status list for Day 1 of the project period.

Requirement Status

Data harvesting framework Complete

SHARE metadata schema Complete

Metadata normalization pipeline Complete

Elasticsearch instances Complete

Automated deployment scripts Complete

SHARE discovery interfaces Complete

Metadata curation interface via CEDAR Workbench Complete

PDFMET metadata extraction and ingestion framework Complete

SeerSuite harvesting, metadata extraction, and indexing framework Complete

Learning to Rank (Machine Learning) Plugin for Elasticsearch Complete

Quepid search Judgement List Curation Complete

GoRank search analytics platform Complete

Activities for Project Period Building useful indexing and searching capabilities begins with understanding how Data Commons users and stakeholders search for, find, and analyze the information they are

75 of 89

seeking. Stakeholders are potentially interested in searching both between data sets and within data sets. Thus, (1) gathering community and stakeholder needs and requirements for data searching, indexing, and reuse, (2) conducting a gap analysis, and (3) conducting a technical review are requirements for developing a well documented plan and a refined Data Commons that reflect community needs.

To gather community requirements for the faceting and searching of synthetic cohorts and extend the existing MVP, the project team will take a mixed methods approach that includes a survey, collecting MVP search analytics, user experience assessments, and focus groups. The target audiences include, but are not limited to, biomedical researchers, users of TOPMed, MOD, GTEx, and medical and data librarians. Each community feedback mechanism will identify how people are using or reusing biomedical data, key metadata elements required for general and specialized searching (such as band splicing) and useful automatic, meta-analytic results of the synthetic cohorts (such as applicable descriptive statistics). During the community sessions, as we gather information about typical search use cases, we will build a representative list of search queries for each user population and store them in the Quepid platform.

We will then work with representatives of each user population (including TOPMed, MOD, and GTEx) to score the existing search results returned, to understand what good search means. This collaborative process gives us a chance to really understand what is important per group, and identify options for providing the most relevant, personalized search results. For example, search analytics collect information on what users search and what they do after they search (“everyone pages on the ‘myocardial infarction’ query”). This feedback will be used to guide the refinement of the learning to rank, or relevancy ranking, algorithm for the Data Commons. The project team will take this data and create the judgment list, which will then be used over the course of phase one to build machine learning models that provide better search results, do more predictive result suggestions, and potentially automate some of the steps required in building “synthetic cohorts” by predicting connections between individual data sets that don’t have complete ontologies for the data (for example via identifiers, titles, authors).

After these models are released for initial feedback in Phase 1, we will gather the results to measure how much improvement there is, and start learning from actual user behavior through integrating the GoRank analytics package into the overall Data Commons web experience.

From all this feedback, user stories and personas will be developed. Personas describe a class of user. User stories describe the tasks they want to accomplish in the application. For example, a user story may be: “As a user of the NIH Data Commons, I want the ability to combine and analyze datasets by data type, such as phenotypic, imaging, and biospecimen data.” These user stories will assist the project team in capturing a description of the software features from an end-user perspective and identify refinements for the Data Commons MVP.

76 of 89

In developing the documented plan, the project team team will next evaluate techniques to provide both the automatic, high-level analyses of the synthetic cohorts and integrations with analytical tools such as R. Team members Nosek, Spies, Soderberg, and Musen bring methodological and statistical expertise as researchers with strong quantitative experience in general and for conducting meta-analysis in particular (e.g., Forscher, Lai, Axt, Ebersole, Herman, Devine, & Nosek, 2017; Soderberg, Callahan, Kochersberger, Amit, & Ledgerwood, 2015). This team will work to ensure that the proposed metadata models will facilitate meta-analyses on the fly.

Once the community feedback has been documented a thorough gap analysis will assess the current state of the metadata to align with researcher searching expectations. Each data provider’s (TOPMed, MOD, GTEx) API metadata will be mapped to the SHARE metadata schema. This process will identify gaps in both the current SHARE metadata schema and the data provider’s metadata to fulfill community identified requirements. The project team anticipates taking multiple approaches to enhance or enrich metadata, both through automated processes and human curation. The project team will use existing methods for going beyond high-level metadata at the data set level and extract metadata at the level of the unit of data in order to create cohorts by different dimensions for meta-analysis of aggregated data across queries.

The gap analysis and community feedback will uncover needed technical modifications to provide users with the most relevant search results and refine the Data Commons MVP. The COS team brings experience in rapidly deploying technologies that are not easy to scale, like full-text search and analytics engines, or maintain, like robust task-queues for harvesting and analytics pipelines. These are already deployed and maintained for the OSF and SHARE (e.g., Elasticsearch, RabbitMQ, Celery workers). The OSC team has deep expertise in taking source data, like what the harvesting pipeline generates, and indexing that data into Elasticsearch. Specifically we will optimize the search data model to meet the querying needs of the researchers. This will take the form of a couple fairly standardized data models that fit some general data types, like genomic data, or tabular data, plus customization for specific data formats to ensure researchers are able to ask the questions they need of the information. The indexing process will be an adaptive process as new data formats are added, but function within an overall framework for bringing data in as a scalable process.

Extracting meaning from the source data is key function of the indexing pipeline. We will use a set of tools based on the Apache Tika framework for extracting metadata from wide variety of resources. Tika functions as a “Swiss army” knife for accessing many different file formats using a single API, and is very robust and scalable. Building custom metadata extraction functions in Tika will allow us to take feedback from the researchers about what they need to work with from their data, and make those features of the data be searchable fields.

Giles’ brings considerable experience in metadata extraction and indexing, data extraction, and data matching and linking--all at scale. The breadth of his experience will be a valuable

77 of 89

contribution to handling diverse data sets, data sources, and variable states of metadata. He also has experience in name disambiguation with algorithms and code that scale to large data sets such as PubMed, Web of Science and the USPTO patent collection. His experience scaling similar services will be valuable to consider scale issues early and deal with them incrementally while avoiding major refactoring of technology. He has built several open source tools for metadata extraction using machine learning methods from PDFs and text for many unique entities such as figures, tables, equations, etc. and incorporated them into scholarly search engines such as CiteSeerX (Wu et al., 2015a) using an open source ingestion system (Wu et al., 2015b). Recent work has been on linking data and metadata in different databases such as PubMed and the Web of Science.

Pugh and his OpenSource Connections (OSC) team bring deep expertise in Lucene-based search engines (e.g., Elasticsearch) and in search relevancy to facilitate powerful matching algorithms to drive better science outcomes. This was distilled in book form in the 2016 publishing of Relevant Search by Manning Publisher, written by OSC’s CTO Doug Turnbull. OSC built the “Learning to Rank” plugin for Elasticsearch, being used in both commercial contexts as well as by the Wikimedia Foundation for the next-generation search behind Wikimedia.org. They have also build Quepid, a SaaS product for defining what good search is. Elasticsearch in particular--already in use in the OSF and SHARE--is performant when making the types of queries necessary for constructing and aggregating data across synthetic cohorts.

Project metadata--in the Commons or harvested by SHARE via API--will be made accessible in SHARE and indexed by Elasticsearch. With key personnel, Eric Pugh, we will index this data to provide that rich querying as well as analytic capabilities across disparate data sets and to balance out the differing quality of the individual data sets using existing methods for ontology alignment and data cleaning in partnership with Giles.

When indexing metadata from various FAIR compliant API’s, Pugh brings expertise in ensuring that large metadata data sets do not crowd out small metadata data sets and in building a concept mapping that can be used to connect knowledge across multiple data sets. Search-based techniques address the reality that metadata is rarely cleanly mapped between data sets. This will allow the users to explore the data available to identify the best matches and identify data for which the data needs to be corrected. CEDAR will be used for this curation step.

Querying the metadata will be done by providing users a faceted view over the indexed units of data, which makes it easy to filter down the records as is done currently in SHARE. The new frontier for search-based solutions is to understand the intent of the user when making the query to return exactly what they are looking for. A Learning to Rank paradigm integrates traditional machine learning for building custom ranking models into an operational search-based query engine.

78 of 89

This will be critical for pooling the data the user actually intends for meta-analyses. With the breadth of data potentially available through the Data Commons, understanding the specific needs of each user by learning what their goals are--by leveraging the data gathered by OSF--promises to deliver significantly more precise search results over traditional statistics-only based ranking algorithms.

Figure 5: Wireframe of Data Commons Search and Discovery Interface

Requirement Status

Requirement gathering from stakeholders Not started

Develop user stories document Not started

Add acceptance criteria Not started

79 of 89

NIH Data Commons Harvester and Normalization Framework Not started

Create Mappings for NIH Data Commons API’s Not started

Data Modeling for Elasticsearch Not started

Processing and Indexing Data in Elasticsearch Not started

Metadata extraction Not started

Create judgement list that defines what good search is to researchers Not started

Initial Learning to Rank Model using Training Data Not started

Deploy GoRank search analytics throughout web platform Not started

Schema development for use with CEDAR Not started

Integrating the data with CEDAR for curation purposes Not started

Creating a discovery interface to create synthetic cohorts Not started

Demonstrating use of the API in meta-analytic tools (e.g., R). Not started

Demonstrating predictive connections between data sets using Learning to Not started Rank driven models

The SHARE operations team (Spies, Johnson, Hudson-Vitale, Ruttenberg) will coordinate closely with COS, Stanford, PSU, and OSC team members to ensure all team members are collaborating seamlessly to address community needs.

Task plan Task 1: Create a well-documented plan Labor cost: $119,010

Month Milestone Personnel (Hours) Metric

1 Identify common interests and Mark (8), Eric (16), Jeff (8), Common define common purpose among Judy (8), Rick (8), Lee (8), vision

80 of 89

stakeholders within and outside of Brian (8), Cynthia (8) document consortium delivered

1 Documentation of roadmap for 180 Eric (16), Cynthia (20), Jeff Roadmap for day project period (20), Judy (20), Rick (20), Lee project period (16), Brian (8), Mark (16), Nici (16)

2 Deploy Quepid to gather user Eric (24), Cynthia (80), Judy Judgement feedback for MVP online, in focus (30), Rick (20), Mark (20), Jeff list produced groups, and through user (8), Lee (4) experience testing

3 Gather examples and requirements Doug (120), Cynthia (80), Lee Training of training dataset for ranking (4), Jeff (4) dataset produced

5 Evaluate metadata models for Brian (16), Courtney (16), Jeff Metadata meta-analysis (16), Mark (16), Lee (4) quality evaluation

5 Develop plan to integrate existing Eric (20), Doug (20), Jeff (20), Documented technologies and identify additional Rick (20), Lee (4), COS plan development needed Developers (80)

6 Develop cohesive plan for Phase II Eric (16), Jeff (16), Rick (16), Documented Judy (16), Cynthia (16), Lee plan (16), Mark (16), Nici (16), Brian (16)

Task 2: Document limitations and challenges for implementing the plan Labor cost: $29,495

Month Milestone Personnel (Hours) Metric

3 Gap Analysis: Documentation of OSC Consultants (16), Jeff (8), Delivered anticipated limitations in initial gap Michael (8), Rick (40), Cynthia gap/needs and needs analysis (8), Judy (8), Nici (8), Mark analysis (30), Lee (8)

3 Documentation of identified OSC Consultants (16), Nici (4), Outcomes limitations and challenges from first David (8), Jeff (8), Michael (8), of sprints half of sprints Lee (8)

6 Documentation of limitations and OSC Consultants (16), Nici (8), Delivered challenges in plan for Phase II. Jeff (8), David (8), Michael (8), plan Lee (8)

81 of 89

Task 3: Implement a working MVP/prototype Labor cost: $379,833

Month Milestone Personnel (Hours) Metric

1 Gap Analysis: Data Modeling for OSC Consultants (160), COS Core App ElasticSearch Developers (40), Rick (16), Evaluation Nici (16), Cynthia (16), Lee (8), Jeff (8)

1 Sprint 1: NIH Data Commons OSC Consultants (80), COS Core App Harvester and Normalization Developers (80), Lee (4), Jian NIH Framework (20), COS QA (8), David (8), Harvester Product Manger (8), Michael Framework (4), Jeff (2)

1 Sprint 2: Processing and Indexing OSC Consultants (160), Lee Documented Data in ElasticSearch (4), Jian (20), COS approach for Developers (80), COS QA metadata (8), David (8), Michael (2), extraction Jeff (2)

2 Sprint 3: Processing and Indexing OSC Consultants (160), Lee Metadata Data in ElasticSearch (4), Jian (20), COS extraction Developers (80), COS QA MVP (8), David (8), Michael (2), implemented Jeff (2)

2 Sprint 4: Processing and Indexing OSC Consultants (160), Lee Metadata Data in ElasticSearch (4), Jian (20), COS extraction Developers (80), David (8), MVP COS QA (8), Michael (2), Jeff implemented (2)

3 Sprint 5: Develop training dataset OSC Consultants (160), Lee LTR Training for Learning to Rank Model, deploy (4), Jian (20), COS dataset, GoRank to capture search analytics Developers (80), David (8), GoRank Michael (2), Jeff (2) system deployed

3 Sprint 6: Build Initial Learning to OSC Consultants (160), Lee LTR model Rank Model using training data (4), Jian (20), COS MVP Developers (80), David (8), COS QA (8), Product Manager (8), Jeff (2), Michael (2)

4 Sprint 7: Schema development for Mark (80), Lee (4), Jian (20), CEDAR

82 of 89

use with CEDAR Rick (40), Cynthia (40), Jeff Integration (20), OSC Consultants (20) with Search

4 Sprint 8: Integrating the data with Mark (80), Lee (4), Jian (20), CEDAR CEDAR for curation purposes COS Developers (80), Rick Integration (20), Cynthia (20), David (8), with Search COS QA (8), Product Manager (8), Michael (4), Jeff (4)

5 Sprint 9: Integrating the data with Mark (80), Lee (4), Jian (20), CEDAR CEDAR for curation purposes Rick (30), COS Developers Integration (80), COS QA (8), David (8), with Search Product Manager (8), Michael (4), Jeff (2)

5 Sprint 10: Discovery interface: initial OSC Consultants (80), COS Mockups and development Developers (80), COS QA user stories (8), Cynthia (20), David (8), Product Manager (8), Michael (4), Lee (4), Jeff (2)

6 Sprint 11: Discovery interface: OSC Consultants (80), COS Discovery development Developers (80), COS QA interface (8), David (8), Product MVP Manager (8), Michael (4), Cynthia (4), Lee (2), Jeff (2)

6 Sprint 12: Demonstrating use of the Courtney (24), Tim (24), COS Script API in meta-analytic tools (e.g., R, Developer (40), COS QA (8), demonstratin Python) in collaboration with Cynthia (8), Product Manager g API stakeholders from the scientific (8), Brian (4), Mark (4), Jeff capability use-case community. (4), Lee (4)

83 of 89

Bibliography Nosek, B. A. (2017). Center for Open Science: Strategic Plan. http://osf.io/x2w0h

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., ...& Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behavior, 1, 0021. doi:10.1038/s41562-016-0021

McKiernan, E. C., Bourne, P. E., Brown, C. T., Buck, S., Kenall, A., Lin, J., ... & Yarkoni, T. (2016). How open science helps researchers succeed. eLife, doi:10.7554/eLife.16800

McNutt, M., Lehnert, K., Hanson, B., Nosek, B. A., Ellison, A. M., & King, J. L. (2016). Beyond “data/samples available upon request” in the field sciences. Science, 351, 1024-1026. doi:10.1126/science.aad7048

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., ... & Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1422-1425. doi:10.1126/science.aab2374

Errington, T. M., Iorns, E., Gunn, W., Tan, F., Lomax, J., & Nosek, B. A. (2014). An open investigation of the reproducibility of cancer biology research. eLife, 3:e04333. doi:10.7554/eLife.04333

Ioannidis, J. P. A., Munafo, M. R., Fusar-Poli, P., Nosek, B. A., & David, S. P. (2014). Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences, 18, 235-241. doi:10.1016/j.tics.2014.02.010

Kanwal, S., Khan, F. Z., Lonie, A., Sinnott, R. O., (2017). Investigating reproducibility and tracking provenance - A genomic workflow case study. BMC Bioinformatics, doi:10.1186/s12859-017-1747-0

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J.,...& Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data, doi:10.1038/sdata.2016.18

Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L.O.B., Wilkinson, M.D. (2017). Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services & Use, 37, 49-56. doi:10.3233/ISU-170824

Wilkinson, M. D., Verborg, R., da Silva Santos, L.O.B.,...& Dumontier, M., (2017). Interoperability and FAIRness through a novel combination of Web technologies. Peer J. CompSci, doi:10.7717/peerj-cs.110

Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M.,... & Giles, C. L. (2015a). CiteSeerX: AI in a Digital Library Search Engine. AI Magazine, 36(3): 35-4. doi:10.1609/aimag.v36i3.2601

Wu, J., Killian, J., Yang, H., Williams, K., Choudhury, S. R., ...& Giles, C. L. (2015b). PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search. Proceedings of 8th ACM K-CAP, 13:1-13:8. doi:10.1145/2815833.2815834

84 of 89

Appendix A: Biography

Erik Schultes, Ph.D. - FAIR Data Scientific Projects Lead at the Dutch Techcentre for Life Sciences, and Human Genetics Department, Leiden University Medical Center Schultes was a research biologist for 15 years focusing on RNA biochemistry, structure and evolution. Since 2009, Schultes has focused on the development of tools helping biologists create, find, access, annotate, and share life science data. Schultes is among the original authors of the nanopublication schema (2011) and FAIR Principles (2016). Schultes is currently developing training and certification programs for the GO FAIR Initiative. Schultes is a member with 5 other FAIR innovators composing the FAIR Metrics Working Group.

Maryann Martone, Ph.D. - Professor Emeritus, University of California, San Diego Martone is an original author of FAIR principles; former president of FORCE11. Expertise in neuroscience, imaging, data integration, standards, and open science. Served on the advisory board for NIH Commons Credits project and early contributor to BioCADDIE. Maintains a lab at UCSD and co-PI of the Neuroscience Information Framework and NIDDK Information Network (dkNET). She is the Director of Biosciences for hypothes.is, a non-profit tech organization developing technology to annotate the web and one of co-founders of SciCrunch.com that is developing a sustainable business model to support the Research Resource identifier (RRID) project.

Barend Mons, Ph.D. - Leader of the GO FAIR initiative in Europe, Professor in Biosemantics at Leiden University Medical Center Mons is Co-director of the Dutch Techcentre for the Life Science (also NL ELIXIR node), life science integrator in the Netherlands e-science Center and former chair of the European Commission High Level Expert Group for the European Open Science Cloud, will be a scientific advisor throughout the project execution, but not involved on a daily basis. He will structurally ensure that international interoperability and interaction is optimal.

Luiz Olavo Bonino da Silva Santos, Ph.D. - CTO for FAIR Data at the Dutch Techcentre for Life Sciences Bonino leads a team responsible for the development of a number of FAIR-related solutions including tools to make, publish and search FAIR data as well a to edit FAIR-compliant metadata. Bonino is one of the authors of the original FAIR principles, he is a member of the GO FAIR Metrics Working Group and is involved in a number of interest and working groups of the Research Data Alliance.

Albert Mons, MsC. - International Project Manager FAIR data for the FAIRdICT project of The Dutch Tech Center for Life Sciences in The Netherlands

85 of 89

Mons works willPhortos Consultants, a consultancy practice to academic institutions and private companies specialising in ‘Big Data’ solutions. Phortos was instrumental in developing the FAIRport approach to Big Data in the Life Sciences. In addition, Mons is a member of the organizational team for the EOSC Implementation plan “GO FAIR”.

Brian Nosek, Ph.D. - Co-founder and Executive Director, COS Nosek is also a Professor of Psychology at the University of Virginia. He co-founded and ran Project Implicit a web-based behavioral research infrastructure from 1998 to 2013. Nosek’s domain expertise areas include implicit social cognition, cognitive biases, ideology, methodology, and reproducibility. At COS, he provides strategic planning, product vision, team coordination, community action, behavioral change strategies, and executive leadership. He also has significant experience in the elements of implementing a Data Commons through roles as a researcher, editor, reviewer, institutional review board member, and technical service provider.

Jeffrey Spies, Ph.D. - Co-founder and Chief Technology Officer, COS; Co-director, SHARE Spies is responsible for product/technical vision, strategic direction, architecture of COS software including the Open Science Framework and SHARE, and management of COS Labs (R&D). He is also a Visiting Assistant Professor in Engineering and Society at the University of Virginia, where he completed his Ph.D. in Quantitative Psychology. Spies has a background in computer science and expertise developing scholarly software and scalable web applications. He conducted research in computational, statistical, and substantive domains and continues to apply his research on scientific incentives, workflow, and reproducibility at COS.

Matt Spitzer - Community Team Manager, COS Spitzer oversees a team of 6 to manage all product operations, business development, outreach, support, training and marketing activities. Spitzer has experience working with academic and private sector research and engineering teams, medical professionals, and research data management professionals on platforms for content management and research communication. He has a proven track record of partnering with target organizations and delivering multi-product programs.

David Mellor, Ph.D. - Project Manager, Journal and Funder Initiatives, COS Mellor leads establishing and sustaining complex initiatives designed to increase rigor and transparency in research. This includes a substantial and ongoing education campaign for preregistration, the Preregistration Challenge. Mellor engages community-based education and advocacy to improve policies and actions by scientific societies, publishers, journals, and funding agencies including advancing FAIR principles. These activates are summarized in the Transparency and Openness Promotion (TOP) Guidelines, and cover actions such as open research data, increased use of reporting guidelines, preregistration, and Registered Reports. Mellor earned a Ph.D. from Rutgers University New Brunswick, NJ in Ecology & Evolution in 2011.

86 of 89

Nici Pfeiffer - Product Manager and Team Lead, COS Pfeiffer is responsible for product management; product requirements, UI specification development and maintenance of the product roadmap, leading the quality assurance team in all testing efforts, training the team for API automated testing, process design and implementation within the software development life cycle. Pfeiffer’s background as a detail-oriented and driven quality assurance engineer, combined with her experience in implementing effective quality control processes has significantly improved the software development life cycle at COS.

David Litherland - Senior Project Manager, Engineering, COS Litherland leads project management of all technical systems at COS. Previously at ScholarOne and Silverchair Information Systems, he has over 25 years of experience managing technical and operational projects, programs, and project portfolios, and over 10 years experience working with STM associations and journals.

Michael Haselton - Technical Manager, Engineering, COS Haselton has significant experience leading development teams and has had a principal role in the development of the backend and frontend of the Open Science Framework (OSF). He has 10 years of experience in Health Information Systems. He was responsible for planning and executing the OSF’s scalability, security, backups, and storage (Amazon S3, Rackspace CloudFiles, Dataverse and Figshare). Haselton also unified a file system access layer to many cloud storage providers: Google Drive, Box, Dropbox, GitHub.

Timothy Errington, Ph.D. - Metascience Lead, COS Errington has expertise and training, a broad background in biomedical research, with a Ph.D. in preclinical cancer biology research. He has expertise in experimental design and reproducible research as well as teaching and presentation experience through his current work at COS and during his graduate work. Errington leads the COS’s Reproducibility Project: Cancer Biology (see: https://osf.io/e81xl). He was also a co-author of the Reproducibility Project: Psychology, the first large-scale effort to reproduce a sampling of the psychological literature. He will consult with the coordinating team on metascience research and project coordination of distributed teams.

Courtney Soderberg, Ph.D. - Training and Consulting Services Team Lead / Statistical and Methodological Consultant, COS Soderberg has a Ph.D. in psychology with a focus on quantitative methods, and leads COS’s training services. She has expertise in a wide range of statistical methods, reproducible research practices, and experimental design. Soderberg has led numerous training programs on using the OSF and using practices that increase research integrity. Soderberg also has substantial interdisciplinary experience with researchers in psychology, biology, political science, and economics.

87 of 89

Rebecca Rosenblatt, User Support and Documentation Specialist, COS Rosenblatt has a bachelor and master’s degree in English literature from the University of Virginia. She works on the Product team to help facilitate discussion of feature requests and other user-facing input. She communicates with users on a daily basis to answer questions and work through any issues, and writes and maintains the internal documentation and user-facing help guides for the OSF.

Whitney Wissinger - Events and Facilities Coordinator, COS Wissinger is responsible for all events and meetings hosted in or in partnership with COS. She has a background in hospitality management. She is a graduate of James Madison University where she earned a degree in education.

Natalie Meyers, MA, MLIS - Partnerships and Project Manager, COS/E-Research Librarian University of Notre Dame Meyers is responsible for interoperability support and collaborative management of domain data repositories and other services connecting via API to The Commons (e.g., scholarly publishing platforms, authoring services, data repositories). At COS, Meyers facilitates 3rd party integrators’ efforts to interoperate with OSF. 20 years’ experience in Information Systems Management, data curation/harmonization. Experience examples: NSF-funded Herbarium Specimen Management Database (SMASCH), The National Geospatial Digital Archive (NGDA), National Digital Information Infrastructure (NDIIP), Vector-borne Disease Network (VecNet), Data and Software Preservation for Open Science (DASPOS), and Data and Software Preservation Quality Tools (PRESQT). MA, 1989 Univ Wisconsin-Milwaukee, MLIS, UC Berkeley, 1994.

C. Titus Brown, Ph.D. - Associate Professor, School of Veterinary Medicine/Genome Center/Data Science Initiative, University of California, Davis Titus is a genomics and bioinformatics professor at UC Davis who works on scaling analyses to large data sets, designing and implementing reproducible workflows, open source/open science and software engineering methodologies, and training biologists and bioinformaticians in the above methodologies. He is deeply involved in the open source data science community and has significant expertise in leading, working within, and working across multiple open source software projects. He is also one of the 14 Moore Data Driven Discovery Investigators.

Ian Taylor, Ph.D. - Full Research Professor, University of Notre Dame Taylor is a research professor at Notre Dame’s Center for Research Computing (CRC), and Reader in Cardiff University. He researched and implemented artificial neural-network types for the determination of musical pitch, constructed the data acquisition system for the GEO 600 gravitational wave detector, and wrote and managed the Triana workflow system. He specializes in Web interfaces and APIs, big data applications, distributed scientific workflows and data distribution. He has managed 20+ projects, published 150+ papers and books, and won the Alan Berman prize for best research paper 3 times. He chaired the WORKS workflow

88 of 89

workshop at Supercomputing for 10+ years.

Sandra Gesing, Ph.D. - Computational Scientist, CRC, and Research Assistant Professor In Computer Science and Engineering, University of Notre Dame Gesing has expertise with science gateways especially for bioinformatic applications, computational workflows, distributed and parallel computing and data infrastructures. PI for NSF Science Gateways Community Institute; founder and lead of International Workshop on Science Gateways (IWSG) series. Substantial research publication, editing, and workshop-chairing experience. Industry experience as head of a system programmer group, project manager and system developer with the responsibility for long-term software projects. CS Ph.D. in a bioinformatics group at the University of Tübingen.

Lee Giles, Ph.D. - Professor, Pennsylvania State University Giles is a Professor at the College of Information Sciences and Technology at the Pennsylvania State University, University Park, PA, with an appointment in Computer Science and Engineering. He is a Fellow of the ACM, IEEE and INNS (Gabor prize). He is known for the digital library search engine, CiteSeer, which he co-created, developed and maintains plus other related search engines. He has expertise in information and metadata extraction and indexing, data extraction, name matching and linking, all at scale. He has published over 400 papers in these areas with over 30,000 citations.

Jian Wu, Ph.D. - Postdoctoral Student, Pennsylvania State University Wu’s portfolio includes using machine learning methods to automatically extract scholarly document metadata at scale and implementation as a automated ingestion and indexing process. He has published over a dozen papers in these areas.

Eric Pugh - CEO & Founder, OpenSource Connections LLC Pugh leads OpenSource Connections, focused on enabling organizations to build better knowledge management solutions by leveraging open source technologies, principally the Apache Solr, Apache Lucene, and Elasticsearch search engines. Clients include the United States Patent and Trademark Office, and pharmaceutical and science-based organizations. He co-authored Apache Solr Enterprise Search Server and is a well-known speaker (Strata, LuceneRevolution, KMWorld) including the State of the Search Community keynote for the 2016 Open Source Search for Bioinformatics conference (European Bioinformatics Institute). In 2017 he directed Samia Ansari’s Analysis of Patient Diversity in Oncology Trials project through the South Big Data Hub program.

Doug Turnbull - CTO, OpenSource Connections LLC Turnbull sets the technical direction of OpenSource Connections, evaluating new trends in technologies, with a focus on enabling organizations to build better knowledge management solutions by leveraging open source technologies, principally the Apache Solr, Apache Lucene, and Elasticsearch search engines. In 2016 he wrote Relevant Search, published by Manning that is the seminal work on the topic. Clients include the United States Patent and Trademark

89 of 89

Office, and pharmaceutical and science-based organizations. He co-authored Apache Solr Enterprise Search Server and is a well-known speaker (Strata, LuceneRevolution, KMWorld) including the State of the Search Community keynote for the 2016 Open Source Search for Bioinformatics conference (European Bioinformatics Institute). In 2017 he directed Samia Ansari’s Analysis of Patient Diversity in Oncology Trials project through the South Big Data Hub program.

Mark Musen, M.D., Ph.D. - Stanford University Musen directs CEDAR, one of the 11 centers of excellence founded under the NIH Big Data to Knowledge initiative, dedicated to improving metadata to aid search and discovery of scientific data sets. He has more than 30 years of professional experience in the areas of biomedical informatics, intelligent systems, and human-computer interaction. Musen led the Informatics Topic Advisory Group for WHO’s development of the latest revision of the International Classification of Diseases (ICD-11), and he has advised several groups in the development of major international data standards. He is an elected member of ASCI, AAP, ACMI, and NAM.

Cynthia Hudson-Vitale - Data Services Coordinator, Washington University in St. Louis Hudson-Vitale leads outreach and engagement for the SHARE initiative while serving as a Visiting Program Officer for the Association of Research Libraries. In this role, Hudson-Vitale developed and led community engagement programs that brought together 35 institutions to enhance local scholarly metadata and act as force multipliers for enriching metadata among additional institutions. Additionally, she has worked extensively with the Washington University Institute for Clinical and Translational Sciences to improve metadata and data quality for evolving/dynamic health data. She received her Master’s degree in Library and Information Science in 2009.

Rick Johnson - Co-Director of Digital Initiatives and Scholarship, University of Notre Dame Johnson directs the design and development of the university’s Hesburgh libraries' data curation and digital library solutions. Johnson is a Visiting Program Officer for the Association of Research Libraries SHARE project to develop partnerships and serve as technical advisor to develop metadata models and technical infrastructure to improve data sharing and metadata alignment across research systems. Johnson contributes to several other collaborations including DASPOS (Data and Software Preservation for Open Science), the multi-institutional Samvera (Hydra) collaboration, steered development of an ORCID plug-in for Hydra, and is a participating member of the Research Data Alliance.

Judy Ruttenberg - Program Director for Strategic Initiatives, Association of Research Libraries Ruttenberg has held this position since 2011. Her portfolio at ARL focuses on key transformations in research library organization and staffing to support member library priorities in such areas as research data management. She is also the co-director of the SHARE initiative, a partnership between ARL and COS to aggregate research metadata. Ms. Ruttenberg has been a Program Officer at the Triangle Research Libraries Network.