Data Stewardship and the Decentralized Web

DANIELLE ROBINSON, PhD Co-Executive Director at Code for Science & Society @daniellecrobins @codeforsociety Code for Science & Society

Supporting open source in the public interest Code for Science & Society

Civic tech + Scholarly research + New media + Open source + Equity, support, inclusion

= CS&S community Sharing

Bringing:experiences - Knowledge of decentralized computing, data collection & management

Seeking: - Better understanding of needs, challenges of your community What is the future of data stewardship?

- Bringing together leaders, stakeholders

- Design a cooperative data preservation network

- Push for ‘FAIR’ and save libraries money Adam Brock 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web

@daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web

@daniellecrobins Across domains, data live online

Early work of a writer Government data Newspaper archives Your family photos Scientific data

@daniellecrobins Data transparency: Inconsistent practices across domains @daniellecrobins Many data publishing options ://www.ohsu.edu/xd/education/library/data/share-and-archive/index.cfm @daniellecrobins Siloed info, centralized gate keepers control access Doc Searls @daniellecrobins https://imgflip.com/memegenerator/Picard-Wtf @daniellecrobins Distributed beginnings http://som.csudh.edu/fac/lpress/history/arpamaps/ @daniellecrobins Clark Boyd Web centralization

Image courtesy of Beaker Browser @daniellecrobins Web centralization

It’s easier to manage and monetize a silo Image courtesy of Beaker Browser @daniellecrobins “We embed values into our technology whether we are aware of it or not” - Stephen Whitmore (@noffle) Digital Democracy See also the work of Safiya Noble

https://blog.datproject.org/2018/03/05/css-community-call-03-2018/ @daniellecrobins In the centralized web

We trust the server to locate, not change objects

Silos are the natural state

Data may be in multiple silos

@daniellecrobins Today’s web relies upon

URLs to identify location of objects

Ability to change information without changing location

Aggregating content for discovery

@daniellecrobins Today’s web lacks

Persistent identifiers

Transparent change log

Links between silos

@daniellecrobins “The internet is a terribly unstable way to keep information available”

- Laurie Allen Penn Libraries' Assistant Director for Digital Scholarship

@daniellecrobins “Federal data ≅ website” https://www1.ncdc.noaa.gov/pub/data/ @daniellecrobins Why are federal data ≅ webpages?

To find an object online:

1. Discover the link 2. Link still works 3. Trust the info at the link

https://www.slideshare.net/shefw/save-the-data-the-role-of-librarians-in-datarescue-collaborations @daniellecrobins Why are federal data ≅ webpages?

https://www1.ncdc.noaa.gov/pub/d ata/annualreports

https://www.slideshare.net/shefw/save-the-data-the-role-of-librarians-in-datarescue-collaborations @daniellecrobins Link rot: When fail

Content Drift: When referenced content are changed

Link rot + content drift = Reference rot

M. Klein, several papers and talks, links at end @daniellecrobins The Internet is broken

and we are using it to access and distribute all of human knowledge

¯\_(ツ)_/¯

@daniellecrobins The web is being reimagined its all about Rock (: @daniellecrobins What’s important to you? romana klee @daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web

@daniellecrobins Preservation starts here @daniellecrobins and preserving “Sharing research data is not well ^ understood, incentivized, or accessible”

Daniella Lowenberg Research Data Specialist Product Manager of @uc3dash California Digital Library

https://medium.com/@UC3CDL/we-are-talking-loudly-and-no-one-is-listening-a108248693f7 / csv @daniellecrobins screenshot from https://peerj.com/preprints/2588/ Preservation requires custody seagen @daniellecrobins Centralized model requires custody to provide access

Image courtesy of Beaker Browser @daniellecrobins Web accessible objects @daniellecrobins Via Agency Is custody required? #WOCinTech Chat @daniellecrobins “Preservation in place… Bring preservation services to the content”

-Stephen Abrams Preservation without Possession California Digital Library

https://figshare.com/articles/Preservation_without_possession_Content- addressable_identifiers_for_post-custodial_preservation/5844369 @daniellecrobins Cooperative of trusted entities

Sharing data and costs

Image courtesy of Beaker Browser @daniellecrobins SangyaPundir / www.force11.org/group/fairgroup/fairprinciples @daniellecrobins Leverage existing infrastructure www.force11.org/group/fairgroup/fairprinciples @daniellecrobins Visions are nice! Peter Miller @daniellecrobins Now let’s get real vladeb @daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web

@daniellecrobins Multiple decentralized approaches

Blockchain Peer-to-peer

BTC Keychain / Danilo / http://www.ala.org/tools/future/trends/blockchain / https://gist.github.com/mafintosh/bd9e6d350ebf02441c9707c5f799d05b @daniellecrobins Centralized “hub and spoke” model

Data stored at central location, accessed by independent users

Image courtesy of Beaker Browser @daniellecrobins Decentralized models

Data persistently identified, networked ability to scale

Image courtesy of Beaker Browser @daniellecrobins Peer-to-peer public technology

https://github.com/mafintosh/bws-2017 @daniellecrobins What’s ?

Persistent identifiers

+

Network of peers

https://github.com/datproject/docs/blob/master/papers/dat-paper.pdf @daniellecrobins Dat + scholarly data =

- Automate preservation, versioning - Find data across storage locations - Spread cost burden across network - Foundational links between silos

@daniellecrobins Reimagine data preservation 俍宏 葉 @daniellecrobins It’s all about TRUST

Image courtesy of Beaker Browser @daniellecrobins … and I trust LIBRARIES

Image courtesy of Beaker Browser @daniellecrobins Building a prototype Eran Sandler @daniellecrobins Start with data creation Dr. Dannise V. Ruiz-Ramos describes sea star genome annotation pipeline @daniellecrobins Dat in the Lab lessons:

Leverage existing workflows Automate data versioning, preservation Link researchers to library

Now linking libraries to each other

https://blog.datproject.org/tag/science/ @daniellecrobins Prototype: CDL - IA - SDSC

CDL’s DASH corpus (<5 TB) Copied to IA and SDSC Deal with technical hurdles (S3)

Next: Monitoring dynamic information

@daniellecrobins Every institution contributes

Storage, bandwidth Metadata on their collection Commitment to preserve their collection

to the network

@daniellecrobins Any user can access

Information on library collections History of objects Whole or partial data sets

from the network

@daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web

@daniellecrobins What’s important to you?

www.liveoncelivewild.com @daniellecrobins Discussion:

● What are the data types that your organization is responsible for? ● How are those data created, stored, used? When do they come to you? ● Who interacts with data? How do they interact with it? ● How are equity, justice addressed (or not) in data stewardship plans? ● What are your concerns around long term preservation of data? Cool project alert! The Data to Policy Project (D2P) is an initiative to engage students with their community’s needs through course-based assignments, which culminate into data-driven policy proposals to local governments and agencies.

https://library.auraria.edu/d2pproject/about Thank you to the Western States Government Information Conference Planning Committee

DANIELLE ROBINSON, PhD Co-Executive Director at Code for Science & Society @daniellecrobins @codeforsociety Discussion:

● What are the data types that your organization is responsible for? ● How are those data created, stored, used? When do they come to you? ● Who interacts with data? How do they interact with it? ● How are equity, justice addressed (or not) in data stewardship plans? ● What are your concerns around long term preservation of data?