Data Stewardship and the Decentralized Web
DANIELLE ROBINSON, PhD Co-Executive Director at Code for Science & Society @daniellecrobins @codeforsociety Code for Science & Society
Supporting open source in the public interest Code for Science & Society
Civic tech + Scholarly research + New media + Open source + Equity, support, inclusion
= CS&S community Sharing
Bringing:experiences - Knowledge of decentralized computing, data collection & management
Seeking: - Better understanding of needs, challenges of your community What is the future of data stewardship?
- Bringing together leaders, stakeholders
- Design a cooperative data preservation network
- Push for ‘FAIR’ and save libraries money Adam Brock 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web
@daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web
@daniellecrobins Across domains, data live online
Early work of a writer Government data Newspaper archives Your family photos Scientific data
@daniellecrobins Data transparency: Inconsistent practices across domains @daniellecrobins Many data publishing options https://www.ohsu.edu/xd/education/library/data/share-and-archive/index.cfm @daniellecrobins Siloed info, centralized gate keepers control access Doc Searls @daniellecrobins https://imgflip.com/memegenerator/Picard-Wtf @daniellecrobins Distributed beginnings http://som.csudh.edu/fac/lpress/history/arpamaps/ @daniellecrobins Clark Boyd Web centralization
Image courtesy of Beaker Browser @daniellecrobins Web centralization
It’s easier to manage and monetize a silo Image courtesy of Beaker Browser @daniellecrobins “We embed values into our technology whether we are aware of it or not” - Stephen Whitmore (@noffle) Digital Democracy See also the work of Safiya Noble
https://blog.datproject.org/2018/03/05/css-community-call-03-2018/ @daniellecrobins In the centralized web
We trust the server to locate, not change objects
Silos are the natural state
Data may be in multiple silos
@daniellecrobins Today’s web relies upon
URLs to identify location of objects
Ability to change information without changing location
Aggregating content for discovery
@daniellecrobins Today’s web lacks
Persistent identifiers
Transparent change log
Links between silos
@daniellecrobins “The internet is a terribly unstable way to keep information available”
- Laurie Allen Penn Libraries' Assistant Director for Digital Scholarship
@daniellecrobins “Federal data ≅ website” https://www1.ncdc.noaa.gov/pub/data/ @daniellecrobins Why are federal data ≅ webpages?
To find an object online:
1. Discover the link 2. Link still works 3. Trust the info at the link
https://www.slideshare.net/shefw/save-the-data-the-role-of-librarians-in-datarescue-collaborations @daniellecrobins Why are federal data ≅ webpages?
https://www1.ncdc.noaa.gov/pub/d ata/annualreports
https://www.slideshare.net/shefw/save-the-data-the-role-of-librarians-in-datarescue-collaborations @daniellecrobins Link rot: When links fail
Content Drift: When referenced content are changed
Link rot + content drift = Reference rot
M. Klein, several papers and talks, links at end @daniellecrobins The Internet is broken
and we are using it to access and distribute all of human knowledge
¯\_(ツ)_/¯
@daniellecrobins The web is being reimagined its all about Rock (: @daniellecrobins What’s important to you? romana klee @daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web
@daniellecrobins Preservation starts here @daniellecrobins and preserving “Sharing research data is not well ^ understood, incentivized, or accessible”
Daniella Lowenberg Research Data Specialist Product Manager of @uc3dash California Digital Library
https://medium.com/@UC3CDL/we-are-talking-loudly-and-no-one-is-listening-a108248693f7 / csv @daniellecrobins screenshot from https://peerj.com/preprints/2588/ Preservation requires custody seagen @daniellecrobins Centralized model requires custody to provide access
Image courtesy of Beaker Browser @daniellecrobins Web accessible objects @daniellecrobins Via Agency Is custody required? #WOCinTech Chat @daniellecrobins “Preservation in place… Bring preservation services to the content”
-Stephen Abrams Preservation without Possession California Digital Library
https://figshare.com/articles/Preservation_without_possession_Content- addressable_identifiers_for_post-custodial_preservation/5844369 @daniellecrobins Cooperative of trusted entities
Sharing data and costs
Image courtesy of Beaker Browser @daniellecrobins SangyaPundir / www.force11.org/group/fairgroup/fairprinciples @daniellecrobins Leverage existing infrastructure www.force11.org/group/fairgroup/fairprinciples @daniellecrobins Visions are nice! Peter Miller @daniellecrobins Now let’s get real vladeb @daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web
@daniellecrobins Multiple decentralized approaches
Blockchain Peer-to-peer
BTC Keychain / Danilo / http://www.ala.org/tools/future/trends/blockchain / https://gist.github.com/mafintosh/bd9e6d350ebf02441c9707c5f799d05b @daniellecrobins Centralized “hub and spoke” model
Data stored at central location, accessed by independent users
Image courtesy of Beaker Browser @daniellecrobins Decentralized models
Data persistently identified, networked ability to scale
Image courtesy of Beaker Browser @daniellecrobins Peer-to-peer public technology
https://github.com/mafintosh/bws-2017 @daniellecrobins What’s Dat?
Persistent identifiers
+
Network of peers
https://github.com/datproject/docs/blob/master/papers/dat-paper.pdf @daniellecrobins Dat + scholarly data =
- Automate preservation, versioning - Find data across storage locations - Spread cost burden across network - Foundational links between silos
@daniellecrobins Reimagine data preservation 俍宏 葉 @daniellecrobins It’s all about TRUST
Image courtesy of Beaker Browser @daniellecrobins … and I trust LIBRARIES
Image courtesy of Beaker Browser @daniellecrobins Building a prototype Eran Sandler @daniellecrobins Start with data creation Dr. Dannise V. Ruiz-Ramos describes sea star genome annotation pipeline @daniellecrobins Dat in the Lab lessons:
Leverage existing workflows Automate data versioning, preservation Link researchers to library
Now linking libraries to each other
https://blog.datproject.org/tag/science/ @daniellecrobins Prototype: CDL - IA - SDSC
CDL’s DASH corpus (<5 TB) Copied to IA and SDSC Deal with technical hurdles (S3)
Next: Monitoring dynamic information
@daniellecrobins Every institution contributes
Storage, bandwidth Metadata on their collection Commitment to preserve their collection
to the network
@daniellecrobins Any user can access
Information on library collections History of objects Whole or partial data sets
from the network
@daniellecrobins 1. Data on the web 2. A new model of data stewardship 3. Prototyping decentralized preservation 4. Reimagine data on the web
@daniellecrobins What’s important to you?
www.liveoncelivewild.com @daniellecrobins Discussion:
● What are the data types that your organization is responsible for? ● How are those data created, stored, used? When do they come to you? ● Who interacts with data? How do they interact with it? ● How are equity, justice addressed (or not) in data stewardship plans? ● What are your concerns around long term preservation of data? Cool project alert! The Data to Policy Project (D2P) is an initiative to engage students with their community’s needs through course-based assignments, which culminate into data-driven policy proposals to local governments and agencies.
https://library.auraria.edu/d2pproject/about Thank you to the Western States Government Information Conference Planning Committee
DANIELLE ROBINSON, PhD Co-Executive Director at Code for Science & Society @daniellecrobins @codeforsociety Discussion:
● What are the data types that your organization is responsible for? ● How are those data created, stored, used? When do they come to you? ● Who interacts with data? How do they interact with it? ● How are equity, justice addressed (or not) in data stewardship plans? ● What are your concerns around long term preservation of data?