Challenges in the Curation and Preservation of Social-Scientific Knowledge
Micah Altman Harvard University
Archival Director, Henry A. Murray Research Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences E: [email protected] W: http://maltman.hmdc.harvard.edu/
[Presented at the Indo-US Workshop on International Trends in Digital Preservation, 2009] This Talk
Why Preserve Social Science Data? Threats and challenges Emerging trends and technologies Recommendations & Predictions
Indo-US Digital Preservation (Page 2) Micah Altman, Senior Research Scientist Workshop 2009 Collaborators & Co-conspirators
Margaret Adams, Caroline Arms, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Cavan Capps, Jonathan Crabtree, Darrell Donakowski, Myron Gutmann, Gary King, Patrick King, Jared Lyle, Marc Maynard, Amy Pienta, , Lois Timms-Ferrarra, Copeland H. Young, Research Support
Thanks to the Library of Congress (PA#NDP03-1), the National Institutes of Aging (P01 AG17625-01), the National Science Foundation (SES-0318275, IIS- 9874747), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.
Indo-US Digital Preservation (Page 3) Micah Altman, Senior Research Scientist Workshop 2009 Related Work
M. Altman and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April). 2007. M. Altman, et. al, “Data Preservation Alliance for the Social Sciences: A Model for Collaboration” Proceedings of DigCcurr07, Chapel Hill. April 2007. G. King, “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing”, Sociological Methods and Research, 32, 2 (November, 2007): 173–199. M. Altman , "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. 2008. Altman, M., Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C.. “Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences.” The American Archivist. Forthcoming 2009. Gutmann, M., Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms- Ferrara L., Young, C. (in press). “From Preserving the Past to Preserving the Future: The Data- PASS Project and the challenges of preserving digital social science data”. Library Trends. Forthcoming 2009. Altman, M., “Transformative effects of NDIIPP, the case of the Henry A. Murray Archive”. Library Trends. Forthcoming 2009. Micah Altman ; Bryan Beecher,; Jonathan Crabtree; with Leonid Andreev, Ed Bachman, Adam Buchbinder , Steve Burling, Patrick King, Marc Maynard, “A Prototype Platform for Policy-Based Archival Replication“ Against The Grain, (Forthcoming) Winter 2009
Indo-US Digital Preservation (Page 4) Micah Altman, Senior Research Scientist Workshop 2009 Why Preserve Social Science Data? Motivations Challenges Trends Recommendations
Value to society Value to science Uniqueness Value to democracy
Indo-US Digital Preservation (Page 5) Micah Altman, Senior Research Scientist Workshop 2009 What is Digital Social-Science Motivations Challenges Data? Trends Recommendations
DIGITAL Optical: DVD, CD Magnetic: Tapes, ‘Floppies’ Paper: cards, tapes
SOCIAL SCIENCE Social: class, crime, social movements, culture, folklore, family Economic: wealth, prosperity, labor, business, equity Psychology: cognition, attitudes, stereotypes Politics: justice, democracy, public policy, public administration, international conflict
DATA Raw measurements Numeric tables Administrative records (& email) Video and audio interviews, transcripts (& blogs) Digital objects (web sites, interactive databases)
Indo-US Digital Preservation (Page 6) Micah Altman, Senior Research Scientist Workshop 2009 Data is the Key to Science Motivations Challenges Trends Recommendations
Science is not (only) about being scientific Scientific progress requires community: Competition and collaboration in the pursuit of common goals Without access to the same materials: no community exists
… data is the nucleus of collaboration.
Indo-US Digital Preservation (Page 7) Micah Altman, Senior Research Scientist Workshop 2009 Social Science Research Data often Motivations Challenges Trends Recommendations Unique Public policy: exact replication of finding Social science concepts/measurement are ambiguous: replication is used for calibration Much social science data is observational
Unique events
Dynamics of large groups, systems, societies
Behavior embedded in a context: history, geography, government, society
Indo-US Digital Preservation (Page 8) Micah Altman, Senior Research Scientist Workshop 2009 Need to Revisit Social Science Data
Challenges of social science research
Few definitive answers
Complex conceptual primitives
Complex theories of behavior
Reliance on noisy observational data
Specification uncertainty
Changing evidence base Often theories are provisional Often predictive power is low Data may be analyzed for many different purposes, test different theories
Indo-US Digital Preservation (Page 9) Micah Altman, Senior Research Scientist Workshop 2009 Articles Are not Enough Motivations Challenges Trends Recommendations Scholarly articles are summaries, not the actual research results The value of an article that can’t be replicated: ? But: Data access is spotty by field, finding the data is still hard Hard for journal editors to verify. If you find it, how do you know it’s the same? Replication projects show: most published articles in social science cannot be replicated
… data is necessary for replication and verification
Indo-US Digital Preservation (Page 10) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Data is a Key to Democracy Trends Recommendations
Statistics = state-istics The state tax authority: counting people, estimating wealth Reformers use data to assess the performance of the state Science informs public policy continually In modern democracy: the public needs a direct source of information
Indo-US Digital Preservation (Page 11) Micah Altman, Senior Research Scientist Workshop 2009 Domain Specific Challenges for Preservation
Preservation is an afterthought – loss ensues Privacy and confidentiality Intellectual property Rapidly changing evidence base
Indo-US Digital Preservation (Page 12) Micah Altman, Senior Research Scientist Workshop 2009 History Minute: Motivations Challenges Trends Recommendations Great Moments in the Early History of Digital Preservation in Social Science
1890 – Use of the Hollerith card for the U.S. decennial census 1924 – Foundation of the Odum Institute 1962 – Foundation of the Inter -university Consortium for Political Research (later renamed ICPSR)
Indo-US Digital Preservation (Page 13) Micah Altman, Senior Research Scientist Workshop 2009 Half Time: State of Play Motivations Challenges Trends Recommendations
Many (a majority?) of the bits comprising traditional tabular quantitative social science data archived Most of the data that is archived is in government archives or large domain archives Mature DDI “standard” for metadata about tabular social science data. Privacy protected through “enclaves” – primarily physical restrictions on access
Indo-US Digital Preservation (Page 14) Micah Altman, Senior Research Scientist Workshop 2009 Loss of Social Science Research Data Motivations Challenges Trends Recommendations Centrally collected government data often well- preserved, but… Much social science data produced by municipalities, local governments Researchers often have reduced incentives for preservation…
Fear of misinterpretation
Data collection processes less systematic
More individual effort put in data production
Confidentiality concerns Most smaller, researcher-produced, data not archived.
Indo-US Digital Preservation (Page 15) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges How Data Is Lost Trends Recommendations
Data Intentionally Discarded “It was just too long ago, I generally keep data for something like 10 years beyond the last time I do something with them.” “Destroyed, in accord with APA 5-year post-publication rule.” Unintentional Hardware Problems “Some data were collected, but the data file was lost in a technical malfunction.” Destroyed for Confidentiality Reasons “The material…was considered sensitive data. Institutional review boards.. required us to promise to destroy the data after a certain period of time...” Acts of Nature “The data from the studies were on punched cards that were destroyed in a flood in the department in the early 80s.” Discarded or Lost in a Move “As I retired …. Unfortunately, I simply didn’t have the room to store these data sets at my house.” Obsolescence “Speech recordings stored on a LISP Machine…, an experimental computer which is long obsolete.” Simply Lost “For all I know, they are on a [University] server, but it has been literally years and years since the research was done, and my files are long gone.”
Research by: Indo-US Digital Preservation (Page 16) Micah Altman, Senior Research Scientist Workshop 2009 Legal Considerations Motivations Challenges Trends Recommendations
Indo-US Digital Preservation (Page 17) Micah Altman, Senior Research Scientist Workshop 2009 Technical Confidentiality Challenges Motivations Challenges Trends Recommendations
Many challenges to privacy:
The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]
The “EZ-Pass Problem”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008]
The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007]
The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Tomkins et. al 2004]
Source: [Calabrese, et al 2007; Real Time Rome Project 2007]
Indo-US Digital Preservation (Page 18) Micah Altman, Senior Research Scientist Workshop 2009 Social Science Privacy Challenges Motivations Challenges Trends Recommendations
Informing policy to balance research and confidentiality & privacy Probing changes in attitudes towards use of personal information Understanding social consequences of changes in pubic information
Indo-US Digital Preservation (Page 19) Micah Altman, Senior Research Scientist Workshop 2009 Evidence Base is Shifting… Motivations Challenges Trends Recommendations Collective holdings of all U.S. numeric social science data in all major data archives, government repositories: ~estimated 10’s of TB
Vs. 18,000,000 TB of telephone calls… “Ambient” data increasingly becoming subject of social science research. Radically different data formats (technical & conceptual)
Agent-based modeling
Social network modeling
Automated coding of text, audio, video
“Virtual world” experiments and observation Examples
Harvesting and analysis of blogs for virtual political opinion surveys
Continuous collection of video of congressional debates, with automated subject coding [Adamic & Glance 2005]
Cell phones data on movement, proximity to others
Participative redistricting – social mapping
Agent-based models of emerging institutions
FMRI analyses of reaction to political and social scenarios
Indo-US Digital Preservation (Page 20) Micah Altman, Senior Research Scientist Workshop 2009 Two Idiosyncratically Selected Motivations Challenges Trends Recommendations Emerging Trends Data Integration
Shared Catalogs
Citations New institutional models for preservation
Institutional repositories
Web 2.0 (aka `open science’, Google++)
Virtual Archiving
Archival Collaboration
Indo-US Digital Preservation (Page 21) Micah Altman, Senior Research Scientist Workshop 2009 Shared Catalogs
IQSS Dataverse Network
30,000+ studies
Including 20,000+ studies from Data-PASS Alliance CESSDA
25000 studies
20 Major archives Enabling technologies
OAI-PMH
DDI
Dataverse Software
Indo-US Digital Preservation (Page 22) Micah Altman, Senior Research Scientist Workshop 2009 Indo-US Digital Preservation (Page 23) Micah Altman, Senior Research Scientist Workshop 2009 Data Citations Motivations Challenges Trends Recommendations
Citations are a traditional formal mechanism to link together intellectual works Citations glue together: Regulations, Publications, and Evidence But, lack of rules for citing numeric data: No consistency in practice No fixed rules for copyeditors Sometimes in the list of references; sometimes a casual mention in the text Sometimes the archive is noted Sometimes a version number exists Sometimes the version number is listed (if it exists) Archive numbers are sometimes given, if they exist Sometimes the author is noted Date of creation is sometimes given URLs often given, rarely persist Dates of access: protect the researcher, do not help find the data The data may not be available publicly The data may no longer exist
Indo-US Digital Preservation (Page 24) Micah Altman, Senior Research Scientist Workshop 2009 A Unified Citation Standard for Motivations Challenges Trends Recommendations Quantitative Data
Indo-US Digital Preservation (Page 25) Micah Altman, Senior Research Scientist Workshop 2009 Institutional Repositories for Data? Motivations Challenges Trends Recommendations
So far, not so much… Why? Focus on more publications – coin of the realm Scale differs from publication – storage requirements Heterogeneity vs. publications
Discipline specific metadata needed to make it discoverable to primary users
Discipline specific formats, norms, workflows
Privacy, confidentiality issues
[but Watch MIT, they may buck this trend…]
Indo-US Digital Preservation (Page 26) Micah Altman, Senior Research Scientist Workshop 2009 Web 2.0 For Data Motivations Challenges Trends Recommendations Privacy?+ + Law?+
Preservation?+ = ? Analysis?
* Can you count how many βββ’s are in this picture? Indo-US Digital Preservation (Page 27) Micah Altman, Senior Research Scientist Workshop 2009 Pitfalls of Web 2.0 Dead soon after (or before Launch)… Google Research Data – dead before it launched Graphwise – dead in a year
Business model clearly excludes preservation… Amazon Public Data Sets
Business model? Swivel Data360 Freebase Many eyes ...[ Insert name of web 2.0 startup here ]
Indo-US Digital Preservation (Page 28) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Virtual Archiving: Trends Recommendations The Dataverse Network* An Open-Source, Federated, Web Discovery Services 2.0 Data Network Simple & fielded search Virtual collection browsing Gateway to over 20000 social Management science studies (world’s largest Ingest catalog) Curation & review
Web Virtual Hosting 2.0 Service Virtual Hosting and administration Federated access to other Metadata delivery
networks Descriptive and structural
Unified access to major U.S. Provenance (chain-of custody metadata) research data archives, Human and OAI interfaces government data Preservation Open service – endowed hosting Standards based Open source – GPL-Affero-3 Reformatting
Universal Numeric Fingerprints Enhanced Delivery
Replication
Layered analysis services
Indo-US Digital Preservation (Page 29) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Dataverse Network Trends Recommendations An Open-Source, Federated Data Network Unified access to major U.S. research data archives, government data Web 2.0 Virtual Archiving Service http://theData.org
(Page 30) Micah Altman, Senior Research Scientist DataPASS
Indo-US Digital Preservation (Page 31) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Collaboration for Preservation Trends Recommendations
Partnership Agreements Joint “Not-bad” practices Agreement to establish Identification & selection good practice Metadata Preservation copies of data collected Security Transfer Protocol: in Confidentiality case of archival failure Shared Catalog Cooperating Operations Unified Discovery Central database of Content exchange leads for acquisition Layered Services Development of shared procedures "Nothing new that is really Review of acquisitions interesting comes without collaboration" -- James Watson
Indo-US Digital Preservation (Page 32) Micah Altman, Senior Research Scientist Workshop 2009 Replication as Motivations Challenges Trends Recommendations Institutional Insurance Data-PASS Syndicated Storage Project
External Causes of Preservation Failure
Third party attacks
Institutional funding
Change in legal regimes Quis custodiet ipsos custodes? Schema driven: Unintentional curatorial capture inter-archival preservation modification commitments Loss of institutional Asymmetric: knowledge & skills resource commitments proportional to holdings Intentional removal Versioned: Change in institutional versioned data and citations mission Integration: LOCKSS + Archival Replication Schema + DVN technology + archival workflows
Indo-US Digital Preservation (Page 33) Micah Altman, Senior Research Scientist Workshop 2009 Recommendations
Mind the gaps – capture information early Build in confidentiality Policies may catalyze preservation
Indo-US Digital Preservation (Page 34) Micah Altman, Senior Research Scientist Workshop 2009 Mind the Gaps – Capture Early Motivations Challenges Trends Recommendations Common tools cover different part of the research process: not yet connected Major gap between data collection/analysis and publication: Linking data and publication: data citations Embed analysis in publication where possible: sweave, statdocs, etc. Archiving is now a separate step, better to integrate from the beginning: Provide bit-level replication as a research service, keep it for publication & archiving later Capture information from workflow tools ( e.g. Viztrails ), increase usefulness of data with low effort preservation preservation reuse analysis publishing publishing design processing integration analysis dissemination dissemination design collection collection processing integration
cati / capi sweave / statdocs citations / identifiers Web 2.0-______data archives, hosting, networks General digital libraries and repositories workflow systems
Indo-US Digital Preservation (Page 35) Micah Altman, Senior Research Scientist Workshop 2009 Build in Confidentiality Motivations Challenges Trends Recommendations
Licensing and Intellectual Property Protections Standardize license terms and metadata Click-through agreements, vetting workflows Authentication, auditing, logging Embedding sensitive data in a digital library can improve subject confidentiality: Authentication, vetting, and access control Standardized license terms governing analysis (derived from metadata and data characteristics) Models can be run on-line without access to raw data Monitoring and auditing of data use Limit sequence of analyses by a user, in some cases ( for promising results, see [Dwork, et al 2006]
Indo-US Digital Preservation (Page 36) Micah Altman, Senior Research Scientist Workshop 2009 Policies may catalyze preservation
Include preservation in funding for research infrastructure Open access mandates Data citation requirements
By journals
By sponsors of research published in journals
Indo-US Digital Preservation (Page 37) Micah Altman, Senior Research Scientist Workshop 2009 Additional References L. A. Adamic and N. Glance, 'The Political Blogosphere and the 2004 U.S. Election: Divided They Blog', Annual Workshop on the Webloging Ecosysteml, WWW2005, Japan, 2005 L. Backstrom, C. Dwork, J. Kleinberg. Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography. Proc. 16th Intl. World Wide Web Conference, 2007. Calabrese F., Colonna M., Lovisolo P., Parata D., Ratti C., 2007, "Real-Time Urban Monitoring Using Cellular Phones: a Case-Study in Rome", Working paper # 1, SENSEable City Laboratory, MIT, Boston http://senseable.mit.edu/papers/ , [also see the Real Time Rome Project [http://senseable.mit.edu/realtimerome/] C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Proceedings of the 3rd IACR Theory of Cryptography Conference, 2006 J. Gibson, and D. McKenzie 2007. Using Global Positioning Systems in Household Surveys for Better Economics and Better Policy, The World Bank Research Observer22(2):217-241 A. MachanavaJJhala, D Kifer, J Gehrke, M. Venkitasubramaniam, 2007,"l-Diversity: Privacy Beyond k- Anonymity" ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52 A. Narayanan and V. Shmatikov, 2008, Robust De -anonymization of Large Sparse Datasets, Proc. of 29th IEEE Symposium on Security and Privacy (Forthcoming) J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on World Wide Web Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, National Research Council, 2007. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. National Academies Press D.L. Zimmerman, C. Pavlik , 2008. "Quantifying the Effects of Mask Metadata, Disclosure and Multiple Releases on the Confidentiality of Geographically Masked Health Data", Geographical Analysis 40: 52- 76 Lyman, Peter and Hal R. Varian, "How Much Information", 2003. Retrieved from http://www.sims.berkeley.edu/how-much-info-2003
Georeferencing in the Social Sciences -- Promise (Page 38) and Peril Introduction Access Challenges Google++ For More Information DVN Data-PASS RCE Conclusions Contact me: http://maltman.hmdc.harvard.edu/
Dataverse Network Project: http://TheData.Org
Data-PASS Alliance: http://www.icpsr.umich.edu/DATAPASS/
Indo-US Digital Preservation (Page 39) Micah Altman, Senior Research Scientist Workshop 2009