Challenges in the Curation and Preservation of Social-Scientific Knowledge

Micah Altman

Archival Director, Henry A. Murray Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences E: [email protected] W: http://maltman.hmdc.harvard.edu/

[Presented at the Indo-US Workshop on International Trends in Digital Preservation, 2009] This Talk

Why Preserve Social Science Data? Threats and challenges Emerging trends and technologies Recommendations & Predictions

Indo-US Digital Preservation (Page 2) Micah Altman, Senior Research Scientist Workshop 2009 Collaborators & Co-conspirators

 Margaret Adams, Caroline Arms, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Cavan Capps, Jonathan Crabtree, Darrell Donakowski, Myron Gutmann, Gary King, Patrick King, Jared Lyle, Marc Maynard, Amy Pienta, , Lois Timms-Ferrarra, Copeland H. Young, Research Support

 Thanks to the Library of Congress (PA#NDP03-1), the National Institutes of Aging (P01 AG17625-01), the National Science Foundation (SES-0318275, IIS- 9874747), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.

Indo-US Digital Preservation (Page 3) Micah Altman, Senior Research Scientist Workshop 2009 Related Work

M. Altman and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April). 2007. M. Altman, et. al, “Data Preservation Alliance for the Social Sciences: A Model for Collaboration” Proceedings of DigCcurr07, Chapel Hill. April 2007. G. King, “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing”, Sociological Methods and Research, 32, 2 (November, 2007): 173–199. M. Altman , "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. 2008. Altman, M., Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C.. “Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences.” The American Archivist. Forthcoming 2009. Gutmann, M., Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms- Ferrara L., Young, C. (in press). “From Preserving the Past to Preserving the Future: The Data- PASS Project and the challenges of preserving digital social science data”. Library Trends. Forthcoming 2009. Altman, M., “Transformative effects of NDIIPP, the case of the Henry A. Murray Archive”. Library Trends. Forthcoming 2009. Micah Altman ; Bryan Beecher,; Jonathan Crabtree; with Leonid Andreev, Ed Bachman, Adam Buchbinder , Steve Burling, Patrick King, Marc Maynard, “A Prototype Platform for Policy-Based Archival Replication“ Against The Grain, (Forthcoming) Winter 2009

Indo-US Digital Preservation (Page 4) Micah Altman, Senior Research Scientist Workshop 2009 Why Preserve Social Science Data? Motivations Challenges Trends Recommendations

Value to society Value to science Uniqueness Value to democracy

Indo-US Digital Preservation (Page 5) Micah Altman, Senior Research Scientist Workshop 2009 What is Digital Social-Science Motivations Challenges Data? Trends Recommendations

DIGITAL  Optical: DVD, CD  Magnetic: Tapes, ‘Floppies’  Paper: cards, tapes

SOCIAL SCIENCE  Social: class, crime, social movements, culture, folklore, family  Economic: wealth, prosperity, labor, business, equity  Psychology: cognition, attitudes, stereotypes  Politics: justice, democracy, public policy, public administration, international conflict

DATA  Raw measurements  Numeric tables  Administrative records (& email)  Video and audio interviews, transcripts (& blogs)  Digital objects (web sites, interactive databases)

Indo-US Digital Preservation (Page 6) Micah Altman, Senior Research Scientist Workshop 2009 Data is the Key to Science Motivations Challenges Trends Recommendations

Science is not (only) about being scientific Scientific progress requires community: Competition and collaboration in the pursuit of common goals Without access to the same materials: no community exists

… data is the nucleus of collaboration.

Indo-US Digital Preservation (Page 7) Micah Altman, Senior Research Scientist Workshop 2009 Social Science Research Data often Motivations Challenges Trends Recommendations Unique Public policy: exact replication of finding Social science concepts/measurement are ambiguous: replication is used for calibration Much social science data is observational

 Unique events

 Dynamics of large groups, systems, societies

 Behavior embedded in a context: history, geography, government, society

Indo-US Digital Preservation (Page 8) Micah Altman, Senior Research Scientist Workshop 2009 Need to Revisit Social Science Data

Challenges of social science research

 Few definitive answers

 Complex conceptual primitives

 Complex theories of behavior

 Reliance on noisy observational data

 Specification uncertainty

 Changing evidence base Often theories are provisional Often predictive power is low Data may be analyzed for many different purposes, test different theories

Indo-US Digital Preservation (Page 9) Micah Altman, Senior Research Scientist Workshop 2009 Articles Are not Enough Motivations Challenges Trends Recommendations Scholarly articles are summaries, not the actual research results The value of an article that can’t be replicated: ? But: Data access is spotty by field, finding the data is still hard Hard for journal editors to verify. If you find it, how do you know it’s the same? Replication projects show: most published articles in social science cannot be replicated

… data is necessary for replication and verification

Indo-US Digital Preservation (Page 10) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Data is a Key to Democracy Trends Recommendations

Statistics = state-istics The state tax authority: counting people, estimating wealth Reformers use data to assess the performance of the state Science informs public policy continually In modern democracy: the public needs a direct source of information

Indo-US Digital Preservation (Page 11) Micah Altman, Senior Research Scientist Workshop 2009 Domain Specific Challenges for Preservation

Preservation is an afterthought – loss ensues Privacy and confidentiality Intellectual property Rapidly changing evidence base

Indo-US Digital Preservation (Page 12) Micah Altman, Senior Research Scientist Workshop 2009 History Minute: Motivations Challenges Trends Recommendations Great Moments in the Early History of Digital Preservation in Social Science

1890 – Use of the Hollerith card for the U.S. decennial census 1924 – Foundation of the Odum Institute 1962 – Foundation of the Inter -university Consortium for Political Research (later renamed ICPSR)

Indo-US Digital Preservation (Page 13) Micah Altman, Senior Research Scientist Workshop 2009 Half Time: State of Play Motivations Challenges Trends Recommendations

Many (a majority?) of the bits comprising traditional tabular quantitative social science data archived Most of the data that is archived is in government archives or large domain archives Mature DDI “standard” for metadata about tabular social science data. Privacy protected through “enclaves” – primarily physical restrictions on access

Indo-US Digital Preservation (Page 14) Micah Altman, Senior Research Scientist Workshop 2009 Loss of Social Science Research Data Motivations Challenges Trends Recommendations Centrally collected government data often well- preserved, but… Much social science data produced by municipalities, local governments Researchers often have reduced incentives for preservation…

 Fear of misinterpretation

 Data collection processes less systematic

 More individual effort put in data production

 Confidentiality concerns Most smaller, researcher-produced, data not archived.

Indo-US Digital Preservation (Page 15) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges How Data Is Lost Trends Recommendations

Data Intentionally Discarded  “It was just too long ago, I generally keep data for something like 10 years beyond the last time I do something with them.”  “Destroyed, in accord with APA 5-year post-publication rule.” Unintentional Hardware Problems  “Some data were collected, but the data file was lost in a technical malfunction.” Destroyed for Confidentiality Reasons  “The material…was considered sensitive data. Institutional review boards.. required us to promise to destroy the data after a certain period of time...” Acts of Nature  “The data from the studies were on punched cards that were destroyed in a flood in the department in the early 80s.” Discarded or Lost in a Move  “As I retired …. Unfortunately, I simply didn’t have the room to store these data sets at my house.” Obsolescence  “Speech recordings stored on a LISP Machine…, an experimental computer which is long obsolete.” Simply Lost  “For all I know, they are on a [University] server, but it has been literally years and years since the research was done, and my files are long gone.”

Research by: Indo-US Digital Preservation (Page 16) Micah Altman, Senior Research Scientist Workshop 2009 Legal Considerations Motivations Challenges Trends Recommendations

Indo-US Digital Preservation (Page 17) Micah Altman, Senior Research Scientist Workshop 2009 Technical Confidentiality Challenges Motivations Challenges Trends Recommendations

Many challenges to privacy:

 The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]

 The “EZ-Pass Problem”: fine geo-spatial-temporal data impossible mask, when correlated with external data [Zimmerman 2008]

 The “Facebook Problem”: Possible to identify masked network data, if only a few nodes controlled. [Backstrom, et. al 2007]

 The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Tomkins et. al 2004]

Source: [Calabrese, et al 2007; Real Time Rome Project 2007]

Indo-US Digital Preservation (Page 18) Micah Altman, Senior Research Scientist Workshop 2009 Social Science Privacy Challenges Motivations Challenges Trends Recommendations

Informing policy to balance research and confidentiality & privacy Probing changes in attitudes towards use of personal information Understanding social consequences of changes in pubic information

Indo-US Digital Preservation (Page 19) Micah Altman, Senior Research Scientist Workshop 2009 Evidence Base is Shifting… Motivations Challenges Trends Recommendations Collective holdings of all U.S. numeric social science data in all major data archives, government repositories: ~estimated 10’s of TB

 Vs. 18,000,000 TB of telephone calls… “Ambient” data increasingly becoming subject of social science research. Radically different data formats (technical & conceptual)

 Agent-based modeling

 Social network modeling

 Automated coding of text, audio, video

 “Virtual world” experiments and observation Examples

 Harvesting and analysis of blogs for virtual political opinion surveys

 Continuous collection of video of congressional debates, with automated subject coding [Adamic & Glance 2005]

 Cell phones data on movement, proximity to others

 Participative redistricting – social mapping

 Agent-based models of emerging institutions

 FMRI analyses of reaction to political and social scenarios

Indo-US Digital Preservation (Page 20) Micah Altman, Senior Research Scientist Workshop 2009 Two Idiosyncratically Selected Motivations Challenges Trends Recommendations Emerging Trends Data Integration

 Shared Catalogs

 Citations New institutional models for preservation

 Institutional repositories

 Web 2.0 (aka `open science’, Google++)

 Virtual Archiving

 Archival Collaboration

Indo-US Digital Preservation (Page 21) Micah Altman, Senior Research Scientist Workshop 2009 Shared Catalogs

IQSS Dataverse Network

 30,000+ studies

 Including 20,000+ studies from Data-PASS Alliance CESSDA

 25000 studies

 20 Major archives Enabling technologies

 OAI-PMH

 DDI

 Dataverse Software

Indo-US Digital Preservation (Page 22) Micah Altman, Senior Research Scientist Workshop 2009 Indo-US Digital Preservation (Page 23) Micah Altman, Senior Research Scientist Workshop 2009 Data Citations Motivations Challenges Trends Recommendations

Citations are a traditional formal mechanism to link together intellectual works Citations glue together: Regulations, Publications, and Evidence But, lack of rules for citing numeric data:  No consistency in practice  No fixed rules for copyeditors  Sometimes in the list of references; sometimes a casual mention in the text  Sometimes the archive is noted  Sometimes a version number exists  Sometimes the version number is listed (if it exists)  Archive numbers are sometimes given, if they exist  Sometimes the author is noted  Date of creation is sometimes given  URLs often given, rarely persist  Dates of access: protect the researcher, do not help find the data  The data may not be available publicly  The data may no longer exist

Indo-US Digital Preservation (Page 24) Micah Altman, Senior Research Scientist Workshop 2009 A Unified Citation Standard for Motivations Challenges Trends Recommendations Quantitative Data

Indo-US Digital Preservation (Page 25) Micah Altman, Senior Research Scientist Workshop 2009 Institutional Repositories for Data? Motivations Challenges Trends Recommendations

So far, not so much… Why? Focus on more publications – coin of the realm Scale differs from publication – storage requirements Heterogeneity vs. publications

 Discipline specific metadata needed to make it discoverable to primary users

 Discipline specific formats, norms, workflows

 Privacy, confidentiality issues

[but Watch MIT, they may buck this trend…]

Indo-US Digital Preservation (Page 26) Micah Altman, Senior Research Scientist Workshop 2009 Web 2.0 For Data Motivations Challenges Trends Recommendations Privacy?+ + Law?+

Preservation?+ = ? Analysis?

* Can you count how many βββ’s are in this picture? Indo-US Digital Preservation (Page 27) Micah Altman, Senior Research Scientist Workshop 2009 Pitfalls of Web 2.0 Dead soon after (or before Launch)… Google Research Data – dead before it launched Graphwise – dead in a year

Business model clearly excludes preservation… Amazon Public Data Sets

Business model? Swivel Data360 Freebase Many eyes ...[ Insert name of web 2.0 startup here ]

Indo-US Digital Preservation (Page 28) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Virtual Archiving: Trends Recommendations The Dataverse Network* An Open-Source, Federated, Web Discovery Services 2.0 Data Network  Simple & fielded search  Virtual collection browsing Gateway to over 20000 social Management science studies (world’s largest  Ingest catalog)  Curation & review

Web Virtual Hosting 2.0 Service  Virtual Hosting and administration Federated access to other Metadata delivery

networks  Descriptive and structural

Unified access to major U.S.  Provenance (chain-of custody metadata) research data archives,  Human and OAI interfaces government data Preservation Open service – endowed hosting  Standards based Open source – GPL-Affero-3  Reformatting

 Universal Numeric Fingerprints Enhanced Delivery

 Replication

 Layered analysis services

Indo-US Digital Preservation (Page 29) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Dataverse Network Trends Recommendations An Open-Source, Federated Data Network Unified access to major U.S. research data archives, government data Web 2.0 Virtual Archiving Service http://theData.org

(Page 30) Micah Altman, Senior Research Scientist DataPASS

Indo-US Digital Preservation (Page 31) Micah Altman, Senior Research Scientist Workshop 2009 Motivations Challenges Collaboration for Preservation Trends Recommendations

Partnership Agreements Joint “Not-bad” practices  Agreement to establish  Identification & selection good practice  Metadata  Preservation copies of data collected  Security  Transfer Protocol: in  Confidentiality case of archival failure Shared Catalog Cooperating Operations  Unified Discovery  Central database of  Content exchange leads for acquisition  Layered Services  Development of shared procedures "Nothing new that is really  Review of acquisitions interesting comes without collaboration" -- James Watson

Indo-US Digital Preservation (Page 32) Micah Altman, Senior Research Scientist Workshop 2009 Replication as Motivations Challenges Trends Recommendations Institutional Insurance Data-PASS Syndicated Storage Project

External Causes of Preservation Failure

 Third party attacks

 Institutional funding

 Change in legal regimes Quis custodiet ipsos custodes? Schema driven:  Unintentional curatorial capture inter-archival preservation modification commitments  Loss of institutional Asymmetric: knowledge & skills resource commitments proportional to holdings  Intentional removal Versioned:  Change in institutional versioned data and citations mission Integration: LOCKSS + Archival Replication Schema + DVN technology + archival workflows

Indo-US Digital Preservation (Page 33) Micah Altman, Senior Research Scientist Workshop 2009 Recommendations

Mind the gaps – capture information early Build in confidentiality Policies may catalyze preservation

Indo-US Digital Preservation (Page 34) Micah Altman, Senior Research Scientist Workshop 2009 Mind the Gaps – Capture Early Motivations Challenges Trends Recommendations Common tools cover different part of the research process: not yet connected Major gap between data collection/analysis and publication:  Linking data and publication: data citations  Embed analysis in publication where possible: sweave, statdocs, etc. Archiving is now a separate step, better to integrate from the beginning:  Provide bit-level replication as a research service, keep it for publication & archiving later  Capture information from workflow tools ( e.g. Viztrails ), increase usefulness of data with low effort preservation preservation reuse analysis publishing publishing design processing integration analysis dissemination dissemination design collection collection processing integration

cati / capi sweave / statdocs citations / identifiers Web 2.0-______data archives, hosting, networks General digital libraries and repositories workflow systems

Indo-US Digital Preservation (Page 35) Micah Altman, Senior Research Scientist Workshop 2009 Build in Confidentiality Motivations Challenges Trends Recommendations

Licensing and Intellectual Property Protections  Standardize license terms and metadata  Click-through agreements, vetting workflows  Authentication, auditing, logging Embedding sensitive data in a digital library can improve subject confidentiality:  Authentication, vetting, and access control  Standardized license terms governing analysis (derived from metadata and data characteristics)  Models can be run on-line without access to raw data  Monitoring and auditing of data use  Limit sequence of analyses by a user, in some cases ( for promising results, see [Dwork, et al 2006]

Indo-US Digital Preservation (Page 36) Micah Altman, Senior Research Scientist Workshop 2009 Policies may catalyze preservation

Include preservation in funding for research infrastructure Open access mandates Data citation requirements

 By journals

 By sponsors of research published in journals

Indo-US Digital Preservation (Page 37) Micah Altman, Senior Research Scientist Workshop 2009 Additional References L. A. Adamic and N. Glance, 'The Political Blogosphere and the 2004 U.S. Election: Divided They Blog', Annual Workshop on the Webloging Ecosysteml, WWW2005, Japan, 2005 L. Backstrom, C. Dwork, J. Kleinberg. Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography. Proc. 16th Intl. World Wide Web Conference, 2007. Calabrese F., Colonna M., Lovisolo P., Parata D., Ratti C., 2007, "Real-Time Urban Monitoring Using Cellular Phones: a Case-Study in Rome", Working paper # 1, SENSEable City Laboratory, MIT, Boston http://senseable.mit.edu/papers/ , [also see the Real Time Rome Project [http://senseable.mit.edu/realtimerome/] C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, Proceedings of the 3rd IACR Theory of Cryptography Conference, 2006 J. Gibson, and D. McKenzie 2007. Using Global Positioning Systems in Household Surveys for Better Economics and Better Policy, The World Bank Research Observer22(2):217-241 A. MachanavaJJhala, D Kifer, J Gehrke, M. Venkitasubramaniam, 2007,"l-Diversity: Privacy Beyond k- Anonymity" ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52 A. Narayanan and V. Shmatikov, 2008, Robust De -anonymization of Large Sparse Datasets, Proc. of 29th IEEE Symposium on Security and Privacy (Forthcoming) J. Novak, P. Raghavan, A. Tomkins, 2004. Anti-aliasing on the Web, Proceedings of the 13th international conference on World Wide Web Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, National Research Council, 2007. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data. National Academies Press D.L. Zimmerman, C. Pavlik , 2008. "Quantifying the Effects of Mask Metadata, Disclosure and Multiple Releases on the Confidentiality of Geographically Masked Health Data", Geographical Analysis 40: 52- 76 Lyman, Peter and Hal R. Varian, "How Much Information", 2003. Retrieved from http://www.sims.berkeley.edu/how-much-info-2003

Georeferencing in the Social Sciences -- Promise (Page 38) and Peril Introduction Access Challenges Google++ For More Information DVN Data-PASS RCE Conclusions Contact me: http://maltman.hmdc.harvard.edu/

Dataverse Network Project: http://TheData.Org

Data-PASS Alliance: http://www.icpsr.umich.edu/DATAPASS/

Indo-US Digital Preservation (Page 39) Micah Altman, Senior Research Scientist Workshop 2009