Facilitating the Aggregation of Dispersed Personal Archives A Proposed Functional, Technical, and Business Model

Christopher J. Prom Assistant University Archivist and Associate Professor, University of Illinois at Urbana-Champaign

Abstract We keep records in archives because such institutions are dedicated to preserving authentic evidence of human activity. ‘Cloud’ services pose a direct challenge to the archival mission. Archivists and all of humanity have a direct interest in building tools that help people aggregate, use, and control records they created. This paper outlines the conceptual model one such service, which is dubbed “myKive,” and which is currently undergoing proof-of-concept development at the University of Illinois. After describing its necessity, the paper lists the proposed service’s functions, outlines its core architecture, and describes it development/business framework.

Author Christopher J. (Chris) Prom is Assistant University Archivist and Associate Professor of Library Administration at the University of Illinois at Urbana-Champaign. He holds a PhD in history from the University of Illinois and also studied at the University of York (United Kingdom). He is a Fellow of the Society of American Archivists and has received several other research fellowships including most recently a 2009-10 Fulbright Distinguished Scholar Award. He maintains the Practical E-Records Blog and an active publication portfolio. His research describes the ways in which archival users seek information relevant to their needs and assesses methods that archivists can use to efficiently meet those needs. He most recently authored a technical watch report for the Digital Preservation Coalition, “Preserving Email.” Chris is also co-director of the Archon™ project, which developed an open source application for managing archival descriptive information and digital objects, and he is a member of the ArchivesSpace project, which is developing a next-generation archival management system. He has served the Society of American Archivists in several capacities. He is currently a member of the editorial board of The American Archivist.

I would like to begin my paper with a story. The story demonstrates key challenges faced by archives and archivists in what we might term the cloud era—the era of dispersed digital archives. Last November, I boarded a train at Union Station in Chicago, Illinois. I had just a left a meeting of the Society of American Archivists’ Fundamental Change Working Group. This group was charged with revising the Fundamentals Series, which comprises the heart of our society’s publishing program.1

1 The six books that comprise this series are available by visiting the publications pages at the Society of American Archivists website, at http://bit.ly/LyVy06 (Accessed June 26, 2012). Everyone at the meeting was acutely aware of two facts: 1) that newly trained archivists need a sophisticated set of digital skills, and 2) that our new instructional manuals must facilitate these skills. Moving quickly to find a seat on the train, I spotted a person from my University. I’ll call this person “Dr. Important.” After the requisite chic-chat, Dr. Important asked me what I have been working on lately. “Well, I’ve been writing a guide to email preservation.” “Oh, that’s interesting. Maybe you can help me.” Who doesn’t like to be asked for help? Maybe I could tell Dr. Important how to organize email and export it to a preservation-ready format. If lucky, I might even convince Dr. Important to transfer email to the University Archives, where it would become a public research resource. In this way, it would be accessible much like the handwritten or typescript correspondence from many other important people who had worked or studied at the University of Illinois in past years. “You see, I went to look for something I sent back in 2009,” Dr. Important continued. “I’ve been keeping a copy of all of my important emails, one folder for each month. But when I went back to find the message I needed, all the folders were gone.” Dr. Important told me that technical staff could not restore the emails, which likely went missing during a system migration that had taken place several months prior. As an archivist, I mourned the death of the evidence and information that Dr. Important had created and cared for over many years. But I felt helpless, and l let the conversation drift to another topic. This incident, and many others that I could tell from my time at the University of Illinois, illustrate one of the greatest challenges that archivists face: ensuring the preservation of evidence when people’s communication tools have, in effect, become their unofficial recordkeeping mechanisms. This problem is particularly pressing because, in most institutions, centralized systems to manage correspondence and other communications are dead or at least have one foot firmly in the grave. Given this fact, what can we (as a profession) do to make sure that usable records are fixed into a medium that will facilitate their perseveration and use? In order to answer that question, we must understand the ways in which information is dispersed within modern organizations and external social networks. More to the point, we must understand the way in which technology makes information into records that subsist within human social networks. With a better understanding of how records are formed and used within the technologies that facilitate such networks, we will be better positioned to capture and preserve not only information, but also contextual data about how that information was dispersed, used, and reused.

1. ‘Record-ness,’ Archives, and the Need for a Personal Archives Service In the cloud era, record capture and preservation systems must take three factors into account: 1) the perceived lack of value accorded to preserving digital communications; 2) the communication and information management practices used by individuals and; 3) the specific ways in which contextual data transforms information into evidence, within human social networks and the technologies that support them. Taken as a whole, the implications of these three factors call into question the continued existence of archives in the cloud environment—if by archives we mean a group of records that are maintained as a collective using the principles of provenance, original order, and collective control.

2 1.1 The perceived lack of value accorded to preserving digital communications In 1899, the American sociologist Thorstein Veblen wrote that “the cheap, and therefore indecorous, articles of daily consumption in modern industrial communities are commonly machine products.” 2 Such articles are much used but little valued, at least in a monetary sense. For that reason, they are easily lost or discarded. Any American who has eaten at a Fourth of July picnic knows how easy it is to throw dirty plastic utensils and plates into the trash, in spite of their utility when the hot dogs and watermelon were being served. In post-industrial societies, digital communications comprise one of the cheap, and therefore indecorous, articles of daily consumption. We are familiar with the forms that these materials take: email messages, blog posts, Facebook updates, tweets, online videos. Each can be inexpensively produced with the help of an electronic device. Each is arguably less decorous than the format for communication that it replaced, such as the handwritten letters, illustrated diaries, or professionally produced films in which archives like to traffic.3 Given this fact, one may expect that the greatest challenge in preserving such materials might consist simply in convincing people that their personal digital communications are important enough to preserve. But this is not the case. In the abstract, many people value digital materials highly and keep everything they send or create. However, most of them do not much concern themselves when a system crash sweeps digital records away as in a flood.4 This points to an important truism: the broader information ecology in which people work makes it very difficult for both organizations and individuals to identify, capture, and preserve the records that have the most long-term archival value, unless extraordinary actions are taken. Let me provide a few examples. My own institution, the University of Illinois, formerly made extensive use of college and departmental subject files, documenting faculty teaching, research, service, and administration. I say ‘formerly’ because over the past twenty years these paper-based files have largely disappeared. During the same period, most of our distinguished faculty members stopped keeping systematic correspondence files, aside from messages fortuitously retained within active email accounts. Asking administrators or faculty members to keep records outside of their communication applications (either in paper or in digital form) seems like a fruitless task. First, it would require that the institution implement an expensive software and hardware product, such as and Electronic Records Management (ERM) application. More to the point, implementing such a system would require that people make extensive changes to their work habits and procedures—something that is extremely unlikely in the Facebook Era, with its emphasis on immediate communication and response. Where ERM or ‘document management’ systems have been implemented, we see numerous problems follow. For example, staff in the office of our chief administrative officer (the Chancellor) are worried that email messages documenting critical policy decisions never make their way into the document management system since administrators don’t like to change their work habits and deposit email. Staff members are also worried that the system will not survive the departure of the current records manager.

2 Thorstein Veblen, The Theory of the Leisure Class (New York: Viking Press, 1967), 161, http://www.gutenberg.org/ebooks/833. 3 For one attempt to provide a more decorous platform for personal reminiscence and storytelling in the digital era, see the Cowbird service, founded by Jonathan Harris, at http://cowbird.com (Accessed June 26, 2012). 4 Catherine C. Marshall, “Challenges and Opportunities for Personal Digital Archiving,” in I, Digital: Personal Digital Collections in the Digital Era, Ed. Christopher A. Lee (Society of American Archivists, 2011), 99–100.

3 Underlying this issue, we see a larger problem. While relatively little attention is being paid to building systems to retain substantive records of archival value, our campus has put many resources into building systems that preserve other types of information. For example, most universities maintain expensive business systems that store transactional data relating to financial affairs, personnel management, and student academic records. Most, if not all of the data managed by these systems lack long-term administrative and archival value. The maintenance of these systems is certainly necessary for the daily operation of the University. The data within them is used to produce aggregated management information, which is certainly of long-term archival value.5 Similarly, the University of Illinois and many other libraries operate well-designed and successful applications that preserve formal research outputs, such as published scholarly papers. A rich literature has arisen around the development and implementation of these repositories.6 By comparison, relatively little has been written concerning the theory and practice of preserving personal or professional correspondence, informal communications, or social media records. For example, email preservation has not been formally included within the scope of most large- scale digital preservation projects.7 As a result, archivists can easily be left in the position of sweeping up the few crumbs of information from a faculty member or administrator’s computer or closet, much like I did when cleaning out the office a distinguished chemist, Stanley Smith, in 2011.8 If we want a better future for historical research, we must develop a method for archivists and records creators to work together, so that communications in email systems and social networking technologies can be kept alive long enough to be accessioned to an archives.

1.2 Personal communication and information management practices An email I received while on sabbatical illustrates how recordkeeping functions have devolved to individual initiative. The author, who wished to remain anonymous, nearly lost his entire email record when his former employer abruptly terminated access.9 The prominent American journalist James Fallows relates a similar story concerning his wife’s Gmail account, which was hacked, leading to the deletion of its entire contents. Ten years’ worth of messages were salvaged only because Fallows took advantage of some personal connections at Google.10 While these stories are anecdotal, they resonate with our experiences as archivists. Furthermore, they show that efforts to preserve personal records in the era of cloud computing must focus on helping people generate archives that subsist outside of their communication utilities. But what types of services are most needed?

5 At the University of Illinois, such datasets are managed by the Division of Management Information. http://www.dmi.illinois.edu/ Accessed June 7, 2012. 6 Charles W. Bailey, “Institutional Repository Bibliography, Version 4,” Digitial-scholarship.org, June 15, 2011, http://digital-scholarship.org/irb/irb.html. 7 Christopher J. Prom, Preserving Email, DPC Technology Watch (Digital Preservation Coalition, December 2011), 4–8, http://dx.doi.org/10.7207/twr11-01. 8 See http://www.library.illinois.edu/archives/archon/index.php?p=collections/controlcard&id=10857 for a description of the Smith Papers, and access to some of his digital files. 9 Prom, Preserving Email, 16. 10 James Fallows, “Hacked!,” The Atlantic, November 2011, http://www.theatlantic.com/magazine/archive/2011/10/hacked/8673/#.

4 In a two-part article entitled Rethinking Personal Digital Archiving, Microsoft Principal Researcher Cathy Marshall notes that most people exhibit an information archiving instinct.11 At an extreme, this tendency can exhibit itself as compulsive information hoarding, either of hard copy or digital materials.12 Most people exhibit modest self-archiving behaviours, such as the six methods that Marshall describes in the first part of her article.13 Technologies facilitate each of these strategies, as well as other strategies that people have developed since Marshall conducted her research. However, Marshall finds that archiving technologies or strategies are often used very ineffectively, if the goal of ‘archiving’ is the long-term preservation of evidence of human activity. I have also noticed this problem. One faculty member whom I know contracts with to backup her computer desktop, which includes photographs, reports, and working documents. The information on this computer may have some long-term value, particularly if the photographs or documents contain embedded metadata (which is doubtful).14 But the ‘saving’ is dependent on her remembering to transfer content to the service. Furthermore, the information on her computer is lifeless until it is communicated or shared with someone else. Such communication or sharing takes place via her three email accounts, her blog, and her Facebook page. What this means is that unless the email, blog posts, and status updates from these sources are saved in a fixed format, along with their metadata, the full impact of her work will be lost. Her career and influence will be less well understood than they might be, because the evidential value of her activities will not have been preserved. We will be left with a set of documents that lie mute on an image of her hard drive, lacking the life blood—context—that makes archives a uniquely useful research resource. Why is saving personal communications important? In the academic realm, I would argue that it is important because libraries in the United States are currently placing a great deal of emphasis on other things. For example, much attention is being given in libraries to the development of data curation and data management services, often with the aim of encouraging faculty to preserve their research data in an institutional or disciplinary repository.15 Several national programs have been launched, aiming to help

11 Catherine C. Marshall, “Rethinking Personal Digital Archiving Part 1: Four Challenges from the Field,” D-Lib Magazine 14, no. 4 (2008), http://www.dlib.org/dlib/march08/marshall/03marshall-pt1.html; Catherine C. Marshall, “Rethinking Personal Digital Archiving, Part 2: Implications for Services, Applications, Institutions,” D-Lib Magazine 14, no. 3/4 (March 2008), http://www.dlib.org/dlib/march08/marshall/03marshall-pt2.html. 12 Renae Reinardy, “Information Hoarding: The Need to Know and to Remember,” OCD Newsletter (Fall 2006): 14–15; Christopher Mims, “The Internet May Encourage ‘Information Hoarding’ - Technology Review,” Technology Review, September 27, 2010, http://www.technologyreview.com/view/420947/the-internet-may- encourage-information-hoarding/. 13 1) System backups; 2) Saving old “My Documents” folders; 3) writing important files to external media such as compact disks or hard drives; 4) emailing themselves important document and attachments; 5) putting materials on social media sites (i.e. Flickr, Facebook); and 6) Saving an image of the entire platform, to be restored in case of system failure. 14 We are well advised to recall David Bearman’s definition for the word “record” as “communicated information.” David Bearman, “Managing Electronic Mail,” in Electronic Evidence: Strategies for Managing Records in Contemporary Organizations, Ed. David Bearman (Pittsburgh: Archives and Musuem Informatics, 1994), 189–90. 15 Choudury, Sayeed, “Data Curation: An Ecological Perspective,” College and Research Libraries News 71, no. 4 (April 2010): 194–96.

5 librarians become more ‘data-centric.’ 16 My institution, for example, launched a “Year of Data Stewardship.”17 Internationally, a rich literature has developing around this issue.18 Those practicing data curation should be regarded as valued partners, to whom we can to articulate the archival emphasis on the evidence that makes data useful. The field is open to us. We can develop complementary tools and services, which will preserve the evidence—the communications—that makes research data interpretable. Recent archival literature articulates a conceptual framework for personal digital archives. This framework complements the library community’s emphasis on data curation. For example, the work of Richard Cox, Kathy Marshall, Cal Lee, and others offers a theoretical basis for preserving the evidence found in dispersed personal digital archives.19 Similarly, some scholars and members of the popular media have argued that individuals should establish digital legacy plans or even provide instructions regarding the disposition of their digital assets within their estate plans.20 However useful this literature may be, what people most need are practical tools and services that preserve communications and social sharing. By designing and implementing such tools, archivists can work to build trusted relationships with people. If we help people generate true archives from their currently dispersed digital communications, and if we make that archives useful to them during their lifetimes, they (or their heirs) will have the ability and motivation to deposit those archives in one of our repositories.

1.3 How contextual data transform data, information and knowledge into evidence John Seely Brown and Paul Duguid open their book The Social Life of Information with a point that may seem trite: ““[I]nformation and individuals are always part of rich social networks.”21 However, the implications of this thought are profound for archival practice. 22 Elaborating their conception of

16 Council on Library and Information Resources, “Data Curation Initiatives”, 2012, http://www.clir.org/initiatives- partnerships/data-curation; Alan Blatecky and Chris Greer, Data Web Forum Concept Paper, June 2012, http://www.cni.org/wp-content/uploads/2012/06/DataWebForum_Concept_Paper.pdf. 17 http://www.cio.illinois.edu/Data_Stewardship. See the Illinois Research Data Initiative blog at http://blogs.cites.illinois.edu/datasteward/ for more information. 18 Tony Hey and Anne Trefethen, “The Data Deluge: An e-Science Perspective,” in Grid Computing (John Wiley & Sons, Ltd, 2003), 809–824, http://dx.doi.org/10.1002/0470867167.ch36; Karen S Baker and Lynn Yarmey, “Data Stewardship: Environmental Data Curation and a Web-of-Repositories,” International Journal of Digital Curation 4, no. 2 (2009): [online] and ; Ixchel Faniel and Ann Zimmerman, “Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse,” International Journal of Digital Curation 6, no. 1 (2011): [online] provide representative examples of this literature. 19 Richard J Cox, Personal Archives and a New Archival Calling: Readings, Reflections and Ruminations (Duluth, Minn.: Litwin Books, 2008); Marshall, “Rethinking Personal Digital Archiving, Part 2: Implications for Services, Applications, Institutions”; Christopher A. Lee, ed., I Digital: Personal Collections in the Digital Era (Chicago, Illinois: Society of American Archivists, 2011). 20 Evan Carroll and John Romano, Your Digital Afterlife: When Facebook, Flickr and Twitter Are Your Estate, What’s Your Legacy? (Berkeley CA: New Riders, 2011); Rebecca J. Rosen, “The Government Would Like You to Write a ‘Social Media Will’,” The Atlantic - Technology Blog, May 3, 2012, http://www.theatlantic.com/technology/archive/2012/05/the-government-would-like-you-to-write-a-social-media- will/256700/. 21 John Seely Brown and Paul Duguid, The Social Life of Information (Boston: Harvard Business School Press, 2002), xxv. 22 One example of how this insight is transforming archival practice can be seen in the Social Networks and Archival Context Project, led by Daniel Pitti at the University of Virginia. The project seeks to develop a national archival authorities cooperative. Jennifer Howard, “Archive Watch: Building a National Cooperative for Archival

6 information’s social life, they argue that “documents . . . help structure society, enabling social groups to form, develop, and maintain a shared sense of identity.”23 Brown and Duguid note that information can easily ‘clot’ within formal organizations or records systems, so that critical information is not communicated to those who most need it. (I recently experienced this phenomenon when a key decision- maker never heard about a report the University Archives developed. As a result, he spent over three months duplicating our work.) On the other hand, data, information and knowledge quickly leak from within organizations to cross-cutting networks of affinity or practice. This is particularly true within an academic setting, but also in business and government, since the influential members of an organization typically work within an interwoven professional network. Facebook and other social media applications accelerate the process of information leakage, particularly when an item ‘goes viral.’ But a video need not be viewed by millions of people for information to have a profound impact outside of formal organizational structures. For example, a professor with whom I am currently working exercises a leadership role on numerous faculty committees that exist within our organization, but she also contributes to professional associations, maintains a blog, and helps mentor current and former students via Facebook. For each of the social networks within which she is engaged, she uses particular tools and services, dispersing evidence across multiple systems, and demonstrating her influence in society. As a profession, we must develop tools and services to capture and preserve something that is more rich than mere data, information, or knowledge. We must find a way to aggregate communicated information that is currently held in dispersed systems, in a way that enhances its value as evidence of people’s activities. As Theo Thomassen noted in his address at the 17th Brazilian Congress on Archival Science,24 the preservation of the evidential value has been one of the archival profession’s defining values. As such, it bears the need for constant reemphasis and renewal. 25 At this time, it must be refreshed to address the rise of social networking technologies. 26

Standards - Wired Campus - The Chronicle of Higher Education,” Wired Campus Blog: The Chronicle of Higher Education, May 22, 2012, http://chronicle.com/blogs/wiredcampus/archive-watch-building-a-national-cooperative- for-archival-standards/36368. 23 Brown and Duguid, The Social Life of Information, 189. 24 Theo Thommasen, “Archivists and the Private Desire To Be or Not To Be Documented,” Proceedings of the 17th Brazilian Congress on Archival Science, (Rio de Janeiro: Assocaição dos Arquivistas Brasileiros), forthcoming. 25 As Adrian Cunningham has pointed out, the concepts behind of digital preservation and digital curation (such as the Open Archival Information System Reference Model and the Trustworthy Repositories Audit Checklist) provide necessary tools to help ensure that records are authentic. However, they are insufficient to the task of “total archiving” because they do not put sufficient emphasis on pre-custodial interventions that help maintain evidential value. Brian Lavoie, The Open Archival Information System Reference Model: An Introductory Guide, DPC Technology Watch (Online Computer Library Center, Inc. and Digital Preservation Coalition, 2004), http://www.dpconline.org/component/docman/doc_download/91-introduction-to-oais; Robin Dale and Bruce Ambacher, eds., Trustworthy Repositories Audit & Certification (TRAC) : Criteria and Checklist (Chicago, Illinois: CRL, 2007); Adrian Cunningham, “Digital Curation/Digital Archiving: A View from the National Archives of Australia,” American Archivist 71, no. 2 (2008): 530–543; Adrian Cunningham, “Ghosts in the Machine: Towards a Principles-Based Approach to Making and Keeping Digital Personal Records,” in I, Digital: Personal Digital Collections in the Digital Era (Chicago, Illinois: Society of American Archivists, 2011), 78–89. 26 The desire for active intervention to shape the record and preserve evidence has been a constant theme since the beginning of electronic records work, but relatively less attention as been paid to the importance of evidence within the context of social networks. For example, Bearman emphasized the evidential value of records within business systems. David Bearman, “Archival Principles and the Electronic Office,” in Electronic Evidence: Strategies for Managing Records in Contemporary Organizations, Ed. David Bearman (Pittsburgh: Archives and Museum

7 As Brown and Duguid note, the media that hold information do not merely contain, carry, or convey that information, they also structure its use within communities.27 For example, paper documents are relatively mobile but largely immutable. The 95 Theses of the Augustinian Monk Martin Luther were dispersed through Europe on foot, without much modification, in a fixed format, and at a relatively slow pace. On the other hand, electronic communications are both instantly mobile and highly mutable, particularly when they are shared via email, websites, or, social networking technologies. A blog post can be changed at a moment’s notice, and many versions of it may be distributed or replicated with a single keystroke or computer command. In order to provide useful research materials to future generations, we must find a way to fix these mobile and mutable records into a fixed format, in a defined location, and in a controlled fashion. This should be done in a manner that preserves enough metadata to make the communications understandable within their original social context. Furthermore, archival systems should also store and manage communications sent via multiple systems and networks, so that no one form of communication is given a privileged status. In this way, materials originating from a common source or records creator (and sharing a common provenance) can be treated in a collective fashion—thus preserving them as an archives, here defined as “’[m]aterials created or received by a person, family, or organization, public or private, in the conduct of their affairs and preserved because of the enduring value contained in the information they contain or as evidence of the functions and responsibilities of their creator, especially those materials maintained using the principles of provenance, original order, and collective control.” 28

2. The Proposed myKive Service I propose that the objective described above can be accomplished if the archival/memory community develops an extensible self-archiving application. The tool that I have in mind will allow people to aggregate their personal digital communications into a replicated storage location in a fixed, preservation- ready format, where they can control the records. I have given this service the provisional name of myKive (“My Archive”).

2.1 Why a New Service? Traditionally the ‘fixing’ function has taken place when the papers of individuals are deposited in an archives, likely after sitting in storage for a period of time. The analogue fixing function is method with which archivists are deeply familiar. But whereas every archivist can easily remove records from a closet, relatively few can easily remove them from an email account or a blog. The proposed myKive service will be an open-source software application that archivists can use to help people save their digital communications and other records in a trusted location. It is tempting to think that people already have access to good tools that allow them to backup and preserve their files. This is simply not the case. Many users write files to hard drives or other locations,

Informatics, 1994), 146–75. Reprinted from Information Handling in Offices and Archives, ed. Angelika Menne- Haritz (New York: KG Sauer, 1993), 177-93. 27 Brown and Duguid, The Social Life of Information, 189– 200. 28 Richard Pearce Moses, A Glossary of Archival and Records Terminology (Chicago: The Society of American Archivists Press, 2005), 30.

8 but Cathy Marshall’s research shows that they rarely use the software consistently or correctly.29 One faculty member with whom I am working complains that Apple’s ‘Time Machine’ overflowed the backup disk. (My own Time machine recently deleted any backups more than six months old, for a similar reason.) Other people who use such backup devices do not know if they are capturing email, or only desktop documents. In any case, backup programs do not create archives, because materials deleted from the source disk may also be deleted from the backup device, unless the user is technically savvy enough to prevent that from happening. Commercial vendors also offer backup services. Carbonite, Mozy, and Crashplan incrementally mirror defined files to an external disk or off-site server. Dropbox can also function as a backup service. While superficially attractive, these services may leave the user extremely vulnerable to data loss or corruption. The end user license agreements (EULAs) for such services typically provide no protection to the user, and the services do not provide the types of digital preservation or migration services that the academic community is ideally placed to provide.30 Similar problems afflict emergent services, such Backupify and Nuffly, which individuals can use to backup data in cloud-based email or social media.31 Therefore, most backup services do not offer a true preservation option. On the other hand, one might argue that preservation of social media or email is not a problem since the services themselves function as a de facto archives. For example, Facebook’s Timeline feature provides users the ability to present a chronological stream of documents, photographs, comments and other materials. This is an example of what Eric Freeman and David Gelertner called a Lifestream.32 Superficially, the features provided by Facebook and Google make these services look like an archives, but as Jason Scott of Archiveteam jokes, “Google is a library or archive like a supermarket is a food museum.”33 While there is much value to the services that social media companies provide, their continuance is predicated upon a business model that exploits personal data as an economic commodity.34 If there is no further monetary value to be gained by maintaining the personal data, one can expect that it will be conveniently deleted or lost. This possibility is highlighted by that fact that in its EULA,

29 Cathy Marshall, “Ownership, Aggregation and Re-use of Personal Data” (presented at the Personal Digital Archiving 2012, San Francisco, California, February 24, 2012), http://archive.org/details/personaldigitalarchiving2012pt2. 30 For example, under Crashplan the user waives the right to anything but minor remedies and the agreement may be terminated at will by the service provider. Furthermore, the license mandates no positive obligation to provide users access to their own data—not only in the case of business failure, but even during the course of daily business. Code 42 Software, “End User License for Crashplan”, July 30, 2011, http://support.crashplan.com/doku.php/eula. 31 Needleman Rafe, “Backupify Is More Than a Backup Service,” Rafe’s Radar | CNET News, February 3, 2011, http://news.cnet.com/8301-19882_3-20030614-250.html; “Google Gmail Backup Service | Backupify”, n.d., http://www.backupify.com/gmail/; Nuffly.com, “Services - Nuffly.com”, 2011, http://nuffly.com/services. 32 Eric Freeman and David Gelernter, “Lifestreams: a Storage Model For Personal Data,” ACM SIGMOD Record 25, no. 1 (March 1996): 80–86. 33 Jason Scott, “ArchiveTeam and the Case of the Widespread Recognition,” presentation at Personal Digital Archiving 2012. The Internet Archive: San Francisco, CA, February 24, 2012. http://e- records.chrisprom.com/jason-scott-archive-team/ 34 Facebook, “Statement of Rights and Responsiblities”, April 26, 2011, http://www.facebook.com/legal/terms; Guilbert Gates, “Facebook Privacy: A Bewildering Tangle of Options,” NYTimes.com, May 12, 2010, http://www.nytimes.com/interactive/2010/05/12/business/facebook-privacy.html; Stevie Marshall, “Commodifying Community: How Contributing to Facebook Is Selling Our Soul and Why We Don’t Care -,” Online Conference on Networks and Communities, February 24, 2010, http://networkconference.netstudies.org/2010/04/commodifying- community/; James Fallows, “Facebook, Google, and the Future of the Online ‘Commons’,” The Atlantic Blogs, February 3, 2012, http://www.theatlantic.com/technology/archive/2012/02/facebook-google-and-the-future-of-the- online-commons/252522/.

9 Facebook assumes no positive obligation to preserve data. In spite of its size, the entire Facebook service is as susceptible to business failure as any other company. Facebook does, thankfully, provide a method for users to download all of their data, and a current version of the EULA gives people ownership of their records, but few users are likely to know what to do with the data once it has been downloaded.35 For these reasons alone, it would be best if we help people fine a way to generate true personal archives. There is final reason why such a personal archives service is needed. Most people use multiple communication or social media services. By bringing personally created content together in one location and fixing it into defined formats, we can preserve a more complete picture of a person’s life and influence, than if we trust preservation to many service providers, holding data in many locations.

2.2 Functional Description of the myKive Service In its initial stages, the myKive project seeks to develop an open-source software application that University of Illinois faculty members and students can use to collect email messages, social media, blog postings, reports, desktop files, and other fugitive materials. These records will be saved to a replicated server in an encrypted, standardized, and preservation-ready format. Regular integrity checks will be run to ensure materials are maintained in a trustworthy manner. Content will be stored with sufficient technical and structural metadata to permit its long-term preservation. Materials in a person’s myKive will remain wholly under their control and subject to a strict privacy policy, but stored on a central server. The contents of the myKive account will be replicated to an offsite location. Users will be provided visualization or other tools to make their records useful. People will be provided the opportunity to donate materials to an established archives or manuscript repository, based on mutual agreement and negotiation, at any time they or their heirs wish.36 Once materials are donated to a public institution, they will be managed under an access agreement outlining the terms under which archival users can access the files. In providing these functions, the myKive project does not seek to replace a person’s existing applications or to affect their daily behaviour, but simply to provide a method by which they can aggregate and make useful content that is currently dispersed across multiple services. As part of the myKive pilot project, the University of Illinois will wrap four pieces of software into a web-based dashboard application. Initially, the dashboard will provide access to these tools:

• A social media archiving tool customized for use by libraries and archives to allow the preservation of account data for multiple users. • An email archiving tool , which will use the SMTP protocol to transfer all sent and/or received mails (or, optionally, a filtered set) to a designated archival store. • A desktop archiving tool, which will mirror files from a local computer to a synchronized version control repository. • A communications visulization and mining tool, which will make materials in the myKive account immediately useful to their creator.

35 “Facebook,” Archiveteam, April 2, 2012, http://archiveteam.org/index.php?title=Facebook. 36 The materials would be managed under a legal agreement/partnership between the records’ creator (as a donor), myKive.org (as service provider), and the archives/manuscript library (as beneficiary). More information is available at http://www.mykive.org.

10 By focusing on three critical formats (social media, email, and desktop files), the project seeks to target format types that are most widely used by people in their personal and work lives. By providing visualization tools, the records will be immediately useful. Preservation is provided as a critical, but secondary, benefit to the end user. While users of the system will surely benefit from its preservation aspects, the project is based on the presupposition that the service will be much more likely to be used if it provides its users with an immediately tangible benefit, rather than a distant, disembodied one.

2.3 Proposed Technical Model During the pilot stage, we have developed a provisional technical model. At this time, we anticipate that the application will use and extend existing open source software, which will be wrapped within middleware to be developed by the University of Illinois. This middleware will link the individual system components into an overall ‘archiving’ application, managed from a web-based dashboard. The service core will consist of an application programming interface (API) that links or glues these components together. Before describing the proposed technical infrastructure in more detail, it is important to note three factors:

1. We propose to build the core application on the LAMP (, Apache, MySQL, PHP) or Ruby on Rails platform, which will ensure the availability of a large community of open-source developers and code contributors. 2. The tool will be utilize a service-oriented architecture and micro-services, which will allow the processing load to be spread among several servers, as necessary. 3. Its interface will include internationalization features (for example, the capability for multi- language support), allowing for world-wide adoption.

Within this overall framework, the application will consist of the following elements, shown in schematic form (see figure one):

• Application Core: user and account management, system security and API. The system will use an MVC framework to segregate application control and data views from the object and data model, which stores and manipulates information in user accounts. • Social Media Archiving: Thinkup Application. The social media archiving component will consist of the ThinkUp application with appropriate extensions and links to our application core. Thinkup provides ways to harvest and aggregate tweets, Facebook content, and Google Plus postings (see figure two for a screenshot of a generic Thinkup installation).37 It is being developed by a very active community of developers, led by Gina Trapani of Expert Labs. While the project has not, to my knowledge, garnered any previous use in the archival community, it represents an excellent starting point for social media archiving, provided that such archival activities can be undertaken within a framework that allows for the co-management of records and their eventual donation to an archives. As we get involved with the project, we plan to

37 Expert Labs, “ThinkUp: Social Media Insights Platform”, 2011, http://thinkupapp.com/; Mark Sample, “Putting Twitter to Work with ThinkUp,” ProfHacker - The Chronicle of Higher Education, October 28, 2010, http://chronicle.com/blogs/profhacker/putting-twitter-to-work-with-thinkup/28161.

11 contribute plugins or other code to the Thinkup Project, and will investigate ways in which Thinkup data can be repurposed and reused alongside other components of the myKive system. Alternately, we will investigate whether Thinkup itself might be extended to serve as the application’s core. • Email Archiving: MUSE: The open-source ‘Muse’ software is suggested as a candidate technology for incorporation into myKive, since it uses an open storage format that facilitates data reuse and transformation , including visualizing, graphing, searching, and browsing.38 Currently, Muse works via an extension to popular web browsers, such as Chrome, Firefox and Safari, and it installs a JAVA applet on the users computer. A copy of the email is then downloaded to the User’s computer, and Muse provides visualization tools to make the email searchable and more useful, as shown in a screenshot from the application (see figure three). We propose to utilize the MUSE core to build a server-based version of the software. As the project website puts it, “[t]he Muse infrastructure is fairly re-usable and provides support for login, caching, attachments, address book and entity resolution, automatic grouping, text indexing, and other functions.”39 If feasible, we will link the Thinkup and Muse data into visualisation services at the application core level, providing users a unified method to search, retrieve, view and analyse information across services. • Desktop Mirroring Tool: We propose using standard encryption techniques and secure socket layer technologies to capture and store desktop files from a local computer on a synchronized and replicated server. The Sparkle Share application is suggested as a candidate technology (see figure four).40 Sparkleshare is an open source alternative to file sharing and backup applications like Dropbox and Carbonite. In a typically scenario, users of such application download client software and install it on their computers, then upload files to a host machine, perhaps using automatic file synchronization feature. One of the interesting things about Sparklesrare is that the project allows users to upload files to any host machine using , provided that the host machine is configured as a git repository.41 We propose that by establishing myKive as git repository, and then integrating the Sparkleshare synchronization tools into our myKive application core/dashboard, we can add automated desktop file backup, synchronization, and even version control, into the system. In theory, users will be able to choose whether to keep a complete record of the desktop over time, or only the latest version of the files. • Dashboard with Visualization Tools: Users will be provided a dashboard application to manage the system components described above. Once they complete initial setup, users will need to perform little or no maintenance, other than keeping passwords current. However, we believe that be providing visualization and data mining tools, such as those integrated into MUSE and Thinkup, we will provide users a reason to remain interested in and use the application. Over

38 Stanford University, Mobisocial Laboratory, “Muse Home Page”, 2011, http://mobisocial.stanford.edu/muse/. 39 Muse help pages, http://pepperjack.stanford.edu:59992/muse/help 40 See the Sparkleshare project website at http://sparkleshare.org/; See also Danny Steiban, “Sparkleshare – A Great Open Source Alternative To Dropbox.” Make Use Of Blog: July 7, 2011. Available: http://www.makeuseof.com/tag/sparkleshare-great-open-source-alternative-dropbox-linux-mac/ 41 See http://en.wikipedia.org/wiki/Git_%28software%29. Git is an open source revision control system, typically used for source code management in open source projects. However, it can be used as part of any distributed file sharing project, and has the advantage of including automatic revision controls, which can be used in myKive to keep a record of—and copies of—changed or deleted files.

12 time, we propose to integrate other data mining and visualization tools into the service. This work is anticipated to take place after the pilot service has been developed.

Finally, the core API will include a plug-in architecture, similar to that used in wordpress or other web applications, so that other types of material can be preserved via the service. This will allow people from the computer science, archives, library, and digital curation communities to extend the application. Using those extensions and plug- ins, people will be able harvest records from other services into their myKive. For example, the following record types might included in an individual’s myKive, once appropriate extensions are developed:

• Blogs, using backup tools such as ArchivePress and WordPressDatabase Backup; • Photographs, using capture tools such as parallel-flickr; • Web pages, using harvesting tools such as wget, warc, and NutchWax; and • Personal reference/citation libraries, using tools such as Zotero’s application programming interface.

2.4 Business Model and Sustainability At this time, myKive is a pilot service, in the very early stages of its development at the University of Illinois. If the pilot is successful, we anticipate that the project will garner widespread interest, and may serve as the basis for a collaborative international project. Since the project will use or extend existing open-source projects and tools, it will itself be made freely available via github, a code repository and version control system. This will encourage a collaborative development process. Based on the results from this project, we will consider launching a not-for profit community resource via the www.myKive.org website, in conjunction with a non-profit partner such as Duracloud/Duraspace and/or the Internet Archive. 42 In this way, the service will be useful not only at the University of Illinois, not only in the State of Illinois, not only in the United States of America, but hopefully, worldwide. The proposed myKive service would include the following elements:

• Competitive pricing model vis-à-vis commercial backup services, • Encrypted data transfer and storage, and • Integrated preservation management features such as automated checksum generation and monitoring.

Initially, subscribers would self-register and pay a monthly fee, or their parent institution would pay on their behalf. For people or institutions that subscribe to the service, any records in their myKive account will be stored in an encrypted format and will remain wholly under their control, subject to a strict privacy policy. The service aims to be self-funding once initial research and development have been completed. I can also foresee a long-term support model. Optionally, users will be able to donate materials to an established archives or manuscript repository, based on mutual agreement and negotiation, at any time they wish. The materials will then be managed under a legal agreement/partnership between the records’

42 Duraspace Foundation, “Duracloud Website”, 2012, http://www.duracloud.org/.

13 creator (as a donor), www.myKive.org (as service provider), and the archives/manuscript library (as donee). Once materials are donated to a public institution, they will be managed under a deed of gift and access agreement outlining the terms under which members of the public can access the files, following standard archival practices.

3. Conclusion In 1965, the founding archivist of the University of Illinois Archives, Maynard Brichford, wrote that: “Wherever the archivist may be located organizationally, he should be out of his office two-thirds of the time. While processing must be done in the Archives, the archivist should define and standardize processing procedures so that he may spend his time in locating the historical documentation relating to the activities of the university's staff and students. Effective appraisal must be done in offices, storerooms, stockrooms, and basements.43 This advice is no less valuable today than it was in 1965. The myKive service seeks, ultimately, to provide people the ability to control their own records. But it will also provide the archival profession the ability to spend a good portion of our time ‘outside the office’ and to help people save records that are worth saving, in a way that rescues them from dispersion, collecting them into a real archives that is based on provenance, original order, and collective control. We will still rescue materials from storerooms, closets, and basements, but we will also help people rescue materials from Facebook profiles, email listservs, and blog commenting systems. With a service like myKive, we can help empower individuals to save these materials in a preservation-ready and accessible format, where they will be more useful to them today. And in the future, we can empower them donate those materials to an archives or manuscript repository—at a time of their choosing. If this idea excites you, I invite you to support the project and to lend your own perspective and expertise as it develops over the next year, as an initial project partner.

43 Maynard Brichford, “Appraisal and Processing,” in University Archives, Allerton Park Institute Proceedings 11 (University of Illinois Graduate School of Library Science, 1965), 46, http://www.ideals.illinois.edu/handle/2142/433.

14