Provenance: a Future History
Total Page:16
File Type:pdf, Size:1020Kb
Provenance: A Future History The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Cheney, James, Stephen Chong, Nate Foster, Margo Seltzer, and Stijn Vansummeren. 2009. Provenance: a future history. In Proceeding of the 24th ACM SIGPLAN conference companion on object oriented programming systems languages and applications. New York, NY: ACM. Published Version doi:10.1145/1639950.1640064 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:5346327 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP Provenance: A Future History James Cheney Stephen Chong Nate Foster University of Edinburgh Harvard University University of Pennsylvania [email protected] [email protected] [email protected] Margo Seltzer Stijn Vansummeren ∗ Harvard University HasseltUniversity/TransnationalUniversityof [email protected] Limburg, Belgium [email protected] Abstract queryable; it’s indexed by standard search engines and can Science, industry, and society are being revolutionized by be queried anywhere. radical new capabilities for information sharing, distributed Provenance is pervasive, invisible, and handled automat- computation, and collaboration offered by the World Wide ically by default. Ten years ago people regularly used data Web. This revolution promises dramatic benefits but also without knowing where it was hosted or originated. Today poses serious risks due to the fluid nature of digital infor- such a thing is unthinkable. Documents whose origins are mation. One important cross-cutting issue is managing and unknown are highly suspect and quickly attract the attention recording provenance, or metadata about the origin, context, of a small but enthusiastic anti-plagiarism community. or history of data. We posit that provenance will play a cen- Both public and confidential provenance are incorporated tral role in emerging advanced digital infrastructures. In this into the algorithms employed by major search engines, who paper, we outline the current state of provenance research use the provenance of a page to weigh its reliability, and di- and practice, identify hard open research problems involving rect searches toward more reputable results. Individuals are provenance semantics, formal modeling, and security, and able to query where their data is being used elsewhere on articulate a vision for the future of provenance. the Web, and can derive economic benefit and acclaim by providing useful data. The Semantic Web only became truly Categories and Subject Descriptors D.3.1 [Programming compelling once people had an incentive for contributing Languages]: Formal Definitions and Theory high-quality metadata. Provenance tracking provided this motivation by ensuring recognition for contributors. Some General Terms Languages, Security, Reliability curators even make a living managing Semantic Web meta- Keywords provenance, integrity, semantics data. The old vertically-integrated media industries (particu- A future history of provenance larly music and print media) loved provenance—fora while. Transcribed keynote remarks from the 2019 Federated They thought it would revive their dream of complete con- Provenance Conference, by [inaudible] trol over their intellectual property through digital rights management. Indeed, provenance is now used to ensure that We won! Provenance is everywhere: it’s part of ev- authors and creators are recognized and rewarded for their ery storage system and every application; it’s secure; it’s work, whether they are independent or affiliated with a ma- ∗ jor media label or news agency. The key difference between Stijn Vansummeren is a Postdoctoral Fellow of the Research the success of provenance tracking and the failure of ear- Foundation—Flanders (Belgium). lier attempts at DRM is that provenance is based on volun- tary participation in the online community, not on central- ized control imposed by legislation or lawsuits. Creators are Permission to make digital or hard copies of all or part of this work for personal or free to decide whether to retain or expose provenance, and classroom use is granted without fee provided that copies are not made or distributed recipients of such information are free to decide whether to for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute believe or trust it. to lists, requires prior specific permission and/or a fee. But how did we get here? OOPSLA 2009, October 25–29, 2009, Orlando, Florida, USA. Copyright c 2009 ACM 978-1-60558-768-4/09/10. $10.00 957 By 2009, it was widely recognized that provenance was an energizing effect on the provenance community similar to important, but the effects of Moore’s Wall were only begin- that of the Department of Defense’s Orange Book on the se- ning to be felt, and 20th-century assumptions about stor- curity community in the 1970s. Other nations followed suit, age and computational overhead still held sway. The cost legislating stronger standards for transparency for both fi- of storage was still relatively high—around $20 for a ter- nancial and social policy data. abyte instead of pennies. Likewise, typical personal com- Security and privacy also had to be completely rethought puters contained only a few cores rather than the hundreds with the adoption of pervasive provenance technology. or thousands now common today, and reliable, usable con- Nowadays, users can limit access to both sensitive data and current programming languages had not yet become main- its provenance, at the cost that others may be less willing stream. Now, of course, the extra storage and computational to believe or trust their data. The tension between privacy overhead of instrumenting programs to record and maintain and provenance was graphically illustrated by the protests detailed provenance metadata is effectively zero. Indeed, in Iran following the contested election of June 2009. Sites provenance pays for itself in terms of decreased liability for such as Twitter, Facebook and YouTube played an essen- failures—most insurers today won’t even cover businesses tial role early on, making it possible for protesters to con- that use hardware or software that is not provenance aware. nect with each other and journalists to report to the outside A thornier issue was how to manage and search the world without government interference, but there was also a masses of detailed provenance information that began to great deal of unverifiable “noise” and deliberate misinforma- be generated. Early work viewed provenance as a simple tion. Moreover, this information (along with more traditional directed acyclic graph. While rich languages for querying forms of surveillance) was later used effectively to single out graph-likedata had already been researched, they had not yet prominent Facebook or Twitter users for persecution in the been integrated into scalable, robust, and industrial-strength brutal repression that followed the protests. systems. However, the development of DBMSs supporting In the rest of this talk, I’ll ask you to think back to the fast graph queries was only a first step towards the rich summer of 2009. It was a pivotal time for this field. Many provenance query languages we have today. Provenance of these ideas were in their infancy, and almost no one had queries essentially query the behavior of programs, and it any idea how important provenance would become over the was a significant challenge to formulate nontrivial prove- next 10 years. In fact, many computer scientists had either nance queries manually over the raw, low-level representa- not heard of provenance, or thought it was either trivial or tion as a DAG. Instead, the first half of this decade saw the outright impossible. Some early papers on provenance were development of advanced provenance query languages that vague about either what they wanted to accomplish, or what incorporated ideas from reflective and aspect-oriented pro- they were actually proposing as solutions. Many papers pro- gramming. These languages make it easy for programmers posed solutions to related problems, but did not provide to query provenance using the syntax they already use to enough detail about either solution or problem to make a write programs. A nice side-benefit of this was that major real comparison possible. It was quite difficult for newcom- programming languages and implementations finally took ers to the field to understand what was really essential. Gen- database–programming language integration seriously, solv- eral foundational definitions, clear problem statements, and ing the “impedance mismatch” problem once and for all. theories of provenance only started to coalesce by the end of Some cynics admitted that provenance might be impor- the last decade. These ideas played a central role in all of the tant for critical applications such as banking or medical above developments, because a cohesive provenance com- records, but doubted that it would carry over to general- munity would have been impossible before everyone was purpose applications: “as soon as you move an object from able to agree on basic definitions. Perhaps the most impor- one system to another,” they said, “you’re going to lose tant insight was... its provenance—we can’t even do extended attributes in a [The rest of the recording is inaudible.] portable way!” But researchers and developers from a host of different areas banded together and formed a “provenance community”. Much like in the early days of the Internet and 1. Introduction World Wide Web, their grass-roots efforts led to the rapid de- velopment and implementation of de facto standards for rep- In an ideal world, software systems would be engineered resenting and transporting provenance metadata. These were to the highest standards. Programs would be expressed in eventually codified by the W3C and other organizations.