New Initiatives in Open Research

Purdue University Purdue e-Pubs Charleston Library Conference New Initiatives in Open Research Clifford Lynch Coalition for Networked Information, [email protected] Lee Dirks Microsoft Research, [email protected] Follow this and additional works at: https://docs.lib.purdue.edu/charleston An indexed, print copy of the Proceedings is also available for purchase at: http://www.thepress.purdue.edu/series/charleston. You may also be interested in the new series, Charleston Insights in Library, Archival, and Information Sciences. Find out more at: http://www.thepress.purdue.edu/series/charleston-insights-library-archival- and-information-sciences. Clifford Lynch and Lee Dirks, "New Initiatives in Open Research" (2011). Proceedings of the Charleston Library Conference. http://dx.doi.org/10.5703/1288284314874 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. New Initiatives in Open Research Clifford Lynch, Executive Director, Coalition for Networked Information Lee Dirks, Director, Portfolio Strategy, Microsoft Research Clifford: Lee and I have both been involved in a knowledge, seems to be at best holding constant, or number of things over the past two or three years even slowing. Most of the formal journal system is that seemed to me to be gaining a considerable really slow. In some communication channels where amount of momentum and I think are going to real- we are achieving considerable speed, it is often at ly change the landscape in significant ways. What I high cost. When you just look at the scale and want to do quickly is to put a frame, a context, speed challenges together, you realize that we have around these developments and then Lee is going got huge problems with filtering, with reviewing to run through a series of example systems at blind- work, validating work, finding colleagues, finding ing speed. We are going to try very hard to get important papers. We also face a very weird and through this quickly enough that we've got 10 frustrating situation where at least one of the hope- minutes for questions and comments at the end. ful prospects for getting some handle on this enor- I'm at least going to focus my remarks primarily mous and ever-expanding scientific literature is around scientific work. A great deal of what we’re through computation upon it. We see the very be- discussing applies to other kinds of academic schol- ginnings of that with things like Google Scholar, Mi- arly communication, humanities in particular, but crosoft Academic Search, some of the text mining there are enough differences and nuances that I experiments that have been done, but from a prac- just want stay primarily focused on science. tical basis the obstacles involved in getting access to really substantive segments of the scientific litera- Let me suggest that we have got some really serious ture are intractable. These are mostly trapped in problems right now with the system of scholarly silos right now, so this is another part of the system communication and science. They are not the prob- that we need to overcome and I think is particularly lems that have been so exhaustively documented challenging one for everybody involved. and analyzed over the past decade or have more to do with economics. There are certainly economic The last stress point, and this is the one that I think problems in the system, but the stress points I want that we're going to spend the most time on here, is to talk about here are separate from economics, that there is a growing disconnect with practices and they occur along several different dimensions. and norms in scholarly work and the way the system of scholarly communication is operated. There One dimension is scale. It is easy to forget how the has been a lot written about this. There is a won- system of scientific publishing is getting bigger and derful book that actually Microsoft Research and bigger and bigger, and as the scientific enterprise Tony Hey over there put together, in memory of Jim grows globally at an alarming rate, it gets steadily Gray called The Fourth Paradigm which is available worse. Demands on scientists for ever greater vol- for downloading for free over the net. There is a umes of publishing to gain academic rewards of very nice new book by Michael Nielsen, Reinventing course don’t help either. If you do the math you will Discovery. There are a mass of reports from the Na- find horrifying numbers, something like a scientific tional Science Foundation’s Office of Cyberinfra- paper is published every minute or two. It means structure and similar organizations in other nations. you're buried. It means everybody's buried. The bottom line here is that more and more science is data intensive, is computationally intensive, it There's a problem with speed. We are under con- relies on big data, it relies on complex software, it stant pressure to make scientific research and dis- relies on large-scale collaborations among dispersed covery more effective, to make it move faster, to people. And there’s a great focus on how data is get more science per dollar, to accelerate medical managed, preserved, discovered, shared, reused breakthroughs. And yet in some areas the rate of and repurposed that’s emerging, in part at the urg- progress, and the rate of transmission of ing of the funding agencies. Copyright of this contribution remains in the name of the author(s). Plenary Sessions 35 DOI: http://dx.doi.org/10.5703/1288284314874 Now, this is pretty mainline. I want to note that these days with Mathematica, Wolfram Alpha, and there are a cadre of people, many of them graduate large datasets you may be surprised. students and young scholars, who really want to take this farther who are making an argument for a There has been a whole series of workshops with an more fundamental change where data and experi- informal group of people that includes scientists, mental work is exposed very, very early in its life some librarians, and some information technolo- cycle. They speak of “open science” or “open re- gists: Beyond the PDF was one such workshop. search”. You can see systems like myExperiment, There is a group called Force 11 in Europe that is in which is a platform that we don't have time to talk the process of drafting a very nice manifesto piece about here in any detail, which is a way of sort of in this area, and just a couple of weeks ago, Harvard putting up your lab work for inspection. When tak- and Microsoft Research did a two-day workshop in en to the limit, these are ideas that I would say are Cambridge to look at current developments. Basi- still relatively out on the fringes of scientific prac- cally, the questions are about how we can get the tice, but the kinds of things I'm talking about here tools to manage data and scientific workflow, to are much closer to consensus, the center of scien- present these kinds of materials, and to intercon- tific practice. nect them with the traditional scholarly literature. Somehow we've got to get past the point of produc- This is not going to be an instantaneous change. As ing articles that are designed for print and just stor- we all know, the Academy is very conservative and ing and delivering them electronically. A scholar indeed some very interesting issues show up here from a hundred years ago would have no trouble about identifying important work and what role with the genre, the format, the presentation of to- tools and mechanisms and measure developed to day’s scientific articles. We can do better, just like surface important work should play in the evalua- we can do better than chalk and blackboards. And tion of scholars themselves. There are numbers of it's the same old presentation of science when the proposed measures now for trying to gain insight to science has gotten much more complicated, where scholarly impact. Personally I'm still a bit skeptical there are major issues about reproducibility, partic- about these, but it surely is infinitely better than ularly in the context of very complex data and soft- the sort of voodoo misuse of the bibliometrics that ware platforms, where there are issues around were designed for absolutely different purposes. At building on other people's work—both facilitating least it is asking the right questions in my view. We this, and giving and receiving credit where it’s due. have questions, as I said, about data citation, about We have certainly seen both scholars and funders bringing together the work of authors as they au- recognize data as a primary input and primary out- thor in very different media, and about simultane- put of scientific inquiry and one that needs to be ously disambiguating the work of authors with simi- reused and managed systematically. There is a very lar names, so that we can have good underlying significant debate now about how to connect data datasets to do computations that attempt to identi- with articles. fy and bring together significant work. Those of you who saw MacKenzie Smith's talk this Those are some of the threads that you will see morning got some taste of that. There are models over the next 10 to 15 minutes. And with that I'm that go all the way from citation to data in ways going to pass it over to Lee with the view that in this that are not that dissimilar from citation to other area, a demonstration as worth a lot of words and articles, all the way through people who have vi- he will at least flash up demonstrations of a number sions of data literally interpenetrated with the arti- of systems.

New Initiatives in Open Research

The UK E-Science Core Programme and the Grid Tony Hey∗, Anne E

The Fourth Paradigm

Mr Chancellor

First Report of ASCAC-OSTI Subcommittee

Agenda All Presentations Will Take Place in the Main Meeting Room in the Training Centre

Constraint-Based Pattern Mining from Classic Pattern Domains to More Sophisticated Ones

Jim Gray on Escience: a Transformed Scientific Method Based on the Transcript of a Talk Given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 20072

Preference-Based Pattern Mining

E-Science, Virtual Organisations and the Grid

The Fourth Paradigm 10 Years on }

Toward Timely, Predictable and Cost-Effective Data Analytics

Tony Hey Vice President Microsoft Research Big Data and the Fourth Paradigm Data Size and Speed Are Growing