<<

The Next Generation Scholarly Communication Ecosystem: Implications for Librarians

Lee Dirks Director, Education & Scholarly Communication External Division Corporation Microsoft External Research

Division within focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing Supporting groundbreaking research to help advance human potential and the wellbeing of our planet Developing advanced technologies and services to support every stage of the research process Microsoft External Research is committed to interoperability and to providing , open tools, and open technology …Thus far we seem to be worse off than before—for we can enormously extend the record; yet even in its present bulk we can hardly consult it. This is a much larger matter than merely the extraction of data for the purposes of scientific research; it involves the entire process by which man profits by his inheritance of acquired knowledge. The prime action of use is selection, and here we are halting indeed. There may be millions of fine thoughts, and the account of the experience on which they are based, all encased within stone walls of acceptable architectural form; but if the scholar can get at only one a week by diligent search, his syntheses are not likely to keep up with the current scene…

As We May Think by Vannevar Bush The Atlantic, July 1945

http://www.theatlantic.com/doc/194507/bush According to study called How Much Information by the University of California at San Diego, “…consumption totaled 3.6 zettabytes and 10,845 trillion words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day. A zettabyte is 10 to the 21st power bytes, a million million gigabytes. These estimates are from an analysis of more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video)."

[Note: Information at work is not included!] Data Tidal Wave Realizing Jim Gray’s Vision for Data-Intensive Scientific Discovery

• Jim Gray = eScience • A Transformed Emergence of a Fourth Research Paradigm

1. Thousand years ago – Experimental Science • Description of natural phenomena 2. Last few hundred years – Theoretical Science • Newton’s Laws, Maxwell’s Equations… 3. Last few decades – Computational Science • Simulation of complex phenomena Astronomy has been one of the first disciplines to 4. Today – Data-Intensive Science embrace data-intensive science with the Virtual Observatory (VO), enabling highly efficient access • Scientists overwhelmed with data sets to data and analysis tools at a centralized site. The image shows the Pleiades star cluster form the from many different sources Digitized Sky Survey combined with an image of o Data captured by instruments the moon, synthesized within the World Wide Telescope service. o Data generated by simulations o Data generated by sensor networks • eScience is the set of tools and technologies Science must move from data to to support data federation and collaboration information to knowledge o For analysis and data mining o For data visualization and exploration o For scholarly communication and dissemination With thanks to Jim Gray

The Fourth Paradigm – the book Edited by Tony Hey, Stewart Tansley, Kristin Tolle

• Distinguished scientists with computer specialists, either researchers or IT experts, giving their vision of how they see their fields being transformed from being data-poor to data-rich • The 289-page book, published in October 2009, is available for free under a Creative Commons license, the first from Microsoft Research • Introductory article from Jim Gray’s last talk, two weeks before he disappeared, in Jan07 to the National Research Council’s Computer Science and Telecommunications Board • The book – executed by Dr. Tony Hey – epitomizes Jim’s vision An edited collection of 26 short technical papers, divided into 4 sections Authored by 70 leading practitioners from around the world. Free PDF Download Or, Amazon Kindle version & paperback print-on-demand http://research.microsoft.com/fourthparadigm/

“The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science." — , Chairman, Microsoft Corporation

“One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive science. This is recognized as a new paradigm beyond experimental and theoretical research and computer simulations of natural phenomena—one that requires new tools, techniques, and ways of working.” — Douglas Kell, University of Manchester

“The contributing authors in this volume have done an extraordinary job of helping to refine an understanding of this new paradigm from a variety of disciplinary perspectives.” — Gordon Bell, Microsoft Research So, what about libraries? “It’s not information overload. It’s filter failure.” Clay Shirky at Web 2.0 Expo 2008 Empower, Inform, Enrich - The modernisation review of public libraries: A consultation Britain’s Culture Minister – Margaret Hodge (Dec09)

Five significant challenges for the library service: • How can the library service demonstrate to citizens, commentators and politicians that they are still relevant and vital? • How can we reverse the current trend of decline in library usage and grow the numbers using their local library? • How can all libraries respond to a 24/7 culture and respond to changing “…Sleepwalking into the era of expectations of people who want immediate the iPhone, the eBook and the access to information. without a strategy," she • How can all libraries grasp the opportunities suggested, "runs the risk of presented by digitisation? turning the library service into • How can the library service cope with limited a curiosity of history such as public resource and economic pressures? telex machines or typewriters."

http://www.culture.gov.uk/reference_library/consultations/6488.aspx UK Prime Minister Gordon Brown “Building Britain’s Digital Future” March 22nd, 2010 • I want Britain to be the world leader in the digital economy which will create over a quarter of a million skilled jobs by 2020… • Underpinning the digital transformation that we are likely to see over the coming decade is the creation of the next generation of the web - what is called the semantic web, or the web of linked data. • This next generation web is a simple concept, but I believe it has the potential to be just as revolutionary - just as disruptive to existing business and organisational models - as the web was itself, moving us from a web of managing documents and files to a web of managing data and information - and thus opening up the possibility of by-passing current digital bottlenecks and getting direct answers to direct requests for data and information. It will change fundamentally the way we conduct business - with new enterprises by-passing traditional media communications and governmental organisations: new enterprises spun off from the new data, information and knowledge that flows more freely. • Today I can announce the first funding for the next stage of this research - £30m to support the creation of a new institute, the institute of web science - based here in Britain and working with government and British business to realise the social and economic benefits of advances in the web. It will assemble the best of world scientists and researchers and be headed by Sir Tim Berners Lee, the British inventor of the world wide web - and the leading web science expert Professor Nigel Shadbolt. • This will help place the UK at the cutting edge of research on the semantic web and other emerging web and internet technologies, and ensure that government is taking the right funding decisions to position the UK as a world leader. And we will invite universities and private sector web developers and companies to join this collaborative project. http://www.number10.gov.uk/Page22897 Have librarians abdicated?

• Commercial entities have stepped into our space—Yahoo, Google, (and yes) Microsoft, etc.

• Other academic domains are creeping into our traditional role (*.informatics)

• We’re being disintermediated.

So, what can we do? #1 – Reinvent ourselves with technology. (Again.) What would Fred do? Present The Future: an Explosion of Data

Experiments Simulations Archives Literature Instruments

The Challenge: Enable Discovery. Deliver the capability to mine, Petabytes search and analyze this data in near real time. The Cloud • A model of computation and data storage based on “pay as you go” access to “unlimited” remote data center capabilities • A cloud infrastructure provides a framework to manage scalable, reliable, on-demand access to applications • A cloud is the “invisible” backend to many of our mobile applications • Historical roots in today’s Internet apps – Search, email, social networks – File storage (Live Mesh, Flickr, …) Types of Cloud Computing • Utility computing [infrastructure] – Amazon's success in providing virtual machine instances, storage, and computation at pay-as-you-go utility pricing was the breakthrough in this category, and now everyone wants to play. Developers, not end-users, are the target of this kind of cloud computing. • Platform as a Service [platform] – One step up from pure utility computing are platforms like Google AppEngine and Salesforce's force.com, which hide machine instances behind higher-level APIs. Porting an application from one of these platforms to another is more like porting from Mac to Windows than from one Linux distribution to another. • End-user applications [software] – Any web application is a cloud application in the sense that it resides in the cloud. Google, Amazon, Facebook, twitter, flickr, and virtually every other Web 2.0 application is a cloud application in this sense. From: Tim O'Reilly, O'Reilly Radar (10/26/08)—”Web 2.0 and Cloud Computing” The Rationale for Cloud Computing in eResearch

• We can expect research environments will follow similar trends to the commercial sector – Leverage computing and data storage in the cloud – Small organizations need access to large scale resources – Scientists already experimenting with Amazon S3 and EC2 services • For many of the same reasons – Small, silo’ed research teams – Little/no resource-sharing across labs – High storage costs – Physical space limitations – Low resource utilization – Excess capacity – High costs of acquiring, operating and reliably maintaining machines is prohibitive – Little support for developers, system operators

23 Cloud Landscape Still Developing • Tools are available – Flickr, SmugMug, and many others for photos – YouTube, SciVee, Viddler, Bioscreencast for video – Slideshare for presentations – Google Docs for word processing and spreadsheets • Data Hosting Services & Compute Services – Amazon’s S3 and EC2 offerings

Tim Berners-Lee's principles for Linked Data:

• The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data. • Like the web of hypertext, the web of data is constructed with documents on the web. However, unlike the web of hypertext, where links are relationships anchors in hypertext documents written in HTML, for data they links between arbitrary things described by RDF,. The URIs identify any kind of object or concept. But for HTML or RDF, the same expectations apply to make the web grow:

1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) 4. Include links to other URIs so that they can discover more things.

• Simple. In fact, though, a surprising amount of data isn't linked in 2006, because of problems with one or more of the steps. This article discusses solutions to these problems, details of implementation, and factors affecting choices about how you publish your data.

http://www.w3.org/DesignIssues/LinkedData.html A “Smart” Cyberinfrastructure for Research” Why Semantic Computing?

http://cacm.acm.org/magazines/2009/12/52840-a-smart-cyberinfrastructure-for-research “Semantics-based computing” vs. “Semantic web” • There is a distinction between the general approach of computing based on semantic technologies (e.g. machine learning, neural networks, ontologies, inference, etc.) and the semantic web – used to refer to a specific ecosystem of technologies, like RDF and OWL • The semantic web is just one of the many tools at our disposal when building semantics-based solutions Towards a smart cyberinfrastructure? • Leveraging Collective Intelligence – If last.fm can recommend what song to broadcast to me based on what my friends are listening to, the cyberinfrastructure of the future should recommend articles of potential interest based on what the experts in the field that I respect are reading? – Examples are emerging but the process is presently more manual – e.g. Connotea, Faculty of 1000, etc.

• Semantic Computing – Automatic correlation of scientific data – Smart composition of services and functionality

• Leverage cloud computing to aggregate, process, analyze and visualize data Who do you want reading your paper?

OR A world where all data is linked…

• Data/information is inter- connected through machine- interpretable information (e.g. paper X is about star Y) • Social networks are a special case of ‘data networks’

• Important/key considerations – Formats or “well-known” representations of data/information – Pervasive access protocols are key (e.g. HTTP) – Data/information is uniquely identified (e.g. URIs) – Links/associations between data/information Attribution: Richard Cyganiak; http://linkeddata.org/ …and stored/processed/analyzed in the cloud

visualization and Vision of Future Research analysis services scholarly Environment with both communications domain-specific services search Software + Services books blogs & citations social networking

Reference instant management messaging

identity Project mail management notification

document store

storage/data services knowledge compute management services knowledge virtualization discovery Joe Hellerstein—UC Berkeley Blog: “The Commoditization of Massive Data Analysis”

• We’re not even to the Industrial Revolution of Data yet… – “…since most of the digital information available today is still individually "handmade": prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation "factories" such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide. Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.” #2 – Be entrepreneurial. Value-Added Processes in Information Systems, Ablex, 1986.

© M. Eisenberg 2010

© M. Eisenberg 2010 Commercial Data Sharing + Analysis Services

• Swivel • IBM’s “Many Eyes” • Gapminder & Google’s Trendalyzer • Metaweb’s “Freebase” • CSA’s “Illustrata” Adding Value to Data

DataCite – http://www.datacite.org/ – Improving scholarly infrastructure around datasets. – Working with data centers and organizations that hold data. • “The details of their business models, workflows, and other requirements do not appear to be identical to those of publishers producing traditional journals.”

Dataverse Network Project – http://thedata.org/ – Via web application software, data citation standards, and statistical methods, the Dataverse Network project increases scholarly recognition and distributed control for authors, journals, archives, teachers, and others who produce or organize data; facilitates data access and analysis for researchers and students; and ensures long-term preservation whether or not the data are in the public domain. WorldWideScience.org is a global science gateway connecting you to national and international scientific databases and portals. WorldWideScience.org accelerates scientific discovery and progress by providing one-stop searching of global science sources. The WorldWideScience Alliance, a multilateral partnership, consists of participating member countries and provides the governance structure for WorldWideScience.org.

WorldWideScience.org was developed and is maintained by the Office of Scientific and Technical Information (OSTI), an element of the Office of Science within the U.S. Department of Energy. Please contact [email protected] if you represent a national or international science database or portal and would like your source searched by WorldWideScience.org. #3 – Tackle the tough items, the difficult collections. Servers are the new shelves. http://www.data.gov/

• The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. Although the initial launch of Data.gov provides a limited portion of the rich variety of Federal datasets presently available, we invite you to actively participate in shaping the future of Data.gov by suggesting additional datasets and site enhancements to provide seamless access and use of your Federal data.

• Data.gov includes a searchable data catalog that includes access to data in two ways: through the "raw" data catalog and using tools. #4 – Focus on sustainability Preservation Sustainability • Articulate a compelling value proposition. • Provide clear incentives. • Define roles and responsibilities among stakeholders to ensure an ongoing and efficient flow of resources. http://brtf.sdsc.edu/ Courtesy: DuraCloud #5 – Leverage the Familiar Envisioning a New Era of Research Reporting Imagine… • Live research reports that had multiple end- user ‘views’ and which could dynamically tailor their presentation to each user Reproducible • An authoring environment that absorbs and Research encapsulates research workflows and outputs from the lab experiments • A report that can be dropped into an Interactive Collaboration electronic lab workbench in order to Data reconstitute an entire experiment • A researcher working with multiple reports on a Surface and having the ability to mash Dynamic Documents up data and workflows across experiments • The ability to apply new analyses and Reputation visualizations and to perform new in silico & Influence experiments Recent developments of interest Elsevier's Article of the Future Competition Grand Challenge & Article of the Future contest -- ongoing collaboration between Elsevier and the scientific community to redefine how a scientific article is presented online. PLoS Currents: Influenza In conjunction with NIH & Google Knol – a rapid research note service, enable this exchange by providing an open-access online resource for immediate, open communication and discussion of new scientific data, analyses, and ideas in the field of influenza. All content is moderated by an expert group of influenza researchers, but in the interest of timeliness, does not undergo in-depth . Preceedings Connects thousands of researchers and provides a platform for sharing new and preliminary findings with colleagues on a global scale – via pre-print manuscripts, posters and presentations. Claim priority and receive feedback on your findings prior to formal publication. Google Wave Concurrent rich-text editing; Real-time collaboration; Natural language tools; Extensions with APIs Mendeley (and Papers) Called “iTunes” for academic papers; 100,000s people have already signed up and a staggering 19+ million scientific papers have been uploaded. The Opportunity Before Us

• Encourage a rich authoring environment • Unlock documents and move from static summaries to living information vehicles • Capture and storing semantically rich information that can be consumed by machines • Facilitate reproducible research • Enable value-added services to be added later

We can all continue to inefficiently architect this after the paper is published, or we can work together to start addressing the issue at the very beginning of the research lifecycle.

53 Creative Commons Add-in for Office

Intent: Insert Creative Commons licenses from within Office 2007

Services: Integrates with Creative Commons Web API to create new licenses

Relationships: license information stored as RDF XML within the document OOXML

Downloads = 146,000+ Source code + binary: This work is licensed under a Creative Commons Attribution 3.0 United States License. http://ccaddin2007.codeplex.com Ontology Add-in for Word

Services: Ontology download web service • John Wilbanks • Phil Bourne • Lynn Fink Intent: Term recognition & disambiguation

Relationships: Ontology browser

Downloads = 4,000+ Source code + binary: This work is licensed under a Creative Commons Attribution 3.0 United States License. http://research.microsoft.com/ontology/ Chemistry Add-in for Word

Author/edit 1D and 2D chemistry. Change chemical layout styles. • Peter Murray-Rust Intent: Recognizes • Joe Townsend chemical dictionary • Jim Downing and ontology terms

Relationships: Navigate and Data: Semantics link referenced chemistry stored in Chemistry Markup Language

Intelligence: Verifies validity Downloads = 51,000+ of authored chemistry Binary (beta 2) This work is licensed under a Creative Commons Attribution 3.0 United States License. http://research.microsoft.com/chem4word/ Article Authoring Add-in for Word 2007

Services: repository deposit via SWORD

Structure: Read, convert, and author NLM XML documents

Relationships: ORE Resource Map creation

Relationships: Citation lookup and reference management Structure: Client-side XML validation

Binary (version 2.0): DownloadsThis work is licensed = 4,000+ under a Creative Commons Attribution 3.0 United States License. http://research.microsoft.com/authoring/

#6 – Reach out across campus. (and beyond) Software (alone) is not the answer. #7 – Change the way we educate our field. iSchools & the iConference SummarySummary 1. Reinvent ourselves with technology. (Again.) 2. Be entrepreneurial. 3. Tackle the tough items. (Like data.) 4. Focus on sustainability. 5. Leverage the familiar. 6. Reach out across campus—and beyond. 7. Change the way we educate our field. "If you don't like change, you're going to like irrelevance even less.“

—General Eric Shinseki

Retired United States Army four-star general, currently US Secretary of Veterans Affairs Questions?

Lee Dirks Director—Education & Scholarly Communication Microsoft External Research [email protected] http://research.microsoft.com/people/ldirks

URL – http://www.microsoft.com/scholarlycomm/ Facebook: Scholarly Communication at Microsoft

This work is licensed under a Creative Commons Attribution 3.0 United States License.