Semantic Application for Digital Repositories

Fabrizio Gagliardi EMEA & LATAM Director Technical Computing MSR External Research Microsoft Corporation Microsoft Research’s Commitment to Science

Putting computing into science… Applying Microsoft products and research technologies to advance the scientific research and engineering innovation process Putting science into computing… Ensuring that research community requirements are factored into future versions of Microsoft

• Advancement of Science • Global Collaboration • Technology Excellence • Interoperability myGrid

• Semantic relationships between different data • Semantic descriptions of services • Annotations • Provenance • Repositories • Ontologies Research Output Repository Platform

Goals • A platform for building services and tools for research output repositories • Papers, Videos, Presentations, Lectures, References, Data, Code, etc. • Relationships between stored entities • Enable a tools and services ecosystem for “research UIs output” repositories on MS technologies

Desktop Search Execution Research Tools • Utilizing OAI-ORE, SWORD, and other output community protocols repository platform • In development, deployment within MSR in early Q4 • Beta release to the community in late Q4 • Built on SQL Server 2008 + Entity Framework • Using WPF and Silverlight for UI Interop Syndication Research Output Repository Platform

Goals Non-goals • Create a platform for building • A generic platform for asset “research output” repositories management • Engage with the and • Support the lifecycle of publications scholarly communications • Compete with existing repository community solutions • Become the “research output” repository for MSR (RMCr project) – Papers, Videos, Presentations, Lectures, Services/tools References, Data, Code, etc. • Support an ecosystem of services and tools Microsoft.Famulus.Framework • Available to the community for free (we are still considering the open Microsoft.Famulus.Core source route) (Based on the Entity Framework Model + extensions) • Build an easy-to-install collection of SQL Server 2008, MS data storage technologies, Entity basic services and tools Framework runtime Research Output Repository Platform

• A Semantic Computing platform • A hybrid between a relational and a triple store

Triple stores -Evolution friendly Relational schema -Poor performance -Evolution not so easy -No need to model everything in advance -Great opportunities for optimization -Semantic interpretation at the application level -Model everything in advance

Research Output Repository Platform -Maintain a balance -Try to model the frequently used entities in our app domain -Try to capture the frequently used relationships -Allow for extensibility (Relationships, Attributes) An intuitive programming experience

Person tony = new Person(); Publication pub1 = new Publication(); pub1.Title = "Title1"; Publication pub2 = new Publication(); pub2.Title = "Title2"; pub1.Cites.Add(pub2); pub1.Authors.Add(tony);

Tag tag = new Tag(); tag.Name = "keyword"; pub1.Tags.Add(tag); Research Output Repository Platform

PDF file

Lecture on is representation of contains 2/19/2008

PowerPoint presentation

authored by organized by presented by

tony

Elizabeth, Sebastien, Matthew, Norman, Brian, Sarah, George, Roy An Ecosystem of Research Repositories Support of harvesting & federation to/from Institutional Repositories - arXiv.org - DSpace - ePrints - Fedora - etc.

Entities + Relationships can be synched to cloud storage so that they are: - Always Available - Sharable - Mixable - Harvestable

Researchers manage their personal research entities (data, citations, documents, workflows, etc.) Current Project Status

• Limit Tech Preview release due June 2008 • Public Beta targeted for Aug/Sept 2008

For more details – Contact: • Alex Wade (Program Manager) / [email protected] – Community Forum: • http://community.research.microsoft.com/forums/90.aspx eScience and Semantic Computing meet the Cloud

The cyberinfrastructure for the next generation of researchers The Future: Software plus Services for Science? • Expect scientific research environments will follow similar trends to the commercial sector – Leverage computing and data storage in the cloud – Scientists already experimenting with Amazon S3 and EC2 services, with mixed results; • For many of the same reasons – Siloed research teams, no resource sharing across labs – High storage costs – Low resource utilization – Excess capacity – High costs of reliably keeping machines up-to-date – Little support for developers, system operators

12 A smart cyberinfrastructure

– If last.fm can recommend what song to broadcast to me based on what my friends are listening to, why cannot the cyberinfrastructure of the future recommend articles of potential interest based on what the experts in the field that I respect are reading? – Already examples emerging but the process is manual (Connotea, BioMedCentral Faculty of 1000 ...) • Automatic correlation of scientific data • Smart composition of services and functionality • Cloud computing to aggregate, process, analyze and visualize data A world where all data is linked…

• Data/information is inter- connected through machine- interpretable information (e.g. paper X is about star Y) • Social networks are a special case of ‘data networks’

• Important/key considerations – Formats or “well-known” representations of data/information – Pervasive access protocols are key (e.g. HTTP) – Data/information is uniquely identified (e.g. URIs) – Links/associations between data/information

Attribution: Richard Cyganiak …and stored/processed/analyzed in the cloud visualization and Vision of Future Research analysis services scholarly Environment with both communications domain-specific services search Software + Services books blogs & citations social networking

Reference instant management messaging

identity Project mail management notification

document store

storage/data services knowledge compute management services knowledge virtualization discovery

Added slides eScience Emergence of a New Research Paradigm?

• Thousand years ago – Experimental Science – Description of natural phenomena • Last few hundred years – Theoretical Science – Newton’s Laws, Maxwell’s Equations… • Last few decades – Computational Science . 2   2 – Simulation of complex phenomena  a  4G c     a  3 a2 • Today – eScience or Data-centric Science   – Unify theory, experiment, and simulation – Using data exploration and • Data captured by instruments • Data generated by simulations • Data generated by sensor networks – Scientists overwhelmed with data – and IT companies have technologies that will help

(With thanks to Jim Gray) Today

Web users... Scientists... • Generate content on the Web • Annotate, share, discover data – Blogs, wikis, podcasts, videocasts, – Custom, standalone tools etc. • Form communities • Conferences, Journals – Social networks, virtual worlds – Publication process is long, subscriptions, discoverability issues • Interact, collaborate, share • Collaborate on projects, exchange – Instant messaging, web forums, ideas content sites – Email, F2F meetings, video- conferences • Consume information and • Use workflow tools to compose services services – Search, annotate, syndicate – Domain-specific services/tools Data can be easily produced

http://ecrystals.chem.soton.ac.uk Thanks to Jeremy Frey Data and services can be easily composed

Taverna Workflow Compose services from the Web

SensorMap Functionality: Map navigation Data: sensor-generated temperature, video camera feed, traffic feeds, etc. Data is easily accessible

With thanks to Catharine van Ingen Data is easily shareable

Sloan Digital Sky Server/SkyServer http://cas.sdss.org/dr5/en/ Today…

Computers are storing computing huge amounts

great tools for managing indexing of data

For example, Google and Microsoft both have copies of the Web for indexing purposes Tomorrow…

Computers will still storing computing huge amounts

be great tools for managing indexing of data

acquisition discovery We would like aggregation organization computers to also of the world’s

help with the correlation analysis information automatic

interpretation inference Semantic Computing What is Semantic Computing?

• Set of concepts and technologies – Data modeling – Relationships – Ontologies – Machine learning (entity extraction) – Inference, reasoning – Data, information, knowledge…

Data Information Knowledge Intelligence Wisdom

Current technologies

Possibilities for innovation

• Term used to refer to the concept of “meaning” • The linguistics, AI, Natural Language Processing, etc. communities have been working on “meaning” and ”knowledge” related technologies for decades • Pragmatic approach to Semantic Computing – Emergence of a new breed of technologies to capture meaning (RDF, OWL, etc.) – Combine with the pervasiveness of the Web community technologies such as … A word about the “

• The term is used to describe a set of technologies used to represent data, concepts, and their relationships – Become a buzzword like Web 2.0

• Prefer to use the term “Semantic Computing” which is about modeling data in ways that can be automatically processed by computers Semantic Computing

• Some efforts are driven by the traditional “knowledge engineering” community – Engaged in building well-controlled ontologies – Important for domain-specific vocabularies with data formats and relationships specific to a community – Model does not easily scale to the • Some efforts are driven by the Web 2.0 community – Focus on the pervasiveness of Web protocols/standards – Emphasis on (small, flexible, embeddable structures) – Exploit evolving and ever-expanding vocabularies such as folksonomies and tag clouds