Everything you always wanted to know about the Grid and never dared to ask

Tony Hey and Geoffrey Fox Outline • Lecture 1: Origins of the Grid – The Past to the Present (TH) • Lecture 2: Web Services, Globus, OGSA and the Architecture of the Grid (GF) • Lecture 3: Data Grids, Computing Grids and P2P Grids (GF) • Lecture 4: Grid Functionalities – Metadata, Workflow and Portals (GF) • Lecture 5: The Future of the Grid - e-Science to e-Business (TH) Lecture 1

Origins of the Grid – The Past to the Present

[Grid Computing Book: Chs 1,2,3,4,5,6,36] Lecture 5

The Future of the Grid – e-Science to e-Business

[Grid Computing Book: Chs 38, 39, 40,41,42,43] Lecture 5

1. e-Science Research and the Future of Scientific Research 2. Computer Science Research Issues 3. A Business Case for the Grid 4. Concluding Remarks e-Science and the Future of Scientific Research

‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor, 2001 Integrated e-Science Environment

“Problem Solving Environments” Domain-specific application interfaces for scientists

Computing Data Data Experiment Grid discovery visualisation control Grid service Grid service Grid service service

Grid services middleware

Authentication Computers Data storage Experiments Authorisation local remot local remot local remot Accounting e e e

Framework for distributed scientific computing and experimentation e-Science Examples

• Particle Physics • Virtual Observatories • e-Engineering • e-Chemistry • Bioinformatics • High-Throughput Applications • e-Health DAME Project

In flight data

Global Network eg: SITA Ground Airline Station

DS&S Engine Health Center

Maintenance Centre Internet, e-mail, pager

Data centre Comb-e-Chem

Structure + Properties Knowledge + Prediction

Structures Properties DB DB

Simulation and calculation Combinatorial Chemistry

• Parallel synthetic approach – create hundreds of materials – screen properties to find those that fit the bill • Typically requires several passes – find chemical structure of the best candidates – create new batches of similar materials for subsequent passes • Leads to explosive growth in: – volume of data generated – potential to exploit this data MeOH EtOH PrOH BuOH

R1COOH

R2COOH

R3COOH

R4COOH Monitor & Analysis Data

same reaction sequence Interface to Grid for all combinations

AArrayrray productionproduction ofof differentdifferent chemicalchemical speciesspecies Well plate with typically 96 or 384 cells Library synthesis

Mass Spec databases

Raman

x-ray

High throughput systems Structure and properties analysis Structure and properties

100,000100,000’’ss compoundscompounds atat aa timetime analysisanalysis ¾¾ProducesProduces hugehuge amountsamounts ofof complexcomplex datadata Remote equipment, multiple users, few experts

Users Users Users

Data & control links

Access Grid links Experiment Expert

Experiment Remote (Dark) Laboratory

¾¾ModelModel forfor NationalNational crystallographiccrystallographic ServiceService NCSNCS NCS Workflow

Send sample Collaborate in e-Lab Search materials database Download full material to experiment and and predict properties using data on materials NCS service obtain structure Grid computations of interest

Structures Computation X-Ray e-Laboratory Database Service NCS Portal Access NCS Experimental Services NCS Lab Service

Samples and UI Status Collaboratory Interface Data Access Interface Schedules Monitor

Proxy Proxy Proxy

Middleware

Control GUI Raw Results Struct Schedules Chat Audio Auth (Filtered Images Access Access VNC)

Backend

Sample Schedule Raw Data Processed Structure Manage- Manage- (Files) Data (DB) Data (DB) ment ment

Admin Scheduling Expt Control HKL Calculation Struct Calc Control myGrid Project

• Imminent ‘deluge’ of data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives myGrid: Generic Technologies

1. Database access from the Grid 2. Process enactment on the Grid 3. Personalisation services 4. Metadata services 5. Development of Agent Services

¾ Ultimate goal is to put Grid Services together with Ontologies to develop ‘Semantic Grid’ Workflow • Know how. • Associate base resources with derived data. • Keep, describe, find, compare, protect, 1 . • Repeat/reuse/re-enact 2 • Specialise/Customise/Personalise • Evolution – notification, knowledge 3 • Quality & best practice – Need the workflows to be effective 4 ¾ good experimental practice. Personalisation

• Dynamic creation of personal data sets 1 • Personal views over repositories • Personalisation of workflows • Personal notification 2 • Annotation of datasets and workflows 3 • Personalisation of service descriptions – ‘what I the service does’ 4 Provenance

• Who, what, where, why, when, how? • The traceability of knowledge as it is evolves and as it is derived. 1 • Identity – the Life Sciences ID • Lab Books, Methods in papers. 2 • Immutable Metadata • Migration – travels with its data but may not be stored with it. 3 • Private vs Shared provenance records. • Ownership/credit 4 Discovery Net Project

In Real Time Scientific Scientific Information Discovery

Real Time Integration Workflow Construction Literature

Databases

Operational Interactive Visual Data Dynamic Application Integration Analysis Using Distributed Resources

Images

Instrument Data Discovery Process Management

• Workflow = Service Composition + Discovery Pathway

• Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML):

– Discovery Pathway Construction: Recording and managing a collaboratively- built discovery process

– Distributed Service Composition: Components organsied by the workflow can be executing anywhere

– Discovery Pathway as Key Intellectual Property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms D-Net Workflow for Genome Annotation : 16 services executing across Internet Dynamic Integration Services

• Dynamic Application Integration = On-demand access and composition of remote analysis components Clustering Classification

• Towards a Dynamic Component Text analysis Integration: Gene function perdition

– Knowledge Servers: allow users to register, locate and remotely execute components D-NET API

Promoter Homology Search – Execution Servers: allow users to Prediction control the execution of components distributed environments

– Easy Maintenance: New components can be added through a clean API Case Study: SC2002 HPC Challenge

Identify D-Net based Global Collaborative Organism Organism’s Real- Time Genome Annotation High Throughput Chromosomes DNA Sequencers Identify

Genes tRNAs, rRNAs genscan blast EMBL NCBI Non-translated Gene markers RNAs Repeat grail Regulatory Repetitive TIGR SNP Masker Nucleotide-level Regions Elements Segmental SNP Annotation Duplication Variations E-PCR genscan Literature ….. References Identify Genome Classify into Annotation Proteins blast 3D-PSSM Protein Families Inter Inter Functional Pro Pro Homologues Characteisation Motif Protein-level PFAM SWISS Search SMART Annotation Domain 3-D Structure PROT Secondary Fold Prediction predator DSC structure Literature ….. References

Relate Pathway Ontologies Maps Metabolism Process-level Cycle GO CSNDB AmiGO GeneMaps Annotation Biological Drugs Process….. KEGG GK Cell death virtual Embryogenesis GenNav chip Literature ….. References 15 DBs 21 Applications How It Works Interactive Editor & Visualisation Nucleotide Annotation Workflows

Download sequence from Reference Server

Save to Inter SMART KEGG Distributed Pro Annotation Server SWISS EMBL NCBI PROT

TIGR SNP GO

¾ 500 Web access Execute ¾1800 clicks distributed ¾200 copy/paste annotation ¾ 3 weeks work workflow in 1 workflow and few second execution eDiamond Applications of SMF

Teleradiology and QC Training and VirtualMammo Differential Diagnosis “Find one like it” ?

Advanced CAD SMF-CAD workstation Epidemiology SMFcomputed breast density Image guided interventions

Images Courtesy Derek Hill Guy’s Hospital Image guided interventions (2) Images Courtesy Guy’s Hospital Surgical verification Accuracy of surgical placement against plan • Surgeon plans on X-ray or CT, uses database of prostheses • Operation takes place using plan as guidance • Post operative X-ray evaluated for accuracy of placement • Data stored and used for short term assessment and long term evaluation studies

Courtesy of Ian Revie Depuy International Summary

• UK e-Science projects emphasize data federation and integration as much as computation

• Metadata and ontologies key to higher level Grid services

• e-Science projects will produce a deluge of scientific data that will need to be annotated and curated in scientific data ‘digital libraries’ Databases in the Grid

Data Complexity

Computational Complexity OGSA – DAI Project • Key middleware project for UK Program - Total Budget £3M (CP £1.5M)

• Three Centres involved: - Edinburgh, Manchester and Newcastle

• Industrial partners: - IBM US, IBM and Oracle UK

¾ Goal is to develop high-quality data-centric middleware OGSA – DAI Project

• Design Specification completed – Papers for GGF WG on Database Access and Integration Services • Alpha versions delivered: – Distributed Query Service – XML Database Interface – Relational Database Interface • Beta versions by April 2003 – Integrate with Globus GT3 release e-Science and the Future of Scientific Research

‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor, 2001 ¾ Need to break down the barriers between the Victorian ‘bastions’ of science – biology, chemistry, physics, …. ¾ Develop ‘permeable’ structures that promote rather than hinder multidisciplinary collaboration ¾ Engage Computing Services and Libraries in developing a new e-Science support service on Campus e-Science and Computer Science

• The lesson of the Web • The Semantic Grid – The myGrid project – The Discovery Net Project • Computer Science Research and the Grid Error 404: Page not found

‘If you want the Web to scale, You must allow the links to fail’

Wendy Hall after Tim Berners-Lee

¾ HTML as the ‘’ of Hypertext! Semantic Web Metadata & Ontologies • Metadata – computationally accessible data about the services • Ontologies – the shared and common understanding of a domain – A vocabulary of terms – Definition of what those terms mean. – A shared understanding for people and machines – Usually organised into a taxonomy. Reasoning in DAML+OIL

• Consistency — check if knowledge is meaningful • Subsumption — structure knowledge, compute classification • Equivalence — check if two classes denote same set of instances • Instantiation — check if individual instance of class C • Retrieval — retrieve set of individuals that instantiate C Computer Science Challenges from e-Science

UK CS Team led by Tom Rodden identified 4 major research challenges arising from e-Science:

- Developing a Semantic Grid - Trusted Ubiquitous Systems - Rapid Customized Assembly of Services - Autonomic Computing Towards a Semantic Grid

• Trace provenance from initial data to information and knowledge structures • Techniques to allow scalable reasoning over uncertain/incomplete knowledge • Tools for design, development and deployment of large-scale ontologies • Support for semantic-directed knowledge discovery to complement data-mining • Development of flexible network-based reasoning and decision support services Trusted Ubiquitous Systems

• New theories to model, specify and analyse trust in distributed ubiquitous systems • New quality of service and service-based models for ubiquitous systems • New design guidelines and practices to enable the development of reusable trusted components • New understanding of the practical engineering trade-offs required to realise trusted ubiquitous systems Rapid Customised Assembly of Services • New theories to describe and reason about semantics and behaviour of services and compositional effects • Agent and service representations that promote adaptability and emergent, opportunistic and implicit arrangement of services • New tools to support the discovery, composition and use of services based on high-level description of requirements • Techniques to support directed automatic composition, decomposition and recomposition of services Autonomic Computing • Techniques to analyze, describe and reason about adaptive systems • Management of semi-autonomous systems with policies, services and software agents • Interoperability and reasoning across and between different autonomous domains • Modeling and measurement of performance of QoS for autonomic structures • Techniques to capture and represent history, context and environment IBM Autonomic Computing Vision

Self-ConfiguringSelf-Configuring Self-HealingSelf-Healing AdaptAdapt automaticallyautomatically toto thethe Discover,Discover, dynamicallydynamically changingchanging diagnose,diagnose, environmentsenvironments andand reactreact toto disruptionsdisruptions Self-Self- Self-Self- ConfiguringConfiguring HealingHealing

Self-Self- Self-Self- OptimizingOptimizing ProtectingProtecting Self-OptimizingSelf-Optimizing Self-ProtectingSelf-Protecting MonitorMonitor andand tunetune Anticipate,Anticipate, detect,detect, resourcesresources identify,identify, andand protectprotect automaticallyautomatically againstagainst attacksattacks fromfrom anywhereanywhere A Business Case for the Grid

• Total Cost of Ownership – TCO • Value of Open Standards • Industrial Applications • Time to exploitation • e-Utilities Current IT Environment Distributed, Heterogeneous, Complex

Typical Financial Subsystem Configuration IMS Sysplex Network Data

Profile Database Security DB2 Local Security IMS Complex Capture Servers Gateways Director Servers Data

Presentation Business Logic Gateway IMS Sysplex Netscape HTTP Data Enterprise Server WebSphere JDBC SNA Application Security WebSphere Server SNA Application Client MQ Sysplex Server MQ MQ SNA IMS Data CICS Hub Server Group

MQ MQ Application Gateway TPF Logging Logging

Front-end for Web presence for financial services Back-end Systems Current IT Environment Distributed, Heterogeneous, Complex

Typical Financial Subsystem Configuration IMS Sysplex Network Data Complexity, TCO zSeries

Profile Database Security DB2 Local Security IMS Complex Capture Servers Gateways Director Servers Data zSeries

Presentation Business Logic Gateway IMS Sysplex Netscape HTTP Data Enterprise Server WebSphere JDBC SNA zSeries Application Security WebSphere Server SNA Application Client MQ Sysplex Server MQ MQ SNA IMS Data CICS Hub Server Group zSeries

MQ MQ Application Gateway TPF Logging Logging Tech. Cost, Utilization zSeries Front-end for Web presence for financial services Back-end Systems Server / Storage Utilization

Peak-hour Prime-shift 24-hour Period Utilization Utilization Utilization Mainframes 85-100% 70% 60%

UNIX 50-70% 10-15% <10%

Intel-based 30% 5-10% 2-5%

Storage N/A N/A 52%

Source: IBM Scorpion White Paper: Simplifying the Corporate IT Infrastructure, 2000 Total Cost of Ownership: TCO

Much More than Hardware and Software Costs IT Budgets Hardware 10.0%

Integration 32.0%Integration Software 32.0% 12.0% 32% Hardware Software Personnel 16% Maintenance PersonnelPersonnel Integration 16.0%16.0% 30%

Maintenance 30.0% Grid Computing Sales Pitch

Storage I/O Operating System

Processing Applications Data

Distributed Computing Over a Network, Using Open Standards to Enable Heterogeneous Operations Grid Technology Enables

ƒ Increased Server Utilization ƒ Workload Management and Consolidation ƒ Reduced Cycle Times

ƒ Collaboration and Access to Data ƒ Federation of Data ƒ Global Distribution

ƒ Resilient/Highly Available Infrastructure ƒ Business Continuity ƒ Recovery and Failover Supporting Heterogeneous Resources Through Open Standards…. Increased Server Utilization • Exploit distributed resources to provide capacity for high-demand applications – Existing applications that cannot be run effectively on a single processor – New large scale application that provide strategic business advantages • Reduce infrastructure cost associated with over-provisioned resources – Balance workload based on policies – Optimize for cost or throughput • Reduce the cost of manpower to manage and configure resources – Fewer resources to manage for the same workload Collaboration and Access to Data

• Enable collaboration across applications to integrate results – Leverage Distributed Data and Resources Design Analytics • Support large multi-disciplinary Design collaborations – Link Business Processes – Federation of Data Pricing Design • Both within a single organization and between partners Simulation – Exploit Replication Services Across Enterprises The Value of Open Standards

Distributed Computing: Grid (Globus -> OGSA)

Applications: Web Services (SOAP, WSDL, UDDI)

Operating System: Linux

Information: World-wide Web (html, http, j2ee, xml) Communications: e-mail (pop3,SMTP,Mime) Networking: The Internet (TCP/IP) Sun and the Grid: ‘Grid Computing is one of the three next big things for Sun and our customers’ Ed Zander, COO Microsoft and the Grid: ‘The alignment of OGSA with XML Web services is important because it will make Internet-scale, distributed Grid Computing possible’ Robert Wahbe, General Manager of Web Services

Industry Applications

Unique by Industry with Common Characteristics

Manufacturing Financial LS/ Services Product Gov’t & Design Bioinformatics Education Energy Derivatives Telco & Analysis Process Simulation Cancer Media Collaborative Seismic Research Research Analysis Statistical Analysis Finite Element Drug Bandwidth Consumption Weather Reservoir Analysis Discovery Analysis Analysis Portfolio Risk Protein Digital Analysis Failure HPC Analysis Folding Rendering Batch Protein Multiplayer Throughput Sequencing Gaming

Grid Infrastructure

Primary Focus Globalization Grid: Butterfly.net

ƒ Unlimited Numbers of Players

ƒ Distributed Artificial Intelligence

ƒ Multiple Concurrent Players

ƒ 1,000 downloads of developer’s kit per week

ƒ Hot-swappable Components

ƒ Developers, Publishers, ESPs

HP, the Grid and e-Utilities The Grid fabric for e-Utilities will be: • Soft – malleable, multi-purpose • Dynamic – resources will be constantly changing • Federated – global structure not owned by any single authority • Heterogeneous – from supercomputer clusters to PCs John Manley, HP Labs Timescales for Exploitation? • IBM see ‘early adopters’ of Grid technology coming from pharmaceutical, engineering and petrochemical sectors

¾ UK program confirms this picture (AstraZeneca, GSK, Merck, Pfizer, Rolls Royce, BAESystems, Schlumberger)

• IBM see Grid middleware being adopted by more mainstream commerce and industry in 2003/2004 timeframe

Status of the Grid • Today - ‘early adoption’ phase - just like the Web in the early days – Industry now selling ‘IntraGrid’ solutions – Genuine Virtual Organisation ‘InterGrid’ middleware not yet mature • ‘Tomorrow’ - sophisticated combinations of services to locate information, applications to process it, and computer systems to run them ¾ Autonomic Middleware infrastructure capable of supporting Virtual Organisations, c-Commerce and e- Utilities will take time! e-Government and the Grid

‘[The Grid] intends to make access to computing power, scientific data repositories and experimental facilities as easy as the Web makes access to information.’ Tony Blair, 2002 Acknowledgements

With thanks to: Gerd Breiter, Phillipe Bricard, David Boyd, Jens Jensen, Daron Green, Mike Brady, Derek Hill, Carole Goble, Yike Guo, Jeremy Frey, Bill Johnston, Ray Browne, Jim Fleming, Anne Trefethen and many others