Everything you always wanted to know about the Grid and never dared to ask
Tony Hey and Geoffrey Fox Outline • Lecture 1: Origins of the Grid – The Past to the Present (TH) • Lecture 2: Web Services, Globus, OGSA and the Architecture of the Grid (GF) • Lecture 3: Data Grids, Computing Grids and P2P Grids (GF) • Lecture 4: Grid Functionalities – Metadata, Workflow and Portals (GF) • Lecture 5: The Future of the Grid - e-Science to e-Business (TH) Lecture 1
Origins of the Grid – The Past to the Present
[Grid Computing Book: Chs 1,2,3,4,5,6,36] Lecture 5
The Future of the Grid – e-Science to e-Business
[Grid Computing Book: Chs 38, 39, 40,41,42,43] Lecture 5
1. e-Science Research and the Future of Scientific Research 2. Computer Science Research Issues 3. A Business Case for the Grid 4. Concluding Remarks e-Science and the Future of Scientific Research
‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor, 2001 Integrated e-Science Environment
“Problem Solving Environments” Domain-specific application interfaces for scientists
Computing Data Data Experiment Grid discovery visualisation control Grid service Grid service Grid service service
Grid services middleware
Authentication Computers Data storage Experiments Authorisation local remot local remot local remot Accounting e e e
Framework for distributed scientific computing and experimentation e-Science Examples
• Particle Physics • Virtual Observatories • e-Engineering • e-Chemistry • Bioinformatics • High-Throughput Applications • e-Health DAME Project
In flight data
Global Network eg: SITA Ground Airline Station
DS&S Engine Health Center
Maintenance Centre Internet, e-mail, pager
Data centre Comb-e-Chem
Structure + Properties Knowledge + Prediction
Structures Properties DB DB
Simulation and calculation Combinatorial Chemistry
• Parallel synthetic approach – create hundreds of materials – screen properties to find those that fit the bill • Typically requires several passes – find chemical structure of the best candidates – create new batches of similar materials for subsequent passes • Leads to explosive growth in: – volume of data generated – potential to exploit this data MeOH EtOH PrOH BuOH
R1COOH
R2COOH
R3COOH
R4COOH Monitor & Analysis Data
same reaction sequence Interface to Grid for all combinations
AArrayrray productionproduction ofof differentdifferent chemicalchemical speciesspecies Well plate with typically 96 or 384 cells Library synthesis
Mass Spec databases
Raman
x-ray
High throughput systems Structure and properties analysis Structure and properties
100,000100,000’’ss compoundscompounds atat aa timetime analysisanalysis ¾¾ProducesProduces hugehuge amountsamounts ofof complexcomplex datadata Remote equipment, multiple users, few experts
Users Users Users
Data & control links
Access Grid links Experiment Expert
Experiment Remote (Dark) Laboratory
¾¾ModelModel forfor NationalNational crystallographiccrystallographic ServiceService NCSNCS NCS Workflow
Send sample Collaborate in e-Lab Search materials database Download full material to experiment and and predict properties using data on materials NCS service obtain structure Grid computations of interest
Structures Computation X-Ray e-Laboratory Database Service NCS Portal Access NCS Experimental Services NCS Lab Service
Samples and UI Status Collaboratory Interface Data Access Interface Schedules Monitor
Proxy Proxy Proxy
Middleware
Control GUI Raw Results Struct Schedules Chat Audio Auth (Filtered Images Access Access VNC)
Backend
Sample Schedule Raw Data Processed Structure Manage- Manage- (Files) Data (DB) Data (DB) ment ment
Admin Scheduling Expt Control HKL Calculation Struct Calc Control myGrid Project
• Imminent ‘deluge’ of data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives myGrid: Generic Technologies
1. Database access from the Grid 2. Process enactment on the Grid 3. Personalisation services 4. Metadata services 5. Development of Agent Services
¾ Ultimate goal is to put Grid Services together with Ontologies to develop ‘Semantic Grid’ Workflow • Know how. • Associate base resources with derived data. • Keep, describe, find, compare, protect, 1 share. • Repeat/reuse/re-enact 2 • Specialise/Customise/Personalise • Evolution – notification, knowledge 3 • Quality & best practice – Need the workflows to be effective 4 ¾ good experimental practice. Personalisation
• Dynamic creation of personal data sets 1 • Personal views over repositories • Personalisation of workflows • Personal notification 2 • Annotation of datasets and workflows 3 • Personalisation of service descriptions – ‘what I think the service does’ 4 Provenance
• Who, what, where, why, when, how? • The traceability of knowledge as it is evolves and as it is derived. 1 • Identity – the Life Sciences ID • Lab Books, Methods in papers. 2 • Immutable Metadata • Migration – travels with its data but may not be stored with it. 3 • Private vs Shared provenance records. • Ownership/credit 4 Discovery Net Project
In Real Time Scientific Scientific Information Discovery
Real Time Integration Workflow Construction Literature
Databases
Operational Interactive Visual Data Dynamic Application Integration Analysis Using Distributed Resources
Images
Instrument Data Discovery Process Management
• Workflow = Service Composition + Discovery Pathway
• Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML):
– Discovery Pathway Construction: Recording and managing a collaboratively- built discovery process
– Distributed Service Composition: Components organsied by the workflow can be executing anywhere
– Discovery Pathway as Key Intellectual Property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms D-Net Workflow for Genome Annotation : 16 services executing across Internet Dynamic Integration Services
• Dynamic Application Integration = On-demand access and composition of remote analysis components Clustering Classification
• Towards a Dynamic Component Text analysis Integration: Gene function perdition
– Knowledge Servers: allow users to register, locate and remotely execute components D-NET API
Promoter Homology Search – Execution Servers: allow users to Prediction control the execution of components distributed environments
– Easy Maintenance: New components can be added through a clean API Case Study: SC2002 HPC Challenge
Identify D-Net based Global Collaborative Organism Organism’s Real- Time Genome Annotation High Throughput Chromosomes DNA Sequencers Identify
Genes tRNAs, rRNAs genscan blast EMBL NCBI Non-translated Gene markers RNAs Repeat grail Regulatory Repetitive TIGR SNP Masker Nucleotide-level Regions Elements Segmental SNP Annotation Duplication Variations E-PCR genscan Literature ….. References Identify Genome Classify into Annotation Proteins blast 3D-PSSM Protein Families Inter Inter Functional Pro Pro Homologues Characteisation Motif Protein-level PFAM SWISS Search SMART Annotation Domain 3-D Structure PROT Secondary Fold Prediction predator DSC structure Literature ….. References
Relate Pathway Ontologies Maps Cell Metabolism Process-level Cycle GO CSNDB AmiGO GeneMaps Annotation Biological Drugs Process….. KEGG GK Cell death virtual Embryogenesis GenNav chip Literature ….. References 15 DBs 21 Applications How It Works Interactive Editor & Visualisation Nucleotide Annotation Workflows
Download sequence from Reference Server
Save to Inter SMART KEGG Distributed Pro Annotation Server SWISS EMBL NCBI PROT
TIGR SNP GO
¾ 500 Web access Execute ¾1800 clicks distributed ¾200 copy/paste annotation ¾ 3 weeks work workflow in 1 workflow and few second execution eDiamond Applications of SMF
Teleradiology and QC Training and VirtualMammo Differential Diagnosis “Find one like it” ?
Advanced CAD SMF-CAD workstation Epidemiology SMFcomputed breast density Image guided interventions
Images Courtesy Derek Hill Guy’s Hospital Image guided interventions (2) Images Courtesy Guy’s Hospital Surgical verification Accuracy of surgical placement against plan • Surgeon plans on X-ray or CT, uses database of prostheses • Operation takes place using plan as guidance • Post operative X-ray evaluated for accuracy of placement • Data stored and used for short term assessment and long term evaluation studies
Courtesy of Ian Revie Depuy International Summary
• UK e-Science projects emphasize data federation and integration as much as computation
• Metadata and ontologies key to higher level Grid services
• e-Science projects will produce a deluge of scientific data that will need to be annotated and curated in scientific data ‘digital libraries’ Databases in the Grid
Data Complexity
Computational Complexity OGSA – DAI Project • Key middleware project for UK Program - Total Budget £3M (CP £1.5M)
• Three Centres involved: - Edinburgh, Manchester and Newcastle
• Industrial partners: - IBM US, IBM Hursley and Oracle UK
¾ Goal is to develop high-quality data-centric middleware OGSA – DAI Project
• Design Specification completed – Papers for GGF WG on Database Access and Integration Services • Alpha versions delivered: – Distributed Query Service – XML Database Interface – Relational Database Interface • Beta versions by April 2003 – Integrate with Globus GT3 release e-Science and the Future of Scientific Research
‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor, 2001 ¾ Need to break down the barriers between the Victorian ‘bastions’ of science – biology, chemistry, physics, …. ¾ Develop ‘permeable’ structures that promote rather than hinder multidisciplinary collaboration ¾ Engage Computing Services and Libraries in developing a new e-Science support service on Campus e-Science and Computer Science
• The lesson of the Web • The Semantic Grid – The myGrid project – The Discovery Net Project • Computer Science Research and the Grid Error 404: Page not found
‘If you want the Web to scale, You must allow the links to fail’
Wendy Hall after Tim Berners-Lee
¾ HTML as the ‘Fortran’ of Hypertext! Semantic Web Metadata & Ontologies • Metadata – computationally accessible data about the services • Ontologies – the shared and common understanding of a domain – A vocabulary of terms – Definition of what those terms mean. – A shared understanding for people and machines – Usually organised into a taxonomy. Reasoning in DAML+OIL
• Consistency — check if knowledge is meaningful • Subsumption — structure knowledge, compute classification • Equivalence — check if two classes denote same set of instances • Instantiation — check if individual instance of class C • Retrieval — retrieve set of individuals that instantiate C Computer Science Challenges from e-Science
UK CS Team led by Tom Rodden identified 4 major research challenges arising from e-Science:
- Developing a Semantic Grid - Trusted Ubiquitous Systems - Rapid Customized Assembly of Services - Autonomic Computing Towards a Semantic Grid
• Trace provenance from initial data to information and knowledge structures • Techniques to allow scalable reasoning over uncertain/incomplete knowledge • Tools for design, development and deployment of large-scale ontologies • Support for semantic-directed knowledge discovery to complement data-mining • Development of flexible network-based reasoning and decision support services Trusted Ubiquitous Systems
• New theories to model, specify and analyse trust in distributed ubiquitous systems • New quality of service and service-based models for ubiquitous systems • New design guidelines and practices to enable the development of reusable trusted components • New understanding of the practical engineering trade-offs required to realise trusted ubiquitous systems Rapid Customised Assembly of Services • New theories to describe and reason about semantics and behaviour of services and compositional effects • Agent and service representations that promote adaptability and emergent, opportunistic and implicit arrangement of services • New tools to support the discovery, composition and use of services based on high-level description of requirements • Techniques to support directed automatic composition, decomposition and recomposition of services Autonomic Computing • Techniques to analyze, describe and reason about adaptive systems • Management of semi-autonomous systems with policies, services and software agents • Interoperability and reasoning across and between different autonomous domains • Modeling and measurement of performance of QoS for autonomic structures • Techniques to capture and represent history, context and environment IBM Autonomic Computing Vision
Self-ConfiguringSelf-Configuring Self-HealingSelf-Healing AdaptAdapt automaticallyautomatically toto thethe Discover,Discover, dynamicallydynamically changingchanging diagnose,diagnose, environmentsenvironments andand reactreact toto disruptionsdisruptions Self-Self- Self-Self- ConfiguringConfiguring HealingHealing
Self-Self- Self-Self- OptimizingOptimizing ProtectingProtecting Self-OptimizingSelf-Optimizing Self-ProtectingSelf-Protecting MonitorMonitor andand tunetune Anticipate,Anticipate, detect,detect, resourcesresources identify,identify, andand protectprotect automaticallyautomatically againstagainst attacksattacks fromfrom anywhereanywhere A Business Case for the Grid
• Total Cost of Ownership – TCO • Value of Open Standards • Industrial Applications • Time to exploitation • e-Utilities Current IT Environment Distributed, Heterogeneous, Complex
Typical Financial Subsystem Configuration IMS Sysplex Network Data
Profile Database Security DB2 Local Security IMS Complex Capture Servers Gateways Director Servers Data
Presentation Business Logic Gateway IMS Sysplex Netscape HTTP Data Enterprise Server WebSphere JDBC SNA Application Security WebSphere Server SNA Application Client MQ Sysplex Server MQ MQ SNA IMS Data CICS Hub Server Group
MQ MQ Application Gateway TPF Logging Logging
Front-end for Web presence for financial services Back-end Systems Current IT Environment Distributed, Heterogeneous, Complex
Typical Financial Subsystem Configuration IMS Sysplex Network Data Complexity, TCO zSeries
Profile Database Security DB2 Local Security IMS Complex Capture Servers Gateways Director Servers Data zSeries
Presentation Business Logic Gateway IMS Sysplex Netscape HTTP Data Enterprise Server WebSphere JDBC SNA zSeries Application Security WebSphere Server SNA Application Client MQ Sysplex Server MQ MQ SNA IMS Data CICS Hub Server Group zSeries
MQ MQ Application Gateway TPF Logging Logging Tech. Cost, Utilization zSeries Front-end for Web presence for financial services Back-end Systems Server / Storage Utilization
Peak-hour Prime-shift 24-hour Period Utilization Utilization Utilization Mainframes 85-100% 70% 60%
UNIX 50-70% 10-15% <10%
Intel-based 30% 5-10% 2-5%
Storage N/A N/A 52%
Source: IBM Scorpion White Paper: Simplifying the Corporate IT Infrastructure, 2000 Total Cost of Ownership: TCO
Much More than Hardware and Software Costs IT Budgets Hardware 10.0%
Integration 32.0%Integration Software 32.0% 12.0% 32% Hardware Software Personnel 16% Maintenance PersonnelPersonnel Integration 16.0%16.0% 30%
Maintenance 30.0% Grid Computing Sales Pitch
Storage I/O Operating System
Processing Applications Data
Distributed Computing Over a Network, Using Open Standards to Enable Heterogeneous Operations Grid Technology Enables
Increased Server Utilization Workload Management and Consolidation Reduced Cycle Times
Collaboration and Access to Data Federation of Data Global Distribution
Resilient/Highly Available Infrastructure Business Continuity Recovery and Failover Supporting Heterogeneous Resources Through Open Standards…. Increased Server Utilization • Exploit distributed resources to provide capacity for high-demand applications – Existing applications that cannot be run effectively on a single processor – New large scale application that provide strategic business advantages • Reduce infrastructure cost associated with over-provisioned resources – Balance workload based on policies – Optimize for cost or throughput • Reduce the cost of manpower to manage and configure resources – Fewer resources to manage for the same workload Collaboration and Access to Data
• Enable collaboration across applications to integrate results – Leverage Distributed Data and Resources Design Analytics • Support large multi-disciplinary Design collaborations – Link Business Processes – Federation of Data Pricing Design • Both within a single organization and between partners Simulation – Exploit Replication Services Across Enterprises The Value of Open Standards
Distributed Computing: Grid (Globus -> OGSA)
Applications: Web Services (SOAP, WSDL, UDDI)
Operating System: Linux
Information: World-wide Web (html, http, j2ee, xml) Communications: e-mail (pop3,SMTP,Mime) Networking: The Internet (TCP/IP) Sun and the Grid: ‘Grid Computing is one of the three next big things for Sun and our customers’ Ed Zander, COO Microsoft and the Grid: ‘The alignment of OGSA with XML Web services is important because it will make Internet-scale, distributed Grid Computing possible’ Robert Wahbe, General Manager of Web Services
Industry Applications
Unique by Industry with Common Characteristics
Manufacturing Financial LS/ Services Product Gov’t & Design Bioinformatics Education Energy Derivatives Telco & Analysis Process Simulation Cancer Media Collaborative Seismic Research Research Analysis Statistical Analysis Finite Element Drug Bandwidth Consumption Weather Reservoir Analysis Discovery Analysis Analysis Portfolio Risk Protein Digital Analysis Failure HPC Analysis Folding Rendering Batch Protein Multiplayer Throughput Sequencing Gaming
Grid Infrastructure
Primary Focus Globalization Grid: Butterfly.net
Unlimited Numbers of Players
Distributed Artificial Intelligence
Multiple Concurrent Players
1,000 downloads of developer’s kit per week
Hot-swappable Components
Developers, Publishers, ESPs
HP, the Grid and e-Utilities The Grid fabric for e-Utilities will be: • Soft – malleable, multi-purpose • Dynamic – resources will be constantly changing • Federated – global structure not owned by any single authority • Heterogeneous – from supercomputer clusters to PCs John Manley, HP Labs Timescales for Exploitation? • IBM see ‘early adopters’ of Grid technology coming from pharmaceutical, engineering and petrochemical sectors
¾ UK program confirms this picture (AstraZeneca, GSK, Merck, Pfizer, Rolls Royce, BAESystems, Schlumberger)
• IBM see Grid middleware being adopted by more mainstream commerce and industry in 2003/2004 timeframe
Status of the Grid • Today - ‘early adoption’ phase - just like the Web in the early days – Industry now selling ‘IntraGrid’ solutions – Genuine Virtual Organisation ‘InterGrid’ middleware not yet mature • ‘Tomorrow’ - sophisticated combinations of services to locate information, applications to process it, and computer systems to run them ¾ Autonomic Middleware infrastructure capable of supporting Virtual Organisations, c-Commerce and e- Utilities will take time! e-Government and the Grid
‘[The Grid] intends to make access to computing power, scientific data repositories and experimental facilities as easy as the Web makes access to information.’ Tony Blair, 2002 Acknowledgements
With thanks to: Gerd Breiter, Phillipe Bricard, David Boyd, Jens Jensen, Daron Green, Mike Brady, Derek Hill, Carole Goble, Yike Guo, Jeremy Frey, Bill Johnston, Ray Browne, Jim Fleming, Anne Trefethen and many others