Commercial and Near- commercial Use of Grid in the US: The Good, the Bad and the Ugly.

Carl Kesselman Director, Center of Grid Technologies University of Southern California Grid is Enterprise Virtualization

Infrastructure for enabling resource sharing and collaboration across distributed virtual orginizations

Enterprise Virtualization

Server Application Storage File Desktop Virtualization Virtualization Virtualization Virtualization Virtualization Grid Adoption Lifecycle

Level 6 : Inter - Enterprise Grid

Level 5 : Enterprise Grid

Level 4 : Linked Clusters / Peer-to-Peer Grid

Level 3 : Remote Use of Isolated Clusters

Level 2 : Isolated Clusters / Grid Silos

Level 1 : Single Cluster / Single Application Commercial Grid Adoption Statistics

• Gartner defines as Early Mainstream • 5% – 20% of target audience has adopted • Currently 14% of CIOs rate interest level as high to very high

Grid focus areas: • Non-dedicated distributed computing

• Highly scalable • Application centric, meta data later, simplicity focus

• Cost savings / ROI value proposition • Innovation, differentiation value proposition Grid Infrastructure Value Progression

Tactical Solution Strategic Asset

Improve coordination; Enable sharing Improve control; Current Increase isolation/ Service stability delivery market Optimize transition enterprise - wide Integrated IT Integrated IT utilization islands islands Coordinate data and Integrated IT Virtual Virtual Accelerate computation islands machine machine application performance Data Data Data Data

Compute Compute Compute Compute Compute

Single Single Single Single Multiple application application application application applications

Slide 5 Copyright © Univa 2006 Grid Value Proposition

For End Users

• Gain access to increased compute capacity • Reduce costs For Software • Get to market faster For IT Vendors Integrators • Gain access to new • Gain a competitive markets / customers Commercial edge with a • Ensure your applications Grid Computing differentiated offering perform as designed • Maximize customer • Maximize customer satisfaction and satisfaction and retention retention For Service Providers

• Guarantee a high-quality end user experience • Deliver fast, consistent application response on time, every time • Reduce operating costs and improve infrastructure utilization

Univa UD Proprietary & Confidential 6 HPC Lifecycle Management

Application Selection Retire Test

Selection Migrate & Purchase HPC Support Lifecycle Configure Management

Add Deploy

Grow Use Upgrade

Univa UD Proprietary & Confidential 7 Maximize use of single cluster e.g. The UniCluster 3.2 Stack

Integrate best-of-breed open source technologies into a full featured, mature, cluster software stack Robust, scalable, Open Source , high-value Multiple uses aligned with business goals Application access for remote users

Globus ARCO UniCluster Monitor Console WS-GRAM

Ganglia SGE Management Bootstrap Service gmond Service Sge_qmaster gmetad MyProxy/Auto-CA GridFTP Sge_schedd RRD GSI-OpenSSH RFT Sge_execd

RHEL 4 and 5 SUSE 9 and 10 CentOS 4 and 5 and x86_64 Windows 2000 or XP (Monitor Console only)

X86, x86_64 Hardware Application Domains

Data Analytics and Data Mining Potential Growth Applications

Health Sciences Medical imaging, bio-informatics

Manufacturing EDA, fluid dynamics, crash test simulations

Financial Services Risk/portfolio analysis, Monte Carlo simulations

Media Digital content creation, animation

Energy Reservoir simulations, seismic processing Cluster Workloads

• Parallel applications • Faster time to completion for single calculation • Generally requires application modification • Examples: solving differential equations for mechanical design, aeronautics, etc. • Course-grain “High-throughput” • Complete as many independent calculations as possible • Orders of magnitude increase in exploration of problem space • Often accomplished via scripting • Examples: Monte Carlo algorithms, parameter space exploration, optimization • Interactive • Single node, start “immediately” • Offload work from desktop • Any of these may be driven from command line, portals, or exiting application tools (eg. Matlab) Enterprise Grids

• Desktop • Desktop and Cluster • Multiple Clusters Johnson & Johnson

• Goal • Provide complete infrastructure suite for central IT managed Grid and HPC services • Challenges “It’s just a better way to invest • Leverage heterogeneous environment including your money... We can have a Clusters, Blades, and high performance workstations single tool that can both do a • Provide solution for meta-scheduling to remove virtual cluster and also take system and application management overhead for advantage of CPU harvesting researchers so they can focus on analysis off of the existing equipment that we already have. That • Solution Highlights was one of the big reasons we • Grid Ready Cluster environment consisting of Linux picked [Univa UD] over other clusters and Windows workstations ways of providing HPC capabilities.” • Fully implemented ‘Grid as Service’ • Successful solution for accelerating a multitude of – Jeff Mathers critical production applications Pharma R&D • Today Johnson & Johnson • Successful deployments across Europe and US (WAN) • Multi-site integration work complete • Grid and HPC now a managed, central service Corus

• Goal – Optimize its scientific computing infrastructure – Enable better / faster engineering • Challenges “The implementation of GX Synergy was, in hindsight, the only solution – Time crunch driven by project to simulate replacing that gave us both an optimization of crash barriers along UK highways the resources of the center as well – Required a significant increase in processing as an immediate mastery of the tool capabilities by our engineers. The simplicity and – Fully integrated solution had to be in production in the lightness of the solution enabled us to install it in one week, without 30 days having to stop our normal work. We • Solution Highlights were able to be ready on time and to – Installed new 48 CPU cluster to increase meet the challenge of the customer throughput work we needed to undertake. ” – Integrated LS-Dyna, Nastran, PamStamp, Radioss, – Mike Twelves Abaqus and ST-ORM Manager Knowledge Systems – Installation began on Jan 15 th , 20 engineers were Corus Automotive Engineering trained and running production jobs by Feb. 15 th • Today – Univa UD used for core production jobs across compute center Novartis

• Goal – Accelerate in silico research, value creation “The reported work clearly shows that large – Cost containment database docking in conjunction with appropriate scoring and filtering processes can • Challenges be useful in medicinal chemistry. This – CPU-constrained in-silico research capability approach has reached a maturation stage where it can start contributing to the lead – Integration with existing interfaces finding process. At the time of this study, nearly – Security of intellectual property one month was necessary to complete such a docking experiment in our laboratory settings. • Solution Highlights The Grid computing architecture recently developed by [Univa UD] allows us to now – Seamless integration with existing portal perform the same task in less than five working interface days using the power of hundreds of desktop PC’s. High-throughput docking has – Passed all end-to-end security tests therefore acquired the status of a routine • Today screening technique.” – To be extended to 10’s of thousands of nodes – Journal of Medicinal Chemistry – Multiple production applications including drug discovery, clinical analysis, and sales & marketing

2007 Univa UD Confidential Procter & Gamble

• Goal • Enhance in-house high performance compute capabilities by taking advantage of underutilized workstations • Run a Finite Element Analysis application (Abaqus) on the Grid • Bottle and package design • Challenges • Competitive pilot vs. Axcellion and Platform (incumbent supplier) • Grid MP grid needed to interface with LSF • User community is used to the Platform LSF interface • Company did not want to disrupt user community by introducing a new interface • Company concerned about: security, unobtrusiveness, scalability • Solution Highlights • Univa UD successfully won competitive pilot • passed all major end user tests and concerns • Integration with LSF for Abaqus jobs completed under a week • A grid of 200 high-end workstations / desktops is running Abaqus jobs during off-peak hours • Design of Experiments, monte-carlo based approach

2007 Univa UD Confidential Children’s Memorial Hospital

• Goal – Analyze patient data and medical research literature to differentiate pediatric brain tumor types while gaining unique insights into tumor classification and treatment “Leveraging SPSS predictive analytics and [Univa UD]’s expertise • Challenges in grid computing, we’ve developed – Too costly and logistically difficult to add compute an integrated technology system power to perform necessary analytic work that can efficiently extract and organize gene relationships • Solution Highlights from full text articles. We can also correlate this insight with both past – Uses data mining technology, Clementine®, to and ongoing research on effective analyze and classify pediatric brain tumor types pediatric cancer treatments.” – Employs LexiQuest Mine™, to discover previously Dr. Eric Bremer overlooked relationships contained in literature Director of the Brain Tumor Research Program at • Today Children’s Memorial Research – Routinely able to process 124,000 medical abstracts Center less than 1.5 hours – Previously the analysis required between 20 and 24 hours making ad-hoc and what-if queries unfeasible

2007 Univa UD Confidential GlaxoSmithKline

• Goal – Replace internally developed Grid technology with commercially available solution Challenges • “The Grid MP platform keeps – Very knowledgeable about Grid computing track of all the data related to our having built VCS in-house job runs – where the job was executed, what type of machine, – Existing Platform Computing customer how long it took. So not only – Concerned with integration in their current IT does the grid save us time, but in infrastructure, application migration and automating this function it allows us to define a validated process standards. for job execution. That goes a • Solution Highlights long way toward achieving FDA – Rapid initial migration of existing applications compliance.” – No performance issues from non-dedicated Mark Sale nodes Global Director of Research Modeling and Simulation – Passed all end-to-end security tests GlaxoSmithKline • Today – Grid solution is now a production IT service – Multiple applications running in production across many functions and departments

2007 Univa UD Confidential American Diabetes Association & Kaiser Permanente

• Goal – Accelerate Archimedes processing dramatically over standalone or server based processing • Challenges – Enabling a Smalltalk application (Archimedes) to run on Grid MP “…[Grid MP] will – Getting single run times to under 10 minutes revolutionize not only how we do our work, but the • Solution Highlights accuracy of the decisions people make about the – On-demand processing was the initial best fit for management of diseases. the ADA’s peak-driven workloads Normally, answering a – Fully functioning simulation portal single question on a PC requires 24 to 48 hours. • Today [Univa UD] is helping us reduce the computing time – ADA utilized an internal grid of more than 1000 to minutes.” nodes – Diabetes PHD in full production and publicly Dr. David Eddy Kaiser Permanente accessible Sr. Advisor for Health Policy and Management

2007 Univa UD Confidential Sanofi~Aventis

• Goal • Screen entire 1M compound • Invest in latest technology – skip over Clusters to Grids • Challenges “Ease of deployment and the • Too costly to expand existing HPC systems (SGI) availability of applications • were strong selling points for Plans in place to phase out HPC systems across us. We also needed proven company scalability and security and • knew from [Univa UD]’s other Solution Highlights enterprise deployments and • Joint effort in partnership with Accelrys to deliver their work with their public Grid that massive scaling and integrated LigandFit solution security capabilities were • Within 2 months began routinely screening 300k already proven.” compounds across multiple locations in France Olivier Gien • Today Head of Discovery IT Sanofi-Aventis • Plan to expand to thousands of nodes • 1M+ library routinely screened

2007 Univa UD Confidential Toyota

• Goal – Build a model of all production parts Grid MP required to build a series of Services automobiles based on current orders Device Group Y on hand Device Group Z • Challenges – Have to determine the optimal production schedule to ensure that factories are loaded optimally and that all parts are on hand (JIT) • Solution Highlights – Integrated with the existing Z/OS

Part 1 Part 2 Part 3 Part 1 Mainframe to minimize current user Part 2 Part 1 Part 3 PartPart 1 2 PartPart 2 3 Part 1 Part 3 PartPart 1 2 Part 1 PartPart 2 3 PartPart 1 2 and workflow changes Part 3PartPart 1 Part 2 3 Part 1 PartPart 2 3 PartPart 1 2 Part 1 Part 3 PartPart 1 Part 2 3 PartPart 1 2 PartPart 2 3PartPart 1 Part 2 3 Part 1 Part 3 PartPart 1 Part 2 3 PartPart 1 2 PartPart 2 3 PartPart 1 Part 2 3 Part 1 Part 3 PartPart 1 Part 2 3PartPart 1 2 PartPart 2 3 PartPart 1 Part 2 3 Part 1 • Today Part 3 PartPart 2 3 PartPart 1 2 Part 3 PartPart 1 Part 2 3Part 1 PartPart 2 3 PartPart 1 2 Part 3 PartPart 2 3 Part 1 Part 3 PartPart 1 2 PartPart 2 3 Part 1 – Improved utilization of hardware, Part 3 Part 2 Part 3 Part 1 Part 2 Part 3 manufacturing scheduling, inventory Orders and personnel Results returned in minutes not hours

2007 Univa UD Confidential Grid is gaining traction across Life Insurance and P&C

For underwriting and process, leading carriers are investing in technology to improve risk selection, competitive Life P&C pricing, and applying automated and (Enterprise PC Grids) (Enterprise PC Grids) consistent underwriting rules.

•Mortality •Pricing •Population Analysis •Underwriter’s Scorecard Deep •Stochastic-based risk •Catastrophe modeling modeling •Claims analytics Computing •Principle-based (specifically fraud) reserving (PBR) • Analysis of remote data Particularly in the Life and Annuities •Enterprise Risk (telematics) space, actuaries are working with IT to Management (ERM) change the way they model risks and maintain reserving in parallel with equity market fluctuations.

As technology helps reduce costs, freed up budget and resources from maintenance and outsourcing are being used to make strategic technology investments across all areas of insurance.

Univa UD 2007 21 Maximizing Value from Multiple Clusters

Job Scheduler File Federated Federated Virtualization Virtualization Security Monitoring

Remote Access

Job File Security Monitoring Scheduler System

OperatingIncreased System collaboration Greater aggregate capacity, via peer to peer Simplified access to distributed, file-based data Semiconductor Design

Design File Test Vectors

Defect Map

RTL env. Log Files scripts Log File Analysis

Scale: 10,000’s of test vectors and result log files Maximizing Value From Shared Utility

“Overflow” compute capacity for multiple applications Management of all resource types (compute, data, network) Higher overall service quality with lower administration

Job Scheduler File Federated Federated Virtualization Virtualization Security Monitoring

Remote Access Job FileSHARED UTILITYSecurity Monitoring Scheduler System

Operating System The Commercial Grid Ecosystem

Event Management Monitors / Actions Event Detection License Control

Reporting & Analytics Capacity Utilization Forecasting Chargeback

Submission Monitoring Management Configuration (Enterprise PC Grids) Cluster Configuration and Mgmt. interfaces Utility App Mgmt. & Job Submission/Mgmt. interfaces Overflow

Cluster Management

• Create Enterprise PC Grids (Open Source) • Create Internet Grids • Highly secure • Easy install & configure • Integrated Insight • Over 100+ enterprise • Job scheduling • Integrated Response deployments • Monitoring • System Management • P2P job forwarding for • Cluster deployment • License Management Grid MP & Cluster • Remote access / staging • DataCatalyst - Pro • DataCatalyst • P2P job forwarding • Insight as a service (fee) • Virtual Machine Plug-ins

Common Application Integration and Bundles Firewall Univa UD 2007 25 Inter-enterprise Grids

• Resource Outsourcing • Cloud computing • Service Value Networks • Outside in (J.S. Brown) • True Federated environments • E.g. Healthcare Globus MEDICUS

• Medical Imaging and Computing for Unified Information Sharing (MEDICUS) • Use standards Open Grid Service Architecture (OGSA) for Healthcare and Clinical Research • Vertical integration of existing robust Grid technology • Addresses Medical Imaging • DICOM image sharing within Grids* • DICOM image processing (WS) • DICOM image archiving/management (Grid PACS)**

*PACS and Imaging Informatics, SPIE Medical Imaging, 6145-32, 2006 **Int Journal of Computer Assistant Radiology and Surgery, 2006, 1:87-105; p100-104, Springer, Heidelberg

Globus MEDICUS Proto-Project @ http://dev.globus.org/wiki/Incubator/MEDICUS Global Patient Record MEDICUS Use Cases: Childrens Oncology Group and Neuroblastoma Cancer Foundation Grids Open Source Communities Summary

• Increasing traction of Grid in commercial sector • Need to give time for maturity lifecycle • Need to look at entire cap-ex/op-ex lifecycle • Many interesting new areas of opportunity and greenfield for Grid technology • Grid’s place in infrastructure ecosystem • Clouds don’t replace Grids, VM management doesn’t replace grids