Complete Scientific Process On(Kepler Enabling Scientific KEPLER Process to Support STUDIO?) the

Ilkay Altintas Lead, Scientific Workflow Automation Technologies Laboratory SDSC Project Manager, Kepler Scientific Workflow Project

San Diego Center, UCSD

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas SDSC is a leader in Cyberinfrastructure • SDSC – 20 Years old NSF-supported center – Leader: NSF-funded TeraGrid and “Core” Programs – Access to the highest-end computing and data resources at no cost to academics and non-profit institutions – ~400 researchers, staff and students (40+ Principal Investigators) – Leading NSF efforts: NEES, GEON,Protein Data Bank, CAIDA, BIRN, GAMESS, EOL,…

SDSC = HEC + DATA

• SDSC’s Mission: – Developing and Using State of the Art Technology to Advance Science and Engineering

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Changing Needs for Scientific Process

Traditional Scientific Process (before computers)

Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict

Yesterday…at least for some of us!

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas What’s different in today’s science?

• Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, microarrays, satellite-based sensors, sensor networks, field studies… • Analysis, Prediction / Models and model execution: Potentially large computation and visualization Today’s scientific process ++ + + Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude  + +  Predict +

More to add to this picture: network, Grid, portals, +++ SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Mission of Scientific Workflow Systems • Tools to combine different CI technologies • My view until yesterday – Promote “scientific discovery” by providing tools and methods to generate scientific workflows – Create a generic customizable graphical user interface for scientists from different scientific domains – Support computational experiment creation, execution, sharing, reuse and – Design frameworks which define efficient ways to connect to the existing data and integrate heterogeneous data from multiple resources – Bring CI into user’s monitor!!!

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Kepler is a Scientific Workflow System

www.kepler-project.org

• … and collaboration: (cross-)x(project+institution+national) • Latest alpha release: out last week!

• Builds upon the Ptolemy II: A laboratory for investigating design open-source KEPLER: Ptolemy II A problem-solving environment framework for Scientific Workflows KEPLER = “Ptolemy II + X” for Scientific Workflows

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Kepler is a Team Effort

Ptolemy II

Griddles SKIDL Resurgence SRB

Cypres NLADR Contributor names and New contributor: funding info are at the - Chesire (UK Text Mining Center) LOOKING Kepler website!! SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Actors are the Processing Components

• Actor – Encapsulation of parameterized actions – Interface defined by ports and parameters • Port – Communication between input and output data – Without call-return semantics • Model of computation – Communication semantics among ports – Flow of control – Implementation is a framework • Examples – Simulink(The MathWorks) – LabVIEW ( from National Instruments) – Easy 5x (from Boeing) – ROOM(Real-time object-oriented modeling) Actor-Oriented Design – ADL(Wright) –…

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Directors are the WF Engines that…

• Implement different computational models • Define the semantics of – execution of actors and workflows – interactions between actors Ptolemy and Kepler are unique in combining different execution models in heterogeneous models! • Kepler is extending Ptolemy directors with specialized ones for web service based workflows and distributed workflows. • Dataflow • Process Networks • Time Triggered • Rendezvous • Synchronous/reactive model • Publish and Subscribe • Discrete Event • Continuous Time • Wireless • Finite State Machines

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Vergil is the GUI for Kepler

Actor Search Data Search

• Actor ontology and semantic search for actors • Search -> Drag and drop -> Link via ports • Metadata-based search for datasets SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Actor Search

• Kepler Actor Ontology • Used in searching actors and creating conceptual views (= folders) Currently 160 Kepler actors added!

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Some actors in place for… • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update • Command Line wrapper tools

• Some Grid actors-Globus Job Runner, GridFTP-based file access, Proxy Certificate Generator • SRB support • Native R support • Interaction with Nimrod and APST • Communication with ORBs through actors and services • Imaging, Gridding, Vis Support • Textual and Graphical Output • …more generic and domain-oriented actors… SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Data Search and Usage of Results

• Kepler DataGrid – Discovery of data resources through local and remote services SRB, Grid and Web Services, Db connections – Registry of datasets on the fly using workflows

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Dataset Generation & Registration A co-development in KEPLER: GEON

%% MakefileMakefile $>$> antant runrun

SQL database access (JDBC) Matt,Chad, Dan et al. Ilkay (SEEK) (SDM)

Efrat (GEON)

Yang (Ptolemy)

Xiaowen (SDM)

Edward et al.(Ptolemy)

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Coming soon in Kepler

• MORE INFRASTRUCTURE TO SUPPORT SCIENCE! • Full support for distributed execution • Plug-in Kepler archives and better versioning support • Semantic and hybrid typed actors and workflow construction • Portal support and registration of products • Support on process and data provenance • Standardization of data interfaces • Integration with SCIRun and SDSC vis modules • Documentation of generated products in addition to the existing manuals and documentation

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas How do all these relate to ORION?

OR… HOW DOES ORION RELATE TO SCIENTIFIC WORKFLOWS?

How can SWFs help?

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas In a few steps in the scientific process… Ocean Research Interactive Observatory Networks

Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict

Additional Missions for Scientific Workflow Systems: +1: To guide scientists in their observing steps with means to use and control instrument networks and observatories +2: To enable scientifically and statistically analyze and control the data collected by observatories via customizable systems +3: Allow simulations as testbeds for possible obs. networks SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Observatory Networks are Complex Systems

CHAOS THEORY: Complexity exists at the “edge of chaos” somewhere between too much and too little order.

• Large number of interacting parts Interactively Complex High • Difficult to understand behavior Complexity • Hard to comprehend the structure Loosely Coupled

that binds the components Tightly Coupled • Nontrivial and complicated Low interaction between components Complexity

• Unpredictable emergent behavior Interactively Simple Perrow’s Framework of Complexity ACM Communications; May 2005 -- Adaptive Complex Enterprises

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Observatory Network Research is Cross

Ocean + ComputerDisciplinary + Engineering + Control CONTROL: Find out finite number of parameters over which we can build decision algorithms to control the system. • State: Condition a system/an instrument is at a time. State estimation as a quality measure for data! • Other parameters: variants, data quality, … DATA Aspects -- Usage: • for scientific analysis and archiving; meaning of data is important: ontologies to better tell what it is; realtime++ • for statistics over the quality of data; use it as a parameter in the control algorithm SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Filtering Techniques to Estimate State • KALMAN FILTERING: • State estimation; Optimal filtering; State reconstruction; System identification • Both for linear and non-linear systems • To build fault tolerant systems: • Systems background • Signal processing • Filtering techniques • • Some techniques can even be used as actors! Generic control emerges as a new computation model for these type of complex systems! SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas To Sum Up • Kepler can help different stages of scientific process – It is proven to help in analysis steps including streaming applications. – Biology, Ecology, Geology, Seismology, Chemistry, Astrophysics, … • CI aspects of ORION include: – data – control – testbed applications • Need cross-disciplinary project teams and cross- committee members

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas CI and ORION Committees

• Multiple committees - seperated • Different CI parts - seperated

• Cross-linked committees • Cross-discipline CI projects

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas Questions…

Thanks!

Ilkay Altintas [email protected] +1 (858) 822-5453 http://www.sdsc.edu

SAN DIEGO SUPERCOMPUTER CENTER

UCSD Ilkay Altintas