DOE Scientific Data Management Center – Scientific Process Automation

Introduction to Scientific Workflow Management and the System

Ilkay Altintas1, Roselyne Barreto4, Paul Breimyer5,Terence Critchlow2, Daniel Crawl1, Ayla Khan3, David Koop3, Scott Klasky4, Jeff Ligon5, Bertram Ludaescher6, Pierre Mouallem5, Meiyappan Nagappan5, Steve Parker3, Norbert Podhorszki6, Claudio Silva3, Mladen Vouk5

1. San Diego Supercomputer Center 2. Pacific Northwest National Laboratory 3. University of Utah 4. Oak Ridge National Laboratory 5. North Carolina State University 6. University of California at Davis

SC07/Kepler Tutorial/V8/Nov-07 1

DOE Scientific Data Management Center – Scientific Process Automation General Architecture

Analytics Computations

LocalLocal and/orand/or Control Panel RemoteRemote (Dashboard) CommunicatioCommunicatio & Display nsns (networks)(networks)

Orchestration (Kepler) Data, , Provenance, … Storage

SC07/Kepler Tutorial/V8/Nov-07 2

1 DOE Scientific Data Management Center – Scientific Process Automation Overview Welcome (Vouk) Introduction and Motivation (Klasky) Introducing Kepler Hands-On #1 (Crawl) Exploring Complex Workflows Hands-On #2 Provenance in Scientific Workflows and Kepler and Hands-On #3 (Vouk, Crawl) Dashboard – Hands-On 4 (Klasky, Vouk) Individualized tutoring – Hand-On #5 (Crawl, Ligon) Wrap-up (Vouk, Klasky)

SC07/Kepler Tutorial/V8/Nov-07 3

DOE Scientific Data Management Center – Scientific Process Automation Handouts

Hands-on account, access and help CD with Kepler and demos Packages with any additional slides that may be used during the tutorial A collection of relevant reports/papers/notes Contact information

SC07/Kepler Tutorial/V8/Nov-07 4

2 DOE Scientific Data Management Center – Scientific Process Automation Link http://sc07.supercomputing.org/schedule/event_detail.php?evid=11036)

Introduction to Scientific Workflow Management and the Kepler System Session: S07 Event Type: Tutorial Time: 8:30am - 5:00pm Presenter(s): Ilkay Altintas, Mladen Vouk, Scott Klasky, Norbert Podhorszki Location: A7 Abstract: A scientific workflow is the process of combining data and processes into a configurable, structured set of steps that implement semi-automated computational solutions of a scientific problem. Scientific workflow systems provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists. This tutorial provides an introduction to scientific workflow construction and management (Part I) and includes a detailed hands-on session (Part II) using the Kepler system. It is intended for an audience with a computational science background. It will cover principles and foundations of scientific workflows, Kepler environment installation, workflow construction out of the available Kepler components, and workflow management that uses Kepler facilities to provide process and data monitoring and provenance information, as well as high speed data movement solutions. This tutorial also incorporates hands-on exercises and application examples from different scientific disciplines. Chair/Presenter Details: Ilkay Altintas San Diego Supercomputer Center Mladen Vouk North Carolina State University Scott Klasky Oak Ridge National Laboratory Norbert Podhorszki University of California, Davis

SC07/Kepler Tutorial/V8/Nov-07 5

DOE Scientific Data Management Center – Scientific Process Automation

Introduction and Motivation

Presenter: Scott Klasky

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 6

3 DOE Scientific Data Management Center – Scientific Process Automation Topics

I. Brief Definition of the Scientific Workflows with Motivational Examples

II. User and Technical Requirements for Scientific Workflows, Existing Systems

SC07/Kepler Tutorial/V8/Nov-07 7

DOE Scientific Data Management Center – Scientific Process Automation DOE Scientific Data Management Center

Goals: integrate and deploy -based solutions to efficiently and effectively manage large volumes of data generated by scientific applications

SDMSDM CenterCenterfocuses focuses on on the the applicationapplication ofof knownknown andand emergingemerging datadata managementmanagement technologiestechnologies toto scientificscientific applications.applications.

http://sdm.lbl.gov/sdmcenter/

Funded under SciDAC

SC07/Kepler Tutorial/V8/Nov-07 8

4 DOE Scientific Data Management Center – Scientific Process Automation A Typical SDM Scenario Flow Tier Flow

Task A: Task : Task D: Generate Task : + Move TS Visualize TS Time-Steps Analyze TS Layer

Applications & Data Simulation Post Parallel Terascale Software Tools Mover Processing Browser Layer

Work Tier Program

HDF5 I/O System Parallel PVFS AIO SRM Libraries Layer NetCDF

Storage & Network Resouces Layer

SC07/Kepler Tutorial/V8/Nov-07 9

DOE Scientific Data Management Center – Scientific Process Automation Scientific Process Automation (SPA)

SPA = Scientific Workflow layer in the SDM Center 3-tier architecture Part of the Kepler Scientific Workflow System and Collaboration

Scientific Process WorkFlowWorkFlow ScientificWeb ManagementManagement WrappingWorkflow Automation EngineTools (Kepler) ComponentsTools (SPA) Layer

Data Data Mining Parallel R Data Efficient Efficient Mining & and Feature Statistical indexing Parallel Identification (Bitmap Analysis Analysis Visualization (DMA) Index) (pVTK) Layer

Storage Storage ROMIOParallel Parallel Parallel Efficient Resource MPI-IOI/O Virtual NetCDF Access Manager (ROMIO)System File Software (SEA) (To HPSS) Layer System Layer

Hardware,SC07/Kepler Tutorial/V8/Nov-07 OS, and MSS (HPSS) 10

5 DOE Scientific Data Management Center – Scientific Process Automation What are Scientific Workflows?

Workflows: Workflow management systems help in the construction and automation of scientific problem-solving processes that include executable sequences of components and data flows. Scientific workflow systems often need to provide for load balancing, parallelism, and complex data flow patterns between servers on distributed networks. … aiming to solve complex scientific data integration, analysis, management, visualization tasks in plainer English: doing hard and/or messy stuff, and making it look easy.

SC07/Kepler Tutorial/V8/Nov-07 11

DOE Scientific Data Management Center – Scientific Process Automation Why use Scientific Workflows?

It’s too hard to keep moving my data, launching analysis services, and visualizing my data. If I don’t archive my data before the end of my run, I will lose data. I want to analyze my data as soon as it’s generated. I want to keep track of everything in the workflow (provenance) in case there is a problem. I don’t want to learn 10 million technologies just to continue running on the large . I write my own analysis routines, and I don’t want to learn new packages just to do my science.

SC07/Kepler Tutorial/V8/Nov-07 12

6 DOE Scientific Data Management Center – Scientific Process Automation Two typical types of Workflows for SC

Real-time Monitoring (Server Side Workflows) Job submission. File movement. Launch Analysis Services. Launch Visualization Services. Launch Automatic Archiving. Post Processing (Desktop Workflows). Read in Files from different locations. File movement. Launch Analysis Services. Launch Visualization Services. Connect to Databases. Obviously there are other types of workflows. Do you guys have different types?

SC07/Kepler Tutorial/V8/Nov-07 13

DOE Scientific Data Management Center – Scientific Process Automation A few days in the life of Sim Scientist. Day 1 -morning.

8:00AM Get Coffee, Check to see if job is running. Ssh into jaguar.ccs.ornl.gov (job 1) Ssh into seaborg.nersc.gov (job 2) (this is running yea!) Run gnuplot to see if run is going ok on seaborg. This looks ok. 9:00AM Look at data from old run for processing. Legacy code (IDL, Matlab) to analyze most data. Visualize some of the data to see if there is anything interesting. Is my job running on jaguar? I submitted this 4K processor job 2 days ago! 10:00AM scp some files from seaborg to my local cluster. Luckily I only have 10 files (which are only 1 GB/file). 10:30AM first file appears on my local machine for analysis. Visualize data with Matlab.. Seems to be ok. ☺ 11:30AM see that the second file had trouble coming over. Scp the files over again… Dohhh

SC07/Kepler Tutorial/V8/Nov-07 14

7 DOE Scientific Data Management Center – Scientific Process Automation A few days in the life of Sim Scientist. Day 1 evening.

1:00PM Look at the output from the second file. Opps, I had a mistake in my input parameters. Ssh into seaborg, kill job. Emacs the input, submit job. Ssh into jaguar, see status. Cool, it’s running. bbcp 2 files over to my local machine. (8 GB/file). Gnuplot data.. This looks ok too, but still need to see more information. 1:30PM Files are on my cluster. Run on hdf5 output files. Looks good. Write down some information in my notebook about the run. Visualize some of the data. All looks good. Go to meetings. 4:00PM Return from meetings. Ssh into jaguar. Run gnuplot. Still looks good. Ssh into seaborg. My job still isn’t running…… 8:00PM Are my jobs running? ssh into jaguar. Run gnuplot. Still looks good. Ssh into seaborg. Cool. My job is running. Run gnuplot. Looks good this time!

SC07/Kepler Tutorial/V8/Nov-07 15

DOE Scientific Data Management Center – Scientific Process Automation And Later

4:00AM yawn… is my job on jaguar done? Ssh into jaguar. Cool. Job is finished. Start bbcp files over to my work machine. (2 TB of data). 8:00AM Bbcp is having troubles. Resubmit some of my bbcp from jaguar to my local cluster. 8:00AM (next day). Still need to get the rest of my 200GB of data over to my machine. 3:00PM My data is finally here! Run Matlab. Run Ensight. Oppps…. Something’s wrong!!!!!!!!! Where did that instability come from? 6:00PM finish screaming!

SC07/Kepler Tutorial/V8/Nov-07 16

8 DOE Scientific Data Management Center – Scientific Process Automation And 2 years from now….

Simulations /computers are getting larger and more expensive to operate. In Fusion, large runs will be using >50K cores/ 100 wallclock hours, to understand turbulent transport in ITER size reactors. The cost of a simulation approaches $0.6M (power, cooling, system cost averaged over 5 years). Data Sizes are getting larger. Large simulations produce 2 TB/simulation (today), 100TB/simulation(week) in the future. Demand for ‘real-time’ monitoring/analysis of simulations. Demand for ‘fast-reliable data movement’ to local machines for ‘post processing’. Demand to keep data provenance at 1 location.

SC07/Kepler Tutorial/V8/Nov-07 17

DOE Scientific Data Management Center – Scientific Process Automation Workflows to the rescue!

In our demo section you will see us automate this process. Job submission starts services on ORNL IB cluster (ewok). Files are automatically moved from Cray XT3 to ORNL IB cluster. Files are converted from binary to hdf5 files. Files accumulate until they >= 6GB. Then they are tarred. Files use hsi commands to place tar files into HPSS. ( file describes which files are in which tar files). Hdf5 file is read into SciRun service which creates a jpeg. Jpeg files create an mpeg file via a mpeg service. Jpeg and mpeg files are moved to web portal. Hdf5 files are archived to PPPL. And of course we will keep track of the provenance of the workflow in a ! And we can monitor this on our dashboard.

SC07/Kepler Tutorial/V8/Nov-07 18

9 DOE Scientific Data Management Center – Scientific Process Automation Why do Pflop computing scientists care? Typical situation for Sim Scientist. We run on 1 – 60K processors, producing lots of data. Typical method of work. Prepare input data for smaller simulations. Iterate until we come up with the correct parameters for the large run. Run the large simulation only at a handful number of locations (usually <4). Must Archive results. Must be of the correct size archives on HPSS. Must move some data over to our local clusters for analysis after the simulation. Did we make a mistake with the input parameters? Is something going wrong? Fix the code/input: start the run over again. Wow, I just wasted 100K CPU hours because I missed a – sign. Duhh. Where are all of my files? I want to look at the temperature in the 200 time slice, where is it on HPSS.

SC07/Kepler Tutorial/V8/Nov-07 19

DOE Scientific Data Management Center – Scientific Process Automation

Post Processing Workflow: A day in the life of Sim Scientist

9:00AM Get Diet Coke, decide which runs/ experimental data to analyze. 9:30AM Start to download files from HPSS from NERSC and ORNL. 10:00AM Move files from NERSC/ORNL to local desktop machine. Smallish data (10GB /location). 11:00AM Start IDL, and compute various post processing quantities. 11:30AM look at the data from the simulations, and grab data from a database which has experimental data. 1:00PM Move some more data from ORNL to local desktop to compare to more experimental data. Save the plot from Matlab to Postscript to include in paper. Write down results into notebook, copy figure into notebook. 2:00PM Think about results, and decide on new analysis routines to write in the future. 4:00PM Start moving more data from NERSC to local desktop.

SC07/Kepler Tutorial/V8/Nov-07 20

10 DOE Scientific Data Management Center – Scientific Process Automation What’s changing in his life?

Collaboration. More clusters, more simulations. Just analyze the data where we run. Don’t move the data. But… What if the network goes (I’m on a plane,…). What if the resource is not available for my late-breaking analysis before the BIG conference? OK, but what about the large data? OK. Large data will be server-side analysis. Not DESKTOP. But can run workflow on a server. Data from multiple resources V&V = data from multiple simulations/experiments. But can’t we just run VISIT/SciRun? Yes. But need to orchestrate the data movement from different sources, track the provenance, and perhaps use multiple analysis/visualization packages, then a workflow system can help.

SC07/Kepler Tutorial/V8/Nov-07 21

DOE Scientific Data Management Center – Scientific Process Automation How do we help this scientist?

The workflow is the “glue” for the scientists.

The scientists hooks up all of the analysis routines.

The director makes sure that the data movement occurs, and is reliable, and secure.

All of the tedious portions of ssh, start this program, is removed by the workflow automation.

The workflow will be able to keep the provenance information which allows the user to understand how they processed the dataset. This enables the scientist to compare new data with old data.

SC07/Kepler Tutorial/V8/Nov-07 22

11 DOE Scientific Data Management Center – Scientific Process Automation So what are the requirements?

Must be EASY to use. If you need a manual, then FORGET IT! Good user support, and long-term DOE support. ☺ The workflow should work for all of my workflows. NOT just for the Petascale computers. And on multiple platforms! Must be easy to incorporate my own services into the workflow. Must be customizable by the users. Users need to easily change the workflow to work with the way users work. Long-term requirements. [NOT being worked on yet]. Autonomics/ User Adaptivity. Faster data movement in the workflow? High Quality front-end for the end-user interaction. You tell us!

SC07/Kepler Tutorial/V8/Nov-07 23

DOE Scientific Data Management Center – Scientific Process Automation SWF Systems Requirements

Design tools-- especially for non-expert users Ease of use-- fairly simple user having more complex features hidden in the background Reusable generic features Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) customizable Extensibility for the expert user Registration, publication & provenance of data products and “process products” (=workflows) Dynamic plug-in of data and processes from registries/repositories Distributed WF execution (e.g. Web and Grid awareness) awareness WF Deployment as a web site, as a ,“Power apps”.

SC07/Kepler Tutorial/V8/Nov-07 24

12 DOE Scientific Data Management Center – Scientific Process Automation The Big Picture: Supporting the Scientist

From “Napkin Drawings” …

…to Executable Workflows Conceptual SWF

Executable SWF

Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE

SC07/Kepler Tutorial/V8/Nov-07 25

DOE Scientific Data Management Center – Scientific Process Automation CPES Fusion Simulation Workflow

Fusion Simulation Codes: (a) GTC; (b) XGC with M3D e.g. (a) currently 4,800 (soon: 9,600) nodes Cray XT3; 9.6TB RAM; 1.5TB simulation data/run GOAL: automate remote simulation job submission continuous file movement to secondary analysis cluster for dynamic visualization & simulation control …with runtime-configurable observables

Submit Submit FileMover Simulation Job Job

Execution Log (=> Data Provenance) Select JobMgr Overall architect (& prototypical user): Scott Klasky (ORNL) SC07/KeplerWF Tutorial/V8/Nov-07 design & implementation: Norbert Podhorszki (UC Davis)26

13 DOE Scientific Data Management Center – Scientific Process Automation CPES Analysis Workflow

Concurrent analysis pipeline(@Analysis Cluster): convert ; analyze ; copy-to-Web-portal easy configuration, re-purposing

Reusable Specialized SpecializeActor Actor Actor “instances” “Class” “Instances” Pipelined Execution Model

Inline Documentation

Inline Easy-to-edit Display Parameter Settings

Overall architect (& prototypical user): Scott Klasky (ORNL) SC07/KeplerWF Tutorial/V8/Nov-07 design & implementation: Norbert Podhorszki (UC Davis)27

DOE Scientific Data Management Center – Scientific Process Automation Dashboard integration with Kepler

Dashboard present information created from the workflow. We have been developing a dashboard for Kepler workflows. . FLASH. PHP. MySQL.

SC07/Kepler Tutorial/V8/Nov-07 28

14 DOE Scientific Data Management Center – Scientific Process Automation Machine Monitoring

• DOE Machine monitoring − Which machines are up? − Which machines have long queues, which are ? − Where can I run my job? − Where am I running jobs? − Where are my running jobs and can I look at my old runs? − Can I monitor a new job, and compare this to an old job?

SC07/Kepler Tutorial/V8/Nov-07 29

DOE Scientific Data Management Center – Scientific Process Automation Dashboards for Simulation Monitoring

Back end: shell scripts, python scripts and PHP. Machine queues command Users’ personal information Services to display and manipulate data before display Dynamic Front end: Machine monitoring: standard web technology + Ajax Simulation monitoring: Flash Storage: MySQL (queue- info, min-max data, users’ notes…)

SC07/Kepler Tutorial/V8/Nov-07 30

15 DOE Scientific Data Management Center – Scientific Process Automation Scientific Workflow Systems

Combination of Data management, integration, analysis, and visualization steps Larger, automated "scientific process" Mission of scientific workflow systems Promote “scientific discovery” by providing tools and methods to generate scientific workflows Provide an extensible and customizable graphical for scientists from different scientific domains Support workflow design, execution, sharing, reuse and provenance Design frameworks which define efficient ways to connect to the existing data and integrate heterogeneous data from multiple resources Make technology useful through user’s !!!

SC07/Kepler Tutorial/V8/Nov-07 31

DOE Scientific Data Management Center – Scientific Process Automation Discussion

What types of workflow are you most interested in? Which technologies do you use in your workflows? How do you automate your workflows now?

SC07/Kepler Tutorial/V8/Nov-07 32

16 DOE Scientific Data Management Center – Scientific Process Automation

Hands On # 1: Introduction to Kepler User Interface and Entities

Presenter: Daniel Crawl

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 33

DOE Scientific Data Management Center – Scientific Process Automation Kepler is a Scientific Workflow System

http://www.kepler-project.org

Kepler is a cross-project collaboration Latest release available from the Builds upon the open-source Ptolemy II framework

Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving support environment for Scientific Workflow development, execution, maintenance

KEPLER = “Ptolemy II + X” for Scientific Workflows

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 34

17 DOE Scientific Data Management Center – Scientific Process Automation Vergil is the GUI for Kepler… … but Kepler can also run in batch mode as a command-line engine.

Actor Search Data Search

Actor ontology and semantic search for actors Search -> Drag and drop -> Link via ports Metadata-based search for datasets

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 35

DOE Scientific Data Management Center – Scientific Process Automation Actor-Oriented Modeling

Actors single component or task well-defined interface (signature) generally a passive entity: given input data, produces output data

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 36

18 DOE Scientific Data Management Center – Scientific Process Automation Actor-Oriented Modeling

Ports each actor has a set of input and output ports denote the actor’s signature produce/consume data (a.k.a. tokens) parameters are special “static” ports

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 37

DOE Scientific Data Management Center – Scientific Process Automation Opening and Running a Workflow

Start Kepler Open the “HelloWorld.xml” under the “demos/sc07” directory in your local Kepler folder Two options to run a workflow: PLAY BUTTON in the toolbar RUNTIME WINDOW from the run menu

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 38

19 DOE Scientific Data Management Center – Scientific Process Automation Modifying an Existing Workflow and Saving It

GOAL: Modify the HelloWorld workflow to display a parameter-based message Step-by-step instructions: 1. Open the HelloWorld workflow as before 2. From actors search tab, search for Parameter 3. Drag and drop the parameter to the workflow canvas on the right 4. Double click the parameter and type your name 5. Right click the parameter and select "Customize Name", type in "name". 6. Double click the Constant actor and type the following: “Hello “ + name 7. Save 8. Run the workflow

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 39

DOE Scientific Data Management Center – Scientific Process Automation Creating a HelloWorld! Workflow

Open a new blank workflow canvas From toolbar: File New Workflow Blank In the Components tab, search for “Constant” and select the Constant actor. Drag the Constant actor onto the Workflow canvas Configure the Constant actor Right-click the actor and selecting Configure Actor from the menu Or, double click the actor Type “Hello World” in the value field and click Commit In the Components and Data Access area, search for “Display” and select the Display actor found under “Textual Output.” Drag the Display actor to the Workflow canvas. Connect the output port of the Constant actor to the input port of the Display actor. In the Components and Data Access area, select the Components tab, then navigate to the “/Components/Director/” directory. Drag the SDF Director to the top of the Workflow canvas. Run the model SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 40 40

20 DOE Scientific Data Management Center – Scientific Process Automation

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 41

DOE Scientific Data Management Center – Scientific Process Automation Actor Search

• Kepler Actor Ontology • Used in searching actors and creating conceptual views (= folders) CurrentlySC07/KeplerSC07/Kepler 160+ Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 Kepler actors added! 42

21 DOE Scientific Data Management Center – Scientific Process Automation Data Search and Usage of Results

• Kepler DataGrid – Discovery of data resources through local and remote services SRB, Grid and Web Services, Db connections – Registry of datasets on the fly using workflows

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 43

DOE Scientific Data Management Center – Scientific Process Automation Creating Web Service Workflows

GOAL: Executing a Web Service using the generic Web Service . Step-by-step instructions: 1. In the Components and Data Access area, select the components tab. 2. Search for "web service". 3. Drag "Web Service Actor" onto the canvas. 4. Double click the actor, enter http://xml.nig.ac.jp/wsdl/DDBJ.wsdl, commit. 5. Double click the actor again, select "getXMLEntry" as method name, commit. 6. Search for "String Constant" in the components tab. Drag and drop "String Constant" onto workflow canvas. 7. Double click the "String Constant", set AA045112 as value, commit. 8. Connect "String Constant" output with the "Web Service Actor" input. 9. Add a "Display" and connect its input with the "Web Service Actor" "Result" output. 10. Add the SDF director. 11. Run the workflow.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 44

22 DOE Scientific Data Management Center – Scientific Process Automation SSH Actor and Including Existing Scripts in a Workflow

GOAL: Use SSH actor to execute command on remote host. Step-by-step instructions: 1. Search for "ssh" in the Components tab in left pane. 2. Drag "SSH To Execute" onto the canvas. 3. Double click the actor: 1. Type in a remote host you have access to. 2. Type in your username. 4. Search for "String Constant" in the components tab. Drag and drop "String Constant" onto workflow canvas. 5. Double click the "String Constant", type "ls" and commit. 6. Connect "String Constant" output with the "SSH To Execute" command input (lowest)‏. 7. Add a "Display" and connect its input with the "SSH To Execute" stdout output (top)‏. 8. Add the SDF director. 9. Run the workflow. 10. If you have a script deployed on the server, you can replace the "ls" command to invoke the script. 1. e.g., tmp.pl …

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 45

DOE Scientific Data Management Center – Scientific Process Automation Using Various Displays

GOAL: Use different graphical output actors. Step-by-step instructions: 1. Open the "03-ImageDisplay.xml" under the “demos/getting-started” directory in your local Kepler folder. 2. Run the workflow. 3. Search for "browser" in the components tab. 4. Drag and drop "Browser Display" onto the canvas. 5. Replace "ImageJ" with "Browser Display" (connect Image Converter output to "Browser Display" inputURL. 6. Run workflow again. 7. Replace "Browser Display" with a textual "Display“. 8. Run workflow.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 46

23 DOE Scientific Data Management Center – Scientific Process Automation Using R in Kepler GOAL: Use the R actor to generate histogram plot. Step-by-step instructions: 1. In “demos/getting-started” directory, open “05-LinearRegression.xml”. 2. Run the workflow to view linear regression. 3. Add another RExpression actor to canvas. 4. Double-click on new R actor and enter the following for “R function or script:” Mean <- mean(Values) hist(Values) 5. Right-click on new R actor and “Configure Ports”:

1. Add an input called “Values”

2. Add an output called “Mean” 6. Control-click on the canvas to create a new “Relation” diamond 7. Connect the T_AIR port from Datos to the diamond. 8. Connect the T_AIR port from R_linear_regression to the diamond. 9. Connect the Values port from the new R actor to the diamond. 10. Place a second ImageJ actor on the canvas (can copy and paste existing one). 11. Connect R actor"s “graphicsFileName” port to second ImageJ actor"s input. 12. Run the workflow.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 47

DOE Scientific Data Management Center – Scientific Process Automation

Directors define the execution semantics of workflow graphs executes workflow graph (some schedule) sub-workflows may have different directors enables reusability

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 48

24 DOE Scientific Data Management Center – Scientific Process Automation

Anatomy of a Workflow

Presenter: Mladen Vouk

SC07/Kepler Tutorial/V8/Nov-07 49

DOE Scientific Data Management Center – Scientific Process Automation General Architecture

Analytics Computations

LocalLocal and/orand/or Control Panel RemoteRemote (Dashboard) CommunicatioCommunicatio & Display nsns (networks)(networks)

Orchestration (Kepler) Data, DataBases, Provenance, … Storage

SC07/Kepler Tutorial/V8/Nov-07 50

25 DOE Scientific Data Management Center – Scientific Process Automation A Hierarchical View of the Architecture

Control Plane (light data flows)

Provenance, Tracking & Meta-Data (DBs and Portals) Execution Plane (“Heavy Lifting” Computations and flows)

Synchronous or Asynchronous?

SC07/Kepler Tutorial/V8/Nov-07 51

DOE Scientific Data Management Center – Scientific Process Automation Scientific Workflow Automation (e.g., Astrophysics) In conjunction with John Blondin, NC State University Automate data acquisition, transfer and visualization of a large-scale simulation at ORNL

Aggregate to Logistic Network Input ~500 files (< L-Bone or bbcp Local Mass Data 50+GB each) Storage 14+TB)

VH1+ Depot

HPSS Local 44 Proc. Data Cluster archive - data sits on local nodes for weeks Highly Output Provenance Parallel ~500x500 Compute files Web Viz Viz Software Wall

SC07/Kepler Tutorial/V8/Nov-07Viz Client 52

26 DOE Scientific Data Management Center – Scientific Process Automation The Big Picture: Supporting the Scientist

From “Napkin Drawings” …

…to Executable Workflows Conceptual SWF Kepler Executable SWF

Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE

SC07/Kepler Tutorial/V8/Nov-07 53

DOE ExampleScientific Data Management– Computational Center – Scientific Process Automation Astrophysics

Dr. Blondin is carrying out research in the field of Circumstellar Gas-Dynamics. The numerical hydrodynamical code VH-1 is used on supercomputers, to study a vast array of objects observed by astronomers both from ground-based observatories and from orbiting satellites. The two primary subjects under investigation are interacting binary stars - including normal stars like the Algol binary, and compact object systems like the high mass X-ray binary SMC X-1 - and supernova remnants - from very young, like SNR 1987a, to older remnants like the Cygnus Loop. Other astrophysical processes of current interest include radiatively driven winds from hot stars, the interaction of stellar winds with the interstellar medium, the stability of radiative shockwaves, the propagation of jets from young stellar objects, and the formation of globular clusters.

SC07/Kepler Tutorial/V8/Nov-07 54

27 DOE Scientific Data Management Center – Scientific Process Automation

SC07/Kepler Tutorial/V8/Nov-07 55

DOE Scientific Data Management Center – Scientific Process Automation

SC07/Kepler Tutorial/V8/Nov-07 56

28 DOE Scientific Data Management Center – Scientific Process Automation

SC07/Kepler Tutorial/V8/Nov-07 57

DOE Scientific Data Management Center – Scientific Process Automation An Extended View of an Actor

Control Plane (light data flows)

Provenance, Tracking & Meta-Data (DBs and Portals) Execution Plane (“Heavy Lifting” Computations and flows)

Synchronous or Asynchronous?

SC07/Kepler Tutorial/V8/Nov-07 58

29 DOE Scientific Data Management Center – Scientific Process Automation

Actor in a Broader Sense

In Out

Network Bsub < code_run ------where code_run is a script ------code_run #! /bin/csh source /usr/local/lsf/conf/cshrc.lsf #BSUB -W 5 #BSUB -n 100 mpiexec ./code #BSUB -o /share/vouk/WFLOW/code.out.%J #BSUB -e /share/vouk/WFLOW/code.err.%J #BSUB -J codevouk SC07/Kepler Tutorial/V8/Nov-07 59 ------

DOE Scientific Data Management Center – Scientific Process Automation A TSI Workflow (in Ptolemy II framework)

SC07/Kepler Tutorial/V8/Nov-07 60

30 DOE Scientific Data Management Center – Scientific Process Automation

Virtualization Multiple instances of the Workflow using Kepler. Also allows asynchronous processing.

Using VCL Plus Windows, Mac and Clients

SC07/Kepler Tutorial/V8/Nov-07 61

DOE Scientific Data Management Center – Scientific Process Automation General System Performance Dashboard

SC07/Kepler Tutorial/V8/Nov-07 62

31 DOE Scientific Data Management Center – Scientific Process Automation Availability of blades (LSF)

SC07/Kepler Tutorial/V8/Nov-07 63

DOE Scientific Data Management Center – Scientific Process Automation Other Dashboard Elements

LSF queue status on the HPC Linux clusters. This data is updated every 10 minutes or by forced refresh. Job details

SC07/Kepler Tutorial/V8/Nov-07 64

32 DOE Scientific Data Management Center – Scientific Process Automation

Demo Runs and Workflow Exploration with Discussion of Components, Processes, Behaviors, Data Models, Issues, Solutions …

SC07/Kepler Tutorial/V8/Nov-07 65

DOE Scientific Data Management Center – Scientific Process Automation A Complex Workflow Attendees Will Modify

SC07/Kepler Tutorial/V8/Nov-07 66

33 DOE Scientific Data Management Center – Scientific Process Automation Summary

Scientific Workflows - a series of structured activities and computations that arise in scientific problem-solving. Needed: a comprehensive, end-to-end, data and workflow management for scientific workflows. Integrated network- based framework that is functional, dependable, fault- tolerant, and supports data and process provenance. Shift scientists’ efforts away from resource management and application development to scientific research and discovery Long history starting with numerical libraries, through tightly- coupled local and distributed problem solving environments, to a combination of end-to-end synchronous and asynchronous network-based/assisted/integrated computational, storage, and access (e.g., via web) resources. One such framework is the Ptolemy II based environment called Kepler. Department of Energy Scientific Data Management Center is participating in development of Kepler and its evaluation in a number of high-end scientific problem solving settings including , astrophysics and fusion domains.

SC07/Kepler Tutorial/V8/Nov-07 67

DOE Scientific Data Management Center – Scientific Process Automation

Hands On#2

Presenter: Norbert Podhorszki

SC07/Kepler Tutorial/V8/Nov-07 68

34 DOE Scientific Data Management Center – Scientific Process Automation Hands On Overview

Workflow 1 Submit a simulation job on a remote machine Workflow 2 Transfer simulation output to another machine (directory) on the fly Run conversion program to convert to HDF5 Archive HDF5 files Create images from data files Transfer images to Kepler and show

SC07/Kepler Tutorial/V8/Nov-07 69

DOE Scientific Data Management Center – Scientific Process Automation Transfer-Convert-Archive-Image-Workflow

SC07/Kepler Tutorial/V8/Nov-07 70

35 DOE Scientific Data Management Center – Scientific Process Automation

Provenance

Presenter: Mladen Vouk

SC07/Kepler Tutorial/V8/Nov-07 71

DOE Scientific Data Management Center – Scientific Process Automation General Architecture

Analytics Computations

LocalLocal and/orand/or Control Panel RemoteRemote (Dashboard) CommunicatioCommunicatio & Display nsns (networks)(networks)

Orchestration (Kepler) Data, DataBases, Provenance, … Storage

SC07/Kepler Tutorial/V8/Nov-07 72

36 DOE Scientific Data Management Center – Scientific Process Automation What is Provenance

Provenance is about meta-data (data about data), the history (lineage) of data, code execution and conditions applied to a workflow run. Run-time monitoring may be part of the provenance meta-data, but it also may require collection of additional information and display of that information in a user-friendly format, for example on a “dashboard,” so that run-time tracking, problem determination, computational steering, and other workflow-related feedback may take place.

SC07/Kepler Tutorial/V8/Nov-07 73

DOE Scientific Data Management Center – Scientific Process Automation Why Provenance?

Recreate results and rebuild workflows using the evolution information Associate the workflow with the results it produced Create links between generated data in different runs, and compare different runs Checkpoint a workflow and Recover from a system failure Debug and explain results (via lineage tracing, …) Smart Reruns Other …

SC07/Kepler Tutorial/V8/Nov-07 74

37 DOE Scientific Data Management Center – Scientific Process Automation Types of Provenance

Process provenance – dynamics of control flows and their progression, execution, etc. Data provenance – dynamics of data and data flows, file locations, application input/output information, etc. Workflow provenance – structure, , evolution, … System (or Environment) provenance – system information, O/S, versions, etc.

SC07/Kepler Tutorial/V8/Nov-07 75

DOE Scientific Data Management Center – Scientific Process Automation Other Data Views and Concepts

Raw data Application/Simulation monitoring (input, output, configuration, intermediate states, …) Data history and location Machine monitoring “Shelf-life” of data Auditability Error and execution logs Analytics and Data – information summation (visual, formulas, smoothing, … etc.) …

SC07/Kepler Tutorial/V8/Nov-07 76

38 DOE Scientific Data Management Center – Scientific Process Automation Framewok

Storage Supercomputer + Auth Analytics

Kepler Rec Disp API DB API Dash

Orchestration Meta-Data about: Processes, Data, Workflows, System & Environment

SC07/Kepler Tutorial/V8/Nov-07 77

DOE Scientific Data Management Center – Scientific Process Automation A Hierarchical View of the Architecture

Control Plane (light data flows)

Provenance, Tracking & Meta-Data (DBs and Portals) Execution Plane (“Heavy Lifting” Computations and flows)

Synchronous or Asynchronous?

SC07/Kepler Tutorial/V8/Nov-07 78

39 DOE Scientific Data Management Center – Scientific Process Automation Implementation

Kepler + Linux + Apache + MySQL + PHP (K-LAMP) Windows based solutions Communications: sockets, xmlrpc, http, files, NFS, synchronous, asynchronous, …etc. Single node vs. distributed solutions Service-based solutions Which information?

SC07/Kepler Tutorial/V8/Nov-07 79

DOE Scientific Data Management Center – Scientific Process Automation Data Model

Schema First Level Tables Provenance Mapping

Usernames, usernotes, Basic User collaborators Information

System system data, system status, Process, System Information

workflow, annotation, entity, Workflow actor, director, relation, link, Workflow Specification port, parameter

execution, exec status, exec Workflow annotation, actor firing, actro Data, Process Execution exec status, actor annotation, command exec, cmd annot, token flow Workflow action, operation Workflow Evolution SC07/Kepler Tutorial/V8/Nov-07 80

40 DOE Scientific Data Management Center – Scientific Process Automation Kepler Provenance Framework

SC07/Kepler Tutorial/V8/Nov-07 81

DOE Scientific Data Management Center – Scientific Process Automation A Key Issue

Very important to distinguish between a custom-made workflow and provenance solution and a more canonical set of operations, methods, data models and solutions and re-usable components that can be composed into scientific workflow components. Complexity, skill level needed to implement, usability, maintainability, “standardization” e.g., sort, uniq, grep, ftp, ssh on unix boxes SAS (that can do sorting), home-made sort, LORS, SABUL, bbcp (, but not standard), etc.

SC07/Kepler Tutorial/V8/Nov-07 82

41 DOE Scientific Data Management Center – Scientific Process Automation

Hands On#3: Provenance in Kepler

Presenter: Daniel Crawl

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 83

DOE Scientific Data Management Center – Scientific Process Automation Saving Provenance

GOAL: Save provenance in a simple workflow. Step-by-Step instructions: 1. Open Hello World workflow. 2. Add “ProvenanceRecorder” to canvas. 3. Double-click to configure. 4. Run the workflow. 5. Save workflow as “HelloWorldProv.xml”

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 84

42 DOE Scientific Data Management Center – Scientific Process Automation Viewing Database

Use MySQL query browser to view provenance database contents.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 85

DOE Scientific Data Management Center – Scientific Process Automation Querying Provenance

GOAL: query provenance information in workflow. Step-by-Step instructions: 1. Start with new, blank workflow. 2. In the Components and Data Access area, select the components tab. 3. Search for "database". 4. Drag "Open Database Connection" and "Database Query" onto the canvas. 5. Configure "Open Database Connection". 6. Connect the output of "Open Database Connection" with the dbcon input port of "Database Query". 7. Double-click to customize the actor: Query: SELECT * FROM token_flow 8. Add Display actor and Database Query output to Display input. 9. Add SDF director. 10. Run the workflow. 11. Save workflow as “QueryProv.xml”.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 86

43 DOE Scientific Data Management Center – Scientific Process Automation Command-line Execution

GOAL: execute workflow from command- line. Step-by-step instructions: 1. Open “HelloWorldProv.xml” workflow. 2. Replace Display actor with LineWriter. 3. Save workflow “HelloWorldProv2.xml”. 4. Open command prompt. 5. Run “ptexecute HelloWorldProv2.xml”.

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 87

DOE Scientific Data Management Center – Scientific Process Automation Querying Provenance 2

Rerun “QueryProv.xml”

SC07/KeplerSC07/Kepler Tutorial/V8bc/Nov-07 Tutorial/V8/Nov-07 88

44 DOE Scientific Data Management Center – Scientific Process Automation Next in the tutorial…

Scott Klasky and Mladen Vouk Demonstration of the dashboard from a user's perspective Showing attendees how to use it and see the system information

Discussion: Your workflows of interest

Daniel Crawl, Jeff Ligon, Norbert Podhorszki, Ayla Khan Create small workgroups of interest Hands on work on details of covered concepts

SC07/Kepler Tutorial/V8/Nov-07 89

DOE Scientific Data Management Center – Scientific Process Automation Discussion and Questions

SC07/Kepler Tutorial/V8/Nov-07 90

45