arXiv:1011.6332v1 [astro-ph.IM] 29 Nov 2010 oapa ntePoednso h t rdCmuigEnviro Computing Grid 5th the 2009. USA, of Oregon, Proceedings Portland, (GCE2009), the Workshop in appear To As- The the thus planets. and extrasolar star detected distant any a of of size size absolute absolute yi Sun-like precise frequencies, of the pulsation properties Asteroseis- their ing the of known. observations determine be from to also How- stars used extrasola must an be star. star of can the the size mology of of absolute size size the the the calculate planet, to to relative order planet in identi ever, the to of used be size pass can satellite–that the planet the a and star as Earth- its extrasol brightness between habitable observing observed by in potentially dips planets identify transits–brief detects to planets. mission like a of part INTRODUCTION 1. services. Web-based - Services formation .. [ H.3.5 Descriptors Subject and Categories development gateway of rapid combination for deployment. this effective and found and have flexible in- We grid tools the from development. development cleanl terface interface while user to capabilities the us framework’s separating allowing framework, the stan- development leverage Python- web traditional the Django using a based application as web -backed developed dalone was AMP’s interface resources. web-based computational TeraGrid i simplify of to orchestration intended AMP’s gateway, are science implementation TeraGrid and a architecture as Targeted year. a in than application produce science describe to web-based used fully-custom we tools highlighting functional and paper, AMP, principles of design this lightweight implementation the In and architecture frequencies. the pulsation observati from their stars Sun-like simulations of of view properties and the run derive to that web- astronomers a for provides interface (AMP) based Portal Modeling Asteroseismic The ABSTRACT nMrh20,NS anhdteKpe aelt as satellite Kepler the launched NASA 2009, March In nomto trg n Retrieval and Storage Information [email protected] ate Woitaszek Matthew 80TbeMs Drive Mesa Table 1850 topei Research Atmospheric ole,C 80305 CO Boulder, ainlCne for Center National M:ASinedie Web-based Science-driven A AMP: plcto o h TeraGrid the for Application :Oln In- Online ]: 80TbeMs Drive Mesa Table 1850 topei Research Atmospheric [email protected] ole,C 80305 CO Boulder, ainlCne for Center National vsMetcalfe Travis nments eld- less ons ar es fy ts y a r eoesi oeigPra (AMP, Portal Modeling teroseismic eeomn ieadcmlxt hl eann ulcre- full retaining while complexity and minimizin time prioritized back-end We development the resources. that enable supercomputing application to of use technologies science-driven Grid web-based used a peripherally create to was . [5], in service-oriented work written application-specific not prior an is our and use architecture from not divergent does notably AMP Most complex implement and to portals. architectures Java oriented using service highly-extensible by typi- and gateways pattern Atmo- many prior design for of the the Center cal followed National (NCAR) of the Research many at spheric Indeed, projects gateway science [12]. substanti the complexity introducing with development often features, M. frameworks and using complexity feature-rich by in constructed most noted greatly vary be As science that can tools for gateways gateway. frameworks development, the evaluating gateway collec- similarly construct that the when to toolkits, select Thomas and used frameworks to be steps as is such will first gateway technologies, the of science of tion a One designing frame- technologies. when application supporting web gateway and architecture, science work, new new a a exploring develop while to opportunity an provided interest. of da asteroseismic stars of many analysis for repe- uniform model, without a the community produces and the al- run tition, to gateway to results re- science resources model of a local disseminates community via without international researchers model More lows an the Presenting to simula time. interest asteroseismology searchers. wall-clock of these of are of week tions results a consume the over can importantly, and for jobs processors MPI 512 of batches independent propagating requires several simulation MPIKAIA single workflow. a high-maintenance a require- Running but on computational straightforward model high and sub- the MPIKAIA’s ments most running are The astronomer resource an gateway. local science to a barriers explore as stantial to model motivation the compelling proce presenting provided for use data potential their Kepler its on ing years, run several and for download resources to own astronomers to available been groups. research an among execution sharing model data community automated simplifying international facilitating broad researchers, a astero- of to MPIKAIA [6] the pipeline the to seismology interface web-based a presents o h einadipeetto fAP u objective our AMP, of implementation and design the For also AMP by implemented workflow straightforward The has itself pipeline asteroseismology MPIKAIA the While [email protected] 80TbeMs Drive Mesa Table 1850 topei Research Atmospheric ole,C 80305 CO Boulder, ainlCne for Center National a Shorrock Ian http://amp.ucar.edu web ss- ta al g ) - ative control of the by selecting the Django AMP supports both modes of execution from its web- rapid-development web framework and implementing the based user interface: running the forward model with spe- Grid functionality with command-line toolkit interfaces. cific model parameters (a “direct model run”), and executing Due to AMP’s computational requirements, AMP has been the GA to identify model parameters that produce observed designed since its inception to target TeraGrid resources. data (an “optimization run”). Direct model runs are trivial Many of the best practices and procedures for developing to configure and execute: they require five floating-point pa- and deploying science gateways on the TeraGrid were pro- rameters as input, take 10-15 minutes to execute on a single posed coincident with our initial exploration of targeting processor, and produce a few kilobytes of output. Opti- TeraGrid as AMP’s computational platform. As such, AMP mization runs are both more complex and computationally also provides an example of constructing a new science gate- intensive. way specifically for TeraGrid cyberinfrastructure rather than The optimization run workflow consists of an ensemble of the common case of extending an existing gateway to utilize independent GA runs, with each run requiring the execu- TeraGrid. AMP’s architecture separates the web-based user tion of multiple sequential tasks (see Figure 1). For each interface and the workflow system performing Grid opera- optimization run, multiple separate GAs are executed and tions, isolating interactive users both logically and physically allowed to converge independently. Each GA (and indeed from TeraGrid operations. We utilized only components each task) is started with randomly generated seed parame- common to all TeraGrid resource providers with the goal of ters to encourage the GA to explore a wide parameter space, facilitating easy deployment on current TeraGrid-managed avoid local minima, and provide confidence in the optimality resources without any resource provider assistance. of the final result. The GAs can take from hours to days to The remainder of this paper is organized as follows. Sec- converge depending on system performance and the number tion 2 describes the asteroseismology model workflow and of iterations requested, so a GA may not converge in a single computational requirements. Section 3 and 4 describe the task execution within the target supercomputer’s walltime architecture, design, and implementation of AMP. Section 5 limitations. Thus, each GA run may require several invo- discusses our experiences with AMP’s implementation em- cations of the executable to converge to a solution. When phasizing the potential usefulness of the design principles all of the GA runs in the ensemble are complete, the best for future gateway projects, and the paper concludes with solution is evaluated using the forward model to produce continuing and future work. detailed output for presentation and analysis. In the current configuration for the Kepler data analy- 2. BACKGROUND sis, each optimization run consists of four GA runs executed in parallel, and each GA models a population of 126 stars The asteroseismology workflow provided by AMP consists (using 128 processors) for 200 iterations. One interesting of two components: a forward stellar model and a genetic artifact of the ASTEC model is that the execution time algorithm (GA) that invokes the forward model as a sub- varies slightly depending on the target star’s characteristics. routine. The forward stellar model is the Aarhus Stellar During the first few iterations, some stars in the randomly Evolution Code (ASTEC) [4], a single-processor code that chosen population may take more time to model than oth- takes as input five floating-point physical parameters (mass, ers. Because the iteration is blocked on the completion of metallicity, helium mass fraction, and convective efficiency) all stars in the population, the iteration run time is set by and constructs a model of the star’s evolution through a the longest-running component star. However, as the model specified age. The output of the model includes observable continues and the population begins to converge, the model data such as the star’s temperature, luminosity, and pulsa- run time for each star also converges and the time to run tion frequencies. In addition to the scalar parameter output, each iteration decreases. Thus, the 200 iterations can be ASTEC produces data that can be used to produce basic performed in about 160x to 180x of the first iteration’s mea- graphical plots describing the star’s characteristics, includ- sured time. ing a Hertzsprung-Russell diagram showing the star’s tem- As part of the allocation request for TeraGrid resources, perature and luminosity and an Echelle plot summarizing the stellar model was benchmarked on four TeraGrid plat- the star’s oscillation frequencies. forms (see Table 1). From the astronomer’s perspective, the In practice, however, the reverse problem must be solved: most important metric is the predicted optimization run ASTEC models a star with known properties and produces (GA) run time. The modern Intel and AMD processors its observable characteristics, while the real research prod- in the NICS and TACC resources can propagate the GA uct requires starting with observations and identifying the properties of a star that could produce those observations. In order to derive the properties of distant stars from ob- servations, ASTEC is coupled with the MPIKAIA parallel GA [6] to create an automated stellar processing pipeline [7]. GA Run 1 Job ... Job The GA creates a population of candidate stars with a vari- ety of physical parameters, models each star using ASTEC, GA Run 2 Job ... Job and then evaluates each candidate star for similarity to the Input Solution Observables Evaluation observed data. Over many iterations, the GA converges to GA Run 3 Job ... Job identify an optimal candidate star that has the properties most likely to produce the observed data. The candidate GA Run 4 Job ... Job star is then subjected to a solution detail run that further refines the star’s characteristics at a finer granularity and produces the final model output. Figure 1: AMP asteroseismology workflow. Stellar Model Optimization Run (Genetic Algorithm) System Run Time (min) Run Time (h) CPUh SUs/CPUh TeraGrid SUs NCAR Frost 110.0 293.3 150,187 0.558 83,804 NICS Kraken 23.6 61.9 31,723 1.623 51,486 TACC Lonestar 15.1 40.4 20,670 1.935 39,996 TACC Ranger 21.1 56.2 28,771 1.644 47,229

Table 1: Measured stellar benchmark run time, and estimated optimization run time and SU charge, for selected TeraGrid systems. An optimization run performs 200 GA iterations and requires about 160x the model benchmark time to complete, and each GA executes four 128-processor jobs. to completion in about 40-60 hours, while the slower pro- science code itself by running the code in an environment cessors in NCAR’s Frost system can require over 12 days. identical to that used by the astronomy principal investiga- When considering TeraGrid’s service unit (SU) charging fac- tor and colleagues. Rather than dispatching software engi- tors and the model performance, the TACC systems are neers or students to maintain the application, the science PI most efficient platforms for this model, but the systems are occasionally updates the Grid-executed code using sudo on generally similar in cumulative charging. For our produc- the remote resource personally. tion deployment, we have targeted the NICS Kraken system Separating the user interface from the grid-related pro- due to its short solution time and support for WS-GRAM. cessing components also simplifies the administrative respon- The TACC systems demonstrated better performance, but sibilites associated with using TeraGrid computational re- the small disk space available on Lonestar and lack of WS- sources. In particular, one concern often associated with GRAM on Ranger, combined with the current allocation science gateways is their use of a shared credential to submit oversubscription on those systems, discouraged their use for jobs on behalf of a community of individual gateway users this project. For additional computational volume, we con- [11]. Gateways that utilize TeraGrid resources are required tinue to utilize NCAR’s Frost system. to maintain user registries and associate every Grid request with a specific gateway user. In order to provide end-to-end 3. ARCHITECTURE user accounting for all gateway jobs and to allow resource providers to disambiguate the real users acting behind com- The high-level AMP architecture reflects our principal de- munity credentials, TeraGrid has developed and deployed sign goals of supporting rapid development and explicitly the GridShib SAML extensions [8]. However, an underly- targeting TeraGrid computational resources. The architec- ing risk remains: a science gateway typically runs a publicly ture consists of three main components: the web-based user accessible web and also must possess the credentials interface, the “GridAMP” workflow daemon that functions necessary to access many machines on the TeraGrid. as a grid client, and the remote computational resources run- The AMP architecture addresses this conern by separat- ning the model (see Figure 2). The separation of these three ing users from the community account credential by placing main components is fundamental to the architecture. them on distinct servers. The user interacts with a web por- With respect to supporting rapid development, one ad- tal located on one publicly-accessible server, while all back- vantage of the separation of AMP’s functional components end processing and remote Grid operations are performed is its ability to support specialized labor. This approach by the GridAMP daemon on another server. All communi- generally decouples the tasks of , back-end cation between the AMP portal and the GridAMP daemon Grid software engineering, and the debugging and mainte- are asynchronously performed by manipulating a database nance of the science software itself. This is particularly ben- located on yet another server. Moreover, the roles and priv- eficial because it is much easier to find students to work on ileges of the public web portal and GridAMP daemon are web-related development (e.g., undergraduates) than to find strictly managed and controlled. The public web portal is students that possess a thorough understanding of the intri- essentially a database-driven without any Grid cacies of Grid infrastructure and middleware (e.g., graduate connectivity or Grid software. The server hosting the Grid- students with several years of experience), to say nothing AMP daemon is accessible only to the developers using SSH of trying to find students that can work proficiently (and keys, and only GridFTP is externally exposed to facilitate efficiently) with both. Because the interface and Grid com- data staging via the community account credential. All in- ponents are not tightly coupled, they can be easily developed put data from users is marshaled through the SQL database. and maintained by individuals with complimentary skill sets. Incoming user data is parsed by the web server and uploaded We have continued the separation concept through to the to database tables with strict data type constraints. When required, the input files are regenerated from the database by the GridAMP daemon and then staged to TeraGrid sys- tems. It is thus exceptionally difficult to send any data other than a properly formatted asteroseismology input file to a TeraGrid resource, and even a full root compromise of the Web Gateway User Interface CTSS Globus Computational Grid Client web server does not provide access to any credentials used User Interface Database Grid Services Jobs for access to any other system. This architectural feature helps AMP comply at the most fundamental level with the Figure 2: AMP high-level architecture. TeraGrid science gateway security best practices [10]. 4. IMPLEMENTATION by rendering final output to the user via Django’s template The AMP gateway and the GridAMP daemon are im- engine. plemented in Python 2.4/2.6 using the Django web devel- For AMP, we implemented most of the science gateway opment framework [2]. Django’s primary intended use is functionality in a single core application consisting of ORM as a web development platform, but over two software en- models and support routines. For example, the catalog of gineering iterations, we adopted Django as the underlying stars, their identifiers, the simulations, and the constituent framework for both the AMP and the GridAMP supercomputer jobs are all stored in this core application. daemon. We were able to perform two complete cycles of This effectively makes the most important components of a “spiral-model” software engineering process in about one AMP first-class global objects when imported properly. The year, completely re-implementing the entire website and pro- web interface is then constructed of additional applications cessing daemon about 6 months after the initial prototyping that refer to the core application as required. Only this core commenced. application’s models are shared between the website and the In our first development prototyping cycle, we perhaps GridAMP daemon. took the separation of components concept too far, as we For both the web server and the GridAMP daemon, we used Django to implement the website but implemented the also adopted Django’s built-in authentication “auth” frame- GridAMP daemon in Python using manually-coded SQL work. The authentication framework provides basic web- database calls. This made sense at the time: although site user management functionality including common user- Django provides a full-featured object-relational model (ORM) initiated account manipulation activities. We extended the independent of its web server-related features, we were skep- Django authentication framework to support additional in- tical that the ORM would be sufficiently robust to fulfill our formation required by AMP and TeraGrid, such as data requirements. For example, we demand direct and explicit provenance and user authentication metadata. control of the database schema and wanted to use database An additional benefit of using the Django ORM and au- permissions to carefully control access to database tables on thentiation framework is that Django’s built-in development a per-user basis. Even the idea of allowing a ORM system to server provides an administrative interface that can manip- create tables based on Python object definitions seemed ir- ulate ORM objects including those created by the authen- reconcilable with production-quality science gateway imple- tication framework. The interface is also easily modified to mentation. Over the first six months of development, how- support custom requirements. Thus, administrative tasks ever, it became clear that this was not the case – the Django such as approving users or adjusting back-end parameters ORM was more powerful and flexible than we imagined (like allocations and the authorization for a user to submit could be possible. We were able to easily redefine our prior to a machine using a particular allocation) can easily be manually-specified database schema entirely using Django manipulated from a graphical interface without custom de- with perfect table/field/type correspondence, including our velopment. The interface is available to developers running desired permissions scheme, all from within Django’s ORM. the Django development server with appropriate database Moreover, the database schema could be reconstructed on connectivity, so the administrative functionality is not even demand–including sample data–in test when re- possible from any publicly accessible web servers. quired for development work. The ORM also worked from standalone programs outside of Django’s web serving infras- 4.2 User Interface tructure. In addition to the shared Django application that contains Thus, the usefulness of the Django “don’t repeat yourself” the core AMP models, we wrote separate Django applica- philosophy quickly became apparent and immediately ap- tions to implement independent portions of the website func- plicable to AMP. While the service separation philosophy tionality. One application allows users to browse and search can be taken to an extreme – we could have even switched star catalogs, one allows users to view completed simula- languages between the web server and the GridAMP dae- tion results, and another facilitates simulation submission. mon – maintaining two separate codebases quickly became These applications don’t contain models so they are useful a mundane waste of time. We therefore maintained the op- only within the context of a Django project containing the erational separation of the web site and GridAMP daemon core AMP application, but the distinction provided a logical but unified the framework for both components. The entire separation of site components. project now uses a single code base to define and manipulate We also wrote additional standalone Django applications shared data structures across multiple servers. containing potentially reusable code. For example, we wished to use a CAPTCHA to reduce the possibility of automated 4.1 Common Components bots requesting AMP accounts. Due to our accessibility Software written with the Django framework is organized requirements, using a typical image-only CAPTCHA was into “projects” and “applications”. A project basically rep- problematic, so we decided to write our own. Our gen- resents a website and consists of a common configuration eral purpose question/answer CAPTCHA presents a series and a collection of installed possibly independent applica- of questions with optional links to answers. For AMP, users tions. Applications are written using the typical model- are asked to enter the HD catalog numbers of popular stars, view-controller design pattern, better described as model- such as “What is the HD number for Alpha Centauri?” For template-view using Django’s terminology. Models use the astronomers that can’t remember, we present a link to the ORM to abstract database access behind Python objects page containing the answer. With this, only one real estate while providing the opportunity to add custom functionality. agent turned fashion supermodel has requested the ability When a HTTP request is received, the request is dispatched to submit AMP jobs. to the appropriate Python subroutine (a “view”) to perform AMP’s web interface is quite typical for current database- necessary processing. View routines then usually conclude driven in that it combines static and dynamic web technologies to provide its user experience. AMP uses - miliarity with our grid support module made it seem simpler based “Web 2.0” techniques to simplify the user experience and more robust than using third-party solutions. The most where possible, but the site is fully functional without these important operational benefit for wrapping command line JavaScript enhancements. For example, the process of search- clients is that it provides excellent support for troubleshoot- ing for a star uses AJAX to suggest stars with results or in ing. The daemon produces logs that clearly highlight warn- the Kepler catalog. If no stars are in AMP’s catalog, the ings and errors with the relevant command lines displayed search is passed to the SIMBAD [3] astronomical database for failure cases. To troubleshoot, a developer needs only to and the target, if found, is added to the local catalog. Fi- open a new console on the GridAMP server and copy-paste nally, AMP uses Django’s SSL authentication and session the line at the shell prompt to retry the failed action. The management support to ensure that all activities performed Grid operations are not hidden behind complex object mod- by registered users is encrypted. els but are transparent so that problems can be investigated and corrected quickly and easily. 4.3 Grid Execution Due to AMP’s straightforward processing requirements, To simplify the deployment of the AMP model on Tera- we also wrote our own workflow management daemon. The Grid systems, we constructed a workflow that utilizes only workflow is represented as a list of stages with function basic components provided by the Coordinated TeraGrid pointers that must return to proceed to the state (see Software and Services (CTSS) software stack [9]. Rather Listing 1). If the job is in a particular state, all of the func- than deploying a SOA with services that encapsulate the tions in the subsequent list are called. If all return True, then models as we have done in the past for other projects, the the job is set to the indicated next state. In practice, the GridAMP daemon directly formulates and submits GRAM first function usually checks to see if the prior state has com- execution requests and GridFTP file transfers. Thus, the pleted, and the last function propagates the job to the next model can be deployed on a TeraGrid resource as soon as state. This simple encoding can represent arbitrary trees of the community account has been authorized and no special execution, but for AMP the processing is merely linear. The resource provider dispensations (e.g., custom Globus con- only coding cleverness is the use of inheritance to support tainers or separate service hosting platforms) are required. AMP’s two job types with a single base class implement- The remote resource execution environment for each AMP ing all of the routine functionality. Job queuing, stage-in, job is initialized and finalized using shell scripts invoked by and stage-out are all handled by the base class. Only the GRAM using the fork job service. The pre-job stage creates functions that generate the GRAM job definitions and per- a new empty copy of the model runtime directory structure form model postprocessing are implemented in the derived and prepopulates the tree with static input files. The model classes. Thus, the derived classes are very small and contain is then run using GRAM through the scheduler interface only model-specific execution and postprocessing code. with each model invocation staging in the small input data Workflow state management and job status tracking are text file and staging out its restart progress file. The -job integrated with AMP’s as implemented using stage uses tar to consolidate output and log files into a single the Django ORM and stored in the centralized database. file for transfer back to the GridAMP daemon and eventual We utilized a two-level approach to workflow status manage- delivery to the user via the website. A final cleanup stage ment, integrating the simulation status in the application- ensures that the execution environment has been removed. specific data models while maintaining constituent grid job status in a more generic fashion. To manage the workflow, 4.4 GridAMP Workflow Daemon the daemon first polls the status of each grid job and up- The GridAMP daemon manages the workflow of AMP dates the job records accordingly. This process is identical simulations on remote grid resources. It reads simulation for all grid jobs regardless of purpose (pre-job, post-job, or information from the centralized database, performs the nec- simulation) or execution method (fork or queue), and no essary grid client actions, and updates the database accord- special callbacks or processing are performed as part of the ingly. The AMP website and the GridAMP daemon thus grid job status update procedure. Once the grid job status interact asynchronously through the centralized database. has been updated, the workflow management code simply We wrote a custom Python module to handle the grid retrieves the last-known status of the appropriate job and client functionality via calls to the Globus command-line in- waits or proceeds accordingly. One advantage to this ap- terfaces. The module supports generating derivative proxy proach is that simulation status is integrated at the highest certificates with GridSHIB SAML extensions, GridFTP, and level of the application-specific data model so the user inter- GRAM. The primary reasons for using our own were face does not need to analyze the state of many individual that we already had such functionality in-house and our fa- grid jobs to determine the current state of a simulation.

Listing 1: Example GridAMP workflow definition self .workflow = { ‘QUEUED’: ( [ self .check queued sim, self.submit prejob ] , ‘PREJOB’ ) , ‘PREJOB’: ( [self.check prejob , self.submit workjob ] , ‘RUNNING’ ) , ‘RUNNING’: ( [ self .check workjob, self.submit postjob ] , ‘POSTJOB’ ) , ‘POSTJOB’: ( [ self .check postjob , self.postprocess , self.submit cleanup ] , ‘CLEANUP’ ) , ‘CLEANUP’: ( [ self .check cleanup, self.close simulation], ‘DONE’) } As part of the workflow management process, the Grid- GRAM job submission! For the optimization runs, the most AMP daemon also handles failures and provides user status complex portion of the workflow is downloading and inter- notifications. Our error management philosophy completely preting partial result files, which requires custom implemen- isolates gateway users from the jargon of grid-related fail- tation regardless of the workflow management paradigm. By ures and transients. Users are not notified of events that writing our own simple workflow management daemon, we they may not understand and are definitely not capable of have retained a single application-defined representation of correcting. Unless the asteroseismology model fails, the sim- all state. The Django models used by the website are used ulation will be completed and returned to the user. Users for execution management by the GridAMP daemon. This may opt to receive an e-mail when their simulation com- avoids the need to deploy and query middleware to run grid pletes or to receive e-mails at each state transition. jobs and provides the transparent end-to-end debugging ca- The GridAMP daemon distinguishes between anticipated pability that is useful when things go wrong. transients, model processing failures, and its own failures. We are particularly impressed with the Python-based Django Anticipated transients, such as remote systems suddenly be- web development framework. For our purposes, Django coming unreachable for GRAM or GridFTP requests, are seemed to perfectly balance framework features and cus- handled silently: administrators are notified, the job’s status tomization, supporting the rapid development web sites with- display is supplemented with a plain-text message describ- out being a content management system. The programming ing the situation, and the processing is retried automatically methodology was intuitive, suggesting but not enforcing a without user or administrator intervention. Model failures, model-view-controller design pattern. The Django frame- such as the absence of a mandatory output file or the failure work was useful even for the non-web portions of the project. of a result line to parse correctly, generally require gateway The self-contained development environment was easy to in- administrator intervention and occasionally escalate to the stall and facilitated quick prototyping and debugging. When science investigators for model development work. In the combined with the Apache web server, the framework was event of a model failure, the simulation is moved to a spe- robust enough to function as a production system. cial “hold” state and both the user and administrator are Our use of AJAX and Web 2.0 technologies has been lim- notified. The gateway administrators can then debug the ited to cases where it is clearly beneficial to our user com- problem and retry the failed processing steps interactively. munity. For example, the star search functionality suggests Once the problem has been resolved, the workflow resumes stars that are in the Kepler catalog and stars that have re- automatically. Finally, failures of the GridAMP daemon it- sults as soon as a user types enough of a catalog identi- self are monitored externally and immediately brought to fier to disambiguate possible targets. Given the long job the attention of the gateway administrators. turnaround time, however, opportunities to make the web- site appear more dynamic are limited. We could do many 5. DISCUSSION cool tricks with AJAX and social networking, and it was very tempting to allow astronomers to “share a star” via Perhaps the most fundamental characteristic of AMP is Facebook or send simulation progress updates using Twit- its posture as a grid-enabled science gateway. When con- ter. More pragmatically, we are currently working on using sidering our earlier grid gateway projects and a small set RSS feeds to allow astronomers to subscribe to stars of inter- of existing grid gateway frameworks, we realized that we est and adding dynamic links to astronomical catalogs and did not really want to build a “grid gateway” in the sense visualization services such as SIMBAD and Google Sky. suggested by these projects and frameworks. Rather, we Although AMP was designed as a custom solution for a wanted a science-driven web-based application focused on specific model and workflow, we believe that some AMP delivering the required functionality to our user community components may be a useful foundation for future similar that happened to use grid resources and technology to per- grid gateway development. Of course, the AMP user in- form some of its computationally intensive processing. To terface is completely custom, but Django facilitates rapid that end, AMP completely hides many aspects of its grid web development in its own right. The core AMP mod- nature from users. As most astronomers are familiar with els that represent jobs and the base classes of the workflow high-performance computing, concepts such as simulations, manager are potentially generic enough to support other ap- computational jobs, allocations, and supercomputers remain plications and workflows with minimal changes. Although visible terminology, but the word “certificate” is not even we have not done so, it would not be particularly difficult mentioned anywhere on the site. to isolate the common job management functionality from Our ability to decouple AMP’s front-end and back-end the models such that it could be added to new models as de- components was enabled by AMP’s straightforward work- sired. The GridAMP daemon already supports this abstrac- flow and lengthy job turnaround time. We recognize that tion, as the workflow manager base class itself contains only the luxury of asynchronous coupling is not afforded by many grid code and all application-specific logic is contained in science gateways that facilitate interactive analysis and visu- the workflow-specific derived classes. This level of abstrac- alizations. The decoupled asynchronous processing is appro- tion would have to be similarly introduced to the data mod- priate for AMP’s jobs, simplified the implementation, and els by using complementary table schemas or inheritance to facilitates operational debugging. make a model represent grid jobs using a mechanism other While workflow management is well understood and a va- than copying and pasting certain fields into the model defi- riety of robust technologies are available to automate work- nition. In this more generic approach, models would be de- flows [1], it was indeed quite simple to implement a small- fined only with application-specific job fields (such as input scale custom workflow manager for AMP. In fact, if GRAM and results) with the job management fields provided exter- ever supports executing pre-job and post-job scripts using nally. Thus, while AMP and its underlying components are the fork service as part of a queued job specification, half clearly not a framework from which new gateways may eas- of AMP’s functionality could be implemented using a single ily be constructed, AMP demonstrates how rapid web de- 8. ACKNOWLEDGMENTS velopment frameworks combined with simple grid support We would like to thank to Nancy Wilkins-Diehr and the libraries can be used to produce useful science gateways. TeraGrid Gateways Program for assistance turning AMP into a TeraGrid science gateway, Stu Martin for assistance 6. FUTUREWORK with Globus GRAM auditing, and Tom Scavo for assistance with GridShib. Thanks to Margaret Murray for helping us Although AMP is currently being used for friendly user test GridAMP on TACC resources and to Victor Hazlewood testing and we do not anticipate making any fundamental and Rick Mohr for assistance with NICS resources. Paul changes over the next year or two, we have identified several Marshall performed the initial compilation and run time front-end and back-end features that we wish to explore in evaluation of MPIKAIA on several TeraGrid resources. Will the future. Again, we are currently investigating the best Baird developed many prototype AMP components and fea- way to provide simulation progress and star result updates tures including AMP’s utilization of the SIMBAD [3] astro- via RSS and refining our use of AJAX techniques to enhance nomical database. Michael Oberg prepared and manages the user experience in subtle yet meaningful ways. As the the NCAR TeraGrid Service Hosting Platform used to host number of simulations on AMP grows, we anticipate that AMP and GridAMP. we will need to revisit the interface used to organize and Funding to integrate AMP with TeraGrid resources was present the results of the simulations. provided by the TeraGrid Science Gateways program. Com- One limitation of GridAMP that we intend to examine in putational time at NCAR was provided by NSF MRI Grants the near future is its use of multiple sequential GRAM jobs CNS-0421498, CNS-0420873, and CNS-0420985; NSF spon- to propagate optimization runs to completion. Although sorship of the National Center for Atmospheric Research; each GRAM job is set to the target system’s walltime (usu- the University of Colorado; and a grant from the IBM Shared ally 6 or 24 hours), jobs are only submitted University Research program. once the prior job has finished. Thus, the continuation jobs must wait in the remote system’s batch queue before pro- cessing can resume. Many schedulers in use at TeraGrid sites support job chaining (or job dependencies) such that 9. REFERENCES [1] Condor Directed Acyclic Graph Manager (DAGMan). multiple jobs can be submitted at once and queued inde- http://www.cs.wisc.edu/condor/dagman/. pendently but declared elegible to run only after a prior job has completed. This would be perfect for AMP jobs, as the [2] Django. http://www.djangoproject.com. initial simulation submission could include the 4-8 jobs that [3] SIMBAD Astronomical Database, CDS, Strasbourg, are always required to perform the simulation, possibly re- France. http://simbad.u-strasbg.fr/simbad/. ducing the cumulative queue wait time. We are currently [4] J. Christensen-Dalsgaard. ASTEC – the Aarhus making a graphical tool that plots job wait vs. execution STellar Evolution Code. Journal of Astrophysics and time on a Gantt chart for each AMP simulation, as well Space Science, 316:13–24, 2008. as calculating aggregate execution wait and run time statis- [5] J. Cope, C. Hartsough, S. McCreary, P. Thornton, tics, in order to understand the impact of queue wait time on H. M. Tufo, N. Wilhelmi, and M. Woitaszek. various systems. We will then investigate Grid-based (but Experiences from simulating the global carbon cycle in possibly nonstandard) methods to submit chained jobs on a grid computing environment. In Proceedings of the the resources at the providers that are the most tolerant of Fourteenth Global Grid Forum (GGF 14), Chicago, AMP’s computational workloads. Illinois, June 2005. [6] T. S. Metcalfe and P. Charbonneau. MPIKAIA – 7. CONCLUSIONS stellar structure modeling using a parallel genetic algorithm for objective global optimization. Journal of AMP has provided an opportunity to develop a new sci- Computational Physics, 185:176–193, 2003. ence gateway targeting TeraGrid computational resources. [7] T. S. Metcalfe, O. L. Creevey, and AMP’s straightforward workflow provided an ideal project J. Christensen-Dalsgaard. A stellar model-fitting to explore the use of the Python-based Django web frame- pipeline for asteroseismic data from the Kepler work for rapid prototyping and development of a science mission. The Astrophysical Journal, 699:373–382, 2009. gateway. Our separation of the web interface, processing [8] T. Scavo and V. Welch. A grid authorization model daemon, and science components simplified the system’s ar- for science gateways. Concurrency and Computation: chitecture and implementation. Furthermore, our use of Practice and Experience, 2008. common Django modules for both the web interface and [9] TeraGrid. Coordinated TeraGrid Software and the workflow daemon greatly reduced the complexity of im- Services (CTSS). http://www.teragrid.org/userinfo plementation. The entire workflow was easily implemented /software/ctss.. using manual Globus command-line client calls to remote scripts and executables, further simplifying debugging and [10] TeraGrid. Security and Accounting for TeraGrid allowing AMP to be configured on remote resources without Science Gateways. http://www.teragrid.org/gateways resource provider intervention. AMP is currently available /developers/security.php. for friendly user testing, and we anticipate the first extensive [11] TeraGrid. TeraGrid Gateway Security Summit. use of the system to perform new asteroseismology science http://www.teragridforum.org/mediawiki/ using Kepler data in October 2009. In the future, we plan index.php?title=Gateway Security Summit, Jan. 2008. to examine possible applications of AMP’s architecture and [12] M. Thomas. Using the Pylons web framework for underlying technology choices to other NCAR science gate- science gateways. In Grid Computing Environments way projects. Workshop, 2008. GCE ’08, pages 1–9, Nov. 2008.