Styx Grid Services: Lightweight Middleware for Efficient Scientific
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Crossref Scientific Programming 14 (2006) 209–216 209 IOS Press Styx Grid Services: Lightweight middleware for efficient scientific workflows J.D. Blowera, A.B. Harrisonb and K. Hainesa aReading e-Science Centre, Environmental Systems Science Centre, University of Reading, Reading RG6 6AL, UK Tel.: +44 (0)118 3788741; E-mail: {jdb,kh}@mail.nerc-essc.ac.uk bSchool of Computer Science, Cardiff University, Cardiff CF24 3AA, UK Tel.: +44 (0)29 20876964; E-mail: [email protected] Abstract. The service-oriented approach to performing distributed scientific research is potentially very powerful but is not yet widely used in many scientific fields. This is partly due to the technical difficulties involved in creating services and workflows and the inefficiency of many workflow systems with regard to handling large datasets. We present the Styx Grid Service, a simple system that wraps command-line programs and allows them to be run over the Internet exactly as if they were local programs. Styx Grid Services are very easy to create and use and can be composed into powerful workflows with simple shell scripts or more sophisticated graphical tools. An important feature of the system is that data can be streamed directly from service to service, significantly increasing the efficiency of workflows that use large data volumes. The status and progress of Styx Grid Services can be monitored asynchronously using a mechanism that places very few demands on firewalls. We show how Styx Grid Services can interoperate with with Web Services and WS-Resources using suitable adapters. Keywords: Styx, streaming, third-party transfers, WS-RF, Condor, Globus 1. Introduction of the required middleware [14]. Furthermore, many workflow systems suffer from inherent limitations that The use of service-oriented architectures (SOAs) in are important in the scientific domain. These vary from scientific computing is increasing. The principal ad- system to system but commonly include: vantage of the SOA approach is that scientists can ac- – A centralized data flow architecture: all data must cess resources such as databases, high-end computing pass through the workflow engine. resources, laboratory equipment and sensor networks – A focus on SOAP and XML as the data transport over the Internet without knowledge of the underly- format. It is very inefficient to encode anything ing infrastructure. Several independent services can be other than a small amount of data in XML due combined in a distributed application or workflow to to the processing time required and the inflating solve a particular problem. For example, a scientist effect of doing so. The use of SOAP attachments might wish to construct a workflow in which several gives a smaller data size than XML but still re- pieces of data are extracted from databases in different quires data to be encoded and decoded [13,15,16]. locations, analyzed using a distributed computing re- – A notification mechanism that requires the client source, then finally visualized on his or her local ma- to listen on incoming ports or to poll the server chine. frequently. This is discussed further in Section 2.1 At present, however, there are very few examples below. of scientific communities that work routinely in this way. Part of the reason for this is that the creation of We describe a service type that addresses the above such services and workflows is beyond the technical issues: the Styx Grid Service or SGS. A Styx Grid Ser- expertise of most scientists, often due to the complexity vice is a service that wraps a command-line (i.e. non- ISSN 1058-9244/06/$17.00 2006 – IOS Press and the authors. All rights reserved 210 J.D. Blower et al. / Styx Grid Services: Lightweight middleware for efficient scientific workflows graphical) program and allows it to be run remotely. The Cactus Computational Toolkit [4] has supported Styx Grid Services are very easy to create, deploy and streaming for a number of years, specifically for re- use [8,10,11] and can be composed into workflows us- mote visualization purposes, including the ability to ing shell scripts or specialized workflow tools. Work- stream to multiple clients simultaneously. In [19], the flows that are composed from Styx Grid Services work authors present data streaming services based on the efficiently with large datasets: data are transported in NaradaBrokering [2] messaging system. Services are their most compact binary form and can be streamed presented that stream GPS data from sensors as well directly from service to service in a decentralized data as services that extend the capabilities of the Open flow architecture. Through the use of wrappers and Geospatial Consortium (OGC) data services to pro- brokers, SGSs can interoperate with tools and services vide streaming of time-dependent data. The authors based on Web Services and the Web Services Resource note the inefficiency of SOAP and provide data fil- Framework (WSRF [29]). ters to transform streamed data. A similar approach The details of how Styx Grid Services are created is taken by the UniGrids Streaming Framework [5]: and used from a user’s point of view can be found in WSRF is used specifically to control and monitor the previous publications [8,10,11] and the project web- lifetime and parameters of optimized data streams. The site [7]. In this paper we shall focus on how the SGS AstroGrid-D [1] project is currently designing a data system enables the creation of efficient scientific work- stream management system for handling astronomical flows through the use of direct transfer and streaming data. Again it uses WSRF, and like the NaradaBro- of data. kering example uses the publish/subscribe pattern for receiving streams. 1.1. Related work Although the SGS system supports many of the fea- tures of the above systems, it does not aim to support A number of systems are addressing the need to sup- them all. A key goal of the SGS system is to be very port data streaming in a variety of domains. In [12], the lightweight and easy to use, requiring the minimum of authors extend the OSIRIS [26] system to enable data dependencies and being as easy as possible for non- stream processing in the field of healthcare. The data technical users to create and use streaming services. stream management sub-system is constructed from a Later in this paper we shall demonstrate how Styx Grid Peer-to-Peer network of Operators deployed on a vari- Services can interoperate with other frameworks, such ety of devices including sensors, that accept, transform as those based on WSRF, through suitable wrappers and output data streams. Different classes of Opera- and brokers. tors perform different transformations (e.g. noise re- duction). Special Web Service Operators interface to the outside world allowing data to be stored or rendered 2. Styx Grid Services: Architecture remotely as well as control parameters to be passed into the network of Operators. The concept of Operators The basis of the SGS system is the well-established is similar to the use of pipes in the Styx Grid Services Styx protocol for distributed systems [25]. Styx is a system. key component of the Inferno [17] and Plan 9 [24] op- In [6] the authors address the limitations of business erating systems (in Plan 9, Styx is known as “9P”: the process modelling techniques in handling data streams. current version of Styx is equivalent to 9P2000). In The research focusses on how to manage state transi- Inferno and Plan 9, applications communicate with all tions within workflow components based on traditional resources using Styx, without knowing whether the re- request/response style interactions, specifically how to sources are local or remote. Styx is essentially a file- enable iterative re-execution of a component as a re- sharing protocol, similar in some ways to NFS. How- sult of receiving data streams. SGS does not limit it- ever, in a Styx system the “files” are not always literal self to business process modelling techniques. This files on a hard disk. They can represent a block of gives the architecture flexibility although it does place RAM or the interface to a program, database or phys- the burden of understanding the behaviour of workflow ical device. Styx can therefore be used as a uniform components on the workflow designer. interface to access diverse resource types. Whereas in In the context of Grid computing data streaming is Remote Procedure Call (RPC)-style Web Services the being addressed with regard to several fields includ- resources are accessed through a set of methods, Styx ing geospatial data and astronomical data processing. resources are accessed by reading and writing a set of J.D. Blower et al. / Styx Grid Services: Lightweight middleware for efficient scientific workflows 211 files, which are organized in a hierarchy of virtual files, SGS namespace to get a piece of status information. which is known as a namespace. Having received the reply, the client immediately sends A Styx Grid Service wraps a command-line exe- another message to read from the same file. The server cutable and exposes it to the network as a namespace. will not respond to this message until the status infor- This namespace contains files that represent the input mation has changed. This is permitted by the design of and output files of the executable and its command-line the Styx protocol, which allows read requests and their arguments. It also contains files that allow the service reponses to be decoupled. The use of persistent con- to be controlled and monitored. A full description of nections is a key difference between the typical usage the SGS namespace is not given here for reasons of patterns of the Styx and HTTP protocols.