A VRE Programming Toolkit for Quantitative E-Social Science

Putting Social Science Applications on the Grid Rob Crouchley1, Ties van Ark1, John Pritchard1 John Kewley2, Rob Allan2, Mark Hayes3 and Lorna Morris3 1e-Social Science Centre of Excellence and Collaboratory for Quantitative e-Social Science, University of Lancaster 2e-Science Centre, CCLRC Daresbury Laboratory 3Cambridge e-Science Centre, University of Cambridge E-mail address of corresponding author: [email protected] Abstract. As e-Social Science develops, there will be a growing user base of social researchers who are keen to share resources and applications in order to tackle some of the large-scale research challenges that confront us. They will be aware of the potential of e-Science technology to provide collaborative tools and provide access to distributed computing resources and data. However social scientists are not ideally catered for by the current Grid middleware and often lack the extensive programming skills to use the current infrastructure to the full and to adapt their existing “heritage” applications. In late 2003, a lightweight client toolkit that is easily installable, yet provides extensible access mechanisms to Grid resources was seen as a possible solution. By implementing a client-side polling strategy, problems associated with institutional firewalls for Grid protocols can be reduced. This is an alternative strategy to a Grid portal, easing the access problems while still sharing the same underlying services and infrastructure. A prototype library called GROWL: Grid Resources on Workstation Library was developed to use a client-server model to interface to existing Grid Services from applications written in C, C++, Fortran and R. Together with its associated wrapper services, GROWL is now being further developed into a number of demonstrators as part of the JISC-funded Virtual Research Environment (VRE) Programme by a collaboration from CCLRC Daresbury Laboratory and the Universities of Lancaster and Cambridge [1]. This paper describes the GROWL library and how it could be used to Grid-enable some computationally demanding statistical models for e-Social Science applications. 1. Background The need for lightweight client toolkits was articulated in late 2003. Chin and Coveney [2] listed a number of problems associated with middleware packages then existing which were seen as a barrier to the uptake of Grid technologies, even in the computational sciences such as chemistry and physics: “We have attempted to make use of Grid technologies for serious scientific work, as part of the EPSRC-funded Reality Grid e-Science pilot project. We have encountered serious middleware-related problems which are hindering scientific progress with the Grid: • The existing toolkits have an excessively heavy set of software and administrative requirements, even for relatively simple demands from applications; • Existing toolkits are painful and difficult to install and maintain, due to excessive reliance on custom-patched libraries, poor package management, and a severe lack of documentation for end-users; • Existing standards bodies and the task forces within the UK e-Science programme are not engaging sufficiently with the applications community, and run a substantial risk of producing and implementing Grid architectures which are irrelevant to the requirements of application scientists.” They in turn cite Gabriel [3] who gives a cogent argument for the benefits for programming projects adopting a design philosophy of simplicity in cases where “the right thing is too complex” and conclude: “We argue that it is important to develop a simple, lightweight Grid middleware which is ‘good enough’ for rapid adoption, rather than taking longer to develop a solution which will, supposedly, suit all needs. Such a toolkit must be: • substantially more portable, lightweight, and modular in design; • produced in very close collaboration with application scientists; • sufficiently well-documented that end-users will be able to port existing codes to use Grid techniques with the minimum of hassle.” It was this that prompted the Grid Technology Group at CCLRC Daresbury Laboratory to write the prototype GROWL: Grid Resources On Workstation Library [4] toolkit. This toolkit provides an easy-to-use client-side interface and is being further extended in the JISC-funded VRE project, in collaboration with the Universities of Cambridge and Lancaster. See project Web site at http://www.growl.org.uk. We note there is discussion in the Applied Computer Science community on End-User Development, i.e. tools that empower users to create their own software solutions. Methods, techniques and tools are discussed which support non-programmers to adapt their applications in an intuitive way [8]. We hope that the GROWL toolkit will be seen to enhance this possibility as far as Grid computing is concerned. It is also of possible interest to note that because of the distributed nature of the team working on this project we are adopting a very modular approach [9]. Modules are mostly independent so that the end user can select which ones are appropriate for compilation and linking into the final application. It also helps considerably with software engineering and management of the project. 2. Introduction to GROWL Our aims are to encourage the uptake of Grid-based computing and distributed data management, focusing on the issues which may hinder or facilitate end-user application development. We refer to the difficulties identified, as the “client problem” and suggest a solution building upon the existing prototype GROWL library to produce a truly lightweight extensible toolkit which complements other solutions. Most developers of scientific software get optimal performance and functionality by linking their bespoke applications to one or more specialised and highly tested libraries, e.g. for numerical algorithms, visualisation or data management. GROWL adopts the same model and extends it to the scenario of Grid computing. GROWL has an easy-to-use client-side application programming interface (API) to a variety of open source libraries and services. GROWL uses basic Web Services Technologies (SOAP, WSDL and UDDI), for communication although this is hidden from the user by the library interface.1 Some of the applications we are considering such as Stata [17] provide facilities for plug-ins to be used to provide additional functionality such as Grid-enabling via GROWL. In this way they can continue to be used as before, but have the capability to call remote high-performance algorithms on parallel computers or access remotely-stored data when necessary. 1 Please see documentation on Web Services for definitions of these abbreviations. 2 End-user Local O/S GROWL passes data and GSI, application e.g. requests across firewalls using MyProxy workstation Web services GridFTP, GROWL client invoked etc. as a function call Globus- GROWL job-run etc. services Remote O/S, e.g. SABRE or parallel computer other remote app Figure 1: GROWL architecture with Globus showing client and server On the server side GROWL accesses application wrappers and services, in the main developed in other projects. Typically these might be wrappers to the Globus v2 toolkit [20] for large-scale Grid resources. Globus itself has a C API, which we are using, but requires holes to be “punched” through firewalls for inter-institution communications with its own protocol and security interface. Containing this on the server side enables system administrators to keep a close control of any potential security risks. The list of required services is still being worked on, but is likely to be derived from software such as HPCPortal, DataPortal, InfoPortal (all from CCLRC), SRB (the Storage Resource Broker from San Diego Supercomputer Centre) [18], Condor [19] and NetSolve, the distributed numerical linear equation solver library from Jack Dongarra’s group in Tennessee. Alongside these associated services is a procedure for enabling a particular service to be linked into the infrastructure, whether it be written in C, C++, Perl, Python, PHP, R or Java. This is illustrated in Figure 1. The basic services in GROWL such as authentication can also be used on the server side simply by installing the GROWL library there too. We term this usage a “GROWL Grid”. To the end user however, GROWL is a purely client side Grid programming environment, it does not help the user create Web or Grid services, nor does it help the user put their computer on the Grid. The software and functionality is kept to a minimum for ease of installation and use on a workstation or PC. For social scientists, particularly those engaged in statistical computing, this will be provided through APIs. Those already familiar with Stata or R can then immediately become part of a large Virtual Research Environment. The graphical user interface R-Commander will also be used by extending its menu capability to include GROWL functions as indicated in Figure 2. Particular requirements of the project are that GROWL be “lightweight” and “extensible”. Lightweight implies that the GROWL should be extremely easy to install, with functionality minimal but sufficient for the user requirements we will identify in target application areas. Extensible means that it should be possible to easily extend GROWL to provide interfaces to additional middleware services or to use additional security mechanisms such as Shibboleth and PERMIS. 3 ± Your heritage application Files View GROWL Tools Help - Job submission - Job progress - Job results GROWL

Load more