http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System on Grid Platform

Yong Xue1,2, Yanguang Wang1, Jianqin Wang1, Ying Luo1, Yincui Hu1, Shaobo Zhong1, Jiakui Tang1, Guoyin Cai1, and Yanning Guan1

1 Laboratory of Remote Sensing Information Sciences, Institute of Remote Sensing Applications, Chinese Academy of Sciences, P. O. Box 9718, 100101, 2 Department of Computing, Metropolitan University, 166-220 Holloway Road, London N7 8DB, UK [email protected], [email protected]

Abstract. For many remote sensing application projects, the quality of the research or the product is heavily dependent upon the quantity of computing cycles available. Middleware is software that connects two or more otherwise separate applications across the Internet or local area networks. In this paper, we present the High Throughput Computing Spatial Information Processing (HIT-SIP) System (Prototype), which is developed in Institute of Remote Sensing Applications, Chinese Academy of Sciences, China. Several middleware packages developed in the HIT-SIP system are demonstrated. Our experience shows that it is feasible that our grid computing testbed can be used to do remote sensing information analysis.

1 Introduction

Dozens of satellites constantly collecting data about our planetary system 24 hours a day and 365 days a year. For many remote sensing application projects, the quality of the research or the product is heavily dependent upon the quantity of computing cycles available. Scientists and engineers engaged in this sort of work need a computing environment that delivers large amounts of computational power over a long period of time. Large scale of satellite data needed to be processed and stored in almost real time. So far this processing in remote sensing confronts much difficulties in one single computer, or even impossibility. Computing grid that is integrated by series of middleware provides a way to solve this problem. It is not uncommon to find problems that require weeks or months of computation to solve. Such an environment is called a High-Throughput Computing (HTC) environment (http://www.cs.wisc.edu/condor/). The key to HTC is to efficiently harness the use of all available resources. Years ago, the engineering and scientific community relied on a large, centralized mainframe or a supercomputer to do computational work. A large number of individuals and groups needed to pool their financial resources to afford such a machine. Users had to wait for their turn on the mainframe, and they had a limited amount of time allocated. While this environment was inconvenient for users, the utilization of the mainframe was high; it was busy nearly all the time.

P.M.A. Sloot et al. (Eds.): EGC 2005, LNCS 3470, pp. 40 – 49, 2005. © Springer-Verlag Heidelberg 2005 转载 中国科技论文在线 http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System 41

As computers became smaller, faster, and cheaper, users moved away from centralized mainframes and purchased personal desktop workstations and PCs. An individual or small group could afford a computing resource that was available whenever they wanted it. The personal computer is slower than the large centralized machine, but it provides exclusive access. Now, instead of one giant computer for a large institution, there may be hundreds or thousands of personal computers. This is an environment of distributed ownership, where individuals throughout an organization own their own resources. The total computational power of the institution as a whole may rise dramatically as the result of such a change, but because of distributed ownership, individuals have not been able to capitalize on the institutional growth of computing power. And, while distributed ownership is more convenient for the users, the utilization of the computing power is lower. Many personal desktop machines sit idle for very long periods of time while their owners are busy doing other things (such as being away at lunch, in meetings, or at home sleeping). Grid computing enables the virtualization of distributed computing and data resources such as processing, network bandwidth and storage capacity to create a single system image, granting users and applications seamless access to vast IT capabilities (Foster and Kesselman 1998, Foster et al. 2001). Just as an Internet user views a unified instance of content via the Web, a grid user essentially sees a single, large virtual computer. At its core, grid computing is based on an open set of standards and protocols — e.g., Open Grid Services Architecture (OGSA) — that enable communication across heterogeneous, geographically dispersed environments. With grid computing, organizations can optimize computing and data resources, pool them for large capacity workloads, share them across networks and enable collaboration. In fact, grid can be seen as the latest and most complete evolution of more familiar developments — such as distributed computing, the Web, peer-to-peer computing and virtualization technologies. There are several famous grid projects today. Network for Earthquake Engineering and Simulation labs (NEESgrid) (www.neesgrid.org) intended to integrate computing environment for 20 earthquake engineering laboratories. Access Grid (ww.fp.mcs.anl.gov/fl/access_grid) lunched in 1999 and mainly focused on lecture and meetings-among scientists at facilities around the world. European Data Grid sponsored by European union, mainly in data analysis in high-energy physics, environmental science and bioinformatics. The Grid projects focused on spatial information include Chinese Spatial Information Grid (SIG) (http://www.863.gov.cn), ESA’s SpaceGrid (http://www.spacegrid.org), and USA Earth System Grid (http://www.earthsystemgrid.org). ESA’s SpaceGrid is an ESA funded initiative. The SpaceGRID project is run by an international consortium of industry and research centres led by Datamat (). Other members include Alcatel Space (France), CS Systemes d'Information (France), Science Systems Plc (UK), QinetiQ (UK) and the Rutherford Appleton Laboratory of the UK Council for the Central Laboatory of the Research Councils. Two other aspects of this project are of particular importance for ESA: finding a way to ensure that the data processed by the SpaceGRID can be made available to public and

中国科技论文在线 http://www.paper.edu.cn

42 Y. Xue et al. educational establishments, and ensuring that SpaceGRID activities are coordinated with other major international initiatives. The SpaceGRID project aims to assess how GRID technology can serve requirements across a large variety of space disciplines, such as space science, Earth observation, space weather and spacecraft engineering, sketch the design of an ESA- wide GRID infrastructure, foster collaboration and enable shared efforts across space applications. It will analyse the highly complicated technical aspects of managing, accessing, exploiting and distributing large amounts of data, and set up test projects to see how well the GRID performs at carrying out specific tasks in Earth observation, space weather, space science and spacecraft engineering. The Earth System Grid (ESG) is funded by the U.S. Department of Energy (DOE). ESG integrates supercomputers with large-scale data and analysis servers located at numerous national labs and research centers to create a powerful environment for next generation climate research. This portal is the primary point of entry into the ESG. ESG Collaborators include Argonne National Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, National Center for Atmospheric Research, Oak Ridge National Laboratory and University of Southern California/Information Sciences Institute. The Condor Project has performed research in distributed high-throughput computing for the past 18 years, and maintains the Condor High Throughput Computing resource and job management software originally designed to harness idle CPU cycles on heterogeneous pool of computers. In essence a workload management system for compute intensive jobs, it provides means for users to submit jobs to a local scheduler and manage the remote execution of these jobs on suitably selected resources in a pool. Condor differs from traditional batch scheduling systems in that it does not require the underlying resources to be dedicated: Condor will match jobs (matchmaking) to suited machines in a pool according to job requirements and community, resource owner and workload distribution policies and may vacate or migrate jobs when a machine is required. Boasting features such as check-pointing (state of a remote job is regularly saved on the client machine), file transfer and I/O redirection (i.e. remote system calls performed by the job can be captured and performed on the client machine, hence ensuring that there is no need for a shared file system), and fair share priority management (users are guaranteed a fair share of the resources according to pre-assigned priorities), Condor proves to be a very complete and sophisticated package. While providing functionality similar to that of any traditional batch queuing system, Condor's architecture allows it to succeed in areas where traditional scheduling systems fail. As a result, Condor can be used to combine seamlessly all the computational power in a community. In this paper, we present the High Throughput Computing Spatial Information Processing (HIT-SIP) System (Prototype), which is developed in Institute of Remote Sensing Applications, Chinese Academy of Sciences, China. The system will be introduced in Section 2. Several middleware developed in the HIT-SIP system will be demonstrated in Section 3. Finally, the conclusion and further development will be addressed in Section 4.

中国科技论文在线 http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System 43

2 High Throughput Computing Spatial Information Processing Prototype System

Remote sensing data processing is characterized by magnitude and long period of computing. The HIT-SIP system on Grid platform has been developed in Institute of Remote Sensing Application, Chinese Academy of Science. It is an advanced High- Throughput Computing system specialized for remote sensing data analysis using Condor. Heterogeneous computing nodes including two sets of Linux computers and WINNT 2000 professional computers and one set of WINNT XP computer provide stable computing power (Wang et al. 2003, 2004). The grid pool uses java universe to screen heterogeneous characters. The function structure of HIP-SIP system is shown in Figure 1. There are 3 different ways to access the HIT-SIP system: Web service, local pool and wireless Web service. The functionalities of the HIT-SIP are: • Remote sensing image processing and analysis: Thematic information extraction. • Spatial analysis: Overlay, buffering, Spatial Statistics • Remote sensing applications: Aerosol, Thermal Inertial and Heat Flux Exchange • System operations: System status, Job management • Visualization

User

Applications Visualization Spatial Analysis Image Processing

3D Immersive SYNTAM DDV Aerosol Aerosol Classification Thematic Information Extraction

Spatial query Buffering Overlay

Fig. 1. The Functionality Structure of HIT-SIP system

The key technologies of HIT-SIP involve integrating Web Services with condor, namely a Grid implementation of Condor. There are about several building remote sensing applications on top of condor, as follows, command line tools of condor that are applied in JSP and Servlet, Web Services (Xue et al, 2002), and Condor GAHP

中国科技论文在线 http://www.paper.edu.cn

44 Y. Xue et al. etc. In HIT-SIP, we serve those people who use web browsers on PCs or on mobile device, such as PDA, by using Java Servlet to invoke command line tools of condor in applications server Resin; and we provide programmer WSDL that tell how to invoking our grid computing functionality of Spatial Information Processing.

Fig. 2. The Structure of HIT-SIP system

Fig. 3. The interface of front page of HIT-SIP system

中国科技论文在线 http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System 45

Web Service is selected because compared to other distributed computing technologies, such as CORBA, RMI and EJB, it is a more suitable candidate for Internet scale application. First, web service is based on a collection of open standards, such as XML, SOAP, WSDL and UDDI. It is platform independent and programming language independent because it uses standard XML language. Second, web service uses HTTP as the communication protocol. That is a big advantage because most of the Internet’s proxies and firewalls will not mess with HTTP traffic. Figure 2 shows the basic structure of HIT-SIP. Figure 3 show the interface of front page of HIT-SIP system, which can be access by web service.

3 Middleware Development for the HIT-SIP System

Middleware makes resource sharing seem transparent to the end user, providing consistency, security, privacy, and capabilities. Middleware is software that connects two or more otherwise separate applications across the Internet or local area networks (www.nsf-middleware.org). More specifically, the term refers to an evolving layer of services that resides between the network and more traditional applications for managing security, access and information exchange to • Let scientists, engineers, and educators transparently use and share distributed resources, such as computers, data, networks, and instruments, • Develop effective collaboration and communications tools such as Grid technologies, desktop video, and other advanced services to expedite research and education, and • Develop a working architecture and approach that can be extended to the larger set of Internet and network users. The HIT-SIP system is a fundamental bare computing platform and can not run remote sensing image processing program. So remote sensing data processing middleware running on grid pool is inevitable. Common users can use the heterogeneous grid and share its strong computing power to process remote sensing data with middleware as if on one supercomputer. All the processing algorithms of the remote sensing images can also be divided into three basic categories in term of the area dealt with by the operation: a) Point operators Point operators change each pixel of the original image into a new pixel of the output image in terms of the given mathematics function, which transforms single sample point. The transform process doesn’t refer to other samples and other samples don’t take part in the calculation. b) Local operators Local operators transform single sample point and the transform process takes its neighbor samples to take part in calculation. The output value is related to itself and its neighbor pixels. c) Global operators Global operators incorporate the whole image’s samples into the calculation when it copes with each sample.

中国科技论文在线 http://www.paper.edu.cn

46 Y. Xue et al.

There are three approaches to divide a sequential task into parallel subtasks: data partition, task partition and mix partition (Cai et al. 2004, Hu et al. 2004a, b). The data partition parts the image data into a number of sub-data. Each subtask processes one sub-data respectively. In this case, the function of each subtask is identical, and only the processing data is different. The task partition parts the process of computing into a number of steps. The efficiency can be increased through parallel execution of these steps. The mix partition take the above two approaches into account to design the parallel algorithm. The Grid computing takes the Internet as a computing platform, thus it differs from additional parallel computing or distributed computing. We must consider the bandwidth of the network, stability, global distribution of the nodes, security etc. the data partition is a suitable approach. If the algorithm cannot be designed by means of the data partition. We can try to part it into relatively unattached (little interaction) subtasks. If it cannot be done, we will conclude that Grid computing is not fit for the algorithm according to the above given criteria. Through analyzing all sorts of algorithms, we conclude that those algorithms which pertain to point operators or local operators are easy to part it them into relative unattached subtasks, however, for global operators, only if we take special data partition or together with task partition, We can adapt it to meet the condition of using grid computing.

3.1 Middleware for Aerosol Retrieval from MODIS Data

GCP-ARS (Grid Computation Platform for Aerosol Remote Sensing), developed by Telegeoprocessing Research group in Institute of Remote Sensing Applications (IRSA), Chinese Academy of Sciences (CAS), is a advanced high throughput computing grid middleware that supports the execution of aerosol remote sensing retrieval over a geographically distributed, heterogeneous collection of resources (Tang et al. 2004a, b). In GCP-ARS, Condor system provides the technologies for resource discovery, dynamic task-resource matching, communication, and result collection, etc. LUTGEN module provides the users of the power to generate the LUT on grid computing platform, LOOK-UP module can be used to look up and interpolate the LUT generated by LUTGEN module or user self-offered to determine aerosol optical thickness. GCP-ARS provides the users friendly Windows-based GUI (Graphical User Interface), computation tasks auto-partitioned, tasks auto-submission, the execution progress monitoring, the sub-results collection and the final result piecing, which results in screening the complexities of grid applications programming and the Grid platform for users. GCP-ARS architecture mainly consists of three entities: clients, a resource broke, and producers. Aerosol retrieval is launched though a client GUI for execution at producers that share its idle cycles through the resource broker. The resource broke manages tasks application execution, resource management and result collection by means of Condor system. The date and time of MODIS data we have chose for our test on GCP-ARS was acquired at 03:20 UTC on August 11, 2003. The image size is 512 x 512 with spatial resolution 500m, which covers most part of plain of north China and Bohai gulf. Figure 4 shows the final AOT (Aerosol Optical Thickness) image retrieved using GCP-ARS.

中国科技论文在线 http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System 47

AOT (470nm) from MODIS d

0.50~1.0 0.40~0.50 0.30~0.40 0.20~0.3

0.10~0.2

0.006~0. 0.00~0.0 No data

Fig. 4. AOT Image of MODIS Band 3 at 470 nm retrieved using GCP-ARS

3.2 Middleware for NDVI Calculation from MODIS `Data

Vegetation indices (VIs) are spectral transformations of two or more bands designed to enhance the contribution of vegetation properties and allow reliable spatial and temporal inter-comparisons of terrestrial photosynthetic activity and canopy structural variations. The Normalized Difference Vegetation Index (NDVI), which is related to the proportion of photosynthetically absorbed radiation, is calculated from atmospherically corrected reflectances from the visible and near infrared remote sensing sensor channels as: (CH2 - CH1) / (CH2 + CH1), where the reflectance values are the surface bidirectional reflectance factors for MODIS bands 1 (620 - 670 nm) (CH1) and 2 (841 - 876 nm) (CH2). Figure 5 shows the architecture of the Condor pool supported mobile geoprocessing system. Figure 6 is one of the GUIs.

WAP Internet Mobile Com Condor Pool Net WapServer+Condor Mobile Client GateWay

Fig. 5. The architecture of the Condor pool supported mobile geoprocessing system

One track of MODIS data (800x5000) used for the calculation of NDVI was acquired on 11 August 2003. We submitted the job from a WAP mobile phone to our

中国科技论文在线 http://www.paper.edu.cn

48 Y. Xue et al.

Condor pool. Figures 6 show the interface on WAP phone. We divided whole images into 4 parts. In theory, we could divide them into any number of parts. The 4 jobs have been submitted to our Condor pool. After each calculation, the results will be merged into one image (Wang et al. 2004). All these operations were performed automatically.

Fig. 6. The interface on WAP phone

4 Conclusions

Grid computing and Web Service technology is really a very effective method for sharing remote sensing data and processing middleware codes. It’s feasible that our grid computing testbed can be used to do remote sensing information modelling. The most efficient image size and the best separation of a large image that apt to our testbed will be our further research with realizing an algorithm that can read and divide remote sensing images automatically, and transfer results back to the submitted machine as a whole data file.

Acknowledgement

This publication is an output from the research projects "CAS Hundred Talents Program" funded by Chinese Academy of Sciences, "Grid platform based aerosol fast monitoring modeling using MODIS data and middlewares development" (40471091) funded by National Natural Science Foundation of China (NSFC), and “Dynamic Monitoring of Beijing Olympic Environment Using Remote Sensing” (2002BA904B07-2) and “863 Plan – Remote Sensing Data Processing and Service Node” funded by the Ministry of Science and Technology, China.

Reference

[1] I. Foster, “What is the Grid: A Three-Point Checklist”, Grid Today, 1 (6), 2002. [2] I. Foster, C. Kesselman (Eds), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, 1998.

中国科技论文在线 http://www.paper.edu.cn

High Throughput Computing for Spatial Information Processing (HIT-SIP) System 49

[3] I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Journal of Super-computer Applications, 15(3), 2001, pp. 200-222. [4] Jianqin Wang, Yong Xue, and Huadong Guo, 2003, A Spatial Information Grid Supported Prototype Telegeoprocessing System. In Proceedings of 2003 IEEE International Geoscience and Remote Sensing Symposium (IGARSS’2003) held in Toulouse, France on 21-25 July 2003, v 1, p 345-347. [5] Yincui Hu, Yong Xue, Jianqin Wang, Guoyin Cai, Shaobo Zhong, Yanguang Wang, Ying Luo, Jiakui Tang, and Xiaosong Sun, 2004, Feasibility Study of Geo-Spatial Analysis using Grid Computing. Lecture Notes in Computer Science, Vol. 3039, pp.971- 978. [6] Guoyin Cai, Yong Xue, Jiakui Tang, Jianqin Wang, Yanguang Wang, Ying Luo, Yincui HU, Shaobo Zhong, and Xiaosong Sun, 2004, Experience of Remote Sensing Information Modelling with Grid computing. Lecture Notes in Computer Science, Vol. 3039, pp.1003-1010. [7] Jianqin Wang, Xiaosong Sun, Yong Xue, Yanguang Wang, Ying Luo, Guoyin Cai, Shaobo Zhong, and Jiakui Tang, 2004, Preliminary Study on Unsupervised Classification of Remotely Sensed Images on the Grid. Lecture Notes in Computer Science, Vol. 3039, pp.995-1002. [8] Yanguang Wang, Yong Xue, Jianqin Wang, Ying Luo, Yincui HU, Guoyin Cai, Shaobo Zhong, Jiakui Tang, and Xiaosong Sun, 2004, Ubiquitous Geocomputation using Grid technology – a Prime. In Proceedings of IEEE/IGARSS to be held at Anchorage, Alaska, 20-24 September 2004. [9] Yincui Hu, Yong Xue, Jianqin Wang, Guoyin Cai, Yanguang Wang, Ying Luo, Jiakui Tang, Shaobo Zhong, Xiaosong Sun, and Aijun Zhang, 2004, An experience on buffer analyzing in Grid computing environments. In Proceedings of IEEE/IGARSS to be held at Anchorage, Alaska, 20-24 September 2004. [10] Jiakui Tang, Yong Xue, Yanning Guan, Linxiang Liang, Yincui Hu, Ying Luo,Guoyin Cai, Jianqin Wang, Shaobo Zhong, Yanguang Wang, Aijun Zhang and Xiaosong Sun, 2004, Grid-based Aerosol Retrieval over Land Using Dark Target Algorithm from MODIS Data. In Proceedings of IEEE/IGARSS to be held at Anchorage, Alaska, 20-24 September 2004. [11] Jiakui Tang, Yong Xue, Yanning Guan, and Linxiang Liang, 2004, A New Approach to Generate the Look-Up Table Using Grid Computing Platform for Aerosol Remote Sensing. In Proceedings of IEEE/IGARSS to be held at Anchorage, Alaska, 20-24 September 2004. [12] Gang Xue, Matt J Fairman, Simon J Cox, 2002, Exploiting Web Technologies for Grid Architecture. In Proceedings of 2002 Cluster Computing and the Grid, pp.272-273.