Extending an open source spatial with geospatial image support: An image mining perspective

Muhammad Imran March, 2009 Extending an open source with geospatial image support: An image mining perspective

by

Muhammad Imran

Thesis submitted to the International Institute for Geo-information Science and Earth Observation in partial fulfilment of the requirements for the degree in Master of Science in Geoinformatics.

Degree Assessment Board

Thesis advisor Dr. Ir. .A. (Rolf) de By Dr. Ir. W. (Wietske) Bijker Thesis examiners Chair: Prof. Dr. A. Stein External examiner: Dr. B.G.H. Gorte

INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION

ENSCHEDE, THE NETHERLANDS Disclaimer

This document describes work undertaken as part of a programme of study at the International Institute for Geo-information Science and Earth Observation (ITC). All views and opinions expressed therein remain the sole responsibility of the author, and do not necessarily represent those of the institute. Abstract

The nature of vector data is relatively constant, and it is revised less frequently as compared to remotely sensed earth observation data. Remote sensing images are being collected nowadays every 15 minutes from satel- lites such as Meteosat. In the coming years, very high spatial resolution data is expected to be available freely and frequently. Integrated GIS and remote sensing methods have the ability to incorporate different data sources to find attribute associations and patterns of change for knowledge discovery and change detection. GIS-based data such as vec- tor data and DEM are overlayed with image data and results are taken up in a GIS for further processing and analysis. A platform is required to efficiently store, retrieve and manipulate such image data as layers just like other GIS data layers for hybrid GIS/RS analysis. In principle, spatial are the most suitable candidates as such a platform. Our work is aimed to investigate the open source spatial database PostgreSQL/PostGIS (PG/PG) as such a platform, to provide a solution for image support and an overall framework for integrated remote sensing and GIS analysis. This is definitely beyond just storage and retrieval of images in spatial databases. The requirements and available open source libraries were extensively studied to provide such an image support. The TerraLib library was pro- posed, and analysed to extend the PG database with image support. To demonstrate the application developed in this study, the Meteosat Second Generation (MSG) image data for a larger part of Europe was extracted from the ITC data receiver. An application programme was written to con- struct time series image database for extracted image data with the PG/ PG DBMS. A mining application to detect clouds patterns from time-series image and vector data stored in the PG database was developed using Ter- raLib conceptual schema. For this, an extensive study of data mining meth- ods was carried out. A statistical data mining method based on the prin- cipal components analysis was adopted to extract cloud features for the Netherlands from the time series image data. Using this research platform and cloud patterns detection case applica- tion, various image mining scenarios were conducted to provide a frame- work for integrated image and vector data analysis top of the DBMS tech- nology. This framework is extremely useful for studying spatio-temporal phenomena with seasonal or long intervals and region-based studies where the regions on a remote sensing image are extracted by vector data.

Keywords spatial database image support, integrated remote sensing and GIS analy- sis, data mining, cloud pattern detection, image analysis

i Abstract

ii Contents

Abstract i

List of Figures v

List of Tables vii

Acknowledgements ix

1 Introduction 1 1.1 Motivation and problem statement ...... 1 1.2 Research objective ...... 2 1.2.1 Research sub-objectives ...... 3 1.2.2 Research questions ...... 3 1.2.3 Background ...... 4 1.3 Project set-up ...... 5 1.3.1 Method 1 ...... 5 1.3.2 Method 2 ...... 6 1.4 Thesis structure ...... 8

2 Literature review 9 2.1 Introduction ...... 9 2.2 Raster data model ...... 9 2.3 Data model for image storage inside PG/PG ...... 11 2.3.1 Functions for integrated vector/raster analyses ...... 14 2.4 Proposed platform for image mining ...... 15

3 Data mining methods 19 3.1 Introduction ...... 19 3.2 Classical data mining ...... 19 3.2.1 Statistics ...... 21 3.2.2 Database-oriented approaches to data mining ...... 22 3.2.3 Machine learning approaches for data mining ...... 26 3.3 Spatial data mining ...... 27 3.3.1 Spatial statistics ...... 28 3.3.2 Spatial database approach to data mining ...... 30 3.4 Image mining ...... 33 3.4.1 Low-level image analysis to feature extraction ...... 34

iii Contents

3.4.2 High-level knowledge discovery ...... 37 3.4.3 Image mining from integrated image/GIS data analysis . 37 3.5 Summary ...... 38

4 A database application development method using TerraLib 39 4.1 Introduction ...... 39 4.2 TerraLib, TerraView and PostgreSQL/PostGIS set-up for image mining ...... 39 4.2.1 TerraLib dependencies on open source third party libraries 40 4.3 TerraLib application development ...... 40 4.4 Conceptual data model ...... 48 4.4.1 Data model for storage ...... 49 4.4.2 Data model for visualization ...... 51 4.4.3 Image data handling in TerraLib Database ...... 53 4.5 Summary ...... 55

5 Application scenarios for image mining: Results and Discus- sions 57 5.1 Introduction ...... 57 5.2 Image mining guided by GIS data ...... 58 5.2.1 Introduction ...... 58 5.2.2 The Data preparation ...... 58 5.2.3 Method and results ...... 59 5.2.4 Discussion ...... 62 5.3 Extending TerraView for a temporal image query ...... 64 5.3.1 Introduction ...... 64 5.3.2 Method and results ...... 64 5.4 Database-oriented approaches to data mining ...... 65 5.4.1 Introduction ...... 65 5.4.2 Method and results ...... 66 5.5 Conclusion ...... 70

6 Conclusions and Recommendations 71 6.1 Conclusions ...... 71 6.2 Recommendations ...... 74

A Source code for creating time-series image database in PG 77

B Source code for image mining application scenario Section 5.2 81

C Source code for image mining application scenario Section 5.3 87

D Source code for image mining application scenario Section 5.4 91

Bibliography 103

iv List of Figures

1.1 Metadata and Data types with some important fields ...... 7 1.2 Flow diagram for providing raster support extending PostGIS with CHIP datatype ...... 7

2.1 Design levels and associated design issues ...... 11 2.2 The current open source software and related Libraries [32] . . 14

4.1 A set-up for cloud detection image mining application ...... 40 4.2 Singleton design pattern adopted for TerraLib ...... 44 4.3 Factory design pattern adopted for TerraLib ...... 44 4.4 Strategy design pattern adopted for TerraLib ...... 46 4.5 Iterator design pattern adopted for TerraLib ...... 47 4.6 TerraLib software architecture [79] ...... 49 4.7 Conceptual data model related to source domain for image and vector data storage in PG modified from [29] ...... 50

5.1 Clipping of research area from MSG satellite data with vector data 59 5.2 An image mining process with integrated image and vector anal- ysis with TerraLib on top of the DBMS technology ...... 60 5.3 The resulting principal components for two dates at 14:00 hours 61 5.4 Comparison of image size on disk with size in database using com- pression ...... 63 5.5 The views/themes populated as a result of temporal query . . . 65 5.6 A sequence of steps in a mining process to generate attribute data in the PG database ...... 67 5.7 Attribute data for PC images as a result of PCA algorithm applied on time-series MSG data ...... 68 5.8 Time-series cloud patterns analysis for December 13, 2008 . . . 69 5.9 Time-series cloud patterns analysis for December 16, 2008 . . . 69

v List of Figures

vi List of Tables

3.1 Statistical methods for data mining...... 23 3.2 Statistical methods for spatial data mining...... 29

4.1 Third-party libraries used by TerraLib for image support in PG 41

5.1 Image size on the disk and in the database ...... 63

vii List of Tables

viii Acknowledgements

I would like to sincerely thank Dr. Ir. R.A. (Rolf) de By and Dr. Ir. W. (Wietske) Bijker for their support and guidance all through this work.

I would like to cordially thank Prof. Dr. A. Stein in promoting and motivating me from my day first at ITC till today.

I would like to thank Dr. Javier Morales for helping me in handling -specific is- sues.

I would like to affectionately thank all teachers for all interesting discussions during GFM course work.

I take this opportunity to appreciate my fellow students in GFM programme and friends in the Enschede. I truly believe that you all are great and caring. To Adil, Asif, Fatemma, Gufrana, Khalil, Luis, Pramod, Salma, Swati, Tahir, Tuul many thanks for sharing your warm friendship for me.

I would like to thank my mom and dad for their blessings.

ix Acknowledgements

x Chapter 1

Introduction

1.1 Motivation and problem statement

Mining patterns of change and association in time-series image and other GIS data in a spatial database is challenging and active area of research. This has become more challenging especially for open source spatial databases as these databases do not provide effective image support along with other GIS data. The open source database platform that will allow us to work with vector and image spatial data in an integrated way will be a highly interesting research vehicle. Such DBMS can be used, for instance, in advanced applications with intensive spatial querying such as spatial data mining and spatial change de- tection. Images inside a spatial database have been a topic of research since mid- 1980s, however, geospatial images have not been supported in open source spa- tial databases. At very recent, open source databases support vector data and spatial analysis based on that data in a declarative way, but do not support raster image data. Remote sensing image data is space-oriented and provides pixel spectral characteristics for a phenomenon at some location. Vector data is object-oriented and provides the object characteristics at the same location in terms of shape, topology, texture, colour, and so on. The image support in a spa- tial database requires that images can reside in the database in an integrated way with other spatial and non spatial data. These images can then be queried, manipulated and analysed in a seamless way. The image storage on disks outside the database is optimal when the ob- jective is only visualization, for instance, for static web applications. Migrating image data inside the database is most realistic when the objective is to provide integrated image/vector spatial data analysis on image and vector data stored in the large spatial database. Many desktop GIS applications allow overlaying of remotely sensed image data with GIS data, for instance, for fast image clas- sification. But, most current GIS packages follow separate analysis procedures for separate data structures, through a time consuming monolithic system of functions. Further, these GIS applications can handle single image at a time and are not suitable for mining a large amount of time-series image data. The integrated spatial data analysis is bound to change completely the development of GIS technology enabling a transition from the monolithic systems of today to

1 1.2. Research objective

the generation of spatial information appliances based on seamless, integrated, and generic functions independent of spatial data structures [1]. Geometric co-registration of a remote sensing image and GIS vector data layers will provide both spectral and spatial information at a geographic lo- cation. Such a layer concept can be implemented with a spatial database to develop advanced database applications, for instance, image mining and land use/land cover change detection. The common approach for mining remote sens- ing databases uses rules created by analysts, however, incorporating GIS infor- mation and human expert knowledge with digital image processing improves remote sensing image analysis [2]. Provision of image support in an open source spatial database for an inte- grated image analysis for image mining or other advanced database applica- tions requires three prominent and interdependent areas for investigation:

1. Image data handling.

• Efficient image storage in and retrieval from a database. Images are required to be partitioned in tiles and pyramids and to be stored in a database table for fast retrieval. These tiles are required to be accessed by client on request. The retrieval of tiles needs to get further benefit from hierarchal storage structure, query optimiza- tion, adequate indexing, compression and partitioning features of the database [3]. • To provide functions for query and administration of the image data.

2. Image data manipulation.

• Single image operations, for instance, image segmentation and clas- sification. • Multi image operations, for instance, overlay operations such as spec- tral ratios [4]. Overlay operations require overlay of one raster layer with another raster or rasterized vector layer. Such overly functions are required for seamless and integrated analysis in our proposed integrated image/vector image mining application development. • Data merging and fusion functions.

3. Image data visualization.

• Visualization of overlayed image and vector data obtained from a database.

1.2 Research objective

This project aims to extend an open source DBMS PostgreSQL/PostGIS for im- age support to conduct an integrated analysis on image and vector data stored in the spatial database, with an illustration of proposed method to perform im- age data mining. There are two main objectives.

2 Chapter 1. Introduction

1. To study and analyse state-of-the-art image support from open source li- braries and existing raster storage support in an open source database, to propose and implement a solution that can provide image storage inside open source database, in such a way, that raster image data can partici- pate in an integrated raster/vector analysis.

2. To provide a general framework for image data mining based on hybrid image and vector data inside open source spatial databases.

1.2.1 Research sub-objectives 1. The proposed solution for image support inside PostgreSQL/PostGIS will provide:

• Efficient image storage in and retrieval from the open source spa- tial database PostgreSQL/PostGIS along with other spatial and non- spatial GIS data such as vector data. • Pyramid, tiling, index support and other parameters for efficient im- age retrieval performance. • Both programming and visualization interfaces that will allow to in- sert and retrieve the image along with other datasets inside PostGIS to perform overlay, spatio-analytic, statistical and aggregate func- tions over the intersection of image and vector data layers.

2. The proposed framework for spatial data mining will be based on:

• Integrated image and vector data analysis. • Scalability for performance considerations in case of large reposito- ries of image and vector data inside PostGIS.

1.2.2 Research questions All research questions will be answered in the context of outcome from first objective that will be accomplished after a complete analysis.

1. Image support from PostGIS.

• What would be the high level conceptual data model for hybrid raster and vector datasets in database? • What would be metadata and actual raster data storage structure and format in database? • What would be the interface to insert and retrieve the image into/ from along with other datasets in spatial database? • How to get and set (manipulation) the raster data while traversing with a reference system through image? • How to calculate zonal statistics over the raster image area clipped by vector feature?

3 1.2. Research objective

• How to provide overlay operations over the raster and vector layers like union, intersection, etc.

2. Image data mining.

• What would be the integrated framework for mining based on both raster and vector data inside PostgreSQL/PostGIS? That will include domain knowledge, image processing techniques, Image/vector data retrieval and preparation, hypotheses building and testing, mining algorithm building and performance scalability, etc. • What would be an interface to visualize and interpret the results?

1.2.3 Background

Mattikalli [5], presented a methodology of integrating remotely-sensed raster data with vector data. The approach was based on the mathematical con- cepts of sets and groups, and was successfully implemented for the analysis of historical land use change from 1931 to 1989 in the River Gken catchment, U.K. Remotely-sensed images were converted into georeferenced vector layers. These resulting layers were then employed with other vector data in GIS to perform land-use change analysis. It was shown that this approach can be ef- ficiently adopted for operational use incorporating products derived from both coarse- and fine-resolution remotely-sensed satellite images once these are in- tegrated with the vector-based GIS. In recent years, incorporation of multi-source data (e.g. aerial photographs, TM, SPOT and previous thematic maps) has become an important method for land-use and land-cover (LULC) change detection, especially when the change detection involves long time intervals associated with different data sources, formats and accuracies or multi-scale land-cover change analysis [6]. Image data is an important component of any large scale spatial database, since RS (land-use classification, mapping, and so on) and GIS techniques are used to in- terpret these databases [7]. Weng [10], used the integration of remote sensing, GIS and stochastic modelling to detect land use change in the Zhujiang Delta of China and indicated that such integration was an effective approach for an- alyzing the direction, rate and spatial pattern of land-use change. Automated detection of change and anomalies in the existing databases using image infor- mation can form an essential tool to support quality control and maintenance of spatial information [9]. Yang and Lo [8], used an unsupervised classification approach, GIS-based image spatial reclassification, and post classification comparison with GIS over- lay to map the spatial dynamics of urban land-use/land-cover change in the Atlanta, Georgia metropolitan area. GIS-based image analyses have shown many advantages over traditional change detection methods in multi-source data analyses [11]. More research focussing on integration of GIS and remote sensing techniques is necessary for better implementation of a change detec- tion analysis and discovery of patterns from such analysis in image mining processes.

4 Chapter 1. Introduction

The construction of spatial databases that handle raster data types has been studied in the database literature and the main approach taken has been to de- velop specialized data servers, as in case of PARADISE [12] and RASDAMAN. The chief advantage of this approach is the capacity of performance improve- ments, especially in case of large image databases. The main drawback of this approach is the need for a specialized, non-standard server, which would greatly increase the management needs for most GIS applications. The approach taken by TerraLib eliminates such drawbacks and includes raster data management into object-relational DBMSs. This is achieved by means of adequate indexing, compression and retrieval techniques, satisfactory performance can be achieved using a standard DBMS, even if very large satellite images [5]. GeoMiner [14], developed at Simon Fraser University, is a spatial data min- ing system prototype able to characterize spatial data using rules, compare, as- sociate, classify and group datasets, analyze patterns and perform data mining in different levels. ADaM [15], a NASA project with the Alabama Univer- sity in Huntsville, is a toolset to mine images and scientific data. It performs pattern recognition, image processing, optimization, association rule mining, among other operations [16].

1.3 Project set-up

We will investigate the two approaches to provide the required indexed and optimized image support along with other data sets inside open source PG/PG for the proposed integrated image/vector data mining perspective.

1.3.1 Method 1 The first approach is to use the TerraLib library to extend the PG/PG DBMS for providing image support. The TerraLib is an open source GIS library that builds its conceptual model and related metadata tables to handle raster (im- age) and vector data in the PG database. The TerraLib library classes are built over these metadata tables in the database. The data and metadata tables in the database are managed by the TerraLib when an application programme executes its operations. The TerraLib library uses built-in geometry types of the PostGIS for vector data. For raster data it uses the PG BLOB (Binary Large object) type as PG does not provide any complex data type for raster data. The TerraLib library uses the PG DBMS for providing raster/vector storage, indexes, query optimization, persistence, multi-user access, and so on. There are two options to provide the required raster data support in the PG/ PG database for an integrated image/vector analysis using the TerraLib library:

• Using the TerraLib programming interface to develop advanced database applications and prototypes through raster/vector integrated analysis. The application developer will use the TerraLib classes built over its concep- tual schema in the PG database for database applications development. TerraView is an open source interface product built on top of the TerraLib

5 1.3. Project set-up

library that aims to provide visualization interface for the rapid develop- ment of integrated GIS applications based on both raster/vector data in the PG/PG database. The first step in this case would be the complete understanding of the Ter- raLib conceptual model, classes, and interfaces to develop and implement a data mining application on remote sensing image and GIS vector data stored in PG/PG.

• The TerraLib and PostGIS can also be integrated at a lower level as both are OGC-compliant libraries written in C++. Any effort to bring the Ter- raLib intermediate level code into a lower level in the PG DBMS for com- plete integration will require the TerraLib architectural level revisions, for instance, revising projection class to work with reference system meta- data table of the PostGIS. The complete integration will require creating an ADT (Abstract Data Type) for raster data in the PG and extending it with the TerraLib library. This ADT can be further extended with Ter- raLib provided functions to execute them from the PG SQL interface. The first step in this case would be the complete integration of the Ter- raLib library and PG/PG DBMS prior to develop an image mining appli- cation based on remote sensing image and vector data in the PG database.

1.3.2 Method 2

The second approach is to extend the PG DBMS with a complex data type and operators for efficient image handling in the database. The first step in this case is the design and development of that complex data type and functions prior to develop an image mining application based on remote sensing image and vector data in the PG database. The PostGIS provides a simple abstract data type for raster data called the CHIP. This data type does not provide efficient image storage options such as indexing and tiling, image manipulation functions, and overlay functions. This can be improved into a complex type to provide the required image storage, manipulation, and analytic functions for an integrated image and vector data analysis. To provide such a complex type for raster data, a PGRaster metadata type having GEOMETRY (BOX3D) for whole image extent can be defined. As shown in Figure 1.1, the defined metadata type has one-to-many relation with the CHIP type. An image is divided into blocks or tiles to store as CHIP data. Each tile also has a unique bounding box or extent. The CHIP data type has header (metadata) and actual raster data for an image tile or block. This CHIP data type will be extended then with user-defined function for image data ma- nipulation from PG SQL interface. PGCHIP is an open source library that provides an interface between client application and PG CHIP data type [17]. Based on improved CHIP data type from previous step, PGCHIP can also be improved to develop a database loader/ dumper utility for raster data import and export.

6 Chapter 1. Introduction

Figure 1.1: Metadata and Data types with some important fields

Figure 1.2: Flow diagram for providing raster support extending PostGIS with CHIP datatype

7 1.4. Thesis structure

The Starspan is an open source library for integrated raster/vector analy- sis. This library can also be extended with CHIP data type to provide inte- grated raster/vector analysis functions. The complete work flow for the pro- posed method is shown in Figure 1.2. In the next chapter we will investigate both approaches keeping data mining and other performance issues in focus. We will accept the best approach based on scientific reasoning and research constraints.

1.4 Thesis structure

This thesis comprises of six chapters.

• Chapter 1 is the current chapter within which this section is contained. As it has been noted, this chapter introduces the problem statement, research objectives, and research questions. The scope of this study and general approaches to carry out that has been shown.

• Chapter 2 contains an extensive review of all individual efforts that have been carried out for providing image support in the open source Post- greSQL/PostGIS database. The requirements for providing image sup- port in a DBMS have been identified and based on these requirements a method from proposed methods is selected.

• Chapter 3 provides various methods for data mining. The data mining is further reviewed for extended spatial and image concepts for spatial and image data mining respectively. These methods are referenced for developing image mining scenarios with provided image support in the PG database.

• Chapter 4 elaborates the procedure to build and investigate a set-up on the TerraLib and PostgreSQL DBMS technologies, as a method for build- ing advanced database applications such as data mining, and change de- tection with integrated image and GIS data analysis. This chapter also provides the understanding to work with the TerraLib library through introducing the TL classes, conceptual schema, data models, and image handling.

• Chapter 5 provides various statistical and database-oriented mining tech- niques that were applied to evaluate the capabilities of proposed method for integrated analysis based on the TerraLib and the PostgreSQL DBMS technologies. Data processing steps were provided with results and dis- cussions.

• Chapter 6 finally presents conclusions and recommendations drawn from the study.

8 Chapter 2

Literature review

2.1 Introduction

An extensive literature review to provide image support in an open source - database PostgreSQL/PostGIS (PG/PG) is carried out. The objective in provid- ing an image support in spatial database is to use image and GIS data for in- tegrated Image/vector analysis in an image mining process. A spatial database is the most suitable candidate for such kind of integrated analysis. The people are trying to provide image data support in PG/PG since year 1998. This chap- ter will investigate all such efforts and will propose a suitable method that will enable us to fulfil the objectives. Section 2.2 explains raster/image general con- cepts, OGC specification for raster implementations, and DBMS considerations for such implementations. Section 2.3 describes the design issues proving im- age support in PG/PG and work done addressing these design issues. This also discusses functions required for integrated image/vector data analysis and pos- sible solutions in providing these functions in the PG. The section 2.4 describes the proposed method and all technologies to provide an image support in the PG. The proposed set-up will serve as a research platform to develop advance database applications through integrated image/vector analysis.

2.2 Raster data model

A raster data model divides space into grid of cells. The position of a cell in a grid is described by cell co-ordinates (ı, j); where ı is the row number and j is the column number. The cell width in ground units represents the resolution of raster data. The cell value has a data type and size called the cell depth. The cell data type can be a primitive such as integer, real number, and so on. This can also be a code number which is referenced to an associated table, called look-up table or value attribute table. The value recorded for a cell can be a discrete value such as land use, a continuous value such as rain fall or a null value if no data is available at the particular location. Georeferencing transforms the cell coordinate system to a ground coordinate system for the raster data. Different attributes are stored as separate layers. These layers can for instance be the bands of multi-spectral image or thematic

9 2.2. Raster data model

layers of land use data. Each cell value in a band or layer can be further ex- tended with types for example a color map type (to map a thematic value in cell to RGB), or with a grayscale type. An image is a specialized case of raster data, and a cell in that case is called pixel. The upper left corner cell value is the reference value and starting point of the block. These starting cell values are registered as metadata and used to join the blocks when required. For a multi-band image, the axis along bands is called the band dimension. For a time series multilayer image (where each layer has a different date or timestamp), the axis along layers is called the temporal dimension. A layer is a logical concept for storing single or multiple bands of a raster in a database. A block is the smallest logical unit for data storage on disk. When data for multiple bands need to store in a single block, an interleaving technique is commonly adopted to arrange the data of each band. The image is usually divided into multiple tiles and different pyramid levels are built for fast retrieval [18]. OGC defines raster type as coverage that refers to any data representation that assigns value directly to a spatial position. OGC provides general imple- mentation specifications for querying such data, however, it does not provide specifications for a strict storage model or representation. According to OGC, an essential property of the coverage is to be able to generate a value for any point within its domain. How the raster would be implemented internally is not a concern. This can be represented by set of polygons which exhaustively tile a plane, a grid of values, a mathematical function or a combination of all these until, a value can be returned by coverage for any atomic cell. OGC Specifica- tions for querying raster data allow access to geospatial coverage for its values or properties of geographic locations. The set and get methods are provided to set and get a raster cell value through some enumerations like a cursor in SQL programming interface. A raster coverage is usually provided with a method for interpolating values at spatial positions between the points or within a cell [19]. Images in space are geo-referenced which distinguishes OGC raster/coverage data sets from SQL MM Part 5: Still image. An image geo-referencing process associates the location in an image with a geographic, or local coordinate sys- tem. The spatial database associates a referencing system dynamically while performing image operations. A spatial reference systems (SRS) is usually im- plemented through transformations such as six-parameter affine transforma- tion. The parameters for transformations are stored in metadata tables [20]. For raster storage, relational databases support variable length numeric data types with variation in precision and variable length character data types. These data types have limited storage compared to image requirements. Most relational databases offer support to store such image data as Binary Large Ob- ject (BLOB), but they do not offer the fine access control (e.g to pixel level)for such BLOBS. The disadvantage is that a single bit change locks the whole BLOB and a single bit call loads the whole BLOB into buffer cache. To have more access control over storage such as BLOB, modern extend- able database technology supports user-defined ADTs (Abstract Data Types) and user-defined functions callable from SQL. “The key feature of OR-DBMS is

10 Chapter 2. Literature review

Figure 2.1: Design levels and associated design issues that it supports a version of SQL, SQL3/SQL99 that provides the notion of user- defined types (as in Java or C++) ” [21]. These extensions provide a powerful mechanism for:

1. Variable length facilitating efficient use of disk space. Tiling and image pyramids are used for fast retrieval of image through fine-grain splitting of image.

2. Declarative support through SQL API interfaces.

3. Developers can develop their own domain-specific extensions through the DBMS provided API interfaces.

The Raster data is stored as field whose type depends on the DBMS in which data is being stored. In DBMS’s with spatial extension, the field type is the type provided by the extension, otherwise it is a BLOB [22]. To provide the raster/coverage support in the PostgreSQL/PostGIS DBMS, decisions over design issues are required at three levels shown in Figure 2.1. We will discuss the work done so far at these three design levels and will select the most suitable method to provide image support with PG/PG for integrated image/vector analysis.

2.3 Data model for image storage inside PG/PG

For image support from a spatial database, DBMS developers need to design a high level conceptual data model for image data and metadata based on com- plex data types. These complex data types need to be extended with functions to access and manipulate the image data through an SQL interface. In ORDBMS, non-structured data can be organised using WKB (well- known binary) data types. A straight-forward way is to store raster data in WKT (well- known text) or WKB column, and associated metadata stored in other columns. For performance, an image can be stored in multiple rows so that each row will

11 2.3. Data model for image storage inside PG/PG

manage a single tile. Other mechanisms such as data compression and pyramid structure can also be used to improve the efficiency and performance [23]. The bounding boxes of these tiles are stored as geometry and are used to build the GIST index. Mostly, a unique identifier associated with a tile is used as primary key for queries and for caching in a hierarchal storage structure. PostgreSQL/PostGIS currently comes with two raster storage models TOAST (The Oversized-Attribute Storage Technique) and BLOB (Binary Large Object). A table with column having potentially large entries (> 8 KB) has an asso- ciated TOAST table, whose OID is stored as locator. A locater is used to link a row from a database table to its out-of-line toasted field values stored in a TOAST table. To support TOAST, a data type must have a variable-length (var- lena) representation like bytea. The parameter TOAST MAX CHUNK SIZE in number of bytes is used to divide out-of-line value, for example images into chunks. Each chunk is stored as a separate row in the TOAST table [25]. The advantage of this architecture is that it allows a simple data model for raster storage with some effort like storing the bounding box of a tile as geometry for GIST index support. We can then store metadata in a metadata table and blocks/tiles of an image inside a TOAST table. The size of a block/ tile can be controlled by TOAST MAX CHUNK SIZE. This could be very useful when:

1. The objective is just to locate and display an image with selection criteria on attributes and not another image

2. The raster image data is too small to insert through SQL insert statement and applications frequently access attributes other than chunks of image.

The disadvantage is, image manipulations need to read and write at pixel level, while traversing through an image inside database. But PostgreSQL manages toasted data itself without providing a cursor handler or an iterator to traverse through an image chunk. Whole chunk is loaded into the database memory structure at once even if a single pixel need to read or write. The PostGIS PGRaster SQL Interface “shall” requirement document states “For a bytea/TOAST storage model, PGRaster data shall be inserted through SQL insert statement and shall retrieve through SQL select statement ”. More efficient insertion can be obtained by creating a prepared insert statement us- ing a low-level interface function to provide the binary data, separate from SQL text. In the PostgreSQL C API, PQexecPrepared and PQexecPrams provide this low level functionality [24]. A second solution that most of the ORDBMS provide is BLOB or Smart BLOB. The application receives a handler to read from a BLOB using a well known file system interface open, close, read, write and seek. This allows fine-grained access to the BLOB type, to seek a specific location and read/write changes using programming interface, and to write functions to extend SQL. PostgreSQL client and server side programming interface Library for LO (libpq) provides manipulation for large objects. The library contains LO func- tions. SQL is extended to call these functions within SQL commands and large objects manipulation takes place using lo functions within SQL transaction

12 Chapter 2. Literature review block. The difference between TOAST and LOB handling is that TOASTed data is automatically managed by PostgreSQL while large objects can be randomly modified at the finest atomic level. A disadvantage for large objects is that they need a trigger to be defined to delete the large object when its referencing row is deleted [25]. A PostGIS abstract data type (ADT) and associated functions called CHIP define a raster header and actual raster data in-line. CHIP provides some basic functions for manipulation and an external programme can write into or read from CHIP data by using these functions. CHIP does not provide any functions for handling and manipulation of images. PGCHIP is an open source driver that uses the GDAL library for raster data read and write operations, with the CHIP data type [26]. PGCHIP external programme provides support for both OGC based well known text representa- tion of spatial reference system, i.e., SRTEXT and Proj4 text representation, i.e., PROJ4TEXT to provide coordinate transformation capabilities. PGCHIP is under development and also does not provide the image manipulation and image/vector overlay operations. The Oracle physical data model for images consists of two object types. A GeoRaster type for metadata for a whole image (image header), and a raster ob- ject type for each block/tile of actual image data as BLOB in out-of-line fashion. The foot-print of each block is also extracted and stored as raster type attribute to build spatial indexes. The GeoRaster object type is further extended with XML object type for metadata storage. Only image foot-print is stored as field of GeoRaster type out of XML type. An image can be inserted through SQL interface by insert statement and also by a loader utility through command prompt [27]. Along similar lines as Oracle GeoRaster, Xing Lin developed a raster model for PostGIS named PGRaster [20]. Metadata is stored as fields of PGRASTER METADATA type with spatial extent (BBOX) and SRID. Raster value type refers to scale of measurement that can be nominal, ordinal, interval or ra- tio. The model coordinates have the same unit as the specified SRID. Actual image data is stored in PGRASTER object type with blocks. A Geotiff2pgraster loader creates blocks, GIST index, and pyramid levels while importing the im- age into the database. Some basic SQL functions to handle parameters of data model are provided, however, many functions for image manipulation are under development. The source code is distributed under the terms of GNU General Public Licence by Refractions Research Inc. There is a debate over Xing Lin’s PGRaster source code and design regarding patent violation by Oracle Corpo- ration. INPEs (National Institute of Space Research) TerraLib is an open source software library that extends object-relational DBMS technology to support spatio-temporal models, remote sensing image databases, and integrated spa- tial analyses [28]. It provides support for PostgreSQL/PostGIS, MySQL, MS, and Oracle DBMS. TerraLib provides interfaces for C++, Java, COM and OGIS web service environments for GIS application development. TerraLib creates its own conceptual model by opening the connection through application pro- gramming interface (API) driver. It provides support for handling large image

13 2.3. Data model for image storage inside PG/PG

Figure 2.2: The current open source software and related Libraries [32]

data sets providing indexes, tiling, pyramiding, and compression techniques. TerraLib raster storage model follows OpenGIS implementation specifications for grid coverage, and also fulfills PostgreSQL “Shall” requirements for raster storage. TerraLib vector features are fully OGC compliant. TerraLib raster data structures include:

1. Raster: a multi-dimensional raster data structure (Used for images and grids).

2. Cell: a single cell, used for building cell spaces.

Cell spaces can be seen as a generalized raster structure in which each cell stores more than one attribute value or as a set of polygons that do not intercept each other. It handles spatio-temporal data types (events, moving objects, cell spaces, modifiable objects) and allows spatial, temporal and attribute queries on the database. TerraLib supports dynamic modeling in generalized cell spaces and spatial data mining, and has a direct runtime link with the R programming language for statistical analysis on spatial data [29, 30].

2.3.1 Functions for integrated vector/raster analyses

All C-based open source projects are built upon reused libraries and form the bases for integration and interaction between different formats of spatial data sets [32]. GDAL is an open source translator library for raster (GDAL) and vector (OGR) geospatial data formats [33]. As shown in Figure 2.2, almost all open source software packages use GDAL/OGR for manipulating raster and vector data, and show conceptually a tendency to provide functions for integrated raster/vector analyses. The OGR library can read vector data sets and trans- form these into feature layers. An OGR layer can be further rasterized with GDAL libraries to intersect any geometry feature in a vector data source with

14 Chapter 2. Literature review the GDAL raster layer. Rasterization is the process of burning the vector poly- gons into a raster, and vectorization is the process of burning raster layer into vector polygons. These processes are often carried out for interaction between raster and vector data for hybrid raster/vector analysis. On this principle, the Starspan utility program was designed to fuse raster and vector layers for spatial analysis. A basic operation performed by Starspan is the extraction of spectral data from raster files whose pixels are geometrically contained in the geometry features (points, lines, polygons) in vector data [34]. The Starspan is open source, written in C++, using GDAL/OGR/GEOS libraries and works with all the formats supported by underlying libraries. Various al- gorithms are used according to type of geometry, to find the pixels in a raster R that are contained in the given geometry features in vector V [35]. TerraLib libraries also provide functions to perform zonal operations over a region of a raster layer clipped by vector features. The zonal operation calcu- lates a set of statistical measures (for example sum, mean, variance, and so on) over the raster layer region that is inside the polygon representing the region of interest. The result is returned in a data structure provided by TerraLib.

2.4 Proposed platform for image mining

The first proposed method is adopted: to use the open source TerraLib/TerraView library to provide image support in the PG/PG DBMS. This will provide a research platform to develop advance applications through integrated image/ vector analysis such as image mining. The TerraLib conceptual schema will be created in PG for image and vector data storage and retrieval for image mining application. The TerraLib library will be used for spatial data analysis, based on hybrid raster and vector operations, and image processing. The TerraLib C++ programming interface will be used in an algorithm development for im- age mining. TerraView visualization interface will be used for visual interpre- tation, training the image data, and analyzing the results in our image mining application development. This method selection is based on a comprehensive study of existing libraries from previous work and the following requirements:

1. Efficient Image Support The advanced database applications such as image mining require image support from a DBMS with high image retrieval performance. TerraLib uses indexing and compression of images for storage inside the database. Multi-resolution pyramids store the raster data at various sizes and de- grees of resolution. Each resolution level is divided in tiles. A tile has a unique bounding box and a unique spatial resolution level, which are used to index it. The TerraLib stores multi-resolution pyramids up to seven levels of resolution. This avoids unnecessary data access at client side, however, it requires extra storage. To compensate extra storage re- quirement, TerraLib applies lossless compression to the individual tiles. When retrieving a section of data, only the relevant tiles are accessed and decompressed [28]. TerraLib also supports BSQ, BIL, and BIP interleav-

15 2.4. Proposed platform for image mining

ing methods. Raster image data can be shared in different formats such as GeoTIFF, TIFF, JPEG, RAW, ASCII-Grid and ASCII Spring.

2. Hybrid Image/Vector Data Analysis TerraLib/TerraView manages image data with a PG database and allows the visualization and manipulation of it together with vector data. Over- lay functions provided by the TerraLib can operate on image and vector layers. A vector polygon and image pixels inside that polygon will point to the same geographical location provided the corresponding layers are projected with similar reference system. An operator can easily extract pixels coinciding with specific polygon area and perform statistical anal- ysis. TerraLib also offers rich overlay operations, for instance, difference, union, and intersection. A decoder class can perform raster conversion between 52 raster formats.

3. Statistical Analysis Any platform for image mining that has no spatial statistical capabilities is insufficient. TerraLib has a basic spatial statistical package, including local and global autocorrelation indexes, non-parametric kernel estima- tors, and regionalistic methods [36]. Additionally, TerraLib provides a di- rect link with the R programming language using the aRT package [37]. R is an open source language and environment for statistical computing and graphics. Packages in R relevant to GIS include geoR for geostatistics, splancs for analysis of point processes, and sp for general spatial analysis. TerraLib developers quickly develop wrappers for R without knowing R internals and TerraLib applications users use these wrappers in spatial statistical analysis without knowing R syntax. Using the aRT package inside a TerraLib application, an operator queries spatial data stored in the database, conducts spatial statistical analysis and can store the re- sult back in the database. These results can be further displayed with TerraView [28].

4. Image Processing Algorithms Another important requirement to provide image mining support is to provide a set of image processing algorithms. These algorithms extract features from image as a precursor to image mining. Image processing algorithms typically take one or more images as input, and produce one ore more images as output. TerraLib libraries come with decoder transla- tion utilities that convert to and from 52 different raster formats including popular formats. TerraLib includes basic operations for changing the size, orientation, scale, and other properties of an image. It provide a large set of algorithms for classification, mixture models, and geometric transfor- mations. Scaling or resampling enlarges or reduces a raster object by changing its geometry. TerraLib support various resampling and interpolation meth- ods, including interpolation by nearest neighbour, average of K-nearest neighbour values, simple and weighted average of elements in a box. It

16 Chapter 2. Literature review

also supports sub-setting by clipping of image feature using polygon and by difference, intersection, union, and so on of various image bands. Filters are used for image enhancement, brightness and contrast adjust- ment, edge sharpening, smoothing, distortion correction and reducing salt and pepper effect. Filters have a very important role in image data prepa- ration. TerraLib implements convolution filters, linear filters, border de- tection filter, buffer-based filters, meteological fitters, radar filters for con- trast growing through image interpolation, radar KAUN filter for reduc- ing speckle noise in SAR images, statistical LEE and FROST filters and geometric ASF filters. Wavelet denoising filters are implemented to re- move additive Gaussian noise by thresholding the wavelet coefficients. Re-mapping is the process of changing an image geometry to fit to adjacent images. It utilizes background estimation, mask generation, parameter estimation and dynamic range re-mapping of an original image to provide an optimal output image. TerraLib dynamic range re-mapping includes an additive algorithm or multiplicative algorithm or a combination of both. The other state-of-the-art image re-mapping algorithms implemented by TerraLib are colour, and PC-based. Segmentation refers to the process of partitioning a digital image into multiple regions (sets of pixels). The segmentation is very important in image mining and is normally performed before classification. The pur- pose is to simplify the representation of an image into something that is more meaningful and easier to analyze. The result of image segmenta- tion is a set of regions that collectively cover the entire image, or a set of contours extracted from the image (edge detection). The pixels in a region are similar with respect to some characteristic or computed property, for instance, colour, intensity, or texture. All state-of-the-art segmentation algorithms such as region growing algorithm, clustering and K-Means classification, edge detection filters, snake algorithm, and so on are im- plemented by TerraLib for image segmentation. Image Fusion is the process of combining relevant information from two or more images into a single image. The resulting image will be more in- formative than any of the input images. In remote sensing applications, increasing availability of space-borne sensors gives a motivation for dif- ferent image fusion algorithms. The image fusion techniques allow the integration of different information sources. TerraLib implements HIS, Wavelets and Enhanced Wavelets image algorithms for image fusion. TerraLib allows creation of mosaics or composition of different data files to single representation. TerraLib image transparency allows making an image theme active over a vector theme and a vector data can be seen under the image theme. 5. Temporal Support TerraLib supports two basic containers for spatial data: spatio-temporal objects and layers. A spatio-temporal object is an atomic feature whose identity is unique and it persists over time. A layer is a collection of

17 2.4. Proposed platform for image mining

spatio-temporal objects that share the same geographical projection and the same set of attributes over a temporal period [39]. Information at some geographic location can be generated from different layers projected with same geographic co-ordinates. Layers can be bands of the Hyper- spectral image, thematic raster data such as soil maps, vector data, or temporal dimension stored in a spatial database.

6. Iterators over spatio-temporal data structures The TerraLib uses generic iterators over spatial data structures. These iterators allow sequential traversal of entire image or an element of a por- tion of image delimited by a polygon. A request to raster class is handled by a decoder to work with different image formats seamlessly. Iterators provide abstraction for image data types and bridge image processing al- gorithms with data structures [31].

7. TerraLib Programming Interface TerraLib provides C++, Java, COM, and Open Geo-Services (OGIS) en- vironment for spatio-temporal application development. There are many different types of applications, and user requirements that require differ- ent components of TerraLib in different order. A complete general-purpose interface incorporating all TerraLib components to satisfy all applications would potentially be very complex. An alternative approach is to construct simple application interfaces around specific work flows [38]. Using the TerraLib library, such applications and prototypes can be rapidly devel- oped according to different requirements, for instance, ordering of oper- ations, conversion of files, connecting the input of one operation to the output of another, and so on.

8. TerraLib visualization Interface Geographic visualization is the use of visual geographical displays to ex- plore data. Such exploration helps to generate hypothesis, to develop prob- lem solution, and to construct knowledge [40]. TerraView visualization interface provided by INPE facilitates training of image data and creating hypothesis through visualization of various spatio-temporal datasets at the same location. The results of image processing algorithms and image mining can also be visually analysed and interpreted.

9. Flexibility, Extensibility, and Scalability TerraLib architecture is highly flexible. A front-end application works seamlessly with heterogeneous database data sets as it can keep several connections to different databases at the same time. A new algorithm can be added by registering through kernel, a new data format can be added by adding decoder class and a new database can added by writing a driver without affecting any other code.

18 Chapter 3

Data mining methods

3.1 Introduction

Image mining is a multi-disciplinary technology that hires tools and techniques from the DBMS technology, statistics, machine-learning, and image processing. In this chapter state-of-the-art tools and techniques for data mining are ex- plored. These tools and techniques are used in Chapter 5 of this thesis for de- veloping various image mining application scenarios. Section 3.2 explores tools and techniques for classical data mining that provide the basis for overall min- ing processes. The tools and techniques in classical data mining are extended with the introduction of spatial concepts for spatial data mining, and further introducing image processing concepts for image data mining. Section 3.3 dis- cusses spatial concepts to classical data mining for spatial data mining. Section 3.4 discusses image processing concepts to spatial data mining that provide the basis for an image mining process.

3.2 Classical data mining

High-dimensional data in large databases leads to so called “Information Gap. ”As a result, we have a huge amount of data and little information. This re- quires techniques to handle the dimensionality of data and to discover informa- tion that has been hidden in large, complex and high-dimensional data sources. Data mining is a technique to reduce the information gap and to discover pat- terns, associations, or relationships among the data. A whole data mining process is a collection of various sub processes. This starts from understanding the domain, defining problems, making assumptions about the data, data preparation, data transformation, mining patterns, and evaluating results to extract the knowledge [77]. The data preparation in- volves data selection and data reduction. The data reduction is a process to remove noise from the data in terms of irrelevance. The data transformation phase transforms a standard database into a form that is suitable for use by mining algorithms. Actual mining is to apply tools and techniques for discover- ing patterns and is a sub process in the whole data mining process. Evaluation is performed to discover knowledge from patterns and to make suggestions to

19 3.2. Classical data mining

the problem for which the mining activity was carried out. Data mining techniques are built using computer science and statistics. Databases, artificial intelligence, and machine learning are important areas from computer science that play a role in developing techniques for data min- ing. Inductive learning is a process of learning from the environment. Accord- ing to Hand [41], two kinds of structures are sought during learning in a data mining activity: models and patterns. A model is a global summary of rela- tionships between variables that helps to understand a phenomenon. Here, tools and techniques are supervised and mostly driven by prior knowledge and simplifying theory. In contrast to the global description given by a model, a pattern is often defined as a characteristic structure exhibited by few number of points, for instance, a small group of customers with high risk. Here, tools and techniques are mostly self-organizing or unsupervised [42]. Data analysis techniques in the data mining are mostly based on apriori or self-organizing learning:

• Apriori learning (supervised induction) This is to learn from examples where an operator helps the system to construct a model by defining classes and by providing examples of each class. The nature of such learning leads to the predictive modelling or discovery-driven data mining. Predic- tive models make prediction about values of the data using a set of known rules called apriori information, “truth data” or “training set “ [44]. Clas- sification, regression, and time-series analysis are examples of predictive mining algorithms. These models may take a hypothesis from the user and test validity against the data. The emphasis is with the user who is responsible for formulating the hypothesis and for issuing the query on data to confirm or negate the hypothesis. Data classification has been widely studied in machine learning, statistics, neural networks, and in expert systems.

• Self-organised learning (unsupervised induction) This is learning from ob- servation and discovery. The nature of such learning leads to the descrip- tive mining algorithms that are exploratory in nature and identify the patterns and relations in the data itself [44]. The user supplies data to the mining system without defining classes or providing any prior infor- mation. The mining algorithm observes data, reduces the dimensions and recognizes patterns by itself. The mining algorithm then results in a set of class descriptions for each class discovered in the environment. Examples of descriptive algorithms include statistical and mathematical methods like clustering and association rule mining. This approach of learning is called verification-driven data mining. It extracts information in the pro- cess of validating a hypotheses postulated by the user [42]. The emphasis is on the system by automatically discovering important information hid- den in the data. Data clustering has been studied in statistics, machine learning, spatial databases, image processing and the data mining area.

Two main tasks must be carried out in a data mining process: developing

20 Chapter 3. Data mining methods data analysis technique or algorithm, and scaling the algorithm over large data sets in terms of computational cost [43]. Any technique developed for data mining depends upon: nature of application domain and associated problem, nature of data (such as type, volume, dimensions, and so on), computational scalability (such as setup, execution cost, and so on), and stakeholder. The stakeholder includes a domain expert, a data miner or any intended user that can use results of data mining. We will discuss techniques for data mining borrowed from three major areas i.e. statistics, databases and machine learning. The techniques in classical data mining are adjusted for spatial data mining and image data mining respectively according to domain-specific considerations.

3.2.1 Statistics

Statistical learning through model construction and pattern recognition from the data is an approach to machine intelligence. Once a statistical model from the data is recognized, probability theory and decision theory can be applied to get an algorithm for estimation, description and prediction [21]. In descriptive learning, the statistical significance of a user hypothesis about the entire data is determined. It is difficult to conduct a data mining activity without proper consideration of its fundamental statistical nature. The statistical approaches adopted for data mining are different from stan- dard classical statistics. In classical statistics, data is normally collected by keeping a question in mind. The statistical models are based on theory about relationships among variables or description of the data. In contrast, in data mining, one may simply carry out a stepwise search over a set of potentially explanatory variables to obtain a model or pattern. The model or pattern ob- tained posses well predictive power to confirm the assumptions about the data. In statistical inference a statement about the population is made after observ- ing the sample, whereas data mining algorithms are executed over the entire population of data. The nature of a statistical analysis in classical statistics is confirmatory to best fit the model over the observed data. Therefore, a pre- cise sample data collection is primary keeping method or model consideration in mind. Whereas, statistical analysis in data mining is exploratory to discover an unexpected model or pattern in the data, and precise data collection is sec- ondary. Data in the data mining is voluminous and many classical statistical tools may fail in such circumstances, for instance, a scatter plot of one million points can easily display as a solid black line [41]. The role of statistical techniques for the data mining starts from the data preparation phase. The data cleaning refers to identification of anomalies in the data that need to be removed or separately addressed in an analysis. The outlier analysis searches for the data items that are unexpectedly deviating from some norm. Variable-by-variable data cleaning is a straight-forward pro- cess but often anomalies only appear when many attributes of a data point are simultaneously considered [45]. Detecting and removing outliers is important in the data mining for cleaning the data or investigation of unusual events. Ta- ble 3.1 shows statistical methods that can be used in various stages of a data

21 3.2. Classical data mining

mining activity. Segmentation involves partitioning selected data into meaningful groups. Segmentation process is either classification or clustering. Classification is normally supervised and based upon some predefined rules from training data. A parametric classifier assumes that data follows some known parameterized probability density function (PDF). A nonparametric clas- sifier is typically used when there is insufficient knowledge about the type of underlying PDF for the domain. Self-organizing classifier models, such as cer- tain kinds of neural networks, are also considered nonparametric classifiers when they make no apriori assumptions about the PDF. Clustering is an unsupervised process in which different statistical tech- niques are employed for partitioning the data into groups. Cluster analysis methods for data mining must accommodate large data volumes and high di- mensional data. This usually requires statistical approximations or heuristics [46]. Many machine learning techniques are also modified for statistical learn- ing, in which decisions are made using a statistical criterion over attribute val- ues. Table 3.1 shows various statistical techniques available for segmentation. Principle component analysis (PCA) is a statistical clustering technique based on statistical correlation as mentioned in table 3.1. This technique was used in building cloud detection application scenarios in Chapter 5. PCA is a non- parametric method to extract information from high dimensional data sets. Eigenvectors of the co-variance matrix is computed from an image that repre- sents principal components of that image. These eigenvectors are often ortho- normal to each other. PCA extracts relevant features from data by performing an orthogonal transformation and in this way reducing a complex data set to a lower dimension [85]. The feature space is a space of clouds in our cloud detection application scenarios for image mining developed in Chapter 5.

3.2.2 Database-oriented approaches to data mining

Database-oriented methods do not search for a model, as the machine learning or the statistical methods do. Instead, the data modelling or other database- specific heuristics are used to exploit the characteristics of the data in hand. Implicit associations in the data can reveal hidden patterns when these are made explicit through the database modeling and design. A mining activity requires data preparation and data transformation phase to transform the database into an algorithm-compatible format. Indeed, it would not be feasible to reorganize data in a database each time for a new min- ing algorithm. However, it is more applicable to provide minimum database design arrangements that can be useful to fulfil the minimum format compat- ibility criteria for many mining algorithms. This is a reason why data ware- houses are designed to keep many data mining and other knowledge discovery tools and techniques in view and vice versa. There are two approaches to provide a mining algorithm for large databases. A mining algorithm can be external to the DBMS in the form of libraries or can be integrated with the DBMS in the form of stored procedures. The data in latter approach does not leave the database for mining process. This inte-

22 Chapter 3. Data mining methods

Table 3.1: Statistical methods for data mining.

Data min- Statistical Statistical Description ing task Task Methods Data prepa- Outlier De- Clustering Partitioning of data into a number of clusters, where each data ration tection approaches point can be assigned a degree of membership to each of the cluster [47]. Distribution- Based on deviation from some probability distribution like based Normal, Poisson, and Gaussian. Distance- An object O in a dataset T is an outlier DB (p, D), if at least based fraction p of objects in T lies greater than distance D from O [47]. Density- It assigns each object a degree to be an outlier. This degree is based called the local outlier factor (LOF) of an object and depends on the isolation of an object to its surrounding neighbourhood [48]. Segmentation Classification Decision A predictive classification model basically hired from machine Trees learning technique that represents a set of decisions which generate rules for classification of a dataset. A decision or test on attributes values is performed based upon the priority through some statistical criterion such as entropy, informa- tion gain, Gini index, chi-square test, measurement error, and classification rate. Further, discussed under machine learn- ing. Linear Re- A predictive model when the target variable is continuous. gression The disadvantage is, assumed normal distribution of response variable is violated sometimes. Bayesian In- A statistical inference in which evidence or observations are ference used to update or to newly infer the probability that a hypoth- esis may be true. The name Bayesian comes from the fre- quent use of the Bayes theorem in an inference process. The most widely used methods are Naive Bayesian classification, assuming that attributes are all independent, and Bayesian belief networks, assuming that dependencies exist among sub- sets of attributes. Representing dependencies among random variables by a graph in which each random variable is a node and the edges between the nodes represent conditional depen- dencies is the essence of the graphical models that are playing an increasingly important role in machine learning and data mining [49]. Clustering Statistical To find an association between fields in data. The most com- Correlation mon methods are: Principle Component analysis, and Maha- lanobis Metric. K-means The k-means algorithm is a distance-based clustering algo- rithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). K-means algorithm works only with numerical attributes. Distance- based algorithms rely on a distance metric to measure the similarity between the data points. Dependency Dependency analysis involves finding rules to predict the val- Analysis ues of some attribute based on a value of other attributes. Association Associations between the independent instances of datasets Rule Mining that form rules. Mostly used as database transaction-based approach for revealing the interesting patterns in business transactional data. Further discussed under the database sec- tion Bayesian Be- Discussed under Bayesian inference section above lief Networks

23 3.2. Classical data mining

grated approach has been adopted by Oracle for data mining and its advantage is declarativeness. The former approach is adopted by TerraLib and its advan- tage is flexibility. A data miner has more control on data inputs and outputs and can rapidly develop prototypes on various DBMS platforms using different programming interfaces. Data mining algorithms deal with a large number of variables and dimen- sions. It is obvious especially for very large databases, to scale a mining algo- rithm for computational efficiency and performance. An algorithm is said to be scalable if, for a given amount of main memory, its runtime increases linearly with the size of input [60]. Existing data mining activities are much driven by practical computational concerns. In many cases, data mining algorithms for large databases are not so simple to scale linearly. Scalability of the mining algorithms then becomes one of the major activities under database management. The database is designed in a way to reduce execution steps in a mining algorithm. The database is tuned for performance to effectively avail DBMS provided performance support such as indices, optimizations, and hierarchal storage structure. Further, external arrangements to elevate the performance can also be provided, for instance, parallel processing and increasing computation power through GRID comput- ing. We will discuss some techniques that have been extensively applied in the databases during various stages of the data mining.

Data preparation, reduction and cleaning

SQL and user-defined functions offer a great degree of flexibility and efficiency in extracting information from very large databases having heterogenous datasets [51]. The information is extracted using these functions in selection, projection, and joining of database records. Useful SQL capabilities include mathematical and analytical functions [52]. In an ideal situation, a data mining algorithm should be designed to directly access the database using query tools. But nor- mally, a time-consuming procedure is involved in transforming the database into an algorithm-compatible format [21]. During the data preparation phase, datasets are reduced or arranged at dif- ferent levels of information according to the dimensions. Selection is performed to determine a subset of records from the database, to clean noise and duplica- tion, and to handle missing values. A database join is performed to integrate data sets and to provide flexibility for data manipulation. Data summaries are generated from aggregate and analytic functions. The traditional SQL queries can help in data preparation phase, however, these are inadequate for exploring patterns.

Multilevel data generalization, summarization, and characterization

Data generalization is a process that abstracts a large set of relevant low-level data in the database to the higher level concepts. The task of characteriza- tion is to find a compact description for a selected subset of data in the the

24 Chapter 3. Data mining methods database [53]. Data generalization is important because most databases have been designed for OLTP (on-line transaction processing) where the aim is to serve small transactions. A mining process however, involves scans on large search spaces that require better access structures, optimizing disk I/O and other performance options. A first approach in data generalization is to generate summary rules that are normally implemented in the form of a data warehouse (DW). The general idea is to materialize certain expensive computations that are frequently re- quested, especially those involved in aggregate functions for knowledge discov- ery. The DW integrates different types of data and reduces the dimensionality of data inside the databases. It reduces runtime computational costs while scal- ing a mining algorithm. This database-oriented technique for data mining was used to build a sce- nario for our cloud patterns detection image mining application. The summary statistics for the principal components were generated and stored in specially designed schema. The vector data summaries were also calculated and stored. This summary data can be used to construct spatio-temporal queries while min- imising the run-time computational costs. The most popular data mining and analysis tools associated with data ware- houses carry several alternative names, for instance, online analytical process- ing (OLAP), multi-dimensional databases, multi-scale databases, data cubes, and materialized views [54]. A well-designed DW supports all such knowledge discovery tools and techniques. Structured OLAP tools include roll-up (increasing the level of aggregation), drill down (decreasing the level of aggregation), slice and dice (selection and projection) and pivot (re-orientation of a multidimensional data view) [55]. OLAP tools provide us the most powerful query techniques that can be effec- tively implied in a knowledge discovery process but they are not able to dis- cover patterns by themselves [56]. A powerful and commonly applied OLAP tool for higher-dimensional data aggregations is the data cube. The data cube is an N-dimensional generalization of the more commonly known SQL aggre- gation functions and group-by clause [57]. The schema for image and vector data summaries that was populated for cloud patterns detection image mining application in Section 5.4.2 can also serve such OLAP tools. The second approach is attribute-oriented induction, which in contrast with the DW, is on-line generalization-based data analysis technique. This technique is based on learning-by-examples algorithms from AI and machine learning in- tegrated with database operations (i.e., group by) [58]. The main goal is to generalize low-level data into high-level concepts using the conceptual hierar- chies form the data itself in a self-organizing way rather to design and stored explicitly by a domain expert.

Association rule discovery

Another kind of pattern that can be extracted from a database are the associa- tion rules between independent instances of the database. The iterative database method is employed to search for frequent item sets

25 3.2. Classical data mining

in a transactional database. The association rules are derived from those dis- covered frequent item sets [50]. Association rule discovery can be formally defined as “An association rule is an expression X=> Y (c%, r%) where X and Y are disjoint set of items from the database, c% is confidence. This is the proportion of the database transactions containing X that also contains Y in other words the conditional probability | P(X Y), and r% is support. This is the proportion of the database transactions that contain X and Y, i.e., union of X and Y, P(X Y) ” [59]. This database oriented technique for data mining can also be applied on the attribute data produced for the vector and image data in Section 5.4.2 of this thesis for our cloud patterns detection image mining application. High computational cost is involved as each database item is visited in the rule discovery process. Existing association rule mining algorithms are apriori- like approaches.

3.2.3 Machine learning approaches for data mining Machine learning (ML) is a broad subfield of artificial intelligence (AI) that par- ticipates in development of algorithms and techniques for building computer systems that can “learn” from data. ML and data mining share the same broad goal of finding novel and useful knowledge in the data, therefore they have techniques and processes in common. The fundamental difference between ML and data mining exists in the volume of the data being processed [60]. Like statistical methods, machine-learning methods search for the best model that matches the testing data. Unlike statistical methods, the searching space is a cognitive space of n attributes instead of a vector space of n dimensions [61]. Machine learning algorithms are modified when these are hired for data min- ing, according to a range of data types, tasks and methods for data mining. We will discuss some of machine learning techniques that have been used extensively in data mining:

1. Decision tree is a flowchart-like structure consisting of internal nodes, leaf nodes, and branches. Each internal node represents a decision or a test on a data attribute, and each outgoing branch corresponds to a possi- ble outcome of the test. Each leaf node represents a class [62]. Decision trees are extensively used in mining applications when a decision for a class is based on many different data sources. VIZ, CART, CHAID, ID3, C4.5 and C5.0 are five common tree algorithms. Decision trees are used in data mining for pattern discovery in highly multivariate data. This applies to categorical output when the target variable is discrete.

2. Neural networks are information processing devices that consist of a large number of simple nonlinear processing modules called neurons. Neurons are connected by elements that have the information storage and programming functions. Decision boundaries are calculated based on rules that are built during the training process. A decision rule is to find the proper value of weights associated with the interconnections of neurons. There is no fixed decision rule, as it is evaluated iteratively by

26 Chapter 3. Data mining methods

minimizing some error criterion on the labeling of training data. Neural nets have been widely used as classification technique in supervised data mining.

3. Rule induction produces a set of if-then-else rules from a database. Un- like decision tree methods that employ the divide-and-conquer strategy. Some popular methods include CN2, IREP, RIPPER and LUPC.

4. Hidden Markov models are statistical models in which the system be- ing modeled is assumed to be a Markov process with unknown parame- ters, and the challenge is to determine the hidden parameters from ob- servations. Recent finite-state-machine methods including maximum en- tropy Markov models (MEMM) and conditional random fields (CRFs) have shown high performance in various structured prediction problems [60].

5. Kernel methods A kernel is a function that transforms the input data to a high-dimensional space to solve a problem. Kernel functions can be linear or nonlinear. The support vector machines (SVM) have been widely used as kernel methods in data mining for both structured and non-structured data such as text and image data.

3.3 Spatial data mining

Spatial data mining extends the data mining techniques with spatial concepts. Primitives of spatial data mining are based on the concept of neighbourhood spatial relations. Spatial operators such as topological distance and direction are combined with logical operators to express a more complex neighbourhood relation. Such neighbourhood consideration in the data mining process makes it more complex [53]. The differences between classical and spatial data mining are:

1. Classical data mining deals with numbers and categories. In contrast, the spatial data is complex and includes extended objects such as points, lines, and polygons.

2. Classical data mining works with explicit inputs, whereas the spatial predicates are often implicit in terms of topological relations such as over- lap and within.

3. Data samples for statistical procedures in classical data mining are highly dimensional and independent, however, the data in the spatial data min- ing is highly dimensional as well as auto-correlated.

4. The spatial data varies across several scales of resolution. Spatial de- pendencies (geographic dependencies in which some objects are always related to other objects) on a small scale turn into random variation when analysed using broader units of measurement. Irwin [13], performed

27 spatial analysis at several spatial scales to examine substantive scale de- pendencies in the underlying processes that influence urban land use pat- terns. It was observed that the estimated parameters of the regression model vary significantly across different spatial scales of analysis.

Making the implicit relations between geographic objects explicit is vital in any spatial knowledge discovery process. Topological relation, distance, and di- mension are implicit to the geographical data. These are made explicit through identifying objects, defining ontologies, and measurement of spatial autocorre- lation. “Ontology is a content theory which contains a general set of facts to be shared and whose main contribution is to identify specific classes of objects and relations that exist in some domain ” [63]. Spatial autocorrelation is a measure of dependencies among neighbouring objects in space. Many analogous terms are used for identification of objects: filtering, feature extraction, finding dependencies or using spatial dimension in the discovery pro- cess. Feature extraction is performed either prior to or during the mining stage. Identified objects and relationships between the objects are represented as at- tributes and one-to-one or one-to-many relationships in the conceptual schema.

3.3.1 Spatial statistics

The difference between classical and spatial statistics is similar to the differ- ence between classical and spatial data mining. Data samples in the spatial statistics are not independent and are highly correlated. The spatial autocorre- lation is an area in the statistics that is devoted to the analysis of spatial data in the spatial statistics. The spatial autocorrelation is a measure of dependency between the spatial data variables. Spatial distribution of values/classes of a certain attribute sometimes shows distinct local trends which contradict global trends. Measures of dispersion such as range, standard deviation, and variance include weight of proximity when calculated from geographic distributions [21]. Various spatial statistical methods for spatial analysis in spatial data mining are provided in Table 3.2. Spatial data quality in terms of accuracy also significantly contributes to the spatial statistical results. Environment interference, acquisition devices limitations, and transmission process, etc., induce errors in the spatial data. Fuzziness of geographic phenomena and vague object definition further have impacts on the spatial data quality [64]. Another influencing factor especially in spatial statistical aggregations is the choice of resolution that arises mod- ifiable areal unit problem which comprises the scale and zoning effects. The scale effect may lead to different statistical results if information is grouped at different levels of spatial resolution. The zoning effect refers to the variability of statistical results if borders of spatial units are differently chosen at a given scale of resolution. Both effects need to be carefully handled when aggregating spatial data in a spatial statistical analysis [65]. Chapter 3. Data mining methods

Table 3.2: Statistical methods for spatial data mining.

Data mining Statistical Statistical Description task Task Methods Data prepara- Spatial Outlier A spatial outlier is a spatially referenced object whose non-spatial attribute tion Detection values differ significantly from those of other spatially referenced objects in its spatial neighbourhood [67]. Graphical tests Based on visualization of spatial data like variogram clouds and Moran scat- terplots [66]. Quantitative Provide a precise test to distinguish spatial outliers from the remainder of methods data. Factor-based Efficient local outlier discovering method. This assigns each object a spatial methods outlier factor (SOF), it is a degree to each object being deviated from its neigh- bour [67]. Interpolation Statistical tools for modelling spatial variability and interpolation (prediction) and Estimation of attributes at unsampled locations. Segmentation Classification Classification in spatial statistics, considers spatial correlation between ob- jects when modelling the spatial dependencies between the objects. The sim- plest way to model spatial dependencies is through spatial covariance. The two methods that have been largely implied to model spatial dependencies into classification/prediction problems are described as: Spatial Au- Linear regression models assume variables are independent, whereas spatial toregression Autoregression models incorporate spatial auto correlation and neighbour- Model. (SAR) hood relationship in classical linear regression model. If spatial autocorrela- tion coefficient is statistically significant, then SAR also quantifies the spatial autocorrelation [66]. Markov Ran- MRF-based Bayesian classifiers estimate model using MRF and Bayes’s rule. dom Fields A set of random variables whose interdependency relationship is represented (MRF) by an undirected graph (i.e. a symmetric neighbourhood matrix) is called a Markov Random Field. The Markov property specifies that a variable depends only on its neighbours and is independent of all other variables [21]. Spatial Clus- Spatial clustering is based on the fact that objects in space are grouped to- tering gether and exhibit patterns, however, statistical significance of spatial clus- ters should be measured by testing the assumption in the data. One of such measure is based on quadrates which is well defined area, often rectangular in shape. Statistics are derived from the counters calculated from location and orientation of quadrates. After the verification of statistical significance of spatial clustering, classical statistical algorithms can be used to discover interesting clusters [66]. Spatial clustering methods incorporate both spatial and non-spatial attributes of the object. Hierarchal Start with all patterns as a single cluster, and successively perform splitting Clustering and merging until a stopping criterion is met. Methods Partitional Data points are allocated to find the clusters of spherical shape in iterative Clustering way. Squared error is the most frequently used criterion function. K-mean Methods and k-medoids are commonly used partitional algorithms. A K-medoids algo- rithm can also be scaled by taking the advantage of spatial index structure like R-Tree for clustering a large spatial database. Some recent algorithms in this category are: partitioning around k-medoids (PAM), clustering large applications (CLARA), clustering large applications based on random search (CLARRANS) and expectation-maximization (EM) [21]. Density-based Clustering algorithms try to find clusters based on density of data points in region. Examples are: density-based spatial clustering of applications with noise (DBSCAN), and density-based clustering (DENCLUE) [21]. Grid-based First quantize the clustering space into a finite number of cells and then per- form the required operations on quantized space. Examples are: statistical information grid-based method (STING), wave cluster, BANG-clustering, and clustering in quest (CLIQUE) [21]. Dependency Spatial Co- The co-location pattern discovery process finds frequently co-located subsets Analysis location of spatial event types given a map of their locations . It Measures spatial correlation to characterize the relationship between different types of spatial features. The measures of spatial correlation include the cross K-function with Monte Carlo Simulation, mean nearest neighbour distance, and spatial regression models [66]. Spatial Asso- Discussed in spatial database-oriented approach. ciation Rule Mining

29 3.3. Spatial data mining

3.3.2 Spatial database approach to data mining A spatial object is described by its geometric properties such as shape, location, area, etc., and the topology in which it is embedded: relationships between it and other. Spatial objects are stored as geometric attributes along with non- spatial attributes in the spatially extended database. Topological relationships are not explicitly stored and calculated at run time. Spatial objects can be queried and analysed along with other non-spatial attributes through SQL in a declarative way. The SQL for a spatial database supports: the topological operators such as within, touches, intersects, etc., and the spatial functions such as measurement, management, transformation, etc. Spatial joins are based on the topological relationships. SQL operators and functions also take the advantage of perfor- mance and fast retrieval measures such as spatial indexing, hierarchal storage structure, and optimization from the spatially extended database. The spatial data can also be pre-processed and materialized with the spatial database. This makes spatial databases as the most suitable platform for the spatial data min- ing as compared to databases for classical data mining. Spatial data aggregation and spatial association rule discovery are the most exciting spatial database-oriented techniques under spatial data mining.

Spatial data model A spatially extended database is a repository of integrated spatial and non- spatial data. Implicit geographic objects are made explicit through concep- tual and logical data models that identify and define objects and relationships among them. Modelling of the spatial data involves: modelling shape and lo- cation of objects, modelling dependencies and topological relationships among the spatial objects, and modelling continuous geographic data like elevation. Unlike statistical methods, the search space is a cognitive space of N spatial objects with spatial and non spatial attributes instead of a vector space of N dimensions, however, for continuous geographical phenomena, the statistical methods are extensively used to quantify the dependencies in terms of autocor- relation. Modelling space, feature identification, and feature extraction are analogous terminologies used in literature. A spatial feature has some spatial and non spatial characteristics of a geographic object. A layer represents a distinct set of geographic features in a particular application domain. Both have the aim to extract implicitly encoded information of spatial relationships. Various models have been proposed and implemented in spatial databases. We will discuss the two most adopted modelling techniques by the spatial database community. The first approach is the object model that has OGC consensus and is most widely used. It has also been implemented in the PostGIS and Oracle. Things in the object model are treated as objects. Spatial geometry is realized through the spatial data types such as point, line, and polygon. The spatial relationships are described by the topological, directional, and metric units. The second approach is the neighbourhood paths and graphs proposed by Ester [53]. A neighbourhood relationship is represented by a graph having N

30 Chapter 3. Data mining methods nodes and E edges. Each object is a node and two nodes are connected by an edge. The edge represents length and direction. The node represents topological relationships. This is a highly effective method in modelling the dependencies, but it calculates and stores some unnecessary relationships resulting in high computational costs. A graph-based approach is extremely suitable for explicitly storing the spa- tial object relationships in non-structured data like images to satisfy similarity searches for object retrieval from the image database. Different graph mod- els such as attribute relational graphs (ARGs) and the region adjacency graphs (RAG) are employed to aggregate image regions.

Multilevel spatial data generalization and aggregation.

We have already discussed for classical data mining that a data warehouse inte- grates different types of data, reduces dimensionality of data, and reduces run- time computational costs; therefore, it helps in scaling the mining algorithms to find patterns in the database. Elzbieta [68], defines terminology for a spatial data warehouse (SDW). A spatial level is defined as a level for which the application needs to store spatial characteristics such as country, state, and city. A spatial attribute is defined as an attribute that has a geometric type such as point, line, and polygon for spatial data. A hierarchy defines the navigating path for roll-up and drill-down along a dimension. All attributes in a hierarchy belong to the same dimension. A dimension is the same category of information for instance, year, month, day, and week is a hierarchy in the time dimension. A spatial hierarchy is defined as a hierarchy that includes at least one spatial level. A spatial dimension is defined as a dimension that includes at least one spatial hierarchy. Usually non-spatial dimensions, hierarchies, and levels are called thematic. The related spatial levels in spatial hierarchy exhibit a topological relationship. A spatial fact relationship is defined as a relationship that requires a spatial join be- tween two spatial levels. The spatial join is based on topological relationships. These definitions give some brief overview of how a SDW designer thinks while designing a spatial data warehouse. Generalization involves the construction of hierarchies of spatial objects based on spatial and non-spatial attributes. A spatial hierarchy represents a successive merging of neighbourhood regions into larger regions. In data ware- housing and OLAP, these spatial hierarchies allow both a detailed view and a general view of data using roll-up and drill-down operations. Aggregation is the process of grouping multiple individual objects based on some criteria to form a new composite object. Aggregation is an important op- eration given its ability to reduce spatial complexity while at the same time retaining most thematic information of the component objects. Aggregation can be based upon similarity of the objects involved or on the functional relation- ships between them. Similarity of the objects is performed through comparison based upon the scale measurement (i.e., nominal, ordinal, interval and ratio), the geometric spatial reference such as a co-ordinate system, and the semantic spatial refer-

31 3.3. Spatial data mining

ence such as a place name. The classification (categorizing things into classes) based on such similarity at different levels leads to the classification hierar- chy or taxonomy. The functional relationships lead to the formalization of geo- ontologies or partonomies, which can be either user-defined, or derived from statistical properties in the dataset itself through the quantification of the spa- tial autocorrelation. Defining ontologies on the abstraction of spatial entities is an active area of research [69]. Spatial data warehousing and spatial data mining are two complementary techniques for the knowledge discovery. The first mainly focuses on the devel- opment of multidimensional spatial data models to support spatial data aggre- gation and spatial navigation. In contrast, spatial data mining deals with the development of algorithms for the discovery of complex spatial knowledge such as spatial clustering and spatial association rules. Currently, a most exciting research area is the integration of multidimensional data modelling, OLAP and data mining in a comprehensive system. Such integration will address different aspects of data modelling, user interaction and complex knowledge extraction [70]. In our cloud patterns detection application in Chapter 5, summary data for all vector polygons in the study area was calculated through the spatial aggregate functions. These summaries were then materialized in the spatial database as a spatial data warehouse. For image data mining, the princi- pal component analysis algorithm was applied to the time-series images of the study area and statistics over the resultant PCs were materialized with the database. The vector summary data such as total area of the study area was joined with the time-series image statistics to construct a spatio-temporal query for pattern analysis. This integration of vector data or any other non- spatial data with remote sensing image analysis results can be extremely use- ful for constructing knowledge in an image mining process. This also presents a method for the integration of spatial data warehousing and image mining for integrating various sources of information at different levels. This approach can be effectively implemented for spatio-temporal pattern analysis such as for a specific phenomenon or a land cover type at various geographical levels.

Spatial association rule discovery The concept of association rule from classical data mining is extended for spa- tial association rule discovery. It is extended through defining an association between the objects based on spatial neighbourhood relations in terms of the spatial predicates rather than items. Spatial association rules consist of an im- plication of the form X=>Y, where X and Y are a set of predicates and at least one element in X or Y is a spatial predicate [71]. At least three steps are involved in spatial association rule discovery [65]:

• Compute a spatial relationship between two objects as data pre-processing to lower high computational costs.

• Find frequent sets of predicates. A frequent predicate is one for which support is at least equal to minimum support.

32 Chapter 3. Data mining methods

• Generate a strong association rule. A rule is strong if it reaches the mini- mum support and its confidence is at least equal to a threshold.

3.4 Image mining

The knowledge gap between much data and little information is further en- hanced by the semantic gap between the low-level image feature representa- tions and the high-level application concepts. The vital and most challenging difference between the spatial and image data mining is that the later involves identification and extraction of the spatial features or objects from pixels of an image through image processing techniques. Spatial features can be extracted from the image using the techniques for feature extraction, segmentation, and image classification. Implicit relationships between the extracted objects are made explicit through providing semantics in identifying meaningful regions, for instance, identifying some geometric features from the series of images as deforestation patterns. Knowledge is discovered at a high level after associating these deforestation patterns to application-specific concepts such as deforesta- tion due to certain human activities such as small and large farms constructed by farmers and different deforestation mechanisms result in deforested areas of different sizes and shapes. The application concepts are different classes of spatial objects, which are associated to a specific application domain. Silva and Camara [74], defines image mining as associating the spatial patterns in the image domain to the application concepts in the application domain. In general, the four stages in image mining are:

• Pre-processing the image data for error corrections, analogous to data preparation in classical data mining.

• Feature extraction through image processing techniques such as to extract buildings in a remote sensing image through the segmentation.

• Associate extracted features to the application concepts resulting in spa- tial configurations, for instance, residence buildings and factory buildings.

• Analyzing spatial configurations to obtain useful knowledge, for instance, the factory area reduced in last 30 years.

Silva and Camara [74], performed image mining for Amazon deforestation patterns in a time-series image database in three phases:

• In the first phase, they recognized landscape objects for application do- main. A landscape object is an object defined as deforested area. They selected corridor, diffuse, and geometric shape patterns of deforestation that were recognized based on factors such as shape and spatial arrange- ments in Amazon deforestation.

• In the second phase, these landscape objects were identified and extracted from a set of sample images in order to build a reference set of spatial pat- terns. This reference set was obtained through segmentation of sample

33 3.4. Image mining

images under a cognitive assessment process, in which a human special- ist associates landscape objects (from image) to spatial patterns typology (corridor, diffuse, geometric). Once the reference set of spatial patterns was built, the next phase was to use it to mine spatial configurations from the time-series of images in an image database.

• In the third phase, geometric structures emerging on clustering of an im- age were mapped with spatial patterns of deforested areas from train- ing data. This mapping of a deforested area for time-series of images t1, t2, t3,....., tn was stored as spatial configuration in the form of database records. A spatio-temporal pattern would be a set of all spatial configu- rations mapped as deforested area for an image time-series over a long period like 30 years. Using these spatial configurations stored as the database records, analysis was performed. The answers for different ques- tions were found such as “Did the area of large forms increase during ten years? ”

In our cloud patterns detection application scenario for database-oriented techniques for data mining, vector polygons representing the boundary of the study area were used to identify the pixels within the polygons. The princi- pal component analysis was then applied on the time-series of images and the statistics over the output were stored in the database as records. A time-series pattern analysis was then performed over the attribute data. In the following sections we discuss image mining at a low-level for feature extraction, and to provide semantics to the extracted features at a high(er) level.

3.4.1 Low-level image analysis to feature extraction At a low level important activities are the object identification and extraction from the pixels of a single image. A pattern can be considered as a unique structure that can be extracted from an image and that can describe a specific phenomenon such as a land cover type. In the context of remote sensing, a pattern is a spectral signature. A spectral signature is a set of spectral radi- ances measured in different bands of a multispectral image. The pattern or feature extraction based on pixels in a single image is a matter of image anal- ysis, image processing, and recognition. Classifications and segmentation are two widely used image analysis techniques for pattern extraction from images. There are two approaches to pixel classification in remote sensing. One of them attempts to relate pixel groups with actual earth-surface cover types such as vegetation, soil, urban area, and water. These groups of pixels are called information classes. The other approach determines the characteristics of non- overlapping groups of pixels in terms of their spectral band values. The groups of pixels in this case are known as spectral classes [75]. The former approach, where samples from information classes (training data) are used for learning and then for classifying unknown pixels to find patterns, is called supervised classification. On the other hand, the latter approach where at first the spectral classes are found without a priori knowledge on the information classes and then their relationship with the information classes is established using a map

34 Chapter 3. Data mining methods or ground truth is called unsupervised classification. Unsupervised techniques generally imply the use of statistical methods to decide on decision boundaries [76]. Image data quality is one of the concerns when dealing with the image at low-level. At low-level, many uncertainties arise that are cause of limitations in land cover/land use classifications during the image mining process. The pixel in remote sensing has its own inherited problems such as physical process of image creation and mixed pixel problems that need to be considered before the image mining process. Both classification and segmentation have supervised and unsupervised tech- niques in image domain. The classification can be performed both at pixel-level and object-level.

Pixel-level image analysis for pattern discovery

The image analysis at this level is based on characteristics of the single pixel. Pixel classification techniques used attempt to assign a class label to an indi- vidual pixel based on its spectral value or feature space. This space is a mul- tidimensional space that is created with different spectral bands of a remote sensing image. Unsupervised classification that includes various clustering algorithms such as ISODATA and k-means, considers only spectral distance measures in a statistical analysis for spectral grouping. Supervised classifica- tion such as the maximum likelihood classifier incorporates the prior knowledge along with the spectral distance measures. Rule-based classifiers such as neu- ral networks and expert classifiers also rely on the spectral characteristics per single pixel level in defining rules. Traditional statistical classifiers rely exclusively on the spectral characteris- tics, but thematic classes are often spectrally overlapping [72]. Mining activity in spatial context extracts knowledge about:

• Patterns associated with land use classes and their evolution in space and time.

• Identification/extraction of objects and their implicit relationships in terms of dependencies.

It is quite difficult to extract objects using individual pixel values without considering the context i.e., a pixel and relationships to its surrounding pixels [73].

Object-level image analysis for pattern discovery

The image analysis at object-level does not attempt to identify a single pixel, but rather a pixel and relationships to its surrounding pixels to formulate an object or segment. Pixels have a relation to surrounding pixels that provide the basis for grouping them into objects. The object or segment-based classification approaches try to combine the neighboring pixels with similar properties into image segments. The segments have spatial characteristics like shape, size,

35 3.4. Image mining

texture, colour, etc., spectral characteristics as well as spatial relationships. After segmentation, this information over the segments is used for classifying the objects. Once an image is transformed into objects having crisp boundaries (i.e., polygons), approaches in spatial data mining can also be directly applied to image mining. Therefore, the segmentation is normally performed in any image mining activity. A platform that provides support for image mining must have at least one segmentation algorithm. Image segmentation methods are either knowledge-driven or data-driven. Knowledge-driven (top-down) methods of the supervised type and apply prior knowledge in the segmentation process. Data-driven segmentation algorithms (bottom-up) are either based on discontinuity or similarity of the intensity val- ues in an image. Discontinuity-based approaches partitions the image using abrupt changes in intensity such as edge detection. Similarity-based approaches partition the image into regions of similar intensity. Thresholds, wavelet trans- form, mathematical morphology, fuzzy clustering, region-growing, and region splitting and merging are examples of methods in similarity-based segmenta- tion. Recent surveys indicate that region-growing approaches are well-suited for producing closed and homogenous regions [74]. TerraLib provides a region growing segmentation algorithm. While transforming an image into objects, the quality of image analysis de- pends on the relation between scale (abstraction) and spatial resolution. Spatial resolution is the area on the ground that is covered by a single pixel. Whereas scale is the magnitude or level of abstraction on which a certain phenomena can be described. Merging of objects during segmentation is controlled by a scale parameter. The same objects appear differently at different scales. The choice of appropriate scale is done by domain experts and is directly related to the image semantics. Once the objects are correctly identified during the segmentation, the next step is to classify these objects. A number of objects are assigned to certain classes based on class descriptions. The class description can be based on spec- tral information such as image bands and spatial information such as texture and colour over the segments of a segmented image. The classifiers that take into account spectral, spatial, and structural information while distinguishing different segments are called contextual or structural classifiers. The contex- tual classifiers try to simulate the behaviour of a photo-interpreter in recogniz- ing homogenous areas in the image, based on the spectral and spatial properties of the images. The spatial contextual classifiers consider spatial relationships and dependencies between objects. The ISOSEG classifier is available in the TerraLib to classify segments or regions of a segmented image. It is a non- supervised grouping algorithm applied to a set of regions, which are character- ized by their statistical attributes of mean, covariance matrix, and also by area [78]. Another important activity that is performed after transforming an image into objects through segmentation, is to link the objects to form a topological network. Each object in this topological network knows its neighbours, sub- objects, and super-objects. Silva and Camara [16], propose a graph mining approach for a topological network of the objects that uses segmentation by re-

36 Chapter 3. Data mining methods gion growing, followed by hierarchical region organization using a graph model. Each region of lower scale is contained in a region of higher scale, generating hierarchical region adjacency graphs (hRAG’s), representing each region as a vertex with attributes, and edge describing the topological scale of regions. The topological network and region hierarchy are the framework of knowledge base that defines image semantics. For example “Water” close to “Building” defines water body as lake and is in some urban area. In this way, implicit relationships between the objects are made explicit.

3.4.2 High-level knowledge discovery At high-level, the major activity is:

• To provide the application-specific semantic concepts to the extracted ob- jects at low-level.

• To relate these semantically annotated objects over the time series of im- ages from the image database for knowledge discovery.

Other GIS data such as vector and non-spatial data from a spatial database is also employed, therefore the classifiers at this stage must be able to utilize many different sources of data over time. Some structural classifiers associate spatial patterns discovered from the image with high level application concepts for knowledge discovery. The spa- tial patterns are extracted features or objects from an image and application concepts can be any non-categorical attributes that can categorize those pat- terns. Most frequently used structural classifiers are machine learning (ML) classifiers such as C4.5, CART, etc., and Bayesian contextual classifiers.

3.4.3 Image mining from integrated image/GIS data analysis Rogan and Miller [81], summarized four ways in which GIS and remote sensing data can be integrated:

1. GIS can be used to manage multiple data types

2. GIS analysis and processing methods can be used for manipulation and analysis of remotely sensed data (e.g. neighbourhood and reclassification operations).

3. Remotely sensed data can be manipulated to derive GIS data.

4. GIS data can be used to guide image analysis to extract more complete and accurate information from spectral data.

Image mining to drive knowledge and patterns through image processing and analysis, is the potential area where GIS data can be used to guide the overall process. This technique was adopted in our cloud patterns detection im- age mining application development where the vector data was used to identify the study area from image to guide the image mining process.

37 3.5. Summary

3.5 Summary

Data mining tools and techniques work on the attribute space or vector space of n-dimensions that are explicit, whereas spatial predicates are often implicit in terms of topological relations such as overlap, within, etc. These are made explicit through identifying objects, defining ontologies, and measurement on spatial autocorrelation. The image mining induces another step of image pro- cessing to identify objects from the pixels of an image before making them ex- plicit. In this chapter methods, tools, and techniques involved in the mining tech- nology were reviewed, and selected for cloud detection image mining applica- tion. These methods are referenced in Chapter 5 in developing the image min- ing scenario applications using the TerraLib library and PG DBMS technology.

38 Chapter 4

A database application development method using TerraLib

4.1 Introduction

The TerraLib GIS software library (TL) was adopted as a method to provide im- age support in the PostgreSQL/PostGIS database and to develop an image min- ing application based on integrated image/vector data analysis. This chapter explains classes, conceptual model, and remote sensing image handling tech- niques provided by the TL library that were involved in the development of an image mining database application. Section 4.2 describes the set-up that has been built for image mining, using the TL library and the PostgreSQL (PG) database technology. Section 4.3 explains various design patterns adopted in the TL GIS library development. These design patterns were identified dur- ing an in-depth study of the TL library. Section 4.4 explains the conceptual model for vector and image data handling for storage, analysis, and visualiza- tion. Section 4.5 discusses various image storage options with illustrations and potential applications. This section also describes the image data handling and manipulation options adopted for our cloud patterns detection image mining application.

4.2 TerraLib, TerraView and PostgreSQL/PostGIS set- up for image mining

A set-up for cloud patterns detection image mining application based on image and GIS data in a spatial database was built, as shown in Figure 4.1. Image and vector data were stored in PostgreSQL/PostGIS (PG/PG) database. The DBMS PostgreSQL and spatial extension PostGIS were installed on the Linux Debian operating system. The TerraLib library (TL) were used for algorithm development in our mining application. The visualization interface TerraView (TV) was used for analyzing results generated by the TL application. Both the

39 4.3. TerraLib application development

Figure 4.1: A set-up for cloud detection image mining application

TL and the TV were installed on the same Linux database server, extending the PostgreSQL DBMS. The whole set-up can be accessed by an application developer through the ITC local area network (LAN).

4.2.1 TerraLib dependencies on open source third party libraries The TL library code relies on a number of open source software packages that are used to support some of its kernel functions and to provide image support in the PG. These third-party software packages are usually provided with the TL library, however, a TL application developer should be familiar with these pack- ages while developing and compiling the TL library or application code used by the TL library. The application developer needs to consider comparability is- sues between the TL library and its dependent software package versions. A list of third-party software packages used by TL are presented in Table 4.1.

4.3 TerraLib application development

The TL library follows a generic programming paradigm for developing reusable libraries, through the object-oriented programming language C++. The generic programming development focuses on finding commonality among similar im- plementations of an algorithm, and providing suitable abstractions, so that a single generic algorithm can be used to realize many concrete implementations. Such programming style is extremely important, for instance, in GIS where an operator needs to handle various data sets and procedures to perform a single task. For example, various procedures to measure spatial autocorrelation can be adopted for a set of points, a set of polygons, a TIN, a grid or a remote sens- ing image. Ideally, a spatial autocorrelation algorithm should be independent

40 Chapter 4. A database application development method using TerraLib

Table 4.1: Third-party libraries used by TerraLib for image support in PG

Software pack- Description age name zlib To compress/uncompress image data when storing in and retrieving from a TerraLib database. libjpeg This is a JPEG image compression library. This library is used in two contexts: providing an algorithm to compress raster data before storing in a TerraLib database and to decode/encode image data in JPEG format. tiff To decode/encode raster data in TIFF/GEOTIFF format. shapelib To decode/encode vector data in shapefile format. libltidsdk.a Used for decoding/encoding of image data in MrSID format. libpq.a The PG client-side development package provides environment for database application development. A front-end application handles BLOB in the PG database through this client-side software pack- age. The normal PG DBMS server installation does not install this client-side development package, and the developer needs to install a version-specific package to connect to the PG server. GCC compiler Used to compile the TL application code on Linux Debian. The com- parability between a GCC compiler version and a TL library version needs to consider.

of data structures on which it operates and vice versa. However, many GIS ap- plications provide a large number of monolithic functions, roughly the number of data structures times the number of algorithms. Generic data structures such as list, set, and map, and generic algorithms such as sort and search are provided in the generic Standard Template Library. The algorithms are completely separated from data structures. A generalized iterator is provided to connect an algorithm to the data structure that traverses through the data structure for selection and manipulation. The TL library applies the generic programming paradigm to the develop- ment of a generic GIS library in a four-step process [31]:

• Finding regularities in the spatial data handling algorithms.

• Generalizing the regularities in these algorithms to provide requirements for traversal over data structures.

• Providing iterators based on the requirements for traversal over data structures.

• Designing algorithms that use provided iterators.

A class in object-orientation is a template to create an object as an instance of that class. A class has variables to encapsulate the data and state of an object and methods to encapsulate the behaviour of an object. A class interacts with

41 4.3. TerraLib application development

other classes through the variables and methods that are provided to define an interface for that class. The other classes implements (or realizes) the inter- face of a class by providing structure (i.e., data and state) and method concrete implementations (i.e., providing code that specifies how methods will work). A design pattern is a description or template to solve a geo-computational problem that can be adopted in developing a GIS library. Various design pat- terns have been adopted for generic programming to develop the TL library. This library follows the design patterns described by Gamma and Helm [80]. It is important to understand the design patterns adopted by the TL classes as well as the collaboration between different classes in a particular design pat- tern before using them in the TL application development. As discussed below, various design patterns were identified that have been adopted by the TL li- brary classes. The generic diagrams for design patterns provided by Gamma and Helm were modified for the TL classes.

1. Template Class A template class is a feature of the C++ programming language that al- lows classes to operate with generic types. The type of a class is defined through providing parameters at run-time. This allows a class to work on many different data types without being re-written for each one. The best example is the Standard Template Library, which is a fundamental part of C++ with aim to generalize data structures such as list and vector, and iterators so that they can be used for any data type or algorithm provided at run-time. A similar approach has been adopted by the TL library to define for in- stance a generic iterator that can traverse over a data structure such as a generic image data structure independent of format. The TL library also uses class templates to define subtypes of generic containers. For in- stance, the TeSingle class template is defined in the TeComposite class for parameterised implementation to handle an object built of a single atomic element such as an object of line to construct a polygon.

2. Template Method Function overloading is a concept of object-orientation in which the code for a function is repeated, with only subtle changes for different parame- ters varying in data type. In the generic programming approach, we de- fine template method’s functionality so that its adapts to the data type(s) of its parameter(s), removing the need to repeat common code. A template method further attempts to generalize the behaviour of a function or al- gorithm through function overriding. Function overriding is a concept of object-orientation in which a new version of a method is introduced though extending a class, with only subtle changes in behaviour of a method in the parent class. Gamma and Helm [80], define a template method as “Define the skeleton of an algorithm in an operation, deferring some steps to subclasses. Template method lets subclasses redefine certain steps of an algorithm without changing the algorithm’s structure. ”

42 Chapter 4. A database application development method using TerraLib

In this way the notion of a generic template in the generic programming is combined with object-orientation (inheritance) to define reusable design patterns in the TL library. In this library, a design pattern as a template is defined after analyzing a geo-computational problem and is implemented in two steps:

• A base class is a realization of an algorithm as generic template. The invariant parts of an algorithm are implemented with a base class. • A specialised class is defined through extending the base class and implements the behaviours that can vary.

3. Singleton Design Pattern The singleton design pattern ensures that only one instance of a class can ever be created for an application programme. A single instance is then used by all operations within an application programme. Gamma and Helm [80], define the singleton design pattern as “Ensure a class has only one instance, and provide a global point of access to it. ” The example class diagram in Figure 4.2 shows a unique precision is set globally for all geometries in an application. That unique precision for all geometric operations can be set at the entry point of an application. For example in this set precision call,

TePrecision::instance().setPrecision(TeGetPrecision(region-> Projection()))

A precision from region geometry is set globally for all forth-coming geo- metric operations.

4. Factory Design Pattern A factory method is designed as an interface to create different objects of the same category. A subclass then implements that interface and over- rides the factory method to create an object for a particular category at run-time. The factory method lets a class defer instantiation to its sub- classes. Gamma and Helm [80], define the factory design pattern as “Pro- vide an interface for creating families of related or dependent objects with- out specifying their concrete classes and let subclasses to decide which class to instantiate. ” The factory design pattern is adopted by TerraLib to define many GIS classes and algorithms. For demonstration, a factory design pattern for handling projections in the TerraLib is shown in Figure 4.3. As shown, there are four actor classes and interfaces in a factory design pattern: the product, the creator, the concrete product, and the concrete creator. These four actor classes and interfaces are explained in the context of a projection factory as follows:

43 4.3. TerraLib application development

Figure 4.2: Singleton design pattern adopted for TerraLib

Figure 4.3: Factory design pattern adopted for TerraLib

44 Chapter 4. A database application development method using TerraLib

• Product The product interface in the factory design pattern defines an interface for creating objects by the factory method. In this exam- ple, the product is an interface for creating object of a projection at run-time. • Creator The creator class in the factory design pattern declares the factory method, which returns an object of the type product. In this example, The TeProjection class is a creator class that can deliver multiple projections as products to the client. The factory method in the creator class is provided to define a high-level map projection definition and geo-referencing of a satellite image. • Concrete Creator The concrete creator of the factory design pat- tern extends the creator class by overriding its factory method for a specific implementation. In this example, a concrete creator TeUTM extends the creator TeProjection super class to override its factory method for the UTM projection-specific details. The creator class re- alizes any of its extending classes through a technique as described below: The TeFactory template class is defined having a list to realize an in- dividual family of factory, for instance, the factory of projections, the factory of decoders, and the factory of databases. The TeProjection is the abstract base class for factory of projections. The list for all con- crete creators extending the TeProjection (for instance, the TeUTM concrete creator) is passed to TeFactory to realize the factory of pro- jections at run-time. This list for all concrete creators of a factory is initialized at run-time due to a static member in each concrete creator as all static members are initialized at the start of a programme. The advantage is any new concrete projection class can be added through extending the projection class without recompiling existing classes or explicitly registering with kernel. • Concrete Product The concrete product of factory design pattern, is a product that is returned by its concrete creator. In this example, the concrete creator TeUTM implements the product interface to re- turn an instance of appropriate concrete product, which is the UTM projection.

5. Strategy Design Pattern In spatial data handling, there are normally different ways to perform the same function, for instance, there is a range of algorithms to mea- sure spatial autocorrelation for a set of points, a set of polygons, a TIN, a grid or an image. The strategy design pattern generalizes algorithms and a particular algorithm can be implicitly selected at run-time depending upon context. Gamma and Helm [80], define the strategy design pattern as “Define a family of algorithms, encapsulate each one, and make them interchangeable. ” The strategy design pattern is adopted by TerraLib to develop many GIS classes and algorithms. For instance, a large number of algorithms exist

45 4.3. TerraLib application development

Figure 4.4: Strategy design pattern adopted for TerraLib

for encoding various raster formats. A strategy design pattern for gener- alizing these algorithms for various raster formats is shown in Figure 4.4. As shown, there are three actors in a strategy design pattern: the context, the strategy, the concrete strategy. These three actors in the framework of TerraLib implementation handle all encoding algorithms for different raster formats at run-time. Their role is as follows:

• Context The TeRaster class acts as a context. A context class main- tains a reference to strategy object and calls the concrete strategy through the strategy interface. Any request for a particular raster format decoding algorithm is made through the TeRaster context class. • Strategy The abstract class TeDecoder acts as a strategy that de- clares an interface common to all supported encoding algorithms for raster formats. The TeRaster class uses this decoder interface to call an algorithm defined by a concrete strategy. • Concrete Strategy A concrete strategy implements the algorithm using the strategy interface. The class TeDecoderTIFF acts as a con- crete strategy that implements TIFF decoding algorithm through the strategy interface TeDecoder. A concrete decoder such as the TeDe- coderTIFF knows how to return a value (as a double) for each pixel of a particular raster format such as TIFF. The concrete decoder im- plementations vary with raster format. A concrete decoder for a par- ticular raster format can be developed by extending the TeDecoder strategy class without recompiling existing classes.

The context class TeRaster can perform raster operations without know-

46 Chapter 4. A database application development method using TerraLib

Figure 4.5: Iterator design pattern adopted for TerraLib

ing low level details of each raster format. The client requests the context class to create raster geometry for a particular image format. The con- text class then passes all required arguments such as a string identifier “.tiff” to the strategy interface. The strategy interface gets an instance of the TeDecoderFactory that implements the factory design pattern to select an appropriate concrete decoder according to a string identifier. The con- crete decoder TeDecoderTIFF returns a value (as a double) for each pixel of the tiff raster format. The values returned by any concrete decoder are populated with only one instance of the TeCoord2D geometry class.

6. Iterator Design Pattern The iterator design pattern generalizes iterators for data structure traver- sal by encapsulating their internal differences. An iterator is a general- ized pointer, which is pointing to all objects in a container, of varying type. Gamma and Helm [80], define the iterator design pattern as “Provide a way to access the elements of an aggregate object sequentially without exposing its underlying representation. ” The (STL) C++ library defines generalized iterators over container type such as vector, tree, and array for different algorithms such as search and sort. Instead of developing algorithms for specific container types, these are developed for a generalized iterator category relevant to these various container types. A TerraLib geometry adopts the iterator design pattern to develop a generic iterator for the external algorithms to traverse through the internal struc- tures. An iterator design pattern adopted for the TL library is shown in

47 4.4. Conceptual data model

Figure 4.5. As shown, there are four actors in an iterator design pat- tern: the iterator, the aggregate, the concrete iterator, and the concrete aggregate. These four actors in the context of TerraLib implementation to develop an iterator over raster geometry are explained as follows:

• Aggregate An aggregate defines an interface for creating an iterator object. The TeRaster class acts as an aggregate to provide an interface for creating a generalized iterator object at run-time for an algorithm to traverse through a raster format. • Iterator An iterator defines an interface for accessing and travers- ing raster elements. The TeRaster::Iterator is an iterator interface that allows traversal over the raster elements (pixels) in a similar way as the (STL) iterators traverse over the data structures such as a list. The Iterator can iterate over raster elements independent of any raster format or any algorithm. • Concrete Iterator The TeRaster::iteratorPoly class acts as a con- crete iterator that implements iterator interface with a restriction of area that allows to cover the raster elements (pixels) that are in or out of a specific region (polygon). • Concrete Aggregate A concrete aggregate implements an iterator creation interface to return an instance of a proper concrete iterator. The class TeRaster is also acting as a concrete aggregate that provides begin and end methods for creating the ItratorPoly i.e. an iterator object.

7. Bridge Design Pattern The bridge design pattern decouples an abstraction from its implementa- tion so that the two can vary independent of each other. For example, the TeGeomComposit is a template class that is provided as abstraction for handling a hierarchy of geometries in the TL library. This class is used for instantiating the different composite geometries. The TePolygonSet class extends the TeGeomComposit class for storing multiple polygons. Simi- larly, the class TeLineSet extends the TeGeomComposit class for storing multiple lines. Another advantage is efficient memory management as an instance of the abstraction class maintains a reference to an object of the type imple- menter. Any new object points to same memory area as the first instanti- ated class object of similar type. For example, multiple copies of a polygon in an instance of the TePolygonSet are allowed to share same memory area through adopting the bridge design pattern.

4.4 Conceptual data model

The conceptual model for the TerraLib is built in the database as the Te Database class is initialized in memory. This initialization requires an instance of a spe- cific driver class for a DBMS such as the TePostGIS for PG/PG. The data model

48 Chapter 4. A database application development method using TerraLib

Figure 4.6: TerraLib software architecture [79] creates spatio-temporal data structures for storage and visualization. The TL library distinguishes between data sources for storage of spatial data in the database and data targets for visualization of spatial objects in TerraView. The spatio-temporal data structures for data sources and data targets are main- tained by the TL kernel in the database and are core of the TL library. The kernel includes classes for storage, retrieval, maintenance, manipulation, and visualization of spatio-temporal objects in the PostGIS DBMS. On the top of kernel structures, the TerraLib functions are built for spatio-temporal analysis and image processing. These functions are then called from various interfaces in a TL application development. The architecture of the open source TerraLib software is shown in Figure 4.6.

4.4.1 Data model for storage The abstractions database, layer, representation, and projection are related to the source domain for storage organization and hierarchy of spatio-temporal objects in the spatial database, as shown in Figure 4.7. These abstractions for the source domain are explained as under:

• Database The te database table describes metadata and layers of spatial data in the database.

• Layer The te layer table stores layer information in the database. A layer is a container of spatial objects that share a set of attributes. A layer is represented in memory as an object of Te Layer class and in the TL database as a record of the table te layer. A layer can be vector data having point, line, and polygon type objects or this can be raster data such

49 4.4. Conceptual data model

Figure 4.7: Conceptual data model related to source domain for image and vector data storage in PG modified from [29]

50 Chapter 4. A database application development method using TerraLib

as elevation data and an image. A layer is created in the database when a vector data file in an interchange format such as shapefile and MID-MIF or raster data such as a TIFF image is imported into the database. Layers can also be created by the processing of other layers in the database such as in our cloud patterns detection image mining application in the next chapter, overlay intersection of the Netherlands boundary data layer and image layer is stored as a separate layer in the database.

• Representation The te representation database table is used to manage representations of all geographical objects such as point, line, polygon, and raster in a layer. Each representation of a layer has a table inside database whose name is geometry name appended with layer id, as shown in Figure 4.7. For example if a vector shapefile with layer id 152 has three geometries line, polygon, and point, the TerraLib will create three indi- vidual representation records in the te representation database table and corresponding three tables polygon 152, line 152, and point 152 in the database. Each representation table has a geometry field spatial data whose “GEOMETRY” type represents the spatial type provided by the spatial extension PostGIS. The TL library also supports multi-representation of a geometry. For ex- ample a district can be represented by a polygon or a point depending upon the scale. A centroid for a polygon is calculated when a new representa- tion of a polygon is required. It also supports complex representations like cell space and networks. For the raster type, the TerraLib supports multi-dimensional regular grids.

• Projection As mentioned above, each representation table of a layer has geometry field spatial data whose “GEOMETRY” type represents a spatial type provided by the spatial extension PostGIS. This represen- tation table name along with geometry column name is recorded in the geomtery columns PG metadata table to assign a PG data type to that TL representation. However, the TL geometries do not use projections provided by PG leaving SRID column value “-1.” The TL library uses its cartographic projections defined in the kernel. A projection is represented in the memory as an object of Te Projection class and in the database as a record of the table te projection. Metadata for all attribute tables associ- ated to a layer are recorded in the te layer table. In case of raster data such as satellite imagery no complex data type is provided by the spatial extension PostGIS therefore, BLOB is used for actual image data storage and retrieval in database.

4.4.2 Data model for visualization

The abstractions view, theme, visual, and legend are related to the target do- main for visualization of spatio-temporal objects. These visualization abstrac- tions describe data retrieval and presentation for front-end applications such as TerraView. These abstractions for the target domain are:

51 4.4. Conceptual data model

• View Just like the database organizes layers of spatial data, a view or- ganizes one ore more themes containing spatial objects. A theme in a view represents a layer in the database. A view aggregates all layers that should be presented and handled simultaneously. • Theme A layer in the database is added as a theme in a view when it is required as input for some GIS analysis or other functionalities provided by TerraView. A TL query retrieves tuples from layers, converts these tuples into a set of objects, and groups objects in a theme. The operator can add any number of themes in a view and select some or all themes as required to perform a function. A selected theme is called visible theme.Aview can have any number of visible themes and one active theme. The active theme is the theme whose descriptive attributes are shown in the grid area in TerraView. All visible themes in a view except the active theme are drawn from bottom to top. At last, the active theme is drawn. The themes are also used to make selection over the geographical objects of layers. A restriction can be defined on conventional attributes, or on spatial or temporal properties of the objects. Spatial query is based on the spatial relation such as within and touches among object geometries of one or two themes. • Visual and Legend The visual and legend represents a set of attributes for grouping and visual representations of the data.

All these visualization abstractions create tables in the database and objects in the memory. The TL library provides a visualization interface called TerraView for the image and vector data, and spatial analysis based on such data. However, the TL library also offers other options to develop an application for spatial analysis on top of the TL library and provide a visualization interface for that application as itemized:

• An application developer can also develop a visualization interface around his prototype rather to use the TerraView visualization interface. In this case, the TeApplication class provides a simple interface for visualization of GIS data. It provides the run() and show() methods to construct the user interface for displaying data, which may be a simple geometrical structure or a more complex layer. • The second option is to build a GIS application on the top of the TL library and add this application to TerraView as plug-in to provide a visualization interface. The added plug-in works seamlessly with other functions pro- vided by TerraView.

Our patterns detection image mining application in chapter 5 was developed using the TL library. The results are analysed through the TerraView visual- ization interface. The application will be added as plug-in to the TerraView to act like other functionalities provided by the TerraView.

52 Chapter 4. A database application development method using TerraLib

4.4.3 Image data handling in TerraLib Database There are three main TerraLib classes to handle raster data inside a PG database:

• The TeRaster class, a generic raster data structure. An instance of this class is obtained for generic raster operations.

• The TeRasterParams class to set or get raster parameters such as tiling type and block size.

• The TeDecoder is a parent class for all decoders to handle various image formats such as tiff and jpeg. Abstraction classes for each image format are defined by extending the TeDecoder class, for instance, the TeDecoder- Tiff decoder class for tiff image format.

A time-series of images can be imported into the database for a geographic location to build a land use/land cover change detection application such as urban growth and flooding. Each new image in the time-series will create a separate layer that can be further overlayed with existing raster and vector layers in the database. In case of a single image import, a record is inserted as metadata in each of te layer and te representation tables. Further, three tables are created for stor- ing metadata and data at each image import. This means that if there are 100 images to import, 300 tables will be created in the database. Database tables in this case will be almost static so there will not be a performance issue since PostgreSQL database can handle unlimited number of tables and unlimited database size. As the schema will constantly grow in this case, some database repository management considerations must be adopted by the database ad- ministrator. Alternatively, images can be imported as bands of a raster layer, in this case the number of tables will not grow but the tables will grow when a new image is inserted as a raster band. This can be useful if an algorithm need to process pixels from many images on the same geographical area through a precise overlay of all image layers. For our cloud patterns detection image mining application, an application program was developed that takes the directory path for images on file system as input and creates a times-series database, inserting each image as a separate layer. Another option is to import images for a whole geographic area such as a country or a continent. An image mosaic is used to import all images of a larger area into a single raster layer in the database. An image mosaic will create only three tables inside the database for any number of images imported into a layer of the database. The tables in this case will be dynamic and grow at each image import into the mosaic layer. Some performance considerations must be adopted by the database administrator. TerraAmazon is a real-time Amazon deforestation monitoring system developed by INPE using open source TerraLib and PostgreSQL. A raster layer for the whole Amazon region has been created for the application. The PostgreSQL database stores approximately 2 million complex polygons and 20 gigabytes of full resolution satellite images are added every year, using TerraLib pyramidal resolution schema [82].

53 4.4. Conceptual data model

For our cloud patterns detection image mining application scenario through image analysis guided by vector data, all images were needed to store as sepa- rate layers. The search algorithm used, to clip image pixels for the study area from each image with the vector data of that area. This was achieved by over- laying each image layer with vector layer using the same projection system. Image processing algorithms were then applied on the clipped pixels in next step. An image can be queried either through a direct SQL statement on image data tables or through set and get methods in an iterator provided by the TL library over an image data structure. In the former case, a developer can ac- cess an image at block or tile level, size of which is decided while inserting an image. Spatial indexes will be used by the query in this case. In the later case, a developer has finer-grain access at the atomic level and a single pixel can be locked and manipulated explicitly by a developer. The generic iterator traverses through image while setting or getting pixel values for an algorithm independent of image format. The image query criteria can be pixel values, attributes, or tiles of an im- age. Another option is to clip an object from the image layer with vector layer through an overlay operation. The query in this case will return the pixels for the clipped object. The vector data can be either already stored in the database or resulting from spatial predicates such as touches and within, or set opera- tions such as union, difference, intersection, and overlay on geographical fea- tures. For our cloud patterns detection image mining application, the image pixel values for the study area were queried through the overlay of the vector and image data layers already stored in the database. The raster data as a result of a query can be directly sent as input to an image processing or statistical algorithm. The output of an algorithm can be handled in the memory:

• To provide as input to another algorithm.

• To store in the form of an image on the disk or in the database.

• To store as a database record.

Any of such outputs can be further accessed for analysis in an image mining process or provided to front-end visualization applications. The output of an image mining algorithm can also be vector data such as vector polygons created as a result of raster to vector conversion and region growing segmentation. For our cloud patterns detection image mining application, the resulting image as a result of vector/image overlay operation was provided as input to principle component analysis algorithm in the cloud detection process. The output of the principal component analysis algorithm was stored as PC images in the database. The PC images were then accessed from TerraView interface for analysis. In another application scenario, PC images as output of the PCA algorithm were not stored in the database rather handled on-the-fly to provide as input to another algorithm for statistical results. These results were then stored in the PG database as records.

54 Chapter 4. A database application development method using TerraLib

When dealing with individual pixels, it is important to consider efficiency issues. For computational efficiency, TerraLib implements a block cache mech- anism to reduce the computational cost for fetch operations from the database. Tiling and multi-resolution scheme provides spatial indexing. The unique iden- tifier associated with each tile is used as a primary key for a query and for the cache system. The cache system uses least recently used queues in memory for cache management. Maximum number of tiles in the cache can be set by an application developer according to the application requirement. The oldest tiles will be aged out by the system as the cache is full. Most of the time the whole image is not required by visualization interface due to the display size limitations or to provide the zooming effect from coarser to finer resolution. For computational efficiency, TerraLib allows the pre-built and down-sampled versions of the data to be stored during the image import, this is called a multi-resolution pyramid. When the user requests the display of the data, TerraView selects the smaller level of the pyramid, with a resolution more similar to the resolution defined by the display area. The web is another important output interface for the dissemination and vi- sualization of geographic data. Open source TerraPHP is an extension of PHP, built with the TL GIS library. The TerraLib code can be embedded inside PHP scripting to facilitate the development of web applications for image visualiza- tion using pyramidal resolution, query on geographic databases, and a web map server (WMS) access to the image data. The raster data from the database is delivered to WMS in PNG or JPEG format.

4.5 Summary

A GIS and remote sensing integrated analysis requires that the analysis pro- cedures are independent of data structures. Most of the current GIS packages follow separate procedures for different data structures. The TerraLib approach of generic programming makes spatial analysis independent of data structures. The design patterns provide the basis of such generic programming. The con- ceptual schema provided by the TerraLib allows querying both image and vector data to perform overlay operations for integrated analysis on these data types. Further, the visualization interface provided by TerraLib gives the facility to analyze the results. The TerraLib approach for integrated data analysis was studied in detail and is applied to build the image mining application scenarios in the next chapter.

55 4.5. Summary

56 Chapter 5

Application scenarios for image mining: Results and Discussions

5.1 Introduction

Chapter 3 discusses various methods for data mining. These methods were further reviewed for extended spatial and image concepts for spatial and im- age data mining respectively. In Chapter 4, we built and investigated a re- search platform using the TerraLib and PostgreSQL DBMS technologies, as a method for developing advanced database applications with integrated image/ vector analysis. In this Chapter, image mining was selected as an advanced ap- plication, to investigate the proposed method. Various statistical and database- oriented data mining techniques were applied to develop image mining appli- cation scenarios. The developed application scenarios ranges from a low com- plexity to relatively more complex. Through executing these scenarios, we also inquired the TerraLib implementation as an image data-handling scheme in a generic object-relational database system. Different approaches to handle im- age and vector data for manipulation and knowledge extraction from a large spatial database were also investigated. Section 5.2 explores the technique in which GIS vector data stored in the database was used to guide an image processing process for knowledge dis- covery from a time-series image database. The principal components analysis (PCA) based statistical data mining algorithm was applied for cloud patterns detection from a time-series image database for study area. Various advan- tages and performance issues for TerraLib approach were also discussed. Sec- tion 5.3 provides the method in which TerraView (TV) visualization interface was extended to support a temporal query on time-series image data stored in the database. Section 5.4 provides a more complex image mining application scenario developed with the output of many image processing algorithms in a sequence. The results were transformed to an attribute space in the database to apply database-oriented approaches for data mining.

57 5.2. Image mining guided by GIS data

5.2 Image mining guided by GIS data

5.2.1 Introduction

One of the important facilities that a DBMS provides to an integrated RS and GIS analysis is to store remote sensing data as a layer just like any other layer in GIS. A remote sensing image layer then can be overlayed with GIS data pointing at the same geographical area provided that both layers share the same projection. Further, if the both layers do not have same projection, the overlay operations perform on-the-fly transformations. The GIS layers can be used to guide remote sensing image analysis algorithms. For instance, remote sensing imagery normally covers a larger part of area on the ground, however, in many cases the researcher needs the spectral characteristics of a particular area. It is always time-consuming to select those regions of interest in an image during an image analysis, particularly when executing an image mining process on large repositories of image data. The GIS data can be used to accurately identify regions of interest within an image. The second facility that a DBMS provides is to store and manipulate a large amount of time-series image and GIS data seamlessly. The image analysis al- gorithms can access and manipulate large repositories of image data in a batch process. A batch process can be written by a data miner to execute image min- ing steps on the image data of many years. For a particular image mining application, a miner defines various image, statistical, and hybrid image/GIS algorithms in a sequence. The output of one particular task or an algorithm is provided to another algorithm in the sequence. Results of such batch process- ing can be obtained, either in the form of database records or as images. These results are further analysed to answer the research questions. This approach is illustrated with an application to detect cloud cover over the study area. The vector data was used to guide the principal component analysis algorithm in an image mining process. The objective is to illustrate a potential case where an image mining process is guided by GIS data, and not to precisely detect the clouds in providing accuracy assessments.

5.2.2 The Data preparation

The Spinning Enhanced Visible and Infrared Imager (SEVIRI) sensor on board of the MSG satellite provides data to EUMETSAT station (Germany), which is processed and then uplinked to HOTBIRD-6 in wavelet compressed format [82]. MSG data for larger part of Europe was extracted from the ITC data receiver as shown in Figure 5.1(b). The images were retrieved for day time for the dates December 13, 2008, and December 16, 2008 with temporal frequency of 60 minutes. We have also a prior knowledge that the study area was clear from cloud cover at the former date, and that there were clouds and fog over the study area on the latter date. In this study, only band 1 (0.06 μm), band 3 (0.08 μm), and band 9 (10.8 μm) were selected as the clouds are easy to recognize with these bands. The visible band 1 can be used to ensure visual cloud validation from the principle components of that band. The area for the Netherlands was

58 Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.1: Clipping of research area from MSG satellite data with vector data selected as study area, and vector data for the Netherlands boundaries was obtained with WGS84 LATLONG projection as shown in Figure 5.1(a). The next task was to construct a time-series image database with PG DBMS and uploading the images. It was inconvenient to upload the images manually one by one in a database, especially when the image data is massive covering a long time span. It was also required that an image exposure date and time as event time should be recorded to build a time-series image database. A util- ity programme was developed to build that time-series image database with PostgreSQL DBMS and uploading the whole image data from disk to database automatically. The code for this utility program is provided as Appendix A. The code was written using the TerraLib library C++ language interface. This ap- plication programme is a command line utility that takes the directory path for image data as input.

5.2.3 Method and results

There are many computational methods for cloud detection in which physical properties such as reflectance, temperature, and radiance are used as thresh-

59 5.2. Image mining guided by GIS data

Figure 5.2: An image mining process with integrated image and vector analysis with TerraLib on top of the DBMS technology

old to mark an individual pixel as cloudy. These physical characteristics are sometimes very difficult to estimate. They are estimated and provided as data products to apply different cloud masks to identify cloudy pixels. Another option is to mine the image data itself for cloud patterns detection. An approach was suggested by Elbert [83], which involves pattern recognition based on large scale texture analysis. The principal component analysis (PCA) is another statistical approach that follows an explanatory learning, and iden- tifies patterns and relations in data itself. As spectral signatures contain phys- ical characteristics of the type of land cover that include clouds, land, etc. The PCA identifies patterns for different land cover types in data through reducing the number of dimensions in terms of principal components, and expressing the data in such a way as to highlight their similarities and differences. The details for PCA method are provided in Chapter 3. The PCA technique was also anal- ysed thoroughly on SEVIRI multi-spectral image data for cloud patterns detec- tion by Amato et.al [84], and found to be very effective in detecting clouds. The objective in this research is not just to test the PCA method for cloud detection, but to apply it in providing the framework for integrated remote sensing/GIS analysis in an image mining process on top of the DBMS technology so as to develop more advanced image mining scenarios. An image mining batch process using the TerraLib C++ programming inter- face was written that retrieves images one by one from the time-series image database, performs a sequence of steps shown in Figure 5.2 in an iterative way, and stores the results in the database as images. These images were then ac- cessed and visually analysed with TerraView as shown in Figure 5.3. The code

60 Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.3: The resulting principal components for two dates at 14:00 hours

61 5.2. Image mining guided by GIS data

of the program is provided as Appendix B. The GIS vector data for the Netherlands was used for the selection of pixels for that area from an MSG image as shown in Figure 5.1(c,d). The selected pixels for the study area were handled in memory and were provided as input to the PCA algorithm to generate principal components for three bands. These principal components were then stored in the PG database as time-series PC images for visual interpretation using TerraView. Figure 5.3 shows the PC images generated by principal component analysis algorithm for a clear day of December 13, 2008, and a cloudy day of December 16, 2008 at 14:00 hours. The presence of clouds can be noticed through the visual interpretation from first principal components.

5.2.4 Discussion

A data mining approach based on the integrated image and GIS data analysis on top of the database technology is illustrated here. This approach requires that the GIS data (vector model), and remote sensing data (raster model) stored in the database are able to perform overlay operations. The results of such integrated overlay operations should also allow to display for visual analysis, and should storable in the database for access by another analytical application. During the data preparation phase ZLIB compression mode was used for storing images in the PG database. It was observed that compression tech- niques can reduce considerable amount of data in the database as shown in Figure 5.4 based on statistics in Table 5.1, but actual benefit of compression along with tiling and index support provided by the TerraLib is to reduce the search space and time, and to increase the performances at acceptable level to construct large remote sensing image databases with a standard DBMS. The performances include: improved physical disk I/O, faster scan operations, im- proved network bandwidth in a client-server environment, and reduced mem- ory requirement. The compression is such a powerful option that it is becoming the default for any high performance DBMS. There is some performance over- head when compressing at write and decompressing at read but this overhead is tolerable as compared to achieved performances. The TerraLib further re- duces the performance overhead in providing the compression/decompression at tile level. The size of tiles can be adjusted according to read/write activity of the image data. The image retrieval speed from the database can not beat that from the disk in some cases, however, the seamless overlay operations for integrated remote sensing and GIS analysis with adequate image retrieval performance from a spatial database has an extremely wide scope in advanced GIS applications de- velopment. A GIS application based on an integrated image and vector data analysis can be rapidly developed on a large repository of satellite images and GIS data stored in the database. Storing images on the disk as external files, is optimal if the purpose is just to retrieve an image for dissemination or visual- ization.

62 Chapter 5. Application scenarios for image mining: Results and Discussions

Table 5.1: Image size on the disk and in the database

Images: 1 2 3 Image size on disk 721 2651 6645 Image size in database without compression 840 3032 8392 Image size in database with compression 696 2032 3200 %Increase in size without compression 16 14 26 %Decrease in size with compression 4 13 107

Figure 5.4: Comparison of image size on disk with size in database using compression

63 5.3. Extending TerraView for a temporal image query

5.3 Extending TerraView for a temporal image query

5.3.1 Introduction

Remote sensing image databases can provide more frequent access to image data for a particular area of interest, and time that can be used in developing land cover/land use change detection applications. Such applications frequently access time-series image data at different temporal frequencies. A temporal im- age query provides the result set in the form of images or as database records populated with image visual or statistical characteristics for a time-series vi- sual or statistical analysis respectively. In the previous scenario, a time-series image database was created through developing an application programme, and the temporal information was stored into the database as metadata. But the TerraView interface does not provide the facility to perform a temporal image query for visualizing the time-series of images. An operator needs to be constructed manually the visualization in- terface for time-series of images for a particular time-interval. This is difficult to do when the TerraView visualization interface does not have any option to view the metadata of the stored image data in the PG database. Therefore, the objective was to provide an option to execute a temporal image query from visualization interface to get stored PC images in the database for a specified period of time. In the previous section, images were retrieved from a time-series image database as input for a mining algorithm. The GIS vector data was also stored along with the image data, and was retrieved to define boundaries over the land cover for our study area. The image date and time was stored as metadata at- tributes. In this section, we illustrate how to perform temporal queries on a time-series image database, and how results are populated as TerraView views and themes for visualization.

5.3.2 Method and results

To visualize an image in TerraView for a specific date and time, an operator first needs to query the database and obtain prior knowledge before manually creating views and themes on image layers for visualization. The TerraView visualization interface does not allow to query a time-series image database before creating views and themes on image layers. An application programme was developed using TerraLib and C++ on top of the PG DBMS, with the objective to extend TerraView with the facility to query a time-series image database for a defined time-period. This TerraLib applica- tion takes the date and the time as input to define a time-period. The script formulates a temporal query for remote sensing data layers in the database and retrieved images for that time-span are populated programmatically in Ter- raView as views and themes for visual analysis. The code for the application is provided in Appendix C. The code was executed to perform a temporal query for December, 13 2008 for time interval of 09:00:00 AM to 12:00:00 AM. The view was created with the

64 Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.5: The views/themes populated as a result of temporal query name displaying time interval, and themes for retrieved images were populated displaying data, time and band information as shown in Figure 5.5.

5.4 Database-oriented approaches to data mining

5.4.1 Introduction

Data modeling or database-specific heuristics are used to exploit the character- istics of attribute data. In image mining, such attribute data can be generated through applying a mining algorithm on image data stored in the database and storing the results as database records. The image mining algorithm is required to access the database for time-series image data through SQL statements, to provide this data as input to a sequence of image processing steps in an itera- tive way, and to store the results as database records. Further analysis can be performed on the attribute space of those database records. Database-oriented data mining algorithms that work on the attribute space such as associative rule discovery can also be applied. Sections 3.2.2 and 3.3.2 of this thesis dis- cuss the database-oriented approaches for classical and spatial data mining. Association rule discovery is a database-oriented data mining approach to find patterns from attribute data. The attribute data can also be used for machine learning algorithms for data mining discussed in Section 3.2.3, among them C4.5 is extensively used. C4.5 tries to find pattern in multivariate attribute space through formulating decision trees and attribute vectors during the clas- sification of attribute data in a data mining process. Such classifiers associates attributes such as mean, max, entropy, etc., as a result of image processing algo-

65 5.4. Database-oriented approaches to data mining

rithm with high level application concepts for pattern or knowledge discovery. The applications concepts can be any categorical attributes or GIS data that can categorize the results of image processing in an image mining application. Database approaches normally use to materialize frequently asked expen- sive computations and aggregates in the database to scale data mining and OLAP (Online Analytical Processing) applications to meet high computational costs required for such applications. The summary data generated by aggregate functions are stored in especially designed database schemas such as the star schema for knowledge discovery. These database techniques also help to reduce data dimensionality and to integrate various sources of data. Such database techniques can be adopted in image mining through storing most of the com- mon spatial, temporal and statistical data for images along with other image attributes. For instance, NDVI for a time-series of images can be calculated and stored along with image date time, image statistics, spatial characteristics, and derived GIS data in a database schema designed to serve one or more data mining applications. In classical data mining, algorithms are applied on attribute space, and these attributes are either explicitly stored or derived. In spatial data mining, these attributes are implicit and need to be derived from spatial operations and spatial analysis. For instance, the Netherlands vector data used in our cloud detection application has thirteen polygons. A derived attribute for total area can be the sum of geometry area of each polygon. The result can be materialized in the database.

5.4.2 Method and results

To illustrate the above approach, a database schema was created to store the result of an image mining algorithm as database records. In Section 5.2, time- series image data was generated from the intersection of MSG image data with vector data of the study area. In the next step, A PCA-based statistical mining algorithm was applied on these images for cloud patterns detection. The result- ing time-series PC images were stored in the PG database. In this scenario, PC images as output of the PCA algorithm were not stored in the database rather handled on-the-fly to provide as input to another algorithm for statistical re- sults. These results were then stored in the PG database as attribute data. A sequence of steps shown in Figure 5.6 was carried out in an iterative way on time-series image data stored in the PG database and the results were stored as database records along with image name, band id, and date time of the im- age as shown in Figure 5.7. The code for this image mining application scenario is provided in Appendix D. The statistical results for the time-series PC images generated from PCA- based image mining algorithm were stored in a PG database table as records. These records were accessed from MS Excel through establishing a connection to the PG database table for time-series pattern analysis. Statistical mean val- ues generated from PC-1 time-series images were plotted for the dates Decem- ber 13, 2008 and December 16, 2008 at the temporal frequency of one hour as shown in Figure 5.8 and Figure 5.9 respectively.

66 Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.6: A sequence of steps in a mining process to generate attribute data in the PG database

67 5.4. Database-oriented approaches to data mining

Figure 5.7: Attribute data for PC images as a result of PCA algorithm applied on time-series MSG data

By comparing these results with the visualization analysis from TerraView in Figure 5.3, the changing mean value at 14:00 hours can be observed as the presence of clouds. This value was used for space-time annotation of an image as cloudy for the Netherlands area at a particular time in our cloud search engine application. The efficient storage and retrieval of time-series images in the PG database enabled the derivation of statistical data over time as multi-temporal mean and variance which was then included in the subsequent time-series pattern analysis. The vector data for the study area was also processed to generate at- tribute data and this was also stored in a database table as records. The spatio- temporal queries joining the vector and image attribute data were generated, for instance,

select i.name, i.study_area_id, i.image_datetime, v.geom_area, i.variance, i.correlation from image_summary_data i, vector_summary_data v where i.study_area_id=v.study_area_id AND band_id=0 AND to_timestamp(i.image_datetime, ’YYYYMMDDHH24MISS’) between to_timestamp(’20081213100000’,’YYYYMMDDHH24MISS’) AND to_timestamp(’20081216100000’,’YYYYMMDDHH24MISS’);

As shown the spatial-temporal query provides the information about the study area derived from the vector data along with the statistical results of PCA image mining algorithm as information on a land cover class i.e., clouds. This integration of GIS data with the results of a time-series image analysis is ex- tremely useful in constructing knowledge from many spatio-temporal analyses. This also presents a method for the integration of spatial data warehousing and image mining for integrating various sources of information at different levels.

68 Chapter 5. Application scenarios for image mining: Results and Discussions

Figure 5.8: Time-series cloud patterns analysis for December 13, 2008

Figure 5.9: Time-series cloud patterns analysis for December 16, 2008

69 5.5. Conclusion

This approach can also be effectively implemented for spatio-temporal pattern analysis such as for a specific phenomenon or a land cover type at various geo- graphical levels. It is important to note that this attribute data can be accessed through the PG SQL interface without any format conversion. A well designed PG database schema for vector and image summaries can be a primary source of front-end OLAP (Online Analytical Processing), and visualization applications, for in- stance, ROLLAP and CUBE. OLAP tools provide the most powerful query tech- niques that can be effectively implemented in a knowledge discovery process [56]. The TerraLib script to generate such vector and image data summaries was written and the code is provided in Appendix D. The script was executed as an ETL (Extraction, Transformation, Loading) to generate summary data for a data warehouse. The objective here was not to construct a data warehouse but to illustrate a potential use of the attribute space of spatio-temporal image and vector data.

5.5 Conclusion

The DBMS technology for time-series image databases to develop advanced ap- plications from integrated image and GIS data analysis was investigated. The image databases to store and manipulate remote sensing data have enormous scope and provide the most efficient and flexible platform for information fu- sion from various RS and GIS data sources in the database. Further, TerraLib technology built on top of the database technology provides rich overlay and integrated analysis operations that has shifted the RS and GIS analysis from individual monolithic systems into a single incorporated and simple system. This approach also provides the tendency to perform spatio-temporal analysis on very large repositories of image and vector data of many years. The re- searcher can write a batch programme for a specific research problem incorpo- rating different data types and various spatial, temporal, statistical, and image analysis functions as a sequence of steps in a single routine. This routine can then be executed over the large amount of data in the database and results can be stored in the database as images, records, or spatio-temporal summaries to serve a wide range of front-end visualization, and analytical applications.

70 Chapter 6

Conclusions and Recommendations

6.1 Conclusions

The first objective of this project was to study and analyse state-of-the-art im- age handling and manipulation facilities from open source libraries to extend the open source PostgreSQL/PostGIS (PG/PG) DBMS for image support. Post- greSQL has been extended through PostGIS to provide spatial data types and topological operators for vector data, but it does not support remote sensing image data. This image support is required to perform integrated image and vector data analysis in space and time with DBMS technology. As listed below the requirements of this image support include:

• Efficient image storage in and retrieval from a database.

• Seamless and integrated analysis through overlay of images with vector data in a database.

• Visualization of overlayed images and vector data obtained from a database.

A comparative study of existing open source libraries was carried out to provide such image support in PostgreSQL. These libraries were analysed on the basis of efficient image support, capability to perform hybrid image/vector analysis, provision of image processing algorithms, ability to conduct statistical analysis, temporal support, flexibility to deal with various spatial data types, extensibility, and availability of visualization interface. The following conclu- sions were drawn.

• Provision of image support by a PG/PG database through extension with complex data types and overlay operations for the integrated image/vector analysis is far from trivial, and can easily go beyond a MSc thesis scope and time. For instance, such innovation by the National Institute for Space Research, Brazil (INPE) in developing the TerraLib version 3.0 re- leased on May, 2004 required an investment of 40 man-years of work. That software package comprises 95,000 lines of C++ code developed by

71 6.1. Conclusions

INPE, and 195,000 lines of code from third-party libraries. INPE has re- cently released version 3.2.1 at the end of 2008. Further, we analyzed many other individual efforts for such image support, but all these efforts were incomplete and not capable to develop applications for integrated im- age/vector analysis. Therefore, the first approach was adopted: to use the TerraLib library at intermediate level to extend the PostgreSQL/PostGIS DBMS with image support and overlay operations. It is found that INPE open source TerraLib library provides the most efficient solution for im- age handling in PG/PG database so far and has rich functionality for inte- grated spatial analysis. • Various architectural components and the conceptual model of TerraLib were analysed at the database, programming interface, and visualization levels. For this, a research platform was constructed running the Post- greSQL/PostGIS DBMS, the TerraLib library, and TerraView visualiza- tion interface on a Linux server machine. Both the TerraView visualiza- tion and TerraLib C++ programming interfaces were used to insert and retrieve images in the spatial database along with other datasets. Ter- raLib provides generic iterators for manipulation of spatial data types in the database. These iterators were extensively studied and applied for image manipulation in image mining application development. TerraLib conceptual schema for image data storage, retrieval, and manipulation along with pyramid, tiling, index and compression support was found to be sufficient in developing advanced database applications on large repos- itories of remote sensing image and GIS data. • The requirement to provide overlay operations on image and vector data layers, is fulfilled through overlay functions provided by TL. These func- tions were found to be extremely useful in developing image mining ap- plication scenarios through integrated image/vector analysis on top of the DBMS technology. • Storing images on the disk as external files is optimal in some cases, for instance, when the objective is only visualization in case of static web ap- plications. But, the seamless overlay operations on remote sensing and GIS data layers for integrated spatial analysis with adequate image re- trieval performance from very–large spatial databases has an extremely wide scope in many advanced GIS applications development. • GIS and remote sensing integrated spatial analysis requires that the anal- ysis procedures should be independent of the data structures. Most cur- rent GIS packages follow separate analysis procedures for separate data structures. The approach of generic programming through implementa- tion of generic design patterns in development of the TerraLib GIS library has been adopted by INPE to make spatial analysis independent of data structures. Various design patterns adopted by the TerraLib library were identified and documented, to develop a GIS database application featur- ing integrated image/vector data analysis. It was found that such a pro- gramming style allows seamless integrated spatial analysis for various

72 Chapter 6. Conclusions and Recommendations

data types provided by a spatial DBMS and allows rapid development of a GIS application or research prototype for a specific research problem.

The second objective was to develop an advanced application featuring im- age support provided in the open source spatial database PG/PG. In this con- text, an image mining application was developed with integrated analysis over image and vector data stored in the database. An extensive review study was carried out to discover tools and techniques for data mining, and effectively applying these in an image mining application development. The statistical, database-oriented, and machine-learning approaches for data mining were stud- ied. These approaches were then applied in developing various image mining application scenarios. The TerraLib and PostgreSQL DBMS technology adopted for hybrid anal- ysis was applied to develop a cloud detection application to mine cloud pat- terns without using any physical parameters such as sea surface temperature or moisture content in the air. The Spinning Enhanced Visible and Infrared Im- ager (SEVIRI) sensor image data on board of the MSG satellite for larger part of Europe was extracted from the ITC data receiver. An application programme was written to construct a time-series image database for extracted image data with the PG/PG DBMS, using the TerraLib conceptual schema. A statistical data mining method based on principal components analysis was adopted to extract cloud features for the area of the Netherlands from the time-series im- age data. The Following results were obtained and conclusions were drawn:

• Vector data stored in the database can be used for rapid delineation of image features from overlay of image and vector data layers. This is extremely useful in region-based studies, to extract raster statistics for regions guided by GIS data in a remote sensing image analysis process. The image pixel values for the Netherlands were extracted from the MSG time-series image data from a larger part of Europe using boundary vec- tor data for the Netherlands. The TerraLib overlay functions for intersec- tion of time-series image and vector layers were used in a spatio-temporal analysis, in which the images in a time-series image database were anal- ysed for cloud cover in space and time. This approach is also useful for a spatio-temporal analysis, in which a geometry is changing over time, for instance, flooding, etc.

• One of the important advantages of the adopted spatial database approach for spatio-temporal analysis is the ability to write a batch process incor- porating a large amount of time-series, multi-source data and integrated spatial analysis functions. The overlay, spatio-analytic, statistical, image analysis and aggregate functions on time-series image and vector data layers can be applied in a sequence of steps. Such batch processes for our cloud detection image mining application were written and executed over time-series image data and the results of the overall process were stored as images and attributes in the PG database. This is an important method in detecting transitions in land cover/land use where seasonal or

73 6.2. Recommendations

long time intervals are associated with different data sources in the spa- tial database, for instance, to analyse maize growth at various parts of Europe during a season needs an integrated analysis using time-series image data for spectral analysis, and vector data for delineating pixels for various regions. • The results of PCA algorithm were stored as PC images for the visual analysis and PC statistics as database records along with date and time for the time-series pattern analysis. TerraView interface for the visual analysis was extended to provide temporal image query support to re- trieve PC images from the database for a specific time interval. The PC statistics stored as PG database records were retrieved from MS Excel in- terface for the time-series pattern analysis. The decreasing mean value for band-1 at a time on the time-axis indicates the presence of cloud cover over the study area for that time. This result was also compared with the visual analysis of PC images from TerraView interface. The comparison of result from two different interfaces was quite satisfactory. This value was used for space-time annotation of an image as cloudy for the Netherlands area at a particular time in our cloud search engine application. • Database techniques for data mining were adopted in developing a cloud patterns detection image mining application scenario and the results of vector aggregate and image analysis functions were stored as attribute data in a database table. This integration of vector data or any non-spatial data with the results of remote sensing image analysis in the attribute space is extremely useful in constructing knowledge with a spatial data mining process. This also presents a method for the integration of spatial data warehous- ing and image mining for integrating various sources of information at different levels, which is currently an active area of research identified in Section 3.3.2. Spatial data warehousing provides spatial data aggre- gation and spatial navigation, whereas, image mining provides discovery of knowledge through image processing techniques. This approach can be effectively used for spatio-temporal pattern analysis such as for a specific phenomenon or a land cover type at various geographical levels.

6.2 Recommendations

Some recommendations for further research are as follows: • Quality of integrated remote sensing and GIS data analysis in terms of accuracy assessment of the results needs to be addressed. Both the raster and vector data depend upon scale factors and previous studies docu- mented in Section 3.3.1 show that scale variation in spatial data can cause variability in results in an integrated spatial analysis. For instance, dur- ing the overlay of many image and vector data layers. The data prepara- tion phase for an integrated analysis needs to carefully consider the spa- tial data scale and acquisition methods, for instance, those determined by

74 Chapter 6. Conclusions and Recommendations

various sensors in case of remotely sensed data. Any subsequent result should be documented with accuracy indications and verifications.

• The TerraLib library was used having an intermediate TerraLib driver for PG, to extend the PostgreSQL/PostGIS database with image support. This provided the ability to perform integrated image/vector analysis for image mining over large amounts of image/vector data in the PG database. TerraLib also developed drivers for many other databases. This approach of keeping TerraLib library outside the DBMS gives flexibility to work with various database systems and programming interfaces. An applica- tion developer has more control on data inputs and outputs incorporat- ing various data formats and analysis procedures. A future innovation to bring the TerraLib library into the DBMS without using the intermediate driver will be useful to obtain the declarativeness and a user-friendly en- vironment such as provided by Oracle GeoRaster. The application devel- oper can then develop algorithms as PG stored procedures, using TerraLib functions at the PG SQL interface. These stored procedures can then be made callable from the PG SQL interface. This will require revising the voluminous TerraLib library structure and integration at a lower level with the PostGIS.

75 6.2. Recommendations

76 Appendix A

Source code for creating time-series image database in PG

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include int main() { TeDatabase* db;

################################################################ ################ File system reading for images ################ ################################################################

string dir; DIR *dp;

77 struct dirent *dirp; vector files = vector(); cout << "Enter your image directory: "; cin >> dir; if((dp = opendir(dir.c_str())) == NULL) { cout << "Error: openeing the directory"; cout << "Prees Enter\n"; getchar(); return errno; } while ((dirp = readdir(dp)) != NULL) { string file(dirp->d_name); if(file == "." || file == "..") continue; files.push_back(string(file)); }

################################################################ ########## Connecting PostgreSQL\PostGIS database ############## ################################################################

db= new TePostGIS(); if (!db->connect("172.16.33.170","imran","password" ,"tvdatabase",5432)) { cout << "Error: " << db->errorMessage() << endl << endl; cout << "Press Enter\n"; getchar(); return 1; } TeInitRasterDecoders();

################################################################# ############ Accessing and initializing input image ############# #################################################################

for (unsigned int i = 2;i < files.size();i++) { string in = files[i]; string input= dir+"/"+in; TeRaster img(input); if (!img.init()) { cout << "Cannot access the input image!" << endl << endl; cout << "Press Enter\n"; getchar();

78 Appendix A. Source code for creating time-series image database in PG

return 1; }

################################################################# #### Creating image layer and setting projection for layer####### #################################################################

string layerName = in;

if (db->layerExist(layerName)) { db->close(); cout << "The database already has an infolayer with this name \""; cout << layerName << "\"!" << endl << endl; cout << "Press Enter\n"; getchar(); return 1; }

TeDatum mDatum = TeDatumFactory::make("WGS84"); TeProjection* pUTM = new TeLatLong(mDatum);

TeLayer* layer = new TeLayer(layerName, db, pUTM); if (layer->id() <= 0) { db->close(); cout << "The destination layer could not be created!\n" << db->errorMessage()<< endl; cout << "Press Enter\n"; getchar(); return 1; }

################################################################# #### Importing the image to the layer in database ############### #################################################################

if(!TeImportRaster(layer,&img,256,256,TeRasterParams::TeZLib,"" ,255,true,TeRasterParams::TeExpansible)) { db->close(); cout << "Fail to import the image\n\n!"; cout << "Press Enter\n"; getchar(); return 1; } else

79 cout << "The image was imported successfully!\n\n";

delete layer; layer = 0; }

################################################################# #################### Closing database ########################### #################################################################

db->close(); cout << "\nPress enter..."; getchar(); return 0; }

80 Appendix B

Source code for image mining application scenario Section 5.2

using namespace std; #include

#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

int main() { TEAGN_LOGMSG( "Process started." );

81 try{ TeStdIOProgress pi; TeProgress::setProgressInterf( dynamic_cast< TeProgressBase* >( &pi )); TeInitRasterDecoders();

############################################################# ######Connecting PostgreSQL\PostGIS database ################ #############################################################

db= new TePostGIS(); if (!db->connect("172.16.33.170","imran","password","test",5432)) { cout << "Error: " << db->errorMessage() << endl << endl; cout << "Press Enter\n"; getchar(); return 1; } ############################################################### ####### Getting Portal to query images from database ########## ###############################################################

TeDatabasePortal* portal = db->getPortal();

if (!portal) return -1; string sql = "SELECT layer_id, name from te_layer WHERE "; sql += " layer_id in (SELECT layer_id FROM te_representation"; sql += " WHERE geom_type=512)";

while (portal->fetchRow()) {

unsigned int i=0;

string layer_name= portal->getData("name");

################################################################ ###### Getting GIS vector data to select features from image#### ################################################################

TeLayer* layer = new TeLayer("arcmap", db); TeProjection* geomProj = layer->projection();

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS);

82 Appendix B. Source code for image mining application scenario Section 5.2

if (!rep) { cout << "Layer has no polygons representation!" << endl; db->close(); cout << "Press Enter\n"; getchar(); return 1; }

string geomTableName = rep->tableName_;

TePolygonSet ps; db->loadPolygonSet(geomTableName, "0", ps);

################################################################ #Clipping image data by vector data and handling it into memory# ################################################################

TeLayer* imgLayer = new TeLayer(layer_name,db); if (imgLayer->id() < 1) { cout << "Cannot access image layer" << endl; cout << endl << "Press Enter\n"; getchar(); return 1; }

string rasterTable = imgLayer->tableName(TeRASTER); int layerId = atoi(portal->getData("layer_id")); TeRaster* img = db->loadLayerRaster(layerId);

TeRaster* clip = TeRasterClipping (img , ps , geomProj, "clip", 0.0, "DB");

if(!clip->init()) { cout << "clip was not initialized" << end1; cout << "Press Enter\n"; getchar(); return 1; }

#################################################################### #####Applying principal components algorithm developed by INPE#### ####################################################################

83 TePDIPrincipalComponents::TePDIPCAType analysis_type = TePDIPrincipalComponents::TePDIPCADirect;

TePDIParameters params_direct;

TePDITypes::TePDIRasterPtrType inRaster1(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster1->init(), "Unable to init inRaster1" );

TePDITypes::TePDIRasterPtrType inRaster2(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster2->init(), "Unable to init inRaster1" );

TePDITypes::TePDIRasterPtrType inRaster3(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster3->init(), "Unable to init inRaster1" );

TePDITypes::TePDIRasterVectorType input_rasters; input_rasters.push_back(inRaster1); input_rasters.push_back(inRaster2); input_rasters.push_back(inRaster3);

std::vector bands_direct; bands_direct.push_back(0); bands_direct.push_back(1); bands_direct.push_back(2);

TePDITypes::TePDIRasterPtrType outRaster1_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster1_direct, 1, 1, 1, false, TeDOUBLE, 0), "RAM Raster 1 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster2_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster2_direct, 1, 1, 1, false, TeDOUBLE, 0), "RAM Raster 2 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster3_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster3_direct, 1, 1, 1, false, TeDOUBLE, 0 ), "RAM Raster 3 Alloc error");

TePDITypes::TePDIRasterVectorType output_rasters_direct; output_rasters_direct.push_back(outRaster1_direct); output_rasters_direct.push_back(outRaster2_direct); output_rasters_direct.push_back(outRaster3_direct);

84 Appendix B. Source code for image mining application scenario Section 5.2

TeSharedPtr covariance_matrix(new TeMatrix);

params_direct.SetParameter("analysis_type", analysis_type); params_direct.SetParameter("input_rasters", input_rasters); params_direct.SetParameter("bands", bands_direct); params_direct.SetParameter("output_rasters", output_rasters_direct); params_direct.SetParameter("covariance_matrix", covariance_matrix);

TePDIPrincipalComponents pc_direct; TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct), "Invalid Parameters"); TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error"); TePDIPrincipalComponents pc_direct; TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct), "Invalid Parameters"); TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error");

################################################################### #writing output principal components back into PostgreSQL database# ###################################################################

TeImportRaster( layer_name, output_rasters_direct[0].nakedPointer(), db ); TeImportRaster( layer_name, output_rasters_direct[1].nakedPointer(), db ); TeImportRaster( layer_name, output_rasters_direct[2].nakedPointer(), db ); ++i; }

}catch( const TeException& e ){ TEAGN_LOGERR( "Test Failed-"+e.message() ); return EXIT_FAILURE; }

TEAGN_LOGMSG( "Test OK." ); return EXIT_SUCCESS;

db->close(); cout << "\nPress enter..."; getchar(); return 0; }

85 86 Appendix C

Source code for image mining application scenario Section 5.3

#include #include #include #include #include #include #include #include

int main() { string host; string dbname; string user="imran"; string password; int port; string init_date; string fin_date;

cout << "Enter intial date in YYYYMMDDHH24MISS format: "; cin >> init_date;

cout << "Enter final date in YYYYMMDDHH24MISS format: "; cin >> fin_date;

TeDatabase* db= new TePostGIS(); if(!db->connect("172.16.33.170","imran","password", "test",5432))

87 { cout << "Error: " << db->errorMessage() << endl << endl; cout << "Press Enter\n"; getchar(); return 1; }

TeDatabasePortal* portal = db->getPortal();

if (!portal) return -1;

################################################################### #Querying the image database for images between required datetime## ###################################################################

string sql = "SELECT l.name as name, to_char(r.initial_time, "; sql+=" ’YYYYMMDDHH24MISS’) as inidatetime, "; sql+=" to_char(r.final_time, ’YYYYMMDDHH24MISS’) as findatetime"; sql+=" FROM te_layer l, (SELECT layer_id, initial_time, final_time"; sql+=" FROM te_representation WHERE geom_type=512 and initial_time"; sql += " between to_timestamp(’"+init_date+"’,’YYYYMMDDHH24MISS’) "; sql += " AND to_timestamp(’"+fin_date+"’,’YYYYMMDDHH24MISS’)) r "; sql += " WHERE l.layer_id=r.layer_id order by r.initial_time desc";

string viewName= init_date+"To"+fin_date;

##################################################################### #Extending TerraView with images retrieved through views and themes# #####################################################################

string viewName= init_date+"To"+fin_date;

TeView* view = new TeView(viewName, user); TeProjection* proj = new TeNoProjection(); view->projection(proj);

if (!db->insertView(view)) { cout << "Fail to insert the view into the database: " << db->errorMessage() << endl; db->close(); delete db; cout << endl << "Press Enter\n"; getchar(); return 1; }

88 Appendix C. Source code for image mining application scenario Section 5.3

if (!portal->query(sql)) { cout << "Could not execute..." << db->errorMessage(); delete portal; return -1; } while (portal->fetchRow()) { unsigned int it=0; string layer_name= portal->getData("name"); string inidatetime= portal->getData("inidatetime"); string findatetime= portal->getData("findatetime");

TeLayer* imgLayer = new TeLayer(layer_name,db); view->projection(imgLayer->projection()); if (imgLayer->id() < 1) { cout << "Cannot access image layer" << endl; cout << endl << "Press Enter\n"; getchar(); return 1; }

TeTheme* theme = new TeTheme(layer_name, imgLayer); theme->visibleRep(TeRASTER | 0x40000000); view->add(theme); if (!theme->save()) { cout << "Fail to save the theme in the database: " << db->errorMessage() << endl; db->close(); delete db; cout << endl << "Press Enter\n"; getchar(); return 1; }

TeGrouping* group1 = new TeGrouping();

++it; } db->close();

89 cout << "\nPress enter..."; getchar(); return 0; }

90 Appendix D

Source code for image mining application scenario Section 5.4

using namespace std; #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include TeDatabase* db; int i,j;

91 #################################################################### ############## Method to generate vector summaries ################# ####################################################################

int generate_vector_summary(){

TeAttributeList attListVec; string tableNameVec = "vector_summary_data";

db->deleteTable(tableNameVec);

TeAttribute atid; atid.rep_.name_= "study_area_id"; atid.rep_.type_ = TeREAL; atid.rep_.numChar_ = 3; TeAttribute atname; atname.rep_.name_= "study_area_name"; atname.rep_.type_ = TeSTRING; atname.rep_.numChar_ = 30; TeAttribute atgeom; atgeom.rep_.name_= "geom_id"; atgeom.rep_.type_ = TeREAL; atgeom.rep_.numChar_ = 30; atgeom.rep_.decimals_ = 4; TeAttribute atarea; atarea.rep_.name_= "geom_area"; atarea.rep_.type_ = TeREAL; atarea.rep_.numChar_ = 30; atarea.rep_.decimals_ = 4;

attListVec.push_back(atid); attListVec.push_back(atname); attListVec.push_back(atgeom); attListVec.push_back(atarea);

if (!db->createTable(tableNameVec, attListVec)) { cout << "Fail to create the table: " << db->errorMessage() << endl; db->close(); cout << endl << "Press Enter\n"; getchar(); return 1; }

TeLayer* layer = new TeLayer("arcmap", db);

92 Appendix D. Source code for image mining application scenario Section 5.4

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS); string geomTableName = rep->tableName_; string objId = "0"; string q; TeDatabasePortal* portalgeom=db->getPortal(); q = "SELECT * FROM "+ geomTableName; q+= " WHERE object_id = ’" + objId + "’"; if(!portalgeom->query(q)) { cout << "geometry of the mask was not fetched" << endl; delete portalgeom; cout << "Press Enter\n"; getchar(); return 1; }

TePolygon poly; portalgeom->fetchGeometry(poly); string sumGeom = Te2String(TeGeometryArea(poly)); delete portalgeom; string attrTableName = "arcmap"; string qr; TeDatabasePortal* portalattr=db->getPortal(); qr = "SELECT nation, cntryname, object_id_159 FROM "+ attrTableName; if(!portalattr->query(qr)) { cout << " could not get attributes of study area" << endl; delete portalattr; cout << "Press Enter\n"; getchar(); return 1; } while (portalattr->fetchRow()) { unsigned int itt=0; string id = portalattr->getData("nation"); string name = portalattr->getData("cntryname"); string geom_id = portalattr->getData("object_id_159");

TeTable table(tableNameVec, attListVec, "");

93 TeTableRow row; row.push_back(id); row.push_back(name); row.push_back(geom_id); row.push_back(sumGeom); row.push_back("object1");

table.add(row); if (!db->insertTable(table)) { cout << "Fail to save the table: "<< db->errorMessage()<< endl; db->close(); cout << endl << "Press Enter\n"; getchar(); return 1; }

++itt; } delete portalattr; }

#################################################################### ############## Method to generate image summaries ################# ####################################################################

int generate_raster_summary() {

TeAttributeList attList; string tableName = "image_summary_data";

db->deleteTable(tableName);

TeAttribute at; at.rep_.name_= "name"; at.rep_.type_ = TeSTRING; at.rep_.numChar_ = 250; TeAttribute atrr; atrr.rep_.name_= "study_area_id"; atrr.rep_.type_ = TeREAL; atrr.rep_.numChar_ = 3; TeAttribute atrb; atrb.rep_.name_= "band_id"; atrb.rep_.type_ = TeREAL; atrb.rep_.numChar_ = 2; TeAttribute atr; atr.rep_.name_= "image_datetime";

94 Appendix D. Source code for image mining application scenario Section 5.4 atr.rep_.type_ = TeSTRING; atr.rep_.numChar_ = 30; TeAttribute atr1; atr1.rep_.name_= "sum"; atr1.rep_.type_ = TeREAL; atr1.rep_.numChar_ = 15; atr1.rep_.decimals_ = 4; TeAttribute atr2; atr2.rep_.name_= "mean"; atr2.rep_.type_ = TeREAL; atr2.rep_.numChar_ = 15; atr2.rep_.decimals_ = 4; TeAttribute atr3; atr3.rep_.name_= "Entropy"; atr3.rep_.type_ = TeREAL; atr3.rep_.numChar_ = 15; atr3.rep_.decimals_ = 4; TeAttribute atr4; atr4.rep_.name_= "Correlation"; atr4.rep_.type_ = TeREAL; atr4.rep_.numChar_ = 30; atr4.rep_.decimals_ = 4; TeAttribute atr5; atr5.rep_.name_= "Min"; atr5.rep_.type_ = TeREAL; atr5.rep_.numChar_ = 15; atr5.rep_.decimals_ = 4; TeAttribute atr6; atr6.rep_.name_= "Max"; atr6.rep_.type_ = TeREAL; atr6.rep_.numChar_ = 15; atr6.rep_.decimals_ = 4; TeAttribute atr7; atr7.rep_.name_= "variance"; atr7.rep_.type_ = TeREAL; atr7.rep_.numChar_ = 15; atr7.rep_.decimals_ = 4; TeAttribute atr8; atr8.rep_.name_= "StdDev"; atr8.rep_.type_ = TeREAL; atr8.rep_.numChar_ = 15; atr8.rep_.decimals_ = 4; TeAttribute atr9; atr9.rep_.name_= "Mode"; atr9.rep_.type_ = TeREAL; atr9.rep_.numChar_ = 15; atr9.rep_.decimals_ = 4;

95 attList.push_back(at); attList.push_back(atrr); attList.push_back(atrb); attList.push_back(atr); attList.push_back(atr1); attList.push_back(atr2); attList.push_back(atr3); attList.push_back(atr4); attList.push_back(atr5); attList.push_back(atr6); attList.push_back(atr7); attList.push_back(atr8); attList.push_back(atr9);

if (!db->createTable(tableName, attList)) { cout << "Fail to create the table: " << db->errorMessage() << endl; db->close(); cout << endl << "Press Enter\n"; getchar(); return 1; }

TeDatabasePortal* portal = db->getPortal();

if (!portal) return -1;

string sql = "SELECT l.layer_id as layer_id, l.name as name, "; sql += " to_char(r.initial_time, ’YYYYMMDDHH24MISS’) as "; sql += " inidatetime, to_char(r.final_time, ’YYYYMMDDHH24MISS’)"; sql += " as findatetime from te_layer l, (SELECT layer_id, "; sql += " initial_time, final_time FROM te_representation "; sql += " WHERE geom_type=512) as r where l.layer_id=r.layer_id "; sql += " and r.initial_time is not null order by r.initial_time desc";

if (!portal->query(sql)) { cout << "Could not execute..." << db->errorMessage(); delete portal; return -1; }

while (portal->fetchRow()) { unsigned int it=0;

96 Appendix D. Source code for image mining application scenario Section 5.4

string layer_name = portal->getData("name");

string inidatetime= portal->getData("inidatetime");

TeLayer* layer = new TeLayer("arcmap", db); TeProjection* geomProj = layer->projection();

TeRepresentation* rep = layer->getRepresentation(TePOLYGONS); if (!rep) { cout << "Layer has no polygons representation!"; db->close(); cout << "Press Enter\n"; getchar(); return 1; }

string geomTableName = rep->tableName_;

TePolygonSet ps; db->loadPolygonSet(geomTableName, "0", ps);

TeLayer* imgLayer = new TeLayer(layer_name,db); if (imgLayer->id() < 1) { cout << "Cannot access image layer" << endl; cout << endl << "Press Enter\n"; getchar(); return 1; }

string rasterTable = imgLayer->tableName(TeRASTER); int layerId = atoi(portal->getData("layer_id")); TeRaster* img = db->loadLayerRaster(layerId);

TeRaster* clip = TeRasterClipping (img , ps , geomProj, "clip", 0.0, "DB");

if(!clip->init()) { cout << "clip was not initialized"; cout << "Press Enter\n"; getchar(); return 1; }

####################################################################

97 #####Applying principal components algorithm developed by INPE#### ####################################################################

TePDIPrincipalComponents::TePDIPCAType analysis_type = TePDIPrincipalComponents::TePDIPCADirect;

TePDIParameters params_direct;

TePDITypes::TePDIRasterPtrType inRaster1(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster1->init(), "Unable to init inRaster1" ); TePDITypes::TePDIRasterPtrType inRaster2(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster2->init(), "Unable to init inRaster2" ); TePDITypes::TePDIRasterPtrType inRaster3(clip,’r’); TEAGN_TRUE_OR_THROW( inRaster3->init(), "Unable to init inRaster3" );

TePDITypes::TePDIRasterVectorType input_rasters; input_rasters.push_back(inRaster1); input_rasters.push_back(inRaster2); input_rasters.push_back(inRaster3);

std::vector bands_direct; bands_direct.push_back(0); bands_direct.push_back(1); bands_direct.push_back(2);

TePDITypes::TePDIRasterPtrType outRaster1_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster1_direct, 1, 1, 1, false, TeDOUBLE, 0), "RAM Raster 1 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster2_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster2_direct, 1, 1, 1, false, TeDOUBLE, 0), "RAM Raster 2 Alloc error");

TePDITypes::TePDIRasterPtrType outRaster3_direct; TEAGN_TRUE_OR_THROW(TePDIUtils::TeAllocRAMRaster( outRaster3_direct, 1, 1, 1, false, TeDOUBLE, 0 ), "RAM Raster 3 Alloc error");

TePDITypes::TePDIRasterVectorType output_rasters_direct;

output_rasters_direct.push_back(outRaster1_direct); output_rasters_direct.push_back(outRaster2_direct);

98 Appendix D. Source code for image mining application scenario Section 5.4

output_rasters_direct.push_back(outRaster3_direct);

TeSharedPtr covariance_matrix(new TeMatrix);

params_direct.SetParameter("analysis_type", analysis_type); params_direct.SetParameter("input_rasters", input_rasters); params_direct.SetParameter("bands", bands_direct); params_direct.SetParameter("output_rasters", output_rasters_direct); params_direct.SetParameter("covariance_matrix", covariance_matrix);

TePDIPrincipalComponents pc_direct; TEAGN_TRUE_OR_THROW(pc_direct.Reset(params_direct), "Invalid Parameters"); TEAGN_TRUE_OR_THROW(pc_direct.Apply(), "Apply error");

#################################################################### #output od PCA algorithm is provided for statistical calculation#### ####################################################################

for (int iter=0; iter < output_rasters_direct.size(); iter++) { TePDITypes::TePDIRasterPtrType inRaster = output_rasters_direct[iter]; TEAGN_TRUE_OR_THROW( inRaster->init(), "Unable to init inRaster" );

TePDIParameters pars; TePDITypes::TePDIRasterVectorType rasters; rasters.push_back( inRaster ); pars.SetParameter( "rasters", rasters ); std::vector< int > bands; bands.push_back( 0 ); pars.SetParameter( "bands", bands );

TeBox box = inRaster->params().boundingBox(); TePolygon pol = polygonFromBox( box ); TePDITypes::TePDIPolygonSetPtrType polset( new TePolygonSet ); polset->add( pol ); pars.SetParameter( "polygonset", polset );

TePDIStatistic stat;

TEAGN_TRUE_OR_THROW( stat.Reset( pars ), "Reset error" );

string band = Te2String(iter); string sum = Te2String(stat.getSum( 0 )); string mean = Te2String(stat.getMean( 0 )); string variance = Te2String(stat.getVariance( 0 ));

99 string StdDev = Te2String(stat.getStdDev( 0 )); string getEntropy = Te2String(stat.getEntropy( 0 )); string getMin = Te2String(stat.getMin( 0 )); string getMax = Te2String(stat.getMax( 0 )); string getMode = Te2String(stat.getMode( 0 )); string getCorrelation = Te2String( stat.getCorrelation( 0, 0 ));

TeTable table(tableName, attList, "");

TeTableRow row; row.push_back(layer_name); row.push_back("31"); row.push_back(band); row.push_back(inidatetime); row.push_back(sum); row.push_back(mean); row.push_back(getEntropy); row.push_back(getCorrelation); row.push_back(getMin); row.push_back(getMax); row.push_back(variance); row.push_back(StdDev); row.push_back(getMode);

row.push_back("object1"); table.add(row); if (!db->insertTable(table)) { cout <<"Fail to save the table: "<< db->errorMessage()<< endl; db->close(); cout << endl << "Press Enter\n"; getchar(); return 1; } } ++it; } }

int main() { TEAGN_LOGMSG( "Test started." ); try{ TeStdIOProgress pi; TeProgress::setProgressInterf( dynamic_cast< TeProgressBase* >( &pi )); TeInitRasterDecoders();

100 Appendix D. Source code for image mining application scenario Section 5.4 db= new TePostGIS(); if (!db->connect("172.16.33.170","imran","password","test",5432)) { cout << "Error: " << db->errorMessage() << endl << endl; cout << "Press Enter\n"; getchar(); return 1; } generate_vector_summary(); generate_raster_summary();

} catch( const TeException& e ){ TEAGN_LOGERR( "Test Failed-"+e.message() ); return EXIT_FAILURE; } db->close(); cout << "\nPress enter..."; getchar(); return 0; TEAGN_LOGMSG( "Test OK." ); return EXIT_SUCCESS; }

101 102 Bibliography

[1] M. Egenhofer, Spatial Information Appliances: A next generation of geo- graphic information systems, First Brazilian Workshop on Geoinformatics, 1999, Campinas, Brazil.

[2] Selim Aksoy, Krzysztof Koperski, Carsten Tusk, and Giovanni Marchisio, Interactive Training of Advanced Classifiers for Mining Remote Sensing Image Archives, In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, 773–782, Seat- tle, WA, USA.

[3] Lubia´ Vinhas, Richardo Cartexo Modesto De Souza, and Gilberto Camara,ˆ Image data handling in spatial databases, V Brazilian Symposium on Geoinformatics, GeoInfo2003, 2003, Campos do Jordao,˜ SP, Brazil.

[4] Vittorio Castelli, and Lawrence D. Bergman, Image Databases: Search and Retrieval of Digital Imagery, John Wily & Sons Inc., 2002, New York, USA.

[5] N.M. Mattikalli, Integration of remotely-sensed raster data with vector- based geographical information system for land-use change detection, In the International Journal of Remote Sensing, 1995, 16, 15, 2813–2828.

[6] C.C. Petit, and E.F. Lambin, Integration of multi-source remote sensing data for land cover change detection, In the International Journal of Geo- graphical Information Science, 2001, 15, 785–803.

[7] Alexander S. Perepechko, Jessica K. Graybill, Craig ZumBrunnen, and Dmitry Sharkov, Spatial database development for Russian urban areas: A new conceptual framework, In the Journal of GIScience & Remote Sens- ing, 2005, 42, 2, 144–170.

[8] X. Yang, and C.P. Lo, Using a time series of satellite imagery to detect land use and land cover changes in the Atlanta, Georgia metropolitan area, In the International Journal of Remote Sensing, 2002, 23, 1775–1798.

[9] S. Gautama, J.D. Haeyer, and W. Philips, Graph-based change detection in geographic information using VHR satellite images, In the International Journal of Remote Sensing, 2006, 27, 9, 1809–8124.

103 Bibliography

[10] Qihao Weng, Land use change analysis in the Zhujiang Delta of China us- ing satellite remote sensing, GIS and stochastic modelling, In the Journal of Environmental Management, 2002, 64, 273–284.

[11] D. Lu, P. Mausel, E. Brondizio, and E. Moran, Change detection tech- niques, In the International Journal of Remote Sensing, 2004, 25, 12, 2365–2407.

[12] Jignesh Patel, JieBing Yu, Navin Kabra, Kristin Tufte, Biswadeep Nag, Josef Burger, Nancy Hall, Karthikeyan Ramasamy, Roger Lueder, Curt Ellmann, Jim Kupsch, Shelly Guo, Johan Larson, David DeWitt, and Jef- frey Naughton, Building a Scalable Geo-Spatial DBMS: Technology, Im- plementation, and Evaluation, In Proceedings of the ACM SIGMOD Con- ference, 1997, 336–347.

[13] Elena G. Irwin, Nancy E. Bockstael, and Hyun Jin Cho, Measuring and modeling urban sprawl: Data, scale and spatial dependencies, In the Ur- ban Economics Sessions, 53rd Annual North American Regional Science Association Meetings of the Regional Science Association International, November 16-18, 2006, Toronto, Canada. [14] Jiawei Han, Krzysztof Koperski, and Nebojsa Stefanovic, GeoMiner: A System Prototype for Spatial Data Mining,InProceedings of the ACM SIGMOD Conference, 1997, 553–556. [15] University of Alabama in Huntsville, ADaM 4.0.0 Documentation, http://datamining.itsc.uah.edu/adam/documentation.html, Accessed on September 14, 2008. [16] Marcelino Pereira Dos Santos Silva, and Gilberto Camara,ˆ Remote Sens- ing Image Mining Using Ontologies, Technical Report, DPI/INPE- Image Processing Division, National Institute of Space Research, 2005, Brazil. [17] MapServer documentation manuals, http://mapserver.gis.umn.edu/, Ac- cessed on September 14, 2008. [18] Oracle Spatial Documentation, http://www.oracle.com/, Accessed on September 14, 2008. [19] Open Geospatial Consortium, OpenGIS implementation specification: Grid coverage, Technical report, Open Geospatial Consortium, 2001. [20] Xing Lin, and Timothy H. Keitt, Goeraster a coverage/raster model and operations for PostGIS, Google Summer of Code Project, 2007, http://lists.refractions.net/, Accessed on October 04, 2008. [21] Shashi Shekhar, and Sanjay Chawala, Spatial Databases: A Tour, Pearson Education Inc., 2003, New Jersey, USA. [22] Vittorio Castelli, and Lawrence D. Bergman, Image Databases: Search and Retrieval of Digital Imagery, John Wily & Sons Inc., 2002, New York, USA.

104 Bibliography

[23] Liu Yu, Wang Yinghui, Zhang Yi, Lin Xing, and Qin Shi, GSQL-R: A query language supporting raster data, In the Geoscience and Remote Sensing Symposium IGARSS ’04’, 2004, 7, 4414–4417.

[24] PgRaster SQL interface requirement document for raster data, http://postgis.refractions.net/, Accessed on October 11, 2008.

[25] PostgreSQL documentation of manuals, http://www.postgresql.org/, Ac- cessed on October 11, 2008.

[26] PGCHIP documentation of manuals, http://simon.benjamin.free.fr/pgchip/, Accessed on October 11, 2008.

[27] Oracle Spatial Documentation, http://www.oracle.com/, Accessed on Oc- tober 11, 2008.

[28] Gilberto Camara,ˆ Lubia´ Vinhas, Karine Reis Ferreira1, Gilberto Ribeiro de Queiroz, Ricardo Cartaxo Modesto de Souza, Antonioˆ Miguel Vieira Mon- teiro, Marcelo T´ılio de Carvalho, Marco Antonio Casanova, and Ubirajara Moura de Freitas, TerraLib: An Open Source GIS Library for Large-Scale Environmental and Socio-economic Applications, In the book G. Brent Hall and Michael G. Leahy (edt), Open Source Approaches in Spatial Data Han- dling, Springer, 2008, 247–270, Berlin.

[29] TerraLib programming tutorial, http://www.dpi.inpe.br/terralib/docs/, Accessed on October 14, 2008.

[30] Gilberto Camara,ˆ Marcos Correaˆ Neves, Antonioˆ Miguel Vieira Monteiro and Lubia´ Vinhas, Spring and TerraLib: Integrating Spatial Analysis and GIS, Technical Report, DPI/INPE- Image Processing Division, National Institute of Space Research, 2002, Campos do Jordao,˜ SP, Brazil.

[31] Lubia´ Vinhas, Gilberto Camara,ˆ and Ricardo Cartaxo Modesto de Souza, TerraLib: An open source GIS library for spatio-temporal databases, Tech- nical Report, DPI/INPE- Image Processing Division, National Institute of Space Research, 2004, Brazil.

[32] Paul Ramsey, The state of open source GIS, Technical Report, Refractions Research Inc., 2007.

[33] GDAL – geospatial data abstraction library, http://www.gdal.org/, Ac- cessed on October 17, 2008.

[34] J.A Greenberg, C.A Rueda, and S.L Ustin, Starspan: A tool for fast selec- tive pixel extraction from remotely sensed data, Center for Spatial Tech- nologies and Remote Sensing (CSTARS), University of California at Davis, 2005, Davis, CA.

[35] Starspan documentation manuals, http://starspan.casil.ucdavis.edu/, Accessed on October 17, 2008.

105 Bibliography

[36] Renato Martins Assuncao,˜ Marcos Correaˆ Neves, Gilberto Camara,ˆ and Corina Da Costa Freitas, Efficient regionalization techniques for socio- economic geographical units using minimum spaning trees, In the Inter- national Journal of Geographical Information Science, 2006, 20, 797–812.

[37] Pedro Ribeiro de Andrade Neto, and Paulo Justiniano Ribeiro Junior, A Process and Environment for Embedding The R Software into TerraLib, In the VII Brazilian Symposium on Geoinformatics, GeoInfo2005, 2005, Campos do Jordao,˜ SP, Brazil.

[38] John Rushing, Rahul Ramachandran, Udaysankar Nair, Sara Graves, Ron Wetch, and Hong Lin, ADaM: a data mining toolkit for scientists and engineers, In the Computers & Geosciences, 2005, 31, 607–618.

[39] Lubia´ Vinhas, Gilberto Camara,ˆ and Ricardo Cartaxo Modesto de Souza, TerraLib: An open source GIS library for spatio-temporal databases, Tech- nical Report, DPI/INPE- Image Processing Division, National Institute of Space Research, 2004, Brazil.

[40] A.M. MacEachren, An evolving cognitive-semiotic approach to geographic visualization and knowledge construction, In the Information Design Jour- nal, 2001, 10, 49–72.

[41] David J. Hand, Statistics and Data Mining: Intersecting Disciplines, In the SIGKDD Explanations, 1999, 1, 1, 16–19.

[42] S. Sumathi, and S.N. Sivanandam, Statistical Themes and Lessons for Data mining, In the Studies in Computational Intelligence (SCI), 2006, 29, 243–263.

[43] Surajit Chaudhuri, Data Mining and Database Systems: Where is the Intersection?, In the IEEE Data Eng. Bull., 1998, 21, 1, 4–8.

[44] A.K. Sinha, Geoinformatics: Data to knowledge. The Geological Society of America (GSA), 2006, Boulder, Colorado, USA.

[45] David M. Rocke, and David L. Woodruff, Some Statistical Tools for Data Mining Applications, Center for Image Processing and Integrated Com- puting University of California, 1998, Davis, USA.

[46] Fredrik Farnstrom, James Lewis, and Charles Elkan, Scalability for clus- tering algorithms revisited, In the SIGKDD Explorations, 2000, 2, 7–51.

[47] M.O. Mansur, and Mohd. Noor Md. Sap, Outlier Detection Technique in Data Mining: A Research Perspective, In the Postgraduate Annual Re- search Seminar, 2005, Brazil.

[48] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Joorg¨ Sander, LOF: Identifying Density-Based Local Outliers, In the Proceed- ings of ACM SIGMOD 2000 Int. Conf. on Management of Data, 2000, 29, 2, 93–104.

106 Bibliography

[49] Tu Bao Ho, Saori Kawasaki1, and Janusz Granat, Knowledge Acquisition by Machine Learning and Data Mining, In the Studies in Computational Intelligence, Springer, 2007, 59, 69–91.

[50] Pauray S.M. Tsai, and Chien-Ming Chen, Mining interesting association rules from customer databases and transaction databases. In the Informa- tion Systems, 2004, 29, 8, 685–696.

[51] S. Sumathi, and S.N. Sivanandam, Data Mining Tasks, Techniques, and Applications, In the Studies in Computational Intelligence (SCI), 2006, 29, 195–216.

[52] Boriana L. Milenova, and Marcos M. Campos, Mining high-Dimensional Data for International Fusion: A Database-Centric Approach, In the 8th International Conference on Information Fusion, 2005, 1, 7 pp-.

[53] Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, and Joorg¨ Sander, Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support, In the Data Mining and Knowledge Discovery, 2000, 4, 193–216. [54] Ming-Syan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An overview from database prospective, In the IEEE Transactions on Knowledge and Data Engineering, 1996, 8, 886–883. [55] Surajit Chaudhuri, and Umeshwar Dayal, An overview of data warehous- ing and OLAP technology, In the SIGMOD Record, 1997, 26, 65–74. [56] P. Adrians, and D. Zantinge, Data Mining, Addison-Wesley, 1996, Harlow, U.K. [57] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Re- ichart, and Murali Venkatrao, Data Cube: A Relational Aggregation Op- erator Generalizing Group-By, Cross-Tab, and Sub-Totals, In the Data mining and Knowledge Discover, 1997, 1, 29–53. [58] Monica Wachowicz, GeoInsight: An approach for developing a knowledge construction process based on the integration of GVis and KDD methods, In the book Harvey J.Mliiter, Jiawei Han (Eds.) Geographic data mining and knowledge discovery, Taylor & Frances Inc., 2001, New York, USA. [59] Jochen Hipp, Ulrich Guntzer,¨ and Gholamreza Nakhaeizadeh, Algorithms for association rule mining: a general survey and comparison, In the ACM SIGKDD Explorations, 2000, 2, 1, 58–64. [60] Tu Bao Ho, Saori Kawasaki, and Janusz Granat, Knowledge Acquisition by Machine Learning and Data Mining, In the Studies in Computational Intelligence, Springer, 2007, 59, 69–91. [61] S. Sumathi, and S.N. Sivanandam, Data Mining Tasks, Techniques, and Applications, In the Studies in Computational Intelligence (SCI), 2006, 29, 195-216.

107 Bibliography

[62] Jong Gyu Han, Keun Ho Ryu, Kwang Hoon Chi, and Yeon Kwang Yeon, Statistics Based Predictive Geo-spatial Data Mining: Forest Fire Haz- ardous Area Mapping Application, In the X. Zhou, Y. Zhang, and M.E. Or- lowska (Eds.): APWeb 2003, Springer, 2003, LNCS 2642, 370–381, Berlin, Heidelberg.

[63] Gilberto Camara,ˆ Max J. Egenhofer, Frederico Fonseca, and Antonioˆ Miguel Vieira Monteiro, What is an image? In the International Confer- ence on Spatial Information Theory, Springer, 2001, LNCS 2205, 474–488.

[64] Yong Ge, Bai Hexiang, and Sanping Li, Geo-spatial Data Analysis, Quality Assessment, and Visualization, In the Proceedings of International Confer- ence on Computational Science and Applications ICCSA 2008, Springer, 2008, 5072, 258-267.

[65] S. Rinzivillo, F. Turini, V. Bogorny, C. Korner,¨ B. Kuijpers, and M. May, Knowledge Discovery from Geographical Data, In the book Mobility, Data Mining, and Privacy, Springer, 2008, 243–265, Berlin, Heidelberg.

[66] Vijay Gandhi, James M. Kang, and Shashi Shekhar, Spatial databases: Technical Report, Department of Computer Science and Engineering Uni- versity of Minnesota, 2007, USA.

[67] Tianqiang Huang, and Ziaolin Qin, Detecting Outliers in spatial database, In the International Conference on Image and Graphics, IEEE Computer Society, 2004, 556–559.

[68] Elzbieta Malinowski, and Esteban Zimanyi. Spatial data Warehouses. In the book Advanced Data Warehouse Design, Springer, 2008, 133–179, Berlin, Heidelberg.

[69] Sebastien Mustiere,` and John Van Smaalen. Database requirements for generalization and multiple representation, In the book Generaliza- tion of geographic information: cartographic modelling and applications, Springer, 2007, 113–136, Berlin, Heidelberg.

[70] Elisa Bertino, and Maria Luisa Damiani. Spatial Knowledge-Based Ap- plications and Technologies: Research Issues, In the Proceedings of 9th International Conference, KES 2005, Part IV, Springer, 2005, LNCS 3684, 324–328, Berlin, Heidelberg.

[71] Krzysztof Koperski, and Jiawei Han, Discovery of spatial associations rules in geographic information database, In the Proc. 4th Int. Symp. Advances in Spatial Databases, SSD, Springer-Verlag, 1995, LNCS 951, 47–66, Berlin, Heidelberg.

[72] Alfred Stein. Modern developments in image mining, In the Science in China Series E: Technological Sciences, 2008, 51, 13–25.

[73] Ranga Raju Vatsavai, Shashi Sheckar, Thmas E. Burk, and Budhendra Bhaduri, *Miner: A Suit of Classifiers for Spatial, Temporal, Ancillary,

108 Bibliography

and Remote Sensing Data Mining, In the Fifth International Conference on Information Technology: New Generations, IEEE, 2008, 801–806.

[74] Marcelino Pereira S. Silva, Gilberto Camara,ˆ Ricardo Cartaxo M. Souza, Dalton M. Valeriano, and Maria Isabel S. Escada, Mining Patterns of Change in Remote Sensing Image Databases, In the IEEE International Conference on Data mining, IEEE, 2005, 362–369, Los Alamitos, CA, USA.

[75] R.L. Kettig, and D.A. Landgrebe, Computer classification of remotely sensed multispectral image data by extraction and classification of homo- geneous objects, In the IEEE Trans. Geoscience Electronics, 1976, GE-14, 1, 19–26.

[76] B. Uma Shankar, Novel Classification and Segmentation Techniques with Application to Remotely Sensed Images, In J.F. Peters et al. (Eds.): Trans- actions on Rough Sets VII, Springer-Verlag, 2007, LNCS 4400, 295–380, Berlin, Heidelberg.

[77] Wynne Hsu, Mong Li lee, and Ji Zhang. Image Mining: Trends and Devel- opments. In the Journal of Intelligent Information Systems, 2002, 19, 1, 7–23.

[78] Gilberto Camara,ˆ Ricardo Cartaxo Modesto Souza, Ubirajara Moura Fre- itas, Juan Garrido, and Fernando Mitsuo, SPRING: Integrating remote sensing and GIS by object-oriented data modelling. In the Journal of Com- puters & Graphics, 1996, 20, 3, 395–403.

[79] Gilberto Camara,ˆ Lubia´ Vinhas, Karine Reis Ferreira1, Gilberto Ribeiro de Queiroz, Ricardo Cartaxo Modesto de Souza, Antonioˆ Miguel Vieira Mon- teiro, Marcelo T´ılio de Carvalho, Marco Antonio Casanova, and Ubirajara Moura de Freitas, TerraLib: An Open Source GIS Library for Large-Scale Environmental and Socio-economic Applications, In the book G. Brent Hall and Michael G. Leahy (Eds.), Open Source Approaches in Spatial Data Handling, Springer, 2008, 247–270, Berlin.

[80] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides, De- sign Patterns: Elements of Reusable Object-Oriented Software, Addison- Wesley, 1995, NJ.

[81] J. Rogan, and J. Miller, Integrating GIS and remote sensing for mapping forest distribution and change. In the book Understanding forest distribu- tion and spatial pattern: Remote sensing and GIS CRC approaches, CRC Press (Taylor & Francis), 2006, FL, USA.

[82] A.S.M. Gieske, J. Hendrikse, V. Retsios, B. van Leeuwen, B.H.P. Maathuis, M. Romaguera, J.A. Sobrino, W.J. Timmermans, and Z. Su, Processing of MSG-1 SEVIRI data in the thermal infrared algorithm development with the use of the SPARC2004 data set, In the ESA Proceedings WPP-250: SPARC final workshop, 2005, 8, Enschede, Netherlands.

109 Bibliography

[83] E. Ebert, A pattern recognition technique for distributing surface and cloud types in the polar regions, In the Journal of Climate and Applied Meteorology, 1987, 26, 1412–1427.

[84] U. Amato, A. Antoniadis, V. Cuomo, L. Cutillo, M. Franzese, L. Murino, and C. Serio, Statistical cloud detection from SEVIRI multispectral images, In the Journal of Remote Sensing of Environment, 2008, 112, 750–766.

[85] Critina Conde, Antonio Ruiz, and Enrique Cabello, PCA Vs low resolu- tion images in face verification, In Proceedings of the 12th International Conference on Image Analysis, IEEE Computer Society, 2003, 63–67.

110