<<

A Prototype Infrastructure for Sentinel Earth Observation Data Relative to Portugal

Supporting OGC Standards for Raster Data Management

Diogo Filipe Pimenta Andrade

Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

Supervisor(s): Prof. Bruno Emanuel da Grac¸a Martins Prof. Mario´ Jorge Costa Gaspar da Silva

Examinatiom Committee

Chairperson: Prof. Daniel Jorge Viegas Gonc¸alves Supervisor: Prof. Prof. Bruno Emanuel da Grac¸a Martins Members of the Committe: Prof. Armanda Rodrigues

May 2017

The only way to achieve the impossible is to believe it is possible. Charles Lutwidge Dodgson

Acknowledgments

I would like to thank those who have been present and have in a way or another contributed to this thesis. First, I would like to thank my advisors, Prof. Bruno Martins and Prof. Mario´ Gaspar, for all the guidance, patience, and support. I would also like to thank Eng. Bruno Anjos and Eng. Marco Silva from Instituto Portuguesˆ do Mar e da Atmosfera (IPMA) for the technical support. I also thank the future mother of my children, Barbara´ Agua,´ for always being kind and supportive. I can not thank her enough for all the encouragement and dedication. I thank my good friend Joao˜ Vieira, for the inspiration and motivation that brought me to pursue higher education. I would like to thank my brother, Joao˜ Andrade, for being the father that I never had. Finally, I would like to thank the taxpayers of Portugal for providing me with financial support, through a scholarship from Governo Regional da Madeira, through the SASUL scholarship, and through the researh grant from the Fundac¸ao˜ para a Cienciaˆ e Tecnologia (FCT) within project DataS- torm - EXCL/EEI-ESS/0257/2012. I also would like to express my gratitude to the Republica´ ”A Desordem dos Engenheiros”, for all the moments, for helping me to grow as a person, for being my second family, and for facilitating my life in an economic way. I will never forget and will be always grateful.

iii

Abstract

This document describes an architecture for the IPSentinel infrastructure for managing Earth ob- servation data together with its implementation. This infrastructure was developed using the DHuS software from the European Space Agency together with the RasDaMan array manage- ment system, to catalogue, disseminate and process Sentinel Earth observation products for the Portuguese territory. RasDaMan implements standards from the Open Geospatial Consortium, such as and Web Coverage Processing Service, which provide access and pro- cessing of the rasters encoding Earth observation data, through the Internet. The reported experiments show that the prototype system meets the functional requirements. This dissertation also provides measurements of the used computational resources, in terms of storage space and response times.

Keywords

Remote Sensing Products, Earth Observation Data, Raster Data, OGC Standards, Geospatial Data Infrastructures, Big Data Management

v

Resumo

Este documento descreve a arquitectura e a implementac¸ao˜ de um prototipo´ da infraestrutura IP Sentinel para a gestao˜ de produtos de observac¸ao˜ da Terra. Esta infraestrutura foi desenvolvida usando o software DHuS da Agenciaˆ Espacial Europeia, juntamente com o sistema de gestao˜ de base de dados de arrays RasDaMan. Esta infraestrutura tem como objectivo catalogar, disseminar e processar produtos de observac¸ao˜ da Terra Sentinel relativos ao territorio´ Portugues.ˆ O RasDaMan implementa normas do Open Geospatial Consortium, tais como as normas Web Coverage Service and Web Coverage Processing Service, que disponibilizam o acesso e o processamento de rasters que codificam os dados de obvservac¸ao˜ da Terra, atraves´ da Internet. As experienciasˆ relizadas mostram que o prototipo´ cumpre os requisitos funcionais. Esta disserta- c¸ao˜ fornece tambem´ algumas medic¸oes˜ dos recursos computacionais utilizados, em termos do espac¸o de armazenamento e do tempo de resposta.

Palavras Chave

Detecc¸ao˜ Remota, Dados de Observac¸ao˜ da Terra, Dados Raster, Especificac¸oes˜ OGC, Infras- truturas para Dados Geoespaciais, Gestao˜ de Megadados

vii

Contents

1 Introduction 1 1.1 Thesis Proposal ...... 2 1.2 Summary of Contributions ...... 3 1.3 Document Organization ...... 3

2 Concepts and Related Work 5 2.1 Concepts ...... 6 2.1.1 The Sentinel Programme ...... 6 2.1.2 Computational Storage of EO Rasters ...... 7 2.1.3 Array Database Managing Services ...... 9 2.1.3.A Array Algebra ...... 10 2.1.4 Open Geospatial Consortium Standards ...... 11 2.1.4.A OGC Web Coverage Service ...... 13 2.1.4.B OGC Web Coverage Processing Service ...... 14 2.1.4.C OGC ...... 15 2.2 Related Work ...... 16 2.2.1 The ESA Data Hub Service ...... 17 2.2.2 The RasDaMan Array Database Management System ...... 21 2.2.3 The SciDB Array Database Management System ...... 23 2.2.4 Handing Raster data in MonetDB and the TELEIOS Infrastructure ...... 26 2.3 Summary and Discussion ...... 28

3 Prototype Infrastructure 31 3.1 Overview ...... 32 3.2 System Assumptions and Functional Requirements ...... 33 3.3 Architecture ...... 34 3.4 Implementation ...... 39 3.4.1 Integration of Petascope with the Data Hub Service (DHuS) software ...... 39 3.4.2 Modification of Petascope and DHuS Web Clients ...... 39 3.4.3 Modification of the DHuS Core to Automate the Process of Ingesting Coverages in RasDaMan ...... 41 3.5 Summary ...... 42

ix 4 Validation 45 4.1 Requirements Compliance ...... 46 4.2 Measurement of Required Computational Resources ...... 48 4.2.1 Storage Space ...... 48 4.2.2 Response Time ...... 49 4.3 Summary ...... 54

5 Conclusions and Future Work 57 5.1 Conclusions ...... 58 5.2 Future Work ...... 58

Bibliography 61

x List of Figures

2.1 The two GeoTIFF raster coordinate systems [Ritter and Ruth, 1997]...... 7 2.2 The GeoTIFF raster, map, and world space [Ritter and Ruth, 1997]...... 8 2.3 Example of GeoTIFF metadata parameter set [Ritter and Ruth, 1997]...... 9 2.4 Constituents of an array...... 10 2.5 Examples of 2D rectified (1 and 2) and referenceable (3 and 4) grids...... 12 2.6 Dissemination of Earth Observation (EO) data from a centralised Data Hub...... 17 2.7 Related view on the DHuS architecture centered around the Spring Framework [Fred´ eric´ Pi- dancier and Mbaye, 2014]...... 18 2.8 The DHuS graphical user interface...... 20 2.9 The DHuS service architecture [Fred´ eric´ Pidancier and Mbaye, 2014]...... 21 2.10 RasDaMan infrastructure and its Petascope and SECORE components for exposing arrays through OGC web services...... 23 2.11 Some SciQL array operations...... 26 2.12 TELEIOS Infrascture Overview...... 28

3.1 Major functionalities of the IPSentinel prototype...... 32 3.2 Overview of interactions of the prototype with ESA SciHub and users...... 34 3.3 Overview of the modified DHuS project...... 35 3.4 Available APIs in the DHuS...... 35 3.5 Component & connector allocated-to files view over the Tomcat and web applications. . 36 3.6 IPSentinel context diagram...... 36 3.7 Component & connector view from the DHuS core...... 37 3.8 Component & connector view with pipe and filter style from the Rasdaman Feeder. . . . 37 3.9 Petascope web client with GetMap tab open...... 40 3.10 A product listed with Open Geospatial Consortium (OGC) services availability checked. 40 3.11 Product details screen displaying OGC service buttons...... 41 3.12 Convention structure for the Sentinel1 products name ...... 42 3.13 Diagram of involved java classes in the implementation...... 43

4.1 Result of executing a getCoverage of Sentinel 1 image with only VV band selected. . . . 47 4.2 Result of executing a false colouring processing through the Web Coverage Processing Service (WCPS) query language...... 47

xi 4.3 Result of executing multiples getMap requests...... 47 4.4 Scenario 2: Regions selected by the 4 clients...... 50 4.5 Results produced by the queries in the 4.8...... 53 4.6 Result produced by the query in the Listing 4.7...... 54

xii List of Tables

2.1 The parameters of a GetCapabilities request...... 14 2.2 The parameters of a DescribeCoverage request...... 15 2.3 The parameters of a GetCoverage request...... 15 2.4 The parameters of a GetCapabilities request...... 15 2.5 The parameters of a GetMap request...... 16 2.6 Examples of operations supported by RasQL ...... 22 2.7 Two dimensional SciDB array...... 24 2.8 The matrix stored as three BATs...... 27

4.1 Space occupied by Sentinel 1 products on disk...... 48 4.2 Space occupied by GRD type products in RasDaMan...... 49 4.3 Storage space required for different rolling archive plans...... 49 4.4 Scenario 1: Response time ...... 50 4.5 Scenario 2: Response time ...... 50 4.6 Scenario 3: Response time ...... 50 4.7 Resume of response times of the WCS operations...... 51 4.8 Summary of response time from different queries...... 52

xiii

Abbreviations

AFL Array Functional Language

AQL Array Query Language

BAT Binary Association Table

DAOs Data Access Objects

DBMS Database Management System

DGT Direcc¸ao˜ Geral do Territorio´

DHuS Data Hub Service

CRS Coordinate Reference Systems

EO Earth Observation

EPSG European Petroleum Survey Group

ESA European Space Agency

EW Extra Wide swath

GIS Geographic Information Systems

GML Geography Markup Language

GRD Ground Range Detected

GUI graphical user interface

HSQLDB HyperSQL Database

IPMA Instituto Portuguesˆ do Mar e da Atmosfera

IW Interferometric Wide swath

MDD multidimensional discrete data

OData Open Data Protocol

OGC Open Geospatial Consortium

xv POM Project Object Model

SLC Single Look Complex

URL Uniform Resource Locator

UTM Universal Transverse Mercator

WAR Web application ARchive

WCPS Web Coverage Processing Service

WCS Web Coverage Service

WMS Web Map Service

xvi 1 Introduction

Contents 1.1 Thesis Proposal ...... 2 1.2 Summary of Contributions ...... 3 1.3 Document Organization ...... 3

1 Thanks to the Copernicus Earth Observation Programme, a wide range of Earth Observation (EO) data is being captured everyday for several applications, such as the monitoring of landslides, or maritime monitoring and control. The acquired data is stored and made available in the European Space Agency (ESA) centralized Data Hub Service (DHuS) for all interested users. However, given that the volume of data acquired daily reaches the Terabytes, the ESA DHuS does not ensure long- term data preservation or fast access to a set of EO data. These are in fact, two of the limitations of the current ESA infrastructure that Wagner [2015] considers that can be resolved by changing how the EO community is organized. For this, one must keep in mind that one of the central paradigms of Big Data is to ”bring the software to the data” rather than vice versa. The first implication of this is that we will see the emergence of an increasing number of Big Data infrastructures in the coming years, covering different infrastructures such as individual big data centres to distributed cloud computing environments [Wagner, 2015]. The infrastructures must be able to attract a large EO data user community, making the infrastructures ”decentralized nodes” in an increasingly complex EO infrastructure network. These are clear benefits in having decentralized nodes cooperating with each other in terms of data storage. But there are other benefits in having decentralized nodes, such as the cooperation between users who work on the same infrastructure. Such cooperation may involve sharing their experiences when working with certain types of data and software, or to directly share data and code [Wagner, 2015]. Many organizations around the world have already identified the need to move the EO data pro- cessing into the cloud. Consequently, several organizations and initiatives worldwide have already started the uptake of Sentinel data into their existing or planned Big Data infrastructures. In the pri- vate sector, companies like Google1 and Amazon2 already are cooperating by storing Sentinel data in their own infrastructures. Regarding public sector initiatives and focusing only on a national level, European countries, such as, Germany, France, United Kingdom, Norway, Finland and Austria al- ready have their own collaborative infrastructure for storing Sentinel data. Each of these countries has followed its own approach, with different levels of involvement of the private industry. Portugal is also interested in having its own collaborative infrastructure to store all Sentinel data relative to the Portuguese geographic area. In this way, the Portuguese users interested in the data related to Portugal would not only have the data closer to themselves, as a whole new community would be created. Currently, Instituto Portuguesˆ do Mar e da Atmosfera (IPMA) and Direcc¸ao˜ Geral do Territorio´ (DGT), the institutions responsible for the Portuguese collaborative infrastructure, are already involved in setting up the Sentinel ground segment in Portugal.

1.1 Thesis Proposal

The present thesis addresses the creation of a prototype for the Portuguese infrastructure, named IPSentinel, to catalogue, to disseminate, and to process Sentinel EO data for the national community. It should be noted that the infrastructure intends to support large volumes of data, corresponding to

1https://earthengine.google.org/ 2http://sentinel-pds.s3-website.eu-central-1.amazonaws.com/

2 time-series of geographic high-resolution images, and also support data access based on geographic, temporal, and thematic criteria. For this prototype, early on it was decided to adopt the DHuS software provided by ESA, as the base of IPSentinel, to catalogue and to disseminate the EO data transferred from the ESA Scientific Data Hub, also known as ESA SciHub. The use of the Open Geospatial Consortium (OGC) services was also explored, related to raster data access and processing. Particularly, the Web Coverage Service (WCS), Web Coverage Processing Service (WCPS), and Web Map Service (WMS) stan- dards, implemented by Petascope on top of the RasDaMan array database, were considered. The integration of these services into DHuS allows IPSentinel to provide the ability to process EO data.

1.2 Summary of Contributions

The main contributions of this work are:

1. An extension to DHuS that makes Sentinel products available through WCS and WMS inter- faces, using RasDaMan. Provides flexibility by allowing the creation of new applications that use these interfaces to consume Sentinel products, and since these provide HTTP interfaces, using them is straightforward and does not limit the technology to be used by these applications (can be mobile applications, web-based applications, etc).

2. Technique for the reduction of Internet communication overhead when accessing EO data. Since that with WCS the processing (e.g., data slicing on the temporal or geospatial dimen- sions) can be done directly on the server with just the final result being sent over the wire to the client. With product processing on the server side, clients no longer need large computational requirements. In addition, RasDaMan uses optimization algorithms to reuse previously made requests to respond to new clients (if the search criteria maintained), reducing processing time.

3. Generic procedure for creation of derivative products, which run on the server and do not re- quire installation of specific software, and which guarantees flexibility in the creation of new EO products using only the WCPS language. Also, it provides flexibility in creating new Geographic Information Systems (GIS) applications with features expressed by WCPS language.

1.3 Document Organization

The remainder of this document is organized as follows:

• Chapter 2 describes fundamental concepts about the Sentinel Programme, OGC standards, and raster . This chapter also presents related work, namely software packages that implement the aforementioned standards and raster database technology.

• Chapter 3 presents the architecture and implementation for the IPSentinel prototype.

3 • Chapter 4 presents the experimental evaluation of the prototype implementing the proposed ar- chitecture. Firstly, the chapter presents the prototype requirements compliance. Then, presents measurements regarding the usage of computational resources.

• Finally, Chapter 5 concludes this work with some last thoughts, also leaving some ideas for future work in the area.

4 2 Concepts and Related Work

Contents 2.1 Concepts ...... 6 2.2 Related Work ...... 16 2.3 Summary and Discussion ...... 28

5 This chapter presents, in Section 2.1, fundamental concepts required for understanding the elabo- rated work. In Section 2.2, the chapter presents essential related work for the development of the IPSentinel prototype infrastructure.

2.1 Concepts

This section describes the Sentinel Programme and introduces concepts related to rasters and as- sociated management systems. The section also briefly explains the concept of Open Geospatial Consortium (OGC) standards and describes the Web Coverage Service (WCS), Web Coverage Pro- cessing Service (WCPS), and Web Map Service (WMS) standards used in this thesis.

2.1.1 The Sentinel Programme

The Sentinel Programme1 is the Earth Observation (EO) programme inserted in the largest Coper- nicus Programme2 that the European Union is financing. This programme is coordinated by the European Space Agency (ESA)3, in cooperation with the EU member states and Canada. The programme involves launching a family of Sentinel satellites that are equipped with sensors to remotely capture distinct data types, covering a broad range of applications [Balzter et al., 2015]. Each Sentinel mission has a different purpose, and each of these missions is based on a constellation of two satellites collecting data in parallel, to increase coverage and operational data availability.

Sentinel-1 is a day-and-night radar imaging mission for maritime monitoring, land monitoring, and emergency management. Practical applications are, for example, land subsidence control and soil moisture control [Geudtner et al., 2014; Wagner et al., 2012]. The first Sentinel-1 was launched on 3 April 2014 and the second was launched on 25 April 2016.

Sentinel-2 is a high-resolution optical imaging mission for land monitoring, aiming to provide, for example, imagery of vegetation, soil and urban areas [Drusch et al., 2012]. The first Sentinel-2 was launched on 22 June 2015 and the second launch is scheduled for April 2017.

Sentinel-3 is a multi-instrument mission to measure sea-surface topography, sea and land-surface temperature, ocean colour and land colour with very high level of availability, accuracy and reliability. The mission will support, for example, aerosols characterisation and climate studies. The first Sentinel-3 was launched on 16 February 2016 and the second launch is scheduled for 2017.

Sentinel-4 is dedicated to air-quality in near real-time applications, air-quality pro- tocol monitoring and climate protocol monitoring over Europe. Sentinel-4 will be launched in 2020.

Sentinel-5 Precursor is a satellite mission planned to launch in 2017 to reduce data gaps between Envisat and Sentinel-5. This mission is dedicated to atmo- spheric measurements, with high spatio-temporal resolution, relating to air quality, climate, ozone and UV radiation monitoring. 1https://sentinels.copernicus.eu 2http://www.copernicus.eu/ 3http://www.esa.int

6 Sentinel-5 aims to provide a wide global coverage data to monitor air quality around the world. It is expected be launched in 2020.

Sentinel-6 carries a radar altimeter to measure global sea-surface height and to complement ocean information from Sentinel-3. The date of launch is not yet known.

2.1.2 Computational Storage of EO Rasters

Raster data, also known as bitmaps, are images that contain a description of each pixel, as opposed to vector graphics which use points, lines, curves and polygons to encode the information. Raster data can be stored, compressed or uncompressed, in image files with varying formats, such as PNG or TIFF. In the context of EO applications and geographical information systems in general, the rasters are also georeferenced, in the sense that each pixel is known to be associated to a particular geo- graphical region. Some of the most popular raster formats for encoding EO data are GeoTIFF, JPEG [Christopoulos et al., 2000], netCDF [Rew and Davis, 1990], HDF44 and HDF55. The rest of this section will focus on the GeoTIFF standard. The GeoTIFF format [Ritter et al., 2000] can been seen as an extension of the TIFF 6.0 file format, which uses a set of reserved TIFF tags to store georeferencing information. The additional informa- tion includes map projection, coordinate systems, ellipsoids, datums, and other relevant information necessary to establish the exact spatial reference in the file. In particular, the GeoTIFF approach accepts three different coordinate systems:

– Raster space, referencing the pixels by (column, row);

– Model space of the projection, typically (easting, northing);

– World space, 3-D (X,Y,Z).

As described by Ritter and Ruth [1997], raster spaces are continuous planar spaces in which pixel values are visually possible. In the GeoTIFF framework, one can choose one of two pixel coordinate systems depending upon the nature of the data (see Figure 2.1). RasterPixelIsArea defines that a pixel represents an area in the real world, while RasterPixelIsPoint defines a pixel to represent a point in the real world.

4https://support.hdfgroup.org/products/hdf4/ 5https://support.hdfgroup.org/HDF5/

Figure 2.1: The two GeoTIFF raster coordinate systems [Ritter and Ruth, 1997].

7 The world space is based on the 3-D coordinate space on which the Earth is embedded, with origin at the centre of mass of the Earth. Model space represents the projection of the spherical shape of the Earth’s surface to a flat two-dimensional plane. Figure 2.2 shows graphically the relations between the three spaces. GeoTIFF uses a meta-tag (GeoKey) approach to encode dozens of information elements into just 6 tags [Ritter et al., 2000], which can be defined in reference to the concepts of georeferencing (i.e., the relationship between raster space and model space) and geocoding (i.e., the relationship between model space and world space). In terms of geocoding, the GeoTIFF GeoKeyDirectoryTag element uses a tag-indexing scheme to point to tag values stored in GeoDoubleParamsTag and GeoAsciiParamsTag, which in turn contain the cartographic parameters that define the geometry of the raster image. The georeferencing informa- tion is, in turn, defined by various combinations of the ModelTiepointTag, ModelPixelScaleTag and ModelTransformationTag tags. The ModelTiePointTag tag stores raster-to-model tiepoint pairs in the order (I, J, K, X, Y, Z), where (I, J, K) is the point at the location (I, J) in raster space with pixel-value K, and (X, Y, Z) is a vector in model space. The ModelPixelScaleTag tag consists of three real values in the double format, in the form (ScaleX, ScaleY, ScaleZ), where ScaleX and Scale Y give the horizontal and vertical spacing of the raster pixels. The ScaleZ is primarily used to map the pixel value of a digital elevation model into the correct Z-scale, and so for most other purposes this value should be zero. The ModelTransformationTag tag is used to define the exact linear (affine) transformations be- tween raster and model space. The ModelTransformationTag tag may be used to specify the 3-D transformation matrix and offset between the raster space (and its dependent pixel-value space) and the (possibly 3-D) model space. For a better understanding, the listing in Figure 2.3 shows us a typical GeoTIFF set of metadata elements for a georeferenced image. The Keyed Information section provides all of the geocoding parameters for the image. The input image is assumed to be already ortho-corrected to the Universal Transverse Mercator (UTM) projection [Langley, 1998]. Since UTM is based on a projection, rather than a geographic coordinate system, it was specified that the ModelType is ModelTypeProjected. The ProjectedCSTypeGeoKey code named PCS NAD27 UTM zone 17N includes all references to

Figure 2.2: The GeoTIFF raster, map, and world space [Ritter and Ruth, 1997].

8 Figure 2.3: Example of GeoTIFF metadata parameter set [Ritter and Ruth, 1997]. datums, ellipsoids, units, cooordinate transpormation parameters, and prime meridians necessary to exactly define the model-to-Earth transformation. The ModelTiePointTag tag shows that the pixel at location 0,0 has easting and northing values of 543,994 m and 4,187,280m, respectively. To Raster- Type was assigned RasterPixelArea, although it might be more precise to say that the upper-left cor- ner of the grid-cell associated with the first pixels is at that location. To finish, ModelPixelSacaleTag shows that pixel-cell dimension is 10 m x 10 m (in Model space) with default to 0 for Z. As we can see, all this information, together with the tiepoint, completely determines the relation- ship between pixel locations and easting-northings, which is to say raster to model space. Since the model to world space relationship is determined by the projected coordinate system, the geographic definition of this raster file is complete [Ritter and Ruth, 1997].

2.1.3 Array Database Managing Services

An array Database Management System (DBMS) stores and manages arrays, also called raster data, time-series of rasters, or multidimensional discrete data (MDD) [Baumann, 1994]. In Geographic Infor- mation Systems (GIS) and EO applications, the nature of raster image data is often multidimensional: it can include 3-D image time series (x/y/t), 3-D exploration data (x/y/z), and 4-D climate models (x/y/z/t) [Gutierrez´ and Baumann, 2008]. Array databases are based on an array algebra model, and are designed to provide flexible and scalable storage, data retrieval, and data manipulation over large volumes, through a declarative query language similar to SQL. Pratical implementations of array alge- bra include AML [Marathe and Salem, 2002], RAM [van Ballegooij, 2004] and RasDaMan [Baumann, 1999]. Typically, these systems decompose multidimensional arrays into sub-arrays which form the unit of access, which are efficiently indexed and stored in a database. Regular decomposing, where multidimensional arrays are subdivided into disjoint sub-arrays with the same size, is referred to as chunking. A generalization where sub-arrays do not have to be the same size is tilling [Gutierrez´ and Baumann, 2008].

9 Figure 2.4: Constituents of an array.

2.1.3.A Array Algebra

Formally, an array A is given by a total function A : X → V where X, the domain, is a d-dimensional integer interval for some d > 0, and where V is a non-empty set referred to as the range. This function can also be written as A(x) = v for x ∈ X, v ∈ V . The x ∈ X are referred to as cells, and A(x) gives the values associated to the cell x. In a typical array algebra there are two distinct operation categories, namely the m-interval (for multi-dimensional interval) and the array operations [Baumann and Holsten, 2012]. The m-interval operations are functions that act on the domain of an array, in which some of them d are simply probing functions. Let a domain X = [l1 : h1, ..., ld : hd] ⊆ Z be given for some two vectors l and h of some dimension d > 0. Let also D and N denote, respectively, the set of all domains and the set of non-negative integers, and let Z denote the set spanned by two vectors l and h, representing the lower and upper diagonal corner points of an axis-parallel hypercube. Then,

– dim : D → N, dim(X) = d is the dimension of X.

– lo : D → Z, lo(X) = (l1, ..., ld) denotes the low bound corner of X.

– hi : D → Z, hi(X) = (h1, ..., hd) denotes the high bound corner of X.

Qd – card : D → N, card(X) = i=1(hii(X) − loi(X) + 1) is the extent of X.

The extraction of a sub-array is done through subsetting operations, that can be subdivided into trimming and slicing. Trimming extracts some subinterval from an m-interval, maintaining its dimen- sion. Slicing cuts out a hyperplane, consequently reducing the array dimensionality by 1. Let X be an m-interval of dimension d > 0, spanned by d-dimensional vectors l and h. For some integer i with 1 ≤ i ≤ d and a one-dimensional interval I = [m : n] with li ≤ m ≤ n ≤ hi the trim of X to I in dimension i is defined as follows:

trim(X, i, I) := [l1 : h1, ..., m : n, ..., ld : hd] = {x ∈ X : m ≤ xi ≤ n}

For some m-interval X as above, an integer i with 1 ≤ i ≤ d and an integer s with loi(X) ≤ s ≤ hii(X), the slice of X at position s in dimension i is given by

slice(X, i, s)

10 := [lo1(X) : hi1(X), ..., loi−1(X) : hi1−1(X), loi+1(X) : hii+1(X), ..., lod(X) : hid(X)]

d−1 = {x ∈ Z |x = (x1, ..., xi−1; xi+1, ..., xd), (x1, ..., xi−1s, xi+1, ..., xd) ∈ X}

The array operations, are constructor functions of arrays that constitute the core of algebra. These operators are defined by means of probing functions derived from the m-interval probing functions. Let A be a V -valued array over domain X. Then,

– dom : V x → D, dom(A) = X denotes the domain of A.

– dim : V x → N, dim(A) = dim(dom(X)) is the dimension of A.

An array constructor, is used to create an array and initialize its cell values through the evaluation of some given expression for all cells. The allowed expressions can be classified into cell-type and index operations. Cell-type operations result in assigning a value to a particular cell. Index operations are those whose result is used for indexing an array. These operations always return integer values. The condense operator, turns an array to a scalar value by combining the array cell values using some aggregating function. The last core operator is the array sorter operator, which along a selected dimension proceeds with the hyperslices reordering. It does so through some expression of order generation that allows one to rank the slices. At the end of the process, the sorted array has the same dimensionality and extent as the original one. An example to demonstrate the practical use of some operators can be the application of a filter kernel, used in EO data for edge detection. The edge detection is used in risk management in situations such as oil slicks, hurricanes, floods, and volcanic eruptions. The filter kernel consists in a quadratic matrix which iterates over an image to determine a new value through the combination of each old value and its neighborhood values. In the kernel matrix each value represents a weight factor for each pixel in the neighborhood which is applied before being added all values. In array algebra, the application of the kernel K on image A can be expressed as

MARRAY(dom(A), x, COND(+, dom(K), y, A(x + y) ∗ K(y))), where MARRAY() is a constructor, and where COND() is the condense operator receiving the follow- ing parameters in the following order: a commutative and associative operation, a free identifier, a m-interval of the array, and an expression possibly containing occurrences of the array and the free identifier.

2.1.4 Open Geospatial Consortium Standards

The Open Geospatial Consortium (OGC) is an international industry consortium which has more than 520 commercial, governmental, nonprofit and research organizations, collaborating to make quality open standards for the global geospatial community [Lupp, 2008]. The standards are submitted to a consensus process and then they are freely available to the community, to help share geospatial data.

11 Figure 2.5: Examples of 2D rectified (1 and 2) and referenceable (3 and 4) grids.

Several of these standards are relevant in the context of infrastructures dedicated to the storage and processing of EO data. EO data are often stored as rasters and the previous section already described the GeoTIFF standard, which has been adopted by the OGC. There are nonetheless several other OGC standards and recommendations relevant in the context of raster data processing. The Geography Markup Language (GML) [Portele, 2007] is the XML grammar created by the OGC to serve as a core modeling language for GIS, as well as a format for geographical transactions across the web. In the context of GML and other OGC standards, the term coverage comes from the data repre- sentation that assigns values directly to spatial positions, typical of raster approaches to GIS. In other words a coverage is a function from a spatial, temporal or spatiotemporal domain to a range [OGC, 2006]. Formally, the GML Application Scheme for Coverages (GMLCOV6) standard [Baumann, 2010] is an extension of the basic GML coverage primitive, which contains the following constituents:

– the coverage domain: the extent where valid values are available.

– the rangeSet: the set of values (i.e., pixels or voxels) which compose the coverage, in conjunc- tion with their locations.

– the rangeType: a type definition of the range set values.

– the metadata: a slot where any kind of metadata can be added for extensions or application profiles.

Rasters can be represented mainly by the following coverage models, illustrated on Figure 2.5.

RectifiedGridCoverage: Coverage whose geometry is represented by a rectified grid. A grid is (geo)rectified when the transformation between grid coordinates and the external Coordinate Reference Systems (CRS) is affine; we also call them rectilinear aligned, or rectilinear non- aligned grids [Portele, 2007]. A transformation is affine if it preserves the collinearity between points and the ratios of distances along a line.

ReferenceableGridCoverage: Coverage whose geometry is represented by a referenceable grid. A grid is (geo)referenceable when there is a non-affine transformation between the grid coor-

6http://www.opengis.net/doc/GML/GMLCOV/1.0.1

12 dinates and the external CRS. This can be the case of rectilinear irregularly-spaced grids, or curvilinear (“warped”) grids [Portele, 2012].

Also in the context of OGC standards, coverages can be accessed through the WCS suite of services and the WCPS coverage query language.

2.1.4.A OGC Web Coverage Service

The Web Coverage Service (WCS)7 is a service interface, based in the client/server architecture, which provides access to raster sources of geospatial images in forms that are useful for client-side rendering, as input into scientific models, and for other clients. The access is made through a server request, for instance in the form of a Uniform Resource Locator (URL). The WCS specification offers capabilities to extract portions of a coverage, as well as more complex and precise querying [Bau- mann, 2012b]. Furthermore, a WCS can return valuable metadata that allows deep analysis, and also supports many export formats. WCS uses the aforementioned coverage model of the GML Ap- plication Schema for Coverages [OGC, 2006], which has been developed to facilitate the interchange of coverages between OGC services. WCS implementations typically also support the GeoTIFF and netCDF formats.

The WCS standard supports three kinds of operations that a WCS client can invoke, namely:

GetCapabilities: allows a client to request information about the server’s capabilities, as well as valid WCS operations and parameters.

DescribeCoverage: allows a client to request a full description of a coverage in particular.

GetCoverage: allows a client to request a coverage comprised of a selected range properties at a selected set of spatio-temporal locations, in a chosen format.

During a sequence of WCS requests, a client should first issue a GetCapabilities request to the server, to obtain an up-to date listing of available data. Then, it may issue a DescribeCoverage request to find out more details about particular coverages being offered. In order to retrieve a coverage, or a part of a coverage, a client must then issue a GetCoverage request. A GetCapabilities request consists of a GetCapabilities structure with the service component assigned the value of WCS. The allowed parameter list on a GetCapabilities request is described in Table 2.1. The GetCapabilities response consists of an XML document with a service metadata section and an optional contents section. Service metadata provides service details and information about the concrete service capabilities of the WCS service. The contents section provides details about the coverages offered by the service [Baumann, 2012b]. A DescribeCoverage request accepts the parameters described on the Table 2.2 and provides a list of coverage identifiers. It prompts the server to return, for each identifier, a description of the

7http://www.opengeospatial.org/standards/wcs

13 corresponding coverage. A WCS deployment typically offers a set of coverage objects, that can be empty. The response to a successful DescribeCoverage request contains a list of coverage metadata, one for each coverage identifier passed in the request. Finally, the GetCoverage request prompts a WCS service to process a particular coverage se- lected from the service’s offerings and return a derived coverage [Baumann, 2012a]. The WCS Core standard defines the domain subsetting operation which delivers all data from a coverage inside a specified request bounding box. Domain subsetting is subdivided into trimming and slicing. The parameters involved in this request are listed on Table 2.3.

2.1.4.B OGC Web Coverage Processing Service

The Web Coverage Processing Service (WCPS)8 specification defines a language for the extraction, processing, and analysis of multi-dimensional raster coverages. While WCS focuses on simple data access operations, the WCPS query language makes it possible to do more powerful queries. The WCPS is an expression language similar to XQuery, which is formed by primitives plus nesting capabilities, and which is independent from any particular request and response encoding, since there is no concrete request/response protocol specified by WCPS [Baumann, 2009a]. This means that it is possible to embed WCPS into different service frameworks. One example of one such framework is OGC WCS. Resorting to the definition of request type [Baumann, 2009b], WCPS forms an extension of the WCS specification. WCPS clients can request the processing of one or more coverages by sending a query string. This query string can be expressed in either Abstract Syntax or XML. When the WCPS service re- ceives a request, it evaluates the expression contained in the request and returns an appropriate response to the client. The evaluation procedure can be seen as a nested finite loop, because the language is safe in evaluation, i.e., every request must terminate after a finite number of steps. The language allows one to express algorithms like classification, filter kernels and general convolutions, histograms, and discrete Fourier transformation. The query syntax is best explained by means of an example, as shown next: for c in ( S1B EW GRDM 1SDV 20161005T071556 002368 00400A 4B0D VV) return encode (c[Lat(37.2773492196), Long( −18.0532750027)], ”csv”)

The above example consists in inspecting a Sentinel 1 VV band coverage, and returning the point value represented by the coordinate pair (37.2773492196,-18.0532750027) encoded in the CSV format. The response of a successful request is an ordered sequence of one or more coverages or

8http://www.opengeospatial.org/standards/wcps

Table 2.1: The parameters of a GetCapabilities request.

Request parameter Mandatory/Optional Description VERSION=version O Request version. SERVICE=WCS M Service type. REQUEST=GetCapabilities M Request name.

14 Table 2.2: The parameters of a DescribeCoverage request.

Request parameter Mandatory/Optional Description VERSION=version O Request version. SERVICE=WCS M Service type. REQUEST=DescribeCoverage M Request name. EXTENSION=extension O Any ancillary information to be sent from client to server COVERAGEID=NCName M Coverage identifier.

Table 2.3: The parameters of a GetCoverage request.

Request parameter Mandatory/Optional Description VERSION=version O Request version. SERVICE=WCS M Service type. REQUEST=DescribeCoverage M Request name. EXTENSION=extension O Any ancillary information to be sent from client to server. COVERAGEID=NCName M Coverage identifier. FORMAT=MIME type O Output format of service meta- data. MEDIATYPE=anyURI O Enforces a multipart encoding. SUBSET=DimeSubset O Subsetting specifications, one per subsetting dimension. scalar values. The previous example returns a scalar.

2.1.4.C OGC Web Map Service

The Web Map Service (WMS)9 standard provides an HTTP interface to request georeferenced data as images from one or more distributed geospatial databases. This norm standardizes the way that maps are requested by clients and the way that servers describe the geodata that they are holding [de la Beaujardiere, 2006]. What distinguishes the WCS from the WMS specification is the fact that WMS just returns images, typically PNG files, and there is no way to get any metadata. The geographic information held by WMS is classified by layers, which can be displayed through predefined styles. A WMS supports the following main operations, among optional others:

GetCapabilities: allows a client to request information about the server’s capabilities, and as well as valid WMS operations and parameters.

9http://www.opengeospatial.org/standards/wms

Table 2.4: The parameters of a GetCapabilities request.

Request parameter Mandatory/Optional Description VERSION=version O Request version. SERVICE=WMS M Service type. REQUEST=GetCapabilities M Request name. FORMAT=MIME type O Output format of service meta- data. UPDATESEQUENCE=string O Sequence number or string for cache control.

15 Table 2.5: The parameters of a GetMap request.

Request parameter Mandatory/Optional Description VERSION=1.3.0 M Request version. REQUEST=GetMap M Request name. LAYERS=layer list M Comma-separated list of one or more map layers. STYLES=style list M Comma-separated list of one rendering style per requested- layer. CRS=namespace:identifier M Coordinate reference system. BBOX=minx,miny,maxx,maxy M Bounding box corners (lower left, upper right) in CRS units. WIDTH=output width M Width in pixels of map picture. HEIGHT=output height M Height in pixels of map picture. FORMAT=output format M Output format of map. TRANSPARENT=TRUE—FALSE O Background transparency of map (default=FALSE). BGCOLOR=color value O Hexadecimal red-green-blue colour value for the background color (default=0xFFFFFF). EXCEPTIONS=exception format O The format in which exceptions are to be reported by the WMS (default=XML). TIME=yyyy-MM-ddThh:mm:ss.SSSZ O Time/Date value of layer de- sired. ELEVATION=elevation O Elevation of layer desired.

GetMap: allows a client to request a map image for a specified area and content.

To issue a GetCapabilities request the client has to assign the SERVICE parameter to the value WMS. The GetCapabilities response consists of an XML document containing service metadata. Table 2.4 summarizes all the parameters allowed in a GetCapabilities request. A GetMap request accepts the parameters listed on Table 2.5. The response of a valid GetMap request is a map of the spatially referenced information layer requested, in the ordered style, with the specified coordinate reference system, bounding box, size, format, and transparency level [de la Beaujardiere, 2006].

2.2 Related Work

This section describes the ESA Data Hub Service (DHuS) and its components, also presenting an array Database Management System called RasDaMan, that implements the WCS, WCPS and WMS standards. Then, the section also describes SciDB and MonetDB/SciQL, two more array Database Management Systems.

16 Figure 2.6: Dissemination of EO data from a centralised Data Hub.

2.2.1 The ESA Data Hub Service

The DHuS10 is an open source11, GPLv3 licensed, web based system that establishes the connection between the core ground segment and the users interested in accessing EO data from the Sentinel programme. The ground segment is the infrastructure that contains all EO data, according to different timelines, ranging from near real-time to non time-critical, and available typically within 3-24 hours of being sensed by the satellite. Figure 2.6 illustrates the dissemination of EO data from a centralised Data Hub, where some users may be potentially replicated Data Hubs. EO data can be seen as different products. These are categorized by the following levels: raw data is a Level-0 product, processed data is a Level-1 product, and derived data is a Level-2 product. The products contained in these categories are varied and depend on the type of satellite that captured the data. This means that one category can have different types of products, unrelated between themselves. At present, only Sentinel-1 and Sentinel-2 products are available. Sentinel-1 Level-0 products consist of the sequence of Flexible Dynamic Block Adaptive Quantization (FDBAQ) compressed and unfocused Synthetic Aperture Radar (SAR) raw data and they include noise, internal calibration and echo source packets, as well as orbit and attitude information. Sentinel-1 Level-1 products are focused SAR data decompressed and processed from Level-0. Sentinel-1 Level-1 products are produced as Single Look Complex (SLC) and Ground Range Detected (GRD). SLC products consist of focused SAR data geo-referenced using orbit and attitude data from the satellite. GRD products consist of focused SAR data that has been detected, multi-looked and projected to ground range using an Earth ellipsoid model. Sentinel-1 Level-2 products consist of geolocated geophysical products derived from Level-1. The Level-2 Ocean (OCN) product is the only product derived from SAR data available by the ESA. For Sentinel-2, products available for users are just Level-1B, Level-1C and Level-2A. The Level- 1B product provides radiometrically corrected imagery in Top-Of-Atmosphere (TOA) radiance values.

10https://scihub.copernicus.eu/dhus 11The system can be downloaded in https://github.com/SentinelDataHub/DataHubSystem

17 Figure 2.7: Related view on the DHuS architecture centered around the Spring Framework [Fred´ eric´ Pidancier and Mbaye, 2014].

The Level-1C product consists of 100 km2 tiles in UTM/WGS84 projection with radiometric measure- ment in each pixel. The Level-2A product provides Bottom Of Atmosphere (BOA) reflectance images derived from the Level-1C products, using the Sentinel-2 Toolbox. Different products can be stored in a preconfigured disk space managed by the DHuS Data Store mechanism. Thanks to this mechanism, an incoming directory path is automatically managed, al- lowing different products to be added. For each product, an archive entry is registered in the DHuS database to define the data store parameters, total space and expected maximum margin before evic- tion. An eviction mechanism is periodically dispatched by a scheduler, to clean-up the old data and to keep enough disk space to upload newer data. The DHuS also offers upload features to authorized users. The upload is possible via the FTP/S protocol (i.e., using a file scanner) or via the graphical user interface (GUI). The file scanner recur- sively scans the remote directory to retrieve known product types that are upload candidates. The file scanner also supports a finer selection of products to upload with regular expressions, and it uses the DHuS database records to prevent reloading data already in the data store. The DHuS system reuses some third party components. The diagram in Figure 2.7 shows logical relations between these components. The DHuS architecture is based on the Spring Framework12 that manages third party libraries. This module is responsible for injecting and wiring DHuS Services

12http://projects.spring.io/spring-framework/

18 in a consistent way from startup to shutdown. The Business API is an internal DHuS API in charge of DHuS functionalities. This API manages products organized in collections within an archive and accessed by users. Spring embeds a set of useful features that facilitates the implementation:

– Model/view/Controler (MVC) embedded design pattern;

– Spring security that manages a security stack for authentication and access rights;

– Schedulers that manage periodical execution;

– Thread pools that manage multithreaded executions;

– Application life-cycle management and event handling;

– Injection, annotations and auto-wiring facilities;

– Easy configuration via an XML file.

To keep product metadata persistent, DHuS embeds an HyperSQL Database (HSQLDB)13 and uses the Hibernate14 object relational mapping, and Data Access Objects (DAOs) to manage persis- tence. The Hibernate layer enables a smooth synchronization of Java classes and the database, with almost no SQL programming. The Tomcat application server15 is used to manage the following client interfaces:

• The DHuS GUI is a Java Web application Archive (WAR) accessible via HTTP.The service is de- veloped in the HTML5, CSS3, and JavaScript technologies, and it uses the OpenLayers16 library to display the products on a geo Web Map. This service can only be accessed by authenticated users. Figure 2.8 provides an illustration for the DHuS GUI.

• The Open Data Protocol (OData)17 interface is implemented in order to manage batch interac- tions with the DHuS system according to the OData protocol. DHuS adds to the general OData specification interfaces dedicated to interacting and browsing online products. Through OData, authentication and downloading of products is possible, via a browser or through command line tools like wget18.

• The Solr interface is built on top of Solr19, an open source platform from the Apache Lucene project. Solr provides powerful full-text search, faceted search, dynamic clustering, database integration, and rich document handling. DHuS filters the user searches and uses Solr to re- trieve results ordered by relevance. DHuS also interacts with a gazetteer (nominatim20) to check and retrieve the bounding box that corresponds to a given place name, if a user request contains

13http://hsqldb.org/ 14http://hibernate.org/ 15http://tomcat.apache.org/ 16http://openlayers.org/ 17http://odata.org 18https://gnu.org/software/wget/ 19https://lucene.apache.org/solr/5_3_1/ 20http://open.mapquestapi.com/nominatim/

19 Figure 2.8: The DHuS graphical user interface.

geographical terms. The Solr API also provides a basic implementation of an OpenSearch21 in- terface, which is a collection of technologies that allow publishing of search results in a standard and accessible format.

The GeoTools22 library is used in the Solr component to manage geolocation intersection com- putations. GeoTools is an open source Java library which provides methods for the manipulation of geospatial data. The GeoTools library data structures are based on OGC specifications. Finally, the Data Request Broker (DRB23) is a critical element of the DHuS service to support a priori unlimited number of data types. DRB has an engine to extract information for feeding the search index, to provide metadata, geographical footprints in GML, to export XML fragments, or to extract series for further plotting. The results of these extractions are indexed into Solr indexes to increase the search capability. The Figure 2.9 resumes the DHuS architecture. DHuS can handle virtually any data type, even those not usually found in EO dissemination sys- tems, mostly due to the DRB API that acts as an abstract layer that rids the system of any concept tied to the EO domain. Since the DHuS offers a selection of EO products through full text search, no specific knowledge about the data types, the acquisition platform, or sensors is required. Moreover, it is possible to look for data based on geographical, temporal, and on thematic criteria. In its current version, DHuS does not support the extraction of portions of a product or processing on the data, and we believe that this constitutes an important limitation. This limitation requires users to transfer large volumes of data, which take some time to transfer, and also requires users to have large computational resources to process the transferred data.

21http://opensearch.org 22http://geotools.org/ 23http://gael.fr/drb/

20 Figure 2.9: The DHuS service architecture [Fred´ eric´ Pidancier and Mbaye, 2014].

2.2.2 The RasDaMan Array Database Management System

RasDaMan (Raster Data Manager) is an Array DBMS with capabilities for the storage and retrieval of massive multi-dimensional arrays. Data management in RasDaMan can be made through an SQL-style query language with two parts: the RasDaMan array definition language (RasDL) and the RasDaMan query language (RasQL). Moreover, RasDaMan also supports OGC standards, such GeoTIFF, WCS, WCPS and WMS. The RasDaMan implementation of an Array DBMS employs a middleware architecture where mul- tidimensional arrays are decomposed into arbitrary smaller units called tiles, which are stored in BLOBs inside a relational database, such as PostgreSQL or SQLite. Thanks to a spatial index, one can efficiently determine the tiles affected by a query, and transfer only a sub-set of large multi- dimensional data from the database. Query processing relies on tile streaming: physical query operators follow the open-next-close (ONC) protocol for reading their inputs tile by tile, and likewise they deliver their results in units of tiles [Baumann and Holsten, 2012]. Thus, the RasDaMan presents a scalable architecture for the processing of data volumes exceeding the main memory in the supporting hardware infrastructure. On the logical level, RasDaMan applies a specialized rewriting heuristic based on about 150 alge- braic transformation rules derived from multidimensional discrete data (MDD) operations, relational operations, and their combinations to construct optimized expressions [Baumann et al., 1999]. Some of the operations are trimming (rectangular cutout), section (extraction of a lower-dimensional hyper- plane), induced operations which apply cell operations simultaneously to the whole array, generalized array aggregation of some cells and all cells, and format converters to accept and deliver arrays [Baumann et al., 1998]. RasDaMan also allows array creation with arbitrary dimensions over primitive

21 Table 2.6: Examples of operations supported by RasQL

selection & sectioning select c [ ∗ : ∗ , 100:200 , ∗ : ∗ , 42 ] from ClimateSimulations as c

result processing select img ∗ ( img . green > 130) from LandsatArchive as img select mri search & aggregation from MRI as img , masks as am where some cells ( mri > 250 and m )

data format conversion select png ( c [ ∗ : ∗ , ∗ : ∗ , 100 , 42 ] ) from ClimateSimulations as c and user-defined cell types. The RasDaMan manipulation language provides declarative query functionality over collections (i.e., bags of arrays) of MDD stored in a database. The query structure is as follows: select r e s u l t L i s t from collName [ as collIterator ] [, collName [ as collIterator ] ] ... [ where booleanExp ]

In the from clause one specifies the working collection on which all evaluation will take place. In the where clause a condition is phrased, and in the select clause, elements in the query result set are post-processed. To consolidate the ideas around the supported operations, some examples are given on Table 2.6. Based on the client/server architecture, the rasnet communication protocol connects clients to the DBMS server. Incoming queries are dispatched among the RasDaMan server processes that are running. Each server process receives queries and parses, optimizes, and executes them. As seen in Subsection 2.1.2, arrays require metadata to map them to the world of real georefer- enced phenomena. Thus, the RasDaMan suite has the following components which are indispensable to the handling of the geospatial interface of RasDaMan arrays: wcst import: used to facilitate the ingestion of georeferenced rasters into RasDaMan, possibly stacked to compose 3-D spatial or spatio-temporal cubes, by expressing the intent through a recipe.

Petascope: a Java Web Application that implements OGC services, such WCS, WCPS and WMS. It relies on its own database of metadata and exposes RasDaMan array data to the outside world.

SECORE: the official OGC resolver for Coordinate Reference Systems (CRS24). Petascope trusts on this component to resolve CRS URLs into full CRS definitions, on which the coverages are defined.

The wcst import tool, used for image ingestion, handles synchronously the ingestion/update of both multidimensional arrays (also known as marrays) in RasDaMan, and (geo)metadata in Petas- cope. The argument passed to this tool is a JSON file called recipe. This recipe is composed of

24http://opengis.net/def/crs/

22 Figure 2.10: RasDaMan infrastructure and its Petascope and SECORE components for exposing arrays through OGC web services. several ingredients that together constitute the conditions under which the coverage will be created and ingested. As shown in Figure 2.10, RasDaMan also supports a distributed family of SECORE resolvers over the Internet, where each one is free to offer some CRSs definitions. Petascope is the component of RasDaMan that implements the OGC standard interfaces, namely WCS (and its extensions), WCPS and WMS. In order to maintain additional metadata (such as georef- erencing information), Petascope uses a separate relational database. Petascope is implemented as a WAR file of servlets, which support access to the coverages stored in RasDaMan. Internally, when it receives requests for a coverage evaluation, they are translated into RasQL. Next, these queries are passed to RasDaMan to be processed. Finally, results returned from RasDaMan are forwarded to the client. Petascope currently supports aligned grids and irregular aligned grids, whose geometries are, respectively, a subtype of rectified grid and a subtype of referenceable grid, already explained in Subsection 2.1.4.

2.2.3 The SciDB Array Database Management System

SciDB [Stonebraker et al., 2011] is an array DBMS built from scratch for storage, processing and analysis very large (petabyte) scale array data from scientific applications, such as astronomy, re- mote sensing and climate modeling, bio-science information management, as well as commercial applications such as risk management systems in the financial services sector, and the analysis of web log data. In brief, SciDB is built to support an array data model, having a query language with the possibility of extension with new scalar data types and array operators by the users. The system was design as a massively parallel storage manager that is able to parallelize large scale array processing algorithms. So far, there is no implementation of OGC services supported by the SciDB creators, Paradigm4. However, it is possible to find on github25 some prototype implementations of WMS and WCS inter- faces created by the SciDB community.

25https://github.com/e-sensing/tws and https://github.com/appelmar/scidb-wcs

23 Table 2.7: Two dimensional SciDB array.

J\I [0] [1] [2] [3] [4] [0] (2, 0.25) (4, 0.5) (3, 0.4) (5, 0.5) (4, 0.5) [1] (3, 0.4) (3, 0.3) (1, 0.3) (3, 0.75) (1, 0.6) [2] (2, 0.3) (2, 0.1) (1, 0.2) (4, 0.1) (3, 0.4) [3] (3, 0.3) (2, 0.2) (7, 0.1) (2, 0.15) (3, 0.7) [4] (6, 0.15) (5, 0.4) (2, 0.35) (3, 0.2) (0, 0.3)

SciDB databases are organized as collections of n-dimensional arrays. Each cell in the SciDB array contains an arbitrary number of attributes of any of the expected numerical, variable length string or user-defined data types. The individual attribute values are associated to a distinguishing attribute name. The arrays are uniform in which all cells in a given array have the same collection of values. Hence, to create an array in SciDB one uses the following instruction: CREATE ARRAY Example ( A:: INTEGER,B:: FLOAT ) [ I=0:4, J=0:4];

The instruction shows the creation of an array with attributes M and N along with dimensions I and J. Table 2.7 illustrates how the SciDB array resulting from the above instruction might look like. SciDB supports both functional and SQL-like query language. The functional language is called AFL for array functional language, and the SQL-like language is called AQL [Lim et al., 2013] for array query languages. AQL is compiled into AFL. All of the operators in SciDB algebra are composable, i.e., operators can be combined. For example, if A and B are arrays with dimensions I and J, and c is an attribute of A, then the following expression would be legal: temp = filter (A, c = value) result = join (B, temp: I, J)

Alternatively, one can have the composite expression: result = join (B, filter (A, c = value), I, J) For users more comfortable with SQL, SciDB supports the AQL, which looks as much like SQL as possible. Thus, the above example can be expressed as: select ∗ from A,B where A. I = B. I and A. J = B. J and A. c = value

In addition to the typical operations that any array algebra supports, such as, array creation, sub- sampling, and slicing, SciDB supports the transform operation. The transform operation changes the dimension values, and has the following uses cases:

– Bulk changes in dimensions - e.g. push all dimension values up one to make a slot for new data,

– Reshape an array - e.g. change it from 100 by 100 to 1000 by 10,

– Flip dimensions for attributes - e.g. replace dimension I in array A with a dimension made up from d,

– Transform one or more dimensions - e.g. change I and J into polar coordinates.

24 In terms of its overall system architecture SciDB adopts a shared nothing design, wherein a SciDB instance is deployed over a network of computers, where each has its own local storage [Brown, 2010]. Each compute/storage node runs a semi-autonomous instance of a SciDB engine, providing communications, query processing, and a local storage manager. SciDB implements a distributed, no-overwrite storage manager. Consequently, data in a SciDB array is not update-able. New array data can only be appended to the database, or the results of a query can be written back to the storage manager. SciDB decomposes storage into multi-dimensional chunks of equal size, which may overlap. In SciDB chunks are the physical unit of I/O, processing, and inter-node communication. Chunks are fixed (logical) size around of 64 megabytes. Each is stored in a file on disk that can be efficiently addressed. The addressing is done using Postgres as a system catalog repository manager, which stores for each chunk the corresponding location as a range of dimensional indices, containing the logical array. Postgres was elected because it allows to use R-Trees [Guttman, 1984] to quickly identify which chunks contain data relevant to a particular subsample of the array. The segmentation of the arrays into overlapping chunks, and the choice of the right overlap extent, makes SciDB capable of parallelizing operations like nearest neighbor searches, on a collection of computing nodes, with each node performing the same calculation on its data. However, some array operations can only be run in parallel if chunks overlap by a minimum amount. This overlap should be the size of the largest feature that will be searched for, and its specified in the create array command at array creation time. The SciDB storage manager also supports a two level chunk/tile scheme that economizes CPU time by splitting a chunk internally into tiles. In this way, subset queries are able to examine only a portion of a chunk. As scientists never want to throw old data away, SciDB was conceived with version control. All SciDB arrays are versioned. Data is loaded into the array at the time indicated in the loading process. Subsequent updates, inserts or bulk loads add new data at the time they are run, without discarding the previous information. Hence, for a given cell, a query can scan particular versions referenced by timestamp or version number. The previous versions of a given chunk are available as a chain of deltas referenced from the base chunk. In other words, the storage manager stores the chunk as a base plus a chain of backwards deltas. The physical organization of each chunk contains a reserved area, for example 20% additional space, to maintain the delta chain. In order to reduce storage space, all arrays are aggressively compressed with an appropriate compression scheme on a chunk- by-chunk basis. Among those compression schemes are the delta encoding, run-length encoding, subtracting off an average value, and LZ encoding. Lastly, a key requirement for most science data is support for provenance. Therefore, SciDB has the ability to trace the derivation of the data. A common use case is when a scientist needs to trace backwards to find the actual source of the error over a given data that looks wrong. Then, the scientist wants to trace forward to find all data values that are derived from the incorrect one, so they can also be repaired. The amount of space to allocate for provenance data is managed by the database administrators.

25 Figure 2.11: Some SciQL array operations.

2.2.4 Handing Raster data in MonetDB and the TELEIOS Infrastructure

MonetDB is an open source column-oriented DBMS designed to provide storage for large volumes of data represented as tables, and to provide high performance on complex queries. SciQL is an SQL-based query language for science applications that allows MonetDB to effectively function as an array database [Zhang et al., 2013]. Arrays in SciQL are defined around the syntax of TABLE with a few minor additions. One of those additions was the introduction of an attribute tagged with the DIMENSION constraint, which describes its value range. The allowed data type for a dimension can be any of the basic scalar data types, such FLOAT, VARCHAR, and TIMESTAMP. The query below creates the 4 × 4 matrix shown in Figure 2.11(a). CREATE ARRAY matrix ( x INT DIMENSION [ 0 : 1 : 4 ] , y INT DIMENSION [ 0 : 1 : 4 ] , v INT DEFAULT 1 ) ;

Semantically, the difference between a TABLE and an ARRAY lies in the fact that a TABLE denotes a set of tuples, while an ARRAY denotes indexed tuples, also indicated as cells. The switch from an ARRAY to a TABLE is made by ignoring the attribute that stands for the DIMENSION constraint. Thus, the relational algebra that queries an ARRAY is exactly the same which queries a TABLE. For example, the array from the example above becomes a table just using the expression SELECT x, y, v FROM matrix. In regard to the basic array manipulation operations that an array DBMS should offer, SciQL uses the SQL UPDATE, INSERT and DELETE statements to assign a new value to an array cell. UPDATE matrix SET v = CASE WHEN x > y THEN x + y WHEN x < y THEN x − y ELSE 0 END; INSERT INTO matrix SELECT [ x ] , [ y ] , x ∗ y FROM matrix WHERE x = y ; DELETE FROM matrix WHERE x > y ;

The SQL ALTER statement was redesigned to manipulate the array dimensions. ALTER ARRAY matrix ALTER DIMENSION x SET RANGE [ − 1 : 1 : 5 ] ; ALTER ARRAY matrix ALTER DIMENSION y SET RANGE [ − 1 : 1 : 5 ] ;

In the list of operations over arrays supported by MonetDB/SciQL are included cell selection, array slicing, array composition, among others. Using the primitive operations provided by SciQL, users

26 Table 2.8: The matrix stored as three BATs.

x y v void int void int void int 0 0 0 0 0 1 1 0 1 1 1 1 2 0 2 2 2 1 3 0 3 3 3 1 4 1 4 0 4 1 5 1 5 1 5 1 6 1 6 2 6 1 7 1 7 3 7 1 8 2 8 0 8 1 9 2 9 1 9 1 10 2 10 2 10 1 11 2 11 3 11 1 12 3 12 0 12 1 13 3 13 1 13 1 14 3 14 2 14 1 15 3 15 3 15 1 can define new more complex operations. As in RasDaMan and SciDB, large arrays in SciQL are also broken into smaller pieces before being aggregated or overlaid with a structure to calculate [Kersten et al., 2011], e.g. a filter kernel function. For that purpose, SciQL uses a slight variation of the SQL GROUP BY clause semantics to support fine-grained control to break the arrays into tiles. The following query tiles the 4 × 4 array matrix in Figure 2.11(a) with a 2 × 2 matrix: SELECT [ x ] , [ y ] , AVG( v ) FROM matrix GROUP BY matrix[x:x+2][y:y+2] HAVING x MOD 2 = 1 AND y MOD 2 = 1;

Tiling starts with the identification of an anchor point through the array dimensional values (e.g., matrix[x][y]), which is extended with a list of cell denotations relative to the anchor point (e.g., matrix[x+1][y], matrix[x][y+1], and matrix[x+1][y+1]). Undesired groups are filtered by speci- fying constraints in the HAVING clause. Figure 2.11(b) shows the four tiles created and Figure 2.11(c) shows the tiling result. Holes (null values) and cells outside the array dimension ranges are ignored by the aggregation functions. With the adoption of the vertically decomposed storage model for relational tables [Monet, 2002], MonetDB/SciQL stores arrays in BATs, which are physically represented as consecutive C arrays. Each Binary Association Table (BAT) is a table with an object-identifier and value columns, represent- ing a single column in the database. Per array, it is used one BAT for each dimension and one BAT for each non-dimensional attribute. Table 2.8 shows how the matrix created above is stored in BATs. MonetDB is currently being used to support prototype infrastructures for EO data. For instance, TELEIOS26 is an European project that focuses on the development of technology to overcome the need for scalable access to petabytes of EO data, and to support the discovery and exploitation of useful information that is hidden in these data [Koubarakis et al., 2012]. TELEIOS is implemented on

26http://earthobservatory.eu/

27 Figure 2.12: TELEIOS Infrascture Overview. top of the MonetDB and uses a variety of techniques from the areas Scientific Databases, Semantic Web, and Image Information Mining to the management of EO data. As we can observe on Figure 2.12, TELEIOS are organized in four tiers: ingestion tier, database tier, processing service tier, and application tier. In the ingestion tier are the components that perform data ingestion and content extraction. In the database tier are the components that provide access to data, metadata and semantic annotations. The processing service tier consists of rapid mapping services, data mining services, and services for automatic or interactive semantic annotation. Finally, in the application tier reside the application and services that provide domain specific support to the end user community.

2.3 Summary and Discussion

This chapter started by describing relevant concepts to facilitate the understanding of the related work previously developed in the area of Geographic Information Systems (GIS) and which is important in the context of this dissertation. Some of the concepts presented are OGC Web Services standards, such as Web Coverage Service (WCS), Web Coverage Processing Service (WCPS) and Web Map Service (WMS), which are important in facilitating the access to the rasters, stored in a remote server, over the Internet. Then, the chapter presented related work, including the DHuS architecture and its components, and also three of the most popular array DBMS in the literature, namely RasDaMan, SciDB and MonetDB/SciQL. For each array DBMS the chapter describes the array operations sup- ported, the used query language, the array storage management, and the implementations of the OGC Web Services described in the concepts section. The DHuS is a good choice for the base of the IPSentinel prototype infrastructure, since is a system from ESA well designed to manage, catalog and disseminate large volumes of Sentinel Earth Observation (EO) products, through the Internet. The system integrates several ways of making Sentinel products available to users. However it does not allows the direct access to the raster data

28 and does not provide processing over the raster data, for the most experienced users or scientists. This constitutes an important limitation that is addressable with the use of the OGC services, such as WCS and WCPS. What regards the array DBMS, each of the systems that has its individual merits and has sound formal arguments about its query language and its array storage manager. RasDaMan is an array DBMS. However, it is implemented as an application layer that uses Postgres or SQLite for blob storage. Similarly MonetDB has an array layer, implemented on top of its columns store table system. Thus, SciQL simulates arrays on top of a table data model. The performance loss in such a simulation layer may be extreme. In contrast, SciDB is an array DBMS built from scratch oriented toward scientific applications that supports version control, uncertainty and provenance. Both RasDaman and SciDB implement multi-dimension chunking, which avoids unnecessary CPU consumption and consequently improves the response time of a query. The SciQL query language was design to provide a true symbiosis of the relational and array paradigms in compliance with the SQL:2003 standard. However, due the restrictions imposed by a strict declarative language almost the entire processing is done through user-defined functions [Rusu and Cheng, 2013]. This raises serious doubts about the expressiveness of SciQL. Baumann and Holsten [2012] demonstrate that RasQL is highly expressive comparing with Array Query Language (AQL) from the SciDB, and also against a few other alternatives. Still, SciDB also implements the Array Functional Language (AFL) for the users not familiarized with SQL. Finally, taking into account also the available OGC Web Services that each array DBMS has at its disposal, only RasDaMan implements the WCPS beyond the WCS and the WMS. Thus, it becomes clear that RasDaMan is the more complete array DBMS to be integrated in the IPSentinel prototype infrastructure, in order to overcome the limitations identified in the DHuS.

29 30 3 Prototype Infrastructure

Contents 3.1 Overview ...... 32 3.2 System Assumptions and Functional Requirements ...... 33 3.3 Architecture ...... 34 3.4 Implementation ...... 39 3.5 Summary ...... 42

31 This section presents the major design decisions and the rationale for making IPSentinel prototype the way it was made. Firstly, Section 3.1 presents an overview of the infrastructure prototype. Then, Section 3.2 describes system assumptions and requirements. Section 3.3 presents the software architecture. Finally, Section 3.4 describes the details of the implemented solution, and Section 3.5 summarizes the contents of this chapter.

3.1 Overview

The IPSentinel is a Portuguese prototype infrastructure developed for the purpose of supporting the Portuguese community in the storage and access of Earth Observation (EO) data concerning the Portuguese territory. The IPSentinel provides a simple web interface to allow interactive data discovery, processing and download, and multiple Application Programming Interfaces (API) that allows users to access the data via programs, scripts or client applications. The major functionalities of IPSentinel are schematically represented in Figure 3.1, and also de- scribed next.

User Interface

This functionality is in charge of providing the user with an interface for the discovery, processing, and downloading of products and for the visualization of the relevant metadata. It consists of two set of interfaces: a set of Graphical User Interfaces (two web applications) and a set of Application Programming Interfaces (mainly used for machine to machine interactions and for client application

Figure 3.1: Major functionalities of the IPSentinel prototype.

32 development).

Product Harvesting

The product harvester is the service responsible for collecting products from Payload Data Ground Segment data sources (ingestion) or from a DHuS Network (synchronization).

Product Cataloging

The product cataloging is responsible for the products management. The product catalog is man- aged as a rolling archive, with configurable eviction strategies and rules.

Product Search & Dissemination

This functionality is in charge of providing users with the possibility to perform search and dis- semination via standardized API protocols (OData, Opensearch, Web Map Service (WMS) and Web Coverage Service (WCS)) and via the graphical user interface.

Product Processing

This functionality is responsible of providing users the capability of processing the available prod- ucts in the catalog using the standardized Web Coverage Processing Service (WCPS) query lan- guage.

3.2 System Assumptions and Functional Requirements

IPSentinel runs the Data Hub Service (DHuS) in order to provide most of the aforementioned function- alities. RasDaMan was adopted to sustain the functionality of product processing in the IPSentinel. Moreover, the implementations of RasDaMan (Petascope) for WMS and WCS were adopted to extend the dissemination service, and WCPS to support the processing service. In this phase of the prototype only level 1 products were considered for the processing, and we assumed only products of Single Look Complex (SLC) type on Interferometric Wide swath (IW) mode, and Ground Range Detected (GRD) type on IW and Extra Wide swath (EW) mode are captured in the Portuguese geographic area.

The following functional requirements were taken into account for the development of the prototype:

1. Rolling archive of relevant products

This first requirement is related with the capacity of the system to support a rolling archive of relevant products that is automatically downloaded by Open Data Protocol (OData) synchroniz- ers.

2. Possibility to search products by region, temporal period, and type.

This requirement concerns the possibility of finding products by specifying a region, selecting the acquisition temporal period, or even the product type, where these filters can be used separately

33 or simultaneously.

3. The access to Sentinel products through the OGC service interfaces.

This requirement relates to supporting the access to the product coverages using the WCS and WMS interfaces, and through the use of the WCPS query language.

4. Automatic ingestion of products as coverages, making them immediately available through OGC service interfaces.

This last requirement concerns the automation of the provision of products as coverages in Ras- DaMan, making them accessible through the Open Geospatial Consortium (OGC) standards mentioned on the third requirement.

3.3 Architecture

In the IPSentinel prototype, product discovery and acquisition is performed automatically by the OData product synchronizer service, which is being extended in the context of a separate M. Sc. project by my colleague Francisco Silva [2016]. This service is configured by the system administrator through a web application, where in the future it will also be possible to specify the following parameters: region of interest, satellite mission, product level, product type, capture mode, capture intervals, among others. The detailed explanation of the OData product synchronizer is out of scope of this dissertation. Succinctly, the OData synchronizer works by calling the OData API, exposed in the third-party infras- tructures that support it, with the parameters that respect the OData protocol. Figure 3.2 illustrates the interaction of the IPSentinel prototype with ESA SciHub1 in the discovery and acquisition of relevant

1https://scihub.copernicus.eu

Figure 3.2: Overview of interactions of the prototype with ESA SciHub and users.

34 products through the OData synchronizer, and the interaction of users with IPSentinel. The products transferred from the SciHub are stored in an directory specified by the system administrator, during the configuration of the OData synchronizer. Petascope is a client module of RasDaMan, integrated to provide IPSentinel users with a dis- tributable service of mirror archives and, processing and dissemination means for EO products. Petascope can be deployed on a standard servlets container as an independent web application. However, it is distributed as a module integrated in the DHuS software to simplify the deployment (see Figure 3.3). Hereupon users can access IPSentinel functionalities, by means of the following APIs presented in Figure 3.4. In the figure, the black lines show the APIs already brought by the DHuS and the red lines show the APIs implemented by Petascope. The Petascope module also has an AngularJS2 Web Client, in which offers a graphic user interface that allows users to create the requests and call the

2https://angularjs.org/

Figure 3.3: Overview of the modified DHuS project.

Figure 3.4: Available APIs in the DHuS.

35 WCS and WMS APIs. Figure 3.5 shows another view of the aforementioned APIs being deployed in the Tomcat embed- ded in the DHuS system. The IPSentinel context diagram is reported in the Figure 3.6, showing how IPSentinel users:

• access DHuS functionalities by means of DHuS core API, used to access DHuS data storage

• access Petascope specific functionalities by means of OGC services, used to:

– access directly to OGC Metadata, containing data specific to OGC services

– access/process RasDaMan data

– insert/delete RasDaMan data, by means of DHuS core API

Figure 3.5: Component & connector allocated-to files view over the Tomcat and web applications.

Figure 3.6: IPSentinel context diagram.

36 Figure 3.7: Component & connector view from the DHuS core.

Figure 3.8: Component & connector view with pipe and filter style from the Rasdaman Feeder.

In order to automatically populate RasDaMan with EO data, some components in the DHuS Core had to be created and others changed. Figure 3.7 illustrates these components and their connections. The Job Scheduler acts as a job manager, in which triggers OData Synchronizers and FileScanners based on periodicity defined by the system administrator, through the Web Client graphical user interface (GUI). Each time the FileScanners are called, the directory (represented by the Product Storage cylinder) is scanned. For each product found, the system invokes the ProductService to proceed with the product metadata ingestion in the DHuS database, which in turn invokes the RasDaMan Feeder to process and import the product to RasDaMan through the Petascope compo- nent. Figure 3.8 describes the information processing pipeline of the Rasdaman Feeder, where the role of each component is as follows:

– The Infer component is responsible to identify the product mission, level, and type, in order to

37 be handled with the correct handler.

– The Unzip component deals with file decompression.

– The Extractor component extracts all the image absolute paths forming part of the product, generates the coverage id and creates the JSON recipe.

– The gdalwarp, gdal merge and gdal translate tools process the images in order to be support- able by RasDaMan.

– Lastly, WCST import performs the ingestion, using the recipe that was previously generated.

38 3.4 Implementation

The implementation was divided in three parts, which consisted in integrating Petascope with the DHuS software, in modifying the Petascope and DHuS Web Clients, and in modifying the DHuS core to automate the process of coverages ingestion in RasDaMan.

3.4.1 Integration of Petascope with the DHuS software

This first part of the implementation consisted in decoupling Petascope from the RasDaMan suite. The Petascope component comes as an Ant3 project and is dependent on some libraries that are part of the RasDaMan project. This decoupling phase was important, since the idea was to use the OGC Service implementations offered by RasDaMan, and to reuse the Web Client example that also comes with the Petascope component, in the IPSentinel infrastructure. After decoupling Petascope, its structure was reorganized into a Maven project4, to be subsequently integrated in the European Space Agency (ESA) DHuS Maven project. The protobuf5 library referenced in the Project Object Model (POM) file of the DHuS core had to be commented, since the Petascope component also depends on a modified version of protobuf, and this was causing conflicts during class loading. Although DHuS is designed as a Maven project, the approach that it uses to launch the servlet container and deploy the webapps is not the easy and conventional configuration done through the POM file. Here, the Tomcat launch is programmed explicitly on the DHuS core component where the deployment is made, through the iteration of a list of WebApplication objects that encaplsulate the Web application ARchive (WAR). The integration of the Petascope component with DHuS was thus not trivial, since it was necessary to define a Spring Bean with the context for the Petascope web application, and create a new class called PetascopeWebApp that extends from the WebApplication class. This step was essential to integrate Petascope with the DHuS, making the two components totally integrated and functional. The fact that DHuS has a Tomcat embedded led me to do this first step in order to put all web applications working in the same container.

3.4.2 Modification of Petascope and DHuS Web Clients

The Web Client that comes with the Petascope component was modified in three aspects. Aspect 1: The first aspect was in terms of the layout style. The style sheet was edited in order to have the same look and feel of the DHuS Web Client (Figure 3.9). The reason that the new panels were not done directly in the DHuS Web Client is because one of the challenges of this prototype is to complete the objectives with the minimum possible changes in the source code of the ESA DHuS, so that the merging of future versions from the original remote repository can be simple. Aspect 2: The second aspect was the modification of the Angular code in order to support the invocation of functions accepting the passing of parameters through URLs while, at the same time, allowing the control of the tabs views. In that sense, to support the desired behavior, the route

3http://http://ant.apache.org/ 4https://maven.apache.org 5https://github.com/google/protobuf

39 Figure 3.9: Petascope web client with GetMap tab open. controller was changed in the AngularConfig main function. Instead of using the stateProvider function from the ui.route library, I used the routeProvider function from the angular-route library, to dispatch which view tab to be loaded. The gain is in the fact that routeProvider offers a menu field that can be mapped directly to the visibility of the menu buttons. With this field it is possible to know which tab is being activated, and to control the visibility of the respective menu button when a user access the Website through a Uniform Resource Locator (URL) that contains the request to load a specific view tab and the associated Angular controller. This part was crucial to navigate directly from the DHuS Web Client to the Petascope Web Client, when listing a product with its available OGC services. Aspect 3: The third aspect focused on the creation of a new tab called GetMap, with OpenLayers6 integrated, to perform WMS requests and display the response results on the map. The results are one or more PNG image slices that belong to a Sentinel product. Figure 3.9 illustrates the GetMap tab open after a getMap request. In the graphical user interface of the DHuS Web Client, one icon was introduced (Figure 3.10) that allows the user to check the availability of OGC Services for each product, as well as buttons (Figure 3.11) that allow the user to navigate to the service page of interest, to process the currently displayed

6https://openlayers.org/

Figure 3.10: A product listed with OGC services availability checked.

40 Figure 3.11: Product details screen displaying OGC service buttons. product. The logic behind these additions was implemented on the client side (Angular), since this way the implementation process is simpler and does not imply the change of the response structure when products are requested to the back-end. The OGC Service availability is achieved by checking during the page loading if the currenttly displayed product has the level and type that combines with the products that are supposed to be loaded in RasDaMan. The same logic was applied for the buttons on the construction of the URLs with product ID for accessing the DescribeCoverage and GetMap tabs of the Petascope Web Client. In this implementation only level 1 and GRD type products from Sentinel 1 were considered.

3.4.3 Modification of the DHuS Core to Automate the Process of Ingesting Coverages in RasDaMan

The Sentinel product ingestion into RasDaMan was implemented in the Core component of the DHuS project, using the same service (ProductService) called by the Quartz7 scheduler to scan the de- fined directory that contains products and ingests them in the DHuS database. The implementation involved creating a new Spring Component called RasdamanFeeder which, in turn, is instantiated and called in the ProductService, in order to do the job of uncompressing the product, inferring the product level and type, processing the image or images inside the product, generating the recipe, and handling the invocation of the WCST Import tool with the recipe to realize the ingestion in Ras- DaMan. The inference of the level and the type of a product is done using the name convention in the folder adopted by ESA. As Figure 3.12 shows, the first set of letters indicate the sentinel mis- sion, the first three letters of the second set indicate the type, and the first letter of the fourth set indicates the processing level. Since each mission has a distinct organization for the product internal

7http://www.quartz-scheduler.org/

41 Figure 3.12: Convention structure for the Sentinel1 products name . structure, the Front Controller Pattern was used to create a dispatcher that allows dispatching the inferred products to the corresponding handler. The Front Controller Pattern supports the addition of new handlers for future missions and types. The aforementioned handler is an object with the necessary functions implemented for extracting the image absolute paths from the manifest.safe file, generating the recipes and the processing images. The image processing before the ingestion con- sists of assigning the Coordinate Reference Systems (CRS) corresponding to European Petroleum Survey Group (EPSG):4326 where pixels should be projected, changing the image data type from UInt16 to Byte, and compress it without losing quality using the gdal tool. The data conversion is necessary since Java does not allow unsigned data types, and the compression helps to slow down the time of ingestion and reduce the space occupied. With the images processed and the recipes generated, the prototype invokes the WCST import tool with the recipe as argument, to construct the Geography Markup Language (GML) as well as the HTTP request to the RasDaMan server. Both gdal and WCST Import are external programs that are executed by a java process. To conclude, during the ingestion process, if for any reason an error occurs, an exception is thrown by the RasdamanFeeder component, in order to be caught in the ProductService for the product to be removed from the catalogue, so that in the future we may try again the ingestion by the whole system, next time the file scanners are triggered by the Quartz scheduler. To finish, Figure 3.13 shows a diagram containing the java classes modified and created, where the Front Controller Pattern is also represented.

3.5 Summary

This chapter presented an overview of major functionalities of the IPSentinel prototype, as well as the system assumptions and the functional requirements. The chapter also described the architecture and implementation of the IPSentinel infrastructure, including all the intermediate steps and decisions takes during the development of the prototype. In summary, the IPSentinel was built using the DHuS software as base, on top of RasDaMan in

42 Figure 3.13: Diagram of involved java classes in the implementation. order to extend his features regarding the support of the WCS, WCPS, and WMS standards for the access and processing of EO data.

43 44 4 Validation

Contents 4.1 Requirements Compliance ...... 46 4.2 Measurement of Required Computational Resources ...... 48 4.3 Summary ...... 54

45 This chapter describes the variables taken into account in the prototype validation, coupled with the results and with some brief conclusions. Section 4.1 presents the requirements compliance, while Section 4.2 presents the measurements that were taken, in regards to the usage of computational resources. Section 4.3 summarizes the main aspects introduced in the chapter.

4.1 Requirements Compliance

One of the methodologies considered for validating the prototype involved assessing whether the functional requirements are in accordance with the proposal. In that sense, the assessment took into account the functional requirements announced in Subchapter 3.2 of Chapter 3.

Requirement 1: Rolling archive of relevant products This requirement is partially fulfilled, since it is possible to archive all the transferred products with the respective metadata to be properly stored in the database and indexed in the search engine. The requirement will be completely accomplished when the work of Francisco Silva [2016] is finished. His work focuses on the automation of product downloading, supporting filters such as temporal periods, product type or product mission, among others.

Requirement 2: Possibility to search products by region, temporal period, and type Without any change to the original DHuS source code this requirement is in conformance because the software originally supported these product search types. Currently, the software version provided by ESA only admits the search filter by product type for missions Sentinel 1 and Sentinel 2.

Requirement 3: The access to sentinel products through the OGC service interfaces When selecting the option to view details of a particular product, in the graphical interface, users are presented with two buttons that allow the navigation to pages of OGC Services. These buttons are only provided if the describeCoverage operation call return success, which means that the specific product is available as a coverage. At the moment the buttons only appear in products of GRD type from Sentinel 1 (see Chapter 3). The graphic interface of the aforementioned pages abstracts the necessary parameters to perform operations over the WCS and WMS services. Thus, the execution of operations, such as coverage slice, trim, scale, range subset or image displaying can be done by clicking buttons. Figure 4.1, presents the result of a getCoverage with parameter rangesubset assigned with VV and scalefactor assigned with 8. Figure 4.3 shows in OpenLayers the result of performing multiple getMap operations on the geographical area of the Madeira island. Regarding to the WCPS query language, it is possible to express, and process the EO rasters using simple to complex queries. In Figure 4.2, is present the result from the assessment of the following lines of code: for vv in (CoverageID VV ) , vh in (CoverdadeID VH) return encode (scale({ red : vv ∗ 0. 99; green : vh ∗ 0. 99; blue : ( abs ( vv ) / abs ( vh ) ) ∗ 0.99

46 Figure 4.1: Result of executing a getCoverage of Figure 4.2: Result of executing a false colouring pro- Sentinel 1 image with only VV band selected. cessing through the WCPS query language.

Figure 4.3: Result of executing multiples getMap requests.

} , { Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ), ” png ” )

Through the cases demonstrated by the images below, this requirement is fully in conformance.

Requirement 4: Automatic ingestion of products as coverages, making them immediately available through the OGC Services interfaces This requirement is assured thanks to the new Rasdaman Feeder component, which has the role of processing the image properly to be ingested by RasDaMan. As explained on Chapter 3 the automation is guaranteed by the Job Scheduler, that dispatches the Product Service that uses the Rasdaman Feeder.

47 4.2 Measurement of Required Computational Resources

Another dimension of the prototype validation involved measuring the computational resources re- quired for it to run, namely (1) the expected storage space to have the prototype playing his role as a rolling archive, and (2) the time that the system takes to process and return a result from the main operations of WCS, WMS, and also through the use of the WCPS query language. All the measure- ments were made on a virtual machine with CentOS 7, 16GB RAM, 100GB of disk and an Intel dual core 2.10GHz processor.

4.2.1 Storage Space

The evaluation of the storage space considered all level-1 and level-2 products captured over the Portuguese territory from March 4 to 11, 2016 (8 days) by the mission Sentinel 1. All products were transferred from the ESA SciHub. Table 4.1 presents various information about the transferred data collection. The space occupied by the metadata of the respective products in the database were not taken into account. From the table it is possible to observe that, on average, a product of SLC type on IW mode approximately occupies on disk 3.6Gb, a product of GRD type on IW mode approximately occupies 805Mb and on EW mode occupies 222Mb, and a product of OCN type occupies approximately 7Mb. It can also be observed that in a period of 8 days 148 products were captured, which occupy a total of 270Gb (i.e. the equivalent of 33Gb per day). Roughly speaking we are counting with around 1Tb of products per month, without including level 0 products or other missions. The products in the data collection mentioned above were imported into RasDaMan, specifically all products of GRD type in EW mode, and 14 products in IW mode. This corresponded to a total of 30 products. Table 4.2 summarizes information taken about the space occupied by these products in RasDaMan. Looking at Tables 4.2 and 4.1 we can see that the S1* EW GRD products in RasDaMan occupy an average of 17% more than the average of the size of the original files in the file system, and S1* IW GRD products occupy an average of 27% more. This means that to store also all GRD type products in RasDaMan, the infrastructure requires 117% more of disk space for EW mode and 127%

Table 4.1: Space occupied by Sentinel 1 products on disk.

Product Sensor Files Total Sum Avg Max Min Satellite Resolution Type Mode # of File Size (Mb) (Mb) (Mb) (Mb) Sentinel1A GRD H IW 23 18838 819 998 714 Sentinel1B GRD H IW 41 32556 794 957 536 Sentinel1A GRD M EW 9 2024 225 236 213 Sentinel1B GRD M EW 7 1539 220 236 204 Sentinel1A SLC n/a IW 31 101600 3629 4600 3300 Sentinel1B SLC n/a IW 28 114500 3694 4300 2400 Sentinel1A OCN n/a IW 6 39,3 6,6 6,9 6,1 Sentinel1B OCN n/a IW 3 19,1 6,4 6,7 6,2 Total 148 271115

48 Table 4.2: Space occupied by GRD type products in RasDaMan.

Files Total Sum Avg Max Min Product # of File Size (Mb) (Mb) (Mb) (Mb) S1A EW GRD 9 2349 261 299 243 S1B EW GRD 7 1819 260 318 222 S1A IW GRD 7 7290 1041 1131 993 S1B IW GRD 7 7054 1008 1113 899 Total 30 18512 643 1131 222

Table 4.3: Storage space required for different rolling archive plans.

15 Days 1 Months 3 Months 1 Year Archive 500 Gb 1 Tb 3 Tb 12 Tb RasDaMan 285 Gb 570 Gb 1.71 Tb 6.84 Tb Total 785 Gb 1.57 4.71 Tb 18.84 Tb Storage Space for IW mode. Taking into account the above data and considering that in one month 36% of the products are of type GRD in IW mode and 10% are in EW mode, to store them all the infrastructure needs to have 57% more of its storage capacity, i.e, 570Gb more. In conclusion, with this informative data we argue that it is viable to have an infrastructure with 1.57Tb of storage space, so that for one month it may be possible to archive several products under the conditions described above. However, considering the periodic eviction at the end of 15 days, the storage space required can be halved. Table 4.3 summarizes the storage space required according to the rolling archive plan to store all Level-1 and Level-2 products from the Sentinel 1 mission, captured for the Portuguese territory.

4.2.2 Response Time

Regarding the validation of the prototype response time, several scenarios were tested executing different operations. The Apache JMeter1 tool, was used to record the relevant information of each operation of each scenario, in order to draw conclusions. All scenarios were run only once so that response times were not influenced by existing caches.

Scenario 1: Searching for products by a region filter The goal of this scenario is to test the response time of searching for products by a region criterion. This scenario was tested with 4 clients in parallel requesting products for the regions selected in Figure 4.4. Contrary to what one might think, selecting a larger area is not synonymous of longer response time. As we can see in the Table 4.4 and Figure 4.4 Client 1 selected an area smaller than Client 3. However, the response time of Client 3 is less than the response time of Client 1. The response time is mostly influenced by the number of products a given region has. This statement can be proved by the response times and products found in Clients 2 and 4.

1http://jmeter.apache.org

49 Figure 4.4: Scenario 2: Regions selected by the 4 clients.

Table 4.4: Scenario 1: Response time

Region Criterion Selected Region 1 2 3 4 Response Time 870 ms 547 ms 368 ms 1.12 s Products Found 13 8 2 29

Table 4.5: Scenario 2: Response time

Temporal Criterion 2015/12/31 2017/01/01 2017/03/07 2015/12/31 Temporal Period 2016/12/31 2017/03/07 2017/04/20 2017/04/20 Response Time 471 ms 506 ms 649 ms 868 ms Products Found 8 10 16 29

Table 4.6: Scenario 3: Response time

Product Type Criterion Product Type RAW GRD SLC ALL Response Time 346 ms 861 ms 344 ms 884 ms Products Found 1 24 1 26

Scenario 2: Searching for products by a temporal filter The goal of this scenario is to test the response time of searching for products by a temporal criterion. This scenario was tested with 4 clients in parallel requesting products of different temporal periods. The same conclusions drawn earlier apply to this scenario, i.e., the response time is mostly influ- enced by the number of products found in in a given temporal interval. As we can see in Figure 4.5 the response time increases with the number of products found. Comparing Table 4.5 with the table from Scenario 1, for the same number of products found the response time is higher in Table 4.4. This difference is due to the fact that the product search by region is heavier. The operation consists of scanning the product list for products that have at least one pair of coordinates contained in the selected region. The difference becomes increasingly higher as the number of products found grows, as can be seen.

Scenario 3: Searching for products by type The goal of this scenario is to test the response time of searching for products by a product type criterion. This scenario was tested with 4 clients in parallel requesting products of different product types. In addition to the conclusions already made in the previous two scenarios that also apply to this

50 Table 4.7: Resume of response times of the WCS operations.

Coverage S1A IW GRD S1B IW GRD S1A EW GRD S1B EW GRD describeCoverage Response Time 338 ms 342 ms 339 ms 369 ms getCoverage Response Time 2.7 min 1.7 min 1.0 min 1.1 min Processing Time 33 s 26 s 4.98 s 3.89 s Original Size ≈ 1 Gb ≈ 1 Gb ≈ 260 Mb ≈ 260 Mb Final Size 10.7 Mb 9.1 Mb 2.5 Mb 2.4 Mb scenario, it is possible to see in Figure 4.6 the system speed to scan the product list in order to create the response payload with the products that satisfy the condition. The first three scenarios are not intended to show how fast the system is, but rather how it behaves as the number of products increases.

Scenario 4: Performing WCS operations The goal of this scenario is to test the response time of describeCoverage and getCoverage operations from the WCS interface. The scenario was tested with 4 clients in parallel performing the same operations on 4 different coverages, as shown in Table 4.7. The getCoverage operation used the following parameters: &RANGESUBSET=VV&SCALEFACTOR=5&FORMAT=image/png As expected, the describeCoverage response times shown in the table are similar and low since only the metadata database (Petascope DB) was queried. With only the data presented in Table 4.7 it is not possible to conclude how the processing system behaves with the increase of getCoverage requests. However, it is possible to have a clear idea of how fast the Infrastructure prototype is pro- cessing 4 equal operations on different data at the same time. The S1A IW GRD and S1B IW GRD coverages are both about 1Gb and have been processed almost at the same time, i.e. around 30 seconds. The same is observed for the S1A EW GRD and S1B EW GRD coverages, which were processed within 5 seconds. As expected, the response time is highly affected by the size of the processed image to be transferred. The response time increases with the size of the images to be transferred and also tends to increase with the number of connected clients transferring images.

Scenario 5: Performing getMap operations from the WMS This scenario, which was intended to test the response time of multiple getMap requests executed by OpenLayers, was not done because it was verified that the version of Petascope used in the prototype contained severe bugs that had a big impact on memory, making the whole system run out of physical memory available when addressing multiple getMap operations.

Scenario 6: Processing a query using the WCPS query language The goal of this scenario is to test the response time of executing a processing query by Ras- DaMan, using the WCPS query language. The scenario was tested with 7 different queries executed in sequence with different operations on the same coverage. The coverage tested is a 2D Level-1 product from the Sentinel 1 mission, with dimensions 30032 x 19272 and size 777 Mb. In order to reduce the download time of the resulting images, in all the queries, a scale factor was used to reduce

51 Table 4.8: Summary of response time from different queries.

Query 1 2 3 4 5 6 7 Response Time 3.14 s 4.95 s 4.61 s 58.35 s 1.4 min 7.10 s 50.3 min Result Size 283 Kb 125 Kb 155 Kb 8 Kb 442 Kb 385 Kb 2 Kb the dimensions and consequently the size. The first query (Listing 4.1) requested only to return the original image so that we can see how each query affects the original image. Queries 2 and 3 (Listing 4.2 and Listing 4.3), return respectively the VV and VH bands of the original image. Query 4 (Listing 4.4) return the NDVI (Normalized Difference Vegetation Index), a measure for the probability of vegetation in . This query does not make much sense in this data, but it has been tested because it is one of the queries most used by scientists. Queries 5 and 6 (Listing 4.5 and 4.6) are basically the same in the sense that they return a falsely colored image. However, in the second, a subsetting is done around the Madeira island. Finally, Query 7 (Listing 4.7) summarizes the frequency with which each color value appears in the original image. The result of this query is a 1D array. The results obtained from each query can be seen in Figure 4.5 and Firgure 4.6. Table 4.8 summarizes the response times for each of the queries. The times shown are only processing times, i.e., the transfer time is not included. From the response times we can observe that the more arithmetic operations are involved in the queries, the longer the processing time. This statement is supported by comparing the first three queries with Queries 4 and 5. It is also possible to observe from Queries 5 and 6 that RasDaMan performs query optimization to retrieve results faster. The order in which operations are processed matters. Both queries perform the false colouring but in the Query 6 there are more things to do, like performing a subsetting and a different scaling. This leads us to think that Query 5 should be faster, however it is not true. RasDaMan first performs the scaling, then the subsetting and finally the false colouring. That is why Query 6 is much faster than Query 5. Query 7, at first glance, seems to be too time consuming and suggests that RasDaMan is not as fast as they say, but as we look at the Figure 4.6 we realized why there was 50.3 minutes delay. In this query a 30032 x 19272 matrix is scanned where each element contains two color values. We can see that only for the color value 0 were made 160592341 counts. Thus, the reason for this delay has to do with the high number of pixels that are traversed, with each pixel having two color channels that have to be counted. Finally, looking at the sizes of the files originated from each of the queries it becomes evident the advantage of having server-side image processing. Instead of having been transferred 777 Mb of data were only transferred in the total 1.38Mb.

Listing 4.1: Query 1 - Returns original image for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode (scale (c, { Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

Listing 4.2: Query 2 - Returns VV band for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 )

52 Figure 4.5: Results produced by the queries in the Table 4.8. return encode (scale (c.VV, { Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

Listing 4.3: Query 3 - Returns VH band for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode (scale (c.VH, { Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

Listing 4.4: Query 4 - NDVI algorithm for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode (scale ( ( ( ( ( c .VV − c.VH) / ( c.VV + c.VH )) > 0 ) ∗ 255) ,{ Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

Listing 4.5: Query 5 - Returns false colouring image for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode (scale ({ red : c .VV; green: c.VH; blue: c.VV + c.VH } ,{ Lat: ”CRS:1”(1:500), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

53 Figure 4.6: Result produced by the query in the Listing 4.7.

Listing 4.6: Query 6 - Returns a subset of a false coloured image for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode (scale ({ red : c .VV; green: c.VH; blue: c.VV + c.VH } [Lat(32.60000:32.95),Long( −17.30000: −16.61324649070722)], { Lat: ”CRS:1”(1:300), Long: ”CRS:1”(1:500) } , {} ) , ” png ” )

Listing 4.7: Query 7 - Summarizes the frequency of each color value for c in ( S1A IW GRDH 1SDV 20170310T190611 20170310T190636...019B65 50B0 ) return encode ( coverage histogram over $n x(0:255) values count( c.VV = $n ), ” csv ” )

4.3 Summary

This chapter described the validation of the functional requirements and performance measurements taken on the prototype that was developed. Regarding functional requirements, the chapter described each requirement and detailed how the system responds to each one. In the case of the measure- ments regarding the usage of computational resources, the chapter presented some values with respect to the size of level-1 and level-2 products from Sentinel 1 mission, and concluded the viability of storing products captured for the Portuguese territory using different rolling archive plans. Besides

54 measurement of storage space, the chapter also presented the response times in searches for prod- ucts using region, temporal interval and product type criteria, as well as the response times of the integrated OGC services.

55 56 5 Conclusions and Future Work

Contents 5.1 Conclusions ...... 58 5.2 Future Work ...... 58

57 This dissertation described the development and implementation of a prototype for the IP Sentinel infrastructure, whose function is to catalogue, disseminate and process PetaBytes of EO products. The implementation used well known software, namely the DHuS from ESA and RasDaMan. The developed prototype allows searching products by region, temporal interval, and product type, and supports standardized services such as WCS, WMS and WCPS, which allow the processing of raster data. This infrastructure is intended to mirror products from the ESA SciHub, of interest to Portugal and only about the Portuguese territory. Section 5.1 presents final conclusions regarding the development activities that were conducted. Section 5.2 presents some ideas for future work.

5.1 Conclusions

Regarding the implementation, it has been shown that the symbiosis of DHuS and RasDaMan can be achieved with some minor modifications in the DHuS core, allowing full automation of making available the DHuS ingested products in the RasDaMan as coverages. This symbiosis makes DHuS richer in that it provides server-side processing of the catalogued products through the use of OGC services. The measurements that were made allow us to evaluate the behavior of the system regarding the response time in the search of products using different types of searches. From the measurements it was also possible to perceive the viable size of storage space needed to store the Sentinel 1 products related to the geographic area of Portugal. Also through the measurements it was possible to con- clude that the processing on the server side allows to reduce substantially the amount of data to be transferred to the clients. This allows reducing the transfer time and disk space of the infrastructure clients. Although it has not been approached in the prototype validation it is also possible to con- clude that with this type of infrastructure it no longer makes sense for the clients to have advanced computers since the infrastructure is responsible for the supply of computing power. Overall this dissertation demonstrated the benefits of an infrastructure with embedded processing of EO data while at the same time supporting Wagner [2015] perspective on decentralized collabora- tive infrastructure for the dissemination of sentinel products.

5.2 Future Work

The developed prototype is still very rough and much more needs to be done before it can support a service. I believe that it will be necessary to add a feature that allows users to add to the DHuS Catalog the products processed by RasDaMan. It will also be interesting to study a strategy so that the products ingested by RasDaMan are not also in the data space managed by DHuS, thus avoiding data duplication. It is also necessary to extend the range of possible products to be ingested by RasDaMan. It also seem reasonable, after my colleague Francisco Silva [2016] finishes his part of the work, to put the rolling archive working for 1 month, to observe if the infrastructure can withstand without going

58 down and at the same time record the speed transfer for each product to create knowledge about the best time of the day to download EO data.

59 60 Bibliography

Balzter, H., Cole, B., Thiel, C., and Schmullius, C. (2015). Mapping CORINE Land Cover from Sentinel-1A SAR and SRTM Digital Elevation Model Data using Random Forests. Remote Sensing, 7(11).

Baumann, P. (1994). Management of multidimensional discrete data. The Very Large Data Bases Journal, 3(4).

Baumann, P. (1999). A database array algebra for spatio-temporal data and beyond. In Proceedings of International Workshops on Next Generation Information Technologies and Systems.

Baumann, P. (2009a). OpenGIS Web Coverage Processing Service (WCPS) Language Interface Standard. OGC 08-068r2, Open Geospatial Consortium.

Baumann, P.(2009b). Web Coverage Service (WCS) - ProcessCoverages Extension. OGC 08-059r3, Open Geospatial Consortium.

Baumann, P. (2010). GML Application Schema for Coverages. OGC 09-146, Open Geospatial Con- sortium.

Baumann, P.(2012a). OGC Implementation Schema for Coverages. OGC 09-146r2, Open Geospatial Consortium.

Baumann, P. (2012b). OGC WCS 2.0 Interface Standard - Core. OGC 09-110r4, Open Geospatial Consortium.

Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., and Widmann, N. (1998). The Multidimensional Database System RasDaMan. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., and Widmann, N. (1999). Spatio-temporal retrieval with RasDaMan. In Proceedings of the International Conference on Very Large Data Bases.

Baumann, P. and Holsten, S. (2012). A comparative analysis of array models for databases. Interna- tional Journal of Database Theory and Application, 5(1).

Brown, P. G. (2010). Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM.

61 Christopoulos, C., Skodras, A., and Ebrahimi, T. (2000). The JPEG2000 still image coding system: an overview. IEEE transactions on consumer electronics, 46(4). de la Beaujardiere, J. (2006). OpenGIS Web Map Service (WMS) Implementation Specification. OGC 06-042, Open Geospatial Consortium.

Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al. (2012). Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment, 120(SP-1322/2).

Fred´ eric´ Pidancier, Nicolas Valette, M. M. and Mbaye, S. (2014). Data Hub Service (DHuS) Architec- tural Design Document. GAEL Systems, 2 edition.

Geudtner, D., Torres, R., Snoeij, P., Davidson, M., and Rommen, B. (2014). Sentinel-1 System capa- bilities and applications. In Proceedings of the International IEEE Geoscience and Remote Sensing Symposium.

Gutierrez,´ A. G. and Baumann, P. (2008). Computing aggregate queries in raster image databases using pre-aggregated data. In Proceedings of International Conference on Computational Science and Applications.

Guttman, A. (1984). R-trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Kersten, M., Zhang, Y., Ivanova, M., and Nes, N. (2011). SciQL, a query language for science appli- cations. In Proceedings of the EDBT/ICDT Workshop on Array Databases.

Koubarakis, M., Datcu, M., Kontoes, C., Di Giammatteo, U., Manegold, S., and Klien, E. (2012). TELEIOS: a database-powered virtual earth observatory. Proceedings of the Very Large Data Bases Endowment, 5(12).

Langley, R. B. (1998). The UTM grid system. GPS world, 9(2):46–50.

Lim, K., Maier, D., Becla, J., Kersten, M., Zhang, Y., and Stonebraker, M. (2013). Array QL syntax.

Lupp, M. (2008). Open geospatial consortium. In Encyclopedia of GIS, pages 815–815. Springer.

Marathe, A. P. and Salem, K. (2002). Query processing techniques for arrays. The Very Large Data Bases Journal, 11(1).

Monet, P. B. (2002). A Next-Generation DBMS Kernel for Query Intensive Applications. PhD Thesis, CWI Amsterdam.

OGC (2006). The OpenGIS Abstract Specification - Topic 6: Schema for coverage geometry and functions. OGC 07-011, Open Geospatial Consortium.

Portele, C. (2007). OpenGIS Geography Markup Language (GML) Encoding Standard. OGC 07-036, Open Geospatial Consortium.

62 Portele, C. (2012). OGC Geography Markup Language (GML)-Extended schemas and encoding rules. Open Geospatial Consortium Implementation Standard, 10-129r1, v3. 3.0, 91pp.

Rew, R. and Davis, G. (1990). NetCDF: an interface for scientific data access. IEEE computer graphics and applications, 10(4).

Ritter, N. and Ruth, M. (1997). The GeoTIFF data interchange standard for raster geographic images. International Journal of Remote Sensing, 18(7).

Ritter, N., Ruth, M., Grissom, B. B., Galang, G., Haller, J., Stephenson, G., Covington, S., Nagy, T., Moyers, J., Stickley, J., et al. (2000). GeoTIFF format specification GeoTIFF revision 1.0. SPOT Image Corp.

Rusu, F. and Cheng, Y. (2013). A survey on array storage, query languages, and systems. arXiv preprint arXiv:1302.0103.

Silva, F. (2016). IPSentinel - Sentinel Earth Observations Rolling Archive Downloader. Technical report, Instituto Superior Tecnico.´

Stonebraker, M., Brown, P., Poliakov, A., and Raman, S. (2011). The Architecture of SciDB. In International Conference on Scientific and Statistical Database Management. van Ballegooij, A. (2004). RAM: A Multidimensional Array DBMS. In Proceedings of the Extending Database Technology Workshops.

Wagner, W. (2015). Big data infrastructures for processing sentinel data. In Proceeding of the Pho- togrammetric Week.

Wagner, W., Sabel, D., Doubkova, M., Horna´cek,ˇ M., Schlaffer, S., and Bartsch, A. (2012). Prospects of Sentinel-1 for land applications. In Proceeding of the International IEEE Geoscience and Remote Sensing Symposium.

Zhang, Y., Kersten, M., and Manegold, S. (2013). SciQL: array data processing inside an RDBMS. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM.

63 64