Querying Spatiotemporal XML Using DataFoX ∗
Yi Chen Peter Revesz Computer Science and Engineering Department University of Nebraska-Lincoln Lincoln, NE 68588, USA {ychen,revesz}@cse.unl.edu
Abstract and a straight-line object in GML have identical struc- tures. There is no automatic translation algorithm that We describe DataFoX, which is a new query language can maintain the XML schema information. for XML documents and extends Datalog with support for trees as the domain of the variables. We also introduce for • Tree algebras do not allow grammar heterogeneity of DataFoX a layer algebra, which supports data heterogene- the XML data. For example, a single spatial object ity at the language level, and several algebra-based evalu- with a rectangle shape can also be encoded as the com- ation techniques. bination of two triangles. The structure and the content of the GML data that result from these different combi- nations are different and tree algebras may erroneously regard them as two different objects. 1. Introduction • Many XML standards provide a strict structure for part XML is a standard data model for data representation of the data. For example, spatial attribute data in GML and exchange on the Internet. Several XML-based lan- are well structured. Unfortunately, most previously guages can also encode geographic information and spa- proposed XML query languages and XML algebras do tiotemporal data. For example, the Vector Markup Lan- not provide sufficient support for well-structured data guage (VML) can represent vector objects, the Geogra- in a semi-structured environment. For example, pre- phy Markup Language (GML) can represent coordinates of viously proposed XML query languages do not sup- OpenGIS features, and the Parametric Rectangle Markup port spatiotemporal queries on GML although the spa- Language (PRML), which we introduce in this paper, can tiotemporal attributes are well structured in GML. represent parametric rectangles[11, 2]. While XML is successful for data representation, the To overcome the above problems, we present a rule- current XML query languages, including XQuery[4], based XML query language, Datalog For XML (DataFoX), Quilt[5], Lorel[1], XML-QL[6], and XPathLog[10], do which combines the simplicity of Datalog with the support not fully support querying geographic and spatiotemporal for spatiotemporal data in constraint databases[8, 11]. We XML documents. Querying based on relational database also introduce a Layer Algebra, which is a novel exten- systems[3, 9, 7, 13, 12] also does not support spatiotempo- sion of relational algebra for XML and provides the basis ral queries. Previous XML query language proposals also of query evaluation and optimization for XML queries. The have the following common problems: primary challenges we address are: (i) how to identify and represent trees in the query language, and (ii) how to evalu- • The method of querying XML using relational ate the tree-based query language. database systems includes both the translation of XML schemas into relational schemas and the translation of 1.1. The DataFoX System Architecture tree-structured XML data into relational tables. This method may loose some semantic information that is The architecture of the DataFoX system is outlined in encoded in the tree structure. For example, a rectangle Figure 1. XML Data sources in different formats are ∗This research was supported in part by NSF grant EIA-0091530 and a wrapped into a uniform representation called Layer Con- Gallup Research Professorship. straint Databases, which are powerful for capturing the DataFoX Query
Query Translator
Datalog Query
Layer Algebra
Query Evaluation
Layer Constraint Database
GML Wrapper VML Wrapper PRML Wrapper XML Wrapper
GML Data Source VML Data Source PRML Data Source XML Data Source
Figure 1. The DataFoX Architecture
XML tree structure and GML, VML, and PRML spatiotem- 2.1. Layer Model poral information. The DataFoX query language is trans- lated into a Datalog query, which is further translated into In XML, the type of each node is defined as an “ele- the layer algebra. The query is evaluated and the query re- ment”. We regard nodes as the counterpart of tuples in re- sult may be translated back to XML using wrappers. lational databases. The difference is that the data type of Our architecture not only supports spatiotemporal the field in a certain element could be another element. All queries but also enables integration of heterogeneous data of the nodes of the same element type belong to a layer, sources. For example, Figure 2 shows an integrated univer- which is the counterpart of a relation in relational databases. sity campus map, whose description is composed of three The layer name is the same as the element name defined for different data sources: the classroom buildings represented these nodes. We say layer A is the parent layer of layer B if in GML, the stadium and the square in front of it repre- the element B is the sub-element of element A in the XML sented in VML, and a moving bus and a helicopter repre- document. sented in PRML. We define inferred layers of a layer L as the set of layers, The following are some typical queries on this map: which are child layers of L. Similarly, the inferred layer of a node is the subtree that is rooted from this node. Thus, the Query 1 Will the bus pass the east gate of the stadium? inferred nodes of a certain node n and the edges between Query 2 When will the helicopter fly over the stadium? these nodes form an XML tree with root n. We assume that each node has a virtual attribute nid as the identifier of the Query 3 When will the helicopter fly above the bus? node and a hidden attribute pid as a pointer to the parent node. We may use the nid to access a node, from which we 2. Data Model may access its child nodes. In DataFoX, we also use this nid to identify the subtree that is rooted at the current node. The DataFoX system translates the different data sources (GML, VML, and PRML) into a layer model. y
VML document PRML document
Figure 2. A campus map integrated from GML, VML, and PRML documents We also assume nid is ordered, to maintain the order in- in DTD. The book layer and author layer are called the in- formation of the XML tree model. For those elements de- ferred layers of the library layer. The layers with primitive fined as primitive data types, we assign an internal attribute data types like “PCDATA” are merged into their parent layer named data. The value of this attribute is the content of the for simplicity. element. We apply the same notation used to represent relational 2.2. Spatiotemporal Data Model schema to represent the layer schema. However, the domain of variables for the layer schema is “tree”. For example, With the self-description feature of XML, every spa- the layer schema LAYERNAME(L1,...,Ln) means that tiotemporal data can be easily encoded in XML. Typi- there is an element named “LAYERNAME” defined in the cally, GML defines a class of spatial features to repre- XML document, which has n child elements, with the layer sent OpenGIS data, and VML encodes vector information names L1, L2, ..., Ln respectively. The data type of the in XML. Parametric Rectangle data model, used to model child layers can be atomic (e.g.: string, integer etc.) or a moving objects, can be also encoded in XML, which we tree type defined by another layer schema. An instance of a refer to as PRML. layer and all of its inferred layers is a tree, whose root is an These data models share a common characteristic, instance of the root layer. namely, the structures of the spatiotemporal attribute are The layer definition can be generated from the element well-formatted, and generally form a well-formatted sub- definition of the XML document. We assume that every tree in the document (although the structure could be very XML document comes with a Data Type Definition (DTD). complex.) Applying the layer data model, we may use the XML Schema is a stronger schema definition of XML, but upper layer to aggregate the object using constraints. The DTD is adequate (note: we can always generate DTD from wrapper performs this work. Figure 5 shows the aggrega- the XML schema to create the layer data model.) With the tion operation on spatiotemporal data. layer data model, an XML document is treated as a collec- tion of layers, and the traversal of the XML document is 3. The Operators actually the traversal between the layers.
A layer in XML can be regarded as a relation in relational defined as child layers. This definition allows us to define the operators on layers, which is very similar to that of re- lational databases. In this section, we introduce the Layer Algebra.
Figure 3. Library Document We define the projection operation as an orthogonal op- eration to the selection operation. It is a “vertical” operation on XML in the sense that the traverse between layers is in- Example 2.1 (Layer Model) The XML document shown in volved in the operation. The input of projection is a layer Figure 3 can be modeled in the layer model shown in Fig- L (and all of its inferred layers), and a set of projection list ure 4. Every layer corresponds to an “element” definition P L as projection parameters. The projection list is a set of library nid pid
book book book nid pid
2 1 Landscape 1997 author author title author 3 1 Portrait 2001 year title year
Lanscape 1997 Steve Portrait 2001 John Maggie author nid pid data
4 2 Steve 5 3 John 6 3 Maggie
Figure 4. Layer Model for Library Document
vehicle vehicle vehicle nid