Structured Parallel Computation in Structured Documents

Structured Parallel Computation in Structured Do cuments D.B. Skillicorn [email protected] March 1995 External Technical Rep ort ISSN-0836-0227- 95-379 Department of Computing and Information Science Queen's University Kingston, Ontario K7L 3N6 Do cument prepared March 6, 1995 c Copyright 1995 D.B. Skillicorn Abstract Do cument archives contain large amounts of data to which sophisticated queries are applied. The size of archives and the complexity of evaluating queries makes the use of parallelism attractive. The use of semantically-based markup such as SGML makes it p ossible to represent do cuments and do cument archives as data typ es. Wepresent a theory of trees and tree homomorphisms, mo delling structured text archives and op erations on them, from which it can b e seen that: many apparently-unrelated tree op erations are homomorphisms; homomorphisms can b e describ ed in a simple parameterised way that gives standard sequential and parallel implementations for them; sp ecial classes of homomorphisms have parallel implementations of practical interest. In particular, we develop an implementation for path expression search, a novel p owerful query facilityfor structured text, that takes time logarithmic in the text size. Keywords: structured text, categorical data typ e, software development metho dology, parallel algorithms, query evaluation. 1 1 Structured Parallel Computation Computations on structured do cuments are a natural application domain for parallel pro cessing. This is partly b ecause of the sheer size of the data involved. Do cument archives contain terabytes of data, and analysis of this data often requires examining large parts of it. Also, advances in parallel computer design have made it p ossible to build systems in which each pro cessor has sizeable storage asso ciated with it, and which are therefore ideal hosts for do cument archives. Such machines will b ecome common in the next decade. The use of parallelism has b een suggested for do cument applications for some time. Some of the drawbacks have b een p ointed out by Stone [24] and Salton and Buckley [17] { these centre around the need of most parallel applications to examine the entire text database where sequential algorithms examine only a small p ortion, and the consequent p erformance degradation in accessing secondary storage. While this p oint is imp ortant, it has b een to some extentovertaken by developments in parallel computer architecture, particularly the storage of data in disk arrays, with some disk storage lo cal to each pro cessor. As we shall show, the use of parallelism allows such an increase in the p ower of query op erations that it will b e useful even if p erformance is not signi cantly increased. Parallel computation is dicult in any application domain for the following reasons: There are many degrees of freedom in the design space of the software, b ecause partitioning computations among pro cessors strongly in- uences communication and synchronisation patterns, which in turn have a strong e ect on p erformance. Hence nding go o d algorithms requires extensive searching in the absence of other information; Parallelism in an algorithm is only useful if it can b e harnessed by some available parallel architectures, and harnessed in an ecientway; Expressing algorithms in a way that is abstract enough to survive the replacement of underlying parallel hardware every few years is dicult; It is hard to predict the p erformance of software on parallel machines without actually developing it and trying it out. 2 Making parallelism the workhorse of structured do cument pro cessing requires nding solutions to these diculties. The extensive use of semantically-based markup, and particularly the use of SGML [13], means that most do cuments havea de facto tree structure. This makes it p ossible to mo del them by a data typ e with enough formality that useful theory can b e applied. We will use the theory of categorical data typ es [19], a particular approach to initiality, emphasising its ability to hide those asp ects of a computation that are most dicult in a parallel setting. Categorical data typ es generalise abstract data typ es by encapsulating not only representation of the typ e, but also the implementation of homomorphisms on it. In ob ject-oriented terms, the only metho ds available on constructed typ es are homomorphisms. As a parallel programming mo del this is ideal, since the partitioning of the data ob jects across pro cessors, and the communication patterns required to evaluate homomorphisms can remain invisible to programmers. Programs can b e written as comp ositions of homomorphisms without any necessary awareness that the implementations of these homomorphisms might contain substantial parallelism. Recall that a homomorphism is a function that resp ects the structure of its arguments. If an argument ob ject is a memb er of a data typ e with constructor ./, then h is a homomorphism if there exists an op eration ~ such that h a ./ b =ha~hb This equation describ es two di erentways of computing the result of applying h to the argument a ./ b . The left hand side computes it by building the argument ob ject completely and then just applying h to this ob ject. The right hand side, however, applies h to the comp onent ob jects from which the argumentwas built, and then applies the op eration ~ to the results of these two applications. There are two things to note ab out the computation strategy implied by the right hand side. The rst is that it is recursive. If b is itself an ob ject built up from smaller ob jects, say b = c ./ d , then h b =hc~hd 3 and so h a ./ b =ha~hc~hd The structure of the computation follows the structure of the argument. Sec- ond, the evaluations of h on the right hand sides are indep endent and can therefore b e evaluated in parallel if the architecture p ermits it. These simple ideas lead to a rich approach to parallel computation. Homomorphisms include many of the interesting functions on constructed data typ es. In particular, all injective functions are homomorphisms. Fur- thermore, all functions can b e expressed as almost-homomorphisms [2], the comp osition of a homomorphism with a pro jection, and this is often of practical imp ortance. In the next section weintro duce the construction of a typ e for trees to represent structured text. We show that the construction reduces algorithm design for homomorphisms to the simpler problem of nding comp onent functions, and illustrate a recursive parallel schema for computing homomorphisms on trees. In Section 3, we distinguish four sp ecial kinds of homomorphisms that capture common patterns for information ow in tree algorithms, and for whichitisworth optimizing the standard schema implementation. These are: tree maps, tree reductions, upwards and downwards accumulations. In the subsequent ve sections we illustrate tree homomorphisms of increasing so- phistication, b eginning with the computation of global do cument prop erties, then search problems that is, query evaluation, and nally problems that involve communicating information throughout do cuments. 2 Parallel Op erations on Trees We build the typ e of homogeneous binary trees, that is trees in whichinternal no des and leaves are all of the same typ e. Binary trees are to o simple to represent the full structure of tagged texts, since any tagged entitymay contain many sub ordinate entities, but it simpli es the exp osition without a ecting any of the complexity results. In SGML an entity is delimited by start and end tags. The region b etween the start and end tags is either `raw' text or is itself tagged. The structure is 4 do cument chapter chapter chapter chapter section section section para para para Figure 1: Mo delling a Do cumentasTree hierarchical, and can b e naturally represented by a tree whose internal no des represent tags, and whose leaves represent `raw' data. All no des may contain values for attributes. In particular, tags will often have asso ciated text { for example, chapter tags contain the text of the chapter heading as an attribute, gure tags contain gure captions, and so on. Thusatypical do cument can b e represented as a tree of the kind shown in Figure 1. Do cument archives can b e mo delled by trees as well, in which no des near the ro ot represent do cument classi cations, as shown in Figure 2. We build trees over some base typ e A that can mo del entities and their attributes. For our purp oses this is a tuple consisting of an entity name and a set of attributes. In our particular examples it will suce to have one attribute, the string of text asso ciated with each no de of the do cument. Trees havetwo constructors: Leaf : A ! Tree A Join : Tree A A Tree A ! Tree A The rst constructor, Leaf, takesavalue of typ e A and makes it into a tree consisting of a single leaf. The second constructor, Join, takes two trees and avalue of typ e A and makes them into a new tree by joining the subtrees and 5 archive manuals novels do cument do cument do cument chapter chapter chapter chapter chapter chapter section section para para para Figure 2: Mo delling a Do cument Archive 6 a b c d e Join Join Leaf d ; b ; Leaf e ; a ; Leaf c Figure 3: A Constructor Expression and the Tree it Describ es putting the value of typ e A at the internal no de generated.

Load more