Structured Parallel Computation in

Structured Do cuments

D.B. Skillicorn

[email protected]

March 1995

External Technical Rep ort

ISSN-0836-0227-

95-379

Department of Computing and Information Science

Queen's University

Kingston, Ontario K7L 3N6

Do cument prepared March 6, 1995

c

Copyright 1995 D.B. Skillicorn

Abstract

Do cument archives contain large amounts of data to which sophisticated queries

are applied. The size of archives and the complexity of evaluating queries makes

the use of parallelism attractive. The use of semantically-based markup such as

SGML makes it p ossible to represent do cuments and do cument archives as data

typ es.

Wepresent a theory of trees and homomorphisms, mo delling structured

text archives and op erations on them, from which it can b e seen that:

 many apparently-unrelated tree op erations are homomorphisms;

 homomorphisms can b e describ ed in a simple parameterised way that gives

standard sequential and parallel implementations for them;

 sp ecial classes of homomorphisms have parallel implementations of practical

interest. In particular, we develop an implementation for path expression

search, a novel p owerful query facilityfor structured text, that takes time

logarithmic in the text size.

Keywords: structured text, categorical data typ e, software development

metho dology, parallel algorithms, query evaluation. 1

1 Structured Parallel Computation

Computations on structured do cuments are a natural application domain for

parallel pro cessing. This is partly b ecause of the sheer size of the data involved.

Do cument archives contain terabytes of data, and analysis of this data often

requires examining large parts of it. Also, advances in parallel computer design

have made it p ossible to build systems in which each pro cessor has sizeable

storage asso ciated with it, and which are therefore ideal hosts for do cument

archives. Such machines will b ecome common in the next decade.

The use of parallelism has b een suggested for do cument applications for

some time. Some of the drawbacks have b een p ointed out by Stone [24] and

Salton and Buckley [17] { these centre around the need of most parallel ap-

plications to examine the entire text database where sequential algorithms

examine only a small p ortion, and the consequent p erformance degradation

in accessing secondary storage. While this p oint is imp ortant, it has b een

to some extentovertaken by developments in parallel computer architecture,

particularly the storage of data in disk arrays, with some disk storage lo cal to

each pro cessor. As we shall show, the use of parallelism allows such an increase

in the p ower of query op erations that it will b e useful even if p erformance is

not signi cantly increased.

Parallel computation is dicult in any application domain for the following

reasons:

 There are many degrees of freedom in the design space of the soft-

ware, b ecause partitioning computations among pro cessors strongly in-

uences communication and synchronisation patterns, which in turn

have a strong e ect on p erformance. Hence nding go o d algorithms

requires extensive searching in the absence of other information;

 Parallelism in an algorithm is only useful if it can b e harnessed by some

available parallel architectures, and harnessed in an ecientway;

 Expressing algorithms in a way that is abstract enough to survive the

replacement of underlying parallel hardware every few years is dicult;

 It is hard to predict the p erformance of software on parallel machines

without actually developing it and trying it out. 2

Making parallelism the workhorse of structured do cument pro cessing requires

nding solutions to these diculties.

The extensive use of semantically-based markup, and particularly the use

of SGML [13], means that most do cuments havea de facto tree structure. This

makes it p ossible to mo del them by a data typ e with enough formality that

useful theory can b e applied. We will use the theory of categorical data typ es

[19], a particular approach to initiality, emphasising its ability to hide those

asp ects of a computation that are most dicult in a parallel setting.

Categorical data typ es generalise abstract data typ es by encapsulating not

only representation of the typ e, but also the implementation of homomor-

phisms on it. In ob ject-oriented terms, the only metho ds available on con-

structed typ es are homomorphisms. As a parallel programming mo del this is

ideal, since the partitioning of the data ob jects across pro cessors, and the com-

munication patterns required to evaluate homomorphisms can remain invisible

to programmers. Programs can b e written as comp ositions of homomorphisms

without any necessary awareness that the implementations of these homomor-

phisms might contain substantial parallelism.

Recall that a homomorphism is a function that resp ects the structure of its

arguments. If an argument ob ject is a memb er of a data typ e with constructor

./, then h is a homomorphism if there exists an op eration ~ such that

h a ./ b =ha~hb

This equation describ es two di erentways of computing the result of applying

h to the argument a ./ b . The left hand side computes it by building the

argument ob ject completely and then just applying h to this ob ject. The

right hand side, however, applies h to the comp onent ob jects from which the

argumentwas built, and then applies the op eration ~ to the results of these

two applications.

There are two things to note ab out the computation strategy implied by

the right hand side. The rst is that it is recursive. If b is itself an ob ject built

up from smaller ob jects, say b = c ./ d , then

h b =hc~hd 3

and so

h a ./ b =ha~hc~hd

The structure of the computation follows the structure of the argument. Sec-

ond, the evaluations of h on the right hand sides are indep endent and can

therefore b e evaluated in parallel if the architecture p ermits it. These simple

ideas lead to a rich approach to parallel computation.

Homomorphisms include many of the interesting functions on constructed

data typ es. In particular, all injective functions are homomorphisms. Fur-

thermore, all functions can b e expressed as almost-homomorphisms [2], the

comp osition of a homomorphism with a pro jection, and this is often of prac-

tical imp ortance.

In the next section weintro duce the construction of a typ e for trees to

represent structured text. We show that the construction reduces algorithm

design for homomorphisms to the simpler problem of nding comp onent func-

tions, and illustrate a recursive parallel schema for computing homomorphisms

on trees. In Section 3, we distinguish four sp ecial kinds of homomorphisms

that capture common patterns for information ow in tree algorithms, and for

whichitisworth optimizing the standard schema implementation. These are:

tree maps, tree reductions, upwards and downwards accumulations. In the

subsequent ve sections we illustrate tree homomorphisms of increasing so-

phistication, b eginning with the computation of global do cument prop erties,

then search problems that is, query evaluation, and nally problems that

involve communicating information throughout do cuments.

2 Parallel Op erations on Trees

We build the typ e of homogeneous binary trees, that is trees in whichinternal

no des and leaves are all of the same typ e. Binary trees are to o simple to

represent the full structure of tagged texts, since any tagged entitymay contain

many sub ordinate entities, but it simpli es the exp osition without a ecting

any of the complexity results.

In SGML an entity is delimited by start and end tags. The region b etween

the start and end tags is either `raw' text or is itself tagged. The structure is 4

do cument

chapter chapter chapter chapter

section section section

para para para

Figure 1: Mo delling a Do cumentasTree

hierarchical, and can b e naturally represented by a tree whose internal no des

represent tags, and whose leaves represent `raw' data. All no des may contain

values for attributes. In particular, tags will often have asso ciated text { for

example, chapter tags contain the text of the chapter heading as an attribute,

gure tags contain gure captions, and so on. Thusatypical do cument can

b e represented as a tree of the kind shown in Figure 1. Do cument archives can

b e mo delled by trees as well, in which no des near the ro ot represent do cument

classi cations, as shown in Figure 2. We build trees over some base typ e

A that can mo del entities and their attributes. For our purp oses this is a

tuple consisting of an entity name and a of attributes. In our particular

examples it will suce to have one attribute, the string of text asso ciated with

each no de of the do cument.

Trees havetwo constructors:

Leaf : A ! Tree A

Join : Tree A  A  Tree A ! Tree A

The rst constructor, Leaf, takesavalue of typ e A and makes it into a tree

consisting of a single leaf. The second constructor, Join, takes two trees and

avalue of typ e A and makes them into a new tree by joining the subtrees and 5

archive

manuals novels

do cument do cument do cument

chapter chapter chapter chapter chapter chapter

section section

para para para

Figure 2: Mo delling a Do cument Archive 6

a

b c

d e

Join Join Leaf d ; b ; Leaf e ; a ; Leaf c 

Figure 3: A Constructor Expression and the Tree it Describ es

putting the value of typ e A at the internal no de generated. Thus a tree is

either a single leaf, or a tree obtained by joining two subtrees together with a

new value at the join. A constructor expression describing a tree and the tree

itself are shown in Figure 3.

De nition 1 A homomorphism, h , on trees is a function that resp ects tree struc-

ture, that is there must exist functions f and f such that

1 2

h Leaf a  = Leaf f a 

1

and

h Join t ; a ; t  = f h t ; a ; h t 

1 2 2 1 2

Call f and f comp onent functions of the homomorphism h .

1 2

Here f is the glueing function that relates the e ect of h on pieces of the

2

argument to its e ect on the whole argument. Notice that if h has typ e

h : Tree A ! P

then the typ es of f and f are

1 2

f : A ! P

1 7

f : P  A  P ! P

2

Thus nding a homomorphism amounts to nding such a pair of functions.

However, nding h means b eing aware of the structure of Tree A, whichis

usually complex, whereas f and f are op erations on usually simpler typ es.

1 2

This is a considerable practical advantage.

Result 2 [19] Every homomorphism on trees is actually a function b etween an

algebra that is, a set with some op erations de ned on it Tree A; Leaf ; Join 

and a related algebra P ; f ; f  for some typ e P and a pair of functions

1 2

f : A ! P

1

f : P  A  P ! P

2

Note that the typ e signatures of f and f are obtained from those of Leaf and

1 2

Join by replacing o ccurrences of Tree A by P .

Furthermore, the homomorphism to each such algebra is unique, and there

is therefore a one-to-one corresp ondence b etween algebras and homomorphisms.

Each homomorphism is completely determined by the functions f and f .

1 2

This corresp ondence justi es the notation

Hom f ; f 

1 2

for the necessarily unique tree homomorphism from Tree A; Leaf ; Join to

the algebra P ; f ; f .

1 2

Using the fact that h is a homomorphism and that it corresp onds to a pair

of comp onent functions, we can now make formal the recursive computation of

h that wesaw earlier. The recursiveschema for evaluating all tree homomor-

phisms is shown in pseudo co de in Figure 4. It is clear that the two recursive

tree homomorphism can b e executed in parallel. Using this calls to evaluate

simple approach to parallelism, all tree homomorphisms on trees can b e eval-

uated in parallel time prop ortional to the height of the tree assuming that

the evaluation of f and f take only constant time.

1 2

Here are some simple examples of tree homomorphisms: 8

evaluate_tree_homomorphism f1, f2, t 

case t of

Leafa :

return f1 a  ;

JoinTreet1, a, Treet2 :

return f2

evaluate_tree_homomorphism f1, f2, t1 ,

a,

evaluate_tree_homomorphism f1, f2, t2  

end case

end

Figure 4: Tree Homomorphism Evaluation Schema

Example 3 The tree homomorphism that computes the numb er of leaves in

a tree replaces the value at each leaf by the constant 1, and then sums these

values over the tree. It is given by

numb er-of-leaves = Hom K ; f 

1 2

where K : A ! N is the constant 1 function and f : N  A  N ! N is

1 2

f t ; a ; t =t +t . The recursiveschema sp ecialises to:

2 1 2 1 2

evaluate_tree_homomorphism t 

case t of

Leafa :

return K_1  a ;{= 1}

JoinTreet1, a, Treet2 :

return

evaluate_tree_homomorphism t1 

+

evaluate_tree_homomorphism t2 

end case

end

Example 4 The tree homomorphism to compute the numb er of internal no des

in a tree replaces each leaf by the value 0 and then adds 1 to the sum for each 9

internal no de encountered. It is

numb er-of-internal-no des = Hom K ; f 

0 2

where K : A ! N is the constant 0 function, and f : N  A  N ! N is

0 2

f t ; a ; t =t +t +1.

2 1 2 1 2

Example 5 The tree homomorphism that nds the maximum value in a tree

do es nothing at the leaves and at internal no des selects the maximum of the

three values from its two subtrees and the no de itself. It is

treemax = Hom id ; f 

2

where f t ; a ; t = "t ;a;t  and " is ternary maximum.

2 1 2 1 2

The fact that all tree homomorphisms are the same in some sense is useful

in twoways. First, nding tree homomorphisms reduces to nding comp onent

functions. Since comp onent functions describ e lo cal actions at the no des of

trees, they are usually simpler conceptually than homomorphisms. There is a

separation b etween common global tree structure and detailed no de compu-

tations. Second, there is a common mechanism for computing any homomor-

phism, so it makes sense to sp end time optimising it, rather than developing

avarietyof ad ho c techniques.

Nevertheless, there are some classes of homomorphisms for which b etter

implementations are p ossible. These classes have many memb ers that are of

practical interest.

3 Four Classes of Tree Homomorphisms

There are four sp ecial classes of homomorphisms that are paradigms for com-

mon tree computations. They are: tree maps, tree reductions, and two forms

of tree accumulations, upwards and downwards.

In our discussion of implementations, we will assume either the EREW

PRAM or a distributed-memory hyp ercub e architecture as the target com- 10

puter. The EREW PRAM do es not charge for communication, and our im-

plementations can all b e arranged on the hyp ercub e so that communication

is always with nearest neighb ours, except for the tree contraction algorithm

used in several places. This enables us to include communication costs in our

complexity measures. We use a result of Mayr and Werchner [16] to justify

the complexity of tree contraction on the hyp ercub e. We will also assume that

a tree of n no des is pro cessed byan n-pro cessor system, so that there is a

pro cessor p er tree no de. We return to this very unrealistic assumption later.

A tree map is a tree homomorphism that applies a function to all the no des

of a tree, but leaves its structure unchanged. If f is some function f : A ! B ,

then TreeMap f  is the function

TreeMap f =Hom Leaf  f ; Join  id  f  id : TreeA!TreeB

The recursiveschema b ecomes

evaluate_tree_homomorphism t 

case t of

Leafa :

return Leaf f a  ;

JoinTreet1, a, Treet2 :

return Join

evaluate_tree_homomorphism t1 ,

fa,

evaluate_tree_homomorphism t2  

end case

end

The recursion \unpacks" the argument tree into a set of leaves and internal

no des. These are then joined back together in exactly the same structure,

except that the function f is applied to eachvalue b efore it is placed backin

the tree. The resulting tree has exactly the same shap e as the argument tree;

only the values and typ es at the leaves and internal no des havechanged.

Using the recursiveschema as an implementation is therefore unnecessarily

cumb ersome. A tree map can b e computed directly by skipping the disassem-

bly and subsequent reassembly and just applying f to each leaf and internal 11

no de. In a parallel implementation tree map can b e computed without any

communication.

Let t f  denote the time complexityof f in i pro cessors. The sequential

i

complexity of a tree map is given by

t TreeMap f  = n  t f 

1 1

and the parallel complexityby

t TreeMap f  = t f 

n 1

A tree reduction is a tree homomorphism that replaces structure without

explicitly manipulating values. Applied to a tree, a tree reduction most often

pro duces a single value. Formally, a tree reduction is a tree homomorphism

TreeReduce g =Hom id ; g :TreeA!A

where the typ e of g is

g : A  A  A ! A

The recursiveschema b ecomes

evaluate_tree_homomorphism t 

case t of

Leafa :

return a;

JoinTreet1, a, Treet2 :

return g

evaluate_tree_homomorphism t1 ,

a,

evaluate_tree_homomorphism t2  

end case

end

Figure 5 shows a tree and the result of applying a reduction to it.

A tree reduction can b e computed in parallel in time prop ortional to the

height of the tree, assuming that each application of the function g takes 12

a

b c

d e

g g d ; b ; e ; a ; c 

Figure 5: A Tree Contraction

constant time. Knowing the result of the reduction on subtrees, the result of

the reduction on the tree formed by joining them requires a further application

of g , so that critical path is the application of g s along the longest branch.

The sequential complexity of tree reduction is

t TreeReduce g  = n  t g 

1 1

and the parallel complexityis

t TreeReduce g  = ht  t g 

n 1

where ht is the height of the tree.

Somewhat surprisingly, tree reduction can b e computed much faster for

many kinds of functions g .For theoretical mo dels such as the EREW PRAM

and more practical architectures such as the hyp ercub e, tree reduction can b e

computed in time logarithmic in the numb er of no des of the tree, sub ject to

some mild conditions on the function g [1, 16], describ ed in App endix A. This

is a big improvementover the metho d suggested ab ove, since a completely left

or right branching tree with n no des requires time prop ortional to n to reduce

directly, whereas the faster algorithm takes time log n .

The key to fast tree reduction is making some useful progress towards the

eventual result at many no des of the tree, whether or not the reductions for 13

the subtrees of which they are the ro ot have b een completed. Naive reduction

applies g only at those no des b oth of whose children are leaves. For an un-

balanced tree, this creates a chain of reductions whose length is prop ortional

to the height of the tree. However, for well-b ehaved reduction op erations it

is also p ossible to carry out partial reductions for no des only one of whose

descendants is a leaf. This brings the time complexity of the tree reduction

down to logarithmic in the size of the tree, no matter howunbalanced it is.

The details are given in App endix A.

Thus wehave a parallel time complexity for tree reduction of

t TreeReduce g [treecontraction ] = log n

n

since t g must b e O 1.

1

The third useful family of tree homomorphisms are the upwards accumu-

lations.Upwards accumulations are op erations in which data can b e regarded

as owing upwards in the tree, and where the computations that take place at

each no de dep end on the results of computations at lower no des. These are

like tree reductions that leave all their partial results b ehind. The nal result

is a tree of the same shap e as the argument tree, in which eachnodeisthe

result of a tree reduction ro oted at that no de. This is illustrated in Figure 6.

Upwards accumulations is characterized as follows. Let subtrees b e the

function that replaces each no de of a tree by the subtree ro oted at that no de.

Hence it takes a tree and pro duces a tree of trees. Although subtrees is a

messy and exp ensive function, it is still a homomorphism.

Upwards accumulations are those functions that can b e expressed as the

comp osition of subtrees with a tree homomorphism mapp ed over the no des of

the intermediate tree.

upwards accumulation = TreeMap Hom f ; f   subtrees

1 2

If Hom f ; f :TreeA!X then the upwards accumulation has typ e signa-

1 2

ture Tree A ! Tree X .

Computing an upwards accumulation in the way implied by this de nition

is exp ensive. The p oint of de ning upwards accumulations like this is that

the result at any no de shares large common expressions with the results at its 14

a

b c

d e

f f f d ; b ; f e ; a ; f c 

2 2 1 1 1

f f d ; b ; f e  c

2 1 1

d e

Figure 6: An Upwards Accumulation

immediate descendants. Therefore, the numb er of partial results that must b e

computed in an upwards accumulation is linear in the numb er of tree no des.

Thus it is p ossible that there is a fast parallel algorithm, provided the dep en-

dencies intro duced by the communication involved are not to o constraining;

and indeed there is. The algorithm is an extension of tree contraction; when

anode u is removed, it is stacked by its remaining child. When this child

receives its nal value, it unstacks u and computes its nal value. Details may

b e found in [7].

The sequential time complexityofanupwards accumulation is

t UpAccum f ; f  = n t f +t f 

1 1 2 1 1 1 2

its parallel time complexityis

t UpAccum f ; f  = t f +ht  t f 

n 1 2 1 1 1 2

and its parallel time complexity using extended tree contraction is

t UpAccum f ; f [treecontraction ] = log n

n 1 2 15

b ecause f and f must b e O 1.

1 2

The fourth useful family of tree homomorphisms are the downwards ac-

cumulations.Downwards accumulations replace each no de of a tree by some

function of the no des on the path b etween it and the ro ot. This mo dels func-

tions in which the ow of information is broadcast down through the tree.

We rst de ne the typ e of non-empty paths, which are like lists except

that they havetwo concatenation constructors, left turn, ^, and right turn,

_. This typ e mo dels the paths that o ccur b etween the ro ot of a tree and

any other no de, where it is imp ortant to rememb er whether the path turns

towards a left or right descendant at each step. The two constructors are

mutually asso ciative, that is

a ?b ??c = a ?b ??c 

where ? and ?? are either of the constructors. Homomorphismson paths are

the unique arrows to algebraic structures

P ; f : A ! P ; ;  : P  P ! P 

where  and  satisfy the same mutual asso ciativity prop erty. De ne paths to

b e the function that replaces each no de of a tree by the path b etween the ro ot

and that no de. A downwards accumulation is a tree homomorphism that can

b e expressed as the comp osition of paths with a path homomorphism mapp ed

over the no des of the intermediate tree.

downwards accumulation = TreeMap PathHom ;   paths

If PathHom ; :Path A ! X then the downwards accumulation has typ e

signature Tree A ! Tree X .

Adownwards accumulation is shown in Figure 7. The value that results at

each no de of the tree dep ends on the values of the no des that app ear in the path

between the no de and the ro ot, together with their structural arrangement,

that is, the combination of left and right turns that app ear along the path.

Clearly it is exp ensive to compute a downwards accumulation by computing

all of the paths and then mapping a path reduction over them. However, there

are again large common expressions b etween each no de and its parent, so that 16

a

b c

d e

a

a  b a  c

a  b  d a  b  e

Figure 7: A Downwards Accumulation 17

the total numb er of expressions to b e computed is linear in the size of the tree.

As a result, there is an algorithm that runs in parallel time prop ortional to

the logarithm of the numb er of the no des in the tree [7].

The sequential time complexityofa downwards accumulation is

t UpAccum ; = nt + t 

1 1 1

its parallel time complexityis

t UpAccum ; = t +ht  t 

n 1 1

and its parallel time complexity using extended tree contraction is

t UpAccum ; [treecontraction ] = log n

n

b ecause  and  must b e O 1.

These complexity results are based on the assumption of one pro cessor

p er do cument no de. It is p ossible to implement all of these homomorphisms

using fewer pro cessors by allo cating a region of each tree to a pro cessor which

then applies each homomorphism to that region. The actual partitioning is

complex, b ecause a partition of a tree into regions is not normally itself a

tree. However, with some care it is p ossible to get implementations of the fast

parallel op erations ab ove in time complexity

n =p + log p

for trees of size n using p pro cessors [20]. For small p , this gives almost linear

sp eed-up over sequential implementations, while for large p it gives almost

logarithmic execution times.

Binary trees are easily extended to trees in which eachinternal no de has

a list of subtrees so-called Rose trees [6]. Rose trees much more naturally

mo del SGML tagged text. The complexity of the algorithms we will present

only changes by a constant factor, since any Rose tree can b e replaced bya

binary tree without changing the order of magnitude of the the number of

no des. 18

4 Global Prop erties of Do cuments

Wenowhave a set of homomorphic tree op erations that can b e evaluated in

parallel e ectively.Wenow b egin to showhow these op erations can b e applied

to structured text.

We b egin with op erations that are tree maps or tree reductions. A broad

class of such op erations are those that count the numb er of o ccurrences of

some text or entity in a do cument. In such tree homomorphisms, the pair of

functions used are of the general form

f a  = if a =entity name then 1 else 0

1

f t ; a ; t  = t + t + if a =entity name then 1 else 0

2 1 2 1 2

with typ es

f : A ! N

1

f : N  A  N ! N

2

Some tree op erations can b e factored into the comp osition of a tree map

and a tree reduction. This is often an easy way to understand and compute

the complexity of a homomorphism. Notice that f ab ove can b e written as

2

f =   id  f  id

2 1

where  is ternary addition. We can express the `countentities' homomor-

phism ab ove as a comp osition like this:

countentities = TreeReduce   TreeMap f 

1

The tree map can b e computed in a single parallel step, taking time t f . The

1 1

tree reduction step can b e computed using tree contraction in a logarithmic

numb er of steps, eachinvolving an application of functions related to f , taking

2

time log n . Thus the total parallel time complexityofevaluating this tree

homomorphism is

t total time = log n + t f  1

n 1 1 19

The following do cument prop erties of frequency can b e determined in the

parallel time given by Equation 1:

 numb er of o ccurrences of a word that is, a delimited terminal string,

 numb er of o ccurrences of a structure or entity section, subsection, para-

graph, gure,

 numb er of reference p oints lab els.

An extension of these counting tree homomorphisms pro duce simple data

typ es as results. For example, to pro duce a list of the di erententity names

used in a do cument, we use a tree homomorphism with comp onent functions

f a  = fentity name a g

1

f t ; a ; t  = [t ; t ; fentity name a g

2 1 2 1 2

that pro duces a set containing the entity names that are present. Each leaf is

replaced by the name of the entity it represents. Internal no des merge the sets

of entity names of their descendants with the name of the entity they represent.

Using sets means that we record eachentity name only once in the nal set.

To determine the numb er of di erententities present in a do cument, we need

only to compute the size of the set pro duced by this tree homomorphism.

Changing the set used to a list, we de ne a tree homomorphism to pro duce

a table of contents. It is

f a  = [a :string ]

1

f t ; a ; t  = a :string ++ t ++ t

2 1 2 1 2

where ++ is list concatenation, and a :string extracts the string attribute as-

so ciated with no de a . This homomorphisms pro duces a list of all of the tag

instances in the do cument in level order. The parallel time complexityof

these tree homomorphisms is not straightforward to compute for two reasons:

the op eration of concatenation is not necessarily constant time, so the com-

putation of f will not b e; and the size of lists grows with the distance from

2

the leaves, creating a communication cost that must b e accounted for on real

computers [21]. 20

Another useful class of tree homomorphisms computing global prop erties

are those that compute prop erties of extent. The most obvious example com-

putes the length of a do cumentincharacters. It is

f a  = lengtha :string 

1

f t ; a ; t  = t + t + lengtha :string 

2 1 2 1 2

Similarly, the tree homomorphism that computes the deep est nesting depth of

structures in a do cumentis

f a = 0

1

f t ;a;t  = "t ;t +1

2 1 2 1 2

Again their parallel time complexity is given by Equation 1.

Software is an example of structured text. Some tree homomorphisms that

apply to software are: computing the numb er of statements, and building a

simple that is, unscop ed symb ol table.

5 Search Problems

Another imp ortant class of op erations on structured text involve searching

them for data matching some key.Four levels of search can b e distinguished:

1. Search on index terms. This requires prepro cessing of the do cument

to allo cate search terms with the attendant problems of choice varying

between individuals. It is usually implemented using indexing. Paral-

lelism can b e used by partitioning the index.

2. Search on full text. This is implemented using a signature le tech-

nique [4, 22, 23] or sp ecial purp ose hardware [10, 11]. Parallelism can b e

used by partitioning the signature le.

3. Search on non-hierarchical tagged regions. This is a more expres-

sivevariant of full text search in which non-hierarchical tags are present

in the text. Searches may include references to tags as well as to content.

This approach is used in the PAT system [8] for searching the Oxford 21

English Dictionary. The descriptive markup of historical do cuments is

typically to o ad ho c to b e captured by SGML-style tags, but is never-

theless an imp ortant part of the organisation of the do cument. The PAT

index contains the strings b eginning at each new word or tag p osition in

the do cument, stored in a Patricia .

Fulcrum use a similar idea with zone tags, asso ciating a set of zones

with each range of the text. Zones are added to the do cument index,

allowing searches to b e based on b oth content and zone. This allows

content references to b e mo di ed by limited context information e.g.

\dog" within a section heading [12].

4. Search on full text and structure. This allows search to include

information ab out b oth contents and tags, and uses regular expressions,

so that content references can b e mo di ed by context information e.g.

\dog" within the third section heading, and truncated term expansion

is trivially available. No existing system has this capability, but the path

expressions query language [14] allows such queries to b e expressed, and

we show in subsequent sections how such searches may b e implemented

eciently.

Fortunately, this fourth level of search is no more exp ensive to implement

in parallel than the previous three. Thus even if parallelism do es not provide

absolute p erformance improvements b ecause of limits on disk sp eeds, it can

provide improvements in functionality.We explore this style of search further

in the next two sections.

6 Parallel Search of Flat Text

There is a well-known parallel algorithm for recognizing whether a given string

is a memb er of a regular language in time logarithmic in the length of the string

[5, 9]. This algorithm is naturally parallel and readily adapted to do cument

search, even on quite mo dest parallel computers. Although it was describ ed

for the Connection Machine [9] it never seems to have b een used on any real

parallel system. It is related to the technique used by Hollaar [10, 11], but is

more expressive, since the use of sp ecial-purp ose hardware limits the exibility

of Hollaar's searches. We describ e the parallel algorithm and showhowit 22

s 1 c

a

s 0 s 3

c a

b

b

s 2

Figure 8: A Simple Finite State Automaton

allows queries that are regular expressions, and hence can search for patterns

involving b oth content and tags.

Supp ose that wewant to determine if a given string is a member of a

regular language over some alphab et fa ; b ; c g. The regular language can b e

de ned by a nite state automaton, whose transitions are lab elled by elements

of the alphab et. This automaton is prepro cessed to give a set of tables, each

one lab elled with a symb ol of the alphab et, and consisting of pairs of states

such that the lab elling symb ol lab els a transition from the rst pair in each

state to the second. For example, given the automaton in Figure 8, the tables

are:

a b c

s 0; s 1 s 0; s 2 s 2; s 1

s 1; s 2 s 2; s 3 s 1; s 3

The tree homomorphism on lists consists of two functions:

a

f a  = Table a =fs ;s j transition s ! s g

9

1 i j i j

f t ; t  = fs ; s  j s ; s  2 t ; s ; s  2 t g

2 1 2 i k i j 1 j k 2

It will b e convenient to write f as an in x binary op eration, ~, on tables. The

2

function f replaces each symb ol in the input string with the table de ning how

1

that symb ol maps states to states. The function f then comp oses tables to

2

re ect the state-to-state mapping of longer and longer strings. Consider ~ 23

applied to the tables for a and b .

a b ab

s 0; s 1 ~ s 0; s 2 = s 1; s 3

s 1; s 2 s 2; s 3

At the end of the reduction, the single resulting table expresses the e ect of

the entire input string on states. If it contains a pair whose rst elementis

the initial state and whose second element is a nal state, the string is in the

regular language.

All of the tables are of nite size no larger than the numb er of transitions

in the automaton. Each table comp osition takes no longer than linear in the

numb er of states of the automaton. The reduction itself takes time logarithmic

in the size of the string b eing searched on a variety of parallel architectures

[18].

The regular language recognition problem is easily adapted for query pro-

cessing. Supp ose we wish to determine if some regular expression, RE, is

present in an input string. This regular expression de nes a language, LRE ,

that is then extended to allow for the existence of other symb ols and for the

string describ ed by the regular expression to app ear in a longer string. Call

0

this extended language LRE  . Then the search problem b ecomes: is the text

0

s in the language LRE  ? There is a nite state automaton corresp onding to

0

LRE  . It is based on the nite state automaton for LRE  with the addition

of extra transitions lab elled with symb ols from the extended alphab et. It can

b e as large as exp onential in the size of the regular expression, but is indep en-

dent of the size of the string b eing searched. An example of the automaton

corresp onding to the search for the word \cat" is shown in Figure 9 and the

resulting algorithm on an input string \b cdab cat" in Figure 10.

Queries that are b o olean expressions of simpler regular language queries

could b e evaluated byevaluating the simple queries indep endently and then

carrying out the required b o olean op erations on the results. However, the clo-

sure of the class of regular languages under union, intersection, complementa-

tion, and concatenation means that such complex queries can b e evaluated in

a single pass through the data by constructing the appropriate deterministic

nite state automaton. For example, a query of the form \are x and y in the 24

a t c

0

1 2 3

t,? c

c

a,?

a,t,? c,a,t,?

Figure 9: Finite State Automaton for \cat"

b c d a b c a t

0; 0 0; 1 0; 0 0; 0 0; 0 0; 1 0; 0 0; 0

1; 0 1; 1 1; 0 1; 2 1; 0 1; 1 1; 2 1; 0

2; 0 2; 1 2; 0 2; 0 2; 0 2; 1 2; 0 2; 3

3; 3 3; 3 3; 3 3; 3 3; 3 3; 3 3; 3 3; 3

0; 1 0; 0 0; 1 0; 0

1; 1 1; 0 1; 1 1; 3

2; 1 2; 0 2; 1 2; 0

3; 3 3; 3 3; 3 3; 3

0; 0 0; 3

1; 0 1; 3

2; 0 2; 3

3; 3 3; 3

0; 3

1; 3

2; 3

3; 3

Figure 10: Algorithm Progress for the Search 25

text" amounts to asking if the text is a string of the augmented language of

the intersection of the regular languages of the strings x and y .

Regular language recognition for strings is easily generalized to trees. The

rst step applies a tree map that replaces each no de of the tree by a table,

mapping states to states, that is the tree homomorphism:

Hom f a = Leaf  Table a ; f = Join  id  f  id 

1 2 1

The second step is a tree reduction, in which the tables of a no de and its

two subtrees are comp osed using a ternary generalisation of the comp osition

op erator used in the string algorithm ab ove:

Hom id ; f t ; t ; t =~t ;t ;t 

2 1 2 3 2 1 3

where table comp osition has b een generalized to a ternary op eration

~t ; t ; t =fs ;sjs ;s 2t ;s ;s 2t ;s ;s2t g

2 1 3 i l i j 2 j k 1 k l 3

Notice that the comp osition of tables comp oses the table b elonging to the in-

ternal no de rst, followed by the tables corresp onding to the subtrees in order.

This represents the case where the text in the internal no de represents some

kind of heading. The order mayormay not b e signi cant in an application,

but care needs to b e taken to get it right so that strings that cross entity

b oundaries are prop erly detected. The extension of the search in a linear

string to a tree is shown in Figure 11.

This algorithm takes parallel time logarithmic in the numb er of no des of

the tree, using the tree contraction algorithm discussed in the previous section.

The automaton can b e extended as b efore to solve query problems, giving a

fast parallel query evaluator for a useful class of queries.

Query evaluation problems that are of this kind include: word and phrase

search, and b o olean expressions involving phrase search. Such queries are of

ab out the complexity of those p ermitted by the PAT system.

Because the tree structure of structured text enco des useful information,

wenow turn to extending this search algorithm to a new kind of searchin

which not only the presence of data but also its relationships can b e expressed

in the query. This increases the p ower of the query language substantially.It 26

a

0,0

a

1,2

2,0

3,3

c 2

1 b

a t b c

0,0 0,1

1,0 1,1

a

2,0 2,1

0,0

3,3 3,3

1,2

2,0

3

3,3

a t

0,0 0,0

t a b c

1,2 1,0

0,0 0,0 0,1 0,0

2,0 2,3

 

1,0 1,2 1,1 1,0

3,3 3,3

2,3 2,0 2,1 2,0

3,3 3,3 3,3 3,3

a b cat

0,0 0,0 0,3

1,2 1,0 1,3

4  

2,0 2,0 2,3

3,3 3,3 3,3

Figure 11: Algorithm Progress for Tree Search 27

turns out that such structured queries can still b e computed in parallel within

logarithmic time b ounds, making structured queries of practical imp ortance.

7 Accumulations and Information Transfer

Accumulations allow no des to nd out information ab out all their neighb ours

in a particular direction. Upwards accumulation allow each no de of a tree to

accumulate information ab out the no des b elow it. Downwards accumulations

allow each no de to accumulate information ab out those no des that lie b etween

it and the ro ot. Powerful combination op erations are formed when an upwards

accumulation is followed bya downwards accumulation, b ecause this makes

arbitrary information owbetween no des of the tree p ossible. As wehave

seen, b oth kinds of accumulations can b e computed fast in parallel if their

comp onent functions are well-b ehaved.

Some examples of upwards accumulations are: computing the length in

characters of each ob ject in the do cument; and computing the o set from

each segment to the next similar segment for example the o set from each

section heading to the next section heading.

Some examples of downwards accumulations are: structured search prob-

lems, that is searching for a part of a do cument based on its content and its

structural prop erties; and nding all references to a single lab el. Structured

search is so imp ortant that weinvestigate its implementation further in the

next section.

Gibb ons gives a numb er of detailed examples of the usefulness of upwards

followed bydownwards accumulations in [6]. Some examples that are im-

p ortant for structured text are: evaluating attributes in attribute grammars

which can represent almost any prop erty of a do cument; generating a symbol

table for a program written in a language with scop es; rendering trees, which

is imp ortant for navigating do cument archives; determining which page each

ob ject would fall on if the do cumentwere pro duced on some device; determin-

ing which no de the i th word in a do cument is in, and determining the fontof

a

each ob ject in systems where font is relative not absolute, suchasLT X.

E

Op erations such as resolving all cross references and determining all the 28

references to a particular p oint can also b e cast as upwards followed bydown-

wards accumulations, but the volume of data that mightbemoved on some

steps makes this exp ensive.

8 Parallel Search of Structured Text

Queries on structured text involve nding no des in the tree based on informa-

tion ab out the content of the no de its text and other attributes and on its

context in the tree, particularly its relative context such as \the third section".

Here are some examples, based on [14]:

Example 6 do cument where `database' in do cument

This returns those do cuments that contain the word `database'.

Example 7 do cument where `Smith' in Author

This returns those do cuments where the word `Smith' o ccurs as part of the

Author structure within each do cument. This query dep ends partly on structural

information that an Author substructure exists as well as text. Notice that the

object returned dep ends on a condition on the structure b elow it in the tree.

Example 8 Section of do cument where `database' in do cu-

ment

This returns all sections of do cuments that contain the word `database'. The

object returned dep ends on a condition of the structure ab ove it in the tree. Notice

that there is a natural way to regard this as a two-step op eration: rst select the

do cuments, then select the sections.

Example 9 Section of do cument where `database' in Section-

Heading

This returns those sections whose headings contain the word `database'. Notice

that the object returned dep ends on a condition of a structure that is neither ab ove

or b elow it in the tree, but is nevertheless related to it. 29

All of these queries describ e patterns in the tree; patterns may include

\don't care"s in b oth the no des and the branch structure. Queries are relative

to a particular no de in the tree usually the ro ot and return a bag of no des

corresp onding to the ro ots of trees containing the pattern bags are sets in

which rep etitions count; wewant to know all the places where a pattern is

present, so the result must allow for more than one solution. Allowing queries

to take a bag of no des as their inputs, so that searches b egin from all of these

no des, allows queries to b e comp osed. Note that a no de is precisely identi ed

by the path b etween itself and the ro ot, so we mightaswell think of no de

identi ers as paths.

There are two di erent kinds of bag op erations in the query examples

ab ove. The rst are lters that take a bag of no des and return those elements

of the bag that satisfy some condition. The second are moves that take a bag

of no des and return a bag in which each no de has b een replaced either bya

no de related to it its ancestor, its descendant, its sibling to the right or by

one of its attributes. A simple query is usually a lter followed by the value of

an attribute at all the no des that have passed the lter. More complex queries

such as the last example ab ove require more complex moves.

This insight is the critical one in the design of path expressions [15], a

general query language for structured text applications. The crucial prop erty

of path expressions that we require is that lters can b e broken up into searches

for patterns that are single paths, and are therefore expressible as regular

expressions over paths.

Such lters can b e computed by replacing each no de of the structured text

tree by the path from it to the ro ot, and then applying the regular string

recognition algorithm, extended to include left and right turns, to eachof

these paths. Those no des for which the string recognition algorithm returns

True are the no des selected by the lter.

For example, a query ab out the presence of a chapter whose rst section

contains the word \Ab out" is equivalent to asking if there is a path in the tree

that contains the following:

entity= chapter left turn entity = section ^ \Ab out" 2 section.string 30

a

L R

b a

L

R

b c

L R

a

t

Figure 12: A Tree Annotated with Turn Information

A tree annotated with left and right turns is shown in Figure 12, and the

result of applying the paths function to it is shown in Figure 13.

The regular language recognition algorithm for strings can b e extended

to paths straightforwardly. The query string may contain references to right

and left turns, and so may the extended regular language. Otherwise the

op erations of table construction and binary table comp osition ~ b ehave just

as b efore.

Filters can b e expressed as

lter = TreeMap PathHom Table ; ~  paths

The right hand side replaces the text tree with the tree in which eachnode

contains the path b etween itself and the ro ot. The TreeMap then maps the reg-

ular language recognition algorithm for paths over each of these no des. Those

that nd an instance of the string are the ro ots of subtrees that corresp ond to

the query pattern.

But wehave already seen that op erations that can b e expressed as maps

over paths are downward accumulations, and hence can b e computed eciently

in parallel when the comp onent functions are well-b ehaved. Wehave also seen

that table comp osition is a constant-time, asso ciative function and so satis es 31

a

aLb aRa

aRaLb aRaRc

aRaRcLa aRaRcRt

Figure 13: The Result of Applying the paths Function

the requirements for tree contraction. Thus there is a logarithmic time parallel

algorithm for computing such lters

t  lter  = log n

n

Path expression queries may therefore b e included in structured text systems

without additional cost, dramatically improving the sophistication of the way

in which such resources can b e queried.

9 Conclusions

The SGML approach of structural tagging of entities makes it p ossible to mo del

structured text as a data typ e. This in turn makes it p ossible to use the ma-

chinery of categorical data typ es to examine trees and tree homomorphisms.

This reveals the underlying similaritiesbetween apparently di erent op era-

tions on structured text, suggests sequential and parallel implementations for

them, and shows how their construction can b e reduced to the construction of

comp onent functions.

In particular wehave shown how tree contraction and its extensions can

b e used to build fast parallel implementations of standard search techniques. 32

A ma jor contribution of the pap er is the discovery of a fast parallel imple-

mentation for searches based on path expressions. These allowmuch more

sophisticated searching than existing query languages with the same p erfor-

mance.

A The Tree Reduction Algorithm

The tree reduction algorithm is well-known in the parallel algorithms litera-

ture. We follow the presentation in [7].

We b egin by de ning a contraction op eration that applies to any no de and

its two descendants, if at least one descendant is a leaf. Supp ose that each

internal no de u contains a p ointer u.p to its parent, a b o olean ag u.left that

is true if it is a left descendant, a b o olean ag u.internal that is true if the

no de is an internal no de, a variable u.g describing an auxiliary function of typ e

A ! A, and, for internal no des, two p ointers, u.l and u.r p ointing to their left

and right descendants resp ectively.

We describ e an op eration that replaces u, u.l, and u.r, where u.l is a leaf,

by a single no de u.r as shown in Figure 14. A symmetric op eration contracts

in the other direction when u.r is a leaf. The op erations required are

u.r.g  x. u.gf u.l.gu.l.a, u.a, u.r.gx

2

u.r.p  u.p

if u.left then u.p.l  u.r else u.p.r  u.r

u.r.left  u.left

if u = ro ot then ro ot  u.r

The rst step is the most imp ortant one. It `folds' in an application of f so

2

that the function computed at no de u.r after this contraction op eration is the

one that would have b een computed at u b efore the op eration. The contraction

algorithm will only b e ecient if this step is done quickly and the resulting

expression do es not grow long. We will return to this p oint b elow.

The contraction op erations must b e applied to ab out half of the leaves on

each step if the entire contraction is to b e completed in ab out a logarithmic 33

u

BEFORE

u.l u.r

AFTER

u

u.r

u.l

Figure 14: A Single Tree Contraction Step: No des u and u.l are deleted from

the tree, and u.r is up dated 34

numb er of steps. The algorithm for deciding where to apply the contraction

op erations is the following:

1. Numb er the leaves left to right b eginning at 0 { this can b e done in

O log n  time using O n = log n  pro cessors [3].

2. For every u such that u.l is an even numb ered leaf, p erform the contrac-

tion op eration.

3. For every u that was not involved in the previous step, and for which u.r

is an even numb ered leaf, p erform the contraction op eration.

4. Renumb er the leaves by dividing their p ositions bytwo and taking the

integer part.

Sucient conditions for preventing the lamb da expressions in the u.gs from

growing large are the following [1]

1. For all no des u , the sectioned function f  ; a ; oftyp e A  A ! A is

2

drawn from an indexed set of functions F , and the function u.g is drawn

from an indexed set of functions G . Both F and G contain the identity

function.

2. All functions in F and G can b e applied in constant time.

3. For all f in F , g in G , and a in A, the functions x :f g x ; a  and

i i

x :f a ; g x  are in G and their indices can b e computed from a and

i

the indices of f and g in constant time.

i

4. For all g ; g in G , the comp osition g  g is in G and its indices can b e

i j i j

computed from i and j in constant time.

These conditions ensure three prop erties: that no function built at a no de

takes more than constant time to derive from the functions of the three no des

it is replacing, that no such function takes more than constant time to evaluate,

and that no such function requires more than a constant amount of storage.

Together these three prop erties guarantee that the tree contraction algorithm

executes in logarithmic parallel time. 35

References

[1] K. Abrahamson, N. Dadoun, D.G. Kirkpatrick, and T. Przytycka. A

simple parallel tree contraction algorithm. In Pro ceedings of the Twenty-

Fifth Allerton Conference on Communication, Control and Computing,

pages 624{633, Septemb er 1987.

[2] M. Cole. Parallel programming, list homomorphisms and the maximum

segment sum problem. In D. Trystram, editor, Pro ceedings of Parco 93.

Elsevier Series in Advances in Parallel Computing, 1993.

[3] R. Cole and U. Vishkin. Faster optimal parallel pre x sums and list

ranking. Information and Control, 81:334{352, 1989.

[4] C. Faloutsos and S. Christo doulakis. Signature les: An access metho d for

do cuments and its analytical p erformance evaluation. ACM Transactions

on Oce Information Systems, 2:267{288, April 1984.

[5] Charles N. Fischer. On Parsing Context-Free Languages in Parallel En-

vironments. PhD thesis, Cornell University, 1975.

[6] J. Gibb ons. Algebras for Tree Algorithms. D.Phil. thesis, Programming

Research Group, University of Oxford, 1991.

[7] J. Gibb ons, W. Cai, and D.B. Skillicorn. Ecient parallel algorithms for

tree accumulations. Science of Computer Programming, 23:1{14, 1994.

[8] G.H. Gonnet, R.A. Baeza-Yates, and T. Snider. New indices for text:

PAT trees and PAT arrays. In W.B. Frakes and R. Baeza-Yates, editors,

Information Retrieval: Data Structures and Algorithms, pages 66{82.

Prentice-Hall, 1992.

[9] W. Daniel Hillis and G.L. Steele. Data parallel algorithms. Communica-

tions of the ACM, 29, No.12:1170{1183 , Decemb er 1986.

[10] L. Hollaar. The Utah Text Search Engine: Implementation exp eriences

and future plans. In Database Machines: Fourth International Workshop,

pages 367{376. Springer-Verlag, 1985.

[11] L. Hollaar. Sp ecial-purp ose hardware for text searching: Past exp erience,

future p otential. Information Pro cessing and Management, 27:371{378,

1991. 36

[12] Fulcrum Technologies Inc. Ful/text reference manual. Fulcrum Technolo-

gies, Ottawa, Ontario, Canada, 1986.

[13] Information pro cessing { text and oce systems { standard generalized

markup language sgml, 1986.

[14] I.A. Macleo d. A query language for retrieving information from hierar-

chical text structures. The Computer Journal, 34, No.3:254{264, 1991.

[15] I.A. Macleo d. Path expressions as selectors for non-linear text. Preprint,

1993.

[16] E.W. Mayr and R. Werchner. Optimal routing of parentheses on the

hyp ercub e. In Pro ceedings of the Symp osium on Parallel Architectures

and Algorithms, June 1993.

[17] G. Salton and C. Buckley.Parallel text search metho ds. Communications

of the ACM, 31:203{215, 1988.

[18] D.B. Skillicorn. Architecture-indep endent parallel computation. IEEE

Computer, 2312:38{51, Decemb er 1990.

[19] D.B. Skillicorn. Foundations of Parallel Computing. Cambridge Series in

Parallel Computation. Cambridge University Press, 1994.

[20] D.B. Skillicorn. Parallel implementation of tree skeletons. Technical Re-

p ort 95-380, Queen's University, Department of Computing and Informa-

tion Science, March 1995.

[21] D.B. Skillicorn and W. Cai. A cost calculus for parallel functional pro-

gramming. Journal of Parallel and Distributed Computing, to app ear.

An early version app ears as Department of Computer Science Technical

Rep ort 92-329.

[22] C. Stan ll and B. Kahle. Parallel free-text search on the Connection

Machine. Communications of the ACM, 29:1229{123 9, 1986.

[23] C. Stan ll and R. Thau. Information retrieval on the Connection Machine.

Information Pro cessing and Management, 27:285{310, 1991.

[24] H. Stone. Parallel querying of large databases. IEEE Computer, 20,

No.10:11{21, Octob er 1987. 37