Structured Parallel Computation in
Structured Do cuments
D.B. Skillicorn
March 1995
External Technical Rep ort
ISSN-0836-0227-
95-379
Department of Computing and Information Science
Queen's University
Kingston, Ontario K7L 3N6
Do cument prepared March 6, 1995
c
Copyright 1995 D.B. Skillicorn
Abstract
Do cument archives contain large amounts of data to which sophisticated queries
are applied. The size of archives and the complexity of evaluating queries makes
the use of parallelism attractive. The use of semantically-based markup such as
SGML makes it p ossible to represent do cuments and do cument archives as data
typ es.
Wepresent a theory of trees and tree homomorphisms, mo delling structured
text archives and op erations on them, from which it can b e seen that:
many apparently-unrelated tree op erations are homomorphisms;
homomorphisms can b e describ ed in a simple parameterised way that gives
standard sequential and parallel implementations for them;
sp ecial classes of homomorphisms have parallel implementations of practical
interest. In particular, we develop an implementation for path expression
search, a novel p owerful query facilityfor structured text, that takes time
logarithmic in the text size.
Keywords: structured text, categorical data typ e, software development
metho dology, parallel algorithms, query evaluation. 1
1 Structured Parallel Computation
Computations on structured do cuments are a natural application domain for
parallel pro cessing. This is partly b ecause of the sheer size of the data involved.
Do cument archives contain terabytes of data, and analysis of this data often
requires examining large parts of it. Also, advances in parallel computer design
have made it p ossible to build systems in which each pro cessor has sizeable
storage asso ciated with it, and which are therefore ideal hosts for do cument
archives. Such machines will b ecome common in the next decade.
The use of parallelism has b een suggested for do cument applications for
some time. Some of the drawbacks have b een p ointed out by Stone [24] and
Salton and Buckley [17] { these centre around the need of most parallel ap-
plications to examine the entire text database where sequential algorithms
examine only a small p ortion, and the consequent p erformance degradation
in accessing secondary storage. While this p oint is imp ortant, it has b een
to some extentovertaken by developments in parallel computer architecture,
particularly the storage of data in disk arrays, with some disk storage lo cal to
each pro cessor. As we shall show, the use of parallelism allows such an increase
in the p ower of query op erations that it will b e useful even if p erformance is
not signi cantly increased.
Parallel computation is dicult in any application domain for the following
reasons:
There are many degrees of freedom in the design space of the soft-
ware, b ecause partitioning computations among pro cessors strongly in-
uences communication and synchronisation patterns, which in turn
have a strong e ect on p erformance. Hence nding go o d algorithms
requires extensive searching in the absence of other information;
Parallelism in an algorithm is only useful if it can b e harnessed by some
available parallel architectures, and harnessed in an ecientway;
Expressing algorithms in a way that is abstract enough to survive the
replacement of underlying parallel hardware every few years is dicult;
It is hard to predict the p erformance of software on parallel machines
without actually developing it and trying it out. 2
Making parallelism the workhorse of structured do cument pro cessing requires
nding solutions to these diculties.
The extensive use of semantically-based markup, and particularly the use
of SGML [13], means that most do cuments havea de facto tree structure. This
makes it p ossible to mo del them by a data typ e with enough formality that
useful theory can b e applied. We will use the theory of categorical data typ es
[19], a particular approach to initiality, emphasising its ability to hide those
asp ects of a computation that are most dicult in a parallel setting.
Categorical data typ es generalise abstract data typ es by encapsulating not
only representation of the typ e, but also the implementation of homomor-
phisms on it. In ob ject-oriented terms, the only metho ds available on con-
structed typ es are homomorphisms. As a parallel programming mo del this is
ideal, since the partitioning of the data ob jects across pro cessors, and the com-
munication patterns required to evaluate homomorphisms can remain invisible
to programmers. Programs can b e written as comp ositions of homomorphisms
without any necessary awareness that the implementations of these homomor-
phisms might contain substantial parallelism.
Recall that a homomorphism is a function that resp ects the structure of its
arguments. If an argument ob ject is a memb er of a data typ e with constructor
./, then h is a homomorphism if there exists an op eration ~ such that
h a ./ b =ha~hb
This equation describ es two di erentways of computing the result of applying
h to the argument a ./ b . The left hand side computes it by building the
argument ob ject completely and then just applying h to this ob ject. The
right hand side, however, applies h to the comp onent ob jects from which the
argumentwas built, and then applies the op eration ~ to the results of these
two applications.
There are two things to note ab out the computation strategy implied by
the right hand side. The rst is that it is recursive. If b is itself an ob ject built
up from smaller ob jects, say b = c ./ d , then
h b =hc~hd 3
and so
h a ./ b =ha~hc~hd
The structure of the computation follows the structure of the argument. Sec-
ond, the evaluations of h on the right hand sides are indep endent and can
therefore b e evaluated in parallel if the architecture p ermits it. These simple
ideas lead to a rich approach to parallel computation.
Homomorphisms include many of the interesting functions on constructed
data typ es. In particular, all injective functions are homomorphisms. Fur-
thermore, all functions can b e expressed as almost-homomorphisms [2], the
comp osition of a homomorphism with a pro jection, and this is often of prac-
tical imp ortance.
In the next section weintro duce the construction of a typ e for trees to
represent structured text. We show that the construction reduces algorithm
design for homomorphisms to the simpler problem of nding comp onent func-
tions, and illustrate a recursive parallel schema for computing homomorphisms
on trees. In Section 3, we distinguish four sp ecial kinds of homomorphisms
that capture common patterns for information ow in tree algorithms, and for
whichitisworth optimizing the standard schema implementation. These are:
tree maps, tree reductions, upwards and downwards accumulations. In the
subsequent ve sections we illustrate tree homomorphisms of increasing so-
phistication, b eginning with the computation of global do cument prop erties,
then search problems that is, query evaluation, and nally problems that
involve communicating information throughout do cuments.
2 Parallel Op erations on Trees
We build the typ e of homogeneous binary trees, that is trees in whichinternal
no des and leaves are all of the same typ e. Binary trees are to o simple to
represent the full structure of tagged texts, since any tagged entitymay contain
many sub ordinate entities, but it simpli es the exp osition without a ecting
any of the complexity results.
In SGML an entity is delimited by start and end tags. The region b etween
the start and end tags is either `raw' text or is itself tagged. The structure is 4
do cument
chapter chapter chapter chapter
section section section
para para para
Figure 1: Mo delling a Do cumentasTree
hierarchical, and can b e naturally represented by a tree whose internal no des
represent tags, and whose leaves represent `raw' data. All no des may contain
values for attributes. In particular, tags will often have asso ciated text { for
example, chapter tags contain the text of the chapter heading as an attribute,
gure tags contain gure captions, and so on. Thusatypical do cument can
b e represented as a tree of the kind shown in Figure 1. Do cument archives can
b e mo delled by trees as well, in which no des near the ro ot represent do cument
classi cations, as shown in Figure 2. We build trees over some base typ e
A that can mo del entities and their attributes. For our purp oses this is a
tuple consisting of an entity name and a set of attributes. In our particular
examples it will suce to have one attribute, the string of text asso ciated with
each no de of the do cument.
Trees havetwo constructors:
Leaf : A ! Tree A
Join : Tree A A Tree A ! Tree A
The rst constructor, Leaf, takesavalue of typ e A and makes it into a tree
consisting of a single leaf. The second constructor, Join, takes two trees and
avalue of typ e A and makes them into a new tree by joining the subtrees and 5
archive
manuals novels
do cument do cument do cument
chapter chapter chapter chapter chapter chapter
section section
para para para
Figure 2: Mo delling a Do cument Archive 6
a
b c
d e
Join Join Leaf d ; b ; Leaf e ; a ; Leaf c
Figure 3: A Constructor Expression and the Tree it Describ es
putting the value of typ e A at the internal no de generated. Thus a tree is
either a single leaf, or a tree obtained by joining two subtrees together with a
new value at the join. A constructor expression describing a tree and the tree
itself are shown in Figure 3.
De nition 1 A homomorphism, h , on trees is a function that resp ects tree struc-
ture, that is there must exist functions f and f such that
1 2
h Leaf a = Leaf f a
1
and
h Join t ; a ; t = f h t ; a ; h t
1 2 2 1 2
Call f and f comp onent functions of the homomorphism h .
1 2
Here f is the glueing function that relates the e ect of h on pieces of the
2
argument to its e ect on the whole argument. Notice that if h has typ e
h : Tree A ! P
then the typ es of f and f are
1 2
f : A ! P
1 7
f : P A P ! P
2
Thus nding a homomorphism amounts to nding such a pair of functions.
However, nding h means b eing aware of the structure of Tree A, whichis
usually complex, whereas f and f are op erations on usually simpler typ es.
1 2
This is a considerable practical advantage.
Result 2 [19] Every homomorphism on trees is actually a function b etween an
algebra that is, a set with some op erations de ned on it Tree A; Leaf ; Join
and a related algebra P ; f ; f for some typ e P and a pair of functions
1 2
f : A ! P
1
f : P A P ! P
2
Note that the typ e signatures of f and f are obtained from those of Leaf and
1 2
Join by replacing o ccurrences of Tree A by P .
Furthermore, the homomorphism to each such algebra is unique, and there
is therefore a one-to-one corresp ondence b etween algebras and homomorphisms.
Each homomorphism is completely determined by the functions f and f .
1 2
This corresp ondence justi es the notation
Hom f ; f
1 2
for the necessarily unique tree homomorphism from Tree A; Leaf ; Join to
the algebra P ; f ; f .
1 2
Using the fact that h is a homomorphism and that it corresp onds to a pair
of comp onent functions, we can now make formal the recursive computation of
h that wesaw earlier. The recursiveschema for evaluating all tree homomor-
phisms is shown in pseudo co de in Figure 4. It is clear that the two recursive
tree homomorphism can b e executed in parallel. Using this calls to evaluate
simple approach to parallelism, all tree homomorphisms on trees can b e eval-
uated in parallel time prop ortional to the height of the tree assuming that
the evaluation of f and f take only constant time.
1 2
Here are some simple examples of tree homomorphisms: 8
evaluate_tree_homomorphism f1, f2, t
case t of
Leafa :
return f1 a ;
JoinTreet1, a, Treet2 :
return f2
evaluate_tree_homomorphism f1, f2, t1 ,
a,
evaluate_tree_homomorphism f1, f2, t2
end case
end
Figure 4: Tree Homomorphism Evaluation Schema
Example 3 The tree homomorphism that computes the numb er of leaves in
a tree replaces the value at each leaf by the constant 1, and then sums these
values over the tree. It is given by
numb er-of-leaves = Hom K ; f
1 2
where K : A ! N is the constant 1 function and f : N A N ! N is
1 2
f t ; a ; t =t +t . The recursiveschema sp ecialises to:
2 1 2 1 2
evaluate_tree_homomorphism t
case t of
Leafa :
return K_1 a ;{= 1}
JoinTreet1, a, Treet2 :
return
evaluate_tree_homomorphism t1
+
evaluate_tree_homomorphism t2
end case
end
Example 4 The tree homomorphism to compute the numb er of internal no des
in a tree replaces each leaf by the value 0 and then adds 1 to the sum for each 9
internal no de encountered. It is
numb er-of-internal-no des = Hom K ; f
0 2
where K : A ! N is the constant 0 function, and f : N A N ! N is
0 2
f t ; a ; t =t +t +1.
2 1 2 1 2
Example 5 The tree homomorphism that nds the maximum value in a tree
do es nothing at the leaves and at internal no des selects the maximum of the
three values from its two subtrees and the no de itself. It is
treemax = Hom id ; f
2
where f t ; a ; t = "t ;a;t and " is ternary maximum.
2 1 2 1 2
The fact that all tree homomorphisms are the same in some sense is useful
in twoways. First, nding tree homomorphisms reduces to nding comp onent
functions. Since comp onent functions describ e lo cal actions at the no des of
trees, they are usually simpler conceptually than homomorphisms. There is a
separation b etween common global tree structure and detailed no de compu-
tations. Second, there is a common mechanism for computing any homomor-
phism, so it makes sense to sp end time optimising it, rather than developing
avarietyof ad ho c techniques.
Nevertheless, there are some classes of homomorphisms for which b etter
implementations are p ossible. These classes have many memb ers that are of
practical interest.
3 Four Classes of Tree Homomorphisms
There are four sp ecial classes of homomorphisms that are paradigms for com-
mon tree computations. They are: tree maps, tree reductions, and two forms
of tree accumulations, upwards and downwards.
In our discussion of implementations, we will assume either the EREW
PRAM or a distributed-memory hyp ercub e architecture as the target com- 10
puter. The EREW PRAM do es not charge for communication, and our im-
plementations can all b e arranged on the hyp ercub e so that communication
is always with nearest neighb ours, except for the tree contraction algorithm
used in several places. This enables us to include communication costs in our
complexity measures. We use a result of Mayr and Werchner [16] to justify
the complexity of tree contraction on the hyp ercub e. We will also assume that
a tree of n no des is pro cessed byan n-pro cessor system, so that there is a
pro cessor p er tree no de. We return to this very unrealistic assumption later.
A tree map is a tree homomorphism that applies a function to all the no des
of a tree, but leaves its structure unchanged. If f is some function f : A ! B ,
then TreeMap f is the function
TreeMap f =Hom Leaf f ; Join id f id : TreeA!TreeB
The recursiveschema b ecomes
evaluate_tree_homomorphism t
case t of
Leafa :
return Leaf f a ;
JoinTreet1, a, Treet2 :
return Join
evaluate_tree_homomorphism t1 ,
fa,
evaluate_tree_homomorphism t2
end case
end
The recursion \unpacks" the argument tree into a set of leaves and internal
no des. These are then joined back together in exactly the same structure,
except that the function f is applied to eachvalue b efore it is placed backin
the tree. The resulting tree has exactly the same shap e as the argument tree;
only the values and typ es at the leaves and internal no des havechanged.
Using the recursiveschema as an implementation is therefore unnecessarily
cumb ersome. A tree map can b e computed directly by skipping the disassem-
bly and subsequent reassembly and just applying f to each leaf and internal 11
no de. In a parallel implementation tree map can b e computed without any
communication.
Let t f denote the time complexityof f in i pro cessors. The sequential
i
complexity of a tree map is given by
t TreeMap f = n t f
1 1
and the parallel complexityby
t TreeMap f = t f
n 1
A tree reduction is a tree homomorphism that replaces structure without
explicitly manipulating values. Applied to a tree, a tree reduction most often
pro duces a single value. Formally, a tree reduction is a tree homomorphism
TreeReduce g =Hom id ; g :TreeA!A
where the typ e of g is
g : A A A ! A
The recursiveschema b ecomes
evaluate_tree_homomorphism t
case t of
Leafa :
return a;
JoinTreet1, a, Treet2 :
return g
evaluate_tree_homomorphism t1 ,
a,
evaluate_tree_homomorphism t2
end case
end
Figure 5 shows a tree and the result of applying a reduction to it.
A tree reduction can b e computed in parallel in time prop ortional to the
height of the tree, assuming that each application of the function g takes 12
a
b c
d e
g g d ; b ; e ; a ; c
Figure 5: A Tree Contraction
constant time. Knowing the result of the reduction on subtrees, the result of
the reduction on the tree formed by joining them requires a further application
of g , so that critical path is the application of g s along the longest branch.
The sequential complexity of tree reduction is
t TreeReduce g = n t g
1 1
and the parallel complexityis
t TreeReduce g = ht t g
n 1
where ht is the height of the tree.
Somewhat surprisingly, tree reduction can b e computed much faster for
many kinds of functions g .For theoretical mo dels such as the EREW PRAM
and more practical architectures such as the hyp ercub e, tree reduction can b e
computed in time logarithmic in the numb er of no des of the tree, sub ject to
some mild conditions on the function g [1, 16], describ ed in App endix A. This
is a big improvementover the metho d suggested ab ove, since a completely left
or right branching tree with n no des requires time prop ortional to n to reduce
directly, whereas the faster algorithm takes time log n .
The key to fast tree reduction is making some useful progress towards the
eventual result at many no des of the tree, whether or not the reductions for 13
the subtrees of which they are the ro ot have b een completed. Naive reduction
applies g only at those no des b oth of whose children are leaves. For an un-
balanced tree, this creates a chain of reductions whose length is prop ortional
to the height of the tree. However, for well-b ehaved reduction op erations it
is also p ossible to carry out partial reductions for no des only one of whose
descendants is a leaf. This brings the time complexity of the tree reduction
down to logarithmic in the size of the tree, no matter howunbalanced it is.
The details are given in App endix A.
Thus wehave a parallel time complexity for tree reduction of
t TreeReduce g [treecontraction ] = log n
n
since t g must b e O 1.
1
The third useful family of tree homomorphisms are the upwards accumu-
lations.Upwards accumulations are op erations in which data can b e regarded
as owing upwards in the tree, and where the computations that take place at
each no de dep end on the results of computations at lower no des. These are
like tree reductions that leave all their partial results b ehind. The nal result
is a tree of the same shap e as the argument tree, in which eachnodeisthe
result of a tree reduction ro oted at that no de. This is illustrated in Figure 6.
Upwards accumulations is characterized as follows. Let subtrees b e the
function that replaces each no de of a tree by the subtree ro oted at that no de.
Hence it takes a tree and pro duces a tree of trees. Although subtrees is a
messy and exp ensive function, it is still a homomorphism.
Upwards accumulations are those functions that can b e expressed as the
comp osition of subtrees with a tree homomorphism mapp ed over the no des of
the intermediate tree.
upwards accumulation = TreeMap Hom f ; f subtrees
1 2
If Hom f ; f :TreeA!X then the upwards accumulation has typ e signa-
1 2
ture Tree A ! Tree X .
Computing an upwards accumulation in the way implied by this de nition
is exp ensive. The p oint of de ning upwards accumulations like this is that
the result at any no de shares large common expressions with the results at its 14
a
b c
d e
f f f d ; b ; f e ; a ; f c
2 2 1 1 1
f f d ; b ; f e c
2 1 1
d e
Figure 6: An Upwards Accumulation
immediate descendants. Therefore, the numb er of partial results that must b e
computed in an upwards accumulation is linear in the numb er of tree no des.
Thus it is p ossible that there is a fast parallel algorithm, provided the dep en-
dencies intro duced by the communication involved are not to o constraining;
and indeed there is. The algorithm is an extension of tree contraction; when
anode u is removed, it is stacked by its remaining child. When this child
receives its nal value, it unstacks u and computes its nal value. Details may
b e found in [7].
The sequential time complexityofanupwards accumulation is
t UpAccum f ; f = n t f +t f
1 1 2 1 1 1 2
its parallel time complexityis
t UpAccum f ; f = t f +ht t f
n 1 2 1 1 1 2
and its parallel time complexity using extended tree contraction is
t UpAccum f ; f [treecontraction ] = log n
n 1 2 15
b ecause f and f must b e O 1.
1 2
The fourth useful family of tree homomorphisms are the downwards ac-
cumulations.Downwards accumulations replace each no de of a tree by some
function of the no des on the path b etween it and the ro ot. This mo dels func-
tions in which the ow of information is broadcast down through the tree.
We rst de ne the typ e of non-empty paths, which are like lists except
that they havetwo concatenation constructors, left turn, ^, and right turn,
_. This typ e mo dels the paths that o ccur b etween the ro ot of a tree and
any other no de, where it is imp ortant to rememb er whether the path turns
towards a left or right descendant at each step. The two constructors are
mutually asso ciative, that is
a ?b ??c = a ?b ??c
where ? and ?? are either of the constructors. Homomorphismson paths are
the unique arrows to algebraic structures
P ; f : A ! P ; ; : P P ! P
where and satisfy the same mutual asso ciativity prop erty. De ne paths to
b e the function that replaces each no de of a tree by the path b etween the ro ot
and that no de. A downwards accumulation is a tree homomorphism that can
b e expressed as the comp osition of paths with a path homomorphism mapp ed
over the no des of the intermediate tree.
downwards accumulation = TreeMap PathHom ; paths
If PathHom ; :Path A ! X then the downwards accumulation has typ e
signature Tree A ! Tree X .
Adownwards accumulation is shown in Figure 7. The value that results at
each no de of the tree dep ends on the values of the no des that app ear in the path
between the no de and the ro ot, together with their structural arrangement,
that is, the combination of left and right turns that app ear along the path.
Clearly it is exp ensive to compute a downwards accumulation by computing
all of the paths and then mapping a path reduction over them. However, there
are again large common expressions b etween each no de and its parent, so that 16
a
b c
d e
a
a b a c
a b d a b e
Figure 7: A Downwards Accumulation 17
the total numb er of expressions to b e computed is linear in the size of the tree.
As a result, there is an algorithm that runs in parallel time prop ortional to
the logarithm of the numb er of the no des in the tree [7].
The sequential time complexityofa downwards accumulation is
t UpAccum ; = nt + t
1 1 1
its parallel time complexityis
t UpAccum ; = t +ht t
n 1 1
and its parallel time complexity using extended tree contraction is
t UpAccum ; [treecontraction ] = log n
n
b ecause and must b e O 1.
These complexity results are based on the assumption of one pro cessor
p er do cument no de. It is p ossible to implement all of these homomorphisms
using fewer pro cessors by allo cating a region of each tree to a pro cessor which
then applies each homomorphism to that region. The actual partitioning is
complex, b ecause a partition of a tree into regions is not normally itself a
tree. However, with some care it is p ossible to get implementations of the fast
parallel op erations ab ove in time complexity
n =p + log p
for trees of size n using p pro cessors [20]. For small p , this gives almost linear
sp eed-up over sequential implementations, while for large p it gives almost
logarithmic execution times.
Binary trees are easily extended to trees in which eachinternal no de has
a list of subtrees so-called Rose trees [6]. Rose trees much more naturally
mo del SGML tagged text. The complexity of the algorithms we will present
only changes by a constant factor, since any Rose tree can b e replaced bya
binary tree without changing the order of magnitude of the the number of
no des. 18
4 Global Prop erties of Do cuments
Wenowhave a set of homomorphic tree op erations that can b e evaluated in
parallel e ectively.Wenow b egin to showhow these op erations can b e applied
to structured text.
We b egin with op erations that are tree maps or tree reductions. A broad
class of such op erations are those that count the numb er of o ccurrences of
some text or entity in a do cument. In such tree homomorphisms, the pair of
functions used are of the general form
f a = if a =entity name then 1 else 0
1
f t ; a ; t = t + t + if a =entity name then 1 else 0
2 1 2 1 2
with typ es
f : A ! N
1
f : N A N ! N
2
Some tree op erations can b e factored into the comp osition of a tree map
and a tree reduction. This is often an easy way to understand and compute
the complexity of a homomorphism. Notice that f ab ove can b e written as
2
f = id f id
2 1
where is ternary addition. We can express the `countentities' homomor-
phism ab ove as a comp osition like this:
countentities = TreeReduce TreeMap f
1
The tree map can b e computed in a single parallel step, taking time t f . The
1 1
tree reduction step can b e computed using tree contraction in a logarithmic
numb er of steps, eachinvolving an application of functions related to f , taking
2
time log n . Thus the total parallel time complexityofevaluating this tree
homomorphism is
t total time = log n + t f 1
n 1 1 19
The following do cument prop erties of frequency can b e determined in the
parallel time given by Equation 1:
numb er of o ccurrences of a word that is, a delimited terminal string,
numb er of o ccurrences of a structure or entity section, subsection, para-
graph, gure,
numb er of reference p oints lab els.
An extension of these counting tree homomorphisms pro duce simple data
typ es as results. For example, to pro duce a list of the di erententity names
used in a do cument, we use a tree homomorphism with comp onent functions
f a = fentity name a g
1
f t ; a ; t = [t ; t ; fentity name a g
2 1 2 1 2
that pro duces a set containing the entity names that are present. Each leaf is
replaced by the name of the entity it represents. Internal no des merge the sets
of entity names of their descendants with the name of the entity they represent.
Using sets means that we record eachentity name only once in the nal set.
To determine the numb er of di erententities present in a do cument, we need
only to compute the size of the set pro duced by this tree homomorphism.
Changing the set used to a list, we de ne a tree homomorphism to pro duce
a table of contents. It is
f a = [a :string ]
1
f t ; a ; t = a :string ++ t ++ t
2 1 2 1 2
where ++ is list concatenation, and a :string extracts the string attribute as-
so ciated with no de a . This homomorphisms pro duces a list of all of the tag
instances in the do cument in level order. The parallel time complexityof
these tree homomorphisms is not straightforward to compute for two reasons:
the op eration of concatenation is not necessarily constant time, so the com-
putation of f will not b e; and the size of lists grows with the distance from
2
the leaves, creating a communication cost that must b e accounted for on real
computers [21]. 20
Another useful class of tree homomorphisms computing global prop erties
are those that compute prop erties of extent. The most obvious example com-
putes the length of a do cumentincharacters. It is
f a = lengtha :string
1
f t ; a ; t = t + t + lengtha :string
2 1 2 1 2
Similarly, the tree homomorphism that computes the deep est nesting depth of
structures in a do cumentis
f a = 0
1
f t ;a;t = "t ;t +1
2 1 2 1 2
Again their parallel time complexity is given by Equation 1.
Software is an example of structured text. Some tree homomorphisms that
apply to software are: computing the numb er of statements, and building a
simple that is, unscop ed symb ol table.
5 Search Problems
Another imp ortant class of op erations on structured text involve searching
them for data matching some key.Four levels of search can b e distinguished:
1. Search on index terms. This requires prepro cessing of the do cument
to allo cate search terms with the attendant problems of choice varying
between individuals. It is usually implemented using indexing. Paral-
lelism can b e used by partitioning the index.
2. Search on full text. This is implemented using a signature le tech-
nique [4, 22, 23] or sp ecial purp ose hardware [10, 11]. Parallelism can b e
used by partitioning the signature le.
3. Search on non-hierarchical tagged regions. This is a more expres-
sivevariant of full text search in which non-hierarchical tags are present
in the text. Searches may include references to tags as well as to content.
This approach is used in the PAT system [8] for searching the Oxford 21
English Dictionary. The descriptive markup of historical do cuments is
typically to o ad ho c to b e captured by SGML-style tags, but is never-
theless an imp ortant part of the organisation of the do cument. The PAT
index contains the strings b eginning at each new word or tag p osition in
the do cument, stored in a Patricia trie.
Fulcrum use a similar idea with zone tags, asso ciating a set of zones
with each range of the text. Zones are added to the do cument index,
allowing searches to b e based on b oth content and zone. This allows
content references to b e mo di ed by limited context information e.g.
\dog" within a section heading [12].
4. Search on full text and structure. This allows search to include
information ab out b oth contents and tags, and uses regular expressions,
so that content references can b e mo di ed by context information e.g.
\dog" within the third section heading, and truncated term expansion
is trivially available. No existing system has this capability, but the path
expressions query language [14] allows such queries to b e expressed, and
we show in subsequent sections how such searches may b e implemented
eciently.
Fortunately, this fourth level of search is no more exp ensive to implement
in parallel than the previous three. Thus even if parallelism do es not provide
absolute p erformance improvements b ecause of limits on disk sp eeds, it can
provide improvements in functionality.We explore this style of search further
in the next two sections.
6 Parallel Search of Flat Text
There is a well-known parallel algorithm for recognizing whether a given string
is a memb er of a regular language in time logarithmic in the length of the string
[5, 9]. This algorithm is naturally parallel and readily adapted to do cument
search, even on quite mo dest parallel computers. Although it was describ ed
for the Connection Machine [9] it never seems to have b een used on any real
parallel system. It is related to the technique used by Hollaar [10, 11], but is
more expressive, since the use of sp ecial-purp ose hardware limits the exibility
of Hollaar's searches. We describ e the parallel algorithm and showhowit 22
s 1 c
a
s 0 s 3
c a
b
b
s 2
Figure 8: A Simple Finite State Automaton
allows queries that are regular expressions, and hence can search for patterns
involving b oth content and tags.
Supp ose that wewant to determine if a given string is a member of a
regular language over some alphab et fa ; b ; c g. The regular language can b e
de ned by a nite state automaton, whose transitions are lab elled by elements
of the alphab et. This automaton is prepro cessed to give a set of tables, each
one lab elled with a symb ol of the alphab et, and consisting of pairs of states
such that the lab elling symb ol lab els a transition from the rst pair in each
state to the second. For example, given the automaton in Figure 8, the tables
are:
a b c
s 0; s 1 s 0; s 2 s 2; s 1
s 1; s 2 s 2; s 3 s 1; s 3
The tree homomorphism on lists consists of two functions:
a
f a = Table a =fs ;s j transition s ! s g
9
1 i j i j
f t ; t = fs ; s j s ; s 2 t ; s ; s 2 t g
2 1 2 i k i j 1 j k 2
It will b e convenient to write f as an in x binary op eration, ~, on tables. The
2
function f replaces each symb ol in the input string with the table de ning how
1
that symb ol maps states to states. The function f then comp oses tables to
2
re ect the state-to-state mapping of longer and longer strings. Consider ~ 23
applied to the tables for a and b .
a b ab
s 0; s 1 ~ s 0; s 2 = s 1; s 3
s 1; s 2 s 2; s 3
At the end of the reduction, the single resulting table expresses the e ect of
the entire input string on states. If it contains a pair whose rst elementis
the initial state and whose second element is a nal state, the string is in the
regular language.
All of the tables are of nite size no larger than the numb er of transitions
in the automaton. Each table comp osition takes no longer than linear in the
numb er of states of the automaton. The reduction itself takes time logarithmic
in the size of the string b eing searched on a variety of parallel architectures
[18].
The regular language recognition problem is easily adapted for query pro-
cessing. Supp ose we wish to determine if some regular expression, RE, is
present in an input string. This regular expression de nes a language, LRE ,
that is then extended to allow for the existence of other symb ols and for the
string describ ed by the regular expression to app ear in a longer string. Call
0
this extended language LRE . Then the search problem b ecomes: is the text
0
s in the language LRE ? There is a nite state automaton corresp onding to
0
LRE . It is based on the nite state automaton for LRE with the addition
of extra transitions lab elled with symb ols from the extended alphab et. It can
b e as large as exp onential in the size of the regular expression, but is indep en-
dent of the size of the string b eing searched. An example of the automaton
corresp onding to the search for the word \cat" is shown in Figure 9 and the
resulting algorithm on an input string \b cdab cat" in Figure 10.
Queries that are b o olean expressions of simpler regular language queries
could b e evaluated byevaluating the simple queries indep endently and then
carrying out the required b o olean op erations on the results. However, the clo-
sure of the class of regular languages under union, intersection, complementa-
tion, and concatenation means that such complex queries can b e evaluated in
a single pass through the data by constructing the appropriate deterministic
nite state automaton. For example, a query of the form \are x and y in the 24
a t c
0
1 2 3
t,? c
c
a,?
a,t,? c,a,t,?
Figure 9: Finite State Automaton for \cat"
b c d a b c a t
0; 0 0; 1 0; 0 0; 0 0; 0 0; 1 0; 0 0; 0
1; 0 1; 1 1; 0 1; 2 1; 0 1; 1 1; 2 1; 0
2; 0 2; 1 2; 0 2; 0 2; 0 2; 1 2; 0 2; 3
3; 3 3; 3 3; 3 3; 3 3; 3 3; 3 3; 3 3; 3
0; 1 0; 0 0; 1 0; 0
1; 1 1; 0 1; 1 1; 3
2; 1 2; 0 2; 1 2; 0
3; 3 3; 3 3; 3 3; 3
0; 0 0; 3
1; 0 1; 3
2; 0 2; 3
3; 3 3; 3
0; 3
1; 3
2; 3
3; 3
Figure 10: Algorithm Progress for the Search 25
text" amounts to asking if the text is a string of the augmented language of
the intersection of the regular languages of the strings x and y .
Regular language recognition for strings is easily generalized to trees. The
rst step applies a tree map that replaces each no de of the tree by a table,
mapping states to states, that is the tree homomorphism:
Hom f a = Leaf Table a ; f = Join id f id
1 2 1
The second step is a tree reduction, in which the tables of a no de and its
two subtrees are comp osed using a ternary generalisation of the comp osition
op erator used in the string algorithm ab ove:
Hom id ; f t ; t ; t =~t ;t ;t
2 1 2 3 2 1 3
where table comp osition has b een generalized to a ternary op eration
~t ; t ; t =fs ;sjs ;s 2t ;s ;s 2t ;s ;s2t g
2 1 3 i l i j 2 j k 1 k l 3
Notice that the comp osition of tables comp oses the table b elonging to the in-
ternal no de rst, followed by the tables corresp onding to the subtrees in order.
This represents the case where the text in the internal no de represents some
kind of heading. The order mayormay not b e signi cant in an application,
but care needs to b e taken to get it right so that strings that cross entity
b oundaries are prop erly detected. The extension of the search in a linear
string to a tree is shown in Figure 11.
This algorithm takes parallel time logarithmic in the numb er of no des of
the tree, using the tree contraction algorithm discussed in the previous section.
The automaton can b e extended as b efore to solve query problems, giving a
fast parallel query evaluator for a useful class of queries.
Query evaluation problems that are of this kind include: word and phrase
search, and b o olean expressions involving phrase search. Such queries are of
ab out the complexity of those p ermitted by the PAT system.
Because the tree structure of structured text enco des useful information,
wenow turn to extending this search algorithm to a new kind of searchin
which not only the presence of data but also its relationships can b e expressed
in the query. This increases the p ower of the query language substantially.It 26
a
0,0
a
1,2
2,0
3,3
c 2
1 b
a t b c
0,0 0,1
1,0 1,1
a
2,0 2,1
0,0
3,3 3,3
1,2
2,0
3
3,3
a t
0,0 0,0
t a b c
1,2 1,0
0,0 0,0 0,1 0,0
2,0 2,3
1,0 1,2 1,1 1,0
3,3 3,3
2,3 2,0 2,1 2,0
3,3 3,3 3,3 3,3
a b cat
0,0 0,0 0,3
1,2 1,0 1,3
4
2,0 2,0 2,3
3,3 3,3 3,3
Figure 11: Algorithm Progress for Tree Search 27
turns out that such structured queries can still b e computed in parallel within
logarithmic time b ounds, making structured queries of practical imp ortance.
7 Accumulations and Information Transfer
Accumulations allow no des to nd out information ab out all their neighb ours
in a particular direction. Upwards accumulation allow each no de of a tree to
accumulate information ab out the no des b elow it. Downwards accumulations
allow each no de to accumulate information ab out those no des that lie b etween
it and the ro ot. Powerful combination op erations are formed when an upwards
accumulation is followed bya downwards accumulation, b ecause this makes
arbitrary information owbetween no des of the tree p ossible. As wehave
seen, b oth kinds of accumulations can b e computed fast in parallel if their
comp onent functions are well-b ehaved.
Some examples of upwards accumulations are: computing the length in
characters of each ob ject in the do cument; and computing the o set from
each segment to the next similar segment for example the o set from each
section heading to the next section heading.
Some examples of downwards accumulations are: structured search prob-
lems, that is searching for a part of a do cument based on its content and its
structural prop erties; and nding all references to a single lab el. Structured
search is so imp ortant that weinvestigate its implementation further in the
next section.
Gibb ons gives a numb er of detailed examples of the usefulness of upwards
followed bydownwards accumulations in [6]. Some examples that are im-
p ortant for structured text are: evaluating attributes in attribute grammars
which can represent almost any prop erty of a do cument; generating a symbol
table for a program written in a language with scop es; rendering trees, which
is imp ortant for navigating do cument archives; determining which page each
ob ject would fall on if the do cumentwere pro duced on some device; determin-
ing which no de the i th word in a do cument is in, and determining the fontof
a
each ob ject in systems where font is relative not absolute, suchasLT X.
E
Op erations such as resolving all cross references and determining all the 28
references to a particular p oint can also b e cast as upwards followed bydown-
wards accumulations, but the volume of data that mightbemoved on some
steps makes this exp ensive.
8 Parallel Search of Structured Text
Queries on structured text involve nding no des in the tree based on informa-
tion ab out the content of the no de its text and other attributes and on its
context in the tree, particularly its relative context such as \the third section".
Here are some examples, based on [14]:
Example 6 do cument where `database' in do cument
This returns those do cuments that contain the word `database'.
Example 7 do cument where `Smith' in Author
This returns those do cuments where the word `Smith' o ccurs as part of the
Author structure within each do cument. This query dep ends partly on structural
information that an Author substructure exists as well as text. Notice that the
object returned dep ends on a condition on the structure b elow it in the tree.
Example 8 Section of do cument where `database' in do cu-
ment
This returns all sections of do cuments that contain the word `database'. The
object returned dep ends on a condition of the structure ab ove it in the tree. Notice
that there is a natural way to regard this as a two-step op eration: rst select the
do cuments, then select the sections.
Example 9 Section of do cument where `database' in Section-
Heading
This returns those sections whose headings contain the word `database'. Notice
that the object returned dep ends on a condition of a structure that is neither ab ove
or b elow it in the tree, but is nevertheless related to it. 29
All of these queries describ e patterns in the tree; patterns may include
\don't care"s in b oth the no des and the branch structure. Queries are relative
to a particular no de in the tree usually the ro ot and return a bag of no des
corresp onding to the ro ots of trees containing the pattern bags are sets in
which rep etitions count; wewant to know all the places where a pattern is
present, so the result must allow for more than one solution. Allowing queries
to take a bag of no des as their inputs, so that searches b egin from all of these
no des, allows queries to b e comp osed. Note that a no de is precisely identi ed
by the path b etween itself and the ro ot, so we mightaswell think of no de
identi ers as paths.
There are two di erent kinds of bag op erations in the query examples
ab ove. The rst are lters that take a bag of no des and return those elements
of the bag that satisfy some condition. The second are moves that take a bag
of no des and return a bag in which each no de has b een replaced either bya
no de related to it its ancestor, its descendant, its sibling to the right or by
one of its attributes. A simple query is usually a lter followed by the value of
an attribute at all the no des that have passed the lter. More complex queries
such as the last example ab ove require more complex moves.
This insight is the critical one in the design of path expressions [15], a
general query language for structured text applications. The crucial prop erty
of path expressions that we require is that lters can b e broken up into searches
for patterns that are single paths, and are therefore expressible as regular
expressions over paths.
Such lters can b e computed by replacing each no de of the structured text
tree by the path from it to the ro ot, and then applying the regular string
recognition algorithm, extended to include left and right turns, to eachof
these paths. Those no des for which the string recognition algorithm returns
True are the no des selected by the lter.
For example, a query ab out the presence of a chapter whose rst section
contains the word \Ab out" is equivalent to asking if there is a path in the tree
that contains the following:
entity= chapter left turn entity = section ^ \Ab out" 2 section.string 30
a
L R
b a
L
R
b c
L R
a
t
Figure 12: A Tree Annotated with Turn Information
A tree annotated with left and right turns is shown in Figure 12, and the
result of applying the paths function to it is shown in Figure 13.
The regular language recognition algorithm for strings can b e extended
to paths straightforwardly. The query string may contain references to right
and left turns, and so may the extended regular language. Otherwise the
op erations of table construction and binary table comp osition ~ b ehave just
as b efore.
Filters can b e expressed as
lter = TreeMap PathHom Table ; ~ paths
The right hand side replaces the text tree with the tree in which eachnode
contains the path b etween itself and the ro ot. The TreeMap then maps the reg-
ular language recognition algorithm for paths over each of these no des. Those
that nd an instance of the string are the ro ots of subtrees that corresp ond to
the query pattern.
But wehave already seen that op erations that can b e expressed as maps
over paths are downward accumulations, and hence can b e computed eciently
in parallel when the comp onent functions are well-b ehaved. Wehave also seen
that table comp osition is a constant-time, asso ciative function and so satis es 31
a
aLb aRa
aRaLb aRaRc
aRaRcLa aRaRcRt
Figure 13: The Result of Applying the paths Function
the requirements for tree contraction. Thus there is a logarithmic time parallel
algorithm for computing such lters
t lter = log n
n
Path expression queries may therefore b e included in structured text systems
without additional cost, dramatically improving the sophistication of the way
in which such resources can b e queried.
9 Conclusions
The SGML approach of structural tagging of entities makes it p ossible to mo del
structured text as a data typ e. This in turn makes it p ossible to use the ma-
chinery of categorical data typ es to examine trees and tree homomorphisms.
This reveals the underlying similaritiesbetween apparently di erent op era-
tions on structured text, suggests sequential and parallel implementations for
them, and shows how their construction can b e reduced to the construction of
comp onent functions.
In particular wehave shown how tree contraction and its extensions can
b e used to build fast parallel implementations of standard search techniques. 32
A ma jor contribution of the pap er is the discovery of a fast parallel imple-
mentation for searches based on path expressions. These allowmuch more
sophisticated searching than existing query languages with the same p erfor-
mance.
A The Tree Reduction Algorithm
The tree reduction algorithm is well-known in the parallel algorithms litera-
ture. We follow the presentation in [7].
We b egin by de ning a contraction op eration that applies to any no de and
its two descendants, if at least one descendant is a leaf. Supp ose that each
internal no de u contains a p ointer u.p to its parent, a b o olean ag u.left that
is true if it is a left descendant, a b o olean ag u.internal that is true if the
no de is an internal no de, a variable u.g describing an auxiliary function of typ e
A ! A, and, for internal no des, two p ointers, u.l and u.r p ointing to their left
and right descendants resp ectively.
We describ e an op eration that replaces u, u.l, and u.r, where u.l is a leaf,
by a single no de u.r as shown in Figure 14. A symmetric op eration contracts
in the other direction when u.r is a leaf. The op erations required are
u.r.g x. u.gf u.l.gu.l.a, u.a, u.r.gx
2
u.r.p u.p
if u.left then u.p.l u.r else u.p.r u.r
u.r.left u.left
if u = ro ot then ro ot u.r
The rst step is the most imp ortant one. It `folds' in an application of f so
2
that the function computed at no de u.r after this contraction op eration is the
one that would have b een computed at u b efore the op eration. The contraction
algorithm will only b e ecient if this step is done quickly and the resulting
expression do es not grow long. We will return to this p oint b elow.
The contraction op erations must b e applied to ab out half of the leaves on
each step if the entire contraction is to b e completed in ab out a logarithmic 33
u
BEFORE
u.l u.r
AFTER
u
u.r
u.l
Figure 14: A Single Tree Contraction Step: No des u and u.l are deleted from
the tree, and u.r is up dated 34
numb er of steps. The algorithm for deciding where to apply the contraction
op erations is the following:
1. Numb er the leaves left to right b eginning at 0 { this can b e done in
O log n time using O n = log n pro cessors [3].
2. For every u such that u.l is an even numb ered leaf, p erform the contrac-
tion op eration.
3. For every u that was not involved in the previous step, and for which u.r
is an even numb ered leaf, p erform the contraction op eration.
4. Renumb er the leaves by dividing their p ositions bytwo and taking the
integer part.
Sucient conditions for preventing the lamb da expressions in the u.gs from
growing large are the following [1]
1. For all no des u , the sectioned function f ; a ; oftyp e A A ! A is
2
drawn from an indexed set of functions F , and the function u.g is drawn
from an indexed set of functions G . Both F and G contain the identity
function.
2. All functions in F and G can b e applied in constant time.
3. For all f in F , g in G , and a in A, the functions x :f g x ; a and
i i
x :f a ; g x are in G and their indices can b e computed from a and
i
the indices of f and g in constant time.
i
4. For all g ; g in G , the comp osition g g is in G and its indices can b e
i j i j
computed from i and j in constant time.
These conditions ensure three prop erties: that no function built at a no de
takes more than constant time to derive from the functions of the three no des
it is replacing, that no such function takes more than constant time to evaluate,
and that no such function requires more than a constant amount of storage.
Together these three prop erties guarantee that the tree contraction algorithm
executes in logarithmic parallel time. 35
References
[1] K. Abrahamson, N. Dadoun, D.G. Kirkpatrick, and T. Przytycka. A
simple parallel tree contraction algorithm. In Pro ceedings of the Twenty-
Fifth Allerton Conference on Communication, Control and Computing,
pages 624{633, Septemb er 1987.
[2] M. Cole. Parallel programming, list homomorphisms and the maximum
segment sum problem. In D. Trystram, editor, Pro ceedings of Parco 93.
Elsevier Series in Advances in Parallel Computing, 1993.
[3] R. Cole and U. Vishkin. Faster optimal parallel pre x sums and list
ranking. Information and Control, 81:334{352, 1989.
[4] C. Faloutsos and S. Christo doulakis. Signature les: An access metho d for
do cuments and its analytical p erformance evaluation. ACM Transactions
on Oce Information Systems, 2:267{288, April 1984.
[5] Charles N. Fischer. On Parsing Context-Free Languages in Parallel En-
vironments. PhD thesis, Cornell University, 1975.
[6] J. Gibb ons. Algebras for Tree Algorithms. D.Phil. thesis, Programming
Research Group, University of Oxford, 1991.
[7] J. Gibb ons, W. Cai, and D.B. Skillicorn. Ecient parallel algorithms for
tree accumulations. Science of Computer Programming, 23:1{14, 1994.
[8] G.H. Gonnet, R.A. Baeza-Yates, and T. Snider. New indices for text:
PAT trees and PAT arrays. In W.B. Frakes and R. Baeza-Yates, editors,
Information Retrieval: Data Structures and Algorithms, pages 66{82.
Prentice-Hall, 1992.
[9] W. Daniel Hillis and G.L. Steele. Data parallel algorithms. Communica-
tions of the ACM, 29, No.12:1170{1183 , Decemb er 1986.
[10] L. Hollaar. The Utah Text Search Engine: Implementation exp eriences
and future plans. In Database Machines: Fourth International Workshop,
pages 367{376. Springer-Verlag, 1985.
[11] L. Hollaar. Sp ecial-purp ose hardware for text searching: Past exp erience,
future p otential. Information Pro cessing and Management, 27:371{378,
1991. 36
[12] Fulcrum Technologies Inc. Ful/text reference manual. Fulcrum Technolo-
gies, Ottawa, Ontario, Canada, 1986.
[13] Information pro cessing { text and oce systems { standard generalized
markup language sgml, 1986.
[14] I.A. Macleo d. A query language for retrieving information from hierar-
chical text structures. The Computer Journal, 34, No.3:254{264, 1991.
[15] I.A. Macleo d. Path expressions as selectors for non-linear text. Preprint,
1993.
[16] E.W. Mayr and R. Werchner. Optimal routing of parentheses on the
hyp ercub e. In Pro ceedings of the Symp osium on Parallel Architectures
and Algorithms, June 1993.
[17] G. Salton and C. Buckley.Parallel text search metho ds. Communications
of the ACM, 31:203{215, 1988.
[18] D.B. Skillicorn. Architecture-indep endent parallel computation. IEEE
Computer, 2312:38{51, Decemb er 1990.
[19] D.B. Skillicorn. Foundations of Parallel Computing. Cambridge Series in
Parallel Computation. Cambridge University Press, 1994.
[20] D.B. Skillicorn. Parallel implementation of tree skeletons. Technical Re-
p ort 95-380, Queen's University, Department of Computing and Informa-
tion Science, March 1995.
[21] D.B. Skillicorn and W. Cai. A cost calculus for parallel functional pro-
gramming. Journal of Parallel and Distributed Computing, to app ear.
An early version app ears as Department of Computer Science Technical
Rep ort 92-329.
[22] C. Stan ll and B. Kahle. Parallel free-text search on the Connection
Machine. Communications of the ACM, 29:1229{123 9, 1986.
[23] C. Stan ll and R. Thau. Information retrieval on the Connection Machine.
Information Pro cessing and Management, 27:285{310, 1991.
[24] H. Stone. Parallel querying of large databases. IEEE Computer, 20,
No.10:11{21, Octob er 1987. 37