Query By Templates: A Generalized Approach for Visual Query
Formulation for Text Dominated Databases
y
Arijit Sengupta Andrew Dillon
Computer Science Department Scho ol of Library & Information Science
Indiana University Indiana University
Blo omington, IN 47405 Blo omington, IN 47405
[email protected] [email protected]
structured do cuments not only for the traditional pur- Abstract
p oses of reading, browsing, and printing, but also for
The WWW has a great potential of evolving into
searching and querying. Automated searches in nor-
aglobal ly distributed digital document library.The pri-
mal word-pro cessor do cuments are usually restricted
mary use of such a library is to retrieve information
to linear word searches. However, users can use do cu-
quickly and easily. Because of the size of these li-
ments enco ded in SGML to p ose much more complex
braries, simple keyword searches often result in too
searches involving b oth text content and structure.
many matches. More complex searches involving
boolean expressions are dicult to formulate and un-
derstand. This paper describes QBT (Query By Tem-
1.1 SGML, Databases and Do cuments
plates), a visual method for formulating queries for
structured document databases modeled with SGML.
Traditionally, do cuments in electronic formats like
Based on Zloof 's QBE (Query By Example), this
text or word-pro cessor do cuments are used mainly
method incorporates the structure of the documents
for word pro cessing, editing, publishing, and p ossible
for composing powerful queries. The goal of this tech-
reuse into other do cuments. The problem with these
nique is to design an interface for querying struc-
do cuments is that they can not b e easily used for pur-
tured documents without prior know ledge of the in-
p oses other than what they are primarily designed for.
ternal structure. This paper describes the rationale
The SGML standard extends the use of electronic do c-
behind QBT, il lustrates the query formulation princi-
uments b eyond word pro cessing and towards a means
ples using QBT, and describes results obtained from
for information interchange. SGML was originally de-
a usability analysis on a prototype implementation of
signed as a p ortable do cument enco ding format for
TM
QBT on the Web using the Java programming lan-
easy interchange b etween various platforms and sys-
guage.
tems. As stated in the standard [11], SGML was to b e
used \for publishing in its broadest de nition, ranging
1 Intro duction
from single medium conventional publishing to multi-
The SGML Standard [11]hasintro duced a metho d
media data base publishing." SGML achieves this goal
for using do cuments as a means for information in-
by involving a mo deling phase in do cument design
terchange and retrieval. We can now use prop erly
and by completely separating layout and structure.
Copyright 1997 IEEE. Published in the Pro ceedings of
The motivation in SGML is to de ne the structure of
ADL'97, May 7-9, 1997 in Washington D.C. Personal use of this
the do cument; and leave the layout to the pro cess-
material is p ermitted. However, p ermission to reprint/republish
this material for advertising or promotional purp oses or for cre-
ing application. The pro cess for the design of do cu-
ating new collectiveworks for resale or redistribution to servers
ment structure is conceptually similar to schema de-
or lists, or to reuse any copyrighted comp onentofthiswork in
sign in traditional databases. Applications pro cessing
other works, must b e obtained from the IEEE. Contact: Man-
an SGML do cument can then use this structural con-
ager, Copyright and Permissions / IEEE Service Center / 445
Ho es Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA.
tent as a database schema to solve queries (searches)
Telephone: +Intl. 908-562-3966
on the do cument. This makes the treatmentofdocu-
y
Partially supp orted by U.S. Dept. of Education award num-
ments enco ded in SGML as databases for the purp ose
b er P200A502367 and NSF Research and Infrastructure grant,
award number NSF CDA-9303189. of querying a fairly straightforward task.
for relational databases have the capability of using 1.2 Searching, querying and browsing
the schema information e ectively. Since the basic
Traditionally, the most common metho d for retriev-
structure of relational databases consists of at tables,
ing information from do cuments is by perusing. For
query languages for relational databases do not usu-
simple word-pro cessor do cuments, p erusing involves
ally deal with complex hierarchical structures. How-
sequentially scanning or paging through the do cu-
ever, query languages for do cument databases need to
ment to nd the information of interest. This form
e ectively use the do cument hierarchy. Research on
of searching, whichwe technically call \browsing", de-
query languages for do cuments has unearthed a num-
p ends completely up on the reader's p erseverance and
ber of such languages. Twovery useful surveys of such
visual attention. The success of the search dep ends
languages and systems can b e obtained from [13, 3].
up on the reader's knowledge of the sub ject matter,
Most database systems implement sp eci c database
the organization of the text, and similar factors. The
mo dels, such as the relational mo del and Ob ject-
recentadventofhyp ertext do cuments has changed the
Oriented (OO) mo del. These database mo dels are
traditional sequential metho d of browsing. Brows-
based on strong theoretical foundations and have for-
ing in hyp ertext do cuments is considered non-linear
mal pro cedural languages to manipulate data. In
[14,21] { usually p erformed by following links in the
the case of relational databases, the most commonly
do cument to other related areas of the same do cument
used query languages are SQL(Structured Query Lan-
or entirely di erent do cuments.
guage) [2] and QBE (Query By Example)[22]. SQL
Although hyp ertext do cuments mayprovide a more
is a textual language with a simple English-like syn-
exible wayofbrowsing, nding information in hyp er-
tax, and is widely implemented in most commercial
text do cuments by browsing still requires human in-
database systems. QBE provides a visual metho d for
telligence, knowledge and exp erience. Whether read-
p osing queries and is suitable even for inexp erienced
ers are more comfortable with screen reading or pap er
users. This section describ es some previous work done
reading is a debatable question[8]. Research shows
in the area of visual query formulation, and derives the
strong evidence of pap er as a more natural means of
motivation b ehind the current research.
reading [7, 9] as well as a means to rectify the problems
with reading from a screen [10,16].
2.1 Visual languages and interfaces
One ma jor advantage of electronic do cuments over
This pap er presents the design of a visual inter-
their pap er counterpart is the capability of automated
face for the purp ose of querying do cument databases.
searching. Many information retrieval techniques have
Although many di erent typ es of pro cedural query
b een prop osed [20] that can quickly extract p ortions of
languages have b een prop osed so far, not enough at-
do cuments matching given keywords. Similar searches
tention has b een devoted towards providing visual
can b e quite dicult in pap er do cuments[8]. However,
metho ds for query formulation. Current implementa-
keyword searches do not often retrieve the target do c-
tions primarily use form-based approaches for building
uments. For large do cuments combining text with the
queries. In this section, wetake a closer lo ok at QBE
logical structure of do cuments is often necessary. In
and form-based querying.
manycases,many comp onentsearch conditions need
Query By Example (QBE). Query By Exam-
to be combined using b o olean expressions to build a
ple [22] is a visual language for querying relational
narrower search. For example, a reader might b e lo ok-
databases. This language has a simple interface com-
ing for pap ers in a journal authored by \Jane Do e"
p osed of tabular skeletons representing tables in the
with a title containing \creature". These typ es of
database. Users sp ecify queries by entering sample
searches are less common in traditional information
values (or examples) in appropriate areas of the table
retrieval systems which lackthe capabilityofinvolv-
skeleton. These values can b e either search strings or
ing structure in searches. Searches involving structure
variables for the purp ose of join and other op erations
and complex op erations such as join have been the
that require variables. QBE can also handle complex
ob ject of recent research[4,1]. In this pap er, we will
b o olean combinations of such search expressions us-
refer to such searches as \queries".
ing a sp ecial section of the screen (termed condition
2 Query languages for structured do c-
box). Aggregation op erations such as sum, count and
uments
average can also b e p erformed by indicating appropri-
ately in the tabular skeleton. Figure 1 shows a simple The ob ject of developing languages for the pur-
join op eration using QBE as implemented in a p opular pose of querying is to attain the capability of in-
database system. The main idea b ehind this metho d is volving complex op erations involving text (data) and
that the user provides an example of outputs that she structure (meta-data). Traditional query languages
exp ects from her query, and the query engine lo oks in 2.2 Other query visualization approaches
the database for data that matches the given exam-
Here we describ e two exp erimental systems that use
ple. This works nicely for relational databases, pri-
interesting visualization techniques for the purp ose of
marily b ecause the tabular structure of the database
information retrieval on complex structures. Although
ts quite well with tabular skeletons used in the inter-
the approaches here are directed primarily towards vi-
face. Some useful prop erties of QBE are:
sualization of data, they do provide useful incentivein
Simplicity The core QBE do es not require the
this work.
knowledge of anysyntactic constructs or the internal
structure of the database to use.
TEXTVIZ. TEXTVIZ[6] is a system for intuitively
visualizing, searching, and querying the contents of
Equivalence relational databases are stored in ta-
large text databases under development at TASC.
bles, and in QBE, the queries are entered into equiv-
This metho d uses a map-like representation of do cu-
alent table skeletons.
ments that shows users how comp onents of do cuments
are related to each other in the collection. It pro-
Closure Users give examples of the typ e of values
vides twolevels of text database information { (1) a
that they are lo oking for, and the result of the queries
macro p ersp ective, giving a top-level view of the whole
shows up as tables in the same format as the data.
database using the top ographic map, and (2) a micro
Completeness The core QBE, along with some
p ersp ective fo cusing up on the conceptual content of
additional commands, condition b oxes and other con-
each do cument comp onent. TEXTVIZ uses a natu-
structs, can construct the same class of queries as re-
ral language or text pro cessing front-end to extract
lational algebra.
meaning ab out the contents of a do cument database.
It then generates and displays a text map p ortraying
These prop erties make QBE a complete language
these contents. This provides the user the ability to
that can adapt to any relational schema and make
comprehend the entire database at once, giving the
querying almost indep endent of the users' knowledge
user a global p ersp ective of the organization of the
of the structure of the data. In a later section we
contents of the database.
will show how our approach keeps all these prop er-
ties in a generalized interface designed primarily for
AIR and SCALIR. AIR (Adaptive Information Re-
do cuments, but applicable to any complex structured
trieval) and SCALIR (Symbolic and Connectionist
data.
Approach to Legal Information Retrieval) are two sys-
tems describ ed by Rose et al.[18] that use a connec-
Query By Forms. Although QBE is a formally ac-
tionist approach to the information retrieval problem.
cepted direct-manipulation visual language for rela-
AIR's goal is to build a representation that will re-
tional algebra, most relational database systems still
trieve do cuments that are more likely to be relevant
do not universally use or implement it. Some com-
to particular queries. In order to achieve this, users
monly used relational database systems implement dif-
b egin a session with AIR by describing their informa-
ferent variations of QBE, but application develop ers
tion need using a very simple query language. Sub-
designing query interfaces for a sp eci c purp ose sel-
sequently users can re ne their queries by sp ecifying
dom use the QBE metho d directly in their implemen-
multiple clauses or bynavigating links shown bythe
tations. In these cases, develop ers use form-based in-
system. SCALIR is an IR system for retrieving cases
terfaces.
ab out copyright law. Internally, SCALIR contains a
In form-based interfaces, the user is presented with
network of no des representing terms, court decisions,
alistofsearchable elds, each with an entry area that
and statute sections. A query consists of the acti-
can b e used to indicate the search string. To p ose a
vation of a few selected no des. This activation is
query, the user needs to ll in the areas of the form
spread throughout the network, building the retrieval
relevant to her search. A more general form interface
set from the activated no des.
would have options of sp ecifying b o olean op erations.
Although the metho ds describ ed in this section pri- As part of the comparison pro cess in the currentwork,
marily apply to data visualization, the direct manip- we implemented a version of the form interface that
ulation of the data in its natural form and constant can work with currentworld-wide-web technology [see
mo di cation of the display based on the currentsta- Figure 8]. Search interfaces employed by most web
tus of the query has signi cant in uence in the design search engines also use interfaces similar to the one
of QBT. shown here.
Figure 1: A sample QBE screen from Paradox for Windows
Query By Templates (QBT)
3 Poems by Felicia Dorothea Hemans 1808 (Early Eighteenth Century)
current work generalizes QBE for databases The Casabianca
Collection Historical Age
taining complex structured data. QBE is very con by Felicia Hemans Poem Title suitable for relational databases, since it uses tabular Poet Name
The boy stood on the burning deck
skeletons(analogous to tables in the relational mo del)
Whence all but he had fled,
In other words, as a means for constructing queries. First Line The flames that lit the battle’s wreck
Shone round him o’er the dead. Any line
the template for presenting queries in QBE is simi-
ternal structure of the database. Weuse lartothein Stanza Yet beautiful and bright he stood,
As born to rule the Storm,
ytyp e of database in this idea to generalize QBE for an The Creatures of Heroic blood
A proud, though child-like form.
whicheach data instance has a simple visual template.
In this generalized metho d (that we term \Query By
Templates",) the basis of the interface is a visual tem-
Figure 2: A simple template for p o ems, with its logical
plate representing an instance of the database. Simple
regions
examples of templates are: a small poem for p o etry
databases, a table for relational databases, a sample
word de nition for a dictionary database, and a sam-
we indicate a prominent logical region of the p o em by
ple entry in a bibliography database. Any database
circling it and lab eling it with the corresp onding re-
that has a simple visual representation of its content
gion name. Physically the QBT interface consists of
can b e used with QBT. For databases that do not have
a small template image divided into areas corresp ond-
a general visual content, we can always revert to tables
ing to di erent logical regions in the database, as in
for the template.
Figure 2. Dep ending on the layout of the regions, the
The main goal for using QBT is to retain all the
templates can b e of di erenttyp es, and these di erent
prominent prop erties of QBE. The intended prop er-
typ es are describ ed in the next sections.
ties of QBT that are analogous to those of QBE (as
discussed in Section 2.1) are (i) simplicity, (ii) equiva-
3.2 Flat templates
lence, (iii) closure and (iv) completeness. This section
As describ ed in the previous section, QBT relies on
describ es templates in greater detail to explain to ba-
the presence of a simple visual template for the in-
sic design of QBT.
stances in the database. In most cases, this template
3.1 QBT: the basic design
could b e planar or at. This means that all logical re-
At the simplest level, a QBT interface displays a gions of the template can b e displayed simultaneously
template for a representative entry of the database. in a two dimensional image without overlapping(like
The user sees an example of the typ e of data she would in Figure 2). We call these templates \ at templates".
exp ect to nd in the database, such as a poem in a Flat templates are usually easier to display and use,as
p o etry database. She sp eci es a query byentering ex- the structural regions can b e simultaneously displayed
amples of what she is searching for in the appropriate in a plane, p ossibly byshowing multiple instances of
areas of the template, and the system retrieves all the some regions. For example, in Figure 2, the First Line
database entries that match the example she provided. and Any Line regions are subregions in Stanza. Todis-
To illustrate the interface, we use a simple template play these subregions, the template needs to include a
for a p o etry database, as in Figure 2. In this gure, second stanza which is broken into its comp onents. Casabianca Casabianca Stanza by Felicia Hemans by Felicia Hemans
The boy stood on the burning deck The boy stood on the burning deck Whence all but he had fled, First line Whence all but he had fled, The boy stood on the burning deck The flames that lit the battle’s wreck The flames that lit the battle’s wreck Shone round him o’er the dead. Shone round Whence him o’er all the but dead. he had fled, Any line The flames that lit the battle’s wreck Yet beautiful and bright he stood, Stanza Yet beautiful and Shone bright round he stood, him o’er the dead. As born to rule the Storm, As born to rule the Storm, The Creatures of Heroic blood The Creatures of Heroic blood First line A proud, though child-like form. A proud, though child-like form. Any line
(a) (b)
Figure 3: Templates with (a) Emb edded Regions and (b)Recursive regions
internal structures for the same p o em example. 3.3 Nested templates
3.4 Structure templates
Although at templates are easy to display and
Structures, particularly large ones, may get to o
navigate, they cannot mo del structures with deep lev-
complex to use nested templates. In these cases, it is
els of nesting. In this case, we use templates that can
often necessary to displaytheinternal structure simul-
be nested. In nested templates, regions are allowed
taneously with a template that displays the relative
to overlap. In particular, certain regions can b e com-
p osition of the current region. As mentioned earlier,
pletely inside other regions to represent subregions.
most do cuments can b e thoughtof as having a hier-
To display emb edded logical regions, we use one of
archical structure that can b e conveniently visualized
the following metho ds:
as a tree. Showing a template simultaneously with a
Embedded Regions In this metho d, subregions
hierarchy of logical regions depicting the context sim-
are displayed inside the parent region. As in at tem-
pli es the nested structure visualization. An example
plates, all regions are displayed simultaneously in the
of the structure template is shown in Figure 7, which
same plane of the image. Comp onent regions no longer
is a screen-shot from the prototyp e implementation of
need to b e mutually exclusive. This metho d is a sim-
QBT, describ ed in Section 5.
ple extension of at templates, but it makes templates
much more powerful while retaining the simplicity.
4 Querying with QBT
However, this metho d is again limited to structures
Normal keyword searches b ounded by structural re-
in which the nesting level is not very deep and the
gions are simple in the QBT interface. As describ ed
top-level region is physically large enough to include
earlier, users express their queries by indicating the
all the nested regions without completely obscuring it-
searchkeywords in the appropriate regions of the tem-
self. An example of this typ e of nesting in shown in
plate. In this section, weshow all the di erenttyp es
Figure 3(a).
of p ossible searches that can b e p erformed with QBT.
One can treat QBE[22] as a sp ecial case of QBT Recursive regions This is the most general metho d
where the templates used are table skeletons that in- of nesting regions. In this metho d, a region with sub-
stantiate tables in the database. In QBE, queries regions can b e recursively expanded. During traversal,
are sp eci ed by entering values in prop er p ositions the user may \zo om in" to a parent region to display
of the tables. These values may be either constants its subregions. The magni ed p ortion of the template
(i.e., strings or numb ers), variables (or examples, usu- can be an indep endent template, and can be subse-
ally di erentiated from the constants by underlining), quently magni ed to get to additional levels of nest-
or expressions involving constants and variables com- ing. Although this metho d can capture any general
bined with arithmetic and comparison op erators. The structure, the templates have to b e cleverly designed
output of the query is sp eci ed by marking the regions so that users are not disoriented by the nested tem-
that need to b e presented in the output. QBT uses the plates. Figure 3(b) shows this metho d of displaying Casabiancahate CasabiancaNOT "hate" by Felicia Hemans by Felicia Hemans OR The boylove stood on the burning deck The boylove stood on the burning deck Whence all but he had fled, Whence all but he had fled, The flames that lit the battle’s wreck The flames that lit the battle’s wreck Shone round him o’er the dead. Shone round him o’er the dead.
(a) (b)
Figure 4: Query formulation with QBT: (a) Simple selections and (b)Logically combined selections
same basic principle, with the extension that the tem- p o em titles and p o ets of all the p o ems that havethe
plates are not restricted to table skeletons but can b e word hate in the title and the word love in the rst
any visual representation of the database instances. line." Note that unlike QBE, searches are substring
matches instead of exact string matches. So, entering
The primary di erence b etween the metho d of ex-
the word \love" in the region \ rst line" matches all
pressing queries in QBT and that in QBE lies in the
p o ems with the rst line containing the word \love"
fact that the templates in QBE are essentially one-
anywhere in the rst line. In QBE, this is done by in-
dimensional. Although QBE uses a 2D tables for
dicating examples b efore and after the search string.
querying, the meta-data (attributes of the relations)
In case of do cuments, substring matches make more
only app ear along the horizontal axis as column head-
sense since exact searches are far less common.
ings of the tables. QBE uses the rows to sp ecify mul-
tiple search conditions and logical op erations b etween
4.2 Selections with multiple conditions
the search conditions (see examples in [22].) In QBT,
the regions (meta-data) are distributed along b oth di-
We have just seen that if multiple conditions are
mensions of the template, utilizing whole template
sp eci ed in di erent regions, they are combined us-
plane for visualizing the structure. Logical op erations
ing logical conjunctions, implying that the results re-
between regions can be expressed by physically con-
turned from the query will satisfy all the sp eci ed
necting two or more regions via a logical op erator.
search conditions. If this is not desired, search con-
Logical op erations within regions can b e formed using
ditions can b e combined using logical op erators AND,
logical expressions within the scop e of that region. In
OR, NOT. Negation of individual conditions are done
the rest of this section, we discuss how di erenttyp es
by placing the keyword \NOT" in front of the string.
of queries are p erformed using QBT.
Implementations of the interface may use some visual
mechanisms to place this negation keyword. Con-
4.1 Simple selections
necting various search strings using the binary op er-
Simple selections include searching for constant
ators \AND" and \OR" involves simply connecting
strings or numbers within logical regions of the do c-
the two strings using a p ointing device and selecting
ument (the whole do cument itself b eing one region).
the prop er op eration typ e for that connection. Fig-
In QBT, such searches are p erformed by simply en-
ure 4(b) demonstrates how this is done by expressing
tering the search string in the corresp onding region of
the query: \Find the p o em titles and p o ets of all p o-
the template. As a result of suchasearch, database
ems that either do not havetheword `hate' in the title
instances ro oted at a default region that match all
or havetheword `love' in the rst line." Notice the in-
the sp eci ed conditions are returned. In other words,
tro duction of the negation and the \OR" connection.
the given search criteria are combined using a logical
conjunction op eration. The result of the query is by
Note that if multiple clauses are connected using
default ro oted at a preselected region de ned bythe
logical constructs, the order in which the expressions
template. However, users can mark the regions that
are evaluated dep ends on the direction of the arrow.
they want returned by placing a print-marker on them.
However,itispossibletooverride this order of evalua-
tion by placing parentheses in appropriate places in a In the illustrations (see Figure 4), the small tick-
p
condition b ox, whichwe will demonstrate shortly (see mark ( ) is used as a print indicator. In the exam-
Section 4.4). ples, Figure 4(a) denotes the simple query: \Find the Casabiancaexample1 Casabiancaexample1 by Felicia Hemans by Felicia Hemans
The boy stood on the burning deck The boy stood on the burning deck Whence all but he had fled, Whence all but he had fled, The flames that lit the battle’s wreck The flames that lit the battle’s wreck
Shone round him o’er the dead. Shone round him o’er the dead.
Figure 5: Query formulation with QBT: Joins
Casabiancahate TITLE AND POET OR FLINE by Feliciashakespeare Hemans OR (a) The boylove stood on the burning deck Whence all but he had fled, TITLE AND (POET OR FLINE) The flames that lit the battle’s wreck Shone round him o’er the dead.
(b)
Figure 6: Changing precedence of op erations with Condition b oxes
tors to sp ecify joins other than \equi-joins" (joins that 4.3 Joins and multiple templates
use equality on the join attributes). Once again, in the
In this section, we lo ok at a sp ecial class of query
case of asymmetric comparison op erations, the prece-
called \join". A join is an op eration using whichmul-
dence of the op erators is determined by the direction
tiple fragments of a database are combined together
of the arrow. Although we are still using underlined
based on some common prop erty. Joins are indisp ens-
examples to highlight the joins, they are redeemed un-
able in relational database, since the relational de-
necessary by the use of the joining arrow.
sign involves \normalizing" a schema by breaking it
4.4 Queries with complex conditions
into at tabular fragments. This fragmentation neces-
sitates combining the individual fragments together
Like QBE, QBT cannot always enable visualiza-
at the time of query pro cessing using the join op er-
tion of complex conditions involving more than two
ation. However, in do cument databases, the struc-
regions. These conditions need to sp eci ed using a
ture is not normalized into planar fragments, but al-
condition b ox. The condition b ox can also b e used to
lowed to grow hierarchically, so joins are not required
override the precedence of op erators. As search strings
to combine fragments. However, joins are still useful
and examples are sp eci ed, the condition b oxisauto-
to solve queries that requires comparison of di erent
matically up dated. The user can then insert parenthe-
parts of a database or di erent instantiations of the
ses as necessary to change the default precedence. For
same database.
example, in Figure 6, if the default precedence is used,
the query evaluates to: \Find the p o em titles and p o- For example, one may try to \ nd the pairs of p o ets
ets of the p o ems in which either the title contains who have at least one p o em with a common title" (as
`hate' and the poet is Shakesp eare, or the rst line in Figure 5). In this case, we need to generate twoin-
contains `love'." The default condition b oxis shown stances of the p o etry database and run the query com-
in Figure 6(a). However, this default can b e changed paring the titles of the two p o ems. This is achieved in
to: \Find the p o em titles and p o ets of the p o ems in QBT by using multiple templates. In the case of the
which the title contains `hate', and either the poet ab ovequery, the same template is instantiated twice,
is Shakesp eare, or the rst line contains `love'." The and the join attributes are connected together. The
condition box of this query is shown in Figure 6(b). connection can b e augmented with comparison op era-
The condition b ox can also b e used for sp ecifying more
complex conditions involving more than twovariables
in an expression. The condition b ox functions in the
same manner as the condition b oxinQBE.
5 A prototyp e implementation of QBT
1
We built a prototyp e of the QBT interface using
TM
the Java programming language. The prototyp e
implements most of the features describ ed here includ-
ing the emb edded template (without recursivemagni-
cation) and the structure template. Wehavenotyet
incorp orated the condition b ox in this prototyp e, but
it will b e added in the next release. We also included
an exp erimental version of an SQL language transla-
Figure 8: The form implementation of the query in-
tor from the QBT query. Figure 7 shows two parts of
terface used in the usability analysis
the screen { one showing the template screen and the
other showing the structure template.
As an exp eriment, we used the Chadwyck-Healey
among which we prepared nine, and left the tenth
English Po etry database with templates similar to
question op en to the sub ject's imagination. All sub-
those describ ed in this pap er for p erforming the
jects were given the same set of questions (see Ap-
TM
queries. We used the Java language to build the
p endix A.) The questions varied in complexity and
user interface. Queries generated using the interface
were designed so that all except one returned some
are sent to a query engine by the HTTP (Hyp erText
matches. The sub jects were to answer the questions
Transfer Proto col) proto col, which is run from a web
using the searchinterface and to write down the num-
server as a CGI (Common Gateway Interface) exe-
ber of matches returned by the search, after making
cutable. The engine generates its output in HTML
sure that the question was interpreted prop erly bythe
which is displayed by the clients. We wrote the engine
searching program. The indep endent and dep endent
in C/C++, using the API (Application Programming
variables for the exp eriment are outlined b elow:
Interface) provided by the Pat [17] software. However,
Indep endentvariables:
due to limitations of the pro cessing capabilities of Pat,
A. Interface type, (1) New Java-based QBT inter-
this engine do es not supp ort variables or joins. Apart
face, and (2) generic form-based interface;
of the currentresearch is aimed at building a database
engine that will supp ort queries such as join, grouping,
B. Subject type, (1) exp ert, and (2) novice.
and nesting, without converting the do cuments into a
Dep endentvariables:
di erent database format.
1. Eciency: The amount of time in seconds the
sub jects take to answer each question.
6 Usability analysis
We p erformed an initial phase of usability analy-
2. Accuracy: The degree of accuracy of the answers.
sis on the prototyp e interface. The main goal of this
3. Satisfaction: How satis ed the users were after
analysis was to detect di erences between this inter-
using the interface (measured by self-rep orts in
face and a standard forms-based interface with similar
written debrie ngs.)
search capabilities. In particular, we were interested
Twenty sub jects were chosen for the exp eriment.
in di erences with resp ect to (1) accuracy, (2) e-
The exp eriment was conducted using a \b etween-
ciency and (3) satisfaction. This section describ es the
users" strategy[19], where two distinct groups of users
metho d used during the exp erimental evaluation of the
use the two platforms. In our exp eriment, ten sub-
new Java-based (QBT) interface.
jects were given the Java-based interface (see Fig-
6.1 Exp erimental design
ure 7), while another ten ( ve novices, ve exp erts)
The exp eriment consisted of two primary parts. In
were given the form-based interface (see Figure 8).
the rst part, we gave the sub jects ten questions {
After the sub jects nished the questions, we asked
1
The current version of the prototyp e can be seen
them to complete a small survey of their exp erience
at: http://blesmol.cs.indiana.edu:7890/pro jects/SGMLQuery.
with databases and query engines. The survey also
Note that only the interface is accessible from outside Indiana
asked them to compare the interface with a standard
University. The results of the queries cannot b e viewed from a
web-searchinterface that they haveused. remote lo cation b ecause of copyright restrictions.
Figure 7: A screen shot-from the prototyp e showing the template and structure
questions). The sub jects were timed automatically by 6.2 Sub jects
the server and the query engine that was actually ex-
Sub jects were students who volunteered to par-
ecuting the queries.
ticipate in the research. There were no ma jor re-
quirements from the sub jects except that they all
Basic pro cedure. The sub jects were intro duced to
be Indiana University aliates, b ecause of the copy-
the exp eriment and the target interface. After an ini-
righted nature of the database on whichwe p erformed
tial intro duction, the sub jects were given the exp eri-
the exp eriment. We divided the sub jects into two
mental queries and asked to input them sequentially,
groups based on their exp erience with computers and
and to note down the numb er of matches returned by
databases. The sub jects classi ed as \novices" had
the database for every query. They were also asked to
minimal computer exp ertise { generally limited to
verify their results to eliminate typ ographical errors
only e-mail and o ccasional World Wide Web access.
bychecking the resp onse from the database and lo ok-
The sub jects classi ed as \exp erts" were p eople ac-
ing at sample results from their search. After they
customed to using databases and the web as well as
nished the queries, they were given a set of survey
designing and programming graphical user interfaces.
questions which they were to answer. They were also
We made no distinctions between male and female
asked to verbally describ e their feelings and general
sub jects, since it was not one of the indep endentvari-
reactions from their use of the systems.
ables in this analysis (eleven female and nine male
Exp erimental queries. A set of nine queries (see
sub jects participated in this study.)
App endix A) were given to each sub ject. For the tenth
6.3 Equipment - software and hardware
query,theywere asked to search for something of their
We p erformed all the exp eriments using Netscap e
own interest. The rst and the easiest query was pri-
2.0 for the Java-based interface and either Netscap e
marily meant for the sub jects to get acquainted with
2.0 or 1.1 for the form-based version. For the Java-
the system, and the last query was mainly to see what
based interface, we restricted exp eriments to ma-
typ es of questions users often search for. The queries
chines having 16MB or more system memory, since
ranged from very simple searches involving a single
Netscap e's Java p erformance is sub-standard with less
clause in a eld to complex searches involving up to
memory. However, for the other interface, no memory
four clauses combined together.
restriction was enforced since the HTML forms do not
Timing technique. The sub jects were timed byelec-
have additional memory requirements.
tronic means. Whenever a user submitted a query us-
6.4 Data collection
ing either interface, the server logged the access time.
We collected twotyp es of written data: (1) the sub- Also, the query engine that we designed logged timing
jects' resp onses to the survey questions and (2) the and other detailed information ab out the queries sent
sub jects' resp onses to the number of matches for each bythe users. We also designed the Javainterface so
search problem (please see App endix A for the actual that it would send log messages to the server. This al-
7.2 Eciency lowed the server to keep track of all the actions (such
as button press and query selection) that the user to ok
For the eciency measure, we used the time (in sec-
over the course of submitting the queries.
onds) b etween two successive submissions of queries.
The absolute times at which (1) the system was rst
Survey questions. In addition to the queries, we
accessed and (2) the queries were submitted, were
gave the users a small set of questions to express their
logged by the query pro cessing system, and we cal-
exp erience, preferences, and the degree of satisfaction
culated the di erence b etween these times to get the
with the interface. They were also asked to p oint out
time taken for each task by the sub jects. For the rst
features that they liked or disliked in the system that
task, we used the time di erence b etween the rst ac-
they used.
cess of the search page and the submission of the rst
General feedback. After the exp erimentwas over,
task. This turned out to b e a problem (as indicated
the sub jects were asked to comment on their general
by the results,) since the Javainterface page did not
feelings ab out the pro ject; and their comments and
haveany other text b esides the searchinterface itself,
suggestions were noted. This data was primarily used
while the form interface had some instructions in it
for the purp ose of designing improved features for the
which most of the sub jects sp ent time reading, be-
currentinterface.
fore comp osing the rst query. Table 2 shows that ex-
p erts were signi cantly more ecient than the novices
7 Results
on b oth the typ es of interfaces. However, except for
This section describ es in detail the results that we
Task 1 and Task 7, there was no signi cant eciency
obtained from the usability analysis. We divide the
di erence b etween the twointerfaces.
results in three di erent sections, one each for the de-
For Task 1, the sub jects using the Java interface
pendent variables | accuracy, eciency, and satis-
p erformed signi cantly b etter than the sub jects us-
faction. For each of the measures, we p erformed a
ing the form interface, and the possible explanation
multivariate ANOVA test with a :05 signi cance level.
is given ab ove. For Task 7, the users of the form in-
In addition, for the accuracy and eciency measures,
terface p erformed signi cantly better than the users
we show the results for all ten questions, and make
of the Java interface. On a detailed analysis of the
appropriate inferences based on the results.
sub jects' actions, we found that this task required the
In the following analysis, for the indep endentvari-
users to switch to a di erent screen for the Java in-
able \Interface typ e", the Javainterface (Figure 7) is
terface, and most of the users could not understand
given a value of 1 and the form interface (Figure 8)
the necessity for this action. This will be recti ed
is given a value of 2. For the indep endent variable
when all three screens are displayed simultaneously;
\Sub ject typ e", the values of 1 and 2 are assigned to
as users will not have to discover something new in
exp erienced and novice users, resp ectively. The tasks
order to p erform this query.
are denoted as \qns1" through \qns10".
Table 2 displays the results we obtained from the
multivariate tests of signi cance. The values show
7.1 Accuracy
clearly that the combined e ect of b oth the interface
We to ok the accuracy measures byevaluating the
and sub ject typ e was signi cant only for Task 1. The
answers to each question in 0 5 scale. Perfect answers
e ect of sub ject typ e was de nitely signi cant on al-
were given 5 p oints and completely wrong answers (of
most all the tasks - the exp erts were signi cantly more
course, there were none in the exp eriment) were given
ecient on b oth the interfaces. However, the interface
0points. Based on the typ e of mistake the users com-
di erence did not haveany signi cant e ect on all the
mitted, their answers were given a value between 1
tasks except Task 1 and 7; and the reasons are already
and 4. For the other questions, there was no signif-
explained ab ove.
icant di erence between either the two interfaces or
the two groups of users. The multivariate ANOVA
7.3 Satisfaction
analysis did not show any signi cant e ect of either
For the satisfaction measure, the users were asked
the interface typ e or the user typ e, or a combination
to grade the interface that they used in a scale of ve
of both with resp ect to accuracy. See Table 1 for a
qualitative values: Much better, Little better, About
summary of the signi cance values for the ten ques-
the same, Worse, Absolutely worse. These ve classes
2
tions.
were given the ranks 5,4,3,2 and 1 resp ectively. This
data was taken after all the actual tasks were per-
2
For questions 1,2,4 and 10, all the participants got the cor-
formed, and was not calculated on a task-by-task ba-
rect result, so there was no variance in the outcome of these
sis. As b efore, a Multivariate ANOVA measure was questions.
E ectnTrial no. Qns1 Qns2 Qns3 Qns4 Qns5 Qns6 Qns7 Qns8 Qns9 Qns10
Sub ject - - 0.332 - 0.624 0.332 0.743 0.332 0.811 -
Interface - - 0.332 - 0.153 0.332 0.332 0.332 0.477 -
Interaction: sub ject & interface - - 0.332 - 0.624 0.332 0.201 0.332 0.477 -
Table 1: Signi cance values for Accuracy
E ectnTrial No. Qns1 Qns2 Qns3 Qns4 Qns5 Qns6 Qns7 Qns8 Qns9 Qns10
Sub ject 0.003 0.085 0.029 0.220 0.000 0.004 0.001 0.000 0.018 0.004
Interface 0.000 0.726 0.696 0.522 0.420 0.660 0.001 0.452 0.897 0.927
Interaction: sub ject & interface 0.006 0.379 0.446 0.494 0.830 0.054 0.066 0.101 0.934 0.912
Table 2: Signi cance Values for Eciency
taken on this data. This clearly shows a signi cant common, so the assumption of a graphics-capable ter-
e ect of interface on the satisfaction of the sub jects. minal is not very demanding.
However, the satisfaction was indep endent of sub ject The implementation of QBT in this work is in an
typ e or the combined e ect of interface and sub ject
early developmental stage, and has substantial p oten-
typ es (see Table 3).
tial for improvement. The exp erimentwe p erformed
clearly indicated some of the ways it could be im-
8 Discussion and conclusion
proved. However, in spite of b eing a prototyp e in-
QBE and forms are b oth quite p opular means for
terface, this QBT implementation demonstrates that
querying in the relational domain. The main advan-
QBT is suitable for querying textual databases using
tage of the form interface is that it is very simple
a simple graphical interface. Moreover, QBT is at
to implement and easy to use for small databases.
least as ecient and accurate as the general form-
However, forms do not adapt very well for databases
based approach, and signi cantly more satisfying to
with a very complex structure, and most text-based
the users. We b elieve that the idea b ehind QBT will
databases tend to haveavery complicated structure
give us a starting p oint for query interfaces in future
(for instance, the Chadwyck-Healey database used in
text database systems. A signi cant p ortion of the
the prototyp e contains over fty logical regions.) A
current research is aimed towards the theoretical sta-
form interface that can search on only a few of these
bility and soundness of the QBT concept, and once
areas is easy to construct, but if the numb er of search-
established, this metho d has the potential of b eing
able regions is increased, the usability gets question-
the standard querying mechanism for text databases.
able. On the other hand, a structure template can b e
Acknowledgments
easily used to navigate through complex hierarchical
Wearevery thankful to the sub jects of our usability
structures. Even for complex hierarchies, the fo cus
analysis for their participation and co op eration. We
can be concentrated in the regions of interest using
are also grateful to the LETRS (Library Electronic
advanced metho ds like di erential magni cation [12].
Text Resource Service) sub division of Indiana Univer-
Another advantage of the template metho d is its
sity Library, and esp ecially ex-directors Richard Ellis
direct relationship to the internal structure of the
and Mark Day for allowing us to use the Chadwyck-
database. Forms always lo ok the same, whether the
Healey Database for this research. Wewould also like
underlying database is a p o em, a dictionary, a quo-
to thank Prof. Dirk Van Gucht, Computer Science,
tation collection, or even a relational database. How-
Indiana University, for his careful reading and useful
ever, templates can be custom-designed for di erent
comments.
typ es of databases. This way, templates can pro-
A Tasks used in the usability testing
vide a direct re ection of the users' mental mo dels
1. Find the p o ems written by \Shakesp eare".
[5, Ch.6], a signi cant factor in the design of good
2. Howmany p o ems were written in the Middle En-
user-interfaces. Moreover, templates use the principle
glish Perio d age (MEP)?
of familiarity[15], which is demonstrated to work well
for novice users. The only disadvantage of templates 3. Find all the p o ems written in the Early 19th Cen-
tury p erio d (C19A) that havetheword \burning" in
is that go o d templates require exp ensive graphics ter-
the rst line.
minals, while forms work quite well with terminals
without graphics capabilities. However, with the ad- 4. Find the p o ems that havethe word \hate" in the
vance in technology, non-graphics terminals are less title and the word \love" in the rst line.
E ect Interface and Sub ject Typ e Sub ject Typ e Only Interface Only
Signi cance values 1.000 0.503 0.014
Table 3: Signi cance Values for Satisfaction
5. Find the p o ems not written by \Hemans" that have [10] J.D. Gould, L. Alfaro, R. Finn, B. Haupt, and A. Min-
uto. Reading from crt displays can b e as fast as read-
the word \wreck" somewhere in a stanza.
ing from pap er. Human Factors, 29(5):497{517, 1987.
6. Find the p o ems written during the Early 18th Cen-
tury (C18A) whichhave the word \love" in the collec-
[11] International Organization for Standardization,
tion title, as well as in the p o em title, but not in the Geneva, Switzerland. ISO 8879: Information
Processing { Text and Oce Systems { Standard
rst line.
GeneralizedMarkupLanguage (SGML),1986.
7. Find the p o ems that have the phrase \exp ostula-
tion and reply" anywhere in the b o dy of the p o em.
[12] T. Alan Keahey and Edward L. Rob ertson. Tech-
8. Find the p o ems written by Keats that do not have niques for non-linear magni cation transformations.
In Proceedings, Visualisation '96 Information Visual-
the word \mortal" in any of the stanzas.
ization Symposium. IEEE, Octob er 1996.
9. Find the p o ems written by Shakesp eare that has
the phrase \to b e or not to b e" somewhere in the p o em
[13] Arjan Lo e en. Text databases: Asurvey of text mo d-
body. els and systems. SIGMOD RECORD, March1994.
10. Write a query of your own from your interest in
[14] Ted Nelson. Literary Machines,volume Version 87.1.
p o ems, and indicate the numb er of matches you found
Published by author, 1987.
for that query.
[15] Donald Norman. The Design of Everyday things.
Doubleday Currency, 1990.
References
[1] Serge Abiteb oul, Sophie Cluet, and Tova Milo. Query-
[16] D. Ob orne and D. Holton. Reading from screen versas
ing and up dating the le. Proceedings, 19th Intl. Con-
pap er: there is no di erence. International Journal of
ferenceonVery Large Data Bases, pages 73{84, 1993.
Man-Machine Studies, 28(1):1{9, 1988.
[2] American National Standards Institute, New
[17] Op en Text Corp oration, Waterlo o, Ontario, Canada.
York. ANSI X3.135-1992 and ISO/IEC 9075:1992,
Open Text 5.0, 1994.
Database Language SQL, 1992.
[18] Daniel E. Rose and Richard K. Belew. Toward a
[3] Ricardo Baeza-Yates and Gonzalo Navarro. Integrat-
direct-manipulation interface for conceptual informa-
ing contents and structures in text retrieval. SIGMOD
tion retrieval systems. Interfaces for Information Re-
Record, 25(4):67{79, March 1996.
trieval and Online Systems - the state of the art, 1991.
[4] Ricardo A. Baeza-Yates and Gaston H. Gonnet. Ef-
[19] Je rey Rubin. Handbook of Usability Testing: How to
cient text searching of regular expressions. Proceed-
plan, design and conduct e ective tests. John Wiley
ings, 16th International Col loquium on Automata,
& Sons, Inc., 1994.
Languages, and Programming, pages 46{62, 1989.
[20] Gerard Salton. Developments in automatic text re-
[5] Paul Bo oth. AnIntroduction to Human-computer In-
trieval. Science, 253:974{980, 1991.
teraction. Laurence ErlBaum Asso ciates Publishers,
1989.
[21] Ben Shneiderman and Greg Kearsley. Hypertext
Hands-on! An Introduction to a New Way of Orga-
[6] Nathan Combs. Large text database visualization.
nizing and accessing Information. Addison-Wesley,
Advances in Classi cation Research, Proceedings of
1989.
the 3rd ASIS SIG/CR Classi cation Research Work-
shop, 3, Oct 1992.
[22] M. M. Zlo of. Query by example: A database lan-
guage. IBM Systems Journal, 16(4), 1977.
[7] A. Creed, I. Dennis, and S. Newstead. Pro of-reading
on VDUs. Behavior and Information Technology,
6(1):3{13, 1987.
[8] Andrew Dillon. Designing Usable Electronic Text: Er-
gonomic Aspects of Human Information Usage. Tay-
lor & Francis, London; Bristol, PA, 1994.
[9] J.D. Gould, L. Alfaro, V. Barnes, R. Finn,
N. Grischkowsky, and A. Minuto. Reading is slower
from crt displays than from pap er: Attempts to iso-
late a single variable explanation. Human Factors, 29(3):269{299, 1987.