Query By Templates: A Generalized Approach for Visual Query



Formulation for Text Dominated

y

Arijit Sengupta Andrew Dillon

Computer Science Department Scho ol of Library & Information Science

Indiana University Indiana University

Blo omington, IN 47405 Blo omington, IN 47405

[email protected] [email protected]

structured do cuments not only for the traditional pur- Abstract

p oses of reading, browsing, and printing, but also for

The WWW has a great potential of evolving into

searching and querying. Automated searches in nor-

aglobal ly distributed digital document library.The pri-

mal word-pro cessor do cuments are usually restricted

mary use of such a library is to retrieve information

to linear word searches. However, users can use do cu-

quickly and easily. Because of the size of these li-

ments enco ded in SGML to p ose much more complex

braries, simple keyword searches often result in too

searches involving b oth text content and structure.

many matches. More complex searches involving

boolean expressions are dicult to formulate and un-

derstand. This paper describes QBT (Query By Tem-

1.1 SGML, Databases and Do cuments

plates), a visual method for formulating queries for

structured document databases modeled with SGML.

Traditionally, do cuments in electronic formats like

Based on Zloof 's QBE (Query By Example), this

text or word-pro cessor do cuments are used mainly

method incorporates the structure of the documents

for word pro cessing, editing, publishing, and p ossible

for composing powerful queries. The goal of this tech-

reuse into other do cuments. The problem with these

nique is to design an interface for querying struc-

do cuments is that they can not b e easily used for pur-

tured documents without prior know ledge of the in-

p oses other than what they are primarily designed for.

ternal structure. This paper describes the rationale

The SGML standard extends the use of electronic do c-

behind QBT, il lustrates the query formulation princi-

uments b eyond word pro cessing and towards a means

ples using QBT, and describes results obtained from

for information interchange. SGML was originally de-

a usability analysis on a prototype implementation of

signed as a p ortable do cument enco ding format for

TM

QBT on the Web using the Java programming lan-

easy interchange b etween various platforms and sys-

guage.

tems. As stated in the standard [11], SGML was to b e

used \for publishing in its broadest de nition, ranging

1 Intro duction

from single medium conventional publishing to multi-

The SGML Standard [11]hasintro duced a metho d

media data base publishing." SGML achieves this goal

for using do cuments as a means for information in-

by involving a mo deling phase in do cument design

terchange and retrieval. We can now use prop erly

and by completely separating layout and structure.



Copyright 1997 IEEE. Published in the Pro ceedings of

The motivation in SGML is to de ne the structure of

ADL'97, May 7-9, 1997 in Washington D.C. Personal use of this

the do cument; and leave the layout to the pro cess-

material is p ermitted. However, p ermission to reprint/republish

this material for advertising or promotional purp oses or for cre-

ing application. The pro cess for the design of do cu-

ating new collectiveworks for resale or redistribution to servers

ment structure is conceptually similar to schema de-

or lists, or to reuse any copyrighted comp onentofthiswork in

sign in traditional databases. Applications pro cessing

other works, must b e obtained from the IEEE. Contact: Man-

an SGML do cument can then use this structural con-

ager, Copyright and Permissions / IEEE Service Center / 445

Ho es Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA.

tent as a schema to solve queries (searches)

Telephone: +Intl. 908-562-3966

on the do cument. This makes the treatmentofdocu-

y

Partially supp orted by U.S. Dept. of Education award num-

ments enco ded in SGML as databases for the purp ose

b er P200A502367 and NSF Research and Infrastructure grant,

award number NSF CDA-9303189. of querying a fairly straightforward task.

for relational databases have the capability of using 1.2 Searching, querying and browsing

the schema information e ectively. Since the basic

Traditionally, the most common metho d for retriev-

structure of relational databases consists of at tables,

ing information from do cuments is by perusing. For

query languages for relational databases do not usu-

simple word-pro cessor do cuments, p erusing involves

ally deal with complex hierarchical structures. How-

sequentially scanning or paging through the do cu-

ever, query languages for do cument databases need to

ment to nd the information of interest. This form

e ectively use the do cument hierarchy. Research on

of searching, whichwe technically call \browsing", de-

query languages for do cuments has unearthed a num-

p ends completely up on the reader's p erseverance and

ber of such languages. Twovery useful surveys of such

visual attention. The success of the search dep ends

languages and systems can b e obtained from [13, 3].

up on the reader's knowledge of the sub ject matter,

Most database systems implement sp eci c database

the organization of the text, and similar factors. The

mo dels, such as the relational mo del and Ob ject-

recentadventofhyp ertext do cuments has changed the

Oriented (OO) mo del. These database mo dels are

traditional sequential metho d of browsing. Brows-

based on strong theoretical foundations and have for-

ing in hyp ertext do cuments is considered non-linear

mal pro cedural languages to manipulate data. In

[14,21] { usually p erformed by following links in the

the case of relational databases, the most commonly

do cument to other related areas of the same do cument

used query languages are SQL(Structured Query Lan-

or entirely di erent do cuments.

guage) [2] and QBE (Query By Example)[22]. SQL

Although hyp ertext do cuments mayprovide a more

is a textual language with a simple English-like syn-

exible wayofbrowsing, nding information in hyp er-

tax, and is widely implemented in most commercial

text do cuments by browsing still requires human in-

database systems. QBE provides a visual metho d for

telligence, knowledge and exp erience. Whether read-

p osing queries and is suitable even for inexp erienced

ers are more comfortable with screen reading or pap er

users. This section describ es some previous work done

reading is a debatable question[8]. Research shows

in the area of visual query formulation, and derives the

strong evidence of pap er as a more natural means of

motivation b ehind the current research.

reading [7, 9] as well as a means to rectify the problems

with reading from a screen [10,16].

2.1 Visual languages and interfaces

One ma jor advantage of electronic do cuments over

This pap er presents the design of a visual inter-

their pap er counterpart is the capability of automated

face for the purp ose of querying do cument databases.

searching. Many information retrieval techniques have

Although many di erent typ es of pro cedural query

b een prop osed [20] that can quickly extract p ortions of

languages have b een prop osed so far, not enough at-

do cuments matching given keywords. Similar searches

tention has b een devoted towards providing visual

can b e quite dicult in pap er do cuments[8]. However,

metho ds for query formulation. Current implementa-

keyword searches do not often retrieve the target do c-

tions primarily use form-based approaches for building

uments. For large do cuments combining text with the

queries. In this section, wetake a closer lo ok at QBE

logical structure of do cuments is often necessary. In

and form-based querying.

manycases,many comp onentsearch conditions need

Query By Example (QBE). Query By Exam-

to be combined using b o olean expressions to build a

ple [22] is a visual language for querying relational

narrower search. For example, a reader might b e lo ok-

databases. This language has a simple interface com-

ing for pap ers in a journal authored by \Jane Do e"

p osed of tabular skeletons representing tables in the

with a title containing \creature". These typ es of

database. Users sp ecify queries by entering sample

searches are less common in traditional information

values (or examples) in appropriate areas of the table

retrieval systems which lackthe capabilityofinvolv-

skeleton. These values can b e either search strings or

ing structure in searches. Searches involving structure

variables for the purp ose of join and other op erations

and complex op erations such as join have been the

that require variables. QBE can also handle complex

ob ject of recent research[4,1]. In this pap er, we will

b o olean combinations of such search expressions us-

refer to such searches as \queries".

ing a sp ecial section of the screen (termed condition

2 Query languages for structured do c-

box). Aggregation op erations such as sum, count and

uments

average can also b e p erformed by indicating appropri-

ately in the tabular skeleton. Figure 1 shows a simple The ob ject of developing languages for the pur-

join op eration using QBE as implemented in a p opular pose of querying is to attain the capability of in-

database system. The main idea b ehind this metho d is volving complex op erations involving text (data) and

that the user provides an example of outputs that she structure (meta-data). Traditional query languages

exp ects from her query, and the query engine lo oks in 2.2 Other query visualization approaches

the database for data that matches the given exam-

Here we describ e two exp erimental systems that use

ple. This works nicely for relational databases, pri-

interesting visualization techniques for the purp ose of

marily b ecause the tabular structure of the database

information retrieval on complex structures. Although

ts quite well with tabular skeletons used in the inter-

the approaches here are directed primarily towards vi-

face. Some useful prop erties of QBE are:

sualization of data, they do provide useful incentivein

Simplicity The core QBE do es not require the

this work.

knowledge of anysyntactic constructs or the internal

structure of the database to use.

TEXTVIZ. TEXTVIZ[6] is a system for intuitively

visualizing, searching, and querying the contents of

Equivalence relational databases are stored in ta-

large text databases under development at TASC.

bles, and in QBE, the queries are entered into equiv-

This metho d uses a map-like representation of do cu-

alent table skeletons.

ments that shows users how comp onents of do cuments

are related to each other in the collection. It pro-

Closure Users give examples of the typ e of values

vides twolevels of text database information { (1) a

that they are lo oking for, and the result of the queries

macro p ersp ective, giving a top-level view of the whole

shows up as tables in the same format as the data.

database using the top ographic map, and (2) a micro

Completeness The core QBE, along with some

p ersp ective fo cusing up on the conceptual content of

additional commands, condition b oxes and other con-

each do cument comp onent. TEXTVIZ uses a natu-

structs, can construct the same class of queries as re-

ral language or text pro cessing front-end to extract

lational algebra.

meaning ab out the contents of a do cument database.

It then generates and displays a text map p ortraying

These prop erties make QBE a complete language

these contents. This provides the user the ability to

that can adapt to any relational schema and make

comprehend the entire database at once, giving the

querying almost indep endent of the users' knowledge

user a global p ersp ective of the organization of the

of the structure of the data. In a later section we

contents of the database.

will show how our approach keeps all these prop er-

ties in a generalized interface designed primarily for

AIR and SCALIR. AIR (Adaptive Information Re-

do cuments, but applicable to any complex structured

trieval) and SCALIR (Symbolic and Connectionist

data.

Approach to Legal Information Retrieval) are two sys-

tems describ ed by Rose et al.[18] that use a connec-

Query By Forms. Although QBE is a formally ac-

tionist approach to the information retrieval problem.

cepted direct-manipulation visual language for rela-

AIR's goal is to build a representation that will re-

tional algebra, most relational database systems still

trieve do cuments that are more likely to be relevant

do not universally use or implement it. Some com-

to particular queries. In order to achieve this, users

monly used relational database systems implement dif-

b egin a session with AIR by describing their informa-

ferent variations of QBE, but application develop ers

tion need using a very simple . Sub-

designing query interfaces for a sp eci c purp ose sel-

sequently users can re ne their queries by sp ecifying

dom use the QBE metho d directly in their implemen-

multiple clauses or bynavigating links shown bythe

tations. In these cases, develop ers use form-based in-

system. SCALIR is an IR system for retrieving cases

terfaces.

ab out copyright law. Internally, SCALIR contains a

In form-based interfaces, the user is presented with

network of no des representing terms, court decisions,

alistofsearchable elds, each with an entry area that

and statute sections. A query consists of the acti-

can b e used to indicate the search string. To p ose a

vation of a few selected no des. This activation is

query, the user needs to ll in the areas of the form

spread throughout the network, building the retrieval

relevant to her search. A more general form interface

set from the activated no des.

would have options of sp ecifying b o olean op erations.

Although the metho ds describ ed in this section pri- As part of the comparison pro cess in the currentwork,

marily apply to data visualization, the direct manip- we implemented a version of the form interface that

ulation of the data in its natural form and constant can work with currentworld-wide-web technology [see

mo di cation of the display based on the currentsta- Figure 8]. Search interfaces employed by most web

tus of the query has signi cant in uence in the design search engines also use interfaces similar to the one

of QBT. shown here.

Figure 1: A sample QBE screen from Paradox for Windows

Query By Templates (QBT)

3 Poems by Felicia Dorothea Hemans 1808 (Early Eighteenth Century)

current work generalizes QBE for databases The Casabianca

Collection Historical Age

taining complex structured data. QBE is very con by Felicia Hemans Poem Title suitable for relational databases, since it uses tabular Poet Name

The boy stood on the burning deck

skeletons(analogous to tables in the relational mo del)

Whence all but he had fled,

In other words, as a means for constructing queries. First Line The flames that lit the battle’s wreck

Shone round him o’er the dead. Any line

the template for presenting queries in QBE is simi-

ternal structure of the database. Weuse lartothein Stanza Yet beautiful and bright he stood,

As born to rule the Storm,

ytyp e of database in this idea to generalize QBE for an The Creatures of Heroic blood

A proud, though child-like form.

whicheach data instance has a simple visual template.

In this generalized metho d (that we term \Query By

Templates",) the basis of the interface is a visual tem-

Figure 2: A simple template for p o ems, with its logical

plate representing an instance of the database. Simple

regions

examples of templates are: a small poem for p o etry

databases, a table for relational databases, a sample

word de nition for a dictionary database, and a sam-

we indicate a prominent logical region of the p o em by

ple entry in a bibliography database. Any database

circling it and lab eling it with the corresp onding re-

that has a simple visual representation of its content

gion name. Physically the QBT interface consists of

can b e used with QBT. For databases that do not have

a small template image divided into areas corresp ond-

a general visual content, we can always revert to tables

ing to di erent logical regions in the database, as in

for the template.

Figure 2. Dep ending on the layout of the regions, the

The main goal for using QBT is to retain all the

templates can b e of di erenttyp es, and these di erent

prominent prop erties of QBE. The intended prop er-

typ es are describ ed in the next sections.

ties of QBT that are analogous to those of QBE (as

discussed in Section 2.1) are (i) simplicity, (ii) equiva-

3.2 Flat templates

lence, (iii) closure and (iv) completeness. This section

As describ ed in the previous section, QBT relies on

describ es templates in greater detail to explain to ba-

the presence of a simple visual template for the in-

sic design of QBT.

stances in the database. In most cases, this template

3.1 QBT: the basic design

could b e planar or at. This means that all logical re-

At the simplest level, a QBT interface displays a gions of the template can b e displayed simultaneously

template for a representative entry of the database. in a two dimensional image without overlapping(like

The user sees an example of the typ e of data she would in Figure 2). We call these templates \ at templates".

exp ect to nd in the database, such as a poem in a Flat templates are usually easier to display and use,as

p o etry database. She sp eci es a query byentering ex- the structural regions can b e simultaneously displayed

amples of what she is searching for in the appropriate in a plane, p ossibly byshowing multiple instances of

areas of the template, and the system retrieves all the some regions. For example, in Figure 2, the First Line

database entries that match the example she provided. and Any Line regions are subregions in Stanza. Todis-

To illustrate the interface, we use a simple template play these subregions, the template needs to include a

for a p o etry database, as in Figure 2. In this gure, second stanza which is broken into its comp onents. Casabianca Casabianca Stanza by Felicia Hemans by Felicia Hemans

The boy stood on the burning deck The boy stood on the burning deck Whence all but he had fled, First line Whence all but he had fled, The boy stood on the burning deck The flames that lit the battle’s wreck The flames that lit the battle’s wreck Shone round him o’er the dead. Shone round Whence him o’er all the but dead. he had fled, Any line The flames that lit the battle’s wreck Yet beautiful and bright he stood, Stanza Yet beautiful and Shone bright round he stood, him o’er the dead. As born to rule the Storm, As born to rule the Storm, The Creatures of Heroic blood The Creatures of Heroic blood First line A proud, though child-like form. A proud, though child-like form. Any line

(a) (b)

Figure 3: Templates with (a) Emb edded Regions and (b)Recursive regions

internal structures for the same p o em example. 3.3 Nested templates

3.4 Structure templates

Although at templates are easy to display and

Structures, particularly large ones, may get to o

navigate, they cannot mo del structures with deep lev-

complex to use nested templates. In these cases, it is

els of nesting. In this case, we use templates that can

often necessary to displaytheinternal structure simul-

be nested. In nested templates, regions are allowed

taneously with a template that displays the relative

to overlap. In particular, certain regions can b e com-

p osition of the current region. As mentioned earlier,

pletely inside other regions to represent subregions.

most do cuments can b e thoughtof as having a hier-

To display emb edded logical regions, we use one of

archical structure that can b e conveniently visualized

the following metho ds:

as a tree. Showing a template simultaneously with a

Embedded Regions In this metho d, subregions

hierarchy of logical regions depicting the context sim-

are displayed inside the parent region. As in at tem-

pli es the nested structure visualization. An example

plates, all regions are displayed simultaneously in the

of the structure template is shown in Figure 7, which

same plane of the image. Comp onent regions no longer

is a screen-shot from the prototyp e implementation of

need to b e mutually exclusive. This metho d is a sim-

QBT, describ ed in Section 5.

ple extension of at templates, but it makes templates

much more powerful while retaining the simplicity.

4 Querying with QBT

However, this metho d is again limited to structures

Normal keyword searches b ounded by structural re-

in which the nesting level is not very deep and the

gions are simple in the QBT interface. As describ ed

top-level region is physically large enough to include

earlier, users express their queries by indicating the

all the nested regions without completely obscuring it-

searchkeywords in the appropriate regions of the tem-

self. An example of this typ e of nesting in shown in

plate. In this section, weshow all the di erenttyp es

Figure 3(a).

of p ossible searches that can b e p erformed with QBT.

One can treat QBE[22] as a sp ecial case of QBT Recursive regions This is the most general metho d

where the templates used are table skeletons that in- of nesting regions. In this metho d, a region with sub-

stantiate tables in the database. In QBE, queries regions can b e recursively expanded. During traversal,

are sp eci ed by entering values in prop er p ositions the user may \zo om in" to a parent region to display

of the tables. These values may be either constants its subregions. The magni ed p ortion of the template

(i.e., strings or numb ers), variables (or examples, usu- can be an indep endent template, and can be subse-

ally di erentiated from the constants by underlining), quently magni ed to get to additional levels of nest-

or expressions involving constants and variables com- ing. Although this metho d can capture any general

bined with arithmetic and comparison op erators. The structure, the templates have to b e cleverly designed

output of the query is sp eci ed by marking the regions so that users are not disoriented by the nested tem-

that need to b e presented in the output. QBT uses the plates. Figure 3(b) shows this metho d of displaying Casabiancahate CasabiancaNOT "hate" by Felicia Hemans by Felicia Hemans OR The boylove stood on the burning deck The boylove stood on the burning deck Whence all but he had fled, Whence all but he had fled, The flames that lit the battle’s wreck The flames that lit the battle’s wreck Shone round him o’er the dead. Shone round him o’er the dead.

(a) (b)

Figure 4: Query formulation with QBT: (a) Simple selections and (b)Logically combined selections

same basic principle, with the extension that the tem- p o em titles and p o ets of all the p o ems that havethe

plates are not restricted to table skeletons but can b e word hate in the title and the word love in the rst

any visual representation of the database instances. line." Note that unlike QBE, searches are substring

matches instead of exact string matches. So, entering

The primary di erence b etween the metho d of ex-

the word \love" in the region \ rst line" matches all

pressing queries in QBT and that in QBE lies in the

p o ems with the rst line containing the word \love"

fact that the templates in QBE are essentially one-

anywhere in the rst line. In QBE, this is done by in-

dimensional. Although QBE uses a 2D tables for

dicating examples b efore and after the search string.

querying, the meta-data (attributes of the relations)

In case of do cuments, substring matches make more

only app ear along the horizontal axis as column head-

sense since exact searches are far less common.

ings of the tables. QBE uses the rows to sp ecify mul-

tiple search conditions and logical op erations b etween

4.2 Selections with multiple conditions

the search conditions (see examples in [22].) In QBT,

the regions (meta-data) are distributed along b oth di-

We have just seen that if multiple conditions are

mensions of the template, utilizing whole template

sp eci ed in di erent regions, they are combined us-

plane for visualizing the structure. Logical op erations

ing logical conjunctions, implying that the results re-

between regions can be expressed by physically con-

turned from the query will satisfy all the sp eci ed

necting two or more regions via a logical op erator.

search conditions. If this is not desired, search con-

Logical op erations within regions can b e formed using

ditions can b e combined using logical op erators AND,

logical expressions within the scop e of that region. In

OR, NOT. Negation of individual conditions are done

the rest of this section, we discuss how di erenttyp es

by placing the keyword \NOT" in front of the string.

of queries are p erformed using QBT.

Implementations of the interface may use some visual

mechanisms to place this negation keyword. Con-

4.1 Simple selections

necting various search strings using the binary op er-

Simple selections include searching for constant

ators \AND" and \OR" involves simply connecting

strings or numbers within logical regions of the do c-

the two strings using a p ointing device and selecting

ument (the whole do cument itself b eing one region).

the prop er op eration typ e for that connection. Fig-

In QBT, such searches are p erformed by simply en-

ure 4(b) demonstrates how this is done by expressing

tering the search string in the corresp onding region of

the query: \Find the p o em titles and p o ets of all p o-

the template. As a result of suchasearch, database

ems that either do not havetheword `hate' in the title

instances ro oted at a default region that match all

or havetheword `love' in the rst line." Notice the in-

the sp eci ed conditions are returned. In other words,

tro duction of the negation and the \OR" connection.

the given search criteria are combined using a logical

conjunction op eration. The result of the query is by

Note that if multiple clauses are connected using

default ro oted at a preselected region de ned bythe

logical constructs, the order in which the expressions

template. However, users can mark the regions that

are evaluated dep ends on the direction of the arrow.

they want returned by placing a print-marker on them.

However,itispossibletooverride this order of evalua-

tion by placing parentheses in appropriate places in a In the illustrations (see Figure 4), the small tick-

p

condition b ox, whichwe will demonstrate shortly (see mark ( ) is used as a print indicator. In the exam-

Section 4.4). ples, Figure 4(a) denotes the simple query: \Find the Casabiancaexample1 Casabiancaexample1 by Felicia Hemans by Felicia Hemans

The boy stood on the burning deck The boy stood on the burning deck Whence all but he had fled, Whence all but he had fled, The flames that lit the battle’s wreck The flames that lit the battle’s wreck

Shone round him o’er the dead. Shone round him o’er the dead.

Figure 5: Query formulation with QBT: Joins

Casabiancahate TITLE AND POET OR FLINE by Feliciashakespeare Hemans OR (a) The boylove stood on the burning deck Whence all but he had fled, TITLE AND (POET OR FLINE) The flames that lit the battle’s wreck Shone round him o’er the dead.

(b)

Figure 6: Changing precedence of op erations with Condition b oxes

tors to sp ecify joins other than \equi-joins" (joins that 4.3 Joins and multiple templates

use equality on the join attributes). Once again, in the

In this section, we lo ok at a sp ecial class of query

case of asymmetric comparison op erations, the prece-

called \join". A join is an op eration using whichmul-

dence of the op erators is determined by the direction

tiple fragments of a database are combined together

of the arrow. Although we are still using underlined

based on some common prop erty. Joins are indisp ens-

examples to highlight the joins, they are redeemed un-

able in relational database, since the relational de-

necessary by the use of the joining arrow.

sign involves \normalizing" a schema by breaking it

4.4 Queries with complex conditions

into at tabular fragments. This fragmentation neces-

sitates combining the individual fragments together

Like QBE, QBT cannot always enable visualiza-

at the time of query pro cessing using the join op er-

tion of complex conditions involving more than two

ation. However, in do cument databases, the struc-

regions. These conditions need to sp eci ed using a

ture is not normalized into planar fragments, but al-

condition b ox. The condition b ox can also b e used to

lowed to grow hierarchically, so joins are not required

override the precedence of op erators. As search strings

to combine fragments. However, joins are still useful

and examples are sp eci ed, the condition b oxisauto-

to solve queries that requires comparison of di erent

matically up dated. The user can then insert parenthe-

parts of a database or di erent instantiations of the

ses as necessary to change the default precedence. For

same database.

example, in Figure 6, if the default precedence is used,

the query evaluates to: \Find the p o em titles and p o- For example, one may try to \ nd the pairs of p o ets

ets of the p o ems in which either the title contains who have at least one p o em with a common title" (as

`hate' and the poet is Shakesp eare, or the rst line in Figure 5). In this case, we need to generate twoin-

contains `love'." The default condition b oxis shown stances of the p o etry database and run the query com-

in Figure 6(a). However, this default can b e changed paring the titles of the two p o ems. This is achieved in

to: \Find the p o em titles and p o ets of the p o ems in QBT by using multiple templates. In the case of the

which the title contains `hate', and either the poet ab ovequery, the same template is instantiated twice,

is Shakesp eare, or the rst line contains `love'." The and the join attributes are connected together. The

condition box of this query is shown in Figure 6(b). connection can b e augmented with comparison op era-

The condition b ox can also b e used for sp ecifying more

complex conditions involving more than twovariables

in an expression. The condition b ox functions in the

same manner as the condition b oxinQBE.

5 A prototyp e implementation of QBT

1

We built a prototyp e of the QBT interface using

TM

the Java programming language. The prototyp e

implements most of the features describ ed here includ-

ing the emb edded template (without recursivemagni-

cation) and the structure template. Wehavenotyet

incorp orated the condition b ox in this prototyp e, but

it will b e added in the next release. We also included

an exp erimental version of an SQL language transla-

Figure 8: The form implementation of the query in-

tor from the QBT query. Figure 7 shows two parts of

terface used in the usability analysis

the screen { one showing the template screen and the

other showing the structure template.

As an exp eriment, we used the Chadwyck-Healey

among which we prepared nine, and left the tenth

English Po etry database with templates similar to

question op en to the sub ject's imagination. All sub-

those describ ed in this pap er for p erforming the

jects were given the same set of questions (see Ap-

TM

queries. We used the Java language to build the

p endix A.) The questions varied in complexity and

user interface. Queries generated using the interface

were designed so that all except one returned some

are sent to a query engine by the HTTP (Hyp erText

matches. The sub jects were to answer the questions

Transfer Proto col) proto col, which is run from a web

using the searchinterface and to write down the num-

server as a CGI (Common Gateway Interface) exe-

ber of matches returned by the search, after making

cutable. The engine generates its output in HTML

sure that the question was interpreted prop erly bythe

which is displayed by the clients. We wrote the engine

searching program. The indep endent and dep endent

in C/C++, using the API (Application Programming

variables for the exp eriment are outlined b elow:

Interface) provided by the Pat [17] software. However,

Indep endentvariables:

due to limitations of the pro cessing capabilities of Pat,

A. Interface type, (1) New Java-based QBT inter-

this engine do es not supp ort variables or joins. Apart

face, and (2) generic form-based interface;

of the currentresearch is aimed at building a database

engine that will supp ort queries such as join, grouping,

B. Subject type, (1) exp ert, and (2) novice.

and nesting, without converting the do cuments into a

Dep endentvariables:

di erent database format.

1. Eciency: The amount of time in seconds the

sub jects take to answer each question.

6 Usability analysis

We p erformed an initial phase of usability analy-

2. Accuracy: The degree of accuracy of the answers.

sis on the prototyp e interface. The main goal of this

3. Satisfaction: How satis ed the users were after

analysis was to detect di erences between this inter-

using the interface (measured by self-rep orts in

face and a standard forms-based interface with similar

written debrie ngs.)

search capabilities. In particular, we were interested

Twenty sub jects were chosen for the exp eriment.

in di erences with resp ect to (1) accuracy, (2) e-

The exp eriment was conducted using a \b etween-

ciency and (3) satisfaction. This section describ es the

users" strategy[19], where two distinct groups of users

metho d used during the exp erimental evaluation of the

use the two platforms. In our exp eriment, ten sub-

new Java-based (QBT) interface.

jects were given the Java-based interface (see Fig-

6.1 Exp erimental design

ure 7), while another ten ( ve novices, ve exp erts)

The exp eriment consisted of two primary parts. In

were given the form-based interface (see Figure 8).

the rst part, we gave the sub jects ten questions {

After the sub jects nished the questions, we asked

1

The current version of the prototyp e can be seen

them to complete a small survey of their exp erience

at: http://blesmol.cs.indiana.edu:7890/pro jects/SGMLQuery.

with databases and query engines. The survey also

Note that only the interface is accessible from outside Indiana

asked them to compare the interface with a standard

University. The results of the queries cannot b e viewed from a

web-searchinterface that they haveused. remote lo cation b ecause of copyright restrictions.

Figure 7: A screen shot-from the prototyp e showing the template and structure

questions). The sub jects were timed automatically by 6.2 Sub jects

the server and the query engine that was actually ex-

Sub jects were students who volunteered to par-

ecuting the queries.

ticipate in the research. There were no ma jor re-

quirements from the sub jects except that they all

Basic pro cedure. The sub jects were intro duced to

be Indiana University aliates, b ecause of the copy-

the exp eriment and the target interface. After an ini-

righted nature of the database on whichwe p erformed

tial intro duction, the sub jects were given the exp eri-

the exp eriment. We divided the sub jects into two

mental queries and asked to input them sequentially,

groups based on their exp erience with computers and

and to note down the numb er of matches returned by

databases. The sub jects classi ed as \novices" had

the database for every query. They were also asked to

minimal computer exp ertise { generally limited to

verify their results to eliminate typ ographical errors

only e-mail and o ccasional World Wide Web access.

bychecking the resp onse from the database and lo ok-

The sub jects classi ed as \exp erts" were p eople ac-

ing at sample results from their search. After they

customed to using databases and the web as well as

nished the queries, they were given a set of survey

designing and programming graphical user interfaces.

questions which they were to answer. They were also

We made no distinctions between male and female

asked to verbally describ e their feelings and general

sub jects, since it was not one of the indep endentvari-

reactions from their use of the systems.

ables in this analysis (eleven female and nine male

Exp erimental queries. A set of nine queries (see

sub jects participated in this study.)

App endix A) were given to each sub ject. For the tenth

6.3 Equipment - software and hardware

query,theywere asked to search for something of their

We p erformed all the exp eriments using Netscap e

own interest. The rst and the easiest query was pri-

2.0 for the Java-based interface and either Netscap e

marily meant for the sub jects to get acquainted with

2.0 or 1.1 for the form-based version. For the Java-

the system, and the last query was mainly to see what

based interface, we restricted exp eriments to ma-

typ es of questions users often search for. The queries

chines having 16MB or more system memory, since

ranged from very simple searches involving a single

Netscap e's Java p erformance is sub-standard with less

clause in a eld to complex searches involving up to

memory. However, for the other interface, no memory

four clauses combined together.

restriction was enforced since the HTML forms do not

Timing technique. The sub jects were timed byelec-

have additional memory requirements.

tronic means. Whenever a user submitted a query us-

6.4 Data collection

ing either interface, the server logged the access time.

We collected twotyp es of written data: (1) the sub- Also, the query engine that we designed logged timing

jects' resp onses to the survey questions and (2) the and other detailed information ab out the queries sent

sub jects' resp onses to the number of matches for each bythe users. We also designed the Javainterface so

search problem (please see App endix A for the actual that it would send log messages to the server. This al-

7.2 Eciency lowed the server to keep track of all the actions (such

as button press and query selection) that the user to ok

For the eciency measure, we used the time (in sec-

over the course of submitting the queries.

onds) b etween two successive submissions of queries.

The absolute times at which (1) the system was rst

Survey questions. In addition to the queries, we

accessed and (2) the queries were submitted, were

gave the users a small set of questions to express their

logged by the query pro cessing system, and we cal-

exp erience, preferences, and the degree of satisfaction

culated the di erence b etween these times to get the

with the interface. They were also asked to p oint out

time taken for each task by the sub jects. For the rst

features that they liked or disliked in the system that

task, we used the time di erence b etween the rst ac-

they used.

cess of the search page and the submission of the rst

General feedback. After the exp erimentwas over,

task. This turned out to b e a problem (as indicated

the sub jects were asked to comment on their general

by the results,) since the Javainterface page did not

feelings ab out the pro ject; and their comments and

haveany other text b esides the searchinterface itself,

suggestions were noted. This data was primarily used

while the form interface had some instructions in it

for the purp ose of designing improved features for the

which most of the sub jects sp ent time reading, be-

currentinterface.

fore comp osing the rst query. Table 2 shows that ex-

p erts were signi cantly more ecient than the novices

7 Results

on b oth the typ es of interfaces. However, except for

This section describ es in detail the results that we

Task 1 and Task 7, there was no signi cant eciency

obtained from the usability analysis. We divide the

di erence b etween the twointerfaces.

results in three di erent sections, one each for the de-

For Task 1, the sub jects using the Java interface

pendent variables | accuracy, eciency, and satis-

p erformed signi cantly b etter than the sub jects us-

faction. For each of the measures, we p erformed a

ing the form interface, and the possible explanation

multivariate ANOVA test with a :05 signi cance level.

is given ab ove. For Task 7, the users of the form in-

In addition, for the accuracy and eciency measures,

terface p erformed signi cantly better than the users

we show the results for all ten questions, and make

of the Java interface. On a detailed analysis of the

appropriate inferences based on the results.

sub jects' actions, we found that this task required the

In the following analysis, for the indep endentvari-

users to switch to a di erent screen for the Java in-

able \Interface typ e", the Javainterface (Figure 7) is

terface, and most of the users could not understand

given a value of 1 and the form interface (Figure 8)

the necessity for this action. This will be recti ed

is given a value of 2. For the indep endent variable

when all three screens are displayed simultaneously;

\Sub ject typ e", the values of 1 and 2 are assigned to

as users will not have to discover something new in

exp erienced and novice users, resp ectively. The tasks

order to p erform this query.

are denoted as \qns1" through \qns10".

Table 2 displays the results we obtained from the

multivariate tests of signi cance. The values show

7.1 Accuracy

clearly that the combined e ect of b oth the interface

We to ok the accuracy measures byevaluating the

and sub ject typ e was signi cant only for Task 1. The

answers to each question in 0 5 scale. Perfect answers

e ect of sub ject typ e was de nitely signi cant on al-

were given 5 p oints and completely wrong answers (of

most all the tasks - the exp erts were signi cantly more

course, there were none in the exp eriment) were given

ecient on b oth the interfaces. However, the interface

0points. Based on the typ e of mistake the users com-

di erence did not haveany signi cant e ect on all the

mitted, their answers were given a value between 1

tasks except Task 1 and 7; and the reasons are already

and 4. For the other questions, there was no signif-

explained ab ove.

icant di erence between either the two interfaces or

the two groups of users. The multivariate ANOVA

7.3 Satisfaction

analysis did not show any signi cant e ect of either

For the satisfaction measure, the users were asked

the interface typ e or the user typ e, or a combination

to grade the interface that they used in a scale of ve

of both with resp ect to accuracy. See Table 1 for a

qualitative values: Much better, Little better, About

summary of the signi cance values for the ten ques-

the same, Worse, Absolutely worse. These ve classes

2

tions.

were given the ranks 5,4,3,2 and 1 resp ectively. This

data was taken after all the actual tasks were per-

2

For questions 1,2,4 and 10, all the participants got the cor-

formed, and was not calculated on a task-by-task ba-

rect result, so there was no variance in the outcome of these

sis. As b efore, a Multivariate ANOVA measure was questions.

E ectnTrial no. Qns1 Qns2 Qns3 Qns4 Qns5 Qns6 Qns7 Qns8 Qns9 Qns10

Sub ject - - 0.332 - 0.624 0.332 0.743 0.332 0.811 -

Interface - - 0.332 - 0.153 0.332 0.332 0.332 0.477 -

Interaction: sub ject & interface - - 0.332 - 0.624 0.332 0.201 0.332 0.477 -

Table 1: Signi cance values for Accuracy

E ectnTrial No. Qns1 Qns2 Qns3 Qns4 Qns5 Qns6 Qns7 Qns8 Qns9 Qns10

Sub ject 0.003 0.085 0.029 0.220 0.000 0.004 0.001 0.000 0.018 0.004

Interface 0.000 0.726 0.696 0.522 0.420 0.660 0.001 0.452 0.897 0.927

Interaction: sub ject & interface 0.006 0.379 0.446 0.494 0.830 0.054 0.066 0.101 0.934 0.912

Table 2: Signi cance Values for Eciency

taken on this data. This clearly shows a signi cant common, so the assumption of a graphics-capable ter-

e ect of interface on the satisfaction of the sub jects. minal is not very demanding.

However, the satisfaction was indep endent of sub ject The implementation of QBT in this work is in an

typ e or the combined e ect of interface and sub ject

early developmental stage, and has substantial p oten-

typ es (see Table 3).

tial for improvement. The exp erimentwe p erformed

clearly indicated some of the ways it could be im-

8 Discussion and conclusion

proved. However, in spite of b eing a prototyp e in-

QBE and forms are b oth quite p opular means for

terface, this QBT implementation demonstrates that

querying in the relational domain. The main advan-

QBT is suitable for querying textual databases using

tage of the form interface is that it is very simple

a simple graphical interface. Moreover, QBT is at

to implement and easy to use for small databases.

least as ecient and accurate as the general form-

However, forms do not adapt very well for databases

based approach, and signi cantly more satisfying to

with a very complex structure, and most text-based

the users. We b elieve that the idea b ehind QBT will

databases tend to haveavery complicated structure

give us a starting p oint for query interfaces in future

(for instance, the Chadwyck-Healey database used in

text database systems. A signi cant p ortion of the

the prototyp e contains over fty logical regions.) A

current research is aimed towards the theoretical sta-

form interface that can search on only a few of these

bility and soundness of the QBT concept, and once

areas is easy to construct, but if the numb er of search-

established, this metho d has the potential of b eing

able regions is increased, the usability gets question-

the standard querying mechanism for text databases.

able. On the other hand, a structure template can b e

Acknowledgments

easily used to navigate through complex hierarchical

Wearevery thankful to the sub jects of our usability

structures. Even for complex hierarchies, the fo cus

analysis for their participation and co op eration. We

can be concentrated in the regions of interest using

are also grateful to the LETRS (Library Electronic

advanced metho ds like di erential magni cation [12].

Text Resource Service) sub division of Indiana Univer-

Another advantage of the template metho d is its

sity Library, and esp ecially ex-directors Richard Ellis

direct relationship to the internal structure of the

and Mark Day for allowing us to use the Chadwyck-

database. Forms always lo ok the same, whether the

Healey Database for this research. Wewould also like

underlying database is a p o em, a dictionary, a quo-

to thank Prof. Dirk Van Gucht, Computer Science,

tation collection, or even a relational database. How-

Indiana University, for his careful reading and useful

ever, templates can be custom-designed for di erent

comments.

typ es of databases. This way, templates can pro-

A Tasks used in the usability testing

vide a direct re ection of the users' mental mo dels

1. Find the p o ems written by \Shakesp eare".

[5, Ch.6], a signi cant factor in the design of good

2. Howmany p o ems were written in the Middle En-

user-interfaces. Moreover, templates use the principle

glish Perio d age (MEP)?

of familiarity[15], which is demonstrated to work well

for novice users. The only disadvantage of templates 3. Find all the p o ems written in the Early 19th Cen-

tury p erio d (C19A) that havetheword \burning" in

is that go o d templates require exp ensive graphics ter-

the rst line.

minals, while forms work quite well with terminals

without graphics capabilities. However, with the ad- 4. Find the p o ems that havethe word \hate" in the

vance in technology, non-graphics terminals are less title and the word \love" in the rst line.

E ect Interface and Sub ject Typ e Sub ject Typ e Only Interface Only

Signi cance values 1.000 0.503 0.014

Table 3: Signi cance Values for Satisfaction

5. Find the p o ems not written by \Hemans" that have [10] J.D. Gould, L. Alfaro, R. Finn, B. Haupt, and A. Min-

uto. Reading from crt displays can b e as fast as read-

the word \wreck" somewhere in a stanza.

ing from pap er. Human Factors, 29(5):497{517, 1987.

6. Find the p o ems written during the Early 18th Cen-

tury (C18A) whichhave the word \love" in the collec-

[11] International Organization for Standardization,

tion title, as well as in the p o em title, but not in the Geneva, Switzerland. ISO 8879: Information

Processing { Text and Oce Systems { Standard

rst line.

GeneralizedMarkupLanguage (SGML),1986.

7. Find the p o ems that have the phrase \exp ostula-

tion and reply" anywhere in the b o dy of the p o em.

[12] T. Alan Keahey and Edward L. Rob ertson. Tech-

8. Find the p o ems written by Keats that do not have niques for non-linear magni cation transformations.

In Proceedings, Visualisation '96 Information Visual-

the word \mortal" in any of the stanzas.

ization Symposium. IEEE, Octob er 1996.

9. Find the p o ems written by Shakesp eare that has

the phrase \to b e or not to b e" somewhere in the p o em

[13] Arjan Lo e en. Text databases: Asurvey of text mo d-

body. els and systems. SIGMOD RECORD, March1994.

10. Write a query of your own from your interest in

[14] Ted Nelson. Literary Machines,volume Version 87.1.

p o ems, and indicate the numb er of matches you found

Published by author, 1987.

for that query.

[15] Donald Norman. The Design of Everyday things.

Doubleday Currency, 1990.

References

[1] Serge Abiteb oul, Sophie Cluet, and Tova Milo. Query-

[16] D. Ob orne and D. Holton. Reading from screen versas

ing and up dating the le. Proceedings, 19th Intl. Con-

pap er: there is no di erence. International Journal of

ferenceonVery Large Data Bases, pages 73{84, 1993.

Man-Machine Studies, 28(1):1{9, 1988.

[2] American National Standards Institute, New

[17] Op en Text Corp oration, Waterlo o, Ontario, Canada.

York. ANSI X3.135-1992 and ISO/IEC 9075:1992,

Open Text 5.0, 1994.

Database Language SQL, 1992.

[18] Daniel E. Rose and Richard K. Belew. Toward a

[3] Ricardo Baeza-Yates and Gonzalo Navarro. Integrat-

direct-manipulation interface for conceptual informa-

ing contents and structures in text retrieval. SIGMOD

tion retrieval systems. Interfaces for Information Re-

Record, 25(4):67{79, March 1996.

trieval and Online Systems - the state of the art, 1991.

[4] Ricardo A. Baeza-Yates and Gaston H. Gonnet. Ef-

[19] Je rey Rubin. Handbook of Usability Testing: How to

cient text searching of regular expressions. Proceed-

plan, design and conduct e ective tests. John Wiley

ings, 16th International Col loquium on Automata,

& Sons, Inc., 1994.

Languages, and Programming, pages 46{62, 1989.

[20] Gerard Salton. Developments in automatic text re-

[5] Paul Bo oth. AnIntroduction to Human-computer In-

trieval. Science, 253:974{980, 1991.

teraction. Laurence ErlBaum Asso ciates Publishers,

1989.

[21] Ben Shneiderman and Greg Kearsley. Hypertext

Hands-on! An Introduction to a New Way of Orga-

[6] Nathan Combs. Large text database visualization.

nizing and accessing Information. Addison-Wesley,

Advances in Classi cation Research, Proceedings of

1989.

the 3rd ASIS SIG/CR Classi cation Research Work-

shop, 3, Oct 1992.

[22] M. M. Zlo of. Query by example: A database lan-

guage. IBM Systems Journal, 16(4), 1977.

[7] A. Creed, I. Dennis, and S. Newstead. Pro of-reading

on VDUs. Behavior and Information Technology,

6(1):3{13, 1987.

[8] Andrew Dillon. Designing Usable Electronic Text: Er-

gonomic Aspects of Human Information Usage. Tay-

lor & Francis, London; Bristol, PA, 1994.

[9] J.D. Gould, L. Alfaro, V. Barnes, R. Finn,

N. Grischkowsky, and A. Minuto. Reading is slower

from crt displays than from pap er: Attempts to iso-

late a single variable explanation. Human Factors, 29(3):269{299, 1987.