A Generalized Approach for Visual Query Formulation for Text
Total Page:16
File Type:pdf, Size:1020Kb
Query By Templates: A Generalized Approach for Visual Query Formulation for Text Dominated Databases y Arijit Sengupta Andrew Dillon Computer Science Department Scho ol of Library & Information Science Indiana University Indiana University Blo omington, IN 47405 Blo omington, IN 47405 [email protected] [email protected] structured do cuments not only for the traditional pur- Abstract p oses of reading, browsing, and printing, but also for The WWW has a great potential of evolving into searching and querying. Automated searches in nor- aglobal ly distributed digital document library.The pri- mal word-pro cessor do cuments are usually restricted mary use of such a library is to retrieve information to linear word searches. However, users can use do cu- quickly and easily. Because of the size of these li- ments enco ded in SGML to p ose much more complex braries, simple keyword searches often result in too searches involving b oth text content and structure. many matches. More complex searches involving boolean expressions are dicult to formulate and un- derstand. This paper describes QBT (Query By Tem- 1.1 SGML, Databases and Do cuments plates), a visual method for formulating queries for structured document databases modeled with SGML. Traditionally, do cuments in electronic formats like Based on Zloof 's QBE (Query By Example), this text or word-pro cessor do cuments are used mainly method incorporates the structure of the documents for word pro cessing, editing, publishing, and p ossible for composing powerful queries. The goal of this tech- reuse into other do cuments. The problem with these nique is to design an interface for querying struc- do cuments is that they can not b e easily used for pur- tured documents without prior know ledge of the in- p oses other than what they are primarily designed for. ternal structure. This paper describes the rationale The SGML standard extends the use of electronic do c- behind QBT, il lustrates the query formulation princi- uments b eyond word pro cessing and towards a means ples using QBT, and describes results obtained from for information interchange. SGML was originally de- a usability analysis on a prototype implementation of signed as a p ortable do cument enco ding format for TM QBT on the Web using the Java programming lan- easy interchange b etween various platforms and sys- guage. tems. As stated in the standard [11], SGML was to b e used \for publishing in its broadest de nition, ranging 1 Intro duction from single medium conventional publishing to multi- The SGML Standard [11]hasintro duced a metho d media data base publishing." SGML achieves this goal for using do cuments as a means for information in- by involving a mo deling phase in do cument design terchange and retrieval. We can now use prop erly and by completely separating layout and structure. Copyright 1997 IEEE. Published in the Pro ceedings of The motivation in SGML is to de ne the structure of ADL'97, May 7-9, 1997 in Washington D.C. Personal use of this the do cument; and leave the layout to the pro cess- material is p ermitted. However, p ermission to reprint/republish this material for advertising or promotional purp oses or for cre- ing application. The pro cess for the design of do cu- ating new collectiveworks for resale or redistribution to servers ment structure is conceptually similar to schema de- or lists, or to reuse any copyrighted comp onentofthiswork in sign in traditional databases. Applications pro cessing other works, must b e obtained from the IEEE. Contact: Man- an SGML do cument can then use this structural con- ager, Copyright and Permissions / IEEE Service Center / 445 Ho es Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. tent as a database schema to solve queries (searches) Telephone: +Intl. 908-562-3966 on the do cument. This makes the treatmentofdocu- y Partially supp orted by U.S. Dept. of Education award num- ments enco ded in SGML as databases for the purp ose b er P200A502367 and NSF Research and Infrastructure grant, award number NSF CDA-9303189. of querying a fairly straightforward task. for relational databases have the capability of using 1.2 Searching, querying and browsing the schema information e ectively. Since the basic Traditionally, the most common metho d for retriev- structure of relational databases consists of at tables, ing information from do cuments is by perusing. For query languages for relational databases do not usu- simple word-pro cessor do cuments, p erusing involves ally deal with complex hierarchical structures. How- sequentially scanning or paging through the do cu- ever, query languages for do cument databases need to ment to nd the information of interest. This form e ectively use the do cument hierarchy. Research on of searching, whichwe technically call \browsing", de- query languages for do cuments has unearthed a num- p ends completely up on the reader's p erseverance and ber of such languages. Twovery useful surveys of such visual attention. The success of the search dep ends languages and systems can b e obtained from [13, 3]. up on the reader's knowledge of the sub ject matter, Most database systems implement sp eci c database the organization of the text, and similar factors. The mo dels, such as the relational mo del and Ob ject- recentadventofhyp ertext do cuments has changed the Oriented (OO) mo del. These database mo dels are traditional sequential metho d of browsing. Brows- based on strong theoretical foundations and have for- ing in hyp ertext do cuments is considered non-linear mal pro cedural languages to manipulate data. In [14,21] { usually p erformed by following links in the the case of relational databases, the most commonly do cument to other related areas of the same do cument used query languages are SQL(Structured Query Lan- or entirely di erent do cuments. guage) [2] and QBE (Query By Example)[22]. SQL Although hyp ertext do cuments mayprovide a more is a textual language with a simple English-like syn- exible wayofbrowsing, nding information in hyp er- tax, and is widely implemented in most commercial text do cuments by browsing still requires human in- database systems. QBE provides a visual metho d for telligence, knowledge and exp erience. Whether read- p osing queries and is suitable even for inexp erienced ers are more comfortable with screen reading or pap er users. This section describ es some previous work done reading is a debatable question[8]. Research shows in the area of visual query formulation, and derives the strong evidence of pap er as a more natural means of motivation b ehind the current research. reading [7, 9] as well as a means to rectify the problems with reading from a screen [10,16]. 2.1 Visual languages and interfaces One ma jor advantage of electronic do cuments over This pap er presents the design of a visual inter- their pap er counterpart is the capability of automated face for the purp ose of querying do cument databases. searching. Many information retrieval techniques have Although many di erent typ es of pro cedural query b een prop osed [20] that can quickly extract p ortions of languages have b een prop osed so far, not enough at- do cuments matching given keywords. Similar searches tention has b een devoted towards providing visual can b e quite dicult in pap er do cuments[8]. However, metho ds for query formulation. Current implementa- keyword searches do not often retrieve the target do c- tions primarily use form-based approaches for building uments. For large do cuments combining text with the queries. In this section, wetake a closer lo ok at QBE logical structure of do cuments is often necessary. In and form-based querying. manycases,many comp onentsearch conditions need Query By Example (QBE). Query By Exam- to be combined using b o olean expressions to build a ple [22] is a visual language for querying relational narrower search. For example, a reader might b e lo ok- databases. This language has a simple interface com- ing for pap ers in a journal authored by \Jane Do e" p osed of tabular skeletons representing tables in the with a title containing \creature". These typ es of database. Users sp ecify queries by entering sample searches are less common in traditional information values (or examples) in appropriate areas of the table retrieval systems which lackthe capabilityofinvolv- skeleton. These values can b e either search strings or ing structure in searches. Searches involving structure variables for the purp ose of join and other op erations and complex op erations such as join have been the that require variables. QBE can also handle complex ob ject of recent research[4,1]. In this pap er, we will b o olean combinations of such search expressions us- refer to such searches as \queries". ing a sp ecial section of the screen (termed condition 2 Query languages for structured do c- box). Aggregation op erations such as sum, count and uments average can also b e p erformed by indicating appropri- ately in the tabular skeleton. Figure 1 shows a simple The ob ject of developing languages for the pur- join op eration using QBE as implemented in a p opular pose of querying is to attain the capability of in- database system. The main idea b ehind this metho d is volving complex op erations involving text (data) and that the user provides an example of outputs that she structure (meta-data). Traditional query languages exp ects from her query, and the query engine lo oks in 2.2 Other query visualization approaches the database for data that matches the given exam- Here we describ e two exp erimental systems that use ple.