Information Retrieval To ols and Techniques

A Accomazzi

Smithsonian Astrophysical Observatory Cambridge MA USA

F Murtagh

SpaceTelescope European Coordinating Facility European Southern

Observatory Garching near Munich Germany AliatedtoAstrophysics

Division Space ScienceDepartment European SpaceAgency

BF Rasmussen

SpaceTelescope European Coordinating Facility European Southern

Observatory Garching near Munich Germany

Abstract

Retrieval of data and information is something whichevery librarian scientist

and technologist do es many times over in eachworking dayMuch progress has

b een made in recentyears and weoverview search engines and distributed to ol

sets However there are still some problems to b e overcome b efore heterogeneous

data can b e integrated for the user in a seamless waychiey in regard to seamless

integration of image data into current practices We present a short stateoftheart

overview of the outstanding achievements of recentyears and of some of the more

challenging and p otentially fruitful op en issues

Intro duction

Database management systems DBMSs are concerned with retrieval based

on exact and partial match searching IR on the other

hand is concerned with b est matchsearching For DBMSs the problem b e

comes one of structuring the data and providing user views on the data For

IR indexing is a necessary rst step followed by querying which supp orts

greater or lesser expressiveness See for intro ductory readings

A topical area of IR research and developmentwork relates to multimedia and

multityp e data Astronomy oers lush pastures for suchwork in particular as

regards text and image data

Preprint submitted to Elsevier Science July

We b egin here with a review of astronomical textual and bibliographic to ols

and techniques sections and This will b e broadened section to dis

cuss what is now b eing done to integrate text and image data Section

addresses nearfuture issues in regard to treatment of images to use program

ming language parlance should images b e compiled b efore returning their

informational result or instead interpreted As with programming languages

b oth approaches have an imp ortant role to play

Free Text Retrieval Supp orting Phrases Using lqtext

This section reviews a simple text search utility

Lqtext authored by L Quin available byanonymous ftp from ftpcstorontoedu

is a text information retrieval utility using indexing which supp orts com

p ound terms and keywordincontext Lewis and Jones review avail

able technologies for handling comp ound terms which they see as aiding in

semanticallyoriented retrieval It do es not supp ort the Z or related pro

to cols and is not immediatelyinterfaceable to WAIS to b e discussed in section

Advantages of lqtext include comp ound term retrieval is supp orted including

phrases which are broken by ends of lines dashes are ignored Lymanalpha

Lyman alpha and in the default mo de case is ignored Disadvantages

include no sp ecic catering for astronomical semantics in the default setting

words are b etween and characters in length and thus UV is ignored

b o olean queries are supp orted but awkwardlyby ranking and intersecting and

strings which consist entirely of numeric data or start with numeric characters

are excluded It is however a to ol which can b e used as the basis of a larger

retrieval system

The following Unix linemo de example the system prompt is commandshows

comp ound term supp ort

command lqphrase v IRAS galaxies

Word IRAS Iras matches

Word galaxies galaxies matches

hstprop

Although there were matches with the word IRAS and with galax

ies only in do cument hstprop did these come together as a phrase

We are not interested here in the remaining numeric information returned re

lating to osets in the text of the word found A KWIC keyword in context

option follows

command lqkwik lqphrase IRAS galaxies

that ultraluminous IRAS galaxies are hstprop

Another supp ort option is to show more of the context of what has b een found

A screen with up to default lines b efore and after the comp ound term or

word for each hit is presented using command lqshow

Multiple phrase supp ort IRAS starburst galaxies sp ectral energy distri

butions massive galaxies elliptical galaxies protogalaxies is also sup

p orted We get information on the numb er of hits asso ciated with the individ

ual words in the phrase

WAIS and Friends

This section reviews networkbased retrieval of text and increasingly data in

other forms

The mo dern computing system is constructed from many lo cal and remote

machines with the result that the clientserver computing mo del has b ecome

a central concept in such systems In IR clientserver systems are widely used

and some imp ortant recent and current developments in regard to to ols and

techniques are nowlooked at

WAIS Wide Area Information Servers is a widely used clientserver infor

mation retrieval system WAIS originated in a joint research pro ject b etween

Thinking Machines Inc recently exp eriencing nancial diculties Apple

Inc Dow Jones Company and KPMG Peat Marwick and was rst re

leased in The formation of WAISIncbyWAIS principal B Kahle led

to supp ort of a freelyavailable version b eing assumed by CNIDR Clearing

house for Networked Information Discovery and Retrieval lo cated at MCNC

ResearchTriangle Park North Carolina

Z is an information search and retrieval proto col Z Version

is used byWAIS This standard is ANSINISO American National Stan

dards Institute National Information Standards Organization Z which

was successfully balloted in WAIS as develop ed by CNIDR freeWAIS

changed its name in to ZDist and later imp orted into Isite to signal the

move from supp ort of Z to Z

Z Version is widely used in libraries and information organizations

This is an ANSINISO standard This version involved alignment with the ISO

International Organization for Standardization SR Search and Retrieval

proto col ISO among other changes Z and Z

are not compatible Supp ort for Z is provided by Isite see b elow

Z Version was balloted in recentmonths by the Z Implementors

Group ZIG whichworks closely with the standards maintenance agencythe

Library of Congress To b e informed ab out this groups activities subscrib e

to list ziw at address listservnervmnerdcuedu

Here are a few widelyused WAIS implementations

WAIS release b eta minor release WAIS b released in Mayby

Thinking Machines Corp ftp to thinkcom An Indiana Universityversion

IUBio ftp to ftpbioindianaedu supp orted b o olean search

freeWAIS version b eta the currentversion of this series from CNIDR

supp ort for Isite has taken over from freeWAIS Bo olean search and sup

p ort for multityp e les are available CNIDR can b e accessed at URL

ftpftpcnidrorgpubNIDRto ols

Isite from CNIDR supp orts Z version See httpvincacnidrorg

softwareIsiteIsitehtml Recent up dates have included elded searching

and supp ort for SGML tag parsing

freeWAISsf from the UniversityofDortmund was based on freeWAIS

Further information may b e obtained at httplswwwinformatikuni

dortmunddefreeWAISsf Supp orted features include full b o olean search

text date and numeric eld structures congurable headlines bit char

acter supp ort stemming and phonetic co ding the latter using the soundex

and phonix approaches see

Various WAISWWW gateways or scripts are available For further informa

tion refer to a numb er of items accessible from httpwwwncsauiuceduSDG

SoftwareMosaicDo csDcomplexhtml in turn accessible under Manual

in the NCSA Mosaic standard help pulldown menu

Other similar to ols can also b e set up as scripts As an example Glimpse

GLobal IMPlicit SEarch is an indexing and querying system whichallows

les in p ossibly nested directories to b e searched through It is based on a

storageecient index and on agrep which is an extension approximate

grep of the wellknown Unix command which supp orts approximate match

ing missp ellings etc GlimpseHTTP is an automatic HTML indexer Glimpse

is available from the University of Arizona Department of Computer Science

for source co de and do cumentation refer to ftpcsarizonaeduglimpse

We note that freeWAISsf has also b een up dated in a thirdparty patch to

handle proximitybased queries

Another interesting development relates to a Spatial WAIS standard This is

under the guidance of the FGDC Federal Geospatial Data Clearinghouse

The aim is to incorp orate a spatial standard into WAIS Although

oriented towards GIS geographic information systems the inspirational and

technical relevance for astronomy is clear Supp ort is available for such spa

tial extensions as at a p oint lo cation in a b ounding b ox and in a

region In the indexing phase some additional index les are created to sup

p ort these spatial queries Discussion of converging such spatial prototyp es

into the Z Version standard are conducted on list zmap subscribable

to at address listservvincacnidrorg Discussions in other communities eg

British GIS are ongoing a useful summarization of spatial data standards

issues can b e found in

One nal p ointerhereistoSTAS the scientic and technical attribute and el

ement set which seeks to dene standard identiers for referring to searchable

and retrievable elds within scientic technical and related data collections

These standards use the Z proto col Other than CNIDR sp onsors include

the American Chemical So cietyACS more strictly their Chemical Abstracts

Service CAS and commercial information providers such as Dialog Informa

tion Services and FIZ Fachinformationszentrum Karlsruhe

Integration of Text and Image

In this section we describ e recentlyinstalled search tra jectories including

supp ort for free text preview images fullsize images image ancillary infor

mation and bibliographic data

Technical supp ort for queries consisting of text chunks observing prop osal

abstracts sets of keywords bibliographic abstracts etc is straightforward

with WAIS since each term is considered by default to have a b o olean OR

connective with the following term This allows supp ort for queries such as

give me all abstracts which are similar to a given abstract

Clustering of text chunks can b e carried out along these lines Murtagh

detected terms based on the IAU Thesaurus therebyproviding a controlled

vo cabulary in around HST observing prop osal abstracts and clustered

them on the basis of their shared termset A Kohonen selforganizing map

approachwas used

Handling of multityp e data eg images and accompanying texts maybe

achieved bymultityp e supp ort in the Thus freeWAISsf can

return a reference to the text providing hits as well as accompanying images

and either can b e lo oked at indep endently by the user

An alternative is to structure text which is returned in resp onse to hits such

that links to accompanying images are emb edded In fact the images them

selves can b e emb edded also

A search tra jectoryemb o dying free text capability image quickview and

bibliographic literature search was set up along the following lines for the

HST image see httpwwwesoorghstpropab ssearchhtml

Using free text on prop osal abstracts andor on elds asso ciated with

principal investigator PI or prop osal identier numb er the prop osal title

can b e obtained together with its approximate matching score 

From the prop osal title by clicking one gets further prop osal informa

tion PI details observing cycle numb er full abstract and list of exp osures



From the list of exp osures one obtains the asso ciated information on the

instrument a link to background information on the instrument and an

indication of availability of preview images ie compressed versions of the

real images 

From the preview images one can view these or mark them for batch

retrieval of the real images

From the abstract or the authors or the keywords one can pro ceed to a

search of all relevantentries in the ADS Bibliographic Service 

From the ADS Bibliographic Service one can nd other similar abstracts

or one can obtain copies of full pap ers in some cases  and

The searchmay b e terminated with published pap ers and data tables

One sees here that most linkages are technically feasible Enabling tech

nologies for these browsing paths other than freeWAISsf include the WDB

WebtoSQL database interface to ol and the ADS Bibliographic Service

search engine Browsing tra jectories as illustrated here are limited far

less than in the past by the particular datatyp e images texts To a greater

extent than heretofore the user is allowed to express their information needs

in something which approaches natural language Wehaveeven seen with

soundex and phonix section ab ove that vo cal querying is not far o

Two issues are still unresolved how b est to organize suchbrowse and search

tra jectories an issue whichwe will not pursue further here and what ab out

direct querying of image contents

Adding Intelligence to Images

In this section we describ e ongoing work in contentbased image IR and in

making images activeandinteractive

One approach to contentbased image retrieval is through image indexing

ie through creation of an ob ject catalog or inventoryFor building up an

inventory of the images contents traditional approachestoobjectinventory

have used thresholding A more thorough approachtoadaptive thresholding

is preferable which relativizes background and incorp orates a vision mo del

relating to the typeofobjectwhich is sought as a priority Since a vision

mo del is based on multiresolution analysis and mathematical morphological

op erations Bijaoui and Rue present a comprehensive view of work in this

direction

Multiresolution analysis is motivated by the fact that the human visual system

deals with visual scenes at diering resolution scales It handles these reso

lution scales simultaneously Using wavelet transforms or other approaches

the rst phase is to arrive at a set of resolution scales representing

dierent levels of information in the original image Multiresolution trans

forms havebeenshown to b e an excellent basis for noise suppression in the

image and for image compression The multiresolution supp ort is a

datastructure whichmay b e derived from an image it is the signicant or

interesting part of the image at a given resolution level The multiresolution

supp ort is a multilevel b o olean image where contiguous sets of valued pixels

demarcate astronomical images of interest A priori knowledge of ob jects or

detector defects may b e incorp orated by mathematical morphology op erations

on the multiresolution supp ort An inventory is built up of the ob jects found

Insofar as ob ject templates are imp osed on the given image in such an approach

cf sp ecication of the multiresolution transform used or incorp oration of a

priori information through erosion and dilation op erations etc and insofar

as idealized astronomical ob jects are output at the end of the pro cessing in

the form of inventory tables what we are really doing here is transforming

the input image into generalized icons The icons are if successful our desired

astronomical ob jects p oint sources spiral galaxies etc

An alternativeapproach to image understanding is to start with the image

itself and make its indexation interactive This is in contrast to starting with

our exp ert judgement ab out what it contains which is then encapsulated in

amultiresolution analysis system or some other alternative analysis system

Making the images indexes active is an ob jectivewhich is currently b eing

addressed in Strasb ourg Observatorys ALADIN pro ject Clicking on

the interactive atlas will provide links to various sources of information as

tronomical catalogs bibliographic data

Chang and Hsu already claim to see b eyond image mo deling generalized

icons and active indexes and to p erceive the contours of smart images These

are images with asso ciated knowledge structures which can b e thoughtofas

generalizing current practice of asso ciating descriptors and audit trails with

images Images would have asso ciated contextdep endentactive metho ds for

display cf pyramidal data structures preview functionality or iconization

as discussed ab ove transmission cf progressive transmission using http and

other proto cols hyp erlinking cf ALADIN data structures for supp ort of

pro cessing algorithms and interrelationships with other images in the same

and in other image databases

Intelligent search agents which op erate networkwide are describ ed in

Active agents are also central in current computing research directions in

particular as regards biological computing What is describ ed as smart

images in are active vision agents whichplay their role in future comput

ing environments

Conclusions

How b est to link not just data but information in whatever form it presents

itself has b een at issue in this article Ma jor problems have b een domesticated

in recentyears eg handling distributed multiformat information Wehave

pinp ointed some op en issues chiey in regard to contentbased image retrieval

Such a problem comes within the scop e of a year international consortium

funded by the Europ ean Science Foundation to study Converging Computing

Metho dologies in Astronomy and to provide a p olicy recommendation at the

end of this p erio d Further information on the consortium can b e found at

URL httpwwwesoorgconvcomphtml

References

A Accomazzi G Eichhorn CS Grant SS Murray and MJ Kurtz The

Astrophysics Data System Abstract and Article Services Vistas in Astronomy

sp ecial issue pro ceedings of WAW Conference April in press

A Accomazzi Notes on the use of search engines at ESO http

wwwesoorgwaisdo cumentationfwsfnotesaaccomazaprhtml

HM Adorf Resource discovery on the Internet this meeting

A Bijaoui and F Rue A multiscale vision mo del adapted to astronomical

images pp submitted

SK Chang and A Hsu Image information systems where do we go from

here Handbook of Pattern Recognition and Computer Vision eds CH Chen

LF Pau and PSPWang World Scientic Singap ore

WB Frakes and R BaezaYates Eds Information Retrieval Data Structures

and Algorithms PrenticeHall

A Kay Apple Computer Inc When can we take the training wheels o

plenary presentation Third International WWW Conference Darmstadt

DD Lewis and KS Jones Natural language pro cessing for information

retrieval preprint ATT Bell Lab oratories and Computer Lab oratory

University of Cambridge pp

D MedyckyjScott C Monckton and P Burnhill Progress towards standards

for spatial metadata pp draft do cument GENIE Pro ject Dept Comp

Sci Loughb orough Univ Tech and RRL Scotland and Data Library Univ

Edinburgh

JJ Michael and M Hinnebusch From A to Z A Networking Primer

Mecklermedia Westp ort CT

F Murtagh Free text information retrieval an assessment of publicly available

Unixbased systems technical note pp

F Murtagh W Zeilinger JL Starck and A Bijaoui Ob ject detection

using multiresolution analysis in D Shaw H Payne and J Hayes Eds

Astronomical Data Analysis Software and Systems IV Astron So c Pac

in press

Ph Paillou F Bonnarel F Ochsenb ein and M Creze ALADIN Atlas

Interactif du Ciel Profond CDS Observatoire Astronomique de Strasb ourg

rep ort pp Version

U Pfeifer T Po ersch and N Fuhr Searching prop er names in

databases to app ear in HIM Proc Conf HypertextInformation Retrieval

Multimedia Available at ftplswwwinformatikunidortmunddepub

do crep ortsPfeiferetalahtml

U Pfeifer N Fuhr and T Huynh Searching structured do cuments with

the enhanced retrieval functionality of freeWAISsf and SFgate Computer

Networks and ISDN Systems

BF Rasmussen WDB software and do cumentation

httparchhttphqesoorgbfrasmuswdb

G Salton Developments in automatic text retrieval Science

G Salton J Allan C Buckley and A Singhai Automatic analysis theme

generation and summarization of machinereadable texts Science

JL Starck F Murtagh and A Bijaoui Multiresolution supp ort applied

to image ltering and restoration Computer Vision Graphics and Image

Processing Graphical Models and Image Processing in press