<<

Journal of and Technology - CIT 12, 2004, 3, 175–194 175

A of Models and Tools

Gerardo Canfora and Luigi Cerulo

RCOST – Research Centre on Software Technology, University of Sannio, Benevento, Italy

Information retrieval is attracting significant attention address the representation, organization of, and due to the exponential growth of the amount of infor- access to large amounts of heterogeneous infor-

mation available in digital format. The proliferation  of information retrieval objects, including algorithms, mation encoded in digital format  58 . methods, technologies, and tools, makes it difficult to In this paper we focus on text document re- assess their capabilities and features and to understand the relationships that exist among them. In addition, trieval, in which the information is represented the terminology is often confusing and misleading, as by text documents. Therefore, for the purposes different terms are used to denote the same, or similar, of this paper, the terms information and docu- tasks. ments are used interchangeably. Text document This paper proposes a taxonomy of information retrieval retrieval is the most traditional subfield of IR; models and tools and provides precise definitions for however, IR comprises other subfields, such as the key terms. The taxonomy consists of superimposing two views: vertical taxonomy, that classifies IR models , speech retrieval, information with respect to a set of basic features, and horizontal generation, query answering, and text summa- taxonomy, which classifies IR systems and services with rization, that we do not cover in this paper. respect to the tasks they support. A key feature of a text IR systems is retrieving The aim is to provide a framework for classifying existing information retrieval models and tools and a solid point the documents that can satisfy the information to assess future developments in the field. needs of a user from a large collection of docu- ments. Such systems, especially in the context Keywords: information retrieval, taxonomy, tools, mo- of the web, are usually known as search en- dels. gines, so that in the rest of the paper we will consider as a synonym of infor- mation retrieval system. IR systems prepare the collection of documents for retrieval through an 1. Introduction indexing step. User information needs are usu- ally represented by keywords or phrases, which In recent years information retrieval has become are themselves indexed, although more complex an important subject of much research, because representation languages are available. This the amount of information available in digital representation, which causes inevitably a loss formats has grown exponentially and the need of information, is usually known as query. In- for retrieving relevant information has assumed dexing can assume different forms according to a crucial importance. The and the model adopted to represent both the docu- the Digital Libraries have shown to a large au- ments in the collection and the user information dience the importance of effective mechanisms needs. Many current IR systems exploit ranked and tools to retrieve documents from a very large IR methods, i.e. they rank the documents in the document collection based on user information collection based on a measure of their relevance needs. with respect to the user information needs as

represented by a query.  Information Retrieval IR is the scientific dis- cipline that deals with the analysis, design and The proliferation of information retrieval al- implementation of computerized systems that gorithms, methods, technologies, and tools, is 176 A Taxonomy of Information Retrieval Models and Tools

making it more difficult to assess the features variety of more specific models. Paijmans iden- and the characteristics of each IR aspect and to tified the vector document model as the basis for understand the relationships that exist among building the classification and showed how the them. The terminology is often confusing; for vector model can subsume other popular mo- example, terms such as crawling, indexing, spi- dels. Whilst this constitutes a concise style of dering, are often used to denote similar tasks, classification, it is unable to classify IR tech- with no clear distinction of the differences. niques that are not derived from the vector based model, such as the logic-based techniques. In this paper we propose a classification of IR models and tools and provide definitions for the Our approach is different, as we start from a key terms. The classification consists of super- classification of the basic features of IR mod- imposing two views: one for the IR models and els and proceed with a classification of the ob- one for the IR objects, either tools or services. jects produced in the various fields of infor- A vertical taxonomy classifies IR models with mation retrieval in terms of tools and services. respect to a set of basic features, and a horizon- The flexibility of this faceted view is evident tal taxonomy classifies IR objects with respect when we consider that different information re- to their tasks, form, and context. The vertical trieval objects can be based on the same in- taxonomy is built by exploding two basic fea- formation retrieval model, and the same infor- tures of any IR model: the representation, that mation retrieval model can be exploited to im- is the model adopted to represent both the docu- plement different information retrieval objects. ments and the user queries; and the reasoning, For example, the classic vector model, generally which refers to the framework adopted to re- presented as a retrieval technique, can be used solve a representation similarity problem. The for building information filtering and document horizontal taxonomy is derived from an analysis clustering tools, too. The latter are different in- of the application areas of IR. formation retrieval objects that exploit the same information retrieval model.

1.1. Related Works 1.2. Content and Structure of the Paper In the literature, several studies have been pro- posed that outline classifications of IR models There are two main viewpoints that characterize and tools. However, most of these studies do information retrieval: we call these two view- not cover the entire spectrum of IR objects; the points information retrieval objects and infor- reasons can be found either in the age of the pa- mation retrieval models. The former is gener-

pers or in the specific objectives of the studies. ally an artifact that exists in the form of a tool or  For example, in 1984 Smith and Warner  69 a service and responds to the “what” question; published a document representation taxonomy the latter is a set of theories on which the in- with the aim of relating new research works formation retrieval object is based and respond to previous works and to suggest new areas of to the “how” question. The two aspects are re- research. Nowadays, this taxonomy is largely lated, as one object can be based on more than incomplete, because it does not consider, for one model and one model can be the basis for

example, the representation of structured docu- more than one object. On this framework we  ments. In 1987 Belkin and Croft  20 published have built a horizontal taxonomy and a vertical a classification of the most important retrieval taxonomy. The horizontal taxonomy refers to techniques in which no reference is made to the IR objects, while the vertical one considers IR relevance feedback model, because, as the au- models. thors explicitly state, relevance feedback is not The remainder of the paper is organized as fol- considered a retrieval technique, rather a help lows. Sections 2 and 3 introduce the vertical to refine the retrieval model.

and the horizontal , together with  In a more recent work, Paijmans  54 made an examples of their application. Section 4 super- interesting analysis of the most important re- imposes the vertical and horizontal taxonomies trieval models. The approach adopted to con- and shows how this can be used to obtain a map- struct a taxonomy of IR models consists of iden- ping of the object’s features on the underlying tifying a generic model that forms a basis for a models. A Taxonomy of Information Retrieval Models and Tools 177

2. Vertical Taxonomy gies under the reasoning component. Representation and Reasoning can be used to

Modeling the process of information retrieval characterize an information retrieval model. For 

is complex, because many parts are, by their example, in  52 an information retrieval model

g nature, vague and difficult to formalize. The is characterized as a quadruple fD, Q, F, R q,d human component assumes an important role where: and many concepts, such as relevance and in- formation needs, are subjective. Therefore, in- D is a set of logical views for the docu- formation retrieval models can be very com- ments in the collection Representation com- plex and, consequently, their classification can ponent; be hard. However, in the definition of any IR

model we can identify some common aspects. Q is a set of logical views for the user infor-  Generally, the first step is the representation of mation needs Representation component ; documents and information needs. From these F is a framework for modeling document representations a reasoning strategy is defined representation, queries and their relation-

that solves a representation similarity problem  ships Reasoning component ;

to compute the relevance of documents with re-

 spect to queries. Various strategies have been R q,d is a ranking function which associates

introduced with the aim of improving the re- a real number with a query q Q and a doc-

 trieval process: we classify these methodolo- ument d D Reasoning component .

Fig. 1. Vertical taxonomy. 178 A Taxonomy of Information Retrieval Models and Tools

An information retrieval model can be modeled Whilst documents are characterized by syntax,  as a couple Rp, Rs where Rp is the repre- structure, semantics and style, the structure and sentation model of documents and queries, and semantics of text are generally sufficient to char- Rs is a framework for modeling the relationship acterize queries. between document and query representations, which is the reasoning strategy. Every compo- nent can be divided into subcomponents and for Query Representation every subcomponent we can build a tree of pos- sible approaches and solutions presented in the A query is the representation of a user infor- literature, as shown in Fig. 1. mation needs. The user information needs is Defining the approaches used for each compo- originated by a problem that the user should re- nent identifies an IR model. For example, the solve; it is implicit in the user mind and its pur-

pose is the necessity to bridge a knowledge gap. 

couple Rp, Rs : 

An information need can be of three types  50 :

f g Rpquery keyword-based known item information need, conscious infor-

mation need, and confused information need.

 f g Rpdocument weighted vector

The first is when users search or verify the exis-

 f g Rswith logic vector algebra tence of documents they know. The second is identifies the well-known vector model, as we when users search for documents they do not will discuss later. We will now go into each of know, but regard a subject they know. The third these components. is when users know neither the documents nor the subject. The following classes of query rep- resentations can be identified: 2.1. Representation

Keyword-based. This is the simplest form for a query. It is composed by keywords and A fundamental component of an IR system is the documents containing such keywords are the representation of the information itself: in- searched for. Keyword-based queries are formation can be processed if it is represented popular, because they are intuitive and easy in some way. to express. Usually, a keyword query is a

In text information retrieval, representation single word, but, in general, it can be a more  means representing documents and queries. A complex combination of Boolean opera- document is the representation of the informa- tions applied to several words. tion the author wished to encode; it is the unity — Single word. It is the most elementary of information that can be retrieved by an IR query that can be formulated in a text system. Queries are the representation of infor- retrieval system. Depending on the rea- mation needs of a user. soning component, the result of a single Any text can be characterized by using four at- word query is generally the set of docu- tributes: syntax, structure, semantics, and style. ments containing at least one occurrence A text has a given syntax and a structure, which of the searched word. are usually dictated by the application or by the — Boolean. It is the oldest and still widely person who created it. Text also has a seman- used form of combining the keywords in a tics, specified by the author of the document. query. A Boolean query is an expression Additionally, a document may have a presen- whose elements are keywords, Boolean tation style associated with it, which specifies operators and a precedence notation. In how it should be displayed or printed. In many addition to classical Boolean operators, approaches to text representation the style is several new operators have been pro- coupled with the document syntax and structure posed, such as: the NEAR operator, which

see for example the LaTeX document prepara-

allows context search capabilities and the  tion system  40 . Modern representations, such

fuzzy Boolean operator, which relaxes  as XML  80 , separate the representation of syn- the meaning of canonical AND and OR. tax and structures, which are defined either by

a DTD or an XSD, and style, which is captured Pattern-based. It is a more specific query by XSL. formulation, which allows the specification A Taxonomy of Information Retrieval Models and Tools 179

of text having some properties. A pattern is Vector space. The basic principle of this text a set of syntactic features that must occur in representation model is to consider that each a text segment. The segments satisfying the document is described by a vector of compo- pattern specification are said to match the nents that are representative of the semantic pattern. content of the document. Traditional vec- tor space approaches use a set of keywords, Structural. Structural queries are a mecha- nism to improve the retrieval quality of struc- called index terms, but other types of repre- tured information. This mechanism is gener- sentative components, such as n-grams, are ally built on top of the basic queries with the used. An is a word whose se- addition of structural constrains expressed mantics helps in identifying the documents using containment, proximity, or other re- main themes. Of course, not all terms of a strictions on the structural elements in the document are useful for describing the doc- documents. Structural queries can be cate- ument content. In fact, there are index terms gorized into three main categories: fixed which are vaguer than others. Deciding the structure, , and hierarchical struc- importance of terms is not a trivial task. In a ture. The first is the simplest form and, for large collection of documents a word which this reason, it is more restrictive. The docu- appears in each document is useless as an ments are divided into a set of fields each of index term, because it does not discriminate which contains some text. A fixed structural between documents. On the other hand, a

query restricts the search to text contained term that appears in one document will likely 

in certain document fields. The hypertext is describe the content of this document  45 ,  probably the most flexible form of structur-  83 . Vector representations can be further ing. It is a directed graph where the nodes categorized a s follows. hold some text and the links represent con- nections between the nodes. However, it — Binary. The text document is represented is not possible to query the hypertext struc- as a binary vector of terms. Each ele- tural connectivity, but only the text content ment of the vector represents a term and of the nodes. This transforms the retrieval its value is ‘1’ if the term appears in the

activity into a navigational activity brows- document, ‘0’ otherwise.

ing task. The hierarchical structure is an in- termediate structuring model and represents — Weighted. In this case element values a natural decomposition for many text col- are real numbers between 0 and 1, called

lections books, articles, structural programs term weights, and represent the affinity

etc.. For example, XML is the most promi- of the term with respect to the document.

nent structural representation model and the A widespread method to compute the

  

XPath 81 is a query language for addressing term weights exploits two factors  58 : 

pieces of content in the hierarchical struc- Term Frequency TF and Inverse Doc-  ture. ument Frequency IDF . The first pro- vides a measure of how well the term

describes the document contents intra- Document Representation cluster similarity; the second measures how well the term can discriminate docu- A document is a retrievable element of the doc- ments among the collections cluster dis- ument space of an information retrieval system.

similarity. A well-known term weight- It can be considered as the minimal resource ing scheme, valid for generic collections, that an information retrieval system can retrieve. is the product between the TF and IDF Historically, documents have been represented factors. Several variations are described

by a set of terms called keywords, which are  usually extracted from the text or inserted by by Salton and Buckley  66 . the author. The following are the most signifi- Latent semantic. In the traditional cant types of document representation: vector space approach each document

Stream of characters. Text is represented as is represented by a vector of n compo- a stream of characters and no interpretation nents, where n is the number of terms

is made on its structure or semantic content. occurring in the collection dimension 180 A Taxonomy of Information Retrieval Models and Tools

of the document space. Latent Se- and topic. He uses a sliding window ap-

   mantic Indexing LSI 27 reduces proach in which n-grams are obtained by the dimension of the document space moving a window of n characters through

by capturing term-to-term statistical a document or a query, one character at  relationships. The document space is a time. Some authors  82 also use n- then represented by a new coordinate grams that cross word boundaries, i.e.,

system of dimension k n, called k- that start within one word, end in another  space or LSI space , in which each of word, and include the space characters the k dimension is a derived concept that separate consecutive words. often called LSI factor or LSI feature.

Structural. Structural documents, similarly LSI features are identified by using to structural queries, are a mechanism to im- a method for matrix decomposition prove the retrieval quality. The main idea is called Singular Value Decomposition

to enrich documents with additional infor-  SVD . The derived concepts may be mation that allow a computer to make part thought of as artificial concepts; they of the semantic content explicit. XML is the represent extracted common meaning most prominent standard for modeling these components of many different words aspects of information. and documents.

Fuzzy subset. Fuzzy set theories deal with the representation of classes 2.2. Reasoning whose boundaries are non-well de- fined. Each element of the class is as- With the term reasoning we refer to the set sociated with a membership function of methods, models, and technologies used to that defines the membership degree of match document and query representations in the element in the class. In many a retrieval task. Strictly related with the rea- fuzzy representation approaches the soning component is the concept of relevance. TF-IDF function of the weighted vec- The primary goal of an information retrieval

tor model is used as the fuzzy mem- system is to retrieve the documents relevant to

   bership function  35 , 37 . a query. The reasoning component defines the framework to measure the relevance between — N-Gram. The n-gram approach is in documents and queries using their representa- some respects an evolution of vector space tions. approaches. In the traditional vector space approaches the dimensions of the A key question to address in order to understand document space for a given collection of the reasoning component of an IR system is to

documents are the words or sometimes find a precise definition for relevance. This is

phrases that occur in the collection. By still an open problem within the IR community;

contrast, in the n-gram approach, the di- the literature reports different definitions, but a  mensions of the document space are n- widespread definition is  67 : grams: strings of n consecutive charac- ters extracted from the text without con- Relevance is the (A) of a (B) existing sidering word lengths, and even word between a (C) and a (D) as determined boundaries. Hence, the n-gram is a re- by an (E). markably pure statistical approach, one Where: that measures the statistical properties of

strings of text in the given collection and (A). measure, estimate, judgment does not consider the vocabulary, lexi-

(B). utility, matching, satisfaction cal, or semantic properties of the natu-

ral language in which the documents are (C). document, document representation,

  written. The n-gram length n and the information provided method for extracting n-grams from doc- (D). question, question representation, uments vary from one author to another.

information need  In  22 Damashek uses n-grams of length

5 and 6 for clustering text by language (E). request, intermediary, export A Taxonomy of Information Retrieval Models and Tools 181

An attempt to clarify this definition has been Reasoning with Logic  proposed by Mizzaro  51 . Starting from an ac-

curate analysis of the interactions between the Logic. The logical approach to information

users and the system, the paper identifies vari- retrieval can be formulated in terms of the

 ous types of relevance on which it is possible to logical formula Pd n , where the arrow define an order relation. is the conditional connective formalized by An information retrieval reasoning strategy can a logic to be chosen and P is the predicate:

“the representation of document d is relevant  be one or any combination of: reasoning with to the representation of information need n”. logic, reasoning with uncertainty, and reason- The central problem is selecting the right im- ing with learning. A reasoning with logic ap- plication connective, i.e. selecting the logic proach deals especially with models developed whose implication connective best mirrors as logical-mathematical theories. A reasoning relevance. An overview of the role of logic

with uncertainty approach comes useful when-  information retrieval is reported in  68 . ever the system is unable to assess the truth of

all the aspects of the environment in which it Algebra. Algebra calculus is the most com- operates. In these cases its behavior is affected mon approach. Under this item we include by uncertainty. This is due to many reasons: the reasoning strategies which are based on it does not understand the environment prop- a set of operations defined in an algebraic erties; there are many variables to process and field. not enough time available, etc. Reasoning with — . In the conventional learning approaches apply with inductive ma- Boolean algebra reasoning strategy the chine learning techniques. is query Boolean expression is computed concerned with systems that learn from expe- to verify whether a document either sat-

rience. In a classical system, the system de-  isfies a query is relevant or does not

signer inserts all the knowledge. Whenever the  satisfy it is non-relevant . No ranking designer does not possess complete knowledge is possible, and this is a significant lim- of the system’s application domain, a learning itation. A number of extended Boolean mechanism is the only way to acquiring new models have been developed to provide knowledge. Learning mechanisms are used ranked output. These extended Boolean both for fulfilling an objective or to improve models employ extended Boolean opera-

it. In IR the primary goal is to improve retrieval  tors also called soft Boolean operators

effectiveness, for example, in terms of precision  42 . and recall. — Vector algebra. Using a weighting sche- Most of the classical information retrieval mod- me for document and query representa- els deal with the reasoning with logic and rea- tions the vector algebra approach com- soning with uncertainty strategies. In the first, putes a numeric similarity between the

for example, fall methods based on first or- query and each document. The doc-

    

der logic  47 , 8 , 6 , and methods based uments can then be ranked according

    

on Boolean and vector algebra  74 , 64 , 25 , to how similar they are to the query.

    78 , 77 . In the second fall methods in which The usual similarity measure exploited in the vagueness and uncertainty aspects of IR are document vector space is the inner prod-

treated in terms of probabilistic and fuzzy set ap- uct between the query vector and a given  proaches. Since many information retrieval as- document vector  65 . If both vectors pects are affected by vagueness and uncertainty, have been cosine normalized, then the

many reasoning processes based on uncertainty inner product represents the cosine of the

        

have been proposed  59 , 13 , 14 , 76 , 10 , angle between the two vectors; hence this

          53 , 49 , 48 , 63 , 70 . Machine learning similarity measure is often called cosine

techniques gained a growing popularity in the similarity. Other well-known variants of

    

past ten years  23 , 16 , 43 . similarity functions are: Dice’s coeffi-  cient and Jaccard’s coefficient  58 .

Recently, several novel approaches have been

 

proposed, based on either 12 , Graph theories. Graph theories deal with

        24 , 33 , 55 or formal ontology 31 . structures formed by vertices and edges. The 182 A Taxonomy of Information Retrieval Models and Tools

application of graphs algorithms to informa- activity to rank the retrieved items in de- tion retrieval becomes more interesting with creasing order of relevance to a user query the advent of the web. Web resources can can greatly improve the effectiveness of such be well modelled with a graph structure in systems. This objective can be reached by

which documents represent vertices and hy- extending the Boolean mode in several ways

   perlinks represent edges. In 24 a Maxi-  35 . In the fuzzy extensions of document mum Flow method is introduced to identify representations the aim is to provide more web communities. Previous graph-based ap- specific and exhaustive representations of proaches were applied to bibliographic doc- the documents information content, in or- uments and were principally based on bib- der to reduce the imprecision and incom- liometric methods such as co citation and pleteness of the Boolean indexing. For ex- bibliographic coupling. Some of these are ample, a document can be represented as

used in the web context, too. Such algo- a fuzzy set of terms. In the fuzzy gener- 

rithm includes: PageRank algorithm  12 on alization of the Boolean query language the 

which the Google  104 web search engine objective must have a more expressive query 

is based, HITS algorithm  33 , and SAE al- language, in order to capture the vagueness  gorithm  55 . of the user needs as well as to simplify the user system interaction. Various approaches have been proposed. One of these intro-

Reasoning with Uncertainty duces soft connectives of selection criteria   11 , characterized by a parametric behavior

Probability theories. Probabilistic theories which can be set between the two extremes

were introduced by Robertson and Sparck “AND” and “OR”. In other approaches, the  Jones  59 . The fundamental reasoning ap- Boolean query language has been genera- proach is based on the following assumption: lized by defining aggregation operators as given a user query and a document in the col- linguistic quantifiers, such as “at least k” or lection, the probabilistic reasoning process “about k”. tries to estimate the probability that the user will find the document interesting. There exist some alternative approaches based on Reasoning with Learning

Bayesian networks. In particular, the infer-  ence network  71 model has been used in

Several authors have proposed the use of ma-  the INQUERY system  13 , while reference

chine learning approach in IR. The most fre- 

 57 introduces a generalization called belief  network. quently used techniques include  16 : multiple layered and feed-forward neural networks such

Fuzzy set theories. Fuzzy IR models have  as back propagation networks  62 , symbolic been defined to overcome the limitations of and inductive learning algorithms such as ID3

the crisp Boolean IR models, in particular

    56 and ID5R 72 , and evolution-based algo-

to manage the vagueness and incomplete-  rithms such as genetic algorithms  34 . ness of users in query formulation. Fuzzy

extended Boolean models are a superstruc- Neural networks. Neural network comput- ture of the Boolean model by means of which ing seems to fit well with conventional re- existing Boolean IR systems can be extended trieval models such as the

without redesigning them completely. The and the probabilistic model. One of the first  standard Boolean models apply an exact applications in IR comes from Belew  7 .He match between the query and the document developed a three-layer neural network of representations, and then partition the docu- authors, index terms, and documents. The ment base into two sets: the retrieved doc- system used relevance feedback from its user uments and the rejected ones. As a con- to change its representation of authors, index sequence of this crisp behavior, they are terms, and documents over time. An evolu-

liable to reject useful items as a result of tion of this application has been introduced  too restrictive queries, and to retrieve use- by Kwok  39 , who uses a modified Hebbian less material in reply to excessively gen- learning rule to reformulate probabilistic in- eral queries. Thus, softening the retrieval formation retrieval. In other applications the A Taxonomy of Information Retrieval Models and Tools 183

Neural Network approach has been used for Genetic algorithms. Several genetic algo- 

more specific tasks. For example, in  44 ,a rithms implementations have been devel-  Kohonen’s self-organizing feature map was oped in the context of IR.  29 presents a ge- applied to construct a self organizing repre- netic algorithm-based approach to document

sentation of the semantic relationships be- indexing, in which competing document de-  tween documents. A Neural Network doc- scriptions binary vector of term are associ-

ument clustering algorithms was developed ated with a document and altered over time 

in  46 . The Hopfield neural network’s par- by using genetic mutation and crossover ope- 

allel relaxation method was used in  17 for rators. In this design, a keyword represents 

concept-based and explo- a gene bit pattern , a document which is  ration. a vector of keywords bit string represents individuals, and a collection of documents, Symbolic learning. In IR the use of symbolic initially judged relevant by a user, repre-

learning is more limited with respect to other sents the initial population. Based on a Jac-  learning techniques. In  9 a symbolic learn- card’s matching function, the initial popula- ing technique is used for automatic text clas- tion evolves through generations and eventu- sification. The symbolic learning process ally converges to an optimal, improved pop-

represents the numeric classification results 

ulation. In  30 a similar approach is adopted  in terms of IF-THEN rules. In  26 a regres- for . sion method and ID3 were used to imple-

ment a feature-based indexing technique. In   18 ID3 and the incremental ID5R algorithm were adopted for information retrieval. Both 2.3. An Example algorithms were able to use user-supplied samples of desired documents to construct As an example of application of the vertical decision trees of important keywords which taxonomy, we have taken some relevant works could represent the user’s query. from the IR models field and tried to classify

Table 1. Vertical taxonomy of a set of Information Retrieval Models. 184 A Taxonomy of Information Retrieval Models and Tools

them using the vertical taxonomy. We iden- identified by three components, as illustrated in tify each information retrieval model in relation Fig. 2: Tasks, Form, and Context. to the representation and reasoning components described above. This is shown in Tab. 1. A notable aspect is that many models contain the 3.1. Tasks

weighted vector as a representation component;  this is why Paijmans  54 introduced the vector Information retrieval tasks are concerned with document model. a particular aspect of information retrieval de- rived from a user point of view and should not be confused with the tasks in an information retrieval process, such as query formulation, 3. Horizontal Taxonomy , comparison, ranking, docu- ment presentation. An information retrieval ob- The vertical taxonomy alone is not sufficient to ject can support one or more tasks and a task take into account all the objects that have been can be stand-alone or it can be integrated in produced under the IR umbrella. Users do not a process to perform a larger task. We have interact with a model, but generally they use a identified the following tasks: ad hoc retrieval, software tool that is able to solve an information known item search, interactive retrieval, filter- retrieval problem. This calls for the introduc- ing, browsing, clustering, mining, gathering and tion of a further dimension, a new viewpoint that crawling. Sometime they are known by differ- we call horizontal taxonomy. Through the hor- ent names because they are inherited from var- izontal taxonomy we classify information re- ious research areas. trieval objects. An information retrieval object is an artifact that solves a more or less general Ad Hoc Retrieval IR problem. An information retrieval object is An ad hoc retrieval task is characterized by an

arbitrary subject of the search and a short du-  ration  73 . It is typically performed by a re- searcher doing a literature search in a library. In this environment the retrieval system knows the set of documents to be searched, but cannot

anticipate the particular topic that will be inves-  tigated  73 . A retrieval system’s response to an ad hoc search is generally a list of documents ranked by decreasing similarity to the query. The internet search engines are examples of in- formation retrieval objects from which one can perform ad hoc search.

Known Item Search

A known item search is similar to an ad hoc

search, but the target of the search is a partic-  ular document or a small set of documents

that the searcher knows to exist in the collec-  tion and wants to find it  73 . An information retrieval object that performs this task usually

implements a precise query language for ex-

ample, structural query language with which a searcher can reach parts of a document with known structure and semantics. For example, in the library environment, a researcher that will Fig. 2. Horizontal taxonomy. retrieve all articles by an author. A Taxonomy of Information Retrieval Models and Tools 185

Interactive Retrieval in which documents are organized in categories and subcategories. The hypertext model intro- A user’s judgment of the usefulness of a doc- duces a navigational structure which allows a

ument may vary during an information seek- user to browse text in a non sequential man-  ing activity  38 ; this can be captured by the ner. The web is the most well know example of system through an interactive information re- hypertext structure. trieval task. During the interactive task the sys- tem attempts to perceive how the user interacts

with it and, as a consequence, it can modify 

the current search strategy  60 . Classical rel-  evance feedback approaches  61 can be seen Clustering as early techniques for interactive retrieval; the

user interaction is captured as yesno judgment of documents relevance. The system uses these The term emerges from the statistics commu-

judgments to expand andor reweigh the query nity, where it is well known as classification

    32 . analysis and discriminant analysis 3 . In the community, the task is of- ten called concept learning. Clustering is the Filtering automatic recognition and the generation of cat- egories of entities that can be text documents. Also known as selective dissemination of in- It is usually based on some similarity measure formation, or text routing, filtering combines between documents, as well as an explicit or aspects of text retrieval and text . implicit definition of what distinguishing char- Like text categorization, a text filtering system acteristic should the groups of documents have. processes documents in real time and assigns It is generally used to improve the retrieval pro- them to zero or more classes. However, like text cess, because the search can be restricted on a retrieval, each class is typically associated with set of interested category. In conjunction with the information needs of one or a small group clustering is categorizing, which is the recog- of users. Each user, or user group, can typically nition and assignment of the document to one add, remove, or modify the queries, or profiles, according to their needs. Examples include: or more pre-existing categories. An example of

categorization tools is CORA Computer Sci-

 

NewsSieve  100 a client server USENET news  filtering system that can be used in a desktop en- ence Research Paper Search Engine84 , an au-

tomatic categorizing tool for scientific papers.  vironment, NewsWeeder  87 an experimental

USENET news filtering service, and SIFT the An example of categorizing service is the Yahoo

   Stanford Information Filtering Tool  86 , which Directory 99 ; in this case the categorization is includes two selective dissemination services, performed manually, by human experts. one for technical reports and one for USENET news articles.

Browsing Mining

When users are not interested in posing a spe- cific query to the system, but they invest some Mining is the process of automatically extract- time in exploring the document space, looking ing key information from text documents. Such for interesting references, then they are brows- information can be: language identification, ing the space, instead of searching. There are feature extraction, terminology extraction, pre-

three types of browsing, namely, flat, structure- dominant themes extraction, abbreviation ex-  guided and hypertext. In flat browsing the idea traction and relation extraction. LEXA  89 is

is that the user explores a document space which an example of a corpus processing software,  has a flat organization; for example, files in a while the IBM text miner  91 is a mining tool directory. In structure-guided browsing the user integrated with the homonymous text search en- is generally guided by a hierarchical structure gine. 186 A Taxonomy of Information Retrieval Models and Tools

Gathering where the high heterogeneity of the informa-

tion calls for a very general purpose approach.

     This is an activity involving pro-active acqui- Google  104 , Altavista 93 , and Infoseek 111 , sition of information from possibly heteroge- are some general purpose engines that currently neous sources. The metasearch engines exem- operate on the web. A specialized retrieval sys- plify a particular type of gathering task. Meta- tem is one that is developed with a particular

application domain in mind. For instance, the

  

crawler  92 , InFind 116 are some examples.  They combine outputs of several search engines LEXIS-NEXIS  119 retrieval system is a spe- and present the results as if produced by a single cialized retrieval system that provides access to search engine. a very large collection of legal and business doc-

uments. Similarly, the ResearchIndex service   105 provides free access to a large collection Crawling of scientific paper.

Crawling is concerned with the activity of se- 3.4. An Example lecting new, or updating the existing, sources of information that will be processed by suc- As we did with the vertical taxonomy, here we cessive activities, for example mining andor apply the horizontal taxonomy to a set of in- gathering. It is also known as indexing process formation retrieval objects. We have chosen

and, especially in the Web context, as spidering. 31 objects from various sources: research labs, 

Well known examples are: Scooter  94 , Archi- companies, and institutions.

     textSpider  110 , Sidewinder 112 , Slurp 102

The main classification scheme consists of iden-

   and Guliver  114 ; the spiders of Altavista 93 ,

tifying, for each object, its horizontal compo-

     Excite  109 , Infoseek 111 , Inktomi 101 and

nents included in Fig. 2.  Northernlight  113 . This is done by analyzing the object as a black box and trying to fetch information about what 3.2. Form it does. The result is viewed in the Appendix in which information retrieval objects are listed The form refers to the way in which the object is with some information notes and references. supplied to the final user. It can be supplied in The presence of a cross establishes that the cor- the form of tool or service. When the object is responding horizontal component is supported implemented as a software product, then it is a by the information retrieval object. tool. It exists because, for example, a company has produced it to make business. It can be dis- tributed, installed, sold, etc. When the object 4. Concluding Remarks exists only in one, or a few instances used to de- liver some information retrieval services, then For the purpose of simplicity, we have con- it is a service. Examples are search engines on ducted the classification on two separate paths: the web. a horizontal taxonomy and a vertical taxonomy. In reality, these taxonomies are not disjoint and in this concluding section we show how these 3.3. Context two important aspects of information retrieval can be combined. We have already remarked The context of an information retrieval object that an information retrieval object can be based regards its domain of application. It can be on more than one model and an information re- general or specific. A general purpose infor- trieval model can be the basis for more than one mation retrieval object operates on heteroge- object. neous domains and contents, unlike a context The vertical dimension classifies information specific system that operates on document col- retrieval models based on a two components lections belonging to a specific domain, such as view, namely representation and reasoning. The legal and business documents, technical papers horizontal dimension classifies information re- etc. Notable examples are web search engines, trieval objects with respect to the application A Taxonomy of Information Retrieval Models and Tools 187

Table 2. Vertical projections. areas. Indeed, objects can themselves be clas- the information needed to produce the vertical sified with respect to the vertical components, projections of the related objects. namely representation and reasoning. We call this further classification of an IR object the ver- In recent years, information retrieval has as- tical projection of the object; Tab. 2 shows the sumed an increasing importance because of the vertical projection for the IR objects referred to dramatic growth of the amount of information in the Appendix. Note that a few rows in the ta- available in digital formats. The proliferation ble are left blank, as we were not able to access of information retrieval algorithms, methods,

188 A Taxonomy of Information Retrieval Models and Tools  technologies, and tools calls for the definition 4 BAEZA-YATES, R., GONNET, G., Efficient text of basic concepts and terminology; this is use- searching of regular expressions, Proceedings of

ful to assess the features and the characteristics the 16th International Colloquium on Automata,  Languages and Programming, LNCS 372, 1989 ,

of each IR object and to understand the rela-  pp. 46–62, Berlin Germany .

tionships that exist between the objects. In this 

paper we have proposed a taxonomy of IR ob- 5 BAEZA-YATES, R., NAVARRO, G., Fast approximate  jects, accompanied with definitions for the key string matching, Algorithmica, 23(2), 1999 , pp. terms. This taxonomy is a tentative first step in 127–158.

classifying IR models and tools,since it does not  6 BEERI, C., KORNATZKY, Y., A logical query lan-

cover all aspects of IR. The market and the de- guage for hypertext systems, Proceedings of the 

velopment of IR technologies are still evolving European Conference on Hypertext, 1990 , pp.  and this evolution will make some observations 67–80, Versailles, France .

contained in this paper obsolete. As a result,  7 BELEW, R.K., Adaptative information retrieval, this work will need to be updated incrementally Proceedings of the 12th Annual International

as the technology develops. However, we think ACM/SIGIR Conference on Research and De- 

that the taxonomy presented in this paper pro- velopment in information Retrieval, 1989 , pp.  vides a good starting point for such a continuous 11–20, Cambridge MA .

updating.  8 BERND T., Logic Programs for Intelligent Web One of the main limitations of the taxonomy Search, Proceedings of the 11th International Sym-

presented in this paper is the fact that it covers posium on Methodologies for Intelligent Systems,

   1999 ,LNAI1609, Warsaw, Poland .

only text information retrieval. Indeed, cur-  rent information needs require more and more 9 BLOSSEVILLE, M.J., HEBRAIL, G., MONTEIL, M.G., integrated retrieval models and tools that com- PENOT, N., Automatic : bine the traditional retrieval of text documents natural language processing, statistical analy- sis, and expert system techniques used together, with the retrieval of multimedia content, such Proceedings of the 15th Annual International

as images and speech, and even structured data ACM/SIGIR Conference on Research and De- 

from . Therefore, there is room for velopment in information Retrieval, 1992 , pp.  improvement of the proposed taxonomy and we 51–57, Copenhagen Denmark .

are currently working on extending it in order to  10 BOOKSTEIN A., Fuzzy request: an approach to

include other important aspects of IR not cove- weighted Boolean searches, Journal of the Amer-  red here, primarily the retrieval of multimedia ican Society for , 31, 1980 ,

content. pp. 240–247.  11 BORDOGNA, G., PASI, G., A Fuzzy Linguistic Approach Generalizing Boolean Information Re- 5. Acknowledgment trieval; a Model and Its Evaluation, Journal of

the American Society for Information Science, 44,  1993 , pp. 70–82.

The work described in this paper has been sup-  ported by the EUREKA Project E!2235, IKF – 12 BRIN, S., PAGE, L., MOTWANI, R., WINOGRAD,T., Information and Knowledge Fusion. The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford University, 1998.

References  13 BROGLIO, J., CALLAN, J.P., CROFT, W.B., NACH- BAR, D.W., Document retrieval and routing using

INQUERY system, Proceedings of the 3rd Re-

 

1 AGOSTI, M., CRESTATI, F., TACHIR: a Tool for the trieval Conference TREC, 1995 , pp. 29–38,  Automated Construction of in Infor- Gaithersburg Maryland .

mation Retrieval, Proceedings of RIAO, Rockfeller

     University, 1994 , NewYork USA . 14 CALLAN, J., Document filtering with inference

network. Proceedings of the 19th Annual Int. ACM  2 ANANDEEP S., SYCARA, P.K., A Learning Per-

sonal Agent for Text Filtering and Notification, SIGIR Conference on Research and Development 

Proceedings of the International Conference of in Information Retrieval, 1996 , pp. 262–269, 

Zurich Switzerland .

 http

Knowledge Based Systems, 1996 ,

wwwricmuedupubspub 

html . 

15 CHANG, S.J., RICE, R.E., Browsing: a multidimen- 

3 ANDERBERG, M.R., for applica- sional framework, Annual Review of Information  tions, Academic Press, NewYork, 1973. Science and Technology, 28, 1993 , pp. 231–276.

A Taxonomy of Information Retrieval Models and Tools 189

  

16 CHEN, H., Machine learning for information re- Development in Information Retrieval, 1998 , pp.  trieval: neural networks, Symbolic learning, and 257–265, Grenoble France .

genetic algorithms, Journal of the American So- 

ARFIELD  ciety for Information Science, 46(3), 1995 , pp. 28 G , E., Citation Indexing: Its Theory 194–216. and Application in Science, John Wiley & Sons,

NewYork, 1979. 

17 CHEN,H.LYNCH, K.J., BASU, K., NG.,D.T., Gen-  erating, integrating, and activating thesauri for 29 GORDON, M., Probabilistic and genetic algorithms

concept-based document retrieval, IEEE EXPERT, for document retrieval, Comunication of the ACM, 

Special Series on Artificial Intelligence in Text- 31(10), 1988 , pp. 1208–1218. 

based Information Systems, 8(2), 1993 , pp.  25–34. 30 GORDON, M.D., User-based document clustering

by redescribing subject descriptions with a genetic 

18 CHEN, H., SHE, L., Inductive query by examples algorithm, Journal of the American Society for

   IQBE : A machine learning approach, Proceed- Information Science, 42(5), 1991 , pp. 311–322.

ings of the 27th Annual International Confer-  ence on System Sciences, Information Sharing 31 GUARINO, N., MASOLO, C., VETERE, G., Ontoseek:

Content-Based access to the web, IEEE Intelligent 

and Knowledge Discovery Track, 1994 , Maui 

Systems, 14(3), 1999 , pp. 70–80. 

Hawaii . 

32 HAINES, D., CROFT, W.B., Relevance feedback and  19 COOPER, W.S., GEY, F.C., DABNEY, D.P., Proba- bilistic retrieval based on staged logistic regres- inference networks, Proceedings of the 16th An-

sion, Proceedings of the 15th Annual Int. ACM nual Int. ACM SIGIR Conference on Research and 

SIGIR Conference on Research and Development Development in Information Retrieval, 1993 , pp. 

2–11, Pittsburgh USA . 

in Information Retrieval, 1992 , pp. 198–210, 

Copenhagen Denmark . 

33 KLEINBERG, J.M., Authoritative Sources in a Hy-  20 CROFT, W.B., Approaches to intelligent informa- perlinked Environment, Proceedings of the 9th

tion retrieval, Information Processing and Man- Annual Int. ACM SIAM Symposium on Discrete Al-

    

agement, 23(4), 1987 , pp. 249–254. gorithms, 1998 , pp. 668–677, New York USA .

  21 CUTTING, D.R., PEDERSEN, J.O., KARGER, D., 34 KOZA, J.R., Genetic Programming: On the Pro-

TUKEY, J.W., Scattergather: a cluster-based ap- gramming of Computers by Means of Natural proach to browsing large document collections, Selection, The MIT Press, Cambridge, MA, 1992.

Proceedings of the 15th Annual Int. ACM SI-  GIR Conference on Research and Development 35 KRAFT, D., BUEL, D.A., Fuzzy sets and generalized

Boolean retrieval systems, International Journal 

in Information Retrieval, 1992 , pp. 318–329, 

of Man-machine Studies, 19, 1983 , pp. 45–56. 

Copenhagen Denmark . 

36 KRAFT, D., PETRY, F.E., BUCKLES, B.P., SADASI-  22 DAMASHEK, M., Gauging similarity with n-grams: Language-independent categorization of text, Sci- VA N , T., The use of genetic programming to build

queries for information retrieval, IEEE Sympo- 

ence, 267, 1995 , pp. 843–848. 

sium on Evolutionary Computation, 1994 , pp.

   23 DOSZKOCS, T.E., REGGIA, J., LIN, X., Connec- 468–473, Orlando USA .

tionist models and information retrieval, Annual 

Review of Information Science and Technology, 37 KRAFT, D.H., BORDOGNA, G., PASI, G., Fuzzy set 

25, 1990 , pp. 209–260. techniques in information retrieval, in J. Bezdek, 

D. Dubois and H. Prade eds , Fuzzy Sets in 

24 FLAKE, G.W., LAWRENCE, S., GILES, C.L., COET- Approximate Reasoning and Information Systems,  ZEE, F.M., Self Organization and Identification of 3(8), 1999 , pp. 469–510, Kluwer Academic

Web Communities, Journal of the IEEE Computer Publishers. 

Society, 35(3), 2002 , pp. 66–71. 

38 KUHLTHAY, C. C., Inside the search process: In-  25 FOX, E. A., Extending the Boolean and vector formation seeking from the user’s perspective,

space models of information retrieval with P-norm Journal of the American Society for Information  queries and multiple concept types, PhD thesis, Science, 42(5), 1991 , pp. 361–371.

Cornell University, 1983. 

39 KWOK, K.L., A neural network for probabilistic  26 FUHR, N., HARTMANN,S.KNORZ, G., LUSTIG, G., information retrieval, Proceedings of the 12th An-

SCHWANTNER, M., TZERAS, K., AIRX – a rule- nual Int. ACM SIGIR Conference on Research and 

based multistage indexing system for large subject Development in Information Retrieval, 1989 , pp. 

fields, Proceedings of the 8th National Conference 202–210, Cambridge USA . 

on Artificial Intelligence, 1990 , pp. 789–895,

  Boston MA . 40 LAMPORT, L., LaTeX: A document Preparation

System, User’s guide and Reference manual; 2nd  27 FURNAS, G. W., DEERWESTER, S., DUMAIS,S.T., edition, Prentice Hall, 1994.

LANDAUER, T.K., HARSHMAN, R.A., STREETER,  L.A., LOCHBAUM, K.E., Information retrieval us- 41 LAYAIDA, R., BOUGHANEM,M.CARON, A., Con- ing a singular value decomposition model of latent structing an information retrieval system with neu-

semantic structure, Proceedings of the 11th An- ral networks, Lecture Notes in Computer Science,  nual Int. ACM SIGIR Conference on Research and 856, 1994 , pp. 561–570.

190 A Taxonomy of Information Retrieval Models and Tools

  42 LEE, J.H., Properties of extended boolean mod- 56 QUINLAN, J.R., Learning efficient classification els in information retrieval, Proceedings of the procedures and their application to chess and

17th Annual International ACM SIGIR Confer- games, Machine Learning, an Artificial Intel- 

ence on Research and Development in Information ligence Approach, 1983 , pp. 463–482, Tioga 

Retrieval, 1994 , pp. 182–190. Publishing company, Palo Alto, CA. 

43 LEWIS, D.D., Learning in intelligent information  57 RIBEIRO-NETO, B.A., MUNTZ, R., A Belief net-

retrieval, Proceedings of the 8th International work model for IR, Proceedings of the 19th An-  Workshop on Machine Learning, 1991 , pp. 235– nual Int. ACM SIGIR Conference on Research and

239, Morgan Kaufmann. 

Development in Information Retrieval, 1996 , pp. 

253–260, Zurich Switzerland .  44 LIN, X., SOERGEL, D., MARCHIONINI, G., A self-

organizing semantic map for information retrieval,  Proceedings of the 14th Annual Int. ACM SI- 58 RIJSBERGEN, C.J., Information Retrieval, Butter-

GIR Conference on Research and Development worths, London, 1979. 

in Information Retrieval, 1991 , pp. 262–269, 

59 ROBERTSON, S.E., SPARCK JONES, K., Relevance  Chicago IL .

weighting of search terms, Journal of the American

   45 LUHN, H.P., A statistical approach to mechanized Society for Information Sciences, 27(3), 1976 , encoding and searching of library information, pp. 129–146.

IBM Journal of Research and Development, 1,

  1957 , pp. 309–317. 60 ROBINS, D., Interactive Information Retrieval:

Context and Basic Notions, Information Science, 

46 MACLEOD, K.J., ROBERTSON, W., A neural algo-  3(2), 2000 , pp. 57–61.

rithm for document clustering, Information Pro- 

cessing & Management, 27(4), 1991 , pp. 337–  61 ROCCHIO, J.J., Relevance Feedback in Information

346. Retrieval, Prentice Hall, 1971. 

47 MCCUNE, B., TONG, R., DEAN, J.S., SHAPIRO, D.,  Rubric: a system for rule-based information re- 62 RUMELHART, D.E., HINTON, G.E., WILLIAMS, R.J.,

trieval, IEEE Transaction on Software Engineer- Learning Internal Representations by Error Prop-  ing, 1985, 11(9). agation, Parallel Distributed Processing, 1986 ,

pp. 318–362, The MIT Press, Cambridge, MA. 

48 MIYAMOTO, S., NAKAYAMA, K., Fuzzy information  retrieval based on a fuzzy pseudo thesaurus, IEEE 63 SACHS W.M., An approach to associative retrieval Transactions on Systems and Man Cybernetics, through the theory of fuzzy sets, Journal of

1986, 16(2), pp. 278–282. the American Society for Information Sciences, 

1976 , pp. 85–87. 

49 MIYAMOTO, S., TERUHISA, M., KAZUHIKO, N.,  Generation of a Pseudothesaurus for Informa- 64 SALTON, G., The SMART Retrieval System – Exper- tion Retrieval base co-occurrences and fuzzy set iments in Automatic Document Processing, Pren-

operations, IEEE Transaction Systems, Man and tice Hall, New York, 1971. 

Cybernetics, 13(1), 1983 , pp. 62–69. 

65 SALTON, G., Automatic text processing: The trans- 

50 MIZZARO, S., A cognitive analysis of informa- formation, analysis, and retrieval of information 

tion retrieval, Proceedings of CoLIS2, 1996 , pp. by computer, Addison-Wesley, 1989. 

233–250, Copenhagen Denmark . 

66 SALTON, G., BUCKLEY C., Term weighting ap-  51 MIZZARO, S., How many relevancies in informa-

tion retrieval?, Interacting with Computers, 10(3), proaches in automatic retrieval, Information Pro-

   1998 , pp. 305–322. cessing and Management, 24(5), 1988 , pp. 513–

523. 

52 NAEZA-YATES, R., RIEBEIRO-NETO, B., Modern  Information Retrieval, Addison Wesley, New York, 67 SARACEVIC, T., RELEVANCE: A Review of and 1999. a Framework for the thinking of the notion in

information science, Journal of the American So- 

53 OGAWA, Y., MORITA, T., KOBAYASHI, K., A fuzzy  ciety for Information Science, 26(6), 1975 , pp. document retrieval system using the keyword con- 321–343.

nection matrix and a learning method, Fuzzy Sets 

and Systems, 39, 1991 , pp. 163–179.  68 SEBASTIANI, F., On the Role of Logic in In-

formation Retrieval, Information Processing & 

54 PAIJMANS, H., Explorations in the document  vector model of information retrieval, Dis- Management, 34(1), 1998 , pp. 1–18.

sertation, Tilburg University, 1999. http 

69 SMITH, L.C., WARNER, A.J., A taxonomy of repre-

PaaiBibliogr

pikubnl sentation in information retrieval design, Journal

  55  PIROLLI, P., PITKOW, J., RAO, R., Silk from Sow’s of Information Science, 8, 1984 , pp. 113–121.

Ear: Extracting Usable Structures from the web, 

Proceedings of the ACM Conference on Human 70 TAHANI, V.A., A fuzzy model of document re- 

Factors in Computing Systems, 1996 , pp. 118– trieval systems, Information Processing and Man-

   125, New York USA . agement, 12, 1976 , pp. 177–187.

A Taxonomy of Information Retrieval Models and Tools 191

  httpantherlearning

71 TURTLE, H., CROFT. W.B., Inference networks for 87 NewsWeeder.

ehtml document retrieval, Proceedings of the 13th An- cscmueduifhom

nual Int. ACM SIGIR Conference on Research and

 httpwwwgnuorg

88 Grep. 

Development in Information Retrieval, 1990 , pp. 

1–24, Brussels Belgium .

89  LEXA.

httpnorahduibnolexainfhtml 

72 UTGOFF, P.E., Incremental induction of decision

 

 httpinfooxacukctitext

trees, Machine Learning, 4, 1989 , pp. 161–186. 90 OCP.

resguideresourcesohtml 

73 VOORHESS, E.M., HARMAN, D., Overview of TREC

 httpwwwibmcom 2001, National Institute of Standards and Technol- 91 IBM text miner.

ogy, 2001.

 httpwwwmetacrawlercom

92 Metacrawler. 

74 WALLER, W.G., KRAFT, D.H., A Mathematical

 httpwwwaltavistacom

Model of a Weighted Boolean Retrieval System, 93 Altavista.

 httpwwwaltavistacom

Information Processing & Management, 15(5), 94 Scooter. 

1979 , pp. 235–245.

 httpwwwciircsumassedu

95 INQUERY. 

75 WILKINSON, R., HINGSTON, P., Using the cosine 

measure in neural network for document retrieval, 96 SMART.

nelledupubsmart Proceedings of the 14th Annual Int. ACM SI- ftpftpcscor

GIR Conference on Research and Development

httpwwwcs

97  ILA. Internet Learning Agent. 

in Information Retrieval, 1991 , pp. 202–210,

washingtoneduhomesmapilahtml 

Chicago USA . 

98 WebLearner. 

76 WONG, S.K.M., YAO, Y.Y., On modeling in-

ciedupazzani formation retrieval with probabilistic inference, httpwwwicsu

ACM Transactions on Information Systems, 13(1), Coldlisthtml

 httpwwwyahoocom 

1995 , pp. 39–68. 99 Yahoo Directory.

 httpwwwnewssievecom  77 WONG, S.K.M., ZIARKO, W., RAG H AVA N , V.V., 100 NewsSieve.

WONG, P.C.N., On Extending the Vector Space

 httpwwwinktomicom 101 Inktomi.

Model for Boolean Query Processing, Proceed-

 httpwwwinktomicom ings of the 9th Annual Int. ACM SIGIR Conference 102 Slurp.

on Research and Development in Information Re- 

103 Isearch.

  

trieval, 1986 , pp. 175–185, Pisa Italy .

httpwwwcnidrorgisearchhtml 

78 WONG, S.K.M., ZIARKO, W., WONG, P.C.N., Gen-

httpwwwgooglecom eralized vector space model in information re- 104  Google.

trieval, Proceedings of the 8th Annual Int. ACM  105 ResearchIndex.

SIGIR Conference on Research and Development

httpwwwresearchindexcom 

in Information Retrieval, 1985 , pp. 18–25, New

 

York USA . 106 Agrep, Glimpse.

httpglimpsecsarizonaedu 

79 WU, S., MANBER, U., Agrep: a fast approximate

  httpwwwsims

pattern matching tool, Proceedings of USENIX 107 Scatter Gather:

berkeleyeduhearstsgoverviewhtml 

Technical Conference, 1992 , pp. 153–162, San 

Francisco USA .

 httplcswwwmediamitedu

108 Amalthaea.

mouxpapersPAAMPAAMhtml

  80 XML eXtensible Markup Language 1.0 Second

Edition W3C Recommendation 6 October 2000.

httpwwwexcitecom

109  Excite.

httpwwwworgXML

 httpwwwexcitecom

110 ArchitextSpider. 

81 XPath XML Path Language 1.0 W3C Recommen-

httpwwwinfoseekcom

dation 16 November 1999. 111  Infoseek.

httpwwwworgTRxpath

 httpwwwinfoseekcom

112 Sidewinder. 

82 YANNAKOUDAKIS, E.J., GOYAL, P., HUGGIL, J.A.,  113 Northern Light.

The generation and use of text fragments for data

ernlightcom

compression, Information Processing and Man- httpwwwnorth

 httpwwwnorthernlightcom 

agement, 18, 1982 , pp. 15–21. 114 Guliver.

 httpwebsomhutfiwebsom  83 ZIPF, H.P., Human Behaviour and the Principle of 115 WEBSOM.

Least Effort, Addison-Wesley, Cambridge, 1949.

 httpwwwinfindcom

116 Infind.

 httpcorawhizbangcom

84 CORA.

httpwwwlycoscom

117  Lycos.

 httpwwwdeiunipditims

85 TACHIR.

httpwwwnorthernlightcom

118  GeoSearch.

tachirhtml 

119 LEXIS-NEXIS.

 ftpdbstanfordedupubsift

nexiscom

86 SIFT. httpwwwlexis

siftnetnewstarZ 192 A Taxonomy of Information Retrieval Models and Tools

Appendix: Horizontal Taxonomy of a Set of Information Retrieval Objects A Taxonomy of Information Retrieval Models and Tools 193 194 A Taxonomy of Information Retrieval Models and Tools

Received: September, 2002 GERARDO CANFORA received the Laurea degree in electronic engineer- Revised: January, 2004 ing from the University of Naples, Federico II, Italy, in 1989. He is Accepted: May, 2004 currently a full professor of computer science at the Faculty of Engineer-

ing and the Director of the Research Centre on Software Technology 

RCOST of the University of Sannio in Benevento, Italy. From 1990  to 1991, he was with the Italian National Research Council CNR . During 1992, he was at the Department of Informatica e Sistemistica Contact address: of the University of Naples, Federico II, Italy. From 1992 to 1993, he was a visiting researcher at the Centre for of the Gerardo Canfora University of Durham, UK. In 1993, he joined the Faculty of Engineer- Research Centre on Software Technology ing of the University of Sannio in Benevento, Italy. He has served on Department of Engineering the program committees of a number of international conferences. He University of Sannio was a program co-chair of the 1997 International Workshop on Pro- Palazzo ex Poste – Via Traiano gram Comprehension and of the 2001 International Conference and the General Chair of the 2003 European Conference on Software Main- 82100 Benevento tenance and Reengineering. His research interests include software ITALY maintenance, program comprehension, reverse engineering, workflow

e-mail: gerardocanforaunisann ioi t management, document and knowledge management, and information retrieval. He serves on the Editorial Board of the IEEE Transactions on . He is a member of the IEEE and the IEEE Computer Society.

LUIGI CERULO received the Laurea degree in computer engineering from

the University of Sannio, Italy, in 2001. He is currently an assistant  researcher at the Research Centre on Software Technology RCOST of the University of Sannio in Benevento, Italy. His research interests include information retrieval, fuzzy logic, and visual languages.