Journal of Computing and Information Technology - CIT 12, 2004, 3, 175–194 175
A Taxonomy of Information Retrieval Models and Tools
Gerardo Canfora and Luigi Cerulo
RCOST – Research Centre on Software Technology, University of Sannio, Benevento, Italy
Information retrieval is attracting significant attention address the representation, organization of, and due to the exponential growth of the amount of infor- access to large amounts of heterogeneous infor-
mation available in digital format. The proliferation of information retrieval objects, including algorithms, mation encoded in digital format 58 . methods, technologies, and tools, makes it difficult to In this paper we focus on text document re- assess their capabilities and features and to understand the relationships that exist among them. In addition, trieval, in which the information is represented the terminology is often confusing and misleading, as by text documents. Therefore, for the purposes different terms are used to denote the same, or similar, of this paper, the terms information and docu- tasks. ments are used interchangeably. Text document This paper proposes a taxonomy of information retrieval retrieval is the most traditional subfield of IR; models and tools and provides precise definitions for however, IR comprises other subfields, such as the key terms. The taxonomy consists of superimposing two views: vertical taxonomy, that classifies IR models image retrieval, speech retrieval, information with respect to a set of basic features, and horizontal generation, query answering, and text summa- taxonomy, which classifies IR systems and services with rization, that we do not cover in this paper. respect to the tasks they support. A key feature of a text IR systems is retrieving The aim is to provide a framework for classifying existing information retrieval models and tools and a solid point the documents that can satisfy the information to assess future developments in the field. needs of a user from a large collection of docu- ments. Such systems, especially in the context Keywords: information retrieval, taxonomy, tools, mo- of the web, are usually known as search en- dels. gines, so that in the rest of the paper we will consider search engine as a synonym of infor- mation retrieval system. IR systems prepare the collection of documents for retrieval through an 1. Introduction indexing step. User information needs are usu- ally represented by keywords or phrases, which In recent years information retrieval has become are themselves indexed, although more complex an important subject of much research, because representation languages are available. This the amount of information available in digital representation, which causes inevitably a loss formats has grown exponentially and the need of information, is usually known as query. In- for retrieving relevant information has assumed dexing can assume different forms according to a crucial importance. The World Wide Web and the model adopted to represent both the docu- the Digital Libraries have shown to a large au- ments in the collection and the user information dience the importance of effective mechanisms needs. Many current IR systems exploit ranked and tools to retrieve documents from a very large IR methods, i.e. they rank the documents in the document collection based on user information collection based on a measure of their relevance needs. with respect to the user information needs as
represented by a query. Information Retrieval IR is the scientific dis- cipline that deals with the analysis, design and The proliferation of information retrieval al- implementation of computerized systems that gorithms, methods, technologies, and tools, is 176 A Taxonomy of Information Retrieval Models and Tools
making it more difficult to assess the features variety of more specific models. Paijmans iden- and the characteristics of each IR aspect and to tified the vector document model as the basis for understand the relationships that exist among building the classification and showed how the them. The terminology is often confusing; for vector model can subsume other popular mo- example, terms such as crawling, indexing, spi- dels. Whilst this constitutes a concise style of dering, are often used to denote similar tasks, classification, it is unable to classify IR tech- with no clear distinction of the differences. niques that are not derived from the vector based model, such as the logic-based techniques. In this paper we propose a classification of IR models and tools and provide definitions for the Our approach is different, as we start from a key terms. The classification consists of super- classification of the basic features of IR mod- imposing two views: one for the IR models and els and proceed with a classification of the ob- one for the IR objects, either tools or services. jects produced in the various fields of infor- A vertical taxonomy classifies IR models with mation retrieval in terms of tools and services. respect to a set of basic features, and a horizon- The flexibility of this faceted view is evident tal taxonomy classifies IR objects with respect when we consider that different information re- to their tasks, form, and context. The vertical trieval objects can be based on the same in- taxonomy is built by exploding two basic fea- formation retrieval model, and the same infor- tures of any IR model: the representation, that mation retrieval model can be exploited to im- is the model adopted to represent both the docu- plement different information retrieval objects. ments and the user queries; and the reasoning, For example, the classic vector model, generally which refers to the framework adopted to re- presented as a retrieval technique, can be used solve a representation similarity problem. The for building information filtering and document horizontal taxonomy is derived from an analysis clustering tools, too. The latter are different in- of the application areas of IR. formation retrieval objects that exploit the same information retrieval model.
1.1. Related Works 1.2. Content and Structure of the Paper In the literature, several studies have been pro- posed that outline classifications of IR models There are two main viewpoints that characterize and tools. However, most of these studies do information retrieval: we call these two view- not cover the entire spectrum of IR objects; the points information retrieval objects and infor- reasons can be found either in the age of the pa- mation retrieval models. The former is gener-
pers or in the specific objectives of the studies. ally an artifact that exists in the form of a tool or For example, in 1984 Smith and Warner 69 a service and responds to the “what” question; published a document representation taxonomy the latter is a set of theories on which the in- with the aim of relating new research works formation retrieval object is based and respond to previous works and to suggest new areas of to the “how” question. The two aspects are re- research. Nowadays, this taxonomy is largely lated, as one object can be based on more than incomplete, because it does not consider, for one model and one model can be the basis for
example, the representation of structured docu- more than one object. On this framework we ments. In 1987 Belkin and Croft 20 published have built a horizontal taxonomy and a vertical a classification of the most important retrieval taxonomy. The horizontal taxonomy refers to techniques in which no reference is made to the IR objects, while the vertical one considers IR relevance feedback model, because, as the au- models. thors explicitly state, relevance feedback is not The remainder of the paper is organized as fol- considered a retrieval technique, rather a help lows. Sections 2 and 3 introduce the vertical to refine the retrieval model.
and the horizontal taxonomies, together with In a more recent work, Paijmans 54 made an examples of their application. Section 4 super- interesting analysis of the most important re- imposes the vertical and horizontal taxonomies trieval models. The approach adopted to con- and shows how this can be used to obtain a map- struct a taxonomy of IR models consists of iden- ping of the object’s features on the underlying tifying a generic model that forms a basis for a models. A Taxonomy of Information Retrieval Models and Tools 177
2. Vertical Taxonomy gies under the reasoning component. Representation and Reasoning can be used to
Modeling the process of information retrieval characterize an information retrieval model. For
is complex, because many parts are, by their example, in 52 an information retrieval model