Automatic Indexing of News Articles

by

Yunus J Mansuri

18 credit project report

for the degree of Master of Information Science

The School of Computer Science and Engineering

The University of New South Wales

August, 1996 This thesis is dedicated to

God the most benificient and merciful

11 Acknowledgements

During my course of study many people have helped me. However I would like to especially thank my supervisor John Shepherd for his patience and perseverance with my thesis. His assistance, suggestions, and constructive criticism in the development of this thesis is worthy of special praise.

My brother Yusuf Mansuri, sister Raisa and baby Hannan deserve special thanks who gave me their endless love and support without which this thesis could not have appeared.

I also thank my friends Sadik, Hassan, Banchong for their help and the time they shared with me. I thank for their friendship.

Finally, I must thank my parents above all. Their contribution to my life makes everything else pale into insignificance.

lll Abstract

Information has become an essential currency in the "Information Age".

With the growth of network technology and connectivity, the desire to share ideas via Usenet has grown exponentially. Huge amounts of data flow through the Usenet daily. Our overall aim is to minimise the effort required by the reader to handle large volume of news passing through Usenet. In order to achieve this, there are 2 major tasks: Automatic indexing which dives in the ocean of data and Retrieval of relevant articles which fetches a glass of information to satisfy the thirst of the user based on user profile.

iv Contents

1 Introduction 6

1.1 Filtering . 7

1.1.1 Information filtering system 9

1.2 Objectives of research ...... 10

1.3 Area of research ...... 11

1.4 Organisation ...... 11

2 Literature Review 13

2.1 ...... 13

2.2 Components of IR system ...... 14

2.2.1 Selection of information ...... 14

2.2.2 Text analysis and representation ...... 19

2.2.3 Searching strategy ...... 43

2.3 Summary ...... 46

3 Literature Review - Available Systems 47

3.1 History of newsreaders ...... 47

3.1.1 Popular screen-oriented news reading interfaces 50

1 3.2 Related work 53

3.2.1 SMART 53

3.2.2 SIFT-Stanford Information Filtering Tool 56

3.2.3 Tapestry . 57

3.2.4 URN ... 58

3.2.5 INFOSCOPE 59

3.2.6 Deja News .. 60

4 IAN - Intelligent Assistant for News reading 62

4.1 Preview of IAN 62

4.2 Introduction .. 64

4.2.1 How does IAN work 64

4.3 IAN system ..... 66

4.3.1 Fetch articles 66

4.3.2 Automatic indexing of news articles . 68

4.3.3 Retrieval of relevant news articles based on user profile 80

5 Experiments 87

5.1 Evaluation methods . 87

5.2 Evaluation procedure 89

5.2.1 System performance method . 90

5.2.2 Comparison method 91

5.2.3 Discussion ...... 92

6 Conclusions 97

2 6.1 Review of research objectives ...... 97

6.2 Conclusion . 98

6.3 Future work . 100

A Outline of the program 102

A Filtering Statistics 114

3 List of Figures

2.1 Change in Document space after assignment of good discrim-

inator ...... 29

4.1 Modules and flow of data . 63

4.2 Major Functions of our part of IAN 66

4.3 Different modules and flow of data 67

4.4 Modules involved in text analysis 70

4.5 Keyword structure during search ...... 83

4.6 Query tree: representing A .and. B ...... 84

4. 7 Query tree: representing A .and. B .or. C .and. D ...... 85

A.1 Rate of posting of articles . . . . 116

A.2 Increase in uniq words per hour . 117

4 List of Tables

2.1 Term weighting formulae depending on within-document fre-

quency ...... 36

2.2 Term-weighting formulae depending on term importance within

an entire collection ...... 37

2.3 Term weighting formulae depending on Document frequency 37

5.1 Results from system performance method ...... 91

5.2 Results from comparison method - An average of 3 days 93

5.3 Result:Evaluation of performance ...... 93

A.1 Statistical information: Information of data in MB ...... 115

A.2 Stasistical information: Information of data in terms of words 115

A.3 Results from comparison method - Day 1 . 118

A.4 Result:Evaluation of performance for Day - 1 . 118

A.5 Results from comparison method - Day 2 . 119

A.6 Result:Evaluation of performance for Day - 2 . 119

A.7 Results from comparison method - Day 3 . 120

A.8 Result:Evaluation of performance for Day - 3 . 120

5 Chapter 1

Introduction

Information has become an essential currency of this "Information Age" .

With the advancement in network technology the information resources of

the world can be accessed from a desktop. With growth of this connectivity

has also grown a desire to share ideas and information. For that purpose a

system already exists which enables millions of people around the world to

send and receive information: Internet. The Internet supports many styles of

communication; one of them is Usenet News, a global bulletin board system.

Usenet is a collaborative system with no barriers to access, no requirement

of computer literacy beyond basic word processing skill and working in the

most democratic way without any restriction on content or dissemination of

information. Anyone can make a posting about any topic and anyone can

read what anyone else has to say about a topic and also share his/her own

view. With the explosive growth of the system in terms of number of hosts

and users connected, the information flow has increased many fold. Every

6 day around 30000 messages (90 MB) of text on a wide range of topics arrive at each site on Usenet. 1

Given this amount and diversity of information, the question arises how can one actually make use of it, by getting information on his/her topic of interest and information on those topics only. Users generally have a small number of specific interests, but most of the material found on Usenet is irrelevant to these interests and often of low quality. One solution to this information overload problem is controlling the information either by charging on posting or by editors filtering out low quality information. At the moment it is not practical to implement either of these alternatives. On the other hand it is not clear that such restrictions are desirable; if these sort of restrictions had been applied from the outset it would have hampered the growth of Usenet. Now, the only feasible solution for the user to filter relevant information from the incoming stream is filtering.

1.1 Filtering

Filtering is not a new concept. We use information filtering in our day to day life. For instance when we go to look for books of interest in the library, we do not start reading all books to find relevant topics but apply filtering

1 A typical example of Usenet traffic volume:

In 2 weeks 990258 articles, totalling 2512.5 MB submitted from 53566 Usenet sites by 196093 different users to 11099 different newsgroups for an average 180 MB per day.[Com95] the weeks

7 whereby we use a catalogue to limit our search to books containing specific topics of interest. Once we have found a potentially useful book, we first look at the index to determine whether it covers any interesting material. The same ideas are applied in filtering.

In Usenet news, the first layer of filtering is provided by newsgroups.

Here generated messages are partitioned over the newsgroups, where each newsgroup carries articles on specific topics. Newsgroup-based filtering is achieved by simply subscribing only to relevant newsgroups (i.e. only sub­ scribing to a small set of newsgroups that carry articles relevant to one's interests). This sort of filtering is not entirely satisfactory as quite often newsgroups cross boundaries, and one is not certain that the un-subscribed newsgroups contain no relevant articles of interest. Also the task of selecting relevant newsgroups is a problem; looking at all 3000 newsgroups and what each contains and then selecting is a non-trivial task. Large numbers of ar­ ticles (hundreds per day) are posted in the most active newsgroups. Out of these, many are not of interest to all users who read that news group.

Currently the burden of selecting relevant articles lies with the user and is achieved by looking at the subject line of each article. This functionality

(displaying subject line) is provided by almost all news reading programs

(e.g. "rn", "trn", "nn" etc). However, simply looking at the subject heading is not a viable way to find relevant articles; sometimes articles do not have any subject, but more frequently, they have a subject heading that is not directly related to the content of the article. Some news reading programs

8 (e.g. "nn", "GNUS") support the idea of a "kill-file" which, depending upon the criteria supplied by the user, removes the articles from consideration be­ fore their subject lines are displayed (e.g. kill all articles posted from site

"x" or posted by "author y"). This filtering mechanism puts the burden of filtering onto the user, who has to choose between subscribing to only a small number of newsgroups and potentially missing out on the interesting items, or subscribing to more newsgroups and manually filtering out a large number of uninteresting articles. Clearly more automatic assistance is required in the information filtering task.

1.1.1 Information filtering system

An information filtering system is an information system designed for un­ structured or semistructured data[BC92]. The only definite structure that

Usenet articles have is : a "header" and a "body", where the header con­ veys general information (e.g. source, distribution, kind of article) about the articles while the body contains the actual message (i.e. the content of the articles). Another characteristic of an information filtering system is that it is able to process a large volume of data. This suits our application (news filtering) where we receive around 90-Mbyte of data each day on the local server.

The task of an information filtering system is to filter information based on a description of individual or group information preferences, called a "user profile" , by the removal of irrelevant data from an incoming stream of data.

9 The first stage of filtering involves identifying useless (non informative) words from the data. This can be achieved by analysing the data for linguistic cues and separating the informative words. This is often done by using a stopword list which contains the list of common non-informative words. These words are then removed from the data. The remaining words are then put into canonical form by removing suffixes and performing other transformations which help to isolate the root words. The next step involves determining a weight for each of the remaining words, to representing its importance.

These keywords can then be compared with the user profile to determine the relevance of the article.

Since a user's interest changes over time, it is also desirable that a news filter can adapt to these changes. A news filtering system can be separated into a number of components: a front-end to parse articles, extract informa­ tion content and select hopefully relevant articles, and a back-end to monitor the user and adapt. In this research project we consider only the front-end.

1.2 Objectives of research

The objectives of this research include:

• Reviewing the current text analysis techniques and searching strategies

that are used in Information Retrieval.

• Reviewing existing systems which provide some sort of filtering mech­

anism when dealing with Usenet news articles.

10 • Checking different available ranking algorithms and selecting one which

suits our application.

• Implementing an experimental prototype news filtering system.

• Evaluating the system's performance and comparing it to existing sys­

tems.

1.3 Area of research

In this research project an experimental prototype news filtering system is implemented which fetches articles from a news network, analyses these ar­ ticles, generate weighted index terms for each articles, and compares these words with a user profile. Along the way, different text analysis approaches are reviewed.

Experiments are carried out to evaluate the performance of our news filtering system. Comparison is done between the results of using single words as index terms and phrases as index terms. Finally comparison is made with the SIFT (Stanford Information Filtering Tool) on the usefulness of using phrases as index terms.

1.4 Organisation

The organisation of the report is as follows.

In chapter 2, the different approaches to text analysis and searching in

11 Information Retrieval (IR) are reviewed.

In chapter 3, the existing systems and different experimental prototypes which work in the Usenet environment are reviewed.

In chapter 4, the implementation of an experimental prototype news filter­ ing system called Intelligent Assistant for News reading (IAN), is described.

In chapter 5, the details of the experiments are explained and the analysis of results of experiments is also presented.

Chapter 6, gives the conclusion and further research directions.

12 Chapter 2

Literature Review

2.1 Information Retrieval

The phrase Information Retrieval covers a very wide scope of activities. It is generally taken to mean the retrieval of references to documents in response to a request for information [Rob90). IR is synonymous with representation, storage, organisation and accessing of information. Strictly speaking there is no restriction on the type of items1 it can handle. In the case of text information retrieval the items are analysed to determine information con­ tent and to find the role of each term2 in providing information about the content that is needed to satisfy the user's queries. An information retrieval system can be said to be a set of rules and procedures, whether performed by humans or machines, for some operations like indexing, search formulation,

1 By "items" we mean various kinds of data. 2The word "term" is a generic word used to represent content identifier also called

"keyword".

13 searching, feed-back etc. In earlier days, items processed by an IR system included books, articles, abstracts, bibliographies, etc. With the evolution of computerisation, the scope as well as the role of IR has expanded and it is now needed in every walk of life. The amount of available information and the need of information retrieval is mutually interdependent such that if information increases the need to access it also increases, and thus new ways of IR evolve.

2.2 Components of IR system

An IR system comprises 3 major components

1. Selection of information

2. Text analysis and representation

3. Searching strategy

2. 2 .1 Selection of information

In any IR system the selection of domain plays the most important function.

In order to provide a service, the system must contain information resources that may be of potential use to its user depending on the purpose of the system. For example the IR system design for a library should contain the information on all the relevant books, journals, reports etc. Similarly the

IR system providing assistance for Usenet must also contain all information available on Usenet for the client (user). The domain of our data selection

14 is the day to day news posted on Usenet. So let us examine Usenet in more detail.

Usenet

Introduction - Usenet

Usenet is a collection of computers which allow users to exchange public

messages on many different topics. It is a world-wide distributed conferencing

and discussion system available at a low cost for universities, schools, libraries

and home users. Because of easy 3 and unrestricted access, the amount of

information has grown rapidly. There is much useful information on Usenet,

and new areas of interest are continually being added. This sheer volume of

new information arriving everyday prevents even a cursory skimming to look

for new articles. Searching for interesting or useful information has become

extremely time consuming and is frequently futile. The problem of effective

information access and retrieval is not a result of disorganisation, but rather

due to the fact that Usenet receives tens of thousands of new articles divided

among thousands of topics (called newsgroups) generated and distributed to

thousands of sites serving millions of users. This is the real cause of the

problem: information overload. To overcome this problem many systems

have appeared in the market and we will examine different approaches that

tackle this problem. Prior to looking at these systems, we examine Usenet

news in some more detail.

3easily available in developed countries.

15 Background of Usenet

Usenet, USEr's NETwork consists of networks of computers that ex- change "netnews." Any user on a Usenet node can post and receive articles to and from Usenet by simply typing in some text, and submitting it to a program on a local machine. This local computer then forwards the article to nearby Usenet nodes (the posting can be restrictive as well) who in turn forward it to other nodes. In this manner news is propagated around the world.

Usenet started in 1979 with only a few nodes but since then it has been growing incredibly. Estimates show that there are 76,000 Usenet sites [Rei93] with millions of users4 . The estimated number of new articles that arrive every day (on each site) is around 30,000 to 35,000. The total amount of data is around 90 MB5 .

Structure of Usenet

Articles are categorised into around 11,000 newsgroups [Com95]. The subject areas are diverse, ranging from sports to sex and religion to rec.pet.dog.

Most of the articles are in ASCII text but also contain encoded versions of binary files, pictures, sound, application software etc.

An article consists of a header and body. The header contains information about the source, distribution and kind of article. The body contains actual

4 3 million according to Hauben(Hau93]and 4.2 million on Nov 93 as estimated by DEC

Network systems laboratory. 5 based on our survey conducted in march 1995 for the arriving articles on our local nntp server: nntp.unsw.edu.au

16 information content. A sample of a header is shown below:

Path: unsw.edu.au!metro!OzEmail!carisma

From: carisma©ozemail.com.au (marcus scholz)

Newsgroups: alt.astrology

Subject: Re: Virgo & Gemini

Date: 8 Mar 1995 12:52:03 GMT

Organization: OzEmail Pty Ltd - Australia

Lines: 20

Distribution: world

Message-ID: <3jk99$j7mp©oznet03.ozemail.com.au>

References: <2088351.ensmtp©newc.com>

NNTP-Posting-Host : shell01.ozemail.com.au

X-Newsreader: TIN [version 1.2 PL2]

Some of the fields vary depending on the local configuration and the newsreader used for posting. The importance of each field depends on the type of approach we take for analysing and retrieving of the articles. For example the field "Newsgroups:" is important if the user is using a normal newsreader like rn, trn, nn, etc; so that when a user gives a newsgroup name it can directly show only that newsgroup. The field "References:" is important if we use a threaded newsreader. In our the case Message-ID and field "Subject:" are important because "Message-ID" is the unique Usenet identifying string for the article which can be used to fetch that article. The

Subject field gives some idea of the topic of an article.

17 The body contains the content of the article in a semi-structured manner.

The word "semi-structured" is used because while articles are generally too short6 to make use of sectioning or more formal document structures, they do make use of various informal structuring devices.

As one of the uses of Usenet is for discussion, the body of an article often contains quotes from other articles. These quotes are often called comments and give an idea of the whole discussion. Optionally a footer is found at the end of the article but this does not contain any relevant information related to the topic of the article but only contains the author's name and other personal details.

Semantic Structure of Usenet

Usenet is used for many different purposes. The most common are

1. Question and Answer.

2. Discussion.

3. Dissemination of information.

Newsgroups contain articles. These articles contain questions, answers, information, etc. As new people subscribe to newsgroups, they tend to ask a set of common questions about the area of interest of the newsgroup. In order to alleviate the traffic caused by this, many newsgroups have collected together a list of frequently asked questions (also known as an FAQ) along

6longer articles often do have structures. The reason most articles don't is that they are too short (e.g. on average 1500 bytes out of this 500 bytes is header)

18 with answers to these questions. Such a list helps in reducing the posting of general question-answers and ultimately reduces the volume of data.

Another use of Usenet is for discussion. Here initially a user posts a point or a topic, other interested users reply and counter reply, thereby creating a

"thread" of discussion. Some newsreaders like trn, GNUS, etc use threads as a basis for displaying and selecting articles via subject lines. All articles from a thread are grouped together. Often these discussions last for a long time and at times lose track of the initial topic. For this reason we do not consider this to be a viable option in determining the similarity of articles in our approach.

A further use of Usenet is the dissemination of current events or 'hot topics'. With the vast diversity of newsgroups, information on regional as well as global events can be found on Usenet. At times, new newsgroups are created specifically for discussion of a current major event (e.g. the Gulf­ war). Such newsgroups tend to be relatively short-lived, being removed after several months.

2.2.2 Text analysis and representation

A document is comprised of a stream of words in natural language. The words can be informative, common, grammatical etc. To extract meaningful words, we need to analyse the text. The text analysis process aims to identify important words (content carrying words, also called "terms") which can be used to represent the document. The assignment of terms to a document is

19 designed to achieve three related purposes:[Kee77]

1. To allow easy identification of documents with topics of interest (in our

case via a user-profile).

2. To identify different documents dealing with similar topics.

3. To predict the relevance of individual documents to a specific informa­

tion requirement through the use of index terms. 7

To achieve these objectives, many different methods have been proposed.

These methods depend on the particular indexing environment. The two basic approches are:

1. Automatic or manual indexing.

2. Use of controlled or uncontrolled vocabulary.

From an analysis point of view, the four combinations represent four dif­ ferent approaches to extracting features from a raw sequence of characters in text. Historically, the analysis operations were carried out manually. In most situations a was used in which a single standard term or phrase represents a wide variety of related terms and description.

Usually these terms were binary8 or weighted with grades of subject impor­ tance such as major, minor, absent etc. Sometime uncontrolled vocabulary was also used but most often controlled vocabulary was used.

7index term is also referred to as "term" , "keyword", "content identifier". 8 whether the term is relevant or not.

20 Automatic indexing using terms drawn from the full text means that some automated feature extraction procedure is used. This typically involves iden­ tifying words or phrases where each corresponds to a set of all distinct clues encountered, under some definition of distinctness (i.e. by using some set of rules, words and phrases are identified). There has been much controversy about the best approach. The recomendation range from simple frequency counts to complicated rule-based approaches. The main debate is about whether automatic indexing is as good or better then manual indexing. "It has been claimed that automatic, free language products are necessarily in­ ferior to a manual system because automatic systems only use words which appear in the text and that these products will fail to pass any relational test carried out by independent human observer" [SM83]. In answer to this

Salton mentioned that manual indexing is also influenced by the terminology used by individual text. Here we can also say that, the expert also uses his own terminology to refer to a term which again may differ (i.e. different experts choose different terms to represent the same document as each may find different things to be important in the same document). However it can be said that adding much complexities in either of the approaches does

11:ot really affect the actual outcome. This has been clearly demonstrated by the results obtained from Cleverdon [CMK66], Aitchison [vR75] and Keen

[vR75]. Also from their conclusion it appears that systems using uncontrolled vocabularies are better than systems using controlled vocabularies [vR75].

21 Automatic Indexing

The indexing task consists of assigning to each stored document a list of key­ words which act as content identifiers of the document. Additionally a weight may be assigned to each keyword reflecting its presumed importance for the purpose of content identification. There are two standard methodologies that have developed over the years for tagging or identifying the appropriate words to used as content identifiers: the statistical approach and linguistic approach.

Statistical Analysis

This is the original methodology that used in automatic indexing and it has evolved significantly over the last four decades. This methodology starts with the observation that the frequency of occurrence of individual words in natural language text has something to do with the importance of these words for the purpose of content representation. H.P Luhn in 1958 [Luh58) mentioned that "The justification of measuring word significance by use­ frequency is based on the fact that a writer normally repeats certain words as he advances or varies his arguments and as he elaborates on an aspect of a subject". Actually the whole idea of frequency theory is based on Zipf's distribution law [Zip49) which relates the frequency of occurrence of terms in a document to their capacity to carry information. His work has been used in many experiments with automatic indexing where the words in the documents are counted and repetitions of terms are calculated using a variety

22 of constraints. Formulae of various kinds have been developed in an attempt to find meaningful relationships betweem term frequency counts within a document or a whole collection. Salton, Sparck Jones, Bookstein, Blair, Van

Rijsbebgen and many others have explored this field and provide models for analysing the text on statistical bases as explained below.

The vector space model

The vector space model represents one approach to quantifying informa­ tion retrieval operations. Here both the stored documents and search request terms are represented by a vector of terms (term vectors). A vector similarity coefficient is computed for the search term and each document in the collec­ tion. The coefficient values are then used to rank the documents according to their relevance to the query. A term weighting system assigns larger weights to terms that occur frequently in a particular document but rarely overall.

The original similarity coeffecient was the "cosine measure" (so called because it measures the "angle" between the term vectors, treated as vectors in an n-dimensional term space). The document was ranked on the basis of that similarity

where tdij = the ith term in the vector for document j. tqik = the ith term in the vector for query k. n = the number of unique terms in the data set.

23 This model has been used in many ranking retrieval experiments, in par­ ticular, the SMART system experiments under Salton et al [Sal71, SY73).

These experiments initially tested an overlap similarity function against the cosine correlation measure and tried simple term-weighting using frequency of terms within the documents. Here they considered term importance with respect to individual documents only.

In [SY73) Salton et al, described further experimental results and found that term frequency alone cannot ensure acceptable retrieval performance and changed the term weighting scheme. They incorporated the IDF, In­ verse Document frequency = log~ measure proposed by Sparck Jones [SJ72) where N is the total number of documents in the collection and ni is the total number of documents in which term i is found. IDF factor varies inversely with the number of documents n to which a term is assigned in a collection of N documents. The new term-weighting scheme relied on term importance within an entire collection rather than individual documents. They demon­ strated empirically that this new approach enhanced retrieval performance.

Salton and Buckley [SB88) devised a new term weighting formula which incorporated three factors which they thought important for enhancing recall

(the proportion of relevant documents that are retrieved) as well as precision

(the proportion of retrieved documents that are relevant):

1. Term Frequency

2. Inverse Document Frequency

24 3. Normalisation Factor.

They considered term frequency as a recall enhancing device. But if the term occurs in many documents then it effects the overall retrieval. For that they incorporated the Inverse Document Frequency (IDF) factor where

IDF varies inversely with the number of documents 'n' to which a term is assigned in a collection of 'N' documents . The third consideration was to tackle the problem of non-uniformity in vector length. As the length of each document is not the same (for example if one document is 50 KB and other is 1 KB then there is more chance for the term to occur more than once in the 50 KB document than in 1 KB ). For that, they incorporated a normalisation factor which normalises the weights, taking into account the different vector length. For that cosine normalisation was used where each term weight is divided by a factor representing Euclidian vector length.

Using different combinations of term frequency, inverse document frequency, and length normalisation components, they derived different term-weighting formulae both for calculating weights for the documents and queries. They conducted a series of experiments on six document collections of varying size and covering different subject areas. From their experimental results they showed that term-weighting method which takes in to consideration all three factors produce best result especially when dealing with large documents and large document collection. The following combination of formulae was ranked the best:

25 tf·· * log N - iJ n; Wij - --;:======✓'Evector(tfi * log ~)2

where

Wij = weight of term i in document j

tfi = total term frequency of term i in the entire collection

t /ii = term frequency of term i in document j

N = total number of documents in the collection

ni = total number of documents in which term i is found

This formula was used to calculate term weight for each term in each

document. while the following formula was used to calculate the weight of

each query term.

0.5 * tfik) N wik = (0.5 + f * log - maxt k ni

where

Wik = weight of term i in query k

tfik = term frequency of term i in query k

maxtfk = maximum frequency of term in query k.

While calculating the weight of query term, they used tf factor normalised

by maximum tf in the vector. They added 0.5 so that the value lie between

0.5 to 1.

The similarity between a document and the query was calculated using well-

26 known cosine vector similarity formula. This vector matching system per­ form global comparison between query and document vectors. This provides ranked retrieval output in decreasing order of the computed similarities be- tween query Q and document D.

. (D Q ) _ :Ei=IWij * Wik sim i, k - --;::======✓:Ei=I(Wij) 2 * :Ei=I(Wik)2

In [SB91] Salton outlined another approach for the retrieval of natural language text in response to a search request. The main aim was to come up with an approach to deal with large text collections. He proposed another formula based on his initial approach. The main aim here was to assign better weights to terms. Because in a large collection the overall occurrence of a term is much higher than when dealing with a small collection. Salton proposed the formula:

(0_5 + o.s*tfo) * (log N) maxtf; n; Wij = --;::======:E~_ (0.5 + o.s*tfo )2 * (lo N )2 i-1 maxtf; g n; where

Wij = weight of term i in document j tfij = term frequency of term i in document j

N = total number of documents in the collection ni = total number of documents in which term i is found maxtfi = maximum frequency of term i in collection.

Salton's experimental results showed the effectiveness of the above for- mula when used for calculating term weight in unrestricted text environments on arbitrary subject matter.

27 The Term discrimination model

Salton, Yang and Yu [SYY75] proposed an automatic indexing model based on the idea of term discrimination where a given index term is rated in accordance with its usefulness as a discriminator among the documents of a collection. Term discrimination offers a reasonable physical interpretation for the indexing process [SYY74]. This model leads to a distinction among possible index terms in accordance with their ability to spread out the docu­ ment space when assigned to the document in the collection. The assessment of good or poor discriminator depends on the ability of the discriminator to isolate the document in which it is found. A discrimination value(DV) is computed for each potential index term as the difference in space densities before and after assignment of that term. The greater the difference in space density, the better that term will function as a good discriminator [SWY76].

The studies show that the discrimination value of an individual term depends on the frequency of the that term (keyword) in the document, and the number of documents in which it is found. Terms with excessively high document frequency (i.e. terms that occur in most of the documents) are the worst discriminators. The terms with medium document frequency, that is the terms having document frequencies between 1~0 and ~ where n = number of documents, comprise of the vast majority of good discriminators. Terms with excessively high frequency cannot be used directly as they produce unacceptable precision losses. To overcome this, such terms are transformed

28 lC DI' I I I I ' 2

I I le D4'

x ______. 0 distance between original document x and centroid 0 x ______0 distance following assignment of good discriminator.

Figure 2.1: Change in Document space after assignment of good discrimina- tor into low frequency units by using them as components of appropriate indexing phrases. Terms with low document frequency terms produce unacceptable recall performance and so were transformed into higher frequency units by including them in a class of related term using a thesaurus [SWY76]. Yu and Salton [YS77] proved that certain transformation suggested by the term discrimination model will improve retrieval effectiveness. However extensive assumptions are necessary for the proof; in particular, the assumptions that query terms with higher document frequency occur in proportionally more non-relevant documents and that it is somehow known what terms to combine into thesaurus classes and indexing phrases.

This model suggests that low document frequency terms should be trans­ formed into high frequency unit by including them in classes of related terms

29 using a thesaurus but does not give a clear picture as to how this would be done (i.e. no selection criteria is mentioned) [Lew92]. "Over all this model has been criticised because it does not exhibit well substantiated theoretical properties" [SB88].

Probabilistic Model

The initial probabilistic model was proposed by Maron and Kuhn [MK60].

Later on, Salton [Sal75] and Sparck Jones [SJ72, SJ73] used a probabilistic technique for improving the retrieval performance. Barkla [Bar69] and Miller

[Mil71] used a probabilistic technique for calculating weight function.

r n w = log(-/-) RN where

N = the number of documents in the collection

R = the number of relevant documents for query q n = the number of documents having term t r = the number of relevant documents having t

Barkla [Bar69] used the above formula in a SDI9 server for obtaining relevance information via feedback. Miller used it in devising a probabilistic search strategy for Medlars10 . It was also used by Sparck Jones [SJ75] in devising optimal performance yardstick for test collections. Yu and Salton [YS76]

9 (Selective Dissemination of Information) is a system designed to make available network-based information resources 10Medlars MEDical Literature Analysis Retrieval System is the information retrieval system developed by U.S National Library of Medicine. It contains over 30 databases that provides references to worldwide publication in medicine

30 derived another formula:

r R-r ) w = l og ( -n--'---r- N-n-R+r where

N = the number of documents in the collection

R = the number of relevant documents for query q n = the number of documents having term t r = the number of relevant documents having t

They used this for modifying the output of a simple co-ordination level matching scheme. From there, they went on to prove that this modifica­ tion of co-ordination level matching can be expected to improve performance.

Robertson and Sparck Jones [RSJ76] proposed a probabilistic model based on the notion that terms that appear in previously retrieved relevant documents for a given query should have higher weights. They presented a table showing the document distribution of term t and gave four formulae for calculating weight

Document Relevance

+ - Document indexing + r n -r n

- R-r n -r N-n

R n-r-R+r N

31 r w =log(!) (2.1) N I... W = log( n~r ) (2.2) N-R r W = log( R;;r ) (2.3) N-n _r_ w = log(--:-=r-r-) (2.4) N-n-R+r where

N = the number of documents in the collection

R = the number of relevant document for query q n = the number of documents having term t r = the number of relevant document having t.

Formula 2.1 represents the ratio of the proportion of relevant documents in which t occurs to the proportion of the entire collection in which it occurs.

Formula 2.2 represents the ratio of the proportion of relevant documents to that of non-relevant documents.

Formula 2.3 represents the ratio between the "relevant odds" (i.e. the ratio between the number of relevant documents in which it does occur and the number in which it does not occur) for the term and the "collection odds" fort.

Formula 2.4 represents the ratio between the term's relevance odds and its "non-relevance odds".

32 Robertson and Sparck Jones [RSJ76] used formula 2.4 for a series of ex­ periments with the manually indexed Cranfield collection. The experiments were done using all relevance judgements to weight the terms and to see what the optimal performance would be. Using different sections of the collection they proved that formula 2.4 was better. Many experiments were carried out by Sparck Jones using formula 2.4 [SJ79] under more realistic conditions and demonstrated same good retrieval performance.

Croft and Harper [CH79] conducted experiments in an environment where no relevance information was available. A probabilistic model of document retrieval was applied to two searches which occured before relevance feedback, i.e. initial search and intermediate search. They used the Cranfield collection of 1400 documents. They modified formula 2.4 with the assumption that all query terms have equal probability of occuring in relevant documents and derived another formula which combine a new weight based on the number of matching terms and on a term-weighting similar to the IDF measure. The combination match was:

where

C =a constant for tuning the similarity function

Xi = document description ri = binary query description

N = the number of documents in the data set ni = the number of documents having term i in the data set

33 With this combination match model they went on to prove that it performed better than the simple match, the match using IDF weights, and matching using the Cosine Correlation. Croft (Cro83] carried on further modification to his proposed combination match formula by incorporating all measures used within-document frequency weights. The tuning was done by using another constant K.

where

K = a constant for adjusting the relative importance of the two weighting schemes. t Iii = the frequency of term i in document j. maxt/j = the maximum frequency of any term in document j.

Q = the number of matching terms between document j and query k.

The result showed significant improvement over both the IDF weighting alone, and an earlier approach of combination weighting.

Model Based on Decision Theory

Bookstein and Swanson [BS75] proposed a new method for indexing based on decision theory. They developed a utility estimation model based on statistical assumptions about the distribution of terms. They assumed that a collection of documents is divided into two or more groups, each of which is about the concept specified by the indexing term to a differing degree(Lew92].

The terms are assumed to have values equal to their "within-document" frequency. An approach to assigning terms to documents in binary fashion

34 had been developed for a special case in the above model where each term was characterised by exactly two Poisson distribution [Har75]. In criticising this approach, Losee [Los88] mentioned that computing term values using this was inferior to a simple binary model of term occurrence.

Another model based on decision theory was proposed by Maron [Mar79] where each term was evaluated according to the probability that users men­ tioning that term in their request will want a particular document. However in explaining this model he has not proposed any method for estimating the relevant probabilities . The above model also mentions that the threshold point that can be varied depending on the cost of retrieving a non-relevant document and missing relevant documents. A series of indexing methods based on decision theory were presented by Cooper and Maron [CM78]. How­ ever, these models have had considerably more success from a theoretical point of view than from text classification effectiveness [Lew92].

Apart from the models described above, researchers have also proposed other models based on fuzzy set theory [Boo85] but this was very hard to implement in practice [FBY92]. Srinivasan [Sri89] and Das-Gupta [DG88] applied rough set theory. The model proposed by Das-Gupta incorporated boolean logic for retrieval, a term weighting approach, and ranking of doc­ uments. It also considered the use of relevance feedback. He also did some experiments in comparing his rough set model to boolean or fuzzy model.

This theory has not been developed up to the stage from where experiment can be performed [FBY92].

35 FORMULA REFERENCE

/ij Absolute frequency [Luh58]

log/ii Sparck Jones, K [SJ73]

/ij//j = f Weinberg [Wei81]

fij/log/j Noreault et al.[NMK81]

/ij/(/j/1000) = a Artandi [Art69]

if(a :'.S 1), then Wij = 1

i/(1 «a« 3), then Wij = 3

if(a ~ 3), then Wij = 3

Table 2.1: Term weighting formulae depending on within-document fre­ quency

Other algorithms have been proposed and implemented. We will give a list of some of them in tabular form in tables. 2.1, 2.2, 2.3. where

Iii = The frequency of term i in document j.

/j = The total number of terms in document j.

N = Number of documents in collection.

Ii = Frequency of term i in collection.

CL = Total number of terms in the collection. n = Number of documents in which term i occurs.

Jung Soon Ro [sR88] performed some experiments on the full text of a journal document collection. In all 29 algorithms (some are shown in tables

36 FORMULA REFERENCE

I- :-r Caroll and Roeloffs [CR69]

Iii/ Ii Sparck Jones, K [SJ73]

(Iii/ /i)/Uii/ Ii) Sparck Jones, K [SJ73]

Iii* log(CL/ Ii) Noreault et al.[NMK81]

lii/logli Noreault et al.[NMK81]

lii/log(/j/ /i) Noreault et al.[NMK81]

Table 2.2: Term-weighting formulae depending on term importance within an entire collection

FORMULA REFERENCE

1/n Noreault et al.[NMK81] and Sager et al.[SL76]

log(N/n) Williams and Perriens [WP68]

log(N/n + 1) Williams and Perriens [WP68]

lij * Udn) Noreault et al. [NMK81] and Sager et al. [SL 76]

Iii* n/(fij - Ii) Noreault et al.[NMK81] and Sager et al.[SL76]

Udn) Salton et al. [SYY75]

lii(logN - logp + 1) Salton et al.[SY73]

Table 2.3: Term weighting formulae depending on Document frequency

37 2.1, 2.2, 2.3) where tested. The aim was to measure the effectiveness of full-text retrieval and to find a solution to improving the low precision of full-text searching with a minimum decrease in recall. He reported that with the use of ranking algorithms precision improved 2 times at the same level of recall and 2.6 times with 0.27 times decrease in recall. However he stated that relative effectiveness of algorithms depend on the level of recall at which precision is measured and depend on the Boolean strategy employed.

Linguistic approach to text analysis

Various attempts have been made in recent years to use syntactic analysis methods for generation of complex constructions such as noun, prepositional phrases, etc which can be used as indexing term. Various automatic index- ing systems based on syntactic analysis have been proposed [BP86, Thu86,

Sme86, SvR88, JR88]. From the outcome of these systems, it conforms that the systems solely based on syntactic methodologies are not sufficient to pro­ duce complete analysis of diverse11 full text. The experiments done using the syntactic analysis method often use known text samples [Sme86] or use skimming-type parser which looks only at certain portions of the text while totally ignoring other parts [JR88]. Fagan [Fag87] carried out the compar­ ison between statistical and syntactic approach on the same collection of document and_ request terms. The results indicated the preference of the statistical approach over syntactic approach. Salton et al. [SBS90] in order

11 Diverse, we mean coming from many different domains and having very broad vocab­ ulary .

38 to evaluate the effectiveness of the syntactic approach, conducted a series of experiments. The analysis system used was PLNP English Grammar (PEG) developed by IBM 12 [SBS90]. This could analyse complete sentences as well as fragments of a sentence and produce one or more parses for each sentence, ranked in decreasing order of presumed correctness. The output was given to a term phrase generator designed at Cornell which generated single terms or phrases suitable for indexing purposes. The evaluation was done by com­ paring the output of the above procedure and the statistical approach using the same data and query collection. Overall conclusion was that "The pro­ portion of acceptable indexing phrase obtained by both methodologies was approximately the same" [S8S90]. The final conclusion was that the statis­ tical approach was better than the syntactic approach because the syntactic approach involved too much computational complexity and because our cur­ rent knowledge of language representation did not allow us to solve significant problems such as ambiguity.

Another approach [FNA +ss, IAKM86, CDBK86] suggested the use of thesaurus thereby using dictionary information to elaborate the meaning of words. The experimental results carried out by Fox et al. [FNA +ss] showed that dictionary information cannot be easily used in general text analysis.

Another approach that comes from the domain of linguistics for deal­ ing with automatic text analysis is the knowledge based approach. Many researchers have attempted to incorporate manually constructed knowledge

12IBM Research Lab in Yorktown Heights

39 bases for specific subject area [MCT87]. Mauldin et al. [MCT87] has ex­ plained giving examples on the working of a knowledge base for text analysis.

It seems that, due to the diverse ways of expression in a large collection, the matching possibilities between arbitrary inputs and an existing knowledge base will always be limited. Also, matching is confined to the knowledge represented in rules, these rules are general and not very specific to tackle all the ambiguities which come up in input text. According to Salton [SBS90] the knowledge base approach has not proved effective in uncontrolled full text retrieval environments.

Different approaches for improving text analysis and representation

From the earliest days of Information Retrieval it has been found that many of the most frequently occuring words in English are worthless as in­ dex terms. These words account for 20% to 30% [FBY92] of the total text.

According to Salton it accounts for 30% to 50% [vR75] of the entire collec­ tion. Removing these words not only reduces the number of indexing terms but also increases the precision. Van Rijsbengen [vR75] used a list of 250

"stopwords" for filtering the common (non-informative) words. In the liter­ ature this "stoplist" is called by different names depending on the words in the list. Van Rijsbengen [vR79] talks about "the removal of high frequency words "stopwords" or "fluff" words". Salton et al. [SM83] calls them "high frequency function words". Vickery et al. [VV87] calls them "very frequent non-significant words". We call them non-informative high frequency words.

The selection of these words depends on the collection environment in which

40 it is to be used. Anyhow we can say that this process of removing stopwords is a positive one in improving the performance of IR as because of this, the total number of words representing the document decreases and thus improve system speed.

Another technique for improving IR performance is the use of stemming.

It is basically used to remove suffixes from the words by which the size of the indexing file can be reduced and also to increase recall. There are several approaches to stemming such as: suffix removal Suffix removal algorithms remove suffixes from terms leav­

ing the stem. successor variety Here the Stemmer is based on work in structural linguis­

tics. It uses the frequencies of letter sequence in a body of text as the

base of stemming. Once the successor varieties for a given word is de­

rived, this information is used to segment the word. The words can be

segmented in different ways, then one of the segment can be selected

as the stem. n-gram method In this approach, association measures are calculated be­

tween pairs of terms based on shared unique diagrams. Once the unique

diagrams for the word pair have been identified and counted, a simi­

larity measure based on them is computed using

similarity matrix are generated. Once this matrix are available , terms

are clustered using single link clustering method.

41 lookup table Here terms and their corresponding stems are stored in a

table. The terms for the documents are compared with this table,

when matched the appropriate stem term is replaced in place of the

original term.

Overall we can say stemming affect retrieval performance, studies [Sal68,

Hw74, KMT+82, Har91] shows that stemming does not negatively affect retrieval performance with respect to precision and recall but in most case rather have positive impact on performance. Stemming have a marked effect on the size of indexing files, which ultimately increase system speed.

Another approach for improving IR looks from a semantic point of view and focusses on the meaning of the word. Thesauri are knowledge sources that specify semantic relationships between words. With the use of thesauri a single word is replaced with clusters of words specified as being related. Some experiments on thesauri-based feature extraction have shown improvement in retrieval effectiveness [Fox83]. The problem with the use of thesaurus in our case is the scope of our collection.

Another approach for improving IR performance is to incorporate phrases as indexing terms. An indexing phrase is an indexing term that corresponds to two or more single word indexing terms. Here both the statistical and syn­ tactic approaches are mixed. A phrase is generated using syntactical analysis and statistical analysis is carried out to assign weight to it. Many phrase for­ mulation techniques are avaliable in the literature. A simple method is to take the conjunction of two or more existing term. A more complex method

42 is used [SM83] by Salton but gives no real improvement despite adding con­ siderable computational complexity.

2.2.3 Searching strategy

The purpose of IR is to develop techniques to provide effective access to large collections of text with the purpose of satisfying a user's stated information need. The selection should be such that documents that the requester will evaluate as sufficiently relevant, should be retrieved [Rad88]. The two most common approaches are: to use boolean operators in specifying a query or to give a query in natural language.

Boolean Search

In a standard conventional boolean retrieval model, a query is stated in terms of conjunctions, disjunctions, and negations. This searching strategy retrieves those documents which are "true" for the query. Here every docu­ ment retrieved is perceived as being of identical usefulness in satisfying the user's information need. The drawback of this is, it does not take into account the ranking of documents. To tackle this problem many researchers have out­ lined a number of non-traditional approaches. In particular, attempts have been made to incorporate a document ranking mechanism into a conven­ tional boolean search mechanism. One of the earliest effort to achieve this objective was by Noreault et al [NMK77]. Here a boolean search request is processed against the documents that contain the query terms and the doc-

43 uments are identified. Thereafter the cosine measure is used to compute a similarity between the query and documents. The output documents are pre­ sented in descending order of the cosine measure. With commercial use of the boolean approach the need for ranking is increased. Salton et al.[SFW83],

Heine [Hei82], Radecki [Rad82] and many others came up with rigourous methodologies for enhancing the conventional approach. One of the most thoroughly studied and experimentally verified models is the P-model pro­ posed by Fox [FBY92]. Here both document index terms and query terms have weights. If "D" is the document with weight dAl, dA2, ... , dAn with respect to keywords A1, A 2 , •.• , An it is considered to be a point with coordi­ nates (dAl, dA2, ... , dAn) in an n-dimensional space. The generalised queries are of the form:

where "P" indicates the degree of strictness ranging from 1 for least strict to oo for most strict of the operator.The similarity between a query Q and document D is calculated by:

p dP+P dP+ p dp P al * Al a2 * A2 ... an * An SIM(Qorp,D) = alP+ a2P+ ... anP

where ai = weight of the term i in the query.

44 dA; = weight of A1 in document D.

Various experiments were conducted by Fox et al. [SFW83, FS86, LF88] re­ porting that this model is effective as it retains the simplicity of conventional boolean searching while also having good ranking scheme.

Paice [Pai84] proposed a model for ranking document in a conventional boolean environment using fuzzy set theory. He assumed that there is a fuzzy set associated with each keyword and that the weight of a document with respect to the keyword represents the degree of membership of the documents with respect to the keyword's fuzzy set. In a document D with keywords A1, A2 , A 3 ... An and keyword weights dA1, dA2, ... , dAn, the similarity between a query Q and the document D is expressed by:

SIM(Q,D) = tri-1 Ef=:iri-1

where d/s are considered in descending order for queries of the form Q0 r =

A1, or A2or ... or An) and in ascending order for queries of the form Qand =

A1 andA2and ... andan). Lee et al. [LF88] conducted experiments with this model and concluded that its high computational cost, due primarily to the requirement for sorting keywords, was prohibitive.

Losee and Bookstein as mentioned in [FBY92] introduced a document re­ trieval model that allow for incorporating conventional boolean query struc­ ture into a probabilistic model. Experiments were conducted using the above model where weights were calculated using a probabilistic ranking algorithm.

Apart from boolean search the queries can also be represented in natural language form. In linguistic search, the query is represented in natural Ian-

45 guage form without the use of boolean operators. Here it is not needed for each query term to match an index term of a document so as to rank it to be important, but partial matching can also rank a document to be important depending on the importance of the query term.

There are also other approaches such as knowledge base systems, which attempt to interpret the text of the query. Many researchers have also given attention in modelling the system that can handle ellipsis query, where in­ stead of supplying whole query one can simply say "find more documents like the one" .

2.3 Summary

This Chapter introduced and defined basic IR concept, and explains the different components of IR system. Along the line it looked at different as­ pects of domain selection, text analysis, and searching strategy. This chapter has presented a survey of statistical and syntactical ranking models and ex­ periments. It also look at different searching strategy that come about in literature. Overall this gives overlook of work done in IR.

46 Chapter 3

Literature Review - Available

Systems

3.1 History of newsreaders

Usenet came into existence in 1979, when two graduates at Duke-University hooked their computers together to exchange information with the Unix com­ munity. Steve Bellovin [SM94] a graduate at the university of North Carolina wrote the first news software in shell script. The first public distribution soft­ ware was written by Steve Daniel. This was modified again to become an "A" news release. With the increase in volume of news, in 1981 Mark Hardon et al. at Berkeley rewrote the "A" news and named it "B" news [SM94]. New versions of "B" news were released which incorporated a mechanism for mod­ erated newsgroups, new naming structures for newsgroups, enhanced batch­ ing and compression, etc. With time the limitations of "B"news bought it

47 to the stage where according to Rick Adams, it is 'dead' and is unlikely to be upgraded any further.

The influx of data onto Usenet forced researchers to come up with new ways to tackle this overload problem. With this, a new version of news known as "C News" was developed at the University of Toronto. The goal of this new version was to increase the processing speed, decrease article expiration, improve the reliability of news systems with better locking. This version has been is publicly available since 1987.

Another Usenet system, known as InterNetNews, or INN, was written by Rich Salz which was designed to run on Unix host that have a socket interface. It provides both NNTP1 and UUCP support. This is very fast and with it's integration with NNTP it is very easy to use.

Until now we have talked about transporters like A, B, C news and INN which basically assist in propagating the news. Their main task is to take the posted articles and deliver them to other nodes (servers connected to

Usenet). When the number of people reading the news increased the total load increased. To reduce this the Usenet news system provided a method whereby all news articles are stored on a centralised host. The subscriber can connect to a local host and send/request/post specific articles. For that a package was released in 1986 implementing news transmission, posting and reading using the Network Transfer News Protocol(NNTP). NNTP has commands which provide a straightforward method of exchanging articles

1 Network News 'Iransfer Protocol

48 between cooperating hosts.

Geoff Collyer implemented NOV (News Over View), a database that stores the important headers of all news articles as they arrive. This was designed so that with its use news readers could provide fast article presentation by sorting and "threading" the article headers.

Geoff Huston at the Australian National University wrote a news package called ANU-NEWS for the VMS systems. ANU-NEWS is a complete news system that allows reading, posting, direct replies, moderated newsgroups, etc. in a fashion closely related to the regular news. The interface is very simple, as this is designed for the VMS systems.

Now for reading the news, the general approach is to use a news read­ ing interface. Broadly speaking this news reading interface can be classified into: Basic news reading systems, Graphical Direct Manipulation System.

Basic news readers have text based interface. They supply all of the needed functionality like sending, receiving, deleting, and saving articles. This basic interface presents users with a succession of text lines containing the names of newsgroups and the number of unread messages in the group. The list of all subscribed newsgroups is stored in a file. This file also determines the order in which newsgroups are presented to the user. Many of the most popular news reader interface fall into this category, this is because these news readers can run on any terminal. Graphical Direct Manipulation sys­ tem includes all functionality of the basic news reading System with the additional use of graphical representations of newsgroups/directories and a

49 direct manipulation interface for carrying out the basic operations.

3.1.1 Popular screen-oriented news reading interfaces rn: "rn" was developed by Larry Wall in 1984. It provides full screen dis­

play. Apart from that the interface also includes reading, discarding,

and/or processing of articles based on user-definable patterns, and the

ability for users to develop customised macros for display and keyboard

interaction. In our initial approach we used rn for fetching articles from

our local NNTP server, but this approach was not effective because of

slow access speed. trn: With the increase in scope and area of discussions it become hard to

track the followups using rn. To incorporate this, Wayne Davison came

with the new interface called trn. Apart from providing all the func­

tionality available in "rn" he added the ability to follow "threads of dis­

cussion" in newsgroups. To do so "trn" uses a reference line database

to allow the user to take advantage of the "discussion tree" formed by

article and their followups. Trn is also capable of "menu base" selection

of articles. This gives extra ability to the user in manipulating certain

newsgroups. nn: "nn" was developed in 1988 by Kim F. Storm of Texas Instruments. It

presents a menu of article subject and senders name, the display is very

fast because it keeps an on-line database of article headers. Its macro

50 language is relatively easy which helps in customisation. It merges

related newsgroups, tagging similar articles. This is necessary as it

does not support threading. The serious drawback of "nn" is the the

limitation of the search area. The search is limited to subject and/or

author name. This limits the use of "nn" as articles may not have a

precise subject, and in some cases no subject at all. Also author name

is a "real name" which can also be an alias and not full names. tin: "tin" is a full-screen, easy to use, news reader. It can read news lo­

cally or remotely via an NNTP server. It operates with threads, and

has a distinct article organisation method. It has a built in "index­

ing" system. When a particular newsgroup is once read, this system

keeps track about all articles, their topics and thread discussion of that

newsgroup. The special features which includes: keeping a history of

posting of articles; batch mode support; and provision of kill files.

WinVN: "WinVN" is a Microsoft Windows and Windows NT-based news

reader developed by Mark Riordan, a systems programmer at Michi­

gan State University. It offers a more visual approach to Usenet News

than traditional text-based news readers, allowing us to easily navigate

amongst newsgroups and articles via its point-and-click interface. In

normal operation, Win VN displays three types of windows: the main

group-list window (main_window), which displays a list of all news­

groups; one or more group article-list windows (group_window), each

51 of which displays a list of the articles in a newsgroup; and one or

more article windows (article_window), which each displays an arti­

cle. Double-clicking on a newsgroup or article name causes that item

to be displayed in a separate window. The special features includes:

point-and-click interface and well organised display of articles.

GNUS & Gnews: are macro packages that can be used with the GNU

Emacs text editor. They are implemented in Lisp. Interaction with the

news is from inside the Emacs text editor. It uses its own ".gnewsrc" file

to determine read/unread articles. The command structure is similar

to "rn" [UME]. It provides a user-programmable filtering mechanism

called "hook kill" for the user to filter out unwanted topics, people,

cross-postings, etc. Searching is restricted to a particular header field

as selected by the user[UME].

SLNR: Simple Local News Reader is an off-line news reader. It is intended

for users who want to connect to a host and download new articles

onto a local machine and read them. It gets all the unread articles

from the subscribed newsgroups. The articles are all lumped together

in a packet. This packet can then be viewed with a SLNR interface.

The posting and reply are not in real time and can be done later. The

Packets are read and written using Simple Off-line Usenet Package

called SOUP. The limitation of this reader is that it does not provide a

search mechanism. It can fetch a maximum of 5000 articles from 1000

52 newsgroups.

There are numerous other newsreaders available. most of them have a similar set of facilities to those described above (e.g. kill files, threading, menu-based interface, off-line reading, etc.).few other newsreaders similar to the ones described above. However none of them effectively solves the fundamental problems of information overload. For example, none of them provide ranking to assist the user in determining the importance of one article over another. Another is the use of ".newsrc" or a similar file which needs to be maintained. The question of subscribing to interesting newsgroups creates a dilemma as we cannot be 100% certain that the newsgroups which we have not subscribed to do not contain any articles of interest. The threading approach is good when followups (replies) actually follow the original topic closely; however discussions in a single thread can often shift considerably from the original topic.

3.2 Related work

3.2.1 SMART

SMART is an implementation of the vector space model of IR proposed by

Salton in the 60's. This system was designed from a very different perspective to our system, and incorporates such ideas as:

• Use of full automatic indexing method to assign index terms to a doc­

ument.

53 • Correlation of related documents into common subject classes making

it possible to start with specific items in a particular subject area.

• Performance of similarity computation between query and documents

and also between documents using cosine similarity measures.

• Relevance feedback depending on the judgement of the user about the

relevance of the retrieved document.

These basic ideas in SMART have not changed over the years; however their implementation has been considerably refined. The following descrip­ tion gives an overview of how the SMART system works.

SMART does not have any fetching mechanism, as it was meant for use under static data collection. The first stage in using SMART is to index the documents: first, non-informative words are removed, then the suffix is removed from each remaining word, and unique words are sorted out and used as keywords. These keywords are assigned to documents in the collection. A weight is assigned to each keyword using a formula described in the previous chapter 2.2.2. The user types the request in natural language. This user request is formulated internally and weights are assigned to each request term using the same formula which was used to calculate the weight of each index term representing the document. In some versions of SMART query term

tfo*log :_ weights are computed using the formula Wij = -----;===(==l,,==N=2 (where Wij I:vector t/;*log n;) = weight of term i in document j, t Ji = total term frequency of term i in the entire collection, tfii = term frequency of term i in document j, N = total

54 number of documents in the collection, ni = total number of documents

in which term i is found.) to calculate the weight of each keywords and

Wik (0.5 0-5*tt[fi&) log N where Wik weight of term i in query k, tfik = + max k * n; =

= term frequency of term i in query k, maxtfk = maximum frequency of

term in query k. (SB88]. to calculate the weight of query term. After that,

a similarity measure is computed between document and query

where

Wij = the weight of term i in a document j.

Wik = the weight of term i in a query k.

n = the number of unique terms in the data set.

During the initial development of SMART, term frequency was the primary

criterion for determining similarity. With time the criterion has changed:

term frequency, inverse document frequency, normalisation factor, etc, have

been incorporated.

The documents matching a query are presented in order of similarity to

the query (in descending order). There is an option whereby a user can

give a judgement on the relevancy of the retrieved document. Depending

on that information a new query is generated which in turn retrieves more

documents. this process is known as "relevance feedback".

We introduced SMART here primarily because it is by far the most pop­

ular IR system in the academic world. Let us now look at some systems

dealing with the Usenet data.

55 3.2.2 SIFT-Stanford Information Filtering Tool

SIFT (Stanford Information Filtering Tool) is a system for performing wide­

area information dissemination. It provides two dissemination services, one

that delivers USENET News (Netnews) articles, and the other, Computer

Science Technical Reports. We consider SIFT from a Usenet News articles

service provider's point of view. From that point of view, SIFT is a tool to

help provide news articles from the Usenet depending on a submitted profile.

It supports full text filtering using a vector space model. The filtering engine

uses an indexing technique proposed by Yan et al [YGM94] which is capable

of processing large volumes of information against a large number of profiles.

It can be used as a clearinghouse server that gathers large amounts of infor­

mation and selectively disseminates the information to a large population of

users.

The mechanism is simple. First a user subscribes to a SIFT server with

one or more subscription. The subscription includes interest profile along

with some additional parameters like frequency of updates, threshold for

number of articles, length of subscription, etc. The interest profile can be

expressed in either a vector space model or a boolean model. When expressed

using the vector space model, a weight should be specified along with each

term for example ((information ,50) (retrieval, 70)). These query terms are

then computed to determine the similarity between each document and the

query. The articles are retrieved depending upon the query term and index

term of the article. The number of retrieved articles depend on the threshold

56 provided by the requester. The output is then mailed to the user by the system. It also supports a boolean query model whereby a user can use conventional boolean operators in the query.

One point to mention here is that this system does not support phrases as the indexing is done only on single terms. In our system we support two or three word phrases as document terms.

3.2.3 Tapestry

Tapestry is an experimental system designed to receive, and filter files, and browse electronic documents. It was developed at the Xerox Palo Alto Re­ search centre[GNOT92]. It uses a relational model for matching user interests and documents. It uses collaborative filtering whereby users are encouraged to annotate. These annotations can then be used for filtering. A Tapestry user can write his own queries using a complex query language call TQL2 which is based on SQL3 . The filters can access annotations. Documents matching the filter are returned as soon as the document receives the speci­ fied annotation. The "Sybase" database is used to store documents, annota­ tions and filter queries. In tapestry, documents are received, stored and then processed in a batch to be sent to users. Storing all incoming documents clearly creates problems for the storage subsystem. One problem that may arise under this scheme is that since each query runs on the entire collec­ tion it may retrieve previously retrieved articles. To avoid this, a tagging

2 Tapestry Query Language 3Structured Query Language

57 mechanism is used which will only allow unread articles to be selected.

Articles can be examined using a Tapestry browser or forwarded via email.

Goldberg et al. [GNOT92] mentioned that articles sent via email are only prioritised for display in the last step of the Tapestry process. Also when getting articles via email we can not use full TQL for searching for other relevant articles. this clearly makes the use of a Tapestry browser the best option.

3.2.4 URN

URN is a multi-user, collaborative, hyper-textual Usenet reader[BJ94]. It provides a weighting mechanism that explicitly represents both the level and confidence associated with the interest. It has a voting mechanism which is used to grade articles. This information is used in turn to weight the articles.

Initially the keyword list (i.e. special features of an article) is displayed to the user. Here the user changes the list according to his interest, and the information is again taken for calculating the weights of the articles which contain that added/ deleted term. The user can then order articles by interest level.

In URN, the work is done collectively by all users, so the overhead load is shared by the users. After reading the articles, the user grades the ar­ ticles to be either interesting, ambivalent, or uninteresting and the vote is recorded. After the user votes on articles, a background process takes their votes and turns them into weights. The process reads a list of all keywords

58 that have been manually added by the users. The bodies of all articles in the database are scanned for these words, if any are found then those keywords are promoted to the keywords field for that article in which it was found.

Once all individual weighting functions are generated, the new weights of all articles in the database are computed for each user. For each article, the list of keyword is compared with the list of weighting functions for a user. If there is a match then that weighting function's weight is added to that user's weight for the article.

This was basically an experimental system. The scope of the article collection was limited to only one newsgroup. The results were provided on a 10 day usage of this system. The results are fascinating but do not appear to scale to the real-world news-reading context where there are approx 11000 newsgroups.

3.2.5 INFOSCOPE

INFOSCOPE is a news reading tool which allows the creation of virtual newsgroups via filtering [Ste92l[FS91]. Articles from different newsgroups are selected and tagged together depending on some series of patterns. The filters are generated by background programs which keep track of user be­ haviour and in turn gives suggestions to the user. If the provided suggestion is rejected, this action is recorded for future use where in a similar situa­ tion the same filter will not be suggested. It is not a collaborative support system, but rather, permits filtering on an individual level. It provides a

59 good graphical interface for browsing using the Usenet hierarchy as a tree structure. The limitation of this system firstly lies in the approach taken for constructing filters, where user has to construct his own filters. Secondly, the search is limited to the newsgroups specified in the filter and confined to the header of the articles. Thirdly, there is no provision for searching on the entire collection of the articles posted. Finally, messages are tagged and dis­ played in hierarchical order and their is no mechanism to reflect importance of one articles over other.

3.2.6 Deja News

Deja News is a Usenet news archiving service, It has the largest collection of indexed archived Usenet news. The articles posted at newsgroups are cata­ logued, but these often include references to resources all over the Internet.

The request is placed via "Query Form" where the keywords are separated using boolean operators. It supports partial as well as full search. The filter can be created where by search can be limited to Author, Subject, News­ groups, or Creation date. The result displays the date, score, Subject of the articles, the newsgroup it is found in and the author of that article. The list is in descending order of the score of the article. The score is calculated by depending on number of matches to the given keyword(s),

Ratio of keyword(s) to total words in the article's body and posting date of the article. The article can be viewed by selecting it from the displayed list.

So it can be said that, it provides a good searching options but user has to

60 create his own filter which has to be changed explicitly depending on the need. The system has no mechanism to feedback user while constructing the filter. Duplicate articles are displayed when same article is posted in different newsgroups which often becomes frustrating for the user.

61 Chapter 4

IAN - Intelligent Assistant for

News reading

4.1 Preview of IAN

The Intelligent Assistant for News reading (IAN) system is an experimental prototype of an adaptive news filtering system. It automatically selects arti­ cles from Usenet news based on a user profile and weights of that keywords in that articles , observes how the user reacts to these articles and on the basis of the user's behaviour, adjusts the profile to more accurately reflect the user's interests.

IAN is composed of four main modules: classifier, selector, presenter and adapter.

Our work is concerned primarily with the first two modules. The aim is to efficiently and effectively extract information from incoming articles and

62 News

Index

Figure 4.1: Modules and flow of data use this information to filter out incoming news articles which are not of interest to the user.

This chapter deals with the implementation of an Intelligent Assistant for News reading (IAN) [NS93]. We will describe the system design includ- ing data acquisition, automatic indexing method, retrieval mechanism, query formulation and the searching process. The IAN system carries out 3 major

tasks, namely, Fetch_Articles for fetching articles from a local NNTP server,

Automatic indexing of news articles for selecting content identifiers for each

article and Retrieval of relevant news articles based on user profile. The

task is carried out by different modules. Overall the IAN system consists

of 7 modules namely: Article Fetcher; Data_Cleaner which removes some

unwanted part of the header and footer; Stopper which removes stop-words;

Stemmer which removes suffixes from the word; Phrase_generator which gen­

erates phrases; Weight_assigner which assigns weight to each index word

representing article; and Search which takes the ad hoe user query or takes

user profile as a query, executes it and finally displays results.

63 4.2 Introduction

IAN is an attempt to explore approaches to effective information management in the Usenet news[NS93).

With the advent of better graphical user interface technology , most news­ readers have moved towards simpler, more intuitive interfaces. This has opened up Usenet to a much larger number of people, with the attendant problems of a larger number of low quality articles, inappropriate use of newsgroups, and so on. Overall this has worsened the information overload problem.

To fully utilise the information available on Usenet, we need to look at it from a different perspective and explore different theories which appear in the literature. After reviewing the literature and examining various ap­ proaches and experimental conclusions, we have developed a system called

IAN (Intelligent Assistant for News reading) which works outside the news­ group classification hierarchy and determines relevance based on content.

4.2.1 How does IAN work

• Takes information on the date and time to put constraint on posting

time.

• Connect to the NNTP server and collect articles as per the specification

provided by the user and store all the articles in a directory.

• Clean up the articles. Non useful information contained in the header

64 and footer of an article is eliminated.

• A stop-word list which include 514 stop words is used to delete the

high-frequency function words that are insufficiently specific for con­

tent representation[Sal86]. This decreases the size of text by 30%-

50%[vR75].

• A suffix stripping routine based on Porter's algorithm is used to reduce

the remaining words to word stems. This recall-enhancing transforma­

tion broadens the scope of the terms[Sal86].

• As well as single keywords, two and three-word phrases are generated

by considering adjacent words in the text.

• Weights are calculated for each keyword, giving the importance of the

keyword in the document.

• Each keyword has a link to the articles in which it is found. This

represents a list of articles linked to a keyword along with the weight

of that keyword in that article.

• Search mechanism builds the structure of keywords in RAM and dis­

plays the list of articles and their subjects according to the terms in

the search request. The display is in descending order of weights.

• The search requests are submitted in boolean form. This is supplied by

user in the user profile or ac hoe query, which is used through search

mechanism for retrieving articles.

65 4.3 IAN system

Here we will talk about the design and implementation of our part of IAN.

As such the IAN system has 3 major functions.

• Fetch articles.

• Automatic indexing of news articles.

• Retrieval of relevant news articles based on user profile.

Fetch Articles

Automatic Indexing 1------, New Articles

etrieval of Relevant user--p-rofi-lle_.., New Articles

Output

Figure 4.2: Major Functions of our part of IAN

The above functions helps in finding articles which are posted each day on

Usenet and determine relevance of each article based on content depending on the user profile. Detailed descriptions of each of these functions follow.

Figure 4.3 gives an overview of the whole process.

4.3.1 Fetch articles

Generally the Usenet articles are stored on the local NNTP server. To extract relevant articles from that huge collection we need a fetching mechanism. Our

66 ----r--~ ~ 0 OlITP~ ARTICLES

Figure 4.3: Different modules and flow of data initial approach 1 was very slow and so we decided to fetch articles directly from the NNTP server. We designed a new module called Article Fetcher.

Article fetcher

The aim of the Article Fetcher is to collect those articles from Usenet which are posted after a certain time. This task is carried out by a program called

ARTICLE_FETCHER. This program connects to that local NNTP server and collects articles posted after a certain time. This time is either provided by the user or it take default time (i.e. the time when it was last run).

The following explanation will give a clearer picture.

• Connect to the NNTP Server. Our ARTICLE_FETCHER client con-

nects to the local NNTP server. 1 A shell script which invoked rn (read news) to collect news articles.

67 • A request is sent to the NNTP server to get the list of messageJ.d(s)

from all news groups that have arrived since a specified date and time.

This request is sent to server. The command schema:

NEWNEWS newsgroups date time {GMT}

e.g.NEWNEWS * 950221 230000 where * means every news groups.

• The articles are obtained by passing the list of messageJ.d(s), one at a

time, as parameter to the NNTP command "ARTICLE message_i

The retrieved articles are stored into individual files in a sequential

manner.

With implementation of this module the articles are fetched at a very high speed. It takes only 17 minutes to fetch 10000 articles, which we consider quite reasonable for fetching 20 to 25 MB-byte of data.

4.3.2 Automatic indexing of news articles

After collecting the articles, we need to find what information they contain and which articles contain information of interest. For that we need to corre- late the users needs to the article content. But, as we know there are around

30000 to 35000 new articles posted every day which contain 5 million words.

On average there are about 200 words/article2 • These words can be infor- mative, grammatical or common words. The grammatical or common words generally do not contain information. So we need to identify informative words which we can then associate to the articles for later use.

2Total 200 words per article.

68 To extract meaningful words we need to analyse the text. The text anal­ ysis process identifies important words which can be used to represent the ar­ ticles. We use a bottom-up approach in identifying informative words where first of all, the known unnecessary parts of the articles are removed. This is because we know that some fields like NNTP-Posting-Host :, X-Newsreader:,

Distribution:, etc do not contain any relevant information that can be used as content identifier. Next, common words are identified and removed because these words are worthless as index terms[FBY92]. Careful thought was given in selecting stop-words as non-careful selection may lead to information loss.

Then each word is converted to its root (using Porter's algorithm). The re­ maining words in the articles can be taken as index words. In this list there would be some words whose meaning depends on the context (for example potato chips, processor chips, rail network, neural network etc). If we can formulate a phrase using the selected index term then we can narrow the scope of these words.

The final aspect we consider while indexing is to assign weights to the index terms3 • This task is most important as it evaluates the importance of each term representing an article. Many approaches appear in the literature4 .

10 The reason we use wt = J 1* g !::!.. N is that it incorporates the three essential E(/;*log 11 )2 factors that affect the overall ranking5 . The above mentioned procedures are carried out by individual modules. Figure 4.4 gives an idea of the

3The need for weighting index term is discussed in full in chapter 2 4 Explained in detail in chapter 2 5How it helps in ranking is discussed while explaining the Weight..Assigner module.

69 involvement of different modules in achieving this goal.

~~

Figure 4.4: Modules involved in text analysis

Data_Cleaner

The task of the Data_Cleaner is to remove any text which is not directly relevant to the information content of the article. The tasks includes:

• We remove all fields from the article header, except for the subject line.

• Some articles also have footer, which may contain information like

name, address, pro-verbs, icons, symbols and other things. This in­

formation represents the author of the article not the article itself. For

that reason we remove the footer. The way we recognise footers was,

we searched for pattern like " - or - -" or words like "Regards" or

"Thanks" in last lines and remove that lines.

• Placing a marker at the end of a subject and at the beginning of a mes­

sage (body). This is done to distinguish between subject heading and

70 body of the article. We started out by adopting the simplest approach

to articles= partitioning. In future, this could be extended so as to

mark various "regions" of the article for the purpose of weighting these

regions separately.

This module sends the output to the next module called Stopper.

Stopper

In a collection of articles, there are many grammatical function words (for example a, and, in, of, is, he, that, what, would, has, to, in, you, for, on, not, but, was, were, will, would, could, etc) and a few common words (for example some, like, just, more, less, any, many, do, all, from, by, etc). We looked into the frequency of occurrences of these and a few other similar words in a collection of 30000 articles. We found that frequency of this type of words is very high (for example on average " the" occured 200000 times;

"to" occured 120000 times; "and" occured 90000 times; "in" 75000 times;

"is" 60000 times; "that" 50000 times; and so on in a single day collection of articles). This type of words do not contain any information about the content of the articles in which they appear.

To remove these nonsignificant terms William Frakes and Ricardo Baeza­ yates wrote the source code 6 of Porter's algorithm[FBY92] which was im­ plemented by Fox, C and Madison, J.U[FM] and found in public domain.

6The source code is written in their book called Information Retrieval: Data structures

& Algorithms

71 This program was written to demonstrate and test stop list filtering[FM]. It takes a single filename on the command line and lists un-filtered terms on standard output[FM].

We modified the source code. The original code was used to remove stop­ words, digits, as well as punctuation. We thought that removing digits was not a good idea. If, for example the user wants information on "os/2", and digits are removed, then the query term will retrieve all articles containing information on "os/2", "os/9", .... We made changes so as not to filter out all punctuation. Punctuation characters like "-,/--" were retained as each conveys some meaning. For example a "." and "," help to identify the end of sentence as well as carry special meaning in phrases such as "bsd 4.3". "/" can also be found in a number of common computer terms and carry useful information, for example "os/2".

The overall tasks of this module are:

• Convert all upper and lower case to lower case. This is done because

string searching and pattern recognising cannot identify upper/lower

case character to be the same while searching for frequency calculation.

• Keep punctuation like "-,/--" while removing others. The advantage of

doing so is explained in the previous paragraphs.

• Compare each word in the article to the list of stop-words. Remove

any words from the stop-word list that occur in the article and store

the remaining words in a file.

72 With the removal of common words, the list of keywords representing

an article decreases. This helps in manipulating large numbers of articles

without overloading the system.

Stemmer

The next stage after removing stop words is suffix stripping. The question which arises here is why do we need stemming. In answer to that we will put forward several points which will justify the need for stemming.

• Irregularity: The purpose of this sub-module is to reduce words to

their root form by removing suffixes. This helps in identifying words

having a similar root. For example, words like "acceptance", "accept",

"acceptable" are all changed to "accept". This helps to get rid of

irregularities and allow us to count variant forms as instances of the root

form.for example, three words "acceptance", "accept", "acceptable"

would be considered as one word "accept" having a frequency of 3.

• Size and complexity reduction: As we know each article is de­

scribed by words. The words with a common stem will usually have

similar meanings. To improve the performance of IR, we need to con­

flate the common stem words into a single term. In addition, the suffix

stripping process will reduce the total number of distinct words in the

IR system thereby reducing the size and complexity of the data in the

system.

73 • Eliminate need for wild card: A stemmer also helps to simplify

query entry by eliminating the need for use of wild card specifier to

indicate word roots.

Over the years many approaches to stemming have been reported in the literature [Lov68, PL69, Daw74, Por80]. Usually suffix stripping programs are given an explicit list of suffixes and with each suffix, the criterion under which it may be removed from a word to leave a valid stem [Por80].

In our system, we used a version of the Porter algorithm implemented by

William Frakes[FT]

• Strip the suffixes of words without removing the alphanumeric charac-

ters.

• Take each article, stem it and store the stemmed words in an output

file for further use.

After stripping suffixes, we generate phrases. This is handled by the next module called the Phrase_Generator.

Phrase_Generator

In recent years researchers have become increasingly convinced that the per­ formance of IR systems can be greatly enhanced by the use of phrases for automatic document indexing and retrieval [SM83, 8B890]. In practice, it is not feasible to remove high frequency non-common words (for example infor­ mation, computer, network, chip, etc). It is therefore necessary to develop a

74 new approach. The most obvious method is to generate phrases. This nar­ rows the scope of matching without hampering the wider scope (for example if a user ask for "network" then he will get articles on every type of network but if he/she ask for "neural network", then he/she will get articles only on

"neural network"). With the above point in mind we here put forward a simple phrase-formation process called Phrase_Generator. The task carried out by this process is mentioned below.

Initially the subject of an article is identified and subject phrases are gen­ erated7. The criteria used here is to join the adjacent words in the subject.

This criteria is used because it is simple. There are more complex methods that deal with syntactic and semantic characteristics but they do not produce better results [SM83, S8S90]. The following example shows how we generate phrases. For example an article having the subject "phrase generation pro­ cess" produces the following terms: phrase generation phrase_generation process generation_process phrase_generation_process

7The subject is easily identified as the Data_Cleaner module puts markers at the start of the subject line and at the end of subject line

75 Phrases are also generated for the message of an article. Here single8 and double9 are generated by concatenating adjacent words, respecting sentence boundaries (i.e. the end of a sentence marks the end of a phrase). The sentence boundaries are identified by ". and , "; thus ". and , " delimits the joining of the last word of the current sentence to the first word of the next sentence and starts a new phrase at the beginning of a new sentence.

The above process runs on each article individually but the output is appended to a single file which contains article..numbers and a list of keywords of all articles.

The output of Phrase_Generator is a sorted list of single words as well as phrases in the article with frequency counts attached. This list along with the article itself, is passed to the next phase.

The next job is to rank the keywords in each article. This is done by calculating the weight as shown in our next module called Weight..Assigner.

Weight.Assigner

In conventional retrieval systems it is not customary to assign weights to the terms to show their importance. The general strategy that follows is limited to binary indexing [SW79]. It is possible to assign a binary weight to each term in the global vocabulary: 1 indicates that the term appears in the article, while O indicates that it doesn't.

8single terms 9 phrase generated by joining two adjacent words

76 The difficulties for using binary indexing are:

• All terms are assigned the same importance.

• During a search, under the binary scheme we are likely to retrieve fewer

articles if the query is a conjunction and more articles if the query is a

disjunction.

To overcome this, we need to come up with another approach i.e. weight­ ing the terms. Our WeighLAssigner is theoretically based on the Term­

Weighting system proposed by Salton[SW79, SM83, Sal70].

In principle a system is preferred that produces both high recall and high precision[SB88]. In practice this is not possible. For that we have to come up with a compromise. Three main considerations appear to be important in this regard as proposed by Salton and Buckley [SB88]. We explain each of them briefly:

• A term that is frequently mentioned in an individual article is consid­

ered to be useful as a recall enhancing term. Thus while calculating

weights, the term frequency factor should be considered. According

to [SB88], Salton and Buckley have considered both the frequency of

occurrence of words in the articles as well as query. But we consider

frequency of occurrence of terms only in the articles. This is because

Salton and Buckley's queries were either short pieces of text(English) or

whole articles (e.g. the query which asks "give article similar to this").

In our case, queries are list of keywords joined by boolean operators.

77 • The second consideration proposed by Salton and Buckley [SB88] is

a collection-dependent factor known as Inverse Document Frequency

(IDF). The IDF system postulates that a term is considered a good

discriminator if a term has high occurrence frequency in a particular

document while its overall occurrence is low. If a term occurs in many

documents in a collection its use in a query will not help us to dis­

criminate between documents in a collection. In our case we take a

collection to be all articles posted each day. The size of the collection

is around 30,000 to 35,000 articles, the number of articles which are

posted daily on Usenet. We have used the same approach in identify­

ing a good discriminator and assigned weights accordingly. The IDF

factor varies inversely with the number of documents (n) to which a

term is assigned in a collection of (N) documents. In our case N is

the total number of articles arriving at our local server daily. The IDF

factor can be computed by log~.

• The third useful factor proposed by Salton and Buckley [SB88] is nor­

malisation. They argue that in most situations a small article has few

words, so there are few keywords with even fewer words having high

frequency as compared with large articles. In such a case the chance of

matching the queries also differs and larger articles have better chances

of retrieval. In our case we have the same problem, where some articles

may contain only a one line reply while others contain very lengthy

comments, questions or responses. To avoid this difference in length

78 of the articles we have incorporated the normalisation factor which is

define as ~­ yEw;

Taking into consideration the type of collection, we need to incorporate all the above factors. We decided to use the same formula proposed by Salton and Buckley [SB88) for calculating the weight of keywords representing the article. The formula is:

tf··tJ * log n·N

where wii = weight of term i in document j tfi = total term frequency of term i in the entire collection t Iii = term frequency of term i in document j

N = total number of documents in the collection ni = total number of documents in which term i is found.

The keywords and their corresponding weights are stored in a structured manner where each keyword is tied with the article number in which it ap­ pears and the corresponding weights of that keyword in each article. This structured file is used as input by a searching mechanism while building the search according to the user profile.

Until now we have talked about how data is fetched and processed so that it can be used to achieve the final functionality, i.e. Retrieval of relevant news articles based on user profile. The next sub-section will explain how we achieve the last functionality provided by IAN.

79 4.3.3 Retrieval of relevant news articles based on user

profile

Until now we have discussed how we retrieved articles, and generated index terms. Now we consider how to use this information to select a subset of the news articles and to rank the news articles according to their relevance to the user's interest.

All searching strategies are based on some kind of comparison between the query terms and the stored article terms. The difference between them lies in how queries are formulated, compared (i.e. comparing all terms in the query or partial comparison is also allowed) and the results represented (i.e. only subject, articles numbers and subject, or whole article is displayed).

The Search Strategy used in IAN for retrieval of relevant articles is based on the weighted boolean approach. We have incorporated features like:

• Accommodate traditional Boolean search which most of the user(s) are

familiar with.

• Rank output in descending order of weight(s) of the matched key­

word(s) in that Article(s).

• Supply the Rank value (weights) to the user to indicate the strength

with which the respective article has been retrieved.

• Set threshold to control the size of retrieved set.

• Place list of queries in a user profile.

80 • The queries need not be written in root words but can be written in

normal words with incorporation of phrases.

• The request terms (queries) are stemmed before passing to the system

(this process is required because the index terms are stemmed).

To achieve the retrieval of relevant information, our system need to carry out two tasks:

• Query Formulation.

• Search Mechanism.

To achieve these two tasks we have formulated two modules

• UTOS.(User TO System).

• SEARCH

UTOS

As discussed above, the user provides each query in a boolean form (e.g. neural network .and. query optimisation) . Each term in a query may be written in language form. A term can be a multi-word phrase. In order to get the list of relevant articles we d to match the query terms with the index terms. Queries are pre-processed in the same way as articles: stopwords are removed, words are stemmed, and phrases are formed. The final query is stored (e.g. neuraLnetwork.and.querLoptim ) in a file called system_profile

(queries in system_profile file see A) which is one of the inputs to the search

81 procedure.( On-line search is also supported where the user can submit a request at the prompt; the query is pre-processed as described above.)

The main objective here is to read queries from the user-profile (user­ profile see A so that the user does not spend his precious time looking around for news. By default this process will run at night time (the user can make it run any time heshe likes but by default it will run each night at lam).

SEARCH

The main tasks of this module are:

• Represent index term in such a way that matching process becomes

efficient.

• Take query as an input.

• Match query term with index terms.

• Display the result.

Represent index terms: For efficient matching of request terms and index

terms we need an efficient data structure. After calculating the weights,

the list of keywords is stored in a structured manner in a file called

INDEX..FILE. Each keyword is tied with an article number in which it

appears and its corresponding weight. The format is

Terml Article_noll Weightll Article_nol2 Weight12 ... delimiter.

e.g.

82 yellow_page 4186 56694 3847 56694 4172 37796 5637 18898 4161 18898

0 0.

The INDEX_FILE is used as an input to the search program where each

term along with article number and weights are stored in a structure

shown in Figure 4.5: keyword article number right - weight next next article number I :: weight keyword next right article number next I - weight next ! f

Figure 4.5: Keyword structure during search

The Data structure used here is a linked list. Hashing is used to index

keywords on their first character. The subject heading list contains

article number and the subject heading. The subject headline is hashed

separately to avoid data redundancy (i.e. if we would have included

subject heading in the linked list structure, the same subject heading

line would be repeated for each keyword in an article).

This provides simple and efficient way of searching. One advantage

is that the structure is built in memory, and is efficient to build and

search. Hashing reduces the searching time. Also separate hashing of

subject heading avoids data redundancy.

83 Matching: Prior to matching we need to break the query into tokens sep­

arating boolean operators and query terms. We use the Lex and Yacc

parser for this. Yacc parser builds a query tree from the query terms

and operators. The algorithm is as follows:

• The search term is broken into sequence of instructions. This

sequence of instructions produce a query tree for comparison with

the list of article index terms.

• The output of this comparison is a list of article number with its

weight and subject line.

The parser generates a query tree as follows:

For each term in the query a node is created. If a boolean operator

is found between query terms then a boolean node is created, which

contains pointers to nodes of adjacent query terms. The more complex

the query, the more elaborate tree is constructed. For example "A .and.

B", the query tree would be as shown in figure 4.6.where as for A .and.

B .or. C .and. D the query tree would be as shown in figure 4. 7

AND

Figure 4.6: Query tree: representing A .and. B

84 OR

AND AND

Figure 4.7: Query tree: representing A .and. B .or. C .and. D

With this approach we can handle large queries. This also helps in

speeding up comparison process.

The comparison is carried out by comparing the query term which are

at the leaf of the query tree with the keyword list. If match is satisfied,

keep that article aside. Keep on traversing until root is reached. At

the end, depending on the list of selected articles, (i.e. articles which

contains query terms as their keywords and also satisfies boolean con­

ditions specified in the queries) the final result will be displayed.

Display the Result: The display of results plays an important role in eval­

uating the system. As explained earlier we need to display the result so

that the user can get the idea of importance of an article. We display

the output in descending order of weights of keywords in an article.

The output contains subject of the article, sum of the weights of the

keywords (which matches the query term) in that article and article

number. The actual format is:

85 Article number Weight Subject heading

157 73.379 Subject: Importing Format from Windows Phone Lo ...

3133 66.166 Subject: Job listing

86 Chapter 5

Experiments

This chapter reports on experiments to evaluate the classification and se­ lection methods used in the IAN system, as well as to evaluate the use of phrases as index terms.

5.1 Evaluation methods

Since the inception of information retrieval systems in the 1960s, there has been considerable interest in trying to test and evaluate the performance of information retrieval systems. So far, however, there is no clear guidance about how to go about this[Su92].

Several criteria and a great number of measures have been proposed and used for evaluating information retrieval performance. However, there is a lack of agreement as to what is a successful information retrieval performance or which are the best existing evaluation measures. In general there are three major approaches or criteria underlying existing performance measures:

87 Relevance; Utility; User satisfaction.

Among these measures Precision (the proportion of retrieved documents that are relevant) and Recall ( the proportion of relevant documents that are retrieved) are most well known, commonly accepted and widely applied.

However there are several objections to the use of these measures for evalu­ ation. First of all both recall and precision are needed in any evaluation and they also have been found to have an inverse relationship with each other.

Secondly even though Precision is easy to apply, recall is very difficult to im­ plement because it requires the knowledge of all the relevant document in the collection for each query. Also, the word Relevance carries different meanings depending on the context in which it is used. When calculating recall and precision using a small known data set (where a very small number of arti­ cles are retrieved), the relevance is judged directly without considering the usefulness of the documents retrieved for the stated request. Cooper[Coo76] argues that the recall measure which depends on the relevance of items that have not been retrieved may be inappropriate in all but the rare cases when a user is interested in a truly exhaustive search that catches everything that may relate to the query. This is not the case for the user when looking for news articles in a very large collection.

In situations when recall is very difficult to calculate, one is left with lit­ tle option then to interpret relevance in a way that is appropriate to his/her concern. In such situations we equate relevance with usefulness from the perspective of the users view of the problem or information need. Relevance

88 then becomes determined by the users relevance judgement: if the user says that a document is relevant then, at that time, in that context, the docu­ ment is relevant[Asl66]. Now however there has been increasing acceptance that stated requests are not the same as information need and that conse­ quently relevance should be judged in relation to needs rather than stated requests[RHB92].

In our case we can not explicitly calculate recall as it is impossible to identify and determine all the relevant articles in the collection for each query. This is due to the fact that each day we handle around 30,000 to

35,000 articles, to find all relevant articles for each given query, we would need to manually examine each article, which is almost impossible.

5.2 Evaluation procedure

In the text processing environment, the performance is often measured by using recall and precision values, where recall measures the ability of the system to retrieve useful documents, while precision conversely measures the ability to reject useless material. Theoretically a system is said to be good if it has high recall as well as precision. To calculate recall and precision we need to know how many articles in the whole collection are relevant to a given query. From that one can find out how many articles were retrieved and how many relevant articles were missed by the system. To find all relevant documents the data collection should either be a small sample or a known

89 static data set where one can easily find the total number of articles relevant to a particular query. In the case of IAN the data collection is very large and dynamic (changes every day). Thus it is impossible to know the total number of relevant articles, which makes it hard to evaluate the system on the basis of recall and precision.

Two approaches were used for evaluating the IAN system. The first was to evaluate the system performance using user-determined relevance whereby interest profile is taken from the user and run on the system. The articles retrieved are given to the user to judge their relevance. Here the relevance is based solely in the usefulness of the articles retrieved for the given query.

In the second approach, we compared the performance with another exist- ing system (the Standard Information Filtering Tool which uses the WAIS similarity matching index ). We considered not only percentage of relevant articles retrieved by each system but also the percentage of relevant articles missed by the system.

5.2.1 System performance method

The experimental evaluation of the IAN system consisted of using the system for a one week period1, filtering articles from all Usenet newsgroups (see newsgroup list in Appendix A. Five students and a staff member were asked to give their user profiles (see user profiles in appendix A). Each day, all user profiles were submitted as queries to the system. The output was

1from 24th March 1995 to 30th March 1995

90 Profile No of Queries Matched Articles Relevance

Profile 1 2 338 61.66

Profile 2 9 101 47.12

Profile 3 2 50 64.65

Profile 4 6 151 53.00

Profile 5 5 66 51.10

Profile 6 7 67 52.55

Table 5.1: Results from system performance method given to each user to evaluate the relevance of articles. They were asked to mark the relevance of each article in terms of percentage importance. Table

5.2.1 shows the result from the experiment using the above method.

The table indicated the number of queries each profile contains and the total number of articles found for that queries in one week. The relevance is the mean relevance rating of articles marked by each user for the articles they received for their queries.

5.2.2 Comparison method

We compared SIFT (Stanford Information Filtering Tool version 1.1) with our IAN system. The procedure was to get the source code of SIFT ( version

1.1) which was available from Stanford University and install it on our local machine. This was done so that same articles can be given as input to both

91 the systems at the same time. 2 The same set of queries3 were given to both

SIFT and IAN which were provided by 5 students and a staff member. The output from both systems was given to the user to rate the relevance of each article retrieved by both systems.

We went further and submitted phrases (out of same single term boolean queries) to IAN and the output was again given to users for evaluation.

Furthermore we checked for relevant articles that were missed. We assume that the total number of relevant articles is the sum of the total articles retrieved both IAN and SIFT for the same query considering the phrases as well, less the duplications.

This experiment was done for 3 days4 using the same user profile A on the same set of newsgroups A. Appendix contains results of each individual day. The table 5.2.2 shows the average of 3 days. For that we took the daily values and averaged it.

5.2.3 Discussion

From the above experiments, we found that the SIFT system produced more articles (444) than to IAN (301), the mean relevance rating of the articles found in SIFT was while for IAN it was . Also the ommission of relevant

2 Initially we subscribed to the SIFT service at Stanford and submitted queries to compare the output. However, set of newsgroups at Stanford is quite different to those on the local server and so it was necessary to install SIFT locally. 3 For list of queries see appendix A 4 30th March, 4th April, 6th April 1995

92 Query s I p Snl SnP lnP SnlnP %S %1 %P %~S %~I %~P

1 18 11 - 8 --- 43.6 66.6 - 45 22.76 - 2 13 T - 6 - -- 16 38 - 40 T - 3 10 9 - T --- 48.05 52.05 - 35 23 - 4 32 17 15 13 10 14 10 45.3 65.83 71.2 53.66 42.76 32.93 & & 6 - 4 - -- 68.66 66.66 - 46.6 30 - 6 2 2 - 2 --- 20 40 - 15 15 - T 16 10 1 8 1 1 1 39.T 69.3 90 26.1 10 58

8 96 58 33 10 6 10 3 37.1 45.29 61.1 47.25 40.41 35.29

9 84 39 13 30 8 9 T 49.86 68.03 84.66 46.& 45.43 54.& 10 6 8 - 6 --- 66.1 &&.43 - 26.66 0 - 11 & 4 - 4 - -- 33.95 38.3 - 0 0 - 12 16 9 - & --- 20.66 40 - 41.66 19.33 - 13 8 14 - 8 --- 36.66 38 - 23.33 0 - 14 u 11 1 9 1 1 1 40 48.56 50 33.33 16 39 15 11 8 - & - -- 28.1 36.66 - 20 4 - 16 42 21 3 16 3 3 3 26.TT 41.2 36 29.66 16.& 0 17 3 4 - 3 --- TT.& 78.3 - 40 0 - 18 11 15 1 & 1 1 1 35.66 31 30 27.33 12.66 - 19 28 32 25 19 17 18 16 49.76 51.26 58.03 &T.T 54.33 37.43 20 26 16 - 14 --- 35.76 47.23 - 33.33 18 -

Total 444 301 91 182 47 &T 42 41.29 51.13 65.22 33.U 18.98 41.21

Table 5.2: Results from comparison method - An average of 3 days

Query s I p SnI snP lnP sn1nP I %S I %1 I %P I %~S %~I %~P Total 444 301 91 182 47 &T 42 I 41.29 I 51.13 I 6&.22 I 33.U 18.98 41.21

Table 5.3: Result:Evaluation of performance

93 s Total articles fetched by SIFT

I Total articles fetched by IAN p Total articles fetched by IAN*

IAN*: Here query is formed using phrases.

Snl Total articles common in both SIFT and IAN

SnP Total articles common in both SIFT and IAN*

InP Total articles common in both IAN and IAN*

SnlnP Total articles common in SIFT, IAN, IAN*

%S Mean relevance rating of articles found in SIFT

%1 Mean relevance rating of articles found in IAN

%P Mean relevance rating of articles found in IAN* using phrases

%~S Mean relevance rating of articles missed in SIFT

%~I Mean relevance rating of articles missed in IAN

%~P Mean relevance rating of articles missed in IAN* using phrases

94 articles in IAN was 19% as compared to SIFT which was 33%. It appears that the SIFT system focusses mainly on achieving high recall. This can be seen from the fact that SIFT uses a primitive weighting scheme, which ignores many of the factors considered in chapter 2 2.2.2. In SIFT, the weight depends solely on the frequency of occurrence of a word in an article.

The first time a word occurs it gets a weight of 5, the next time it occurs it gets a weight of 1 and the words in the headline are worth an extra 10 points[Kah]. There is no consideration of overall frequency of occurrence of a word while calculating the weight; this leads to assigning higher weights to common words. Also, they have overlooked the length factor which means that larger articles have better chances of retrieval. This ultimately leads to missing small but relevant articles.

In the case of IAN, the weight calculation not only considers the term frequency of a word in an article but also the over-all occurrence frequency of that word. Also the size of the articles is considered while calculating the weight of each keyword. This helps to reduce the fetching of irrelevant articles.

The above observations are supported by the feedback which we got from the users that even though SIFT generated more articles than IAN, it still missed more relevant articles. This suggests that IAN is a more effective information filter than SIFT.

The results of the experimentation on phrases shows that they discrimi­ nate heavily. The relevance from the retrieval using phrases was high (65.22%);

95 however the ommission of relevant articles was large as well (41.2%). The results suggests that phrases should be used with caution. In particular, phrases should only be used in situations where the query term has a very broad meaning (e.g. a query term like "algorithm" or "network" carries very broad scope, to narrow the scope, the use of phrases like "linear algorithm",

"neural network" is desirable).

96 Chapter 6

Conclusions

In this chapter, we review our research objectives, describe how well they have been achieved, and draw conclusions from our experiments conducted on the IAN and SIFT systems. Finally, we address the future work that could be done so as to improve our current work.

6.1 Review of research objectives

The research carried out in this research project satisfies all the objectives as stated at the beginning 1.2.

• Different text analysis techniques and searching strategies used in IR

were thoroughly reviewed. We were then able to make judgements on

the best techniques and strategies to incorporate in our research.

• Various filtering mechanisms were analysed and the most suitable fil­

tering scheme for our need was implemented.

97 • Various ranking algorithms were tried and we finally settled for the

algorithm proposed by Salton et al[SB88].

• A prototype news fetching program was developed along with the news

filtering system.

• Performance Evaluation of IAN was carried out.

• The quality of retrieval using SIFT and IAN ( using single terms as well

as phrases) was compared to determine their suitability in the domain

of Usenet news.

6.2 Conclusion

From our work we conclude the following:

• A Huge amount and variety of information passes through the Usenet

news each day. An automatic filtering system and a searching strategy

is crucial for making better use of this data.

• Removing stopwords significantly helps in reducing the number of in­

dexed terms in an article. This ultimately helps search algorithms to

identify relevant documents. However care is necessary to select the

proper stopwords. One should keep in mind that stopwords should not

be selected solely on the basis of frequency, for some content-bearing

words are very frequent in the English language. Also, domain should

be considered in the selection of stopwords, as certain index term that

98 are uncommon in general use may be exceedingly common in a specific

context, which might merit their removal.

• Stemming help in avoiding various offshoot of a root stem. However in

some cases stemming changes the meaning of the intended word. For

that proper stemming algorithm should be selected depending on the

nature of the vocabulary used in the collection.

• Identifying footers is difficult as there is no standard footer structure

in Usenet articles.

• Removal of some header lines and footer enhances the precision sig­

nificantly. The reason being that non relevant words are found in the

headers and footer.

• The most appropriate weighting formula is the one which not only

considers the term frequency while calculating the weight on a keyword

but also considers overall frequency and length factor. Care should be

taken in choosing on a weighting scheme.

• Hashing of keywords, building query tree for boolean expressions and

efficient use of RAM will significantly improve the search time.

• Text analysis of the IAN system is appropriate for Usenet data. We

found that although the mean relevance rating was low, it did not miss

many1 relevant articles.

1 Mean relevance rating of missed articles is 19%

99 • Phrases are too discriminatory. Although the mean relevance rating is

high (65.22%) , the use of phrases tends to cause many relevant articles

to be missed (41.21%).

• The IAN system compares favourably against SIFT. The relevance was

higher overall2 and the misses were lower.

6.3 F'uture work

A lot can be done in this area as no one algorithm provides the correct solution for all situations and domains. Work can be done to overcome the drawbacks in stemming.

Work can also be done m the area of adaptive user profiles, where the user profile changes automatically depending on the success history of user queries. User actions can be monitored and the user profile updated accord- ingly.

Various AI techniques can be used in determining the precise location of footers, which can then be eliminated. It is impossible to recognise each and every footer with Simple pattern matching. This is due to the fact that not all editors put some pattern at the beginning on the footer.

Different techniques can be used to assign weight to index terms depending on their location in the article: cue words, exclamations, upper /lower case, etc.

2Mean relevance rating for IAN was 51.13%

100 Research can be done in the area of article creation where provision may be given to the author to mark words as index terms during article creation.

A thesaurus can provide a mapping between phrases and terms to improve recall . However this area needs further exploration.

101 Appendix A

Outline of the program

List of programs and files:

ARTICLE....FETCHER Executable program generated by ARTICLE_FETCHER. c.

It reads input from file called "inputfile" which contains list of selected

newsgroups, date and time from which you want to collect posted ar­

ticles.

Function: It connects to local nntp server and fetch articles from spec­

ified news groups which are posted after specified date and time. The

output is stores in a directory called "ARTICLES".

DATA_CLEANER Executable program generated by DATA_CLEANER. c

It reads articles as an input (from ARTICLES directory) and pipes the

output to "stopper program".

Function: Removes few non-informative lines form the articles, identi­

fies footer and removes it. put markers to identify end of subject and

102 beginning of body of an article. stopper It is an executable program which needs following files:

stop_makefile Builds object code and test drivers

stop. wrd A list of 461 stop words used for testing

stop. c Code building and running lexical analysers that stops words

stop.h Header for the lexical analyser module

strlist. c Code for a utility package for handling lists of strings

strlist.h Header for the string list utility module

stopper. c A test program

Function: stopper {executable program) is use to for removal

for automatic indexing and query processing.

Contributed by :C. Fox and J. Madison U.

modified by: C. Fox, July 1991.

remodified by: Y. Mansuri, May 1994.

This source code is from the book Information Retrieval: Data Struc­

tures and Algorithms Edited by William B. Frakes and Ricardo Baeza­

Yates Prentice-Hall, 1992 ISBN: 0-13-463837-9 stemmer It is an executable program which needs following files:

stem_makefile Builds object code and test drivers

stem. c Module containing the Porter stemmer

103 stem.h Header for the stemmer module

stemmer. c Stemmer test program

Function: stemmer (executable program) is use to remove suffixes im­

plementing the Porter stemming algorithm.

Written by: C. Fox, 1990

Contributed by: William B. Frakes

Modified by : Y. Mansuri, May 1994.

This source code is from the book Information Retrieval: Data Struc-

tures and Algorithms Edited by William B. Frakes and Ricardo Baeza­

Yates Prentice-Hall, 1992 ISBN: 0-13-463837-9

[PHRASE_MAKER] Executable program generated by PHRASE_MAKER. c.

It takes input form standard input (It takes output of stemmer as an

input) and append the output in a file called "keyword".

Function: This program generates single keywords as well as phrases

which are used as index words.

WEIGHT _A.SSINGER89 Executable program generated by WEIGHT_ASSIGNER.c.

It takes the appended keyword file as an input ans generate an output

file called "INDEX_FJLE". This output is in a structured manner.

Function: Assign weight to each index term using following formula:

tf·i3 ·*log n·N Wij = ' JEvector (t fi * log ~ )2

Comment: proposed by Salton, Gin 1989.

104 WEIGHT _ASSINGER91 The is another program form calculating weight

and assigning weight to keywords. The difference is in the formula for

calculation weight.

Function: Assign weight to each index term using following formula:

0.5 +~)*(log ( max/;p npN) Wik = -----;::======2 N)2 E k-1- (0.5 + ~)max/;p * (log np

Comment: proposed by Salton, Gin 1991.

WEIGHT .ASSINGER76 The is another program for calculating and as-

signing weight to keywords. The difference is in the formula for calcu­

lation weight.

Function: Assign weight to each index term using following formula:

Comment: Proposed by use Sagar, W.K.H 1976[SL76) and used in by

IBM' STAIRS system.

WEIGHT .ASSINGER22 The is another program for calculating and as­

signing weight to keywords. The difference is in the formula for calcu­

lation weight.

Function: Assign weight to each index term using following formula:

Fi wt = (fin * di * (- . )) 1in

Comment: Formula used in [SL 76, NMK81].

105 WEIGHT _ASSINGER22 The is another program for calculating and as­

signing weight to keywords. The difference is in the formula for calcu­

lation weight.

Function: Assign weight to each index term using following formula:

Comment: Formula used in [NMK81].

WEIGHTSAME The is another program for calculating and assigning

weight to keywords. The difference is in the formula for calculation

weight. The source is generated by WEIGHTSAME.c

Function: Assign weight to each index term using following formula:

wt= fin fmax

Comment: Formula used in [8B88]. search It is an executable program which needs following files:

search.c It is the main program which takes INDE_FILE as input. It

builds a structure in RAM to facilitate searching of index words with

the query. It also read "subjecUist" file to built the hash table for

subject lines.

gv. l is a LEX program which generates token from query term after

separating separating boolean operators and query term

106 gv.y It is a "YACC" program which first builds a compiler tree for

query terms. Them compares it with the index terms. After resolving

boolean logic as per the query it displays article no , weight, subject

line of the selected articles.

search. h Header file.

makefile first cleans and then make object code.

Function: Over all function is to Provide searching mechanism.

UTOS it is an executable program which needs following files:

utos.l It is a lexical analyser generate token for stemming the query

words and pass the tokens to stemming function.

utos. c It contains porter's stemming Module.

user_profile It contains the list of user queries. either constructed as

phrases and/or using boolean operators. The list of quries we used in

our experiments is given below:

Profile 1:

learning .and. algorithm .and. neural .and. network

heterogeneous .and. database

regression .and. linear .or. non-linear

neural network

neural .and. network

backpropagation .and. network

107 query optimisation .or. query optimization

Profile 2: ingres database ingres .and. database

Profile 3: information .and. systems information systems oracle database oracle .and. database

Profile 4: spatial .and. database temporal .and. spatial multimedia .and. object symbolic and. representation

Profile 5: object recognition object .and. recognition model .and. acquisition

108 model learning model .and. learning

Profile 6: australian football .and. rules .or. aussie rules information filtering information .and. filtering functional programming functional .and. programming computer .and. science .and. education

system_profile After query is stemmed it output is stored into this file.

It is the modified version of user_profile which goes as an input to search program is shown below.

Profile 1: learn_algorithm.and.neural.and.network heterogen.and.databas regress.and.linear.or.non-linear neural..network neural.and.network backpropag.and.network querLoptimis.or .querLoptim

109 Profile 2: ingr _data bas ingr.and.databas

Profile 3: inform.and.system inform...system oracLdatabas oracl.and.databas

Profile 4: spatial.and.databas tempor.and.spatial multimedia.and.object symbol.and.represent

Profile 5: object.and.recognit objecLrecognit model.and.acquisit modelJearn

110 model.and.learn

Profile 6:

australian_foot ball. and .rule. or .aussLrule

inform Jilter

inform.and.filter

function_program

function. and. program

comput...scienc.and.educ

Function: To formulate the query.

List of Shell Scripts and files AFTo run ARTICLE_FETCHER program.

ALL To run all programs i.e when it is executed it fetch articles, anal­

ysis them, generate keywords, assign weight, formulate query given in

user_profile, search for relevant articles as per user_profile.

MAKE_CLEANit runs program which cleans the fetched articles, anal­

yse them,and generate keywords.

CW_SG89, CW_SG91, CW_SG19, CW_SG22, CW_SJ25 can be run to

calculate weight. Each one runs a different weight assigning programs.

We can select one (by default CW_SG89). This script can be included

in ALL or MAKE_CLEAN script can be run separately.

runutos This script runs utos program which modifies the query.

111 subjecUist it contains list of subject along with article number. This

file is read by search.c while generating hash table for subject.

List of news groups alt. * aus. * bionet. * bit.* biz.* comp.* ddn. *

gnu.* ieee. * kl2.* misc.* news.* rec. *

SCZ. *

SOC.* system* talk.* unsw. * vmsnet.*

112 ! * binaries* The news-groups which contains binary files where not selected.

113 Appendix A

Filtering Statistics

Different experiments were conducted to see the effect of filtering at each step i.e from collection of articles to displaying of results. For that we collected news articles and tried to analysed the articles. and collected the results at each step of filtering to see how it helps in selection of index words. The table A, A shows some statistical information. The tables A, A shows an average result for 31 day data collection.

The graphs show A.1 show the flow of articles per day. The average flow of articles is around 2000 to 23002 each two hours. One thing is to be mentioned here that number of articles posted is not very constant it varies from around 18000 to 40000 articles per day. But the posting per 2 hour rate is almost constant. 1 March 18 No of articles 309241, March 24 No of articles 27356, March 26 No of articles

36278 2 This flow is with respect articles coming on our NNTP server nntp.unsw.edu.au

114 DATA Size after process I

Total number of ARTICLES 30518

Total amount in MB 92.8

Total data in MB after removing few header fields, comments, footer 50.49

Total data in MB after removing stop words 27.4

(and keeping only those words having some alphabetic characters)

Total data in MB after stemming 24.46

Total data in MB after generating index term(single) 24.46

Total data in MB after generating phrases 62.56

Total data in MB after assigning weight 59.37

Table A.1: Statistical information: Information of data in MB

I DATA I Size after process I

Total number of ARTICLES 30518

Total amount in Mb 92.8

Total number of words after removing few header ,comment, footer fields 5163919

Total number of words uniq after removing few header fields 874395

Total no of uniq words after removing stop words 863314

(keeping all words having some alphabetic characters)

Total no of uniq words after stemming and removing numbers 530406

Total no of index word (single) 530406

Total no of index word (double) 1145710

Table A.2: Stasistical information: Information of data in terms of words

115 35000 'Time_ V_N_articles' i ◊ 30000 ◊

25000 ◊ ◊

20000 ◊ ◊

15000 ◊ ◊ 10000 ◊ ◊

5000 ◊ 0 0 5 10 15 20 25

Figure A.1: Rate of posting of articles

The following graph A.2 show the change is vocabulary each two hours i.e number of uniq words generated every two hours.

116 220000 200000 'Uniq_Word_V _Articles' ◊◊ ◊ 180000 ◊

◊ 160000 ◊ 140000 ◊ 120000 100000 ◊ 80000 ◊ 60000 40000 0 5 10 15 20 25 30 35

Figure A.2: Increase in uniq words per hour

117 Query s I p Snl SnP lnP sn1nP %S %1 %P %~S %~I %~P

1 10 6 - 4 --- 42 113.S - 611 38.S - 2 II II - 6 - - - 16 16 - 0 0 - s 7 6 - 4 - -- 42.1 49.1 - 35 23 - 4 7 4 s s s s s 49 62.5 66 50 SIi 38 5 s s - 2 - -- 116.6 110 - 79 90 - 6 2 1 - 1 - - - 40 50 - 0 so - 7 6 6 1 4 1 1 1 116 65 90 45 20 58

8 34 211 s 4 1 1 1 51.6 110.4 73.S 52.6 31.29 39.64

9 so 18 II 13 2 s 2 44 1111.5 84 70 47 46.11 10 2 s - 2 - - - 511 113 - 50 0 - 11 1 1 - 1 -- - 40 40 - 0 0 - 12 8 II - 4 - -- 31 40 - 40 so - 13 5 10 - 5 - -- 40 44 - 40 0 - 14 10 7 1 6 1 1 1 35 45.7 50 50 SS 39 15 s 4 - 2 --- ss.s 40 - so 0 - 16 111 6 s 6 s s s 26 41.6 36 0 0 0 17 2 s - 2 - - - 75 76.6 - 80 0 - 18 6 7 7 1 1 1 1 24 27 so 47 18 - 19 8 12 8 6 5 5 4 61 110 67.5 41.6 67.11 25.8 20 14 5 - 4 -- - 24 54 - 60 14 -

Total 178 137 24 79 17 18 16 42.08 48.17 62.1 40.S 23.811 41.111

Table A.3: Results from comparison method - Day 1

Query s I p Snl SnP I lnP I sn1nP %S %1 %P %~S %~I %~PI Total 178 137 24 79 17 I 18 I 16 42.08 48.17 62.1 40.S 23.811 41.111 I

Table A.4: Result:Evaluation of performance for Day - 1

118 Query s I p Snl snP lnP sn1nP %S %1 %P %~S %~I %~P

1 5 3 - 3 --- 46 76.6 - 0 0 - 2 8 2 - 1 -- - 16 60 - 40 7 - 3 3 3 - 3 - - - 55 55 - 0 0 - 4 16 5 5 4 4 5 4 44.3 72 72 70 35.8 35.8

5 1 2 - 1 --- 70 70 - 70 0 - 6 ------7 4 1 - 1 -- - 25 90 - 0 10 - 8 33 17 23 1 1 6 1 25.7 46.47 54.3 48 41 33

9 29 12 5 8 3 3 2 44 72 80 69.5 33.7 58 10 3 3 - 3 --- 53.3 53.3 - 0 0 - 11 4 3 - 3 -- - 27.1 36.6 - 0 0 - 12 7 2 - 1 -- - 11 45 - so 8 - 13 2 2 - 2 --- 40 40 - 0 0 - 14 4 3 - 2 -- - 25 40 - 50 15 - 15 7 3 - 2 --- 21 40 - 30 12 - 16 20 10 - 7 -- - 30.33 40 - 0 0 - 17 ------18 4 5 - 3 - - - 53 46 - 40 20 - 19 8 7 6 2 1 2 1 35 so 53 51.5 45.S 36.S 20 11 9 - 9 - - - 46.3 47.7 - 0 40 -

Total 169 92 39 56 9 16 8 37.3 54.48 64.82 30 16.66 40.7

Table A.5: Results from comparison method - Day 2

Query s I pi SnI I snP I lnP SnlnP %S %1 %P %~S %~I I %~P J Total 169 92 39 I 56 I 9 I 16 8 37.3 54.48 64.82 30 16.66 1 40.7 J

Table A.6: Result:Evaluation of performance for Day - 2

119 Query s I p Snl snP lnP SnlnP %S %1 %P %~S %~I %~P

1 3 2 - 1 --- 43 17 - TO 30 - 2 ------3 ------4 9 8 T 6 3 6 3 43 63 TIS.T 41.25 !ST.IS 25 IS 1 1 - 1 -- - 80 80 - 0 0 - 6 0 1 - 1 --- 0 30 - 30 0 - T 6 3 0 3 --- 38.3 53 - 33.3 0 - 8 29 16 T IS 4 3 1 34.1 39 55.T 41.15 58.95 33.25

9 25 9 3 9 3 3 3 61.6 76.6 90 0 55.6 59 10 1 2 - 1 - -- 90 60 - 30 0 - 11 ------12 1 2 - 0 --- 20 35 - 35 20 - 13 1 2 - 1 - - - 30 30 - 30 0 - 14 1 1 - 1 --- 60 60 - 0 0 - 15 1 1 - 1 - - - 30 30 - 0 0 - 16 T IS - 3 - -- 24 42 - ISO 17.5 - 17 1 1 - 1 - -- 80 80 - 0 0 - 18 1 3 - 1 - - - 30 20 - 15 0 - 19 12 13 11 11 11 11 11 53.3 53.8 53.6 80 ISO ISO 20 1 2 - 1 - -- 40 40 - 0 40 -

Total 97 72 28 47 21 23 18 44.5 50.TIS 68.TIS 29.15 16.44 41.8

Table A.7: Results from comparison method - Day 3

Query s I pi Snl I SnP lnP SnlnP %S %1 I %P I %~S I %~I %~P Total 97 72 28 I 47 I 21 23 18 44.5 50.TIS I 68.TIS I 29.15 I 16.44 41.8

Table A.8: Result:Evaluation of performance for Day - 3

120 s Total articles fetched by SIFT

I Total articles fetched by IAN p Total articles fetched by IAN*

IAN*: Here query is formed using phrases.

Snl Total articles common in both SIFT and IAN

SnP Total articles common in both SIFT and IAN*

InP Total articles common in both IAN and IAN*

SninP Total articles common in SIFT, IAN, IAN*

%S Mean relevance rating of articles found in SIFT

%1 Mean relevance rating of articles found in IAN

%P Mean relevance rating of articles found in IAN* using phrases

%~8 Mean relevance rating of articles missed in SIFT

%~I Mean relevance rating of articles missed in IAN

%~ P Mean relevance rating of articles missed in IAN* using phrases

121 Bibliography

[Art69] S Artandi. Computer indexing of medical articles. Journal of

Documentation, 20:227-233, 1969.

[Asl66] Aslib Proceedings. The Relevance of Relevance to the Testing

and Evaluation of Document Retrieval Systems, volume 18/11.

Rees, A.M, 1966.

[Bar69] J.K Barkla, editor. Construction of weighted Term Profiles by

Measuring Frequency and Specificity in Relevant Items, Cranfield

Bedford, 1969. Cranfield conference on Mechanized Information

Storage and Retrieval Systems.

[BC92] N.J Belkin and W.B Croft. Information filtering and information

retrieval: Two sides of the same coin ? Communication of the

ACM, 35(12):29-38, December 1992.

[BJ94] R.S Brewer and P.M Johnson. Toward collaborative knowledge

management within large, dynamically structured information

systems. Technical report, Collaborative Software Development

Laboratory, Department of Information and Computer Sciences

122 University of Hawaii, 1994. Found from Unified Computer Sci­

ence Tech Report Index search with keyword:Usenet.

[B0085] A Bookstein. Probability and fuzzy-set application to information

retrieval. Annual Review of Information Science and Technology,

pages 177-151, 1985. Williams, M.

[BP86] C Berrut and P Palmer. Solving grammatical ambiguities within

a surface syntactical parser for automatic indexing. In F Rabitti,

editor, Proc, of Ninth International Conference on Research and

Development in Information Retrieval, pages 123-130, Pisa, Italy,

September 1986.

[BS75] A Bookstein and D.R Swanson. A decision theoric foundation

for indexing. Journal of the American Society for Information

Science, pages 45-50, Jan-Feb 1975.

[CDBK86] Y Chiaramella, B Defude, M.F Bruanet, and D Kerkouba. A full

text information retrieval system. In F Rabitti, editor, Proc of

the Ninth International Conference on Research and Development

in Information Retrieval, pages 207-213, Pisa Italy, September

1986.

[CH79] W.B Croft and D.J Harper. Using probabilistic models of docu­

ment retrieval without relevance information. Journal of Docu­

mentation, 35(4):285-295, December 1979.

123 [CM78] W.S Cooper and M.E Maron. Foundations of probabilistic and

utility-theoretic indexing. Journal of the Association for Com­

puting Machinery, 25(1):67-80, January 1978.

[CMK66] C.W Cleverdon, J Mills, and M Keen. Factors determining the

performance of indexing systems. Technical report, ASLIB Cran­

field Project, 1966.

[Com95] UUNET Communications. Total traffic through uunet for the

last 2 weeks. Newsgroups: news.lists, Jan 1995. sender: newsu­

unet.uu.net.

[Coo76] W.S Cooper. The paradoxical role of unexamined documents in

the evaluation of retrieval effectiveness. Information Processing

and Management, 12:367-375, 1976.

[CR69] J .M Carroll and R Roeloffs. Computer selection of keywords using

word frequency analysis. American Documentation, 20:227-233,

1969.

[Cro83] W.B Croft. Experiments with representation in a document re­

trieval system. Information Technology: Research and Develop­

ment, 2(1):1-21, 1983.

[Daw74] J.L Dawson. Suffix removal and word conflation. ALLC Bulletin,

pages 33-46, 1974.

124 [DG88] P Das-Gupta. Rough sets and information retrieval. In Proc.

Eleventh Int 'l. Conf. on Res. and Development in Information

Retrieval, Set Oriented Models, page 567, 1988.

[Fag87] J Fagan. Experiments in automatic phrase indexing document

retrieval: A comparison of Syntactic and non-syntactic methods.

Ph.d thesis available as report tr 87-868, Department of Com­

puter Science, Cornell University, Ithaca, N.Y, September 1987.

[FBY92] W.B Frakes and R Baeza-Yates. Information Retrieval: Data

Structures and Algorithms. Prentice-Hall, 1992.

[FM] C Fox and J.U Madison. stopper. Obtained from Public Domain.

This source code was obtained from public domain.

[FNA +88] E.A Fox, J.T Nutter, T Ahlswede, M Even, and Markowitz, edi­

tors. Building a large thesaurus for information retrieval, Austin,

TX, February 1988. Association for Computational Linguistic,

Association for Computational Linguistic. page 101-108.

[Fox83] E.A Fox. Extending the boolean and vector space models of the

information retrieval with p-norm queries and multiple concept

types. Technical report, Cornell University, Cornell University,

August 1983.

125 [FS86] E.A Fox and S Sharat. A comparison of two methods for soft

boolean interpretation in information retrieval. Technical Report

TR-86-1, Virginia Tech, Department of Computer Science, 1986.

[FS91] G Fischer and C Stevens, editors. Information access in complex,

poorly structured information space, volume Human Factors in

Computing Systems CHl'91, April 1991. page 63-70.

[FT] W. Frakes and V Tech. stemmer. Obtained from Public Domain.

This source code was obtained from public domain.

[GNOT92] D Goldberg, D Nichols, B.M Oki, and D Terry. Using collabora­

tive filtering to weave an information tapestry. Communication

of the ACM, 35(12):61-69, December 1992.

[Har75] S.P Harter. A probabilistic approach to automatic keyword in

indexing part 2. an algorithm for probabilistic indexing. Journal

of the American Society for Information Science, pages 280-289,

September-October 1975.

[Har91] D Harman. How effective is suffixing. Journal of the American

Society for Information Science, 42(1):7-15, 1991.

[Hau93] R Hauben. The evolution of usenet news: poor man's arpanet.

news group: comp.society, March 1993. Speech presented at

Michigan Association of Computer Users in Learning.

126 [Hei82] M.H Heine. Intelligent front for information retrieval systems

with boolean logic. Information technology: Research and Devel­

opment, 1(4):247-260, 1982.

[Hw74] M Hafer and S weiss. Word segmentation by letter successor

varieties. Information Storage and Retrieval, 10:371-385, 1974.

[IAKM86] M Isoda, H Aiso, N Kamibayshi, and Y Matsunaga. Model for

lexical knowledge base. In Proc, Eleventh International Confer­

ence on Computer linguistics-coding 86, pages 451-453, Univer­

sity of Bonn, August 1986.

[JR88] P.S Jacobs and L.F Rau. Natural language techniques for intel­

ligent information retrieval. In Y Chiaramella, editor, Proc, of

Eleventh International Conference on Research and Development

in Information Retrieval, pages 85-99, Grenoble, France, June

1988.

[Kah] Brewster Kahle. sift-1.1. Obtained from Public Domain at ftp

db.stanford.edu. The description is given in DESIGN file found

in sift-1.1 software. Copyright (c) Tak W. Yan.

[Kee77] E.M Keen. On the generation and searching of entries in printed

subject indexes. Journal of Documentation, 33(1):15-45, March

1977.

127 [KMT+82] J Katzer, M McGill, J Tessier, W Flakes, and P Das Gupta. A

study of the overlaps among document representations. Informa­

tion Technology: Research and Development, 1:261-273, 1982.

[Lew92] D.D Lewis. Representation and Learning in Information Re­

trieval. Ph.d thesis, University of Massachusetts, Massachusetts,

February 1992.

[LF88] W.C Lee and E.A Fox. Experiments comparison of schemes for

interpreting boolean queries. M.S Thesis Technical Report TR-

86-1, Virginia Tech, Department of Computer Science, 1988.

[Los88] R.M Losee. Parameter estimation for probabilistic document­

retrieval models. Journal of the American Society for Information

Science, 39(1):8-16, 1988.

[Lov68] _J.B Lovins. Development of a stemming algorithm. Mechanical

Translation and computational Linguistics, 11 ( 1): 22-31, March

1968.

[Luh58] H.P Luhn. The automatic of literature abstracts. IBM Journal

of Research and Development, 2:159-165, 1958.

[Mar79] M.E Maron. Automatic indexing: An experimental inquiry. Jour­

nal of the American Society for Information Science, pages 224-

228, July 1979.

128 [MCT87] M Mauldin, J Carbonell, and R Thomason. Beyond the key­

word barrier: Knowledge based information retrieval. In Proc.

29th Annual Conference of National Federation of Abstracting

and Information Services. Elsevier Press, 1987.

[Mil71] W.L Miller. Probabilistic search strategy for medlars. Journal of

Documentation, 27:254-266, 1971.

[MK60] M.E Maron and J.L Kuhns. On relevance, probabilistic indexing

and information retrieval. Association for Computing Machinery

ACM, 7(3):216-44, 1960.

[NMK77] T Noreault, M McGill, and M Koll. Automatic ranking output

from boolean searchers in sire. Journal of the American Society

for Information Science, 28(6):333-339, 1977.

[NMK81] T Noreault, M McGill, and M. Koll. A performance evalua­

tion of similarity Measures, Document Term Weighting Schemes

and representation in a Boolean Environment. R.N oddy Lon­

don:Butterworths, 1981. page 57-76.

[NS93] A.H.H Ngu and J Shepherd. How to Deal with 10,000 News

Articles per Day: An Intelligence Assistant for Newreading. Proc

of Inaugural Australian f3 NZ Intelligent Info, Dec 1993.

129 [Pai84) C.D Paice. Soft evaluation of the boolean search queries in in­

formation retrieval system. Information Technology: Research,

Development, Application, 3(1):33-41, 1984.

[PL69) A.E Petrarca and W.M Lay. Use of an automatically generated

authority list to eliminate scattering caused by some singular and

plural main index terms. Proceedings of the American Society for

information Science, 6:277-282, 1969.

[Por80) M.F Porter. An algorithm for suffix stripping. Program,

14(3):130-137, July 1980.

[Rad82) T Radecki. A probabilistic approach to information retrieval in

systems with boolean search request formulations. Journal of the

American Society for Information Science, 33(6):365-370, 1982.

[Rad88] T Radecki. Trends in research on information retrieval-the poten­

tial for improvements in conventional boolean retrieval systems.

Information Processing and Management, 24(3):219-227, 1988.

[Rei93] Brain Reid. Usenet statistic period postings to usenet newgroup.

newsgroup: news.lists, 1993. Found from Unified Computer Sci­

ence Tech Report Index search with keyword:Usenet.

[RHB92] S.E Robertson and M.M Hancock-Beaulieu. On the evaluation of

ir systems. Information Processing and Management, 28(4):457-

466, 1992.

130 [Rob90] S.E Robertson. The methodology of information retrieval experi­

ment, chapter 2, page 9. Elsevier Science, first edition, 1990.

[RSJ76] S.E Robertson and K Sparck Jones. Relevance weighting of search

terms. Journal of the American Society for Information Science,

pages 129-146, May-June 1976.

[Sal68] G Salton. Automatic Information organization and Retrieval.

McGraw-hill, 1968.

[Sal70] G Salton. A theory of Indexing. McGrawHill Book Company,

1970.

[Sal71] G Salton. the SMART Retrieval systems-Experiments in Auto­

matic Document Processing. Englewood Cliffs, N .J: Prentice Hall,

1971.

[Sal75] G Salton. Theory of indexing. Regional conference Series in

Applied Mathematics Society for Industrial and Applied Mathe­

matics, -(1), 1975. Philadelphia, PA.

[Sal86] G Salton. Another Look At Automatic Text-Retrieval Systems.

Communication of the ACM, 29(7):648-656, July 1986.

[SB88] G Salton and C Buckley. Term-weighting approaches in auto­

matic text retrieval. Information Processing and Management,

24(5):513-523, 1988.

131 (SB91] G Salton and C Buckley. Global text matching for information

retrieval. SCIENCE, 253:1012-1015, August 1991.

(SBS90] G Salton, C Buckley, and M Smith. On the application of syntac­

tic methodologies in automatic text analysis. Information Pro­

cessing and Management, 26(1):73-92, 1990.

(SFW83] G Salton, E. A Fox, and H Wu. Extended boolean information

retrieval. Communications of the Association for Computing Ma­

chinery, 26(11):1022-1036, 1983.

(SJ72] K Sparck Jones. A statistical interpretation of term speci­

ficity and its application in retrieval. Journal Documentation,

29(4):351-72, 1972.

(SJ73] K Sparck Jones. Index term weighting. Information Storage and

Retrieval, 9:619-633, 1973.

[SJ75] K Sparck Jones. A performance yardstick for test collection.

Journal of Documentation, 31:266-272, 1975.

(SJ79] K Sparck Jones. Experiments in relevance weighting of search

terms. Information Processing and Management, 15(3):133-144,

1979.

[SL76] W.K.H Sager and P.C Lockemann. Classification of ranking algo­

rithms. International Forum on Information and Documentation,

1(4):12-25, 1976.

132 [SM83] G Salton and M.J McGill. Introduction to Modern Information

Retrieval. McGrawHill Book Company, first edition, 1983.

[SM94] G Spafford and M Moraes. usenet-software-partl. USENET Soft­

ware: History and Sources, October 1994.

[Sme86] A.F Smeaton. Incorporating syntactic information into a docu­

ment retrieval strategy: An investigation. In F Rabitti, editor,

Proc, of Ninth International Conference on Research and De­

velopment in Information Retrieval, pages 103-113, Pisa, Italy,

September 1986.

[sR88] Jung soon Ro. An evaluation of applicability of ranking algo­

rithms to improve the effectiveness of full-text retrieval ii.on the

effectiveness of ranking algorithms on full-text retrieval. Journal

of the American Society for Information Science, 39(3):147-160,

1988.

[Sri89] P Srinivasan. Intelligent information retrieval using rough

set approximations. Information Processing and Management,

25(4):347-261, 1989.

[Ste92] C Stevens. Automating the creation of information filters. Com­

munication of the ACM, 35(12):48, December 1992.

[Su92] L.T Su. Evaluation measures for interactive information retrieval.

Information Processing and Management, 28(4):503-516, 1992.

133 [SvR88] A.F Smeaton and C.J van Rijsbergen. Experiments on incor­

porating syntactic processing of user queries into a document

retrieval strategy. In Y Chiaramella, editor, Proc, of Eleventh

International Conference on Research and Development in In­

formation Retrieval, pages 31-51, Grenoble, France, June 1988.

[SW79] G Salton and H Wu. A term weighting model based on utility

theory, chapter 2, page 9. Butter Worths, 2 edition, 1979.

[SWY76] G Salton, A Wang, and C.T Yu. Automatic indexing using term

discrimination and term precision measurements. Information

Processing and Management, 12:43-51, 1976.

[SY73] G Salton and C.S Yang. On the specification of term values in

automatic indexing. Journal Documentation, 29(4):351-72, 1973.

[SYY74] G Salton, C.S Yang, and C.T Yu. Contributions to the theory of

indexing. Information Processing, 74:584-590, 1974.

[SYY75] G Salton, C.S Yang, and C.T Yu. A theory of term importance

in automatic text analysis. Journal of the American Society for

Information Science, pages 33-44, 1975.

[Thu86] G Thurmair. A common architecture for different text processing

techniques in a information retrieval environment. In F Rabitti,

editor, Proc, of Ninth International Conference on Research and

134 Development in Information Retrieval, pages 138-143, Pisa, Italy,

September 1986.

[UME] Masanobu UMEDA. Gnews help. available ftp.cs.titech.ac.jp.

[vR75] C.J van Rijsbergen. Information Retrieval. Butter Worths, first

edition, 1975. Ch 2,page 14.

[vR79] C.J van Rijsbergen. Information Retrieval. Butter Worths, sec­

ond edition, 1979.

[VV87] B Vickery and A Vickery. Information Science in Theory and

Practice. Butter Worths, London, 1987. page 122.

[Wei81] B.H Weinberg. Word Frequency and Automatic Indexing. PhD

thesis, Columbia University, 1981.

[WP68] J .H Williams and M.P Perriens. Automatic full text indexing and

searching system. In In:Proceedings of the Information System

Symposium, Washington, DC, September 1968. IBM, Gaithers­

burg.

[YGM94] T.W Yan and Garcia-Molina. Index structures for information

filtering under the vector space model. In Proc. International

Conference on Data Engineering, pages 337-347, 1994.

[YS76] C.T Yu and G Salton. Precision weighting-an effective automatic

indexing method. Journal of the Association for Computing Ma­

chinery, 23:76-88, 1976.

135 136

[YS77] C.T Yu and G Salton. Effective information retrieval using term

accuracy. Communication of the ACM, 23(1):76-88, 1977.

[Zip49] H.P Zipf. Human Behaviour and the Principle of Least Effort.

Addison Wesley Cambridge Massachusetts, 1949.