Automatic Indexing of News Articles by Yunus J

Automatic Indexing of News Articles by Yunus J Mansuri 18 credit project report for the degree of Master of Information Science The School of Computer Science and Engineering The University of New South Wales August, 1996 This thesis is dedicated to God the most benificient and merciful 11 Acknowledgements During my course of study many people have helped me. However I would like to especially thank my supervisor John Shepherd for his patience and perseverance with my thesis. His assistance, suggestions, and constructive criticism in the development of this thesis is worthy of special praise. My brother Yusuf Mansuri, sister Raisa and baby Hannan deserve special thanks who gave me their endless love and support without which this thesis could not have appeared. I also thank my friends Sadik, Hassan, Banchong for their help and the time they shared with me. I thank for their friendship. Finally, I must thank my parents above all. Their contribution to my life makes everything else pale into insignificance. lll Abstract Information has become an essential currency in the "Information Age". With the growth of network technology and connectivity, the desire to share ideas via Usenet has grown exponentially. Huge amounts of data flow through the Usenet daily. Our overall aim is to minimise the effort required by the reader to handle large volume of news passing through Usenet. In order to achieve this, there are 2 major tasks: Automatic indexing which dives in the ocean of data and Retrieval of relevant articles which fetches a glass of information to satisfy the thirst of the user based on user profile. iv Contents 1 Introduction 6 1.1 Filtering . 7 1.1.1 Information filtering system 9 1.2 Objectives of research . 10 1.3 Area of research . 11 1.4 Organisation . 11 2 Literature Review 13 2.1 Information Retrieval . 13 2.2 Components of IR system . 14 2.2.1 Selection of information . 14 2.2.2 Text analysis and representation . 19 2.2.3 Searching strategy . 43 2.3 Summary . 46 3 Literature Review - Available Systems 47 3.1 History of newsreaders . 47 3.1.1 Popular screen-oriented news reading interfaces 50 1 3.2 Related work 53 3.2.1 SMART 53 3.2.2 SIFT-Stanford Information Filtering Tool 56 3.2.3 Tapestry . 57 3.2.4 URN ... 58 3.2.5 INFOSCOPE 59 3.2.6 Deja News .. 60 4 IAN - Intelligent Assistant for News reading 62 4.1 Preview of IAN 62 4.2 Introduction .. 64 4.2.1 How does IAN work 64 4.3 IAN system ..... 66 4.3.1 Fetch articles 66 4.3.2 Automatic indexing of news articles . 68 4.3.3 Retrieval of relevant news articles based on user profile 80 5 Experiments 87 5.1 Evaluation methods . 87 5.2 Evaluation procedure 89 5.2.1 System performance method . 90 5.2.2 Comparison method 91 5.2.3 Discussion . 92 6 Conclusions 97 2 6.1 Review of research objectives . 97 6.2 Conclusion . 98 6.3 Future work . 100 A Outline of the program 102 A Filtering Statistics 114 3 List of Figures 2.1 Change in Document space after assignment of good discrim- inator . 29 4.1 Modules and flow of data . 63 4.2 Major Functions of our part of IAN 66 4.3 Different modules and flow of data 67 4.4 Modules involved in text analysis 70 4.5 Keyword structure during search ............... 83 4.6 Query tree: representing A .and. B ...... 84 4. 7 Query tree: representing A .and. B .or. C .and. D . 85 A.1 Rate of posting of articles . 116 A.2 Increase in uniq words per hour . 117 4 List of Tables 2.1 Term weighting formulae depending on within-document frequency . 36 2.2 Term-weighting formulae depending on term importance within an entire collection . 37 2.3 Term weighting formulae depending on Document frequency 37 5.1 Results from system performance method . 91 5.2 Results from comparison method - An average of 3 days 93 5.3 Result:Evaluation of performance . 93 A.1 Statistical information: Information of data in MB ...... 115 A.2 Stasistical information: Information of data in terms of words 115 A.3 Results from comparison method - Day 1 . 118 A.4 Result:Evaluation of performance for Day - 1 . 118 A.5 Results from comparison method - Day 2 . 119 A.6 Result:Evaluation of performance for Day - 2 . 119 A.7 Results from comparison method - Day 3 . 120 A.8 Result:Evaluation of performance for Day - 3 . 120 5 Chapter 1 Introduction Information has become an essential currency of this "Information Age" . With the advancement in network technology the information resources of the world can be accessed from a desktop. With growth of this connectivity has also grown a desire to share ideas and information. For that purpose a system already exists which enables millions of people around the world to send and receive information: Internet. The Internet supports many styles of communication; one of them is Usenet News, a global bulletin board system. Usenet is a collaborative system with no barriers to access, no requirement of computer literacy beyond basic word processing skill and working in the most democratic way without any restriction on content or dissemination of information. Anyone can make a posting about any topic and anyone can read what anyone else has to say about a topic and also share his/her own view. With the explosive growth of the system in terms of number of hosts and users connected, the information flow has increased many fold. Every 6 day around 30000 messages (90 MB) of text on a wide range of topics arrive at each site on Usenet. 1 Given this amount and diversity of information, the question arises how can one actually make use of it, by getting information on his/her topic of interest and information on those topics only. Users generally have a small number of specific interests, but most of the material found on Usenet is irrelevant to these interests and often of low quality. One solution to this information overload problem is controlling the information either by charging on posting or by editors filtering out low quality information. At the moment it is not practical to implement either of these alternatives. On the other hand it is not clear that such restrictions are desirable; if these sort of restrictions had been applied from the outset it would have hampered the growth of Usenet. Now, the only feasible solution for the user to filter relevant information from the incoming stream is filtering. 1.1 Filtering Filtering is not a new concept. We use information filtering in our day to day life. For instance when we go to look for books of interest in the library, we do not start reading all books to find relevant topics but apply filtering 1 A typical example of Usenet traffic volume: In 2 weeks 990258 articles, totalling 2512.5 MB submitted from 53566 Usenet sites by 196093 different users to 11099 different newsgroups for an average 180 MB per day.[Com95] the weeks 7 whereby we use a catalogue to limit our search to books containing specific topics of interest. Once we have found a potentially useful book, we first look at the index to determine whether it covers any interesting material. The same ideas are applied in filtering. In Usenet news, the first layer of filtering is provided by newsgroups. Here generated messages are partitioned over the newsgroups, where each newsgroup carries articles on specific topics. Newsgroup-based filtering is achieved by simply subscribing only to relevant newsgroups (i.e. only sub scribing to a small set of newsgroups that carry articles relevant to one's interests). This sort of filtering is not entirely satisfactory as quite often newsgroups cross boundaries, and one is not certain that the un-subscribed newsgroups contain no relevant articles of interest. Also the task of selecting relevant newsgroups is a problem; looking at all 3000 newsgroups and what each contains and then selecting is a non-trivial task. Large numbers of ar ticles (hundreds per day) are posted in the most active newsgroups. Out of these, many are not of interest to all users who read that news group. Currently the burden of selecting relevant articles lies with the user and is achieved by looking at the subject line of each article. This functionality (displaying subject line) is provided by almost all news reading programs (e.g. "rn", "trn", "nn" etc). However, simply looking at the subject heading is not a viable way to find relevant articles; sometimes articles do not have any subject, but more frequently, they have a subject heading that is not directly related to the content of the article. Some news reading programs 8 (e.g. "nn", "GNUS") support the idea of a "kill-file" which, depending upon the criteria supplied by the user, removes the articles from consideration be fore their subject lines are displayed (e.g. kill all articles posted from site "x" or posted by "author y"). This filtering mechanism puts the burden of filtering onto the user, who has to choose between subscribing to only a small number of newsgroups and potentially missing out on the interesting items, or subscribing to more newsgroups and manually filtering out a large number of uninteresting articles. Clearly more automatic assistance is required in the information filtering task. 1.1.1 Information filtering system An information filtering system is an information system designed for un structured or semistructured data[BC92].

Load more