Assigning Context to Documents in a Collection to Improve Information Retrieval Effectiveness: Methodology and Experiences

VIVEK CHANANA ATHULA GINIGE SAN MURUGESAN School of Computing and IT School of Computing and IT School of Multimedia and IT University of Western University of Western Southern Cross University Sydney Sydney Coffs Harbour Sydney Sydney AUSTRALIA AUSTRALIA AUSTRALIA

Abstract: - Recently proposed new context-based information retrieval system has better retrieval effectiveness than the traditional key-word systems. In this system, each document is assigned context(s) based on the type of information. This paper presents context categories for documents in computing and information technology domain and offers a new methodology for assigning context to documents in a collection. This methodology showed promising results and the inter-assigner consistency quite comparable to the results reported in literature.

Key-Words: - context, context assigning, inter-assigner consistency, context category, information retrieval, test document collection

1 Context-based Information methodology we developed could also be applied to Retrieval System other domains as well. We developed a new context-based information retrieval system detailed in [3]. It uses a new notion of context - context is based on type of information 2 Document Collection and not the subject topic of information – in an For evaluating an information retrieval system in a effort to improve the information retrieval laboratory-type environment, .a document collection effectiveness. Our studies on the new context-based or test document set is required. The experimental information systems have demonstrated approach of using standard test collections has a improvements in information retrieval effectiveness long history in information retrieval evaluation. It – the relevance of the retrieved documents in started in 1963 when first laboratory testing of response to query. Further information on the new information retrieval system was done in Cranfield information retrieval system is given in [3]. II experiment [4]. Although it has remained popular Some examples of context categories based on over the years and is still very much in use, the type of information are: specification, principle, evaluation methodology based on using standard procedure etc., whereas topic-based categories test collections suffers from some limitations. It based on subject topics could be: computer, assumes that the relevance can be approximated by software, programming, etc. While there exist there topical similarity of the query to document and exists groupings/classifications for subject topics in hence user is not required to make his/her relevance a given domain, pre-defined context category assessment about retrieved documents; the set/hierarchy which is based on type of information relevance is binary – relevant or non-relevant; and is not currently available. recall for a query is always known. In this paper, we present context category for The characteristics of context-based information documents in computing and information system and the way we had planned to evaluate the technology domain and present a methodology for system after implementation determined the assigning context to documents in a collection. We specifications of test document collection that we also describe Web-based context assignment system could use. The influencing factors were: user- that we used for this purpose and highlight our centred relevance judgement, non-binary relevance experiences in assigning context to articles judgement and comparative evaluation of two published in IEEE Computer magazine. The information retrieval systems. For this study, we used a set of documents in context categories based on type of information computing and information technology domain. It is contained in the selected document collection. This an area we are familiar with and also an area of initial set of contexts served as a basic set for interest and familiarity to students from our school - starting the context assigning activity. A provision School of Computing and Information Technology - was kept for some augmentation or refinement that who were participants in the two stages of our might be required during the course of context experiment: first for assigning context to documents assigning activity. in a collection, and then for assessing the relevance of retrieved documents at the searching (information retrieval) stage. At both these stages the participants 4 Methodology for Assigning Contexts were expected to read the content of documents. to Documents The other considerations were that the documents in The context assigning activity involved assigning the collection should be: full text documents and not one or more context categories to a document from abstracts, recent and not very old, of medium length a pre-defined set of context categories. This activity (number of pages); and collection size (number of was performed by using Context Allocation (CA) documents) should be reasonable. System (described later in section 5). This activity We could not use any of the traditional standard involved reading /examining document by ‘context test collections like CACM, CISI, and CRAN [13]. assigners’ and then assigning context(s) that best These collections were old, not very big in size, describe the contents of documents. This activity contained mainly abstracts and short documents. was very significant as the documents, which were More importantly, the relevance judgements in these assigned context at this stage, were to be later used collections are static and binary which do not satisfy in the development of context-based information requirements of our context-based information retrieval system. Hence, it was performed in a retrieval system. While TREC collections [17] have consistent and systematic way. become very popular over the years, these collections were not attractive to us as they were not specific to a particular domain (for example: 4.1 Selection of Assigners Computing and IT). Further, with this collection we Assigning of documents traditionally involves could not have made use of continuous relevance (0 analysis of their contents and then assigning of to 1) judgement which our system is capable of. descriptors that best capture their contents. The In view of these shortcomings and limitations, analysis of document contents can be either done by we created our own test collection. We selected machines, called automatic assigning, or by humans documents from one magazine to restrict the called manual assigning. domain. Keeping in view the fact that articles Although automatic assigning is considerably should not be very difficult to understand and faster, consistent and cheaper than manual should have a wide acceptability, we selected assigning, yet manual assigning is more accurate articles published in IEEE Computer Magazine than automatic assigning. It may albeit be slow and which is a widely read and respected magazine. expensive. Despite growing automation in assigning Also, as our library has an agreement with IEEE, we category labels to documents, manual assigning is could download the entire article for research still in practice [1]. We decided to assign context purposes. In this collection we included articles manually as no automatic text categorization system from the Computer issue from year 1995 to 2000; in exists that could work with context categories based all 950 articles were inducted into our test on type of information. Moreover such data collection. (compiled by doing manual assigning) is required to assess the performance of automatic categorization system whenever that is made available. We decided to select assigners from a group of 3 Set of Contexts potential users of the context-based information The first task was to define a set of context that retrieval system. We recruited five postgraduate could fairly represent the documents in the students from computing and information collection. Keeping in mind the scope and nature of technology areas for this activity. No one from the ‘context’ being considered in this test collection, a selected students at any stage was involved with the set of 32 context categories was arrived at after development of the project. examining all the documents in the collection. This set of context categories tried to cover all possible 4.2 Document Allocation between the assigners [2, 8, 14]. Though these The documents in the collection were evenly studies are not large enough to allow one to draw divided among assigners. Since manual assigning is firm conclusions, they are suggestive in their results prone to inconsistency, it was decided to assign each and they give us a general idea of the sort of document to three assigners. So that inter-assigner consistency that might be expected between consistency could be evaluated. assigners.

4.3 Relevance Criteria of Context Relevance judgements in most experiments for the 5 Context Allocation System evaluation of information retrieval systems are We developed a context allocation (CA) system for assumed to have a binary value - relevant or non- the purpose of assigning context to a document relevant. In actual practice, users' relevance collection. It is a Web-based system and its code judgments span a continuum of relevance regions was written in Perl. from highly relevant, through partially relevant to non-relevant [16]. We adopted this concept of It operates in two modes: relevance and allowed our assigners to give a (i) Normal user mode relevance score to a context in the complete range (ii) Super user mode from 0 to 1 in steps of 0.1. 5.1 Normal User Mode 4.4 Evaluation This mode is used by assigners to assign contexts to After deciding to use human assigners for assigning documents. When an assigner logs on to the CA context to documents, the next important task was to system, the main page shows documents and their establish methods for the evaluation of assigning IDs (as hyperlinks) allotted to him/her for assigning scheme. For achieving effective retrieval, assigning contexts. The assigner clicks on a document link to should be correct and free from any assigning review its contents. The complete document appears errors. The assigning errors can be found if we on a separate window. The assigner reviews the know which descriptors should and should not be contents of the document and assigns appropriate assigned for each document [15]. The assigning context(s) on the main page. We allowed assigning correctness can be determined through consensus of of multiple contexts to a document. After examining several good assigners even if there may be the contents, the assigner could assign a maximum differences between assigners. of three contexts that would suitably represent the Context assigning consistency is defined as the full contents of the document. For each context degree of agreement in the representation of the assigned, the assigner is also expected to indicate a essential information content of the document by relevance score for that context indicating the certain sets of assigning terms selected individually degree to which it associates with the document. and independently by each of the assigners in the Once the context and relevance of a document had group [15]. It is the extent to which different been assigned, it was removed from the list of assigners (inter-assigner consistency) will choose documents yet to be visited by that assigner. And the same terms for a given document. the procedure continued until all documents in the Even though inter-assigning consistency may not list were assigned contexts. be important in it self, it is used as an indicator of For doing this, on the main page three drop- assigning correctness and hence correlates with the down menus are located for selecting contexts. Next retrieval effectiveness. Past work on inter-assigner to each context selection menu is a relevance score consistency has been motivated by the perception drop down menu. The scale of the relevance score is that high levels of consistency should improve from 0.1 to 1.0 with a step of 0.1. information retrieval effectiveness [6]. A better The initial set of context could by no means be consistency will lead to an improvement in retrieval exhaustive and complete. Keeping this in mind, a effectiveness. Leonard [10] attempted to investigate provision was made for an assigner to suggest a new this relationship in his doctoral dissertation and context similar in nature to the existing ones, if found that inter-assigner consistency and retrieval he/she strongly felt that some document were not effectiveness exhibit a tendency toward a direct, being fully represented by the existing contexts. The positive relationship. proposed context was visible to its proposer as well Studies on assigning consistency studies have as other assigners and they were free to use it for shown that the consistency values vary a great deal assigning to documents. The project coordinator (in super-user mode discussed below) could either time after the training session and were urged to accept or reject the recommendation for inclusion of meet the set deadline. proposed new context into the system. The participants were asked to note down the reasons for each selection of a context on the 5.2 Super User Mode supplied forms. They were required to submit these Only the coordinator of the context assigning forms to the coordinator after competing their share activity could use the system in this mode. At logon, of documents. Each participant was also asked to fill the coordinator can see: all new context proposed in a questionnaire after finishing the CA activity. and yet to be evaluated; the proposer for each Interviews were held with each participant context; the documents for which these were individually after the completion of the activity proposed; and the time and date of proposal for each regarding their experience. context. The coordinator can examine each proposed context and review the contents of document for which it was proposed and decide 7 Evaluation about its induction into the system. When the All assigners successfully completed their share of coordinator accepts a proposed context, it gets assigning contexts to documents. All assigner inducted into the CA system and if not accepted is interactions were logged onto our server. Logged removed from the system. data for each document included: document number, name of the assigner, assigned contexts, relevance score of each assigned context, proposed context (if 6 Conducting Context Assigning any), time spent, and date and time of assigning. Activity A briefing session with all the participants was 7.1 Evaluation Metrics conducted before the start of CA activity. The We used both objective and subjective measures of purpose was to make the participants completely evaluation for context assigning activity. We aware of the experiment and our expectations from measured the inter-assigner consistency across them. The session involved explaining the objective context assigners as an objective measure. We of the CA activity, features and limitations of the evaluated the assigner’s subjective assessment of the CA system, procedure to be followed, and context assigning activity through a short expectations from the participants. They were given questionnaire and interview. a handout detailing definition and description for each context. Some areas, where possible confusion 7.1.1 Inter-assigner consistency could arise, were discussed. The participants were A number of approaches have been used for the allowed to refer to given handout during the assessment of inter-assigner consistency on a activity. nominal or categorical scale. These include indices Each participant practiced CA activity during the such as percentage of agreement based on the ratio training session so that they understood the whole of number of agreements on the considered activity very well and felt comfortable in using the categories to the total number of possible system and performing the activity. As rightly put agreements. These measures ignore the chance by Mulvany [12], assigning “involves a way of agreement as observed agreement between two thinking that can only be guided but not taught. assigners has some element of agreement by chance Assigning cannot be reduced to steps that can be in that. This, therefore, is not a reliable measure of followed.” We made our best effort to guide them. consistency. Free from this shortcoming is a kappa The training session was important as the coefficient () proposed by Cohen [5], which participants online performed the CA activity at a calculates the proportion of agreement across time and place convenient to them using the Web- assigners beyond the agreement due to chance. The based system remotely using the Internet. kappa coefficient () is defined as: The participants were informed that they were Po – Pc expected to get a feel of the subject matter discussed kappa () = in the complete document before assigning 1 – Pc appropriate context. They were advised to assign appropriate context(s) to the document soon after Po is the observed proportion of cases in which the reading it. They could perform the activity at any assigners agree Pc is the proportion of cases in which agreement is expected due to chance Landis [9], as shown in Table 1. The assigners in If there is a complete agreement among the our experiment were not professional assigners thus; assigners, then =1, whereas if there is no we did not expect to get perfect agreement for all agreement (other than the agreement which would contexts. We sought to achieve moderate agreement be expected to occur by chance) among the (kappa values greater that 0.40) overall. We believe assigners, then =0. that this level of agreement across novice context The kappa coefficient was originally developed assigners provides evidence for reproducible context to assess inter-assigner consistency or agreement in assigning scheme. the basic situation in which two assigners assign one category each for each case. In our experiment, the Range Interpretation assigners assign multiple context categories to each < 0.00 Poor document and we have multiple assigners assigning 0.00 - 0.20 Slight contexts to one document, so the original 0.21 – 0.40 Fair formulation of the kappa statistic is not directly 0.41 – 0.60 Moderate applicable in our situation. Multiple assignments 0.61 – 0.80 Substantial present a new challenge to the assessment of inter- 0.81 – 1.00 Almost perfect assigner agreement. The original kappa statistic was extended to deal Table 1: Interpretation of the kappa coefficient. with situations where multiple assigners assign multiple categories to each case [7, 11], or, in our We made subjective assessment about context situation each document. We used the formulation categories, the system for assigning contexts and shown below: overall context assigning activity through a short questionnaire. We asked each assigner to rate the A statements about the context assigning activity. Po = They were asked to use a 1-to-4 scale to indicate A + J + K their responses. Po is the proportion of agreement Their responses indicated their satisfaction with A is the number of categories assigned to the document the whole activity. Assigners did not have a great by both assigners (J and K) deal of difficulty in choosing the appropriate J is the number of categories assigned to the document by contexts from the given contexts. They did not find assigner J only any problem finding the desired contexts they were K is the number of categories assigned to the document looking for a document from the given set of by assigner K only contexts. We could safely infer that the initial set of context was by and large able to cover all possible We used three assigners for assigning contexts to context categories that documents could get each document. Agreement among these assigners assigned to. The small number of contexts proposed for a document was measured by averaging the (only 3) for addition to the initial set of contexts proportions of agreement obtained for all three further supports this claim. combinations of pairs of context assigning for a The assigners found the CA system adequate for document. The overall observed proportion of the intended purpose and they did not have any agreement Po for the entire document collection is difficulty performing the activity. This satisfaction the average of the mean proportions of agreement with the system was further reconfirmed during the obtained for each document. To calculate kappa interviews conducted with the assigners after the coefficient, a proportion of chance agreement Pc assigning activity. needs to be determined. Computing proportions of chance agreements between all combinations of assigners across all documents, and then averaging 8 Conclusion it determine this. We estimated the agreement due to We presented a methodology for assigning chance using the method described by Fleiss [7]. context(s) to documents in collection and discussed the philosophy behind the context allocation activity 7.2 Results as well as the activity itself. We obtained The overall  calculates using equation shown satisfactory inter-assigner consistency in our earlier was found to be 0.48. To interpret the kappa objective measures. Considering the subjective statistic, and to determine the consistency across the nature of category assigning activity, and the fact assigners, we used the benchmarks defined by that the assigners in our activity were novices and not professional assigners, we were Library Science, University of Illinois, Urbana- satisfied/encouraged with the results achieved for Champaign, IL, 1975. objective measures. The results of subjective [11] Mezzich, J. E., Kraemer, H. C., Worthington, measures were equally encouraging and satisfying. D. R. L. and Coffman, G. A., Assessment of The context assigned document collection served as agreement among several raters formulating the basis for the development of our context based multiple diagnoses, Journal of Psychiatric information retrieval system. The complete Research, Vol. 16, No. 1, 1981, pp. 29-39. implementation and evaluation of the context based [12] Mulvany, N. C., Indexing books. Chicago: information retrieval system is proposed for University of Chicago Press, 1994. presentation in another publication. [13] Sanderson, M., Word sense disambiguation and information retrieval, Ph.D. Thesis, Department References: of Computing Science, University of Glasgow, [1] Anderson, J. D. and Perez-Carballo, J., The 1996. nature of indexing: how humans and machines [14] Sievert, M. C. and Andrews, M. J., Indexing analyze messages and texts for retrieval. Part I: consistency in information science abstracts, Research, and the nature of human indexing, Journal of the American Society for Information Information Processing & Management, Vol. Science, Vol. 42, No. 1, 1991, pp. 1-6. 37, No. 2, 2001, pp. 231-254. [15] Soergel, D., Indexing and retrieval [2] Chan, L. M., Inter-indexer consistency in subject performance: the logical evidence, Journal of cataloging, Information Technology and the American Society for Information Science, Libraries, Vol. 8, No. 4, 1989, pp. 349-358. Vol. 45, No. 8, 1994, pp. 589-599. [3] Chanana, V., Ginige, A. and Murugesan, S., A [16] Spink, A., Greisdorf, H. and Bateman, J., From new context-based information retrieval system, highly relevant to not relevant: examining Accepted 3rd WSEAS International Conference different regions of relevance, Information on Artificial Intelligence, Knowledge Processing & Management, Vol. 34, No. 5, Engineering, Data Bases (AIKED 2004), 1998, pp. 599-621. Salzburg, Austria, February 13-15, 2004. [17] Voorhees, E. M. and Harman, D. K., Overview [4] Cleverdon, C. W., The Cranfield tests on index of TREC 2001, In NIST Special Publication language devices, In Readings in information 500-250: The Tenth Text REtrieval Conference retrieval, K. S. Jones and Willett, P., Eds. San (TREC 2001), 2001. Francisco, Calif.: Morgan Kaufmann, 1998, pp. 47-59. [5] Cohen, J., A coefficient of agreement for nominal scales, Educational and Psychological Measurement, Vol. 20, No. 1, 1960, pp. 37-46. [6] Ellis, D., Furner-Hines, J. and Willett, P., On the measurement of inter-linker consistency and retrieval effectiveness in hypertext databases, In SIGIR 1994: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3-6 July, 1994. [7] Fleiss, J. L., Measuring nominal scale agreement among many raters, Psychological Bulletin, Vol. 76, No. 5, 1971, pp. 378-382. [8] Funk, M. E. and Reid, C. A., Indexing consistency in MEDLINE, Bulletin of the Medical Library Association, Vol. 71, No. 2, 1983, pp. 176-183. [9] Landis, J. R. and Koch, G. G., The measurement of observer agreement for categorical data, Biometrics, Vol. 33, 1977, pp. 159-174. [10] Leonard, L. E., Inter-indexer consistency and retrieval effectiveness: measurement of relationships, Ph.D. Thesis, Graduate School of