ADAC Automatic Document Analyzer and Classifier
A. Guitouni A.-C. Boury-Brisset DRDC Valcartier
L. Belfares K. Tiliki Université Laval
C. Poirier Intellaxiom Inc.
Defence R&D Canada – Valcartier Technical Report DRDC Valcartier TR 2004-265 October 2006
ADAC Automatic Document Analyzer and Classifier
A. Guitouni A.-C. Boury-Brisset DRDC Valcartier
L. Belfares K. Tiliki Université Laval
C. Poirier Intellaxiom Inc.
Defence R&D Canada - Valcartier Technical Report DRDC Valcartier TR 2004-265 October 2006 Author
A. Guitouni, A.-C. Boury-Brisset, L. Belfares, K. Tiliki and C. Poirier
Approved by
Dr. E. Bossé Section Head / Decision Support System Section
Approved for release by
G. Bérubé Chief Scientist
© Her Majesty the Queen as represented by the Minister of National Defence, 2006 © Sa majesté la reine, représentée par le ministre de la Défense nationale, 2006
Abstract
Military organizations have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mails, electronic documents, etc.) The documents have to be screened, analyzed and categorized in order to interpret their contents and gain situation awareness. These documents should be categorized according to their contents to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partially manual. Integrating the recently acquired knowledge in different fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with a limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes. DRDC Valcartier’s ADAC system (Automatic Document Analyzer and Classifier) incorporates several techniques and tools for document summarization and semantic analysis based on ontology of a certain domain (e.g. terrorism), and algorithms of diagnosis, classification and clustering. In this document, we describe the architecture of the system and the techniques and tools used at each step of the document processing. For the first prototype implementation, we focused on the terrorism domain to develop the document corpus and related ontology. Résumé
Les organisations militaires font face à une augmentation notable du nombre de documents provenant de différentes sources en formats divers (papier, télécopie, courriels, documents électroniques, etc.) Ces documents doivent être scrutés, analysés et catégorisés afin d’en interpréter le contenu pour comprendre la situation. Ils doivent donc être catégorisés selon leur contenu pour un meilleur archivage et une recherche ultérieure plus efficace. Dans ce contexte, des techniques et des outils évolués devront donc être développés pour appuyer et mener ce processus de gestion de l’information qui, actuellement, est essentiellement effectué de façon manuelle. L’intégration de connaissances nouvelles provenant de différents domaines au sein d’un même système pour la gestion documentaire traitant notamment l’analyse, le diagnostic, le filtrage, la classification et l’organisation de documents devrait permettre d’en améliorer notablement l’efficacité, et ce, avec un minimum d’intervention humaine. Une meilleure gestion devrait faciliter l’intégration d’informations provenant de diverses sources, éliminer toute redondance, améliorer l’accès à l’information pertinente et fournir ainsi, en bout de ligne, un meilleur soutien au processus de prise de décision. Le système ADAC (Automatic Document Analyzer and Classifier) conçu à RDDC Valcartier incorpore différentes techniques et outils pour le résumé et l’analyse sémantique basée sur l’ontologie d’un domaine particulier (p. ex. celui du terrorisme), et des algorithmes de diagnostic, de classification et l’organisation de documents. Dans ce rapport, nous décrivons l’architecture du système, ainsi que les techniques et outils utilisés à chaque étape du traitement d’un document. Pour l’implantation du prototype, l’accent a été mis sur le domaine du terrorisme pour développer une ontologie, ainsi qu’une collection de documents adaptée.
DRDC Valcartier TR 2004-265 i
This page intentionally left blank.
ii DRDC Valcartier TR 2004-265
Executive summary
Military organizations, in particular intelligence or command centers have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mail messages, electronic documents, etc.). These documents must be analyzed in order to interpret their contents and gain situation awareness. These documents should be categorized according to their content to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partly manual.
Automatic, intelligent processing of documents is at the intersection of many fields of research, especially Linguistics and Artificial Intelligence, including natural language processing, pattern recognition, semantic analysis and ontology. Integrating the recently acquired knowledge in these fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate the correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes.
This is the purpose of the work we have undertaken at DRDC Valcartier as part of the Common Operational Picture for 21st Century Technology Demonstration project. The ADAC system (Automatic Document Analyzer and Classifier) incorporates several techniques and tools for document summarization and semantic analysis based on the ontology of a certain domain (e.g. terrorism), and algorithms of diagnostic, classification and clustering. A document is processed through the following steps: i) Summarization: large documents are summarized to provide a synthesized view of their content; ii) Statistical and semantic analysis: the document is indexed by identifying the attributes that best characterize it. Both statistical analysis and semantic processing exploiting domain ontology are carried out at this stage; iii) Diagnosis: intercept relevant document matching criteria provided by the user (e.g. document on a particular subject) in order to execute an appropriate action (e.g. alert); iv) Filtering/classification: classify/categorize the document in predefined hierarchical classes; and v) Clustering: assign the document to the most similar group of previously processed documents. External actions can then be triggered on specific classes of documents (e.g. alerts, visualization and data mining). Using a launching agent, ADAC checks periodically the presence of new documents and processes them. The diagnostic and filtering/classification tests may be processed on previously analyzed documents if new directives require it.
In this report, we describe the architecture of the system and the techniques and tools used at each step of the document processing. For the first prototype implementation, we have chosen to focus our document corpus and related ontology on the terrorism domain.
Guitouni, A, Boury-Brisset, A.-C., Belfares, L., Tiliki, K., Poirier, C., 2006. ADAC: Automatic Document Analyzer and Classifier, DRDC Valcartier, TR 2004-265, Defence R&D Canada.
DRDC Valcartier TR 2004-265 iii
Sommaire
Les organisations militaires, en particulier les cellules de renseignement et les centres de commandement, doivent traiter un nombre sans cesse croissant d’informations provenant de différentes sources sous divers formats (papier, fax, courriels, documents électroniques, etc.). Ces documents doivent être scrutés et analysés afin d’en interpréter le contenu pour une meilleure gestion de situation. Ils doivent donc être catégorisés selon leur sujet pour permettre, d’une part, un archivage efficace et, d’autre part, pour faciliter une recherche ultérieure. Dans ce contexte, des techniques et des outils avancés devront donc être développés pour soutenir et mener ce processus de gestion de l’information, qui à l’heure actuelle est essentiellement effectué manuellement.
La compréhension automatique de documents est un domaine de recherche multi-disciplinaire touchant en particulier la linguistique computationnelle et l’intelligence artificielle, notamment le traitement de la langue naturelle, la reconnaissance de formes, l’analyse sémantique et ontologique. L’intégration dans un même système des résultats de recherches récentes dans différents champs de connaissances reliés à la gestion documentaire traitant notamment de l’analyse, du diagnostic, du filtrage, et de la classification de documents devrait permettre d’en améliorer considérablement l’efficacité avec un minimum d’intervention humaine. Une meilleure catégorisation et une gestion adéquate de l’information devraient faciliter l’aggrégation d’informations provenant de diverses sources, éliminer toute redondance, améliorer l’accès à l’information pertinente et ainsi fournir un meilleur soutien au processus de prise de décision.
C’est l’objectif du travail que nous avons entrepris à RDDC Valcartier dans le cadre du projet démonstrateur techonologique COP 21. Le système ADAC (Automatic Document Analyzer and Classifier) incorpore différentes techniques et outils permettant le résumé de textes, l’analyse sémantique basée sur l’ontologie d’un domaine particulier (p. ex. celui du terrorisme), associés à des algorithmes de diagnostic, de classification et d’organisation de documents. Le traitement d’un document dans ADAC suit les étapes suivantes : i) Résumé : les documents volumineux sont résumés afin d’en produire une synthèse; ii) Analyse statistique et sémantique: le document est indexé en identifiant les attributs qui le caractérisent le mieux. À cette fin, un traitement à la fois statistique et sémantique (exploitant l’ontologie) est effectué; iii) Diagnostic : le document est intercepté s’il répond aux critères de sélection fournis par l’utilisateur (p. ex. document traitant d’un sujet particulier) et une action associée est déclenchée (p. ex. alerte); iv) Classification: le document est catégorisé dans des classes hiérarchiques prédéfinies (selon une taxonomie du domaine); et v) Regroupement : le document est affecté au groupe de documents sémantiquement le plus proche parmi les groupes constituant les documents déjà traités.
Dans ce rapport, nous décrivons l’architecture du système, ainsi que les techniques et outils utilisés à chaque étape du traitement d’un document. Pour le prototype d’implantation, l’accent a été mis sur le domaine du terrorisme pour développer une ontologie, ainsi qu’une collection de documents associée.
Guitouni, A, Boury-Brisset, A.-C., Belfares, L., Tiliki, K., Poirier, C., 2006. ADAC: Automatic Document Analyzer and Classifier, DRDC Valcartier TR 2004-265, R et D pour la défense Canada. iv DRDC Valcartier TR 2004-265
Table of contents
Abstract / Résumé...... i
Executive summary ...... iii
Sommaire...... iv
Table of contents ...... v
List of figures ...... viii
Acknowledgements ...... x
1. Introduction ...... 1
2. Automated Document Processing ...... 4 2.1 Introduction ...... 4 2.2 Information Retrieval ...... 5 2.3 Document classification (or document categorization) ...... 7 2.3.1 The Expert System approach...... 7 2.3.2 The Machine Learning approach...... 7 2.3.2.1 Document representation and preprocessing...... 8 2.3.2.2 Classification methods ...... 9 2.3.3 Application to document categorization...... 11 2.4 Document clustering...... 11 2.5 Commercial solutions...... 13 2.6 Awaited solutions ...... 13
3. The ontology-based document processing approach...... 14 3.1 About the rationality...... 14 3.2 Ontologies: definitions and roles...... 15 3.2.1 From controlled vocabulary to ontologies...... 15 3.2.2 Role of ontologies in information systems and knowledge management...... 16 3.3 Exploitation of Ontologies for document processing ...... 18
DRDC Valcartier TR 2004-265 v
3.3.1 Content-based indexing...... 19 3.3.2 Ontology-based search and retrieval ...... 19 3.3.3 Ontologies in enterprise portals...... 19 3.3.4 Ontology-based document categorization and clustering...... 20 3.4 Combining statistics and semantics for document categorization...... 22 3.4.1 Approach ...... 22 3.4.2 Operationalization ...... 22
4. The processing algorithms...... 25 4.1 Representation of the document’s DNA...... 25 4.2 The diagnostic module ...... 26 4.3 Classification/Filtering ...... 27 4.4 The Clustering module ...... 32 4.4.1 Problem formulation...... 33 4.4.2 Clustering using genetic algorithms ...... 35 4.4.3 Clustering using variable neighborhood search method...... 36
5. Empirical tests ...... 40 5.1 Metrics for performance assessment ...... 40 5.2 The simulation tool: TestBench...... 42 5.3 Tests results ...... 44 5.3.1 Filtering/Categorization algorithm ...... 44 5.3.2 Clustering algorithms ...... 44
6. ADAC prototype functional architecture ...... 45
7. Conclusions ...... 50
References ...... 51
Annex A: Concepts of Weights in the Ontology...... 61
Annex B: Clustering using Genetic Algorithms...... 67
Annex C: Similarity Index Computation...... 74
Annex D: Non-parametric approaches ...... 79
vi DRDC Valcartier TR 2004-265
Annex E: COTS Product Evaluation...... 97
List of symbols/abbreviations/acronyms/initialisms ...... 131
Distribution list...... 132
DRDC Valcartier TR 2004-265 vii
List of figures
Figure 1. ADAC’s document processing...... 2
Figure 2. Hierarchy of concepts in the terrorism ontology...... 23
Figure 3. Document DNA ...... 25
Figure 4. Classification process...... 28
Figure 5. Example of a 3-level hierarchy ...... 29
Figure 6. TestBench’s interface for the categorization simulations ...... 42
Figure 7. TestBench’s interface for the genetic clustering algorithm...... 43
Figure 8. TestBench’s interface for the VNS clustering algorithm ...... 43
Figure 9. ADAC retained configuration ...... 45
Figure 10. ADAC processing and analyzing scenario...... 45
Figure 11. ADAC recovery agent...... 46
Figure 12. ADAC implementation concept...... 46
Figure 13. ADAC architecture...... 47
Figure 14. ADAC Interfaces (example) ...... 48
Figure 15. Other ADAC configurations ...... 49
Figure 16. Diagram for concordance index measurement (DAC-01-C-P)...... 80
Figure 17. Diagram for discordance index measurement (DAC-01-C-P) ...... 84
Figure 18. Admissibility for comparison for 0% Figure 19. Admissibility for comparison in 100% overlapping ...... 88 Figure 20. Discordance index measurement (CAD-02-I-I)...... 91 Figure 21. Diagram for concordance index calculation in case 1 (CAD-04-C-I)...... 93 Figure 22. Diagram for concordance index calculation in case 2 (CAD-04-C-I)...... 94 Figure 23. Autonomy’s IDOL architecture and technical components...... 99 viii DRDC Valcartier TR 2004-265 Figure 24. Stratify discovery system architecture ...... 111 Figure 25. The Stratify classification process...... 113 Figure 26. RetrievalWare Searching Process ...... 118 Figure 27. RetrievalWare Architecture...... 119 Figure 28. Term "bears witness" (Applied Semantices)...... 123 Figure 29. Applied Semantics Concept Server: implementation architecture...... 127 List of tables Table 1. Factors of severity/admissibility values according to the DM Attitude ...... 83 μ h Table 2. Partial discordance variations D j (d, pi ) (CAD-04-C-I) ...... 95 Table 3. Autonomy Architecture...... 98 Table 4. RetrievalWare Searching Process...... 117 DRDC Valcartier TR 2004-265 ix Acknowledgements The author would like to thank the Common Operation Picture 21 Project Team for their constructive ideas. x DRDC Valcartier TR 2004-265 1. Introduction In May 1999, the new National Defence Command Centre (NDCC) was commissioned. The mission of the NDCC is to provide a 24/7 secure command and control facility through which the Command staff can plan, mount and direct operations and training activities at the strategic level. Since September 11th, 2001, the level of traffic of messages at the NDCC reached unpredictable peaks. The operators of the Centre are overloaded with information that they should handle in real time. The information “digested” by the NDCC represents vital stakes for several other users. Military organizations and particularly intelligence or command centres have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mails and electronic documents). These documents must be analyzed in order to interpret their contents and gain situation awareness. These documents should be diagnosed and categorized according to their content to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partly manual. Automatic, intelligent processing of documents is at the intersection of many fields of research, especially linguistics and artificial intelligence, including natural language processing, pattern recognition, semantic analysis and ontology. Natural language understanding has been one major research domain for decades. Integrating the recently acquired knowledge in these fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate the correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes. The ADAC system (Automatic Documents Analyzer and Classifier) has been developed at DRDC Valcartier as a concept demonstrator and a test bed. The objective of the targeted system is to provide an environment in which documents of various types and formats can be automatically processed, with minimal human intervention, going from document summarization and statistical/semantic analysis for content extraction, to diagnostic and classification. Some document processing modules may eventually trigger external actions. Consequently, this environment incorporates several techniques and tools for document summarization, semantic analysis based on ontology of a certain domain (e.g. terrorism), and algorithms for automated diagnostic, classification and clustering. ADAC is composed of a set of agents, each being responsible for a specific document- processing module. ADAC’s launching agents automatically intercept any new document. Then the document is processed through the following steps: DRDC Valcartier TR 2004-265 1 • Summarization: provide a synthesized view of the document’s contents; • Statistical and semantic analysis: index the document by identifying the attributes that best characterize it. Both statistical analysis and semantic processing exploiting domain ontology are carried out at this stage. This produces the document DNA; • Diagnostic: intercept relevant document matching criteria provided by the user (e.g. document on a particular subject) in order to apply an appropriate action (e.g. alert); • Filtering/classification: classify/categorize the document in predefined hierarchical classes; • Clustering: assign the document to the most similar group of previously processed documents. External actions can then be triggered on specific classes of documents (e.g. alerts, visualization and data mining). Ontology Semantic Diagnostic électronic Tools documents images sound files Observer Fonction: to diagnostic TEXT DNA the information Summariser DB & Des. Attrib. Producer Filtering/ Clustering Classification Statistics Document Management Tools System External actions OCR StoT Visualization Alerts Data Mining électronic documents images sound files Figure 1. ADAC’s document processing This work has been achieved under the Common Operation Picture for 21st Century Technology Demonstration Project. It has been motivated by face-to-face interviews with NDCC Operators that allowed DRDC team to capture requirements for third generation search and retrieval engine. This report captures the important contribution of this work. It is organized in two parts: the first part introduces the problem of automatic document processing and categorization and presents the main theoretical 2 DRDC Valcartier TR 2004-265 foundations of this work. In the second part, the approaches and algorithms used at different steps of the documents processing are presented. A description of the implementation of the first ADAC prototype and preliminary results are provided to illustrate this study. Finally, our conclusions are exposed in addition to a brief discussion on the ongoing work and future development ideas. DRDC Valcartier TR 2004-265 3 2. Automated Document Processing 2.1 Introduction The increasing amount of digital information exchanged among people and the subsequent information overload that workers have to face have accentuated the needs for more innovative knowledge management tools dedicated to information processing, for example text summarization, text extraction, text retrieval or text classification/categorization. Information comes in many forms, and can be either structured (relational databases, tagged messages) or unstructured (electronic documents, Web pages, e-mails, etc.). While the problem of structured information is well taken into account by database management systems, the management of unstructured information/document needs further research to facilitate both structuring and organization of information as well as its exploitation (e.g. effectively retrieve relevant information). For several years, US sponsored conferences such that MUC1 (Message Understanding Conference) and TREC2 (Text Retrieval Conference), devoted to automatic text processing, have largely contributed to significant advances in the domain. In the wide research area of automatic document processing, one can distinguish three important fields: information extraction, information retrieval and text classification or categorization - see for example Salton [1989a, b], Maybury [1993], and Mani and Maybury [1999] - for a tutorial on the subject. A clarification has to be made at this stage in order to define what is meant by these terms. • Information extraction3 is the process of extracting information from text in order to identify specific semantic elements within a text (e.g. entities, properties, relations) that populate a template. The goal is to undertand the document semantics in order to extract relevant content by means of Natural Language Processing (NLP) techniques. The process takes place in several stages: tokenisation, morphological and lexical analysis, etc. Information extraction is not information retrieval: Information extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus). Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about predefined types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyze the data for trends, to give a natural language summary, or simply to serve for on-line access. 1 See at www.itl.nist.gov/iaui/894.02/related_projects/muc/ 2 See at trec.nist.gov/ 3 http://www.dcs.shef.ac.uk/research/groups/nlp/extraction/ 4 DRDC Valcartier TR 2004-265 • Information retrieval consists in retrieving documents that best match a user query. Usually, documents are indexed by word occurrences to facilitate the process. • Document categorization consists in assigning documents to predefined categories. Document clustering is the process of detecting topics within a document collection, assigning documents to those topics and labelling these topics clusters. Information management of large amounts of documents requires addressing two inter-related problems: the problem of information classification from heterogeneous information sources and the problem of effective and efficient access to relevant information (e.g. information retrieval). The problem of text classification is less complex than that of full text understanding, because it consists in extracting the most relevant concepts from the document, not to interpret the text (as for example, for text summarizing). In large organizations that do not have content-based management tools such as automatic document categorization tools, electronic documents are scanned manually by information managers for content assessment and then classified in predefined folders. Even if people are better than machines at understanding the meaning in documents, this manual process is labor-intensive and expensive to maintain. Furthermore, it is subject to inconsistency because different people cannot classify large volumes of information in a uniform way. Folders categories are not totally disjoint and it can happen that documents are duplicated in different categories. Automatic document classification and clustering techniques aim at providing solutions for the organization of tremendous numbers of text documents by topics based on their contents. Classification within documents repositories differs from relational database management where data are organized as attributes-value pairs. In the following sections, after introducting the domain of information retrieval that is of relevance for classification purposes, we present the approaches proposed in the literature to address the problems of document classification and document clustering. 2.2 Information Retrieval The main objective in information retrieval (IR) is to find desired information from a collection of textual documents. This field already relatively old (more than 30 years) is centered on document access problems in response to various types of queries. One of the basic tasks approached in this field consists in providing to a user a list of relevant documents in response to a beforehand formulated query. Traditionally, the basic object in IR is a text (or portions of text) represented by a term-vector. More recently a representation in form of words groups was proposed. The traditional IR tasks consist in pairing a query to a document collection and returning the relevant ones to the user. The success of such systems depends partly on DRDC Valcartier TR 2004-265 5 quality and quantity of information associated to the request. Indeed, with a great quantity of information defining the document relevance, systems will be able to use more advanced techniques to identify relevant and nonrelevant documents. Most of the systems of IR rely on a statistical approach rather than on methods resulting from the computational linguistics (Natural Language Processing). Several reasons were advanced to explain this state of affair that can first of all appear as non- intuitive because the language knowledge should be a requirement in development of an intelligent system of research on text [Amini 2001]. Classically, an IR system is composed of two large components [Amini, 2001]: 1. An indexation process which leads to a representation of documents, requests and the class representatives (prototypes). Documents and queries are described like vectors of the same semantic vector space, that of ontology’s concepts which is structured in a hierarchy, this r space has the size of this ontology. We will note by d , qr and pr the vectorial descriptions of document d , a query q and a class prototype p respectively. 1. A similarity measure between each document and each request, and between documents and the class’s prototype. The most traditional method consists in calculating the angle r r cosine between {d} and qr , or between {d} and pr . Besançon [2002] presents in his thesis an interesting outline of the various methods used in the literature to calculate this similarity. The documents are ranked, on the relevance scale, according to their similarity measure. The traditional strategies of research used in IR are based on Boolean, vectorial and probabilistic models. These various models take their names from the three possible document representations. The Boolean models are characterized by a representation of documents based on the presence or the absence of terms in the document. The majority of rule-based systems use a Boolean approach [Apte et a.l 1994, Cohen 1996]. Some disadvantages are discussed in [Hull 1994]. The vectorial models gather a great number of search methods. The query and documents are indexed in two stages. First, relevant terms are extracted from the query4 q and/or from the document d . Then the user assigns to each term a weight which reflects its importance. A score is generated by a function of similarity starting from query and document representations. Salton and Buckley [1991], have tested some vectorial search models. The models based on probabilistic approaches try to capture words distributions in documents in order to use them for an eventual inference. First studies on these models go up at the beginning of the sixties with Maron and Kuhns, [1960]. Since that time, 4 We suppose here that queries and documents are written in a natural language 6 DRDC Valcartier TR 2004-265 these studies grew rich by many other models. The score used in these models is the relevance probability according to a particular query. One of the justifications advanced for the use of the probabilistic models is the "Probability ranking principle" [Robertson and Spark-Jones 1976]. Amini [2001] states this principle as following: "the optimal search performances are obtained when documents are provided in an ordered way according to their relevance probability for a certain query", and then in this probabilistic context, the concepts of "relevance", "optimality" and "performances" can be defined in an exact way. IR can also use document structuring (by enrichment of its represention vector) as a preprocessing phase. This stucturing is performed by statistical/linguistic analysis tools that provide richer representations, since documents are not represented any more in a space of words, but as well in a semantic space of concepts. These representations allow the user to apprehend the informative document content more intituively. Text retrieval techniques are measured using two parameters, namely precision and recall. Precision is the percentage of retrieved documents that are relevant to a query (correct response). Recall is percentage of documents that are relevant to a query and were retrieved. 2.3 Document classification (or document categorization5) Automatic document classification has a long history in the literature and is an active research area for a few decades [Sebastiani 1999]. Several approaches and algorithms have been proposed and new enhancements are still emerging to gain better results. We present hereafter the main concepts underlying these techniques and describe the most popular algorithms for document classification. 2.3.1 The Expert System approach First methods for the creation of automatic document classifiers were based on a knowledge engineering approach for the design of an expert system dedicated to the task of document classification into predefined categories. The technique consisted of the manual definition of a classifier by domain experts by defining a set of rules encoding expert knowledge on how to classify documents under predefined categories. The drawback of this approach is the knowledge acquisition bottleneck well known from the expert systems literature in the sense that the rules must be manually defined by a knowledge engineer with the aid of a domain expert if the set of categories is updated, the system must be modified to take into account the new categories. 2.3.2 The Machine Learning approach Nowadays, the dominant approach for document categorization relies on the machine learning (ML) paradigm according to which a general inductive process automatically 5 In this report, document classification and document categorization are synonyms DRDC Valcartier TR 2004-265 7 builds an automatic text classifier by learning (i.e. the computer system discovers the classification rules) from a set of preclassified documents, and the characteristics of the categories of interest. This approach requires an existing set of classes with associated training data. Text classification is a two-step process: training and classification. In the first step, training, the system is given a set of preclassified documents (provided by human experts). It uses these to learn the features that represent each of the concepts. In the classification phase, the classifier uses the knowledge that it has already gained in the training phase to assign a new document to one or more of the categories. In this approach, a general inductive process (also called the learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci (positive example) or not (negative example) by a domain expert; from these characteristics, the inductive process gleans the characteristics that a new unseen document should have in order to be classified under ci. In ML terminology, the classification problem is an activity of supervised learning, since the learning process is “supervised” by the knowledge of the categories and of the training instances that belong to them. The engineering effort goes toward the construction, not of a classifier, but of an automatic builder of classifiers (the learner). This means that if a learner is (as it often is) available of the shelf, all that is needed is the inductive, automatic construction of a classifier from a set of manually classified documents. In the ML approach, the preclassified documents are then the key resource. In the most favorable case, they are already available; this typically happens for organizations that have previously carried out the same categorization activity manually and decide to automate the process. The set of preclassified documents from the global corpus will serve as a training set, and the other documents (called test set) will be used to test the accuracy of the classifier. It must be noted that the machine learning approach for classifier construction relies on techniques for information retrieval because both document categorization and information retrieval are document content-based management tasks. Common processes include document indexing, and document request-matching or query expansion that are used in the inductive construction of the classifier. 2.3.2.1 Document representation and preprocessing Digital documents, which are typically composed of strings of characters, must be converted into a representation suitable for the classification task. Each document in the corpus is represented as a vector of words (number of occurrences of words), as a vector of n weighted index terms. Weights usually range between 0 and 1. This representation of documents is called bag of words. Any indexing technique that represents a document as a vector of weighted terms may be used. Before indexing, a preprocessing is usually performed. It consists in removing 8 DRDC Valcartier TR 2004-265 stopwords, i.e. words that carry no information such as prepositions, and performing word stemming, i.e. suffix removal. Because of the high dimensionality of the term space, dimensionality reduction is often employed using one of two distinct techniques: feature selection or feature extraction. For the latter, Latent Semantic Indexing, a technique used in Information Retrieval to address problems deriving from the use of synonymous, nearsynonymous, and polysemous words as dimensions of document representations, can be exploited for dimensionality reduction in this context. This technique compresses document vectors into vectors of a lower- dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence. 2.3.2.2 Classification methods The inductive construction of a classifier for a category Ci usually consists of two phases: • The definition of a function CSV (Categorization Status Value) that determines the fact that a document d should be categorized under Ci. • The definition of a threshold Ti to determine when d is categorized under Ci or is not. Several methods for text classification have been developed, differing in the way in which they compare the new document with the reference set. A comparison of these methods is presented in [Yang 99] and [Sebastiani 02]. We present hereafter an outline of the most important methods. • Naïve Bayesian (probabilistic classifier): This approach uses the joint probabilities of words co-occurring in the category training set and the document to be classified to calculate the probability that the document belongs to each category (using Bayes theorem). The document is assigned to the most probable category(ies). The naïve assumption in this method is the independence of all the joint probabilities. • Linear (profile-based) classifier (e.g. Rocchio method) Linear classifiers embody an explicit profile (or prototype vector) of the category. The Rocchio method, rooted in the Information Retrieval tradition, is used for inducing linear, profile-style classifiers. It rewards the closeness of a test document to the centroid of the positive training examples, and its distance from the centroid of the negative training examples. • K-Nearest Neighbor (k-NN algorithm): This method, proposed by Yang [94] is a popular instance of example-based classifiers, that do not build an explicit, declarative representation of the categories , but rely on the category labels attached to the training documents similar to the test document. It is called lazy learning method as they do not have a true training phase and thus defer all the computation to classification time. For deciding whether a test document dj should be classified under DRDC Valcartier TR 2004-265 9 category ci or not, k-NN looks at whether the k training documents most similar to dj also are in ci; if the answer is positive for a large enough proportion of them, a positive decision is taken, and a negative decision is taken otherwise. • Decision trees: (Sebastiani) In this approach, the test document is matched against a decision tree, constructed from the training examples, to determine whether the document is relevant to the user or not. A decision tree (DT) text classifier (see Mitchell [1996]) is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the test document, and leafs are labeled by categories. Such a classifier categorizes a test document dj by recursively testing for the weights that the terms labeling the internal nodes have in vector E dj , until a leaf node is reached; the label of this node is then assigned to dj . Most such classifiers use binary document representations, and thus consist of binary trees. • Support Vector Machines: This method tries to find a boundary that achieves the best separation between the groups of documents. The system is trained using positive and negative examples of each category and the boundaries between the categories are calculated. A new document is categorized by determining the partition of the space to which the vector belongs. In geometrical terms, it may be seen as the attempt to find, among all the surfaces σ1, σ2, …in |T| -dimensional space that separate the positive from the negative training examples (decision surfaces), the σi that separates the positives from the negatives by the widest possible margin, that is, such that the separation property is invariant with respect to the widest possible translation of si.. • Neural networks: [Ruiz 99] In this method, a neural network takes training sets as inputs and calculates the topics inferred from these words as the output. In the approaches, one can distinguish flat text classification from hierarchical text classification. With Flat text classification, categories are treated in isolation of each other and there is no structure defining the relationships among them. A single huge classifier is trained, which categorizes each new document as belonging to one of the possible basic classes. They lose accuracy because the categories are treated independently and relationship among the categories is not exploited. With hierarchical text classification, topics that are close to each other in hierarchy have more in common with each other. Thus, the problem is addressed using a divide-and- conquer approach [Koller 97] that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. At each level in the category hierarchy, a document can be first classified into one or more subcategories using some flat classification methods. We can use features from both the current level as well as its children to train this classifier. By treating problem hierarchically, the problem can be decomposed into several problems, each involving a smaller number of categories. Among category 10 DRDC Valcartier TR 2004-265 structures for hierarchical classification, category tree allows documents to be assigned into both internal categories and leaf categories, and directed acyclic category graph categories are organized as a Directed Acyclic Graph (DAG). This is perhaps the most commonly used structure in the popular web directory services such as Yahoo! and Open Directory Project. Documents can be assigned to both internal and leaf categories. 2.3.3 Application to document categorization Text filtering is the activity of classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer. A filtering system may also further classify the documents deemed relevant to the consumer into thematic categories. Similarly, an e-mail filter might be trained to discard “junk” mail and further classify nonjunk mail into topical categories of interest to the user. A filtering system may be installed at the producer end, in which case it must route the documents to the interested consumers only, or at the consumer end, in which case it must block the delivery of documents deemed uninteresting to the consumer. In the former case, the system builds and updates a “profile” for each consumer, while in the latter case (which is the more common, and to which we will refer in the rest of this chapter) a single profile is needed. A profile may be initially specified by the user, thereby resembling a standing IR query, and is updated by the system by using feedback information provided (either implicitly or explicitly) by the user on the relevance or nonrelevance of the delivered messages. In theTREC community, this is called adaptive filtering. Automatic categorization of Web pages. TC has recently aroused a lot of interest also for its possible application to automatically classifying Web pages, or sites, under the hierarchical catalogues hosted by popular Internet portals. This way, it is easier for a search engine to first navigate in the hierarchy of categories and then restrict the search to a particular category of interest. Automatic Web page categorization has two essential peculiarities: The hypertextual nature of the documents where links between pages can be exploited for categorization, and the hierarchical structure of the category set (This may be used, for example, by decomposing the classification problem into a number of smaller classification problems, each corresponding to a branching decision at an internal node). Text Mining [Hearst 2003] consists in analyzing large text collections, detecting usage patterns, trying to extract implicit information and discovering new knowledge that is useful for a particular purpose. It is a variation of data mining but the difference is that the patterns are extracted from a natural language text rather than from structured databases of facts. It is becoming an active research area applied to the Web where the goal is to extract and discover knowledge from large sets of Web pages (Web mining). 2.4 Document clustering Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated DRDC Valcartier TR 2004-265 11 for improving the precision or recall in information retrieval systems and as an efficient way of finding the nearest neighbors of a document (by preclustering the entire corpus). More recently, clustering has been proposed for use in browsing a collection of documents or in organizing the results returned by a search engine in response to a user’s query. Document-clustering systems create groups of documents based on associations among the documents. They use an unsupervised algorithm to create the clusters. Since, automatic clustering does not require training data, it is an example of unsupervised learning. They take documents as input, extract or select the features of the documents, and form clusters based on a calculation of similarity between individual documents or between an individual document and a representation of the clusters formed so far. The similarity calculation, is based on only the selected features for that document collection. To determine the degree of association among documents, clustering systems require a similarity metric to measure the distance between document vectors, such as the number of words that the documents have in common. Clustering algorithms can be either hierarchical and form a tree-like organization of documents, or they can be nonhierarchical and form a flat set of document groups (disjoint clusters). Consequently, there are two main approaches to document clustering, namely agglomerative hierarchical clustering (AHC) and partitional (e.g. K-means) techniques. There are two basic approaches to generating a hierarchical clustering: 1. Agglomerative (bottom-up): Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. 2. Divisive (top-down): Start with one, all-inclusive cluster and, in each successive iteration step, a cluster is split up into smaller clusters. In this case, we need to decide, at each step, which cluster to split and how to perform the split. Most of the work on document clustering has concentrated on the hierarchical agglomerative clustering methods. Agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. A number of different methods have been proposed for determining the next pair of clusters to be merged. Hierarchical algorithms produce a clustering that forms a dendrogram, with a single all inclusive cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K-means, or K-medoids find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular algorithm, a k-way clustering solution can be obtained either directly, or through a sequence of repeated bisections. In the former case, there is in general no relation between the clustering solutions produced at different levels of granularity, whereas the later case gives rise to hierarchical solutions 12 DRDC Valcartier TR 2004-265 The main advantage of clustering over classification is that it may reveal previously hidden but meaningful themes among documents. However, clustering techniques provide no clear way to convey the meaning of the clusters. 2.5 Commercial solutions Many projects and commercial of the shelf tools have been proposed to deal with problems like those addressed in this project (see Witten [2001]). Appendix D provides a description and evaluation of some of these commercial tools, namely Autonomy, Delphes, Stratify, Convera, Applied Semantics and Diagnos. In particular, it presents the characteristics of the tools and their information management functions, and describes how well they meet ADAC’s requirements. 2.6 Awaited solutions In the literature presented above, several approaches have been proposed for automatic document classification and clustering, but few has been devoted to exploit both the documents’ contents and the semantics underlying the domain of interest. In this context, we have experimented with candidate methods and tools to deal with the problem of document categorization within military organizations such as NDCC. In the following chapters, we describe the different techniques we have proposed and implemented within the ADAC environment to support automatic document processing functions exploiting an ontology of the domain. Main efforts deal with document categorization and clustering. DRDC Valcartier TR 2004-265 13 3. The ontology-based document processing approach Structuring a decision problem always begins by the specification of the decision framework such as attributes description, alternatives generation, and assessment of consequences in terms of multiple defined criteria. It is a step that requires significant human intervention by the initial creation of the rational framework or the domain ontology. The latter consists, first, in identifying all relevant concepts corresponding to the given application domain, then, in organizing them in an ontology form. Once such an application ontology is written, it can be applied to unstructured documents from a wide variety of sources, as long as these documents correspond to the given application domain. In this work, because our approach is an ontology-based one, we advance that it is resilient to changes in source-document formats. 3.1 About the rationality The essence of the cooperative information system is to achieve interoperation among distributed and heterogeneous information sources or agents. One way to do it is by providing a unique base of rationality in the form of ontology. For supporting the sharing and reuse of formally represented knowledge among AI systems, it is, indeed, useful to define the common vocabulary in which shared knowledge is represented [Studer et al. 1998]. AI is just about rational agents. Roughly, according to the classical conception, AI is considered as an enterprise (or an organization) devoted to the mechanization of rational thinking (a conception rooted in Turing’s work [Turing 1947, 1948, 1950, 1954]). Hampton [1998] stated that the action having the highest expected value is the rational action. The notion of rationality [Michael 1998, 1994, Frank 1994; Horvitz et al. 1988] is considered, by many authors and specialists, to be of crucial importance in multicriteria decision-making process (MCDM). This is why we give a large place to this concept in our discussions. Rationality has empirical and testable contents once we specify a utility function (relevance function) and a domain (ontology) to which this notion is applied. Rational decision-making is the action of choosing among alternatives in a way that “properly” agrees with the preference of the decision maker or those of a group making a joint decision [Doyle 1998]. The matter is to treat unanalyzed alternatives (actions, situations, documents) regarding preferences which reflect the desirability of alternatives and a certain rationality criteria. For example, in the case of documentary task management, this desirability corresponds to the utility function of alternatives with respect to a certain relevance structure (preference structure). The main factors influencing the decision-making rationality are explained in the work of Papadakis et al. [1998] and Rajagopolan et al. [1993]. These authors proposed three 14 DRDC Valcartier TR 2004-265 kinds of factors: internal factors, decisional factors and external factors. Internal factors can be controlled and even directed by managers of the firm which bring them the opportunity to design the decision process (or framework) needed for every decision. Decisional factors are those that characterize the decision, and are related to the strategic relevance for the firm. Though the fact that the firm has no control on external factors, it can react and even anticipate those factors by modifying the organizational contextual rationality. An ontology-based model presents this opportunity to modify this contextual rationality. Other studies related to decision rationality can be found in [Eisenhardt and Bourgeois, 1988] or [Dean and Sharfman 1993a, 1993b] for example; an interesting analysis on the influence of decision rationality on the results of the firm processes or their global performance was published by Goll and Rasheed [1997]. Most of MCDM problems lie within the scope of the following approaches [Bell et al. 1988, Roy 1990, Dias and Tsoukias 2003]: the descriptive approach, the prescriptive approach, the constructive approach and the normative approach. The normative approach, subject of interest in this work, consists in defining principles and rules that a group of persons could follow. This analysis is coherent and rational, in the sense that these well-specified rules constitute an axiom set with a precise logic and implications [Bell et al. 1988]; a logic from which implied agents can, in no case, go against, unless specifically formulated. Roy [1990] have pointed out that the classic normative theory confers to these axioms the value of an unquestionable truth. They represent ideal rules that the DM must rationally follow. Axiomatic analysis allows the characterization of multicriteria procedures. The axiomatic characterization of a procedure is not single. On this subject, Pirlot [1994] cited by Othmani [1998] talks about normative axioms which translates rules of rational behaviour and descriptive axioms which present the way with which a procedure works. This analysis stimulates the production of new methods with well- defined fields of application [Pirlot 1994]. To help the analyst in making a choice among existing multicriteria procedures, Arrow and Raynaud [1986] and Pasquier- Dorthe and Raynaud [1990] proposed to build an axiom pool for a set of situations and a coherent axioms system which translates the basic assumptions of such situations. Then, a procedure fulfilling axioms for a given situation is selected or algorithmly built. 3.2 Ontologies: definitions and roles 3.2.1 From controlled vocabulary to ontologies From an AI viewpoint, an ontology is defined as follows [Gruber 1993]: “An ontology is a model of some portion of the world and is described by defining a set of representational terms [concepts]. In an ontology, definitions associate the names of entities in a universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text DRDC Valcartier TR 2004-265 15 describing what the names means, and formal axioms that constrain the interpretation [rationality] as well-formed use of these terms [concepts].” A body of formally represented knowledge is based on a conceptualization: the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them [Genesereth and Nilsson 1987]. A conceptualization is an abstract, simplified, and united view of the world that we wish to represent for some specific purpose. Every knowledge-based system or knowledge- level agent is committed to some conceptualization, explicitly and implicitly. Ontologies have received increasing interest in the computer science community and their benefits are recognized as they provide a foundation for the representation of domain knowledge. They explicitly encode a shared understanding of a domain that can be communicated between people and application programs. Gruber [1993] defines an ontology as « an explicit specification of a shared conceptualization ». In the literature, ontologies range from controlled vocabularies to highly expressive domain models [McGuinness, 2002]: integrated data dictionaries designed for human understanding, taxonomies organizing concepts of a domain into inheritance hierarchies, structured data models suitable for data management, and finally highly expressive computational ontologies. A controlled vocabulary is a finite set of terms with unambiguous definitions. Usually, if multiple terms are used to mean the same thing, a preferred term is identified and the other terms are listed as variants or synonyms. A taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure. Taxonomies have been built manually in libraries for hundred of years. They structure a domain into categories/subcategories that can be used to organize a document collection. Relationships between terms in a taxonomy usually consist of “is-a” (generalization-specialization) relations, but there may be other types of relationships, e.g. membership, or metonymy (whole-part). A thesaurus is a networked collection of controlled vocabulary terms. Thesauri provide some semantics in their relations between terms (e.g. synonym relationship). However, the relations between terms in thesaurus hierarchy are implicit (interpreted as narrower-broader relations). Furthermore, thesauri contain associative relationships between terms that are most often expressed as “related to term”. Even if taxonomies, thesauri and ontologies have commonalities in their definitions, ontologies add more expressiveness in the specification of relationships between concepts. Formal ontologies use a representation language to specify properties and constraints of concepts that can be exploited for automated reasoning (inferencing). 3.2.2 Role of ontologies in information systems and knowledge management Ontologies can be exploited in a wide range of applications including, natural language processing, intelligent search engines, information retrieval, or as a means to 16 DRDC Valcartier TR 2004-265 facilitate semantic interoperability among heterogeneous knowledge sources at a high level of abstraction. In particular, ontologies can be used for document indexing and annotation, information organization, and search and retrieval. The utilization of ontology or taxonomy of terms has been identified of potential utility to support information extraction from texts or for automated document indexing. For example, controlled vocabularies (unlike keywords) can be exploited as document metadata (document indexing, semantic tagging). For example, WordNet, a large electronic lexical database publicly available [Felbaum 1998] has been used to support information extraction or query formulation in different contexts. Ontological models can be utilized for categorizing documents by their contents, for example in [Labrou and Finin 1999], where Yahoo topics are being used as descriptors. The taxonomy serves as a navigational as well as an organizational tool. Furthermore, relationships between concepts explicitly specified within ontology can be exploited to enhance search and retrieval as well as automatic categorization tools. Traditionally, the semantic analysis of a given domain starts with intellectual efforts for knowledge identification and acquisition, such as analyzing document indices that contain relevant concepts. Spryns et al. [2002] have proposed that when well-known measures for significant collocations can be used to extract relevant relations between concepts from text, the general notion of relatedness, result of such statistical analysis, is adequate as input information for ontology design. Ontological engineering encompasses a set of activities that are conducted during conceptualization, design implementation and deployment of ontologies. A large range of topics and issues are covered, such as the basics (philosophical and metaphysical issues and knowledge representation formalisms), development methodology, knowledge sharing and reuse, knowledge management, business process modeling, commonsense knowledge, systematization of domain knowledge, information retrieval, interpretations and decision standardization. It also gives a design rationale of a knowledge base, allowing to define the essential concepts of the world of interest, for a more disciplined design of knowledge base, and enables the gathering of knowledge about it [Guarino 1995, Gòmez-Pérez and Benjamin 1999; Benjamin et al. 1998; Gòmez-Pérez and Rojas-Amaya 1999; Gòmez-Pérez et al. 1996; Levesque and Brachman 1985, Winkels et al. 2000, Gòmez-Pérez 1995, Guarino and Giaretta 1995, Gruninger and Fox 1995, Fernandes et al. 1997]. An ontology approach, just like a compression approach [Von Luxburg et al. 2002] offers a promising alternative approach to categorization with several potential advantages. Among these advantages Mahalingam and Huhns [1997, 1998] have identified the following: • Relatively to physical and functional ontology structure: i) Provisions for value mapping: A useful propriety for unstructured text-based information spaces is the value mapping. Because mapping process is a big problem in distributed and heterogeneous environment, this advantage represents a very desirable feature for those types of environment. ii) Suitable for graphical representation: As the ontology supports the information DRDC Valcartier TR 2004-265 17 structure, the latter can easily be represented graphically as an Entity- Relationship diagram. Graphical representations are much easier to understand than textual representation by any user. Ontology can also be used to eliminate confusion and redundancy inherent to unstructured plain textual representation. In addition, in a graphical display, a user can form queries by simple mouse clicks, whereas in a textual representation the user is expected to type the query. • Relatively to the multicriteria paradigm: Ability to view at various abstraction levels and to scale: the ontology can grow or shrink if necessary based on the context where it is being used. Parts of this ontology can be hidden or made visible, so that a new view of the same information space can be generated, efficiently and quickly, to suit a certain audience, as it is in some large databases (a common procedure in those large databases). In addition, ontologies created by experts from a variety of fields can be merged to create super ontology. Other advantages to ontologies are reported in the literature. For example, Studer et al. [1998] compared them to knowledge bases. He noted that they are suitable for formal or machine representation, have full and explicitly described vocabulary, can be used as full model of some domain, have a common understanding of a domain (consensus knowledge), and are easy to share and reuse. 3.3 Exploitation of Ontologies for document processing Extracting relevant descriptors from free-text electronic documents is a problem that requires the use of natural language processing techniques. Statistic analysis of documents consists of extracting a set of concepts or attributes that characterize the text content based on statistical parameters (e.g. number of occurrences of words). Different statistical methods have been proposed in the domain (e.g. Latent Semantic Indexing). However, purely statistical methods may lead to text descriptors that do not really reflect the semantics of processed documents. Whereas traditional information extraction systems are based on shallow natural language techniques and statistical algorithms, more recent approaches try to take into account semantics incorporated in ontologies to obtain more precise results. Ontologies can be exploited for unstructured document processing at different levels: for semantic annotation, for content-based indexing and retrieval, or for text classification and clustering. In this section, we present approaches where ontologies are exploited to provide enhanced knowledge management services, in particular for intelligent document processing: content-based indexing, semantic search, information integration within portals, and automatic classification or clustering. 18 DRDC Valcartier TR 2004-265 3.3.1 Content-based indexing Semantic tagging of unstructured information consists in identifying terms that are descriptive of a document and that can be used for the indexing and retrieval of that document. This process could benefit from the exploitation of ontological knowledge. When texts are analyzed, ontologies can be used for word sense disambiguation by utilizing known semantic relationships between concepts to boost the probability of a particular sense of a word in context. Furthermore, the analysis of surrounding words in a text adds semantics that should be taken into account. For example, the word tank when surrounded by words such as military and vehicle is more likely to be a fighting vehicle and less likely to be a container for holding fuel. This should help identify relevant indexes or meta-data. 3.3.2 Ontology-based search and retrieval With the explosion of the Web and the increasing popularity of intranets and enterprise portals in organizations, the exploitation of ontologies has been considered for the building of semantic search engines in order to provide enhanced search results (better precision and recall) when compared to keyword search. The reason is that standard search engines on the Web (e.g. Google) do not take into account the multiple ambiguous meanings of words. For example, a search for the word chip would retrieve both information about food and electronics. Thus, keyword search engines result in poor precision. On the other hand, they do not consider synonyms, acronyms or more specific terms when searching for a word, and thus they also result in poor recall. Moreover, they do no take into account the context of a search. For example, a search for the words air and defence should retrieve information in the military context by relating the two terms, and not provide results in the juridical context as it occurs if the two words are taken separately without context. Ontology-based search engines exploit the inheritance hierarchy of concepts to look for more specific or more general terms. They also make use of lexical knowledge such as synonyms, acronyms contained in ontological models. This way, they retrieve information that would have been missed with keyword search, and they do not consider terms that are not in context of a search by limiting the search to a subset of relevant topics. By doing so, they improve recall and precision of results. In [McGuinness 1998], D. McGuinness shows that formal description logic oriented ontologies could also improve search results under particular conditions. 3.3.3 Ontologies in enterprise portals There is an increasing interest in the exploitation of ontologies to structure information sources and integrate heterogeneous sources in enterprise portals (e.g. KAON, SEAL, OntoKnowledge are representative initiatives). In this context, ontologies are considered as a key component for knowledge management systems and knowledge portals. Maedche et al. [2001], as part of the KAON project, have proposed an DRDC Valcartier TR 2004-265 19 ontology management infrastructure for semantic driven application based on the RDF schema formalism. Within the Semantic Portal Project [Hotho 2001a], they proposed a framework for a semantic portal relying on an ontology basis, for semantic integration, web site management and presentation. On-To-Knowledge is another initiative to build an ontology-based environment to enable Semantic Web knowledge management [Fensel et al. 2002]. It exploits newly proposed ontology web formalisms such as RDF Schemas and DAML+OIL to facilitate access to heterogeneous information from an intranet and the World Wide Web. As part of the COP21 TD, [Gauvin et al. 2002] proposes the concept of a situational awareness knowledge portal that exploits military ontologies to organize, filter and search for information within a user’s portfolios. The Carnot Project addresses the problem of logically unifying physically distributed, enterprise-wide, heterogeneous information. The Model Integration Software Tool (MIST) Project6, developed as a part of this project, is a graphical user interface that assists a user in the integration of different databases via a common ontology that serves as an enterprise model. Cyc7 knowledge server is a very large, multi-contextual knowledge base and inference engine developed by Cycorp. Cyc is intended to provide a "deep" layer of understanding that can be used by other programs to make them more flexible. 3.3.4 Ontology-based document categorization and clustering The objective of automatic document classification is to automatically organize documents in categories that are meaningful to users. For this purpose, predefined domain taxonomies may be used as categories. An alternative approach is to partition the set of documents into clusters that are generated through an unsupervised process according to similarities between documents. In Automatic Document Categorization/Diagnostic Task, the limitations identified in Pirlot’s normative approach [Pirlot 1994] become strengths as the rationality is imposed by an ontology built on the organization’s beliefs. Indeed, axioms even those who seem intelligible, appealing or evident, i.e., those who are relevant of common sense, in the sense of the organization philosophy, must be accepted by all the agents of the organization without critical examination. When the decision framework is strongly well specified, as it is the case with an ontology domain, the agents of the system can consistently distinguish relevant from irrelevant information by well-defined rules during document processing. From a user’s perspective, a good categorization/clustering tool is as important as a good search engine because the user browses through categories in order to look for information and to retrieve or discover relevant information. Thus, different approaches are proposed in the literature, and commercial tools using various methods are now available on the market. Some relevant aspects of commercial tools and research proposals are presented below. 6 www.cse.ogi.edu/DISC/projects/mist 7 www.cyc.com 20 DRDC Valcartier TR 2004-265 The market of commercial tools dedicated to unstructured information management (indexing, search and retrieval, filtering/categorization and clustering) is particularly dynamic. Plenty of vendors offer categorization solutions (as reported in [Delphi 2002, Letson 2001, Adams 2001]). They use different methods, sometimes in combination: pattern matching and statistical algorithms, machine learning (using Bayesian probability, Support Vector Machine, or neural networks), rule-based, and linguistic- semantic approach. Among them, Autonomy has a pattern-matching utilizing Bayesian theory, Stratify uses multiple classifiers, Semio proposes a hybrid solution combining linguistic algorithms and statistical clustering techniques. Tools using a semantic approach are worth mentioning. Convera’s RetrievalWare technology includes specific semantic networks that support concept-based categorization and search (using a concept-based search engine). Applied Semantics approach is based on the utilization of a vast ontology composed of half a million tokens (individual words), two million terms (sequences of words), half a million concepts (or meanings), and relationships between these concepts (similar to those represented in the lexical semantic network WordNet). The ontology drives the processing of texts at a semantic level and provides a foundation for document categorization, meta-tagging, and summarization. In particular, concepts and semantic relationships from the ontology are used during word sense disambiguation and sensing in order to identify globally relevant concepts. This tool uses a unique approach in that it solely relies on semantic analysis and does not make use of machine learning techniques. In a particular domain of interest, ontologies can give additional power by employing background knowledge. There have been some attempts to combine ontologies or background knowledge with automatic classification or clustering algorithms. These are described below. In [Iwasume 1996], the authors propose IICA (Intelligent Information Collector and Analyzer), a system for gathering and categorizing information from information resources on the Web, that classifies documents by combining a keyword vector model with an ontology. It creates an initial classification of documents from the ontology and repeats a cycle of computing a representative characteristic vector for each category and classifying documents using the characteristic vectors until a convergence is achieved. [Hotho et al. 2001b] exploits ontologies for text clustering. They utilize background knowledge in the form of a hierarchy of concepts for document preprocessing in order to generate different clustering views onto a set of documents. Their approach, COSA (Concept Selection and Aggregation), makes use of concept vectors using natural language processing techniques in place of the standard term vectors that simply constitute “bags of words” to represent the presence/occurrence of words in a text. The objective is to restrict the set of relevant document features and to automatically propose good aggregations by exploiting the concept hierarchy. DRDC Valcartier TR 2004-265 21 According to Tsuhan Chen8, classification techniques should take into account the user’s context. They incorporate user-specific context into their categorization process and exploit both lexical (WordNet) and contextual knowledge through ontology. Ontology-based algorithms may also exploit the notion of strength of the relationships between concepts, i.e. concepts directly linked by a “is-a” relationship are strongly related compared to those that are indirectly linked. 3.4 Combining statistics and semantics for document categorization 3.4.1 Approach Ontologies or taxonomies can be exploited as a support for automatic document categorization. On the one hand, they organize concepts in a hierarchical structure that can be utilized for categorization (e.g., categories of Yahoo). On the other hand, they provide the semantics of a domain that can be exploited to improve traditional classification methods based on statistics. For example, as mentioned above, WordNet, a large electronic lexical database publicly available may be used to support document categorization. Natural language processing techniques supported by an ontology domain can be used to extract semantic meaning from unstructured text and provide semantic indices from the ontology. Recently, some preliminary experiments have been conducted to combine statistics and semantics for information extraction [Termier et al. 2001, Faure and Poibeau 2000] using different methods. An ontology-based semantic analysis consists in analyzing candidate concepts resulting from the statistical analysis from a semantic perspective by exploiting a domain ontology, in order to restrict the document descriptors to the attributes that semantically characterize the text, by for example removing poorly meaningful words, or by replacing terms that are semantically closely related by a concept that represent them. At each level of the ontology structure, specific semantic expressions are attached to concepts, to guide the semantic processing. 3.4.2 Operationalization To demonstrate our approach and techniques, we have restricted our experiment to a particular specific domain, terrorism. In this context, we have chosen a document corpus from various open sources about terrorist events and build an ontology baseline about terrorism that organizes concepts of the domain in a hierarchy of concepts (taxonomy). Figure 2 illustrates the baseline ontology we have built from resources about terrorism found on the Web for the purpose of our experiment (e.g. [NATO, 2002]). This taxonomy aims at encapsulating important terms from the terrorism 8 www.ece.cmu.edu/~tsuhan 22 DRDC Valcartier TR 2004-265 domain, organized into categories, from general terms to more specific terms. At the first level, it contains terms such as terrorist organizations, countries (or national hosts), terrorist acts (activities), tactics, weapons, financial assets, etc. Level 2 refines the concepts of level 1 by providing more specific concepts in the hierarchy, most of them being linked by a “is-a” relationship. At a certain level in the hierarchy of concepts, terms in a category are considered to be similar from a categorization perspective because they are sub-concepts of this category (e.g. nerve agent and blister agent are two types of chemical agents in the Weapon of Mass Destruction category).