ADAC Automatic Document Analyzer and Classifier

A. Guitouni A.-. Boury-Brisset DRDC Valcartier

L. Belfares K. Tiliki Université Laval

C. Poirier Intellaxiom Inc.

Defence R&D Canada – Valcartier Technical Report DRDC Valcartier TR 2004-265 October 2006

ADAC Automatic Document Analyzer and Classifier

A. Guitouni A.-C. Boury-Brisset DRDC Valcartier

L. Belfares K. Tiliki Université Laval

C. Poirier Intellaxiom Inc.

Defence R&D Canada - Valcartier Technical Report DRDC Valcartier TR 2004-265 October 2006 Author

A. Guitouni, A.-C. Boury-Brisset, L. Belfares, K. Tiliki and C. Poirier

Approved by

Dr. E. Bossé Section Head / Decision Support System Section

Approved for release by

G. Bérubé Chief Scientist

© Her Majesty the Queen as represented by the Minister of National Defence, 2006 © Sa majesté la reine, représentée par le ministre de la Défense nationale, 2006

Abstract

Military organizations have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mails, electronic documents, etc.) The documents have to be screened, analyzed and categorized in order to interpret their contents and gain situation awareness. These documents should be categorized according to their contents to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partially manual. Integrating the recently acquired knowledge in different fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with a limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes. DRDC Valcartier’s ADAC system (Automatic Document Analyzer and Classifier) incorporates several techniques and tools for document summarization and semantic analysis based on ontology of a certain domain (e.g. terrorism), and algorithms of diagnosis, classification and clustering. In this document, we describe the architecture of the system and the techniques and tools used at each step of the document processing. For the first prototype implementation, we focused on the terrorism domain to develop the document corpus and related ontology. Résumé

Les organisations militaires font face à une augmentation notable du nombre de documents provenant de différentes sources en formats divers (papier, télécopie, courriels, documents électroniques, etc.) Ces documents doivent être scrutés, analysés et catégorisés afin d’en interpréter le contenu pour comprendre la situation. Ils doivent donc être catégorisés selon leur contenu pour un meilleur archivage et une recherche ultérieure plus efficace. Dans ce contexte, des techniques et des outils évolués devront donc être développés pour appuyer et mener ce processus de gestion de l’information qui, actuellement, est essentiellement effectué de façon manuelle. L’intégration de connaissances nouvelles provenant de différents domaines au sein d’un même système pour la gestion documentaire traitant notamment l’analyse, le diagnostic, le filtrage, la classification et l’organisation de documents devrait permettre d’en améliorer notablement l’efficacité, et ce, avec un minimum d’intervention humaine. Une meilleure gestion devrait faciliter l’intégration d’informations provenant de diverses sources, éliminer toute redondance, améliorer l’accès à l’information pertinente et fournir ainsi, en bout de ligne, un meilleur soutien au processus de prise de décision. Le système ADAC (Automatic Document Analyzer and Classifier) conçu à RDDC Valcartier incorpore différentes techniques et outils pour le résumé et l’analyse sémantique basée sur l’ontologie d’un domaine particulier (p. ex. celui du terrorisme), et des algorithmes de diagnostic, de classification et l’organisation de documents. Dans ce rapport, nous décrivons l’architecture du système, ainsi que les techniques et outils utilisés à chaque étape du traitement d’un document. Pour l’implantation du prototype, l’accent a été mis sur le domaine du terrorisme pour développer une ontologie, ainsi qu’une collection de documents adaptée.

DRDC Valcartier TR 2004-265 i

This page intentionally left blank.

ii DRDC Valcartier TR 2004-265

Executive summary

Military organizations, in particular intelligence or command centers have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mail messages, electronic documents, etc.). These documents must be analyzed in order to interpret their contents and gain situation awareness. These documents should be categorized according to their content to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partly manual.

Automatic, intelligent processing of documents is at the intersection of many fields of research, especially Linguistics and Artificial Intelligence, including natural language processing, pattern recognition, semantic analysis and ontology. Integrating the recently acquired knowledge in these fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate the correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes.

This is the purpose of the work we have undertaken at DRDC Valcartier as part of the Common Operational Picture for 21st Century Technology Demonstration project. The ADAC system (Automatic Document Analyzer and Classifier) incorporates several techniques and tools for document summarization and semantic analysis based on the ontology of a certain domain (e.g. terrorism), and algorithms of diagnostic, classification and clustering. A document is processed through the following steps: i) Summarization: large documents are summarized to provide a synthesized view of their content; ii) Statistical and semantic analysis: the document is indexed by identifying the attributes that best characterize it. Both statistical analysis and semantic processing exploiting domain ontology are carried out at this stage; iii) Diagnosis: intercept relevant document matching criteria provided by the user (e.g. document on a particular subject) in order to execute an appropriate action (e.g. alert); iv) Filtering/classification: classify/categorize the document in predefined hierarchical classes; and v) Clustering: assign the document to the most similar group of previously processed documents. External actions can then be triggered on specific classes of documents (e.g. alerts, visualization and data mining). Using a launching agent, ADAC checks periodically the presence of new documents and processes them. The diagnostic and filtering/classification tests may be processed on previously analyzed documents if new directives require it.

In this report, we describe the architecture of the system and the techniques and tools used at each step of the document processing. For the first prototype implementation, we have chosen to focus our document corpus and related ontology on the terrorism domain.

Guitouni, A, Boury-Brisset, A.-C., Belfares, L., Tiliki, K., Poirier, C., 2006. ADAC: Automatic Document Analyzer and Classifier, DRDC Valcartier, TR 2004-265, Defence R&D Canada.

DRDC Valcartier TR 2004-265 iii

Sommaire

Les organisations militaires, en particulier les cellules de renseignement et les centres de commandement, doivent traiter un nombre sans cesse croissant d’informations provenant de différentes sources sous divers formats (papier, fax, courriels, documents électroniques, etc.). Ces documents doivent être scrutés et analysés afin d’en interpréter le contenu pour une meilleure gestion de situation. Ils doivent donc être catégorisés selon leur sujet pour permettre, d’une part, un archivage efficace et, d’autre part, pour faciliter une recherche ultérieure. Dans ce contexte, des techniques et des outils avancés devront donc être développés pour soutenir et mener ce processus de gestion de l’information, qui à l’heure actuelle est essentiellement effectué manuellement.

La compréhension automatique de documents est un domaine de recherche multi-disciplinaire touchant en particulier la linguistique computationnelle et l’intelligence artificielle, notamment le traitement de la langue naturelle, la reconnaissance de formes, l’analyse sémantique et ontologique. L’intégration dans un même système des résultats de recherches récentes dans différents champs de connaissances reliés à la gestion documentaire traitant notamment de l’analyse, du diagnostic, du filtrage, et de la classification de documents devrait permettre d’en améliorer considérablement l’efficacité avec un minimum d’intervention humaine. Une meilleure catégorisation et une gestion adéquate de l’information devraient faciliter l’aggrégation d’informations provenant de diverses sources, éliminer toute redondance, améliorer l’accès à l’information pertinente et ainsi fournir un meilleur soutien au processus de prise de décision.

C’est l’objectif du travail que nous avons entrepris à RDDC Valcartier dans le cadre du projet démonstrateur techonologique COP 21. Le système ADAC (Automatic Document Analyzer and Classifier) incorpore différentes techniques et outils permettant le résumé de textes, l’analyse sémantique basée sur l’ontologie d’un domaine particulier (p. ex. celui du terrorisme), associés à des algorithmes de diagnostic, de classification et d’organisation de documents. Le traitement d’un document dans ADAC suit les étapes suivantes : i) Résumé : les documents volumineux sont résumés afin d’en produire une synthèse; ii) Analyse statistique et sémantique: le document est indexé en identifiant les attributs qui le caractérisent le mieux. À cette fin, un traitement à la fois statistique et sémantique (exploitant l’ontologie) est effectué; iii) Diagnostic : le document est intercepté s’il répond aux critères de sélection fournis par l’utilisateur (p. ex. document traitant d’un sujet particulier) et une action associée est déclenchée (p. ex. alerte); iv) Classification: le document est catégorisé dans des classes hiérarchiques prédéfinies (selon une taxonomie du domaine); et v) Regroupement : le document est affecté au groupe de documents sémantiquement le plus proche parmi les groupes constituant les documents déjà traités.

Dans ce rapport, nous décrivons l’architecture du système, ainsi que les techniques et outils utilisés à chaque étape du traitement d’un document. Pour le prototype d’implantation, l’accent a été mis sur le domaine du terrorisme pour développer une ontologie, ainsi qu’une collection de documents associée.

Guitouni, A, Boury-Brisset, A.-C., Belfares, L., Tiliki, K., Poirier, C., 2006. ADAC: Automatic Document Analyzer and Classifier, DRDC Valcartier TR 2004-265, R et D pour la défense Canada. iv DRDC Valcartier TR 2004-265

Table of contents

Abstract / Résumé...... i

Executive summary ...... iii

Sommaire...... iv

Table of contents ...... v

List of figures ...... viii

Acknowledgements ...... x

1. Introduction ...... 1

2. Automated Document Processing ...... 4 2.1 Introduction ...... 4 2.2 Information Retrieval ...... 5 2.3 (or document categorization) ...... 7 2.3.1 The Expert System approach...... 7 2.3.2 The Machine Learning approach...... 7 2.3.2.1 Document representation and preprocessing...... 8 2.3.2.2 Classification methods ...... 9 2.3.3 Application to document categorization...... 11 2.4 Document clustering...... 11 2.5 Commercial solutions...... 13 2.6 Awaited solutions ...... 13

3. The ontology-based document processing approach...... 14 3.1 About the rationality...... 14 3.2 Ontologies: definitions and roles...... 15 3.2.1 From controlled vocabulary to ontologies...... 15 3.2.2 Role of ontologies in information systems and knowledge management...... 16 3.3 Exploitation of Ontologies for document processing ...... 18

DRDC Valcartier TR 2004-265 v

3.3.1 Content-based indexing...... 19 3.3.2 Ontology-based search and retrieval ...... 19 3.3.3 Ontologies in enterprise portals...... 19 3.3.4 Ontology-based document categorization and clustering...... 20 3.4 Combining statistics and semantics for document categorization...... 22 3.4.1 Approach ...... 22 3.4.2 Operationalization ...... 22

4. The processing algorithms...... 25 4.1 Representation of the document’s DNA...... 25 4.2 The diagnostic module ...... 26 4.3 Classification/Filtering ...... 27 4.4 The Clustering module ...... 32 4.4.1 Problem formulation...... 33 4.4.2 Clustering using genetic algorithms ...... 35 4.4.3 Clustering using variable neighborhood search method...... 36

5. Empirical tests ...... 40 5.1 Metrics for performance assessment ...... 40 5.2 The simulation tool: TestBench...... 42 5.3 Tests results ...... 44 5.3.1 Filtering/Categorization algorithm ...... 44 5.3.2 Clustering algorithms ...... 44

6. ADAC prototype functional architecture ...... 45

7. Conclusions ...... 50

References ...... 51

Annex A: Concepts of Weights in the Ontology...... 61

Annex B: Clustering using Genetic Algorithms...... 67

Annex C: Similarity Index Computation...... 74

Annex D: Non-parametric approaches ...... 79

vi DRDC Valcartier TR 2004-265

Annex E: COTS Product Evaluation...... 97

List of symbols/abbreviations/acronyms/initialisms ...... 131

Distribution list...... 132

DRDC Valcartier TR 2004-265 vii

List of figures

Figure 1. ADAC’s document processing...... 2

Figure 2. Hierarchy of concepts in the terrorism ontology...... 23

Figure 3. Document DNA ...... 25

Figure 4. Classification process...... 28

Figure 5. Example of a 3-level hierarchy ...... 29

Figure 6. TestBench’s interface for the categorization simulations ...... 42

Figure 7. TestBench’s interface for the genetic clustering algorithm...... 43

Figure 8. TestBench’s interface for the VNS clustering algorithm ...... 43

Figure 9. ADAC retained configuration ...... 45

Figure 10. ADAC processing and analyzing scenario...... 45

Figure 11. ADAC recovery agent...... 46

Figure 12. ADAC implementation concept...... 46

Figure 13. ADAC architecture...... 47

Figure 14. ADAC Interfaces (example) ...... 48

Figure 15. Other ADAC configurations ...... 49

Figure 16. Diagram for concordance index measurement (DAC-01-C-P)...... 80

Figure 17. Diagram for discordance index measurement (DAC-01-C-P) ...... 84

Figure 18. Admissibility for comparison for 0%

Figure 19. Admissibility for comparison in 100% overlapping ...... 88

Figure 20. Discordance index measurement (CAD-02-I-I)...... 91

Figure 21. Diagram for concordance index calculation in case 1 (CAD-04-C-I)...... 93

Figure 22. Diagram for concordance index calculation in case 2 (CAD-04-C-I)...... 94

Figure 23. Autonomy’s IDOL architecture and technical components...... 99

viii DRDC Valcartier TR 2004-265

Figure 24. Stratify discovery system architecture ...... 111

Figure 25. The Stratify classification process...... 113

Figure 26. RetrievalWare Searching Process ...... 118

Figure 27. RetrievalWare Architecture...... 119

Figure 28. Term "bears witness" (Applied Semantices)...... 123

Figure 29. Applied Semantics Concept Server: implementation architecture...... 127

List of tables

Table 1. Factors of severity/admissibility values according to the DM Attitude ...... 83

μ h Table 2. Partial discordance variations D j (d, pi ) (CAD-04-C-I) ...... 95

Table 3. Autonomy Architecture...... 98

Table 4. RetrievalWare Searching Process...... 117

DRDC Valcartier TR 2004-265 ix

Acknowledgements

The author would like to thank the Common Operation Picture 21 Project Team for their constructive ideas.

x DRDC Valcartier TR 2004-265

1. Introduction

In May 1999, the new National Defence Command Centre (NDCC) was commissioned. The mission of the NDCC is to provide a 24/7 secure command and control facility through which the Command staff can plan, mount and direct operations and training activities at the strategic level. Since September 11th, 2001, the level of traffic of messages at the NDCC reached unpredictable peaks. The operators of the Centre are overloaded with information that they should handle in real time. The information “digested” by the NDCC represents vital stakes for several other users.

Military organizations and particularly intelligence or command centres have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mails and electronic documents). These documents must be analyzed in order to interpret their contents and gain situation awareness. These documents should be diagnosed and categorized according to their content to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partly manual.

Automatic, intelligent processing of documents is at the intersection of many fields of research, especially linguistics and artificial intelligence, including natural language processing, pattern recognition, semantic analysis and ontology. Natural language understanding has been one major research domain for decades. Integrating the recently acquired knowledge in these fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate the correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes.

The ADAC system (Automatic Documents Analyzer and Classifier) has been developed at DRDC Valcartier as a concept demonstrator and a test bed. The objective of the targeted system is to provide an environment in which documents of various types and formats can be automatically processed, with minimal human intervention, going from document summarization and statistical/semantic analysis for content extraction, to diagnostic and classification. Some document processing modules may eventually trigger external actions. Consequently, this environment incorporates several techniques and tools for document summarization, semantic analysis based on ontology of a certain domain (e.g. terrorism), and algorithms for automated diagnostic, classification and clustering.

ADAC is composed of a set of agents, each being responsible for a specific document- processing module. ADAC’s launching agents automatically intercept any new document. Then the document is processed through the following steps:

DRDC Valcartier TR 2004-265 1

• Summarization: provide a synthesized view of the document’s contents;

• Statistical and semantic analysis: index the document by identifying the attributes that best characterize it. Both statistical analysis and semantic processing exploiting domain ontology are carried out at this stage. This produces the document DNA;

• Diagnostic: intercept relevant document matching criteria provided by the user (e.g. document on a particular subject) in order to apply an appropriate action (e.g. alert);

• Filtering/classification: classify/categorize the document in predefined hierarchical classes;

• Clustering: assign the document to the most similar group of previously processed documents.

External actions can then be triggered on specific classes of documents (e.g. alerts, visualization and data mining).

Ontology Semantic Diagnostic électronic Tools documents images sound files

Observer Fonction: to diagnostic TEXT DNA the information Summariser DB & Des. Attrib. Producer Filtering/ Clustering Classification

Statistics Document Management Tools System

External actions

OCR StoT

Visualization Alerts Data Mining électronic documents images sound files

Figure 1. ADAC’s document processing

This work has been achieved under the Common Operation Picture for 21st Century Technology Demonstration Project. It has been motivated by face-to-face interviews with NDCC Operators that allowed DRDC team to capture requirements for third generation search and retrieval engine. This report captures the important contribution of this work. It is organized in two parts: the first part introduces the problem of automatic document processing and categorization and presents the main theoretical

2 DRDC Valcartier TR 2004-265

foundations of this work. In the second part, the approaches and algorithms used at different steps of the documents processing are presented. A description of the implementation of the first ADAC prototype and preliminary results are provided to illustrate this study. Finally, our conclusions are exposed in addition to a brief discussion on the ongoing work and future development ideas.

DRDC Valcartier TR 2004-265 3

2. Automated Document Processing

2.1 Introduction

The increasing amount of digital information exchanged among people and the subsequent information overload that workers have to face have accentuated the needs for more innovative knowledge management tools dedicated to information processing, for example text summarization, text extraction, text retrieval or text classification/categorization.

Information comes in many forms, and can be either structured (relational databases, tagged messages) or unstructured (electronic documents, Web pages, e-mails, etc.). While the problem of structured information is well taken into account by database management systems, the management of unstructured information/document needs further research to facilitate both structuring and organization of information as well as its exploitation (e.g. effectively retrieve relevant information).

For several years, US sponsored conferences such that MUC1 (Message Understanding Conference) and TREC2 (Text Retrieval Conference), devoted to automatic text processing, have largely contributed to significant advances in the domain.

In the wide research area of automatic document processing, one can distinguish three important fields: information extraction, information retrieval and text classification or categorization - see for example Salton [1989a, b], Maybury [1993], and Mani and Maybury [1999] - for a tutorial on the subject. A clarification has to be made at this stage in order to define what is meant by these terms.

• Information extraction3 is the process of extracting information from text in order to identify specific semantic elements within a text (e.g. entities, properties, relations) that populate a template. The goal is to undertand the document semantics in order to extract relevant content by means of Natural Language Processing (NLP) techniques. The process takes place in several stages: tokenisation, morphological and lexical analysis, etc. Information extraction is not information retrieval: Information extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus). Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about predefined types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyze the data for trends, to give a natural language summary, or simply to serve for on-line access.

1 See at www.itl.nist.gov/iaui/894.02/related_projects/muc/ 2 See at trec.nist.gov/ 3 http://www.dcs.shef.ac.uk/research/groups/nlp/extraction/

4 DRDC Valcartier TR 2004-265

• Information retrieval consists in retrieving documents that best match a user query. Usually, documents are indexed by word occurrences to facilitate the process.

• Document categorization consists in assigning documents to predefined categories. Document clustering is the process of detecting topics within a document collection, assigning documents to those topics and labelling these topics clusters.

Information management of large amounts of documents requires addressing two inter-related problems: the problem of information classification from heterogeneous information sources and the problem of effective and efficient access to relevant information (e.g. information retrieval). The problem of text classification is less complex than that of full text understanding, because it consists in extracting the most relevant concepts from the document, not to interpret the text (as for example, for text summarizing).

In large organizations that do not have content-based management tools such as automatic document categorization tools, electronic documents are scanned manually by information managers for content assessment and then classified in predefined folders. Even if people are better than machines at understanding the meaning in documents, this manual process is labor-intensive and expensive to maintain. Furthermore, it is subject to inconsistency because different people cannot classify large volumes of information in a uniform way. Folders categories are not totally disjoint and it can happen that documents are duplicated in different categories.

Automatic document classification and clustering techniques aim at providing solutions for the organization of tremendous numbers of text documents by topics based on their contents. Classification within documents repositories differs from relational database management where data are organized as attributes-value pairs.

In the following sections, after introducting the domain of information retrieval that is of relevance for classification purposes, we present the approaches proposed in the literature to address the problems of document classification and document clustering.

2.2 Information Retrieval

The main objective in information retrieval (IR) is to find desired information from a collection of textual documents. This field already relatively old (more than 30 years) is centered on document access problems in response to various types of queries. One of the basic tasks approached in this field consists in providing to a user a list of relevant documents in response to a beforehand formulated query. Traditionally, the basic object in IR is a text (or portions of text) represented by a term-vector. More recently a representation in form of words groups was proposed.

The traditional IR tasks consist in pairing a query to a document collection and returning the relevant ones to the user. The success of such systems depends partly on

DRDC Valcartier TR 2004-265 5

quality and quantity of information associated to the request. Indeed, with a great quantity of information defining the document relevance, systems will be able to use more advanced techniques to identify relevant and nonrelevant documents.

Most of the systems of IR rely on a statistical approach rather than on methods resulting from the computational linguistics (Natural Language Processing). Several reasons were advanced to explain this state of affair that can first of all appear as non- intuitive because the language knowledge should be a requirement in development of an intelligent system of research on text [Amini 2001].

Classically, an IR system is composed of two large components [Amini, 2001]:

1. An indexation process which leads to a representation of documents, requests and the class representatives (prototypes). Documents and queries are described like vectors of the same semantic vector space, that of ontology’s concepts which is structured in a hierarchy, this r space has the size of this ontology. We will note by d , qr and pr the vectorial descriptions of document d , a query q and a class prototype p respectively.

1. A similarity measure between each document and each request, and between documents and the class’s prototype. The most traditional method consists in calculating the angle r r cosine between {d} and qr , or between {d} and pr .

Besançon [2002] presents in his thesis an interesting outline of the various methods used in the literature to calculate this similarity. The documents are ranked, on the relevance scale, according to their similarity measure.

The traditional strategies of research used in IR are based on Boolean, vectorial and probabilistic models. These various models take their names from the three possible document representations.

The Boolean models are characterized by a representation of documents based on the presence or the absence of terms in the document. The majority of rule-based systems use a Boolean approach [Apte et a.l 1994, Cohen 1996]. Some disadvantages are discussed in [Hull 1994].

The vectorial models gather a great number of search methods. The query and documents are indexed in two stages. First, relevant terms are extracted from the query4 q and/or from the document d . Then the user assigns to each term a weight which reflects its importance. A score is generated by a function of similarity starting from query and document representations. Salton and Buckley [1991], have tested some vectorial search models.

The models based on probabilistic approaches try to capture words distributions in documents in order to use them for an eventual inference. First studies on these models go up at the beginning of the sixties with Maron and Kuhns, [1960]. Since that time,

4 We suppose here that queries and documents are written in a natural language

6 DRDC Valcartier TR 2004-265

these studies grew rich by many other models. The score used in these models is the relevance probability according to a particular query.

One of the justifications advanced for the use of the probabilistic models is the "Probability ranking principle" [Robertson and Spark-Jones 1976]. Amini [2001] states this principle as following: "the optimal search performances are obtained when documents are provided in an ordered way according to their relevance probability for a certain query", and then in this probabilistic context, the concepts of "relevance", "optimality" and "performances" can be defined in an exact way.

IR can also use document structuring (by enrichment of its represention vector) as a preprocessing phase. This stucturing is performed by statistical/linguistic analysis tools that provide richer representations, since documents are not represented any more in a space of words, but as well in a semantic space of concepts. These representations allow the user to apprehend the informative document content more intituively.

Text retrieval techniques are measured using two parameters, namely precision and recall. Precision is the percentage of retrieved documents that are relevant to a query (correct response). Recall is percentage of documents that are relevant to a query and were retrieved.

2.3 Document classification (or document categorization5)

Automatic document classification has a long history in the literature and is an active research area for a few decades [Sebastiani 1999]. Several approaches and algorithms have been proposed and new enhancements are still emerging to gain better results. We present hereafter the main concepts underlying these techniques and describe the most popular algorithms for document classification.

2.3.1 The Expert System approach

First methods for the creation of automatic document classifiers were based on a knowledge engineering approach for the design of an expert system dedicated to the task of document classification into predefined categories. The technique consisted of the manual definition of a classifier by domain experts by defining a set of rules encoding expert knowledge on how to classify documents under predefined categories. The drawback of this approach is the knowledge acquisition bottleneck well known from the expert systems literature in the sense that the rules must be manually defined by a knowledge engineer with the aid of a domain expert if the set of categories is updated, the system must be modified to take into account the new categories.

2.3.2 The Machine Learning approach

Nowadays, the dominant approach for document categorization relies on the machine learning (ML) paradigm according to which a general inductive process automatically

5 In this report, document classification and document categorization are synonyms

DRDC Valcartier TR 2004-265 7

builds an automatic text classifier by learning (i.e. the computer system discovers the classification rules) from a set of preclassified documents, and the characteristics of the categories of interest. This approach requires an existing set of classes with associated training data.

Text classification is a two-step process: training and classification. In the first step, training, the system is given a set of preclassified documents (provided by human experts). It uses these to learn the features that represent each of the concepts. In the classification phase, the classifier uses the knowledge that it has already gained in the training phase to assign a new document to one or more of the categories.

In this approach, a general inductive process (also called the learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci (positive example) or not (negative example) by a domain expert; from these characteristics, the inductive process gleans the characteristics that a new unseen document should have in order to be classified under ci. In ML terminology, the classification problem is an activity of supervised learning, since the learning process is “supervised” by the knowledge of the categories and of the training instances that belong to them.

The engineering effort goes toward the construction, not of a classifier, but of an automatic builder of classifiers (the learner). This means that if a learner is (as it often is) available of the shelf, all that is needed is the inductive, automatic construction of a classifier from a set of manually classified documents.

In the ML approach, the preclassified documents are then the key resource. In the most favorable case, they are already available; this typically happens for organizations that have previously carried out the same categorization activity manually and decide to automate the process. The set of preclassified documents from the global corpus will serve as a training set, and the other documents (called test set) will be used to test the accuracy of the classifier.

It must be noted that the machine learning approach for classifier construction relies on techniques for information retrieval because both document categorization and information retrieval are document content-based management tasks. Common processes include document indexing, and document request-matching or query expansion that are used in the inductive construction of the classifier.

2.3.2.1 Document representation and preprocessing

Digital documents, which are typically composed of strings of characters, must be converted into a representation suitable for the classification task. Each document in the corpus is represented as a vector of words (number of occurrences of words), as a vector of n weighted index terms. Weights usually range between 0 and 1. This representation of documents is called bag of words.

Any indexing technique that represents a document as a vector of weighted terms may be used. Before indexing, a preprocessing is usually performed. It consists in removing

8 DRDC Valcartier TR 2004-265

stopwords, i.e. words that carry no information such as prepositions, and performing word , i.e. suffix removal. Because of the high dimensionality of the term space, dimensionality reduction is often employed using one of two distinct techniques: feature selection or feature extraction. For the latter, Latent Semantic Indexing, a technique used in Information Retrieval to address problems deriving from the use of synonymous, nearsynonymous, and polysemous words as dimensions of document representations, can be exploited for dimensionality reduction in this context. This technique compresses document vectors into vectors of a lower- dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence.

2.3.2.2 Classification methods

The inductive construction of a classifier for a category Ci usually consists of two phases:

• The definition of a function CSV (Categorization Status Value) that determines the fact that a document d should be categorized under Ci.

• The definition of a threshold Ti to determine when d is categorized under Ci or is not.

Several methods for text classification have been developed, differing in the way in which they compare the new document with the reference set. A comparison of these methods is presented in [Yang 99] and [Sebastiani 02]. We present hereafter an outline of the most important methods.

• Naïve Bayesian (probabilistic classifier): This approach uses the joint probabilities of words co-occurring in the category training set and the document to be classified to calculate the probability that the document belongs to each category (using Bayes theorem). The document is assigned to the most probable category(ies). The naïve assumption in this method is the independence of all the joint probabilities.

• Linear (profile-based) classifier (e.g. Rocchio method) Linear classifiers embody an explicit profile (or prototype vector) of the category. The Rocchio method, rooted in the Information Retrieval tradition, is used for inducing linear, profile-style classifiers. It rewards the closeness of a test document to the centroid of the positive training examples, and its distance from the centroid of the negative training examples.

• K-Nearest Neighbor (k-NN algorithm): This method, proposed by Yang [94] is a popular instance of example-based classifiers, that do not build an explicit, declarative representation of the categories , but rely on the category labels attached to the training documents similar to the test document. It is called lazy learning method as they do not have a true training phase and thus defer all the computation to classification time. For deciding whether a test document dj should be classified under

DRDC Valcartier TR 2004-265 9

category ci or not, k-NN looks at whether the k training documents most similar to dj also are in ci; if the answer is positive for a large enough proportion of them, a positive decision is taken, and a negative decision is taken otherwise.

• Decision trees: (Sebastiani) In this approach, the test document is matched against a decision tree, constructed from the training examples, to determine whether the document is relevant to the user or not. A decision tree (DT) text classifier (see Mitchell [1996]) is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the test document, and leafs are labeled by categories. Such a classifier categorizes a test document dj by recursively testing for the weights that the terms labeling the internal nodes have in vector E dj , until a leaf node is reached; the label of this node is then assigned to dj . Most such classifiers use binary document representations, and thus consist of binary trees.

• Support Vector Machines: This method tries to find a boundary that achieves the best separation between the groups of documents. The system is trained using positive and negative examples of each category and the boundaries between the categories are calculated. A new document is categorized by determining the partition of the space to which the vector belongs. In geometrical terms, it may be seen as the attempt to find,

among all the surfaces σ1, σ2, …in |T| -dimensional space that separate the positive from the negative training examples (decision surfaces), the σi that separates the positives from the negatives by the widest possible margin, that is, such that the separation property is invariant with respect to the widest possible translation of si..

• Neural networks: [Ruiz 99] In this method, a neural network takes training sets as inputs and calculates the topics inferred from these words as the output.

In the approaches, one can distinguish flat text classification from hierarchical text classification. With Flat text classification, categories are treated in isolation of each other and there is no structure defining the relationships among them. A single huge classifier is trained, which categorizes each new document as belonging to one of the possible basic classes. They lose accuracy because the categories are treated independently and relationship among the categories is not exploited. With hierarchical text classification, topics that are close to each other in hierarchy have more in common with each other. Thus, the problem is addressed using a divide-and- conquer approach [Koller 97] that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. At each level in the category hierarchy, a document can be first classified into one or more subcategories using some flat classification methods. We can use features from both the current level as well as its children to train this classifier. By treating problem hierarchically, the problem can be decomposed into several problems, each involving a smaller number of categories. Among category

10 DRDC Valcartier TR 2004-265

structures for hierarchical classification, category tree allows documents to be assigned into both internal categories and leaf categories, and directed acyclic category graph categories are organized as a Directed Acyclic Graph (DAG). This is perhaps the most commonly used structure in the popular web directory services such as Yahoo! and Open Directory Project. Documents can be assigned to both internal and leaf categories.

2.3.3 Application to document categorization

Text filtering is the activity of classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer. A filtering system may also further classify the documents deemed relevant to the consumer into thematic categories. Similarly, an e-mail filter might be trained to discard “junk” mail and further classify nonjunk mail into topical categories of interest to the user. A filtering system may be installed at the producer end, in which case it must route the documents to the interested consumers only, or at the consumer end, in which case it must block the delivery of documents deemed uninteresting to the consumer. In the former case, the system builds and updates a “profile” for each consumer, while in the latter case (which is the more common, and to which we will refer in the rest of this chapter) a single profile is needed. A profile may be initially specified by the user, thereby resembling a standing IR query, and is updated by the system by using feedback information provided (either implicitly or explicitly) by the user on the relevance or nonrelevance of the delivered messages. In theTREC community, this is called adaptive filtering.

Automatic categorization of Web pages. TC has recently aroused a lot of interest also for its possible application to automatically classifying Web pages, or sites, under the hierarchical catalogues hosted by popular Internet portals. This way, it is easier for a search engine to first navigate in the hierarchy of categories and then restrict the search to a particular category of interest. Automatic Web page categorization has two essential peculiarities: The hypertextual nature of the documents where links between pages can be exploited for categorization, and the hierarchical structure of the category set (This may be used, for example, by decomposing the classification problem into a number of smaller classification problems, each corresponding to a branching decision at an internal node).

Text Mining [Hearst 2003] consists in analyzing large text collections, detecting usage patterns, trying to extract implicit information and discovering new knowledge that is useful for a particular purpose. It is a variation of data mining but the difference is that the patterns are extracted from a natural language text rather than from structured databases of facts. It is becoming an active research area applied to the Web where the goal is to extract and discover knowledge from large sets of Web pages (Web mining).

2.4 Document clustering

Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated

DRDC Valcartier TR 2004-265 11

for improving the precision or recall in information retrieval systems and as an efficient way of finding the nearest neighbors of a document (by preclustering the entire corpus). More recently, clustering has been proposed for use in browsing a collection of documents or in organizing the results returned by a search engine in response to a user’s query.

Document-clustering systems create groups of documents based on associations among the documents. They use an unsupervised algorithm to create the clusters. Since, automatic clustering does not require training data, it is an example of unsupervised learning. They take documents as input, extract or select the features of the documents, and form clusters based on a calculation of similarity between individual documents or between an individual document and a representation of the clusters formed so far. The similarity calculation, is based on only the selected features for that document collection. To determine the degree of association among documents, clustering systems require a similarity metric to measure the distance between document vectors, such as the number of words that the documents have in common.

Clustering algorithms can be either hierarchical and form a tree-like organization of documents, or they can be nonhierarchical and form a flat set of document groups (disjoint clusters). Consequently, there are two main approaches to document clustering, namely agglomerative hierarchical clustering (AHC) and partitional (e.g. K-means) techniques. There are two basic approaches to generating a hierarchical clustering:

1. Agglomerative (bottom-up): Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance.

2. Divisive (top-down): Start with one, all-inclusive cluster and, in each successive iteration step, a cluster is split up into smaller clusters. In this case, we need to decide, at each step, which cluster to split and how to perform the split.

Most of the work on document clustering has concentrated on the hierarchical agglomerative clustering methods. Agglomerative algorithms find the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until a certain stopping criterion is met. A number of different methods have been proposed for determining the next pair of clusters to be merged. Hierarchical algorithms produce a clustering that forms a dendrogram, with a single all inclusive cluster at the top and single-point clusters at the leaves. On the other hand, partitional algorithms, such as K-means, or K-medoids find the clusters by partitioning the entire dataset into either a predetermined or an automatically derived number of clusters. Depending on the particular algorithm, a k-way clustering solution can be obtained either directly, or through a sequence of repeated bisections. In the former case, there is in general no relation between the clustering solutions produced at different levels of granularity, whereas the later case gives rise to hierarchical solutions

12 DRDC Valcartier TR 2004-265

The main advantage of clustering over classification is that it may reveal previously hidden but meaningful themes among documents. However, clustering techniques provide no clear way to convey the meaning of the clusters.

2.5 Commercial solutions

Many projects and commercial of the shelf tools have been proposed to deal with problems like those addressed in this project (see Witten [2001]). Appendix D provides a description and evaluation of some of these commercial tools, namely Autonomy, Delphes, Stratify, Convera, Applied Semantics and Diagnos. In particular, it presents the characteristics of the tools and their information management functions, and describes how well they meet ADAC’s requirements.

2.6 Awaited solutions

In the literature presented above, several approaches have been proposed for automatic document classification and clustering, but few has been devoted to exploit both the documents’ contents and the semantics underlying the domain of interest.

In this context, we have experimented with candidate methods and tools to deal with the problem of document categorization within military organizations such as NDCC. In the following chapters, we describe the different techniques we have proposed and implemented within the ADAC environment to support automatic document processing functions exploiting an ontology of the domain. Main efforts deal with document categorization and clustering.

DRDC Valcartier TR 2004-265 13

3. The ontology-based document processing approach

Structuring a decision problem always begins by the specification of the decision framework such as attributes description, alternatives generation, and assessment of consequences in terms of multiple defined criteria. It is a step that requires significant human intervention by the initial creation of the rational framework or the domain ontology. The latter consists, first, in identifying all relevant concepts corresponding to the given application domain, then, in organizing them in an ontology form. Once such an application ontology is written, it can be applied to unstructured documents from a wide variety of sources, as long as these documents correspond to the given application domain. In this work, because our approach is an ontology-based one, we advance that it is resilient to changes in source-document formats.

3.1 About the rationality

The essence of the cooperative information system is to achieve interoperation among distributed and heterogeneous information sources or agents. One way to do it is by providing a unique base of rationality in the form of ontology. For supporting the sharing and reuse of formally represented knowledge among AI systems, it is, indeed, useful to define the common vocabulary in which shared knowledge is represented [Studer et al. 1998].

AI is just about rational agents. Roughly, according to the classical conception, AI is considered as an enterprise (or an organization) devoted to the mechanization of rational thinking (a conception rooted in Turing’s work [Turing 1947, 1948, 1950, 1954]). Hampton [1998] stated that the action having the highest expected value is the rational action.

The notion of rationality [Michael 1998, 1994, Frank 1994; Horvitz et al. 1988] is considered, by many authors and specialists, to be of crucial importance in multicriteria decision-making process (MCDM). This is why we give a large place to this concept in our discussions. Rationality has empirical and testable contents once we specify a utility function (relevance function) and a domain (ontology) to which this notion is applied. Rational decision-making is the action of choosing among alternatives in a way that “properly” agrees with the preference of the decision maker or those of a group making a joint decision [Doyle 1998]. The matter is to treat unanalyzed alternatives (actions, situations, documents) regarding preferences which reflect the desirability of alternatives and a certain rationality criteria. For example, in the case of documentary task management, this desirability corresponds to the utility function of alternatives with respect to a certain relevance structure (preference structure).

The main factors influencing the decision-making rationality are explained in the work of Papadakis et al. [1998] and Rajagopolan et al. [1993]. These authors proposed three

14 DRDC Valcartier TR 2004-265

kinds of factors: internal factors, decisional factors and external factors. Internal factors can be controlled and even directed by managers of the firm which bring them the opportunity to design the decision process (or framework) needed for every decision. Decisional factors are those that characterize the decision, and are related to the strategic relevance for the firm. Though the fact that the firm has no control on external factors, it can react and even anticipate those factors by modifying the organizational contextual rationality. An ontology-based model presents this opportunity to modify this contextual rationality.

Other studies related to decision rationality can be found in [Eisenhardt and Bourgeois, 1988] or [Dean and Sharfman 1993a, 1993b] for example; an interesting analysis on the influence of decision rationality on the results of the firm processes or their global performance was published by Goll and Rasheed [1997].

Most of MCDM problems lie within the scope of the following approaches [Bell et al. 1988, Roy 1990, Dias and Tsoukias 2003]: the descriptive approach, the prescriptive approach, the constructive approach and the normative approach. The normative approach, subject of interest in this work, consists in defining principles and rules that a group of persons could follow. This analysis is coherent and rational, in the sense that these well-specified rules constitute an axiom set with a precise logic and implications [Bell et al. 1988]; a logic from which implied agents can, in no case, go against, unless specifically formulated. Roy [1990] have pointed out that the classic normative theory confers to these axioms the value of an unquestionable truth. They represent ideal rules that the DM must rationally follow.

Axiomatic analysis allows the characterization of multicriteria procedures. The axiomatic characterization of a procedure is not single. On this subject, Pirlot [1994] cited by Othmani [1998] talks about normative axioms which translates rules of rational behaviour and descriptive axioms which present the way with which a procedure works. This analysis stimulates the production of new methods with well- defined fields of application [Pirlot 1994]. To help the analyst in making a choice among existing multicriteria procedures, Arrow and Raynaud [1986] and Pasquier- Dorthe and Raynaud [1990] proposed to build an axiom pool for a set of situations and a coherent axioms system which translates the basic assumptions of such situations. Then, a procedure fulfilling axioms for a given situation is selected or algorithmly built.

3.2 Ontologies: definitions and roles

3.2.1 From controlled vocabulary to ontologies

From an AI viewpoint, an ontology is defined as follows [Gruber 1993]:

“An ontology is a model of some portion of the world and is described by defining a set of representational terms [concepts]. In an ontology, definitions associate the names of entities in a universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text

DRDC Valcartier TR 2004-265 15

describing what the names means, and formal axioms that constrain the interpretation [rationality] as well-formed use of these terms [concepts].”

A body of formally represented knowledge is based on a conceptualization: the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them [Genesereth and Nilsson 1987]. A conceptualization is an abstract, simplified, and united view of the world that we wish to represent for some specific purpose. Every knowledge-based system or knowledge- level agent is committed to some conceptualization, explicitly and implicitly.

Ontologies have received increasing interest in the computer science community and their benefits are recognized as they provide a foundation for the representation of domain knowledge. They explicitly encode a shared understanding of a domain that can be communicated between people and application programs. Gruber [1993] defines an ontology as « an explicit specification of a shared conceptualization ». In the literature, ontologies range from controlled vocabularies to highly expressive domain models [McGuinness, 2002]: integrated data dictionaries designed for human understanding, taxonomies organizing concepts of a domain into inheritance hierarchies, structured data models suitable for data management, and finally highly expressive computational ontologies.

A controlled vocabulary is a finite set of terms with unambiguous definitions. Usually, if multiple terms are used to mean the same thing, a preferred term is identified and the other terms are listed as variants or synonyms.

A taxonomy is a collection of controlled vocabulary terms organized into a hierarchical structure. Taxonomies have been built manually in libraries for hundred of years. They structure a domain into categories/subcategories that can be used to organize a document collection. Relationships between terms in a taxonomy usually consist of “is-a” (generalization-specialization) relations, but there may be other types of relationships, e.g. membership, or metonymy (whole-part).

A thesaurus is a networked collection of controlled vocabulary terms. Thesauri provide some semantics in their relations between terms (e.g. synonym relationship). However, the relations between terms in thesaurus hierarchy are implicit (interpreted as narrower-broader relations). Furthermore, thesauri contain associative relationships between terms that are most often expressed as “related to term”.

Even if taxonomies, thesauri and ontologies have commonalities in their definitions, ontologies add more expressiveness in the specification of relationships between concepts. Formal ontologies use a representation language to specify properties and constraints of concepts that can be exploited for automated reasoning (inferencing).

3.2.2 Role of ontologies in information systems and knowledge management

Ontologies can be exploited in a wide range of applications including, natural language processing, intelligent search engines, information retrieval, or as a means to

16 DRDC Valcartier TR 2004-265

facilitate semantic interoperability among heterogeneous knowledge sources at a high level of abstraction. In particular, ontologies can be used for document indexing and annotation, information organization, and search and retrieval. The utilization of ontology or taxonomy of terms has been identified of potential utility to support information extraction from texts or for automated document indexing. For example, controlled vocabularies (unlike keywords) can be exploited as document metadata (document indexing, semantic tagging). For example, WordNet, a large electronic lexical database publicly available [Felbaum 1998] has been used to support information extraction or query formulation in different contexts.

Ontological models can be utilized for categorizing documents by their contents, for example in [Labrou and Finin 1999], where Yahoo topics are being used as descriptors. The taxonomy serves as a navigational as well as an organizational tool. Furthermore, relationships between concepts explicitly specified within ontology can be exploited to enhance search and retrieval as well as automatic categorization tools.

Traditionally, the semantic analysis of a given domain starts with intellectual efforts for knowledge identification and acquisition, such as analyzing document indices that contain relevant concepts. Spryns et al. [2002] have proposed that when well-known measures for significant collocations can be used to extract relevant relations between concepts from text, the general notion of relatedness, result of such statistical analysis, is adequate as input information for ontology design.

Ontological engineering encompasses a set of activities that are conducted during conceptualization, design implementation and deployment of ontologies. A large range of topics and issues are covered, such as the basics (philosophical and metaphysical issues and knowledge representation formalisms), development methodology, knowledge sharing and reuse, knowledge management, business process modeling, commonsense knowledge, systematization of domain knowledge, information retrieval, interpretations and decision standardization. It also gives a design rationale of a knowledge base, allowing to define the essential concepts of the world of interest, for a more disciplined design of knowledge base, and enables the gathering of knowledge about it [Guarino 1995, Gòmez-Pérez and Benjamin 1999; Benjamin et al. 1998; Gòmez-Pérez and Rojas-Amaya 1999; Gòmez-Pérez et al. 1996; Levesque and Brachman 1985, Winkels et al. 2000, Gòmez-Pérez 1995, Guarino and Giaretta 1995, Gruninger and Fox 1995, Fernandes et al. 1997].

An ontology approach, just like a compression approach [Von Luxburg et al. 2002] offers a promising alternative approach to categorization with several potential advantages. Among these advantages Mahalingam and Huhns [1997, 1998] have identified the following:

• Relatively to physical and functional ontology structure: i) Provisions for value mapping: A useful propriety for unstructured text-based information spaces is the value mapping. Because mapping process is a big problem in distributed and heterogeneous environment, this advantage represents a very desirable feature for those types of environment. ii) Suitable for graphical representation: As the ontology supports the information

DRDC Valcartier TR 2004-265 17

structure, the latter can easily be represented graphically as an Entity- Relationship diagram. Graphical representations are much easier to understand than textual representation by any user. Ontology can also be used to eliminate confusion and redundancy inherent to unstructured plain textual representation. In addition, in a graphical display, a user can form queries by simple mouse clicks, whereas in a textual representation the user is expected to type the query.

• Relatively to the multicriteria paradigm: Ability to view at various abstraction levels and to scale: the ontology can grow or shrink if necessary based on the context where it is being used. Parts of this ontology can be hidden or made visible, so that a new view of the same information space can be generated, efficiently and quickly, to suit a certain audience, as it is in some large databases (a common procedure in those large databases). In addition, ontologies created by experts from a variety of fields can be merged to create super ontology.

Other advantages to ontologies are reported in the literature. For example, Studer et al. [1998] compared them to knowledge bases. He noted that they are suitable for formal or machine representation, have full and explicitly described vocabulary, can be used as full model of some domain, have a common understanding of a domain (consensus knowledge), and are easy to share and reuse.

3.3 Exploitation of Ontologies for document processing

Extracting relevant descriptors from free-text electronic documents is a problem that requires the use of natural language processing techniques. Statistic analysis of documents consists of extracting a set of concepts or attributes that characterize the text content based on statistical parameters (e.g. number of occurrences of words). Different statistical methods have been proposed in the domain (e.g. Latent Semantic Indexing). However, purely statistical methods may lead to text descriptors that do not really reflect the semantics of processed documents.

Whereas traditional information extraction systems are based on shallow natural language techniques and statistical algorithms, more recent approaches try to take into account semantics incorporated in ontologies to obtain more precise results. Ontologies can be exploited for unstructured document processing at different levels: for semantic annotation, for content-based indexing and retrieval, or for text classification and clustering. In this section, we present approaches where ontologies are exploited to provide enhanced knowledge management services, in particular for intelligent document processing: content-based indexing, semantic search, information integration within portals, and automatic classification or clustering.

18 DRDC Valcartier TR 2004-265

3.3.1 Content-based indexing

Semantic tagging of unstructured information consists in identifying terms that are descriptive of a document and that can be used for the indexing and retrieval of that document. This process could benefit from the exploitation of ontological knowledge.

When texts are analyzed, ontologies can be used for word sense disambiguation by utilizing known semantic relationships between concepts to boost the probability of a particular sense of a word in context.

Furthermore, the analysis of surrounding words in a text adds semantics that should be taken into account. For example, the word tank when surrounded by words such as military and vehicle is more likely to be a fighting vehicle and less likely to be a container for holding fuel. This should help identify relevant indexes or meta-data.

3.3.2 Ontology-based search and retrieval

With the explosion of the Web and the increasing popularity of intranets and enterprise portals in organizations, the exploitation of ontologies has been considered for the building of semantic search engines in order to provide enhanced search results (better precision and recall) when compared to keyword search. The reason is that standard search engines on the Web (e.g. Google) do not take into account the multiple ambiguous meanings of words. For example, a search for the word chip would retrieve both information about food and electronics. Thus, keyword search engines result in poor precision. On the other hand, they do not consider synonyms, acronyms or more specific terms when searching for a word, and thus they also result in poor recall. Moreover, they do no take into account the context of a search. For example, a search for the words air and defence should retrieve information in the military context by relating the two terms, and not provide results in the juridical context as it occurs if the two words are taken separately without context.

Ontology-based search engines exploit the inheritance hierarchy of concepts to look for more specific or more general terms. They also make use of lexical knowledge such as synonyms, acronyms contained in ontological models. This way, they retrieve information that would have been missed with keyword search, and they do not consider terms that are not in context of a search by limiting the search to a subset of relevant topics. By doing so, they improve recall and precision of results. In [McGuinness 1998], D. McGuinness shows that formal description logic oriented ontologies could also improve search results under particular conditions.

3.3.3 Ontologies in enterprise portals

There is an increasing interest in the exploitation of ontologies to structure information sources and integrate heterogeneous sources in enterprise portals (e.g. KAON, SEAL, OntoKnowledge are representative initiatives). In this context, ontologies are considered as a key component for knowledge management systems and knowledge portals. Maedche et al. [2001], as part of the KAON project, have proposed an

DRDC Valcartier TR 2004-265 19

ontology management infrastructure for semantic driven application based on the RDF schema formalism. Within the Semantic Portal Project [Hotho 2001a], they proposed a framework for a semantic portal relying on an ontology basis, for semantic integration, web site management and presentation. On-To-Knowledge is another initiative to build an ontology-based environment to enable Semantic Web knowledge management [Fensel et al. 2002]. It exploits newly proposed ontology web formalisms such as RDF Schemas and DAML+OIL to facilitate access to heterogeneous information from an intranet and the World Wide Web. As part of the COP21 TD, [Gauvin et al. 2002] proposes the concept of a situational awareness knowledge portal that exploits military ontologies to organize, filter and search for information within a user’s portfolios.

The Carnot Project addresses the problem of logically unifying physically distributed, enterprise-wide, heterogeneous information. The Model Integration Software Tool (MIST) Project6, developed as a part of this project, is a graphical user interface that assists a user in the integration of different databases via a common ontology that serves as an enterprise model. Cyc7 knowledge server is a very large, multi-contextual knowledge base and inference engine developed by Cycorp. Cyc is intended to provide a "deep" layer of understanding that can be used by other programs to make them more flexible.

3.3.4 Ontology-based document categorization and clustering

The objective of automatic document classification is to automatically organize documents in categories that are meaningful to users. For this purpose, predefined domain taxonomies may be used as categories. An alternative approach is to partition the set of documents into clusters that are generated through an unsupervised process according to similarities between documents. In Automatic Document Categorization/Diagnostic Task, the limitations identified in Pirlot’s normative approach [Pirlot 1994] become strengths as the rationality is imposed by an ontology built on the organization’s beliefs. Indeed, axioms even those who seem intelligible, appealing or evident, i.e., those who are relevant of common sense, in the sense of the organization philosophy, must be accepted by all the agents of the organization without critical examination. When the decision framework is strongly well specified, as it is the case with an ontology domain, the agents of the system can consistently distinguish relevant from irrelevant information by well-defined rules during document processing.

From a user’s perspective, a good categorization/clustering tool is as important as a good search engine because the user browses through categories in order to look for information and to retrieve or discover relevant information. Thus, different approaches are proposed in the literature, and commercial tools using various methods are now available on the market. Some relevant aspects of commercial tools and research proposals are presented below.

6 www.cse.ogi.edu/DISC/projects/mist 7 www.cyc.com

20 DRDC Valcartier TR 2004-265

The market of commercial tools dedicated to unstructured information management (indexing, search and retrieval, filtering/categorization and clustering) is particularly dynamic. Plenty of vendors offer categorization solutions (as reported in [Delphi 2002, Letson 2001, Adams 2001]). They use different methods, sometimes in combination: pattern matching and statistical algorithms, machine learning (using Bayesian probability, Support Vector Machine, or neural networks), rule-based, and linguistic- semantic approach. Among them, Autonomy has a pattern-matching utilizing Bayesian theory, Stratify uses multiple classifiers, Semio proposes a hybrid solution combining linguistic algorithms and statistical clustering techniques. Tools using a semantic approach are worth mentioning. Convera’s RetrievalWare technology includes specific semantic networks that support concept-based categorization and search (using a concept-based search engine). Applied Semantics approach is based on the utilization of a vast ontology composed of half a million tokens (individual words), two million terms (sequences of words), half a million concepts (or meanings), and relationships between these concepts (similar to those represented in the lexical WordNet). The ontology drives the processing of texts at a semantic level and provides a foundation for document categorization, meta-tagging, and summarization. In particular, concepts and semantic relationships from the ontology are used during word sense disambiguation and sensing in order to identify globally relevant concepts. This tool uses a unique approach in that it solely relies on semantic analysis and does not make use of machine learning techniques.

In a particular domain of interest, ontologies can give additional power by employing background knowledge. There have been some attempts to combine ontologies or background knowledge with automatic classification or clustering algorithms. These are described below.

In [Iwasume 1996], the authors propose IICA (Intelligent Information Collector and Analyzer), a system for gathering and categorizing information from information resources on the Web, that classifies documents by combining a keyword vector model with an ontology. It creates an initial classification of documents from the ontology and repeats a cycle of computing a representative characteristic vector for each category and classifying documents using the characteristic vectors until a convergence is achieved.

[Hotho et al. 2001b] exploits ontologies for text clustering. They utilize background knowledge in the form of a hierarchy of concepts for document preprocessing in order to generate different clustering views onto a set of documents. Their approach, COSA (Concept Selection and Aggregation), makes use of concept vectors using natural language processing techniques in place of the standard term vectors that simply constitute “bags of words” to represent the presence/occurrence of words in a text. The objective is to restrict the set of relevant document features and to automatically propose good aggregations by exploiting the concept hierarchy.

DRDC Valcartier TR 2004-265 21

According to Tsuhan Chen8, classification techniques should take into account the user’s context. They incorporate user-specific context into their categorization process and exploit both lexical (WordNet) and contextual knowledge through ontology.

Ontology-based algorithms may also exploit the notion of strength of the relationships between concepts, i.e. concepts directly linked by a “is-a” relationship are strongly related compared to those that are indirectly linked.

3.4 Combining statistics and semantics for document categorization

3.4.1 Approach

Ontologies or taxonomies can be exploited as a support for automatic document categorization. On the one hand, they organize concepts in a hierarchical structure that can be utilized for categorization (e.g., categories of Yahoo). On the other hand, they provide the semantics of a domain that can be exploited to improve traditional classification methods based on statistics. For example, as mentioned above, WordNet, a large electronic lexical database publicly available may be used to support document categorization.

Natural language processing techniques supported by an ontology domain can be used to extract semantic meaning from unstructured text and provide semantic indices from the ontology. Recently, some preliminary experiments have been conducted to combine statistics and semantics for information extraction [Termier et al. 2001, Faure and Poibeau 2000] using different methods.

An ontology-based semantic analysis consists in analyzing candidate concepts resulting from the statistical analysis from a semantic perspective by exploiting a domain ontology, in order to restrict the document descriptors to the attributes that semantically characterize the text, by for example removing poorly meaningful words, or by replacing terms that are semantically closely related by a concept that represent them. At each level of the ontology structure, specific semantic expressions are attached to concepts, to guide the semantic processing.

3.4.2 Operationalization

To demonstrate our approach and techniques, we have restricted our experiment to a particular specific domain, terrorism. In this context, we have chosen a document corpus from various open sources about terrorist events and build an ontology baseline about terrorism that organizes concepts of the domain in a hierarchy of concepts (taxonomy). Figure 2 illustrates the baseline ontology we have built from resources about terrorism found on the Web for the purpose of our experiment (e.g. [NATO, 2002]). This taxonomy aims at encapsulating important terms from the terrorism

8 www.ece.cmu.edu/~tsuhan

22 DRDC Valcartier TR 2004-265

domain, organized into categories, from general terms to more specific terms. At the first level, it contains terms such as terrorist organizations, countries (or national hosts), terrorist acts (activities), tactics, weapons, financial assets, etc. Level 2 refines the concepts of level 1 by providing more specific concepts in the hierarchy, most of them being linked by a “is-a” relationship. At a certain level in the hierarchy of concepts, terms in a category are considered to be similar from a categorization perspective because they are sub-concepts of this category (e.g. nerve agent and blister agent are two types of chemical agents in the Weapon of Mass Destruction category).

security surveillance

intelligence CIA (Central Intelligence Agency) agencies FBI (Federal Bureau of Investigation) Elite commandos / commando units Canadian secret soldier / Special secret canadian commando Joint Task Force Military forces Two (JTF2) financial asset Anti-terrorism cash transfer US rangers Financial aid US Delta Force transaction Funding deposit Red Cross Bank account Resources Non-governmental organization (NGO) Oxfam sponsor Amnesty International

Personnel Legislation / Law Equipment Peace process Politics Training camp US foreign policy Latin America Bombing Europe Suicide bombing GIA - Armed Islamic Group Hijacking Islamic Resistance Movement (HAMAS) Attacks / Raids Osama Bin Laden Suicide attack Taliban Al Qaida / Al Qaeda Hostage Assassination Afghanistan Middle East Organizations / groups / Cells Kidnapping Abu Nidal Organization (ANO) Cyber-terrorism / cyberwarfare Palestine Liberation Organization (PLO) Tactics Airline crash Hezbollah (Hizballah) Piracy Kurdistan Worker's Party (PKK) Hostage taking Asia Extortion Africa Sabotage Blackmail TerrorismTerrorism ontologyontology Free imprisoned comrades Threat / intimidation extremism Mines violence doctrine Religion grenade Hand launched missile fanatism Motives mortars separatist Politics Bomb Communist Explosive Rocket nuclear Sarin gas Tabun nerve agent Governmental official VX Weapons Governmental Building Embassy Soman Civilians Mustard tourist chemical Blister agent Weapon of Mass Phosgene Military personnel Target Destruction (WMD) phosgene Infrastructure Choking agent Rival groups Chlorine Computer systems / networks Blood agent Cyanide Facilities / Buildings microorganism anthrax biological Toxin botulium Figure 2. Hierarchy of concepts in the terrorism ontology

At each level, specific semantic expressions are attached to concepts (e.g. acquire weapon of mass destruction, plan an attack, build a bomb), similar to sub- categorization frames (e.g. in the ASIUM system [Faure and Poibeau 2000]) to guide the semantic processing. A semantic search engine is used to search for semantic similar expressions in the documents being processed. This search engine exploits the background knowledge contained in the ontology (e.g. synonyms). This allows the system to refine the semantic analysis of the document and thus provides a fine- grained documents categorization. For example, from the concept bomb extracted from a document, we could deduce through this process whether the document is about “bomb construction” or “bomb explosion”.

The taxonomy was first built manually using the MindManager tool and then exported into XML format into the ADAC environment. It contains about 200 concepts and relations. This structure is directly exploitable by ADAC agents (parsers and algorithms).

DRDC Valcartier TR 2004-265 23

While our first prototype focus on the restricted domain of terrorism, existing military ontologies or taxonomies could be exploited in order to validate our approach to the processing of documents of various military topics (ex. US CALL from the Centre of Army Lessons Learned) which contains 17000 terms from the military domain).

However, given that ontology building is a time-consuming activity, automatic or semi-automatic enrichment of the domain ontology could be interesting in order to learn new concepts by using machine learning techniques. This is a very active research domain. Moreover, the validation of our approach using larger ontologies that better reflect the domain knowledge (as well as larger sets of documents) is the objective of future work.

24 DRDC Valcartier TR 2004-265

4. The processing algorithms

As mentioned in the introduction, three specific tools are implemented in ADAC to diagnose and categorize a document. The diagnostic module consists in intercepting a relevant document matching criteria provided by the user in order to execute an appropriate action; the filtering/classification module classifies a document in pre- specified hierarchical classes; and the clustering module assigns a document to the most similar group of previously regrouped documents. In the following sections, we describe how is represented the document before its processing through these tools and we present the developed algorithms.

4.1 Representation of the document’s DNA

Any document intercepted by ADAC agents is summarized and its most pertinent concepts are extracted. Then, using the ontology and its related semantic networks, we extract the document’s DNA that consists of statistical measurement on the document. A document’s DNA (text DNA) could be represented, as shown in Figure 3. For each concept we measure a confidence level indicating that the concept is represented within the document with a standard error. This matrix will then be attached to the document through its journey within ADAC. It is also possible to export the data to other applications.

A document is represented by a matrix of statistical measurements on the extracted concepts based on the ontology and its related semantic networks. This is the output of the Ontology/Semantic tools and could be schematized, as shown in Figure 3. Concept Confidence Level (CL) Standard Deviation C1 C1 Concept 1 (C1) μ (d) σ (d) C2 C2 Concept 2 (C2) μ (d) σ (d) ...... Ci Ci Concept i (Ci) μ (d) σ (d) ...... Cn-1 Cn-1 Concept n-1 (Cn-1) μ (d) σ (d) Cn Cn Concept n (Cn) μ (d) σ (d)

Figure 3. Document DNA

Where:

DRDC Valcartier TR 2004-265 25

th • Concept i (Ci): represents the i concept in the ontology. We define Cij as the jth synonym of a concept i ;

• μ(C j ) : Confidence degree average (DC) of all requests upon the concept j (and eventually upon his synonyms);

• σ (C j ) : Confidence degree standard deviation (DC) of all requests upon the concept j (and eventually upon its synonyms)

We also define the following notation:

th • Rq : The q user’s request

• DCq (C j ) : Confidence degree measurement returned by the

semantic/ontology tool (DELPHES) upon request Rq (and eventually upon his synonyms)

Thus, the formatted representation of a document, used as the input of the diagnostic/classification algorithms, is given by μ(C j ) and σ (C j ) for all the concepts. In the case of synonyms consideration, average of the measures on the confidence degree (with regards to each request on each synonym concept) and on the standard deviation is computed.

4.2 The diagnostic module

The goal of the diagnostic module is to intercept the documents being of particular interest, interest that is expressed by a request. Every document is analyzed through the diagnostic tool to determine its relevance relatively to a specific query (or a set of queries). The diagnostic is based on the idea of continuous query. Documents and queries are represented using the DNA matrix-space model. The user, using the Diagnosis interface, specifies the keywords or concepts to be retrieved in documents. The diagnostic is based on the similarity between the query (q) and the processed document (d). This similarity is computed by comparing the two DNAs. A fuzzy similarity index is then computed as follows [Belacel, 2000]:

m m h I(d,q) = ∑ I j (d,q) = ∑(w j ×C j (d,q)× (1− D j (d,q))) j=1 j=1

h h The concordance and the discordance indexes ( C j (d, pi ) and D j (d, pi ) ) are computed at the concept level, by comparing the scores, between the prototype i (of a certain class h ) and the processed document d. Comparisons are made on the averages of the confidence degrees for the concordance, and on deviations errors for the

26 DRDC Valcartier TR 2004-265

discordance. For more details, consult [Tiliki, 2003] and Appendix E. Cdqj (,) and

Ddqj (,) take values between 0 and 1. wj are the weight-coefficients associated with the query concepts or keywords, reflecting the relative importance of each concept

retrieved in the document; 0≤ wj ≤1 and ∑wj =1. j

The diagnostic task is implemented, as done for the classification, with the difference that the DNA of the processed document is not compared to the DNA of class prototypes, but to that of a request prototype. In other words, the diagnosis is based on the similarity between the query (q) and the processed document (d). This similarity is computed by comparing the two DNAs (document and prototype of a specific request) using a fuzzy similarity index. It is then easy to use α-cut concept to validate the diagnosis. For example, it is possible to consider a fixed α-cut threshold (e.g., 60%) and each time the fuzzy similarity index is greater than this threshold, an appropriate action (specified by the user) is triggered.

4.3 Classification/Filtering

The classification problem can be seen as the assignment of a document to a pre- defined category or class. Fuzzy classification is the process of assigning a document to a set of predefined categories, and a document can belong to several categories or to no one. In this section, we focus on text categorization, which is the process of organizing a set of documents into categories.

A number of automatic classification methods have been proposed and evaluated [Goller et al. 2000]. These methods are ranged from rule-based systems, to learning approaches such as Bayesian probability, support vector machines, neural networks, or K-Nearest neighbours to name a few. The classes could be characterized by prototypes. These classes could be linked hierarchically or defined without any order relationship. For example, in classifying documents relating to a specific topic like terrorism, pages about Afghanistan may be grouped together into a specific category called “Sub-region Asia Zone”. This category could belong to a higher-level category called "Asia Zone", which in turn is a part of "Terrorist Training Camp" and so on. This type of hierarchical classification can generally be viewed using a tree diagram, with a root node at the top level and more fine-grained categories lower down. Such a tree formalizes the domain knowledge and the interrelations between the concepts, the units that characterize it.

The proposed algorithm considers any of these two options (hierarchy or no relationship). The user can fix some prototypes for each class using some well-known examples he is satisfied with their classification. The system learns from those past examples and automatically builds on the profiles of the prototypes. An appropriate DNA represents each prototype’s profile. Any incoming document’s DNA is then compared with each prototype’s DNA of each class, as shown in Figure 4.

DRDC Valcartier TR 2004-265 27

Concept Confidence Level (CL) Standard Deviation Concept 1 (C1) 0 0

Concept Confidence Level (CL) Standard Deviation Concept 2 (C2) 100% 0 C1 C1 Concept 1 (C1) μ (d) σ (d) . . . C2 C2 . . . Concept 2 (C2) μ (d) σ (d) . . .

. . . Concept i (Ci) 100% 0 ...... Ci Ci . . . Concept i (Ci) μ (d) σ (d) 1 Concept n-1 (Cn-1) 0 0 . . . p Concept n (Cn) 100% 0 . . . 1 . . . Concept n-1 (C ) μCn-1(d) σCn-1(d) 1 n-1 Cn Cn Concept n (Cn) μ (d) σ (d) ?

d p2 ? 2 1 électronic documents 2 p 2 ? K h p 1 Figure 4. Classification process

Let ui consider that:

• 1 ≤ j ≤ n : Number of concepts present in the Ontology (Hierarchy) H

• p j : Number of synonyms of a concept j ( j = 1,2,..., n )

When the document is processed for diagnostic, classification and clustering, a similarity to a prototype or a specific query is examined. To compute a similarity measure, three parameters are necessary: μ(C j ) , σ (C j ) defined above and w j which represents the weight (the importance) associated with the concept j in the user’s request or in the ontology’s hierarchy.

The classification/filtering process is inspired from the PROAFTN9 method [Belacel, 2000]. According to that special assignment process, a fuzzy similarity index is then computed as follows:

At a local level, (i.e. at the concept j level)

h h h h I j (d , p i ) = w j × C j (d , p i ) × (1 − D j (d , p i ))

At a global level, (i.e. for the whole document)

m m h h h h h I( d , pi ) = ∑ I j ( d , pi ) = ∑( w j × C j ( d , pi )×(1 − D j ( d , pi ))) j=1 j=1

h with m , the total number of concepts retrieved in the document and wj the weight affected to the concept j according to its position in the hierarchy. Consider the

9 PROAFTN : PROcédure d’Affectation Floue pour le Tri Nominal

28 DRDC Valcartier TR 2004-265

following hierarchy H , containing 7 elements (concepts) distributed on three levels (Figure 5):

Figure 5. Example of a 3-level hierarchy

Two weighting methods are considered. In the first one, the decision maker (the user) expresses his preferences with respect to the importance attached to each concept of the hierarchy. This affectation of the weights is done at the request formulation level for extracting confidence degrees of concepts. The second method, called objective method, helps the filtering/classification module to classify the objects (documents) to the right categories or classes and works as follows.

Principle of the weight assignment: Only the concepts present simultaneously in the document and in the prototype are considered to evaluate the weights and the h following notation is used. Suppose the following weight set for a given prototype pi :

h h h h h h pi pi pi pi pi pi W = {w1 , w 2 ,...., w j ,..., w n−1 , w n }

n h p i with ∑ w j = 1 . j = 1

Here n represents the number of concepts present in the prototype. Thus for a given concept j :

h pi w j = Pr es _ Concept j _ prot / Nb _ Levels _ prot / Nb _ Concepts _ Level j

Where:

h pi • w j is the weight assigned to concept j in the class of the prototype,

h • pi Pr es _ Concept j _ prot is a binary variable (1 if the concept j is h present in the prototype pi , 0 otherwise),

DRDC Valcartier TR 2004-265 29

• Nb _ Concepts _ Level j is the number of concepts located on the same level as the concept j in the hierarchy,

• Nb _ Levels _ prot is the number of levels actually present in the prototype,

h • pi = (Total _ Nb _ Levels _ Hier) − (Level _ First _ Concept _ prot) ,

• Nb _ Total _ Niv _ Hier is the total number of levels of the hierarchy (ontology)

• Level _ First _ Concept _ prot is the level of the first concept met in the h prototype pi .

An example is presented below to illustrate how this second method of weighting works. It consists of two phases: identification, enumeration and localization phase, h and distribution or repartition phase. Consider the following prototype pi , used as reference:

h 10 Prototype pi → objective weighting (P1) Belonging Level (# C )

h pi 11 C1 1 → w1 = 1*(1/ 3) /1 = 0.33333 first level (1) h pi C2 1 → w2 = 1*(1/ 3) / 2 = 0.16665 second level (1) h pi C3 0 → w3 = 0*(1/ 3) / 2 ⇒ 0.0000 second level (0) h pi C4 1 → w4 = 1*(1/ 3) / 4 = 0.08333 third level (4) h pi C5 1 → w5 = 1*(1/ 3) / 4 = 0.08333 third level (4) h pi C6 1 → w6 = 1*(1/ 3) / 4 = 0.08333 third level (4) h pi C7 1 → w7 = 1*(1/ 3) / 4 = 0.08333 third level (4)

The method takes in account only the concepts that are present at the same time in the h pi prototype and in the processing document; this allows a new repartition of w j values.

In our example only six concepts are considered C1 , C2 , C4 , C5 , C6 and C7 , then the readjustment of the 100% is done in the following way:

10 Number of concepts present on the level

11 h The first level of prototype pi

30 DRDC Valcartier TR 2004-265

h pi C1 → w1 = 0.33333 → 0.33333a h pi C2 → w2 = 0.16665 → 0.16665a h pi C3 → w3 = 0.00000 → 0.00000a h pi C4 → w4 = 0.08333 → 0.08333a h pi C5 → w5 = 0.08333 → 0.08333a h pi C6 → w6 = 0.08333 → 0.08333a h pi C7 → w7 = 0.08333 → 0.08333a ------m ph w i × a = ∑ j 0.8333a = 100 ⇒ a = 120 j=1 where a represents a repartition multiplier. The results after distribution are:

h pi C1 → w1 = 0.33333×120 = 40.00% h pi C2 → w2 = 0.16665×120 = 20.00% h pi C3 → w3 = 0.00000×120 = 00.00% h pi C4 → w4 = 0.08333×120 = 10.00% h pi C5 → w5 = 0.08333×120 = 10.00% h pi C6 → w6 = 0.08333×120 = 10.00% h pi C7 → w7 = 0.08333×120 = 10.00% ------m h pi ∑ w j = 100.00% j=1 The weighting method confers high scores to the concepts that are located at higher levels of the hierarchy, and gives the same weight to those located at the same level. The logic behind this objective weighting method is that for a given class C h , it is significant to give a greater importance (i.e., larger weight) to the first most general concept of the prototype that identifies the class to which this prototype belongs. And then, it assigns weights which decrease according to whether we go down towards more specific concepts in the hierarchy. This way of making respects the hierarchy logic and considers, during the document-prototype comparison process, only the concepts that are simultaneously present in both. By this way, the prototype plays its full role of reference.

As Cdqj (,) and Ddqj (,) take values between 0 and 1, the local indifference index h I j( d , pi ) also takes values between 0 and 1. Thus the global indifference index will

DRDC Valcartier TR 2004-265 31

h h vary as: 0 ≤ I(d, pi ) ≤ m . In order to express I( d , pi )as a percentage handle scores, we divide it by the total number m of concepts retrieved in the document. The score characterizing the membership of a document to a given class is then defined as the membership degree, given by:

I( d , p h ) I(d, ph ) = ( i ) × 100 (%) (5.6) i m

If many prototypes describe a class, the membership function of a document to a specific class is obtained by considering the maximum similarity to any of these prototypes. This is given by:

h h h h m( d ,C ) = max{I( d , p1 ),I( d , p2 ),...,I( d , pLh} (5.7)

and the decision to assign exclusively a document to a specific class is based on the following rule:

m( d ,C h ) = max{m( d ,C1 ),m( d ,C 2 )....,m( d ,Cl ),...m( d ,C k )}⇔ d ∈C h (5.8)

At the end of the classification/filtering process, a forced classification is carried out to link the document to the upper classes (or concepts) in the hierarchy. This facilitates correlation of information between the document’s content and other logically related concepts.

The various approaches proposed in this work differ by the manner of considering the

variable μ(C j ) (the degree of confidence of a concept) and the prototypes. Some of

these approaches treat the case where the variable of the model μ(C j ) is considered as continuous, interval or discrete. Another approach presents the case where the prototype of the class is pure, i.e., the degree of confidence of all the concepts present in this prototype are of 100%. However, the majority of approaches consider prototypes presenting intermediate values (located between 0% and 100%). Certain approaches are also identified according to the way in which the averages of the degrees of confidence and their respective standard deviations are perceived. The parametric one -from the statistical point of view- considers average and standard deviation of the confidence degrees as parameters of a well-known distribution. The non-parametric approach considers them as representing an interval.

4.4 The Clustering module

Clustering is an unsupervised learning method of classification that seeks to identify similar objects in a multi-dimensional space [Everitt, 1993]. The difference between clustering and classification can be seen in the predefinition of the categories or the classes. The clustering problem consists in finding the best partition of a set of documents into sub-sets (categories) that best cluster the documents. The problem can

32 DRDC Valcartier TR 2004-265

be decomposed into a partitioning problem (creating categories) and then into an assignment problem of the document to those categories. By definition, the clustering algorithm requires a database containing many documents. This algorithm could be executed each time the number of documents intercepted reaches a certain peak.

Two algorithms are proposed in this work: the first is based on genetic algorithms [Goldberg, 1989], and the second on the variables neighborhood search approach [Hansen and Mladenović, 2001]. Both use the agglomerative hierarchical clustering technique which starts with the documents as individual clusters and at each step merges the most similar or closest pair of clusters. Generally, an objective function is optimized at each step. The approach used in this study is described below.

4.4.1 Problem formulation

The goal of a clustering process is to produce a partition where the documents in each cluster are very similar and the documents across the different clusters share very few concepts. Let each Cl represents the cluster l with cardinality nl, i.e., |Cl|= nl. We can define the similarity within a cluster as the minimum similarity between any element couple of this cluster (intra-similarity). This can be formulated as

Sim(Cl) = Š (Cl) = min I(di,dj), ∀ di , dj ∈Cl. Let α0 be a similarity threshold fixed by

the user if desired. Let also Δ jh be the distance between two distinct clusters Cl and Ch (or inter-similarity).

The multi-objective constrained problem is stated as follows:

( j ⎧maxSC ( ) ⎪ ⎪min k ⎪ max Δ jh ⎪ jh, ⎪ ⎪st.. (Pcl) : ⎨ ( j ⎪SC( )≥=α0 ; j 1,..., k ⎪ j nC=≥= nj; 1,..., k ⎪()j 0 ⎪kk≤ ⎪ 0 ⎪ ⎩Δ≥Δ=jh 0 ;1,...,jk

where k0 is a user defined maximum number of clusters.

The originality of the two clustering algorithms proposed in this work lies in the multicriteria aspect where the objectives are the number of clusters, a priori unknown, to be minimized, the intra-cluster similarities and the inter-clusters distance to be maximized. The clustering process, applied in the two proposed algorithms, is a hierarchical agglomerative technique.

DRDC Valcartier TR 2004-265 33

We define the Intra-cluster similarity Š (Cl) or similarity of the objects within a cluster as the minimum similarity between any two elements of this cluster, then its compactness, and is expressed as:

Version 1:

Š (Cl) = min I(di,dj); ∀ di , dj ∈Cl (5.9)

Version 2:

Each document dj included in cluster (i), containing n documents, is indexed by a global similarity:

1 SG(dj) = ∑ I(d j ,dk ) n −1 d k ∈Ci ,k ≠ j

And the cluster’s evaluation is the lowest similarity:

l Š (C ) =)}min{SG (d1),SG (d2 )...... SG (dn d j ∈Ci

where I(di,dj) is the indifference index between the documents di and dj. This indifference index is calculated as done in the classification/filtering module with the difference that two documents are considered instead of a document and a prototype. Higher is this coefficient, better is the clustering as expressed in the problem (Pcl) above.

We introduce the inter-cluster dissimilarity between clusters to reflect how the clusters are different from each other (they contain documents whose pair-wise similarities are different). This can be expressed by a “distance” between the clusters. Three kinds of distance defined in the literature, are tested in this work:

• The Jaccard coefficient or also called The Tanimoto Coefficient:

Cl ∩ C h l h Δ lh = J(C ,C ) = Cl + C h − Cl ∩ Ch

This distance estimates the number of documents in common between clusters two with regard to the total number of documents. It works with sets.

• The cosine function [Salton and Mc Gill, 1983]:

Cl ∩ Ch Δ lh = C h × Cl

34 DRDC Valcartier TR 2004-265

This measure is equivalent to the cosine measurement between the vectors representing the clusters.

• UPGMA (Unweighted Pair Group Method with Arithmetic Mean) [Jain and Dubes, 1988]:

1 UPGMA(Clj, Clk)= ∑ I(,ddiz ) Cljk Cl dCldClijzk∈∈

This coefficient estimates the similarity between two clusters not by counting the number of documents in common but by comparing the pair-wise similarity of the documents from each cluster.

4.4.2 Clustering using genetic algorithms

The clustering algorithm consists of three stages: initialization, clustering improvement using genetic algorithms to maximize Š (Cl), merging clusters to

minimize k and maximize Δ lh .

• Stage 1: Initial clustering

This is done by considering each document di as an initial cluster and the closest documents dj are assigned to this cluster if their similarity I(di , dj) are greater than a threshold α0 . Let us denote this initial number of clusters k(t=0) = N.

• Stage 2: Clustering improvement using genetic algorithms (GA)

GA is a meta-heuristic working with a population of solutions encoded into chromosomes and characterized by fitness, which returns a quantitative measure of their “goodness” [Goldberg, 1989]. In this work, a cluster is a chromosome, which encodes the membership of each document. It is a binary N-integer string where the ith document is set to one if it is present or to zero if not. Fitness of each cluster, defined by Š (Cl), must be maximized. A linear static scaling method of this objective function is used to ensure that the population did not become dominated by the descendants of a single super-fit chromosome, and later, if the population average fitness may be close to the fitness of the best chromosome, competition among chromosomes is not stochastic but based on the principle of the survival of the fittest [De Jong, 1976].

Evolution of the population, i.e. improvement of the fitness of solutions, is done by using two principal operators. The first operator is the crossover that combines parts of the fittest chromosomes of the previous generation to generate new individuals. Two types of crossover are used alternatively: the uniform crossover [Syswerda, 1989] and the partially mapped crossover

DRDC Valcartier TR 2004-265 35

[Goldberg and Lingle, 1985]. The second operator is the mutation that introduces new information at random with a low probability by changing the value of one single bit from one to zero and vice versa, at a randomly chosen position in the string.

The selection of clusters for reproduction, crossover and mutation is done by using the stochastic remainder selection without replacement [De Jong 1976; Goldberg, 1989]. The probability of selection is calculated by:

k(t) l ~ l ~ l ps(C ) = S (C ) / ∑ S (C ) . l=1

• Stage 3: Merging clusters

At this stage of the procedure, we try to reduce the number of clusters obtained from the second stage while maximizing the distance between clusters. This is done by evaluating inter-cluster dissimilarity Δ lh between all the clusters C and merging the closest ones (those have the lower Δ lh). The new number of clusters is then k. The procedure is repeated from stage 2.

The clustering is stopped if the number of iterations t reached the maximum or if the number of clusters cannot be reduced without altering the intra-cluster similarity.

4.4.3 Clustering using variable neighborhood search method

A new local search heuristic called INTER-CLUSTER is proposed, with a neighborhoods’ structure defined both by the clusters merging and by the documents interchanging. This local search INTER-CLUSTER is embedded into a variable neighborhood search (VNS) metaheuristic to alleviate the difficulty of being stuck in local minima of poor value. The clustering process is similar to the previous genetic algorithms based method and consists of a first step for initial clustering, two stages for clustering improvement and for cluster merging processed by INTER-CLUSTER, and a stage of an solution improvement processed by the VNS metaheuristic.

For notation convenience, let us define the following variables used by this algorithm:

• D = {d1,d2,…,dn} is the set of n documents or texts in z-dimensional space

C = {c1, c2,…,cm} is the set of clusters to be determined by INTER-CLUSTER

I(di, dj) = Iij is the indifference index between documents i and j

l l c : represents the cluster l with cardinality n0, i.e., |c |= n0

α0: is the similarity threshold between two documents.

36 DRDC Valcartier TR 2004-265

The inter-cluster heuristic is based on local neighborhood search heuristic. The different steps of the local search are given as follows:

• Step 1 (Initialization):

opt opt opt 1 2 The initial solution is a set of clusters S = {C , fopt}, with C = {c , c , …, m c } has been constructed as presented in subsection 5.4.2 stage1 and fopt, the l evaluation of this solution, fopt,={ min Šl(c ), |C|, min Δlh ). lC∈ lh, ∈ C

• Step 2 (Interchange the documents):

Find the documents from different clustering to be interchanged with another document as follows: Sort clusters in C in such away that the first cluster has the lowest Šl and the last cluster has the highest Šl.

• Step 2.1 (candidate cluster):

Select the first cluster as the candidate cluster cl*, such as: cl* = arg min {Šl}; i.e. the cluster that has the lowest Šl.

• Step 2.2 (Candidate document)

l* Find the document di* in this candidate cluster c corresponding to:

( ( ( l* dd= arg min Sl( ) , where S(dcd )=− S( { }) ii* { } l*ii l* i= 1,...,nl

i.e. the document corresponding to the minimum value of the function Šl

h l* If di* is present in another cluster c , delete it from c and go to step 3, otherwise go to Step 2.3

• Step 2.3 (search for the host cluster)

Classify clusters in C from the nearest to cl* to the more distant to l* k i j c . If c is the nearest one then Δkl = min Δij for all c and c in C,

k Add di* to c if the following inequality is respected:

k Sim(c U { di* })>= Šk , and go to Step 3

Otherwise find another nearest cluster to cl* that respects this constraint, add di* and go to Step 3. If no host candidate was found, create a new cluster with this document, select a new candidate cluster new_ cl* which has the next worst similarity Š and go to Step 2.2

DRDC Valcartier TR 2004-265 37

• Step 3 (Local improvement)

Apply merge(C°, f); update the corresponding value of the objective function f and the solution S° = {C°, f}, C° is the new set of clusters.

• Step 4 (Stop or move)

opt ° If (f < fopt) save the new currently best solution and its value, i.e., set S = S opt and fopt=f; return to step 2; Otherwise stop with a local minimum (S , fopt)

There is no guarantee that the final solution obtained by INTER-CLUSTER heuristic is a globally optimal one. We alleviate this difficulty by using a meta-heuristic variable neighborhood search algorithm to improve further on the solution found. This procedure is presented below.

The basic idea of the variable neighborhood search (VNS) is to proceed to a systematic change of neighborhood within a local search algorithm INTER-CLUSTER. This algorithm remains in the same locally optimal solution exploring increasingly far neighborhoods of it by random generation of a point and descent, until another solution better than the incumbent is found. The neighborhood structures that we use are obtained by interchanging k documents of cluster l with k documents of cluster h. Both are chosen at random. We denote with Nk, (k = 1,…, kmax), the set of such neighborhood structures and with Nk(V) the set of solutions forming neighborhood Nk of a current solution C*.

The steps of VNS for documents clustering are presented as follows:

• Step 1 (Initialization)

Let C* and f be the set of clusters and the current objective function value. opt Set C := C* and fopt := f. Choose some stopping condition, and a value for a parameter kmax.

• Step 2 (Termination)

If the stopping condition is met, stop.

• Step 3 (First neighbourhood)

Set k =1

• Step 4 (Inner loop)

If k > kmax, return to Step 2.

• Step 5 (Perturbation)

38 DRDC Valcartier TR 2004-265

Generate a solution C’ at random from Nk(C) (C’ ∈ Nk(C)), i.e., find k documents of cluster l and k documents of cluster h at random (the clusters l and h can be obtained as random or using some criterion, for example the distance or similarity inter clusters); then interchange between these documents;

• Step 6 (Local search)

Apply the INTER-CLUSTER local search (with C’ as initial solution); denote the resulting solution and objective function value with C” and f”, respectively.

• Step 7 (Move or not)

If f” < fopt, then re-center the search around the better solution found (fopt := f” and Copt := C”) and go to Step 3; otherwise set k := k + 1 and go to Step 4.

The stopping criterion may be the maximum number of iteration without improvement on fopt or the prespecified number of clusters is reached. In Step 5 the incumbent solution C is perturbed in such a way that the distance between the two solutions C and C’ is equal to k. We then use C’ as initial solution for INTER-CLUSTER local search in Step 6. If a better solution C’’ is obtained, we move there and start again with small perturbations of this new best solution (i.e., C := C’’, k := 1). Otherwise we increase the distance between C and the new randomly generated point, i.e., we set k := k +1. If k becomes greater than kmax, we return to Step 2 and the Inner loop steps iterate as long as the stopping criterion is not met.

DRDC Valcartier TR 2004-265 39

5. Empirical tests

The performance of the algorithms proposed in ADAC is examined in this chapter. To perform the empirical tests, a simulation tool developed by Intell@xiom inc. to test the algorithms separately, collections of documents from the web and performance indicators from the literature are used. These different tools for performance assessment and the empirical tests results are presented below.

5.1 Metrics for performance assessment

To estimate the efficiency of the proposed algorithms in this work, several metrics from the literature are used.

For the classification, the most indicated metric is simply the percentage of well classified documents using a database with well-known categories.

To evaluate the clusters' quality obtained from the clustering algorithms, three metrics of the literature are chosen: the entropy, the purity [Zhao and Karypis 2001] and the F- measure [Larsen and Aone 1999].

Calculation of the entropy: The entropy quantifies how the various classes of documents are distributed in every cluster. First, the class distribution of the i documents is calculated. For each cluster j, the probability, pij = n j /nj, that a member of cluster j belongs to a class i is computed. Then, using this class distribution, the entropy of each cluster is calculated using the standard formula:

q i i 1 n j n j E(Cj) = − ∑ log log q i=1 n j n j

The total entropy for a set of clusters, the output of a clustering algorithm, is calculated by summing all these entropies weighted by the size of each cluster as given by the equation:

k n Entropy = ∑ j E(Cj) j=1 n

i th with q is the number of classes, n j is the number of documents of the i class assigned th th to the j cluster, nj is the number of documents in the j cluster, n is the total number of documents, k is the number of clusters obtained by the clustering method.

40 DRDC Valcartier TR 2004-265

The smaller the entropy values, the better the clustering solution is. A perfect clustering solution will lead to clusters that contain documents from only single class in which case the entropy will be zero.

Calculation of the purity: The purity measures how many documents belonging to a class are contained in a cluster. This is nothing more than the fraction of the overall cluster size that the largest class of documents is assigned to that cluster:

1 i P (Cj) = max(n j ) i n j

And the overall purity of the clustering solution is the weighted sum of all these individual purities:

k n Purity = ∑ j P(Cj) j=1 n

The notations are the same that those used for the calculation of the entropy. In general, the larger the purity value, the better the clustering solution is.

Calculation of the F-measure: The F-measure estimates the complete hierarchical tree (hierarchical agglomerative clustering) produced by the algorithms. The idea behind this measure is to treat every class i as a request and every cluster j as an answer to this request. Given a particular class i and a particular cluster j, the F-measure of these class and cluster is defined as:

2Recall ( i , j )Pr ecision ( i , j ) F(i, j) = Recall ( i , j )+ Pr ecision ( i , j )

where Recall(i, j) is the recall value defined as nij/ni, and Precision(i, j) is the precision value defined as nij/nj for the class I and cluster j. nij is the number of elements of the class i contained in the cluster j, nj the number of elements in the cluster j and ni in the class i.

This measure is calculated for every class with regard to all the clusters. Then for the complete clustering, we have:

c n F = i max{F(i, j)} ∑ j i=1 n

n being the total number of documents and c the total number of classes. A perfect clustering solution will be the one that produces clusters corresponding to the classes in the hierarchical tree, containing exactly the same documents, in which case the F- measure will be one. In general, higher the F-measure value, better the clustering solution is.

DRDC Valcartier TR 2004-265 41

5.2 The simulation tool: TestBench

To study the efficiency of the algorithms of classification and of clustering, a tool for simulation was developed by Intell@xiom inc.: TestBench. A friendly interface (Figures 6 to 8) allows selecting the collection of data, the algorithm and the associated parameters to carry out a simulation. Before presenting the results of these tests, a brief introduction of each of these tool components is given below.

In Figure 6, the window part External data is used to select the collection for the tests. In the File type combo box, one can select the data type as, for example, frequency measurements associated to terms or words labels or simply frequency measurements. In the test part, for the classification, we can choose the number of prototype by class, the name of the HTML file in which the results will be stored, and the method used to calculate the indifference index. Coef Concordance is not used for the classification. At the end of the simulation, the results are shown in the Tests Results window part.

Figure 6. TestBench’s interface for the categorization simulations

42 DRDC Valcartier TR 2004-265

Figure 7. TestBench’s interface for the genetic clustering algorithm

Figure 8. TestBench’s interface for the VNS clustering algorithm

For the genetic clustering tests (Figure 7), the Genetic tab shows the parameters used by this algorithm. Nb. Clusters (a constraint of the problem), loop Max (number of

DRDC Valcartier TR 2004-265 43

iterations), Nb copy (for the reproduction genetic operator), Min diff % (distance- threshold to merge two clusters), Crossover, Mutation, Nbr Voisin (closest document to be merged as initial clusters), Coeff concordance (similarity-threshold to group two documents), are the parameters of the clustering algorithms presented in section 5. Merging Method is the inter-cluster dissimilarity function used to evaluate the distance between two clusters.View log allows storing the results of the simulation in a text file.

In the Inter-cluster tab shown in Figure 8, one can select the parameters associated to the VNS clustering algorithm. Loop in inter is the number of nearby spaces. The “Documents informations” tab allows examining the detailed contents of documents and on which base the concordance between them is made.

5.3 Tests results

5.3.1 Filtering/Categorization algorithm

Unfortunately, all the proposed methods for the classification procedure are not tested. Indeed, we did not find a referenced database with an appropriate ontology that allowed us to exploit the potential of the categories’ hierarchical representation considered in these methods. Preliminary tests were performed for the classification algorithm using a document collection database12. These collections contain documents classified into categories and represented by term-space vectors. Each vector contains the label of the term found in the document and its frequency, all the terms of the collection being preliminary labeled. Depending on the number of prototypes selected to represent a category and on the collection chosen, the percentage of well classified documents varies between 30 and 95%. From these tests, although not very adapted to the proposed classification methods for ADAC, the interesting conclusion that we can draw is that the success of the documents assignment to categories depends not only on the method of similarity evaluation between a document and a prototype but also on the choice of the prototypes to represent these categories.

5.3.2 Clustering algorithms

The same database as for the classification algorithm tests is used for the clustering. The preliminary obtained results showed that the entropy is between 18-30%, the purity metric is around 75-85% and the F-measure varies between 0.5 and 0.75 for the VNS clustering algorithm. This performance is comparable to that found in the literature [Zhao and Karipys 2001].The genetic clustering algorithm is less performing as the mean entropy, purity measures are around 50% and the F-measure about 40%. Several features of these clustering algorithms could be improved. We recommend revising the solution encoding in genetic algorithm, the intra-similarity measure used to evaluate a clustering solution as well as the clustering initialization.

12 http: // www.users.cs.umn.edu / ~ karypis / cluto / download.html

44 DRDC Valcartier TR 2004-265

6. ADAC prototype functional architecture

The ADAC concept could be implemented according to different configurations. One of the configurations we privileged in this work is based on the concept of a desktop knowledge management and decision support tool. This configuration is represented in Figure 9.

ADAC

Observer ADAC Monitoring Agents Fonction: to diagnostic the information

E-Mail INTRANET INTERNET COMS

Figure 9. ADAC retained configuration

In this configuration, ADAC’ agents continually monitor the flow of information in different media (e.g. e-mail, Intranet, Internet, Voice/Electronic COMS). Once a document has been intercepted by the recovery agent (Figure 11), it is copied into ADAC database and processed through the different steps as shown in Figure 10.

Diagnostic tool or technology Storage external Ontology/ service Central database Semantic Tool or technology Filtering / Classification tool External Text or technology statistic tool

Clustering tool or techology External ScénarioScénarioScénario de de de traitement traitement traitement et et et Summarizer Processing and analyzing analyseanalyseanalyse tool scenario

Report, Display, Interaction external service System administration Launching scenarios agent Processing and analyzing support external service

Images e-Mail Docs Sound Observer Fonction: answer to alerts and important messages

Figure 10. ADAC processing and analyzing scenario

DRDC Valcartier TR 2004-265 45

The ADAC recovery agent could be configured as represented by Figure 11. In this configuration, the agent is able to deal with document having been produced into different formats (e.g. audio, image, text). The document is then processed according to the approaches/algorithms described below. paper documents electronic documents Telecommunications Speech To Text Engine

text coherence Text corrector Recovery agent Classes tools verification scenario Images e-Mail Docs Sons

eMail

OCR Engine paper images speech files document electronic documents

Figure 11. ADAC recovery agent

ADAC has been implemented by the Quebec based Intell@xiom inc. It has been developed within the IntellStudio environment and integrates several external services and algorithms, as shown in Figure 12. IntellStudio allows the interoperability of heterogeneous systems, COTS, GOTS and supports all major known communication protocols. It also offers tools to easily integrate knowledge and analytical based decision rules into the document processing flow.

information | documents | messages eMails docs

DATA Recovery WEB received agent Client

Launching Intell@xiom WEB agent IntellStudio Server

External services

Copernic Diagnos Delphes Algorithms ADAC Summarizer SIPINA DioWeb (©RDDC) DATA

Figure 12. ADAC implementation concept

46 DRDC Valcartier TR 2004-265

The Copernic’s Summarizer is used to generate a comprehensive summary along with the ten most representative concepts of the document. The SIPINA Diagnos tool is used as a statistical tool. It offers several data mining tools and knowledge extracting tools (structured or unstructured). The DioWeb Delphes is a semantic search engine. It is used within ADAC as a semantic search engine (based on ontology) in documents. The Algorithms external service invokes the three algorithms: diagnosis, classification, and clustering. The architecture showed in Figure 13 is a detailed view of the implementation concept shown in Figure 12.

API API API Central database BD Filter BD 3/3 Diagnostic tool or technology API Storage external service Diag Central database External semantic 1/3 search engine tool API API Filter BD API API Filtering / SSE BD Classification tool or technology Ontology/Semantic API External Text Tool or technology Onto statistic tool API API API API Clust BD Stat BD Clustering tool or Statistic and techology API Summarizer API External Sum external service Stat Summarizer tool WEB server API Report, Display, IA Interaction external WEB service Browser Intell@xiom API server BD Intell@xiom scenarios database API Processing and analyzing API API P.A. ADM IA support external API service BD Launching agent

API BD Central database Observer 2/3 Fonction: answer to alerts and important messages

Images e-Mail Docs Sound

Figure 13. ADAC architecture

Figure 14 shows an example of user interfaces to access ADAC within the desktop knowledge management and decision support configuration. In this configuration, the user can access to multiple documents, as well as to the classification and clustering results. Advanced search engines are available to browse the database and create persistent queries for diagnostic. The alerts will be displayed and external actions could be pre-programmed. A map viewer allows associating documents with specific geographical areas in the world.

DRDC Valcartier TR 2004-265 47

Figure 14. ADAC Interfaces (example)

The ADAC prototype developed shows how putting together knowledge management and decision analysis concepts could lead to an advanced tool to support military organizations coping with information overload while minimizing risks. ADAC proposes advanced concept demonstrators for unstructured and structured information management and offers several decision support tools to help understanding information content and gaining situation awareness.

The different algorithms developed require more fine-tuning and validation. In fact, we are going to perform several experimentations to evaluate the algorithms performance, faithfulness, effectiveness and efficiency. Other clustering and classification algorithms will be also developed and tested.

The ontology and the semantic networks are going to be reviewed with subject-matter experts at the NDCC and adapted to reflect their needs. More advanced semantic search engine will be used to generate document’s DNA. The DNA itself will be improved and the measurement will be enhanced.

We are also looking to showcase other than ADAC configurations to support other R&D projects. Figure 15 shows two different configurations. Figure 15.A represents a LAN/WAN filtering system that monitors all the traffic, structures the information and triggers alerts. Figure 15.B shows ADAC used as an input to a document management system (e.g. JIIMS developed at DRDC Valcartier).

48 DRDC Valcartier TR 2004-265

électronic documents images sound files

Observer Fonction: to diagnostic ALERTS Observer the information Fonction: to diagnostic the information sound files ADAC Observer Fonction: to diagnostic the information images électronic documents ADAC Storing Agents ADAC DOCUMENT MANAGEMENT SYSTEM (A)(e.g. JIIMS) (B)

Figure 15. Other ADAC configurations

DRDC Valcartier TR 2004-265 49

7. Conclusions

ADAC is a prototype that showed the feasibility of an advanced automated document manager, analyzer, classifier and diagnosis tool. The ADAC uses the concept of document’s DNA that is a matrix extracted using advanced ontology/semantic networking tools. Ontology and semantic networks should be then carefully developed and should be validated with subject matters experts.

Exploiting the relationships between the concepts in the ontology could enhance the semantic processing of documents. Building ontology for a specific domain is a time- consuming task. Our approach should be enriched in order to be able to learn new concepts (ontology learning).

We have proposed several original algorithms to perform diagnoses, classification and clustering. Those algorithms should be extensively tested and validated. The IntellStudio development environment is a powerful decision support systems development tool. It offers flexible and easy access to external services (integration of COTS & GOTS), user-friendly interfaces supported by graphical modeling and easy integration of knowledge and analytical based rules.

It is important to integrate automated learning algorithms to improve ADAC flexibility and improve its usefulness to effectively support end-users. It is also important to integrate services that process synonyms, advanced semantic networks, image processing, speech to text, intelligent character recognition. We are also going to investigate ways to improve external actions like the visualization of the information, generating alerts in different formats including sending wireless messages.

Humans always put together pieces of information from different sources to learn new concepts and understand documents’ contents. It is recommended to pursue research into advanced ways to network documents and improve learning by putting together pieces of information from different sources. Bayesian networks are seen as a potential way to implement such concepts.

50 DRDC Valcartier TR 2004-265

References

1. [Abiteboul and Vuanu 1997] S. Abiteboul and V. Vuanu, Queries and Computation on the Web. In F. Afrati and P. Kolaitis, editors, Database Theory – ICDT’97, pages 262-275, 1997. 2. [Abiteboul et al 1993] S. Abiteboul, S. Cluet, and T. Milo, Querying and updating the file, In Proceedings of 19th International Conference on Very Large Databases, pages 73-84, Dublin, Ireland, 1993. 3. [Abiteboul et al 1996] A. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener, The Lorel Query Language for Semi-structured Data, Journal of Digital Libraries, 1(1): 68-88, 1996. 4. [Abiteboul et al 1997] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, and J. Siméon, Querying documents in object databases, In Journal of Digital Libraries, volume 1:1, 1997. 5. [Adams 2001] C. Adams, Word Wanglers: Automatic classification Tools transform Enterprise Documents from Bags of Words to Knowledge Resources, Intelligent KM, January 2001. 6. [Amini 2001] M. –R. Amini, Apprentissage Automatique et Recherche d'Information : application à l'Extraction d'Information de surface et au Résumé de Texte, Thèse de Doctorat, Université Paris 6, juillet 2001. 7. [Apte et al 1994] C. Apte, F. Dameraeu, S. M. Weiss, Automated Learning of Decision Rules for Text Categorization, Transactions of Office Information Systems. 12 (3) 1994. 8. [Arrow and Raynaud 1986] J.K. Arrow, H. Raynaud. Social Choice and Multicriterion Decision Making. MIT press, Cambridge, 1986. 9. [Belacel 2000] N. Belacel, Multicriteria assignment method PROAFTN: methodology and medical assignment, Eur. J. Oper. Res., 125: 175-183, 2000. 10. [Belacel 2000a] N. Belacel, Méthodes de classification multicritères : Méthodologie et applications à l’ide du diagnostic médical, PhD thesis, Université Libre de Bruxelles, Belgium, 2000. 11. [Bell et al. 1988] D.E. Bell, H. Raiffa., Y.A. Tversky. (edit.), Decision Making: Descriptive, Normative, and Prescriptive Interactions. Cambridge University Press, Great Britain, Third edition, 1988. 12. [Benjamin et al. 1998] R.V.Benjamins, V. D. Fensel, V., A. Gòmez-Pérez , Knowledge Management through Ontologies. In PAKM 1998. 13. [Besançon 2002] R. Besançon, Intégration de connaissances syntaxiques et sémantiques dans les représentations vectorielles de textes, Thèse de doctorat, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne (Suisse), 2002.

DRDC Valcartier TR 2004-265 51

14. [Birch 2003] S. Birch , Statistical text modeling - Towards modeling of matching problems. Informatics and Mathematical Modelling, Technical University of Denmark, DTU, 2003. 15. [Broomhead and Kirby 2000] D.S. Broomhead, M. Kirby, A New Approach to Dimensionality Reduction: Theory and Algorithms, Society for Industrial and Applied Mathematics, Volume 60, Number 6, pages 2114-2142. 16. [Buneman 1997] P. Buneman, Semi-structured Data, In ACM Symposium on Principles of Database Systems, pages 117-121, Tucson, Arizona, June 1997. 17. [Buneman et al 1995a] P. Buneman, S.B. Davidson, K. Hart, C. Overton, L. and Wong, A data transformation system for biological data sources, In Proceedings of VLDB, Sept. 1995. 18. [Buneman et al 1995b] P. Buneman, S. B. Davidson, and D. Suciu, Programming constructs for unstructured data, In Proc. Workshop on Database Programming Languages (DBPL), 1995. 19. [Buneman et al 1996] P. Buneman, S. Davidson, D. Suciu, and G. Hillebrand, A query Language and Optimization Techniques for Unstructured Data, In Proc. Of the ACM SIGMOD Int’l Conf. on Management of Data, pp. 505-516, 1996. 20. [Cattell 1994] R. G. G. Cattell, (editor), The Object Database Standard: ODMG- 93.Morgan Kaufmann, San Francisco, California, 1994. 21. [Chen and Pan 2003] C. Chen, X. S. Pan, F. Kurfess, Ontology-based Semantic Classification of Unstructured Documents, in Proceedings of the first International Workshop on Adaptive Multimedia Retrieval, Germany, September 2003. 22. [Christophides et al 1994] V. Christophides, S. Abiteboul, S. Cluet and M. Scholl, From Structured Documents to Novel Query Facilities, In Proc. ACM SIGMOD Symp. on the Management of Data, pages 313-324, 1994. 23. [Cluet 1997] S. Cluet, Modeling and querying semi-structured data, In SCIE 1997, pp. 192-213, 1997. 24. [Codd 1970] E. F. Codd, A relational model for large shared data banks, Communications of the ACM, 13(60): 377-387, June 1970. 25. [Cohen 1996] W. W. Cohen, Learning trees and rules with set-valued features, In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI96, 1996. 26. [Davidson et al 1996] S. B. Davidson, C. Overton, V. Tannen and L. Wong, Biokleisli: A digital library for biomedical researchers, In Journal of Digital Libraries, volume 1:1, November 1996. See http://www.cis.ipenn.edu/db. 27. [De Jong 1976] A. De Jong, An Analysis of the Behavior of a Class of Genetic Adaptive Systems, PhD Thesis, University of Michigan, USA, 1976. 28. [Dean and Sharfman 1993a] J. Dean, J., M. Sharfman, The relationship between Procedural Rationality and Political Behavior in Strategic Decision Making. Decision Sciences, 24 (6):1069-1083.

52 DRDC Valcartier TR 2004-265

29. [Dean and Sharfman 1993b] J. Dean, J., M. Sharfman, Procedural Rationality in the Strategic Decision-Making Process, Journal of Management Studies, 30 (4): 587-610. 30. [Delphi 2002] Taxonomy and Content Classification, Delphi Group White Paper, 2002. 31. [Dias and Tsoukias 2003] L. Dias, L., A. Tsoukias, A. On Constructive and other approaches in Decision Aiding. In Proceedings of the 57th meeting of the EURO MCDA working. ( J. Figueira, C. Antunes eds.), 2003. 32. [Doyle 1998] J. Doyle, Rational decision making, In : The MIT Encyclopedia of the Cognitive Sciences, R. Wilson and F. Kiel editors, Cambridge, Massachusetts: MIT Press,1998. 33. [Dumais 1998] S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems magazine. Vol. 13, Nº 4. 34. [Eisenhardt and Bourgeois 1988] K. Eisenhardt, J. Bourgeois, Politics of strategic decision making in high-velocity environments: toward a midrange theory, Academy of Management Journal, 31: 737-770, 1988. 35. [Everitt 1993] B.S. Everitt, Cluster Analysis, London: Edward Arnold, 1993. 36. [Faure and Poibeau 2000] D. Faure, T. Poibeau, First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX, in: the proceedings of the 14th European Conference on Artificial Intelligence, ECAI’2000, Berlin, Germany, 2000. 37. [Felbaum, 1998] C. Felbaum, WordNet: an electronic lexical database, Cambridge, Massachussets and London, England, The MIT Press, 1998. 38. [Fensel et al 2002] D. Fensel, F. Van Harmelen, Y. Ding, M. Klein, H. Akkermans, J. Broekstra, A. Kampman, J Van der Meer, Y. Sure, R. Studer, U. Krohn, J. Davies, R. Engels, V. Iosif, A. Kiryakov, T. Lau, U. Reimer, I. Horrocks, On- To-Knowledge in a nutshell. In IEEE Computer (2002). http://www.ontoknowledge.org. 39. [Fernandez et al. 1997] M. Fernandez, A. Gomez-Perèz,, N. Juristo, Methodology: From ontological art towards ontological engineering. In AAAI-97 Spring Symposium on Ontological Engineering, pp. 33–40, 1997. 40. [Frank 1994] M.P. Frank, Advances in decision-theoretic AI: Limited rationality and abstract search, Master thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, May 1994. Available also at http://www.ai.mit.edu/~mpf/papers/Frank-94/Frank-94.html. 41. [Friese et al. 1998] T. Friese, P. Ulbig, S. Schulz, Use of evolutionary algorithms for the calculation of group contribution parameters in order to predict thermodynamic properties. Part I: Genetic algorithms, Computers Chem. Engng, 22(11):1559- 1572, 1998. 42. [Furuta et al 1988] R. Furuta, P. David Sotts, Specifyinf Structures Document Transformations, Document Manipulation and Typography, J. C. van Vliet, ed., 109-120, cambridge University Press, 1988.

DRDC Valcartier TR 2004-265 53

43. [Gao et al. 2003] J. Gao, M. Li, C.-N. Huang, Improved source –channel models for chinese word segmentation, In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 272-279, July 2003. 44. [Gauvin et al. 2002] M. Gauvin, A.-C. Boury-Brisset, and F. Garnier-Waddell, “Contextual User-Centric, Mission-Oriented Knowledge Portal: Principles, Framework and Illustration”, 7th International Command and Control Research Technology Symposium, Quebec City, 16-20 Sep. 2002. 45. [Genesereth and Nilsson 1987] M. Genesereth, N. Nilsson, Logical Foundations of Artificial Intelligence. Morgan Kaufmann.1987. 46. [Gingrande 2002] A. Gingrande, Recognizing unstructured documents: Invoice Processing Challenges and Solutions, Journal of Work Process Improvement, December 2002. 47. [Goldberg 1989] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, M.A., 1989. 48. [Goldberg and Lingle 1985] D.E. Goldberg, R. Lingle, Alleles, loci and the traveling salesman problem, in: the Proceedings of the First International Conference on Genetic Algorithms, J. Grefenstette (Ed.), Lawrence Erlbaum Associates, Hilsdale, NJ, 1985. 49. [Goll and Rasheed 1997].I. Goll, A. Rasheed. Rational Decision Making and Firm Performance: The Moderating Role of Environment, Strategic Decision Making, 18(7):. 583-591. 50. [Goller et al 2000] C. Goller, J. Löning, T. Will T., W. Wolff, Automatic document classification: A thorough evaluation of various methods, in proceedings of ISI’2000. 51. [Gomez-Perez 1995] A. Gomez-Pèrez, Some ideas and examples to evaluate ontologies. In Proceedings of the Eleventh Conference on Artificial Intelligence Applications, page 50. IEEE Computer Society Press, 1995. 52. [Gomez-Perez and Rojas-Amaya 1999] A. Gomez-Pèrez, D Rojas-Amaya,. Ontological Reengineering for Reuse. EKAW 1999: 139-156. 53. [Gomez-Perez et al. 1996] A. Gomez-Pèrez, M. Fernandez, M., A. De VIincente,Towards a method to conceptualize domain ontologies. In Working notes of the Workshop on Ontological Engineering ECAI’96, pp. 41–52, 1996. 54. [Gomez-Perez et and Benjamin 1996] A. Gomez-Pèrez., R.V. Benjamin, Applications of Ontologies and Problem-Solving Methods. AI Magazine 20(1): 119-122 (1999). 55. [Gruber, 1993] T. Gruber, A translation approach to portable ontology specifications, Knowledge acquisition, vol. 5, no. 2, pp. 199-220, 1993. 56. [Gruninger and Fox 1995] M. Gruninger, S.F. Fox, Methodology for the design and evaluation of ontologies. In Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing held in conjunction with IJCAI-95, Montreal, Canada, 1995.

54 DRDC Valcartier TR 2004-265

57. [Guarino 1995] N. Guarino, Formal Ontology, Conceptual Analysis and Knowledge Representation. International Journal of Human-Computer Studies, 43 (5/6) : 625-640, 1995. 58. [Guarino and Giaretta 1995] N. Guarino, P. Giaretta, Ontologies and knowledge bases: Towards a terminological clarification. In Towards Very Large Knowledge Bases – Knowledge Building and Knowledge Sharing, edt. N.J. Mars, pp. 25–32, Amsterdam, 1995. IOS Press. 59. [Hampton 1998] J.E. HAMPTON, The Authority of Reason. Cambridge University Press, 1998. 60. [Hansen and Mladenovic 2001] P. Hansen, and N. Mladenovic, Variable neighbourhood search: principles and applications, Eur. J. Oper. Res, Vol. 130, pp. 449-467, 2001. 61. [Hearst 2003] M. Hearst, What Is Text Mining? Essay, SIMS, UC Berkeley October 17, 2003. 62. [Henriet 2000] L. Henriet, Systèmes d’évaluation et de classification multicritères pour l’aide à la décision : construction de modèles et de procédures d’affectation, PhD thesis, Université Paris Dauphine, France, 2000. 63. [Horvitz et al. 1988] E. Horvitz, J. Breese, M. Hendrio M, Decision Theory in Expert Systems and Artificial Intelligence, Journal of Approximate Reasoning, Special Issue on Uncertainty in Artificial Intelligence, 2:247-302, 1988. Also, Stanford CS Technical Report KSL-88-13. 64. [Hotho et al. 2001a] Hotho, A., Maedche, A., Staab, S., & Studer, R. (2001). SEAL-II - the soft spot between richly structured and unstructured knowledge. Journal of Universal Computer Science (J.UCS), Vol. 7, No.7, pp. 566-590, 2001. 65. [Hotho 2001b] A. Hotho, A. Maedche, S. Staab, Ontology-based text clustering. In Proceedings of the IJCAI-01 Workshop "Text Learning: Beyond Supervision", August, Seattle, USA, 2001. 66. [Hull 1994] D. Hull, (1994), Information Retrieval using Statistical Classification. Thèse de doctorat, Université de Stanford, Computer Science Department, 1994. 67. [Iwasume et al. 1996] M. Iwazume, K. Shirakami, K. Hatadani, IICA: An Ontology- based Internet Navigation System, In Proceedings of AAAI 96. 68. [Jain and Dubes 1988] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. 69. [Joachims 1997] T. Joachims, Text Categorization with Support Vector Machines: Learning with many Relevant Features, Technical Report, LS8-Report, Universitaet Dortmund, 1997. 70. [Karypis and Han 2000] G. Karypis, E.-H Han, Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical report tr-00-0016, University of Minnesota, 2000.

DRDC Valcartier TR 2004-265 55

71. [Kulyukin et al 1998] V. Kulyukin, K. Hammond and R. Burke, Automated Processing of Structured Online Documents, Intelligent Information Laboratory Computer Science Department. The University of Chicago. Chicago. February, 1998. 72. [Labrou and Finin 1999] Y. Labrou, t. Finin., Yahoo! as an Ontology – Using Yahoo! Categories to Describe Documents, ACM Conference on Information and Knowledge Management (CIKM'99), Kansas City, November, 1999. 73. [Larsen and Aone 1999] B. Larsen, and C. Aone, Fast and effective text mining using linear-time document clustering, In Proc. of 5th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp. 16-22, 1999. 74. [Lebart et al 1998] L. Lebart, A. Salem, and L. Berry, Exploring Textual Data, Kluwer Academic Publishers, ISBN 0-7923-4840-0, 1998. 75. [Letson 2001] R. Letson, Taxonomies put Content in Context, Transform Magazine, december 2001. 76. [Levesque and Brachman 1985] H.J. Levesque, J. Brachman, A fundamental tradeoff in knowledge representation and reasoning. In Readings in Knowledge Representation, edts. H.J. Levesque and R.J. Brachman, editors, pp. 41–70. Morgan Kaufmann, 1985. 77. [Lewis et al 1996] D. D. Lewis, R. E. Schapire, J. P. Callan, R. Papka, Training Algorithms for Linear Text Classifiers, In Proceedings of ACM-ASIGIR, 1996. 78. [Maedche et al 2001] A. Maedche, S. Staab, N. Stojanovic, R. Studer, and Y. Sure, SEmantic PortAL – The SEAL approach in Creating the Semantic Web. D. Fensel, J Hendler, H. Lieberman, W. wahlster (eds) MIT Press, MA, Cambridge, 2001. 79. [Mahaligam and Huhns 1997] K. Mahalingam, M.N. Huhns, An Ontology Tool for Query Formulation in an Agent-Based Context, CoopIS 1997: 170-178. 80. [Mahaligam and Huhns 1998] K. Mahalingam, M.N. Huhns, Ontology Tools for Semantic Reconciliation in Distributed Heterogeneous Information Environments, in Intelligent Automation and Soft Computing: An International Journal (special issue on Distributed Intelligent Systems, Mohamed Kamel and Mohammad Jamshidi, editors), 1998. 81. [Mani and Maybury 1999] I. Mani, and M. T. Maybury, Advances in Automatic Text Summarization. The MIT Press, ISBN 0-262-13359-8. 82. [Marcoux 1993] Y. Marcoux, Les formats de documents électroniques, Actes de la Journée d'échange sur les formats normalisés de documents, Québec, ministère des communications, 1993. 83. [Marcoux 1994] Y. Marcoux, Les formats normalisés de documents électroniques ICO Québec, numéro thématique sur la gestion de l'information textuelle, vol. 6, nos 1 et 2, printemps 1994, pp. 56-65, 1994. 84. [Marcoux et al 1996] Y. Marcoux, F. P. Bélanger; and C. Dufour, Réflexions sur une expérience de construction abstraite d'un hypertexte, Argus, vol. 25, no 3, septembre-décembre 1996, pp. 14-22.

56 DRDC Valcartier TR 2004-265

85. [Maron and Kuhns 1960] M. E. Maron, K. L. Kuhns, On Relevance probabilistic indexing and Information Retrieval, Journal of the associations of Computing Machinery, Nº 7, pp. 216 – 244, 1960. 86. [Maybury 1993] M. T. Maybury, (editor), Intelligent Multimedia Interfaces, Menlo Park and Cambridge: AAAI Press/MIT Press. This book covers the ground where artificial intelligence, multimedia computing, information retrieval and human- computer interfaces all overlap, 1993. 87. [McGuinness 1998] D. L. McGuinness, Ontological issues for Knowledge-Enhanced Search, in proceedings of the International Conference on Formal Ontologies in Information Systems, 1998. 88. [McGuinness 2002] D. L. McGuinness, Ontologies Come of Age, In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2002. 89. [Michael 1988] T. Michael, Rationality and Revolutionary Collective Action, in : Rationality and Revolution, Chapter 2, ed. Michael Taylor. Cambridge University Press, 1988 90. [Michael 1994] T. Michael, Structure, Culture, and Action in the Explanation of Social Change, In: Politics and Rationality, Chapter 4, eds. William James Booth, Patrick James and Hudson Meadwell. Cambridge University Press, 1994. 91. [Mothe 2000] J. Mothe, Recherche et Exploration d’Informations. Découvertes de connaissances pour l’accès à l’information, Habilitation à diriger des recherches, Institut Universitaire de Formation des Maîtres, 2000. 92. [NATO, 2002] NATO/RTO, RTO Combating terrorism Workshop Report, April 2002, http://www.rta.nato.int/ctreport/CTWSReport.pdf. 93. [Othmani 1998] I. Othmani, Optimisation Multicritère : Fondements et Concepts, PhD Thesis, Université Joseph Fourier, Grenoble, France, 20 mai 1998. 94. [Papadakis et al. 1998] V. Papadakis, S. Lioukias, D, Chambers, Strategic Decision- Making Processes: The Role of Management and Context, Strategic Management Journal, 19:115-147, 1998. 95. [Papakonstantinou et al 1995] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, Object Exchange across Heterogeneous Information Sources, In Proc. Of the 11th Int’l Conf. on Data Engineering, pp. 251-260, 1995. 96. [Parsaye and Chignell 1993] K. Parsaye and M. Chignell, Intelligent Databases: Object- Oriented, Deductive Hypermedia Technologies, John Wiley, NY, 1993. 97. [Pasquier-Dorthe and Raynaud 1990] J. Pasquier-Dorthe, H.Raynaud, Un outil d’aide à la décision multicritère. Rapport technique, Université Joseph Fourier, 1990. 98. [Pirlot 1994] M. Pirlot, Why trying to characterize the procedures used in multi-criteria decision aid, Cahiers de CERO, 36: 283-292, 1994. 99. [Poźivil and. Źd’ánský 2001] Poźivil, J., M. Źd’ánský, Application of genetic algorithms to chemical flowshop sequencing, Chem. Eng. Technol., 24 (4):327-333, 2001.

DRDC Valcartier TR 2004-265 57

100. [Rajagopolan et al. 1993] N. Rajagopolan, A. Rasheed, D.Datta, Strategic Decision Processes: Critical Review and Future Directions, Journal of Management, 19 (2):349-384, 1993. 101. [Reerwester et al 1990] S. Reerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, pp. 391-407, Vol. 41, Nº 6, 1990. 102. [Rillof 1993] E. Rillof, Automatically constructing a dictionary for information extraction tasks, In Proceedings of 11th Annual Conference of Artificial Intelligence, pp. 811-816, 1993. 103. [Robertson and Spark-Jones 1976] S. E. Robertson, K. Spark-Jones, Relevance Weighting of Search Terms, Journal of American Society for Information Science, Nº 27, pp. 129 – 146, 1976. 104. [Rocchio 1971] J. Rocchio, Relevance Feedback in Information Retrieval, G. Salton edition, The SMART Retrieval System – Experiments in Automatic Document Processing. Chapter 14, pp. 313 – 323, Prentice Hall Inc, 1971. 105. [Roy 1990] B.Roy, Science de la décision ou science de l’aide à la décision? Cahier du Lamsade 97, Université Paris-Dauphine, 1990. 106. [Ruiz, Srinivasan 1998] M.E. Ruiz, P. Srinivasan, Crosslingual information retrieval using the UMLS Metathesuarus, In: Proceedings of the 61st Annual Meeting of the American Society for Information Science, Pittsburg, P.A. October 24-29, 1998. 107. [Sahami 1998] M. Sahami, Using Machine Learning to Improve Information Access, PhD thesis, Stanford University, Computer Science Department. 108. [Salton 1989a] G. Salton, Automatic Text Processing: the transformation, Analysis and Retrieval of Information by computer. Addison Wesley, 1989. 109. [Salton 1989b] G. Salton, Automatic Text Processing. Addison-Wesley, 1989. 110. [Salton and Buckley 1991] G. Salton and C. Buckley, Automatic text structuring and retrieval - experiments in automatic encyclopedia searching, Annual ACM Conference on Research and Development in Information Retrieval. Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, 1991. 111. [Salton and McGill 1983] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, NY: Mc Graw Hill, 1983. 112. [Salton et al 1983] G. Salton, C. Buckley, and E. A. Fox. Automatic query formulations in information retrieval. Journal of the American Society for Information Science, 34(4):262-280, July 1983. 113. [Schuettze et al 1995] H. Schuettze, D. Hull, J. Pedersen, A Comparison of Document Representations and Classifiers for the Routing Problem, in Proceedings of the 18th Annual ACM AIGIR Conference. Pp. 229 – 237, 1995.

58 DRDC Valcartier TR 2004-265

114. [Sebastiani 1999] F. Sebastiani, Machine Learning in Automated Text Categorisation, Technical Report B4-31, Instituto di Elaborazione dell'InformaTione, Consiglio Nazionale delle Ricerche, Pisa, 1999. http://faure.iei.pi.cnr.ir/-fabrizio. 115. [Spryns et al. 2002] P. Spryns, R. Meersman, R.., M. Jarrar, Data modeling versus Ontology Engineering, ACM 31 (4 ) (December 2002). SPECIAL ISSUE: Special section on Semantic Web and data management, pp. 12 – 17, 2002 . 116. [Studer et al. 1998] R. Studer, R. Benjamin., D. Fensel, Knowledge Engineering: Principles and methods, Data and Knowledge Engineering, 25: 161-197, 1998. 117. [Syswerda 1989] G. Syswerda, Uniform crossover in genetic algorithms, in the Proceedings of the Third International Conference on Genetic Algorithms, J. Schaffer (Ed.), Morgan Kaufmann publishers, Sam Mateo, CA, 1989. 118. [Termier et al., 2001] A. Termier, M.-C. Rousset, M. Sebag, Combining Statistics and Semantics for Word and Document Clustering, in: the proceedings of IJCAI’2001, Workshop on Ontology Learning, 2001. 119. [Tiliki 2003] K. Tiliki, Developpement d’un module de classification automatique de documents basée sur la logique floue et l’ontologie, Master Thesis, Université Laval, Québec, Canada, 2003. 120. [Trang et al. 2003] H. Trang, L. Denoyer, P. Gallinari. Un modèle statistique pour la classification de documents structurés. Journées francophones d'Extraction et de Gestion des Connaissances (EGC2003), janvier 2003, Lyon, France. 121. [TREC] Text Retrieval Conference, http://trec.nist.gov 122. [Tsoukias and Vincke 1998] A. Tsoukias, Ph. Vincke, A Generalization of Intervals Order, Preprint submitted to Elsevier Science, November 19th, 1998. 123. [Turing 1947] A.M. Turing, Lecture to the London Mathematical Society on 20 February 1947, in: A. M. Turing's ACE report of 1946 and other papers, eds. B. E. Carpenter and R. W. Doran, Cambridge, Mass.: MIT Press, 1986. 124. [Turing 1948] A.M. Turing, Intelligent machinery, Report for National Physical Laboratory, in Machine Intelligence 7, Eds. B. Meltzer and D. Michie, 1969. 125. [Turing 1950] A.M. Turing, Computing machinery and intelligence, Mind 50: 433-460, 1950 126. [Turing 1954] A.M. Turing, Solvable and unsolvable problems, Science News 31: 7-23, 1954 127. [Veber et al 1999] M. Veber, A. Horák, R. Julinek and P. Smrz , Automatic Structuring of Written Texts, TSD 1999: 101-194. 128. [Von Luxburg et al. 2002] U. Von Luxburg, O. Bousquet, B. Schölkopf, A Compression Approach to Support Vector Model Selection. Technical Report No. TR-101. Max Planck Institute for Biological Cybernetics. 129. [Weiner et al 1995] E. Weiner, J. O. Pedersen, A. S. Weigend, A Neural Network Approach to Topic Spotting”, In Symposium on Document Analysis and Information Retrieval, pp. 317 -332, 1995.

DRDC Valcartier TR 2004-265 59

130. [Winkels et al. 2000] R. Winkels, D. Bossher, A. Boer, R. Hoekstra, Extended conceptual retrieval. In Legal Knowledge and Information Systems 2000, eds. J.A. Breuker et al., pp. 85–98. IOS Press, Amsterdam, The Netherlands, 2000. 131. [Witten 2001] I.H. Witten, Visions of the digital library, In: proceedings International Conference of Asian Digital Libraries, 3-15, Bangalore, India, December. 132. [Yang 1999] Y. Yang, An evaluation of statistical approaches to text categorization, Information Retrieval, 1(1-2): 69-90, 1999. 133. [Zhao and Karypis 2001] Y. Zhao and G. karypis, Criterion Functions for Document Clustering. Experiments and Analysis, Technical Report #01-40, http:www.cs.umm.edu/~karypis, 2001.

60 DRDC Valcartier TR 2004-265

Annex A: Concepts of Weights in the Ontology

Objective weighting method (PondOb)

BEGIN PondOb

h For a given prototype pi :

1. First phase - P1 (identification, enumeration and localization phase) Consider the part (sub-tree) of which the root is the first concept met in the prototype 1.1. Determine the number of all levels of the hierarchy (ontology) Total _ Nb _ Levels _ Hier

h 1.2. Determine the level of the first concept met in the prototype pi : Level _ First _ Concept _ prot

1.3. Determine the number of levels present in the sub-tree which root is the first concept met in h the prototype pi : Nb _ Levels _ prot ← (Total _ Nb _ Levels _ Hier − Level _ First _ Concept _ prot)

h 1.4. Determine the presence of the concept j in the prototype pi :

Pr es _ Concept j _ prot ← 1 if present

Pr es _ Concept j _ prot ← 0 if not

1.5. Determine the total number of concepts (present and absent) per level of the tree defining h the prototype pi :

1.6. Compute the concept j weight in the prototype by applying the formula : h pi w j ← Pr es _ Concept j _ prot * (1/ Nb _ Levels _ prot) /(Nb _ Concepts _ Level j )

2. Second phase P-2 (repartition of the weights)

To ensure a distribution of weights to all the concepts present in the prototype m h pi if ∑ w j = 100% then j =1 End of PondOb (weights are already distributed)

If not go to 2.1.

2.1. Distribution

DRDC Valcartier TR 2004-265 61

2.1.1. Compute repartition multiplier ( Re p _ Multip 13) 100

Re p _ Multip ← m h pi ∑ w j j=1 2.1.2. For each present concept in the prototype

h h pi pi Compute w j ← w j × Re p _ Multip

END of PondOb

Algorithm of classification using the method DAC-01-C-P Begin 1. Initialization

- Determine severity factors n1 and n2 + d - Determine the admissibility threshold (of confidence degrees) of concept S (μ j ) + d + d - Determine standard deviation thresholds S1 (σ j ) and S2 (σ j ) 2. As long as all documents are not treated Do (for a given document d )

h Initialize I(d, pi ) = 0 As long as all classes are not treated Do (for a certain class h ) As long as all prototypes are treated Do (for a given prototype i ) Call the PondOb Algorithm ( determine concept weights) As long as all concepts14 are traited Do (For a given concept j )

h - Compute C j (d, pi ) h - Compute D j (d, pi ) h - Compute I j (d, pi ) h h h - Compute I(d, pi ) = I(d, pi ) + I j (d, pi ) Go to next concept

13 In this is the parameter a in the method presented in Section 5.3 14 The concepts that are at the same time in the processed document and the prototype

62 DRDC Valcartier TR 2004-265

h h I(d, pi ) = I(d, pi )

h h App(d, pi ) = I(d, pi ) Go to next prototype

h App(d, h) = Max{App(d, pi )} Go to the next class App(d, C) = Max{App(d, h)} Treat next document End

DRDC Valcartier TR 2004-265 63

Algorithm of classification using the method DAC-02-I-I BEGIN

1. Initialization

Fix the parameters V j (concept’s admissibility threshold) and RapMax (the maximum ratio for the discordance measurement), that are valid parameters for a specific task of classification.

For a given document d and the prototype i of a certain class h

For the first concept j (the most general met in the prototype)

2. Determine the intervals

d d d d Id = [μ j − σ j ; μ j + σ j ] = [l(d);r(d)]

p p p p I p = [μ j − σ j ; μ j + σ j ] = [l( p);r( p)]

a. Test the concept admissibility (to concordance measure or of similarity)

If { l( p) < l(d) and r(d) < r( p) } and if V j ≤ l( p)

= Admissibility of the concept and go to b.1

If not go to the next concept

If { l(d) > l( p) and r( p) > r(d) } and if V j ≤ l(d ) = Admissibility of the concept and go to b.2

If not go to the next concept

If { l(d) = l( p) and r( p) = r(d) } and if {V j ≤ l(d) or V j ≤ l( p) }

= Admissibility of the concept and go to b.3 If not go to the next concept

If { l(d) < l( p) and r( p) < r(d) } and if V j ≤ l( p) = Admissibility of the concept and go to b.3

If not go to the next concept

If { l( p) < l(d) and r(d) < r( p ) } and if V j ≤ l(d) = Admissibility of the concept and go to b.3 If not go to the next concept

h b. Compute the local concordance index C j (d, pi ) (or Overlapping) b.1 (If { l( p) > l(d) and r( p) > r(d) })

64 DRDC Valcartier TR 2004-265

h r(d) − l( p) Then C j (d, pi ) = r( p) − l(d)

b.2 (If { l(d) > l( p) and r(d) > r( p)}) h r( p) − l(d) Then C j (d, pi ) = r(d) − l( p)

b.3 (If { l(d) ≤ l( p) and r( p) ≤ r(d) }) or if { l( p) ≤ l(d) and r(d) ≤ r( p) }) h Then C j (d, pi ) = 1

c. Compute the interval ratio

Ampl(Id ) r(d) − l(d) Rap1 = = if Ampl(Id ) ≤ Ampl(I p ) or Ampl(I p ) r( p) − l( p)

Ampl(I p ) r( p) − l( p) Rap2 = = if Amp(Id ) ≥ Ampl(I p ) Ampl(Id ) r(d) − l(d)

h d. Compute the local discordance index D j (d, pi )

d.1 If Rap1 ≥ RapMax or if Rap2 ≥ RapMax h Then D j (d, pi ) = 1

d.2 If Ampl(I p ) > Ampl(Id ) and Rap1 < RapMax h Rap1 Then D j (d, pi ) = RapMax

d.3 If Ampl(Id ) > Ampl(I p ) and Rap2 < RapMax h Rap2 Then D j (d, pi ) = RapMax

h e. Compute the similarity index Sim(d, pi )

h Initialization: Sim(d, pi ) = 0

e.1 If there is no hierarchy h h C j (d, p )× D j (d, p ) Sim(d, p h ) = Sim(d, p h ) + i i i i m Go to the next concept If all concepts are treated

m h h h C j (d, pi )× D j (d, pi ) Sim(d, pi ) = ∑ m j=1

e.2 If there is a hierarchy

DRDC Valcartier TR 2004-265 65

Call of PondOb algorithm

h ⎡ h pi ⎤ h h pi h h wj Sim j (d, pi ) = Sim j (d, pi ) + ⎢w j ×C j (d, pi )× (D j (d, pi )) ⎥ ⎣⎢ ⎦⎥ Go to the next concept If all concepts are treated m Sim(d, p h ) = Sim (d, p h ) i ∑ j i j=1

Go to the next prototype i +1 of the class h

f. Membership degree to the class h

h App(d, h) = Max {Sim(d, pi ) }

g. Assignment to a single class C App()d,C = Max{App(d,h)}

END

66 DRDC Valcartier TR 2004-265

Annex B: Clustering using Genetic Algorithms

Algorithm of the procedure:

Fix values for parameters Δο (minimum distance between two clusters), ko (number of clusters) and no minimum number of documents per cluster. In a version of the algorithm, the similarity threshold to merge initially the documents is used: αo. Initial clustering:

- Initialize the generation counter t = 1

- Evaluate pairwise similarity I(i,j), of the N documents in the collection (N×N matrix), each document being an initial cluster Ci, i = 1,N.

- Merge in initial clusters the nb closest documents (neighbors). Let us denote N the number of initial clusters M(t) = N.

Step1: Clustering improvement:

A cluster encoded the membership of each document. It is a binary N-integer string where the ith document is set to one if it is present or to zero if not. cluster(i)=[ d1, d2, …….dN], i =1 to M(t) d1 = 1 if it is included in cluster(i), otherwise d1 = 0.

Version 1:

Cluster’s evaluation is the lowest similarity Sjk between two documents. This objective function (fitness) must be maximized:

S(i) = Min [I(j,k)] with dj, dk ∈ cluster(i) (1)

Version 2:

Each document dj included in cluster(i), containing n documents, is indexed by a global similarity: 1 S(d ) = G j ∑ I(d j ,dk ) n −1 d k ∈Ci ,k ≠ j

And the cluster’s evaluation is the lowest similarity:

S(i) = min{SG (d1),SG (d2 )...... SG (dn )} (2) d j ∈Ci

A linear static scaling method of this objective function (eq. 2) is used to ensure that the population did not become dominated by the descendants of a single super-fit cluster, and later, if the population average fitness may be close

DRDC Valcartier TR 2004-265 67

to the fitness of the best cluster, competition among clusters is not stochastic but based on the principle of the survival of the fittest.

The scaled function is defined as follows:

Fun(i) = a .S(i) + b (3) a and b could be computed as proposed for example by Friese et al.[1998] and Poźivil and Źd’ánský [2001] in such a way that Funave= Save and Funmax = D.Save where D is the desired number of copy of the best solution. D is usually set between 1.2 and 2. The following expressions can be derived:

Save (D −1) Smax − DSave a = , b = Save (3a) Smax − Save Smax − Save with 1 Save = S(i) ; Smax = Max [S(i)]; Smin = Min [S(i)] (3b) M(t) ∑

Selection of clusters for reproduction, crossover and mutation:

For i= 1 to M(t) Selection : The selection operator used in this algorithm to select a new population is the stochastic remainder selection without replacement. This is based on the concept of the expected value (De Jong 1976; Goldberg, 1989)

- Calculate the probability of selection of clusters: Fun(i) p(i) = (4) s M(t) ∑ Fun(i) i=1 - Calculate the expected number of copies of a cluster

e(i) = ps(i).M(t) (4a)

- Complete the population to M(t) individuals, by using the fractional remainder r(i) = e(i) – Int(e(i)), for each cluster, one by one, as probability of further duplication with a weighted coin toss.

Crossover: Two crossover procedures are proposed, to be used alternatively, in order to explore a greater number of search spaces. The first one is the uniform crossover operator (Syswerda, 1989) that has been shown to be superior to traditional crossover strategies for combinatorial problems. The second operator is the partial

68 DRDC Valcartier TR 2004-265

mapped crossover (PMX) proposed by Goldberg and Lingle (1985). The crossover is performed on two

selected clusters with a probability pc. The offspring resulted from crossover are evaluated using eqs 1or 2-3b

Mutation: A mutation is carried out on a cluster by changing the value of a randomly selected element (1 to zero or vice

versa), with a probability pm = 0.01. The offspring resulted from mutation are evaluated using eqs 1or 2-3b

Step 2: Clusters merging In the second step of the procedure, we try to reduce the number of clusters obtained from the first step. - Evaluate the distance (or similarity) between all the clusters C as mentioned in Chapter 5: For i =1 to M(t), For j = 1 to M(t)

Δ ij= Eq 5.13 or 5.14 or 5.15 Version 1:

- Merge in new clusters all the clusters that C(i,j) ≥ Δ0 new M(t). Version 2: - Merge the two closest clusters → new M(t).

Step 3: if M(t) ≤ ko and Δ ≥ Δ0 ijand min {Card_C(i)} ≥ n0 Stop if t > tmax Stop else t = t+1 go to step1.

Algorithm of the procedure

Initialization (Generation counter) t = 1;

(Genetic operators parameters) zc1= 1; z1=0; z2=0; z3=0; pc = 0.6; pm=0.01; (Crossover and mutation count accumulators) ncross = 0; nmut = 0;

(Minimum distance between two clusters) Δ 0=0.6;

(Similarity threshold to group two documents) α 0=0.55; (Scaled function parameter; (eq.2a)) D = 1.8;

(Final number of clusters chosen by the user) Read ko (example = 10) for i = 1 to N for j =1 to N -evaluate the matrix similarities between documents: I(i,j) ;

DRDC Valcartier TR 2004-265 69

end end clustering: - version1: for i = 1 to M(t) for j =1 to N if I(i,j) ≥ θ then d_cluster(i,j) = 1; end end - version2: for i = 1 to M(t) sort I(i,j) for j =1 to N while(Card_C(i))< nd) , d_cluster(i,j) = 1 with j/I(i,j) the smallest. end end

- evaluate the fitness of the clusters using subroutine: Evaluation (D, M, cluster, I, Fun, Fmax, Fmin, Fave)

Step 1 :

Selection - t = t+1 - for i=1 to M(t) tempcluster(i) = 0 end

- for i=1 to M(t) compute the probability of selection of a cluster ps (eq. 4) compute expected value of each cluster Int(e(i)) (eq. 4a) compute Int(e(i)) ≤ 2 compute Rem(e(i)) end

- compute cumulative probability of selection qs(k) for each cluster k: qs(0) = 0 for k = 1 to M(t)

qs(k)= qs(k-1)+ps(k) end

- ii=0 for i = 1 to M(t) make In_e copies of cluster(i) Int(e(i)) =Int_e, while (Int_e > 0 ) do ii = ii+1 Int_e =Int_e - 1 tempcluster(ii)=cluster(i) wend end

- for i = 1 to M(t)

70 DRDC Valcartier TR 2004-265

make a copy of cluster(i) with a probability of Rem(e(i)) while ii ≤ M(t) do compute a random number rand from [0,1] if (rand ≤ Rem(e(i))) then ii=ii+1 tempcluster(ii)=cluster(i) end wend end

- for i=1 to M(t) cluster(i) = tempcluster(i) end

Crossover (select crossover operator) if t = zc1 then (use the uniform crossover every two generations) i = 1, repeat (select two parents for crossover using subroutine select) mate1=select (M(t), qs) mate2=select (M(t), qs)

- apply uniform crossover (cluster(mate1),cluster(mate2), newcluster(i), newcluster(i+1), N, pc, ncross) parent1 =mate1 parent2 = mate2 - evaluate newcluster(i), newcluster(i+1) i =i+2 until i >M(t) zc1=zc1+2

else

i = 1, repeat ( select two parents for crossover using subroutine select) mate1=select (M(t), qs) mate2=select (M(t), qs)

- apply PMX crossover(cluster(mate1),cluster(mate2), newcluster(i), newcluster(i+1), N, pc, ncross) parent1 =mate1 parent2 = mate2 - evaluate newcluster(i), newcluster(i+1) i =i+2 until i >M(t) end

Mutation

i = 1, repeat (select a cluster for mutation ) - compute a random number rk from [0,1] if (rk ≤ pm) then (select randomly the gene to be mutated)

DRDC Valcartier TR 2004-265 71

k= Int(random*N) (assign to the new cluster the mutated gene) for j =1 to N newd_cluster(i,j)= d_cluster(i,j) if (j = k) then newd_cluster(i,j)= not d_cluster(i,k) end end nmut=nmut+1 end newd_cluster(i,j)= d_cluster(i,j) i =i+1 until i >M(t)

- store pertinent information about population (t, prent1,parent2, cluster, F, Fmax, Funmin, Funave, crossover operator (zc), ncross, nmut)

- cluster = newcluster -go to step 2

Step 2:

-Cluster merging

Step 3 : if M(t) ≤ ko and ij≥ and min {Card_C(i)} ≥ n0 Stop

if t > tmax Stop else t = t+1 go to step1.

SUBROUTINE Evaluation (D, popsize, cluster, I, Fun, Fmax, Fmin, Fave) for i = 1 to popsize, S(i)= eq 1 or eq. 2 / d_cluster(i, j) ≠ 0, d_cluster(i, k) ≠ 0 end compute Smax, Smin, Save

(evaluate the scaled function of the clusters) for i = 1 to popsize, Fun(i) using eqs 3, 3a and 3b end compute Fmax, Fmin, Fave ENDSUB

SUBROUTINE select(maxpop, probability) compute a random number rand from [0,1] k= 1 while( k < maxpop) or (probability(k)< rand) do k=k+1 wend select = k ENDSUB

SUBROUTINE UNIFORM_CROSSOVER(parent1,parent2, child1,child2, N, pc, ncross) - compute a random number rk from [0,1] if (rk ≤ pc) then ncross=ncroos+1 ( generate randomly a binary string with the same size of chromosome : N bits) for j=1 to N

72 DRDC Valcartier TR 2004-265

rand(j) = random number from [0,1] if(rand(j) ≤ 0.5) then bit(j)=0 else bit(j )= 1 end end

(exchange between parent1 and parent2 the genes corresponding to bit position 1) for j = 1 to N if (bit(j) = 0) then child1(j) = parent1(j) child2(j) = parent2(j) else child1(j) = parent2(j) child2(j) = parent1(j) end end else for j = 1 to N child1(j) = parent1(j) child2(j) = parent2(j) end end ENDSUB

SUBROUTINE PMX_CROSSOVER(parent1,parent2, child1,child2, N, pc, ncross) - compute a random number rk from [0,1] if (rk ≤ pc) then ncross=ncross+1 - select randomly a first position j1 (j1 from [1,N-1]) - select randomly a second position j2 (j2 from [j1,N]) for j = 1 to j1 child1(j) = parent1(j) child2(j) = parent2(j) end

for j = j1 to j2 child1(j) = parent2(j) child2(j) = parent1(j) end for j = j2 to N child1(j) = parent1(j) child2(j) = parent2(j) end else

for j = 1 to N child1(j) = parent1(j) child2(j) = parent2(j) end

end

ENDSUB

DRDC Valcartier TR 2004-265 73

Annex C: Similarity Index Computation

Codification of the categorization methods For a better identification of these various methods and their associated algorithms, we will adopt the following coding:

DAC - 0X - Y - Z Algorithm

DAC : Document Automatic Classification

X : Serial number DomX = {1, 2, 3, 4, 5}

Y : Variable type considered DomY = {C, I, D}

with C = continuous variable15

I = interval variable

D = discrete variable

Z : Prototype type DomZ = {P, N, I}

with

P = degree of confidence of the concepts in the prototype (called prototype variable) are of 100% and standard deviation are nil (σ (C j ) =0, ∀C j in a prototype)

N = prototype variable is an intermediary values (between 0% and 100%), and standard deviations ≠ 0

I = prototype variables are of interval type (non-parametric) Hypothesis All our methods are based on the hypothesis that the initialization is done, i.e., that ontology of the domain has been defined, the various classes and their prototypes have been specified, and the documents to be categorized and the prototypes are represented by their DNA matrix.

The PROAFTN procedure [Belacel 2000] is used for the essential of our work as a starting point, with some modifications to take into account the characteristics of the text categorisation problems considered in this work. Here follows the description of the proposed methods of classification.

15 Variable = confidence degrees of concepts (or terms) of documents (model objects)

74 DRDC Valcartier TR 2004-265

Initialization

1 hk Let us denote Ω={CCC ,..., ,..., } the set of predefined classes corresponding to the different concepts of the hierarchy.

h For each class C , h = 1,...,k , we determine a set of Lh prototypes h h h h h B = {p1 , p2 ,...., pi ,....., p } , by combining available knowledge with data set or by Lh applying an automatic algorithm of prototype building. In this set, each prototype is considered as a good representation of C h . The documents and the prototypes are compared using their DNA matrix after removing the concepts that are not simultaneously present in both16:

Document d Prototype i of class h

h h d d pi pi C1 μ1 σ 1 C1 μ1 σ 1 h h d d pi pi C2 μ 2 σ 2 C2 μ 2 σ 2 ….. ….. …… .... ….. …… h h d d pi pi C j μ j σ j C j μ j σ j ….. …..…… ….. ….. …… h h d d pi pi Cm−1 μ m−1 σ m−1 Cm−1 μ m−1 σ m−1 h h d d pi pi Cm μm σ m Cm μ m σ m

with

C j : The concept j

d μ j : The average of DCq (C j ) , confidence degrees of concept j

d σ j : The standard deviation of DCq (C j ) , confidence degrees of concept j

h pi μ j : The confidence degree of the concept j for the prototype i of the class h

16 It is important to remind that there is a difference between n, the total number of concepts (of the hierarchy), and m, the number of remained concepts after the document or prototype vectors reduction.

DRDC Valcartier TR 2004-265 75

h pi σ j : The standard deviation of the confidence degree of the concept j for the prototype i of the class h

h h pi pi Note: if μ j of prototype equals 100% (=1) then the corresponding standard deviations σ j is null. Definition of the various parameters of the models The PROAFTN procedure [Belacel 2000], from which is inspired the ADAC classification/filtering module, requires the determination of certain parameters (weights and thresholds). Moreover, it uses Concordance, Discordance and Indifference indexes to evaluate the comparison between the involved objects. Other parameters are introduced due to the objects' representation used in this study. Three kinds of parameters are involved: weights, thresholds, factors of admissibility/severity. All these parameters are necessary for the h measurement of the similarity between a document d and the prototype pi . This similarity is first expressed locally at the level of the concept, and then on a more global level, for the whole document.

In the following sections, we describe these parameters and how they are used to calculate the total indifference index. The Concordance, Discordance, and Indifference parameters represent the keystones of our models.

Thresholds

The thresholds are fixed by the decision maker according to a given assignment preferential profile.

Factors of admissibility/severity

Different values are associated to the admissibility/severity factors and depend on the attitude of the decision maker.

Concordance, discordance and Indifference indexes

h When comparing a document d to a prototype pi , positive and negative explanations could be derived justifying their indifference (similarity) level. The comparison at the local level is done for the concept j and the comparison on the global level is done for all the relevant concepts in the document.

h - Local Concordance index C j (d, pi )

The concordance at the local level for the concept j provides positive explanation regarding the indifference degree between the document d and the prototype i of the class h [Henriet 2000].

The concordance does not represent anything else than a measure of similarity (a distance). It expresses the degree of proximity between the two elements that are compared. Its measurement is based on the idea of membership function.

76 DRDC Valcartier TR 2004-265

h - Local Discordance index D j (d, pi )

The discordance at the local level provides negative explanation. It is the expression of the h divergence which prohibits indifference between an object (document d ) and a prototype pi [Henriet 2000]. It comes to attenuate the similarity degree measured by the concordance (at a local level).

It is important to mention that in this work, contrary to the original PROAFTN model, the discordance is applied only at a local level and do not have any veto prerogative. Its effect is restricted at the concept level.

h h - Local indifference I j (d, pi ) and global indifference I(d, pi )

h On a local level, the indifference index between the document d and the prototype pi for the concept j is computed by using the concordance index moderated by the discordance index. Then, these measurements of local similarity are aggregated to provide a measurement, which h translates the proximity between the document d and the prototype pi . This is called the h indifference index I(d, pi ) and provides the membership degree of document d to the sub- h class of the prototype pi . This index can be interpreted by "the document d and the prototype h pi are indifferent or strictly equivalent" [Belacel 2000a]. Affectation rules are then used to lead the assignment of the document d to one or more classes. Affectation rules: Membership degree I(d,C h ) to a class h

The membership degree of a document to a given class is defined by comparing the indexes measured on the set of all prototypes: B h = { p h , p h ,...., p h ,....., p h } of this 1 2 i Lh class. The rule of decision to choose the degree of membership is:

I (dC ,hhhhh )== m ax{ Idp ( , ), Idp ( , ),..., Idp ( , ),...., Idp ( , )}, h 1,...... , k 12 iLh At the end of the classification process, a score, the membership degree, is associated with the document for all the classes. The assignment of this document to a class or several classes is done according to the following rule:

1. Use a λ -cut threshold, λ ≥ 1/ 2 (or any other value chosen by the decision maker), then:

h d ∈⇔C I(d,Ch) ≥ λ

Or

d ∈Ch ⇔ I (d , C h ) = max{ I (d , C 1 ), I (d , C 2 ),...., I (d , C l ),..... I (d , C k )}

Note that the first rule allows the assignment of a document to several different classes, whereas the second rule restricts this assignment to only one class.

DRDC Valcartier TR 2004-265 77

Procedure The classification methods presented further follow an identical procedure. This procedure treats two processing levels, one at a local level, for each concept, the other globally, on the entire document concept vector.

Step 1. Once the initialization is done, the user determines the values of the various parameters of the model: the weights, the threshold values, and the admissibility/severity factors characterizing the method

Step 2. For a given document and each prototype of a specific class, a measurement of concordance is computed at the local level of the concept, which is simultaneously present in the document and the prototype

Step 3. Still on the local level, a discordance measurement aiming at attenuating the concordance (calculated in Step 2), is evaluated. It intervenes by confirming, canceling or attenuating an already established concordance and has, as mentioned previously, no veto prerogative on the global similarity measurement.

Step 4. A local indifference measurement is computed using the concordance index measured into Step 2 and the discordance index measured in Step 3. This local similarity measure is adjusted by the weighting procedure according to the position of the concept in the hierarchy (ontology).

Step 5. Local indifference measurements are aggregated to give a total indifference measurement between the document and the prototype considered (for all the set of concepts).

Step 6. The classification of the documents to the various classes of the ontology, or the assignment of document to a specific class of this same ontology, is made according to rules.

78 DRDC Valcartier TR 2004-265

Annex D: Non-parametric approaches

First method (DAC-01-C-P)

h pi In this method, the confidence degrees averages μ j of all the concepts j in a prototype are considered equal to 100% and their corresponding standard deviations h pi σ j are nil, as it is shown it the example below: Prototype i of class h Prototype k of class h

h h h h pi pi pk pk μ j σ j μ j σ j

C1 100% 0 C1 0 0

C2 100% 0 C2 100% 0 ….. ….. ….. ….. …. ….

C j 100% 0 C j 100% 0 ….. ….. …. ….. …. ….

Cm−1 100% 0 Cm−1 100% 0

Cm 100% 0 Cm 0 0

A prototype including all the concepts of the ontology (i.e., all the concepts are present) represents actually the class of the concept field (root). Therefore, a document which contains all the concepts of the ontology would be classified in the most general class, i.e., that which corresponds to the root (the field itself). As previously mentioned, the first concept met in the prototype, is used to identify the class to which it belongs.

h Measure of the Concordance Index: C j (d, pi )

h This index evaluates the similarity between a document d and a prototype pi of a − d + d given class h . The parameters S ()μ j and S ()μ j and the factors n1 and n2 , used to calculate this concordance index, are defined below:

d n1σ j : Zone for no concordance of the concept j in the document d d n2σ j : Zone of hesitation for perfect concordance of the concept j in the document d − dd d Sn()μ jj+ 1σ : Higher value for n1σ j + dd d Sn()μ jj− 2σ : Lower value for n2σ j h pi d = μ jj− n2σ

DRDC Valcartier TR 2004-265 79

− d d S ()μ j : Lower threshold (user defined) of the degree of confidence, lower value for n1σ j + d d S ()μ j : Higher threshold (user defined) of the degree of confidence, higher value for n2σ j , h + d pi in this work S ()μ j = μ j , d σ j : Standard deviation of the degrees of confidence of the concept j in the document d d μ j : Average of the degrees of confidence of the concept j in the document d h pi h μ j : Average of the degrees of confidence of the concept j , in the prototype pi d n1 , n2 : Factors of admissibility/severity to take into account the importance of σ j

Figure 16. Diagram for concordance index measurement (DAC-01-C-P)

Figure 23 is interpreted as follows:

Null concordance (No agreement) (= 0)

d − dd h if μ j ≤ Sn()μ jj+ 1σ then C j (d, pi ) = 0

Intermediary concordance [0, 1]

h − ddd + dd + ddpi d if Sn()μ jj+ 1σ ≤ μ j ≤ Sn()μ jj− 2σ with Sn()μ jj− 2σ = μ jj− n2σ

d − d d μ j − (S (μ j ) + n1σ j ) then h = C j (d, pi ) + d − d d d S (μ j ) − S (μ j ) − n2σ j − n1σ j

Perfect concordance (=1)

80 DRDC Valcartier TR 2004-265

d + dd h if μ j ≥ Sn()μ jj− 2σ then C j (d, pi ) = 1

Determination of the factors of admissibility/severity: n1 and n2

According to the importance given to the variability of the concordance measurements, n1 and n2 can take binary values (1 if variability is taken into account and 0 if not). The decision maker could also choose real values located between 0 and 1 for these two factors. Let us consider a scale from 0% to 100% divided into three distinct zones, A, d B, and C, delimited by various values of standard deviationsσ j . The thresholds of 15% and 20% are selected here only to illustrate an example related to the Figure 23.

Zone of high homogeneity

+ d + d Zone A: ]0%, S1 (σ j ) ] S1 (σ j ) = 15% ]0%, 15%]

In this interval, the distribution is considered as very homogeneous, we can then affirm there is no variability. The comparison between the document and the prototype is done then by using only the average of the degrees of confidence, without any consideration of the standard

deviations. The factors n1 and n2 are in this case: n1 = n2 =0.

Zone of relative homogeneity

+ d + d + d Zone B: ] S1 (σ j ) , S 2 (σ j ) ] S 2 (σ j ) = 20% ]15%, 20%]

In this zone, the homogeneity of the distribution of the degrees of confidence in the document regarding a concept j tends to become obvious. It is then necessary to take into account the d influence of the variability,σ j , in the calculation of the similarity index between the document

and the prototype for this concept. The factors n1 and n2 take the value of 1 (or all other real values between 0 and 1).

Zone of no homogeneity

+ d Zone C: ] S 2 (σ j ) , 100%] ]20%, 100%]

There is no homogeneity of the distribution of the degrees of confidence so the variability is

considered in the similarity calculation and the factors n1 and n2 are equal to 1. Discussion For the measurement of the concordance, these homogeneity zones, which in fact correspond to

what we have called previously hesitation zones, represent the hesitation level ( n1 ) and/or the

benefit of doubt ( n2 ) that one wants to give to a certain concept, considering the average and the standard deviation of its degrees of confidence.

DRDC Valcartier TR 2004-265 81

d − d d We consider that for any value of μ j below a threshold S (μ j ) + n1σ j , we h d + d d have Cdpji(, )= 0, while if μ j is higher than S (μ j ) − n2σ j , the concordance should h d be perfect, i.e., Cdpji(, )= 1. In other words, if the average μ j of the degrees of confidence − d − d d is located in the interval [ S (μ j ) , S (μ j ) + n1σ j ], a pessimistic attitude will lead to the h attribution of a null concordance, i.e., Cdpji(, )= 0. This attitude translates all the hesitation and the doubt induced by the variability of these degrees of confidence.

d + d d + d On the other hand, if μ j is located in the interval [ S (μ j ) − n2σ j , S (μ j ) ], an h optimistic attitude will lead to a perfect concordance attribution, i.e., Cdpji(, )= 1. For the considered concept j, this perfect attribution translates the benefit of the doubt regarding the variability of the degrees of confidence.

It appears clearly that the larger the variability of the degrees of confidence, more doubtful is the assignment to a given class. For this reason, we consider an hesitation zone for the strict h h concordance, i.e., Cdpji(, )= 1 and for the null concordance, i.e., Cdpji(, )= 0. Various cases can then arise according to the importance of this variability. This results in the variation of the intervals representing the hesitation and the benefit of the doubt, reflecting, by this way, the severity in the determination of the concordance and/or discordance indices.

Using this approach in the concordance index measurement, the decision maker can adopt mixed attitudes and expresses his more or less great hesitation or his more or less great optimism according to the situation under study. Zone of null concordance

Pessimistic attitude, increase of the hesitation zone n1

Optimistic attitude, reduction or suppression of the hesitation zone n1 Consequence: A reduction (pessimistic case), an increase (optimistic case) of the fuzzy concordance zone − dd+ dd [ Sn()μ jj+ 1σ , Sn()μ jj− 2σ ]

Zone of perfect concordance

Optimistic attitude, increase in the zone of reasonable doubt (benefit of the doubt) n2

Pessimistic attitude, reduction or suppression of the zone of reasonable doubt n2 Consequence: A reduction (optimistic case), an increase (pessimistic case) of the fuzzy concordance zone − dd+ dd [ Sn()μ jj+ 1σ , Sn()μ jj− 2σ ] Various cases arise, according to the decision maker attitude versus the risk, and summarized in Table 1.

82 DRDC Valcartier TR 2004-265

Table 1. Factors of severity/admissibility values according to the DM Attitude

DECISION MAKER HESITATION ZONE AFFECTATION ATTITUDE CHARACTER

Situation 1

Pessimistic n1 = 1 Severe

Optimistic n2 = 1 Large Situation 2

Optimistic n1 = 0 Large

Optimistic n2 = 1 Large Situation 3

Optimistic n1 = 0 Large

Pessimistic n2 = 0 Severe Situation 4

Pessimistic n1 = 1 Severe

Pessimistic n2 = 0 Severe

h Measure of the Discordance Index: D j (d, pi )

The measurement of this index is conditioned by thresholds on the standard deviation + d + d S1 (σ j ) and S 2 (σ j ) , as shown in Figure 24.

+ d + d Statistic signification of the thresholds S1 (σ j ) and S 2 (σ j )

+ d S1 (σ j ) : Threshold below which the variability of the degrees of confidence is considered h negligible, below this threshold the discordance is null ( D j (d, pi ) = 0 ).

+ d h S 2 (σ j ) : Threshold above which one concludes to a perfect discordance D j (d, pi ) = 1, because the degrees of confidence considered have a too great variability to be taken in account in the concordance measurement.

+ d + d Note that the thresholds S1 (σ j ) and S 2 (σ j ) are in fact coefficients of variation (CV) according to the terminology used in statistics. A coefficient of variation is the quotient of the + d d d average of a given distribution with its standard deviation ( Si (σ j ) = σ j / μ j ,i = 1,2 ), and by definition it provides an information about its homogeneity.

The measurement of the discordance index proposed here differs from that published in the work of Belacel [2000]. This measure is taken at a local level and it is evaluated on the standard deviation instead of the average of the degrees of confidence used to determine the

DRDC Valcartier TR 2004-265 83

concordance index. This approach allows us to take into account all the incertitude aspects characterizing a concept, which gives a more rigorous tool of measurement.

Figure 17. Diagram for discordance index measurement (DAC-01-C-P)

Null discordance (0)

d + d d h If σ j ≤ S1 (σ j ) × μ j then D j (d, pi ) = 0

Intermediate discordance] 0, 1[

+ d d d + d d h If S1 (σ j ) × μ j < σ j < S 2 (σ j ) × μ j then 0 < D j (d, pi ) < 1

d + d d σ j − (S1 (σ j ) × μ j ) And h = D j (d, pi ) + d d + d d (S2 (σ j ) × μ j ) − (S1 (σ j ) × μ j )

Perfect discordance (1)

d + d d h If σ j ≥ S 2 (σ j ) × μ j then D j (d, pi ) = 1

Then the following three cases can occur:

Case 1:

h h If for a given concept j , the discordance is null, i.e., D j (d, pi ) = 0 , the term (1−Dj (d, pi )) h is equal to 1 so has no influence on the local indifference index I j (d, pi ) value, leaving this role only to the concordance index.

Case 2:

84 DRDC Valcartier TR 2004-265

If for a given concept j , the discordance takes values between 0 and 1, i.e., h h 0 < D j (d, pi ) < 1, the term (1 − D j (d, pi )) will take values between 1 and 0, and comes to moderate the measure of the concordance index.

Case 3:

h If for a given concept j , the discordance is equal to 1, i.e., D j (d, pi ) = 1 , the term h h (1− D j (d, pi )) will be equal to 0, thus canceling the local indifference index I j (d, pi ) .

h h Measure of the similarity indices: I j (d, pi ) and I(d, pi )

The similarity between a document and a prototype is evaluated, first at a local level, for a concept j, then globally by aggregating these measurements.

At a local level, (i.e., concept j level)

h h h h h w j I j (d , pi ) = w j × C j (d , pi ) × (1 − D j (d , pi ) )

At a global level, (i.e., whole document)

m m h h h h h h w j I(d, pi ) = ∑I j (d, pi ) = ∑(w j × C j (d, pi ) × (1− D j (d, pi ) )) j=1 j=1

with m the relevant concepts in the document d .

h h Note: The indices C j (d, pi ) and D j (d, pi ) are not fuzzy numbers in this method. Second method (DAC-02-I-I) This method is based on the confidence interval. In this case study, the expert cannot determine a specific measure but an interval in which the required value is undoubtedly located.

One of the most effective strategies to analyze the differences between averages, if they are important or not, is to compare them by using their standard deviation. This analysis uses the statistical concept of confidence interval. A confidence interval is established by adding and by withdrawing the value of the standard deviation from the average.

As it is the case within the framework of many studies, intervals are built from values taken "normally" by a variable of interest (in this work, the average of the degrees of confidence for a concept j). The term "normally" refers to values that are likely to be observed with a given probability, under conditions known as "normal" and for well identified cases by prototypes used as references.

The non-parametric approach described in this section consists in comparing the interval Intd

to the reference interval Int h whose corresponding values are considered as a reference. It pi

DRDC Valcartier TR 2004-265 85

consists in processing a distance measurement between the document and the prototype, by determining a rate of recovering (overlapping) at the concept level. Higher is this overlapping, smaller is this distance.

According to Tsoukias and Vincke [1998], the comparison of elements evaluated by intervals leads to the preference structure < P,Q, I > . Formally, a preference structure on a finished set A is called an interval order P − Q − I , if and only if there are two real valued functions l and r such as for all x and y memberships of A, x different from y , r(x) ≥ l(x) and r(y) ≥ l(y) (l -for left- representing the lower bound and r - for right – representing the upper bound).

This interval comparison consists in distinguishing between three following situations:

a. a) Strict indifference ( I ) :

b. If one of the intervals to be compared is completely included in the other, the result of the intersection is the included interval:

I(x, y) r(x) > r(y) > l(y) > l(x) or r(y) > r(x) > l(x) > l(y)

c. b) Weak preference or weak indifference (Q):

d. If an intersection exists between the two intervals, but none of them is included in the other, we have:

Q(x, y) r(x) > r(y) > l(x) > l(y) or

e. Q(y, x) r(y) > r(x) > l(y) > l(x)

f. c) Strict preference (P):

g. If no intersection exists between the two intervals (i.e., one of these intervals is either completely on the right or completely on the left of the other):

P(x, y) r(x) > l(x) > r(y) > l(y) or

P(x, y) r(y) > l(y) > r(x) > l(x)

In the present method, we are interested to evaluate an indifference index between two intervals; this index synthesizes the distances between the document and the prototype at the concept level of the ontology. So, a strict preference situation corresponds to a null indifference index, which translates the absence of any similarity between the compared objects. For a better comprehension, we use the term indifference instead of preference relation. That corresponds better to the concordance measurement and is closer to the notion of indifference than that the notion of preference. Eligibility for a concept comparison

86 DRDC Valcartier TR 2004-265

To establish the indifference relation between a document and a prototype, certain conditions must be met by these two elements regarding the intervals representing their concepts degree of confidence measurement. The admissibility of a concept j to h the measurement of the local indifference index I j (d, pi ) , i.e., to the determination of different values that constitute it (concordance and discordance), is conditioned by the threshold V j , which must be lower or equal to the lower limit of the interval overlapping between the document and the prototype.

Various case studies can occur: a) 0% < Overlapping < 100%

1) If V j ≤ l( p) and l(d) < l( p) and r(d) < r( p) ⇒ admissibility of concept j

2) if V j ≤ l(d) and l(d) > l( p) and r(d) > r( p) ⇒ admissibility of concept j

Case 1

Case 2

Figure 18. Admissibility for comparison for 0%

b) Overlapping = 100%

1) If V j ≤ l(d) and V j ≤ l( p) and l(d) = l( p) and r( p) = r(d) ⇒ admissibility of concept j

2) If V j ≤ l( p) and l(d) < l( p) and r( p) < r(d) ⇒ admissibility of concept j

3) If V j ≤ l(d) and l(d) > l( p) and r( p) > r(d) ⇒ admissibility of concept j

DRDC Valcartier TR 2004-265 87

Case 1

Case 2

Case 3

Figure 19. Admissibility for comparison in 100% overlapping How to calculate the Indifference Index Once the indifference relation is established, it is appropriate to evaluate its intensity. This is done by the measurement of a concordance index, which estimates the consensus degree between the two intervals (document interval and prototype interval). This index is then attenuated by another one, the discordance index, which is used on the local level of the concept.

The overlapping rate

d d d d It is the intersection between the intervals Intd = [μ j −σ j ;μ j +σ j ] for the document d d d d d and Int h = [μ − σ ; μ + σ ] for the prototype i of a certain class h . pi j j j j

Notation: For reading convenience, this notation is adopted to describe the intervals:

− + For a document: l(d) = Sd = μd − σ d ; r(d) = Sd = μd + σ d ; Id = [l(d), r(d)] ;

Ampl(Id) = r(d) − l(d)

88 DRDC Valcartier TR 2004-265

− + For a prototype: l( p) = S p = μ p − σ p ; r( p) = S p = μ p + σ p ; I p = [l( p),r( p)] ;

Ampl(Ip) = r( p) − l( p)

The critical values of covering are:

1) Null or minimal distance 100% of Overlapping

This corresponds to the case where one of the intervals is completely included in the other. The intersection results in this interval.

Overlapping = 100% if I d ∩ I p ≠ ∅ and I d ∩ I p = I p or I d ∩ I p = Id or I d ⊇ I p or

Id ⊆ I p

⇒ Situation where the indifference is strict (total similarity)

2) Maximal distance 0% of Overlapping

This corresponds to the case where no intersection is between the intervals.

Overlapping = 0% if Id ∩ Ip = ∅

⇒ Situation where the preference is strict (null similarity)

3) Intermediary distance Overlapping between 0% and 100%

The values of intermediate overlapping are calculated by taking in account the importance of the intersection between the two intervals.

0% < Overlapping < 100% if Id ∩ I p ≠ ∅ and Id ∩ Ip < Id or Id ∩ Ip < Ip

⇒ Situation where the indifference is weak (similarity is neither null nor total)

Overlapping measurement17

I d ∩ I p [l(d), r(d)]∩[l( p), r( p)] Re c(I d , I p ) = = I d ∪ I p [l(d), r(d)]∪[l( p), r( p)]

a) If l( p) > l(d) and r( p) > r(d)

I d ∩ I p [l(d), r(d)]∩[l( p), r( p)] Re c(I d , I p ) = = I d ∪ I p [l(d), r(d)]∪[l( p), r( p)]

b) If l(d) > l( p) and r(d) > r( p)

17 Re c(Id , I p ) for overlapping between Id and I p

DRDC Valcartier TR 2004-265 89

I d ∩ I p [l(d ), r (d )] ∩ [l( p), r ( p)] r (d ) − l( p) Re c(I d , I p ) = Re c1 = = = I d ∪ I p [l(d ), r (d )] ∪ [l( p), r ( p)] r ( p) − l(d )

c) If l(d) ≤ l( p) and r( p) ≤ r(d) or l( p) ≤ l(d) and r(d) ≤ r( p)

Re c(Id , I p ) = Rec3 = 100%

h Measure of the concordance Index C j (d, pi )

h The measurement of concordance C j (d, pi ) is a function of the rate of overlapping h Re c(Id , I p ) between the document d and the prototype pi . Its determination is based on the following rules:

Re c1 if l( p) > l(d) and r( p) > r(d) or Re c2 if l(d) > l( p) and r(d) > r( p) h C j (d, pi ) = Re c3 = 100% if I d ⊆ I or if I p ⊆ I

with I = Id ∪ I p

h Measure of the discordance Index D j (d, pi )

The discordance index aims at attenuating the concordance between I d and I p according to the extent of overlapping between the two intervals. This is expressed by a ratio function Rapi (i=1 or 2) to quantify the comparison between these two intervals using values ranging between 0 and 100%. This measurement, which evaluates the importance of an interval, compared to another, can be applied by using a certain threshold RapMax , which expresses the Ampl(I ) Ampl(I p ) quotient intervals Rap1 = d or Rap2 = (according to the case study). The Ampl(I p ) Ampl(I d ) discordance index is then calculated by:

1 if Rap1 ≥ RapMax or if Rap2 ≥ RapMax

h Rap1 D j (d, p ) = if Ampl(I p ) > Ampl(Id ) and Rap1 Ampl(I p ) and Rap2

The following figure (27) illustrates the case where RapMax = 0.40

90 DRDC Valcartier TR 2004-265

Figure 20. Discordance index measurement (CAD-02-I-I)

h Measure of the indifference index I j (d, pi )

At a local level (concept j)

1. If there is no hierarchy of concepts

We suppose that all the concepts have the same weight.

h h h 1 I j (d, p ) = C j (d, p )× D j (d, p )× i i i m

2. If there is a hierarchy of concepts

p h p h h wj I j (d, pi ) = w j ×C j (d, pi )×(D j (d, pi ))

At a global level (whole document d)

1. If there is no hierarchy of concepts

n h h C j (d, p )× D j (d, p ) I(d, p h ) = i i i ∑ m j=1

m is the number of relevant concepts extracted the document and present in the prototype.

2. If there is a hierarchy of concepts

n w p I(d, p h ) = (w p ×C (d, p h )× (D (d, p h )) j ) i ∑ j j i j i i=1 Third method (DAC-04-C-I)

DRDC Valcartier TR 2004-265 91

This method, noted CAD-04-C-I, is the fourth one proposed in the context of ADAC project. It consists in using thresholds and parameterization factors. As shown in the interval approach, the thresholds are used to check the admissibility of the concepts for similarity comparison between a document and a prototype at local level. The parameterization factors are introduced here to reflect the preferential attitude of the decision maker. We define two types of factors: the admissibility factors β (for the document) and α (for the prototype), and the severity factors n1 (for the document) and n2 (for the prototype). These various factors are applied on the standard deviations of involved concepts confidence degree averages.

Admissibility factors ( β andα )

These factors are used in the concept eligibility test by considering besides the degree of h h d pi d pi confidence degree average μ j or μ j , it standard deviation σ j or σ j , respectively. They take the values -1 for a pessimistic attitude, 0 for a neutral attitude and +1 for an optimistic attitude.

Severity factors ( n1 and n2 )

These factors are used to calculate the concordance index.

h Measure of the concordance index C j (d, pi )

h h d pi The concordance index C j (d, pi ) is a function of μ j or μ j . The symmetry property of this measure requires that the result of comparison between the document and the prototype be exactly the same as that of the prototype with the document. We must h h have C j (d, pi ) = C j ( pi ,d) . This enables to calculate the concordance by considering the following admissibility cases:

h h d d p i p i Case 1: μ j + ασ j < μ j + βσ j

h h d d p i p i Case 2: μ j + ασ j > μ j + βσ j

h The function C j (d, pi ) is of trapezoidal form. The diagrams depicted in Figures 28 and 29 illustrate this function for cases 1 and 2, respectively, even if its trapezoidal form does not clearly appear.

The concordance index calculation is then:

h h d d p i p i Case 1: μ j + ασ j < μ j + βσ j

h h h d d pi d d pi pi If μ j − n1σ j ≤ Sμ − n2σ j or μ j + n1σ j ≥ μ j + n2σ j h Then C j (d, pi ) = 0 (A zones)

92 DRDC Valcartier TR 2004-265

h h h pi d d pi pi If Sμ − n2σ j < μ j +ασ j < μ j − n2σ j

h d d pi μ j + ασ j + n2σ j − Sμ Then C (d, ph ) = (B zone) j i h pi μ j − Sμ

h h h h pi pi d d pi pi If μ j − n2σ j ≤ μ j +ασ j ≤ μ j + n2σ j

h Then C j (d, pi ) = 1 (C zone)

Figure 21. Diagram for concordance index calculation in case 1 (CAD-04-C-I)

h h d d pi pi Case 2: μ j + ασ j > μ j + βσ j

h h h pi d d pi pi d d If μ j − n2σ j ≤ Sμ − n1σ j or μ j + n2σ j ≥ μ j + n1σ j h Then C j (d, pi ) = 0 (A zones)

h d pi d d d If Sμ − n1σ j ≤ μ j + βσ j ≤ μ j − n1σ j

h h h pi pi pi μ j + βσ j + μ j − Sμ Then C (d, ph ) = (B zone) j i d μ j − Sμ

h h d d pi pi d d If μ j − n1σ j ≤ μ j + βσ j ≤ μ j + n1σ j h Then C j (d, pi ) = 1 (C zone)

DRDC Valcartier TR 2004-265 93

Figure 22. Diagram for concordance index calculation in case 2 (CAD-04-C-I)

h Measure of the discordance Index D j (d, pi )

The symmetry property is also involved for the discordance index measurement, so h h D j (d, pi ) = D j ( pi ,d) . As mentioned previously, this index represents a correction factor of the concordance and depends on the importance of parameters characterizing d the concept j. These parameters are on the one hand the standard deviation σ j or

h pi μ σ j (according to the cases 1 or 2) and the difference E j between the averages

h d pi μ j and μ j on the other hand.

μ h Partial discordance index D j (d, pi )

h μ d pi A part of the discordance is induced by the importance of the difference E j = μ j − μ j . We

μ want to build a function of E j (variation of the averages), which varies slowly at the μ beginning and which takes higher values as E j increases. An example of this function's variation is summarized in Table 2.

94 DRDC Valcartier TR 2004-265

μ h Table 2. Partial discordance variations D j (d, pi ) (CAD-04-C-I)

μ μ h Δ factor μ h E D j (d, pi ) ΔD j (d, pi ) (Slope) ≤ 0.20 0.00 2 0.10 0.30 0.10 4 0.20 0.40 0.30 6 0.30 0.50 0.60 8 0.40 ≥ 0.60 1.00 Using data from Table 2, a function is determined by quadratic regression by curve fitting (SPSS). This function is expressed in the following way:

0 if E μ ≤ 0.20

μ h μ μ 2 μ D j (d, pi ) = 10E −1.5(E ) + 0.05 if 0.20 < E < 0.60

1 if E μ ≥ 0.60

σ h Partial discordance index D j (d, pi )

It is a measure induced by the importance of the variability of the confidence degree measurements of the concept j. The standard deviation expresses how the precision of the σ h concept extraction task is. This partial discordance index D j (d, pi ) is calculated by linear extrapolation.

18 σ d Here, we consider that a standard deviation of 30% ( S j ) of the average μ j is a significant threshold over which this parameter is considered as too important and the distribution of the confidence degree measurement presents a too great variability.

σ h σ D j (d, pi ) is a function of S j and is given by:

18 It is generally admitted that 30% of the average is a threshold over which data seem to present a too greater variability

DRDC Valcartier TR 2004-265 95

d S j if 0.00 ≤ S d < 0.30 0.30 j σ h D j (d, pi ) = d 1 if S j ≥ 0.30

h Total discordance index D j (d, pi )

It is the discordance index that is used in the calculation of the partial indifference h index I j (d, pi ) . It is determined by the following relation:

h σ h μ h D j (d , pi ) = Max [D j (d , pi ), D j (d , pi )]

h Measure of the Indiffrence Index I j (d, pi )

The local and global indifference indices are computed as done for the method DAC-01-C-P and are described in section Measure of the Similarity Indexes.

96 DRDC Valcartier TR 2004-265

Annex E: COTS Product Evaluation

Autonomy Autonomy technology is fundamentally an SDK (Software Development Kit)19 with a sophisticated indexing engine where Autonomy built many document management solutions. Main purposes Autonomy has built and commercialized a document management system around their engine. The documents could be in many formats (+200), languages (56), types (text, voice20, images21), and coming from many sources (internet, intranet, audio and video stream, eMails, document repositories (e.g.Lotus Notes), etc.).

The systems created by Autonomy are complete and have many features related to document management, user profiling, classification, clustering, taxonomy generation and other useful features. Global features and architecture Autonomy technology is based on a combination of three servers tightly integrated. • Dynamic Reasoning Engine (DRE): It is the heart of the Autonomy technology. It is an indexing engine based on a statistical approach. The mathematical foundation behind the indexing comes from the Bayes and Shannon theories. • Classification server: In conjunction with the Dynamic Reasoning Engine's ability to understand any information contextually, this server brings the functionalities of clustering, categorization, and taxonomy generation. • User agent server: Again in conjunction with the Dynamic Reasoning Engine's ability to understand any information contextually, this server brings the management functionalities of user agents. We can associate user agent as a document prototype interesting the user. In return, the user can be alerted when there is a new document entering the system with this prototype matching. The user agent server manages automatically the link between users having the same interest (user agent matching to other one). There is also a feature of collaboration where users can easily find other users with the same interest or expertise.

These three servers are well supported with a complete API to build document solutions. By far, Autonomy, with their partners, created a list of document management solutions and brings to their clients many modules to cover a lot of needs.

The following table and figure shows the architecture of the Autonomy products.

19 Autonomy corporation, “Global Infrastructure White Paper (AutonomyGlobalInfrastructureWP_0501.pdf)” 20 The audio and sound in video are converted in text by a speech to text engine. 21 The images are converted in text by using an OCR using the same technology of pattern matching in the audio converter.

DRDC Valcartier TR 2004-265 97

Table 3. Autonomy Architecture

Term Definition

DRE™ Dynamic Reasoning Engine™, a massively scalable, multithreaded process that performs the essential tasks of analyzing and conceptual profiling of content, conceptual pattern matching; relevance ranking; delivery of results and interaction with other components.

AXE Autonomy XML Engine organizations to eliminate the inefficiencies introduced by many of the manual issues associated with creating XML tags by understanding the content and purpose of either the tag itself, or related information, or both.

DQH Distributed Query Handler for high performance, reliable location transparency.

DIH Distributed Index Handler for high performance, reliable content routing and categorization.

DiSH Distributed Service Handler for a single point of control, monitoring and configuration of all Autonomy components.

Classification In conjunction with the Dynamic Reasoning Engine™, Classification Server fuses together the ability Server™ to categorize, cluster data and generate deep hierarchical taxonomies completely automatically.

UAServer™ User Agent Server is a highly scalable server that handles requests for user administration functionality, including user information, user agents, user profiles, user roles and user authentication/ security.

ACI/ API DRE™ 4 uses a new Client API, the ACI (Autonomy Content Infrastructure) API that enables easy Compliant communication between custom-built applications that retrieve data using HTTP commands and the Autonomy ACI servers (DRE™ 4/AXE™), as well as simple manipulation of the returned results set.

98 DRDC Valcartier TR 2004-265

Figure 23. Autonomy’s IDOL architecture and technical components

• Indexing engine: Autonomy calls their indexing engine the Dynamic Reasoning Engine (DRE). This engine takes, as input, data in XML format and indexes it in its server. The data submitted to the DRE contains the necessary metadata and the raw data (text) to exploit it.

• Automatic hyperlinking: Autonomy has built functionality for automatic creation of hyperlinks to other documents with the most important concepts in the document displayed. This feature helps users to navigate in the knowledge database formed by the repositories of documents.

• User profiling22: The user profiling is associated to the management of User Agents.

• Clustering23: The clustering is made by using the metadata obtained from the DRE server and is done by Classification server. No information has been found and/or obtained on the mathematical foundation used for the

22 Autonomy corporation, “Autonomy Updatetm Technical Brief (Agent_Update TB.pdf)” 23 Autonomy corporation, “Classification Server™ Technical Brief (Classification server.pdf)”

DRDC Valcartier TR 2004-265 99

clustering algorithms.The visualization of the results is very good and gives the users a picture of how the documents are organized and linked. There are many ways of displaying the clustering results: 2D cluster mapping, 3D cluster mapping and spectrograph.

• Classification23: The classification can be achieved in three ways:

• By creating a hierarchy of user agents and the classification server will associate corresponding documents to them;

• By using the functionality of taxonomy generation. This functionality requires enough documents to build a good taxonomy. After is has been created, the user access the metadata from that taxonomy to create the user agents hierarchy;

• By using the product Categorizer. The user can define manually a hierarchy of categories. This product can use the taxonomy generated to create the base of the hierarchy. It seems easy to use but the definition of a category is more complex and requires some statistical knowledge.

• Taxonomy generation23: The classification server can generate a taxonomy based on documents. This taxonomy contains all the metadata necessary to be used by other functionalities such as classification.

• Audio and Video24,25,26,27: The information extracted from the audio and video is the speech. The speech-to-text engine behind the audio module does not need to be trained to convert speech to text. The technology of this engine was originally developed at Cambridge University where Dr. Tony Robinson (Managing Director) led a research team for over ten years. SoftSound licensed, commercialized and further developed this technology to provide a broad base of audio and speech processing algorithms. The key features are:

• Speaker independent operation - works out of the box without manual training;

• Very large vocabularies - no arbitrary limit in size;

• Patented search technology to drastically reduce CPU and memory requirements;

• Automatic customisation through additional text material;

24 Autonomy corporation, “Audio and Broadcast White Paper (Autonomy Audi Broadcast WP.pdf)” 25 Dremedia limited, “Technology white paper (Dmtech-wp2.pdf)” 26 Dremedia limited, “www.dremedia.com” 27 Softsound limited, “www.softsound.com”)

100 DRDC Valcartier TR 2004-265

• Incorporates both state-of-the-art acoustic models (Hidden Markov models and Recurrent Neural Networks);

• Tailored for information retrieval use (not just dictation);

• Support for multiple languages (current builds include English, French, Greek, Italian and Spanish).

In May 2000 SoftSound received substantial investments from Autonomy. Autonomy, with its technology, brings to the feature of indexing the text that comes from the conversion, but also, it indexes the position of the text in the audio or video file. So, the user can fetch the audio and video original file or stream to the frame or position where the concept searched are. The source of audio or video can come from internet audio or video stream of from audio or video file.

Dremedia is a company founded to create a fully automated software platform for digital television production and interactivity, increasing productivity and reducing time-to-air. Dremedia was founded in 2001 by leading broadcast and technology executives; its principal external investor is Autonomy. Dremedia uses the SoftSound technology in their products. This company is responsible for the video management system found in the Autonomy products.

The DRE™ accepts a piece of content and returns a summary of the information containing the most salient concepts of the content. In addition, summaries can be generated. Such summaries relate to the context of the original inquiry, allowing the most applicable dynamic summary to be provided in the results of a given inquiry28. Specific features related to ADAC Globally, Autonomy seems to offer many features ADAC intends to offer to its client, and it will be easy to integrate Autonomy’s features to ADAC, but, the opposite would be very difficult. For example, integrating the use of the clustering features with ADAC’s indexing metadata will not be possible, because of the nature of the algorithms, it does not treat the same metadata.

• Recovery agent ;Autonomy offers various connectors to retrieve documents from any sources (internet, intranet, document repositories, document management system, email, other), types (text, audio, video), and in many languages (56). If there is no satisfying connector, we have the API to build the one needed. The integration to ADAC or other products can be done with the API.

• Summarizing: Autonomy’s summarizing process technology can bring a summary of documents where the most important (based on statistics) concepts are extracted, or a summary of documents where the most important concepts related to a predefined concepts (prototype or user agent) searched are extracted. In both cases, Autonomy extracts the phrase

28 Autonomy corporation, “Technology white paper (Autonomy TechnologyWP.pdf)”, 19p

DRDC Valcartier TR 2004-265 101

where the concepts are present. The integration to ADAC can be done easily by using the API.

• Indexing: We can compare the indexing process to the creation of ADAC’s text DNA with the exception that Autonomy’s process uses the mathematical theories of Bayes and Shannon. The concepts extracted from documents are based on this mathematical foundation. Indexing is done by sending to the DRE a file formatted with the XML language and have to include some mandatory fields to index the document. The integration with ADAC can be done easily by using the API.

• Diagnosis: The diagnosis is not really implemented, but a part of it is already there. As it is done with ADAC, the user has to define the diagnosis (prototype of document). Autonomy brings this functionality by using the user agent creation. Automatic alerting to new content based on user's interests by email can be added to the user agent definition. The addition of more complex actions to do and the integration with ADAC can be done easily by using the API.

• Classification: The term used in Autonomy documentation is category. The categorization is made automatically by the Classification server. The server categorizes the document based on its content. The content is analyzed by the DRE. The categorization hierarchy is made automatically without human intervention. No information is available on the algorithms used to build the hierarchy. The categorization hierarchy can be modified by the user with a specialized user interface. The user interface is relatively simple but the associated parameters require a specific knowledge. The user interface for visualization of the hierarchy and its documents is already present within Autonomy’s products. The integration can easily be done by using the API.

• Clustering: Autonomy’s clustering functionalities answers most needs of ADAC’s clustering. The clustering can be achieved on an entire document repository, user agents or profiles. No information is available on the clustering algorithms used in their products. The visualization of clusters can be made in many ways. They can present the clusters graphically in 2D and 3D and when the user point to the 2D surface or in the 3D space, it shows the related concepts and browse into the documents present in the cluster. Those visualization methods are very useful and can be integrated to ADAC. The clustering features can be easily integrated to ADAC, but the opposite would be rather difficult. As mentioned earlier, the metadata accessed in the algorithms are not the same as ADAC.

• Visualization: Based on the documents read and the demonstration in December 2002, Autonomy brings many user interfaces needed in ADAC. The user interfaces of Autonomy can be modified or new interfaces can be created with Autonomy’s functionalities to meet ADAC’s needs.

102 DRDC Valcartier TR 2004-265

Strength and weakness Autonomy answers a lot of ADAC’s requirements with the functionalities developed around their principle servers, DRE, UA server, and the Classification server.

Autonomy’s products are very well provided with features of document recovery from many sources, formats, and types. The other features of classification (categorization, clustering), user profiling, visualization, environment administration, security, are well implemented and can be modified or adapted to ADAC’s requirements. The API covers the use of the three servers and other parts of Autonomy’s products. If Autonomy has built all their solutions around the API provided, many other solutions can be imagined and implemented with this API.

Autonomy has one main weakness and it is very important. The understanding of the concepts in documents is based on statistical techniques (combination of Bayes and Shannon theories). We have not found information about the accuracy of these techniques, thus, we cannot estimate their precision. More research on the accuracy of these methods should be done to establish the opportunity of using them. Integration, deployment and scalability The integration of Autonomy’s products is made by the provided API. It is complete and can be used in many situations. The architecture of these products is built to take into account the flexibility required to implement complex document management solutions.

The user interfaces created with these products are generally implemented in order to be displayed on an internet browser. Then, the deployment of the solutions requires the three servers connected to a web server. Obviously, the components of the solutions must be connected on a TCP/IP network. No documentation was available on how HTML constituents are used and if there were Active constituents (Java applet, ActiveX, etc.). Given that the Department of National Defence network requires certain limitations at this level, we have to make an analysis on this aspect. The deployment can be centralized or distributed (replication of servers). In the case of a centralized deployment, it is very simple to provide maintenance. In the case of a distributed deployment, automatic updates of the distributed servers will have to be thought of.

Autonomy’s products are very well provided from the scalabiliy point of view. Autonomy uses the characteristics listed below to ensure scalability.

• Multi-tier: Autonomy’s Multi-tier architecture gives administrators the flexibility to independently address user load requirements, processing requirements, data volume requirements and interface functionality/appearance. By allowing these to be independently scaled, only the part of the system that requires growth need to be increased– thereby allowing platform cost-savings in areas that are not growing.

DRDC Valcartier TR 2004-265 103

• Modular: Autonomy consists of a number of modules designed to perform particular functionalities, including the core DRE™ engine, Classification Server, UAServer, Connectors etc.

• Distribution: Autonomy’s architecture supports distribution of all modules. Such distribution of modules across machines allows for:

o Linear scaling: e.g. to double performance/capacity - simply replicate the existing machine;

o True parallelism: allowing multiple machines to work in parallel to bring the solution to the users;

o Robustness: even hardware failure can be catered for using a distributed (and sometimes replicated) system;

o Redundancy: multiple replicated modules can run on independent machines ensuring backup systems are up to date and always live;

o Geoefficiency: geoefficient systems can be created. Geoefficiency refers to the ability of the distributed architecture to provide a system that has the right components in the right geographic locations;

o Future Proofing: future hardware platforms can be integrated in parallel to existing systems, offering almost seamless cutover to new hardware.

• Caching: Autonomy uses multi-tier caching, ensuring that the minimum of operations is performed to provide the functionality required.

• High Performance: Autonomy has fully multithreaded products, allowing parallel transactions to take place in the core engines and distribution components.

• Cross-platform: Autonomy’s products can be implemented in many platforms, Windows servers to almost any standards POSIX.

• Replication: Replication allows redundancy to be built into systems.

• Reliable/Monitored: Autonomy provides this capability through fail-over mechanisms built into the distribution components. DQH allows 100% uptime from a pair of servers, while DIH ensures data integrity across the pair. The Distributed Service Handler (DiSH) component allows effective auditing, monitoring and alerting of ALL other Autonomy components.

104 DRDC Valcartier TR 2004-265

Delphes Delphes technology is essentially an SDK with a sophisticated indexing and search engine where Delphes built their products. The search engine uses all linguistic aspects (English, French, and Spanish) to retrieve the most relevant documents. Main purposes Delphes built around their SDK a set of modules necessary for the use of a search engine. There is an indexing module where all the parameters are set in the administration module. There is also a searching module that can be integrated in a Web site. All these modules are Web applications. Technology foundation29 At the heart of the informative system integrated by Delphes are linguistic modules based on the concept of universality. The main characteristics of universality are reduced to a small number of abstract relations common to all the languages.

The universal concept allows explaining the variety of the languages and supplies a means of reaching uniformity during data processing conveyed by any language of the world. In spite of the variety of languages, the capacity to denote referents and to express valid propositions is universal. This universality is incorporated into the informative system integrated by Delphes.

The informative system integrated by Delphes incorporates the principles and the parameters of the universal grammar. The principles determine at the same moment the morphological shape and the syntactic constitution of the linguistic expressions. The universal engine

At the core of Delphes' integrated information system is the Universal Axiomatic Engine (UNAX™). UNAX™ is based on advanced principles and parameter scanning technology that models high-performance human properties. The UNAX™ performs four main functions:

Configuration detecting

UNAX™ is designed to detect information in terms of the identification of abstract structure identities, call them "configurations" (Di Sciullo, 1996 and related works). The latter ranges from structured sets of characters to structured sets of morphemes, to structured sets of words, to structured sets of phrases, to structured sets of texts.

There is a basic qualitative difference between configuration detecting and common practice in search engines. The latter is designed to typically identify singular entities, singular characters, singular morphemes, singular words, and so on.

The specificity of the configuration-detecting engine is that it mimics the natural, even though implicit procedures used by humans to search and extract information. In

29 Delphes International inc., “Livre blanc Système informationnel intégré (white_paper.pdf)”

DRDC Valcartier TR 2004-265 105

effect, recent research in cognitive science reveals the centrality of structured relations in conceptual activities, such as identification, grouping, displacements, linking, and so on. These activities are brought about in terms of abstract categories, a large subclass of which are not visible by the human perceptual system, but nevertheless are a part of the cognitive algebra that makes humans able to manipulate information.

UNAX™ mimics a fundamental feature of the human cognitive system: the ability to process information supported by natural language in terms of the manipulation of abstract configurations and categories. Relation preserving (transformational facilities)

UNAX™ also has the ability to maintain the relations between the query and equivalent expressions, which are meaning preserving. This is done through a limited set of transformations.

Let us illustrate with a simple case, to the query "the portrait of Mona Lisa by Da Vinci", UNAX™ will identify the set of texts, as well as the set of paragraphs therein, that will include equivalent expressions.

In the case at hand, the equivalent expressions are: "Mona Lisa's portrait by Da Vinci", "Da Vinci's portrait of Mona Lisa", "the portrait of Mona Lisa that Da Vinci painted". The set of equivalent configurations will not include the following: "the portrait of Da Vinci", "the portrait of Mona Lisa", "Da Vinci's portrait by Mona Lisa", which can also be obtained by transformations but are not meaning preserving.

The transformational facilities incorporated in UNAX™ sharpen and enhance the precision and the recall of information retrieval, and is also more precise for the extraction of information from documents. Concept expanding

UNAX™ is designed to process concept bearing configurations. This facility is incorporated in the natural language processing modules. Thus, the operations of the morphological module identify the different dimensions of the conceptual information supported by morphological objects. The conceptual information bears on the nature of the referent (object of the search) denoted by a derived word, whether it is an entity, a property or an event. It also bears on the nature of the denoted referent with respect to its temporal localization as well as its singular or generic nature.

UNAX™ derives conceptual expansion from the relation between a root and a derivational affix, as well as from the relation between a root and an inflectional affix. This aspect of the engine contributes to the originality of the system, as current information retrieval and extraction systems do not provide a fine-grained morpho- conceptual analysis. The concept expanding facility of the morphological module enhances the accuracy of information processing. This facility is an original feature of Delphes' integrated information system.

106 DRDC Valcartier TR 2004-265

Another important aspect to morphology is the analysis of compound words. Words like "shiftclick", "doormat" and "ringleader" are a conjoining of distinct root words. Delphes' integrated information system uses a lexical map to identify such words and determine their contextual function. With considerable innovation, it goes much further than other linguistic engines by detecting certain compound words that are not graphically conjoined or hyphenated but can still be treated as a single concept. Many common terms like "human resources" are contained in Delphes' expansive indexing dictionary, while other terms can be detected based largely on grammatical principles. For example, the system will determine that the query "Delphes' human resources" contains a compound word, "human resources", and therefore a more precise search will be made for this relevant term.

This advanced treatment of compound words and the extensive stemming programmed into Delphes' integrated information system makes it a morphological powerhouse. This strength contributes in turn to the unparalleled efficiency of the syntax analysis performed with the system.

UNAX™ derives conceptual expansions from the relations between syntactic constituents. The identification of the conceptual relations supported by nominal expressions is central in the system, as the referent (object of a search) is supported mainly by nominal expressions in natural languages. The ability to associate nominal configurations to conceptual configurations reduces the set of possible interpretations for a given expression on the one hand, and, on the other hand, renders the retrieval and extraction processes more precise. For example, the association of a conceptual configuration to the expression "the military inventions of Da Vinci", ensures that the referent of the search are entities of the "invention" type and that these entities are restricted by the descriptive modifier "military", which are further restricted by an alienable possession relation holding between the restricted entity "military inventions" and its creator, "Da Vinci". Evolved text search

UNAX™ is an axiomatic system that performs evolved text search, the procedure and the results of which are way ahead of current search systems.

Thus, UNAX™ performs evolved information processing tasks, basically different from the usual, but very limited tasks, performed by keywords search and Boolean search. Moreover UNAX™ is more refined than systems that incorporate natural language techniques, including NP detection and shallow parsing, as it dynamically brings about the conceptual dimension of the derived configurations. Global features and architecture Delphes DioWEB is an indexing and search engine where the indexing and the search are based on the natural language process. At this time, Delphes develops a module to categorize documents using an ontology or a taxonomy. The SDK supplied with the product allows creating complex solutions. The integration to ADAC was made by using the SDK.

DRDC Valcartier TR 2004-265 107

• Indexing engine30: DioWEB includes an indexing engine capable of extracting the semantic sense of the text. This engine forms the metadata linked to a document and stored in its database. DioWEB supports more than 200 formats of document (Word, PDF, etc.). The processing of the indexing is organized under the shape of jobs whose user (administrator) has to define. The job parameters are: source of documents, schedule, category to be associated, language, and global parameters of document’s source processing. With the SDK, programmers can provide the creation of jobs by any user. It is similar to the creation of User Agent in Autonomy’s products.

• Search30,31: DioWEB includes a searching feature based on the semantic sense of the query. The results are presented in XML format with the following fields: original document address (URL), document title, number of hits in the document, score32 of each hit, the paragraph associated to the hit (the most relevant paragraph is presented first), and other useful metadata.

• Summarization31: DioWEB can deliver a summary of a document based on the searched subject. It extracts the most relevant paragraphs linked to the search query. At this time, Delphes does not support the creation of a summary based on the most relevant subject of a document. The summary is always related to a search query. DioWEB provides a summary of all documents where a search result is positive. This functionality is well presented and could be formatted in HTML or PDF, and can be sent by email.

• Visualization31: The visualization of the source document or search results are well presented and easy to access to. A new user interface with the same results can be created or the actual ones can be adapted to ADAC’s requirements. Specific features related to ADAC • Recovery agent: The recovery agent is included in the indexing engine through the functionality of determining the source of document and the schedule when the user creates an indexing job. The indexing job can work in conjunction with the recovery agent actually in place in ADAC.

• Summarizing: The summarizing process cannot be replaced by Delphes’ summarizing process because of its nature. Delphes’ summarizing process makes a summary on a search query and not on the most relevant subject. If Delphes, ADAC team, or other company provides the extraction of the

30 Delphes International inc., “Technologie Diogène Livre blanc (white_paper_diogène.pdf) ” 31 Delphes Internationnal inc., “www.delphes.com” 32 The score is calculated with a complex algorithm which takes into account the comparison of the semantic value, its position in the text and other information retrieved.

108 DRDC Valcartier TR 2004-265

most relevant subject from a text, the summarizing process similar to the one provided by Copernic Summarizer will be possible.

• Indexing: Delphes indexes documents according to their scientific foundations. The resulting metadata of this indexing has not been analyzed yet and therefore, we cannot give an opinion about this issue. Presently ADAC’s indexing process uses Delphes to retrieve the search query results to build the text DNA.

• Searching: Delphes provides a search interface that could be modified to satisfy ADAC’s requirements. The search results interface can also be modified to better reflect ADAC’s needs.

• Visualization: In the search results, DioWEB provides a link to the original document, and presents the most relevant paragraph linked to the search query. The summary can also be presented and/or sent to another user if the summarizing module is activated. In ADAC, the search results help us to build the visualization of the document by placing highlights where concepts of the ontology were present. Strength and weakness The strength of this product and it is of importance, is its scientific foundation. The ability to understand text this product does, is unique and very precise. But the development of solutions based on this product is relatively young and we do not find many features that can be useful to ADAC. Delphes is developing more functionalities around its products, but they are not yet available. In other cases, we can develop the features we need by using the API. Delphes does not supply a processing distribution module for a larger number of users’ configuration. For a large-scale deployment, the customer shall have to use other products which allow distribution of resources (load balancing, replication of database) through a network. Integration, deployment and scalability30,33 The integration of DioWEB can be easily made by using the API that supports all the features. The API is available in C++ and Perl. The COM version will be available soon.

The deployment of solutions created with DioWEB is the same as client/server solutions. The server and the source of document have to be connected to the network. Delphes’ products are Microsoft centric. The indexing server needs Windows NT or Windows 2000 servers with MS SQL server activated. The user interfaces are implemented in ASP code with Microsoft IIS server. Thus, the servers have to be implemented with Microsoft products. Delphes is working on porting its servers to other platforms (Solaris 2.7 and Linux), but there is no time frame available. Based on customers needs, Delphes can deliver these versions earlier than forecasted. Delphes currently develops a proprietary database system to better answer their application needs and reduce the cost of ownership for the clients.

33 Delphes Internationnal inc., verbal information

DRDC Valcartier TR 2004-265 109

Delphes products run on standard clustering environments, like the load balancing and clustering of Microsoft, including Microsoft Application Centre. This allows either to distribute the load on several servers, or to have a solution to fail-over, or a combination of both. Delphes is also working on a module, transparent to the user that will distribute simultaneously the searches on several servers and merge the results. It allows creating multiple indexes and simulating it as one index. This module will provide better performance in case of huge indices.

Stratify Stratify technology is more of a SDK with content management functionalities than a complete content management solution itself. Their functionalities are based on statistical approach. Main purposes Stratify has built and commercialized a document classification solution. Stratify provides user interfaces with the SDK to help users to maintain the classification hierarchy. Global features and architecture The features in their products are supported by an engine extracting concepts of documents based on a statistical approach. This approach exploits the weight of words in document.

• Taxonomy Building34,35,36: Builds a customized taxonomy automatically from corporate documents, producing a single consistent view of information important to your business; Refines and optimizes taxonomies and training sets automatically, ensuring that taxonomies and classification models stay relevant and accurate with changes to document corpora and corporate objectives; Imports industry-standard or customized taxonomies in XML, or directly from an organized file system or Web site; Provides a simplified, distributed workflow to manage the total taxonomy lifecycle using the single, integrated Taxonomy Manager interface.

• Classification37: Provides a parallel classification architecture that uses multiple classifiers to provide the most accurate classification of your documents; Provides Boolean classifier in addition to statistical keyword and source classifiers, enabling the Discovery System to seamlessly import existing taxonomies with predefined rules; Allows as much or as little human oversight of taxonomy creation and document classification as desired through the integrated Taxonomy Manager interface; Collects

34 Stratify, inc., “Discovery SystemTM (StratifyDiscoverySystemDatasheet.pdf)” 35 Stratify, inc., “Taxonomy Manager (StratifyTaxonomyManager2.0.pdf)” 36 Stratify, inc., “Technical White Paper (StratifyTechWhitePaper2.0.pdf)” 37 Stratify, inc., “Classification Server (StratifyClassificationServerDatasheet.pdf)

110 DRDC Valcartier TR 2004-265

and classifies corporate documents and external information in over 200 formats automatically from file servers, the Internet, intranet sites, Lotus Notes, FileNET, Open Text, Documentum, and Microsoft Exchange, eliminating the need for manual tagging.

• Interfaces34,36: Provides browser and Windows based interfaces that enable users to intuitively navigate information by browsing the taxonomy and binning search results into topics; Retrieves information proactively by suggesting topics and documents related to content that a user is browsing or work in progress; Incorporates advanced personalization features that learn users’ interests automatically and provide them with matching documents.

The architecture of the product is shown in the following figure.

Figure 24. Stratify discovery system architecture

• Crawler — Crawlers find documents of any format in internal repositories and on the Web and extract text and metadata from them.

• Taxonomy Builder — Using patent-pending technology based on clustering and pattern matching algorithms, the Taxonomy Builder automatically creates structured taxonomy from unstructured documents. Alternately, it allows administrators to import and extend an existing taxonomy.

• Metadata Server — This J2EE compliant middle tier presents metadata to Stratify user interfaces and third party applications.

DRDC Valcartier TR 2004-265 111

• Taxonomy Manager — Included with all Stratify software products, the Taxonomy Manager supports complete editorial control of the taxonomy building and classification processes for total taxonomy lifecycle management.

• Classifier — The Classifier identifies the main ideas in text and classifies documents into a taxonomy defined uniquely for each enterprise.

• Metadata Repository — This open, SQL relational database stores document metadata and classifications for access by any application.

• Stratify APIs — A robust set of Java, Web services, and WebDAV- compatible APIs allow easy integration.

• Stratify Web Access™ — This browser based interface allows users to navigate the taxonomy, find related documents, and view search results organized into the taxonomy.

• Stratify Windows Client™ — This Windows application learns user interests automatically, delivers related documents, classifies Microsoft Word, PowerPoint, Adobe Acrobat and HTML documents in real time, and provides offline access to selected documents. Specific features related to ADAC • Taxonomy builder: In the process of taxonomy creation, Stratify uses a clustering algorithm and other algorithms to identify groups and extract the concepts of each node. The result of this process is an automatic classification of documents. The Stratify Web access can be provided to browse the result of this process. Instead of the clustering implemented in ADAC, where it clusters documents in a list of clusters, the taxonomy builder can provide a hierarchy of groups of documents. For users, the result of a hierarchy is more effective than a list of groups. The taxonomy builder uses two algorithms to create the hierarchy. The first is K-Means clustering and the second is Hierarchical Agglomerative Clustering. The former is a fairly basic algorithm which is available in most data analysis texts. The latter is more recent and has extensive literature on the web: http://www.bci.gb.com/products/programs/hac_pub.htm, http://www.cs.umd.edu/hcil/multi-cluster/. Stratify has made extensive changes to the algorithms mainly to improve robustness in the face of different kinds of data sets and to improve the response of the algorithm in terms of producing user-friendly output. Their algorithms have been tested on many real-world data sets that they have obtained under NDAs from our customers and prospects. The taxonomy builder can import a taxonomy defined in a standard language. The taxonomy can be created from the classification structure definition in ADAC.

• Classification: The classification is based on the taxonomy created by the taxonomy builder. If the taxonomy comes from ADAC’s exported

112 DRDC Valcartier TR 2004-265

definition, then the classifier module can be used to classify documents in ADAC. The classification process can be defined in these terms. Using patent-pending advances in information theory, artificial intelligence and machine learning, the Classifier analyzes the text extracted Crawler. As shown in Figure18, the Classifier architecture is modular, applying multiple classification methods to each document. The Combiner module within the Classifier aggregates the results of each method and chooses the best classifications, based on its knowledge of the strengths of each classifier and the degree of certainty each classifier has about a particular document. By using multiple complementary techniques, the Classifier achieves much greater accuracy than single-method classifiers. It has achieved precision and recall rates of more than ninety percent on certain document corpuses.

Figure 25. The Stratify classification process Strength and weakness The strength of their products is the complete access to process and results via an API. ADAC can use the taxonomy builder to present clusters of documents in the clustering process. And also, ADAC can use the topics information to enhance the ontology definition.

Stratify provides a good way to classify documents and the accuracy should be better.

Stratify’s products are limited in functionalities and the classification process in ADAC is well covered. But, the taxonomy builder can help ADAC to suggest classification structure to the user. Integration, deployment and scalability34

DRDC Valcartier TR 2004-265 113

Strtify’s product is designed with an open, flexible architecture and robust Java, Web services and WebDAV APIs for simple, streamlined integration with existing applications and IT computing environments.

The deployment of solutions created with Stratify is the same as client/server solutions. The server and the source of document have to be connected to the network. The server runs on Windows 2000 server and the database could be MS SQL Server or Oracle 8.1.7 on Sun Solaris 8. The client application (Taxonomy Manager and other) runs on Windows 2000 Professional and higher.

Scalability is one of the key design goals of the architecture of the Discovery System. Each main function of the system (taxonomy building, document acquisition and classification) has been implemented in a separate module to allow users to distribute their applications effectively on multiple machines. The Taxonomy Server can be duplicated on multiple machines, so that each machine can be dedicated to the taxonomy building task for different corpora. Similarly, if documents are being acquired for classification from different sources, crawlers can be set up on different machines and each machine assigned a set of sources to crawl.

The classification server can also be duplicated on many machines. Each server instance can be set up to display the same taxonomy or have its own taxonomy. Classification requests from Stratify's crawlers as well as third-party crawlers/API clients are all handled by the classification server, rather than the main application server, thus insulating the database from high volumes of classification requests. This feature also allows users to implement host-based authentication to the different taxonomies. Because the Classification Server communicates via HTTP, it is easy to control access to a given taxonomy simply by configuring appropriately the underlying web server (eg. IIS, Apache).

Each classification server can be configured to refresh its metadata cache from the database at different intervals. This allows the system to stagger the download requests going to the database from each classification server. Also, the classification servers do not maintain any state---this means that the response time on requests is not dependent on the history of previous requests and also classification servers can be easily shut down and migrated to other machines.

The internal design of the Stratify Crawler is also meant to increase the scalability of the system. The Crawler has multiple threads processing the document at different stages---acquisition, classification, indexing and insertion into the database. This maximizes the use of the different components of the system and allows graceful degradation if any of the components (the indexing engine, say) were to malfunction.

This complex architecture is made easy to maintain and monitor through the Stratify Console which allows easy access to Crawler and Classification Server duplication, migration and configuration.

114 DRDC Valcartier TR 2004-265

Convera Convera’s RetrievalWare product is essentially an indexing and search engine similar to Delphes’ DioWeb, except that the indexing and search are based on semantic networks. Main purposes RetrievalWare is an advanced knowledge retrieval solution for indexing and searching a wide range of distributed assets, all from a common user interface. RetrievalWare incorporates powerful techniques known as concept search and pattern search, enabled by a mature Semantic Network™ and Adaptive Pattern Recognition Processing (APRP™), to deliver the most accurate and relevant results.

RetrievalWare performs natural language processing and searches term expansion to paraphrase queries, enabling retrieval of documents that contain the specific concepts requested rather than just the words typed during the query while also taking advantage of its semantic richness to rank documents in results lists. RetrievalWare’s pattern search abilities overcome common errors in both content and queries, resulting in greater recall and user satisfaction.

RetrievalWare achieves high levels of both recall and precision through advanced search methods and close attention to every detail of the search process. RetrievalWare provides concept, pattern and Boolean searching methods that can be used independently or interactively to enable the highest levels of accuracy. RetrievalWare’s powerful index and query pipelines use sophisticated formatting and linguistic processing components to achieve these results.

• Technology foundation38: RetrievalWare offers an approach to knowledge retrieval, offering people an alternative to traditional information retrieval systems. Convera's Semantic Network and APRP™ (Adaptive Pattern Recognition Processing) technologies provide levels of accuracy, flexibility and ease of use for retrieving all kinds of digital information.

• Semantic Networks: Convera's Semantic Networks leverage true natural language processing, incorporating syntax, morphology and, most importantly, the actual meaning of words as defined by published dictionaries and other reference sources. Semantic Networks provide many benefits, including:

• Multiple Lexical Sources: Convera's baseline Semantic Network, created from complete dictionaries, a thesaurus and other semantic resources, gives users a built-in knowledgebase of 500,000 word meanings and over 1.6 million word relationships.

• Natural Language Processing: Users can simply enter straight forward, plain English queries, which are then automatically enhanced by a rich set

38 Convera, “Accurate search What a concept (WP-RWT-020517.PDF)”

DRDC Valcartier TR 2004-265 115

of related terms and concepts, to find information targeted to their specific context.

• Morphology: Recognizes words at the root level, a much more accurate approach than the simple stemming techniques characteristic of other text retrieval software. Does not miss words because of irregular or variant spellings.

• Idioms: Recognizes idioms for more accurate searches. Processes phrases like real estate and kangaroo court as single units of meaning, not as individual words.

• Semantics: Recognizes multiple meanings of words. Users can simply point and click to choose the meaning appropriate to their queries.

• Multi-layered Dictionary: Convera's baseline Semantic Network supports multi-layered dictionary structures that add even greater depth and flexibility. Enables integration of specialized reference works for legal, medical, finance, engineering and other disciplines. End users can also add personalized definitions and concepts without affecting the integrity of the baseline knowledgebase.

• APRP™ (Adaptive Pattern Recognition Processing): Modeled on the way biological systems use neural networks to process information, APRP™ acts as a self-organizing system that automatically indexes the binary patterns in digital information, creating a pattern-based memory that is self-optimized for the native content of the data. This unique capability provides a number of powerful advantages for building retrieval applications for virtually any type of digital information, including text, images, video and sounds. In text applications, APRP™'s unique capabilities provide many benefits, including:

• Automated, self-organizing pattern indexes: Eliminates the costly labour of manually defining keywords, building topic trees, establishing expert rules, and sorting and labelling information in database fields. Avoids the inherent subjective biases of categorical indexes.

• Fuzzy searching: Provides the ability to retrieve approximations of search queries. Has a natural tolerance for errors in both input data and query terms. Eliminates the need for OCR clean up, which is especially useful in applications that handle large volumes of scanned documents.

• High precision and recall: Gives end-users a high level of confidence that their queries will return all of the requested information regardless of errors in spelling or in "dirty data" they may be searching.

• High performance: Small pattern indices and binary pattern matching deliver high-speed retrieval and efficient use of computing resources.

116 DRDC Valcartier TR 2004-265

Global features and architecture • Synchronizers (recovery agents),39: It is the Synchronizer that enables RetrievalWare to securely “reach out” to one or more instances of a source of documents or data, pull their contents into a common index space. A Synchronizer is also responsible for keeping its related index up to date by determining whether anything has changed for the assets in a repository. The overall RetrievalWare architecture includes an Access Filter Module (AFM) that allows new Synchronizers to be easily plugged into a solution, without programming. RetrievalWare comes with a Synchronizer for file systems (Unix, NT) and one for Relational Database Management Systems (RDBMS) out of the box. There are additional Synchronizers for document management systems (Documentum, FileNET), Groupware Servers (Microsoft Exchange, LotusNotes) and Teradata. If your organization’s solution requires access to a repository or file type not currently supported or not available from a partner, the RetrievalWare SDK includes an AFM Toolkit that allows you or our integrators such as Convera’s Integration Services Group to build the necessary Synchronizers and filters.

• Indexing: The indexing process captures all documents from the synchronizers and extracts all information from the document source. The document’s metadata are stored in a database. No information is available at the present time about the accessibility of metadata created by the indexing process.

• Classification40: Convera provides categorization of content into a hierarchy of subjects by organizing search queries into categories defined by the users. There is no automation of the classification process.

• Search: The following table shows the process of searching:

Table 4. RetrievalWare Searching Process Step Description 1 Tokenizing identifies strings of characters as words, dates or numbers and determines how to handle special characters. 2 Stop words such as “the”, “a”, & “and” are removed from the query so their hits don’t artificially inflate the document’s rank. 3 Morphology reduces query words to their root forms, removing suffixes and verifying the existence of the root words in the dictionary. 4 Pattern matching expands the list of query words to include similarly spelled terms in the indices if pattern mode is enabled.

39 Convera, “Synchronizers for RetrievalWare (RW_SYNC.pdf)” 40 Convera, “C API Toolkit Guide”

DRDC Valcartier TR 2004-265 117

Table 4. RetrievalWare Searching Process Step Description 5 Term grouping identifies words enclosed in parentheses, which allows several alternative words to be treated as a single search term. 6 Exact phrases, words enclosed in quotation marks (“ ”), indicate the importance of word order and proximity. 7 Idioms (such as “real estate” or “ice cream”) are identified, so that those phrase hits are ranked higher than occurrences of the individual words that make up the idioms. 8 Numbers or dates, including open-ended ranges such as greater than and less than, are normalized so users can search for them in the body of the document. 9 Query words containing one or more wildcard characters are expanded to include words beginning with or containing a certain string. 10 If the Power Query option is used, users can select specific meanings of query words and/or modify their weight. 11 Using Semantic Network expansion, words related to the concepts of query words are added to the query word list. 12 Documents are ranked by relevance and displayed to the user.

The following figure shows the searching process.

Figure 26. RetrievalWare Searching Process

• Visualization41: The visualization of the source document and the search result are well presented and easy to navigate. RetrievalWare’ search results show the hit score and provide a link to the original document, and present the most relevant paragraph linked to the search query.

• Audio and video: Convera has applied two decades of experience in intelligent, high performance search and retrieval technology to develop an easy-to-use video asset management system. The result is the Screening Room, which gives today’s video producers a fast, accurate, fully integrated, modular system to browse, search and preview all of their

41 Convera, “www.convera.com”

118 DRDC Valcartier TR 2004-265

video source material (either analog or digital) directly from their desktops. Video producers can effortlessly search vast tape archives for supplemental footage; automatically capture video; browse storyboards; catalog content using annotations, closed captioned text, voice sound tracks, and metadata; search for precise video clips using text and image clues; create rough cuts and edit decision lists (EDLs) for further production; and publish those video assets to the Web for streaming. Architecture

Figure 27. RetrievalWare Architecture

Specific features related to ADAC • Synchronizers (Recovery agent): The synchronizers work like ADAC’s recovery agent. The synchronizers have parameters to configure the frequency, source and others. They can work in conjunction with ADAC’s recovery agent.

• Search: RetrievalWare provides a search interface that could be modified to satisfy ADAC’s requirements. The search results interface can also be modified to reflect more ADAC’s requirements. The search results can be used to do something else than displaying.

DRDC Valcartier TR 2004-265 119

• Visualization: A new user interface with the same results can be created or the actual ones can be adapted to ADAC’s requirements. Strength and weakness The major strength of RetrievalWare is its searching process. It is unique and provides a very good accuracy in the results. The semantic network continuously updated by Convera provides to users a very good knowledge base. The search expansion by using the semantic network is controlled by parameters. The product supports many languages (26). The integration is very well supported by providing a SDK, Web Toolkit, and APIs (C, Java, XML).

The weakness of this product is the restricted number of features. There is no feature such as clustering, or hyperlinking concepts of text to other documents. These features have to be developed with their SDK, Web toolkit, and APIs. Integration, deployment and scalability42: APIs to Integrate with RetrievalWare Engine RetrievalWare’s high-level APIs for C, Java and XML could help create direct integrations between applications and RetrievalWare. The APIs could be used to:

• Manipulate more than one query at a time through the use of query threads

• Set properties for the query such as what languages to use, libraries to access, etc.

• Remove word senses and terms from pattern matching or wildcarding before semantic expansion

• Control the semantic expansion of the query terms by bypassing the expansion level and expansion word limit

• Access the retrieved document’s body text and meta data fields

• Analyze the strength and location of hits within a document

• Step through hits and move around in the document

• Download a file from a server to a local disk

• Integrate RetrievalWare with a relational database

• Access the retrieved document’s text in its original “raw” state (i.e., just as it enters RetrievalWare with all formatting characters, nulls, white space, etc.)

42 Convera, “RetrievalWare SDK (RW_SDK.pdf)”

120 DRDC Valcartier TR 2004-265

• Log users into secure systems via username/password or a proxy login ticket

The RetrievalWare product is available for many platforms, Sun Solaris 7 and 8 (32- bit), HP- Unix 11 (32- & 64-bit), DEC Unix 4.0f, SGI IRIX 6.5.0 (32-BIT), RedHat 6.1 (kernel 2.2), IBM AIX 4.3.3 (32-bit), Intel NT 4.0, and Windows 2000. The deployment of the RetrievalWare is like a deployment of a client/server application.

RetrievalWare’s architecture provides the ability to build scalable systems that can be distributed across an organization’s computing infrastructure. Because you can distribute single or multiple instances of a process across multiple processors and or servers anywhere on the enterprise’s local or wide area network, your solution can be scaled to meet most requirements. The RetrievalWare architecture is Web based, constructed on a client server structure with distributed processes and remote procedure calls (RPC) across a TCP/IP transport layer.

Applied Semantics Applied Semantics provides a product that can satisfy the ADAC’s requirements, the Concept Server. Applied Semantics Concept Server bundles powerful categorization, concept tagging, summarization, and information extraction software with industry standard taxonomies. This solution also comes with Taxonomy Administration tools, for easy, secure management of company's category or topic sets. The taxonomies supported are:

• International Press Telecommunications Council (IPTC) Subject Codes: a taxonomy of 925 general subject topics established to categorize news content;

• Medical Subject Headings (MeSH): a taxonomy of 20 000 categories used by the National Library of Medicine to tag medical research;

• Open Directory Project: a 250 000+ category taxonomy established by Netscape to organize content on the Web;

• Universal Standards Products and Services Categories (UNSPSC): a 13 000+ category taxonomy used to classify products and services;

• Standard Industrial Classification (SIC) Codes: 1000+ four-level category taxonomy used to classify all business establishments by the types of products or services they offer;

• ISO 3166 Geography: an 800 category taxonomy of countries and cities throughout the world. Main purposes

DRDC Valcartier TR 2004-265 121

Concept Server uses a huge semantic network (1.2 million terms, 500 000 concepts and tens of millions of relationships) to extract information from document. It is not a search engine, it only extracts information (summary, concepts, other metadata), can categorize documents by using standard industry taxonomy, and can do event notification when a document corresponds to some criteria.

• Technology foundation:

• Summary43: The acronym CIRCA stands for "Conceptual Information Retrieval and Communication Architecture". Applied Semantics' systems are built upon a foundation of CIRCA technology. This technology uses the proprietary ASI Ontology, which consists of over 420,000 concepts and many millions of relationships between them. At the linguistic level, concepts are manifested as terms (i.e. one or more word tokens). For instance, the concept of car is represented by several terms, such as car, automobile, auto, motor car, or ride. At the semantic level, concepts are defined by their relationships to other concepts. For example, car is a kind of motor vehicle; car has kind BMW Z3; car is bound to car dealership, etc. The ASI ontology aims at representing semantic relationships that are fundamental to structuring human knowledge. These include synonymy, hypernymy, membership, metonymy, causation, and entailment. In addition, the strength of each relationship is stored. Thus, the location of a concept within semantic space is determined both by its relationships and by the strength (or distance) of these bonds to other concepts. Other information associated with concepts includes frequencies of occurrence of terms, frequencies of occurrence of terms with certain meanings, etc. CIRCA technology uses the ASI ontology to process queries or documents at a semantic level. The ontology is primarily used to disambiguate words, which in turn provides the basis for document categorization, metatagging, and summarization.

• Details44: At its core, the Applied Semantics ontology consists of meanings, or concepts, and relationships between those meanings. But in order to utilize meanings and their relationships while processing text, we must also provide some link to the manifestation of those concepts in text, in terms of linguistic expressions such as words or phrases. The ontology is therefore characterized by three main representational levels: Tokens: corresponding to individual word forms; Meanings: concepts; Terms: sequences of one or more tokens that stand as meaningful units. Each term is associated with one or more meanings. Conversely, each meaning is linked to one or more terms (which can be considered synonyms with respect to that meaning). Currently, the Ontology consists of close to half a million distinct tokens, over two million unique terms, and approximately half a million distinct meanings. To illustrate the difference between the three levels, let us consider the phrase “bears witness.” This is

43 Applied Semantics, Inc., “Circadia User’s Manual – CIRCA (circ_manual.htm)” 44 Applied Semantics, Inc., “Ontology Usage and Applications (ontology_whitepaper.pdf)”

122 DRDC Valcartier TR 2004-265

an expression that consists of two tokens together comprising a single term, since, as a unit, these tokens have a specific usage/meaning that is not strictly a function of the meaning of the parts.

Figure 28. Term "bears witness" (Applied Semantices)

In fact, this term is associated with two distinct meanings: Establish the validity of something; be shown or be found to be; ''This behaviour bears witness to his true nature'', and give testimony in a court of law.

Meanings are represented in the system both directly, in terms of dictionary-style glosses as shown above, and indirectly, in terms of their relationships to terms and to other meanings. That is, a concept is defined by the sets of terms that are used to express that concept, and by its location in the semantic space established through the specification of relationships among concepts.

The types of relationships between concepts that we have chosen to represent correspond to those relationships that are fundamental to structuring human knowledge, and enabling reasoning over that knowledge. We represent:

• Synonymy/antonymy (“good” is an antonym of “bad”)

• Similarity (“gluttonous” is similar to “greedy”)

• Hypernymy (is a kind of / has kind) (“horse” has kind “Arabian”)

• Membership (“commissioner” is a member of “commission”)

• Metonymy (whole/part relations) (“motor vehicle” has part “clutch pedal”)

• Substance (e.g. “lumber” has substance “wood”)

• Product (e.g. “Microsoft Corporation” produces “Microsoft Access”)

• Attribute (“past”, “preceding” are attributes of “timing”)

• Causation (e.g. travel causes displacement/motion)

• Entailment (e.g. buying entails paying)

DRDC Valcartier TR 2004-265 123

• Lateral bonds (concepts closely related to one another, but not in one of the other relationships, e.g. “dog” and “dog collar”)

Each relationship is associated with a strength indicating how close the relationship is. For instance, “dog” is a kind of “pet” as well as a kind of “species.” However, the relationship between “dog” and “pet” is stronger (closer) than between “dog” and “species” and this is reflected in a larger strength value.

Linguistic information such as syntactic category (part of speech) and inflectional morphology (for instance, word endings indicating plurality or past tense) is associated with terms and tokens. In addition, certain meta-level classifications of tokens, that indicate how a token is used rather than specifying relationships for its meaning, are specified. One example is identifying the language that the token is in—this identification is necessary because the ontology is organized by meanings, which are independent or outside of language. Other examples include identification of first names, trademarks, locations, abbreviations, particles, and function words.

The Applied Semantics ontology aims at being a dynamic representation of words, their usage, and their relationships. To achieve this goal, various statistics have been incorporated into the representation of tokens, terms, and meanings, which are derived from observation of how particular words are used over a range of contexts, and with what meaning. The probability of a specific term being used with a specific meaning, relative frequencies of different tokens and terms, the frequency of a particular multi- token sequence being used as a cohesive term, and other such statistics are gathered and used during subsequent processing. A bootstrapping methodology is followed to acquire this data, in which initial term analysis and meaning disambiguation are done on the basis of human-estimated probabilities and conceptual relationships provided in the ontology. Statistics are gathered over this initial processing and fed back into the ontological database to be used for subsequent runs.

In addition, mechanisms for automatically generating new relationships from those represented in the basic ontology have been implemented. These mechanisms roughly correspond to logical reasoning algorithms that infer new relationships on the basis of existing ones. For instance, given the relationships “Dalmatians are dogs” and “dogs are animals”, we can infer that “Dalmatians are animals.” Thus, relationships that are more distant are inferred from relationships that are more immediate. Using the strengths and types of the relationships on the path through the ontology from one meaning to another, we assign a value to the strength of the newly inferred relationship.

To make use of enterprise-specific or otherwise pre-existing categories or domain- specific taxonomies, the system supports the linking of external terms with the meanings in the ontology. This allows the results of any semantic analysis done by the system to be mapped into proprietary or pre-existing classifications, essentially providing the external terms with hooks into the massive knowledge base represented by the ontology and giving those terms meaning, independent of the specific context they were developed for.

124 DRDC Valcartier TR 2004-265

The architecture as presented here is related to a well-known semantic network called WordNet (Fellbaum, 1998), which was designed to reflect psycholinguistic and computational theories of human lexical memory. Many of the relationship types in the Applied Semantics ontology are the same as those represented in WordNet, for the reason that those relationship types are foundational to the structure of the human lexicon. However, the WordNet network does not include any probabilistic data, which is critical for utilizing the knowledge embodied by the network in any realistic text processing application. In addition, WordNet does not include the lateral relationships that help to organize concepts into coherent groups. This hierarchical data is central to establishing contexts that can be used to recognize particular meanings of words, since words that “go together” often do not stand in a hierarchical relationship. Global features and architecture • Categorizer: CIRCA views the process of categorization as one of conceptual matching. On one side, there are documents, which can be in any of the 52 formats supported by Circadia45, and can be of varying length (up to 700 k bytes), on virtually any topic. These documents are processed (see above), and a gist is extracted. On the other side, there are categories in a given taxonomy. These categories are also automatically “sensed”, i.e. gists are assigned to them. For example, a category that is named "Java Classes in British Columbia" might have the following unique concepts associated with it: (Java, concept ID 85142; Class, concept ID 3591; British Columbia, concept ID 40127, along with their weightings). Each concept is fully documented in the ontology, with synonyms, variant spellings, morphological variants, and relationships with other concepts. Using the page senses extracted from the document, Circadia performs a search for all categories whose concepts are semantically related to those extracted. The categories that are potential matches are then scored using our proprietary scoring algorithm, and only the most relevant ones are presented in the result set. Thus, because the Auto-Categorizer is already equipped with knowledge of language (through the ASI ontology), there is no need for a training period in order to perform categorization.

• Summarizing: Page Summarizer is a customizable system that processes documents and returns extracted summaries. This functionality improves the ability of searchers to quickly scan a list of results and select the documents most appropriate to their information needs (e.g. a doctor looking through the results of a search in a medical database). CIRCA mimics the way that humans summarize documents. First, the document is read and understood in the way described above. The gist of the document is extracted. Next, each sentence in the document is reviewed and analyzed to see if it is a representative sentence. CIRCA determines the most important concepts in the document, and then uses the density of

45 Circadia is the name of the core module within the Concept Server product or it is the old name of Concept server, no information was available while this document was being written.

DRDC Valcartier TR 2004-265 125

these concepts in the sentences contained in the document to rank their contribution to the overall subject of the document. The best sentences according to these criteria are chosen to represent the summary of the document.

• Indexing (metadata extraction): Metadata Creator processes documents and returns semantically representative metatags of documents, according to user-specified preferences. Metadata Creator improves the ability of search technologies to locate and return relevant material to a business user.The process of metatagging a document involves processing it through CIRCA (see above). For each document processed, a set of representative concepts are extracted and sorted by descending order of strength. Based on this set of concepts, synonyms, semantic relatives (whether broader, narrower, or equivalent), and morphological variants are extracted. For instance, if the most important meaning on a page is movie, then all its synonyms (e.g. film, motion picture, flick, moving picture ), all its semantic relatives, whether broader (e.g. entertainment), narrower (e.g. documentary) or lateral (e.g. movie trailer ); and all its morphological variants (e.g. films, motion pictures, flicks, movies, moving pictures ) are placed in a pool of possible metatags. Thereafter, metatags are chosen based on a number of parameters: presence (whether or not the candidate metatag is present in the text being analyzed), focus (the degree of specificity of the candidate metatag, whether broad or narrow), frequency (how common the term is), and variety (the degree of semantic coverage needed in the metatag set). In addition, a named entity recognition algorithm is used to identify person's names, company names, etc. The results of this component of CIRCA are returned as separate blocks of metatags. Lastly, Metadata Creator can be customized to handle any set of controlled vocabularies (e.g. public companies, industrial codes, corporate vocabularies), and return tags that are most relevant to a company's information needs. Multiple vocabularies may be employed simultaneously, and use of specific vocabularies can be restricted on a per document basis.

• Taxonomy editing46: Applied Semantics provides a easy tool to edit taxonomies. This tools is simple and users can build taxonomy as the ontology was built in ADAC first edition prototype. The user defines the hierarchy of folders and defines which concepts would be associated with the folder. The concepts are chosen from the provided semantic network (Applied Semantics ontology).

• Ontology (semantic network) editing: As no information was available on the tool provided for editing the ontology, no evaluation on the easiness and the logic of updating ontology can be discussed here. Applied semantics provides an ontology editing tool to enrich the ontology with other terms, concepts, and/or relationships.

46 Applied Semantics, Inc., online Web demonstration

126 DRDC Valcartier TR 2004-265

Architecture47:

Figure 29. Applied Semantics Concept Server: implementation architecture

As we can see in Figure 22, Concept Server does not provide a solution but provides tools to help build solutions. Specific features related to ADAC • Categorizer: The Categorizer module does an intelligent and logic categorization (see the description of this feature at the Global features and technology section above). Since the metadata can be extracted from documents, ADAC can use another categorization algorithm utilizing this metadata. The process of categorization can be carried out with the provided API.

• Summarizing: The Summarizing results from extraction of the most important phrase in the document. Each phrase has a score which represents the percentage of resemblance to the core idea of document. This feature is available through API.

• Indexing (metadata extraction): We can associate the metadata extraction to the indexing process in ADAC. This metadata can be access through the API and can be use by others applications (search engine, clustering, etc).

47 Applied Semantics, Inc., “Applied Semantics Concept Server (concept_server_datasheet.pdf)”

DRDC Valcartier TR 2004-265 127

• Taxonomy editing: The taxonomy editing can be associated to the ontology editing in ADAC. It can be used in replacement of the one provided in ADAC. It is a Windows application.

Strength and weakness The major strength of this product is its technology foundation. Other strengths are:

• The ability to access the metadata to use in other application (search engine, clustering, etc.);

• Does not require a training set or manual rules creation;

• The easy integration process;

• The ability to set up parameters to exploit the technology as needed;

• The ability to augment the ontology (semantic network);

• The ability to define and update taxonomy with the concepts from the ontology.

Applied Semantics uses an algorithm for augmenting the ontology semi-automatically. This algorithm uses statistics and probabilities and suggests enhancement to the ontology whose user has to accept or decline. This algorithm is not provided with the product, but with negotiations ($$$), we could access to it.

A major weakness of this product is that it supports only two languages, English and Spanish. Upon formal request, the company can develop the support of other languages, but, at this time, they do not plan to develop other languages. Another weakness is that the product does not provide a search engine. But, to alleviate this disadvantage, the extracted metadata can be used in many search engines that Applied Semantics has already experimented (Verity, AltaVista, Inktomi, and Hummingbird). Integration, deployment and scalability The integration will be easy and fast. All there API is made with the protocol HTTP in XML. The documentation is well done and easy to consult. It is a client server application. As the application does not store document in database or in other type of storage system, Applied Semantics did not implement a security management feature. The applications using the metadata from Concept server has to implement the security management.

As the application is client server, the deployment will follow the guide of the deployment of this kind. Depending on the document process performance of this server, the server can be deployed on a few servers where the applications installed on other servers can access to the services provides by Concept server. This product requires 2GB of free memory and 20Gb of disk space for the platforms Linux 2.2, Solaris 8, and Windows 2000 Professional and up.

128 DRDC Valcartier TR 2004-265

Applied Semantics does not provide tools or modules to extend the performance of document process. All the scalability has to be supported by other tools.

Diagnos Diagnos’ product SIPINA Enterprise is a classification engine. It analyzes structured or unstructured data previously classified and determines the pattern matching rules. These rules are afterward used in an operational environment by another engine. It provides an online data classification. Main purposes Sipina offers an extensive library of data mining methods. In total, 11 different and complementary analysis techniques are integrated into the application, representing the four major families of data mining methods: Descriptive, Predictive, Classification, and Association.

Using a Processing Workflow concept, Sipina is an intuitive, simple and visual tool to implement and combine different data mining analysis techniques. Constructed step by step, the Processing Workflow helps you to become familiar with the analysis methods; it is also the optimum way to combine the data mining techniques for a tenfold increase in Sipina's data exploration power.

Sipina allows you to benefit fully from your analysis work: you can save your data after processing in the application's internal database or apply your newly created prediction models to new datasets.

The prediction models can be exported and used by an engine running in an operational environment. Thus, each unit of data (record, document, or image) is processed with the predictive model and the engine extracts the results. This result can be used by other application understanding them. Global features and architecture Classification: To classify structured or unstructured data, the user has to build the predictive model based on previous classified data and many methods can be used. Once this satge done, the user exports the model to an engine that gets the predictive model and data as inputs and classifies the data accordingly. The structured data could come from databases, and the unstructured data could be documents or images. Specific features related to ADAC Classification: The classification process is not simple and requires knowledge in statistic domain. But the classification of images is interesting. Strength and weakness Their strength is the ability to classify images. Their weaknesses are the complexity of creating a predictive model and the features around the product.

DRDC Valcartier TR 2004-265 129

Integration, deployment and scalability SIPINA is the main module for creating predictive model based on data previously classified. The model is exported in XML and can be used by the module of online processing predictive models (OPPM). This last module has an API that can be controlled by the developers. The integration is relatively easy.

The deployment of solutions using the OPPM engine is like a client/server application. The input data has to be sent to this engine via the network or locally. The modification of predictive models needs to be updated for each OPPM on the network.

No information was available on the scalability of this product.

130 DRDC Valcartier TR 2004-265

List of symbols/abbreviations/acronyms/initialisms

DND Department of National Defence

NDCC National Defence Command Center

ADAC Automatic Document Analyzer and Classifier

NLP Natural language Processing

IE Information Extraction

IR Information Retrieval

TREC Text Retrieval Conference

MUC Message Understanding Conference

SDK Software Development Kit

DRDC Valcartier TR 2004-265 131

Distribution list

INTERNAL DISTRIBUTION

DRDC Valcartier TR-2004-265

1 – Director General

3 – Document Library

1 – Head Decision Support System Section

1 – Head Information and Knoweldge Management Section

1 – Head Systems of Systems Section

1 – M. Allouche

1 – A. Auger

1 – M. Bélanger

1 – A. Benaskeur

1 – J. Berger

1 – M. Blanchette

1 – A.-C. Boury-Brisset (Author)

1 – R. Breton

1 – C. Daigle

1 – Maj B. Deschênes

1 – Lcol M. Gareau

1 – M. Gauvin

1 – D. Gouin

1 – A. Guitouni (Author)

1 – H. Irandoust

132 DRDC Valcartier TR 2004-265

1 – A.-L. Jousselme

1 – R. Lecocq

1 – P. Maupin

1 – S. Paradis

1 – F. Rhéaume

1 – J. Roy

1 – A. Sahi

1 – G. Thibault

1 – LCdr E. Tremblay

1 – P. Valin

1 – LCdr E. Woodliffe

DRDC Valcartier TR 2004-265 133

EXTERNAL DISTRIBUTION

DRDC Valcartier TR-2004-265

1 – DRDKIM (PDF file)

1 – Director Joint Capability Production

101 Col. By Drive, Ottawa, ON, K1A 0K2

1 – Canadian Forces Experimentation Center

Shirley’s Bay Campus, 3701 Carling Ave, Ottawa, ON, K1A 0K2

1 – Director Aerospace Requirements

101 Col. By Drive, Ottawa, ON, K1A 0K2

1 – Director Land Requirements

101 Col. By Drive, Ottawa, ON, K1A 0K2

1 – Director Maritime Requirements

101 Col. By Drive, Ottawa, ON, K1A 0K2

1 – Director Science and Technology (Air)

Constitution Building, 305 Rideau St. Ottawa, ON, K1A 0K2

1 – Director Science and Technology (C4ISR)

Constitution Building, 305 Rideau St. Ottawa, ON, K1A 0K2

1 – Director Science and Technology (Land)

Constitution Building, 305 Rideau St. Ottawa, ON, K1A 0K2

1 – Director Science and Technology (Navy)

Constitution Building, 305 Rideau St. Ottawa, ON, K1A 0K2

1 – Information Management Group

101 Col. By Drive, Ottawa, ON, K1A 0K2

134 DRDC Valcartier TR 2004-265

UNCLASSIFIED SECURITY CLASSIFICATION OF FORM (Highest Classification of Title, Abstract, Keywords)

DOCUMENT CONTROL DATA

1. ORIGINATOR (name and address) 2. SECURITY CLASSIFICATION Defence R&D Canada Valcartier (Including special warning terms if applicable) 2459 Pie-XI Blvd. North Unclassified Québec, QC G3J 1X8 3. TITLE (Its classification should be indicated by the appropriate abbreviation (S, C, R or U) ADAC: Automatic Document Analyzer and Classifier (U)

4. AUTHORS (Last name, first name, middle initial. If military, show rank, e.g. Doe, Maj. John E.) Guitouni, A., Boury-Brisset, A.-C., Belfares, L., Tiliki, K. and Poirier, C.

5. DATE OF PUBLICATION (month and year) 6a. NO. OF PAGES 6b .NO. OF REFERENCES 2006 134 133

7. DESCRIPTIVE NOTES (the category of the document, e.g. technical report, technical note or memorandum. Give the inclusive dates when a specific reporting period is covered.) Technical Report

8. SPONSORING ACTIVITY (name and address)

9a. PROJECT OR GRANT NO. (Please specify whether project or 9b. CONTRACT NO. grant) COP 21 TD

10a. ORIGINATOR’S DOCUMENT NUMBER 10b. OTHER DOCUMENT NOS TR 2004-265 N/A

11. DOCUMENT AVAILABILITY (any limitations on further dissemination of the document, other than those imposed by security classification)

Unlimited distribution Restricted to contractors in approved countries (specify) Restricted to Canadian contractors (with need-to-know) Restricted to Government (with need-to-know) Restricted to Defense departments Others

12. DOCUMENT ANNOUNCEMENT (any limitation to the bibliographic announcement of this document. This will normally correspond to the Document Availability (11). However, where further distribution (beyond the audience specified in 11) is possible, a wider announcement audience may be selected.)

UNCLASSIFIED SECURITY CLASSIFICATION OF FORM (Highest Classification of Title, Abstract, Keywords) dcd03e rev.(10-1999) UNCLASSIFIED SECURITY CLASSIFICATION OF FORM (Highest Classification of Title, Abstract, Keywords)

13. ABSTRACT (a brief and factual summary of the document. It may also appear elsewhere in the body of the document itself. It is highly desirable that the abstract of classified documents be unclassified. Each paragraph of the abstract shall begin with an indication of the security classification of the information in the paragraph (unless the document itself is unclassified) represented as (S), (C), (R), or (U). It is not necessary to include here abstracts in both official languages unless the text is bilingual). Military organizations have to deal with an increasing number of documents coming from different sources and in various formats (paper, fax, e-mails, electronic documents, etc.) The documents have to be screened, analyzed and categorized in order to interpret their contents and gain situation awareness. These documents should be categorized according to their contents to enable efficient storage and retrieval. In this context, intelligent techniques and tools should be provided to support this information management process that is currently partially manual. Integrating the recently acquired knowledge in different fields in a system for analyzing, diagnosing, filtering, classifying and clustering documents with a limited human intervention would improve efficiently the quality of information management with reduced human resources. A better categorization and management of information would facilitate correlation of information from different sources, avoid information redundancy, improve access to relevant information, and thus better support decision-making processes. DRDC Valcartier’s ADAC system (Automatic Document Analyzer and Classifier) incorporates several techniques and tools for document summarization and semantic analysis based on ontology of a certain domain (e.g. terrorism), and algorithms of diagnosis, classification and clustering. In this document, we describe the architecture of the system and the techniques and tools used at each step of the document processing. For the first prototype implementation, we focused on the terrorism domain to develop the document corpus and related ontology.

14. KEYWORDS, DESCRIPTORS or IDENTIFIERS (technically meaningful terms or short phrases that characterize a document and could be helpful in cataloguing the document. They should be selected so that no security classification is required. Identifiers, such as equipment model designation, trade name, military project code name, geographic location may also be included. If possible keywords should be selected from a published thesaurus, e.g. Thesaurus of Engineering and Scientific Terms (TEST) and that thesaurus-identified. If it is not possible to select indexing terms which are Unclassified, the classification of each should be indicated as with the title.) Knowledge management, automatic text processing, text classification, text clustering, ontology.

UNCLASSIFIED SECURITY CLASSIFICATION OF FORM (Highest Classification of Title, Abstract, Keywords)

dcd03e rev.(10-1999)

Defence R&D Canada R & D pour la défense Canada

Canada’s Leader in Defence Chef de file au Canada en matière and National Security de science et de technologie pour Science and Technology la défense et la sécurité nationale

WWW.drdc-rddc.gc.ca