Hybrid Approach for the Assistance in the Events Extraction

Hybrid approach for the assistance in the events extraction in great textual data bases

(1) (2) Ismail BISKRI & (3) Rim FAIZ 1 (1) LANCI – UQAM CP 8888, succursale Centre-Ville, Montréal, Québec, H3C 3P8, Canada (2) Département de Mathématiques et Informatique – UQTR CP 500, Trois-Rivières, Québec, G9A 5H7, Canada [email protected] (3) LARODEC Institut des Hautes Etudes Commerciales 2016 Carthage-Présidence, Tunisie [email protected]

Our approach uses a fine specific syntactic description of Abstract--Numerical classification tools are generally quite robust events and is at the same time based on the methodology of but only provide coarse-granularity results; but such tools can event classes and on the contextual exploration method [8]. handle very large inputs. Several computational linguistic tools Contextual exploration approach uses a priori knowledge in (in this case events extraction ones) are able to provide fine- order to process input texts and the nature of this a priori granularity results but are less robust; such tools usually handle relatively short inputs. A synergistic combination of both types of knowledge is morphological, lexical, syntactic and semantic. tools is the basis of our hybrid system. The system is validated by it makes it more or less difficult to feed into the system extracting event information from press articles. First, a especially when the input is a large corpora built with more connectionist classifier is used to locate potentially interesting than a few number of press articles. press articles according user interests. Second, the user forward to the linguistic system the selected press articles processor in In this paper, we propose to look at events extraction in a order to extract events. We present the main characteristics of different way. The extraction process must be performed in our approach. such a way that full linguistic analysis of huge press articles Keywords: Events extraction, contextual exploration, will not be necessary. This would simply not be a practical classification, n-grams. solution, especially since most of the time only few parts of the press articles are relevant for the user. That is why we argue that a sensible strategy is to first apply cost-reasonable numerical method (for instance in our case Gramexco I. INTRODUCTION software [3] [4] [5]), and then more expensive contextual The Press is one of the most used documentary sources. It exploration method (for instance in our case EXEV system distributes various information and presents a huge amount of [10], [11]). In the first phase, a “rough” numerical method data conveying a very particular knowledge type which is the helps (quickly) select press articles which, according to the event. user’s needs, deserve more “refined” (time-consuming) processing that will, in the end, make the extraction process a Our interest is to provide appropriate overview and analyzing reality. functionality that allows a user to keep track of the key content of a potentially huge amount of relevant publication. II. NUMERICAL CLASSIFICATION

The objective of our research is to skip excessive useless The first stage in our approach is the numerical analysis[19]. information from electronic documents of Press by means of Our approach is based on the notion of the N-Grams of filtering and extracting in order to emphasize the event characters. This notion has been in use for many years mainly information type. As a result, the reader in this research will in the field of speech processing. Fairly recently, this notion be urged to look for relevant information, and the journalist has attracted even more interest in other fields of natural will be helped in developing articles surveys representing the language processing, as illustrated by the works of main events. Greffenstette [13] on language identification and that of Damashek [6] on the processing of written text. Amongst The representation of information signaling the presence of other things, these researchers have shown that the use of N- "an event" is such an important task in Artificial Intelligence grams instead of words as the basic unit of information does as well as in natural language processing. Indeed, just as the not lead to information loss and is not limited by the presence reasoning from information presented in a text, the of spelling mistakes which we sometimes find in press understanding process must also allow the re-building of the articles. Examples of recent applications of N-grams include structure of event information. the work of Mayfield & McNamee [20] on indexation, the work of Halleb & Lelu [14] on automated multilingual

1 Authors’ names order is unimportant. This paper is the result of genuine collaborative work between both authors. hypertextualization (in which they construct hypertextual common topic in these three segments. Of course, since these navigational interfaces based on a language-independent text same trigrams also appear in information and informationnel, classification method), and the work of Lelu et al. [18] on this could appropriately be considered as noise unless a multidimensional exploratory analysis oriented towards higher-level interpretation is invoked, such as informatique information retrieval in texts. being a subfield of information.

Now, what is a N-gram of characters exactly? Quite simply, we define a N-gram of characters as a sequence of N III. THE GRAMEXCO SOFTWARE characters. Sequences of two characters (N=2) are called Our software, called GRAMEXCO, has been developed for bigrams, sequences of three characters (N=3) are called the numerical classification of large documents in order to trigrams, and sequences of four characters (N=4) are called extract knowledge from them. Classification is performed quadrigrams. Notice that the definition of N-grams of with a ART neural network such as the one used in [2]. The characters does not explicitly or implicitly require the basic information unit is the N-gram of characters, N being a specification of a separator, as is necessary for words. parameter. A primary design goal of GRAMEXCO during its Consequently, analyzing a text in terms of N-grams of development was to offer a standard flow of processing, characters, whatever the value of N might be, constitutes a regardless of the specific language being processed in the valuable approach for text written in any language based on input documents. Another important design feature is that an alphabet and the concatenation text-construction operator. GRAMEXCO is semi-automatic, allowing the user to set Clearly, this is a significant advantage over the problematic certain parameters on the fly, according to her own subjective notion of what a word is. goals or her interpretation of the results produced by the software. The use of N-grams of characters instead of words offers another important advantage: it provides a means by which to Starting from the input text, a simple ASCII text file built control the size of the lexicon used by the processor. Up until with articles of press, three main phases follow in which the recently, the size of the lexicon has been a controversial user may get involved, as necessary: issue, often considered as an intrinsic limit of processing techniques based on the comparison of character strings. 1. The list of N-grams is constructed (with N determined Indeed, splitting up a text in words normally implies that the by the user) and the text is partitioned into segments (in larger the text will be, the larger the lexicon will be. This our case one segment corresponds to one press article). constraint persists even if special processing is performed for These operations are performed simultaneously functional words and hapax, and even if morphological and producing a matrix in which N-gram frequencies have lexical analysis is performed on words. For instance, Lelu et been computed for every segment. al. [18] managed to reduce the size of the lexicon to 13 087 2. A ART neural network computes similarities between quadrigrams for a text containing 173 000 characters. (cooccuring N-grams in) segments produced in the previous step. Similar articles of press, according to a In addition, if N-grams of characters are used as the basic certain similarity function, are grouped together. This is unit of information, instead of words, there is simply no need the result of GRAMEXCO’s classification process. for morphological and lexical analysis. Not only these types 3. At this stage, N-grams have served their purpose: they of processing can be computationally demanding but, most of have helped produced the classification of articles of all, they are specific to each individual language. Thus, when press. Now that we have the classes of segments, we can using the word as the basic unit of information, language- get at the words they contain. The words a class of specific processors must be developed for every language of segments contains is referred to as its lexicon. The user interest. This is a potentially very costly constraint, both in can now apply several operations (e.g. union, terms of development time and in terms of required expertise intersection, difference, frequency threshold) on the for each language. Not mentioning the problem that texts segments’ words in order to determine, for instance, a written in unforeseen languages might cause at the time of common theme—assuming she understands the processing. language. Results interpretation is up to the user. But even if we do have lexical analyzers available, many Depending on the parameters set by the user, and her choices often have trouble correctly handling words and their during the three phases above, results produced by derivations. For instance, the French words informatisation, GRAMEXCO can help identify similar classes of text informatique and informatiser all refer to the concept of segments and their main theme. The results can also help informatique (informatics). So if, in a given corpus, we have determining word usages and meanings for specific words. these segments, which have similar informational contents: These are important tasks in knowledge extraction systems “l’informatisation de l’école”, “informatiser l’école” and especially events extraction systems. “introduire l’informatique à l’école”, many word-based processors will not be able to reliably detect such similarities. We see now some results obtained with GRAMEXCO However, the analysis of the above three short segments in software. These results have been obtained from a 100-page terms of trigrams (N=3) of characters is sufficient to classify corpus constructed from a random selection of English and these segments in the same class. Indeed, not only ‘école’ French newspaper articles on various subjects. This corpus (school) appears in all three, but the trigrams inf, nfo, for, was submitted to our classifier in order to obtain classes of orm, rma, mat and ati allow the computation of a similarity articles on the same topic and, also, help the user, normally measure supporting the conclusion that informatique is the an expert in her own domain, to identify and study the themes skimmed in order to identify factual markers. We’ve decided of these classes of articles, thanks to the lexicon to keep the sentence which presents one of the markers ; the automatically associated with each of these classes. latters being also sequences of morpho-syntactic categories. The two main parameters we have used for this experiment are the following. First, we used quadrigrams (N-grams of We classify the linguistic markers into classes. For example: size 4), taking into account various practical factors. Second, we discarded N-grams containing a space and those having a 1. The calendar term class, exp.: frequency of one: this is done in order to minimize the size of Prp_inf stands for preposition + infinitive + preposition the vectors submitted on input to the classifier and, thus, to (From, starting from, to deduct of), Cal_num_cal stands reduce the work to be performed by the classifier. for calendar, number, calendar (wednesday 10 Interestingly, the removal of these N-grams has no significant February), Ver_prt stands for temporal preposition impact on the quality of the results produced by the classifier. (comes after, occurs before, creats since). The first noticeable result obtained from the classifier is the 2. The occurence indicator class, exp. : perfect separation of English and French articles. Qualitative Adj_occ stands for adjective + occurrence. Example : analysis of the classes also allows us to observe that articles another time, last time, first time. belonging to the same class are either similar or share a Adt_det_occ stands for tense adverb + determiner + common theme. For instance : occurrence. Example: once again  Class 100 (articles 137 and 157). The common lexicon 3. The relative pronoun class, exp.: of these two articles consists of {bourse, francs, Prr_aux_ppa : relative pronoun + auxiliary + past marchés, millions, mobile, pdg, prix}. The shared theme participle (which hit), Prr_aux_adv_ppa stands for of these articles appears to be related to the financial relative pronoun + auxiliary + adverb + past participle domain. An analysis of the full articles allowed us to (who drank too much). confirm this interpretation. 4. The transitive verb class, exp.:  Class 54(articles 141 ans 143). The common lexicon of Aux_ppa_prp stands for auxiliary + past participle + these two articles consists of {appel, cour, décidé, juge}. preposition (are exposed to, were loaded with, have led The shared theme of these articles appears to be related to). to lawsuits. In addition to these structure indicators (morpho-syntactic  Class 64 (Articles 166 and 167). The lexicon of these indicators), we added a list of verbs which illustrate some two articles consists of {chance, dernière, dire, match, event classes as they are defined by Foucou [12], examples: stade, supporters, vélodrome}. The shared theme of the class of natural catastrophe (take place) : floods, these articles appears to be related to supporters of earthquakes landslides,... The class of metrological theOlympique de Marseille. phenomena (occur) : fog, snow, storm, ...  Classe 13 (Articles 32, 35, 41 and 48). The lexicon of This list will help us extract all factual sentences because we these four articles consists of {conservateur, socialisme, may find sentences which do not have any of the define marxiste, conservateur, révolutionnaire, Dostoievski, markers that are based on the formal structure of the sentence doctrine, impérial, slavophile}. The shared theme of these articles appears to be related to the Slavophiles and Because its analysis modules and its chosen markers are the Russian political culture of the 19th century. independent from the documentary source, Exev system allows its users to apply it on other types of texts such as IV. THE EXEV SYSTEM medical literature. It also gives us, the possibility to extract other information relevant to other fields (other than event The second principal stage concerns the extraction of the extraction) such as the causality notion. This can be done by events with the use of rules of the contextual exploration[7]. inputting the markers related to the field, for example for the As we mentioned above, the analyzed corpus is made up of causality notion we must introduce in the basis the following press articles. The EXEV system aims at automatic filtering markers : to result of, to be provoked by, to be due to, to of significant sentences bearing information with factual cause, to provoke, ... knowledge from Press articles as well as identifying the agent, the location, and the temporal setting of those events. On analyzing sentences from the press articles, the system The system use two main modules: detect that they may have one of the following forms : 1. The first module allows us to pick out markers in order 1. Occurrence indicators followed by an event. Example : to identify the distinctive sentences which represent For the first time an authorized demonstration seemed events. to be out of their control. 2. The second module allows us to interpret of the 2. Preposition followed by a calendar term. Example : sentences which are extracted to identify “Who did This concise inventory has helped since 1982 the what?” “to whom?” and “where?”. development of exposure schemes for the prevention of natural disasters. A. Extraction of Factual sentences 3. Event followed by a calendar term. Example : Blood The extraction process is based on the result of the morpho- washed in Syria on Wednesday 10 February. syntactic analysis (for more details cf. [9]). These results, 4. An event1 followed by a relative pronoun, followed by a which are a translation of morpho-syntactic sentences, are verb action, followed by a transitive verb, followed by event2. Example : The murderous avalanche which hit undergo further processing with the events extraction the valley of Chamonix, will urge the authorities to re- subsystem. asses the local safety system. 3. Events Extraction processing of the selected articles. 5. Subject followed by a transitive verb, followed by This third step is the one in which articles kept by the event. Example : About 200 lodgings run the risk of user, after the filtering operation performed in the two landslides, floods or avalanches. previous steps, will finally undergo detailed linguistic analysis in order to extract events information. The above examples will help us show the shift from natural language text to syntactic structures with representation of event type. VI. CONCLUSION

Thanks to the synergic association of Gramexco and Exev B. Interpretation of Factual Sentences softwares, the extraction of the events in large data bases of After extracting sentences which bear factual information, we articles of press is possible. Indeed Gramexco being ready to will now try to answer a classical question according to the gather the whole of the articles sharing a common topic, the field of extraction but which is major importance and which user initially will select certain classes of articles according to is : who does what, to whom, when and where. a point of view which is specific for him. the extraction of the events will apply in a last stage to these textual classes. The answer to the above question can be of great help This methodology, in addition to the economy in working especially if we want to extend our work and add a module times, makes it possible to seriously consider a true linguistic called text generation. engineering for large corpus [4].

We also thought about a module of enrichment and In another side, thanks to our morphological sensor based on consultation of the list of markers which have been defined. It inflectional morphology, we were able to directly extract type is quite useful to allow the user to add other markers or to information as well as interpret the type of event i.e., future define another list of markers so that each user of the system event or past event. will be free to adapt it to his or her own needs. The system can be improved in two ways : we can on one hand increase the linguistic data base (with using Gramexco) This is not an easy task but it implies that the user knows and on the other hand the interfacing of result consultation. perfectly well how the modules of the system function especially the morphological analysis module. Once the user inputs the markers he wishes to add, the system will suggest a VII. REFERENCES morpho-syntactic structure. [1] Balpe J.P., A. Lelu & F. Papy (1996). Techniques avancées pour l’hypertexte. Paris, Hermes. [2] Benhadid I., J.G. Meunier, S. Hamidi, Z. Remaki & M. V. THE METHODOLOGY ASSOCIATED WITH GRAMEXCO Nyongwa (1998). “Étude expérimentale comparative des AND EXEV APPROACH méthodes statistiques pour la classification des données The implemented model is supported by a methodology that textuelles”, Proceedings of JADT-98, Nice, France. guides the user through the events extraction process. This [3] Biskri I. & J.G. Meunier (2002). “SATIM : Système methodology is organised in four major steps which are now d’Analyse et de Traitement de l’Information described. Multidimensionnelle”, Proceedings of JADT 2002, St- 1. Partitioning of the initial corpus into its different Malo, France, 185-196. domains with Gramexco. If a corpus contains articles [4] Biskri I. & S. Delisle (1999). “Un modèle hybride pour le about different domains written in different languages, textual data mining : un mariage de raison entre le these can easily be separated from each other with the numérique et le linguistique”, Proceedings of TALN-99, help of Gramexco Indeed, these different domains Cargèse, France, 55-64. (classes) will normally be described with different [5] Biskri I. & S. Delisle (2001). “Les n-grams de caractères words. Yet, Gramexco’s results do not correspond to an pour l'aide à l’extraction de connaissances dans des bases automatic interpretation of the corpus. At this stage, de données textuelles multilingues”, Proceedings of what we obtain is a coarse partitioning of the corpus into TALN-2001, Tours, France, Tome 1, 93-102. its main themes or topics of the various articles, each of [6] Damashek M. (1995). “Gauging Similarity with n-Grams: these comprising one or more classes of words. The user Language-Independent Categorization of Text”, Science, is the only one who can associate a meaningful 267, 843-848. interpretation to the partitions produced by Gramexco. [7] Desclés J.P. (1993). L’exploration contextuelle : une 2. Classes exploration with regard to the user’s needs. This méthode linguistique et informatique pour l’analyse second step is mainly manual with however some automatique de texte, ILN’93, pp. 339-351, 1993. automatic supports. At this point, the user has assigned a [8] Desclés J.P., Cartier E., Jackiewiz A., Minel J.L. (1997). theme to every class, or at least to most of them. So he Textual Processing and Contextual Exploration Method, or she is able to select the themes that best correspond to Proceedings of Context’97, Universidade Federal do Rio his or her information/knowledge needs. The selected de Janeiro, pp 189-197. themes indicate which articles belong to the classes: [9] Faïz R. (1998). Filtrage automatique de phrases these articles share similarities and thus deserve to temporelles d’un texte. Actes de la Rencontre Internationale sur l’extraction, le Filtrage et le Résumé automatiques (RIFRA’98), Sfax, Tunisie, 11-14 novembre, pp.55-63. [10] Faïz R. (2001). Automatically Extracting Textual Information from Electronic Documents of Press. IASTED International Conference on Intelligent Systems and Control (ISC 2001), Floride, Etats-Unis, 19-22 novembre. [11] Faïz, R. (2002), Exev: extracting events from news reports, Proceedings of JADT 2002, St-Malo, France, pp. 257-264. [12] Foucou P. Y. (1998). Classes d’événements et synthèse de services Web d’actualité, Actes de la Rencontre Internationale sur l’extraction, le Filtrage et le Résumé automatiques (RIFRA’98), Sfax, Tunisie, 11-14 novembre, pp.154-163. [13] Greffenstette (1995). “Comparing Two Language6Identification Schemes”, Proceedings of JADT-95, 85-96. [14] Halleb M. & A. Lelu (1998). “Hypertextualisation automatique multilingue à partir des fréquences de n- grammes”, Proceedings of JADT-98, Nice, France. [15] Halteren H. van (1999). (ed.) Syntactic Wordclass Tagging, Kluwer Academic Publishers. [16] Jacobs P. S. & Rau L. F. (1990). SCISOR : Extracting information from on-line news, Commun. ACM 33 (11), pp. 88-97. [17] Jurafsky D. & J.H. Martin. (2000). Speech and Language Processing (An Introduction to Natural Language Processing, Computational Linguistics, and Speech recognition). Prentice Hall. [18] Lelu A., M. Halleb & B. Delprat (1998). “Recherche d’information et cartographie dans des corpus textuels à partir des fréquences de n-grammes”, Proceedings of JADT-98, Nice, France. [19] Manning C.D. & H. Schütze, H., (1999). Foundations of Statistical Natural Language Processing, MIT Press. [20] Mayfield J. & P. McNamee (1998), “Indexing Using both n-Grams and Words”, NIST Special Publication 500-242 : TREC 7, 419-424.