Natural Language Techniques and Text Mining Applications
Total Page:16
File Type:pdf, Size:1020Kb
Text Mining Natural Language techniques and Text Mining applications M Rajman R Besancon Articial Intel ligence Laboratory Computer Science Department Federal Institute of Technology Swiss CH Lausanne Switzerland rajmanliadiepch romaricliadiepch Abstract In the general framework of knowledge discovery Data Mining techniques are usually dedicated to information extraction from structured databases Text Mining techniques on the other hand are dedicated to information extrac tion from unstructured textual data and Natural Language Pro cessing NLP can then b e seen as an interesting to ol for the enhancement of information cedures In this pap er we present two examples of Text Min extraction pro ing tasks asso ciation extraction and prototypical do cument extraction along with several related NLP techniques Keywords Text Mining Knowledge Discovery Natural Language Pro cessing INTRODUCTION The always increasing imp ortance of the problem of analyzing the large amounts of data collected by companies and organizations has led to imp ortant devel opments in the elds of automated Knowledge Discovery in Databases KDD and Data Mining DM Typically only a small fraction of the col lected data is ever analyzed Furthermore as the volume of available data grows decisionmaking directly from the content of the databases is not fea sible anymore KDD and DM techniques are concerned with the pro cessing of Standard structured databases Text Mining techniques are dedicated to the automated information extraction form unstructured textual data we present the dierences b etween the traditional Data Mining In Section and the more sp ecic Text Mining approaches and in the subsequent sections we describ e two examples of Text Mining applications along with the related NLP techniques c IFIP Published by Chapman Hall Text Mining Natural Language techniques and Text Mining applications TEXT MINING VS DATA MINING to Fayyad PiatetskyShapiro and Smyth Knowledge Dis According the nontrivial process of identifying valid novel po covery in Databases is tential ly useful and ultimately understandable patterns in data and therefore refers to the overall pro cess of discovering informations from data However as the usual techniques inductive or statistical metho ds for building decision trees rule bases nonlinear regression for classication explicitly rely on the structuring of the data into predened elds Data Mining is essentially concerned with information extraction from structured databases Table shows an example of Inductive Logic Programming based learn ing from an attributevalue database D zerovski The presented tables contain the database and the rules induced by the mining pro cess otential Customer Table P Person Age Sex Income Customer Ann Smith F yes Joan Gray F yes Mary Blythe F no Jane Brown F yes Bob Smith M yes Jack Brown M yes MarriedTo Table Husband Wife Bob Smith Ann Smith Jack Brown Jane Brown induced Rules if IncomePerson then PotentialCustomerPerson if SexPerson F and AgePerson then PotentialCustomerPerson if MarriedPerson Sp ouse and IncomePerson then PotentialCustomerSp ouse if MarriedPerson Sp ouse and PotentialCustomerPerson then PotentialCustomerSp ouse Table An example of Data Mining using ILP techniques Association extraction from indexed data This example illustrates how strongly the rule generation pro cess relies on the explicit structure of the relational database presence of welldened elds explicit identication of attributevalue pairs In reality however a large p ortion of the available information app ears in textual and hence unstructured form or more precisely in an implicitly structured form Sp ecialized techniques sp ecically op erating on textual data then b ecome necessary to extract information from such kind of collections of texts These techniques are gathered under the name of Text Mining and in grammatical structure order to discover and use the implicit structure eg of the texts they may integrate some sp ecic Natural Language Pro cessing used for example to prepro cess the textual data Text Mining applications imp ose strong constraints on the usual NLP to ols For instance as they involve large volumes of textual data they do not al low to integrate complex treatments which would lead to exp onential and hence non tractable algorithms Furthermore semantic mo dels for the appli cation domains are rarely available and this implies strong limitations on the sophistication of the semantic and pragmatic levels of the linguistic mo dels In fact a working hyp othesis Feldman and Hirsh build up on the al assumes that shallow exp erience gained in the domain of Information Retriev representations of textual information often provides sucient supp ort for a range of information access tasks ASSOCIATION EXTRACTION FROM INDEXED DATA If the textual data is indexed either manually or automatically with the help of NLP techniques such as the ones describ ed in section the indexing structures can b e used as a basis for the actual knowledge discovery pro cess In this section we present a way of nding information in a collection of indexed do cuments by automatically retrieving relevant asso ciations b etween keywords Asso ciations denition Lets consider a set of keywords A fw w w g and a collection of 1 2 m t t indexed do cuments T ft g ie each t is asso ciated with a subset 1 2 n i of A denoted t A i A b e a set of keywords the set of all do cuments t in T such that Let W W tA will b e called the covering set for W and denoted W Any pair W w where W A is a set of keywords and w AnW will W w b e called an asso ciation rule and denoted R Given an asso ciation rule R W w Text Mining Natural Language techniques and Text Mining applications S R T jW fw gj is called the supp ort of R with resp ect to the j denotes the size of X collection T jX j[W fw g]j is called the condence R T of R with resp ect to the C j[W ]j collection T Notice that C R T is an approximation maximum likeliho o d estimate of the conditional probability for a text of b eing indexed by the keyword w if it is already indexed by the keyword set W An asso ciation rule R generated from a collection of texts T is said to satisfy supp ort and condence constraints and if S R T and C R T To simplify notations W fw g will b e often written W w and a rule R W w satisfying given supp ort and condence constraints will b e simply written as W w S R T C R T Mining for asso ciations eriments of asso ciation extraction have b een carried out by Feldman et al Exp with the KDT Knowledge Discovery in Texts system on the Reuter corpus The Reuter corpus is a set of do cuments that app eared on The do cuments were assembled and manually the Reuter newswire in indexed by Reuters Ltd and Carnegie Group Inc in Further formatting and data le pro duction was done in and by David D Lewis and Peter Sho emaker The do cuments were indexed with categories in the Economics domain The mining was p erformed on the indexed do cuments only ie exclusively on the keyword sets representing the real do cuments All known algorithms for generating asso ciation rules op erate in two phases Given a set of keywords A fw w w g and a collection of indexed 1 2 m do cuments T ft t t g the extraction of asso ciations satisfying given 1 2 n supp ort and condence constraints and is p erformed by rst generating all the keyword sets with supp ort at least equal to ie all the keyword sets W such that jW j The generated keyword t sets or covers sets are called the frequen then by generating all the asso ciation rules that can b e derived from the pro duced frequent sets and that satisfy the condence constraint Association extraction from indexed data a Generating the frequent sets The set of candidate covers frequent sets is built incrementally by starting from singleton covers and progressively adding elements to a cover as long as it satises the condence constraint The frequent set generation is the most computationally exp ensive step exp onential in the worse case Heuristic and incremental approaches are currently investigated A basic algorithm for generating frequent sets is indicated in Algorithm ffw g jfw gj g wher i C and e w are keywords i C and do while i C and fS S j S S C and i+1 1 2 1 2 i and jS S j i 1 2 and S S S jS S j i S C and 1 2 1 2 i and jS S j g 1 2 i i endw Algorithm Generating the frequent sets b Generating the asso ciations Once the maximal frequent sets have b een pro duced the generation of the asso ciations is quite easy A basic algorithm is presented in Algorithm foreach W maximal frequent set do generate all the rules W nfw g fw g where w W such that j[W nf w g]j j[W ]j endfch Algorithm Generating the asso ciations c Examples Concrete examples of asso ciations rules found by KDT on the Reuter Corpus are provided in Table These asso ciations were extracted with resp ect to sp ecic queries expressed by p otential users Text Mining Natural Language techniques and Text Mining applications query nd al l associations between a set of countries including Iran and any person Reagan result Iran Nicaragua Usa query nd al l associations between a set of topics including Gold and any country result gold copp er Canada gold silver USA Table Examples of asso ciations found by KDT techniques for asso ciation extraction Automated NLP Indexing In the case of the Reuter Corpus do cument indexing has b een done manually