<<

Text Mining

Natural Language techniques

and Text Mining applications

M Rajman R Besancon

Articial Intel ligence Laboratory Computer Science Department

Federal Institute of Technology Swiss

CH Lausanne Switzerland

rajmanliadiepch romaricliadiepch

Abstract

In the general framework of knowledge discovery techniques are

usually dedicated to extraction from structured Text

Mining techniques on the other hand are dedicated to information extrac

tion from unstructured textual data and Natural Language Pro cessing NLP

can then b e seen as an interesting to ol for the enhancement of information

cedures In this pap er we present two examples of Text Min extraction pro

ing tasks asso ciation extraction and prototypical do cument extraction along

with several related NLP techniques

Keywords

Text Mining Knowledge Discovery Natural Language Pro cessing

INTRODUCTION

The always increasing imp ortance of the problem of analyzing the large amounts

of data collected by companies and organizations has led to imp ortant devel

opments in the elds of automated Knowledge Discovery in Databases KDD

and Data Mining DM Typically only a small fraction of the col

lected data is ever analyzed Furthermore as the volume of available data

grows decisionmaking directly from the content of the databases is not fea

sible anymore

KDD and DM techniques are concerned with the pro cessing of Standard

structured databases Text Mining techniques are dedicated to the automated

form unstructured textual data

we present the dierences b etween the traditional Data Mining In Section

and the more sp ecic Text Mining approaches and in the subsequent sections

we describ e two examples of Text Mining applications along with the related

NLP techniques

c

IFIP Published by Chapman Hall

Text Mining Natural Language techniques and Text Mining applications

TEXT MINING VS DATA MINING

to Fayyad PiatetskyShapiro and Smyth Knowledge Dis According

the nontrivial process of identifying valid novel po covery in Databases is

tential ly useful and ultimately understandable patterns in data and therefore

refers to the overall pro cess of discovering from data However

as the usual techniques inductive or statistical metho ds for building decision

trees rule bases nonlinear regression for classication explicitly rely on

the structuring of the data into predened elds Data Mining is essentially

concerned with information extraction from structured databases

Table shows an example of Inductive Logic Programming based learn

ing from an attributevalue D zerovski The presented tables

contain the database and the rules induced by the mining pro cess

otential Customer Table P

Person Age Sex Income Customer

Ann Smith F yes

Joan Gray F yes

Mary Blythe F no

Jane Brown F yes

Bob Smith M yes

Jack Brown M yes

MarriedTo Table

Husband Wife

Bob Smith Ann Smith

Jack Brown Jane Brown

induced Rules

if IncomePerson then PotentialCustomerPerson

if SexPerson F and AgePerson

then PotentialCustomerPerson

if MarriedPerson Sp ouse and IncomePerson

then PotentialCustomerSp ouse

if MarriedPerson Sp ouse and PotentialCustomerPerson

then PotentialCustomerSp ouse

Table An example of Data Mining using ILP techniques

Association extraction from indexed data

This example illustrates how strongly the rule generation pro cess relies on

the explicit structure of the relational database presence of welldened elds

explicit identication of attributevalue pairs

In reality however a large p ortion of the available information app ears

in textual and hence unstructured form or more precisely in an implicitly

structured form Sp ecialized techniques sp ecically op erating on textual data

then b ecome necessary to extract information from such kind of collections of

texts These techniques are gathered under the name of Text Mining and in

grammatical structure order to discover and use the implicit structure eg

of the texts they may integrate some sp ecic Natural Language Pro cessing

used for example to prepro cess the textual data

Text Mining applications imp ose strong constraints on the usual NLP to ols

For instance as they involve large volumes of textual data they do not al

low to integrate complex treatments which would lead to exp onential and

hence non tractable Furthermore semantic mo dels for the appli

cation domains are rarely available and this implies strong limitations on the

sophistication of the semantic and pragmatic levels of the linguistic mo dels

In fact a working hyp othesis Feldman and Hirsh build up on the

al assumes that shallow exp erience gained in the domain of Information Retriev

representations of textual information often provides sucient supp ort for a

range of tasks

ASSOCIATION EXTRACTION FROM INDEXED DATA

If the textual data is indexed either manually or automatically with the help

of NLP techniques such as the ones describ ed in section the indexing

structures can b e used as a basis for the actual knowledge discovery pro cess

In this section we present a way of nding information in a collection of

indexed do cuments by automatically retrieving relevant asso ciations b etween

key

Asso ciations denition

Lets consider a set of keywords A fw w w g and a collection of

1 2 m

t t indexed do cuments T ft g ie each t is asso ciated with a subset

1 2 n i

of A denoted t A

i

A b e a set of keywords the set of all do cuments t in T such that Let W

W tA will b e called the covering set for W and denoted W

Any pair W w where W A is a set of keywords and w AnW will

W w b e called an asso ciation rule and denoted R

Given an asso ciation rule R W w

Text Mining Natural Language techniques and Text Mining applications

 S R T jW fw gj is called the supp ort of R with resp ect to the

j denotes the size of X collection T jX

j[W fw g]j

is called the condence R T of R with resp ect to the  C

j[W ]j

collection T

Notice that C R T is an approximation maximum likeliho o d estimate

of the conditional probability for a text of b eing indexed by the key

w if it is already indexed by the keyword set W

An asso ciation rule R generated from a collection of texts T is said to satisfy

supp ort and condence constraints and if

S R T and C R T

To simplify notations W fw g will b e often written W w and a rule

R W w satisfying given supp ort and condence constraints will b e

simply written as

W w S R T C R T

Mining for asso ciations

eriments of asso ciation extraction have b een carried out by Feldman et al Exp

with the KDT Knowledge Discovery in Texts system on the Reuter

corpus The Reuter corpus is a set of do cuments that app eared on

The do cuments were assembled and manually the Reuter newswire in

indexed by Reuters Ltd and Carnegie Group Inc in Further formatting

and data le pro duction was done in and by David D Lewis and

Peter Sho emaker

The do cuments were indexed with categories in the Economics domain

The mining was p erformed on the indexed do cuments only ie exclusively on

the keyword sets representing the real do cuments

All known algorithms for generating asso ciation rules op erate in two phases

Given a set of keywords A fw w w g and a collection of indexed

1 2 m

do cuments T ft t t g the extraction of asso ciations satisfying given

1 2 n

supp ort and condence constraints and is p erformed

 by rst generating all the keyword sets with supp ort at least equal to

ie all the keyword sets W such that jW j The generated keyword

t sets or covers sets are called the frequen

 then by generating all the asso ciation rules that can b e derived from the

pro duced frequent sets and that satisfy the condence constraint

Association extraction from indexed data

a Generating the frequent sets

The set of candidate covers frequent sets is built incrementally by starting

from singleton covers and progressively adding elements to a cover as long

as it satises the condence constraint

The frequent set generation is the most computationally exp ensive step

exp onential in the worse case Heuristic and incremental approaches are

currently investigated

A basic for generating frequent sets is indicated in Algorithm

ffw g jfw gj g wher i C and e w are keywords

i

C and do while

i

C and fS S j S S C and

i+1 1 2 1 2 i

and jS S j i

1 2

and S S S jS S j i S C and

1 2 1 2 i

and jS S j g

1 2

i i

endw

Algorithm Generating the frequent sets

b Generating the asso ciations

Once the maximal frequent sets have b een pro duced the generation of the

asso ciations is quite easy A basic algorithm is presented in Algorithm

foreach W maximal frequent set do

generate all the rules W nfw g fw g where w W such that

j[W nf w g]j

j[W ]j

endfch

Algorithm Generating the asso ciations

c Examples

Concrete examples of asso ciations rules found by KDT on the Reuter Corpus

are provided in Table These asso ciations were extracted with resp ect to

sp ecic queries expressed by p otential users

Text Mining Natural Language techniques and Text Mining applications

query nd al l associations between a set of countries

including Iran and any person

Reagan result Iran Nicaragua Usa

query nd al l associations between a set of topics

including Gold and any country

result gold copp er Canada

gold silver USA

Table Examples of asso ciations found by KDT

techniques for asso ciation extraction Automated NLP

Indexing

In the case of the Reuter Corpus do cument indexing has b een done manually

but as manual indexing is a very timeconsuming task it is not realistic

to assume that such a pro cessing could systematically b e p erformed in the

general case Automated indexing of the textual do cument base p erformed

for example in a prepro cessing phase has to b e considered in order to allow

the use of asso ciation extraction techniques on a large scale

Techniques for automated pro duction of indexes asso ciated with do cuments

can b e b orrowed from the eld In this case they usually

rely on frequencybased weighting schemes Salton and Buckley Several

examples of such weighting schemes are provided in the SMART Information

Retrieval system Formula presents the SMART atc weighting scheme

p

ij

N

if p log

ij

max (p ) n

j

il l

w

ij

otherwise

where w is the weight of word w in do cument t p is the relative

ij j i ij

P

do cument frequency of w in t p f f where f is the numb er

ij j i ij ik ij

k

of o ccurrences of w in t N is the numb er of do cuments in the collection

j i

and n is the numb er of do cuments containing w

j j

Once a weighting scheme has b een selected automated indexing can b e

p erformed by simply selecting for each do cument the words satisfying given

weight constraints

advantage of automated indexing pro cedures is that they dras The ma jor

tically reduce the cost of the indexing stepOne of their main drawbacks is

however that when applied without additional knowledge such as a the

saurus they pro duce indexes with extremely reduced generalization p ower

Prototypical document extraction from ful l text

keywords have to b e explicitly present in the do cuments and do not always

provide a go o d thematic description

Additional issues

a Integration of background knowledge

If background knowledge is available for example some factual knowledge

ab out the application domain additional constraints can b e integrated in

the asso ciation generation pro cedure either in the frequent set generation or

directly in the asso ciation extraction An example of a system using back

ground knowledge for asso ciation generation is the FACT system develop ed

by Feldman and Hirsh

b Generalization of the notion of asso ciation

Several generalizations are p ossible for the notion of asso ciation rule

 rules with more than one keyword in their righthand side which can

express more complex implications

 more general attributes ie not only restricted to keywords presence ab

sence discrete and continuous variables

 non implicative relations such as pseudoequivalences

 dierent quality measures providing alternative approaches for condence

evaluation

An example of system integrating such kinds of generalizations is the GUHA

system develop ed at the Institute of Computer and in

Prague

PROTOTYPICAL DOCUMENT EXTRACTION FROM FULL

TEXT

The asso ciation extraction presented in the previous section exclusively op er

ates on the do cument indexes and therefore do es not directly take advantage

of the textual content of the do cuments Approaches based on full text mining

for information extraction can then b e considered

con Our initial exp eriments on the Reuter Corpus Ra jman and Besan

were dedicated to the implementation and evaluation of asso ciation extraction

techniques op erating on all the words contained in the do cuments instead of

only considering the asso ciated keywords The obtained results showed how

ever that asso ciation extraction based on full text do cuments do es not pro

vide eectively exploitable results Indeed the asso ciation extraction pro cess

either just detected comp ounds ie domaindep endent terms such as wal l

Text Mining Natural Language techniques and Text Mining applications

street or treasury secretary james baker which cannot b e considered

as potential ly useful referring to the KDD denition given in section or

extracted uninterpretable asso ciations such as dol lars shares exchange total

commission stake securities that could not b e considered as ultimately

understandable

We therefore had to seek for a new TM task that would b e more adequate

for full text information extraction out of large collections of textual data

We decided to concentrate on the extraction of prototypical do cuments

where prototypical is informally dened as corresp onding to an information

that o ccurs in a rep etitive fashion in the do cument collection The underlying

working hyp othesis is that rep etitive do cument structures provide signicant

information ab out the textual base that is pro cessed

Basically the metho d presented in this section relies on the identication of

frequent sequences of terms in the do cuments and uses NLP techniques such

as automated PartofSp eech Tagging and Term Extraction to prepro cess the

textual data

The NLP techniques can b e considered as an automated generalized in

dexing pro cedure that extracts from the full textual content of the do cuments

linguistically signicant structures that will constitute a new basis for frequent

set extraction

do cument NLP Prepro cessing for prototypical

extraction

a PartOfSp eech tagging

The ob jective of the PartOfSp eech tagging POSTagging is to automat

ically assign PartofSp eech tags ie morphosyntactic categories such as

noun verb adjective to words in context For instance a sentence as a

computational pro cess executes programs should b e tagged as aDET com

programsN The main diculty of putationalADJ pro cessN executesV

such a task is the lexical ambiguities that exist in all natural languages For

instance in the previous sentence b oth words pro cess and programs could

b e either nounsN or verbsV

Several techniques have b een designed for POStagging

 Hidden Markov Mo del based approaches Cutting et al

 Rulebased approaches Brill

If a large lexicon providing go o d coverage of the application domain and

some manually handtagged text are available such metho ds p erform auto

mated POStagging in a computationally very ecient way linear complex

ity and with a very satisfying p erformance on the average accuracy

Prototypical document extraction from ful l text

One of the imp ortant advantage of POStagging is to allow automated

ltering of nonsignicant words on the basis of their morphosyntactic cate

gory For instance in our exp eriments where we used the E Brills rulebased

tagger Brill we decided to lter out articles prep ositions conjunc

tions therefore restricting the eective mining pro cess to nouns adjectives

and verbs

b term extraction

In order to automatically detect domaindep endent comp ounds a term ex

traction pro cedure has b een integrated in the prepro cessing step

Automated term extraction is indeed one of the critical NLP tasks for vari

ous applications such as enhanced indexing in the

domain of textual

Term extraction metho ds are often decomp osed into two distinct steps

Daille

 extraction of term candidates on the basis of structural linguistic informa

tion for example term candidates can b e selected on the basis of relevant

morphosyntactic patterns such as N Prep N b oard of directors Secre

tary of State Adj N White House annual rate etc

 ltering of the term candidates on the basis of some statistical relevance

2

scoring schemes such as frequency mutual information co ecient log

like co ecient in fact the actual lters often consist of combinations

of dierent scoring schemes asso ciated with exp erimentally dened thresh

olds

In our exp eriments we used morphosyntactic patterns to extract the

term candidates Noun Noun Noun of Noun Adj Noun Adj

Verbal In order to extract more complex comp ounds such as Secretary

of State George Shultz the term candidate extraction was applied in an it

erative way where terms identied at step n were used as atomic elements

for step n until no new terms were detected For example the sequence

SecretaryN ofprep StateN GeorgeN ShultzN was rst transformed into

SecretaryofStateN GeorgeShultzN patterns and and then com

bined into a unique term SecretaryofStateGeorgeShultzN pattern A

purely frequencybased scoring scheme was then used for ltering

The prototyp e integrating POStagging and term extraction that we used

for our exp eriments was designed in collab oration with R Feldmans team at

Bar Ilan University

Text Mining Natural Language techniques and Text Mining applications

Mining for prototypical do cuments

a The extraction pro cess

The extraction pro cess can b e decomp osed into four steps

 NLP prepro cessing POStagging and term extraction as describ ed in the

previous section

 frequent term sets generation using an algorithm globally similar to the one

describ ed in Algorithm with some minor changes particularly concerning

the data representation

 clustering of the term sets based on a similarity measure derived from the

numb er of common terms in the sets

 actual pro duction of the prototypical do cuments asso ciated with the ob

tained clusters

The whole pro cess is describ ed in more detail in subsection b on the basis

of a concrete example

As we already mentioned earlier asso ciation extraction from full text do cu

ments provided uninterpretable results indicating that asso ciations constitute

an inadequate representation for the frequent sets in the case of full text min

ing In this sense the prototypical do cuments are meant to corresp ond to

more op erational structures giving a b etter representation of the rep etitive

do cuments in the text collection and therefore providing a p otentially useful

basis for a partial synthesis of the information content hidden in the textual

base

b example

Figure presents an example of a SGML tagged do cument from the Reuter

Corpus

Figure presents for the same do cument the result of the NLP prepro cess

ing step POStagging and term extraction the extracted terms are printed

in b oldface

During the pro duction of term sets asso ciated with the do cuments ltering

of nonsignicant terms is p erformed on the basis of

 morphosyntactic information we only keep nouns verbs and adjectives

 frequency criteria we only keep terms with frequency greater than a given

minimal supp ort

 empiric knowledge we remove some frequent but nonsignicant verbs is

has b een

After this treatment the following indexing structure term set is obtained

for the do cument and will serve as a basis for the frequent set generation

Prototypical document extraction from ful l text

Nissan Motor Co Ltd is issuing a billion yen eurob ond

due March paying p ct and priced at Nikko Securities Co

Europ e Ltd said

ailable in denominations of one mln Yen and will b e The noncallable issue is av

listed in Luxemb ourg

The payment date is March

The selling concession is p ct while management and underwriting combined

pays p ct

> Nikko said it was still completing the syndicate <BODY><TEXT

<REUTERS>

Figure An example of Reuter Do cument

Nissan Motor Co LtdN NSANN TN isV issuingV aDET

billion yenCD eurob ondV dueADJ March CD CD pay

ingV p ercentCD andCC pricedV atPR ADJ

o Securities CoN Europ eN SYM LtdN saidV Nikk

TheDET noncallable issueN isV availableADJ inPR denominationsN

ofPR one millionCD YenCD andCC willMD b eV listedV inPR Lux

emb ourgN

dateN isV March CD TheDET payment

TheDET selling concessionN isV p ercentCD whilePR manage

mentN andCC underwritingN combinedV paysV p ercentCD

NikkoN saidV itPRP wasV stillRB completingV theDET syndicateN

Figure A tagged Reuter Do cument

availableadj combinedv denominationsn dueadj europ en issuingv listedv f

luxemb ourgn managementn payingv payment daten paysv pricedv sell

ing concessionn syndicaten underwritingng

The frequent sets generation step of course op erating on the whole do c

ument collection then pro duces among others the following frequent term

sets POStags have b een removed to increase

fdue available management priced issuing paying denominations underwritingg

dateg fdue available management priced issuing denominations payment

fdue available management priced issuing denominations underwriting luxemb ourgg

fdue management selling priced issuing listedg

dateg fdue priced issuing combined denominations payment

fmanagement issuing combined underwriting payment dateg

where the numeric values corresp ond to the frequency of the sets in the

collection

In order to reduce the imp ortant information redundancy due to partial

overlapping b etween the sets clustering was p erformed to gather some of the

term sets into classes clusters represented by the union of the sets

Text Mining Natural Language techniques and Text Mining applications

fdue available management priced issuing combined denominations listed underwriting

date payingg luxemb ourg payment

To reduce the p ossible meaning shifts linked to non corresp onding word

sequences the term sets representing identied clusters were split into sets of

distinct terms sequences asso ciated with paragraph b oundaries in the original

do cuments The most frequent sequential decomp ositions of the clusters are

then computed and some of the corresp onding do cument excerpts extracted

These do cument excerpts are by denition the prototypical do cuments corre

sp onding to the output of the mining pro cess

Figure presents b oth the most frequent sequential decomp osition for the

previous set and the asso ciated prototypical do cument

issuing due paying priced available denominations listed luxemb ourg pay

ment date management underwriting combined

>

Nissan Motor Co Ltd NSANT is issuing a billion yen eurob ond due

March paying p ercent and priced at Nikko Securities Co

Europ e Ltd said

issue is available in denominations of one million Yen and will The noncallable

b e listed in Luxemb ourg

date is March The payment

The selling concession is p ercent while management and underwriting

combined pays p ercent

Nikko said it was still completing the syndicate

Figure An example of prototypical do cument

F uture Work

a Name entity tagging

We p erformed syntactic PartOfSp eech tagging on the do cument base Sim

ilar techniques can also b e used for semantic tagging

For instance the Alembic environment develop ed by the MITRE Natural

Language Pro cessing Group MITRE NLP Group corresp ond to a set

of techniques allowing rule based name entity tagging The rules used by the

system have b een automatically learned from examples

Figure presents the prototypical do cument given in the previous section

as tagged by Alembic This tagging has b een provided by Christopher Clifton

from MITRE and used two rule bases trained to resp ectively recognize p er

sonlo cationorganization and datetimemoneynumb ers

This kind of semantic tagging will b e undoubtfully useful for the generaliza

tion of the variable parts in prototypical do cuments and could b e considered

Prototypical document extraction from ful l text

Nissan Motor Co Ltd<ENAMEX>

NSAN<ENAMEX><s>T is is

suing a <NUMBER> billion<NUMBER> yen

eurob ond due March <TIMEX> paying

<NUMBER><NUMBER> p ercent and priced

at <NUMBER><NUMBER>

TYPEORGANIZATION>Nikko Securities Co<ENAMEX>

TYPELOCATION>Europ e<ENAMEX> Ltd said<s>

The noncallable issue is available in denominations of

one<NUMBER> million<NUMBER>

TYPEORGANIZATION>Yen<ENAMEX> and will b e listed in

TYPELOCATION>Luxemb ourg<ENAMEX><s>

The payment date is March <TIMEX><s>

The selling concession is <NUMBER>

<NUMBER> p ercent while management and underwriting

combined pays <NUMBER> p ercent<s>

Nikko<ENAMEX> said it was still

completing the syndicate<s>

Figure Name entity tagging of a Prototypical Do cument

as an abstraction pro cess that will provide a b etter representation of the syn

thetic information extracted from the base

b Implicit user mo deling

In any information extraction pro cess it is of great interest to try to take into

account an interaction with the user Exp eriments in Information Retrieval

IR have shown for instance that b etter relevance results can b e obtained

by using relevance feedback techniques techniques that allow to integrate

relevance evaluation by the user of the retrieved do cuments

In our mo del such an approach could lead to integrate b oth a posteriori

and a priori information ab out the user and therefore corresp ond to the

integration of an implicit mo del of the user

 A p osteriori information could b e obtained with a similar pro cedure as

in classical IR pro cesses through the analysis of the reactions of the user

concerning the results provided by the TM system relevance or usefulness

of extracted prototypical do cuments

 A priori information could b e derived for example from any preclassica

tion of the data often present in the real data for example users often

classify their les in directories or folders This user prepartitioning of the

do cument base contain interesting information ab out the user and could

serve as a basis for deriving more adequate parameters for the similarity

measures for instance the parameters could b e tuned in order to minimize

interclass similarity and maximize intraclass similarity

Text Mining Natural Language techniques and Text Mining applications

CONCLUSION

The general goal of Data Mining is to automatically extract information from

databases Text Mining corresp onds to the same global task but sp ecically

applied on unstructured textual data In this pap er we have presented two

dierent TM tasks asso ciation extraction from a collection of indexed do cu

ments designed to answer sp ecic queries expressed by the users and proto

typical do cument extraction from a collection of fulltext do cuments designed

to automatically nd information ab out classes of rep etitive do cument struc

tures that could b e used for automated synthesis of the information content

of the textual base

REFERENCES

oc of the Brill E A Simple RuleBased PartofSp eech Tagger In Pr

rd Conf on Applied Natural Language Processing

Cutting D et al A Practical PartofSp eech Tagger In Proc of the

rd Conf on Applied Natural Language Processing

Daille B Study and Implementation of Combined Techniques for Auto

matic Extraction of Terminology In Proc of the nd Annual Meeting

of the Association for Computational

Dzerovski S Inductive logic programming and Knowledge Discovery

ledge Discovery and Data Mining in Databases In Advances in Know

AAAI Press The MIT Press

Fayyad UM PiatetskyShapiro G and Smyth P From Data Min

ing to Knowledge Discovery An Overview In Advances in Know ledge

Discovery and Data Mining AAAI Press The MIT Press

Feldman R Dagan I and Klo egsen W Ecient Algorithm for Mining

th

European Meeting on and Manipulating Asso ciations in Texts

Cybernetics and

Feldman R and Hirsh H Mining Asso ciations in Text in the Presence

of Background Knowledge In Proc of the nd Int Conf on Know ledge

Discovery

Feldman R and Hirsh H Finding Asso ciations in Collections of Text

In Michalski RS Bratko I and Kubat M edts Machine Learn

ing Data Mining and Know ledge Discovery Methods and Application

John Wiley and sons Ltd

MITRE NLP Group Alembic Language Pro cessing for Intelligence Ap

plications At URL

tersadvanced infoghnlindexhtml httpwwwmitreorgresourcescen

Ra jman M and Besancon R A Lattice Based Algorithm for Text

Mining Technical Rep ort TRLIALN Swiss Federal Institute of

Technology

Salton G and Buckley C Term Weighting Approaches in Automatic

BIOGRAPHY

Text Retrieval Information Processing and Management

BIOGRAPHY

Born Martin RAJMAN graduated from the Ecole Nationale Supe

rieure des Telecommunications ENST Paris where he also obtained a PhD

in Computer Science In March M RAJMAN joined the p ermanent

teaching and research sta of the ENST as memb er of the Articial Intelli

gence Group resp onsible for the Natural Language activity Since Septemb er

he is memb er of the Articial Intelligence Lab oratory of the Ecole Poly

technique Federale de Lausanne EPFL where he is in charge of the Natural

Language Pro cessing NLP group

Born Romaric BESANC ON graduated from the Institut dInfor

matique dEntreprise I IE Evry and then obtained a DEA from the Univer

Orsay He is currently research assistant at the EPFL NLP sity ParisXI

group where he is working in the domain of TextMining