<<

Automatic multilingual indexing using the EuroVoc thesaurus

Ralf Steinberger

European Commission – Joint Research Centre (JRC) http://langtech.jrc.ec.europa.eu/

VdB: Jenseits der Verbundkataloge. Die Zukunft der Recherche Munich, , 27.09.2012

Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming JRC - Who we are

(scientific-technical arm of public administration) • Non-commercial • Multi-disciplinary / multilingual • Main product: Media Monitor (EMM)

Europe Media Monitor EMM – A few facts

• ~ 150,000 online articles / day in ~ 50 • ~ 3600 Sources (-wide, with focus on Europe)

• In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via http://emm.newsbrief.eu/overview.html

• Four main EMM applications: Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming

What is JEX

JEX = JRC Eurovoc IndeXer:

•Automatic indexing software

• Developed at the JRC • Using the controlled vocabulary from EuroVoc • Readily trained for 22 languages, and re-trainable • Freely downloadable from http://langtech.jrc.ec.europa.eu/ The EuroVoc Thesaurus

• > 6,000 subject domains • Exists in one-to-one translation in 22 official EU languages plus (at least) in Basque, Catalan, Croatian, Russian and Serbian;

• Used by many European for manual classification

Æ We use results to train automatic classifiers

• Indexing documents in one allows search and retrieval in other languages!

EuroVoc Indexing Challenges

• Classes are concepts rather than words, e.g. • Protection of minorities, construction and town planning, … • Very unevenly distributed (most frequent class is used 4262 times) • Various text types (heterogeneous training set) • Multi-label categorisation (both for training and assignment) Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming

Automatic indexing Motivation

• Manual indexing is very valuable, but difficult, slow, often inconsistent between indexers and inconsistent over time.

• Automatic indexing works less well, but is very fast and consistent; even old documents can be re-indexed. • Manual process cannot be replaced, but automatic process has its uses.

• Interactive indexing combines the advantages of both: • Speed and consistency of automatic process; • Quality and control of manual indexing; • Hopefully speeding up the overall process.

• Used by the Spanish Congress of Deputies since 2006. Automatic indexing Motivation (2)

http://emm.newsexplorer.eu/

Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming Automatic indexing How does it work?

• How would you do it if you had to develop such a system? • What happens in your heads while indexing? • e.g. for Eurovoc descriptor NUCLEAR ACCIDENT • Presumably: looking for the occurrence of specific words or Boolean combinations or weighted word lists, e.g.

• (nuclear OR radioactive OR …) AND (accident OR leak OR …) Æ NUCLEAR ACCIDENT

• Writing such rules is a time-consuming task • Rules have to be written separately for each language • >6000 descriptors, 22+ languages • Rules would need to be updated continuously.

How JEX works Overview

• Method: Profile-based category-ranking • E.g. Result for a document with the : Legislative resolution embodying 's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the • E.g. profile for the EuroVoc category FISHERY MANAGEMENT How JEX works How does it work? (2)

• JRC’s system learns the rules from manually indexed documents • Taking large collections of previously manually indexed documents • Using statistics to see which words frequently occur with a certain class (e.g. nuclear, leak) and which ones are found with all classes (e.g. the, of, for, paragraph, decision, …)

• For technical details, see, e.g.:

How JEX works Learning the profiles

• Using a large collection of manually indexed documents (training corpus),

• For each descriptor Di, take all documents Tj indexed with Di • Identify the statistically salient words in each of these texts • join these lists of statistically salient words and take the most frequently occurring words as associates. E.g. descriptor RADIOACTIVE MATERIALS (simplified!)

T1 T2 T3 radioactive plutonium Illegal_traffic radioactive (3) deuterium plutonium (3) ++ = resolution assembly radioactive nuclear (2) plutonium nuclear ukrainian deuterium (2) deuterium schmidt plutonium Illegal_traffic (1) parliament radioactive lithium nuclear dangerous chernobyl (1) blottnitz iaea mox ......

• normalise the weight according to a of different (statistical) criteria Æ Result of Training: Weighted lists of associated words for each of the descriptors. Sample profile 1 RADIOACTIVE MATERIALS

Sample profile 2 FISHERY MANAGEMENT

fishery-related

management-related How JEX works Assigning descriptors

1. Produce word list (excluding stop words)

... 2. Calculate similarity between word frequency list and descriptor profiles, using statistical similarity measures, e.g. cosine

∑TFIDFl,d .TFIDFl,t COSINE(d,t) = l∈d∩t 2 2 (∑∑TFIDFl,d ).( TFIDFl,t ) lt∈∈d l

How JEX works Sample result

Title: Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ First 6 ? 0146(CNS)) (Consultation procedure)

• Ranked list, not a set • Assignment weight, not a yes/no answer • How many descriptors to assign or display? > 0.20 ? • Human average is 5.6 descriptors per document • Varying from 2 to 20 per document Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming

JEX Automatic evaluation

Automatic, at rank 6 comparing to previous manual annotation

Evaluation: P, R, F1 at rank 6. JEX Manual evaluation

Manual, by indexing professionals (at least top ten descriptors/document) • English (by Swedish Riksdag) and • Spanish (by Spanish )

Manual Evaluation Good result Manual Evaluation Bad result

Little lexical evidence in text …

Manual evaluation Results

• Large-scale manual evaluation for English and Spanish: • 75% and 82% agreement with top ten automatically assigned descriptors.

• Agreement of 2nd annotator with 1st for English and Spanish: • 74% and 84% agreement with first (also professionally trained) indexer

• What is the best result that automatic software can achieve? Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming

JEX GUI Index documents

• Download software from http://langtech.jrc.ec.europa.eu/Eurovoc.html . • Graphical user interface (GUI), command line option (and an API). JEX GUI View one document

Descriptors assigned Profile words Document indexed Profile of found in document selected descriptor

Manually correct results Save result as XML

JEX Summary

• Software available for free, trained for 22 languages on legislative text. • only perform well if applied to the same text type (can be retrained) • Results • could be better, but are • good enough to use interactively by professionals.

• Various fully automatic uses e.g. • cross-lingual linking of documents.

• Major advantage: • same categories across 22 languages: for search and retrieval and display • Web-like full-text search across languages is difficult, but is possible with JEX. Agenda

• EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing

• How automatic indexing works • Performance evaluation • Download and use JEX

• Future of the librarian profession •Brainstorming

Profession: librarians Little (1)

Opinion on the - rather provocative - questions: • Search in unstructured data (e.g. full-text search) is getting better. • Will we still (also) need to search in structured data? • Do we still need to train librarians to search? • Can’t we find everything even using a badly formulated search query?

Wir „würden […] uns jedoch einen Vortrag wünschen, der aus Sicht der Informatik eine Einschätzung abgibt, in welchen Bereichen zukünftig eine verbesserte Recherche in unstrukturierten Daten (z.B. Volltexten wissenschaftlicher Publikationen) die Recherche in strukturierten Metadaten ersetzen oder substantiell ergänzen könnte. Hintergrund ist, dass es existierende Diskurse gibt, die behaupten, auf Dauer müsse man weder aktiv Recherchekompetenz vermitteln noch manuell Metadaten erstellen, weil die Suchalgorithmen so gut würden, dass alles Gewünschte gefunden wird (egal wie schlecht die Sucheingabe oder wie unstrukturiert die durchsuchten Daten)“.

Small survey among: • Publications (PO) of the European institutions • Librarians and EuroVoc maintenance group • IT group But any errors are mine! • Spanish Congress of Deputies • Various researchers in Computational Linguistics Profession: librarians Little survey (2)

• All agreed that librarians will also be needed in the future.

• “Human intervention is not dead, but it is strongly challenged by the new technologies.”

• Even the best software is not useful if not combined with detailed subject knowledge.

• Search gets easier, but amount of data also increases Æ not less work.

• Full-text search does not easily replace indexing in a multilingual context.

• The profession of the librarian is changing a lot. • Collaboration with IT is unavoidable.

Profession: librarians Little survey (3)

• There are new tasks for librarians:

• Conception of data models (ontology, meta-data) • e.g. to store and retrieve metadata and documents • Example: Meta-Data Repository (MDR) at http://publications.europa.eu/mdr/

• Produce linked data, e.g. to inter-connect different media: • (multilingual) text, images, bibliographic descriptions, …

• This involves mapping thesaurus concepts • EU thesaurus EuroVoc degradation of the environment • Environment Agency thesaurus GEMET fire • AgroVoc • Umwelt-Thesaurus UmThes

• Automatic tools can be used for support (e.g. Thesaurus Alignment Environment TAE), but librarians are needed. Profession: librarians Little survey (4)

• Expert knowledge of library professionals is also needed

• to train, tune and software tools (e.g. thesaurus alignment, indexing) • to validate the results • to define ranking rules • specify an indexing methodology • …

Profession: librarians Little survey (5)

•“No algorithm can substitute human thinking when we talk about understanding the world, which is necessary when annotating and searching for textual data.”

•“Experts will be needed when producing high-quality metadata, for example, in the context of a library.”

• “There will always be work for librarians and meta-data should be produced with the help of humans. However, terminology extraction tools can be of great help for annotators.”