Automatic Multilingual Indexing Using the Eurovoc Thesaurus
Total Page:16
File Type:pdf, Size:1020Kb
Automatic multilingual indexing using the EuroVoc thesaurus Ralf Steinberger European Commission – Joint Research Centre (JRC) http://langtech.jrc.ec.europa.eu/ VdB: Jenseits der Verbundkataloge. Die Zukunft der Recherche Munich, Germany, 27.09.2012 Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming JRC - Who we are • European Commission (scientific-technical arm of public administration) • Non-commercial • Multi-disciplinary / multilingual • Main product: Europe Media Monitor (EMM) Europe Media Monitor EMM – A few facts • ~ 150,000 online news articles / day in ~ 50 languages • ~ 3600 Sources (world-wide, with focus on Europe) • In-depth analysis in 20 languages (NewsExplorer) • 24/7, updated every 10 minutes • Freely accessible via http://emm.newsbrief.eu/overview.html • Four main EMM applications: Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming What is JEX JEX = JRC Eurovoc IndeXer: •Automatic indexing software • Developed at the JRC • Using the controlled vocabulary from EuroVoc • Readily trained for 22 languages, and re-trainable • Freely downloadable from http://langtech.jrc.ec.europa.eu/ The EuroVoc Thesaurus • > 6,000 subject domains • Exists in one-to-one translation in 22 official EU languages plus (at least) in Basque, Catalan, Croatian, Russian and Serbian; • Used by many European parliaments for manual classification Æ We use results to train automatic classifiers • Indexing documents in one language allows search and retrieval in other languages! EuroVoc Indexing Challenges • Classes are concepts rather than words, e.g. • Protection of minorities, construction and town planning, … • Very unevenly distributed (most frequent class is used 4262 times) • Various text types (heterogeneous training set) • Multi-label categorisation (both for training and assignment) Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming Automatic indexing Motivation • Manual indexing is very valuable, but difficult, slow, often inconsistent between indexers and inconsistent over time. • Automatic indexing works less well, but is very fast and consistent; even old documents can be re-indexed. • Manual process cannot be replaced, but automatic process has its uses. • Interactive indexing combines the advantages of both: • Speed and consistency of automatic process; • Quality and control of manual indexing; • Hopefully speeding up the overall process. • Used by the Spanish Congress of Deputies since 2006. Automatic indexing Motivation (2) http://emm.newsexplorer.eu/ Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming Automatic indexing How does it work? • How would you do it if you had to develop such a system? • What happens in your heads while indexing? • e.g. for Eurovoc descriptor NUCLEAR ACCIDENT • Presumably: looking for the occurrence of specific words or Boolean combinations or weighted word lists, e.g. • (nuclear OR radioactive OR …) AND (accident OR leak OR …) Æ NUCLEAR ACCIDENT • Writing such rules is a time-consuming task • Rules have to be written separately for each language • >6000 descriptors, 22+ languages • Rules would need to be updated continuously. How JEX works Overview • Method: Profile-based category-ranking • E.g. Result for a document with the title: Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the common fisheries policy • E.g. profile for the EuroVoc category FISHERY MANAGEMENT How JEX works How does it work? (2) • JRC’s system learns the rules from manually indexed documents • Taking large collections of previously manually indexed documents • Using statistics to see which words frequently occur with a certain class (e.g. nuclear, leak) and which ones are found with all classes (e.g. the, of, for, paragraph, decision, …) • For technical details, see, e.g.: How JEX works Learning the profiles • Using a large collection of manually indexed documents (training corpus), • For each descriptor Di, take all documents Tj indexed with Di • Identify the statistically salient words in each of these texts • join these lists of statistically salient words and take the most frequently occurring words as associates. E.g. descriptor RADIOACTIVE MATERIALS (simplified!) T1 T2 T3 radioactive plutonium Illegal_traffic radioactive (3) ukraine deuterium chernobyl plutonium (3) ++ = resolution assembly radioactive nuclear (2) plutonium nuclear ukrainian deuterium (2) deuterium schmidt plutonium Illegal_traffic (1) parliament radioactive lithium nuclear korea dangerous chernobyl (1) blottnitz iaea mox ... ... ... ... • normalise the weight according to a number of different (statistical) criteria Æ Result of Training: Weighted lists of associated words for each of the descriptors. Sample profile 1 RADIOACTIVE MATERIALS Sample profile 2 FISHERY MANAGEMENT fishery-related management-related How JEX works Assigning descriptors 1. Produce word frequency list (excluding stop words) ... 2. Calculate similarity between word frequency list and descriptor profiles, using statistical similarity measures, e.g. cosine ∑TFIDFl,d .TFIDFl,t COSINE(d,t) = l∈d∩t 2 2 (∑∑TFIDFl,d ).( TFIDFl,t ) lt∈∈d l How JEX works Sample result Title: Legislative resolution embodying Parliament's opinion on the proposal for a Council Regulation amending Regulation No 2847/93 establishing a control system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ First 6 ? 0146(CNS)) (Consultation procedure) • Ranked list, not a set • Assignment weight, not a yes/no answer • How many descriptors to assign or display? > 0.20 ? • Human average is 5.6 descriptors per document • Varying from 2 to 20 per document Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming JEX Automatic evaluation Automatic, at rank 6 comparing to previous manual annotation Evaluation: P, R, F1 at rank 6. JEX Manual evaluation Manual, by indexing professionals (at least top ten descriptors/document) • English (by Swedish Riksdag) and • Spanish (by Spanish Congreso) Manual Evaluation Good result Manual Evaluation Bad result Little lexical evidence in text … Manual evaluation Results • Large-scale manual evaluation for English and Spanish: • 75% and 82% agreement with top ten automatically assigned descriptors. • Agreement of 2nd annotator with 1st for English and Spanish: • 74% and 84% agreement with first (also professionally trained) indexer • What is the best result that automatic software can achieve? Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming JEX GUI Index documents • Download software from http://langtech.jrc.ec.europa.eu/Eurovoc.html . • Graphical user interface (GUI), command line option (and an API). JEX GUI View one document Descriptors assigned Profile words Document indexed Profile of found in document selected descriptor Manually correct results Save result as XML JEX Summary • Software available for free, trained for 22 languages on legislative text. • Will only perform well if applied to the same text type (can be retrained) • Results • could be better, but are • good enough to use interactively by professionals. • Various fully automatic uses e.g. • cross-lingual linking of documents. • Major advantage: • same categories across 22 languages: for search and retrieval and display • Web-like full-text search across languages is difficult, but is possible with JEX. Agenda • EC-Joint Research Centre (JRC) – Who we are • The JRC Eurovoc Indexer software JEX • EuroVoc thesaurus • Motivation for automatic indexing • How automatic indexing works • Performance evaluation • Download and use JEX • Future of the librarian profession •Brainstorming Profession: librarians Little survey (1) Opinion on the - rather provocative - questions: • Search in unstructured data (e.g. full-text search) is getting better. • Will we still (also) need to search in structured data? • Do we still need to train librarians to search? • Can’t we find everything even using a badly formulated search query? Wir „würden […] uns jedoch einen Vortrag wünschen, der aus Sicht der Informatik eine Einschätzung abgibt, in welchen Bereichen zukünftig eine verbesserte Recherche in unstrukturierten Daten (z.B. Volltexten wissenschaftlicher