Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis

D4.1 PRELIMINARY VERSION OF THE TEXT ANALYSIS COMPONENT, INCLUDING: NER, EVENT DETECTION AND SENTIMENT ANALYSIS Grant Agreement nr 611057 Project acronym EUMSSI Start date of project (dur.) December 1st 2013 (36 months) Document due Date : November 30th 2014 (+60 days) (12 months) Actual date of delivery December 2nd 2014 Leader GFAI Reply to [email protected] Document status Submitted Project co-funded by ICT-7th Framework Programme from the European Commission EUMSSI_D4.1 Preliminary version of the text analysis component 1 Project ref. no. 611057 Project acronym EUMSSI Project full title Event Understanding through Multimodal Social Stream Interpretation Document name EUMSSI_D4.1_Preliminary version of the text analysis component.pdf Security (distribution PU – Public level) Contractual date of November 30th 2014 (+60 days) delivery Actual date of December 2nd 2014 delivery Deliverable name Preliminary Version of the Text Analysis Component, including: NER, event detection and sentiment analysis Type P – Prototype Status Submitted Version number v1 Number of pages 60 WP /Task responsible GFAI / GFAI & UPF Author(s) Susanne Preuss (GFAI), Maite Melero (UPF) Other contributors Mahmoud Gindiyeh (GFAI), Eelco Herder (LUH), Giang Tran Binh (LUH), Jens Grivolla (UPF) EC Project Officer Mrs. Aleksandra WESOLOWSKA [email protected] Abstract The deliverable reports on the resources and tools that have been gathered and installed for the preliminary version of the text analysis component Keywords Text analysis component, Named Entity Recognition, Named Entity Linking, Keyphrase Extraction, Relation Extraction, Topic modelling, Sentiment Analysis Circulated to partners Yes Peer review Yes completed Peer-reviewed by Eelco Herder (L3S) Coordinator approval Yes EUMSSI_D4.1 Preliminary version of the text analysis component 2 Table of Contents 1. BACKGROUND ................................................................................................ 7 2. INTRODUCTION ............................................................................................. 8 2.1. Requirements for tools used in EUMSSI ..................................................... 8 2.2. Summary of the first year results ............................................................... 9 3. TEXT CORPUS DESCRIPTION ......................................................................... 10 3.1. Text cleaning .......................................................................................... 11 3.2. Next steps .............................................................................................. 14 4. ARCHITECTURE OF THE TEXT ANALYSIS COMPONENT .................................... 15 4.1. UIMA Platform ........................................................................................ 15 4.2. DKPro Linguistic Repository and Type System............................................ 15 4.3. Linguistic processing chains or pipelines .................................................... 16 4.3.1. UIMA-based pipeline for Text Analysis ................................................ 16 4.3.2. Text Analysis pipelines in EUMSSI ...................................................... 17 5. NAMED ENTITY RECOGNITION AND DISAMBIGUATION ................................... 19 5.1. Named Entity Recognition (NER) .............................................................. 19 5.1.1. Named entity types .......................................................................... 19 5.1.2. Available models and annotated training corpora................................. 20 5.1.3. Dates and Times .............................................................................. 20 5.1.4. Impact ............................................................................................ 21 5.1.5. Next steps ....................................................................................... 21 5.2. Named Entity Disambiguation and Linking ................................................. 21 5.2.1. Comparison with other tools .............................................................. 22 5.2.2. Input parameters ............................................................................. 22 5.2.3. Output features ................................................................................ 23 5.2.4. Program availability .......................................................................... 25 5.2.5. DKPro integration ............................................................................. 25 5.2.6. DBpedia Spotlight and Stanford NER .................................................. 25 5.2.7. Impact ............................................................................................ 27 5.2.8. Next steps ....................................................................................... 27 6. EXTRACTION OF EVENTS, RELATIONS AND TOPICS ........................................ 28 6.1. Introduction ........................................................................................... 28 6.2. Keyphrase extraction ............................................................................... 28 6.2.1. Model training .................................................................................. 30 EUMSSI_D4.1 Preliminary version of the text analysis component 3 6.2.2. Experiment with large number of extracted keyphrases ....................... 30 6.2.3. Sample visualization of keyphrases .................................................... 35 6.2.1. Granularity of scores ......................................................................... 38 6.2.2. Experiments with different training corpus size ................................... 38 6.2.3. Author-annotated keyphrases versus tool-based keyphrase annotations 40 6.2.4. Topic diversity .................................................................................. 41 6.2.5. GFAI tools for linguistic analysis and keyword extraction ...................... 41 6.3. Relation extraction .................................................................................. 43 6.4. Event extraction ...................................................................................... 43 6.5. Topic detection ....................................................................................... 43 6.5.1. Impact ............................................................................................ 44 6.5.2. Next steps ....................................................................................... 44 7. SENTIMENT ANALYSIS COMPONENT ............................................................... 45 7.1. Introduction ........................................................................................... 45 7.2. Approaches to OM: Rule-based and Corpus-based ..................................... 45 7.2.1. Rule-based approaches ..................................................................... 45 7.2.2. Corpus-based approach .................................................................... 47 7.3. Scope of polarity and type of text in EUMSSI ............................................. 47 7.3.1. News text ........................................................................................ 47 7.3.2. User-generated content from Social Media .......................................... 48 7.4. Preliminary work: Benchmarking of Polar Lexicons ..................................... 48 7.4.1. OpinionFinder................................................................................... 48 7.4.2. SenticNet ......................................................................................... 49 7.4.3. LIWC ............................................................................................... 50 7.4.4. Computation of polarity ..................................................................... 51 7.4.5. Evaluation framework ....................................................................... 51 7.4.6. Analysis of results ............................................................................. 52 7.5. Next steps .............................................................................................. 53 8. ANALYSIS OF OTHER TEXT TYPES: OCR, ASR .................................................. 54 8.1. Optical Character Recognition (OCR) ........................................................ 54 8.1.1. Impact ............................................................................................ 54 8.1.2. Next steps ....................................................................................... 54 8.2. Automatic Speech Recognition (ASR) ........................................................ 55 8.2.1. Impact ............................................................................................ 56 8.2.2. Next steps ....................................................................................... 56 EUMSSI_D4.1 Preliminary version of the text analysis component 4 9. CONCLUSIONS AND NEXT STEPS .................................................................... 57 10. REFERENCES ............................................................................................. 58 11. GLOSSARY ................................................................................................. 61 Tables Table 1: Data size per source and language in MB and sample number .................... 10 Table 2: Number of articles crawled from different newspapers ............................... 10 Table 3: Paragraph tags used in

Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis

Information Retrieval and Web Search

Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data

Fasttext-Based Intent Detection for Inflected Languages †

Unsupervised Methods to Predict Example Difficulty in Word Sense

Robust Prediction of Punctuation and Truecasing for Medical ASR

Evaluation of Machine Translation Methods Applied to Medical Terminologies

Copyright and Use of This Thesis This Thesis Must Be Used in Accordance with the Provisions of the Copyright Act 1968

Using Named Entity Recognition for Relevance Detection in Social Network Messages

A Rule-Based Normalization System for Greek Noisy User-Generated Text

Proceedings Chapter Reference

Proceedings of the ACL-IJCNLP 2015 System Demonstrations

Monolingual Machine Translation for Paraphrase Generation