Improving Performance Quality and User Experience in the PULS News Mining System
Total Page:16
File Type:pdf, Size:1020Kb
Date of acceptance Grade Instructor Improving performance quality and user experience in the PULS News Mining system Mian Du Helsinki September 17, 2012 UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Laitos — Institution — Department Faculty of Science Department of Computer Science Tekijä — Författare — Author Mian Du Työn nimi — Arbetets titel — Title Improving performance quality and user experience in the PULS News Mining system Oppiaine — Läroämne — Subject Computer Science Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages September 17, 2012 85 pages + 48 appendices Tiivistelmä — Referat — Abstract Pattern-based Understanding and Learning System (PULS) can be considered as one key component of a large distributed news surveillance system. It is formed by the following three parts, 1. an Information Extraction (IE) system running on the back-end, which receives news articles as plain text in RSS feeds arriving continuously in real-time from several partner systems, processes and extracts information from these feeds and stores the information into the database; 2. a Web-based decision support (DS) system running on the front-end, which visualizes the information for decision-making and user evaluation; 3. both of them share the central database, which stores the structured information extracted by IE system and visualized by decision support system. In the IE system, there is an increasing need to extend the capability of extracting information from only English articles in medical and business domain to be able to handle articles in other languages like French, Russian and in other domains. In the decision support system, several new ways of Information Visualization and user evaluation interfaces are required by users for getting better decision support. In order to achieve these new features, a number of approaches including Information Extrac- tion, machine learning, Information Visualization, evolutionary delivery model, requirements elic- itation, modelling, database design approach and a variety of evaluation approaches have been investigated and adopted. Besides, various programming languages such as Lisp, Java, Python, JavaScript/Jquery, etc. have been used. More importantly, appropriate development process has been followed. This thesis reports on the whole process followed to achieve the required improve- ments made to PULS. ACM Computing Classification System (CCS): A.1 [Introductory and Survey], I.7.m [Document and text processing] Avainsanat — Nyckelord — Keywords Information Extraction, Information Visualization, Machine Learning, Decision Support Säilytyspaikka — Förvaringsställe — Where deposited Muita tietoja — övriga uppgifter — Additional information ii Contents 1 Introduction 1 1.1 Project Objectives . 3 1.2 Project Motivations . 4 1.2.1 Features to support both development and end-users experiences 4 1.2.2 Benefits for research purpose . 4 1.3 Approaches . 5 1.4 Overview . 6 2 Background 8 2.1 Information Extraction . 8 2.1.1 Common structure of IE systems . 10 2.1.2 Methods of extraction . 11 2.2 Information Visualization and Decision Support System . 13 2.2.1 Information Visualization . 13 2.2.2 Decision Support System . 16 2.3 Programming languages . 18 2.3.1 Lisp . 18 2.4 Chapter summary . 19 3 Requirements 20 3.1 Requirements Elicitation . 20 3.1.1 PULS changes . 20 3.1.2 Literature resources . 21 3.1.3 Users’ specific requirements . 22 3.1.4 Improving requirements from researchers’ point of view . 23 3.2 Requirements Specification . 24 3.2.1 Functional requirements . 24 3.2.2 Non-functional requirements . 25 iii 3.3 Requirements priority . 26 3.4 Chapter summary . 26 4 Design 27 4.1 Functional Design . 27 4.1.1 UML Introduction . 27 4.1.2 Static Structure Diagram . 28 4.1.3 Use Case Diagram . 28 4.1.4 Class Diagram . 32 4.1.5 Activity Diagram . 32 4.2 Interface design . 32 4.2.1 Layout design . 34 4.2.2 Navigation design . 34 4.2.3 Content organization design . 36 4.2.4 Detailed content display design . 37 4.3 Database Design . 39 4.3.1 ER Diagram . 39 4.3.2 ER Diagram to relational database mapping . 39 4.4 Chapter summary . 41 5 Implementation 42 5.1 Implementation technologies and environment . 42 5.1.1 Programming environment . 42 5.1.2 XML used in PULS . 43 5.1.3 Other technologies . 43 5.2 Functionalities implementation . 44 5.2.1 PULS IE system . 44 5.2.2 PULS relevance classifiers . 46 5.2.3 PULS DS system . 48 iv 5.3 Database implementation . 56 5.4 Chapter summary . 56 6 Evaluation and Testing 57 6.1 Internal testing . 57 6.1.1 Unit testing . 57 6.1.2 Integrity testing . 59 6.2 External testing . 61 6.2.1 Feature checklist . 61 6.2.2 Usability testing . 61 6.3 Evaluation . 63 6.3.1 PULS IE system . 63 6.3.2 Relevance Classifier . 67 6.4 Chapter summary . 68 7 Conclusion & Further work 70 7.1 Conclusion . 70 7.2 Suggestion for further work . 72 References 73 Appendices 1a A brief history of IE systems 1b History of Information Visualization 1c Example figures of IV 1d Development History of DSS 1e Introduction to XML v 1f What made Lisp different? 2 PULS detailed requirements specifications 3 Implementation details 4 Testing examples 1 1 Introduction Along with the development of web and web technologies, the amount of electroni- cally accessible documents has been growing exponentially. Specifically to newspa- per articles, a huge number of newspaper websites, which cover all domains globally, have been constructed providing online daily news. Some systems then try to gather related articles from a specific domain in order to serve the specific interests. In the business area, Esmerk [Esm11], for example, monitors thousands of newspapers, magazines, trade journals, web sources and press releases from 130 countries in 30 different languages for retrieving articles related to business information every day mostly by manual selection. Europe Media Monitor [EMM11] is another example which gathers over 40,000 reports every day from news portals world-wide in 43 languages, classifies the articles, analyzes the news texts by extracting information from them, aggregates the information, issues alerts and produces intuitive visual presentations of the information found. The service provided by EMM is known as document retrieval, which generally applies a keywords matching process to filter out irrelevant news articles that do not contain the combination of the keywords and return the possibly related ones. For the purpose of disease outbreak surveillance, the keywords could be any disease names or their synonyms. However, only key- words matching process performs problematically when handling natural-language text and the following two types of wrong results would be easily returned, 1. a large number of irrelevant documents that contain the disease name but de- scribe something else rather than an outbreak event, such as medical research news discussing about some disease; 2. adding more keywords like "outbreak", "epidemics" or "pandemics" to the query in order to increase the precision, on the other hand, would eliminate many relevant articles which do not contain those specific keywords mentioned above since the number of different terms or sentences to describe an outbreak in plain text is almost infinite. Moreover, these article level events returned are in fact not sufficient for the purpose of outbreak surveillance because the real information we need in these articles about what (disease), where (location), who (human or animal), when (time) and other related information like number and status of victims is still unknown. Aiming to find all disease outbreaks in a specific country reported in the newspaper articles 2 in a day, one might try to read these possibly relevant articles and acquire really valuable information from them, the manual work done to find them could often be tedious and inefficient. Considering the overwhelming number of newspaper articles every day, there is an increasing need for an automated method to process a large number of articles efficiently and precisely analyze them and extract the valuable information from them. Such method is generally known as Information Extraction. Information Extraction was introduced in early 70s for extracting specific data from natural language text such as literature resources and newspaper articles. It has been widely used in many areas. There are a number of Information Extraction systems developed in recent years to support news surveillance system to better track disease outbreaks in medical domain, daily events in business area, etc. Pattern-base Understanding and Learning System (PULS), developed by University of Helsinki, provides such support for news surveillance. It is formed by the following three main components, 1. an Information Extraction (IE) system running on the back-end, which receives news articles as plain text in RSS feeds arriving continuously in real-time from several partner systems, processes and extracts information from these feeds and stores the information into the database; 2. a Web-based decision support (DS) system running on the front-end, which visualizes the information for decision-making and user evaluation; 3. both of them share the central database (DB), which stores the structured information extracted by IE system and visualized by decision support system. Originally, PULS was focusing on Information Extraction from English newspaper articles in disease outbreak and business domains. Recently, the focus of PULS has been extended to be able to handle articles written in other languages and cross- border security domain in order to meet users’ requirements. Besides, in the DS system, several new ways of Information Visualization and user evaluation inter- faces are needed for better decision support. This thesis reports the whole work progress made to develop new functionalities for improving performance quality and user experience in the PULS News Mining system. It presents a detailed, in-depth account of a body of work that has been published in several research papers, in- cluding, 3 1.