9.2.1 Processing the User's Template Definitions 191

Durham E-Theses Financial information extraction using pre-dened and user-denable templates in the Lolita system Constantino, Marco How to cite: Constantino, Marco (1997) Financial information extraction using pre-dened and user-denable templates in the Lolita system, Durham theses, Durham University. Available at Durham E-Theses Online: http://etheses.dur.ac.uk/5030/ Use policy The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that: • a full bibliographic reference is made to the original source • a link is made to the metadata record in Durham E-Theses • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders. Please consult the full Durham E-Theses policy for further details. Academic Support Oce, Durham University, University Oce, Old Elvet, Durham DH1 3HP e-mail: [email protected] Tel: +44 0191 334 6107 http://etheses.dur.ac.uk 2 University of Durham Financial Information Extraction using pre-defined and user-definable Templates in the LOLITA System. Marco Costantino Laboratory for Natural Language Engineering, Department of Computer Science. Submitted for the degree of Doctor of Philosophy ©1997, Marco Costantino The copyright of this thesis rests with the author. No quotation from it should be published without the written consent of the author and information derived from it should be acknowledged. 2 JUL 19.98 Abstract Financial operators have today access to an extremely large amount of data, both quantitative and qualitative, real-time or historical and can use this informa• tion to support their decision-making process. Quantitative data are largely processed by automatic computer programs, often based on artificial intelligence techniques, that produce quantitative analysis, such as historical price analysis oi: technical analysis of price behaviour. Differently, little progress has been made in the processing of qualitative data, which mainly consists of financial news articles from financial newspapers or on-line news providers. As a result the financial market players are overloaded with qualitative information which is potentially extremely useful but, due to the lack of time, is often ignored. The goal of this work is to reduce the quahtative data-overload of the financial operators. The research involves the identification of the information in the source financial articles which is relevant for the financial operators' investment decision making process and to implement the associated templates in the LOLITA system. The system should process a large number of source articles and extract specific templates according to the relevant information located in the source articles. The project also involves the design and implementation in LOLITA of a user- definable template interface for allowing the users to easily design new templates using sentences in natural language. This allows user-defined information extrac• tion from source texts. This differs from most of existing information extraction systems which require the developers to code the templates directly in the system. The results of the research have shown that the system performed well in the extraction of financial templates from source articles which would allow the finan• cial operator to reduce his qualitative data-overload. The results have also shown that the user-definable template interface is a viable approach to user-defined in• formation extraction. A trade-off has been identified between the ease of use of the user-definable template interface and the loss of performance compared to hand- coded templates. Acknowledgements I would like to thank my supervisor Richard G. Morgan for his advice and support throughout the three years over which this research has been conducted. I would also like to thank Russell J. CoUingham for his help-and support .and Stephen Eckett for his invaluable advice on the financial aspects of the thesis. I am also grateful for all past and present members of the Laboratory for Natural Language Engineering who have helped provide such a pleasant research and social environment. Thank you to all those people who commented on the various drafts of this thesis, particularly Luisa Mich and Luigi Colazzo. Financial assistance for this project was received from the Opera Universitaria Society of the University of Trento. I would also like to thank the department of Economics of the University of Trento for awarding me the University of Trento studentship for postgraduate studies abroad. I would also like to thank you all my friends who supported me during the these three years and, particularly, Edy, Mika, Marianna, Paolo, Wendy, Janice, Giacomo, Max, Xavier and Sergio. Finally I would like to thank my Mum for her support during my time at the University. Declaration The material contained within this thesis has not previously been submitted for a degree at the University of Durham or any other university. The research reported within this thesis has been conducted by the author unless indicated otherwise. The copyright of this thesis rests with the author. No quotation from it should be published without his prior written consent and information derived from it should be acknowledged. Contents 1 Methodological Introduction 1 1.1 Introduction 1 1.2 Traditional Natural Language Processing Approaches 4 1.2.1 Cognitive Science 4 1.2.2 Artificiallntelligence 5 1.2.3 Computational Linguistics 5 1.3 Natural Language Engineering 6 1.3.1 The General Philosophy of Natural Language Engineering . 6 1.3.2 Scale . '. 7 1.3.3 Robustness 7 1.3.4 Maintainabihty 8 1.3.5 Flexibility 8 1.3.6 Integration 8 1.3.7 Feasibihty 9 1.3.8 Usability 9 1.3.9 The use of a full range of techniques 10 1.3.10 Cost-Benefit Analysis 10 1.3.11 Motivation for Adopting the NLE Approach in finance ... 10 1.4 Methodological criteria for success 11 1.5 Context of this Work: the LOLITA project 12 1.6 Logical Progression of the Thesis 12 2 Finance and Financial Tools 14 2.1 Introduction 14 CONTENTS 2.2 Quantitative tools 19 2.2.1 Statistical and probabilistic techniques 20 2.2.2 Artificial Intelligence techniques 26 2.3 Integrating conventional and intelligent systems 34 2.4 Qualitative tools 35 3 Information retrieval and information extraction 40 3.1 Introduction 40 3.2 Information Retrieval 40 3.2.1 Statistical and Probabilistic approaches to the Information Retrieval 42 3.2.2 The Linguistic approach to information retrieval 47 3.2.3 Knowledge-based approaches (weak methods) 48 3.3 The TREC competitions 50 3.3.1 Tasks 51 3.3.2 Techniques 53 3.3.3 Evaluation metrics 53 3.4 Information Extraction : 56 3.4.1 The scripts / frames systems 59 3.5 The MUG competitions 64 3.5.1 The MUG Tasks 65 3.5.2 The main techniques employed 74 3.5.3 The evaluation of the MUG results 92 3.6 Information extraction in the financial domain 95 3.6.1 Automatic extraction of templates from source texts .... 99 3.7 User-definable template interfaces 99 3.8 Gonclusions 103 4 The LOLITA System 104 4.1 Introduction 104 4.2 Architecture of the system 105 4.2.1 The Semantic Network 105 CONTENTS vi 4.2.2 Syntactic analysis 109 4.2.3 Analysis of meaning Ill 4.2.4 Inference 116 4.2.5 Generation .117 4.3 LOLITA Applications , . .' 117 4.3.1 Analysis of Text 118 4.3.2 Query 118 4.3.3 Translation 118 4.3.4 Database front-end • • •. 118 4.3.5 Chinese tutoring ; 119 4.3.6 Extraction of Objects 119 4.3.7 Dialogue ' 119 4.3.8 Information Extraction • • •. 120 5 Information extraction in the LOLITA System 121 5.1 Introduction - 121 5.2 Architecture of the information extraction apphcation 121 5.3 The LOLITA templates 123 5.3.1 Types of slots 125 5.4 Types of templates available in the system 128 5.4.1 Concept-based templates 128 5.4.2 Summary templates 130. 5.4.3 Hyper templates 131 5.4.4 The MUC-6 templates 131 5.5 Advantages and disadvantages of the implementation 133 6 Definition of the financial templates 134 6.1 Introduction 134 6.2 Designing a financial information extraction system 134 .6.2.1 The type of system: real-time / batch 135 6.2.2 The information to be extracted (Task definition) 136 6.2.3 The .user interface 140 CONTENTS vii 6.3 The LOLITA financial information extraction system 140 6.3.1 The type of system 141 6.3.2 The information to be extracted 142 7 Implementation of the financial templates 148 7.1 Introduction 148 7.2 Defining the rules for the financial templates 148 7.2.1 Prototypes 150 7.2.2 Domain-specific knowledge 150 7.2.3 Unification 151 7.3 The takeover financial template 152 7.3.1 The takeover main-event 153 7.3.2 The takeover templates slots •. .• 161 8 Design of the user-definable template interface 171 8.1 Introduction 171 8.2 The user-definable template interface 171 8.2.1 The template interface environment 179 8.2.2 The template structure 180 8.2.3 The fill rules 184 9 Implementation of the user-definable template interface 191 9.1 ' Introduction 191 9.2 The user-definable template interface .' 191 9.2.1 Processing the user's template definitions 191 9.2.2 Producing templates using the inference system 199 10 Results and Evaluation . 210 10.1 Introduction 210 10.2 Evaluation of the definition and implementation of the financial tem• plates • 212 10.2.1 Evaluation of the financial templates 212 10.2.2 Evaluation of the performance of the financial templates .

Load more