Indirect Supervised Learning of Strategic Generation Logic

Indirect Supervised Learning of Strategic Generation Logic Pablo Ariel Duboue Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2005 c 2005 Pablo Ariel Duboue All Rights Reserved Abstract Indirect Supervised Learning of Strategic Generation Logic Pablo Ariel Duboue The Strategic Component in a Natural Language Generation (NLG) system is responsible for determining content and structure of the generated output. It takes a knowledge base and communicative goals as input and provides a document plan as output. The Strategic Generation process is normally divided into two subtasks: Content Selection and Document Structuring. An implementation for the Strategic Component uses Content Selection rules to select the relevant knowledge and Document Structuring schemata to guide the construction of the document plan. This implementation is better suited for descriptive texts with a strong topical structure and little intentional content. In such domains, special communicative knowledge is required to structure the text, a type of knowledge referred as Domain Communicative Knowledge. Therefore, the task of building such rules and schemata is normally recognized as tightly coupled with the semantics and idiosyncrasies of each particular domain. In this thesis, I investigate the automatic acquisition of Content Selection rules and the automatic construction of Doc- ument Structuring schemata from an aligned Text-Knowledge corpus. These corpora are a collection of human-produced texts together with the knowledge data a generation system is expected to use to construct texts that fulfill the same communicative goals as the human texts. They are increasingly popular in learning for NLG because they are readily available and do not require expensive hand labelling. To facilitate learning I further focus on domains where texts are also abundant in anchors (pieces of information directly copied from the input knowledge base). In two such domains, medical reports and biographical descriptions, I have found aligned Text-Knowledge corpus for my learning task. While aligned Text-Knowledge corpora are relatively easy to find, they only provide indirect information about the selected or omitted status of each piece of knowledge and their relative placement. My methods, therefore, involve Indirect Su- pervised Learning (ISL), as my solution to this problem, a solution common to other learning from Text-Knowledge corpora problems in NLG. ISL has two steps; in the first step, the Text-Knowledge corpus is transformed into a dataset for supervised learning, in the form of matched texts. In the second step, supervised learning machinery acquires the CS rules and schemata from this dataset. My main contribution is to define empirical metrics over rulesets or schemata based on the training material. These metrics enable learning Strategic Generation logic from positive examples only (where each example contains indirect evidence for the task). Contents List of Figures v List of Tables ix Chapter 1 Introduction 1 1.1 Problem Definition . 6 1.1.1 Assumed NLG Architecture . 6 1.1.2 Content Selection Rules . 9 1.1.3 Document Structuring Schemata . 10 1.2 Research Hypothesis . 12 1.3 Methods . 13 1.3.1 Technical Approach . 18 1.4 Contributions . 21 1.5 Domains . 24 1.5.1 Medical Reports . 25 1.5.2 Person Descriptions . 25 1.6 Structure of this Dissertation . 26 Chapter 2 Related Work 28 2.1 Related Work in Content Selection . 28 2.1.1 ILEX Content Selection Algorithm . 29 i 2.1.2 STOP Content Selection Knowledge Acquisition . 30 2.1.3 Separated vs. Integrated Content Selection . 32 2.2 Document Structuring . 33 2.2.1 Schemata-based Document Structuring . 37 2.2.2 RST-based planning . 44 2.3 Related Work in Learning in NLG . 47 2.4 Related Work in Other Areas . 50 2.4.1 Related Work in Dialog Systems . 50 2.4.2 Related Work in Summarization . 51 2.4.3 Related Work in Assorted Areas . 55 2.5 Conclusions . 58 Chapter 3 Indirect Supervised Learning 59 3.1 Definitions . 60 3.2 Indirect Supervised Learning . 63 3.2.1 Evaluation . 65 3.3 Unsupervised Construction of Matched Texts . 68 3.3.1 Dictionary Induction . 68 3.3.2 Verbalize-and-search . 73 3.4 Data . 77 3.4.1 biography.com . 78 3.4.2 s9.com . 79 3.4.3 imdb.com . 79 3.4.4 wikipedia.org . 80 3.5 Experiments . 81 3.6 Conclusions . 96 ii Chapter 4 Learning of Content Selection Rules 98 4.1 Definitions . 100 4.2 Supervised Learning . 104 4.2.1 Learning Rules . 106 4.2.2 Traditional ML . 111 4.2.3 Baselines . 114 4.3 Experiments . 115 4.4 Conclusions . 118 Chapter 5 Learning of Document Structuring Schemata 120 5.1 Definitions . 121 5.2 Training Material . 124 5.3 Order Constraints . 127 5.3.1 Learning Order Constraints . 129 5.3.2 Using Order Constraints . 134 5.4 Supervised Learning . 137 5.4.1 GAL (Genetic Automaton Learner) . 138 5.4.2 Fitness Function . 141 5.5 Variants . 146 5.6 Evaluation Methods . 148 5.7 Conclusion . 148 Chapter 6 Experiments in the Medical Domain 150 6.1 Data . 152 6.2 Learning Order Constraints . 154 6.3 Learning Document Structuring Schemata . 162 6.4 Conclusions . 168 iii Chapter 7 Experiments in the Biographical Domain 171 7.1 Data . 172 7.2 Learning Order Constraints . 178 7.3 Learning Document Structuring Schemata . 178 7.4 Conclusions . 184 Chapter 8 Limitations 187 8.1 General Limitations . 187 8.2 Limitations of the matched text construction process . 190 8.3 Limitations of the learning of Content Selection rules . 192 8.4 Limitations of the learning of Document Structuring schemata . 193 8.5 Conclusions . 196 Chapter 9 Conclusions 197 9.1 Contributions . 198 9.1.1 Deliverables . 200 9.2 Possible Extensions . 201 9.3 Other Possible Domains . 203 9.3.1 Museum Exhibit Descriptions: M-PIRO . 203 9.3.2 Biology: KNIGHT . 204 9.3.3 Geographic Information Systems: Country Descriptions . 205 9.3.4 Financial Market: Stock Reports . 205 9.3.5 Role Playing Games: Character Descriptions . 205 Appendix A Additional Tables 231 iv List of Figures 1.1 Content Planning Task Example . 3 1.2 Graph Rendering of my Knowledge Representation . 4 1.3 Assumed NLG Architecture . 8 1.4 Example Rules . 10 1.5 Example Predicate . 11 1.6 Example Message . 12 1.7 Selected Items Example . 16 1.8 Learning Architecture . 19 1.9 System Architecture . 22 1.10 MAGIC Example . 25 1.11 Example of Biographies Training Data . 26 2.1 ILEX Predicate Definition . 30 2.2 Planning Example . 37 2.3 McKeown Original Schemata . 38 2.4 McKeown’s Attributive Schema (ATN) . 40 2.5 MAGIC Schema-like DS Tree Example . 43 3.1 A Frame-based Knowledge Representation . 61 3.2 An Example of a Matched Text (excerpt) . 64 3.3 Learning Architecture . 65 v 3.4 Dictionary Induction . 70 3.5 Pseudo-code Hypothesis Testing . 72 3.6 Extracted Words . 73 3.7 Verbalize-and-Search . 74 3.8 Pseudo-code Disambiguation in Verbalize-and-Search . 76 3.9 Iteration Curves, Variation 1 . 92 3.10 Iteration Curves, Variation 2 . 93 3.11 Impact of the training size for the matched text construction, Variant 2 . 94 3.12 Impact of the training size for the matched text construction, Variant 3 . 95 4.1 Input to the Learning System . 99 4.2 Content Selection Example . 100 4.3 Relatives Example . ..

Load more