The Computer Generation of Cryptic Crossword Clues
Total Page:16
File Type:pdf, Size:1020Kb
Riddle posed by computer (6) : The Computer Generation of Cryptic Crossword Clues David Hardcastle Birkbeck, London University 2007 Thesis submitted in fulfilment of requirements for degree of PhD The copyright of this thesis rests with the author. Due acknowledgement must always be made of any quotation or information derived from it. 1 Abstract This thesis describes the development of a system (ENIGMA) that generates a wide variety of cryptic crossword clues for any given word. A valid cryptic clue must appear to make sense as a fragment of natural language, but it must also be possible for the reader to interpret it using the syntax and semantics of crossword convention and solve the puzzle that it sets. It should also have a fluent surface reading that conflicts with the crossword clue interpretation of the same text. Consider, for example, the following clue for the word THESIS: Article with spies’ view (6) This clue sets a puzzle in which the (the definite article) must be combined with SIS (Special Intelligence Service) to create the word thesis (a view), but the reader is distracted by the apparent meaning and structure of the natural language fragment when attempting to solve it. I introduce the term Natural Language Creation (NLC) to describe the process through which ENIGMA generates a text with two layers of meaning. It starts with a representation of the puzzle meaning of the clue, and generates both a layered text that communicates this puzzle and a fluent surface meaning for that same text with a shallow, lexically-bound semantic representation. This parallel generation process reflects my intuition of the creative process through which cryptic clues are written, and domain expert commentary on clue-setting. The system first determines all the ways in which a clue might be formed for the input word (which is known as the “light”). For example, a light might be an anagram of some other word, or it might be possible to form the light by embedding one word inside another. There are typically a great many of these clue-forming possibilities, each of which forms a different puzzle reading, and the system constructs a data message for each one and annotates it with lexical, syntactic and semantic information. As the puzzle reading is lexicalised a hybrid language understanding process locates syntactic and semantic fits between the elements of the clue text and constructs the surface reading of the clue as it emerges. 2 The thesis describes the construction of the datasets required by the system including a lexicon, thesaurus, lemmatiser, anagrammer, rubric builder and, in particular, a system that scores pairs of words for thematic association based on distributional similarity in the BNC, and a system that determines whether a typed dependency is semantically valid using data extracted from the BNC with a parser and generalized using WordNet. It then describes the hybridised, NLC generation process, which is implemented using an object-oriented grammar and semantic constraint system and briefly touches on the software engineering considerations behind the ENIGMA application. The literature review and evaluation of each separate research strand is described within the relevant chapter. I also present a quantitative and qualitative end-to-end evaluation in the final chapter in which 60 subjects participated in a Turing-style test to evaluate the generated clues and eight domain experts provided detailed commentary on the quality of ENIGMA’s output. 3 Contents Abstract..........................................................................................................................2 Contents .........................................................................................................................4 Declaration.....................................................................................................................6 Acknowledgements........................................................................................................7 Introduction....................................................................................................................8 Cryptic Clues as Multi-Layered Text ........................................................................9 Computing Cryptic Clues ........................................................................................10 A Sample Generated Clue........................................................................................11 Natural Language Creation......................................................................................13 Structure of the Thesis .............................................................................................15 Sample Output .........................................................................................................21 Chapter 1 Requirements Analysis ............................................................................23 1.1 An Analysis of the Domain..........................................................................23 1.2 Scoping ENIGMA...........................................................................................39 1.3 Determining Resources with Sample Clues.................................................42 1.4 Quality Control ............................................................................................47 1.5 Language Specification................................................................................52 Chapter 2 Building Clue Plans.................................................................................58 2.1 A Paragrammer ............................................................................................59 2.2 Orthographic Distance Measures.................................................................67 2.3 A Lexicon.....................................................................................................70 2.4 A Homophone Dictionary............................................................................71 2.5 A Lemmatiser/Inflector................................................................................73 2.6 Stop List.......................................................................................................78 2.7 A Thesaurus.................................................................................................78 2.8 Cryptic keywords.........................................................................................89 2.9 Word phrase information .............................................................................89 2.10 Syntactic and Semantic Role Addenda........................................................90 2.11 WordNet.......................................................................................................90 2.12 Sample Input/Output....................................................................................91 Chapter 3 Finding Thematic Context .......................................................................93 3.1 Introduction..................................................................................................93 3.2 Measuring Word Association......................................................................95 3.3 Five Approaches to the Problem..................................................................97 3.4 Algorithmic Details......................................................................................99 3.5 Evaluating and Tuning the Algorithms......................................................106 3.6 Extracting and Organising the Data...........................................................124 3.7 Further Research........................................................................................128 3.8 Conclusion .................................................................................................132 3.9 Sample Output...........................................................................................133 Chapter 4 Finding Syntactic Collocations..............................................................134 4.1 Lexical Choice in ENIGMA .........................................................................134 4.2 What is wrong with distributional data? ....................................................135 4.3 Extracting Syntactic Collocates .................................................................137 4.4 Generalizing the dataset.............................................................................143 4.5 Evaluation ..................................................................................................156 4.6 Baseline Comparison.................................................................................164 4 4.7 Discussion..................................................................................................166 4.8 Sample Output...........................................................................................171 Chapter 5 Creating Clues........................................................................................174 5.1 Clue Plans as Input.....................................................................................176 5.2 Lexicalization Options...............................................................................179 5.3 Analysing the Data.....................................................................................180 5.4 Bottom-up Processing with Chunks..........................................................183 5.5 Chunks as Text Fragments.........................................................................186