Adam Kilgarriff's Legacy to Computational Linguistics And
Total Page:16
File Type:pdf, Size:1020Kb
Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond Roger Evans?, Alexander Gelbukhy, Gregory Grefenstettez, Patrick Hanks}, Miloš Jakubícekˇ ♧, Diana McCarthy⊕, Martha Palmer~, Ted Pedersen/, Michael Rundell], Pavel Rychlý1, Serge Sharoff♤, David Tugwell ? University of Brighton, [email protected], y CIC, Instituto Politécnico Nacional, Mexico, [email protected], z Inria Saclay, [email protected], } University of Wolverhampton, [email protected], ♧ Lexical Computing and Masaryk University, [email protected], ⊕ DTAL University of Cambridge, [email protected], ~ University of Colorado, [email protected], / University of Minnesota, [email protected] ] Lexicography MasterClass, [email protected] 1 Lexical Computing and Masaryk University, pary@fi.muni.cz, ♤ University of Leeds, [email protected], Independent Researcher, [email protected] Abstract. This year, the CICLing conference is dedicated to the memory of Adam Kilgarriff who died last year. Adam leaves behind a tremendous scientific legacy and those working in computational linguistics, other fields of linguistics and lexicography are indebted to him. This paper is a summary review of some of Adam’s main scientific contributions. It is not and cannot be exhaustive. It is writ- ten by only a small selection of his large network of collaborators. Nevertheless we hope this will provide a useful summary for readers wanting to know more about the origins of work, events and software that are so widely relied upon by scientists today, and undoubtedly will continue to be so in the foreseeable future. 1 Introduction Last year was marred by the loss of Adam Kilgarriff who during the last 27 years contributed greatly to the field of computational linguistics1, as well as to other fields of linguistics and to lexicography. This paper provides a review of some of the key scientific contributions he made. His legacy is impressive, not simply in terms of the numerous academic papers, which are widely cited in many fields, but also the many scientific events and communities he founded and fostered and the commercial Sketch Engine software. The Sketch Engine has provided computational linguistics tools and corpora to scientists in other fields, notably lexicography for example [61,50,17], as 1 In this paper, natural language processing (NLP) is used synonymously with computational linguistics. well as facilitating research in other areas of linguistics [56,12,11,54] and our own subfield of computational linguistics [60,74]. Adam was hugely interested in lexicography from the very inception of his post- graduate career. His DPhil2 on polysemy and subsequent interest in word sense disam- biguation (WSD) and its evaluation was firmly rooted in examining corpus data and dictionary senses with a keen eye on the lexicographic process [20]. After his DPhil, Adam spent several years as a computational linguist advising Longman Dictionaries on use of language engineering for the development of lexical databases, and he contin- ued this line of knowledge transfer in consultancies with other publishers until realizing the potential of computational linguistics with the development of his commercial soft- ware, the Sketch Engine. The origins of this software lay in his earlier ideas of using computational linguistics tools for providing word profiles from corpus data. For Adam, data was key. He fully appreciated the need for empirical approaches to both computational linguistics and lexicography. In computational linguistics from the 90s onwards there was a huge swing from symbolic to statistical approaches, however the choice of input data, in composition and size, was often overlooked in favor of a focus on algorithms. Furthermore, early on in this statistical tsunami, issues of repli- cability were not always appreciated. A large portion of his work was devoted to these issues; in his work on WSD evaluation and in his work on building and comparing corpora. His signature company slogan was ‘corpora for all’. This paper has come together from a small sample of the very large pool of Adam’s collaborators. The sections have been written by different subsets of the authors and with different perspectives on Adam’s work and on his ideas. We hope that this approach will give the reader an overview of some of Adam’s main scientific contributions to both academia and the commercial world, while not detracting too greatly from the coherence of the article. The article is structured as follows. Section 2 outlines Adam’s thesis and origins of his thoughts on word senses and lexicography. Section 3 continues with his subsequent work on WSD evaluation in the Senseval series as well as discussing his qualms about the adoption of dictionary senses in computational linguistics as an act of faith without a specific purpose in mind. Section 4 summarizes his early work on using corpus data to provide word profiles in a project known as the WASP-bench, the precursor to his company’s3 commercial software, the Sketch Engine. Corpus data lay at the very heart of these word profiles, and indeed just about all of Computational Linguistics from the mid 90s on. Section 5 discuss Adam’s ideas for building and comparing corpus data, while section 6 describes the Sketch Engine itself. Finally section 7 details some of the impact Adam has had transferring ideas from computational and corpus linguistics to the field of lexicography. 2 Like Oxford, the University of Sussex, where Adam undertook his doctoral training, uses DPhil rather than PhD as the abbreviation for its doctoral degrees. 3 The company he founded is Lexical Computing Ltd. He was also a partner – with Sue Atkins and Michael Rundell – in another company, Lexicography MasterClass, which provides con- sultancy and training and runs the Lexicom workshops in lexicography and lexical comput- ing; http://www.lexmasterclass.com/. 2 Adam’s Doctoral Research To lay the foundation for an understanding of Adam’s contribution to our field, an ob- vious place to start is his DPhil thesis [19]. But let us first sketch out the background in which he undertook his doctoral research. Having obtained first class honours in Philsophy and Engineering at Cambridge in 1982, Adam had spent a few years away from academia before arriving at Sussex in 1987 to undertake the Masters in Intelligent Knowledge-Based Systems, a programme which aimed to give non-computer scientists a grounding in Cognitive Science and Artificial Intelligence. This course introduced him to Natural Language Processing (NLP) and lexical semantics, and in 1988 he enrolled on the DPhil program, supervised by Gerald Gazdar and Roger Evans. At that time, NLP had moved away from its roots in Artificial Intelligence towards more formal approaches, with increasing interest in formal lexical issues and more elaborate models of lexical structure, such as Copestake’s LKB [6] and Evans & Gazdar’s DATR [9]. In addition, the idea of improving lexical coverage by exploiting digitized versions of dictionaries was gaining currency, although the advent of large-scale corpus-based ap- proaches was still some way off. In this context, Adam set out to explore Polysemy, or as he put it himself: What does it mean to say a word has several meanings? On what grounds do lexicographers make their judgments about the number of meanings a word has? How do the senses a dictionary lists relate to the full range of ways a word might get used? How might NLP systems deal with multiple meanings? [19, p. i] Two further quotes from Adam’s thesis neatly summarize the broad interdisciplinar- ity which characterized his approach to his thesis, and throughout his research career. The first is from the Preface: There are four kinds of thesis in cognitive science: formal, empirical, program- based and discursive. What sort was mine to be? . I look round in delight to find [my thesis] does a little bit of all the things a cognitive science thesis might do! [19, p. vi] while the second is in the introduction to his discussion of methodology: We take the study of the lexicon to be intimately related to the study of the mind . For an understanding of the lexicon, the contributing disciplines are lexicography, psycholinguistics and theoretical, computational and corpus linguistics. [19, p. 4] The distinctions made in the first of these quotes provide a neat framework to dis- cuss the content of the thesis in more detail. Adam’s own starting point was empirical: in two studies he demonstrated first that the range and type of sense distinctions found in a typical dictionary defied any simple systematic classification, and second that the so-called bank model of word senses (where senses from dictionaries were considered to be distinct and easy to enumerate and match to textual instances) did not in general reflect actual dictionary sense distinctions (which tend to overlap). A key practical con- sequence of this is that the then-current NLP WSD systems which assumed the bank model could never achieve the highest levels of performance in sense matching tasks. From this practical exploration, Adam moved to more discursive territory. He ex- plored the basis on which lexicographers decide which sense distinctions appear in dic- tionaries, and introduced an informal criterion to characterize it – the Sufficiently Fre- quent and Insufficiently Predicable (SFIP) condition, which essentially favors senses which are both common and non-obvious. However he noted that while this criterion had empirical validity as a way of circumscribing polysemy in dictionaries, it did not offer any clear understanding of the nature of polysemy itself. He argued that this is because polysemy is not a ‘natural kind’ but rather a cover term for several other more specific but distinct phenomena: homonymy (the bank model), alternation (systematic usage differences), collocation (lexically contextualized usage) and analogy. This characterization led into the formal/program-based contribution (which in the spirit of logic-based programming paradigms collapse into one) of his thesis, for which he developed two formal descriptions of lexical alternations using the inheritance-based lexical description language DATR.