Report on the 2015 NSF Workshop on Unified Annotation Tooling Finlayson, Mark Alan

Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2016-014 November 8, 2016 Report on the 2015 NSF Workshop on Unified Annotation Tooling Finlayson, Mark Alan massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Report on the 2015 NSF Workshop on Unified Annotation Tooling Mark A. Finlayson Florida International University & Massachusetts Institute of Technology November 8, 2016 Abstract On March 30 & 31, 2015, an international group of twenty-three researchers with expertise in linguistic annotation convened in Sunny Isles Beach, Florida to discuss problems with and potential solutions for the state of linguistic annotation tooling. The participants comprised 14 researchers from the U.S. and 9 from outside the U.S., with 7 countries and 4 continents represented, and hailed from fields and specialties including computational linguistics, artificial intelligence, speech processing, multi-modal data processing, clinical & medical natural language processing, linguistics, documentary linguistics, sign-language linguistics, corpus linguistics, and the digital humanities. The motivating problem of the workshop was the balkanization of annotation tooling, namely, that even though linguistic annotation requires sophisticated tool support to efficiently generate high-quality data, the landscape of tools for the field is fractured, incompatible, inconsistent, and lacks key capabilities. The overall goal of the workshop was to chart the way forward, centering on five key questions: (1) What are the problems with current tool landscape? (2) What are the possible benefits of solving some or all of these problems? (3) What capabilities are most needed? (4) How should we go about implementing these capabilities? And, (5) How should we ensure longevity and sustainability of the solution? I surveyed the participants before their arrival, which provided significant raw material for ideas, and the workshop discussion itself resulted in identification of ten specific classes of problems, five sets of most-needed capabilities. Importantly, we identified annotation project managers in computational linguistics as the key recipients and users of any solution, thereby succinctly addressing questions about the scope and audience of potential solutions. We discussed management and sustainability of potential solutions at length. The participants agreed on sixteen recommendations for future work. This technical report contains a detailed discussion of all these topics, a point-by-point review of the discussion in the workshop as it unfolded, detailed information on the participants and their expertise, and the summarized data from the surveys. Report on the 2015 NSF UAT Workshop M.A. Finlayson, FIU Table of Contents Abstract ......................................................................................................................................................................... 1 Table of Contents........................................................................................................................................................... 2 1. Problem .................................................................................................................................................................. 3 1.1. An Analogy .................................................................................................................................................. 3 1.2. Detrimental Effects ...................................................................................................................................... 4 1.3. Definition of Terms ...................................................................................................................................... 4 1.4. Summary of Results ..................................................................................................................................... 5 2. Context ................................................................................................................................................................... 6 2.1. Scientific Context ......................................................................................................................................... 6 2.2. Funding Context ........................................................................................................................................... 8 3. Goals ..................................................................................................................................................................... 10 3.1. Original Goals ............................................................................................................................................ 10 3.2. Revised Goals & Questions........................................................................................................................ 10 4. Logistics ............................................................................................................................................................... 11 5. Discussion ............................................................................................................................................................ 12 5.1. Summary of Survey Answers ..................................................................................................................... 12 5.2. Summary of Workshop Discussion ............................................................................................................ 12 6. Conclusions .......................................................................................................................................................... 34 7. Acknowledgements .............................................................................................................................................. 34 8. References ............................................................................................................................................................ 35 A. Workshop Participants & Demographics ............................................................................................................. 37 B. Participant Biographical Sketches ........................................................................................................................ 38 C. Pre-Workshop Survey Questions .......................................................................................................................... 46 D. Collated List of Survey Answers .......................................................................................................................... 49 E. Original Workshop Agenda .................................................................................................................................. 61 2 Report on the 2015 NSF UAT Workshop M.A. Finlayson, FIU 1. Problem Computational linguistics, and especially its sub-area of statistical natural language processing (NLP), is a field bursting with new, important, and influential scientific and technical work. Importantly, much of this work has been enabled by linguistically annotated corpora: collections of linguistic artifacts (such as text, speech, or video) that have been marked up for some language phenomenon. Indeed, in just the past five years, there have been a number of prominent technological advances and innovations that have had broad impact on society that would not have been possible without numerous linguistically annotated corpora. As just three examples, witness Apple’s Siri or Microsoft’s Cortana for speech understanding, Google Translate for automatic language translation, and IBM’s Watson for playing the Jeopardy game show. Large annotated corpora are a key resource that enables these advances, and they are fundamental to progress in the field. But despite the widely-recognized importance of annotated corpora, the field has a major problem: it lacks coherent and functional software tool support. Collecting linguistically annotated data is a complex, labor-intensive, and difficult endeavor, and sophisticated software tools are required at every stage of the process. While there are hundreds of tools available right now, they suffer from three related problems: Problem 1: Functionality. Tools do not provide the needed functionality, or do not provide it in an easily usable form. There are a number of tasks common across almost every linguistic annotation project, and many of these tasks have no or inadequate software support. Individual annotation projects usually involve more specialized tasks which often have even less software support. Even if tools do provide the needed functionality, the functionality is often hard to use or not documented properly. Problem 2: Interoperability. Tools do not work together well. No tool can do everything, and so any linguistic annotation project must assemble tools together into pipelines, tool chains, and workflows. Despite this inescapable fact, tools often do not share input or output formats, do not use the same linguistic conceptual schemes, do not use the same terminology to describe their operation, and do not support the full range of annotation project types. Problem 3: Reusability. Tools and resources cannot be easily applied to new problems. Tools are not extensible in a way that allows new functionality to be added, and the conceptual and theoretical schemes underlying the linguistic annotation

Load more