SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora Mats Wiren´ Arild Matsson Department of Linguistics Sprakbanken,˚ Department of Swedish Stockholm University, Sweden University of Gothenburg, Sweden
[email protected] [email protected] Dan Rosen´ Elena Volodina Sprakbanken,˚ Department of Swedish Sprakbanken,˚ Department of Swedish University of Gothenburg, Sweden University of Gothenburg, Sweden
[email protected] [email protected] Abstract Annotation of second-language learner text is a cumbersome manual task which in turn requires interpretation to postulate the intended meaning of the learner’s language. This paper describes SVALA, a tool which separates the logical steps in this process while providing rich visual sup- port for each of them. The first step is to pseudonymize the learner text to fulfil the legal and ethical requirements for a distributable learner corpus. The second step is to correct the text, which is carried out in the simplest possible way by text editing. During the editing, SVALA au- tomatically maintains a parallel corpus with alignments between words in the learner source text and corrected text, while the annotator may repair inconsistent word alignments. Finally, the ac- tual labelling of the corrections (the postulated errors) is performed. We describe the objectives, design and workflow of SVALA, and our plans for further development. 1 Introduction Corpus annotation, whether manual or automatic, is typically performed in a pipeline that includes tok- enization, morphosyntactic tagging, lemmatization and syntactic parsing. Because of the deviations from the standard language, however, learner data puts special demands on annotation.