Predicting the Level of Text Standardness in User-Generated Content

Predicting the Level of Text Standardness in User-generated Content Nikola Ljubešic´∗‡ Darja Fišer† Tomaž Erjavec∗ Jaka Cibejˇ † Dafne Marko† Senja Pollak∗ Iza Škrjanec† Dept. of Knowledge Technologies, Jožef Stefan Institute ∗ [email protected] Dept. of Translation studies, Faculty of Arts, University of Ljubljana † [email protected] Dept. of Inf. Sciences, Faculty of Humanities and Social Sciences, University of Zagreb ‡ Abstract man knowledge and people’s opinions (Crystal, 2011). Language use in social media is char- Non-standard language as it appears in acterised by special technical and social circum- user-generated content has recently at- stances, and as such deviates from the norm of tracted much attention. This paper pro- traditional text production. Researching the lan- poses that non-standardness comes in two guage of social media is not only of great value to basic varieties, technical and linguistic, (socio)linguists, but also beneficial for improving and develops a machine-learning method automatic processing of UGC, which has proven to discriminate between standard and non- to be quite difficult (Sproat, 2001). Consistent de- standard texts in these two dimensions. creases in performance on noisy texts have been We describe the manual annotation of a recorded in the entire text processing chain, from dataset of Slovene user-generated content PoS-tagging, where the state-of-the-art Stanford and the features used to build our re- tagger achieves 97% accuracy on Wall Street Jour- gression models. We evaluate and dis- nal texts, but only 85% accuracy on Twitter data cuss the results, where the mean abso- (Gimpel et al., 2011), to parsing, where double- lute error of the best performing method digit decreases in accuracy have been recorded on a three-point scale is 0.38 for tech- for 4 state-of-the-art parsers on social media texts nical and 0.42 for linguistic standard- (Petrov and McDonald, 2012). ness prediction. Even when using no Non-standard linguistic features have been language-dependent information sources, analysed both qualitatively and quantitatively our predictor still outperforms an OOV- (Eisenstein, 2013; Hu et al., 2013; Baldwin et al., ratio baseline by a wide margin. In addi- 2013) and they have been taken into account in tion, we show that very little manually an- automatic text processing applications, which ei- notated training data is required to perform ther strive to normalise non-standard features be- good prediction. Predicting standardness fore submitting them to standard text processing can help decide when to attempt to nor- tools (Han et al., 2012), adapt standard process- malise the data to achieve better annota- ing tools to work on non-standard data (Gimpel et tion results with standard tools, and pro- al., 2011) or, in task-oriented applications, use a vide linguists who are interested in non- series of simple pre-processing steps to tackle the standard language with a simple way of most frequent UGC-specific phenomena (Foster et selecting only such texts for their research. al., 2011). However, to the best of our knowledge, the level 1 Introduction of (non-)standardness of UGC has not yet been User-generated content (UGC) is becoming an in- measured to improve the corpus pre-processing creasingly frequent and important source of hu- pipeline or added to the corpus as an annotation 371 Proceedings of Recent Advances in Natural Language Processing, pages 371–378, Hissar, Bulgaria, Sep 7–9 2015. layer in comprehensive corpus-linguistic analy- million tokens. Currently, most of the collected ses. In this paper, we present an experiment in tweets were written between 2013 and 2014. It which we manually annotated and analysed the should be noted that the majority of these tweets (non-)standardness level of Slovene tweets, forum did not turn out to be user-generated content, but messages and news comments. The findings were rather news feeds, advertisements, and similar ma- then used to train a regression model that auto- terial produced by professional authors. matically predicts the level of text standardness For forum posts and news site comments, six in the entire corpus. We believe this information popular Slovene sources were chosen as they were will be highly useful in linguistic analyses as well the most widely used and contained the most texts. as in all stages of text processing, from more ac- The selected forums focus on the topics of mo- curate sampling to build representative corpora to toring, health, and science, respectively. The se- choosing the best tools for processing the collected lected news sites pertain to the national Slovene documents, either with tools trained on standard broadcaster RTV Slovenija, and the most popular language or with tools specially adapted for non- left-wing and right-wing weekly magazines. Be- standard language varieties. cause the crawled pages differ in terms of struc- The paper is organised as follows. Section 2 ture, separate text extractors were developed using presents the dataset. Section 3 introduces the fea- the Beautiful Soup1 module, which enables writ- tures used in subsequent experiments, while Sec- ing targeted structure extractors from HTML doc- tion 4 describes the actual experiments and their uments. This allowed us to avoid compromising results, with an emphasis on feature evaluation, corpus content with large amounts of noise typi- the gain when using external resources, and an cally present in these types of sources, e.g. adverts analysis of the performance on specific subcor- and irrelevant links. It also enabled us to structure pora. The paper concludes with a discussion of the texts and extract relevant metadata from them. the results and plans for future work. The forum posts contribute 47 million tokens to the corpus, while the news site comments amount 2 The Dataset to 15 million tokens. As with tweets, the majority of the collected comments were posted between This section presents the dataset used in subse- 2013 and 2014. The forum posts cover a wider quent experiments, starting with our corpus of time span, with similar portions of text coming user-generated Slovene and the sampling used to from each of the years between 2006 and 2014. extract the dataset for manual annotation. We then The corpus is also automatically annotated. The explain the motivation behind having two dimen- texts were first tokenised and the word tokens sions of standardness, and describe the process of normalised (standardised) using the method of manual dataset annotation. Ljubešic´ et al. (2014a), which employs character- 2.1 The Corpus of User-generated Slovene based statistical machine translation. The CSMT translation model was trained on 1000 keywords The dataset for the reported experiments is taken taken from the Slovene tweet corpus (compared from our corpus of user-generated Slovene, which to a corpus of standard Slovene) and their manu- currently contains three types of text: tweets, fo- ally determined standard equivalents. Then, using rum posts, and news site comments. The complete the models for standard Slovene the standardised corpus contains just over 120 million tokens. word tokens were PoS-tagged and lemmatised. Tweets were collected with the TweetCaT tool (Ljubešic´ et al., 2014b), which was constructed 2.2 Samples for Manual Annotation specifically for compiling Twitter corpora of For the experiments reported in this paper, we con- smaller languages. The tool uses the Twitter API structed a dataset containing individual texts that and a small lexicon of language specific Slovene were semi-randomly sampled from the corpus of words to first identify the users that predominantly tweets, forum posts and comments. The dataset tweet in Slovene, as well as their friends and was then manually annotated. followers. TweetCaT continuously collected the users’ tweets for a period of almost two years, To guarantee a balanced dataset, we selected also updating the list of users. This resulted in 1http://www.crummy.com/software/ the Slovene tweet subcorpus, which contains 61 BeautifulSoup/ 372 equal proportions (one third) of texts for each text ods, and relevant enough to linguists for filtering type. For forum posts and comments, we included relevant texts when researching non-standard lan- equal proportions of each of their six sources. guage. In order to obtain a balanced dataset in terms of language (non-)standardness from a corpus heav- 2.4 Manual Annotation and Resulting ily skewed towards standard language, we used Dataset a heuristic to roughly estimate the degree of text The annotators, who were postgraduate students (non-)standardness, which makes use of the cor- of linguistics, were presented with annotation pus normalisation procedure. For each text, we guidelines and criteria for annotating the two di- computed the ratio between the number of word mensions of non-standardness. Each given text tokens that have been changed by the automatic was to be annotated in terms of both dimen- normalisation, and its overall length in words. If sions using a score between 1 (standard) and 3 this ratio was 0.1 or less, the text was consid- (very non-standard), with 2 marking slightly non- ered as standard, otherwise it was considered as standard texts. We used a three-score system as non-standard. The dataset was then constructed the task is not (and can hardly be) very precisely so that it contained an equal number of ”standard” defined. Using this scale would also allow us to and ”non-standard” texts. It should be noted that better observe inter-annotator agreement. In addi- this is only an estimate, and that the presented tion, for learning standardness models, we used a method does not depend on an exact balance. Dif- regression model, which returns a degree of stan- ferent rough measures of standardness could also dardness, rather than a classification one. In this be taken, e.g.

Predicting the Level of Text Standardness in User-Generated Content

Acta Neophilologica 40

EBU Tech 3232-1982 Displayable Character Set for Broadcast Teletext

Country Report on France ______

International Register of Coded Character Sets to Be Used with Escape Sequences for Information Interchange in Data Processing

Of Basic Slovene

United Nations Group of Experts on Geographical Names Working

Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary

Drago Jančar; to Write in the Language of a Small Nation

International Review 33

EBU Tech 3232A-1982 Displayable Character Set for Teletext, Apps

The Slovene Language in the Digital Age Slovenski Jezik V

ATLANT and SLOVENE NATIONAL CONSCIOUSNESS in the SECOND HALF of the 19Th CENTURY ATLANT in SLOVENSKA NACIONALNA ZAVEST V 2