Predicting the Level of Text Standardness in User-Generated Content

Total Page:16

File Type:pdf, Size:1020Kb

Predicting the Level of Text Standardness in User-Generated Content Predicting the Level of Text Standardness in User-generated Content Nikola Ljubešic´∗‡ Darja Fišer† Tomaž Erjavec∗ Jaka Cibejˇ † Dafne Marko† Senja Pollak∗ Iza Škrjanec† Dept. of Knowledge Technologies, Jožef Stefan Institute ∗ [email protected] Dept. of Translation studies, Faculty of Arts, University of Ljubljana † [email protected] Dept. of Inf. Sciences, Faculty of Humanities and Social Sciences, University of Zagreb ‡ Abstract man knowledge and people’s opinions (Crystal, 2011). Language use in social media is char- Non-standard language as it appears in acterised by special technical and social circum- user-generated content has recently at- stances, and as such deviates from the norm of tracted much attention. This paper pro- traditional text production. Researching the lan- poses that non-standardness comes in two guage of social media is not only of great value to basic varieties, technical and linguistic, (socio)linguists, but also beneficial for improving and develops a machine-learning method automatic processing of UGC, which has proven to discriminate between standard and non- to be quite difficult (Sproat, 2001). Consistent de- standard texts in these two dimensions. creases in performance on noisy texts have been We describe the manual annotation of a recorded in the entire text processing chain, from dataset of Slovene user-generated content PoS-tagging, where the state-of-the-art Stanford and the features used to build our re- tagger achieves 97% accuracy on Wall Street Jour- gression models. We evaluate and dis- nal texts, but only 85% accuracy on Twitter data cuss the results, where the mean abso- (Gimpel et al., 2011), to parsing, where double- lute error of the best performing method digit decreases in accuracy have been recorded on a three-point scale is 0.38 for tech- for 4 state-of-the-art parsers on social media texts nical and 0.42 for linguistic standard- (Petrov and McDonald, 2012). ness prediction. Even when using no Non-standard linguistic features have been language-dependent information sources, analysed both qualitatively and quantitatively our predictor still outperforms an OOV- (Eisenstein, 2013; Hu et al., 2013; Baldwin et al., ratio baseline by a wide margin. In addi- 2013) and they have been taken into account in tion, we show that very little manually an- automatic text processing applications, which ei- notated training data is required to perform ther strive to normalise non-standard features be- good prediction. Predicting standardness fore submitting them to standard text processing can help decide when to attempt to nor- tools (Han et al., 2012), adapt standard process- malise the data to achieve better annota- ing tools to work on non-standard data (Gimpel et tion results with standard tools, and pro- al., 2011) or, in task-oriented applications, use a vide linguists who are interested in non- series of simple pre-processing steps to tackle the standard language with a simple way of most frequent UGC-specific phenomena (Foster et selecting only such texts for their research. al., 2011). However, to the best of our knowledge, the level 1 Introduction of (non-)standardness of UGC has not yet been User-generated content (UGC) is becoming an in- measured to improve the corpus pre-processing creasingly frequent and important source of hu- pipeline or added to the corpus as an annotation 371 Proceedings of Recent Advances in Natural Language Processing, pages 371–378, Hissar, Bulgaria, Sep 7–9 2015. layer in comprehensive corpus-linguistic analy- million tokens. Currently, most of the collected ses. In this paper, we present an experiment in tweets were written between 2013 and 2014. It which we manually annotated and analysed the should be noted that the majority of these tweets (non-)standardness level of Slovene tweets, forum did not turn out to be user-generated content, but messages and news comments. The findings were rather news feeds, advertisements, and similar ma- then used to train a regression model that auto- terial produced by professional authors. matically predicts the level of text standardness For forum posts and news site comments, six in the entire corpus. We believe this information popular Slovene sources were chosen as they were will be highly useful in linguistic analyses as well the most widely used and contained the most texts. as in all stages of text processing, from more ac- The selected forums focus on the topics of mo- curate sampling to build representative corpora to toring, health, and science, respectively. The se- choosing the best tools for processing the collected lected news sites pertain to the national Slovene documents, either with tools trained on standard broadcaster RTV Slovenija, and the most popular language or with tools specially adapted for non- left-wing and right-wing weekly magazines. Be- standard language varieties. cause the crawled pages differ in terms of struc- The paper is organised as follows. Section 2 ture, separate text extractors were developed using presents the dataset. Section 3 introduces the fea- the Beautiful Soup1 module, which enables writ- tures used in subsequent experiments, while Sec- ing targeted structure extractors from HTML doc- tion 4 describes the actual experiments and their uments. This allowed us to avoid compromising results, with an emphasis on feature evaluation, corpus content with large amounts of noise typi- the gain when using external resources, and an cally present in these types of sources, e.g. adverts analysis of the performance on specific subcor- and irrelevant links. It also enabled us to structure pora. The paper concludes with a discussion of the texts and extract relevant metadata from them. the results and plans for future work. The forum posts contribute 47 million tokens to the corpus, while the news site comments amount 2 The Dataset to 15 million tokens. As with tweets, the majority of the collected comments were posted between This section presents the dataset used in subse- 2013 and 2014. The forum posts cover a wider quent experiments, starting with our corpus of time span, with similar portions of text coming user-generated Slovene and the sampling used to from each of the years between 2006 and 2014. extract the dataset for manual annotation. We then The corpus is also automatically annotated. The explain the motivation behind having two dimen- texts were first tokenised and the word tokens sions of standardness, and describe the process of normalised (standardised) using the method of manual dataset annotation. Ljubešic´ et al. (2014a), which employs character- 2.1 The Corpus of User-generated Slovene based statistical machine translation. The CSMT translation model was trained on 1000 keywords The dataset for the reported experiments is taken taken from the Slovene tweet corpus (compared from our corpus of user-generated Slovene, which to a corpus of standard Slovene) and their manu- currently contains three types of text: tweets, fo- ally determined standard equivalents. Then, using rum posts, and news site comments. The complete the models for standard Slovene the standardised corpus contains just over 120 million tokens. word tokens were PoS-tagged and lemmatised. Tweets were collected with the TweetCaT tool (Ljubešic´ et al., 2014b), which was constructed 2.2 Samples for Manual Annotation specifically for compiling Twitter corpora of For the experiments reported in this paper, we con- smaller languages. The tool uses the Twitter API structed a dataset containing individual texts that and a small lexicon of language specific Slovene were semi-randomly sampled from the corpus of words to first identify the users that predominantly tweets, forum posts and comments. The dataset tweet in Slovene, as well as their friends and was then manually annotated. followers. TweetCaT continuously collected the users’ tweets for a period of almost two years, To guarantee a balanced dataset, we selected also updating the list of users. This resulted in 1http://www.crummy.com/software/ the Slovene tweet subcorpus, which contains 61 BeautifulSoup/ 372 equal proportions (one third) of texts for each text ods, and relevant enough to linguists for filtering type. For forum posts and comments, we included relevant texts when researching non-standard lan- equal proportions of each of their six sources. guage. In order to obtain a balanced dataset in terms of language (non-)standardness from a corpus heav- 2.4 Manual Annotation and Resulting ily skewed towards standard language, we used Dataset a heuristic to roughly estimate the degree of text The annotators, who were postgraduate students (non-)standardness, which makes use of the cor- of linguistics, were presented with annotation pus normalisation procedure. For each text, we guidelines and criteria for annotating the two di- computed the ratio between the number of word mensions of non-standardness. Each given text tokens that have been changed by the automatic was to be annotated in terms of both dimen- normalisation, and its overall length in words. If sions using a score between 1 (standard) and 3 this ratio was 0.1 or less, the text was consid- (very non-standard), with 2 marking slightly non- ered as standard, otherwise it was considered as standard texts. We used a three-score system as non-standard. The dataset was then constructed the task is not (and can hardly be) very precisely so that it contained an equal number of ”standard” defined. Using this scale would also allow us to and ”non-standard” texts. It should be noted that better observe inter-annotator agreement. In addi- this is only an estimate, and that the presented tion, for learning standardness models, we used a method does not depend on an exact balance. Dif- regression model, which returns a degree of stan- ferent rough measures of standardness could also dardness, rather than a classification one. In this be taken, e.g.
Recommended publications
  • Acta Neophilologica 40
    ACTA NEOPHILOLOGICA 40. 1-2 (2007) Ljubljana MIRKOJURAK JAKOB KELEMINA ON SHAKESPEARE' S PLAYS SANDROJUNG WORDSWORTH' S "TINTERN ABBEY" AND THE TRADITION OF THE " HYMNAL" ODE JANES STANONIK MARCUS ANTONIUS KAPPUS: A REEVALUATION DARJA MAZI- LESKOVAR THE FIRST TRANSLATIONS OF LEATHERSTOCKING TALES JERNEJA PETRIC LOUIS ADAMJC'S "OLD ALIEN" AS A RELIC OF ETHNIC DIFFERENTIATION IN THE U.S.A. TATJANA VUKELIC UNDERSTANDING ZORA NEALE HURSTON' S THEIR EYES WERE WATCHING GOD BRANKA KALOGJERA "OLD" VS. "NEW" ETHNICITIES AND MULTIPLE IDENTITIES IN SANDRA CISNEROS' CARAMELO MAJDASAVLE INDIRECT NARRATION: ON CONRAD' S HEART OF DARKNESS AND FITZGERALD' S THEGREATGATSBY DARJA MARINSEK FEMALE GENITAL MUTILATION IN AFRICAN AND AFRICAN-AMERICAN WOMEN' S LITERATURE BRIGITA PAYSIC TIME AS THE FIFTH ELEMENT IN MARGARET LAURENCE'S MANAWAKA CYCLE JOHANN GEORG LUGHOFER EINE ANNAHERUNG AN HAND DES FALLBEISPIELS BERTHA VON SUTTNER MIHA PINTARIC THE ROLE OF VIOLENCE IN THE ROMANCES OF CHRETIEN DE TROYES SPELA ZAKELJ L' IRONIE DANS L' ALLEGORIE CHEZ RUTEBEUF PATRIZIA FARINELLI SUL FANTASTICO NELLA NARRATIVA Dl TABUCCHI UROS MOZETIC FROM DOUBLEVALANT TO MONOVALANT DISCOURSE: THE ROLE OF THE TRANSLATOR MIRKOJURAK BERNARD HICKEY. IN MEMORIA M ACTA NEOPHILOLOGICA 40. 1-2 (2007) Ljubljana MIRKOJURAK JAKOB KELEMINA ON SHAKESPEARE'S PLAYS ........... .... .. .. ... ........... .... ....... ..... ......... .. 5 SANDROJUNG WORDSWORTH'S "TINTERN ABBEY" AND THE TRADITION OF THE "HYMNAL:' ODE ....... 51 JANES STANONIK MARCUS ANTONIUS KAPPUS: A REEVALUATION .......................................................... 61 DARJA MAZI- LESKOVAR THE FIRST TRANSLATIONS OF LEATHERSTOCKING TALES .............................................. 75 JERNEjA PETRIC LOUIS ADAMIC'S "OLD ALIEN" AS A RELIC OF ETHNIC DIFFERENTIATION IN THE U.S .A. 89 TATJANA VUKELIC UNDERSTANDING ZORA NEALE HURSTON'S THEIR EYES WERE WATCHING GOD ............
    [Show full text]
  • EBU Tech 3232-1982 Displayable Character Set for Broadcast Teletext
    DISPLAYABLE CHARACTER SETS FOR BROADCAST TELETEXT Tech. 3232-E Second edition - June 1982 CONTENTS 1 Introduction........................................................................................................................................3 2. EBU inquiry....................................................................................................................................4 3. Analysis of the replies from EBU members...................................................................................4 3.1 Minimum number of characters required ..................................................................................4 3.2 Two or more languages on one page .......................................................................................6 4. Establishment of multilingual character repertoires.......................................................................8 4.1 Constraints that have to be taken into account.........................................................................8 4.2 Significance of the repertoires ..................................................................................................8 4.3 Character repertoires for languages written in Latin-based-alphabets.....................................9 4.4 The combination of non-Latin-based alphabets together with a limited selection of Latin- based characters - ..................................................................................................................11 5. Conclusion ...................................................................................................................................14
    [Show full text]
  • Country Report on France ______
    DIRECTORATE-GENERAL FOR INTERNAL POLICIES POLICY DEPARTMENT FOR CITIZENS' RIGHTS AND CONSTITUTIONAL AFFAIRS LEGAL AFFAIRS Life in cross-border situations in the EU A Comparative Study on Civil Status ANNEX III - COUNTRY REPORT B FRANCE This Report has been prepared by Lukas Rass-Masson for Milieu Ltd, under contract to the European Parliament, Directorate-General Internal Policies, Directorate C for Citizens Rights and Constitutional Affairs (Contract No IP/C/JURI/IC/2012-019). The views expressed herein are those of the consultants alone and do not represent the official views of the European Parliament. Milieu Ltd (Belgium), 15 rue Blanche, B-1050 Brussels, tel: 32 2 506 1000; fax 32 2 514 3603. PE 474.395 EN Policy Department for Citizens’ Rights and Constitutional Affairs _________________________________________________________________________________________ CONTENTS LIST OF ABBREVIATIONS.......................................................................... 3 1. GENERAL RULES .................................................................................. 4 1.1 Civil Status Registration System ........................................................ 4 1.1.1 Type of civil status registration system.......................................... 4 1.1.2 Organisation.............................................................................. 4 1.1.3 Accessibility............................................................................... 6 1.1.4 Notion of civil status ..................................................................
    [Show full text]
  • International Register of Coded Character Sets to Be Used with Escape Sequences for Information Interchange in Data Processing
    INTERNATIONAL REGISTER OF CODED CHARACTER SETS TO BE USED WITH ESCAPE SEQUENCES 1 Introduction 1.1 General This document is the ISO International Register of Coded Character Sets To Be Used With Escape Sequences for information interchange in data processing. It is compiled in accordance with the provisions of ISO/IEC 2022, "Code Extension Technique" and of ISO 2375 "Procedure for Registration of Escape Sequences". This International Register contains coded character sets which have been registered in accordance with procedures given in ISO 2375. Its purpose is to identify widely used coded character sets and associate with each a unique escape sequence by means of which it can be designated according to ISO/IEC 2022 and ISO/IEC 4873. The publication of this International Register should promote compatibility in international information interchange and avoid duplication of effort in developing application-oriented coded character sets. Registration provides an identification for a coded character set but implies nothing about its status; it may or may not be part of a standard of an international, national or a corporate body. However, if such a standard is published subsequently to the registration, it would be appropriate for the escape sequence identifying the character set to be specified in the standard. If it is desired to register a set, application should be made to the Registration Authority through an appropriate Sponsoring Authority as specified in ISO 2375. Any character set can be a candidate for registration if it meets the requirements of ISO 2375. The Registration Authority ascertains that the proposals received are formally in accordance with this International Standard, technically in accordance with ISO/IEC 2022, and, where applicable, with ISO/IEC 646 and ISO/IEC 4873, and meet the presentation practice of the Registration Authority.
    [Show full text]
  • Of Basic Slovene
    7 DAYS OF BASIC SLOVENE W E E K 1 A big thanks to all my readers around the globe, for learning Slovene with me! And a special thanks to my cousin, Jeannie Banh, for creating the illustrations for my blog! Learning Slovene can be fun and simple, it all depends on your mindset. Trust in yourself and practice as much as you can! VENE SL UT O B A The official language of Slovenia is Slovene. Worldwide, there is about 2.5 millions speakers, and soon you can become one too! Slovene is categorized as a Western South Slavic language and has its own distinct characteristics. In Slovenia, there are around 50 dialects, which are divided into seven regions. The Slovene alphabet has three extra letters: č, š, ž but four letters Slovene is a very precise less: q, w, x and y. and detailed language. It has three genders: The letter č sounds like ch - chocolate masculine, feminine and (čokolada). The letter š sounds like sh neuter. - shawl (šal). And the letter ž is pronounced as g, like giraffe (žirafa). And three grammatical numbers: singular, dual In Slovene, every syllable is important, and plural. This applies especially those super scary long to nouns, verb words with many š, č. conjugation, adjectives and so on. slaščičarna "sla-sh-chi-char-na" (confectionery) Slovene has six declensions : nominative, genitive, dative, accusative, locative and instrumental. They are often referred by numbers, from 1 to 6. Declensions are probably one of the most confusing concept of the language. Fortunately, you can easily get by with knowing a few basics.
    [Show full text]
  • United Nations Group of Experts on Geographical Names Working
    United Nations Group of Experts on Working Paper Geographical Names No. 49 Eighteenth Session Geneva, 12-23 August 1996 Item 5 of the Provisional Agenda REPORTS OF THE DIVISIONS REPORT OF THE EAST CENTRAL AND SOUTH-EAST EUROPE DIVISION Submitted by B&A Pokoly, Chairman of the Division Contents 1. Institutional News on Bodies of standardization of Geo- graphical Names in the Member Countries. Short Report of the Activities of These Bodies Since September, 1993. 2. Toponymic Data Files and Gazetteers 3. Romanization Systems 4. Country Names 5. Toponymic Guidelines *for Map and Other Editors 0.r Exonyms 7. Standardization in Multilingual Areas 8. Implementation of Resolutions and the Aims and Functions of UNGEGN Introduction Hungary has been serving as Chairman of the Division since the 16th Session of UNGEGN (4 December, 1992, New York). In its function it convened the Division for its 12th session in Sep- tember, 1993. Reporting on information received on that event was contained in WP No. 6. of the 17th Session of UNGEGN. The 13th session of the Division was held in Budapest on 2-4 April, 1996. Six countries of the division, apart from the host country, were represented: Albania, Czech Republic, Po- land, Slovakia, Slovenia and Ukraine. This report contains in- formation based first of all on those presented at this Divi- sion session. Bulgaria sent written information to the Chair- man following the session. As a matter of course the word *session' in itself thereinafter always refers to this event. 1. Institutional News on Bodies of Standardization of Geo- graphical Names in the Member Countries.
    [Show full text]
  • Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary
    Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary Ling Liu and Mans Hulden Department of Linguistics University of Colorado [email protected] Abstract network (RNN) encoder-decoder model to gener- ate an inflected form of a lemma for a target mor- Wiktionary provides lexical information phological tag combination. The SIGMORPHON for an increasing number of languages, in- 2016 shared task (Cotterell et al., 2016) of mor- cluding morphological inflection tables. It phological reinflection received 11 systems which is a good resource for automatically learn- used various approaches such as conditional ran- ing rule-based analysis of the inflectional dom fields (CRF), RNNs, and other linguistics- morphology of a language. This paper per- inspired heuristics. Among all the methods, one forms an extensive evaluation of a method standard technology is to use finite-state transduc- to extract generalized paradigms from ers, which are more interpretable and manually morphological inflection tables, which can modifiable, and thus more easily incorporated into be converted to weighted and unweighted and made to assist linguists’ work. Hulden (2014) finite transducers for morphological pars- presents a method to generalize inflection tables ing and generation. The inflection tables into paradigms with finite state implementations of 55 languages from the English edition and Forsberg and Hulden (2016) subsequently in- of Wiktionary are converted to such gen- troduce how to transform morphological inflection eral paradigms, and the performance of tables into both unweighted and weighted finite the probabilistic parsers based on these transducers and apply the transducers to parsing paradigms are tested. and generation, the result of which is very promis- ing, especially for facilitating and assisting lin- 1 Introduction guists’ work in addition to applications to mor- phological parsing and generation for downstream Morphological inflection is used in many lan- NLP tasks.
    [Show full text]
  • Drago Jančar; to Write in the Language of a Small Nation
    KOSMOPOLIS. International Festival of Literature 14 to 19 September 2004 Barcelona - CCCB CAFÉ EUROPE – HAVANA Literary Café Drago Jančar; To Write in the Language of a Small Nation Translated by Tamara M. Soban When giving a reading abroad, I am often asked by my foreign readers what it actually means to write in the language of a small nation. How does it feel, for an author, to write in a language that is spoken and read at best by two and a half million? In recent years, since it became clear that quite a few new “small” languages would appear in the European Union, and more notably, in European culture, this cute question was joined by another, a favorite of the members of the press: Aren’t you concerned about the survival of Slovene culture, in particular the Slovene language? In Slovenia too, I have heard not just anybody, but a Slovene author say: What is the point of writing in Slovene, since these works might only survive in some library or other, to be studied by eccentric scholars, similarly as dinosaurs are studied today? Somewhat irritated, I replied to my fellow writer that all my books have been translated into so- called “big” languages, so he need not worry about me. But that is of course no answer. Librarians will tell you that the paper used in book-printing today is very short-lived, it disintegrates, turning some books into dust in a matter of decades. Just as we can only guess what treasures of ancient Greek drama and philosophy disappeared in the flames that engulfed the library of Alexandria, it may be up to conjecture in a hundred years’ time what Slovene literature used to be like, even if it is translated into the languages of big nations.
    [Show full text]
  • International Review 33
    Statisti čni letopis Republike Slovenije 2009 Mednarodni pregled Statistical Yearbook of the Republic of Slovenia 2009 International review 33 33. MEDNARODNI PREGLED INTERNATIONAL REVIEW METODOLOŠKA POJASNILA METHODOLOGICAL EXPLANATIONS Angleška imena držav in pripadajo če kode smo povzeli po English country names and their codes were taken over from the mednarodnem standardu ISO 3166, ki ga izdaja Mednarodna international standard ISO 3166, which is published by the organizacija za standardizacijo, po stanju 18. 08. 2009 1) . Slovenska International Organisation for Standardisation, as of 18 Avgust 2009 1) . imena so povzeta po standardu ISO 3166, ki ga je objavil tedanji Urad Slovene country names were taken over from the ISO 3166 standard, Republike Slovenije za standardizacijo in meroslovje. Spremembam pa which was publishe d by the then Standards and Metrology Institution of sledimo na podlagi dopolnil, ki nam jih posreduje Slovenski inštitut za the Republic of Slovenia. Changes are followed on the basis of standardizacijo. V sezn amu so uporabljena uradna kratka imena držav. amendments transmitted by the Slovenian Institute for Standardization. Države so prikazane po celinah ter razvrš čene po abecednem redu v Official short names of countries are used. Countries are shown by slovenskem jeziku. Vzporedno sta navedena tudi uradno angleško co ntinents and in alphabetical order of the Slovene alphabet. Official kratko ime države in pripadajo ča dvomestna črkovna koda. English short names of countries and their two-figure alphabet codes are also stated. V tabelarnem pregledu so države navedene le v slovenš čini. Glede na In tables only Slovene country names are stated. The selection of dostopnost podatkov in zasnovo poglavja je v posameznih tabelah countries in individual tables is narrowed depending on the availability izbor držav zožen.
    [Show full text]
  • EBU Tech 3232A-1982 Displayable Character Set for Teletext, Apps
    EBU Tech 3232: Displayable character sets for broadcast teletext 17 APPENDIX 1 Example of the forms used in the EBU Inquiry Organisation: . Language: . Proposed basic character set 1. Lower case letters a b c d e f g h i j k l m n o p q r s t u v w x y z 2. Upper case letters A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 3. Punctuations/abbreviations ! " # $ % ' ( ) * £ , - . / : ; < > ? — & 4. Accented letters 5. Symbols 1 2 3 4 5 6 7 8 9 0 6. Mathematical symbols + ÷ = 7. Pictograms ? ? ? 8. Others Additional characters Romance languages (French, Italian, Spanish, Portuguese) 1. Lower case letters ã õ ñ ç œ 2. Upper case letters Ã Õ Ñ 3. Punctuations/abbreviations ¡ ¿ \ | ¯ _ 4. Accented letters à á â è é ê ë ì í î ï ò ó ô ù ú û 5. Symbols 6. Mathematical symbols ± ° 7. Pictograms ? 8. Others Other languages that are likely to be required on the same broadcast page: . Notes: · Please complete a separate form for each language likely to be required for teletext broadcasts by your organisation. · Delete all redundant characters in both tables by a diagonal line. · Insert all additional characters required in the appropriate section of the lower table, taking care to position them accurately within the rectangles. · Only non-accented letters should be inserted in rows 1 and 2. · No accented upper-case letters have been included in the tables. If any of these are considered to be essential, they should be inserted in row 4.
    [Show full text]
  • The Slovene Language in the Digital Age Slovenski Jezik V
    White Paper Series Zbirka Bela knjiga THE SLOVENE SLOVENSKI LANGUAGE IN JEZIK V THE DIGITAL DIGITALNI AGE DOBI Simon Krek White Paper Series Zbirka Bela knjiga THE SLOVENE SLOVENSKI LANGUAGE IN JEZIK V THE DIGITAL DIGITALNI AGE DOBI Simon Krek “Jožef Stefan” Institute, Amebis, d. o. o. Georg Rehm, Hans Uszkoreit (urednika, editors) PREDGOVOR PREFACE Bela knjiga je del zbirke, s katero širimo zavedanje o is white paper is part of a series that promotes jezikovnih tehnologijah in o možnostih, ki jih ponu- knowledge about language technology and its poten- jajo. Namenjena je izobraževalcem, novinarjem, poli- tial. It addresses journalists, politicians, language com- tikom, jezikovnim skupnostim in vsem ostalim, ki jih munities, educators and others. e availability and zanima jezik. Dostopnost in raba jezikovnih tehnologij use of language technology in Europe varies between v Evropi se razlikuje od jezika do jezika. V skladu languages. Consequently, the actions that are required s tem se dejanja, potrebna za podporo raziskovanju to further support research and development of lan- in razvoju, med seboj razlikujejo in so odvisna od ra- guage technologies also differs. e required actions zličnih dejavnikov, na primer od zahtevnosti jezikov depend on many factors, such as the complexity of a ali velikosti njihovih skupnosti. given language and the size of its community. V projektu META-NET, mreži odličnosti, ki jo fi- META-NET, a Network of Excellence funded by the nancira Evropska komisija, smo analizirali obstoječe European Commission, has conducted an analysis of stanje na področju jezikovnih virov in tehnologij (glej current language resources and technologies in this str. 79).
    [Show full text]
  • ATLANT and SLOVENE NATIONAL CONSCIOUSNESS in the SECOND HALF of the 19Th CENTURY ATLANT in SLOVENSKA NACIONALNA ZAVEST V 2
    acta46-2.qxd 1.12.2006 12:19 Page 251 Acta geographica Slovenica, 46-2, 2006, 251–283 ATLANT AND SLOVENE NATIONAL CONSCIOUSNESS IN THE SECOND HALF OF THE 19th CENTURY ATLANT IN SLOVENSKA NACIONALNA ZAVEST V 2. POLOVICI 19. STOLETJA Mimi Urbanc, Jerneja Fridl, Drago Kladnik, Drago Perko The folder where the original maps of Atlant were found (photograph: Milan Oro`en Adami~). Mapa, v kateri so bili najdeni originalni zemljevidi Atlanta (fotografija: Milan Oro`en Adami~). acta46-2.qxd 1.12.2006 12:19 Page 252 Mimi Urbanc, Jerneja Fridl, Drago Kladnik, Drago Perko, Atlant and Slovene national consciousness in the second half … Atlant and Slovene national consciousness in the second half of the 19th century UDC: 912.44(497.4)"19" COBISS: 1.01 ABSTRACT: The geographic literature made an important contribution to the development of national consciousness among Slovenes in the 19th century, as well as to the reinforcement of Slovene identity after Slovenia's independence in 1991. A typical example of this kind of geographic publication is Atlant, the first atlas of the world in the Slovene language, which was published in installments from 1869 to 1877. Atlant followed the concept of a Unified Slovenia, and this idea boosted the confidence of Slovenes, the bulk of whom were incorporated into Austria-Hungary as they entered the stage of European politics as a nation for the first time. With the publication of Atlant, a number of geographical names were trans- lated into Slovene or Slovenized for the first time. The geographic, linguistic, and political conditions in which Slovenes lived, as well as the relations at the time between Slovene, German, and Slavic languages, are reflect- ed in the way foreign place names were adapted.
    [Show full text]