Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-automatic System for German

Von der Philosophisch-Historischen Fakult¨at der Universit¨atStuttgart zur Erlangung der W¨urdeeines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung

Vorgelegt von Stefanie Anstein aus Rottweil

Hauptberichter: Prof. Dr. phil. habil. Ulrich Heid 1. Mitberichter: Prof. Dr. phil. habil. Achim Stein 2. Mitberichter: Univ.-Prof. Mag. Dr. Gerhard Budin

Tag der m¨undlichen Pr¨ufung:31. Januar 2013

Institut f¨urMaschinelle Sprachverarbeitung Universit¨atStuttgart

2013

Erkl¨arung

Hiermit versichere ich, dass ich – unter Verwendung der aufgef¨uhrtenQuellen und unter fachlicher Betreuung – diese Dissertation selbst¨andigverfasst habe.

(Stefanie Anstein)

Danksagung

Ich bedanke mich ganz herzlich bei allen, die mich in den letzten Jahren aus verschiedensten und in verschiedenste Richtungen begleitet und mich auf unterschiedlichste Arten unterst¨utzthaben. Dabei gilt ein besonderer Dank meinem Hauptbetreuer Ulrich Heid, der mich mit seinem unersch¨opflichen Wissens- und Erfahrungssschatz ausgezeichnet geleitet hat. Seine ¨uberaus wertvollen R¨uckmeldungen und Ratschl¨age,f¨urdie er sich immer viel Zeit nahm, weiß ich sehr zu sch¨atzen. Den Mitberichtern Achim Stein und Gerhard Budin danke ich herzlich f¨ur ihre Bereitschaft zur Begutachtung und zur Pr¨ufungsowie f¨urihre hilfreichen Kommentare – bei Rainer B¨auerlebedanke ich mich f¨urseinen kurzfristigen Einsatz. Diese Arbeit entstand w¨ahrendmeiner Zeit am Institut f¨urFachkommunika- tion und Mehrsprachigkeit der EURAC in Bozen, dessen Koordinatorin Andrea Abel ich ebenfalls sehr dankbar bin – sowohl f¨urihre inhaltlichen Anregungen als auch f¨urihre organisatorische Flexibilit¨at. Ich bedanke mich bei allen weiteren ProfessorInnen, DozentInnen, KollegIn- nen und FreundInnen am IMS, an der EURAC und von außerhalb, die mir viel beigebracht, geholfen und mit auf den Weg gegeben haben, besonders bei Heidi Abfalterer, der C4 -Gruppe, Chris Culy, Henrik Dittmann, Grzegorz Dogil, Hans Drumbl, Stefan Evert, Peter Farbridge, Arne Fitschen, Hannah Kermes, Adam Kilgarriff, Jonas Kuhn, Anke L¨udeling,Margit Oberhammer, Sebastian Pad´o,Uwe Reyle, Helmut Schmid, Sabine Schulte im Walde, Marcello Soffritti, Egon Stemle, Barbara Taferner, Renata Zanin und Heike Zinsmeister. Danke f¨urden wohltuenden Rahmen und den immer wieder erfrischenden Ausgleich an Anke & Micha, Anne & Nat, Familie Bayer, Britta, Fabienne, Herrn Fischl, Frank & Anna, Franzi, Goenkaji, Gotte & Katharina, Harald, Katrin, Lionel, Magdalena & Michi, Monika, Nadi & Diana, Omar & Smail, Regi & Sims, Sandra, Simone und Stef. Und einfach meinen herzliebsten Dank f¨uralles – an meine Eltern und an Gerhard, Kati und Verena. iv

Publikationen

Aspekte der hier beschriebenen Forschung finden sich auch in folgenden begutachteten Publikationen:

Abel, Andrea & Anstein, Stefanie (2008): ‘Approaches to Computational Lexicography for German Varieties’. In: Proceedings of the 13th EURALEX International Conference; pp. 251–260; Barcelona.

Abel, Andrea & Anstein, Stefanie (2011): ‘Korpus S¨udtirol- Variet¨aten- linguistische Untersuchungen’. In: Korpusinstrumente in Lehre und Forschung, ed. by Abel, Andrea & Zanin, Renata; pp. 29–53; Bolzano: Bolzano University Press.

Abel, Andrea; Anstein, Stefanie & Ties, Isabella (2008): ‘Ans¨atze einer intralingualen kontrastiven Korpuslinguistik – aufgezeigt am Beispiel administrativer Rechtstexte aus Deutschland, Osterreich¨ und S¨udtirol’. In: Formulierungsmuster in deutscher und italienischer Fachkommunikation. Intra- und interlinguale Perspektiven, ed. by Heller, Dorothee; Linguistic Insights; pp. 243–270; Bern: Peter Lang.

Anstein, Stefanie (2007): ‘Korpuslinguistische Fallstudien zum S¨udtiroler Standardschriftdeutsch - das Projekt ’Korpus S¨udtirol”. Linguistik online; vol. 32. http://www.linguistik-online.org/32 07/anstein.pdf, last accessed 2012-10-14.

Anstein, Stefanie (2009a): ‘Vis-A-Vis` – a System for the Comparison of Linguistic Varieties on the Basis of Corpora’. In: Proceedings of the 2nd Col- loquium on Lesser Used Languages and Computer Linguistics (LULCL); pp. Publikationen v

59–64. http://www.eurac.edu/Org/LanguageLaw/Multilingualism/Projects/ LULCL II proceedings.htm, last accessed 2012-10-16.

Anstein, Stefanie (2009b): ‘Vis-A-Vis` - a System to Compare Variety Corpora’. In: Proceedings of the 5th Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/cl2009, last accessed 2012-10-16.

Anstein, Stefanie (2012): ‘Comparing Variety Corpora with Vis-A-Vis` — a Prototype System Presentation’. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS); pp. 243–247; Vienna. http: //www.oegai.at/konvens2012/proceedings/35 anstein12p, last accessed 2012- 10-16.

Anstein, Stefanie & Glaznieks, Aivars (2011): ‘Comparing Geographical and Learner Varieties on the Basis of Corpora’. In: Comparative Methods and Analysis in the Language Science. Proceedings of the 3rd edition of J´eTou; pp. 179–188; Toulouse. http://jetou2011.free.fr/ARTICLES/S4A2.pdf, last accessed 2012-10-17. vi

Contents

List of abbreviations ix

List of figures xi

List of tables xii

German abstract xiii

English abstract xviii

1 Introduction and background1 1.1 Objectives and research questions...... 1 1.1.1 Aims and scope of this work...... 2 1.1.2 Research questions...... 8 1.1.3 Structure of this thesis...... 9 1.2 Background in the relevant research areas...... 10 1.2.1 Linguistics and language variation...... 10 1.2.1.1 Language phenomena and their investigation. 10 1.2.1.2 Linguistic variation...... 14 1.2.2 Language in South Tyrol...... 19 1.2.2.1 History and current situation...... 20 1.2.2.2 Standards and norms...... 22 1.2.3 Computational approaches to corpus studies...... 25 1.2.3.1 Inter-disciplinarity...... 25 1.2.3.2 Corpora and their linguistic annotation.... 26 1.2.3.3 Data extraction from corpora...... 34 1.2.3.4 Comparative corpus linguistics...... 40 vii

1.2.3.5 Statistics for comparing corpora...... 44 1.2.3.6 Evaluation of corpus processing tools..... 50

2 Related work and research desiderata 53 2.1 Resources and methods for corpus comparison...... 53 2.1.1 Variety corpora and dictionaries...... 54 2.1.2 Comparative corpus studies...... 59 2.1.2.1 Studies on corpus comparability...... 60 2.1.2.2 General variety studies...... 65 2.1.3 Computational systems for corpus studies...... 74 2.1.3.1 Corpus annotation tools...... 75 2.1.3.2 Corpus analysis and comparison tools..... 77 2.2 Investigations on South Tyrolean German...... 82 2.2.1 South Tyrolean German variety linguistics...... 82 2.2.2 Linguistic characteristics of South Tyrolean German.. 91 2.3 Research desiderata derived from the state of the art...... 105

3 The system Vis-A-Vis` 107 3.1 Requirements for a corpus comparison system...... 107 3.2 Methodology and system architecture...... 108 3.2.1 Approaches and methods...... 109 3.2.2 Workflow and modules...... 114 3.3 System functionalities and usage modes...... 116 3.3.1 Technical and functional specification...... 116 3.3.1.1 Technical system features...... 117 3.3.1.2 Input verification...... 119 3.3.1.3 Comparability check...... 119 3.3.1.4 Annotation...... 119 3.3.1.5 Analysis levels...... 120 3.3.1.6 Linguistic filtering...... 122 3.3.1.7 Statistical comparison...... 124 3.3.1.8 Output presentation...... 126 3.3.2 Coverage and limitations of the system...... 126 viii

3.3.3 System usage scenarios...... 127 3.4 System output...... 135 3.4.1 General corpus comparison output...... 135 3.4.2 Output by analysis levels...... 135

4 Quantitative and qualitative system evaluation 146 4.1 Quantitative system performance...... 146 4.1.1 Evaluation procedures...... 146 4.1.2 Evaluation data and gold standard...... 147 4.1.3 Quantitative evaluation results...... 148 4.2 Qualitative case studies...... 154 4.2.1 Newspaper corpora...... 154 4.2.2 Web corpora...... 158 4.2.3 Learner corpora...... 158 4.3 Discussion of evaluation results...... 162

5 Outlook and conclusion 164 5.1 Potential further research...... 164 5.1.1 General resource and system enhancements...... 164 5.1.2 Refinement of analysis levels...... 169 5.2 Summary...... 178 5.2.1 Principal findings of this work...... 178 5.2.2 Contributions to the relevant research areas...... 180

A System documentation 182

B Gold standard list of S¨udtirolisms 198 B.1 Primary S¨udtirolisms...... 198 B.2 Extract of secondary S¨udtirolisms...... 203

C Online resources 204

Bibliography 205 ix

List of abbreviations

ADJ adjective ADV adverb AM association measure AT Austria BNC British National Corpus1 CH Switzerland CQP Corpus Query Processor CWB Corpus Workbench DE Germany DOLO Dolomiten newspaper corpus DWDS Digitales W¨orterbuchder Deutschen Sprache; Digital dictionary of the German language f frequency FR Frankfurter Rundschau newspaper corpus GUI graphical user interface ICE International Corpus of English IT Italy KWIC keyword in context LNRE large number of rare events LL log-likelihood MT machine translation MWE multi-word expression NE named entity NER named entity recognition N noun

1Names are marked by italics. x

NLP natural language processing OBJ object OWB¨ Osterreichisches¨ W¨orterbuch;Austrian dictionary PHEN phenomenon PoS part of speech ppm parts per million PRED predicate PREP preposition ST South Tyrol SUBJ subject TEI Text Encoding Initiative TSV tab-separated values TTR type-token ratio V verb VWB Variantenw¨orterbuchdes Deutschen; Variant dictionary of Ger- man XML extensible markup language xi

List of figures

1.1 Examples of bilingual public signs in South Tyrol...... 20

3.1 Overall architecture of Vis-A-Vis` ...... 115 3.2 Vis-A-Vis` GUI – corpus upload or specification...... 130 3.3 Vis-A-Vis` GUI – corpus-external knowledge upload...... 131 3.4 Vis-A-Vis` GUI – selection of analysis level...... 132 3.5 Vis-A-Vis` GUI – results page...... 133 3.6 Vis-A-Vis` GUI – view and download window...... 134 3.7 Vis-A-Vis` progress report on the command line...... 138 3.8 Vis-A-Vis` command line output for the uni-gram level...... 139 3.9 Vis-A-Vis` command line output for non-lemmatised words..... 139 3.10 Vis-A-Vis` result table for the uni-gram level...... 140 3.11 Vis-A-Vis` command line output for the bi-gram level...... 141 3.12 Vis-A-Vis` result table for the bi-gram level (ADJ+N)...... 142 3.13 Vis-A-Vis` command line output for the syntactic level...... 143 3.14 Vis-A-Vis` result table for the syntactic level...... 144 3.15 Vis-A-Vis` syntactic phenomenon extraction – KWIC results.... 145

4.1 DOLO-FR precision-recall – baseline vs. Vis-A-Vis` ...... 150 4.2 DOLO-FR F-score – baseline vs. Vis-A-Vis` ...... 151 4.3 DOLO-FR precision-recall – vs. Vis-A-Vis` ...... 152 4.4 DOLO-FR F-score – Sketch Engine vs. Vis-A-Vis` ...... 152 4.5 DOLO-FR uni-gram result classification – histogram...... 157 4.6 Vis-A-Vis` syntactic phenomena in the Kolipsi L2 learner corpus.. 160 4.7 Vis-A-Vis` syntactic phenomena in the KoKo-TH learner corpus.. 161 xii

List of tables

2.1 Key for source indications in example tables...... 93 2.2 Examples for inflection...... 97 2.3 Examples for word formation – compounding...... 97 2.4 Examples for lexical entities...... 98 2.5 Examples for loanwords and loan formations...... 99 2.6 Examples for cross-level equivalent constructions...... 100 2.7 Examples for morpho-syntactic phenomena...... 101 2.8 Examples for ADJ+N co-occurrences...... 101 2.9 Examples for (N+)PREP+N co-occurrences...... 102 2.10 Examples for V+PREP co-occurrences...... 102 2.11 Examples for PRED+OBJ co-occurrences...... 103 2.12 Examples for semantic differentiations...... 104 2.13 Example for idiomatic expressions...... 104

3.1 Vis-A-Vis` runtime...... 118 3.2 Linguistic filter tags used by Vis-A-Vis` ...... 125

4.1 DOLO-FR precision % – baseline vs. Vis-A-Vis` ...... 150 4.2 DOLO-FR recall % – baseline vs. Vis-A-Vis` ...... 150 4.3 DOLO-FR F-score % – baseline vs. Vis-A-Vis` ...... 151 4.4 Comparison of result ranks yielded by AMs and ppm...... 153 4.5 DOLO-FR frequencies for lexicographically relevant findings.... 156 4.6 DOLO-FR uni-gram result classification...... 156 xiii

German abstract

Zusammenfassung der Arbeit

Im Folgenden werden die Ziele und Forschungsfragen der vorliegenden Arbeit sowie deren Ergebnisse in einem Uberblick¨ dargestellt.

Einleitung und Hintergrund

Regionale Variet¨atenpluri-zentrischer Sprachen wie z. B. des Deutschen haben grunds¨atzlich sehr viele Gemeinsamkeiten in Bezug auf ihre Struktur und die auftretenden linguistischen Ph¨anomene.Umso wichtiger ist es, Unterschiede zu erfassen, etwa f¨urdie Dokumentation von Variet¨aten,f¨urAnwendungen in der Lexikographie oder in der Sprachdidaktik, sowie allgemein f¨urdie F¨orderung des Sprachbewusstseins. Die Ziele der vorliegenden Arbeit bestehen darin, M¨oglichkeiten der Unter- st¨utzungund Automatisierung manueller variet¨atenlinguistischer Forschung mit komputationellen Methoden zu erforschen. Daf¨urwurde eine Machbarkeits- studie durchgef¨uhrt,in der der Prototyp eines Systems zum halb-automatischen Vergleich von Variet¨atenkorpora geschriebener Sprache entwickelt wurde: Vis-A-` Vis unterst¨utztVariet¨atenlinguisten2 bei der manuellen Analyse von regionalen Variet¨atenund erlaubt einen effizienteren und objektiveren Vergleich. Als Grundlage dienen Sammlungen von authentischen Sprachdaten, die in immer mehr Initiativen f¨urdie computerlinguistische Verwendung aufbereitet werden – wie etwa das Korpus S¨udtirol, das die Entwicklung von Vis-A-Vis` anregte. Die dabei zu beantwortenden Forschungsfragen beziehen sich auf das m¨ogliche Ausmaß dieser Unterst¨utzung:Inwieweit l¨asstsich der Vergleich von regionalen

2Pluralformen schließen sowohl feminine als auch maskuline Formen ein. xiv

Variet¨atenbasierend auf Korpora mit empirischen komputationellen Methoden automatisieren? Welche linguistischen Beschreibungsebenen k¨onnenabgedeckt werden? Sind die Methoden mit angemessener Annotation von tieferen auf h¨ohereEbenen ¨ubertragbar? Dabei wurden die folgenden verwandten Fragen ber¨ucksichtigt: F¨urwelche Eingabe ergeben die Methoden n¨utzliche Ergebnisse? Welche Korpusmerk- male (z. B. Vergleichbarkeit oder Gr¨oße)beeinflussen in welchem Maße die Ergebnisqualit¨at?Wie n¨utzlich k¨onnendiese Ergebnisse sein und k¨onnensie verl¨asslich nach einem automatisch erhobenen Relevanzwert sortiert werden? Des Weiteren wurde der Frage nachgegangen, inwieweit bestehende Werkzeuge zur Verarbeitung von Korpora verwendet und integriert werden k¨onnen und wieviel Anpassung an regionale Variet¨atennotwendig ist. Die abschließende Forschungsfrage betrifft die Ubertragbarkeit¨ des Systems auf andere pluri-zentrische Sprachen.

Relevante Vorarbeiten und Forschungsdesiderate

Im Bereich der Variet¨atenlinguistikgibt es eine Vielzahl manueller Einzelstudien f¨urteilweise sehr spezielle regionale Ph¨anomene,meist auf lexikalischer Ebene. In den letzten Jahren wurden zunehmend auch automatisierte Einzelstudien auf der Grundlage von Korpora durchgef¨uhrt,die f¨urpluri-zentrische Sprachen erarbeitet werden. In der Computerlinguistik wurde eine Reihe von halb-automatischen Sys- temen entwickelt, die linguistische Ph¨anomeneauf verschiedenen Beschrei- bungsebenen in Korpora analysieren und vergleichen, z. B. f¨urverschiedene Textgattungen oder Publikationsjahre. Die sich daraus ergebenden Desiderate f¨urdie variet¨atenlinguistische For- schung beinhalten

umfassende systematische Studien auf mehreren linguistischen Beschrei- • bungsebenen, im direkten Vergleich von Variet¨atenkorpora, • xv

mit komputationellen Methoden und Werkzeugen auf dem neusten Ent- • wicklungsstand.

Das System Vis-A-Vis`

Vis-A-Vis` analysiert und vergleicht linguistische Ph¨anomeneaus Variet¨aten- korpora, um Listen von m¨oglichen relevanten Unterschieden als Grundlage f¨ur manuelle Feinanalysen zu erstellen. Dabei wird bereits vorhandenes linguis- tisches Wissen, zum Beispiel in Form von bekannten und schon lexikalisch erfassten Regionalismen oder variet¨atentypischen Eigennamen, einbezogen. Das modulare System kombiniert musterbasierte mit statistischen Ans¨atzen und integriert verf¨ugbareStandardwerkzeuge, etwa zur Wortartannotierung und zur Korpusabfrage, die mit angepassten und neu entwickelten Werkzeugen verkn¨upftwurden. Nach der Annotation der Eingabekorpora auf Wortebene und deren Abfrage- Indizierung werden Ph¨anomeneauf drei linguistischen Beschreibungsebenen extrahiert. Die dabei erhobenen Unigramme, Bigramme und exemplarischen syntaktischen Ph¨anomenewerden daraufhin in Bezug auf ihre Relevanz gefiltert sowie mit Hilfe von statistischen Assoziationsmaßen sortiert, um wahrschein- liche Kandidaten f¨urbisher nicht erfasste Regionalismen zu ermitteln. Die Ausgabe des Systems besteht aus allgemeinen Angaben ¨uber die verwendeten Korpora und deren Vergleichbarkeit sowie aus konkreten gefilterten und nach m¨oglicher Relevanz sortierten Ph¨anomentabellen mit quantitativen Daten aus den beiden Korpora. Vis-A-Vis` steht sowohl ¨uber eine benutzerfreundliche graphische Oberfl¨ache auf der Webseite der Initiative Korpus S¨udtirol als auch zum Herunterladen zur Verf¨ugung.Die herunterladbare Version erlaubt zus¨atzlich zur graphischen Oberfl¨ache eine Verwendung des Systems auf der Kommandozeile f¨ur Linux- Nutzer. xvi

Quantitative und qualitative Systemevaluierung

Die wesentlichen Ergebnisse der Evaluierung von Vis-A-Vis` bestehen einerseits aus quantitativen Daten zur Genauigkeit und zur Trefferquote, die f¨urdie Unigramm-Ebene in Zeitungskorpora einen deutlichen positiven Unterschied zur Messbasis sowie auch zu bekannten kommerziellen Systemen aufweisen. Zudem konnten erste angewandte Studien mit drei verschiedenen Arten von Korpora (Zeitungskorpora, Internetkorpora und Lernerkorpora) qualitative Ergebnisse liefern, die die Uberpr¨ufungund¨ Erg¨anzungvon Lexikoneintr¨agen f¨urVariantenw¨orterb¨ucher erm¨oglichen und damit den Nutzen von Vis-A-Vis` untermauern.

Ausblick und Fazit

Ressourcen und Werkzeuge f¨urdie komputationelle Variet¨atenlinguistikk¨onnen zunehmend – etwa durch Ergebnisse aus weiteren Studien mit Systemen wie Vis-A-Vis` – verfeinert werden. Eine konkrete Weiterentwicklung des Systems ist sowohl auf technischer als auch auf inhaltlicher Ebene m¨oglich. Im Hinblick auf die genannten Forschungsfragen kann best¨atigtwerden, dass die Unterst¨utzungmanueller Variet¨atenkorpusvergleiche auf den tiefen linguis- tischen Beschreibungsebenen sehr gut machbar ist, w¨ahrendsie auf h¨oheren Ebenen betr¨achtlich komplexer wird. Dabei sind Voraussetzungen f¨urn¨utzliche Ergebnisse v. a. die Vergleichbarkeit der Eingabekorpora. Ein Filtern und Sortie- ren der Ergebnisse mit statistischen Assoziationsmaßen nach wahrscheinlicher Relevanz f¨urVariet¨atenunterschiede ist m¨oglich und sinnvoll. Verf¨ugbareStan- dardwerkzeuge konnten mit akzeptablem Anpassungsaufwand integriert werden; zudem ist das System durch seine Modularit¨atdirekt ¨ubertragbar auf andere, ¨ahnlich strukturierte pluri-zentrische Sprachen. Bez¨uglich der Forschungsdesiderate kann zusammenfassend die M¨oglichkeit der Unterst¨utzungvon variet¨atenlinguistischer Forschung durch diesen speziell darauf zugeschnittenen Systemprototypen best¨atigtwerden. Die Ergebnisse des Systems zum halb-automatischen Korpusvergleich dienen der Variet¨atenlin- guistik als gefilterte Grundlage f¨urqualitative Feinanalysen, die sich z. B. in xvii

konkreten Erg¨anzungenvon W¨orterbucheintr¨agenniederschlagen. Die zuneh- mend in Form von Korpora verf¨ugbarenauthentischen Sprachdaten werden in effizienter Weise genutzt, was wesentlich zur Verbesserung konkreter variet¨aten- linguistischer Anwendungen beitr¨agt.In einem weiteren Schritt dient dies der Variet¨atendokumentation, der -lexikographie, der -didaktik sowie Initiativen zur F¨orderungdes Sprachbewusstseins. Im Unterschied zu anderen korpusvergleichenden Systemen ist Vis-A-Vis` speziell auf regionale Variet¨atenzugeschnitten. Es ist leicht zug¨anglich und benutzerfreundlich zu verwenden und kann in Bezug auf mehrere Aspekte direkt angepasst werden. Die Weiterverfolgung und Verfeinerung des in dieser Arbeit vorgestellten Ansatzes wird ¨uberaus n¨utzliche Ergebnisse f¨urdie Erforschung und Beschrei- bung nicht nur regionaler, sondern auch weiterer linguistischer Variet¨atenliefern, was einen wesentlichen Beitrag zur sprachwissenschaftlichen Forschung leistet. xviii

English abstract

Introduction and related work

Regional varieties of pluri-centric languages such as German are generally very similar with respect to their structure and the linguistic phenomena that occur. The extraction of differences is thus crucial e. g. for variety documentation, lexicography, or didactics. In this thesis, computational approaches to the comparison of regional variety corpora are explored, in order to support manual analyses by variety linguists. A feasibility study on semi-automatic corpus comparison has been conducted by developing a prototype system, in order to determine on which levels of linguistic description such automation is possible and to what extent. Further research aims at showing which features of the input corpora produce the best results as well as on the ‘relevance ranking’ of the output. In addition, the potential of integrating available standard tools as well as the transferability of the system to other languages have been explored. Written corpora, which have been made increasingly available through ini- tiatives such as Korpus S¨udtirol, are used as an empirical basis to extract differences semi-automatically, which is more efficient and more objective than a purely manual approach. The results yielded by the prototype system Vis- A-Vis` assist variety linguists in their detailed qualitative analyses by reducing corpus comparison output to presumably relevant phenomena. In regional variety linguistics, numerous manual approaches have been ap- plied and various single studies have been carried out, followed more recently by an increasing number of automated studies on the basis of corpora being developed for pluri-centric languages. In computational linguistics, the analysis and comparison of corpora through automated systems, in order to find dif- xix

ferences on various levels of linguistic description, has been conducted for a considerable time (e. g. for register studies), yielding promising results.

System description and evaluation

Vis-A-Vis` applies linguistic pattern extraction as well as statistical output comparison, combining existing standard tools with adapted or newly-devel- oped tools. It is a modular system that offers a user-friendly graphical interface, available online and for downloading. The processing of the corpus input consists of data annotation and the extraction of phenomena on different levels of linguistic description (i. e. at the uni-gram level, the bi-gram level, and selected aspects of the syntactic level). Vis-A-Vis` produces ranked ‘candidate’ lists of variety peculiarities – by filtering through corpus-external linguistic knowledge and by applying statistical association measures to identify significantly different phenomena in the two input corpora, in order to reduce the output to probably relevant phenomena. The quantitative evaluation showed that the system performs clearly better than a baseline approach and that it outperforms well-known commercial systems. Furthermore, first qualitative results produced by Vis-A-Vis` led to suggestions for refining and enhancing variant dictionary entries.

Conclusion

The overall conclusion of this work is that a semi-automatic approach to variety comparison is clearly promising for lower levels of linguistic description, and – with further refinements – for more complex levels as well. The comparability of the input corpora turned out to be crucial for usable results, and the association measures used for relevance ranking proved to be valuable. Standard corpus processing tools have been integrated, and the transferability to other pluri- centric languages is ensured by the system’s modular architecture. Complying with the research desiderata identified, comprehensive methods for systematic regional variety studies have been assessed and made available. This work has contributed to applied variety linguistic research, resulting in xx

benefits for general variety description. Regional variety linguists as well as lexicographers, teachers and learners, and the interested public all benefit from the results of such a specifically tailored tool; they can use these results as a compact empirical basis — extracted from large amounts of authentic data — for their detailed qualitative analyses. Through an easily accessible user-friendly application of a comprehensive computational system, they are supported in efficiently extracting differences between varieties of pluri-centric languages. Bootstrapping processes will further enhance the input data and the methods to provide increasingly better results of variety corpus comparison. Such com- prehensive tools can also serve in fields outside of regional variety linguistics, wherever corpora are being compared, contributing to further general linguistic research. 1

1 Introduction and background

The human language comprises numerous dimensions – it interacts with cul- tural, economic, historical, political, and social phenomena, and it involves communicative, functional, geographical, personal, and situational aspects (see e. g. Bloomfield, 1933; Weisgerber, 1933, in German). The work conducted in the framework of this thesis is related to many of these contexts, since it is concerned with the science of ‘variety linguistics’. This research field, in general terms, is involved in investigating all the mentioned factors and their inter-relations (see e. g. Chambers et al., 2002; Baker, 2010). More precisely, this work deals with comparing regional varieties of pluri- centric languages (see e. g. Clyne, 1992; Ammon, 2005). For that purpose, corpus-linguistic methods (see e. g. Ludeling¨ & Kyto¨, 2009) are applied, which have been developed in the field of natural language processing (NLP), the general framework of this thesis. In the first introductory section, the general aims and research questions of this study will be presented. In section 1.2, the relevant scientific fields will be introduced in order to provide a background in the related topics.

1.1 Objectives and research questions

Regional varieties of a language obviously bear many similar characteristics, both with respect to their structure and regarding the occurring linguistic phenomena. Despite – and also because of – this fact, it is crucial to identify the differences e. g. for variety documentation (see e. g. Bender et al., 2004), for variant lexicography (see e. g. Ammon et al., 2004), and to foster and standardise lesser-used varieties (see e. g. Muhr, 1987). Likewise for didactics 2 1 Introduction and background

(see e. g. Saxalber-Tetter, 1989), for socio-linguistic studies (see e. g. Ro- maine, 2009), and to raise general variety language awareness (see e. g. Clyne, 1992, for regional varieties), such an investigation of peculiarities is essential. Scientifically well-grounded empirical insights both about similarities and about peculiarities of varieties are an important basis for systematic and concrete actions in educational policy and in general language planning. In the following, the overall objectives and the scope of this study as well as its specific research questions will be specified.

1.1.1 Aims and scope of this work

In this section, the general aims of this study – including the applications of its outcome – will be elaborated on.

Research objectives The overall objective of this project – after the identification of concrete research gaps in computational regional variety linguistics – is to investigate how far it is possible to semi-automatically compare linguistic varieties on the basis of corpora, by developing a prototype for computational corpus-based variety comparison. This is especially relevant since the amount of electronically available au- thentic data – also for regional variety research – constantly increases and can no longer be handled purely manually. The benefit of supportive tools from the field of NLP for empirical analyses as opposed to traditional manual linguistic investigation is clearly evident in this scenario. The comparison of varieties, up to now, has mostly been based on introspection or has been conducted in various single specific studies with rather small corpora, if any. An automated comparison of linguistic phenomena for variety description, which appears to have received little attention in the past, is urgently needed. By filtering out ‘trivial’ characteristics of a variety or knowledge already confirmed (e. g. collections of proper names or regionalisms), the system Vis-A-Vis` , which has been developed in the framework of the research presented here, presents mainly 1.1 Objectives and research questions 3

material which is likely to be relevant as possible ‘candidates’ for differences between varieties. This enables experts to systematically investigate consider- ably more data than they could handle purely manually. They gather empirical evidence along the way, which allows them to especially concentrate on the detailed contents-related interpretation and evaluation of new phenomena. The work conducted is of much timeliness not only because variety corpus data is nowadays available to be processed and the opportunity to use large corpora as a basis for systematic comparisons is given, but also because the NLP tools developed in this field are ready to be integrated. The present investigation and development extends the classical work in this area by applying comprehensive semi-automatic, easily accessible, and user-friendly computational tools. The comparison task is not trivial – and the solutions are non-obvious – since differences between regional varieties are rare and subtle.1 Furthermore, the still undefined notion of corpus comparability2 and its complex way of being measured are challenging prerequisites for corpus comparison. Since corpora are taken as basis for such variety description attempts, the assumption is held that ideal corpora represent a variety in general and that their comparability can be measured. Since however, most corpora are obviously neither completely representative nor large enough to draw conclusions for general language usage, the interpretation and evaluation of the results can only be exemplary. It can be helpful to see if the method is valuable to provide a starting point – with a pre-selection of data to be investigated manually in more detail – and if it is worth being developed further. The feasibility study carried out in this thesis yielded the prototype system Vis-A-Vis` with a focus on German3 varieties. It is supposed to show which direction of research and development is most promising for the mentioned task. With this approach, combining qualitative with quantitative extraction methods and symbolic with statistical approaches, manual work can be supported

1As also Grzega (2000) points out, exclusive features are hard to find. 2Krenˇ & Hlava´covˇ a´ (2008) e. g. assume that this fact is the reason for the lack of fully automated tools for variety corpus comparison. 3The system has first been evaluated with German varieties, but it is adaptable to other pluri-centric languages. 4 1 Introduction and background

and reduced substantially, with automatically produced lists of objectively observable differences4 providing a basis for further manual investigation and classification. Obviously, the analysis of the output material will always have to be done manually – depending on the respective study purpose. The German language in South Tyrol5 in Northern Italy is taken into special consideration as a starting point for this study, since it is interesting for comparative linguistic investigations from various perspectives. On the one hand, research demands come from South Tyrol, being a semi-centre (see section 1.2.1) which needs descriptive documentation compared to the other varieties of German. On the other hand, South Tyrol’s language contact to Italian and its di-glossic situation with respect to dialects6 influence the use and the development of the language. Additionally, it has not yet been comprehensively investigated, partly because no electronic collections of authentic data have been available on a large scale until the start of the initiative Korpus S¨udtirol (see section 2.1.1). This initiative is developing a balanced corpus of written standard South Tyrolean texts for language archiving, documentation, and awareness raising – the framework in which the development of Vis-A-Vis` originated. Especially in the umbrella initiative C4 (see also section 2.1.1), where four comparable corpora of German varieties in central Europe are being elaborated, a strong demand for systematic and automated comparative studies on variety corpora has emerged. In addition to interactively run single queries in the C4 corpora, variety linguists needed a tool to exploratively and empirically analyse and quantitatively compare their corpora on the desired levels of linguistic description. For the special case of South Tyrol, the survey of related work presented in section 2.2.1 leads to the following research desiderata: Comprehensive

4The methods applied allow as well for investigating common features of varieties, e. g. if peculiarities which are shared in two varieties are supposed to be contrasted to a third variety. 5‘South Tyrolean German’ is the variety used there as one of the official languages (see e. g. Egger & Lanthaler, 2001, and also section 1.2.2). 6 The term ‘dialect’ here is used generally for all spoken varieties differing from the written standard varieties. Dialects usually do not have their own norms or an official orthography (see e. g. Besch et al., 1983). 1.1 Objectives and research questions 5

investigations on the peculiarities of the German standard language in South Tyrol are required, from the lexical over the syntactic up to the textual level. Such studies have to be conducted by an intra-lingual comparison between varieties using empirical state-of-the-art methods from the field of computational linguistics. No comprehensive system for the analysis and the comparison of regional variety corpora has been developed yet, the implementation of which is thus the aim of the presented feasibility study, with the mentioned variety corpus initiatives providing a most promising basis. The most distinguishing feature of Vis-A-Vis` with respect to related tools is that it is specifically tailored to regional variety linguistic needs e. g. by inte- grating existing linguistic knowledge about regional lexical items for filtering quantitative corpus comparison output. Moreover, it offers an easily acces- sible and user-friendly application to be used without computational expert knowledge. Its dissemination to potential user groups such as the scientific communities of variety linguistics and computational linguistics as well as the general interested public is ensured by its free availability both online and for downloading. This work presents empirical approaches to the analysis and comparison of regional varieties, contributing to the methodology in computational variety research. Many research scenarios and scientific communities will benefit from developments in variety corpus comparison and from their concrete results, which reveal trends in linguistic development and change – one reason to regularly and systematically observe and investigate how a language community is ‘acting’. The outcome of a further pursuit of this work will e. g. contribute to concrete applications such as variety documentation (see e. g. Bender et al., 2004). More prevalent varieties of languages are usually far better investigated and described than their lesser-used counterparts, which also often have considerably less NLP coverage (see e. g. Ostler, 2009). The term ‘lesser-used’ is not always used according to the number of speakers, but partly also according to the critical mass for developing NLP tools (see e. g. Streiter & De Luca, 2003; 6 1 Introduction and background

Streiter et al., 2006)7, as can e. g. be seen in the Lesser Used Languages and Computer Linguistics Conferences (LULCL, see Ties, 2006; Lyding, 2009), whose variety of topics shows how differently lesser-used languages can be defined. Results of regional variety comparison will further contribute to computa- tional variant lexicography (see e. g. Ammon et al., 2004; Nelson, 2006; Abfalterer, 2007; Abel & Anstein, 2008), also on higher levels of linguistic description e. g. in the field of phraseology (see e. g. Hofer & Schmidlin, 2003).8 The question of standards and norms (see e. g. Muhr, 1987; Riehl, 1994) for non-dominant varieties (see e. g. Clyne, 1992, p. 459) can be pursued further with computational variety comparison, basing a prescriptive approach on a descriptive one, e. g. by concretely using the results for variety-specific spell-checking. Both first and second language acquisition as well as didactics of regional varieties are another important field of application. Variety research results are especially relevant as a basis for recommendations of educational policies at schools and universities as well as for the advanced education of teachers. Concrete output of regional variety studies can be used for aiding language teaching by developing adapted material (see e. g. Nagy, 1993; Kirk, 1996) and for language learning in variety regions (see e. g. Ebner, 1987; Saxalber-Tetter, 1989; Muhr, 1993b; Hagi¨ , 2006; De Cillia, 2006; Abel et al., 2006, pp. 233-319). Specifically tailored language didactics can e. g. help language learners in cases where they are unsure about ‘correct’ language usage (e. g. the use of accusative vs. dative in South Tyrolean German; see section 2.2.2). For such applications, it is crucial to identify linguistic units

7As a related initiative, the Basic Language Resource Kit (BLARK; http://www.blark.org, last accessed 2012-10-21) concept was defined by the European Network of Excellence in Language and Speech (ELSNET) and the European Language Resources Association (ELRA). BLARK aims at facilitating a cooperative framework by fostering the inter- operability of the results of national projects to provide language resources of various types for the different languages. 8Also Heid (2011) suggests further systematic studies with more data and more effective extraction methods, e. g. along the lines of his approach for the extension of variant dictionaries by adding collocation entries (see section 2.1.2.2). 1.1 Objectives and research questions 7

that may contain errors which might be caused by linguistic variation. A further purpose of variety description is to identify probable difficulties for foreigners who study German in a variety region (see e. g. Colleselli et al., 2009). As a related area of application, the comparison of learner texts is possible – both diachronically with longitudinal studies and from a synchronic perspective, for example comparing different proficiency levels (see e. g. Abel et al., 2010) or different regional varieties (see e. g. Anstein & Glaznieks, 2011). Also for socio-linguistic investigations, e. g. for gender or age differences in language (see e. g. Romaine, 2009), regional variety corpus comparison can be a valuable basis. Abfalterer (2007, p. 7) points out that the documentation of regionalisms is furthermore very important to enhance people’s awareness for linguistic variation – for their differences and peculiarities as well as for their similarities. Abel et al. (2006, pp. 337-412) similarly elaborate on bringing forward linguistic awareness.9 The findings on regionalisms can additionally be used in further variety corpus preparation – resulting in increasingly better resources and tools for general computational variety research.

Scope of this work Several basic assumptions and premises have to be fulfilled for the presented approach to be most useful. Variety corpora have to be available, which should be relatively comparable10 and large enough to allow for making generalisations. Furthermore, it is assumed that the quality of the automated annotation is high enough for correct results. The ubiquitous variation inside varieties as in the whole of language, refer- ring e. g. to spelling, style, or tense usage, cannot be covered in this project.

9This is one of the reasons why Vis-A-Vis` is integrated directly in the query interface of Korpus S¨udtirol – to be accessible for the interested public and to allow for its use in language awareness initiatives. 10The currently available corpus comparability measures are not sophisticated enough – developing better measures is a task which cannot be covered in the framework of this project. 8 1 Introduction and background

Generally, language is ever-changing in parallel to other – e. g. historical or political – developments.11 Linguists have to make sure not to interpret variety phenomena too ‘easily’; they need to take general language variation and change (also related to productive word formation, creating new abbreviations, trend words, new loanwords, new terminology, etc.) into account, which cannot be straightforwardly filtered out by computational methods. Corpora of spoken language are not considered in the framework of this study, since their treatment and analysis differs considerably from the processing of written text corpora – thus Vis-A-Vis` cannot be used for phonological and phonetic investigations. Levels of linguistic description higher than the syntactic level, e. g. automatic deep semantic analyses, are equally beyond the scope of this project, since more sophisticated NLP methods would be necessary. Especially for the case where one concept is expressed on different levels of linguistic description in two varieties, the full inventory of possible alternative structures with functional equivalency would have to be considered when comparing corpora. Such differences in formulation have to be treated specifically, which is where the limits for automatability and computational handling become apparent.

1.1.2 Research questions

In the following, the general questions to be approached in this work are specified. These questions are relevant because they contribute to linguistic research both for regional varieties and in the field of NLP. They are non-trivial because variant phenomena are very rare and partly rather subtle and because the comparison of corpora is not straightforward on all levels of linguistic description. Therefore, their answers are not obvious – they need thorough scientific investigation.

To what extent can manual variety corpus comparison be supported • and automated with empirical computational methods? Which levels of

11Especially in South Tyrol, e. g. decisions on the use of especially Italian loanwords are dependent on its historical-political development (see section 1.2.2). 1.1 Objectives and research questions 9

linguistic description can be covered? Are the methods transferable across these levels with the appropriate annotation?

– For which input do the methods work well? Which corpus features (comparability, size12, etc.) influence the quality of results to what extent? – How useful can the output be, i. e. probable candidates for variety peculiarities? Can an automatic ‘relevance ranking’13 of this output work reliably? – How far can available standard tools be integrated, adapted, and combined? How suited for treating a certain language variety are e. g. annotation tools which have been developed for a different variety? How useful are general corpus processing tools?

How easily transferable to other pluri-centric languages is such a • system for variety corpus comparison?

1.1.3 Structure of this thesis

In the following section (1.2), an introduction to the relevant research areas related to the contents of this study will be given as a general background. In section2, the influential previous work which has been done in these fields will be described in order to set the context for this work. Section3 presents the prototype system developed in the framework of this research, elaborating on its methodology, its functional and technical specification, and its output according to the general system development life cycle. In section4, the quantitative and qualitative evaluation of the system is shown.

12The question regarding corpus size is especially relevant because of the lesser-used language situation for many regional varieties. 13The notion of relevance ranking for linguistic purposes is taken as the possibility to rank corpus occurrences with respect to their appropriateness for illustrating special properties of a given variety. 10 1 Introduction and background

The thesis is concluded with an outlook section (5.1) pointing to possible further lines of research to be followed and with a general summary of the findings and contributions of this work to the respective research areas (5.2).

1.2 Background in the relevant research areas

In this section, the central notions from three related areas of research are addressed in order to place this work in a theoretical context. First, the general regional variety linguistic area as the basic background will be introduced briefly, then the situation of the South Tyrolean variety which is investigated as an exemplary variety in this work will be described, and third, the relevant NLP methods regarding the field of corpus research will be elaborated on as an introduction to the relatively new field of computational variety linguistics.

1.2.1 Linguistics and language variation

The field of linguistics studies human language, which is divided into diverse sub-fields according to its structure and can be investigated on different levels of linguistic description (see e. g. Halliday & Webster, 2006). In the following, the most relevant linguistic notions and sub-fields are presented as a general basis for the summary of the related previous work. After that introduction, central aspects of linguistic variation are shown – including language political features which may point to the reasons for variety phenomena and for possible hypotheses about their ‘position’.

1.2.1.1 Language phenomena and their investigation This section introduces the three levels of linguistic description which have been investigated and covered as analysis levels in the present feasibility study. Linguistic phenomena, e. g. lexical, phrasal, or syntactic phenomena, become manifest in concrete instances in texts (‘occurrences’) and in relations between 1.2 Background in the relevant research areas 11

such occurrences.14 In the technical part of this work, the term ‘phenomenon’ is also used on the instance level.

Lexical level Lexical entities consisting of one or more words are analysed and described in the field of lexicology (see e. g. Cruse et al., 2005). To give an example of a sub-type of lexical entities, a named entity (NE) is a proper noun referring e. g. to persons. The scientific approach of lexicography is involved in the development of dictionaries (for a detailed general account see Hausmann et al., 1989-1991). Hausmann (1985b) provides an overview (in German), a comprehensive bibliography (pp. 399-410), and a dictionary typology (pp. 379-381). Relevant types of dictionaries discussed by Hausmann (1985b) are e. g. synchronic vs. diachronic, philological vs. linguistic, or general language vs. language for specific purposes. Regionalism dictionaries (see Hausmann, 1985b, p. 391) are a sub-group of specialised dictionaries containing region- specific lexical items. With respect to regional language variety dictionaries, full / cumulative / integrative / inclusive dictionaries (e. g. Ammon et al., 2004) are contrasted with distinctive / differentiating / supplemental variant dictionaries (examples being Ebner, 1980; Meyer, 1989).

Phrasal level Phraseological units are classified into referential, structural, and communicative units – and on a lower level into collocations and (partial) idioms (see e. g. Fleischer, 1997; Burger, 2007; Heid, 2008). Especially collocations, which have as well been chosen to be investigated in the present study, aroused interest as a lexical resource for NLP (see Sinclair, 1991, pp. 109ff.).15 Heid (2001) covers their treatment in lexicography for specialised languages, e. g. the important differentiation between ‘lexical’ and ‘conceptual collocations’ (p. 795). Also in the field of didactics, collocations

14On higher levels of linguistic description, however, the phenomena are not necessarily staightforwardly noticeable and mappable to concrete (lexical or structural) occurrences; such phenomena are rather subtle and hidden, and therefore difficult to extract computa- tionally. 15For a discussion on the term collocation in (computational) linguistics and lexicography, see also e. g. Steyer (2004a). 12 1 Introduction and background

are suggested as an enrichment for learner dictionaries (see e. g. Cowie, 1978; Reder, 2006). According to Firth (1957, p. 106), who originally coined the term ‘colloca- tion’, the characteristics of a word can be identified best by its linguistic context: “You shall know a word by the company it keeps.” (p. 11). Sinclair (2004, p. 198) suggests to “trust collocation as the control mechanism for meaning” and promotes the “nine word window of collocative power” (Sinclair, 1991). Hausmann (1979) refers to the parts of collocations as ‘basis’ (argument) and ‘collocator’ (predicate). This terminology develops from ‘semantically autonomous basis’ and ‘syn-semantic collocator’ (Hausmann, 1985a) to ‘auto- semanticon’ vs. ‘syn-semanticon’ (Hausmann, 2004). Hausmann (1984) defines collocations from a lexicographic view as ‘non-fixed word combinations with restricted combinability’. They are seen as phrase- ological constructions of at least two lexemes which are conventionalised in a language community and which are remarkably common (Hausmann, 1984). Usually they are semantically or syntactically closely related and are semi- compositional (going back to Frege, 1892). Their elements are semantically not equally weighted and show an asymmetry: the basis has the same meaning as outside of the collocation, whereas the collocator has a specific meaning only within the collocation (Hausmann, 1979). Criteria for collocations are lexical selection (including morpho-syntactic preferences and formal fixedness), gram- matical relation (see Evert, 2005a, on relational co-occurrence), idiomatisation, and semantic transparency (see e. g. Heid, 2001; Hausmann, 2004). Manning & Schutze¨ (1999, pp. 172f.) describe collocations as characterised by limited compositionality, substitutability, and modifiability. Forkl (2010) adds that collocations are uni-directionally compositional: they can be understood but not produced without specific lexical knowledge.16 Possible forms of collocations are i.) adjacent (e. g. adverb+adjective) and ii.)

16This is the case for collocations in contrast to idioms, which can neither be produced nor understood without such knowledge. 1.2 Background in the relevant research areas 13

non-adjacent co-occurrences, the latter of which are e. g. subject/object+predi- cate co-occurrences, where the parts are linked by functional relations17. The exact definitions of the term ‘collocation’ depend on the focus of the research areas.18 A rough discrimination between two concepts (see e. g. Evert et al., 2000, pp. 215f.) can be observed:

1. From the linguistic, lexicographic, and language didactics perspective, collocations are seen as partly idiomatised combinations of two lexemes with restrictions in compositionality or morpho-syntactic modifiability.

2. In corpus-linguistic approaches, e. g. for collocation extraction from cor- pora (see section 1.2.3.3), a broader understanding and statistical defini- tion is assumed, with a focus on the frequency of a combination of lexical entities, not necessarily with semantic or morpho-syntactic peculiarities.

The middle-road working definition of Bartsch (2004, which is a good overview of ‘competing’ definitions of collocations and their historical development) is

“Collocations are lexically and / or pragmatically constrained re- current co-occurrences of at least two lexical items which are in a direct syntactic relation with each other.” (p. 76).

In the present work, the interpretation of Hausmann (2004)’s notion of collo- cation and the derived working definition are taken as a basis.

Syntactic level In syntax, the study of the order of elements in a higher-level linguistic construction, the topological field model has been developed for the inflecting language German (see e. g. Drach, 1937; Hohle¨ , 1986; Weinrich, 1993; Wollstein-Leisten¨ et al., 1997). A sentence is separated into several

17Such co-occurrences with different grammatical roles are referred to as ‘syntagmatic associations’, whereas ‘paradigmatic associations’ denote linguistic entities which can substitute each other such as synonyms or antonyms (de Saussure, 1916). 18In the field of specialised language, for example, there is additionally the problem of dis- crimination between collocation and terminological unit (Evert et al., 2000, pp. 215f.). Furthermore, the phenomenon of ‘localisation’ of collocations exists for specialised lan- guage, especially for legal terms (see e. g. Wiesmann, 2004). 14 1 Introduction and background

parts which can contain only certain constituents according to different sentence patterns. One specific syntactic phenomenon which has been chosen to be analysed in an exemplary manner in the present study is introduced in the following. In written language, the phenomenon of verb-second word order in certain sub- ordinate clauses (causal and concessive clauses) is marked as an ‘anacoluthon’. This points to a ‘sentence break’, where the syntactic structure is abruptly changed within a sentence (Drosdowski & Augst, 1984, pp. 716f.). This phenomenon is increasingly frequent and regular in spoken language; in the present study, it is tested if its frequency in written texts differs among regional varieties.

1.2.1.2 Linguistic variation With respect to language use, generally uniformity, standardisation, and binding norms are important for understanding and communicating. However, linguistic reality includes variations and changes (see e. g. Chambers et al., 2002; Mattheier, 1997, in German).19 The dia-system of language comprises mul- tiple variations, e. g. diachronic (referring to different time spans), dia-stratic (with respect to different social layers), and dia-topic (regarding different re- gions) varieties (see e. g. Pulgram, 1964). Changes in language usage and language in general are part of the inner dynamics of languages. Linguistic norms are more stable, but they also change over time, which results in changing judgements about linguistic correctness. Languages are e. g. influenced by dialects or by contact languages (see Goebl et al., 1996; Hickey, 2010, who give a general account of contact linguistics). The field of socio-linguistics deals with the variation in language use among speakers, e. g. in pronunciation (accent), word choice (lexicon), or even pref- erences for particular grammatical patterns. Hacki¨ Buhofer (2000) refers to linguistic variation from a socio-linguistic perspective as a prerequisite for change. She differentiates between

19Also Fandrych & Salverda (2007) describe standard, variation, and change especially in Germanic languages. 1.2 Background in the relevant research areas 15

i.) internal (natural) change due to production economy and the characteris- tics of the speech apparatus and ii.) external (non-natural) change, where contacts between varieties lead to the integration of new elements.

Pluri-centric languages Languages with several national centres and with specific national forms are called pluri-centric20 languages (see e. g. Clyne, 1992). The linguistic and communication-related differences between varieties are due to different life situations in the specific social realities of the single countries, also influenced by the social identity of their speakers (see e. g. de Cillia & Wodak, 2006; Abfalterer, 2007, pp. 25-55), which is – among others – expressed by language.21 The varieties of pluri-centric languages usually include one economically more important ‘normative centre’. Ammon (1995a) describes an asymmetry among the standard varieties of German – the language which is looked at more closely as an example in this work22 – because of country size and economic power. He distinguishes between two kinds of ‘centres’: the varieties of German used

20Scheuringer (1996) proposed the more neutral name ‘pluri-areal’ against the ‘centre’ notion, referring to geographical instead of national linguistic uniformity. In this thesis, the term pluri-centric is used in a neutral way, being aware of the fact – as also Grzega (2000) points out – that “regional varieties and national varieties seldom coincide.” Within single varieties, there is often a second level of pluri-centricity because of the variance within countries, which depends on political and social factors, e. g. in Germany: North vs. East vs. South, or in Austria and in Switzerland: East vs. West. 21As Clyne (1992) puts it, “[p]luricentric languages are both unifiers and dividers of peoples”. 22Other pluri-centric languages include e. g. Arabic, Chinese, Dutch, English, French, Por- tuguese, or Spanish, about some of which section 2.1.1 will give details. For an overview on German, see Clyne (1984, 1992, pp. 56-72), Ammon (1995a, which is a comprehensive systematic presentation including all relevant literature), Ammon (1995b, 1997, 2001), as well as Abfalterer (2007, pp. 15-24). von Polenz (1987, 1988, 1999) elaborates on the general history of the German language and its national varieties and the norm influence of Germany, and also Foldes¨ (2005) is an overview article about the concept and architecture of the German language with respect to regionality, variation, and heterogeneity. 16 1 Introduction and background

in Austria23, Germany, and Switzerland24 are considered as the ‘full centres’ because they each have their own declaration of norms and the definition of a codification (Osterreichisches¨ W¨orterbuch;Austrian dictionary (OWB¨ ), Duden, and Schweizer Sch¨ulerduden). The varieties which are in use in Liechtenstein, Luxembourg25, East Belgium, and South Tyrol are referred to as ‘semi-centres’, since they do not have their own codification. In these semi-centres, German is either a co-official language like in Luxembourg or an additional official language in a part of a country like in East Belgium or in South Tyrol. A ‘standard variety’ (see e. g. Muhr, 1995, who provides a definition of standard language within pluri-centric languages) can roughly be characterised by the fact that it constitutes the norm in public situations and is widely taught at schools of the general education system. To qualify as a full centre, a variety has to have an official codification that has been developed and published within its centre (Ammon, 1986, 1995a). A standard variety is furthermore defined as “the historically legitimated, panregional, oral and written language form of the social middle or upper class” by Bussmann (1996, pp. 451f.), who furthermore states that it is “subject to extensive normalization (especially in the realm of grammar, pronunciation, and spelling)”.26 Huesmann (1998) adds the following features of a standard: codified, supra-regional, of overt prestige,

23Muhr et al. (1995) is a collection of linguistic, psychological, and political aspects of Austrian German, and also Schrodt (1995) attempts to define Austrian German ‘between grammar and pragmatics’. Muhr & Schrodt (1997) describe Austrian German in the context of the other German varieties. Furthermore, in the Internet portal Osterreich-¨ isches Deutsch including a comprehensive bibliography on Austrian German from 1995 to 2005 (http://www-oedt.kfunigraz.ac.at, last accessed 2012-10-19), various information on Austrian German is made available. 24Haas (2000) gives a general account of Swiss German. 25For example Magenau (1964) describes peculiarities of written German in Luxembourg and Belgium. 26Clyne (1992, p. 459) uses the differing terminology of ‘dominant’ and ‘other’ / ‘non- dominant’ varieties. He adds that speakers of the non-dominant varieties have a good passive competence in the dominant variety but not vice versa. Speakers of the dominant variety also tend to regard the non-dominant varieties as dialectal. Moreover, speakers of the non-dominant varieties (especially the cultural elites) tend to take over norms of the dominant variety because these norms are thought to be ‘better’ since they are not regionally marked. Finally, speakers of the non-dominant varieties are insecure about their standard since it is not always explicit and conscious. 1.2 Background in the relevant research areas 17

group-specific, prescriptive, multi-functional, and used in written language. Muhr (1987, p. 20) – from the Austrian perspective – refers to an ‘inner standard’ being informal and natural vs. an ‘outer standard’, which is formal and non-natural. The research presented in this thesis is related to minority, lesser-used, and lesser-resourced languages, since often, semi-centres are located in multilingual countries and their varieties are used by a rather small number of speakers.

Varieties and variants The terminology used in research on pluri-centric languages is varying, but it has been established (see also Ammon, 1995a, pp. 61-94, 96f.) to use the term ‘variety’ on the language level and the term ‘variant’ for single occurrences of linguistic units. Language varieties27 are different instances of pluri-centric languages in different regions consisting of a system of variants (see below). ‘South Tyrolean German’, for example, is the German variety used as an official language in the Autonomous Province of Bolzano / South Tyrol in Northern Italy (see e. g. Egger & Lanthaler, 2001). For linguistic variants28, also called regionalisms, Ammon (1995a) differenti- ates additionally between the terms ‘onomasiological variants’, where the same object is referred to by different words (Marille in Austria (AT) and South Tyrol (ST) vs. Aprikose in Germany (DE) and Switzerland (CH) for ‘apricot’), and ‘semasiological variants’, which are same words in several varieties with varying meanings (B¨ackerei (bread shop) in AT includes a Konditorei (pastry shop), whereas in DE, a B¨ackerei does not include a Konditorei). More exam- ple phenomena for the South Tyrolean variety on various levels of linguistic description can be found in section 2.2.2. ‘S¨udtirolisms’are the specific variants used in South Tyrolean German. The terms ‘prim¨areS¨udtirolismen’(primary S¨udtirolisms;used exclusively in South Tyrol) and ‘sekund¨areS¨udtirolismen’(secondary S¨udtirolisms;shared with other varieties) have been coined by Abfalterer (2007, see also section 2.2.1 and

27Also Grzega (2000) provides a description of national varieties with a literature review of the field including concrete examples from English and German. 28An example for English is ‘pavement’ in British English vs. ‘sidewalk’ in American English. 18 1 Introduction and background

appendixB). Correspondingly, ‘Austriacisms’, ‘Helvetisms’, and ‘Teutonisms’ are the variants used in Austria, Switzerland, and Germany, respectively. Varieties usually differ to a certain extent on different levels of linguistic description, mostly on the lexical level. Ammon et al. (2004, p. xxxii) describe varieties as languages which hardly differ in grammar and partly differ in lexis and pronunciation29; they state that varieties can be best differentiated according to these latter peculiarities. Following Nelson (2006), it can be expected that a core vocabulary is shared and that the peripheral vocabulary with lower frequency is differing across varieties.30 Also in the tradition of dialectology (see e. g. Besch et al., 1983; Anderwald & Szmrecsanyi, 2009, the latter from a computational-linguistic view), variety linguistics deals especially with lexis and pronunciation.

Loan influence Loanwords and loan formations, which are words borrowed and taken over without or with minor changes from one language and incor- porated into another language (see e. g. Weinreich, 1953), are a frequent phenomenon in multilingual and language contact situations (see e. g. Goebl et al., 1996; Hickey, 2010). ‘Interference’ is a related phenomenon, which is defined as the influence of linguistic elements of one language on linguistic elements of another, mostly concerning lexical phenomena. Moser & Putzer (1980, pp. 153-157) classify such phenomena into the following groups; examples are added from South Tyrolean German:

interferences without equivalents in the target language (e. g. Pizza), • interferences with names for unique facts of life (e. g. Option, ‘choice to • leave South Tyrol during fascism’), and interferences with an equivalent in the other language (Aranciata • and Limonade, ‘lemonade’).

29In general, for spoken language, rather descriptive measures are taken, whereas for written language, more prescriptive measures exist. 30It cannot be assumed, however, that all phenomena in the dominant varieties are shared with the other varieties. 1.2 Background in the relevant research areas 19

In addition to such obvious lexical interferences, interference phenomena can be found on higher levels of linguistic description as well. This results in structural borrowing, e. g. of grammatical aspect such as the progressive form. Higher-level interferences point to different types and degrees of language contact (see e. g. Thomason & Kaufman, 1988; Heine & Kuteva, 2005).

1.2.2 Language in South Tyrol

The Autonomous Province of Bolzano / South Tyrol is a multilingual environ- ment with a strong language contact between German and Italian. Furthermore, the German-speaking community is in a di-glossic situation with respect to South Tyrolean dialects vs. the German standard variety (see e. g. Abel, 2009).31 In South Tyrol, about two thirds of the around 500 000 inhabitants are mother-tongue German speakers, with German being an official language besides Italian (25 %) and in some valleys Ladin with 4 % (ASTAT, 2002, p. 118). German and Italian are equally used in almost all areas of public life – in the media and for public announcements, in the educational system, and in public administration (see e. g. Lanthaler & Saxalber, 1995; Abel, 2009). This results in multilingual publications of all kinds, e. g. laws and official letters, as well as in bilingual street signs (see figure 1.1 with ‘retrieve parking ticket’ and ‘Attention, train!’ as examples). The South Tyrolean dialects are predominantly used by the German-speaking population in both spoken and written inter-personal communication. Regarding the di-glossic situation, the standard variety is constricted to the educational system during class, to communication with foreigners, and to written texts that appear in public. The South Tyrolean written standard language shows several differences compared to the other varieties of German (see section 2.2.2), partly caused by the South

31 The Austrian trib¨une.zeitschrift f¨ur sprache und schreibung 2/2007 contains contributions about dialect in dictionaries, about written dialect, and about the fear of language decay that gives way to investigating interesting peculiarities of South Tyrolean German. Another edition (2/2008) of the trib¨une is especially dedicated to the current German language in South Tyrol; topics include variant lexicography, corpus initiatives, literature, as well as dialect and variation in literary texts (e. g. Lanthaler, 2008). 20 1 Introduction and background

Figure 1.1: Examples of bilingual public signs in South Tyrol

Tyrolean dialect and by the contact language Italian. Noticeable phenomena on the lexical level are often a result of the political situation of South Tyrol being part of Italy – it was necessary to borrow and translate Italian expressions or to extend meanings of German words in order to transfer Italian legal and administrative terms. The peculiarities of the South Tyrolean variety on other levels of linguistic description have not been comprehensively investigated (see section 2.2.1); unique aspects on the morphological and syntactic level, for example, have often been considered as individual mistakes.

1.2.2.1 History and current situation South Tyrol’s language history is strongly influenced by its political history (see e. g. Eichinger, 1996; Egger & Heller, 1997; Steininger, 1999). Also according to Abel (2009), which is a compact overview, the historical background strongly influences legal, political, economic, sociological, cultural, and linguistic areas. The see-saw changes described in the following (see e. g. also Mall & Plagg, 1990, pp. 218ff.) resulted in a weak position of the German language for a long time after World War II. From 1919 on, after the separation from Austria, South Tyrol belonged to Italy – a development leading to a deep social and linguistic change. Caused by the following assimilation policy, the process of ‘italianisation’ (see e. g. Abel, 2009) left few rights for the German minority: the use of the German language was prohibited in schools and in public. In 1946, the Pariser Vertrag or Gruber-De Gasperi-Abkommen between Italy and Austria ensured equal rights for the German and the Italian language, a special protection of the German language community, and autonomous legislation and jurisdiction for the provinces of Trento and Bolzano (see e. g. Feiler, 1997; Magliana, 1.2 Background in the relevant research areas 21

2000). Due to certain insufficiencies, in 1948, the Erste Autonomiestatut (first autonomy statute) was released. After on-going protests between 1956 and 1969, the Zweite Autonomiestatut (second autonomy statute)32 in 1972 ensured rights with respect to public services and to the educational system and also rights for the Ladin population. The so-called Proporz, in effect from 1976 on, promotes a proportionality of jobs or funding according to the size of the language groups. Furthermore, rights include especially

mother-tongue classes by mother-tongue teachers for children (Art. 19), • equality of languages (Art. 99) with German as an official language, • obligatory bilingualism in public offices (Art. 100), • and other important regulations with respect to language rights (the declaration of language affiliation, the current bilingualism exam, etc.). In the 1980s, German gained a stronger position in the media (German radio and television), and South Tyrolean German received the status of an official regional variety (see e. g. Moser, 1982; Egger & Lanthaler, 2001). Concerning the bilingual situation in South Tyrol (see also Egger, 2001; Abel et al., 2007), Langer (1983) sees positive prerequisites because of the comparable social prestige of both languages, the similar legal status, and the obligatory public bilingualism, but he also describes political barriers to it: the fear that the ethnic-political unity is endangered – which aims at two monolingual groups – and the ‘policy of separation’. Cavagnoli & Nardin (1999) list three aspects for the lack of true bilingualism:

i.) political-institutional relations (protection of mother-tongue rights, ‘sepa- ratism’, compulsory bilingualism, education in mother tongue), ii.) socio-educational relations (Italian being ‘normal’ vs. German being ‘spe- cial’, prestige questions, teaching material for South Tyrolean German), and iii.) social relations (self-identification and differentiation).

32For both statutes, see http://www.landtag-bz.org/de/datenbanken-sammlungen/ autonomiestatut.asp#anc1956 (last accessed 2012-10-21). 22 1 Introduction and background

As to the interactions between the language groups, also Baur (2006) describes vicious circles and historical barriers because of the sensitivity and negative connotations on both sides, proposing an active ‘pedagogy of encountering’. With respect to literature history (see also Anstein et al., 2011), since the literary and scientific centre in the beginning of the 20th century has been Innsbruck and since during fascism almost no written German sources have been created, the consequences until the 1960s were that politics, media, and literature were closely interlaced and publishers and press were monopolised. After the autonomy and the economic boom in the 1980s however, German regained prestige and domains of use. The current written sources show a problem awareness and an interest in the critical examination of the past and present, even though more from the historical and sociological viewpoint than from a linguistic one.

1.2.2.2 Standards and norms When discussing language in South Tyrol, prevailing topics are the issue of ‘good’ and ‘correct’ German, the question ‘which German to use in South Tyrol’33, and discussions about norm instances (see e. g. Lanthaler & Saxalber, 1995; Ammon, 2001; Lanthaler, 2001). A simple and ‘naive’ norm understanding with a clear-cut judgement and a strict viewpoint as to right or wrong (interpreting any interferences as errors) is not in line with the linguistic reality, which always accepts certain variation. Therefore, the awareness for linguistic variation has to be raised e. g. by the documentation of language varieties, in order to change this norm understanding. Regarding the question what counts as standard, colloquial, or dialectal, Abfalterer (2007, p. 22) mentions a general tendency towards their ‘centre’, the colloquial language. As background to the norm question in South Tyrol, it has to be kept in mind that – during the 1960s to the 1980s – contact phenomena were described as an impairment of the system and that the fear of language decay was prevalent (e. g. Riedmann, 1972, which is the first ‘scientific’ study; see section 2.2.1).

33This question is relevant for all minority languages and non-dominant varieties. 1.2 Background in the relevant research areas 23

Also Lanthaler (1997, pp. 293f.) states that the uncertainty with respect to the correct usage of language partly seems to lead to language purism and an orientation towards a strict norm. In addition, regional peculiarities are rejected, which is similarly attributed to the attempts for preserving the German language after fascism. This language criticism led to a low linguistic self-confidence in South Tyrol, also since no specific reference work existed. Linguistic insecurity kept growing, followed by a call for clear norms and norm authorities (the general public, the schools, language experts, etc.). Since the 1990s, specific educational material for South Tyrol has been developed – but still, rather the Duden and less the OWB¨ are consulted (Abfalterer, 2007, p. 34). The recommendation is to use both of them: the OWB¨ as a complement to the Duden. As well regarding the question which German to use in South Tyrol, Ab- falterer (2007, p. 50) states that in many cases, variants from Germany are prevailing over Austrian ones. Lanthaler & Saxalber (1995, pp. 293f.) observe that there are less Austriacisms in the South Tyrolean Standard than assumed, because after fascism the standard language was ‘protected’. People sought security by consulting the Duden from Germany, and purists fought against all Italian influences34, but also against dialectal and regional influence. Lanthaler (1997, p. 364) observes that the spoken trend goes towards Aus- trian, the written trend towards Germany (see also Lanthaler, 2001, p. 148, who verifies that the South Tyrolean German language community follows the common dictionaries and grammar books from Germany). Riehl (1994), describing particularly South Tyrolean German, distinguishes between different forms of language from colloquial as the one extreme to specialised language as the other – and between two kinds of norms: systematic and pragmatic norms (p. 154). She reports that school books from Germany are used in South Tyrol and that e. g. the education of South Tyrolean public speakers for radio or television takes place in Germany. Further she explains the strong punishment, the high normative consciousness, and the suppression

34According to Abfalterer (2007, p. 49), research on the influence of German on Italian is rare. 24 1 Introduction and background

of creative potential by the fact that, for minorities, the deviance from a norm is a threat for their language. Institutions founded to ‘take care’ of South Tyrolean German are e. g. the Sprachstelle im S¨udtiroler Kulturinstitut35, the Institute for Specialised Com- munication and Multilingualism36 of the EURAC 37 Bolzano, and for legal terminology, the Parit¨atischeTerminologiekommission38. In addition to the scientific examination of the South Tyrolean German peculiarities, the every-day experience of language criticism expressed e. g. in the print media is very important. The press has a strong influence both with respect to model writing and, according to Abfalterer (2007, pp. 35-37), on the public opinion on language as well. Especially for the Dolomiten, which is the most widespread newspaper, a study about language criticism from 1982 to 2004 by Abfalterer (2007, pp. 37-45) shows the use of emotional keywords Sprachverfall (language decay) and the fear of losing one’s mother tongue. Also the fact that the newspaper keeps using the term Hochsprache (‘high level’ language) instead of the more neutral term Standardsprache (standard language) can be noticed. Abfalterer (2007, pp. 42-44) mentions even wrong reports of linguistic scientific studies by the Dolomiten newspaper. The lack of linguistic self-confidence, the lack of language mastery, and the role of language contact with its emotional aspects is demonstrated, with the conclusion that this leads to high norm fostering, which in turn is followed by a rejection of South Tyrolean peculiarities (Abfalterer, 2007, pp. 41). In the last decade however, the widespread awareness and neutral acceptance of linguistic differences to other varieties can be clearly observed in South Tyrol.

35‘Language section of the South Tyrolean institute for cultural issues’; http://www. kulturinstitut.org/hauptnavigation/sprachstelle.html, last accessed 2012-10-25. 36http://www.eurac.edu/en/research/institutes/Multilingualism/default.html, last accessed 2012-10-25. 37http://www.eurac.edu, last accessed 2012-10-25. 38TerKom, ‘Joint Commission for Terminology’, http://www.provinz.bz.it/anwaltschaft/ themen/terkom.asp, last accessed 2012-10-25. As examples, the terms Qu¨astur (‘police department’, Italy (IT): questura) or zivile Motorisierung (‘motoring’, IT: motorizzazione civile) have been dismissed and substituted by the normed terms Polizeidirektion and Kraftfahrzeugwesen by the TerKom. 1.2 Background in the relevant research areas 25

1.2.3 Computational approaches to corpus studies

This research is embedded in the multi-disciplinary field of natural language processing (NLP) or computational linguistics, which combines and builds bridges between the two specialised domains of linguistics and computer science (for comprehensive overviews see Jurafsky & Martin, 2000; Carstensen et al., 2004, in German). It is concerned with the interaction between natural languages and computers (see e. g. Charniak & McDermott, 1985). Funda- mentals, methods and tools, as well as applications are thoroughly explained in the handbook by Mitkov (2003). Another useful work containing resources, tools, and an extensive bibliography is Cramer & Schulte im Walde (2006, in German) including a comprehensive website39 containing basics, methods, resources, and applications.40 In the following, the relevant areas from the whole field of NLP will be presented, starting with the inherent aspect of inter-disciplinarity followed by a description of linguistic corpora and their annotation. Furthermore, the section will show extraction methods on different levels of linguistic description, corpus comparison methods, as well as relevant statistical notions and evaluation approaches for the field of corpus linguistics.

1.2.3.1 Inter-disciplinarity Applying computational methods, linguists inevitably encounter inter-disciplinary tasks: electronic text collections and tools are increasingly available since data processing and storage space become cheaper. To decide about the best way to solve a computational-linguistic task, a cost-benefit analysis is needed inte- grating the following criteria: complexity of the task, expected error rate and tolerance, costs (manual vs. automated), and sustainability by the development of re-usable tools (see e. g. Lyding et al., 2008). As to the combination of social and natural sciences (e. g. variety linguistics and NLP), the following

39http://www.coli.uni-saarland.de/projects/stud-bib, last accessed 2012-10-19. 40An annotated list of resources for statistical natural language processing and corpus-based computational linguistics by Chris Manning can be found at http://www-nlp.stanford. edu/links/statnlp.html, last accessed 2012-10-19. 26 1 Introduction and background

aspects can lead to practical difficulties of inter-disciplinary work (see also Klein, 1990): the terminology used in the two fields needs mutual understand- ing, the respective methods need mutual understanding and acceptance, there is often a spatial separation of members of the two groups, there are cognitive limits which prevent people from being perfect experts in both fields, and there are reservations towards the approaches and methods of the respective other discipline, partly because of the lack of knowledge and of competence in it (see also Lyding et al., 2008). In the 1990s, Abney (1996) describes – especially in linguistics – a traditional scepticism towards automated methods and quantitative approaches because no real insights in the complete nature of language are thought to be possible.41 Another inter-disciplinary task within computational linguistics is human- machine interaction, especially the development of intelligent user interfaces.

1.2.3.2 Corpora and their linguistic annotation While the first linguistic studies on systematic collections of authentic text have been done manually, nowadays mostly electronic corpora42 are taken into consideration. Thus a current definition of the term is:

“[...] a corpus is taken to be a computerised collection of authentic texts, amenable to automatic or semi-automatic processing or anal- ysis. The texts are selected according to explicit criteria in order to capture the regularities of a language, a language variety or a sub-language.” (Tognini-Bonelli, 2001, p. 55).

41As a side note, Jannidis (1999, pp. 47f.) comments on statistical methods for linguistics stating that quantitative approaches are not always accepted in literary studies: if they are used to verify common insights, they are said to be superfluous; if they contradict such insights, they are said not to be trustworthy because of statistical measures. This fact bears the danger to miss the verification of knowledge and the discovery of new insights. 42Practical hints and best practises for corpus creation can be found in Wynne (2005), which also goes into detail on the annotation of corpora on various levels (ch. 2) and is available online at http://www.ahds.ac.uk/guides/linguistic-corpora/index.htm, last accessed 2012-10-19. For German, a practical introduction to corpus linguistics can be found at http://bubenhofer.com/korpuslinguistik/kurs, last accessed 2012-10-19. 1.2 Background in the relevant research areas 27

The computer has become an inherent part of corpus linguistics43:

“The computer, which has come to be associated with Corpus Linguistics [...] is used to process, in real time, a quantity of information that could hardly be envisaged by a team of informants working over decades even 50 years ago; [...] it has affected the methodological frame of the enquiry by speeding it up, systematising it, and making it applicable in real time to ever larger amounts of data44.” (Tognini-Bonelli, 2001, p. 5).

The use of electronic corpora made existing strategies more efficient, e. g. by query engine indexing for fast searches and analyses, and changed and further developed the linguistic methodology. Automated corpus-based language obser- vation enables researchers to systematically investigate large text collections instead of conducting manual analyses of single documents as before. The systematic access to authentic language examples allows to evaluate (and pos- sibly adapt) linguistic theories on an empirical basis, enhancing introspective methods. Linguistic phenomena can be quantified, which is one of the main reasons for the work with corpora (see Lemnitzer & Zinsmeister, 2006, pp. 182-186). McEnery et al. (2006, pp. 7f.) define corpus linguistics as a “whole system of methods and principles of how to apply corpora”. According to Gries (2009c) in his ‘dialogue-style’ article about central assumptions, notions, and methods of corpus linguistics, it is defined a ‘method(ology)’, an ‘approach’. He sets a high value on the fact that corpora are used to extract mainly quantitative

43 A comprehensive overview of the field of computational linguistics dealing with corpora is given by Lemnitzer & Zinsmeister (2006, in German) and by Ludeling¨ & Kyto¨ (2009). The latter is a recent handbook describing the origins, the history, the philosophical, socio-, and psycho-linguistic aspects of corpus linguistics, as well as symbolic vs. statistical methods. 44An introduction to ‘data-intensive linguistics’ is given in Brew & Moens (2000) including its history, corpus search and processing tools, as well as information on concordances, collocations, corpus design, annotation, statistics, and applications. 28 1 Introduction and background information and lists diverse example of corpus application in various fields such as language acquisition.45 External influences on corpus linguistics (according to Atwell, 2009, p. 505) come from

i.) lexicography, which is responsible for a large part of the current corpus- building work and innovation, and from ii.) NLP / computational linguistics (see also Church & Mercer, 1993).

Regarding the latter field of influence, since the computational revolution of the 1980s, a paradigm shift due to NLP methods can furthermore be observed: more focus is put onto larger linguistics units like texts and discourse in addition to the word or sentence level.46 Tognini-Bonelli (2001, pp. 177-179) and Xiao (2009a) go into detail on corpus-based vs. corpus-driven methods (see also Fellbaum, 2007, p. 4). The former applies pre-processing and uses hypotheses, the latter relies exclusively on distributional data (however, as well with the basic assumptions that categories exist, that certain patterns can be expected, and that frequencies can be interpreted). McEnery et al. (2006)47 describe the criticism of corpus methodology because of its ‘skewedness’48, which is in line with Chomsky (1962), who represents the generative approach (see also Karlsson, 2009) and argues against “accumulating huge masses of unanalyzed data and trying to draw some generalizations from them” (Andor, 2004, p. 97). In contrast, Sinclair

45Taylor (2008) deepens the discussion about corpus linguistics as a discipline, paradigm, framework, theory, method, or philosophical approach by conducting a concrete corpus analysis with the collection of definitions. She does not get to a definite answer, but states that multiple views exist without much explicit discussion of the topic. 46On the ‘scope and limits’ of corpus linguistics, see e. g. Gast (2006). 47They present an overview on corpus-based language studies and resources including a range of concrete examples, e. g. language variation studies on pp. 160-194 as well as contrastive and diachronic studies (pp. 178ff.). 48Since such concerns have been found to be disproven by various valuable corpus studies – a rich field which cannot be discussed in detail here – in this work, the structural theory applying corpus-based methods is taken as a general premise. 1.2 Background in the relevant research areas 29

(1991, esp. p. 6) gives strong arguments against introspective and for empirical linguistics. Regarding the history49 of the field of corpus linguistics, before 1960, ‘early corpus linguistics’ conducted manual, pre-electronic linguistic enquiries based on observations of language in use, e. g. for lexicography or language acquisition. The of American English50 developed in the 1960s was the first machine-readable linguistic corpus containing one million running words (see e. g. Johansson, 2009). With the enormous progress in computer science since the 1980s, corpus linguistics became an increasingly important method in many linguistic fields. Corpora of the 1990s and later include the (BNC)51 with 100 million words of written and spoken language to be used for lexicography, the International Corpus of English (ICE)52 containing different varieties of English, or the Bank of English53 corpus comprising 450 million written and spoken words. In recent years, web corpora have been getting increasingly prominent (see e. g. Baroni & Kilgarriff, 2006). More and more ‘Web as Corpus’ initiatives offer corpora of around 1 billion tokens; the Netspeak54 web service, for example, assists authors by using the Internet as a source of common language (see also Stein et al., 2010). Xiao (2009b) provides a list and description of corpora of various kinds; more examples for available corpora can e. g. be found in the BYU 55 corpus search interface. For German, the first large corpus (containing ca. 11 million words) which has been manually analysed for an early frequency dictionary is described in Kading¨ (1897). Recent German examples are: Deutsches Referenzkorpus; German Reference Corpus (DeReKo)56, the corpora of the project Digitales W¨orterbuchder Deutschen Sprache; Digital dictionary of the German language

49The origin and history of corpus linguistics is e. g. presented by Meyer (2009). 50http://khnt.aksis.uib.no/icame/manuals/brown, last accessed 2012-10-24. 51Burnard & Aston (1998); http://www.natcorp.ox.ac.uk, last accessed 2012-10-19. 52http://ice-corpora.net/ice, last accessed 2012-10-19. 53http://www.titania.bham.ac.uk, last accessed 2012-10-19. 54http://www.webis.de/research/projects/netspeak, last accessed 2012-10-19. 55http://www.americancorpus.org, last accessed 2012-10-19. 56Kupietz et al. (2010); http://www1.ids-mannheim.de/kl/projekte/korpora, last accessed 2013-12-07. 30 1 Introduction and background

(DWDS) (see section 2.1.1), available digital text databases of the Leipzig Corpora Collection57, or the TIGER58 corpora.59 A growing field of interest is language teaching and learning on the basis of corpora (see e. g. Kettemann & Marko (2000, on data-driven learning and discovery learning) or Aston (2001); Romer¨ (2009); Abel & Zanin (2011)).60 Gries (2006) – in his article “Towards a More Rigorous Corpus Linguistics” – gave general recommendations on future perspectives for further development because the methodology is ‘not technically perfected’. He elaborates on data gathering, dispersion, and quantitative evaluation; and he recommends to develop the existing methods further and, in addition, to learn from general computational linguistics, psycholinguistics, and psychology. The following sections go into detail on different types of corpora and on their possible annotations.

Corpus typology There are various types of corpora (Hundt, 2009; Hun- ston, 2009)61 with different perspectives of classification, such as general vs. specialised language corpora or diachronic vs. synchronic corpora.62 In the context of this study, especially the notions of reference63 corpora (Sinclair,

57http://corpora.informatik.uni-leipzig.de/download.html, last accessed 2012-10-19. 58http://www.ims.uni-stuttgart.de/projekte/TIGER, last accessed 2012-10-19. 59These corpora are e. g. contrasted by Duffner & Naf¨ (2006), who make clear that the respective corpus choice always depends on the specific aims of a study. 60In an application for language learning, Ludewig (2005) uses corpus-based collocation analysis. An overview of further possible applications is e. g. given by McEnery & Wilson (2001, pp. 103-132). 61As to the appropriateness of a certain corpus to investigate a certain object of research, see McEnery & Wilson (2001, pp. 171-173). Also the overview article by Heid (2009) describes corpora, their composition, their authenticity, as well as their annotation – and lists large national text corpora as well as specialised, parallel, and web corpora. 62A further category are learner corpora Granger (1994), which collect texts of language learners of different proficiency levels or in (non-)mother tongues, ‘L1’ indicating first language, ‘L2’ second language. 63In the technical part of this thesis, the term ‘reference corpus’ is not used in an ‘absolute’ sense. There, ‘reference’ does not refer to ‘standard’ or ‘dominant’ variety, but is used neutrally and can be any other variety. 1.2 Background in the relevant research areas 31

1996), of representative corpora, and of comparable corpora are worth some special attention:

“A reference corpus is one that is designed to provide comprehensive information about a language. It aims to be large enough to repre- sent all the relevant varieties of the language, and the characteristic vocabulary [...] . Questions of balance and representativeness recur in the discussion of reference corpora. They are extremely difficult to define [...] . While it is not normally claimed that there is a core variety of a language, there appear to be a large number of heavily overlapping varieties, sharing the bulk of their vocabularies and al- most all the syntactic rules. Marginal vocabulary items differentiate them and slight individuality of phraseology.”64

According to Biber (1993, p. 243), “a corpus must be representative in order to be appropriately used as the basis for generalizations concerning a language as a whole” or representative of a domain (e. g. specialised corpora), a time range, etc. Representative means containing samples of all major text types proportional to their usage in every-day language. For observations on the question of balance and representativeness in a corpus, see also Lemnitzer & Zinsmeister (2006, pp. 50-54). Regarding ‘comparable corpora’, the EAGLES guideline gives the following definition:

“A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.”65

It continues:

64Expert Advisory Group on Language Engineering Standards Guidelines (EAGLES); http: //www.ilc.cnr.it/EAGLES/corpustyp/node18.html, last accessed 2012-10-19. 65http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html, last accessed 2012-10-25. 32 1 Introduction and background

“The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication, but avoiding the inevitable distortion introduced by the translations of a parallel corpus.”

The research community is faced with the inherent vagueness of the concept of corpus similarity and comparability (see also section 1.2.3.4). Besides this, the notion of similarity across languages, cultures, text varieties or genres, and circumstances of communication has to be pursued. Aijmer (2009) presents a comparison of the applications and designs of parallel and comparable corpora, parallel corpora being translated and comparable corpora being originals, written by native speakers on similar topics. Bilingual terminology extraction, for example, uses comparable corpora if parallel corpora are scarce.66

Corpus annotation The annotation of corpora can consist of various types of additional information about linguistic entities on different levels of linguistic description. A recent introductory book to corpus annotation is Wilcock (2009) and an overview article for the description of annotation tools is given by Lehmberg & Worner¨ (2009), which refer especially to standards such as the extensible markup language (XML)67, to the Text Encoding Initiative (TEI )68, or to MATE 69. Ide (2007) is a further article on the Linguistic Annotation Framework (LAF), XML, and TEI . More details on corpus preparation and annotation can be found in Lemnitzer & Zinsmeister (2006, pp. 60-98). Bergenholtz & Tarp (1995, p. 34) point to the fact that annotations can obfuscate or inhibit possible interpretations if the latter are not conform with the theory the annotation is based on.70

66In the present variety study, also no really parallel corpora are available (except e. g. for laws), therefore comparable corpora are used. 67http://www.w3.org/TR/REC-xml, last accessed 2012-10-19. 68http://www.tei-c.org, last accessed 2012-10-19. 69Multilevel Annotation / Tools Engineering; http://www.ims.uni-stuttgart.de/projekte/ mate, last accessed 2012-10-19. 70Also for learner corpora, error annotations are a special category, since they partly contain interpretations. Ludeling¨ (2008) goes into details on problems with learner corpus annotation. 1.2 Background in the relevant research areas 33

In the following, possible annotations starting from the lower levels of linguis- tic description will be described. On the token level, single linguistic units are separated, resulting in ‘vertical files’ with one token per line. Grefenstette & Tapanainen (1994) is an early paper on problems in tokenisation, which they list as acronyms and abbreviations, numbers and dates – suggesting a flexible modular approach for their solution. Atwell (2009) in a case study on multi-word expressions (MWEs) goes into details on their criteria and problems especially in tokenisation, and also Schmid (2009) explains the pro- cedure and the problems of word-level annotation. Furthermore, Fitschen & Gupta (2009) focus on tagging and lemmatising, especially on the case of disambiguation, and Rayson & Stevenson (2009) elaborate on the methods of disambiguation in semantic tagging as well as on its evaluation. Regarding higher levels of linguistic description, linguistic chunks can be annotated (see e. g. Kermes, 2009), and partial or complete parsing of sentences can be done (see e. g. Bohnet, 2009, 2010). Annotations are as well possible on the levels of pragmatics and discourse linguistics, however, these are still done rather manually up to now. The more annotation is added (e. g. syntactic chunking, parsing, or semantic information), the less complex the queries for linguistic phenomena (e. g. gram- matical relations) can be formulated, especially on higher levels of linguistic description. This relates to choosing the adequate level(s) of annotation for a specific study and the possible trade-off between pre-processing investment and querying complexity or the quality of results. Corpus annotation for the validation of linguistic hypotheses has both advan- tages, since it is an empirical method, and disadvantages, since annotations are never really comparable. Annotation both depends on and influences analysis results. Regarding regional varieties, the quality of automated comparison results for varieties depends substantially on the quality of the linguistic annotation, which is a challenge for tools that have in most cases been created for more frequently used dominant varieties in contrast to lesser-used varieties. As a related topic, extra-linguistic metadata – annotation on even higher 34 1 Introduction and background

levels such as documents or texts – can be added as a further enhancement of the pure text, e. g. on authors or publication time, place, type etc.

1.2.3.3 Data extraction from corpora The basis for comparing instances of linguistic phenomena is their correct extrac- tion from running text. For the quality of results, especially with quantitative methods, the size of the corpora is crucial according to Church & Mercer (1993), who claim that ‘only more data is better data’.71 A fundamental premise for quantitative corpus-based research, according to Biber & Jones (2009), is “[...] determining the ‘unit of analysis’ and the appropriate research design required for a particular research question”. In the following, extraction approaches and methods on several levels of linguistic description are presented.

Uni-gram level For single-word term extraction for dictionaries, which is the most prominent field of application, e. g. Ahmad et al. (1992) define the ‘weirdness ratio’72 as a measurement to specify how much more frequent a phenomenon is in e. g. a specialised than in a reference corpus, see (1.1). In the case of term extraction, the highly ranked phenomena are good terminology candidates. relative frequency in specialised corpus (1.1) weirdness ratio = relative frequency in reference corpus

Kilgarriff (1997a) reports on the first approaches of corpus lexicography73: he suggests to use data extracted from authentic texts and discusses adding frequencies from corpora to lexicons, including a detailed description of the methodology and evaluation. In the ‘fourth age of corpus lexicography’ described by Kilgarriff & Tugwell (2002) (where the first was pre-computer, the second (around 1980)

71A general minimum corpus size for corpus analyses cannot been specified. This is only possible for specific tasks where experiments can be conducted with different corpus sizes, in order to find the lowest threshold yielding useful results. 72Such a ratio is mentioned in Kilgarriff (2009) as well. 73Another early introduction to computational lexicography is Atkins & Zampolli (1994). 1.2 Background in the relevant research areas 35

based on concordances (see e. g. the COBUILD project), and the third dealt with statistical summarising over data), corpora are said to be an indispensable basis for lexicographic work.74 Also Teubert (2004, pp. 16-18) promotes the corpus approach for lexico- graphy:

“A corpus-driven perspective could give rise to an entirely new (and perhaps utopian) generation of dictionaries. It would list what the corpus analysis reveals as units of meaning and not just words, and the meaning it assigns to these units of meaning would be based on their collocation profiles.” (p. 17).

He continues: “Corpus linguistics and lexicography belong together. They are the winning combination for the future of dictionary making.” (p. 18).75 The overview article of Heid (2009) describes the extraction of data for lexicographic use according to user needs and presents tools76 for the selection of raw material for lexicography. He further elaborates on the two-way interaction between lexicography and corpus linguistics: lexical data are used in corpus annotation, the production of dictionary entries is supported, and combined resources (dictionary and corpus) are built. Further he reports on the selection of raw material for the lexicographer according to qualitative and quantitative criteria, on the units of a dictionary, as well on as the challenges of homonymy and polysemy.

Bi-gram level Based on the ‘remarkable commonness’ of collocations (Haus- mann, 1984), statistical association measures (see section 1.2.3.5) can be used

74An early project has e. g. been conducted by Johansson & Hofland (1989), who collected frequencies of English vocabulary and word (class) combinations from the 1-million-tokens corpus LOB (Lancaster-Oslo-Bergen Corpus, http://khnt.hit.uib.no/icame/manuals/ lobman, last accessed 2012-10-25) to be used for teaching. 75Also Halliday et al. (2004) describe the relation between lexicology and corpus linguis- tics. 76These tools are semi-automatic: the user interprets the output, e. g. Heid et al. (2004), which is a system to interactively compare dictionary entries with corpus examples. Also Evert et al. (2004a) show statistics and tools to support corpus-based dictionary updating. 36 1 Introduction and background

to find word pairs occurring proportionally more frequently than statistically expectable. Especially with respect to collocations, Church & Hanks (1990) show how to extract MWEs from corpora with simple methods, an approach to be used for lexicography. Smadja (1993) describes one of the earliest well-documented collocation extraction systems, Xtract, as a comprehensive lexicographic toolkit for English which combines statistical (to find all significant phenomena) and symbolic (filtering them with part of speech (PoS) tags) approaches. It uses methods to separate out irrelevant data, which are pairs of words supposedly not used consistently within a single syntactic structure, reaching a precision of 80 % and a recall of 94 %. For German, it is done the other way round: collocation extraction by PoS is followed by association measure (AM) ranking, e. g. implemented by Krenn (2000), Evert (2005a), or Heid (2011), and also Kilgarriff et al. (2008) use this symbolic and statistical approach. Manning & Schutze¨ (1999, pp. 162f.) cover MWEs and collocations with hypothesis testing and introduceAMs, the reason for using likelihood ratios being their clear and intuitive interpretation. Prescher & Heid (2000) use probabilistic clustering methods for German77 verb-noun collocations taking selection classes into account. Fraas (2001), for example, presents a (positive) evaluation of statistical results for collocations as being representative for semantic knowledge. A human interpretation of the resulting co-occurrence partners is still always necessary. Approaches to collocation extraction including morpho-syntactic properties are shown in Evert et al. (2004b), who extracted number and case preferences

77According to the topological field model (see section 1.2.1), in German there is quite free word order in the middle field and case ambiguities exist, which is why the extraction of grammatical relations such as subject/object+predicate is not trivial (see e. g. Ritz & Heid, 2006). 1.2 Background in the relevant research areas 37

of adjective+noun combinations to be used in lexicography.78 A comprehensive state-of-the-art overview of procedures, approaches, and methods for statistical collocation extraction is given by Evert (2005a), offering concrete available software. He furthermore confirms a correlation with linguistic- lexicographic intuition: as a first hintAMs are useful, but word frequency effects can be problematic. A corpus-driven approach for the measure of collocationality (the tendency of words to occur with particular other words) based on entropy is presented by Kilgarriff (2006), who uses grammatical relations and states that items with low entropy but a long and diverse list of collocates are most useful for lexicographical work. Fazly & Stevenson (2006) extract morphologically fixed collocations to find collocations with strong morphological preferences. The COSMAS co- occurrence analysis (see Keibel & Belica, 2007; Brunner & Steyer, 2007) shows significant regularities in the use of word combinations, which depend on the chosen corpus and the parameters such as window size or linguistic pre-processing. Todirascu et al. (2008) present another combination of statistical and symbolic approaches to verb-noun collocation extraction for lexicography, and Heid & Prinsloo (2008) worked on collocational false friends in bilingual dictionaries. Heid (2008) gives an overview of current approaches in computational phra- seology and concludes that more annotated corpora on higher levels of linguistic description are needed. He elaborates on computational support for human-use

78This is within a series of studies starting with Heid (1998) on noun-verb collocations for lexicography, followed by e. g. Zinsmeister & Heid (2002, 2004) for collocations of complex words and by Kermes & Heid (2003) using chunked corpora additionally for the extraction of idiomatic expressions by sub-categorisation and combinatory preferences. As well along these lines, Heid & Ritz (2005), Ritz (2006),and Ritz & Heid (2006) are hybrid approaches for noun-verb collocation and idiom extraction, followed e. g. by Heid & Weller (2008), who reach 92 % precision in the identification of passives using morpho- syntactic features for the indication of degrees of idiomatisation, e. g. by restricted variance. Heid et al. (2010) present the latest application of this extraction of significant word pairs from parsed text as web service via D-SPIN (http://weblicht.sfs.uni-tuebingen.de, last accessed 2012-10-20), which is a distributed resource infrastructure (see also section 2.1.1). 38 1 Introduction and background

tools, on resources, and on NLP with automatic treatment of phraseological units (e. g. in information retrieval or machine translation (MT)), for example using methods to semi-automatically identify and classify phraseological units in texts. The necessary steps are pre-processing, collocation extraction according to patterns (symbolic, statistical, and hybrid), and selection according to association measures on the co-occurrence frequency of candidates (where enough data needs to be available for the methods to work properly). Idiomatic MWEs are identified using distributional semantics, since similar contexts can point to shared meaning components (see below), with the precondition that the phenomena have the same linguistic structure in the corpora to be compared (see Heid, 2011, p. 553). Further he discusses statistical and symbolic approaches especially for multi-word term extraction by typical PoS patterns. According to Evert (2009), the notion of collocations is controversial; he provides a summary ofAMs (effect-size vs. significance measures) and gives hints on finding the right measure, but concludes that “no definitive recommendation can be made” (p. 1243), that there is “room for improvement”, and that it is “desirable to develop measures with novel properties” (p. 1245). A major caveat he mentions is that the assumptions of the underlying statistical models are not met, e. g. the randomness of corpora (see section 1.2.3.5). Furthermore, he considers the null hypothesis very unrealistic for co-occurrence data (see also section 1.2.3.5). In their overview article, Stefanowitsch & Gries (2009) summarise the broad collocational framework up to the level of collostructional analysis, which measures the association of words to syntactic patterns or constructions. Fritzinger & Heid (2009) take whole collocation paradigms into account by grouping morphologically related collocations and corresponding compounds. They work on parsed data and first extract all collocations to later group and count them according to their base forms yielded with SMOR (see secion 2.1.3), finding that the combinations are unevenly distributed across the patterns.

Higher levels of linguistic description Syntactic patterns can be ex- tracted from corpora with PoS queries, e. g. for German according to the 1.2 Background in the relevant research areas 39

topological field model (see section 1.2.1.1). Studies can as well be done on higher-level annotations created by a chunker or a parser. Meurers & Muller¨ (2009) give a recent overview of syntax and corpus research, where e. g. different treebanks to be used are presented (see also Nivre, 2009). As far as the extraction of semantic phenomena is concerned, Stubbs (1996) elaborates on understanding meaning by analysing patterns of words and grammar. The distributional hypothesis – which assumes that words with similar meanings tend to occur in similar contexts (see Firth, 1957) – is exploited in distributional semantics (Harris, 1954).79 Lin (1998) presents clustering methods to construct a thesaurus using a parsed corpus by detecting similar words in large monolingual corpora based on similarity measures according to their distributional patterns. Lee (1999) is a comparison of measures of distributional similarity, conducting a classification of similarity functions according to incorporated information and additionally presenting a novel function. Related to distributional semantics, unsupervised methods are applied with word space models. The calculated vectors contain frequency counts for context words, applied mostly on corpora which have not been pre-processed. A general introduction and overview of vector space models is Turney & Pantel (2010) for the semantic processing of text in the fields of artificial intelligence or information retrieval. It is a collection of literature on vector space models, which comprise three classes:

i.) term-document (similarity of documents), ii.) word-context80 (similarity of words), and iii.) pair-pattern matrices (similarity of relations).

Also related to distributional semantics, in Latent Semantic Analysis, relations between a set of documents and the terms they contain are analysed by produc- ing a set of concepts related to the documents and terms. A theory description

79An information page on lexical semantics and distributional semantic models can be found at http://www.wordspace.collocations.de/doku.php (last accessed 2012-10-19). 80It has to be noted that synonym detection by context finds antonyms as well. 40 1 Introduction and background

and a very thorough explanation of the background, the functions, the effects, and a comparison to human learning81 is provided by Landauer & Dumais (1997), who define it as ‘acquired similarity and knowledge representation’ to simulate learning and to analyse psycho-linguistic phenomena. Global knowl- edge can be induced indirectly from local co-occurrence data. Latent Semantic Analysis is a reconstruction of a system of multiple similarity relations in a high dimensional space. Random Indexing methods are an efficient, scalable, and incremental alter- native to standard word space models, additionally handling multilingual data (see e. g. Sahlgren, 2005). They generate high-dimensional vector spaces in which words are represented by context vectors whose relative directions are assumed to indicate semantic similarity.

1.2.3.4 Comparative corpus linguistics In the field of computational comparative linguistics on the basis of corpora, different language varieties are compared and described semi-automatically. The following sections relate to the general notion of comparing corpora and to corpus comparability. Concrete studies and systems applied in this area will be presented in section 2.1.2.

Comparison of corpora The notion of comparing corpora can refer to many features – to their contents, their composition, or to phenomena on different levels of linguistic description.82 In the present context, ‘comparison of varieties’ means a parallel analysis of corpora of comparable size and contents. The ‘tertium comparationis’, i. e. the common quality of the two variety text collections to be compared (see e. g. McEnery et al., 2006, p. 179), is here

81The question “Does Latent Semantic Analysis reflect free human associations” is posed by Wandmacher et al. (2008), who report a weak correlation. They further note that Latent Semantic Analysis estimates for weakly associated terms are much closer to those of humans than for strongly associated terms. 82Regarding comparative vs. contrastive corpus linguistics, the former compares varieties of one language whereas the latter compares different languages. Cross-linguistically appli- cable categories are crucial for contrastive, but less relevant for comparative approaches. 1.2 Background in the relevant research areas 41 the shared general linguistic structure and phenomena on different levels of linguistic description. Biber (1990) elaborates on methodological issues for corpus linguistics in the statistical study of variation (focusing as well on the overall size and the composition of the corpora used). In this context, Biber & Jones (2009) differentiate

‘type A designs’: corpus-based studies of a linguistic feature (e. g. active / • passive), ‘type B designs’: corpus-based studies of texts and text categories (e. g. • registers), and ‘type C designs’: corpus-based studies with sub-corpora as the unit of • analysis.

For automated extraction and comparison of phenomena in corpora, the results can be categorised83 as

linguistically irrelevant phenomena which are due to actual facts (e. g. • currency names such as Lire in older South Tyrolean texts) vs. interesting categories, i. e. variety-specific peculiarities, e. g. Notspur (ST • vs. Standstreifen DE; ‘emergency lane’).

Following Heid (2011), for collocations, the results of automated extraction can be differentiated into three categories, the first two of which are not specific of a regional variety in the linguistic sense of the term:

i.) linguistically unremarkable co-occurrences (‘trivial differences’ / ‘reality descriptions’) which are due to actual concepts or facts or to the gen- eral situation in variety regions (e. g. Europ¨aischeAkademie (‘European Academy’) for South Tyrol),84 ii.) co-occurrences containing lexical peculiarities (which are ‘trivial’ as well, e. g. heurige Saison with the regionalism heurig; ‘this year’s season’), and

83Errors caused e. g. by wrong annotations are not taken into account in this categorisation. 84See also Heid (2011, p. 549), who additionally refers to differences that reflect the selection of topics in the respective corpora. 42 1 Introduction and background

iii.) the most interesting category being variety-specific phrasal peculiarities (e. g. weißer Stimmzettel being a direct translation from IT scheda bianca, ‘void ballot’; vs. DE leerer / ung¨ultigerStimmzettel).

Corpus comparability For contrasting phenomenon occurrences in different corpora, it is crucial to verify the comparability of these corpora in order to obtain useful results. The notion of comparability and similarity is important because of the interpretation of the findings and the generalisations to be made, especially since language and linguistic behaviour are very variable – influenced by a multitude of factors. The question on the appropriate comparability of corpora relates to their homogeneity as a prerequisite, while no established measures exist for neither of them. Already Stubbs (1996, p. 152) pointed out corpus homogeneity in order to understand results correctly and not to interpret differences within single corpora as inter-corpora differences (see also Kilgarriff, 1997b, who takes a close look at homogeneity and similarity of corpora). Similarly by Krenˇ & Hlava´covˇ a´ (2008), “comparability of corpora is shown to play a key role” for their studies. Gries (2009a) claims that corpora are very crude samples of language, because – in contrast to language – they are never infinite, representative, balanced, or complete. Furthermore, corpora themselves are variable, e. g. in homogeneity. The definition of the term comparability for a corpus is still under construc- tion. Comparable corpora (see section 1.2.3.2) are defined as pairs (or other tuples) of monolingual corpora which are not necessarily translations of each other but share some characteristics (domain, genre, topic, etc.). The degree of comparability is perceived as the amount of these common characteristics: on one extremity there are parallel corpora and on the other independent corpora which have nothing in common (Prochasson, 2010). A very recent conclusion on corpus comparability is given by Su & Babych (2012): 1.2 Background in the relevant research areas 43

“However, so far there is no widely accepted definition of com- parability. For example, there is no agreement on the degree of similarity that documents in comparable corpora should have or on the criteria for measuring comparability.”.

Kilgarriff (2001) and Gries (2007) go deeply into detail on this topic, both from the theoretical and from the practical perspective (see also section 2.1.2). Kilgarriff (2001) in his comprehensive overview states that no objective strategies exist for describing and comparing corpora. He gives a critical review of statistical methods used. An assessment of the homogeneity of a corpus (according to text types, grammar, lexis, and contents within a text type) is preliminary to a quantitative approach to similarity. Kilgarriff (2001) observes that corpora with high within-corpus variability usually contain general language, and corpora with low within-corpus variability mainly comprise specialised texts. Gries (2007) elaborates on the variability within and between corpora (at different levels of granularity) in order to be able to interpret and generalise comparison results; above all, methodological considerations are described. The basic question is how to quantify the degree of variation (by multi-variate exploratory data analysis techniques) and the degree of homogeneity – where no satisfying definition can be given. Gries (2007) presents a sophisticated technique for measuring corpus homogeneity based on re-sampling methods. He additionally recommends exploratory data analysis to provide internal estimates, in order to show how superficially different results can reflect similar underlying tendencies. Homogeneity has to be quantified with respect to the phenomenon studied, and an unreliability estimate or confidence measure for extracted data is needed. More traditional measures for corpus comparability are their ‘complexity’ – as also investigated for example in learner corpus studies – e. g. type-token ratio (TTR)85 for vocabulary richness and lexical variability (see Herdan, 1964; Oakes, 2009, p. 1073, who presents further alternatives) and ‘lexical

85Weitzman (1971) recommends logarithmic TTR since it remains constant for samples of different size. 44 1 Introduction and background

density’ (Stubbs, 1986, p. 33), defined as the ‘information load’, which indicates the proportion of content-bearing words to the total number of words in a text (see also section 3.3.1). For concrete studies on corpus comparability, see section 2.1.2.

1.2.3.5 Statistics for comparing corpora In the following, central statistical characteristics of linguistic data and as- sociation measures (AMs) to be used for measuring statistical significance of phenomena are presented and compared. In the context of this study, the null hypothesis86 is that there is no difference between phenomenon frequencies in the variety corpora. This hypothesis has to be rejected with the help ofAMs, which determine the significance of the frequency difference of phenomena. A cut-off value has to be determined according to the chosen degree of confidence. Baroni & Evert (2009)87 go into detail on the statistics in a two-sample setting with two corpora and the question if they are significantly different with respect to a certain property. The notion of ‘contingency tables’88 is introduced there, which show observed and expected frequencies of a phenomenon for the different corpora. In general, they remark that the

“key to the successful application of statistical techniques to lin- guistic problems lies in being able to frame interesting linguistic questions in operational terms that lead to meaningful significance testing”.

The translation of the research question into a meaningful null hypothesis is thus a very basic crucial point. Further important features are the unit

86A null hypothesis is described as an ‘uninteresting’ hypothesis that is hoped to be rejected. 87For a further ‘gentle’ introduction to statistics for linguists including basic notions as well as many applied examples, tools, and other resources, see Krenn & Samuelsson (1997). More technical and applied books are Agresti (1996) on categorical response data e. g. from bio-medicine or social sciences containing mathematical details or Dalgaard (2002) and Gries (2009b), who discuss implementational aspects using the programming language R. 88An easy-to-use tool for log-likelihood (LL) calculation using contingency tables can be found on http://ucrel.lancs.ac.uk/llwizard.html, last accessed 2012-10-21. 1.2 Background in the relevant research areas 45

of measurement, the assumptions of a test, and the meaning of the p-value. The latter is defined as the probability of obtaining a test statistic at least as extreme as the one that was actually observed. The null hypothesis is rejected if the p-value is less than or equal to the significance level. A p-value refers to the 99.99th percentile at the 0.01 % level with p < 0.0001, and analogously for other percentiles. According to Biber & Jones (2009),

“[i]nferential statistical tests help to identify meaningful differences, as opposed to differences that occur just due to random chance. However, we would argue that inferential statistics should be used and interpreted with caution in corpus-based research. Tests of statistical significance depend on the sample size: as the sample size becomes larger, the difference among groups required to achieve significance becomes smaller. [...] With very large samples, it is easy to find small linguistic differences that are statistically significant but not strong; we would argue that these differences are often not interesting, because they do not reflect the important differences across text categories. By also considering measures of strength, researchers can identify the linguistic differences that are important and therefore more interesting for interpretation.”.

Inter-related issues regarding the statistical comparison of phenomenon fre- quencies across corpora according to Rayson et al. (2004) are

the corpus representativeness, • the homogeneity within the corpora, • the comparability of the corpora (see section 1.2.3.4), and • the reliability of the chosen statistical tests. • StatisticalAMs may be argued to be superior to frequency alone due to their higher degree of sophistication, but this assumption has been called into question, for example by Kilgarriff (2005). Sorting according to pure 46 1 Introduction and background

frequencies puts grammatical or function words to the top; statisticalAMs filter out the content-bearing lexical words. A good overview in German is given by Langer (2009), who summarises that for collocations, allAMs calculate the association of word pairs from the difference of the actual occurrence and the assumed frequency if words were not associated (the latter of which is calculated differently in each test). The most important difference between the measures is the significance they assume for word pairs with different frequencies. Log-likelihood (LL), for example, is said to be very correct, but complex and costly to implement and calculate. A practical decision has to be taken whether the effort is worth it and a thorough cost-benefit calculation is necessary.

Large number of rare events The phenomenon of large number of rare events (LNRE), which describes the fact that the majority of words in a natural language text is very rare (entailing that the product of the rank order and the frequency is constant; Zipf, 1932), has the effect that the different ends of the range show very different statistical behaviour. Thus it is difficult to find one measure for both, and a comparison is not really possible (Kilgarriff, 2001, p. 245). Baayen (2001) introduces into the theory and the application as well as into the statistical analysis of word frequency distributions and presents suitable models. According to Evert (2006), “[t]he variation of observed frequencies for the different possible random samples can be predicted mathematically. This leads to a bi-nominal distribution [...] for low-frequency data [...] and a normal distribution [...] for higher frequencies.”. As to co-occurrences, the rareness of most of the words leads e. g. to the consequence that random word pairs of not extremely frequent words have to be very rare. Baroni (2009) deals especially with practical implications of LNRE distributions of word frequencies.

Randomness The ‘randomness’ of corpora and the applicability of statistical measures for corpus analysis is e. g. discussed in Kilgarriff (2005). According 1.2 Background in the relevant research areas 47

to this article, the null hypothesis posits randomness and can thus never be true – with enough data, it is always possible to reject it. For example, the common words in two corpora always defeat the null hypothesis. He states that “the fact that a relation between two phenomena is demonstrably non-random does not support the inference that it is not arbitrary”. Hypothesis testing leads to unhelpful or misleading results and inappropriate inferences, when results seem more significant than they are. His proposed solution is to use more data – then tendencies are supposed to be evident even without using statistics. However, the follow-up by Gries (2005) including tested solutions and shortcomings for null hypothesis testing, presents suggestions for methodological consequences, e. g. post-hoc testing and effect sizes (as the most relevant measure). He further suggests to take more research and proposals from other disciplines (medicine, psychology, etc.) into account. Evert (2006) introduces the library metaphor to show that, even though the intuitive assessment would be that language is not random because of its rules, it is the selection of the documents for a corpus that makes it (almost) random. He also presents practicable methods for identifying and quantifying non-randomness, but in conclusion, non-randomness is still acknowledged to be a problem.

Relative frequency Especially for co-occurrences, frequency alone is already a good indicator for significance, since common occurrence is very improbable for rare words. This is why almost every word pair (of infrequent words) which occurs several times is not coincidental (see e. g. Langer, 2009). In order to normalise the absolute frequency, it is divided by the corpus size to calculate the relative frequency. However, the values for different word pairs are not comparable due to the lack of normalisation with respect to their parts. For word pairs with same relative frequency, the association is less for frequent words than it is for rare words.

Mutual information Since Church & Hanks (1990), the ‘pointwise mu- tual information’ measure has been widely used, which comes from the field 48 1 Introduction and background

of information theory and is a measure for the dependence of two random variables with the risk of over-estimating rare items (see Kilgarriff, 2001). It is the ratio of one word’s relative frequency in one corpus to its relative frequency in the joint corpus stating how much information a word provides about a corpus.

Chi-square test The χ2 test is a hypothesis test which is especially useful to compare expected frequencies with observed frequencies by using a 2x2 contingency table. The significance of deviations of a set of random variables of a hypothetically assumed value, e. g. bi-gram frequencies assumed with the condition that the distribution of words is random, is measured. This test on whether corpora are ‘drawn from the same population’ is said to be unreliable when the expected frequency is too small and it is likely to over-emphasise common terms according to Kilgarriff (2001).

Fisher’s exact test This test is based on the independence of two items in a 2x2 contingency table. It is a very elaborate measure of which LL is an approximation - even though they diverge for low frequencies.

Log-likelihood Due to weaknesses in calculating the association between words by mutual information or χ2, Dunning (1993) suggested the log- likelihood (LL) measure. The reason is that most of the phenomena investigated with corpora are rare events, but previous measures cannot work correctly with rare events, and especially not with comparing rare with frequent events. It is recommended to rely on bi-/multi-nomial distributions for smaller texts instead on the assumption of a normal distribution. LL compares the probability of two hypotheses.89 In collocation analysis, the values for rare pairs are less weighted; only for word pairs with a frequency < 5 this leads to significant down-grading compared to otherAMs. The value is high for non-associated word pairs as well: they do

89An LL value of >= 10.83 is e. g. significant at the 0.1 % level and an LL value of >= 15.13 is significant at the 0.01 % level. 1.2 Background in the relevant research areas 49

not occur together with high significance, thus the up/down deviation needs to be differentiated. LL compares observed frequency with statistically expected frequency, including data on further combinations of the two collocation parts. It is a hypothesis test on how probable it is that the collocation is not a random combination. Thus it quantifies how ‘surprising’ events are and according to Kilgarriff (2001), it “corresponds reasonably well to human judgements of distinctiveness”. As an enhancement, Kilgarriff & Tugwell (2002, p. 130) suggest to multiply LL values by the logarithm of the phenomenon frequency for measuring e. g. lexicographic relevance, since they claim that LL over-estimates the significance of low-frequency items.

Cochran rule According to Cochran (1954), for the results of statistical testing to be reliable, all expected frequencies should be > 1 and less than 20 % of the expected frequencies should be < 5. Rayson et al. (2004) extend the Cochran rule for corpus word frequency (f) comparisons. They test the reliability of statistical tests (χ2 and LL) with simulation experiments using differently sized corpora and different probabilities of words occurring in texts. Their conclusion is that the Cochran rule predicts the accuracy of both measures well, but in some cases it needs to be extended by using higher cut-off values. As a trade-off for corpus linguistics, they propose a new critical value of 15.13. Their test if the Cochran rule is reliable for large variations in corpus size and in word frequencies results in the conclusion that it can even be made stricter and that LL values are more precise than χ2 values, but both are quite accurate. They report no problems comparing corpora of different sizes as long as low expected values in the contingency table can be avoided and suggest the following extensions: on the 5 % level, expectedf >= 13; on the 1 % level, >= 11; on the 0.1 % level, >= 8. They furthermore confirm that it is possible to “safely lower the Cochran rule at the 0.01 % level for the LL test to expected values of 1 or more. The trade-off is that the critical value is higher than at the usual 5 % level at 15.13”. 50 1 Introduction and background

Comparisons ofAMs Each association measure is appropriate for a dif- ferent kind of significance evaluation, as is also concluded by Evert (2009, p. 1243: “no definitive recommendation can be made”) and by Pecina (2008, p. 103):

“It is not possible to recommend ‘the best general association measure’ for ranking collocation candidates, as the performance of the measures heavily depend on the data/task”.

Evert (2005a) and the associated website90 give a thorough introduction to (pp. 75ff.) and comparison of (pp. 107ff.) statistical measures with a focus on collocations, concluding the following:

“For practical applications, LL is a convenient and numerically unproblematic alternative that gives very good approximations to the exact p-values.” (p. 114)

A comparison and discussion of all measures mentioned above can also be found in Evert et al. (2000), who conclude that especially LL is most useful for collocation extraction (both for low- and for high-frequency items) and furthermore that the relative frequency measure taken alone gives a good approximation. Also Evert & Krenn (2001) and Evert & Krenn (2005) provide such a comparison, the latter presenting collocation extraction evalua- tion including precision-recall graphs of differentAMs. 91 According to Kilgarriff (2001) as well as to Evert (2005a, p. 113), LL is prevailing in computational linguistics as a de facto standard for the measure of association between words.

1.2.3.6 Evaluation of corpus processing tools When formulating queries for linguistic data extraction, a compromise has to be found: either the query is very restrictive with a precise search pattern and a complex syntax – then the results comprise not all, but mostly correctly

90http://www.collocations.de, last accessed 2012-10-19. 91An interactive calculator for the significance of a word’s occurrence in two corpora is given at http://mmmann.de/Sprache/signifikanz-corpora.htm (last accessed 2012-10-20). 1.2 Background in the relevant research areas 51

extracted phenomena (true positives). If the query is rather general, this results in more, but possibly wrong ‘hits’ (false positives), which have to be separated out manually.92 NLP evaluation measures are presented by Paroubek et al. (2007) with a review of the history and terminology, introducing the notion of ‘gold standards’ for reference lists and pointing to the difference between technology evaluation and usage evaluation. Amar et al. (2008) elaborate on the evaluation of both software performance and software usage as well, the latter being an aspect which has not been taken into account in the present study. In Manning & Schutze¨ (1999), the definition of the notions ‘precision’ and ‘recall’ for evaluation can be found (see also Salton & McGill, 1986; McEnery & Wilson, 2001, ch. 3.4). As can be seen in (1.2) to (1.4), precision is the fraction of retrieved93 instances which are relevant, corresponding to true positives. Recall is the fraction of relevant instances which have been retrieved. The F-score is the weighted harmonic mean of precision and recall.

(1.2) precision = number of extracted terms included in gold standard list 100 total number of extracted terms ∗ (1.3) recall = number of extracted terms included in gold standard list 100 total number of terms in gold standard list ∗ (1.4) F-score = 2 precision * recall ∗ precision + recall Popescu-Belis et al. (2006) show how to explore the influence of the intended context of an NLP system on its evaluation. They propose six classes of quality characteristics: functionality, reliability, usability, efficiency, main- tainability, and portability. With vector-space representations of contexts and of quality characteristics models, such an influence can be computed.

92In general, the outcome of evaluations depends considerably on the quality and the appropriateness of the test resources and on the gold standard lists used. 93Retrieved / extracted generally means extracted up to a certain cut-off according to the rank. 52 1 Introduction and background

Most use of precision and recall calculations is made on the uni-gram level, since there the gold standard is most clear-cut.94 Also on the bi-gram level, e. g. for co-occurrence extraction evaluation (Evert & Krenn, 2001; Evert, 2005a), precision and recall calculation is possible if appropriate gold standard lists are available. For higher levels of linguistic description, a gold standard is more difficult to define. Dybkjær et al. (2007) present an overview on the evaluation of more complex computational systems.

94However, Cabre´ & Estopa` (2003) show that e. g. for terminology, the agreement on a gold standard is non-trivial. 53

2 Related work and research desiderata

In this section, the relevant past and current work on central topics of the project’s research area which has been considered in the feasibility study while developing Vis-A-Vis` is presented in detail. Section 2.1 shows related resources such as corpora and dictionaries as well as computational-linguistic studies and systems. Characteristic phenomena of South Tyrolean German that have already been investigated for different levels of linguistic description, including numerous examples, are shown in section 2.2. To conclude this part of the thesis, section 2.3 summarises previous findings and identifies gaps to be filled.

2.1 Resources and methods for corpus comparison

Comparative corpus studies have been conducted by various research groups for a considerable time. The following sections first present resources, i. e. examples of written corpora and variety-related dictionaries resulting from variety comparison. In a second part, studies conducted in the variety-linguistic area will be described, and third, a number of tools for the more or less automated analysis and comparison of corpora will be shown. Several similar and – experimentally – some completely different approaches to the one applied in the present project have been looked at for helpful insights of various research directions. 54 2 Related work and research desiderata

2.1.1 Variety corpora and dictionaries

Corpora developed for regional varieties of pluri-centric languages are an indispensable resource for a detailed and systematic variety comparison and dictionary development based on finding relevant differences in corpora (see e. g. Heid, 2009). Corpus projects and dictionaries for a selection of pluri-centric languages will now be introduced, mentioning mainly electronic resources as well as important printed dictionaries.

English Most of the work on the English1 varieties (after corpora such as the FROWN 2 corpus and the FLOB3 corpus) is done in the framework of the International Corpus of English (ICE) 4 (see Greenbaum, 1996). This initiative is developing comparable corpora according to specific corpus compilation criteria in order to conduct systematic variety investigations. The website offers an extensive bibliography list5 for more and less specific variety studies. Nelson (1996) explains all aspects of the ICE corpora in detail. The Corpus of Contemporary American English (COCA) is available at http://www.americancorpus.org (last accessed 2012-10-19). The International Computer Archive of Modern and Medieval English6 is another international organisation of linguists and information scientists, where digitised resources for English language material are collected and distributed. ICAME additionally provides a comprehensive bibliography7 on their website comprising various single studies which compare specific aspects in the variety corpora.

1Trudgill & Jean (2002) as well as the Handbook of Varieties of English (Schneider et al., 2004) and Schneider (2007) gather comprehensive material about the English varieties around the world. 2Freiburg - Brown corpus of American English; http://www.helsinki.fi/varieng/CoRD/ corpora/FROWN, last accessed 2012-10-19. 3Freiburg - LOB Corpus of British English (language change); http://www.helsinki.fi/ varieng/CoRD/corpora/FLOB, last accessed 2012-10-19. 4http://www.ucl.ac.uk/english-usage/ice, last accessed 2012-10-19; the most relevant online resources are listed again in appendixC. 5http://www.ucl.ac.uk/english-usage/archives/ice-biblio.htm, last accessed 2012-10-19. 6ICAME; http://icame.uib.no, last accessed 2012-10-19. 7http://icame.uib.no/bib add.html, last accessed 2012-10-19. 2.1 Resources and methods for corpus comparison 55

A selection of printed variety dictionaries for English includes the following (see Hausmann, 1985b, pp. 402-404, for a comprehensive bibliography):

A Dictionary of South African English on Historical Principles (Silva, • 1996), Dictionary of New Zealand English (Orsman, 1998), • Canadian Oxford Dictionary (Barber, 2004). •

French The varieties of French are mainly studied with corpora in the context of the Tr´esorde la Langue Fran¸caiseInformatis´e8. The project BFQS 9 is especially investigating fixed verbal expressions in four varieties of French. Schafroth (2008) discusses the normative approaches in some of the following dictionaries, which are partly mentioned in the bibliography by Hausmann (1985b, pp. 404-406) as well:

Dictionnaire du fran¸caisPlus, `al’usage des francophones d’Am´erique • (DFP; 1988, 62 000 entries, chief editor: Claude Poirier) and Dictionnaire historique du fran¸caisqu´eb´ecois,10 Dictionnaire qu´eb´ecois d’aujourd’hui;(DQA; Boulanger, 1993), • Dictionnaire des canadianismes (1998-1989, editor: Gaston Dulong), • Dictionnaire qu´eb´ecois-fran¸cais.Pour mieux se comprendre entre franco- • phones (DQF; 1999, 9 000 entries, editor: Lionel Meney), Dictionnaire FRANQUS - Fran¸caisQu´eb´ecois Usage Standard (online • version 2008, printed 2009, editors: H´el`eneCajolet-Lagani`ereet al.).

8TLFi; http://atilf.atilf.fr/tlf.htm, last accessed 2012-10-19. 9Les expressions verbales fig´eesde Belgique - France - Qu´ebec - Suisse (Lamiroy et al., 2003); http://bfqs.fltr.ucl.ac.be/publications.htm, last accessed 2012-10-19. 10Both dictionaries are described on http://www.salic-slmc.ca, last accessed 2012-10-11. 56 2 Related work and research desiderata

German To start with the variety in focus here, the Korpus S¨udtirol11 is a documentation of the written language in South Tyrol contributing to language archiving, to language observation and analysis, to general language awareness, to fostering language culture in a multilingual border region, and furthermore to the historical and cultural heritage of the region. The main aim of the initiative is to provide an empirical basis for the debate on the linguistic situation in South Tyrol (see section 1.2.2). The initiative is preparing a written text corpus of South Tyrolean German, which can be queried together with other German variety corpora with the help of the distributed query engine implemented in the C4 12 project. The corpus C4 is designed as a reference corpus, balanced over text types and decades including four standard varieties of the written German language throughout the 20th century, comparable with respect to contents and size. It has its roots in the project Digitales W¨orterbuchder Deutschen Sprache13. The sub-corpora have been collected and prepared in the framework of the following national initiatives:

the Austrian Academy Corpus,14 • a corpus of the project Digitales W¨orterbuchder deutschen Sprache des • 20. Jahrhunderts, the Korpus S¨udtirol, and • the Schweizer Text Korpus.15 • They share a common structure, metadata set (e. g. bibliographical and regional information), data format, and indexing solution, and can be simultaneously accessed via a single interface, which is implemented in a distributed manner as

11Corpus South Tyrol; Abel et al. (2009); Anstein et al. (2011); http://www. korpus-suedtirol.it, last accessed 2012-10-19. 12Dittmann et al. (2012); http://www.korpus-c4.org, last accessed 2012-10-19. 13Digital Dictionary of the German language, DWDS; Klein (2004); Geyken (2007); Klein & Geyken (2010); http://www.dwds.de, last accessed 2012-10-19. 14AAC; Biber et al. (2002); http://www.aac.ac.at, last accessed 2012-10-19. 15Swiss Text Corpus, CHTK; Bickel et al. (2009); http://www.schweizer-textkorpus.ch, last accessed 2012-10-19. 2.1 Resources and methods for corpus comparison 57

a virtual corpus (Roth, 2009). Filtering options include structural criteria (sub- corpus, time range, and text type) as well as bibliographic metadata (author, title, and place of publication). The DWDS website furthermore provides a syntactic tree view for each query result to registered users. This tree view is the result of a web service provided by the University of T¨ubingenvia the D-SPIN / WebLicht16 demonstrator in the framework of the CLARIN 17 initiative, which aims at the integration, inter-operability, and persistence of tools and resources for language processing and research. Additionally, in the Variantengrammatik18 project, a newspaper and fiction corpus of the four central-European German varieties is developed to allow for grammatical studies. Prevailing dictionaries19 for German are – as the first comprehensive dis- tinctive variant dictionary – the Variantenw¨orterbuchdes Deutschen; Variant dictionary of German (VWB)(Ammon et al., 2004) and, with a focus on South Tyrolean German, the book Der S¨udtiroler Sonderwortschatz aus pluri- zentrischer Sicht: lexikalisch-semantische Besonderheiten im Standarddeutsch S¨udtirols (Abfalterer, 2007)20, the latter of which has been elaborated in the framework of the project Datenbank zum S¨udtiroler Deutsch21. The VWB includes current non-terminological lexical items of standard German which are special to at least one German variety, i. e. they are either exclusively used within one of the German-speaking communities, or they share some peculiarities with one or more other varieties. Abfalterer (2007) refined and added entries concerning the South Tyrolean variety and published

16https://weblicht.sfs.uni-tuebingen.de, last accessed 2012-10-19. 17Wittenburg et al. (2010); http://www.clarin.eu, last accessed 2012-10-19. 18Variant Grammar; Durscheid¨ et al. (2011); http://www.variantengrammatik.net, last accessed 2012-10-20. 19Related to lexicography, Ebner (1995) reports on problems e. g. with Austrian German. 20Abfalterer (2007, p. 70) specifically recommends to enhance the database on South Tyrolean German using a digital corpus, which has now been presented by Korpus S¨udtirol. 21Database for South Tyrolean German, http://www.uibk.ac.at/projects/woerterbuch/sued/ sued.html; part of the project W¨orterbuchder nationalen und regionalen Varianten der deutschen Standardsprache, Dictionary of the national and regional variants of Standard German, http://www.uibk.ac.at/projects/woerterbuch/proj/proj.html; both last accessed 2012-10-20. 58 2 Related work and research desiderata

a dictionary with a special focus on the German variants used in South Tyrol. Details on the methodology for its development can be seen in section 2.2.1. More German dictionaries – in addition to the following – can be found in Hausmann (1985b, pp. 401-402).22

Duden - Deutsches Universalw¨orterbuch23 (Scholze Stubenrecht, • 2011), Osterreichisches¨ W¨orterbuch(OWB¨ ; 2006, chief editors: Otto Back & • Herbert Fussy), Schweizerisches Idiotikon (ca. 150 000 entries, chief editor: Hans-Peter • Schifferle).

Portuguese The Centro de Lingu´ıstica da Universidade de Lisboa24 developed corpora of the European and African varieties of Portuguese, which are queriable on their website: e. g. the Corpus de Referˆenciado PortuguˆesContemporˆaneo (CRPC; Reference Corpus of Contemporary Portuguese). In the Linguateca25 language resource centre, corpora for Brazilian and European Portuguese are compiled and provided; CETEMP´ublico, for example, is a free resource with 180 million tokens of European Portuguese. The COMPARA26 parallel corpus of Portuguese and English enables re- searchers to create sub-corpora of regional varieties of Portuguese and English and offers a complex search facility which allows to compare and contrast these two languages, which is as well possible with respect to single varieties only. Bacelar do Nascimento et al. (2006) are developing a corpus-derived lexicon of Portuguese in Africa based on a 3 million word corpus of spoken and written language to allow intra- and inter-corpora comparative studies. 22Wiegand (2007) provides an international bibliography for German lexicography and dictionary research. 23Wermke (1995), for example, describes the treatment of Austriacisms in German dictio- naries. 24CLUL; Linguistics department of the university of Lisbon; http://www.clul.ul.pt, last accessed 2012-10-20. 25http://www.linguateca.pt, last accessed 2012-10-20. 26http://www.linguateca.pt/COMPARA/compara en.html, last accessed 2012-10-20. 2.1 Resources and methods for corpus comparison 59

The major reference dictionary for the Portuguese language covering African, Brazilian, and European Portuguese is the

Dicion´arioHouaiss da L´ınguaPortuguesa (Houaiss Dictionary of the • Portuguese Language; 2000, 228 500 entries, editor: AntˆonioHouaiss).

Spanish The Spanish language in Spain and South America is studied e. g. with the Corpus del Espa˜nol27, which contains 100 million tokens from the 13th to the 20th century. The Corpus de Referencia del Espa˜nolActual (CREA)28 is another important reference corpus, created by the Real Academia Espa˜nola. Examples of dictionaries are mentioned in the following; in addition see the bibliography of Hausmann (1985b, p. 406):

Diccionario del Habla de los Argentinos29 by the Academia Argentina de • Letras, Latin American Dictionary with American Spanish variants for every • country.30

2.1.2 Comparative corpus studies

In this section, recent foundational references to the most important and influ- ential past work in the research area of this thesis are presented chronologically, in order to show how researchers have tried to resolve the task of comparing cor- pora from various perspectives (for the theoretical foundation, see section 1.2.3). Related examples of corpus studies which are especially relevant for the devel- opment of Vis-A-Vis` are listed. The work discussed in the following is only a sample of a much larger body of corpus studies which, in the present thesis, cannot be summarised in depth.

27http://www.corpusdelespanol.org, last accessed 2012-10-20. 28http://corpus.rae.es/creanet.html, last accessed 2012-10-20. 29http://www.elortiba.org/hablapop.html, last accessed 2012-10-20. 30http://www.asihablamos.com, last accessed 2012-10-20. 60 2 Related work and research desiderata

After the presentation of comparability studies, concrete investigations on different kinds of linguistic variation will be shown, also in order to present approaches already applied and established.

2.1.2.1 Studies on corpus comparability Related to general differences between corpora, Johansson & Hofland (1989) conducted studies on the similarity of genres, e. g. to extract the most common words with the Spearman31 rank correlation statistics. Charniak (1993), coming from statistical language modelling, measured corpus similarity by calculating cross-entropy (defined as ‘insecurity’). Sekine (1997) parsed corpora and compared their sub-tree frequencies to successfully make a statement on corpus similarity. Kilgarriff (2001) suggests to compare corpora via characteristic words, which are words that are systematically used more in one text type than in an other type. He gives a critical review of statistical methods used and proposes concrete measures for corpus similarity (keeping the theoretical limitations of the idea in mind) and a method for evaluating corpus similarity measures based on word or n-gram frequencies. For his promising approach (presented already in Kilgarriff, 1997b), ‘Known-Similarity Corpora’ are built, where distinct text types are mixed with different proportions. The compared corpus distance measures make clear

i.) that differences between corpora can be shown by highlighting words which are consistently used more in one corpus than in the other with the Mann-Whitney ranks test32 and

ii.) that for corpus similarity, χ2 outperformed the Spearman rank correlation statistics and cross-entropy measures.

The measure CBDF (‘Chi By Degrees of Freedom’), used to calculate a nor- malised χ2 for (sub-)corpus pairs, generally outperforms rank-based measures.

31The Spearman rank correlation is a technique used to test the direction and strength of the relationship between two variables. 32The Mann-Whitney ranks test assesses whether one of two samples of independent obser- vations tends to have larger values than the other. 2.1 Resources and methods for corpus comparison 61

In detail, the homogeneity method divides corpora randomly into halves, pro- duces word frequency lists for each of them, calculates χ2 for the sub-corpora difference, normalises the values, and iterates this in order to get different random halves. The interpretation is done by comparing the values for differ- ent corpora. The similarity method is the same as for homogeneity with the sub-corpora taken from different corpora; the interpretation has to be done with reference to the homogeneity of each corpus. Denoual (2006) proposes to quantify the similarity of corpora based on cross- entropy of statistical n-gram character models. Sharoff (2007) is an approach on classifying web corpora into domain and genre using automatic feature identification. Comparability is analysed by assessing corpus composition, such as structural criteria (e. g. format and size) and linguistic criteria (e. g. topic, domain, and genre). According to Crossley & Louwerse (2007), decisions as to which gran- ularity level and which linguistic features to look at have to be considered carefully for register similarity comparison. Their approach is to look at different granularity levels to determine which one is most discriminatory and to measure bi-gram attraction, a data-driven measure of collocational attraction. This is reached with the ‘lexical gravity’ G (Daudaraviciusˇ & Marcinkevicienˇ e˙, 2004) taking type frequency into consideration instead of token frequency such as in other contingency-table-based measures. Their clustering based on gravity values leads to the favourable conclusion that (sub-)register average G values differ significantly (e. g. for spoken vs. written) and yield a perfect clustering. In order to improve the homogeneity or to estimate the comparability by feature vectors of texts, Guthrie et al. (2008) used unsupervised statistical outlier detection approaches as they are used for documents in corpora. Surface features such as syllables per word, PoS distributions, or emotional tone are taken into account. Work on corpus comparison has also been done in the Web as Corpus community (e. g. Ferraresi et al., 2008), integrating also more general matters of comparisons between language varieties. Regarding another aspect of similarity, vor der Brueck et al. (2008) compared the readability of 62 2 Related work and research desiderata

different texts by various complexity measures. Kilgarriff (2009) identifies keywords of one corpus vs. another. Different lists according to the frequency range of interest are produced, where a variable can be changed to focus on higher- or lower-frequency words. The problems reported are i.) that corpora are not comparable and ii.) the question how to treat words that do not occur in the reference corpus. Because the null hypothesis presuming randomness is claimed to be untrue, simple mathematical methods without statistical tests are applied. The comparability of web corpora can additionally be measured according to their communicative intention, see also Poulard et al. (2011), who look at the functional dimension of documents, e. g. browsing, emailing, searching, chatting, interacting, shopping, collaborating, etc. In the TTC 33 project, multilingual comparable corpora are defined according to distributional similarity (in a matrix of similarities) via feature vectors; the corpora are not parallel. As to comparability, the project members state that no well-established definition of corpus comparability exists, and therefore neither a standard method of measuring it. Many of the measures that have been proposed (e. g. by Li & Gaussier, 2010) rely on bilingual lexicons. Corpus ‘compatibility’34 is measured by the correlation between similarity profiles of two corpora. It can be improved by removing or adding texts to corpora according to distributional similarity measures.35 A very recent and sophisticated study has been published by Su & Babych (2012), who present and evaluate three approaches to measure the comparability of documents in non-parallel corpora (a lexical mapping based approach, a keyword based approach, and an MT-based approach). Furthermore, they develop a task-oriented definition of comparability, based on the performance of automatic extraction of translation equivalents from the documents aligned

33Terminology Extraction, Translation Tools and Comparable Corpora, http://www. ttc-project.eu (last accessed 2012-10-20); for an overview see Heid & Gojun (2012). 34They explicitly claim to not measure ‘comparability’. 35See especially the public deliverable document D 2.3 regarding the composition of corpora within a language and across languages, relating to document dissimilarity and corpus comparability at http://www.ttc-project.eu/releases (last accessed 2012-10-29). 2.1 Resources and methods for corpus comparison 63

by the proposed metrics, which formalises intuitive definitions of comparability. They further define a scale to assess comparability qualitatively, as well as metrics to measure comparability quantitatively, focusing on document-level comparability. As to the related topic of corpus homogeneity, Rose et al. (1997) present a bottom-up approach based on word frequency lists, and Rayson et al. (1997) suggest to determine corpus homogeneity with theAM χ2 (see section 1.2.3.5). Kilgarriff (2001, pp. 241-243) presents useful approaches to model and measure the ‘burstiness’ of words. For handling the dispersion of words in a corpus, Savicky´ & Hlavacov´ a´ (2002) propose their ‘average reduced fre- quency’ measure ARF. Also De Roeck et al. (2004) ‘defeat the homogeneity assumption’; their investigation on the distribution of very frequent terms is based on three different ways of partitioning corpora: i.) assigning documents randomly to a partition, ii.) assigning halves of documents randomly to a par- tition, and iii.) using different chunk sizes instead of documents. Sahlgren & Karlgren (2005) apply a similar approach using density as a helpful measure for homogeneity. Gries (2007) claims that χ2 testing for corpus homogeneity is based on wrong presuppositions and recommends effect size measures. His approach is based on methods from exploratory data analysis using summary statistics, data reduction, structuring methods, and graphical methods. Additionally, re-sampling (permutation and bootstrapping) is applied at the specific levels of granularity to be investigated, in order to handle variability and corpus homogeneity solely on the basis of the parameter(s) of interest. He proposes a variety of progressively more complex and comprehensive ways of quantifying and exploring the variability of corpus results:

i.) One recommendation for facilitating output interpretation is to illustrate the metadata together with the results, e. g. the average frequency of present perfects for each sub-register, and the range of z-scores36 in order to show the variability and to look at similarities of sub-corpora. Box-

36The z-score indicates by how many standard deviations an observation is above or below the mean. 64 2 Related work and research desiderata

plots can be used to illustrate the central tendency, the dispersion, and the overall distribution of a set of values, to show the variability within sub-registers and on different levels of granularity. The advantage here is that it “allows the researcher to identify what appears to be relevant [...] in an exploratory bottom-up fashion” instead of testing hypotheses about the level of granularity. With this approach, the locations of the largest differences can be identified and more specific fine-grained follow-up analyses can be done there.

ii.) A more precise method – fighting the data sparsity problem as well – is the decision on how to split up a corpus in order to get groups with the largest within-group similarity and the smallest between-group similarity with respect to a phenomenon. Two approaches are shown: a.) With the permutation approach, pairwise differences can be found. b.) Bootstrapping on the basis of files, not of registers, is used for clus- tering. This shows the parts which are most homogeneous internally and most heterogeneous externally on the basis of the parameter of interest. Like this, data-driven and objective categorisation of the groups of data is guaranteed and a useful heuristic for identifying sources of variability is found.

These methods are not only applicable to relative frequencies, but also to more complex statistical measures, e. g. multi-factorial coefficients or binary logistic regression to predict the outcome of a dependent variable on the basis of a set of independent variables (Gries, 2007, pp. 130-135). He proposes a more general approach to corpus homogeneity by looking at a homogeneity plot showing the amount of variance against the number of principal components, using a data reduction method. File-based analyses yielded highly homogeneous results, (sub-)register-based less homogeneous ones. The approach is confirmed with random-parts-based analyses which are repeated, with the mean being a single index for the homogeneity of a corpus. His descriptive approach to corpus homogeneity is: 2.1 Resources and methods for corpus comparison 65

“The homogeneity of a corpus with respect to a particular phe- nomenon and a particular level of granularity is proportional to the average amount of structure detected by a principal component analysis that exceeds that of a corpus that was divided into the same number of parts randomly.”. (Gries, 2007, p. 143)

A range of different very advanced solutions to the challenge of word dispersion and frequency adaptation is offered by Gries (2008). Gries (2009a) takes type frequencies and their distributions into account to measure within-corpus homogeneity with bi-gram gravity, applying hierarchical agglomerative cluster analysis. In summary, all these approaches try to approximate a notion of corpus comparability that is not yet completely intuitively clear or theoretically defined, which is why the present feasibility study chooses a rather basic approach to measure it.

2.1.2.2 General variety studies Other work that is related to this topic has been done in general comparative corpus linguistics as e. g. described in McEnery et al. (2006), where corpus- based language studies and resources, detailed language variation studies, and contrastive and diachronic studies are presented. A collection of various corpus studies of specific phenomena, such as written vs. spoken language, are e. g. also described in Hornero et al. (2006). In addition, Schmied (2009) presents general contrastive studies. In almost all corpus exploitation scenarios, a more or less explicit comparison between phenomena is conducted. Work on comparing corpora has been done in a number of linguistic areas, and there are numerous single corpus studies on specific phenomena, which is why only the most relevant of them will be mentioned here, chronologically, as examples of how comparative studies have been conducted in the past.

Diachronic comparison The comparison of language over time with a di- achronic view has been described e. g. by Janda & Joseph (2004), Dossena & 66 2 Related work and research desiderata

Lass (2004), Facchinetti & Rissanen (2006), Curzan (2009), or Rissanen (2009).37 A relevant study regarding diachronic corpus comparison on the lexical level is Krenˇ & Hlava´covˇ a´ (2008). They stress that the highest possible corpus comparability is crucial since “corpus composition differences can prove more significant than differences in language” and note that “[p]resumably this can be why there are probably no automated tools of this kind for higher levels of language description”. They use ranked word lists to characterise lexical usage chances in Czech. After the normalisation with respect to dispersion, they apply statistical significance measures with the conclusion that the “comparability of corpora is shown to play a key role”. Their CBF measure (dividing the χ2 value by the square root of the expected frequency) prefers lower-frequency items with greater frequency differences between the corpora over higher-frequency items with smaller differences, which yields good automated results. Belica & Steyer (2008) investigate distributional characteristics of lin- guistic phenomena on all levels of linguistic description (see also section 1.2.3.3) and use very large corpora and statistical methods, e. g. cluster hierarchies, for valuable insights into language usage and change. The selection of neologisms has been successfully done e. g. by O’Donovan & O’Neill (2008) implementing a ‘word-tracking’ system; for an overview of studies on more recent change in language with corpus-linguistic methods see Mair (2009). Burki¨ (2009) semi-automatically investigated diachronic change in MWEs in Swiss German according to frequent tri-grams.38 Related to studies on diachronic change, Belica et al. (2009) discuss the time-dependency of corpora and useful sampling strategies. They developed a ‘Frequency Relevance Decay’ formula to model the relevance of language events from a synchronic perspective.

37Comparative investigations of diachronic gender peculiarities have been done e. g. by Biber & Burges (2001). 38He concludes that the diachronic change is comparable to variation between genres and that change at a ‘moderate speed’ can be noted. He furthermore points out that corpus size is very important for such studies. 2.1 Resources and methods for corpus comparison 67

Learner language comparison Native and learner language has been com- prehensively compared by Granger (1994, 1997) and is generally described in Hinkel (2005). Before the development of learner corpora, manual analyses of small corpora have been carried out to investigate learners’ writing skills. Netzel et al. (2003) describe a semi-automatic study of academic English for writers with different mother tongues, where clear usage trends are shown for different phenomena. To compare the use of adjective vs. participles, Handwerker et al. (2004) carried out studies in the context of German as a foreign language. Lauttamus et al. (2007) investigated the syntactic interference of first and second language by comparing tri-gram PoS patterns in spoken corpora. Various semi-automatic analyses in learner texts of numerous phenomena on different levels of linguistic description have been conducted in the framework of the learner corpus Falko (Ludeling¨ et al., 2008).39 The project KOM- POST 40, for example, is developing computational-linguistic methods to find and investigate indicators for the quality of learner texts for German, in order to use its results in the further development of competence models.

Comparison of specialised and general language Ahmad et al. (1992, see section 1.2.3) are able to identify specialised lexis relying on the relation of relative frequencies of words or word combinations. A review of 12 systems for monolingual corpus analysis and terminology extraction is given by Cabre´ et al. (2001). They discuss different kinds of pre-processing and – as the core of extraction process – the identification and collection of occurrences of term-like units with varying degrees of complexity as well as orthographical, morphological, syntactic, and semantic variability, the use of semantic relations and clustering methods, and the statistical and lexical filtering and sorting. With this method, they try to find the most

39For example, an Excel add-in for over-/underuse of phenomena for learner corpora has been implemented (http://korpling.german.hu-berlin.de/∼amir/uoaddin.htm, last accessed 2012-10-20). 40http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/ kompost, last accessed 2012-10-20. 68 2 Related work and research desiderata

representative and useful term candidates by applying statistical approaches for word similarity. Similarly in the field of terminology extraction, in order to enhance dictionar- ies, contextual analysis is successfully done assuming that words with the same meaning appear in similar lexical contexts (see also section 1.2.3.3). Fung & McKeown (1997), for example, built context vectors for each word and com- pared these vectors with a general bilingual dictionary. Large corpora are found to yield good results: an accuracy of around 80 % for the top 20 candidates can be reached (see e. g. Chiao & Zweigenbaum, 2002). Improvements to this have been made e. g. by Koehn & Knight (2002), who add cognates and similar contexts by using lexico-syntactic patterns or by a bi-directional approach. These methods originate in MT, where translation lexicons are ‘acquired’ from unrelated monolingual corpora by clues such as identical or similar words, cognates, similar contexts, or word frequency. An alternative would be a graph-based approach with grammatical relations, e. g. the graph clustering to group according to similar behaviour by Dorow (2007). There are more and more statistical approaches to specialised language data extraction from parallel or even from unrelated monolingual corpora, an example for the latter being Nazar (2008). First studies on administrative legal terminology in varieties has been done by Abel et al. (2008) with hybrid (symbolic and statistical) methods.

Comparison of original texts and translations Comparison of original texts and translations has been conducted e. g. by Laviosa (1998), who worked with the English Comparable Corpus (ECC)41 to find patterns of lexical use in translated vs. original texts. On the same topic, Papineni et al. (2002) compared such varieties in the context of automatic MT evaluation. Baroni & Bernardini (2006) present further studies on ‘translationese’ using machine learning to find differences.

41http://www.llc.manchester.ac.uk/ctis/phd/completed phd/laviosa, last accessed 2012-10- 25. 2.1 Resources and methods for corpus comparison 69

On a higher level of linguistic description, Doherty (2006) describes a very fine-grained comparison of translated and original texts according to structural and discourse properties. She uses typologically determined conditions for discourse-appropriate uses of word order, case, voice, and structural explicitness in simple and complex sentences or sequences of sentences.

Register comparison In order to explore and quantify register42 differences between corpora, various approaches have been applied. Word-frequency-based methods have proven to reveal register differences such as already shown by Biber (1990). Genre differences have also been analysed based on cross-entropy values from partial syntactic trees by Sekine (1997). Rayson & Garside (2000) describe a method called ‘frequency profiling’, which is a keyword comparison for social differentiation studies. Another sys- tematic exploration of genre analysis is described in Xiao & McEnery (2005), who find genre differences by word frequency comparisons. Cross-disciplinary variation in academic English, shown e. g. by the modes of persuasion, are investigated by Hyland & Bondi (2006) in single corpus studies on specific phenomena. A well-known corpus-based variation study of university language comparing spoken and written registers is Biber (2006). Register classification with bi- gram frequencies of spoken vs. written text has been done as well by Crossley & Louwerse (2007), who revealed syntactic and discourse features through a data-driven approach. Geyken et al. (2008) contrast sub-corpora of different genre and style by generating word profiles based on frequent word co-occurrences. Likewise related to academic genres is the British Academic Written English Corpus (BAWE), whose project website43 collects a number of publications on comparative studies.

42The term ‘register’ is used here in a general sense referring to language varieties defined by their situational characteristics, comprising genre, style, text type, etc. 43http://wwwm.coventry.ac.uk/RESEARCHNET/BAWE/RESEARCH/Pages/ Publications.aspx, last accessed 2012-10-20. 70 2 Related work and research desiderata

Another similar field of research is authorship attribution for texts, which has e. g. been done by Luyckx & Daelemans (2008). Oakes (2009) presents ways to compare author styles by quantitative criteria such as collocations, vocabulary richness, word positions, or syntax as discriminators. In addition, studies of genre, gender, diachronic, native/learner, and translations are reported on. He introduces alternatives to TTR for measuring vocabulary richness (p. 1073). Major dimensions of linguistic variation an be identified with factor analysis. Such multi-dimensional approaches are summarised in Biber (2009) for reg- ister variation studies and register comparison. Factor analysis for features is conducted with different dimensions of variability. In exploratory multi-variate analysis, vectors for data representation are created and compared. Such a multi-variate approach according to metadata and phenomena (to look e. g. in newspaper corpora for the distribution of phenomena in different parts such as sports, economy, culture, etc.), is described in Moisl (2009), which is a comprehensive summary including a bibliography for multi-variate analysis on all levels of linguistic description. Additionally, Bell et al. (2009) studied stylometry regarding computation- ally extractable properties of style. They formulate a stylistic distance function via the weighted ratio of lexical stylometry without sample-size dependency.

Intra-textual comparison of phenomena For the semi-automatic develop- ment of thesauri, Kanerva et al. (2000) are able to find TOEFL44 synonyms with Random Indexing methods for Latent Semantic Analysis (see section 1.2.3). Synonym extraction according to context is also done by van der Plas & Tiedemann (2006) on parallel corpora with automatic word alignment, which is able to distinguish between synonyms and otherwise related words, whereas a syntax-based approach is reported to be not as promising for this task. Peirs- man et al. (2007) find semantically related words (in Dutch), where they explore co-occurrences vs. syntactic contexts. Heylen et al. (2008) present a comparison of vector-based models for word similarity concluding that all approaches find most synonyms for high-frequency

44Test of English as a Foreign Language, http://www.ets.org/toefl, last accessed 2012-10-25. 2.1 Resources and methods for corpus comparison 71

and abstract nouns. Peirsman et al. (2008) discuss gold standards for word space models taking automatically calculated semantic similarity as the basis. WordNet- and corpus-based methods are compared, the latter yielding better results. To extend lexicons using similarity measures, Cahill (2008) successfully created “lexical entries for ‘unknown’ words based on entries for words that are known and that are deemed to be distributionally similar” using measures from the BNC (see section 1.2.3.2). Morpho-syntactic selectional preferences are compared by Peirsman & Pado´ (2010) with a cross-lingual unsupervised approach for resource-poor languages; bilingual vector spaces from a large un-parsed corpus in a resource- poor language are built. Their evaluation with human judgements reports good results even for unrelated languages. The approach to semantic categorisation by co-occurrence statistics presented in Bullinaria (2008) detects fine-grained semantic differences in large corpora via clustering methods. Semantic vector analysis with morphological decompo- sition and search for compound bases according to collocation ‘partners’ can e. g. be done with the scalable open source Java package including online use presented by Widdows & Ferraro (2008). As to idiom investigations, e. g. the idiom database by Fellbaum (2007) has been taken as a basis in various studies.45 Dormeyer et al. (2005) conducted automatic recognition and analyses of metaphors and idioms using lexical entries, grammar rules, and contexts with a context-free backbone attributed with feature structures. Equally in order to identify idiomatic expressions, Villada Moiron´ & Tiedemann (2006) use word alignment in parallel corpora according to meaning predictability and overlap between the meaning of the whole expressions and its parts.

45http://kollokationen.bbaw.de/htm/idb de.html, last accessed 2012-10-20; an experiment on these data in distance-based idiom recognition by looking at syntagmatic fixedness is explained at http://kollokationen.bbaw.de/htm/collconf2 en.html#programme, last accessed 2012-10-20. 72 2 Related work and research desiderata

Regional variety comparison Comparative regional variety linguistics has first been handled mostly manually with single introspective studies on rather specific phenomena – later as well with the help of the Internet (see e. g. Abfalterer, 2007; Bickel, 2000), also in the development of the VWB (see section 2.1.1).46 By now, more and more projects use specifically developed variety corpora and automated comparison methods, e. g. the ICE initiative for English or the C4 initiative for German (for both see section 2.1.1).47 Hofland & Johansson (1982) present one of the most comprehensive early studies on American vs. British English, using χ2 statistics (see section 1.2.3). A comparative study on comparative clauses in English varieties has been done by Peters (1996), and another exemplary investigation on linguistic variation in Jamaica has been conducted by Sand (1999). A further semi-automatic analysis is described in Wong & Peters (2007), who investigate backchannels in regional varieties of English using the ICE corpora, and Ronan & Schneider (2009) did comparative work on MWEs in Old English and Old Irish. Schneider & Hundt (2009) used a syntactic parser as a heuristic tool to discover features of English varieties by exploring quantitative differences in the use of syntactic patterns. In addition, they evaluated parser break-downs with respect to their correlation to regional peculiarities, where they found that due to robustness, no correlation can be discovered. In-depth qualitative evaluations of the outcomes are still pending. Gries & Mukherjee (2010) investigated lexical co-occurrence preferences, to see if and to what degree n-gram patterns can help to distinguish between different modes and varieties in the same components of the ICE. Bacelar do Nascimento et al. (2006), as an example for another pluri-centric language, carry out lexicographic corpus studies for describing

46For details on studies specifically of South Tyrolean German, see section 2.2. 47As another example, the Quantitative Lexicology and Variational Linguistics research group (QLVL, http://wwwling.arts.kuleuven.ac.be/qlvl/index.htm, last accessed 2012-10-20) presents studies for the comparison of Dutch in Belgium and the Netherlands. 2.1 Resources and methods for corpus comparison 73

Portuguese in Africa, and Zampieri & Gebre (2012) work on the corpus- driven statistical identification of Portuguese varieties.48 For German, already Muhr (1993c) looked at differences in pragmatics – on the function of linguistic units – in Austria and Germany. Hofer & Schmidlin (2003) elaborate on phrases in variant dictionaries, and Schmidlin (2007) investigated phraseological expressions in German varieties.49 Heid (2011) used dependency-parsed newspaper data of German varieties with morpho-syntactic features in order to look at collocator choices and the stability of selectional preferences. Underspecification markers are applied to exclude ambiguous data, which enhances the precision of the results; the recall could be enhanced by integrating all possible word order patterns. Interesting cases are found if parts of collocations occur in several varieties, but only in one of them in a certain collocational combination, e. g. AT regierender Weltmeister (vs. DE amtierender Weltmeister, ‘reigning world champion’); usually the collocator meaning is variety-specifically extended. The partly automated method consists in comparing collocations of one basis; if there are differences in their collocator choice, further typical bases of one collocator are looked at. With this method, variety-specific meanings of collocators can be identified (an example being DE markant (‘striking’) and CH markant (‘substantial’) which are used with different semantic classes of nouns, p. 551). The general conclusion in Heid (2011) is that simple frequency-based tools can provide a valuable starting point and concrete material for further man- ual interactive interpretation and investigation by lexicographers and variety linguists. Such manual work is indispensable to decide if phenomena point to specific variety-typical readings of an identical collocation or not, and also for Frequenz-Regionalismen (‘frequency regionalisms’), which are formally unre- markable since they occur as well – only with less frequency – in texts used as a reference.

48Extensive bibliographies for comparative studies are furthermore presented online on the websites of the other main variety research centres mentioned in section 2.1.1. 49For example in edition 4/2007 of the Austrian trib¨une.zeitschrift f¨ursprache und schreibung, Austrian phraseologisms are described and discussed. 74 2 Related work and research desiderata

Additionally, Heid (2011) especially points to (pragmatically) equivalent constructions on different levels of linguistic description to be found manually, since there the limits of fully automated tools are reached. A further approach he suggests is to investigate morpho-syntactic preference differences in more detail, where more and better classified data are needed for most trustworthy interpretations. Also in the Variantengrammatik project (Durscheid¨ et al., 2011, see section 2.1.1), German variety corpora are investigated regarding national and regional differences in their grammar, in order to use this knowledge e. g. for didactics. Bickel (2012) describes the growingly promising study of German varieties with web corpora, and also Roth (2012) used web corpora for the semi- automatic recognition of regional variation in German collocations. To sum up, comparative corpus investigations have increasingly and suc- cessfully been conducted using electronic corpora – but still often for rather restricted phenomena. This fact suggests a systematic approach to cover a wide range of linguistic phenomena based on corpus methodology to be tested in the present feasibility study.

2.1.3 Computational systems for corpus studies

In order to show existing solutions to similar research questions, related computational-linguistic systems of various kinds which are used for the steps that are necessary for comparative corpus studies are listed in the following. Especially tools for corpus annotation and analysis will be shown, in the order of their development and with a focus on German. 2.1 Resources and methods for corpus comparison 75

2.1.3.1 Corpus annotation tools The TreeTagger 50 is a fast and high-quality probabilistic decision-tree-based tool (with up to 97.5 % correctness51 for standard German; Schmid, 1995, pp. 6-7) mainly applied for annotating text with PoS and lemma information, but for chunking as well. The tag set STTS 52 as described in Schiller et al. (1999) is used. As an alternative, Brants (2000) developed the statistical PoS tagger TnT, which is based on Hidden Markov Models. To interpret and classify non-lemmatised words, e. g. Klatt (2006) proposes a method using a tokenised corpus and rules for morpho-syntactic and seman- tic properties of words. Tufis¸ et al. (2008) extend lexicons automatically by adding PoS-tagged words, estimating their lexical distribution and their morphological analysis. Also Adolphs (2008) acquires lexical data from raw texts for this purpose and carries out statistical ranking methods. A successful approach to similarity-based lemma correction is Gojun et al. (2012), who use the Levenshtein distance ratio in addition to a rule-based approach. The Stuttgart Morphology tools (SMOR; see Schmid et al., 2004) provide the automatic recognition of German word forms, using a stem lexicon and linguistic rules. It is a morphological analyser based on finite-state methodology covering derivation, compounding, and inflection, including productive word formation. This method is similar to the one used by Geyken & Hanneforth (2006), the latter with a recognition rate of up to 99 %. With YAC (Yet Another Chunker), recursive chunking with detailed inflec- tional information for German can be done for unrestricted German text with lexical-semantic and structural feature annotation (Kermes, 2003). Kermes (2009) presents an overview on symbolic vs. probabilistic chunking and parsing methods and applications for e. g. lexicography and evaluation.

50Schmid (1994, 1995); http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger, last accessed 2012-10-20. 51For ‘unknown’ words (word forms for which no lexicon entries exist), the accuracy of the PoS tagging decreases dramatically according to Schmid (1995, p. 8). 52Stuttgart-T¨ubingentag set; http://www.sfb441.uni-tuebingen.de/a5/codii/info-stts-en. xhtml, last accessed 2012-10-20. 76 2 Related work and research desiderata

Partial parsing is e. g. possible with a cascaded finite-state approach using explicit underspecification as by Schiehlen (2003) or with a two-stage partial parser for untagged German which combines rule-based with corpus-based decisions and uses topological fields to mark minimal phrases and multi-word units as by Klatt (2004), yielding very high correctness values. Another partial parser for untagged German texts is described by Klatt (2004), which combines rule-based and statistical methods. Schmid (2004) implemented efficient parallelised parsing for large highly ambiguous grammars. Furthermore, dependency parsing has been done by Bohnet (2009, 2010), and Bjorkelund¨ et al. (2010) combined parsing and semantic role labelling in a complete semantic analysis pipeline. The system TypeCraft53 supports morphological word-level annotation in a relational database setting; it is wrapped into a communication system, where it is possible to work with user-defined data. The MMAX2 54 multi-level annotation tool with a sophisticated graphical user interface (GUI) has been developed by Muller¨ & Strube (2006). WordFreak55, a java-based linguistic annotation tool, is designed to support human and automatic annotation of linguistic data as well as to employ active-learning for human correction of automatically annotated data. Furthermore, the editor Posedit56 was developed to aid the manual correction of automatically PoS-tagged texts. Equally for corpus annotation, e. g. the UIMA-based (see section 2.1.3.2) Federated Language Resource Architecture UFRA offers an inter-operable infrastructure for resources, technologies, and components (Del Gratta et al., 2008). A collection of various annotation tools of the FreeLing package Open Source Suite of Language Analysers is located at http://nlp.lsi.upc.edu/freeling (last accessed 2012-10-20). All these annotation tools are very useful and can be integrated with more or less adaptation into newly developed systems.

53http://typecraft.org, last accessed 2012-10-20. 54http://mmax2.sourceforge.net, last accessed 2012-10-20. 55http://wordfreak.sourceforge.net, last accessed 2012-10-20. 56http://elearning.unistrapg.it/corpora/posedit.html, last accessed 2012-10-20. 2.1 Resources and methods for corpus comparison 77

2.1.3.2 Corpus analysis and comparison tools There are several types of systems to analyse corpora; this section presents examples for building-block toolkits, for complete, fully automatic systems, and for mixed platforms with the integration of automatic and semi-automatic tools, each in chronological order of their development.

Building-block systems The general functionality of framework and buil- ding-block systems for analysing corpora is to provide modules for different applications and users, which need to be plugged together according to the respective requirements. The UIMA57 framework provides such an open and extensible platform for building applications that e. g. process text to find latent meaning and relations. The GATE 58 software is an open source system for all sorts of language processing tasks and applications, including web mining, information extraction, or semantic annotation (see Cunningham et al., 2002). Also Heart of Gold59 offers an architecture for the integration of deep and shallow natural language processing components. Uplug60 is a further modular platform with a collection of tools for linguistic corpus processing, word alignment, or term extraction from parallel corpora.

Automated corpus query and analysis Complete and fully automatic corpus analysis can e. g. be done with the following tools. Heid et al. (2001) introduced the system DOT for term candidate extrac- tion based on symbolic and statistical methods. Rayson (2003) developed Matrix, a statistical software tool for linguistic analyses through corpus compar- ison, which provides a web interface to corpus annotation tools and standard corpus-linguistic methodologies such as (the comparison of) frequency lists, frequency profiles, concordances, n-grams, and collocations. It further extends

57Unstructured Information Management Architecture; http://uima-framework.sourceforge. net, last accessed 2012-10-20. 58General Architecture for Text Engineering; http://gate.ac.uk, last accessed 2012-10-20. 59http://heartofgold.dfki.de, last accessed 2012-10-20. 60http://stp.lingfil.uu.se/∼joerg/Uplug, last accessed 2012-10-20. 78 2 Related work and research desiderata

the keywords method to key grammatical categories and key semantic domains. Wmatrix allows the user to run these tools via a web browser. The Leipzig Corpus Portal61 (see Quasthoff et al., 2006) comprises around 20 monolingual comparable corpora for a free access via a GUI. They offer web newspaper texts, dictionary data, various kinds of statistics, pre- computed co-occurrence data and graphs, visualisation, semantic class and taxonomy learning. The Leipzig tools can be used via web services. T-LAB (Lancia, 2007) is a software for co-occurrence analysis and doc- ument or word clustering. Statistical ranking of n-grams is applied to find co-occurrences and terms on the basis of their distribution in documents by using similarity coefficients to compare strings of texts as vectors. For language recognition, function word vector comparison with language models is done. For testing corpus representativeness, the vocabulary size curve is taken into consideration: few new words are expected to be encountered when a corpus is representative of a certain field with special terminology. The collocational behaviour of word types is shown with histograms based on severalAMs for collocation hypothesis testing, e. g. ‘cubic mutual information’ (independent of frequency, see Daille, 1994). The system includes parametrisation with normal frequency expectations from reference corpora. Nazar et al. (2008) developed tools for the extraction of specialised web corpora and their modular statistical analysis. They check vocabulary richness, vector similarity, and rank co-occurrences and conduct language recognition by vector comparisons. Furthermore, they offer collocate sorting and a demonstration of term distribution via a GUI. The system is modular in order to facilitate the integration of other systems. They use their tools for corpus compilation and analysis from the web for cases where no corpora exist yet. The authors also looked at the distribution of terms in documents and highlight terms which are i.) frequent, ii.) well dispersed, iii.) less frequent in a reference corpus. Biemann et al. (2008) give a description of a modular collection of various

61http://corpora.informatik.uni-leipzig.de, last accessed 2012-10-20. 2.1 Resources and methods for corpus comparison 79

combined linguistic algorithms and Java tools (ASV 62 toolbox) for written corpus exploration including word-level annotation, classification, clustering, language identification, morphological analysis, term extraction, Levenshthein distance for similar words, named entity recognition (NER), terminology extraction to be applied e. g. for MT, information retrieval, or information extraction. The tool can be downloaded (for command line use and use via a GUI) and is compatible to other sources than the Leipzig Corpora Collection, as well as extensible with new tools. With the co-occurrence database CCDB63 using a clustering algorithm ac- cording to semantic proximity, co-occurrence profiles can be calculated (see Belica & Steyer, 2008). Branco et al. (2008) describe the web service LX 64 for the distribution and usage of language technology for Portuguese. Word sketches (“summaries of a word’s grammatical and collocational be- haviour”, see Kilgarriff & Tugwell, 2002, p. 125) are implemented in the Sketch Engine65. It supports ‘keyword’ analyses between sub-corpora and is reported to work well for lexicographic use for English, Slovene, and Japanese (Kilgarriff et al., 2008).66 There are more such commercial tools, e. g. Collocate 1.0 67, which will not be discussed further. The eHumanities Desktop68 is a corpus-linguistic platform with an easy- to-use interface offering a flexible use of linguistic tools (see Gleim et al.,

62http://wortschatz.uni-leipzig.de/∼cbiemann/software/toolbox, last accessed 2012-10-20. 63http://corpora.ids-mannheim.de/ccdb, last accessed 2012-10-20. 64http://lxcenter.di.fc.ul.pt/services/online suite/en/lx-suite.html, last accessed 2012-10-26. 65http://www.sketchengine.co.uk, last accessed 2012-10-20. 66According to Kilgarriff et al. (2010), this is the case even using a flat grammar because of word order and explicit markers for grammatical relations. A similar approach for German with a focus on sub-categorisation information is described in Schulte im Walde (2003). Word sketches for German are difficult to implement because of structural ambiguities. There are three models of constituent order (verb-first, verb-second, and verb-last) as well as case syncretism (only about 20 % of German NPs in a newspaper text are unambiguous according to Evert (2004)). Since German is a non-configurational language, there is a relatively free constituent order in the middle field (see section 1.2.1) and since ‘German does not fully compensate the lack of configurationality with its morphological case’ (see e. g. Heid & Weller, 2008), a flat approach does not yield convincing results (Ivanova et al., 2008). 67http://www.athel.com/colloc.html, last accessed 2012-10-20. 68http://www.hucompute.org/ressourcen/ehumanities-desktop, last accessed 2012-10-20. 80 2 Related work and research desiderata

2009; Gleim & Mehler, 2010). The WebLicht69 tool chain developed in the D-SPIN project (see section 2.1.1) uses systems from different institutions via web services to annotate corpora up to the level of syntactic parsing; corpus analysis and visualisation tools are planned to be implemented as a next step. Heid et al. (2010) present a web-based scenario in the D-SPIN framework to extract significant word pairs (details in section 2.1.2) as a web service in the WebLicht pipeline. The system WordSurv70 is able to determine linguistic relations e. g. through the comparison of word lists. The LIWC 71 text analysis software programme can e. g. be used to calculate the degree to which people use different categories of words across a wide array of texts. A beta version of the DWDS 72 (see also section 2.1.1) corpus query allows users to generate sophisticated word profiles based on frequent collocations.

Hybrid platforms Further semi-automatic corpus analysis toolkits with mixed approaches are listed in the following paragraphs. With the Corpus Query Processor (CQP) (see Christ, 1994) on the basis of the Corpus Workbench (CWB)73, an evaluation of queries for annotated corpora can be conducted, and also the WordSmith74 tools offer software for lexical analysis (see Scott, 2004). Xiao (2006) describe the system XAIRA (XML-Aware Indexing and Retrieval Architecture) for this purpose. An overview of concordancing tools is given by Wiechmann & Fuhs (2006). To measure statistical association between words, the Ngram Statistics Pack- age NSP (Banerjee & Pedersen, 2003) can be used. Open source Python modules, linguistic data, and documentation for research and development in

69http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main Page, last accessed 2012-10-20. 70http://www.sil.org/computing/survey/wordsurv.htm, last accessed 2012-10-20. 71Linguistic Inquiry and Word Count; http://www.liwc.net, last accessed 2012-10-20. 72http://beta.dwds.de, last accessed 2012-10-20. 73http://cwb.sourceforge.net, last accessed 2012-10-20. 74http://www.lexically.net/wordsmith/index.html, last accessed 2012-10-20. 2.1 Resources and methods for corpus comparison 81

NLP and text analytics can be accessed at the NLTK 75 webpage (Bird & Loper, 2004). Also the multi-word expressions community76 provides software for extracting collocations and MWEs, and the UCS 77 toolkit is a collection of libraries and scripts for the statistical analysis of co-occurrence data. On the MULINCO corpus platform for education and research described by Maegaard et al. (2006), supported manual linguistic investigations and comparisons can be conducted. ANNIS 78 is a browser-based open source search and visualisation tool for multi-layer annotated corpora (see Zeldes et al., 2009), and the ICE Corpus Utility Program79 provides a corpus exploration platform designed especially for parsed corpora to conduct automated, although not primarily comparative studies. The NITE XML Toolkit80 is another set of libraries and tools that provide for the representation, manipulation, query and analysis of language data. As an example for a higher-level analysis tool, Schrunder-Lenzen¨ & Henn (2009) developed software to conduct automated high-level language diagnosis in order to support language acquisition. It allows for a standard-measurable analysis of written story telling competence of learners. In summary, a wide variety of computational systems for different tasks has been developed. This feasibility study will use their most promising features and will combine them into an integrated system for variety corpus comparison.

75Natural Language Toolkit; http://www.nltk.org, last accessed 2012-10-20. 76http://multiword.sf.net, last accessed 2012-10-20. 77Evert (2005a); http://www.collocations.de/software.html, last accessed 2012-10-20. 78http://www.sfb632.uni-potsdam.de/d1/annis, last accessed 2012-10-25. 79ICECUP; http://www.ucl.ac.uk/english-usage/resources/icecup/index.htm, last accessed 2012-10-20. 80NXT; http://groups.inf.ed.ac.uk/nxt, last accessed 2012-10-20. 82 2 Related work and research desiderata

2.2 Investigations on South Tyrolean German

To provide a more precise background for the present study, the investigations conducted especially in the field of South Tyrolean variety linguistics will be presented in the following sections. For an overview, collections of special South Tyrolean phenomena will be shown in section 2.2.2.

2.2.1 South Tyrolean German variety linguistics

Main focal points of research on the written standard German variety in South Tyrol will now be elaborated on, first concentrating on the methodology applied, then – sorted chronologically and by contents – on the concrete single studies, and finally on the respective findings and their interpretation.

Methods used for investigating South Tyrolean German The early studies on South Tyrolean peculiarities mainly relied on manual examination and excerption (e. g. Rizzo-Baur, 1962; Riedmann, 1972; Pern- stich, 1982). For the more recent studies on the existing dictionaries on German varieties (VWB and Abfalterer (2007), see sections 2.1.1 and 2.1.2.2), these methods have been enhanced

by consulting informants (see survey in Abfalterer, 2007, pp. 284-286), • by the comparison of observed data with existing resources such as relevant • secondary literature and dictionaries (Abfalterer, 2007, pp. 227-232), and

by additionally using cross-checks in the Internet as a resource for ad- • ditional evidence (e. g. Bickel, 2000, 2006; Abfalterer, 2007, who provides Internet search results for Austriacisms with the percentages of occurrences according to different domain endings on pp. 282-283).

This recent work was conducted with a systematic view from the outside as well: with the help of many experts cross-checking texts from all the relevant 2.2 Investigations on South Tyrolean German 83 varieties (see Abfalterer (2007, pp. 167-220) for a detailed discussion on the classification81 of single phenomena or their origins and situational contexts; the exact procedures are described in Abfalterer (2007, pp. 60-70)). As has been mentioned, recent developments in computational linguistics (see section 2.1) present a promising basis and an indispensable contribution to this investigation methodology – for extensive research on German varieties as well as to foster language awareness and to support language didactics. Widely accessible corpora of the German language (not only South Tyrolean German), e. g. by the initiatives Korpus S¨udtirol and C4 (see section 2.1.1), are increasingly used for variety studies. With regard to verifying and enhancing variant dictionary entries, also Abfalterer (2007, p. 70) refers to the con- tinuation and validation of her work by new investigations with the help of a digital corpus of South Tyrolean German to quantitatively study border and higher-level phenomena (especially collocations). With a corpus comparison system such as Vis-A-Vis` , quantitative and thus more objective investigations on the basis of such corpora are now possible. The first cascaded studies (see first approaches in Anstein, 2007) comprised a high amount of manual sorting, and direct corpus querying was involved for the resolution of ambiguous cases – e. g. for the grammatical functions of subject (SUBJ) and object (OBJ), since German has a relatively free word order – or for subtle semantic differences. The ‘manual’ combination of annotation and analysis tools was time-consuming and rather inefficient, which is where the development of Vis-A-Vis` originated. The first specific variety lexicography experiments (Abel & Anstein, 2008) also used annotations of standard tools, where difficult cases (‘unknowns’) or errors could identify a set of candidates for special variety characteristics. The reason is that these tools are usually created for the dominant varieties and therefore variety-specific terms (alike other terms or compounds in German) are not included in the lexicon. Since the tagger was designed for and trained

81Abfalterer (2007) chose a ‘border’ towards the dialect for the entries, where the exclusion criteria are that i.) dialects are mainly spoken (occurring in citations) and that ii.) dialect words are used for the filling of ‘holes’, if there is no corresponding standard word. 84 2 Related work and research desiderata

on the ‘main’ variety in Germany, the assumption was that new candidates for South Tyrolean variants could be found in the list of ‘unknowns’.82 First, existing word lists with South Tyrolean special vocabulary such as proper names (e. g. from electronically available phone book lists), terminology, or approved lists of S¨udtirolismswere used to filter and thus reduce this ‘unknowns’ list. The remaining list was then manually checked for possible new dictionary entries (for results, see section 2.2.1). The first studies used cascaded queries in variety corpora via CQP (see section 2.1.3), with the assumption that those co-occurrences are worth being more closely investigated which have high frequency in the South Tyrolean corpus and very low frequency in the corpus from Germany. This automatic filtering of a huge amount of data was used to extract only relevant data to be investigated manually. In Abel & Anstein (2011), the semi-automated analyses have been further developed and described; this article additionally shows a detailed discussion of results. Focusing on the co-occurrence level, sophisticated collocation extraction tools including morpho-syntactic annotation have been applied by Heid (2011) for investigating the varieties of German (see section 2.1.2.2).

Studies conducted on South Tyrolean German Linguistic research on South Tyrolean German has been concentrating on language contact phenomena and peculiarities of the variety on the lexical and on the morpho-syntactic level. The investigation of the written German standard in South Tyrol started with single language contact studies in the 1960s with Rizzo-Baur (1962). Riedmann (1972, 1973) wrote the most critical articles on Italian interferences and the impairment of the South Tyrolean German language, a view which was taken over e. g. also by Kramer (1981) or Tyroller (1986). Moser & Putzer (1980, pp. 153-157) described colloquial vocabulary in South Tyrol. Phenomena on the morpho-syntactic level have been investigated by Egger (1979) and by Giacomozzi (1982). Pernstich (1982, pp. 112ff.)

82Further groups contained in this list are typing and orthographical or tokenisation errors, or also morphologically complex words which cannot be decomposed by the tagger. 2.2 Investigations on South Tyrolean German 85

studied contact phenomena and interferences in newspaper language, and similarly Forer & Moser (1988) describe South Tyrolean special phenomena on the word level. For an overview until the mid 1990s see Lanthaler (1997). Also Mall & Plagg (1990), describing di-glossia (see section 1.2.2) and bilingualism in South Tyrol, summarise published material until 1990. Bertagnolli (1994) conducted a socio-linguistic language study in Bolzano, where she focused on social identity, standard and norm, dialect vs. standard, and code-switching. It was a questionnaire investigation on the language use in private, at work, in public, on the acceptance by different social strata, and on prestige. A detailed linguistic description of German in South Tyrol is presented by Riehl (2000) on peculiarities, different domains for spoken and written language, education, literary tradition, and culture. Zambelli (2004) gathered and classified interferences from Italian by inter- views in a field study yielding comprehensive tables with examples for classified interferences from Italian to South Tyrolean German (pp. 23-30). Rampold (2005) is a collection of – partly entertaining, partly serious – regular newspaper articles (called the Federfuchser) on peculiarities of South Tyrolean German in the South Tyrolean daily newspaper Dolomiten. In addition, many individual reports on the topic exist – an example are articles containing concrete detailed translation suggestions for administrative terminology written by Alexander Brenner-Knoll in the Wirtschaftskurier of the Dolomiten newspaper. Over the years, the research on interferences considered as signs for a loss of language skills has been shifted towards the description of ‘special vocabularies’ from a variation linguistic perspective83 – see also Egger (2001) and Egger & Lanthaler (2001, p. 58), who describe special vocabulary for South Tyrol with differences to other varieties and with loanwords from Italian (see also Abfalterer, 2007, p. 55).

83The ‘equalisation’ of the varieties from a scientific point of view does not automatically mean that the language community shares this view: Germany, e. g. does not necessarily have the common awareness of being one of more varieties (Ammon, 1997, p. 9). 86 2 Related work and research desiderata

Abfalterer (2007) presents the most recent extensive investigation of South Tyrolean German containing a variant dictionary elaboration. This book is the latest review and discussion of linguistic research including the principal related work – she presents specific comparisons of her results with earlier studies on pp. 227-232. It is a study on the whole of the South Tyrolean special vocabulary in the pluri-centric context, which has been collected and classified. Robertazzi (2007) is a master’s thesis studying collocations in original and translated corpora of South Tyrolean German to investigate if there is an Italian influence. Putzer (2009) goes into detail on regional differences in grammar, focusing on past tense and perfect as well as on the morphological distribution of the auxiliaries haben (to have) and sein (to be) for the perfect tense. Obrist (2010), which is a booklet directed to the general public, includes a recent collection of newspaper articles84 on the South Tyrolean German language from the Dolomiten newspaper and of photographs from the initiative Abgeblitzt85 about peculiar phenomena of South Tyrolean German. More and more collections of single phenomena are as well presented e. g. in the Internet for a broad public (for South Tyrolean dialects see e. g. http://oschpele.ritten.org, last accessed 2012-10-20). The latest computationally supported study on collocations in the varieties of German is Heid (2011, see above). Nevertheless, variants have not been systematically investigated and doc- umented – according to Bickel (2000, p. 112) there were no dictionaries at all for a long time; the VWB (Ammon et al., 2004, see section 2.1.1) was supposed to fill this gap. There are almost no extensive investigations on particular features of the South Tyrolean variety on the syntactic (e. g. collocations, idioms; see Ab- falterer, 2007, p. 70) or on the textual level, e. g. on discourse features. One single example for the latter is Riehl (2001), who studied text-specific

84The fact that such articles appear in newspapers illustrates the interest of the media and the public. 85On http://www.abgeblitzt.it (last accessed 2012-10-20), this collection of photographs with interesting linguistic phenomena found in public can be accessed. 2.2 Investigations on South Tyrolean German 87

formulations and lexical as well as morpho-syntactic phenomena, differentiating between influences

i.) from the second language (on lexicon and syntax), ii.) from contact varieties of the mother tongue (on lexicon and morphology), and iii.) indirectly from language contact (on lexicon, collocations, and idioms).

Recently, also the Variantengrammatik project (see section 2.1.1) compares corpora to study national and regional differences in grammar e. g. on word structure, phrase structure, or government. Most of these studies have been conducted on written general German standard language; in-depth examinations of specialised texts are rare. Putzer (1984, who studied bilingualism exams on different levels of linguistic description) and Putzer (2001) looked at peculiarities and interferences in translated texts by non-professionals with respect to the bilingualism exam, and draw conclusions for didactics.86 Similarly concerning specialised language, bistro87 is an information system for legal language developed for the German language area. Learner language in South Tyrol has been investigated in Kolipsi88, a project which collected L1 and L2 German learner texts by 1800 pupils in their fourth year of high school in South Tyrol. In the project KoKo89 in the framework of the initiative Korpus S¨udtirol (see section 2.1.1), the latter is enhanced by L1 learner texts from South Tyrol, North Tyrol, and Thuringia (600 texts each) including socio- and psycho-linguistic metadata to analyse and compare writing competence in high-school age on different levels of linguistic description.

86Putzer (2001, p. 155) for example describes the strategy for bilingualism being realised not by the simultaneous creation of texts in two languages, but by their formulation in one language and the translation into the other. 87http://dev.eurac.edu:8080/cgi-bin/index/preindex.en, last accessed 2012-10-20. 88Abel et al. (2010); http://www.eurac.edu/de/research/projects/ProjectDetails.html? pid=1818, last accessed 2012-10-20. 89Anstein & Glaznieks (2011); http://www.korpus-suedtirol.it/bildungssprache de.htm, last accessed 2012-10-20. 88 2 Related work and research desiderata

More publications on studies of South Tyrolean German can be found on the websites of the initiative Korpus S¨udtirol90 and of the project Datenbank zum S¨udtiroler Deutsch91.

General findings of the studies and their interpretation During the 1960s to the 1980s, studies of language contact phenomena have been emphasising the fact that contact leads to an impairment of the language (see Lanthaler, 1997, p. 209); the fear of language decay and the fight against any Italian influence was predominant (esp. in Riedmann, 1972, see also section 1.2.2). ‘Interference’ has been a very negatively connotated keyword for a long time, influenced by language fostering and language preserving efforts.92 Rizzo-Baur (1962) describes deviations of South Tyrolean German from Austrian German (pp. 108-110) and the influence of Italian (pp. 110-113), where she concentrates on a ‘negative interference’. Also Riedmann (1972, pp. 212ff.) emphasised that South Tyrolean German is especially influenced by the Italian language in the field of administration and states that the German colloquial language is affected by Italian. According to his investigations, there are furthermore many phonological, orthographical, grammatical, and semantic interferences of Italian in South Tyrolean German, and especially lexical additions. He considers semantic shifts as the ‘worst’ changes (p. 214), since wrong word usage impairs language use for communicational purposes. According to Moser (1975), however, Riedmann’s approach is methodologi- cally criticisable and the judgements are exaggerated. Similarly, Masser (1982, pp. 72f.) made a strong point, criticising Riedmann with respect to method- ological and factual aspects, stating that the phenomenon of contact language influence (e. g. loanwords) is widespread and not outstanding or notably alarm-

90http://www.korpus-suedtirol.it/lit varietaetenling de.htm, last accessed 2012-10-20. 91http://www.uibk.ac.at/projects/woerterbuch/sued/sued.html, see section 2.1.1. 92Mall & Plagg (1990, pp. 218ff.) describe influencing factors on the occurrence of interferences as the speakers’ profession, education, social habits, age, or place of residence. 2.2 Investigations on South Tyrolean German 89

ing in South Tyrolean German and that no damage is done to the German language in South Tyrol.93 In addition, syntactic peculiarities have been discussed to be a result of the language contact to Italian. These peculiarities concern word order, the position of the inflected verb, the conjunction of sentences, and constructions with the participle (Egger, 1979, pp. 84-97). Later, it was as well stressed that – apart from obvious transfers of words from the Italian to the German language especially in the fields of public administration and law – the interpretation of research results on South Tyrolean German proposes that there are less peculiarities than assumed, see e. g. Masser (1982, pp. 72-73), Pernstich (1982, pp. 112-114, who certified very few interferences, mainly in official language use), or Ammon (2001, p. 25). Mall & Plagg (1990) describe South Tyrolean German as being not substantially different from Austrian German at the levels of phonology94 and morphology and state that differences in lexis and syntax are restricted to specialised language. According to Lanthaler & Saxalber (1995, pp. 290- 292), syntactic errors such as wrong cases and difficulties with prepositions can be detected. They call contact phenomena a disruption of the system (p. 209) and assert that influences move from administrative language into every-day language, to be observed in youth language especially in the cities and in strongly bilingual groups (p. 292). But also, they state that many of these cases are citations and do not only originate from sub-conscious usage and that they mainly refer to things that do not exist in the German language area. As a general tendency, South Tyrolean German includes words from every-day language from Austria, words referring to media and tourism from Germany, and terms for research on multilingualism from Switzerland (see Abfalterer, 2007, pp. 233-238). Lanthaler (2001, p. 147) verifies that the South Ty- rolean German language community generally follows the common German

93As well with respect to the decay of language, Schrodt (2007) is an entertaining article about the change of the German language with the optimistic conclusion that language change is a natural process. 94However, systematic studies on the phonological level have not been conducted due to the lack of extensive spoken corpora. 90 2 Related work and research desiderata

dictionaries and grammar books (p. 148). According to Lanthaler (1997, pp. 366f.), Italian influence can be seen especially in colloquial and administra- tive language, in names for South Tyrol specific realities, in publicity, and in ironical language. Ammon (2001, p. 25) confirms that mainly administrative terminology influences every-day language as well, but states that other pecu- liarities of actual South Tyrolean German seem to be sparse and unremarkable. He recommends to study how far norm authorities accept or correct South Tyrolean peculiarities. Riehl (2001) notices a frequent use of atypical idiomatic expressions (but also insecurities in the control group from Germany, p. 283ff.) and summarises that the contact language causes borrowing of meaning into German (especially for Romance lexemes, where the meaning in German is extended from Italian). Further she reports on the following influences: lexical and morpho-syntactic interference phenomena are rare, syntactic even rarer, and an indirect influence is less familiarity with the language which results in atypical idioms and combinations. Starting in the 1980s, the focus in South Tyrolean variety linguistics has been shifted from research based on the criticism of Italian interferences as an impairment of the German language towards a less judgemental interpretation and the description of ‘official’ ‘specific vocabularies’95 from a variety linguistic perspective (see above). Abfalterer (2007, pp. 67-232 and 263ff.) reports the following as the special vocabulary96 or lexical-semantic peculiarities of South Tyrol: 303 central ‘pri- mary S¨udtirolisms’have been identified, which are used only in South Tyrol in all their senses (containing 16 % loanwords and less than 30 % loan translations, most of them as expected in the areas of administration, finance, and politics, see p. 54). She ascertains that there is a rather restricted and decreasing use of

95Already Moser (1982, pp. 20-21) claimed that lexical or pronunciation differences cannot even be called deviations or errors (see also Lanthaler, 2001, p. 147). Pernstich (1982, p. 107) and Putzer (1984, p. 142) noted furthermore that it is important when investigating the influence and peculiarities of South Tyrolean German to differentiate between spoken and written language – the Italian influence is stronger in spoken language, for example. 96250 of all S¨udtirolismsgot included in the VWB (Ammon et al., 2004, see section 2.1.1) and in the 40th edition of the OWB¨ . 2.2 Investigations on South Tyrolean German 91

Italian loanwords; more than half of the S¨udtirolismsinvestigated do not have a direct relation to Italian (p. 219), which also promotes the new variety linguis- tic perspective moving away from research focused on negative interferences. Examples for primary S¨udtirolismsare Autob¨uchlein vs. DE Fahrzeugschein (‘vehicle registration certificate’), Barist vs. DE Barkeeper (‘bartender’), or Saltner (‘vineyard / orchard guard’) without a specific corresponding term in e. g. Germany. In addition, Abfalterer (2007) describes words which are primary S¨udtirolismsonly in one of their senses, where a specific grammatical or semantic characteristic of a lemma is only used in South Tyrol (for examples see section 2.2.2). Around 250 ‘secondary S¨udtirolisms’have been identified by Abfalterer (2007, pp. 266-268), where the meaning is shared completely with at least one other variety. One quarter of these are combinations of ST/CH, one third of ST/DE, and the rest of ST/AT; most of them are used in every-day language (p. 42). An example is Marille (shared with AT, vs. DE Aprikose, ‘apricot’). Abfalterer (2007, pp. 227-232) compares these findings with older studies and concludes that many phenomena are consistent even though the corpora are different, thus deducing a constant special vocabulary without considerable diachronic change. Heid (2011) in the most recent study on collocations (see above) concludes that the German varieties have different preferences in their collocator choice for certain bases, but that their morpho-syntactic preferences for the same collocations are rather stable. More examples of single phenomena can be found in the following section.

2.2.2 Linguistic characteristics of South Tyrolean German

This section collects particular phenomena97 and distinctive features of the German language in South Tyrol on different levels of linguistic description.

97Not all of these phenomena are verified as S¨udtirolisms,but they contain single observations as well and are collected across the text types ‘spoken’ and ‘written’, in order to give a general impression of the occurring phenomena. 92 2 Related work and research desiderata

It is based on research done in this field documented in books, media articles, and scientific publications (see section 2.2.1). The key for the abbreviations of these sources used in the example tables below can be found in table 2.1.

Phonology For some details on the phonological level see e. g. Mall & Plagg (1990, p. 228), who describe a relatively small Italian influence on the sounds of South Tyrolean German, e. g. the palatal nasal produced in ignorieren (in simplified phonetic alphabet: [iniori:ren] vs. [ignori:ren], ‘ignore’). Also the special pronunciation of [qu] (vs. [kv]) can be noted, e. g. from IT quale (‘which’) in German Qualle (‘jellyfish’) in South Tyrolean German.

Morphology Little morphological influence of the Italian language on South Tyrolean German is reported (e. g. by Mall & Plagg, 1990, p. 228); as to inflection, not even the plurals of Italian loanwords are taken over in some cases, e. g. Targa / Targen (‘number plate/s’) and Vespa / Vespen (‘scooter/s’) vs. IT targhe and vespe. The plural ending -e seems to be less acceptable than -i, the latter of which is e. g. used in Alpini (‘mountain soldiers’). Regarding word formation and more specifically compounding, the linking element between morphemes is generally used more in Southern German dialects (examples e. g. also in Grzega, 2000). No differences in derivation could be identified. Many of the characteristics of South Tyrolean German in inflection and compounding collected in tables 2.2 and 2.3 are similar to the Austrian variety.

Lexis Most peculiarities of South Tyrolean German, as various studies verify, can be noted on the word level (to be seen in tables 2.4 for general lexical entities and 2.5 for loan phenomena). Such differences between the German varieties on the lexical level usually consist in one-to-one equivalents such as Kondominium (‘apartment building’) in ST vs. Mehrfamilienhaus in DE or in many-to-one 2.2 Investigations on South Tyrolean German 93

Table 2.1: Key for source indications in example tables

indication source A07 Abfalterer (2007) A10 author’s observations G00 Grzega (2000) H11 Heid (2011) MP90 Mall & Plagg (1990) O10 Obrist (2010) R98 Riehl (1998) R05 Rampold (2005) R07 Robertazzi (2007)

equivalents such as ST provisorische Ausfahrt (‘temporary exit’) and DE Behelfsausfahrt, respectively. In table 2.6, one-word expressions and multi- word expressions (MWEs) are confronted, which also relates to the cross-level functional equivalents in the paragraph on semantics below. Another example for the first category from Obrist (2010, p. 33) are the variants for ‘number plate’: Kenntafel or Targa in ST, Nummerntafel or Kennzeichentafel in AT, Kontrollschild or Nummernschild in CH, and Nummernschild or amtliches Kennzeichen in DE.

Morpho-syntax Especially differences between dialect and standard variety with respect to the inflection of articles and nouns have been listed e. g. by Egger (1979, pp. 72-78) and by Giacomozzi (1982, pp. 79-84). For example, in the dialect98 there is no distinction between the accusative singular and the dative singular forms of the definite article of masculine nouns, whereas the standard makes a distinction

(denacc vs. demdat, see also Giacomozzi, 1982, pp. 90-97). This is as well related to the so-called m-/n-Abrundung (‘rounding off’, Schwienbacher, 1997, pp. 94ff.)99 for datives.

98Likewise in Italian, no such distinction is made. 99Another interpretation is that all prepositions are used with the dative, p. 102. 94 2 Related work and research desiderata

This difference between dialect and standard often causes insecurities100 in the use of accusative and dative forms that can be observed for example within

prepositional phrases (e. g. *Interesse an denacc (instead of demdat) Service; ‘interest in the service’).101 Furthermore, differences in the gender of nouns can be noted, e. g. for AT, ST das vs. DE die Brezel (‘pretzel’; neuter vs. feminine) and der Radio vs. das Radio (‘radio’; masculine vs. neuter), respectively. In South Tyrol, as in other Southern German dialect regions, the present perfect tense is more frequent than the simple past, see e. g. Riehl (1998, p. 181). Also Putzer (2009) finds (in the Tiroler Tageszeitung) that the perfect tense is more frequent there than in more northern newspapers. Another phenomenon showing a difference is the morphological distribution of the auxiliaries haben (‘have’) and sein (‘be’) for the perfect tense (ich habe / bin gesessen for north vs. south, ‘I have sat’; p. 489; see also Giacomozzi (1982)). The Variantengrammatik project (see section 2.1.1) verifies such findings as well. Examples for morpho-syntactic phenomena can be seen in table 2.7.

Co-occurrences Particular co-occurrences for South Tyrolean German of the types adjective (ADJ) + noun (N), (N+) preposition (PREP) + N, verb (V) + PREP, and predicate (PRED) + OBJ are exemplified in table 2.8 to table 2.11. Abfalterer (2007, p. 70) lists several phrasal constructions to be further investigated, and also the mentioned Datenbank zum S¨udtiroler Deutsch (see section 2.1.1) contains complex phenomena and MWEs for further systematic studies.

100In contrast, Obrist (2010, p. 82) explains that the German differentiation between das and dass is easy to make by translating it into the dialect, where different lemmas are used (was and dass, respectively). 101The sentence “Ich werde in August in einer neuen Arbeit eintauchen.” (vs. im / eine; ‘I will immerge into a new job in August.’) is a further authentic example which shows two mistakes influenced by the writer’s dialect. 2.2 Investigations on South Tyrolean German 95

The phenomenon of adjectives or relative clauses referring to non-heads of compounds (see Lapshinova-Koltunski, 2008, who reports this phenomenon for the Austrian variety as well), e. g. feste Stellensuche (‘search for fixed position’), is found also in South Tyrolean German texts (see also Rampold, 2005, pp. 38, 40, 45, 59, and 89).

Syntax Mall & Plagg (1990, p. 228) describe only a minor influence of Italian on South Tyrolean German syntax, except for written administrative language, where they report ‘relatively many’ interferences. One phenomenon reported by Rampold (2005, pp. 21, 36, and 39) are participle clauses where the inferred subject does not correspond to the subject in the main clause, which he explains by the Romance language influence, an example being

(2.1) Seit zwanzig Jahren von den Schweizer Beh¨orden gesucht, [...] ist die Sozialunterst¨utzungf¨urXY abgelaufen. Having been searched for by the Swiss authority since 20 years, [...] the social welfare for XY has expired. (p. 36).

Semantics The items shown in table 2.12 have undergone a change or extension of meaning in South Tyrolean German with respect to other German varieties. Abfalterer (2007, p. 168; see also section 2.2.1) refers to such words as ‘primary S¨udtirolisms in one of their senses’. Such ‘semasiological variants’ (see section 1.2.1) are the same words in several varieties with varying meanings, such as Stundenplan (ST ‘opening hours’ vs. DE ‘timetable’; IT orario) or ausrasten (ST ‘relax’ vs. DE ‘get crazy’). Particular idioms in South Tyrolean German are rare; an example can be seen in table 2.13. Regarding equivalents which go beyond one level of linguistic description, complex constructions are noticeable in South Tyrolean German, such as 96 2 Related work and research desiderata

Ministerium f¨urausw¨artigeAngelegenheiten (DE Außenministerium; ‘foreign ministry’; IT ministero per gli affari esteri, see also table 2.6). This phenomenon is explained by a Spiegelbildeffekt (mirror effect) because graphical similarity is important for official bilingual documents. Another example for complex phenomena such as differences in formulations is the case where one concept is expressed with different constructions in two varieties, e. g. ST Informationen k¨onneneingeholt werden bei (‘information can be retrieved at’), where there is no corresponding collocation in DE, AT, or CH, but the expression Informationen unter (‘information at’) can be found in DE (see Heid, 2011, p. 553). Also the construction sich Mutschlechner schreiben (‘to be called Mutschlechner by last name’), which would correspond roughly to Mutschlechner mit Nachnamen heißen, shows such a phenomenon. 2.2 Investigations on South Tyrolean German 97

Table 2.2: Examples for inflection

STDE source translation; comment ihr erh¨alt ihr erhaltet R05, you receive; also CH p. 44 Ihr erh¨alt/ erhaltet beide eine Bescheinigung. You both receive a certificate. infinitive + infinitive + O10, should, can, want, be allowed, must gesollt, gekonnt, sollen, k¨onnen, p. 72 gewollt, gedurft, wollen, d¨urfen, gemusst m¨ussen Sie h¨attegestern ankommen gesollt / sollen. She should have arrived yesterday. den Bauer, den Bauern, R05, the farmer, president, fellow Pr¨asident, Pr¨asidenten, p. 63; Kamerad Kameraden O10, pp. 59, 78 Das wird den Bauer / Bauern interessieren. That will be of interest to the farmer.

Table 2.3: Examples for word formation – compounding

STDE source translation; comment Fabriksverkauf Fabrikverkauf A10 outlet sale; also AT Geschenksidee Geschenkidee A10 idea for a gift; also AT Schadensersatz Schadenersatz A10 compensation; also AT Schweinsschnitzel Schweineschnitzel G00 roast pork; also AT verfassungsgebend verfassunggebend A10 constituent; also AT Zugsungl¨uck Zugungl¨uck G00 train accident; also AT 98 2 Related work and research desiderata

Table 2.4: Examples for lexical entities

STDE source translation; comment Aufstiegsanlage Lift O10, cable car; IT impianto di risalita p. 36 Autob¨uchlein Fahrzeugschein MP90 vehicle registration document Dose Dosis R05, dosage; IT dose p. 49 Jeden Tag die Dose verringern. Reduce the dosage every day. durchaus st¨andig/ A10 constantly durchgehend Er hat durchaus geredet. He has been talking constantly. Feuerwehrhalle Feuerwache O10, fire station p. 38 Gesuch Antrag H11 application innerhalb + bis + date A10 until; IT entro date Motorisierungs- Kraftfahrzeug- R05, motor vehicle department amt amt p. 119 Notspur Standstreifen O10, emergency lane; IT corsia di emergenza p. 35 oder . . . oder entweder . . . R05 either . . . or; IT o . . . o oder Er wird oder heute oder morgen kommen. He will come either today or tomorrow. Schreibname Nachname A10 last name Stammrolle unbefristeter MP90 unlimited contract for civil servants Beamtenvertrag 2.2 Investigations on South Tyrolean German 99

Table 2.5: Examples for loanwords and loan formations

STDE source translation; comment Amtsdirektor Amtsleiter H11 head of office; IT direttore Argument Thema O10, topic; IT argomento p. 42 Assessor Minister A10 minister; IT assessore Bankkoordinaten Bankverbindung O10, bank details; IT coordinate bancarie p. 44 Barist Barkeeper MP90 bartender; IT barista Baukonzession Baugenehmigung O10, building permit; IT concezzione edilizia p. 47 Carabiniere Polizist MP90, policeman; IT carabiniere p. 229 drogiert unter Drogen R05, drugged; IT drogato p. 73 Identit¨atskarte Personalausweis A10 identity card; IT carta d’identit`a Industriezone Industriegebiet A10 industrial zone; IT zona industriale Kondominium Mehrfamilienhaus A10 apartment building; IT condominio Modul / Modell Formular R05 form; IT modulo Monolokal Einzimmer- O10, studio apartment; IT monolocale wohnung p. 40 Permesso Genehmigung MP90 permission; IT permesso Qu¨astur Polizeidirektion A10; police department; IT questura MP90, p. 229 referenziert mit Referenzen O10, with references; IT referenziato p. 40 Zu vermieten an referenzierte Personen. To be rented to people with references. Struktur Einrichtung O10, facilities; IT struttura p. 46 Superalkoholika Spirituosen O10, spirits; IT superalcolici p. 43 Urbanistik Stadtplanung R05, urbanism; IT urbanistica p. 124; O10, p. 47 100 2 Related work and research desiderata

Table 2.6: Examples for cross-level equivalent constructions

STDE source translation; comment Erste Hilfe Notaufnahme O10, emergency ward; IT p. 37 pronto soccorso in erster Person pers¨onlich O10, personally; IT in prima p. 39 persona landwirtschaftliches Anbaufl¨ache R07 farming land; IT verde Gr¨un agricolo Ministerium f¨ur Außenministerium A10 foreign ministry; IT Ausw¨artigeAngele- ministero degli affari genheiten esteri mobile Baustelle Wanderbaustelle A10 mobile road works namentliches Ver- Namensverzeichnis R07 name list zeichnis provisorische Aus- Behelfsausfahrt R07 temporary exit; IT uscita fahrt provvisoria unbewegliches Gut Immobilie R07 real estate; IT bene immobile; also AT weiters des Weiteren A10 furthermore; also AT 2.2 Investigations on South Tyrolean German 101

Table 2.7: Examples for morpho-syntactic phenomena

STDE source translation; comment mit, bei + mit, bei + R05, with, at accusative dative pp. 55, 92, 97, 100, 104, 117, 162 Sie hat sich mit den / dem Lehrer getroffen. She met with the teacher. gratulieren + gratulieren + R05, congratulate accusative dative pp. 41, 42, 122, 125 Alle haben ihn / ihm gratuliert. Everybody congratulated him. kosten + dative kosten + ac- A10 cost cusative Es kostete ihm / ihn seinen Ruf. It costed his reputation. sich wenden an sich wenden an O10, refer to + dative + accusative p. 59 Wenden Sie sich an der / die Kasse. Refer to the cash desk.

Table 2.8: Examples for ADJ+N co-occurrences

STDE source translation; comment gef¨orderter Wohn- sozialer Wohnungs- R07; subsidised housing bau bau H11 Gute Arbeit! Frohes Schaffen! R05, Have fun working!; IT p. 23 Buon lavoro! weißer Stimmzettel leerer Stimmzettel R07 void ballot; IT scheda bianca gr¨uneNummer geb¨uhrenfreie A10 toll-free number Nummer 102 2 Related work and research desiderata

Table 2.9: Examples for (N+)PREP+N co-occurrences

STDE source translation; comment Achtung auf . . . Achtung, . . . A10 attention to . . . ; IT attenzione a ... Achtung auf Stirnfransen beim Passfoto! Attention to bangs in the pass- port photograph! am Bauernhof / auf dem Bauernhof R98, on the farm / floor / in the Boden / Land / / Boden / Land / p. 193 countryside / on the platform / Podium / R¨ucken Podium / R¨ucken back an Bord (eines in (einem Auto / R05, in (a car / bus); IT a bordo Autos / Busses) Bus) p. 24 beim Fenster zum / am Fenster A10 (throw out / sit at) the window (hinauswerfen / (hinauswerfen / sitzen) sitzen) im Bedarfsfall bei Bedarf A10 if required; IT in caso di neces- sit`a in einem zweiten in einem weiteren A10 in a further step; it in un sec- Moment Schritt ondo momento

Table 2.10: Examples for V+PREP co-occurrences

STDE source translation; comment am guten Gelingen zum guten Gelin- A07, contribute to a good beitragen gen beitragen p. 70 success mit dem Auto / mit dem Auto / A10 take the car / bus / bike Bus / Fahrrad / Bus / Fahrrad / / train; IT andare con Zug gehen Zug fahren ... mit Hand machen von Hand machen A07, do manually p. 70 sich interessieren sich interessieren A07, be interested in um f¨ur p. 70 sich um Hilfe an sich f¨urHilfe an R05, turn to someone for help jemanden wenden jemanden wenden p. 66 2.2 Investigations on South Tyrolean German 103

Table 2.11: Examples for PRED+OBJ co-occurrences

STDE source translation; comment Arbeit haben brauchen f¨ur A10 take (time) mit Sie hatte drei Stunden Arbeit nach Brixen. It took her three hours to get to Bressanone. Comeback geben Comeback H11 stage a comeback feiern / haben Information Information H11 give information erteilen geben Kind ein- Kind an- R07 enroll child; IT iscrivere schreiben melden Information Information H11 retreive information / parking ticket / Parkschein beschaffen / einholen Parkschein l¨osen sich eine Idee sich eine R05; get an idea; IT farsi un’idea machen Vorstellung 122 machen 104 2 Related work and research desiderata

Table 2.12: Examples for semantic differentiations

STDE source translation; comment ausrasten ausruhen A10 relax; vs. get crazy Morgen kannst du ausrasten. Tomorrow you can relax. Konkurs Wettbewerb R05, competition; IT concorso p. 167 Linie Leitung A10 (telephone) line; IT linea Bleiben Sie in der Linie. Hold the line. Mobilit¨at ‘Arbeitslosigkeit’ A07 temporary unemployment Patent F¨uhrerschein A10 driving license; IT patente Professor Lehrer R98, teacher; IT professore p. 194

Table 2.13: Example for idiomatic expressions

STDE source translation; comment ein ‘Wagele’ werden durchdrehen A10 get crazy Dort muss man ja ein Wagele werden. One must get crazy there. 2.3 Research desiderata derived from the state of the art 105

2.3 Research desiderata derived from the state of the art

As has been documented in the previous sections, various aspects of written varieties of pluri-centric languages have been investigated and described in numerous single studies and publications during the last decades. The analysis of these approaches and investigations still shows a clear research gap in the automated comparison and description not only of German varieties (see also Abel et al., 2008, pp. 243f.), but also for varieties of other pluri-centric languages. In the following, concrete desiderata from the fields of variety linguistics and computational linguistics are summarised which will contribute to filling this gap. From the variety-linguistic perspective, systematic and comprehensive as well as objective quantitative and empirical studies on all levels of linguistic description are still outstanding. As can be deduced especially from the current state of research on the South Tyrolean German written standard, large-scale systematic investigations of peculiarities are necessary

i.) on the lexical, phrasal, syntactic, semantic, and textual level, ii.) by direct comparison of authentic data from different varieties, and iii.) using state-of-the-art computational-linguistic methods.

It is furthermore crucial for South Tyrol – and for non-dominant varieties (see section 1.2.2) in general – to make use of the knowledge on the variety that has already been elaborated. On the one hand, existing research results on the lexical level have to be verified, supplemented, and enriched (see section 1.2.3 on corpus lexicography and also Abel & Anstein, 2008, with respect to variant lexicography); on the other hand, systematic studies on higher levels of linguistic description (e. g. on recurrent phrasal patterns, on fine-grained semantic differences or on text structure) are necessary. It is crucial to investigate this in an intra-lingual comparison (see also Ammon, 2001, p. 25, who proposes to systematically contrast German varieties to check South Tyrolean specific vocabulary). For these purposes, electronic resources and the corresponding 106 2 Related work and research desiderata

computational analysis methods – which are both becoming more and more available – have to be exploited. Regarding the latter requirement, no comprehensive sophisticated compu- tational tools tailored for supporting variety linguists are available yet. Even though the challenge of extracting significant differences from corpora102 has been tackled by various approaches and systems (see sections 2.1.2 and 2.1.3), the methods used in the variety research mentioned above consisted mainly of manual work – for various reasons such as the lack of digital variety text corpora and of specific user-friendly tools for automated data processing. While variety corpora are being developed in more and more initiatives, as described in section 2.1.1, an improvement and refinement of existing tools as well as the development of new specific tools are needed to compare varieties on the basis of corpora semi-automatically, in order to facilitate and support the manual work done by variety linguists. The described need of a user-friendly comprehensive system for variety corpus comparison as a basis for systematic studies in variety linguistics is tackled in this work as a feasibility study by prototyping an interactive system that supports variety linguists and by evaluating its benefits.

102Related to this task, valid measures for the comparability of corpora (see sections 1.2.3.4 and 2.1.2) still have to be developed (see section 5.1). 107

3 The system Vis-A-Vis`

In this section, the system prototype developed in the framework of this project will be described in detail. Following the general system development life cycle, the requirements analysis, the methodological design, and the implementation including the functionalities and the output will be elaborated on.

3.1 Requirements for a corpus comparison system

In the following, both general and specific requirements for a useful computa- tional system for variety linguistics on the basis of corpora according to various aspects are listed.

Functionality The system has to offer corpus preparation from plain text with linguistic annotation (see section 1.2.3.2) and query engine indexing (see section 2.1.3.2). The corpus preparation tools, especially for linguistic annotation, have to be interactively adaptable to varieties. After checking the comparability of the variety and reference1 input corpora (see section 1.2.3.4), the next step which is necessary consists in the systematic extraction of phenomena2 on different levels of linguistic description (see section 1.2.3.3). In the central module, trivial phenomena listed as corpus-external knowledge have to be filtered out, and data

1The term ‘reference corpus’ is used in a ‘relative’ sense here. It does not refer to a ‘standard’ or ‘dominant’ variety, but can be any other variety (see also section 1.2.3.2). 2As opposed to the use of ‘phenomenon’ as a general term so far, here, this term refers to concrete instances of linguistic phenomena in the description of the system and of its output, see also section 1.2.1.1. 108 3 The system Vis-A-Vis`

for each phenomenon has to be quantitatively compared across the two corpora (see section 1.2.3.5). On that basis, the ‘candidates’ should be ranked according to their relevance, where a suitable statistical measure for good reliability of this relevance value has to be chosen (see also section 1.2.3.5). The system has to show the results in a readable and further processable form.

Interfaces The toolkit must be accessible on the command line and via a user-friendly graphical user interface (GUI), which has to be a browser application usable by experts as well as, after a short familiarisation time, by non-experts as well. Via the interfaces, it has to be possible to up- and download input (corpora and optional corpus-external data) and output (comparison result lists), respectively.

Performance The system has to allow for efficient work on variety linguistic issues and it has to be robust with regard to distorted, incomplete, and generally unexpected input. Its results have to be correct and reproducible.

Further general attributes The tool must be modular and easily maintainable. Each module has to be extensible and adaptable, e. g. to different varieties or other pluri-centric lan- guages.

3.2 Methodology and system architecture

After a description of the approaches and the methodology used for developing Vis-A-Vis` , the system’s workflow and the contents of its modules will be specified in this section. 3.2 Methodology and system architecture 109

3.2.1 Approaches and methods

Vis-A-Vis` applies a corpus comparison approach which is based on linguistic patterns. This methodology has been chosen since it allows to extract and count individual occurrences of the phenomena to be analysed, in order to work empirically with quantitative data.3 Applying this method, previous findings regarding regionalisms can be both revised and enhanced, which is necessary according to the desiderata in regional variety linguistics (see section 2.3). The units of comparison are linguistic entities on different levels of description. The ‘tertium comparationis’ (see section 1.2.3.4) is considered as the varieties’ common classification in terms of language typology, which entails the general shared structure and characteristics of the varieties as constant parameters, independent of the peculiarities of individual varieties. This is the common foundation for the comparisons to be carried out.4 As a general assumption and prerequisite, it is expected that regionalisms of one variety do not or very rarely occur in reference corpora, the latter of which being either dominant varieties or other non-dominant varieties. The phenomena which show high frequency in the variety corpus and significantly lower frequency in the reference corpus are considered worth being more closely investigated. Phenomena such as i.) single occurrences of regionalisms in reference corpora, ii.) regionalisms in only one of their senses, and iii.) regionalisms shared with other varieties have to be manually investigated with larger sentence contexts. As to the degree of pre-processing and query complexity, which are mutually dependent (see section 1.2.3.2), the decision has been taken to apply word-level annotation for PoS-based querying, since this is a suitable compromise which allows to benefit from the generally high quality of low-level annotation with an acceptable complexity of the needed queries.

3In section 1.2.3.4, cross-linguistically applicable categories have been introduced to be crucial for contrastive studies (between languages). This consideration is not necessary in the present context, since for varieties usually similar categories apply (see also below), thus the condition is trivially met. 4Heid (2011) claims that corresponding variants have the same or a very similar linguistic structure, e. g. collocation patterns. 110 3 The system Vis-A-Vis`

Since Internet independence is desired for the toolkit developed in this feasibility study, the scripts are designed as stand-alone software which can be used as an autonomous and adaptable tool. For the GUI, possible user groups and their research demands as well as their usability feedback have been taken into account in order to develop a serviceable, intuitive and user-friendly interface. It can be used without computational expertise, as described in the objectives of this feasibility study (see section 1.1.1), since it has been designed with general principles of human- machine interaction in mind. The concrete implementation of the methodology for the system’s single steps, which are specified in the following, is presented in section 3.3.1.

Comparability check The comparability (see section 1.2.3.4) of the two input corpora is measured in order to judge the reliability of the comparison results. For this, their ‘complexity’5 – as also used e. g. in learner corpus studies – is taken into consideration in order to give a tentative indicator for the comparability of the two corpora. On the one hand, the logarithmic type-token ratio (TTR) (see section 1.2.3.4) of each corpus is calculated as an indicator for vocabulary richness and lexical variability. On the other hand, the proportion of lexical words to the total number of words is shown in order to illustrate the ‘information load’ of the input texts, measured as lexical density (see also section 1.2.3.4). With these indications, also the research question on the comparability of the input affecting the quality of the output (see section 1.1.2) can be tackled as shown in section 4.1.

5Other approaches to and measures for corpus comparability as presented in section 1.2.3.4 (concluding that the notion of ‘corpus comparability’ is still unclear) and in section 2.1.2 can be investigated in further research. 3.2 Methodology and system architecture 111

Annotation In order to compare corpora according to linguistic categories, word-level anno- tation (see section 1.2.3.2) is conducted using both existing (see section 2.1.3) and adapted tools. As has been mentioned above, such low-level annotation is generally very robust and yields correct results, which was the reason for taking it as the basis for the phenomenon extraction – in contrast to higher-level annotation, which is less reliable. The process of tokenising the input texts, based on general punctuation rules and abbreviation lexicon entries, is followed by statistical PoS tagging and lexicon-based lemmatisation carried out for each token. After the complete Vis-A-Vis` run – in a bootstrapping process – new findings on variants can be integrated into a further comparison process by providing lexicon entries for non-lemmatised items, according to the lists generated as output. These lists can also be used to adapt annotation tools, which will altogether result in increasingly better output of further corpus analysis and comparison studies.

Analysis levels The three levels of linguistic description covered with Vis-A-Vis` – chosen as examples of the most promising possible analyses according to the research question stated in section 1.1.2 – will now be methodologically described.

Uni-gram level The level of lexical items provides the most obvious and also the most frequent cases of differences between a given variety and a reference corpus. To find instances of such differences, all word forms or lemmas of both corpora are extracted and counted using standard tools (see section 2.1.3.2). In order to look for specific lexical phenomena, an extraction restricted to certain open word classes such as nouns or verbs is available as well.

Bi-gram level Based on a general n-gram extraction study to look at fre- quencies of PoS combinations, the two adjacent co-occurrence patterns ADJ+N and adverb (ADV)+ADJ have been chosen as examples for two-word combi- 112 3 The system Vis-A-Vis`

nation studies, in order to investigate differences on a more complex level of linguistic description. These PoS patterns are extracted with standard tools (see section 2.1.3.2).

Syntactic level As an example of a phenomenon from the syntactic level, coordinate and subordinate clauses are counted via corresponding conjunctions assigned by the PoS tagger, in order to compare the complexity of sentences in the corpora. Furthermore, the correctness of the word order in two kinds of subordinate clauses is checked. Here as well, standard tools (see section 2.1.3.2) are used to match the corresponding PoS patterns for the phenomenon of verb- second word order in subordinate clauses (see ‘anacoluthon’, section 1.2.1.1). This phenomenon has been chosen to test if the PoS-based methods also work for non-adjacent phenomena.

Linguistic filtering In order to clean the output as far as possible, thorough filtering of the extracted phenomena, e. g. to remove tokenisation errors, special characters, or punctua- tion, is carried out. Furthermore, presumingly non-relevant (e. g. stopwords) as well as ‘expected’ variety-specific phenomena such as regional NEs, approved regionalisms, or ‘trivial’ phenomena (see the classification in section 1.2.3.4) are marked in the output table to exclude them from the candidate lists. This is done to reduce this output to mostly relevant phenomena for minimising manual work load, as is one of the objectives stated in section 1.1.1. This step also relates to the research question on how far such a support is possible with useful output, see section 1.1.2. Another independent filter is applied on the annotated corpora, in the case of plain text input, for selecting the non-lemmatised word forms. This resulting ‘unknowns’ list can additionally be manually checked for possible new variants after removing certain known phenomena (see section 2.2.1). 3.2 Methodology and system architecture 113

Statistical comparison The extracted linguistic phenomena are quantitatively compared and ranked by two statisticalAMs to indicate how significantly different the frequency of one phenomenon is in the variety corpus with respect to the reference corpus (see section 1.2.3.5).6 This is done to further reduce manual interpretation work as described in section 1.1.1 and to find answers to the research question on relevance ranking posed in section 1.1.2. The log-likelihood (LL) measure (Dunning, 1993, see section 1.2.3) has been chosen as anAM since, according to Evert (2005a, p. 137), it is prevailing in computational linguistics as a standard for the measure of association between words, and for co-occurrences it is recommended as well (see also Rayson & Garside, 2000). Evert (2005a, p. 114) concludes that

“[f]or practical applications, log-likelihood is a convenient and nu- merically unproblematic alternative that gives very good approxi- mations to the exact p-values.”

As a second measure, LL * log(frequency) is calculated, since Kilgarriff & Tugwell (2002, p. 130, see also section 1.2.3.5) state that LL values over-emphasise the significance of low-frequency items and thus suggest to multiply these values by the logarithm of their frequency for measuring e. g. lexicographic relevance. These two measures have been chosen and compared to each other as is presented in section 4.1; further measures can be evaluated as described in section 5.1.1. In addition, filtering according to the Cochran rule (see section 1.2.3.5) by expected frequencies gives a further indication of theAMs’ reliability, in order to filter out probably irrelevant phenomena.

Output presentation The general results of the corpus comparison are shown directly in the terminal or in the browser window. The output tables for the analysis levels can be

6Since the ‘overuse’ vs. ‘underuse’ is marked by , peculiar phenomena of the reference ± corpus with respect to the variety corpus can be equally investigated on that basis. 114 3 The system Vis-A-Vis`

retrieved from specified output directories or are downloadable from the GUI. These tables are provided in the easily readable and further processable tab- separated values (TSV) format (see section 3.4), in order to ensure their practicability and sustainability.

3.2.2 Workflow and modules

In the following, the overall sequence of steps in the Vis-A-Vis` workflow is presented. The main script takes as input

i.) written text corpora of two German varieties to be compared and

ii.) if available, lists containing corpus-external knowledge that can consist of approved peculiarities (e. g. NEs or regionalisms) of the varieties to be investigated and of full lexicon entries.

As a first step, the two input corpora are checked with regard to their com- parability, then the corpora are annotated including corpus-external linguistic knowledge. In the extraction step, instances of phenomena from different levels of linguistic description are obtained and stored. These are further linguistically filtered as well as compared by frequency and by statisticalAMs to indicate their probable relevance for identifying new variant candidates. The output that is presented to the user for interpretation is composed by

i.) general quantitative information on the two corpora and their compara- bility and

ii.) filtered phenomenon-specific lists of items extracted from the two corpora on the chosen level of linguistic description.

The overall architecture of Vis-A-Vis` can be seen in figure 3.1; the single modules depicted there will be described in the following.

Input In the first Vis-A-Vis` module, plain texts are uploaded or pre-indexed corpora are specified by their names. The encoding of the input texts is verified (UTF-8 ) and the corpora are checked for their comparability. 3.2 Methodology and system architecture 115

variety reference comparability check INPUT corpus corpus

corpus-externalcorpus-externalcorpus-external knowledgeknowledgeknowledge

tokeniser tokeniser

PRE- lemmatiser lemmatiser annotation PROCESSING PoS tagger PoS tagger

CWB indexer CWB indexer

EXTRACTION uni extr bi extraction syn extr & STORAGE

varvarvar refrefref SELECTION filter & comparison phenomenaphenomenaphenomena phenomenaphenomenaphenomena

RANKRANKRANK PHEN PHENPHEN LLLLLL fffvarvarvarfff refrefref ...... FILTERFILTERFILTER OUTPUT 111 abc abc abc u u u v v v w w w 222 def def def x x x y y y z z z REG REG REG ......

Figure 3.1: Overall architecture of Vis-A-Vis` 116 3 The system Vis-A-Vis`

Pre-processing For plain text input, the linguistic annotation as well as the corpus indexing is done in the pre-processing module by several combined scripts for tokenisation, PoS tagging, lemmatisation, and query engine indexing.

Extraction & storage In the extraction module, phenomena for the three different analysis levels (i. e. uni-gram, bi-gram, and syntactic level) are retrieved from the corpora and stored together with their frequency information. For text input (vs. CQP corpus input), the ‘unknowns’ extraction (see section 3.3.1.5) is furthermore done on the basis of the annotated corpora.

Selection After cleaning the phenomenon lists, the linguistic filter marks expected items, and the difference of phenomenon occurrences in the two corpora is determined via statisticalAMs in order to sort the resulting list according to relevance.

Output In the final module, general corpus comparison results and the output lists for phenomena on each analysis level are presented to the user for interpretation and further processing.

3.3 System functionalities and usage modes

In the following, the implementation of Vis-A-Vis` and its specific functional features are elaborated on from the technical perspective, and details of its usage are described and visualised.

3.3.1 Technical and functional specification

In this section, implementational details of all relevant features of Vis-A-Vis` are provided starting from the most technical parts, covering the functionalities’ implementations as well as the requirements covered by the system and its limitations. This description does not cover all the implemented single steps; the complete source code containing comprehensive documentation can be downloaded from the Korpus S¨udtirol website as specified in the following. 3.3 System functionalities and usage modes 117

3.3.1.1 Technical system features Vis-A-Vis` is implemented in the programming language Perl7, which is well suited for linguistic data processing. The system is designed in a modular way in order to facilitate straightforward maintenance and adaptation, e. g. to account for its transferability to other pluri-centric languages (see the research questions in section 1.1.2). In order to allow for parallel system runs, individual temporary directories for each programme call are created, which are stored for 10 days before they get cleaned up. The GUI is implemented with the scripting language PHP8.

File structure The main script visavis.perl is located in the basic di- rectory, where the command line process is started (see section 3.3.3). The following further directories are in use:

ADD DATA/ • containing system-internal additional knowledge lists such as stopwords or acknowledged regionalisms.

GUI/ • containing the files needed for the system’s GUI usage.

OUT XXXX/ • containing all generated intermediate and final result data with individual directory names for each run to allow for parallel processing.

SCRIPTS/ • containing the necessary system-internal Perl and shell scripts for the annotation, the indexing, the phenomenon extraction and storage, the filtering, and the comparison of the input corpora and their phenomena.

TESTDATA/ • containing test corpora and corpus-external knowledge lists.

7v5; http://www.perl.org, last accessed 2012-10-10. 8v5; http://www.php.net, last accessed 2012-10-10. 118 3 The system Vis-A-Vis`

TOOLS/ • containing all external utility programmes, i. e. for annotation and corpus querying.

Runtime Vis-A-Vis` ’ performance9 in terms of time for selected scenarios such as text vs. corpus input and confronting the different extraction levels can be seen in table 3.1. The system’s runtime lies in the range of 2 seconds to 2 days, the latter for the most laborious extraction level and for large corpora, which is acceptable for regional varieties, since the corpora are usually rather small.

Table 3.1:Vis-A-Vis` runtime

input uni-gram level bi-gram level syntactic level (hh:mm:ss) (hh:mm:ss) (hh:mm:ss) 100 000 tokens text 00:02:35 00:01:23 00:00:08 1.5 m tokens text 04:26:03 03:37:45 00:01:10 1.5 m tokens corpus 04:25:31 03:36:32 00:00:02 40 m tokens corpus 48:03:46 27:56:44 00:00:31

Availability The Vis-A-Vis` GUI version can be accessed online from the Korpus S¨udtirol10 website. The system is furthermore released with a free software license for downloading as a stand-alone application, which comprises two possibilities of usage: command line and GUI use for Linux environments (for both see section 3.3.3).11

9The runtimes were measures on the EURAC development server, running the Linux-based operating system CentOS 6.2. 10http://www.korpus-suedtirol.it, last accessed 2012-10-19; see also section 2.1.1. 11An adaptation for its use with Windows or Mac operating systems is possible as well. 3.3 System functionalities and usage modes 119

3.3.1.2 Input verification Before the system run, the uploaded or specified corpora are checked as to their character encoding (UTF-8 )12 in order to enhance the system’s robustness by making sure that the tools are working correctly.

3.3.1.3 Comparability check The calculation of the corpus comparability is done via two measures (for both see also section 1.2.3.4) expressed in %: logarithmic TTR as shown in (3.1) and lexical density as can be seen in (3.2)13.

log(number of types) (3.1) TTR = 100 log(number of tokens) ∗ (3.2) lexical density = number of lexical words 100 total number of words ∗

3.3.1.4 Annotation The tokenisation of the input texts is done with a script released at EURAC and adapted for Vis-A-Vis` . It takes special punctuation as well as a considerable number of abbreviations into account.14 The texts are then PoS-tagged and lemmatised with the TreeTagger (v3; using the STTS tag set, see section 2.1.3), resulting in a vertical file. Here, a system-internal lexicon (containing 1723 regionalisms, NEs, etc.) and further lexicon entries possibly uploaded by the user are taken into account to tag and lemmatise special words. The TreeTagger option to assign the word form as the lemma for ‘unknown’ words is not activated in order to prevent wrong lemmatisation. Instead, a Perl script corrects a considerable amount of ‘unknowns’ with a rule-based approach (following parts of the similarity-based lemma correction by Gojun et al., 2012, see also section 2.1.3). This is done via language-specific morphological

12For different encoding, a warning message is shown. 13Lexical words are taken as content-bearing words from open word classes (N, V, ADJ, and ADV), whereas grammatical words such as function words belong to closed word classes (e. g. PREP). 14As mentioned, all the details of the scripts can be viewed in the downloadable code. 120 3 The system Vis-A-Vis`

rules according to word classes and derivational as well as inflectional suffixes, an example for nouns ending on -ungen being shown in (3.3). The derived lemmas are marked with an asterisk (*) to indicate that they have been created by use of heuristic rules (not retrieved from an acknowledged dictionary) and thus might have to be manually verified.

(3.3) if ($lem eq "") [...] if ($pos = /NN/) ∼ [...] if ($wordform = /(.+)ungen$/) ∼ print "$wordform t$pos t$1ung* n"; \ \ \ [...]

3.3.1.5 Analysis levels In the following, details of the phenomenon extraction on the three implemented levels of linguistic description will be presented.

Uni-gram level Either all word forms or lemmas occurring in the corpora or the word forms / lemmas of a certain open word class such as N, V, ADJ, or ADV (according to the choice of the user) are counted with CQP (v3; see section 2.1.2). The query e. g. for nouns is shown in (3.4).

(3.4) [pos="NN"]

Furthermore, the word forms which the TreeTagger could not lemmatise ac- cording to its lexicon (‘unknowns’) are extracted via pattern matching in the vertical file created for text input during the Vis-A-Vis` process.

Bi-gram level For the analysis of recurrent linguistic patterns, Vis-A-Vis` extracts co-occurrences by searching for the adjacent PoS patterns ADJ+N and ADV+ADJ15 via CQP queries as shown in the following:

15More patterns can be integrated into the code as specified in the source code documentation. 3.3 System functionalities and usage modes 121

(3.5) [pos = "ADJA"] [pos = "NN"] (3.6) [pos = "ADV"] [pos = "ADJ."]

Syntactic level Frequencies of coordinate and subordinate clauses for both corpora are determined on this analysis level, and the word order in certain subordinate clauses is checked.16 The latter is done by a CQP query for the specific PoS pattern of the two subordinating conjunctions weil (‘because’) and obwohl (‘although’) with verb-second word order. For the query, which is shown in (3.7), the topological field structure of German sentences (see section 1.2.1.1) was taken into consideration, and a CQP ‘macro’17 for subject noun phrases is used as can be seen in the first line. The combination ‘!=’ expresses a negation, and the addition ‘ 2,6 ’ specifies { } that the preceding element can occur between 2 and 6 times.18

(3.7) [lemma = "weil | obwohl"] /subj np[] [pos = "V.FIN"] [pos != "(\$,|\$.|\$(|KOUI|KON|KOUS|KOKOM|PTKZU|V.*)"]{2,6} [pos = "\$."]

The macro for the subject noun phrase is shown in (3.8); the syntax is the same as for CQP queries, e. g. ‘?’ corresponding to ‘ 0,1 ’. { } (3.8) MACRO subj np(0)

( ( # determiner ((in)det | ein jeder | meine): ([word="der|die|das|ein|eine"] | ([word="der|die|das|ein|eine"][pos="PIDAT"]) | [pos="PDAT|PIAT|PPOSAT"])?

# adverbs, adjectives connected by comma or conjunctions: ([pos="ADV"]{0,2}[pos="ADJA"]([word="(\,|KON)"]?

16Here as well, more syntactic patterns to be analysed can be integrated into the code. 17CQP macros are externally pre-defined query definitions which can be loaded into the engine to be used in queries (Evert, 2005b). 18With such a query, the phenomenon ‘ADJ/N, weil ADJ/N’ is not excluded. This decision has been taken in order not to miss true positives, even though it results in yielding possibly false positives (see also section 3.4). 122 3 The system Vis-A-Vis`

[pos="ADJA"]){0,3})? ([pos = "NE"][pos = "NE"] | # Max Muster [pos = "NN"][pos = "NE"] | # Professor Muster [pos = "N."] ) # normal or proper noun

# optional genitive attribute; no relative clause ([word = "des|der|eines|einer"] | [word = "des|der|eines|einer"][pos="PIDAT"] | [pos="PDAT|PIAT|PPOSAT"])? ([pos="ADV"]{0,2}[pos="ADJA"] ([word="(\,|KON)"]?[pos="ADJA"]){0,3})? [pos = "N."]? | [pos = "(PDS|PIS|PPER)"] ) ); # substitutive pronouns

3.3.1.6 Linguistic filtering For cleaning the result data, the system filters out numbers, tokens with less than 3 characters, and tokens containing only special characters to exclude them from the result files. Furthermore, the extracted phenomenon occurrences are case-normalised (for word analysis, not for lemma analysis) to disregard capitalisation differences. To mark ‘expected’ phenomena in the resulting comparison lists in order to improve the candidate ranking, Vis-A-Vis` uses various and comprehensive word lists as filters.19 These lists contain German regionalisms, person and place names, foreign word dictionaries, and lists of ‘reality’ descriptions compiled manually. The following corpus-external knowledge lists are used system-internally, modified by extensions or deletions for each as considered necessary. Region- alism lists have been taken from Abfalterer (2007, pp. 263-281, see also section 2.2.1). Vis-A-Vis` includes 612 word forms of the primary S¨udtirolism lemmas listed there, 742 secondary S¨udtirolismword forms, and 990 Austriacism word forms, where the cases which are a S¨udtirolismonly in one of their senses and the MWEs have been removed. A list of Helvetisms (1 816 items) has been provided by the initiative Schweizer Text Korpus (see section 2.1.1). 2 051

19Additional lists can be provided by the users, if available. 3.3 System functionalities and usage modes 123

South Tyrolean first names have been extracted from ASTAT (2002), 10 950 South Tyrolean last names from ASTAT (2005). Place names in South Tyrol have been combined from two Wikipedia20 lists, adding up to 1 365 items. The name list for non-South-Tyrolean places (638 items) is a compilation of different collections available at EURAC. In addition, 232 German stopwords21 are used, an English word list22 containing 2 458 items, and an Italian word list23 with 1 635 items. The reality descriptions for AT include 65 items, for CH 11 items, for DE 19 items, and for ST 160 items.24 The following file names are expected for corpus-external lexical lists of acknowledged peculiarities provided by the user:

file name description U REG VAR.txt list of regionalisms for variety corpus U REG-2 VAR.txt list of secondary regionalisms for variety corpus U REAL VAR.txt list of reality descriptions for variety corpus U PLACE VAR.txt list of place names for variety corpus U PERS FIRST VAR.txt list of person first names for variety corpus U PERS LAST VAR.txt list of person last names for variety corpus U REG REF.txt list of regionalisms for reference corpus [analagous for all list types for reference corpus]

U LEX.txt list of lexicon entries

In addition, compounds containing acknowledged phenomena (e. g. the primary S¨udtirolism Waal in Waalweg; ‘path along a water rill’) are marked with a specific filter tag in this step.

20http://de.wikipedia.org/wiki/Liste der Gemeinden in Suedtirol, last accessed 2012-10-05, and http://de.wikipedia.org/wiki/Liste deutscher Bezeichnungen italienischer Orte, last accessed 2012-10-10. 21http://snowball.tartarus.org/algorithms/german/stop.txt, last accessed 2012-09-10. 22http://simple.wikipedia.org/wiki/Wikipedia:List of 1000 basic words, last accessed 2012- 10-04. 23http://www.istc.cnr.it/grouppage/colfis, last accessed 2012-10-08. 24They as well can be viewed directly in the downloadable system code. 124 3 The system Vis-A-Vis`

The tags used in the output tables (with the prefix ‘S ’ for system-internal and the prefix ‘U ’ for user lists) are shown in table 3.2.

3.3.1.7 Statistical comparison The difference of phenomenon occurrences in the two corpora is determined with absolute and relative frequencies, the latter with respect to the corpus sizes and expressed in parts per million (ppm), and with statisticalAMs (see section 1.2.3.5). The LL formula used for calculating the significance of occurrence difference of one phenomenon (PHEN) in two corpora is shown in (3.9)25, where an LL value of >= 10.83 is significant at the 0.1 % level and an LL value of >= 15.13 is significant at the 0.01 % level.

a b (3.9) LL = 2 * ((a * ln(E1)) + (b * ln(E2))) with (a+b) (a+b) E1 = c * (c+d), E2 = d * (c+d) and a = PHEN freq in corpus 1, b = PHEN freq in corpus 2, c = corpus size 1, d = corpus size 2

The second measure, LL * log(f v), multiplies the LL values by the loga- rithm of the phenomenon frequency in the variety corpus, in order to prevent over-emphasising the significance of low-frequency items (see section 1.2.3.5). In addition, a statistical filter according to expected frequency values marks improbableAM values according to the Cochran rule (see also section 1.2.3.5) in order to assign lower ranks to the respective phenomena. On the uni-gram26 level, all expected frequencies >= 5 are marked; on the bi-gram level, all expected frequencies >= 1.

25http://ucrel.lancs.ac.uk/llwizard.html, last accessed 2012-08-10 26Rayson et al. (2004) conclude that at the 0.01 % level, this threshold could be lowered to 1; however, the choice here is to be stricter. 3.3 System functionalities and usage modes 125

Table 3.2: Linguistic filter tags used by Vis-A-Vis`

filter tag description S STOPWORD German stopword S ENGL / ITAL English / Italian word S PLACENAME general place name

S REG AT / CH / ST regionalism for AT/CH/ST S REG-2 ST secondary regionalism for ST S REAL AT / CH / DE / ST reality description for AT/CH/DE/ST S PLACENAME ST place name for ST S PERSONFIRSTNAME ST person first name for ST S PERSONLASTNAME ST person last name for ST U REG VAR / REF regionalisms for variety / reference corpus [analagous for all list types for user input]

EXP<5 expected frequency < 5 for uni-grams EXP<1 expected frequency < 1 for bi-grams

S REG ST CMPD compound containing regionalism for ST [analagous for AT/CH]

S REAL ST CMPD compound containing reality description for ST [analagous for AT/CH/DE]

S PLACENAME ST CMPD compound containing place name for ST U REG VAR CMPD compound containing regionalism for variety corpus [analagous for reality descriptions and place names for variety and reference corpus]

1:S STOPWORD first part of bi-gram being stopword 2:U PLACE REF second part of bi-gram being place name for reference corpus [analagous for all filter types] 126 3 The system Vis-A-Vis`

3.3.1.8 Output presentation On the one hand, general information on the corpora and their comparability which have been collected and calculated in the course of the system run are directly printed in the terminal or browser window. On the other hand, files in TSV format are produced on each analysis level containing the phenomena, the corresponding PoS tags, and their respective absolute and relative frequencies as well as statisticalAMs and filter information (see also section 3.4). These output tables are sorted and ranked according to the LL values for each phenomenon.27

3.3.2 Coverage and limitations of the system

In the following, the requirements which are covered by Vis-A-Vis` as well as relevant system limitations are summarised.

Requirements fulfilled It can be concluded that the system requirements stated in section 3.1 are satisfied as follows. Regarding the functional requirements, the corpus preparation, the variety- adaptable annotation, and the indexing are covered in the first two modules. The comparability check, the phenomenon extraction on three exemplary levels of linguistic description, the linguistic filtering and statistical comparison, as well as the ranking according to a plausible relevance measure are implemented. The output is readable and further processable due to its TSV format. The system’s usability is ensured by the access realised on the command line and through a user-friendly GUI with up- and download functions. Regarding performance issues, Vis-A-Vis` is fast enough to be applied effi- ciently for variety linguistics, where the corpora are usually not overly large. It is robust with regard to distorted, incomplete, or unexpected input, which it captures by warnings or error messages. Vis-A-Vis` ’ results have been evaluated to be correct and completely reproducible.

27In order to investigate common features of varieties, the phenomena withAM values around 0 can be especially considered. 3.3 System functionalities and usage modes 127

The system is available as independent software. Thanks to its modularity, it is easily maintainable, extensible, and transferable. It is adaptable on the one hand to other varieties than South Tyrolean German by allowing for user input of e. g. special word lists (and by disregarding system-internal resources specific to the South Tyrolean variety). On the other hand, it is portable to other pluri-centric languages by adapting certain modules such as the one for corpus annotation.

System limitations Vis-A-Vis` is designed to compare two corpora at a time; for comparing more corpora, the process has to be run several times. A filter for metadata is not included in the system – if parts of a corpus are supposed to be compared to each other according to their respective metadata, sub-corpora have to be created externally as separate corpora to be analysed. The computational extraction of phenomena with Vis-A-Vis` reaches up to the syntactic level; no higher levels of linguistic description or functional equivalences on different levels have been tackled in this feasibility study. For obtaining best results, the corpora have to be as comparable as possible with respect to contents, and ideally balanced and large, the latter above all for applying statistical measures on higher levels of linguistic description. Additionally, errors of automatic tools, e. g. in tagging, which may lead to skewed corpus query results, could not be captured in the scope of this work.

3.3.3 System usage scenarios

The usage interfaces are implemented both on the command line as a Perl script call and via a user-friendly GUI. A comprehensive and precise system documentation including technical requirements as well as instructions for set-up and application containing all details for both usages is provided in appendixA. 128 3 The system Vis-A-Vis`

Tool usage on the command line The Perl script visavis.perl is called by the following command:

(3.10) perl visavis.perl -l (uni|uni NN|uni V|uni ADJ|uni ADV|bi|syn) -f (wrd|lem) -e -r -i (t|c)

It supports several parameters as described in the following: The option -l chooses the level of extraction and comparison, uni referring to the lexical level (uni-grams, which can also be restricted to single open word classes), bi to the bi-gram level for co-occurrences, and syn to exemplary syntactic pattern analysis. With the option -f, the user decides if either word forms or lemmas will be considered in the analysis process. The option -e takes corpus-external knowledge for the annotation and the linguistic filtering as input, i. e.

i.) one-word-per-line external knowledge lists with specified file names, e. g. U REG VAR.txt for variety text regionalisms or U PLACE REF.txt for reference text place names (see section 3.3.1 for the full list), and

ii.) single-word lexicon entries with PoS28 and lemma information in the format: wordform PoS lemma.

With the option -r, the directory containing the registry files for CQP corpus input is specified. Finally, the option -i is used to indicate if the two input corpora are in text format (t) or if they are available as corpora indexed for CQP (c). The user receives the following output in the terminal window:

28STTS tags, see section 2.1.3, have to be used here. 3.3 System functionalities and usage modes 129

i.) general data, e. g. regarding the comparability of the two corpora, and ii.) the locations of the comparison files to view or further process the output data.

Tool usage via the graphical user interface To make the access to Vis-A-Vis` easier and more user-friendly, a GUI has been implemented to make sure that the software is as well accessible for users without computational expertise. Users can upload the data they want to analyse over this GUI and are guided through the options for the comparison process, step by step, up to the download of their analysis results. Through the GUI, also a direct link to Korpus S¨udtirol and to the C4 search interface (for both see section 2.1.1) is provided for the verification of phenomena and for context search. In figures 3.2 to 3.6, the successive steps of the Vis-A-Vis` GUI use are shown by means of screenshots. After the corpus choice (figure 3.2), lexical lists of external knowledge can be uploaded, if available (figure 3.3). In the second step, the analysis and comparison level is chosen (figure 3.4), and after the Vis-A-Vis` run, the result data can be viewed and downloaded for further processing (figures 3.5 and 3.6). 130 3 The system Vis-A-Vis`

Figure 3.2: Vis-A-Vis` GUI – corpus upload or specification 3.3 System functionalities and usage modes 131

Figure 3.3: Vis-A-Vis` GUI – corpus-external knowledge upload 132 3 The system Vis-A-Vis`

Figure 3.4: Vis-A-Vis` GUI – selection of analysis level 3.3 System functionalities and usage modes 133

Figure 3.5: Vis-A-Vis` GUI – results page 134 3 The system Vis-A-Vis`

Figure 3.6: Vis-A-Vis` GUI – view and download window 3.4 System output 135

3.4 System output

In the following, the output of Vis-A-Vis` is described in detail both with respect to its contents and to its format.

3.4.1 General corpus comparison output

When running the script on the command line, intermediate results of single steps for the preparation of the corpora (for text input; vs. CQP input) are shown, see figure 3.7. After that information, for both input types, the two comparability measures for each corpus are given and the locations of the final result files are provided (see figure 3.8). For text input, additionally the paths to the ‘unknown’ lists are shown, as well as the instructions how to include lexicon files for correcting annotation in a further system run (see figure 3.9). In the GUI, the ‘progress report’ column on the left and the output page (see figure 3.5) show this information – except for the intermediate results which are left out for reasons of clarity and space.

3.4.2 Output by analysis levels

In order to allow for easy further processing, the output is provided in the tab- separated values (TSV) format, ranked by the LL values.29 For each analysis level, the result table contains the following columns, each line describing the values for one phenomenon:

column title column contents RANK rank count (for uni- and bi-gram level) PHEN phenomenon PoS phenomenon PoS (for uni-gram level) LL AM LL with the ± indication of overuse/underuse by ± 29Since the output is not yet filtered manually, it contains both linguistically relevant differences as well as differences that might reflect the selection of topics in the respective corpora or certain situational peculiarities (see section 1.2.3). The interpretation and more detailed analysis based on these results must be conducted manually. 136 3 The system Vis-A-Vis`

LL*log(f v) furtherAM against the over-estimation ± of low-frequency items freq v absolute frequency in the variety corpus size v overall variety corpus size ppm v relative frequency in the variety corpus EXP v expected frequency in the variety corpus freq r absolute frequency in the reference corpus size r overall reference corpus size ppm r relative frequency in the reference corpus EXP r expected frequency in the reference corpus FILTER indication of irrelevant phenomena

The phenomena containing a filter tag in the last column are listed in the end of the output table without a rank assigned, since they are considered irrelevant (stopwords, foreign words, acknowledged regionalisms, etc.; for a full list of these filter tags, see section 3.3.1). On the following pages, example command line output and result tables for the three levels of linguistic description covered by Vis-A-Vis` will be presented.

Uni-gram level The general output for the first level on the command line is shown in figure 3.8 (for text input, the output in figure 3.9 is provided in addition). An extract of a TSV-formatted output file for the uni-gram level considering word forms is given in figure 3.10 (for the Dolomiten newspaper corpus (DOLO) vs. the Frankfurter Rundschau newspaper corpus (FR), see section 4.1.2; leaving out the corpus size columns for reasons of space).30

Bi-gram level The command line progress report for the co-occurrence extraction is shown in figure 3.11. An extract of a TSV-formatted output file for the bi-gram analysis

30For the higher ranks, it can be seen that with negative LL values, peculiarities of the reference corpus with respect to the variety corpus can be investigated as well. 3.4 System output 137

level taking word forms into account can be seen in figure 3.12 for the pattern ADJ+N (for web corpora of ST and DE, see section 4.2; leaving out the corpus size columns for reasons of space); the output for ADV+ADJ looks analogous.

Syntactic level The output files for the anacoluthon extraction in DOLO vs. FR (see sec- tion 4.1.2) are shown in figures 3.13 to 3.15. The first figure contains the general corpus comparison information, the second displays the overview of the extracted complex phenomena. In figure 3.15, concrete examples for the phenomena31 can be seen in keyword in context (KWIC) format.

31As has been noted in section 3.3.1, the third example in figure 3.15 is a case of a false positive since this pattern cannot be excluded. Also the two occurrences in the reference corpus are of that type. 138 3 The system Vis-A-Vis`

... creating output directory ... OUT_g8bF ok

... reading TESTDATA/TEST_CORPORA/testkorpus_ST.txt ... ok ... reading TESTDATA/TEST_CORPORA/testkorpus_DE.txt ... ok

... checking encoding of TESTDATA/TEST_CORPORA/testkorpus_ST.txt ... ok ... checking encoding of TESTDATA/TEST_CORPORA/testkorpus_DE.txt ... ok

... annotating TESTDATA/TEST_CORPORA/testkorpus_ST.txt ...... using lexicon file in TMP ... ok ... annotating TESTDATA/TEST_CORPORA/testkorpus_DE.txt ...... using lexicon file in TMP ... ok

... extracting unknown lemmas for TESTDATA/TEST_CORPORA/testkorpus_ST.txt ... ok ... extracting unknown lemmas for TESTDATA/TEST_CORPORA/testkorpus_DE.txt ... ok

... cwb-encoding OUT_g8bF/c1_fin.tgd ... [...] ok ... cwb-encoding OUT_g8bF/c2_fin.tgd ... [...] ok

Figure 3.7: Vis-A-Vis` progress report on the command line 3.4 System output 139

... counting frequencies ... ok ... calculating log-likelihood ... ok

*** RESULTS *** type-token ratio: variety corpus: 78.1% reference corpus: 78.6% lexical density: variety corpus: 54.4% reference corpus: 55.6% word-level phenomena result tables sorted by relative frequency difference [An LL value of >= 10.83 is significant at the 0.1% level and an LL value of >= 15.13 is significant at the 0.01% level.]: less OUT_hQS8/word_cmp.txt less OUT_hQS8/word_cmp_clean_srt_fin_tsv.txt less OUT_hQS8/word_cmp_clean_srt_fin_rank_tsv.txt

OK

Figure 3.8: Vis-A-Vis` command line output for the uni-gram level

Non-lemmatised words can be found in OUT_p0Wf/*_unkn.txt.

You can provide corrected lexicon entries to be used in the annotation process (file called U_LEX.txt in the directory indicated with the -e option in the format: word\tPoS\tlemma\n).

Figure 3.9: Vis-A-Vis` command line output for non-lemmatised words 140 3 The system Vis-A-Vis` -mnPS13.41379 77 =39 22. 16 =01 72. S 47323.5 (=1041) 41866 42221.5 (=1329) 47679 14347.99 1331.94 PIS man -- RANK -sne-N 1.7433 1(2 871(0 33S 43.3 (=0) 1 42.8 (=0), 38.7 0 (=2) 4892.7 3830.5 363.1 (=149) (=64) 1305.9 81 38.2 (=0) (=14) 6009 2578 4365.3 433.9 (=2), 493.36 430.7 552 0 3417.5 (=91) (=0) (=130) (=0) 81 112.27 1165.1 3249 4670 (=54) 323.9 NN 11 1 -4453.02 535.20 7390.35 1127.8 (=19) 1919 -550.7 s 874.71 (=0) 387.1 121.79 7291.12 NN 687 384.3 NN sonder- (=23) ADV -- 964.49 16 (=23) 6747.84 1426.4 810 NN -- 1032.99 814 (=2) politik 1006.2 ehestens 5377.9 151340 NN sprache 7468.80 (=59) 2444.3 8108.17 (=52) 3461 tourismus (=0) 71 1115.24 240 2118 1209.82 ADJA 2107 210 bezirks- NN 23101.96 2 1272.6 4798.1 3016.62 (=73) 21131.1 (=225) 191 (=121) gesamt- NN 8069 2180.7 2628 4864 163 (=129) stadt- 39941.55 26656.41 18852.9 (=979) 4440.03 4623 3385.38 146 NN 35120 NN 58386.53 vize- 307791.85 6918.82 29407.26 mittelschule ADV NN 38 landes- 32 aussendung gestern 25 11 1 PHEN frage statut obmann staatlichen gemeinde meister b regierung dio-N 2.1559 6(2 050(0 55S 45.5 (=0) 0 40.5 (=2) 86 575.99 129.31 NN ¨ udtirol- ¨ urger- PoS iue3.10 : Figure ± LL log( f Vis- ± LL v) freq * A-Vis ` euttbefrteuiga level uni-gram the for table result (= ppm v )EXP v) freq v (= ppm r )EXP r) FILTER r S S ENGL STOPWORD PLACENAME REAL REG ST ST CMPD ST CMPD 3.4 System output 141

... counting A+N collocations ...... calculating log-likelihood ... ok

... counting Adv+A collocations ...... calculating log-likelihood ... ok

*** RESULTS *** type-token ratio: variety corpus: 81.2% reference corpus: 79.4% lexical density: variety corpus: 56.5% reference corpus: 53.0%

A+N and Adv+A result tables sorted by relative frequency difference [An LL value of >= 10.83 is significant at the 0.1% level and an LL value of >= 15.13 is significant at the 0.01% level.]: less OUT_hQS8/A+N-coll_cmp.txt less OUT_hQS8/A+N-coll_cmp_clean_srt_fin_tsv.txt less OUT_hQS8/A+N-coll_cmp_clean_srt_fin_rank_tsv.txt less OUT_hQS8/Adv+A-coll_cmp.txt less OUT_hQS8/Adv+A-coll_cmp_clean_srt_fin_tsv.txt less OUT_hQS8/Adv+A-coll_cmp_clean_srt_fin_rank_tsv.txt

OK

Figure 3.11: Vis-A-Vis` command line output for the bi-gram level 142 3 The system Vis-A-Vis` RANK -atnm ein147 6.018(2 772 =)9. 2:S 91.3 1:S (=0) 2.3 1:S 22.4 (=0) 21 (=0) 292.8 88.4 (=2) 0 67.7 (=0) 10 (=2) 179 114.8 1.7 5 16.6 138 (=0) (=0) 217.2 (=0) (=5) 663.90 65.6 275.0 13 29 4 (=2) (=0) 134.74 331 85.2 54.55 149 595.70 5 9.47 334.1 (=3) 1292.8 (=7) 102.67 16.20 (=0) 1090.7 1079.70 (=4) 6.83 187 215.77 204.0 (=7) 628 region 41 autonome 1241.97 360 ortschaften zweisprachigen 959.2 bezeichnung 237.42 -- 474 deutsche 247.9 (=24) 809.3 -- (=8) (=23) -- 4676.66 gef 1624 europ 1819 759.05 hilfe 1540 541 erste 952 5938.66 kreuz weißes 163 8686.42 803.32 4231.51 h paradies 161 1183.51 wahres 672.37 153 pr n 89 unmittelbarer 88 gef feuerwehr 50 freiwillige tauern 49 hohe infos 48 weitere familie ganze 25 11 5 PHEN cse ifl226 299 2 =)1863 =)146.4 (=0) 30 108.6 (=3) 225 1259.95 232.63 gipfel ¨ ochsten pretnpse 5.211.120(3 462(0 127.4 (=0) 2 94.6 (=3) 220 1914.31 354.92 pisten ¨ aparierten retrwhbu6.8218 0(0 700(0 23.0 (=0) 0 17.0 (=0) 40 258.9 (=0) 251.88 68.28 12 192.1 (=7) wohnbau ¨ orderter 439 3966.73 651.94 wanderungen ¨ uhrte (A+N) iceaaei 95 3.86 =)3. =)41.9 (=0) 6 31.1 (=1) 67 334.48 79.55 akademie ¨ aische h 9.339.194(1)54431(4 760.6 (=4) 361 564.4 (=14) 964 3395.21 494.13 ¨ ahe iue3.12 : Figure Vis- ± A-Vis LL ` log( f euttbefrteb-rmlvl( level bi-gram the for table result ± LL v) freq * (= ppm v )EXP v) freq v ADJ (= ppm r + N ) )EXP r) FILTER r 1:S REAL ENGL REAL REAL ST ST DE 3.4 System output 143

... executing KOUS+V2 query ...... calculating log-likelihood ... ok

*** RESULTS *** type-token ratio: variety corpus: 78.1% reference corpus: 78.6% lexical density: variety corpus: 54.4% reference corpus: 55.6% variety vs. reference corpus: coordinating conjunctions: 937010 vs. 1067651 subordinating conjunctions (clause): 274135 vs. 311469 subordinating conjunctions (infinitive): 30743 vs. 34636 weil/obwohl+V2 result table: [An LL value of >= 10.83 is significant at the 0.1% level and an LL value of >= 15.13 is significant at the 0.01% level.]: less OUT_hQS8/kous_cmp_srt_fin_tsv.txt

Keyword-In-Context view for weil/obwohl+V2: less OUT_hQS8/var_c-kous-kwic_u8.txt less OUT_hQS8/ref_c-kous-kwic_u8.txt

OK

Figure 3.13: Vis-A-Vis` command line output for the syntactic level 144 3 The system Vis-A-Vis` elV .846 3864 =)382/0004(0 4.2 (=0) /40201004 2 3.8 (=0) /35866845 6 4.62 2.58 weil+V2 PHEN ± LL log( f ± LL v) freq * iue3.14 : Figure /size v Vis- A-Vis ` (= ppm v euttbefrtesnatclevel syntactic the for table result )EXP v) freq v /size r (= ppm r )EXP r) r 3.4 System output 145 astebus/NN ¨ alt/VVFIN¨ sein/PPOSAT Wort/NN ./$.> syntactic phenomenon extraction – KWIC results ` A-Vis Vis- Figure 3.15 : gibt/VVFIN es/PPER mittlerweile/ADV in/APPR sechs/CARD [...] Ortschaften/NN ./$.> woanders/ADV los/ADJD ./$.> 11028612: ./$. "/$(13820568: Geschichten/NN20083082: G ./$.> ,/$, wir/PPER werden/VAFIN unser/PPOSAT Geld/NN schon/ADV auch/ADV 146

4 Quantitative and qualitative system evaluation

As the last phase of the general system development life cycle, the prototype’s quantitative and qualitative evaluation will now be presented.

4.1 Quantitative system performance

The following three sections demonstrate – for the quantitative evaluation – the procedures applied, the concrete data used, and the performance output in terms of precision and recall as well asAM rank comparisons.

4.1.1 Evaluation procedures

A quantitative evaluation of the prototype’s performance was possible on the uni-gram level, since only there, a gold standard is available. These measurable objectives against which the outcome can be assessed are the approved lists of primary and secondary S¨udtirolismscollected in Abfalterer (2007, see section 2.2.1 and appendixB). Following the prevalent evaluation methodology as has been presented in section 1.2.3.6, precision and recall as well as the F-score (as their harmonic mean) have been calculated. The ‘baseline’ to which Vis-A-Vis` ’ performance has been compared consists of all comparison results sorted by ppm in the variety corpus without using any corpus-external knowledge such as name lists for filtering, since this is how basic approaches would handle the task. The evaluation has been conducted according to different parameters regarding rank cut-off values, with all the other parameters remaining the same. 4.1 Quantitative system performance 147

The steps of the evaluation procedure comprised the following:

count (gold standard) S¨udtirolismtypes for uni-grams in variety corpus, • extract top n comparison result phenomena according to LL, and • determine overlap of gold standard list and extracted results to calculate • precision, recall, and F-score values.

As a further exploratory test, the rankings of the three quantitative comparison values LL, LL*log(f v), and variety ppm have been contrasted and compared.

4.1.2 Evaluation data and gold standard

The following corpora have been used for the quantitative evaluation:

i.) the Dolomiten newspaper corpus (DOLO), which – as the most wide- spread South Tyrolean newspaper – represents an important part of the whole of written language in South Tyrol (newspaper editions between 1991 and 2006), and

ii.) the Frankfurter Rundschau newspaper corpus (FR), a daily newspaper from Germany (editions between 1992 and 1993).1

They contain ca. 40 million tokens each and have been PoS-tagged and lemma- tised with the TreeTagger as well as indexed with CWB (see section 2.1.3). The gold standard list consisting of ‘verified’ phenomena comprises

i.) primary S¨udtirolisms(Abfalterer, 2007, pp. 263-266) and ii.) secondary S¨udtirolisms(Abfalterer, 2007, pp. 266-268)

1Both corpora are obviously neither representative nor large enough to draw conclusions for general language usage. Especially for newspaper texts, usually general newswire articles from other countries are included as well (see e. g. Heid, 2011). Furthermore, since “regional varieties and national varieties seldom coincide” (Grzega, 2000), the comparison might be influenced by this fact. Nevertheless, the corpora can be used to show tendencies and specific example results. 148 4 Quantitative and qualitative system evaluation

occurring in DOLO, which amount to 921 word forms2 out of the complete list of 1 354 word forms. AppendixB shows all integrated 286 primary S¨udtirolism lemmas and an excerpt of the 269 secondary S¨udtirolismscontained in the gold standard list.

4.1.3 Quantitative evaluation results

The comparison using the full PoS set in the two corpora reaches the highest precision value of 7.26 % at the rank cut-off 1 000. Taking only nouns into account, this value increases to 10.88 % (see table 4.1). The highest recall score amounts to 98.15 % for the full PoS set, as can be seen in table 4.2. The top F-score of 8.87 %3 is yielded for nouns at the cut-off 1 000 (see table 4.3). This lies clearly above the noun baseline, which reaches its highest value of 2.83 % only at cut-off 10 000. The tables show the three scores for five different rank cut-off values; figures 4.1 and 4.2 illustrate this data in graphs. In figures 4.3 and 4.4, Vis-A-Vis` results for the full PoS set are compared to the output of the Sketch Engine (Kilgarriff et al., 2008, see also section 2.1.3) extracting keywords of two corpora (Kilgarriff, 2009). With three parameters for adding values to non-occurring words in reference corpora (N=1, N=10, and N=100, see section 2.1.2), the results produced by the Sketch Engine score lower than the ones by Vis-A-Vis` , since the latter is specifically tailored to regional varieties by its enrichment with linguistic knowledge such as name lists. For N=1, the Sketch Engine reaches a top precision of 5.05 %, a top recall of 67.21 %, and a top F-score of 5.23 %. For N=10, the precision amounts at most to 3.54 %, the recall to 66.12 %, and the F-score to 3.65 %. N=100 yields a top precision of 1.95 %, a top recall of 66.12 % as well, and a top F-score of 2.98 % at cut-off 10 000.

2Since the corpora have only been available in an already indexed format and the vertical files could no longer be accessed in order to influence their lemmatisation by Vis-A-Vis` , word form extraction has been evaluated. 3The relatively low overall values are caused by the fact that the phenomena to be extracted are very rare. Furthermore, wrong cases of tokenisation of the already available evaluation corpora lowered these scores. 4.1 Quantitative system performance 149

The rank comparison by the three quantitative comparison values (twoAMs and variety ppm) resulted in the following conclusions: as could be expected, pure ppm ranking puts function words to the top of the result list, whereas the AMs identify more relevant content words. The difference between LL and LL*log(f v) is not systematic for the top ranks; e. g. the S¨udtirolism Bauleitplan (‘area development plan’) is still included in the top 100 by LL but not in the top 100 by LL*log(f v). For the first 20 candidates, the two measures do not differ by more than 3 ranks. The highest rank difference is 22 for the LL rank 88, which suggests that the AMs yield very similar results for the top-ranked phenomena. An excerpt of the rank comparison table for relevant results can be seen in table 4.4, where values for certain extracted items (column ‘PHEN’) with their PoS are shown: LL ranks are listed in column 3, ranks according to LL*log(f v) in column 5, and variety ppm ranks in the rightmost column (ranks >100 are marked by --). The rank differences are provided in the columns placed between the ranks, i. e. for LL vs. LL*log(f v) in column 4 and for LL*log(f v) in column 6. The results are sorted according to the ranking difference between the twoAMs in column 4. As can be seen there, pure LL outperforms LL*log(f v) in all cases. The reason is that the latter has been chosen for treating low-frequency phenomena in the variety corpus adequately, which are expected not to appear in the top ranks. This measure might prove useful for studies with a different research purpose. 150 4 Quantitative and qualitative system evaluation

Table 4.1: DOLO-FR precision % – baseline vs. Vis-A-Vis` rank cut-off precision precision NN precision NN precision (baseline) (Vis-A-Vis` ) (baseline) (Vis-A-Vis` ) 1000 0.96 7.26 2.40 10.88 5000 0.77 3.64 1.63 5.71 10000 0.81 2.68 1.72 4.15 50000 0.65 1.57 1.23 2.34 100000 0.44 0.96 0.93 1.54

Table 4.2: DOLO-FR recall % – baseline vs. Vis-A-Vis` rank cut-off recall recall NN recall NN recall (baseline) (Vis-A-Vis` ) (baseline) (Vis-A-Vis` ) 1000 0.98 7.71 0.76 7.49 5000 3.91 19.22 3.58 18.78 10000 8.25 28.01 7.93 27.25 50000 32.57 81.32 31.60 78.50 100000 44.41 98.15 43.21 95.22 12 1000 Vis−À−Vis (NN) ● Vis−À−Vis (full PoS set) 10 baseline (NN) ● baseline (full PoS set) 8 ●

6 5000

precision % 10000 4 ● ● 50000

2 ● 1e+05 ● ● ● ● ● ● 0

0 20 40 60 80 100 recall %

Figure 4.1: DOLO-FR precision-recall – baseline vs. Vis-A-Vis` 4.1 Quantitative system performance 151

Table 4.3: DOLO-FR F-score % – baseline vs. Vis-A-Vis` rank cut-off F-score F-score NN F-score NN F-score (baseline) (Vis-A-Vis` ) (baseline) (Vis-A-Vis` ) 1000 0.97 7.48 1.15 8.87 5000 1.29 6.12 2.24 8.76 10000 1.48 4.89 2.83 7.20 50000 1.27 3.08 2.37 4.54 100000 0.87 1.90 1.82 3.03

14 Vis−À−Vis (NN) ● Vis−À−Vis (full PoS set) 12 baseline (NN) ● baseline (full PoS set) 10

8 ●

● 6

F−score % ● 4 ●

2 ● ● ● ● ● ● 0

1000 5000 10000 50000 1e+05 rank cut−off

Figure 4.2: DOLO-FR F-score – baseline vs. Vis-A-Vis` 152 4 Quantitative and qualitative system evaluation 10 ● Vis−À−Vis (full PoS set) SkE, N=1 (full PoS set) 8 1000 SkE, N=10 (full PoS set) ● SkE, N=100 (full PoS set) ● baseline 6

4 ● 5000 precision %

● 10000 2 ● 50000 ● ● 1e+05 ● ● ● ● 0

0 20 40 60 80 100 recall %

Figure 4.3: DOLO-FR precision-recall – Sketch Engine vs. Vis-A-Vis`

● Vis−À−Vis (full PoS set) 10 SkE, N=1 (full PoS set) SkE, N=10 (full PoS set)

8 SkE, N=100 (full PoS set) ● ● baseline (full PoS set)

● 6

● F−score % 4

2 ● ● ● ● ● ● 0

1000 5000 10000 50000 1e+05 rank cut−off

Figure 4.4: DOLO-FR F-score – Sketch Engine vs. Vis-A-Vis` 4.1 Quantitative system performance 153

Table 4.4: Comparison of result ranks yielded byAMs and ppm

PHEN PoSLL diffLL*log(f v) diff ppm rank rank rank bauleitplan NN 100 n/a -- n/a -- heuer ADV 4 0 4 -65 69 landeshauptmann NE 12 0 12 n/a -- aussendung NN 16 -1 17 n/a -- j¨anner NN 6 -1 7 n/a -- landesrat NN 8 -1 9 n/a -- obmann NN 30 -1 31 n/a -- weiters NN 24 -1 25 n/a -- carabinieri NN 20 -3 23 n/a -- bezirksgemeinschaft NN 42 -5 47 n/a -- heurigen ADJA 44 -6 50 n/a -- landeshauptmann NN 39 -6 45 n/a -- presseaussendung NN 63 -10 73 n/a -- 154 4 Quantitative and qualitative system evaluation

4.2 Qualitative case studies

In the following, qualitative investigations carried out with three kinds of corpora as case studies on different levels of linguistic description will be presented. In addition to newspaper corpora, web corpora and learner corpora have been studied.

4.2.1 Newspaper corpora

In the first case study, the newspaper corpora DOLO and FR (see section 4.1.2) have been used for exemplary comparative variety studies on all implemented levels of linguistic description.

Uni-gram level On the one hand, dictionary entries for South Tyrolean German could be verified and refined by applying Vis-A-Vis` on the available newspaper corpora. For e. g. the S¨udtirolism Abgeordnetenkammer (‘chamber of deputies’; ppms see table 4.5), its usage in South Tyrol can be confirmed by various further result sentences from DOLO such as in example (4.1).

(4.1) Die Mitglieder der Abgeordnetenkammer und des Senats bleiben [...] bis

Ende Monat im Amt [...] (DOLO−2001−05−14); The members of the chamber of deputies and the senate remain in office until the end of the month [...]

Moreover, dictionary entries could be enriched with further information, e. g. special senses of words for the South Tyrolean variety. This is the case for e. g. Konsortium (‘consortium’; ppms see table 4.5; example see (4.2)), where a less specific definition than in the VWB is suggested, closely following the definition of the Italian corresponding word without the temporal restriction on being permanent (for the detailed entries see Abel & Anstein, 2011).

(4.2)[ ...] sowie das Konsortium ” CTM Altromercato ” werden sich und ihre

Produkte vorstellen . (DOLO−2005−07−30); 4.2 Qualitative case studies 155

[...] as well as the consortium “CTM Altromercato” will introduce them- selves and their products.

Furthermore, the adverb weiters (‘furthermore’) and the preposition ober (‘above’), both being Austriacisms, have been found to be as well used in South Tyrol (see table 4.5) and should get a regional label for South Tyrol in a new dictionary edition. Example contexts are shown in (4.3) and (4.4).

(4.3)[ ...] beschloß , daß der Latscher Sportplatz [...] saniert werden

soll . (DOLO−1991−01−12); [...] furthermore decided that the sports ground of Latsch should be renovated.

(4.4)[ ...] am Brenner der Finanzkaserne . (DOLO−1996−03−01); [...] at the Brenner above the finance barracks.

On the other hand, new dictionary entries for South Tyrolean German can be added, e. g. for the adverb ehestens (‘as soon as possible’), which is not yet included in the VWB, see example (4.5).

(4.5)[ ...] will die Verwaltung die Erweiterungszone ” Gebach ”

verwirklichen . (DOLO−1991−05−06); [...] the administration wants to realise the expansion area “Gebach” as soon as possible.

The occurrence data of these refined and new dictionary entries for South Tyrolean German are shown in table 4.5 (see also Abel & Anstein, 2011). Furthermore, a classification of the top 100 result candidates according to each of the three rankings (AMs and ppm) has been done manually. The results with the counts for each category of the overall 181 output phenomena – corresponding to the categorisation of extraction results listed in section 1.2.3.4 – are shown in table 4.6 and in figure 4.5. The category ‘errors’ refers e. g. to orthographic mistakes or tokenisation errors. The second major category, ‘facts’, denotes trivial peculiarities caused by reality descriptions. In the third category, 156 4 Quantitative and qualitative system evaluation

Table 4.5: DOLO-FR frequencies for lexicographically relevant findings

phenomenon ppm DOLO ppm FR translation Abgeordnetenkammer 12.2 1.4 chamber of deputies Konsortium 18.6 3.6 consortium weiters 49.1 0 furthermore ober 3.2 0 above ehestens 2.3 0 as soon as possible which is the largest, possibly relevant phenomena are included, which can be further sub-classified into acknowledged regionalisms, temporal expressions, function words, and content words (which are the source of possible new regionalism candidates). On such a basis, a typology of errors can be identified in order to handle them systematically, e. g. excluding function words (if they are not of interest to a certain study) or including a recogniser for temporal expressions.

Table 4.6: DOLO-FR uni-gram result classification

class frequency sub-class errors 10 facts 15 possibly relevant items 156 regionalisms 13 temporal expressions 15 function words 30 content words 98

Bi-gram level In the study on co-occurrences, remarkable ADJ+N combinations in the DOLO newspaper corpus have been found, such as allgemeine Klasse (‘general class’; 8.4 vs. 0 ppm in DOLO vs. FR; see also Heid (2011)) for sports or weißer 4.2 Qualitative case studies 157

150 regionalisms temporal expressions function words content words 100 frequency 50

0 possibly errors facts relevant items

Figure 4.5: DOLO-FR uni-gram result classification – histogram

Stimmzettel (‘white ballot’; 1.4 vs. 0 ppm, respectively), which do not occur in the FR corpus and are thus candidates for regional collocation entries. In additional studies beyond the use of Vis-A-Vis` , for combinations with prepositions, e. g. the influence from Italian using the local preposition innerhalb (‘within’) in a temporal context (innerhalb Januar vs. DE bis Januar; ‘until January’) could be noticed, a usage where the degree of ‘correctness’ can be discussed. In a further case study on collocators with S¨udtirolismsas bases, several typical collocators for the noun Mobilit¨at (which has the particular sense ‘dismissal’ / ‘unemployment’ only within the South Tyrolean variety) could be identified. These are e. g. jemanden in die Mobilit¨at¨uberstellen (‘transfer someone into unemployment’) or sich in Mobilit¨atbefinden (‘be unemployed’; see also Abel & Anstein, 2008). For SUBJ-/OBJ-PRED combinations, single examples have been interpreted, which suggested that in general, little variation seems to occur. This has to be further investigated with more fine-grained tools (e. g. along the lines of Heid, 2011). 158 4 Quantitative and qualitative system evaluation

(Morpho-)Syntactic level All phenomenon occurrences for subordinate clauses with verb-second word order in DOLO are found to be cited direct speech (see figure 3.15 on page 145). In a further additional study on the dative-accusative confusion in written South Tyrolean German (see section 2.2.2), no salient results could be found, which suggests that this phenomenon occurs mainly in spoken language.

4.2.2 Web corpora

Two corpora compiled from Internet texts have further been contrasted on the bi-gram level. The first corpus contains South Tyrolean written language and has been created at EURAC (ca. 67 million tokens) and the other comprises texts from Germany4. Many phenomena of the pattern ADJ+N5 point to different contents of the web pages in the two regions – e. g. formulations from tourism (wahres Paradies; ‘true paradise’) or from obituary notices are far more frequent in the South Tyrolean web corpus (see also figure 3.12 on page 142). Furthermore, many ‘realisms’ are extracted such as geographical names or e. g. institution names like Deutsches Schulamt (‘German education authority’). Similarly, the ADV+ADJ results point to content-related differences, e. g. gut pr¨apariert (‘well-groomed [slopes]’) or besonders sehenswert (‘especially worth seeing’).

4.2.3 Learner corpora

Two kinds of written corpora of learner language have been investigated on the syntactic level as far as implemented in Vis-A-Vis` . The German L2 corpus from the project Kolipsi (Abel et al., 2010, see section 2.2.1) contains ca. 90 000 tokens, the Kolipsi L1 corpus ca. 8 600 tokens.

4100 million tokens have been taken from the deWaC (1.7 billion tokens; http://wacky. sslmit.unibo.it/doku.php?id=corpora, last accessed 2012-10-11). The SdeWaC, which consists of 880 million parseable data from the whole of deWaC, was chosen not to be used for reasons of comparability to the South Tyrolean web corpus, which has not been ‘cleaned’ in the same way as SdeWaC. 5The pattern ADV+ADJ seems to not yield particularly interesting results for regional variety linguistics. 4.2 Qualitative case studies 159

In addition, L1 learner texts from different German varieties as collected in the project KoKo (Anstein & Glaznieks, 2011, see also section 2.2.1), comprising ca. 300 000 tokens each, have been analysed. The comparability measures were found to be lower than for the newspaper corpora and the web corpora (especially the lexical density around 48 %), but still yielding similar6 values for the two learner corpora compared in each run, which suggests an adequate corpus similarity. The results of the analysis on the syntactic level, where certain subordi- nate clauses with verb-second word order have been extracted, can be seen in figures 4.6 and 4.77. For Kolipsi, 11 such occurrences for the subordinating conjunction weil8 (‘because’) have been found in the L2 ‘variety’ corpus (fig- ure 4.6) vs. 0 in the mother-tongue ‘reference’ corpus. Interestingly, for the regional variety learner corpora, no such phenomena of marked word order in subordinate clauses have been found in the South Tyrolean and in the North Tyrolean sub-corpora, whereas 3 occurred in the corpus part from Thuringia (see figure 4.7). Even though these corpora and the extracted phenomenon frequencies are rather small, useful results for exemplary syntactic studies could be obtained.

6The differing sizes of the Kolipsi corpora lead only to a slightly higher difference of their TTR values. 7The results are shown in keyword in context (KWIC) format; the tag set used is STTS, see also section 3.4. 8The other conjunction, obwohl (‘although’), is very rarely found to be used with verb-second word order; this seems to be (still) a purely spoken phenomenon. 160 4 Quantitative and qualitative system evaluation 49: zubezahlen/NN ./$.> ohne/APPR b warm/ADJD ging/VVFIN 84405: immer/ADV Maria/NE ist/VAFIN Wetter/NN 54694: mehr/ADV sch das/ART uns/PRF immer/ADV ./$.> Meer/NN Sie/PPER discutieren/ADJD ans/APPRART ./$.> sie/PPER nacher/NN 32078: Sommer/NN Natur/NN die/ART mit/APPR kontakt/NN zu/APPR sind/VAFIN sie/PPER Supermarkt/NN in/APPR s/DA Geld/NN die/ART nicht/PTKNEG haben/VAFIN wir/PPER calt/ADJD manchmal/ADV ist/VAFIN Wetter/NN das/ART -TH learner corpus KoKo syntactic phenomena in the ` A-Vis Vis- Figure 4.7 : wieder/ADV ./$.> 131354: ,/$,

4.3 Discussion of evaluation results

According to the objectives and the research questions stated in section 1.1, both the quantitative and the qualitative evaluation show promising application results of Vis-A-Vis` . Large amounts of authentic data can be reduced to the most relevant phenomena to be investigated manually with less time and effort, especially up to the syntactic level. The focus has not been put on complete representativity and generisability of the results, since this also highly depends on the input data. The still undefined notion of and measures for corpus comparability are a challenging prerequisite for corpus comparison. The main objective of this study consisted in evaluating approaches, methods, and instruments for data extraction and processing for the pre-selection of relevant data as a starting point for more detailed manual studies.

Quantitative results Concerning the system’s performance, the relatively low scores (precision < 11 %, F-score < 9 %) are acceptable – especially for a prototype – since variants are very rare events and a great majority of phenomena are shared among varieties.9 This fact posed a special challenge and the figures for Vis-A-Vis` are therefore not comparable with usual performance scores in NLP. Additionally, the quality of the evaluation corpus annotation and their comparability was not perfect (and could not be influenced). Nevertheless, a clear difference to the baseline can be noticed for all results. Especially the comparison with the commercial system Sketch Engine points to Vis-A-Vis` ’ benefits due to the fact that it is specifically tailored for the identification of regional differences by means of relevant word filter lists. Concerning the research question on specific features of the corpus input, the applied comparability measures seem already to be a helpful indicator for the quality and reliability of the results, as could be seen interpreting newspaper /

9Furthermore, the gold standard list consisted of all S¨udtirolisms which occurred at least once in the variety corpus; no higher cut-off has been chosen. 4.3 Discussion of evaluation results 163 web corpus output vs. learner corpus output. Nevertheless, comprehensive tests on different measures for corpus similarity, taking corpus homogeneity (see section 1.2.3.4) into account as well, will be useful to meet this prerequisite for corpus comparison. The evaluation has shown that also relatively small corpora can yield useful results; however, the larger the corpora are, the more reliable the results of the statisticalAMs can be, especially for phenomena on higher levels of linguistic description (see also e. g. Burki¨ , 2009). A minimum corpus size for useful results cannot be generally specified (see also section 1.2.3.3). The system’s highest F-score was achieved at a rank cut-off of 1 000, which is an acceptable length for a list to be manually screened. As to the relevance ranking addressed in the research questions, the comparison of the twoAMs yielded similar results for the top output candidates, both significantly better than figures obtained from pure relative frequency. FurtherAMs should be evaluated and compared in order to find the (combination of) measures which yield the best ranking in terms of candidate relevance for this specific task of regional variety comparison, since according to e. g. Evert (2005a, see also section 1.2.3.5), there is no single best measure for all tasks.

Qualitative results Vis-A-Vis` ’ application has proven to be useful in research on variants (see also the investigations e. g. in Abel & Anstein, 2011) and in learner corpus studies (see also Anstein & Glaznieks, 2011). On the one hand, it yielded concrete results for dictionary enhancements on the uni-gram and the bi-gram level; on the other hand, it pointed e. g. to learners’ grammar proficiency with respect to constructing subordinate clauses. The classification of the top 100 uni-gram results according to the three output rankings shows that a high proportion of the top ranks contain possibly relevant phenomena in contrast to errors or trivial differences; this clearly suggests that the approach is worth being followed. 164

5 Outlook and conclusion

In the following section, possible further lines of research derived from the described feasibility study are presented, both from the technical and from the contents-related perspective. In section 5.2, the research with its scientific findings and contributions will be summarised.

5.1 Potential further research

In order to systematically improve comparative regionalism extraction from corpora, various development directions can be explored. In the following, potential and promising future work in the field of comparative corpus linguistics for regional varieties will be elaborated on, first relating to general matters and in section 5.1.2 according to the specific levels of linguistic description.

5.1.1 General resource and system enhancements

For a comprehensive support of manual variety linguistics, both the sources to obtain and filter variant phenomena, i. e. in this case the corpora and potential lists of available corpus-external knowledge, and the approaches – from the general system design up to the output of results – can be improved.

Resources The corpora taken as a basis for geographical variety1 studies are of more value the larger they are (see section 1.2.3.3) – and for reliable results, they

1As e. g. Grzega (2000) points out, “regional varieties and national varieties seldom coincide”, therefore, in addition to the national corpora presented in section 2.1.1, more regional corpora should be developed for still better research potentials. 5.1 Potential further research 165 should be as comparable as possible (see sections 1.2.3.4 and 2.1.2, where approaches are presented to make corpora more comparable e. g. by changing their composition according to comparability and heterogeneity measures). For German, the development of large and comparable corpora is one aim of the C4 initiative (see section 2.1.1), and also most of the other pluri-centric languages are developing variety corpora for language documentation such as the well-known ICE (see section 2.1.1).2 In addition, it is of much value to invest into the compilation of lists of corpus-external knowledge – by pooling and systematising all existing exemplary findings and phenomenon collections3 – containing all peculiarities which are already confirmed (or which are of little interest to a certain study). This is useful to exclude such material from the resulting candidate lists in order to reduce the data which has to be manually screened.

Methods In the following, approaches that could be used in further work on the different steps of corpus comparison will be addressed.

System design A comprehensive modular NLP pipeline approach as de- scribed for several existing systems is recommended to be pursued for enhancing the prototypical semi-automatic approach presented, e. g. using the GATE plat- form (see section 2.1.3). If Internet dependency is accepted, a web-based scenario along the lines of Heid et al. (2010, see sections 1.2.3.3 and 2.1.3.2) is a further possible solution. Following such pipeline approaches for variant ex- traction, a valuable extension can be to offer several tools the user can choose from in each annotation, analysis, comparison, and filtering step. With such an approach, an easy integration of additional tools for each module as to the users’ needs is ensured as well, which will in addition facilitate the portability process to other languages. A direct interface to available variety corpora will

2Corpora of spoken language are a further basis for variety linguistic studies – to be investigated rather for descriptive than for prescriptive purposes (see section 1.2.1). 3Additionally, studies on how far norm authorities (on standard and norm, see sections 1.2.1 and 1.2.2) such as teachers accept variety peculiarities could be conducted. 166 5 Outlook and conclusion

be a convenient feature, which is as well easiest to be implemented in such a web-based scenario; for the case of German, the distributed C4 corpora (see section 2.1.1) could be linked. The integration of the variety comparison system into a larger corpus process- ing architecture is generally a promising development to be followed further.

Annotation The quality of automated comparison results using pattern- based methods depends on the quality of the linguistic annotation (see sec- tion 1.2.3.2), since it relies on the correctness of frequency data. For pluri-centric languages, this is in some cases a challenge for tools that have been created for ‘dominant’ varieties (see section 1.2.1).4 To improve the lemmatisation output of common tools for ‘unknown’ words (see section 2.1.3), available variant, term or NE lists are a valuable resource. In addition to adding such lexicon entries (or even rules for symbolic tools, if necessary), statistical tools could be trained with available corpora for a certain variety. Furthermore, several approaches to automated similarity-based lemmatisation or lemma correction without lexical lists have been developed as presented in section 2.1.3, which can be followed in order to reduce the amount of ‘unknown’ lemmas in corpora to be compared.5 It is recommended to use sophisticated MWE and abbreviation recognition as well as NER tools – the latter either for finding variety-typical NEs or for excluding them from the results because they are of little linguistic interest. With more complex annotation methods, approaches e. g. along the lines of Schneider & Hundt (2009, see section 2.1.2) can be applied, who use a syntax parser as a heuristic tool to compare syntactic features of varieties. More higher-level annotation tools ready to be used are described in section 2.1.3; it just has to be taken into account that the error risk is higher on more complex

4Since annotations might distort possible interpretations if the latter are not conform with the theory the annotation is based on (see e. g. Bergenholtz & Tarp, 1995, p. 34), additional corpus comparison approaches without any prior assumptions have to be explored. Related to this is the question what the minimal input with respect to the effort invested into and to necessary interactivity is – and the challenge to find the optimal balance between this effort and usable output. 5It should furthermore be attempted to find and correct errors of automatic annotation tools, in order to avoid skewed corpus query results resulting from wrong annotations. 5.1 Potential further research 167

annotation levels. Additionally, the combination of annotation approaches in order to integrate e. g. morphological, sub-categorisation, chunking, or parsing information can reduce ambiguities and yield more reliable extraction output in general. For certain kinds of comparisons, when existing annotated corpora are used, annotation mapping might be necessary, which can e. g. be tackled with the framework for integrating different annotations by Chiarcos et al. (2008). In order to decide about the use of annotation schemes or treebanks, the findings of Kubler¨ et al. (2008) can be considered, who conducted a qualitative comparison and evaluation of the usefulness of different solutions in that area. Applying bootstrapping methods, new findings on variants can iteratively be used to improve the annotation of variety corpora and to adapt variant annotation tools.

Metadata handling To compare sub-corpora and take their different char- acteristics into account, it is necessary to integrate and exploit all available metadata information. Diachronic variation can e. g. be covered by including publication years of the corpus parts in the corpus comparison. More informa- tion such as exact region or author details can be presented in the output as well for its use in interpretation. In addition, the evaluation of the tool can become more fine-grained considering such metadata. Extending the binary comparison, it should be envisaged to process and compare several corpora simultaneously. This can be implemented using a multi-variate analysis approach as described in section 2.1.3 taking available metadata into account.

Comparability check In order to give a useful and reliable indication of the comparability of the corpora under investigation (see section 1.2.3.4), further approaches can be explored along the lines of the latest research as presented in section 2.1.2. The methods used in the TTC project (see section 2.1.2) could e. g. be followed, where corpus ‘compatibility’ is measured by the correlation between 168 5 Outlook and conclusion

similarity profiles of two corpora. Similarly, statistical outlier detection ap- proaches as described in section 2.1.2 can be used for this purpose.

Linguistic filtering More available PoS tag information (e. g. NE for proper nouns) can be used for the filtering of automatically extracted results. It has to be extended from the word level to MWEs, e. g. to also identify complex names such as Dorf Tirol or Unsere liebe Frau im Walde. Refinements of the filtering method can additionally be reached with more character sequence mappings, e. g. for German ss ß or ¨a ae. Furthermore, ↔ ↔ measures such as the Levenshtein distance (see section 2.1.3) can be used to mark ‘similar’ phenomena, especially if extracted word forms are compared to lemmatised filter lists. The latter approach, however, has to be implemented carefully since it is rather error-prone, e. g. because of linguistic minimal pairs.

Statistical comparison More statisticalAMs (see section 1.2.3.5) have to be implemented to account for their differences in ranking the resulting candidates.6 Additionally, a reliable confidence value (such as e. g. implemented in Heid et al., 2001) for the candidates based on the combination ofAMs can be useful. To account for the non-randomness of corpora (see section 1.2.3.5), new statistical methods (as e. g. proposed by Gries, 2005) have to be applied. For handling the dispersion of words in a corpus, approaches to model and measure the ‘burstiness’ of words such as ARF (see section 2.1.2 and Gries (2008) for a range of different solutions to this challenge) can be integrated. As to the adaptation of word frequencies in the course of a text, the fact that “two words with similar frequency [...] can be distinguished on the basis of their adaptation” (Church, 2000, p. 186) could be exploited for variety comparison as well.

6For a comparative judgement of significance measures, combined precision-recall diagrams such as in Evert et al. (2000) and Evert (2005a, which provides specific software as well) can be looked at in order to choose the most adequate measure according to the respective purpose. 5.1 Potential further research 169

In general, it seems promising to additionally apply sophisticated statistical approaches of other NLP areas for the comparison of corpora, as e. g. used in MT for similar languages.

Data storage and output As an enhancement of the output format which is so far most usable for traditional linguistic target users, additional information for computational linguists to be semi-automatically processed further could be offered in the resulting data. For the storage of extraction output, a database approach including metadata and various kinds of linguistic annotations with a Perl / CQP interface is encouraged, which can be designed along the lines of Heid & Weller (2010). The output presented to the user including all such information should preferably be in a widespread markup language format such as XML (see section 1.2.3.2) for standardised processibility. Furthermore, systematic testing of the GUI in order to refine it according to comprehensive user feedback can be done.

5.1.2 Refinement of analysis levels

The quantitative and qualitative investigation of variety phenomena has to be extended on all levels of linguistic description in order to give a complete picture of pluri-centric languages in all their forms.7 By refining the extraction methods, the correctness of their results should be enhanced to minimise manual effort as far as possible. Since the lexical level is the most consciously utilised one, it is furthermore worth especially to investigate the lower and the higher levels of linguistic de- scription such as morphology or syntax for interesting findings. In the following, the single levels are referred to in detail.

Phonology Pronunciation peculiarities of regional varieties (see examples in section 2.2.2) can be investigated on the basis of spoken corpora for descriptive

7For German, see also Abfalterer (2007, p. 237), who especially refers to still untreated entries in the Datenbank zum S¨udtiroler Deutsch (see section 2.1.1) to be further investi- gated. 170 5 Outlook and conclusion

purposes (see section 1.2.1). Phonological corpus annotation allows for an automated comparison to other similarly prepared resources.

Morphology As to word structure, the output of morphological analysers covering inflection and word formation (e. g. SMOR, see section 2.1.3) is a promising possible source of peculiarities in varieties. Acknowledged phenomena should be integrated into such a morphological analysis module for best results. A further potential approach is to compare the compositionality of compounds in varieties, following e. g. McCarthy et al. (2003) who used an automatically acquired thesaurus as a basis.

Lexis On the lexical level, even though the phenomenon extraction is quite straightforward, more sophisticated methods to reduce the output to very probable variant candidates can be developed. According to a classification of false positives (see section 1.2.3), such cases can be handled systematically. One more fine-grained approach is to look at the position and distribution of words in documents and to highlight those which are i.) frequent, ii.) well dispersed, and iii.) less frequent in the reference corpus (Nazar et al., 2008, see section 2.1.3). Additional variants could also be found by searching for words in quotation marks such as in

(5.1)[ ...] gelangte die ‘Lahn’ bis zur Straße [...] (DOLO−2001−12−16); [...] the ‘avalanche’ reached the road [...].

Furthermore, the degree of foreign material in texts could be compared, such as Schmidlin (2003, see section 2.2.1) did for anglicisms in German varieties. To look at the use of comparable PoS such as adjectives vs. participles in order to see if there is a difference across varieties, the approach taken by Handwerker et al. (2004, see section 2.1.2) can be followed. On a more fine-grained level, to check if word-level ambiguities differ across varieties, studies along the lines of Ignatova & Abel (2008) could be con- 5.1 Potential further research 171

ducted to see if there are different degrees of ambiguities and if there is a difference across varieties with respect to the effort needed to resolve them. As a further more manual approach, the words marked as ‘unknown’ (by a PoS tagger whose lexicon was created for a different variety), which are likely to contain new variant candidates, could be manually investigated after filtering against corpus-external lexical lists (see sections 2.1.2 and 4.2). Moreover, applying the tools to other kinds of varieties of languages such as learner corpora, it can be even more interesting to look at rare words, but also at the very frequent function words and their contexts, since these belong to the sub-conscious parts of language use.

Morpho-syntax An integrated tool to extract morpho-syntactic features of words, e. g. YAC (see section 2.1.3), will allow for detailed analyses on the morpho-syntactic level, e. g. to comprehensively investigate the dative vs. accusative confusion occurring in South Tyrolean German (see section 1.2.2). Furthermore, morpho-syntactic selectional preferences can be compared, e. g. following Peirsman & Pado´ (2010, see section 2.1.2), which is a cross-lingual unsupervised approach. Selectional preferences in collocations can be studied along the lines of Evert et al. (2004b, see section 2.1.2), e. g. to find number and case preferences of ADJ+N collocations, or following Heid & Weller (2008, see section 2.1), who extracted N+V collocations with preferences for active vs. passive in chunked corpora. Also Heid (2011, see section 2.1.2) concentrated specifically on variety preference differences, an approach to be built up on. As to general differences with respect to the distribution of features such as tenses (simple vs. complex), cases (dative vs. genitive), voices (actives vs. passives), or e. g. the extent of nominal vs. verbal style, experiments could be conducted in line with the ones described in Clark et al. (2008) for the automatic data-driven discovery of such features to see if they are expressed differently across varieties. Further work on differences between varieties in sub-categorisation, e. g. to check if there are more or less cases of non-head valency bearers in one 172 5 Outlook and conclusion variety with respect to the other, can be done along the lines of Lapshinova- Koltunski & Heid (2008, see section 2.2.2).

Co-occurrences In the field of co-occurrence extraction, not only variety- typical collocations can be considered, but also e. g. specific collocators for certain regionalisms or differing characteristics of collocations such as their degree of fixedness, their idiomaticity, or their lexical and morpho-syntactic pref- erences and peculiarities. Another aspect could concern different collocations, bases, and collocators occurring in the same context, or a certain collocation which occurs in different contexts and is thus likely to have different readings. Additionally, to compare the collocationality of words in different varieties could be a valuable exploration, an approach which was presented by Kilgarriff (2006, see section 1.2.3.3). A fine-grained analysis of ‘false friends’ in regional collocations could be ap- proached along the lines of Heid & Prinsloo (2008, see section 1.2.3.3). To go into more complex phenomena, Zinsmeister & Heid (2002, see section 2.1.2) could be followed, who extracted collocations of complex words by applying a stochastic grammar. Along the lines of Burki¨ (2009, see section 2.1.2), MWEs in general could be extracted and compared. The approach by Fritzinger & Heid (2009, see section 1.2.3.3), who take whole paradigms into account by grouping morphologically related collocations and corresponding compounds together, can be integrated into all collocation extraction processes. This ap- proach is related to the mentioned extraction of all possible corresponding constructions in different varieties (see section 2.2.2 and the paragraph on semantics below). On the one hand, the patterns for collocation extraction can be extended, e. g. to other adjacent ((N+)PREP+N) and also to non-adjacent phenomena such as SUBJ/OBJ+PRED. On the other hand, the extraction algorithms themselves have to be refined, especially for non-adjacent phenomena, which are best extracted on the basis of the topological field model (see section 1.2.1.1). To reduce errors such as the mis-interpretation of optional prepositional phrases in the middle field as objects, it is crucial to use valency information, sub- 5.1 Potential further research 173

categorisation frames, and morpho-syntactic annotation (e. g. integrating YAC, see section 2.1.3). Also Fritzinger et al. (2009) underline that, for German, it is necessary to extract significant word pairs on the basis of syntactically annotated text because otherwise the structural ambiguities lead to poor results. The comparison of phraseologism compositionality could be done along the lines of McCarthy et al. (2003), who investigated phrasal verbs with an automatically acquired thesaurus. An interactive manual interpretation and investigation of significant collo- cators of a basis to find variety-typical collocations whose parts are as well frequent in reference texts is indispensable (see also Heid, 2011, section 2.1.2). Lexicographers and variety linguists have to decide whether variety-typical collocation preferences indeed point to specific variety-typical readings. Recently, Gries (2012) published his latest recommendations for collocation research. He proposes new statistical measures without assuming any symmetry (likewise to be applied for general corpus linguistic research). Approaches to be followed for computational collostructional analysis (see section 1.2.1) are e. g. presented by Stefanowitsch & Gries (2003).

Syntax To investigate syntactic differences between varieties (e. g. structural interferences from contact languages, see section 1.2.1.2), an approach such as by Lauttamus et al. (2007, see section 2.1.2) could be envisaged, who studied syntactic interference by comparing PoS tri-grams. For identifying grammatical differences between corpora, the approach by Grouin (2008) on corpus ‘certification’ is a further possible application. Fur- thermore, as Atterer & Schutze¨ (2008) extracted and compared frequencies of grammatical relations, this could also be done for different regional varieties (this feature is as well implemented in the Sketch Engine and the DWDS, see section 2.1.3). In their register study based on bi-grams, Crossley & Louwerse (2007, see section 2.1.2) revealed syntactic features, an approach which could be transferred to regional varieties. 174 5 Outlook and conclusion

Semantics For approaches to cover semantic features in variety corpora, methods based on the ‘distributional hypothesis’ (see section 1.2.3.3) are worth being explored. Such more complex extraction methods are necessary for lexical entities which have differing or additional meanings in one variety (e. g. ausrasten: ST ‘relax’ vs. DE ‘get crazy’; see section 2.2.2). Since these lexical entities occur in reference corpora as well, an automatic distinction of the readings can only be done with sophisticated semantic processing methods such as word space models (see section 1.2.3.3).8 Words which are e. g. primary S¨udtirolismsonly in one of their senses have otherwise to be manually checked according to occurrences in corpora from other varieties and their respective contexts. In order to assign the correct corresponding expression in another variety automatically, e. g. an approach such as by Koehn & Knight (2002, see section 2.1.3) can be used, where translation lexicons are ‘acquired’ from unrelated monolingual corpora. Additionally, the approach for terminology acquisition from unrelated corpora by Nazar (2008, see section 2.1.2) could be tested for variant equivalents; there, neither large amounts of data nor a lexicon to start with are needed. Also Heylen et al. (2008, see section 2.1.2) presented a comparison of vector-based models for word similarity used in large corpora to be considered for future work. Variants could as well be discovered by the approach to extend lexicons using similarity measures by Cahill (2008, see section 2.1.2). Along the lines of Sahlgren (2005) or Kanerva et al. (2000, for both see section 2.1.2), Random Indexing methods (see section 1.2.3.3) as an efficient alternative to standard word space models using distributional statistics to indicate semantic similarity could be applied. In addition, the clustering approaches by Lin (1998, see also section 1.2.3.3) to detect similar words via their distributional pattern in large monolingual corpora can be used. It is furthermore worth exploring if fine-grained semantic differences in varieties can be found via clustering methods such as in the approach for semantic categorisation using co-occurrence statistics by Bullinaria (2008,

8To this aim, e. g. the open source package for semantic vector analysis by Widdows & Ferraro (2008, see section 2.1.2) can be integrated. 5.1 Potential further research 175

see section 2.1.2). For indications about which similarity functions to use, consult Lee (1999, see section 1.2.1) which is a comparison of measures of distributional similarity. If parallel corpora for varieties exist, e. g. for laws in Austria and Germany, alignment methods can be exploited. For term alignment in particular, the choice of equivalents has however to be treated very carefully if e. g. different legal systems are valid in the countries where the varieties occur, since terms can differ on a very subtle level. Synonym extraction according to context in aligned corpora such as by van der Plas & Tiedemann (2006, see section 2.1.2) can be followed, which is able to distinguish between synonyms and otherwise related words. On a more fine-grained level, with the distributional model it could be checked how many words are used in vs. out of context, see Jabbari et al. (2008), who conducted experiments on word obfuscation using LL. As to measuring the compositionality of MWEs and for identifying idiomatic MWEs using distributional semantics, Latent Semantic Analysis methods (see also Heid, 2008, section 1.2.3.3) can be used as measures of semantic relatedness, where similar contexts point to shared meaning components. An automatic recognition and analysis of metaphors and idioms can also be done e. g. following the approach by Dormeyer et al. (2005, see section 2.1.2), who used lexical entries, grammar rules, and contextual features of words. In their extraction of morphologically fixed collocations, Fazly & Stevenson (2006, see section 2.1) find idiomatic expressions as well, which is an approach to possibly be followed. Furthermore, chunked corpora can be exploited to extract idiomatic expressions according to sub-categorisation and combinatorial preferences (such as e. g. be done by Kermes & Heid, 2003, see section 2.1.2). Additionally, word alignment in parallel corpora can be used to identify idiomatic expressions (Villada Moiron´ & Tiedemann, 2006, see section 2.1). If lexical entities in one variety do not have a corresponding expression on the same level of linguistic description in the other variety, e. g. compounds vs. MWEs (examples see section 2.2.2), equivalent constructions can be tried to be extracted with sophisticated semantic methods. Otherwise they have to be 176 5 Outlook and conclusion

identified rather manually, since there the limits of fully automated tools are reached (see also Heid, 2011).

Pragmatics The analysis of differences in pragmatics such as presented by Muhr (1993c, see section 2.1.2) has to be automated as far as possible. The annotation of pragmatic functions of linguistic units can e. g be done following Archer et al. (2009, who describe a gold standard for hand-coded annotation schemes). Analysis software such as by Schrunder-Lenzen¨ & Henn (2009, see section 2.1.3), who conduct automated high-level language diagnosis to support language acquisition, could be adapted and used here.

Discourse level For comparing the structure of variety texts, adverbials at paragraph borders or other stylistic devices and textual information structure can be systematically investigated (see e. g. Virtanen, 2009, who provides an overview on discourse markers and coherence relations presenting the method- ology and various studies). Crossley & Louwerse (2007, see section 2.1.2) investigated discourse features based on bi-grams, an approach to be tested with regional varieties. Along the lines of Doherty (2006, see section 2.1.2), a very fine-grained comparison of texts according to structural and discourse properties could be conducted to check the discourse-appropriate use of linguistic features across varieties. As to comparing the readability of different texts, complexity measures such as in vor der Brueck et al. (2008, see section 2.1.2) could be applied.

Further ideas Approaches to the automatic identification of varieties could be developed, as they are used for identifying languages by vocabulary size curve and vector comparison (e. g. by Nazar et al., 2008, see section 2.1.3) or as Zampieri & Gebre (2012, see also section 2.1.3) have implemented for Portuguese varieties comparing n-gram patterns. 5.1 Potential further research 177

It could also be tested if purely statistical methods of novelty detection (Markou & Singh, 2003) using neural networks could help in finding differ- ences between varieties. As soon as gold standard phenomena lists for higher levels of linguistic description are available, the evaluation of the tool can be extended. Furthermore, sophisticated inter-language comparison could provide hints about the position but also about the origins of differences between varieties, both for internal changes and for differences due to contact languages. Finally, visualisation techniques are worth being explored – either for the interpretation of analysis results or for the representation of corpora and their features to carry out explorative investigations – such as gathered in the LInfo- Vis9 project dealing with the visualisation of linguistic information (Culy & Lyding, 2010).

A concluding paragraph will now summarise the last sections, which deliber- ately presented a wide range of possible developments to show what Vis-A-Vis` can be a starting point for. The steps that can directly be taken with the current implementation of the prototype Vis-A-Vis` are e. g. the integration of different comparability measures and statistical association measures for the comparison of phenomena. Also the linguistic filtering can be straightforwardly extended, or e. g. adapted for studies on German varieties in other semi-centres. Other co-occurrence or syntactic patterns which are of special interest for a certain study can be directly integrated into the code, according to the source code documentation. In a further series of steps, the more indirect enhancements could be tackled, such as the integration of more complex annotation tools, which will need a more complex corpus query processor as well, e. g. for dependency-parsed data. Especially the very low and the very high levels of linguistic description will need more thorough adaptation with different linguistic approaches – on the one hand, for phonological studies, on the other hand, e. g. for semantic or pragmatic investigations.

9http://www.eurac.edu/linfovis, last accessed 2012-09-26. 178 5 Outlook and conclusion

5.2 Summary

After a general introduction in section 1.1 on the aims and research questions, this thesis presented relevant research areas with respect to its contents in section 1.2. To place this work into an applied research context, section2 described the important previous work that has been done in these fields. Section3 provided a detailed presentation of the prototype system developed in the framework of this research, followed by its quantitative and qualitative evaluation in section4. The outlook (section 5.1) pointed to possible further lines of research to be followed. This concluding section summarises the main findings of the work on the semi- automatic comparison of regional variety corpora, along with the contributions with respect to its objectives and research questions, including the significance and the implications of this study.

5.2.1 Principal findings of this work

As far as the research questions stated in section 1.1 are concerned, the feasibility of automated variety comparison based on corpora can be confirmed to the following extent. Regarding the application of the methods on different levels of linguistic description, it can be concluded that the lower levels (e. g. the uni-gram level) can be addressed relatively straightforwardly. More sophisticated approaches (partly based on more complex annotations) are needed for higher levels of linguistic description such as for semantics or for cross-level investigations, in order to find the full inventory of possible alternative structures of a phenomenon. The special phenomena that occur in South Tyrolean German as described in section 2.2.2 can be addressed computationally with Vis-A-Vis` to a certain extent up to the syntactic level. According to appropriate annotation, the transferability of methods across levels of linguistic description is trivial on the lower levels; for more complex levels (e. g. for pragmatics or discourse studies), the applied methods are not directly transferable but need more thorough 5.2 Summary 179

research. For the implemented levels of linguistic description, the combination of symbolic and statistical methods proved to be a valuable approach. As to the influence of certain features of the input corpora (such as compa- rability or size), the evaluation of the prototype system suggests that the more similar the input corpora are in every respect, the more reliable and valuable the results. The most crucial factor with respect to the corpus features is the necessity to obtain the highest possible comparability of the corpora, in order to reduce irrelevant findings such as differences resulting from corpus composition and contents (for example, if the corpora focus on very distinct topics). Further work on the notion of corpus comparability and on the ways of measuring it will contribute to increasingly better results in this regard. The usefulness of the output and the reliability of the automatic ‘relevance ranking’ have been confirmed as acceptable with respect to the difficulty of extracting subtle differences between corpora, since variants are rare phenomena and generally the extraction of significantly different phenomena from corpora is not a trivial task. Regarding the possibility of integrating, adapting, and combining available standard tools, the conclusion is that such tools are useful and straightforward to integrate. This has been shown by applying annotation tools for Vis-A-Vis` which have originally been developed for the ‘dominant’ variety and could easily be adapted. For general corpus processing systems, the extent of necessary adaptation is acceptable as well. As far as the system’s transferability to other varieties and languages is con- cerned, the development of the prototype for South Tyrolean German suggests that this task is also straightforward, especially for pluri-centric languages with a similar structure. The key to the adaptations are the annotation module and the specific word lists, e. g. for proper names. As to the research desiderata for regional variety linguistics (section 2.3), on the basis of the findings in this thesis, it can be concluded that the desired comprehensive systematic empirical studies on variety corpora with state-of-the- art methods are practicable. Existing descriptions of regionalisms (e. g. variant dictionary entries) have already been revised, confirmed, and in some cases, 180 5 Outlook and conclusion

enriched. With the proposed semi-automatic approach, a larger amount of the available authentic data can be exploited than could be investigated by manual means alone; it is obvious that manual work is still necessary, for example to carry out thorough lexicographic analyses or for the detailed interpretation of semantic phenomena.

5.2.2 Contributions to the relevant research areas

This work contributes to the research of (regional) variety linguistics, in par- ticular for (South Tyrolean) German, as well as to the field of computational comparative corpus linguistics. The study supports the documentation of vari- eties of pluri-centric languages as well as variant lexicography, didactics, and language awareness initiatives. The methodology of variant research has been enhanced by making empirical studies considerably easier with a freely accessible and user-friendly tool. This is especially important, as there is currently a surge in the amount of electronic authentic language data available in this field. The task has been tackled by directly comparing corpora from different varieties using state-of-the-art computational-linguistic methods and tools that take acknowledged findings into account, revising and enhancing them. With this method, the research gap with respect to the comprehensive systematic, empirical, and automated comparison and description of varieties on all levels of linguistic description as mentioned in section 2.3 can be successfully approached. In contrast to other tools, Vis-A-Vis` is simple to use without computational expertise via its intuitive GUI. It is furthermore specifically tailored to regional variety corpus comparison, which is demonstrated by the comparative evaluation with a well-known commercial general corpus analysis and comparison system. Given such findings, it clearly seems worth following the presented approach and further refining the prototype system Vis-A-Vis` in order to provide an even more useful NLP tool for corpus comparison. Since the applied approaches are generic, such a tool can then serve in fields outside of regional variety 5.2 Summary 181

linguistics as well, wherever corpora are being compared to find significantly different phenomena. Through bootstrapping processes, such systems will yield increasingly better results of further corpus analysis and comparison studies, which will be able to make considerable contributions to general linguistic research. 182

A System documentation

On the following pages, the practical user documentation of Vis-A-Vis` – as provided together with the code – is included. 183

Vis-A-Vis` System Documentation – Comparing Regional Varieties on the Basis of Corpora

Stefanie Anstein Institute for Specialised Communication and Multilingualism European Academy Bozen/Bolzano

Contents

1 Introduction2

2 The system Vis-A-Vis` 2 2.1 Architecture and functionalities...... 2 2.2 Implementation and limitations...... 3

3 System usage6 3.1 Availability and prerequisites...... 6 3.2 Input and output...... 7 3.3 Instructions for usage...... 10 3.3.1 Tool usage on the command line...... 10 3.3.2 Tool usage via the graphical user interface...... 11

4 Conclusion 12

References 14

Index 15

1 184 A System documentation

1 Introduction

In this system documentation, the toolkit Vis-A-Vis` is described from a practical perspective; for details on its methodology, see Anstein (2013). Vis-A-Vis` is a prototype which has been developed for the comparison of regional varieties of pluri-centric languages (see e. g. Ammon, 1995) on the basis of corpora, in order to evaluate the feasibility of a computational linguistic approach for this task. The development of Vis-A-Vis` originated in the projects Korpus S¨udtirol1 and C4 2. The former is preparing a written text corpus of South Tyrolean German3, which can also be queried together with other German variety corpora via the distributed query engine implemented in the C4 project. In addition to interactively run single queries in the C4 corpora, variety linguists can use Vis-A-Vis` to empirically analyse and quantitatively compare their corpora on the desired levels of linguistic description. The results will contribute e. g. to variety documentation, variant lexicography, or variety didactics.

2 The system Vis-A-Vis`

This section briefly presents the design of the toolkit and its functional features.

2.1 Architecture and functionalities

The overall architecture of Vis-A-Vis` can be seen in figure1; details on the single modules are given in the following.

Input In the first Vis-A-Vis` module, plain texts are uploaded or pre-indexed corpora are specified by their names. The encoding of the input texts is verified

1Abel et al. (2009); http://www.korpus-suedtirol.it. 2Dittmann et al. (2012); http://www.korpus-c4.org. 3‘South Tyrolean German’ is the German variety spoken and written as an official language in South Tyrol in Northern Italy (see e. g. Egger & Lanthaler, 2001).

2 185

(UTF-8 ) and the corpora are checked for their comparability with the two text ‘complexity’ measures ‘type-token ratio’ and ‘lexical density’.

Pre-processing For plain text input, the linguistic annotation on the word level as well as the corpus indexing is done in the pre-processing module by several combined scripts for tokenisation, PoS tagging, lemmatisation, and query engine indexing.

Extraction & storage In the extraction module, phenomena for the three different analysis levels (i. e. uni-gram, bi-gram, and syntactic level) are retrieved from the corpora and stored together with their frequency information. For text input (vs. pre-indexed corpus input), the extraction of non-lemmatised word forms (‘unknowns’) is furthermore done on the basis of the annotated corpora.

Selection After cleaning the phenomenon lists, a linguistic filter marks expected cases via system-internal and possibly uploaded lists of corpus- external knowledge. The difference of phenomenon occurrences in the two corpora is determined via statistical association measures, e. g. log-likelihood (LL), in order to rank the resulting list according to relevance.

Output In the final module, general corpus comparison results and the output lists for phenomena on each analysis level are presented to the user for their interpretation and further processing.

2.2 Implementation and limitations

In the following, relevant implementional details of Vis-A-Vis` will be provided and the limitations of the system will be stated.

Technical system features Vis-A-Vis` is implemented in the programming language Perl4. The system is designed in a modular way in order to allow for straightforward maintenance

4v5; http://www.perl.org.

3 186 A System documentation

variety reference comparability check INPUT corpus corpus

corpus-externalcorpus-externalcorpus-external knowledgeknowledgeknowledge

tokeniser tokeniser

PRE- lemmatiser lemmatiser PROCESSING annotation PoS tagger PoS tagger

CWB indexer CWB indexer

EXTRACTION uni extr bi extraction syn extr & STORAGE

varvarvar refrefref SELECTION filter & comparison phenomenaphenomenaphenomena phenomenaphenomenaphenomena

RANKRANKRANK PHEN PHENPHEN LLLLLL f f fvarvarvar f f frefrefref ...... FILTERFILTERFILTER OUTPUT 111 abc abc abc u u u v v v w w w 222 def def def x x x y y y z z z REG REG REG ......

Figure 1: Overall architecture of Vis-A-Vis`

4 187

and adaptation, e. g. to account for its transferability to other pluri-centric languages. In order to allow for parallel system runs, individual temporary directories for each programme call are created, which are stored for 10 days before they get cleaned up. The GUI is implemented with the scripting language PHP5.

File structure The main script visavis.perl is located in the source directory, where the command line process can be started (see section 3.3.1). The following further directories are in use:

ADD DATA/ • containing system-internal additional knowledge lists such as stopwords or acknowledged regionalisms.

GUI/ • containing the files needed for the system’s graphical user interface.

OUT XXXX/ • containing all generated intermediate and final result data with individ- ual directory names for each run to allow for parallel processing.

SCRIPTS/ • containing the necessary system-internal Perl and shell scripts for the annotation, indexing, phenomenon extraction and storage, filtering, and comparison of the input corpora and their phenomena.

TESTDATA/ • containing test corpora and corpus-external knowledge lists.

TOOLS/ • containing all external utility programmes, i. e. for annotation and corpus querying.

5v5; http://www.php.net.

5 188 A System documentation

System limitations Vis-A-Vis` is designed to compare two corpora at a time; for comparing more corpora, the process has to be run several times. A filter for metadata is not included in the system – if parts of a corpus are supposed to be compared to each other according to their respective metadata, sub-corpora have to be created externally as separate corpora to be analysed. The computational extraction of phenomena with Vis-A-Vis` reaches up to the syntactic level; no higher levels of linguistic description or functional equivalences on different levels have been tackled in this feasibility study. For obtaining best results, the corpora have to be as comparable as possible with respect to contents and size, and ideally balanced and large, the latter above all for applying statistical measures. Additionally, errors of automatic tools, e. g. in tagging, which may lead to skewed corpus query results, could not be captured in the scope of this work.

3 System usage

This central section presents the details necessary to apply Vis-A-Vis` for both its usage scenarios.

3.1 Availability and prerequisites

The Vis-A-Vis` GUI version can be accessed online from the Korpus S¨udtirol website (see above). The system is furthermore released with a free software license for downloading as a stand-alone application, which comprises two possibilities of usage: command line and GUI use for Linux environments (see section 3.3).6 Regarding operating system prerequisites and external software, the code runs in Linux environments and uses the following pre-installed software:

Perl v5 with CWB7 and CQP8 modules as well as CQP v5, • 6An adaptation for its use with Windows or Mac operating systems is possible as well. 7http://cwb.sourceforge.net. 8Evert (2005); http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html.

6 189

TreeTagger 9 v3, and optionally • PHP v5 (for local GUI use). •

3.2 Input and output

The main script takes as input

i.) written text corpora of two German varieties to be compared (either in plain text format or indexed for CQP) and ii.) if available, lists containing corpus-external knowledge that can consist of approved peculiarities (e. g. named entities or regionalisms) of the variety to be investigated and of full lexicon entries.

The output is composed by

i.) general quantitative information on the two corpora and their compara- bility as well as ii.) filtered phenomenon-specific lists of items extracted from the two corpora on the chosen level of linguistic description.

General corpus comparison output When running the script on the command line, intermediate results of single steps for the preparation of the corpora (for text input) are shown. After that information, for both input types, the two comparability measures for each corpus are given and the locations of the final result files are provided. For text input, additionally the paths to the ‘unknown’ lists can be found, as well as the instructions how to include lexicon files for correcting annotation in a further system run. In the GUI, the ‘progress report’ column on the left and the output page show this information – except for the intermediate results which are left out for reasons of clarity and space.

9Schmid (1994); http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger.

7 190 A System documentation

Output by analysis levels In order to allow for easy further processing, the output is provided in tabular- separated format, ranked by the LL values. For each analysis level, the result table contains the following columns, each line describing the values for one phenomenon:10

column title column contents RANK rank count (for uni- and bi-gram level) PHEN phenomenon PoS phenomenon PoS (for uni-gram level) LL association measure LL with the ± indication of overuse/underuse by ± LL*log(f v) further association measure against the ± over-estimation of low-frequency items freq v absolute frequency in the variety corpus size v overall variety corpus size ppm v relative frequency in the variety corpus EXP v expected frequency in the variety corpus freq r absolute frequency in the reference corpus size r overall reference corpus size ppm r relative frequency in the reference corpus EXP r expected frequency in the reference corpus FILTER indication of irrelevant phenomena

The phenomena containing a filter tag in the last column are listed in the end of the output table without a rank assigned, since they are considered irrelevant (stopwords, foreign words, acknowledged regionalisms, etc.). The filter tags used in the output tables (with the prefix ‘S ’ for system-internal and the prefix ‘U ’ for user lists) are shown in table2.

10Since the output is not yet filtered manually, it contains both linguistically relevant differences as well as differences that might reflect the selection of topics in the respective corpora or certain situational peculiarities. The interpretation and more detailed analysis based on these results must be conducted manually.

8 191

Table 2: Linguistic filter tags used by Vis-A-Vis`

filter tag description S STOPWORD German stopword S ENGL / ITAL English / Italian word S PLACENAME general place name

S REG AT / CH / ST regionalism for AT/CH/ST S REG-2 ST secondary regionalism for ST S REAL AT / CH / DE / ST reality description for AT/CH/DE/ST S PLACENAME ST place name for ST S PERSONFIRSTNAME ST person first name for ST S PERSONLASTNAME ST person last name for ST U REG VAR / REF regionalisms for variety/reference corpus [analagous for all list types for user input ]

EXP<5 expected frequency < 5 for uni-grams EXP<1 expected frequency < 1 for bi-grams

S REG ST CMPD compound containing regionalism for ST [analagous for AT/CH]

S REAL ST CMPD compound containing reality description for ST [analagous for AT/CH/DE]

S PLACENAME ST CMPD compound containing place name for ST U REG VAR CMPD compound containing regionalism for variety corpus [analagous for reality descriptions and place names for variety and reference corpus]

1:S STOPWORD first part of bi-gram being stopword 2:U PLACE REF second part of bi-gram being place name for reference corpus [analagous for all filter types]

9 192 A System documentation

3.3 Instructions for usage

The usage interfaces are implemented both on the command line as a Perl script call and via a user-friendly GUI.

3.3.1 Tool usage on the command line The script visavis.perl is called by the following command:

perl visavis.perl -l (uni|uni_NN|uni_V|uni_ADJ|uni_ADV|bi|syn) -f (wrd|lem) -e -r -i (t|c)

It supports several parameters as described in the following. The option -l chooses the level of extraction and comparison, uni referring to the lexical level (uni-grams, which can also be restricted to certain single word classes), bi to the bi-gram level for co-occurrences, and syn to exemplary syntactic pattern analysis. With the option -f, the user decides if either word forms or lemmas will be considered in the analysis process. The option -e takes optional corpus-external knowledge for the annotation and the linguistic filtering as input, e. g.

i.) one-word-per-line external knowledge lists with specified file names (see below) and ii.) single-word lexicon entries with PoS11 and lemma information in the format: wordform PoS lemma.

The following file names are expected for corpus-external lexical lists of acknowledged peculiarities provided by the user:

11STTS tags have to be used here (http://www.sfb441.uni-tuebingen.de/a5/codii/info-stts-en.xhtml).

10 193

file name description U REG VAR.txt list of regionalisms for variety corpus U REG-2 VAR.txt list of secondary regionalisms for variety corpus U REAL VAR.txt list of reality descriptions for variety corpus U PLACE VAR.txt list of place names for variety corpus U PERS FIRST VAR.txt list of person first names for variety corpus U PERS LAST VAR.txt list of person last names for variety corpus U REG REF.txt list of regionalisms for reference corpus [analagous for all list types for reference corpus ] U LEX.txt list of lexicon entries

With the option -r, the directory containing the registry files for CQP corpus input is specified. Finally, the option -i is used to indicate if the two input corpora given as arguments are in text format (t; UTF-8 encoding) or if they are available as corpora indexed for CQP (c). The user receives the following output in the terminal window:

i.) general data, e. g. regarding the comparability of the two corpora, and ii.) the locations of the comparison files to view or further process the output data.

3.3.2 Tool usage via the graphical user interface To make the access to Vis-A-Vis` easier and more user-friendly, a GUI has been implemented to make sure that the software is also accessible for users without computational expertise. Users can upload the data they want to analyse over this GUI and are guided through the options for the comparison process, step by step, up to the download of their analysis results. Through the GUI, also a direct link to Korpus S¨udtirol and to the C4 search interface for the verification of phenomena and for context search is provided.

11 194 A System documentation

In figure2, the starting page of the Vis-A-Vis` GUI use is shown. After the corpus choice, lexical lists of external knowledge can be uploaded, if available. In the second step, the analysis and comparison level is chosen, and after the Vis-A-Vis` run, the result data can be viewed and downloaded for further processing.

4 Conclusion

In this documentation, the prototype system Vis-A-Vis` for the comparison of regional varieties has been presented from a practical perspective. Pos- sible further lines of research will now be mentioned to serve as developer instructions. The steps that can directly be taken with the current implementation of the prototype Vis-A-Vis` are e. g. the integration of different corpus comparability measures and association measures for the comparison of phenomena. Also the linguistic filtering can be straightforwardly extended – or e. g. adapted for studies on other German varieties. Other co-occurrence or syntactic patterns which are of special interest for a certain study can be directly integrated into the code, according to the source code documentation. In a further series of steps, the more indirect enhancements could be tackled, such as the integration of more complex annotation tools, which will also need a more complex corpus query processor, e. g. for dependency-parsed data. Also the lower and the higher levels of linguistic description will need more thorough adaptation with different linguistic approaches – on the one hand, e. g. for phonological studies, on the other hand, e. g. for semantic or pragmatic investigations.

12 195

Figure 2: Vis-A-Vis` GUI corpus upload or specification

13 196 A System documentation

References

Abel, Andrea; Anstein, Stefanie & Petrakis, Stefanos (2009): ‘Die Initiative Korpus S¨udtirol’. Linguistik online; vol. 38(2).

Ammon, Ulrich (1995): Die deutsche Sprache in Deutschland, Osterreich¨ und der Schweiz. Das Problem der nationalen Variet¨aten; Berlin / New York: De Gruyter.

Anstein, Stefanie (2013): Computational Approaches to the Comparison of Regional Variety Corpora – Prototyping a Semi-automatic System for German; Ph.D. thesis; Institute for Natural Language Processing (IMS), University of Stuttgart.

Dittmann, Henrik; Durˇ co,ˇ Matej; Geyken, Alexander; Roth, Tobias & Zimmer, Kai (2012): ‘Korpus C4 – a distributed corpus of Ger- man varieties’. In: Multilingual Corpora and Multilingual Corpus Analysis, ed. by Schmidt, Thomas & W¨orner,Kai; vol. 14 of Hamburg Studies in Multilingualism (HSM); Amsterdam: John Benjamins.

Egger, Kurt & Lanthaler, Franz (eds.) (2001): Die deutsche Sprache in S¨udtirol. Einheitssprache und regionale Vielfalt; Vienna / Bolzano: Folio.

Evert, Stefan (2005): The CQP query language tutorial; Tech. rep.; Insti- tute for Natural Language Processing, University of Stuttgart. www.ims. uni-stuttgart.de/projekte/CorpusWorkbench, last accessed 2012-10-21.

Schmid, Helmut (1994): ‘Probabilistic Part-Of-Speech Tagging Using De- cision Trees’. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP); Manchester. http://www.ims. uni-stuttgart.de/projekte/corplex/TreeTagger, last accessed 2012-10-26.

14 197

Index access, 11 metadata,6 annotation,3–5,7, 12 module,2 architecture,2,4 output,3,4,6–8, 11 association measure,3,8, 12 availability,6 peculiarity,7, 10 Perl,5,6, 10 bi-gram level,4, 10, 12 PHP,5,7 command line,5, 10 pluri-centric language,2 comparability,3,4,6,7, 11, 12 PoS tagging,3,4, 10 comparison,2,4,5, 10–12 pre-processing,3,4 corpus-external knowledge,4,7,8, prerequisites,6 10, 12 prototype,2, 12 CQP,6,7 ranking,8 download,6, 11, 12 reference corpus,4, 11 extraction,4–6, 10 regionalism, 11 feasibility study,2,6 South Tyrolean German,2 file structure,5 statistics,6 filter,3,4,8, 12 syntactic level,4,6, 10, 12 functional equivalence,6 tokenisation,3,4 GUI,5–7, 10–13 TreeTagger,7 indexing,3–5 uni-gram level,4, 10 input,2,4,7, 10, 11 upload,2, 11, 12

Korpus S¨udtirol,2,6, 11 variety corpus,2,4, 11 lemmatisation,3,4, 10 lexicon entry,7, 10, 11 limitations,3 Linux,6

15 198

B Gold standard list of S¨udtirolisms

The following lists are taken from Abfalterer (2007, pp. 263-266 for primary S¨udtirolisms,pp. 266-268 for secondary S¨udtirolisms;see also section 2.2.1).

B.1 Primary S¨udtirolisms

Abgeordnetenkammer Ausgeher ACI Ausspeisungsdienst Ajourierung Außendienstverg ¨u tung Alminteressentschaft Autob ¨u chlein Alpini Bankkoordinaten Amtsentsch ¨a digung Barist Anas Baristin angereift Basisarzt Ans¨assigkeitsgemeinde Basis ¨a rztin Apfelklauber¨ Bauarbeiterkasse Apfelklauberin¨ Baukonzession Aranciata Bauschau Arbeitsrechtsberater Bef¨ahigungsdiplom Arbeitsrechtsberaterin Begehrensantrag Assessorat Begleitgeld Arztambulatorium Begleitzulage Aufenthaltsabgabe Behindertenerzieher Aufenthaltssteuer Behindertenerzieherin aufschenken Bereichsvertrag Aufstiegsanlage Berggemeinschaft 199

Bergtr ¨ager Erweiterungszone Bergtr ¨a gerin Familienbogen Berufsalbum Familiengeld Berufsert ¨u chtigung Familienstandsbogen Berufsverzeichnis Federkielsticker Betriebsmensa Federkielstickerin Bewertungskonferenz Feuernacht Bezirksgemeinschaft Feuerungsanlagenmonteur Breatl Feuerungsanlagenmonteurin B¨u rgerkunde Feuerwehrhalle Burggr ¨a fler Finanzkaserne Burggr ¨a flerin Finanzpolizei Carabiniere Finanzwache Collaudo Forstwache Dableiber Frauenhausdienst Dableiberin Funktionsrang Dienstaltersrente F¨ursorgeinstitut Dienstbewertungskomitee Garni Diensteinheit Geb¨audekataster Direktionsauftrag Gebietsplan Direktionssitz Geburtengeld Direktionsverteilungsplan Ged¨achtnisspende Direktivrat Gehaltsamt Dopolavoro Gehaltsposition Dreijahreshaushalt Gemeindeausschuss Dringlichkeitsbesetzung Gemeindeimmobiliensteuer Durchf ¨uhrungsplan Gemeindenverband Einheitstext Gemeindevertrauensmann Einheitsvordruck Gerstsuppe Einvernehmenskomitee Gesetzesanzeiger Ersatzerkl ¨a rung Gesetzgebungskommission Erstwohnung Gesuchsmuster 200 B Gold standard list of S¨udtirolisms

Gewerbeoberschule Kenntafel Griffelschachtel Klauber Grundbuchsamt Klauberin Grundf ¨u rsorge Kleinsparergesetz Halbmittag Kontingenzzulage Handelsoberschule K¨u belmilch Handwerkerzone Laborfonds Hausfrauenrente Landesarbeitskommission Hauspflegedienst Landesaußenamt Heimatferne Landesbeirat Heimatpflegeverband Landesforstkorps Heimgehilfe Landesgesundheitsdienst Heimgehilfin Landesgruppenvorsteher Heizkesselw ¨a rter Landesgruppenvorsteherin Heizkesselw ¨a rterin Landesraumordnungs- Hochunserfrauentag kommission Hydrauliker Landeszulage Hydraulikerin Lebensminimum Immobiliensteuer Leps INAIL Littorina Industriellenverband Lohnbuch Infermerie Lokalfinanz INPS Lyzeum Integrationslehrperson Magenzucker Interessentschaft Mappenauszug Interpellanz Marende Italienmeister marenden Italienmeisterin Maturadiplom Italienmeisterschaft Mebo Kandl Mehrwertsteueramt Karterle Merkantilmuseum Kaufleuteverband Mindestkultureinheit 201

Miniwohnung Saltner Mobilit ¨a tsliste Saltnerin Normalstatut Sanit ¨a tsausweis Notspur Sanit ¨a tsbetrieb Oberboden Sanit ¨a tseinheit Obstgenossenschaft Schatzamtsdienst Obstmagazin Schlichtungsrichter Optant Schlichtungsrichterin Optantin Schlussbewertung Ortsaugenschein Schlutzer Paarl Schriftleiter Partikularsekret ¨ar Schriftleiterin Partikularsekret ¨arin Schulamtsleiter Pflichtschuldirektion Schulamtsleiterin Postkontokorrent Schuldiener Proporz Schuldienerin provinzfremd Sch¨u lercharta Ragioniere Schulf ¨u rsorge Rechnungsrevisor Sch¨u ttelbrot Rechnungsrevisorin schwarzplenten Regierungskommiss ¨ar Selbstbescheinigung Regierungskommiss ¨arin sequestrieren Regierungskommissariat Sonderbetrieb Regionalassessor Sondererg ¨anzungszulage Regionalassessorin Sonderstatut Regionalausschuss Sozialgenossenschaft Regionalrat Sozialhilfekraft Regionalratsabgeordnete Spatzlen Regionalregierung Speltenzaun Registersteuer Spezialisierungsdiplom R¨u cksiedler Spezialisierungstitel R¨ucksiedlerin Sprachgruppen- 202 B Gold standard list of S¨udtirolisms

zugeh ¨o rigkeits - Unterhaltungssteuer erkl ¨a rung Vertrauensarzt Sprengelsitz Vertrauens ¨a rztin Spuma Verwaltungspolizei Staatsadvokatur Verwaltungsrekurs Staatsb ¨urgerschafts- Verzierungsbildhauer bescheinigung Vidimierung Stadtviertelrat Vogelesalat Stammrolle Volkswohnbau Stammrollenlehrer Vollzugsausschuss Stammrollenlehrerin Vormas Stellenvorbehalt Vorschlagbrot stempelfrei Vorzugstitel Stempelpapier Waal Steuerbeistand Wanderh ¨a ndler Steuerbeistandszentrum Weinhof Steuereinbehalt Wertsch ¨opfungssteuer Steuerrolle Wiedergewinnungsplan Studientitel Wiere SVP Wohnbauinstitut Superalkoholika Wohnkubatur Tirtlen Zampone t¨o rggelen Zirkularscheck Torggl Zugbahnhof Uberwachungsgericht¨ Zulassungstitel Uberwachungsrichter¨ Zweisprachigkeitsnachweis Uberwachungsrichterin¨ Zweisprachigkeitspr ¨u fung 203

B.2 Extract of secondary S¨udtirolisms

Ab¨anderungsantrag Baustellenleiter Absp ¨uler Beistrich Absp ¨u lerin Bereitschaftspolizei Advokat Berglandwirtschaft Advokatin Bergzoo Agronom Berufsbef ¨a higung Agronomin Berufskategorie Alleinerzieher Beschlussantrag Alleinerzieherin best ¨a uben Almh ¨utte Bezirkspr ¨a sident Ambulanz Bezirkspr ¨a sidentin Amtsdirektor Bildungsguthaben Amtstafel Brettljause ans¨a ssig B¨u rgerhaus Anschlagtafel B¨u rgerheim Ansuchen B¨urgerkapelle Arbeitsberater Buschwald Arbeitsberaterin Buße Arbeitsinspektor Buß geld Arbeitsinspektorin Coupon ausbeinen Dentalhygieniker Ausgezeichnet Dentalhygienikerin Autobahnhof Di¨a tist Autobahnviadukt Di¨a tistin Autosteuer Diplom Bancomat Dringlichkeitsverfahren Bancomatkarte Durchf ¨uhrungsbestimmung Baugesuch Durchzugsverkehr Bauleitplan Ehrenschutz Baulos Einreichprojekt Baumfest [...] Baumschuler 204

C Online resources

All resources listed in the following have last been accessed on 2012-10-19.

corpora and initiatives • – Austrian Academy Corpus (AAC) http://www.aac.ac.at – C4 initiative http://www.korpus-c4.org – CLARIN, D-SPIN, andWebLicht initiatives http://www.clarin.eu https://weblicht.sfs.uni-tuebingen.de – Corpus del Espa˜nol http://www.corpusdelespanol.org – Corpus de Referˆenciado PortuguˆesContemporˆaneo (CRPC) http://www.clul.ul.pt – Digitales W¨orterbuchder Deutschen Sprache (DWDS) http://www.dwds.de – International Corpus of English (ICE) http://ice-corpora.net/ice – Korpus S¨udtirol http://www.korpus-suedtirol.it – Schweizer Text Korpus http://www.schweizer-textkorpus.ch – Sketch Engine http://www.sketchengine.co.uk 205

– Tr´esorde la Langue Fran¸caiseinformatis´e(TLFi) http://atilf.atilf.fr/tlf.htm – Variantengrammatik project http://www.variantengrammatik.net – Web As Corpus (WaC) initiative http://wacky.sslmit.unibo.it/doku.php?id=corpora

tools and related resources • – Corpus Workbench (CWB) http://cwb.sourceforge.net – Perl programming language http://www.perl.org – PHP programming language http://www.php.net – Stuttgart-T¨ubingentag set (STTS) http://www.sfb441.uni-tuebingen.de/a5/codii/info-stts-en.xhtml – TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger 206

Bibliography

Abel, Andrea (2009): ‘Mehrsprachigkeit in S¨udtirol:Alles bleibt anders?’. Geographische Rundschau; vol. 3: pp. 20–27.

Abel, Andrea & Anstein, Stefanie (2008): ‘Approaches to Computational Lexicography for German Varieties’. In: Proceedings of the 13th EURALEX International Conference; pp. 251–260; Barcelona.

Abel, Andrea & Anstein, Stefanie (2011): ‘Korpus S¨udtirol- Variet¨aten- linguistische Untersuchungen’. In: Abel & Zanin (2011); pp. 29–53.

Abel, Andrea; Anstein, Stefanie & Petrakis, Stefanos (2009): ‘Die Initiative Korpus S¨udtirol’. Linguistik online; vol. 38(2).

Abel, Andrea; Anstein, Stefanie & Ties, Isabella (2008): ‘Ans¨atze einer intralingualen kontrastiven Korpuslinguistik – aufgezeigt am Beispiel administrativer Rechtstexte aus Deutschland, Osterreich¨ und S¨udtirol’. In: Heller (2008); pp. 243–270.

Abel, Andrea; Stuflesser, Mathias & Putz, Magdalena (eds.) (2006): Multilingualism across Europe: Findings, Needs, Best Practices; Bolzano: EURAC research.

Abel, Andrea; Stuflesser, Mathias & Voltmer, Leonhard (eds.) (2007): Aspects of Multilingualism in European Border Regions: Insights and Views from Alsace, Eastern Macedonia and Thrace, the Lublin Voivodeship and South Tyrol; Bolzano: EURAC research.

Abel, Andrea; Vettori, Chiara & Forer, Doris (2010): ‘Learning the Neighbour’s Language: The Many Challenges in Achieving a Real Multilingual Bibliography 207

Society. The Case of Second Language Acquisition in the Minority– Majority Context of South Tyrol’. European Yearbook of Minority Issues; vol. 9.

Abel, Andrea & Zanin, Renata (eds.) (2011): Korpusinstrumente in Lehre und Forschung; Bolzano: Bolzano University Press.

Abfalterer, Heidemaria (2007): Der S¨udtiroler Sonderwortschatz aus plurizentrischer Sicht: lexikalisch-semantische Besonderheiten im Standard- deutsch S¨udtirols; vol. 72 of Germanistische Reihe; Innsbruck: Innsbruck University Press.

Abney, Steven (1996): ‘Statistical Methods and Linguistics’. In: The Balanc- ing Act: Combining Symbolic and Statistical Approaches to Language, ed. by Klavans, Judith & Resnik, Philip; pp. 1–26; Cambridge: MIT Press.

Adolphs, Peter (2008): ‘Acquiring a Poor Man’s Inflectional Lexicon for Ger- man’. In: Proceedings of the 6th International Language Resources and Evalua- tion Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/ lrec2008, last accessed 2012-10-26.

Agresti, Alan (1996): An Introduction to Categorical Data Analysis; New York: John Wiley and Sons.

Ahmad, Khurshid; Davies, Andrea; Fulford, Heather & Rogers, Margaret (1992): ‘What is a term? The semi-automatic extraction of terms from text’. In: Translation Studies - An Interdiscipline, ed. by Snell-Hornby, Mary; P¨ochhacker, Franz & Kaindl, Klaus; vol. 2 of Benjamins Translation Library; pp. 267–278; Amsterdam / Philadelphia: John Benjamins.

Aijmer, Karin (2009): ‘Parallel and comparable corpora’. In: Ludeling¨ & Kyto¨ (2009); pp. 275–291.

Amar, Muriel; David, Sophie; Panckhurst, Rachel & Whistlecroft, Lisa (2008): ‘Classification Procedures for Software Evaluation’. In: Proceed- ings of the 6th International Language Resources and Evaluation Conference 208 Bibliography

(LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Ammon, Ulrich (1986): ‘Explikation der Begriffe ’Standardvariet¨at’und ’Standardsprache’ auf normtheoretischer Grundlage’. In: Sprachlicher Sub- standard, ed. by Holtus, G¨unter & Radtke, Edgar; vol. 36 of Konzepte der Sprach- und Literaturwissenschaft; pp. 1–63; T¨ubingen:Niemeyer.

Ammon, Ulrich (1995a): Die deutsche Sprache in Deutschland, Osterreich¨ und der Schweiz. Das Problem der nationalen Variet¨aten; Berlin / New York: De Gruyter.

Ammon, Ulrich (1995b): ‘Vorschl¨agezur Typologie nationaler Zentren und na- tionaler Varianten bei plurinationalen Sprachen - am Beispiel des Deutschen’. In: Muhr et al. (1995); pp. 111–120.

Ammon, Ulrich (1997): Nationale Variet¨atendes Deutschen; vol. 19 of Studienbibliographien Sprachwissenschaft; Heidelberg: Julius Groos.

Ammon, Ulrich (2001): ‘Die Plurizentrizit¨atdes Deutschen, oder: Wer sagt, was gutes Deutsch ist?’. In: Egger & Lanthaler (2001); pp. 11–26.

Ammon, Ulrich (2005): ‘Pluricentric and Divided Languages’. In: Sociolin- guistics: An International Handbook of the Science of Language and Society, ed. by Ammon, Ulrich; Dittmar, Norbert; Mattheier, Klaus J. & Trudgill, Peter; vol. 2 of Handb¨ucherzur Sprach- und Kommunikationswissenschaft 3 ; pp. 1536–1543; Berlin / New York: De Gruyter; 2nd ed.

Ammon, Ulrich; Bickel, Hans; Ebner, Jakob; Esterhammer, Ruth; Gasser, Markus; Hofer, Lorenz; Kellermeier-Rehbein, Birte; Loffler,¨ Heinrich; Mangott, Doris; Moser, Hans; Schlapfer,¨ Robert; Schloßmacher, Michael; Schmidlin, Regula & Val- laster, Gunter¨ (2004): Variantenw¨orterbuchdes Deutschen. Die Standard- sprache in Osterreich,¨ der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und S¨udtirol; Berlin: De Gruyter. Bibliography 209

Anderwald, Lieselotte & Szmrecsanyi, Benedikt (2009): ‘Corpus linguistics and dialectology’. In: Ludeling¨ & Kyto¨ (2009); pp. 1126–1140.

Andor, Joszef´ (2004): ‘The master and his performance: An Interview with Noam Chomsky’. Intercultural Pragmatics; vol. 1(1): pp. 93–111.

Anstein, Stefanie (2007): ‘Korpuslinguistische Fallstudien zum S¨udtiroler Standardschriftdeutsch - das Projekt ’Korpus S¨udtirol”. Linguistik online; vol. 32. http://www.linguistik-online.org/32 07/anstein.pdf, last accessed 2012-10-14.

Anstein, Stefanie & Glaznieks, Aivars (2011): ‘Comparing Geographical and Learner Varieties on the Basis of Corpora’. In: Comparative Methods and Analysis in the Language Science. Proceedings of the 3rd edition of J´eTou; pp. 179–188; Toulouse. http://jetou2011.free.fr/ARTICLES/S4A2.pdf, last accessed 2012-10-17.

Anstein, Stefanie; Oberhammer, Margit & Petrakis, Stefanos (2011): ‘Korpus S¨udtirol- Aufbau und Abfrage’. In: Abel & Zanin (2011); pp. 15–28.

Archer, Dawn; Culpeper, Jonathan & Davies, Matthew (2009): ‘Pragmatic annotation’. In: Ludeling¨ & Kyto¨ (2009); pp. 613–642.

ASTAT (ed.) (2002): Vornamen in S¨udtirol / Nomi di battesimo in provincia di Bolzano 2001 ; Autonome Provinz Bozen-S¨udtirol/ Provincia Autonoma di Bolzano-Alto Adige: ASTAT – Landesinstitut f¨urStatistik / Istituto provinciale di statistica.

ASTAT (ed.) (2005): Nachnamen in S¨udtirol / Cognomi in provincia di Bolzano 2004 ; Autonome Provinz Bozen-S¨udtirol/ Provincia Autonoma di Bolzano- Alto Adige: ASTAT – Landesinstitut f¨urStatistik / Istituto provinciale di statistica.

Aston, Guy (ed.) (2001): Learning with corpora; Houston: Athelstan. 210 Bibliography

Atkins, Beryl T. & Zampolli, Antonio (eds.) (1994): Computational Approaches to the Lexicon; Oxford: Oxford University Press.

Atterer, Michaela & Schutze,¨ Hinrich (2008): ‘An Inverted Index for Storing and Retrieving Grammatical Dependencies’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Atwell, Eric (2009): ‘Development of tag sets for part-of-speech tagging’. In: Ludeling¨ & Kyto¨ (2009); pp. 501–526.

Baayen, Harald (2001): Word frequency distributions; vol. 18 of Text, speech and language technology; Dordrecht: Kluwer.

Bacelar do Nascimento, Maria F.; Gonc¸alves, Jose´ B.; Pereira, Lu´ısa; Estrela, Antonia;´ Pereira, Alfonso; Santos, Rui & Oliveira, Sancho M. (2006): ‘The African Varieties of Portuguese: Com- piling Comparable Corpora and Analyzing Data-derived Lexicon’. In: Pro- ceedings of the 5th International Language Resources and Evaluation Confer- ence (LREC); pp. 1791–1794; Genoa. http://www.lrec-conf.org/proceedings/ lrec2006, last accessed 2012-10-26.

Baker, Paul (2010): Sociolinguistics and Corpus Linguistics; Edinburgh: Edinburgh University Press.

Banerjee, Satanjeev & Pedersen, Ted (2003): ‘Extended Gloss Overlaps as a Measure of Semantic Relatedness’. In: Proceedings of the 18th Interna- tional Joint Conference on Artificial Intelligence; pp. 805–810; Acapulco.

Barber, Katherine (ed.) (2004): The Canadian Oxford Dictionary; Don Mills: Oxford University Press; 2nd ed.

Baroni, Marco (2009): ‘Distributions in text’. In: Ludeling¨ & Kyto¨ (2009); pp. 803–821. Bibliography 211

Baroni, Marco & Bernardini, Silvia (2006): ‘A new approach to the study of translationese: Machine-learning the difference between original and translated text’. Literary and Linguistic Computing; vol. 21(3): pp. 259–274.

Baroni, Marco & Evert, Stefan (2009): ‘Statistical methods for corpus exploitation’. In: Ludeling¨ & Kyto¨ (2009); pp. 777–803.

Baroni, Marco & Kilgarriff, Adam (2006): ‘Large linguistically- processed Web corpora for multiple languages’. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Lin- guistics (EACL); Trento. http://aclweb.org/anthology-new/E/E06/E06-2001. pdf, last accessed 2012-10-31.

Bartsch, Sabine (2004): Structural and functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence; T¨ubingen:Narr.

Baur, Siegfried (2006): ‘Uber¨ die Schwierigkeit, die Sprache des Nachbarn zu lernen’. In: Abel et al. (2007).

Belica, Cyril; Keibel, Holger; Kupietz, Marc; Perkuhn, Rainer & Vachkova,´ Marie (2009): ‘Putting corpora into perspective: Rethinking synchronicity in corpus linguistics’. In: Proceedings of the 5th Corpus Linguis- tics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/CL2009, last accessed 2012-10-31.

Belica, Cyril & Steyer, Kathrin (2008): ‘Korpusanalytische Zug¨ange zu sprachlichem Usus’. In: Beitr¨agezur bilingualen Lexikographie, ed. by Vachkov´a,Marie; pp. 7–24; Prague: Charles University Prague.

Bell, Edward J. L.; Berridge, Damon & Rayson, Paul (2009): ‘Mea- suring style with the authorship ratio: an invariant metric of lexical simi- larity’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/CL2009, last accessed 2012-10-25. 212 Bibliography

Bender, Emily M.; Flickinger, Dan; Good, Jeff & Sag, Ivan A. (2004): ‘Montage: Leveraging Advances in Grammar Engineering, Linguistic Ontologies, and Mark-up for the Documentation of Underdescribed Lan- guages’. In: Proceedings of the LREC Workshop on First Steps for the Docu- mentation of Minority Languages: Computational Linguistic Tools for Mor- phology, Lexicon and Corpus Compilation; Lisbon. http://www.lrec-conf.org/ proceedings/lrec2004, last accessed 2012-10-21.

Bergenholtz, Henning & Tarp, Sven (eds.) (1995): Manual of Specialised Lexicography. The preparation of specialised dictionaries; Amsterdam: John Benjamins.

Bertagnolli, Judith (1994): Das “unfeine” Hochdeutsch in S¨udtirol. Mit der Auswertung einer soziolinguistischen Spracherhebung in Bozen; Master’s thesis; University of Vienna.

Besch, Werner; Knoop, Ulrich; Putschke, Wolfgang & Wiegand, Herbert E. (eds.) (1983): Dialektologie. Ein Handbuch zur deutschen und allgemeinen Dialektforschung; Berlin / New York: De Gruyter.

Biber, Douglas (1990): ‘Methodological issues regarding corpus-based anal- yses of linguistic variation’. Literary and Linguistic Computing; vol. 5(4): pp. 257–269.

Biber, Douglas (1993): ‘Representativeness in corpus design’. Literary and Linguistic Computing; vol. 8(4): pp. 243–257.

Biber, Douglas (2006): University Language: A Corpus-based Study of Spoken and Written Registers; Amsterdam: John Benjamins.

Biber, Douglas (2009): ‘Multi-dimensional approaches’. In: Ludeling¨ & Kyto¨ (2009); pp. 822–855.

Biber, Douglas & Burges, Jena´ (2001): ‘Historical shifts in the language of women and men: gender differences in dramatic dialogue’. In: Variation in Bibliography 213

English: Multi-Dimensional Studies, ed. by Biber, Douglas & Conrad, Susan; pp. 157–170; London: Longman.

Biber, Douglas & Jones, James K. (2009): ‘Quantitative methods in corpus linguistics’. In: Ludeling¨ & Kyto¨ (2009); pp. 1286–1304.

Biber, Hanno; Breiteneder, Evelyn & Morth,¨ Karlheinz (2002): ‘The Austrian Academy Corpus – Digital Resources and Textual Studies’. In: Proceedings of the 14th Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH); pp. 16–17; T¨ubingen.

Bickel, Hans (2000): ‘Das Internet als Quelle f¨urdie Variationslinguistik’. In: Hacki¨ Buhofer (2000); pp. 111–124.

Bickel, Hans (2006): ‘Das Internet als linguistisches Korpus’. Linguistik online; vol. 28(3): pp. 71–83.

Bickel, Hans (2012): ‘Deutsche Variet¨atenin Internetkorpora – eine kleine Entwicklungsgeschichte’. In: Tranel – Travaux neuchˆateloisde linguistique. La linguistique de corpus - de l’analyse quantitative a l’interpr´etationqualitative, ed. by Elmiger, Daniel & Kamber, Alain; vol. 55; pp. 7–23; Neuchˆatel: Universit´ede Neuchˆatel.

Bickel, Hans; Gasser, Markus; Hacki¨ Buhofer, Annelies; Hofer, Lorenz & Schon,¨ Christoph (2009): ‘Schweizer Text Korpus – Theo- retische Grundlagen, Korpusdesign und Abfragem¨oglichkeiten’. Linguistik online; vol. 39: pp. 5–31.

Biemann, Chris; Quasthoff, Uwe; Heyer, Gerhard & Holz, Florian (2008): ‘ASV Toolbox: a Modular Collection of Language Exploration Tools’. In: Proceedings of the 6th International Language Resources and Evalua- tion Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/ lrec2008, last accessed 2012-10-26. 214 Bibliography

Bird, Steven & Loper, Edward (2004): ‘NLTK: The Natural Language Toolkit’. In: Proceedings of the ACL Demonstration Session; pp. 214–217; Barcelona.

Bjorkelund,¨ Anders; Bohnet, Bernd; Hafdell, Love & Nugues, Pierre (2010): ‘A high-performance syntactic and semantic dependency parser’. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING): Demonstration Volume; pp. 33–36; Beijing.

Bloomfield, Leonard (1933): Language; New York: Henry Holt and Co.

Bohnet, Bernd (2009): ‘Efficient Parsing of Syntactic and Semantic Depen- dency Structures’. In: Proceedings of the Conference on Natural Language Learning (CoNLL); pp. 67–72.

Bohnet, Bernd (2010): ‘Top Accuracy and Fast Dependency Parsing is not a Contradiction.’. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING); Beijing.

Boulanger, Jean-Claude (ed.) (1993): Dictionnaire qu´eb´ecois d’aujourd’hui; Paris: Le Robert; 2nd ed.

Branco, Antonio;´ Costa, Francisco; Martins, Pedro; Nunes, Filipe; Silva, Joao˜ & Silveira, Sara (2008): ‘LX-Service: Web Services of Language Technology for Portuguese’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http: //www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Brants, Thorsten (2000): ‘TnT – A Statistical Part-of-Speech Tagger’. In: Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); pp. 224–231; Seattle.

Brew, Chris & Moens, Marc (2000): ‘Data-Intensive Linguistics’. University of Edinburgh, HCRC Language Technology Group. vor der Brueck, Tim; Helbig, Hermann & Leveling, Johannes (2008): The readability checker DeLite; Tech. rep.; Faculty for Mathematics Bibliography 215

und Informatics, FernUniversit¨atHagen. http://pi7.fernuni-hagen.de/brueck/ papers/DeLite techreport.pdf, last accessed 2012-10-24.

Brunner, Annelen & Steyer, Kathrin (2007): ‘Corpus-driven study of multi-word expressions based on collocations from a very large cor- pus’. In: Proceedings of the 4th Corpus Linguistics Conference; Birmingham. http://corpus.bham.ac.uk/corplingproceedings07/paper/182 Paper.pdf, last accessed 2012-10-21.

Bullinaria, John A. (2008): ‘Semantic Categorization Using Simple Word Co-occurrence Statistics’. In: Proceedings of the ESSLLI Workshop on Dis- tributional Lexical Semantics, ed. by Baroni, Marco; Evert, Stefan & Lenci, Alessandro; Hamburg.

Burger, Harald (2007): Phraseologie. Eine Einf¨uhrungam Beispiel des Deutschen; Grundlagen der Germanistik; Berlin: Erich Schmidt Verlag; 3rd ed.

Burki,¨ Andreas (2009): ‘Multi-Word Sequences in Motion: on the extent and speed of diachronic change in patterns of German’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/ publications/CL2009, last accessed 2012-10-21.

Burnard, Lou & Aston, Guy (1998): The BNC handbook: exploring the British National Corpus; Edinburgh: Edinburgh University Press.

Bussmann, Hadumod (ed.) (1996): Routledge Dictionary of Language and Linguistics; London: Routledge.

Cabre,´ M. Teresa; Bagot, Rosa Estopa` & Platresi, Jordi Vivaldi (2001): ‘Automatic term detection: A review of current systems’. In: Re- cent Advances in Computational Terminology, ed. by Bourigault, Didier; Jacquemin, Christian & L’Homme, Marie-Claude; vol. 2 of Natural Language Processing; pp. 53–88; Philadelphia: John Benjamins. 216 Bibliography

Cabre,´ M. Teresa & Estopa,` Rosa (2003): ‘On the units of specialised meaning used in professional communication’. Terminology Science Research; vol. 14: pp. 16–27.

Cahill, Lynne (2008): ‘Using Similarity Measures to Extend the LinGO Lexi- con’. In: Proceedings of the 6th International Language Resources and Evalua- tion Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/ lrec2008, last accessed 2012-10-26.

Carstensen, Kai-Uwe; Ebert, Christian; Endriss, Cornelia; Jekat, Susanne; Klabunde, Ralf & Langer, Hagen (eds.) (2004): Compu- terlinguistik und Sprachtechnologie; Berlin: Spektrum Akademischer Verlag; 2nd ed.

Cavagnoli, Stefania & Nardin, Francesca (1999): ‘Second language ac- quisition in South Tyrol: Difficulties, motivations, expectations’. Multilingua; vol. 18(1): pp. 17–45.

Chambers, J K; Trudgill, Peter & Schilling-Estes, Natalie (eds.) (2002): Handbook of Language Variation and Change; Oxford: Blackwell.

Charniak, Eugene (1993): Statistical Language Learning; Cambridge: MIT Press.

Charniak, Eugene & McDermott, Drew (1985): Introduction to artificial intelligence; Reading: Addison-Wesley.

Chiao, Yun-Chuang & Zweigenbaum, Pierre (2002): ‘Looking for candi- date translational equivalents in specialized, comparable corpora’. In: Pro- ceedings of the 19th International Conference on Computational Linguistics (COLING); pp. 1208–1212; Taipei.

Chiarcos, Christian; Dipper, Stefanie; Gotze,¨ Michael; Leser, Ulf; Ludeling,¨ Anke; Ritz, Julia & Stede, Manfred (2008): ‘A flexible framework for integrating annotations from different tools and tag sets’. Traitement automatique des langues; vol. 49(2): pp. 217–246. Bibliography 217

Chomsky, Noam (1962): ‘Explanatory Models in Linguistics’. In: Logic, Methodology and Philosophy of Science, ed. by Nagel, Ernest; Suppes, Patrick & Tarski, Alfred; pp. 528–550; Stanford: Stanford University Press.

Christ, Oliver (1994): ‘A Modular and Flexible Architecture for an In- tegrated Corpus Query System’. In: Proceedings of the 3rd Conference on Computational Lexicography and Text Research (COMPLEX); pp. 23–32; Budapest.

Church, Kenneth W. (2000): ‘Empirical estimates of adaptation: the chance of two Noriegas is closer to p/2 than p2’. In: Proceedings of the 18th Inter- national Conference on Computational Linguistics (COLING); pp. 180–186; Saarbr¨ucken.

Church, Kenneth W. & Hanks, Patrick (1990): ‘Word association norms, mutual information, and lexicography’. Computational Linguistics; vol. 16(1): pp. 22–29.

Church, Kenneth W. & Mercer, Robert L. (1993): ‘Introduction to the Special Issue on Computational Linguistics Using Large Corpora’. Com- putational Linguistics; vol. 19(1): pp. 1–24. de Cillia, Rudolf & Wodak, Ruth (2006): Ist Osterreich¨ ein ”deutsches” Land? Sprachenpolitik und Identit¨atin der zweiten Republik; Innsbruck: Studienverlag.

Clark, Jonathan; Frederking, Robert & Levin, Lori (2008): ‘To- ward Active Learning in Data Selection: Automatic Discovery of Lan- guage Features During Elicitation’. In: Proceedings of the 6th Interna- tional Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Clyne, Michael G. (1984): Language and society in the German-speaking countries; Melbourne: Cambridge University Press. 218 Bibliography

Clyne, Michael G. (ed.) (1992): Pluricentric languages: Differing norms in different nations; Berlin / New York: De Gruyter.

Cochran, William G. (1954): ‘Some methods of strengthening the common chi-square tests’. Biometrics; vol. 10: pp. 417–451.

Colleselli, Toni; Lanthaler, Franz & Mazza, Aldo (2009): Schian isch’s gwesn. Nove lezioni per comprendere il tedesco di tutti i giorni in Alto Adige S¨udtiro; Merano: alpha beta.

Cowie, Anthony P. (1978): ‘The place of illustrative material and collocations in the design of a learner’s dictionary’. In: In Honour of A. S. Hornby, ed. by Strevens, Peter; pp. 127–139; Oxford: Oxford University Press.

Cramer, Irene & Schulte im Walde, Sabine (2006): Computerlinguistik und Sprachtechnologie; vol. 36 of Studienbibliografien Sprachwissenschaft; T¨ubingen:Julius Groos; 2nd ed. http://www.coli.uni-saarland.de/projects/ stud-bib, last accessed 2012-10-31.

Crossley, Scott A. & Louwerse, Max M. (2007): ‘Multi-dimensional reg- ister classification using bigrams’. International Journal of Corpus Linguistics; vol. 12: pp. 453–478.

Cruse, D. Alan; Hundsnurscher, Franz; Job, Michael & Lutzeier, Peter R. (eds.) (2005): Lexikologie / Lexicology; vol. 21(1-2) of Handb¨ucher zur Sprach- und Kommunikationswissenschaft; Berlin / New York: De Gruyter.

Culy, Christopher & Lyding, Verena (2010): ‘Visualizations for ex- ploratory corpus and text analysis’. In: Proceedings of the 2nd International Conference on Corpus Linguistics (CILC); pp. 257–268; A Coru˜na.

Cunningham, Hamish; Maynard, Diana; Bontcheva, Kalina & Tablan, Valentin (2002): ‘GATE: A Framework and Graphical Develop- ment Environment for Robust NLP Tools and Applications’. In: Proceedings Bibliography 219

of the 40th Annual Meeting of the Association for Computational Linguistics (ACL); Philadelphia.

Curzan, Anne (2009): ‘Historical corpus linguistics and evidence of language change’. In: Ludeling¨ & Kyto¨ (2009); pp. 1091–1109.

Daille, Beatrice´ (1994): ‘Study and Implementation of combined techniques for Automatic Extraction of Terminology’. In: Proceedings of the ACL Work- shop ’The Balancing Act: Combining Symbolic and Statistical Approaches to Language’; Las Cruces.

Dalgaard, Peter (2002): Introductory Statistics with R; New York: Springer.

Daudaravicius,ˇ Vidas & Marcinkevicienˇ e,˙ Ruta¯ (2004): ‘Gravity counts for the boundaries of collocations’. International Journal of Corpus Linguis- tics; vol. 9(2): pp. 321–348.

De Cillia, Rudolf (2006): ‘Variet¨atenreiches Deutsch. Deutsch als pluri- zentrische Sprache und DaF-Unterricht’. In: Begegnungssprache Deutsch – Motivation, Herausforderung, Perspektiven, ed. by Krumm, Hans-J¨urgen& Portmann-Tselikas, Paul; pp. 51–65; Innsbruck / Vienna / Bolzano: Studien- Verlag.

De Roeck, Anne; Sarkar, Avik & Garthwaite, Paul H. (2004): ‘De- feating the Homogeneity Assumption: some findings on the distribution of very ’frequent terms”. In: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data (JADT); pp. 282–294; Louvain-la- Neuve. www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2004/pdf/JADT 026. pdf, last accessed 2012-10-21.

Del Gratta, Riccardo; Bartolini, Roberto; Caselli, Tommaso; Monachini, Monica; Soria, Claudia & Calzolari, Nicoletta (2008): ‘UFRA: a UIMA-based Approach to Federated Language Resource Architec- ture’. In: Proceedings of the 6th International Language Resources and Evalua- tion Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/ lrec2008, last accessed 2012-10-31. 220 Bibliography

Denoual, Etienne (2006): ‘A method to quantify corpus similarity and its application to quantifying the degree of literality in a document’. International Journal of Technology and Human Interaction; vol. 2(11): pp. 51–66.

Dittmann, Henrik; Durˇ co,ˇ Matej; Geyken, Alexander; Roth, To- bias & Zimmer, Kai (2012): ‘Korpus C4 – a distributed corpus of German varieties’. In: Multilingual Corpora and Multilingual Corpus Analysis, ed. by Schmidt, Thomas & W¨orner,Kai; vol. 14 of Hamburg Studies in Multilin- gualism (HSM); Amsterdam: John Benjamins.

Doherty, Monika (2006): Structural Propensities; Amsterdam: John Ben- jamins.

Dormeyer, Ricarda; Fischer, Ingrid & Weber Russell, Sylvia (2005): ‘A Lexicon for Metaphors and Idioms’. In: Semantik im Lexikon, ed. by Langer, Stefan & Schnorbusch, Daniel; pp. 203–221; T¨ubingen:Narr.

Dorow, Beate (2007): A Graph Model for Words and their Meanings; Ph.D. thesis; Institute for Natural Language Processing (IMS), University of Stuttgart.

Dossena, Marina & Lass, Roger (eds.) (2004): Methods and Data in English Historical Dialectology; Bern: Peter Lang.

Drach, Erich (1937): Grundgedanken der deutschen Satzlehre; Darmstadt: Wissenschaftliche Buchgesellschaft; 4th ed.

Drosdowski, Gunther¨ & Augst, Gerhard (eds.) (1984): Grammatik der deutschen Gegenwartssprache; Mannheim: Dudenverlag; 4th ed.

Duffner, Rolf & Naf,¨ Anton (2006): ‘Digitale Textdatenbanken im Vergleich’. Linguistik online; vol. 28(3): pp. 7–22. www.linguistik-online.de/ 28 06/duffnerNaef.html, last accessed 2012-10-21.

Dunning, Ted (1993): ‘Accurate methods for the statistics of surprise and coincidence’. Computational Linguistics; vol. 19(1): pp. 61–74. Bibliography 221

Dybkjær, Laila; Hemsen, Holmer & Minker, Wolfgang (eds.) (2007): Evaluation of Text and Speech Systems; Dordrecht: Springer.

Durscheid,¨ Christa; Elspaß, Stephan & Ziegler, Arne (2011): ‘Gram- matische Variabilit¨atim Gebrauchsstandard – das Projekt ”Variantengram- matik des Standarddeutschen”’. In: Grammatik und Korpora 2009 / Grammar & Corpora 2009 , ed. by Konopka, Marek; Kubczak, Jacqueline; Mair, Chris- tian; St´ıcha,ˇ Frantiˇsek& Waßner, Ulrich H.; vol. 1 of Corpus Linguistics and Interdisciplinary Perspectives on Language; pp. 123–140; T¨ubingen:Narr.

Ebner, Jakob (1980): Wie sagt man in Osterreich?¨ W¨orterbuchder ¨osterre- ichischen Besonderheiten; Mannheim / Vienna / Zurich: Dudenverlag; 2nd ed.

Ebner, Jakob (1987): ‘Osterreichisches¨ Deutsch’. Informationen zur Deutsch- didaktik; vol. 1(2): pp. 149–162.

Ebner, Jakob (1995): ‘Vom Beleg zum W¨orterbuchartikel - Lexikographische Probleme zum ¨osterreichischen Deutsch’. In: Muhr et al. (1995); pp. 179– 196.

Egger, Kurt (1979): ‘Morphologische und syntaktische Interferenzen an der deutsch- italienischen Sprachgrenze in S¨udtirol’. In: Standardsprache und Dialekte in mehrsprachigen Gebieten Europas, ed. by Ureland, P. Sture; pp. 55–104; T¨ubingen:Niemeyer.

Egger, Kurt (2001): Sprachlandschaft im Wandel. S¨udtirol auf dem Weg zur Mehrsprachigkeit; soziolinguistische und psycholinguistische Aspekte der Ein- und Mehrsprachigkeit; Bolzano: Athesia.

Egger, Kurt & Heller, Karin (1997): ‘Deutsch - Italienisch’. In: Kontak- tlinguistik - ein internationales Handbuch zeitgen¨ossischerForschung, ed. by Goebl, Hans; Nelde, Peter H.; Star´y,Zdenˇek& W¨olck, Wolfgang; vol. 2; pp. 1350–1357; Berlin / New York: De Gruyter. 222 Bibliography

Egger, Kurt & Lanthaler, Franz (eds.) (2001): Die deutsche Sprache in S¨udtirol. Einheitssprache und regionale Vielfalt; Vienna / Bolzano: Folio.

Eichinger, Ludwig M. (1996): ‘S¨udtirol’. In: Handbuch der mitteleurop¨ais- chen Sprachminderheiten, ed. by Hinderling, Robert & Eichinger, Ludwig M.; T¨ubingen:Narr.

Evert, Stefan (2004): ‘The statistical analysis of morphosyntactic distri- butions’. In: Proceedings of the 4th International Language Resources and Evaluation Conference (LREC); pp. 1539–1542; Lisbon. http://purl.org/ stefan.evert/PUB/Evert2004b.pdf, last accessed 2012-10-25.

Evert, Stefan (2005a): The Statistics of Word Cooccurrences: Word Pairs and Collocations; Ph.D. thesis; Institute for Natural Language Processing (IMS), University of Stuttgart. http://elib.uni-stuttgart.de/opus/volltexte/ 2005/2371, last accessed 2012-10-21.

Evert, Stefan (2005b): The CQP query language tutorial; Tech. rep.; In- stitute for Natural Language Processing, University of Stuttgart. www.ims. uni-stuttgart.de/projekte/CorpusWorkbench, last accessed 2012-10-21.

Evert, Stefan (2006): ‘How random is a corpus? The library metaphor’. Zeitschrift fur Anglistik und Amerikanistik; vol. 54(2): pp. 177–190.

Evert, Stefan (2009): ‘Corpora and collocations’. In: Ludeling¨ & Kyto¨ (2009); pp. 1212–1248.

Evert, Stefan; Heid, Ulrich & Lezius, Wolfgang (2000): ‘Methoden zum qualitativen Vergleich von Signifikanzmaßen zur Kollokationsidentifika- tion’. In: Proceedings of the 5th Conference on Natural Language Processing (KONVENS); pp. 215–220; Ilmenau.

Evert, Stefan; Heid, Ulrich; Sauberlich,¨ Bettina; Debus-Gregor, Esther & Scholze-Stubenrecht, Werner (2004a): ‘Supporting corpus- based dictionary updating’. In: Proceedings of the 11th EURALEX Interna- tional Conference; pp. 255–264; Lorient. Bibliography 223

Evert, Stefan; Heid, Ulrich & Spranger, Kristina (2004b): ‘Identi- fying Morphosyntactic Preferences in Collocations’. In: Proceedings of the 4th International Language Resources and Evaluation Conference (LREC); pp. 907–910; Lisbon. http://www.lrec-conf.org/proceedings/lrec2004, last accessed 2012-10-31.

Evert, Stefan & Krenn, Brigitte (2001): ‘Methods for the qualitative evaluation of lexical association measures’. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL); pp. 188–195; Toulouse.

Evert, Stefan & Krenn, Brigitte (2005): ‘Using small random samples for the manual evaluation of statistical association measures’. Computer Speech and Language; vol. 19(4): pp. 450–466.

Facchinetti, Roberta & Rissanen, Matti (eds.) (2006): Corpus-based Studies of Diachronic English; Bern: Peter Lang.

Fandrych, Christian & Salverda, Reinier (eds.) (2007): Standard, Vari- ation und Sprachwandel in germanischen Sprachen; T¨ubingen:Narr.

Fazly, Afsaneh & Stevenson, Suzanne (2006): ‘Automatically construct- ing a lexicon of verb phrase idiomatic combinations’. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computa- tional Linguistics (EACL); pp. 337–344; Trento.

Feiler, Michael (1997): ‘South Tyrol: Model for the Resolution of Minority Conflicts?’. Review of International Affairs; vol. 28: pp. 10–35.

Fellbaum, Christiane (ed.) (2007): Idioms And Collocations: Corpus-based Linguistic And Lexicographic Studies; Research in Corpus and Discourse; London: Continuum International Publishing Group.

Ferraresi, Adriano; Zanchetta, Eros; Baroni, Marco & Bernar- dini, Silvia (2008): ‘Introducing and evaluating ukWaC, a very large Web-derived corpus of English’. In: Proceedings of the 4th Web as Corpus 224 Bibliography

Workshop (WAC); Marrakech. http://webascorpus.sourceforge.net/download/ WAC4 2008 Proceedings.pdf, last accessed 2012-10-26.

Firth, John R. (1957): Papers in Linguistics 1934-1951 ; London: Oxford University Press.

Fitschen, Arne & Gupta, Piklu (2009): ‘Lemmatising and morphological tagging’. In: Ludeling¨ & Kyto¨ (2009); pp. 552–563.

Fleischer, Wolfgang (1997): Phraseologie der deutschen Gegen- wartssprache; Niemeyer Studienbuch; T¨ubingen:Niemeyer; 2nd ed.

Foldes,¨ Csaba (2005): ‘Die deutsche Sprache und ihre Architektur. Aspekte von Vielfalt, Variabilit¨atund Regionalit¨at:variationstheoretische Uberlegun-¨ gen’. Studia Linguistica XXIV Acta Universitatis Wratislaviensis; pp. 37–59.

Forer, Rosa & Moser, Hans (1988): ‘Beobachtungen zum west¨osterreich- ischen Sonderwortschatz’. In: Das ¨osterreichische Deutsch, ed. by Wiesinger, Peter; vol. 12 of Schriften zur deutschen Sprache in Osterreich¨ ; pp. 189–209; Vienna / Cologne / Graz: B¨ohlau.

Forkl, Yves (2010): Zur digitalen Zukunft der Kollokationslexikographie. Perspektiven der Pr¨asentationvon Wissen ¨uber usuelle franz¨osischeund deutsche Wortverbindungen in gedruckten und elektronischen W¨orterb¨uchern; Ph.D. thesis; University of Erlangen-Nuremberg.

Fraas, Claudia (2001): ‘Usuelle Wortverbindungen als sprachliche Mani- festation von Bedeutungswissen. Theoretische Begr¨undung,methodischer Ansatz und empirische Befunde’. In: Lexikon und Text, ed. by Nikula, Henrik & Drescher, Robert; pp. 41–66; Vaasa: Saxa. http://www.tu-chemnitz.de/ phil/medkom/mk/fraas/bedeutungswissen.pdf, last accessed 2012-10-21.

Frege, Gottlob (1892): ‘Uber¨ Sinn und Bedeutung’. Zeitschrift f¨urPhiloso- phie und philosophische Kritik; pp. 25–50. Bibliography 225

Fritzinger, Fabienne & Heid, Ulrich (2009): ‘Automatic Grouping of Morphologically Related Collocations’. In: Proceedings of the 5th Corpus Lin- guistics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/CL2009, last accessed 2012-10-25.

Fritzinger, Fabienne; Kisselew, Max; Heid, Ulrich; Madsack, An- dreas & Schmid, Helmut (2009): ‘Werkzeuge zur Extraktion von sig- nifikanten Wortpaaren als Webservice’. In: GSCL-Symposium Sprachtech- nologie und eHumanities, ed. by Hoeppner, Wolfgang; pp. 32–43.

Fung, Pascale & McKeown, Kathleen (1997): ‘Finding Terminology Translations from Non-parallel Corpora’. In: Proceedings of the 5th Annual Workshop on Very Large Corpora; pp. 192–202; Hong Kong.

Gast, Volker (ed.) (2006): The Scope and Limits of Corpus Linguistics. Empiricism in the Description and Analysis of English; vol. 54(2) of Zeitschrift f¨urAnglistik und Amerikanistik (Special Issue); W¨urzburg:K¨onigshausen& Neumann.

Geyken, Alexander (2007): ‘The DWDS corpus: a reference corpus for the German language of the twentieth century’. In: Fellbaum (2007); pp. 23–40.

Geyken, Alexander; Didakowski, Jorg¨ & Siebert, Alexander (2008): ‘Generation of Word Profiles on the Basis of a Large and Balanced German Corpus’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Geyken, Alexander & Hanneforth, Thoms (2006): ‘TAGH: A Com- plete Morphology for German based on Weighted Finite State Automata’. In: Proceedings of the 5th International Workshop on Finite State Meth- ods and Natural Language Processing (FSMNLP); pp. 55–66; Helsinki. http://www.dwds.de/dokumente/Geyken Hanneforth fsmnlp 2005.pdf, last accessed 2012-10-26.

Giacomozzi, Laura (1982): ‘Dialektbedingte Schwierigkeiten von Sch¨ulern aus dem S¨udtirolerUnterland. Ergebnisse einer Fehleranalyse’. In: Dialekt 226 Bibliography

und Hochsprache in der Schule, ed. by Egger, Kurt; pp. 75–110; Bolzano: Athesia.

Gleim, Rudiger¨ & Mehler, Alexander (2010): ‘Computational Linguis- tics for Mere Mortals. Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities’. In: Proceedings of the 7th International Lan- guage Resources and Evaluation Conference (LREC); pp. 204–210; Valetta. http://www.lrec-conf.org/proceedings/lrec2010, last accessed 2012-10-21.

Gleim, Rudiger;¨ Mehler, Alexander; Waltinger, Ulli & Menke, Peter (2009): ‘eHumanities Desktop — An extensible Online System for Corpus Management and Analysis’. In: Proceedings of the 5th Corpus Linguis- tics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/CL2009, last accessed 2012-10-31.

Goebl, Hans; Nelde, Peter H.; Stary, Zdenek & Wolck,¨ Wolf- gang (eds.) (1996): Kontaktlinguistik / Contact Linguistics / Linguistique de contact; vol. 12(1-2) of Handb¨ucherzur Sprach- und Kommunikationswis- senschaft; Berlin / New York: De Gruyter.

Gojun, Anita; Heid, Ulrich; Weißbach, Bernd; Loth, Carola & Mingers, Insa (2012): ‘Adapting and evaluating a generic term extraction tool’. In: Proceedings of the 8th International Language Resources and Evalu- ation Conference (LREC); Istanbul. http://www.lrec-conf.org/proceedings/ lrec2012, last accessed 2012-10-24.

Granger, Sylviane (1994): ‘The Learner Corpus: a revolution in applied linguistics’. English Today; vol. 39: pp. 25–29.

Granger, Sylviane (1997): ‘Automated retrieval of passives from native and learner corpora: precision and recall’. Journal of English Linguistics; vol. 25(4): pp. 365–374.

Greenbaum, Sidney (ed.) (1996): Comparing English Worldwide: The Inter- national Corpus of English; Oxford: Clarendon Press. Bibliography 227

Grefenstette, Gregory & Tapanainen, Pasi (1994): ‘What is a word, what is a sentence? Problems of tokenization’. In: Proceedings of the 3rd Conference on Computational Lexicography and Text Research (COMPLEX); pp. 79–87; Budapest.

Gries, Stefan T. (2005): ‘Null-hypothesis significance testing of word fre- quencies: a follow-up on Kilgarriff’. Corpus Linguistics and Linguistic Theory; vol. 1(2): pp. 277–294.

Gries, Stefan T. (2006): ‘Some proposals towards more rigorous corpus linguistics’. Zeitschrift fur Anglistik und Amerikanistik; vol. 54(2): pp. 191– 202.

Gries, Stefan T. (2007): ‘Exploring variability within and between corpora: some methodological considerations’. Corpora; vol. 1(2): pp. 109–151.

Gries, Stefan T. (2008): ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics; vol. 13(4): pp. 403–437.

Gries, Stefan T. (2009a): ‘Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/ publications/CL2009, last accessed 2012-10-25.

Gries, Stefan T. (2009b): Quantitative corpus linguistics with R: a practical introduction; New York: Routledge.

Gries, Stefan T. (2009c): ‘What is corpus linguistics?’. Language and Lin- guistics Compass; vol. 3: pp. 1–17.

Gries, Stefan T. (2012): ‘50-something years of work on collocations: what is or should be next . . . ’. International Journal of Corpus Linguistics; vol. 18(1).

Gries, Stefan T. & Mukherjee, Joybrato (2010): ‘Lexical gravity across varieties of English: an ICE-based study of n-grams in Asian Englishes’. International Journal of Corpus Linguistics; vol. 15(4): pp. 520–548. 228 Bibliography

Grouin, Cyril (2008): ‘Certification and Cleaning up of a Text Corpus: To- wards an Evaluation of the ’Grammatical’ Quality of a Corpus’. In: Proceed- ings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Grzega, Joachim (2000): ‘On the Description of National Varieties: Examples from (German and Austrian) German and (English and American) English’. Linguistik online; vol. 7.

Guthrie, David; Guthrie, Louise & Wilks, Yorick (2008): ‘An Unsu- pervised Probabilistic Approach for the Detection of Outliers in Corpora’. In: Proceedings of the 6th International Language Resources and Evaluation Con- ference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Haas, Walter (2000): ‘Die deutschsprachige Schweiz’. In: Die viersprachige Schweiz, ed. by Bickel, Hans & Schl¨apfer,Robert; Reihe Sprachlandschaft 25; Aarau: Sauerl¨ander;2nd ed.

Hacki¨ Buhofer, Annelies (ed.) (2000): Vom Umgang mit sprachlicher Vari- ation. Soziolinguistik, Dialektologie, Methoden und Wissenschaftsgeschichte; vol. 80 of Basler Studien zur deutschen Sprache und Literatur, Festschrift f¨urHeinrich L¨offlerzum 60. Geburtstag; T¨ubingen/ Basel: Francke.

Hagi,¨ Sara (2006): Nationale Variet¨atenim Unterricht Deutsch als Fremd- sprache; Duisburger Arbeiten zur Sprach- und Kulturwissenschaft; Frankfurt, Main: Lang.

Halliday, Michael A. & Webster, Jonathan (2006): On Language and Linguistics; London: Continuum International Publishing Group.

Halliday, Michael A. K.; Teubert, Wolfgang; Yallop, Colin & Cermˇ akov´ a,´ Anna (2004): Lexicology and Corpus Linguistics. An Intro- duction; Open linguistics series; London: Continuum International Publishing Group. Bibliography 229

Handwerker, Brigitte; Madlener, Karin & Moller,¨ Max (2004): ‘Wortbedeutung und Konstruktionsbedeutung. Die Adjektiv-Partizip- Opposition aus der Perspektive des Deutschen als Fremdsprache’. In: Linguis- tik f¨urdie Fremdsprache Deutsch, ed. by L¨uger,Heinz-Helmut & Rothen¨ausler, Rainer; vol. 7 of bzf-Sonderheft; pp. 85–120; Landau: Verlag f¨urEmpirische P¨adagogik.

Harris, Zellig (1954): ‘Distributional structure’. Word; vol. 10(23): pp. 146– 162.

Hausmann, Franz J. (1979): ‘Un Dictionnaire des Collocations est-il Pos- sible?’. In: Travaux de Linguistique et de Litt´erature; vol. 17; pp. 187–195; University of Strasbourg.

Hausmann, Franz J. (1984): ‘Wortschatzlernen ist Kollokationslernen. Zum Lehren und Lernen franz¨osischer Wortverbindungen’. Praxis des neusprach- lichen Unterrichts; vol. 31: pp. 395–406.

Hausmann, Franz J. (1985a): ‘Kollokationen im deutschen W¨orterbuch. Ein Beitrag zur Theorie des lexikographischen Beispiels’. In: Lexikographie und Grammatik. Akten des Essener Kolloquiums zur Grammatik im W¨orterbuch, ed. by Bergenholtz, Henning & Mugdan, Joachim; Lexicographica Series Maior 3; pp. 118–129; T¨ubingen:Niemeyer.

Hausmann, Franz J. (1985b): ‘Lexikographie’. In: Handbuch der Lexikologie, ed. by Schwarze, Christoph & Wunderlich, Dieter; pp. 367–411; K¨onigstein: Athen¨aum.

Hausmann, Franz J. (2004): ‘Was sind eigentlich Kollokationen?’. In: Steyer (2004b); pp. 309–334.

Hausmann, Franz J.; Reichmann, Oskar; Wiegand, Herbert E. & Zgusta, Ladislav (eds.) (1989-1991): W¨orterb¨ucher.Ein internationales Handbuch zur Lexikographie / Dictionaries. An International Encyclopedia of Lexicography / Dictionnaires. Encyclop´edieinternationale de lexicographie; 230 Bibliography

vol. 5(1-3) of Handb¨ucherzur Sprach- und Kommunikationswissenschaft; Berlin / New York: De Gruyter.

Heid, Ulrich (1998): ‘Towards a corpus-based dictionary of German noun-verb collocations’. In: Proceedings of the 8th EURALEX International Conference; pp. 301–312; Li`ege.

Heid, Ulrich (2001): ‘Collocations in Sublanguage Texts: Extraction from Corpora’. In: Handbook of Terminology Management, ed. by Wrigth, Sue Ellen & Budin, Gerhard; pp. 788–808; Amsterdam / Philadelphia: John Benjamins.

Heid, Ulrich (2008): ‘Computational Phraseology: an overview’. In: Phraseol- ogy – An interdisciplinary perspective, ed. by Granger, Sylviane & Meunier, Fanny; pp. 337–360; Amsterdam / Philadelphia: John Benjamins.

Heid, Ulrich (2009): ‘Corpus linguistics and lexicography’. In: Ludeling¨ & Kyto¨ (2009); pp. 131–153.

Heid, Ulrich (2011): ‘Korpusbasierte Beschreibung der Variation bei Kolloka- tionen: Deutschland – Osterreich¨ – Schweiz – S¨udtirol’. In: Sprachliches Wis- sen zwischen Lexikon und Grammatik, ed. by Engelberg, Stefan; Holler, Anke & Proost, Kristel; Jahrbuch 2010; Institut f¨urDeutsche Sprache, Mannheim: De Gruyter.

Heid, Ulrich; Evert, Stefan; Fitschen, Arne; Freese, Marion & Vogele,¨ Andreas (2001): Term Candidate Extraction in DOT; Tech. rep.; Institute for Natural Language Processing (IMS), University of Stuttgart.

Heid, Ulrich; Fritzinger, Fabienne; Hinrichs, Erhard; Hinrichs, Marie & Zastrow, Thomas (2010): ‘Term and Collocation Extraction by Means of Complex Linguistic Web Services’. In: Proceedings of the 7th Inter- national Language Resources and Evaluation Conference (LREC); Valletta. http://www.lrec-conf.org/proceedings/lrec2010, last accessed 2012-10-31. Bibliography 231

Heid, Ulrich & Gojun, Anita (2012): ‘Term candidate extraction for terminography and CAT: an overview of TTC’. In: Proceedings of the 15th EURALEX International Conference; Oslo.

Heid, Ulrich & Prinsloo, Daan J. (2008): ‘Collocational False Friends: Description and Treatment in Bilingual Dictionaries’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Heid, Ulrich & Ritz, Julia (2005): ‘Extracting collocations and their con- texts from corpora’. In: Proceedings of the 8th Conference on Computational Lexicography and Text Research (COMPLEX); pp. 107–121; Budapest.

Heid, Ulrich; Sauberlich,¨ Bettina; Debus-Gregor, Esther & Scholze-Stubenrecht, Werner (2004): ‘Tools for upgrading printed dic- tionaries by means of corpus-based lexical acquisition’. In: Proceedings of the 4th International Language Resources and Evaluation Conference (LREC); p. 419 – 423; Lisbon. http://www.lrec-conf.org/proceedings/lrec2004, last accessed 2012-10-31.

Heid, Ulrich & Weller, Marion (2008): ‘Tools for Collocation Extraction: Preferences for Active vs. Passive’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http: //www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Heid, Ulrich & Weller, Marion (2010): ‘Corpus-derived data on Ger- man multiword expressions for lexicography’. In: Proceedings of the 14th EURALEX International Conference; Leeuwarden.

Heine, Bernd & Kuteva, Tania (2005): Language contact and grammatical change; vol. 3 of Cambridge Approaches to Language Contact; Cambridge: Cambridge University Press.

Heller, Dorothee (ed.) (2008): Formulierungsmuster in deutscher und italienischer Fachkommunikation. Intra- und interlinguale Perspektiven; Lin- guistic Insights; Bern: Peter Lang. 232 Bibliography

Herdan, Gustav (1964): Quantitative linguistics; London: Butterworths.

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk & Speelman, Dirk (2008): ‘Modelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www. lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Hickey, Raymond (2010): The Handbook of Language Contact; Blackwell handbooks in linguistics; Chichester: Wiley-Blackwell.

Hinkel, Eli (ed.) (2005): Handbook of research in second language teaching and learning; New Jersey: Lawrence Erlbaum Associates.

Hofer, Lorenz & Schmidlin, Regula (2003): ‘Phraseology and Lexicog- raphy: fixed expressions in a dictionary of national variants of Standard German’. In: Phraseological Units: basic concepts and their application, ed. by Allerton, David John; Nesselhauf, Nadja & Skandera, Paul; pp. 171–184; Basel: Schwabe.

Hofland, Knut & Johansson, Stig (1982): Word frequencies in British and American English; London: Longman.

Hohle,¨ Tilman N. (1986): ‘Der Begriff ”Mittelfeld”, Anmerkungen ¨uber die Theorie der topologischen Felder’. In: Akten des 7. Internationalen Germanisten-Kongresses G¨ottingen, ed. by Weiss, Walter; Wiegand, Her- bert E. & Reis, Marga; pp. 329–340; T¨ubingen:Niemeyer.

Hornero, Ana M.; Luzon,´ Mar´ıa J. & Murillo, Silvia (eds.) (2006): Corpus Linguistics. Applications for the Study of English; Bern: Peter Lang.

Huesmann, Anette (1998): Zwischen Dialekt und Standard: Empirische Untersuchung zur Soziolinguistik des Variet¨atenspektrums im Deutschen; vol. 199 of Reihe Germanistische Linguistik; T¨ubingen:Niemeyer.

Hundt, Marianne (2009): ‘Text corpora’. In: Ludeling¨ & Kyto¨ (2009); pp. 168–186. Bibliography 233

Hunston, Susan (2009): ‘Collection strategies and design decisions’. In: Ludeling¨ & Kyto¨ (2009); pp. 154–168.

Hyland, Ken & Bondi, Marina (eds.) (2006): Academic Discourse Across Disciplines; Bern: Peter Lang.

Ide, Nancy (2007): ‘Annotation Science: From Theory to Practice and Use’. In: Data Structures for Linguistics Resources and Applications, ed. by Rehm, Georg; Witt, Andreas & Lemnitzer, Lothar; Proceedings of the Biennial GLDV Conference; T¨ubingen:Narr.

Ignatova, Kateryna & Abel, Andrea (2008): ‘The Use of Context Vectors for Word Sense Disambiguation within the ELDIT Dictionary’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Ivanova, Kremena; Heid, Ulrich; Schulte im Walde, Sabine; Kil- garriff, Adam & Pomikalek, Jan (2008): ‘Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-25.

Jabbari, Sanaz; Allison, Ben & Guthrie, Louise (2008): ‘Using a Probabilistic Model of Context to Detect Word Obfuscation’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Janda, Richard D. & Joseph, Brian D. (eds.) (2004): The Handbook of Historical Linguistics; Oxford: Blackwell.

Jannidis, Fotis (1999): ‘Was ist Computerphilologie?’. In: Jahrbuch f¨ur Computerphilologie, ed. by Eibl, Karl; Deubel, Volker & Jannidis, Fotis; vol. 1; pp. 39–60; Paderborn: mentis. 234 Bibliography

Johansson, Stig (2009): ‘Some aspects of the development of corpus linguis- tics in the 1970s and 1980s’. In: Ludeling¨ & Kyto¨ (2009); pp. 33–52.

Johansson, Stig & Hofland, Knut (eds.) (1989): Frequency Analysis of English Vocabulary and Grammar, Based on the LOB Corpus; Oxford: Clarendon Press.

Jurafsky, Daniel & Martin, James H. (2000): Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition; Upper Saddle River: Prentice Hall.

Kading,¨ Friedrich W. (ed.) (1897): H¨aufigkeitsw¨orterbuchder deutschen Sprache; Berlin: E. S. Mittler & Sohn.

Kanerva, Pentti; Kristoferson, Jan & Holst, Anders (2000): ‘Ran- dom Indexing of Text Samples for Latent Semantic Analysis’. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society; pp. 103–106; Philadelphia.

Karlsson, Fred (2009): ‘Early generative linguistics and empirical method- ology’. In: Ludeling¨ & Kyto¨ (2009); pp. 14–32.

Keibel, Holger & Belica, Cyril (2007): ‘CCDB: A Corpus-Linguistic Research and Development Workbench’. In: Proceedings of the 4th Corpus Linguistics Conference; Birmingham.

Kermes, Hannah (2003): Off-line (and On-line) Text Analysis for Computa- tional Lexicography; Ph.D. thesis; Institute for Natural Language Processing (IMS), University of Stuttgart.

Kermes, Hannah (2009): ‘Syntactic preprocessing’. In: Ludeling¨ & Kyto¨ (2009); pp. 598–612.

Kermes, Hannah & Heid, Ulrich (2003): ‘Using chunked corpora for the acquisition of collocations and idiomatic expressions’. In: Proceed- ings of the 7th Conference on Computational Lexicography and Text Re- Bibliography 235

search (COMPLEX); Budapest. www.ims.uni-stuttgart.de/∼kermes/papers/ 03 COMPLEX KermesHeid.pdf, last accessed 2012-10-26.

Kettemann, Bernhard & Marko, Georg (2000): ‘Teaching and Learning by Doing Corpus Analysis’. In: Proceedings of the 4th International Conference on Teaching and Language Corpora; Graz.

Kilgarriff, Adam (1997a): ‘Putting Frequencies in the Dictionary’. Interna- tional Journal of Lexicography; vol. 10(2): pp. 135–155.

Kilgarriff, Adam (1997b): ‘Using word frequency lists to measure corpus homogeneity and similarity between corpora’. In: Proceedings of the ACL SIGDAT Workshop on Very Large Corpora; pp. 231–245; Beijing / Hong Kong.

Kilgarriff, Adam (2001): ‘Comparing corpora’. International Journal of Corpus Linguistics; vol. 6(1): pp. 1–37.

Kilgarriff, Adam (2005): ‘Language is never, ever, ever, random’. Corpus Linguistics and Linguistic Theory; vol. 1(2): pp. 263–276.

Kilgarriff, Adam (2006): ‘Collocationality (and how to measure it)’. In: Proceedings of the 12th EURALEX International Conference; Torino.

Kilgarriff, Adam (2009): ‘Simple Maths for Keywords’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/ publications/CL2009, last accessed 2012-10-25.

Kilgarriff, Adam; Kovar, Vojtech; Krek, Simon; Srdanovic, Irena & Tiberius, Carole (2010): ‘A Quantitative Evaluation of Word Sketches’. In: Proceedings of the 14th EURALEX International Conference; Leeuwarden.

Kilgarriff, Adam; Rychly,´ Pavel; Smrz, Pavel & Tugwell, David (2008): ‘The Sketch Engine’. In: Practical Lexicography: A Reader; pp. 297– 306; Oxford: Oxford University Press. http://www.fit.vutbr.cz/research/ view pub.php?id=8557, last accessed 2012-10-26. 236 Bibliography

Kilgarriff, Adam & Tugwell, David (2002): ‘Sketching words’. In: Lexi- cography and Natural Language Processing. A Festschrift in Honour of B. T. S. Atkins, ed. by Corr´eard,Marie-H´el`ene;pp. 125–137; Manchester: St. Jerome Publishing.

Kirk, John (1996): ‘ICE and Teaching’. In: Greenbaum (1996); pp. 227–238.

Klatt, Stefan (2004): ‘A High Quality Partial Parser for Annotating Ger- man Text Corpora’. In: Proceedings of the 4th International Language Re- sources and Evaluation Conference (LREC); Lisbon. http://www.lrec-conf. org/proceedings/lrec2004, last accessed 2012-10-26.

Klatt, Stefan (2006): ‘A Corpus-based Approach to the Interpretation of Unknown Words with an Application to German’. In: Proceedings of the 5th International Language Resources and Evaluation Conference (LREC); Genoa. http://www.lrec-conf.org/proceedings/lrec2006, last accessed 2012- 10-26.

Klein, Julie T. (1990): Interdisciplinarity: History, Theory, and Practice; Detroit: Wayne State University.

Klein, Wolfgang (2004): ‘Das Digitale W¨orterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)’. In: Sprachkultur und Lexikographie, ed. by Scharnhorst, J¨urgen;pp. 281–311; Berlin / Frankfurt, Main: Peter Lang.

Klein, Wolfgang & Geyken, Alexander (2010): ‘Das Digitale W¨orter- buch der deutschen Sprache (DWDS)’. In: Lexicographica: International Annual for Lexicography, ed. by Gouws, Rufus H.; Heid, Ulrich; Schierholz, Stefan J.; Schweickard, Wolfgang; Wiegand, Herbert E. & Wolski, Werner; vol. 26; pp. 79–96; Berlin: De Gruyter.

Koehn, Philipp & Knight, Kevin (2002): ‘Learning a translation lexi- con from monolingual corpora’. In: Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition; pp. 9–16; Philadelphia. Bibliography 237

Kramer, Johannes (1981): Deutsch und Italienisch in S¨udtirol; vol. 23 of Reihe Siegen. Beitr¨agezur Literatur- und Sprachwissenschaft; Heidelberg: Universit¨atsverlag Winter.

Krenn, Brigitte (2000): The usual suspects: Data-oriented models for the identification and representation of lexical collocations; Ph.D. thesis; DFKI and Saarland University; Saarbr¨ucken.

Krenn, Brigitte & Samuelsson, Christer (1997): ‘The linguist’s guide to statistics. Don’t panic’. http://citeseer.ist.psu.edu/krenn97linguists.html, last accessed 2012-10-26.

Kupietz, Marc; Belica, Cyril; Keibel, Holger & Witt, Andreas (2010): ‘The German Reference Corpus DeReKo: A primordial sample for linguistic research’. In: Proceedings of the 7th International Language Resources and Evaluation Conference (LREC); pp. 1848–1854; Valetta. http://www.lrec-conf.org/proceedings/lrec2010, last accessed 2012-10-24.

Kren,ˇ Michal & Hlava´covˇ a,´ Jaroslava (2008): ‘Corpus as a Means for Study of Lexical Usage Changes’. In: Proceedings of the 13th EURALEX International Conference; pp. 437–447; Barcelona.

Kubler,¨ Sandra; Maier, Wolfgang; Rehbein, Ines & Versley, Yan- nick (2008): ‘How to Compare Treebanks’. In: Proceedings of the 6th Interna- tional Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-24.

Lamiroy, Beatrice;´ Klein, Jean; Leclere,` Christian & Labelle, Jacques (2003): ‘Les expressions verbales fig´eesdans quatre vari´et´esde fran¸cais:le projet BFQS’. Cahiers de lexicologie; vol. 83(2): pp. 153–172.

Lancia, Franco (2007): ‘Word Co-occurrence and Similarity in Meaning’. In: Mind as Infinite Dimensionality, ed. by Soresi, Salvatore & Valsiner, Jaan; Rome: Edizioni Carlo Amore. 238 Bibliography

Landauer, Thomas K. & Dumais, Susan T. (1997): ‘A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge’. Psychological Review; vol. 104(2): pp. 211–240.

Langer, Alexander (1983): ‘Chancen und Hindernisse f¨urZweisprachigkeit in S¨udtirol’. In: Vergleichbarkeit von Sprachkontakten, ed. by Nelde, Peter H.; Bonn: D¨ummler.

Langer, Stefan (2009): Funktionsverbgef¨ugeund automatische Sprachver- arbeitung; Linguistic Resources for Natural Language Processing; Munich: Lincom GmbH.

Lanthaler, Franz (1997): ‘Variet¨atendes Deutschen in S¨udtirol’. In: Va- riet¨atendes Deutschen: Regional- und Umgangssprachen, ed. by Stickel, Gerhard; pp. 364–383; Berlin / New York: De Gruyter.

Lanthaler, Franz (2001): ‘Zwischenregister der deutschen Sprache in S¨udtirol’. In: Egger & Lanthaler (2001); pp. 137–152.

Lanthaler, Franz (2008): ‘Die Deutsche Sprache in S¨udtirol’. trib¨une zeitschrift f¨ursprache und schreibung; vol. 2.

Lanthaler, Franz & Saxalber, Annemarie (1995): ‘Die deutsche Stan- dardsprache in S¨udtirol’. In: Muhr et al. (1995); pp. 289–305.

Lapshinova-Koltunski, Ekaterina (2008): ‘Non-heads of compounds as valency bearers: extraction from corpora, classification and implication for dictionaries’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Lapshinova-Koltunski, Ekaterina & Heid, Ulrich (2008): ‘Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Sub- categorisation Properties of Compounds.’. In: Proceedings of the 6th Interna- tional Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26. Bibliography 239

Lauttamus, Timo; Nerbonne, John & Wiersma, Wybo (2007): ‘De- tecting Syntactic Contamination in Emigrants: The English of Finnish Aus- tralians’. SKY Journal of Linguistics; vol. 21: pp. 273–307.

Laviosa, Sara (1998): ‘Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose’. Translator’s Journal; vol. 43(4): pp. 557–570.

Lee, Lillian (1999): ‘Measures of Distributional Similarity’. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL); pp. 25–32; College Park.

Lehmberg, Timm & Worner,¨ Kai (2009): ‘Annotation standards’. In: Ludeling¨ & Kyto¨ (2009); pp. 484–500.

Lemnitzer, Lothar & Zinsmeister, Heike (2006): Korpuslinguistik; T¨ubin- gen: Narr.

Li, Bo & Gaussier, Eric (2010): ‘Improving corpus comparability for bilingual lexicon extraction from comparable corpora.’. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING); Beijing.

Lin, Dekang (1998): ‘Automatic Retrieval and Clustering of Similar Words’. In: Proceedings of the COLING-ACL; pp. 768–774.

Ludeling,¨ Anke & Kyto,¨ Merja (eds.) (2009): Corpus Linguistics. An International Handbook.; vol. 29 of Handbooks of Linguistics and Communi- cation Science; Berlin: De Gruyter.

Ludewig, Petra (2005): Korpusbasiertes Kollokationslernen; vol. 9 of Sprache, Sprechen und Computer; Frankfurt, Main / Berlin / Bern / Vienna: Lang.

Luyckx, Kim & Daelemans, Walter (2008): ‘Personae: a Corpus for Author and Personality Prediction from Text’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http: //www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26. 240 Bibliography

Lyding, Verena (ed.) (2009): LULCL II 2008 - Proceedings of the 2nd Col- loquium on Lesser Used Languages and Computer Linguistics; Bolzano; EU- RAC research. http://www.eurac.edu/Org/LanguageLaw/Multilingualism/ Projects/LULCL II proceedings.htm, last accessed 2012-10-16.

Lyding, Verena; Anstein, Stefanie & Petrakis, Stefanos (2008): ‘Aspekte der Interdisziplinarit¨atbeim Einsatz von Korpora in der Fachtext- analyse’. In: Heller (2008); pp. 51–74.

Ludeling,¨ Anke (2008): ‘Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora’. In: Fortgeschrittene Lernervariet¨aten, ed. by Walter, Maik & Grommes, Patrick; pp. 119–140; T¨ubingen:Niemeyer.

Ludeling,¨ Anke; Doolittle, Seanna; Hirschmann, Hagen; Schmidt, Karin & Walter, Maik (2008): ‘Das Lernerkorpus Falko’. Deutsch als Fremdsprache; vol. 2: pp. 67–73.

Maegaard, Bente; Offersgaard, Lene; Henriksen, Lina; Jansen, Hanne; Lepetit, Xavier; Navarretta, Costanza & Povlsen, Claus (2006): ‘The MULINCO corpus and corpus platform’. In: Proceedings of the 5th International Language Resources and Evaluation Conference (LREC); pp. 2148–2153; Genoa. http://www.lrec-conf.org/proceedings/lrec2006, last accessed 2012-10-26.

Magenau, Doris (1964): Die Besonderheiten der deutschen Schriftsprache in Luxemburg und in den deutschsprachigen Teilen Belgiens; vol. 15 of Duden- Beitr¨age; Mannheim: Dudenverlag.

Magliana, Melissa (2000): The Autonomous Province of South Tyrol: A Model of Self-Governance?; Bolzano: EURAC research.

Mair, Christian (2009): ‘Corpora and the study of recent change in language’. In: Ludeling¨ & Kyto¨ (2009); pp. 1109–1126.

Mall, Josef & Plagg, Waltraud (1990): ‘Versteht der Nordtiroler die S¨udtirolerinnoch? Anmerkungen zum Sprachwandel in der deutschen All- Bibliography 241

tagssprache S¨udtirols durch den Einfluß des Italienischen’. Germanistische Linguistik Grenzdialekte; vol. 101-103: pp. 217–239.

Manning, Chris & Schutze,¨ Hinrich (1999): Foundations of Statistical Natural Language Processing; Cambridge: MIT Press.

Markou, Markos & Singh, Sameer (2003): ‘Novelty Detection: A Review, Part I: Statistical Approaches’. Signal Processing; vol. 83: pp. 2481–2497.

Masser, Achim (1982): ‘Italienisches Wortgut im S¨udtirolerDeutsch – droht eine Uberfremdung?’.¨ In: Moser (1982); pp. 63–74.

Mattheier, Klaus J. (ed.) (1997): Norm und Variation; Frankfurt, Main: Lang.

McCarthy, Diana; Keller, Bill & Carroll, John (2003): ‘Detecting a Continuum of Compositionality in Phrasal Verbs’. In: Proceedings of the ACL SIGLEX Workshop on Multiword Expressions: Analysis, Acquisition and Treatment; Sapporo.

McEnery, Tony & Wilson, Andrew (2001): Corpus linguistics; Edinburgh: Edinburgh University Press; 2nd ed.

McEnery, Tony; Xiao, Richard & Tono, Yukio (2006): Corpus-Based Language Studies - An Advanced Resource Book; London: Routledge.

Meurers, Detmar & Muller,¨ Stefan (2009): ‘Corpora and syntax’. In: Ludeling¨ & Kyto¨ (2009); pp. 920–932.

Meyer, Charles F. (2009): ‘Pre-electronic corpora’. In: Ludeling¨ & Kyto¨ (2009); pp. 1–13.

Meyer, Kurt (1989): Wie sagt man in der Schweiz? W¨orterbuchder schweiz- erischen Besonderheiten; Mannheim / Vienna / Zurich: Dudenverlag.

Mitkov, Ruslan (ed.) (2003): The Oxford Handbook of Computational Lin- guistics; Oxford: Oxford University Press. 242 Bibliography

Moisl, Hermann (2009): ‘Exploratory multivariate analysis’. In: Ludeling¨ & Kyto¨ (2009); pp. 874–899.

Moser, Hans (1975): ‘Zur deutschen Schriftsprache in S¨udtirol’. In: Der Donauraum; vol. 3(4) of Zeitschrift f¨urDonauraum-Forschung; Vienna: Uni- versit¨atsverlag Wilhelm Braum¨uller.

Moser, Hans (ed.) (1982): Zur Situation des Deutschen in S¨udtirol. Sprach- wissenschaftliche Beitr¨agezu den Fragen von Sprachnorm und Sprachkontakt; vol. 13 of Innsbrucker Beitr¨agezur Kulturwissenschaft, Germanistische Reihe; Innsbruck: Innsbruck University Press.

Moser, Hans & Putzer, Oskar (1980): ‘Zum umgangssprachlichen Wortschatz in S¨udtirol:Italienische Interferenzen in der Sprache der St¨adte’. In: Sprache und Name in Osterreich.¨ Festschrift f¨urWalter Steinhauser zum 95. Geburtstag, ed. by Wiesinger, Peter; vol. 6 of Schriften zur deutschen Sprache in Osterreich¨ ; pp. 139–172; Vienna: Universit¨atsverlag Wilhelm Braum¨uller.

Muhr, Rudolf (1987): ‘Deutsch in Osterreich¨ - Osterreichisch:¨ Zur Begriffsbes- timmung und Normfestlegung der deutschen Standardsprache in Osterreich’.¨ Grazer Arbeiten zu Deutsch als Fremdsprache und Deutsch in Osterreich¨ GRADaF; vol. 1: pp. 1–23.

Muhr, Rudolf (ed.) (1993a): Internationale Arbeiten zum Osterreichischen¨ Deutsch und seinen nachbarsprachlichen Bez¨ugen; vol. 1 of Materialien und Handb¨ucherzum ¨osterreichischen Deutsch und zu Deutsch als Fremdsprache; Vienna: H¨older-Pichler-Tempsky.

Muhr, Rudolf (1993b): ‘Osterreichisch¨ - Bundesdeutsch - Schweizerisch. Zur Didaktik des Deutschen als plurizentrische Sprache’. In: Muhr (1993a); pp. 108–124.

Muhr, Rudolf (1993c): ‘Pragmatische Unterschiede in der deutschsprachigen Kommunikation: Osterreich¨ – Deutschland’. In: Muhr (1993a); pp. 26–38. Bibliography 243

Muhr, Rudolf (1995): ‘Zur Sprachsituation in Osterreich¨ und zum Begriff ”Standardsprache” in plurizentrischen Sprachen. Sprache und Identit¨atin Osterreich.’.¨ In: Muhr et al. (1995); pp. 75–110.

Muhr, Rudolf & Schrodt, Richard (eds.) (1997): Osterreichisches¨ Deutsch und andere nationale Variet¨atenplurizentrischer Sprachen in Europa; Materialien und Handb¨ucher zum ¨osterreichischen Deutsch und zu Deutsch als Fremdsprache; Vienna: H¨older-Pichler-Tempsky.

Muhr, Rudolf; Schrodt, Richard & Wiesinger, Peter (eds.) (1995): Osterreichisches¨ Deutsch. Linguistische, sozialpsychologische und sprachpoli- tische Aspekte einer nationalen Variante des Deutschen; Materialien und Handb¨ucher zum ¨osterreichischen Deutsch und zu Deutsch als Fremdsprache; Vienna: H¨older-Pichler-Tempsky.

Muller,¨ Christoph & Strube, Michael (2006): ‘Multi-level annotation of linguistic data with MMAX2’. In: Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, ed. by Braun, Sabine; Kohn, Kurt & Mukherjee, Joybrato; pp. 197–214; Frankfurt, Main: Peter Lang.

Nagy, Anna (1993): ‘Nationale Varianten der deutschen Standardsprache und die Behandlung im Deutschunterricht des Auslandes’. In: Muhr (1993a); pp. 67–75.

Nazar, Rogelio (2008): ‘Bilingual Terminology Acquisition from Unrelated Corpora’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Nazar, Rogelio; Vivaldi, Jorge & Cabre,´ Teresa (2008): ‘A Suite to Compile and Analyze an LSP Corpus’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http: //www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Nelson, Gerald (1996): ‘The Design of the Corpus’. In: Greenbaum (1996); pp. 27–35. 244 Bibliography

Nelson, Gerald (2006): ‘The core and periphery of world Englishes: a corpus-based exploration’. World Englishes; vol. 25(1): pp. 115–129.

Netzel, Rebecca; Perez-Iratxeta, Carolina; Bork, Peer & An- drade, Miguel A. (2003): ‘The way we write’. European Molecular Biology Organization EMBO reports; vol. 4(5): pp. 446–451.

Nivre, Joakim (2009): ‘Treebanks’. In: Ludeling¨ & Kyto¨ (2009); pp. 225– 241.

Oakes, Michael P. (2009): ‘Corpus Linguistics and Stylometry’. In: Ludel-¨ ing & Kyto¨ (2009); pp. 1070–1090.

Obrist, Monika (2010): Was kann der Franz daf¨ur,wenn wir uns verfranzen? Hintersinniges und Skurriles zur deutschen Sprache in S¨udtirol und dar¨uber hinaus; Bolzano: Sprachstelle im S¨udtirolerKulturinstitut.

O’Donovan, Ruth & O’Neill, Mary (2008): ‘A Systematic Approach to the Selection of Neologisms for Inclusion in a Large Monolingual Dictionary’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Orsman, Harry (ed.) (1998): Dictionary of New Zealand English; Oxford: Oxford University Press.

Ostler, Nicholas (2009): ‘Corpora of less studied languages’. In: Ludeling¨ & Kyto¨ (2009); pp. 457–483.

Papineni, Kishore; Roukos, Salim; Ward, Todd & Zhu, Wei-Jing (2002): ‘BLEU: a method for automatic evaluation of machine translation’. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL); pp. 311–318.

Paroubek, Patrick; Chaudiron, Stephane´ & Hirschman, Lynette (2007): ‘Principles of Evaluation in Natural Language Processing’. Traitement Automatique des Langues; vol. 48(1): pp. 7–31. Bibliography 245

Pecina, Pavel (2008): Lexical Association Measures: Collocation Extraction; Ph.D. thesis; Faculty of Mathematics and Physics, Charles University; Prague.

Peirsman, Yves; De Deyne, Simon; Heylen, Kris & Geeraerts, Dirk (2008): ‘The Construction and Evaluation of Word Space Models’. In: Proceed- ings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Peirsman, Yves; Heylen, Kris & Speelman, Dirk (2007): ‘Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts’. In: Proceedings of the Workshop on Contextual Information in Semantic Space Models (CoSMo); pp. 9–16; Roskilde.

Peirsman, Yves & Pado,´ Sebastian (2010): ‘Cross-lingual Induction of Selectional Preferences with Bilingual Vector Spaces’. In: Proceedings of NAACL-HLT; Los Angeles.

Pernstich, Karin (1982): ‘Deutsch-italienische Interferenzen in der S¨udtiroler Presse’. In: Moser (1982); pp. 91–182.

Peters, Pam (1996): ‘Comparative insights into comparison’. World Englishes; vol. 15: pp. 57–67. van der Plas, Lonneke & Tiedemann, Jorg¨ (2006): ‘Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity’. In: Proceedings of the COLING-ACL; pp. 866–873. von Polenz, Peter (1987): ‘Nationale Variet¨atender deutschen Hochsprache’. Zeitschrift fur germanistische Linguistik; vol. 15: pp. 101–103. von Polenz, Peter (1988): ‘”Binnendeutsch” oder Plurizentrische Sprachkul- tur? Ein Pl¨adoyer f¨urNormalisierung in der Frage der ”nationalen” Varianten’. Zeitschrift f¨urGermanistische Linguistik; vol. 16: pp. 198–218. von Polenz, Peter (1999): Deutsche Sprachgeschichte vom Sp¨atmittelalter bis zur Gegenwart; vol. 1-3; Berlin / New York: De Gruyter. 246 Bibliography

Popescu-Belis, Andrei; Estrella, Paula; King, Margaret & Under- wood, Nancy (2006): ‘A model for context-based evaluation of language processing systems and its application to machine translation evaluation’. In: Proceedings of the 5th International Language Resources and Evaluation Con- ference (LREC; pp. 691–696; Genoa. http://www.lrec-conf.org/proceedings/ lrec2006, last accessed 2012-10-31.

Poulard, Fabien; Daille, Beatrice;´ Jacquin, Christine; Monceaux, Laura; Morin, Emmanuel & Blancafort, Helena (2011): ‘Compa- rability Measurement for Terminology Extraction’. In: Proceedings of the Workshop on Creation, Harmonization and Application of Terminology re- sources (CHAT); Riga.

Prescher, Detlef & Heid, Ulrich (2000): ‘Probabilistisches Clustering zur Identifikation von Verb-Nomen-Kollokationen’. In: Proceedings of the 22. Jahrestagung der Deutschen Gesellschaft f¨urSprache; Marburg.

Prochasson, Emmanuel (2010): Alignement multilingue en corpus compara- bles sp´ecialis´es; Ph.D. thesis; University of Nantes.

Pulgram, Ernst (1964): ‘Structural comparison, diasystems, and dialectol- ogy’. Linguistics; vol. 2(4): pp. 66–82.

Putzer, Oskar (1984): Interferenz in Ubersetzungen:¨ Aspekte der Uberset-¨ zungsleistungen bei der Zweisprachigkeitspr¨ufung(D.P.R. 752/76); Bolzano: Assessorat f¨urSchule und Kultur in italienischer Sprache.

Putzer, Oskar (2001): ‘Kommunizieren oder ¨ubersetzen? Methoden und Verfahren bei der Zweisprachigkeitspr¨ufungin S¨udtirol’. In: Egger & Lan- thaler (2001); pp. 153–165.

Putzer, Oskar (2009): ‘Perfekt und Pr¨ateritumim S¨uddeutschen. Ein Beispiel f¨urstandardsprachliche Variation in der Grammatik?’. In: Kulturraum Tirol. Literatur - Sprache - Medien, ed. by Klettenhammer, Sieglinde; vol. 75 of Innsbrucker Beitr¨agezur Kulturwissenschaft. Germanistische Reihe; pp. 489–502; Innsbruck: Innsbruck University Press. Bibliography 247

Quasthoff, Uwe; Richter, Matthias & Biemann, Chris (2006): ‘Cor- pus Portal for Search in Monolingual Corpora’. In: Proceedings of the 5th International Language Resources and Evaluation Conference (LREC); Genoa. http://www.lrec-conf.org/proceedings/lrec2006, last accessed 2012-10-26.

Rampold, Josef (2005): Das Beste vom Federfuchser. Spitzfindige Randbe- merkungen zur Pflege der deutschen Muttersprache in S¨udtirol. Beitr¨ageaus dem Tagblatt der S¨udtiroler ’Dolomiten’; Bolzano: Athesia.

Rayson, Paul; Berridge, Damon & Francis, Brian (2004): ‘Extending the Cochran rule for the comparison of word frequencies between corpora’. In: Proceedings of the 7th International Conference on Statistical Analysis of Textual Data (JADT), ed. by Purnelle, Gerald; Fairon, C´edrick & Dister, Ann; pp. 926–936; Louvain-la-Neuve.

Rayson, Paul & Garside, Roger (2000): ‘Comparing Corpora using Frequency Profiling’. In: Proceedings of the Workshop on Comparing Corpora; pp. 1–6.

Rayson, Paul & Stevenson, Mark (2009): ‘Sense and semantic tagging’. In: Ludeling¨ & Kyto¨ (2009); pp. 564–578.

Rayson, Paul E. (2003): Matrix: a statistical method and software tool for linguistic analysis through corpus comparison; Ph.D. thesis; Computing Department, Lancaster University.

Rayson, Paul E.; Leech, Geoffrey & Hodges, Mary (1997): ‘Social Differentiation in the Use of English Vocabulary: Some Analyses of the Conversational Component of the British National Corpus’. International Journal of Corpus Linguistics; vol. 2(1): pp. 133–152.

Reder, Anna (2006): ‘Kollokationsforschung und Kollokationsdidaktik’. Lin- guistik online; vol. 28(3): pp. 157–176. http://www.linguistik-online.de/28 06/ reder.html, last accessed 2012-10-26. 248 Bibliography

Riedmann, Gerhard (1972): Die Besonderheiten der deutschen Sprache in S¨udtirol; vol. 39 of Duden-Beitr¨age:Die Besonderheiten der deutschen Schriftsprache im Ausland; Mannheim: Dudenverlag.

Riedmann, Gerhard (1973): ‘Bemerkungen zur deutschen Gegenwartssprache in S¨udtirol’. Skolast; vol. 18(5): pp. 17–20.

Riehl, Claudia M. (1994): ‘Das Problem von Standard und Norm am Beispiel der deutschsprachigen Minderheit in S¨udtirol’. In: Mehrsprachigkeit in Europa - Hindernis oder Chance?, ed. by Riehl, Claudia M. & Helfrich, Uta; vol. 24 of Pro Lingua; pp. 149–164; Wilhelmsfeld: Egert.

Riehl, Claudia M. (1998): ‘Schriftsprachliche Kompetenz und Zwei- sprachigkeit: Der Fall S¨udtirol’. In: Mehrsprachigkeit im Alpenraum, ed. by Werlen, Iwar; vol. 22 of Sprachlandschaft; pp. 175–195; Aarau: Sauerl¨ander.

Riehl, Claudia M. (2000): ‘Deutsch in S¨udtirol’. Minderheiten und Regio nalsprachen in Europa; pp. 237–248.

Riehl, Claudia M. (2001): Schreiben, Text und Mehrsprachigkeit: zur Textpro- duktion in mehrsprachigen Gesellschaften am Beispiel der deutschsprachigen Minderheiten in S¨udtirol und Ostbelgien; Terti¨arsprachen; T¨ubingen:Stauf- fenburg.

Rissanen, Matti (2009): ‘Corpus linguistics and historical linguistics’. In: Ludeling¨ & Kyto¨ (2009); pp. 53–68.

Ritz, Julia (2006): ‘Collocation extraction: Needs, feeds and results of an extraction system for German’. In: Proceedings of the EACL Workshop on Multi-word-expressions in a multilingual context; pp. 41–48; Trento. http: //acl.ldc.upenn.edu/W/W06/W06-2406.pdf, last accessed 2012-10-26.

Ritz, Julia & Heid, Ulrich (2006): ‘Extraction tools for collocations and their morphosyntactic specificities’. In: Proceedings of the 5th International Language Resources and Evaluation Conference (LREC); pp. 1925–1930; Bibliography 249

Genoa. http://www.lrec-conf.org/proceedings/lrec2006, last accessed 2012- 10-26.

Rizzo-Baur, Hildegard (1962): ‘Die Besonderheiten der deutschen Schrift- sprache in Osterreich¨ und S¨udtirol’. In: Duden-Beitr¨age; vol. 5 of Sonderreihe: Die Besonderheiten der deutschen Schriftsprache im Ausland; Mannheim: Dudenverlag.

Robertazzi, Romina (2007): Der Einfluß von Sprachkontakt auf Kolloka- tionsbildungen - eine Korpusanalyse in Texten aus S¨udtirol; Master’s thesis; Institute for Linguistics, University of Stuttgart.

Romaine, Suzanne (2009): ‘Corpus linguistics and sociolinguistics’. In: Ludel-¨ ing & Kyto¨ (2009); pp. 96–111.

Romer,¨ Ute (2009): ‘Corpora and language teaching’. In: Ludeling¨ & Kyto¨ (2009); pp. 112–130.

Ronan, Patricia & Schneider, Gerold (2009): ‘Multi-verbal expressions of ‘giving’ in Old English and Old Irish’. In: Proceedings of the 5th Corpus Lin- guistics Conference; Liverpool. http://ucrel.lancs.ac.uk/publications/CL2009, last accessed 2012-10-21.

Rose, Tony G.; Haddock, Nicholas J. & Tucker, Roger C. (1997): ‘The effects of corpus size and homogeneity on language model quality’. In: Proceedings of the ACL SIGDAT Workshop on Very Large Corpora; pp. 178–191; Beijing / Hong Kong.

Roth, Tobias (2009): ‘Verteilte Korpusabfragesysteme’. Linguistik online; vol. 38: pp. 67–78.

Roth, Tobias (2012): ‘Using web corpora for the recognition of regional variation in standard German collocations’. In: Proceedings of the 7th Web as Corpus Workshop (WAC); Lyon.

Sahlgren, Magnus (2005): ‘An Introduction to Random Indexing’. In: Pro- ceedings of the Methods and Applications of Semantic Indexing Workshop at 250 Bibliography

the 7th International Conference on Terminology and Knowledge Engineering; Copenhagen.

Sahlgren, Magnus & Karlgren, Jussi (2005): ‘Counting Lumps in Word Space: Density as a Measure of Corpus Homogeneity’. In: Proceedings of the 12th edition of the Symposium on String Processing and Information Retrieval (SPIRE); pp. 151–154; Buenos Aires.

Salton, Gerard & McGill, Michael J. (1986): Introduction to modern information retrieval; New York: McGraw-Hill.

Sand, Andrea (1999): Linguistic Variation in Jamaica; T¨ubingen:Narr. de Saussure, Ferdinand (1916): Cours de linguistique g´en´erale; Paris: Payot.

Savicky,´ Petr & Hlavacov´ a,´ Jaroslava (2002): ‘Measures of Word Commonness’. Journal of Quantitative Linguistics; vol. 9(3): pp. 215– 231. http://dblp.uni-trier.de/db/journals/jql/jql9.html#SavickyH02, last accessed 2012-10-25.

Saxalber-Tetter, Annemarie (1989): ‘Dialekt in der Schule - ein Problem f¨urdas diglossische und bilinguale S¨udtirol’. In: Bayrisch-¨osterreichische Dialektforschung, ed. by Koller, Erwin; pp. 394–407; W¨urzburg:K¨onighausen & Neumann.

Schafroth, Elmar (2008): ‘Les dictionnaires qu´eb´ecoiset le probl`emede la norme linguistique’. In: Proceedings of the 13th EURALEX International Conference; Barcelona.

Scheuringer, Hermann (1996): ‘Das Deutsche als pluriareale Sprache: Ein Beitrag gegen staatlich begrenzte Horizonte in der Diskussion um die deutsche Sprache in Osterreich’.¨ Die Unterrichtspraxis Teaching German; vol. 29(2): pp. 147–153. http://www.jstor.org/stable/3531824, last accessed 2012-10-25.

Schiehlen, Michael (2003): ‘A Cascaded Finite-State Parser for German’. In: Proceedings of the EACL Research Note Sessions; Budapest. Bibliography 251

Schiller, Anne; Teufel, Simone; Stockert,¨ Christine & Thielen, Christine (1999): Guidelines f¨urdas Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset); Tech. rep.; Institut f¨urMaschinelle Sprachverarbeitung, Universit¨atStuttgart / Seminar f¨urSprachwissenschaft, Universit¨atT¨ubingen.

Schmid, Helmut (1994): ‘Probabilistic Part-Of-Speech Tagging Using Decision Trees’. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP); Manchester. http://www.ims.uni-stuttgart. de/projekte/corplex/TreeTagger, last accessed 2012-10-26.

Schmid, Helmut (1995): ‘Improvements in Part-of-Speech Tagging with an Application to German’. In: Proceedings of the ACL SIGDAT Workshop on Very Large Corpora; Boston.

Schmid, Helmut (2004): ‘Efficient Parsing of Highly Ambiguous Context- Free Grammars with Bit Vectors’. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING); Geneva.

Schmid, Helmut (2009): ‘Tokenizing and part-of-speech tagging’. In: Ludel-¨ ing & Kyto¨ (2009); pp. 527–551.

Schmid, Helmut; Fitschen, Arne & Heid, Ulrich (2004): ‘SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection’. In: Proceedings of the 4th International Language Resources and Evaluation Conference (LREC); pp. 1263–1266; Lisbon. http://www. lrec-conf.org/proceedings/lrec2004, last accessed 2012-10-21.

Schmidlin, Regula (2003): ‘Vergleichende Charakteristik der Anglizismen in den standardsprachlichen Variet¨atendes Deutschen’. In: G¨ommerMiGro? Ver¨anderungenund Entwicklungen im heutigen SchweizerDeutschen, ed. by Dittli, Beat; H¨acki Buhofer, Annelies & Haas, Walter; Festschrift f¨urPeter Dalcher zum 75. Geburtstag; pp. 141–160; Freiburg, CH: Universit¨atsverlag. 252 Bibliography

Schmidlin, Regula (2007): ‘Phraseological expressions in German standard varieties’. In: Phraseologie / Phraseology, ed. by Burger, Harald; vol. 28; pp. 551–563; Berlin: Erich Schmidt Verlag.

Schmied, Josef (2009): ‘Contrastive corpus studies’. In: Ludeling¨ & Kyto¨ (2009); pp. 1140–1159.

Schneider, Edgar W. (2007): Postcolonial English - Varieties around the world; Cambridge: Cambridge University Press.

Schneider, Edgar W.; Burridge, Kate; Kortmann, Bernd; Mesthrie, Rajend & Upton, Clive (eds.) (2004): A Handbook of Vari- eties of English; Berlin: De Gruyter.

Schneider, Gerold & Hundt, Marianne (2009): ‘Using a parser as a heuristic tool for the description of New Englishes’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel.lancs.ac.uk/ publications/CL2009, last accessed 2012-10-31.

Scholze Stubenrecht, Werner (ed.) (2011): Duden - Deutsches Univer- salw¨orterbuch; Mannheim: Dudenverlag; 7th ed.

Schrodt, Richard (1995): ‘Der Sprachbegriff zwischen Grammatik und Pragmatik: Was ist das ¨osterreichische Deutsch?’. In: Muhr et al. (1995); pp. 52–58.

Schrodt, Richard (2007): ‘Apokalypse neu’. trib¨unezeitschrift f¨ursprache und schreibung; vol. 1: pp. 4–16.

Schrunder-Lenzen,¨ Agi & Henn, Dominik (2009): ‘Entwicklung eines Computerprogramms zur Analyse der schriftlichen Erz¨ahlf¨ahigkeit’. In: Von der Sprachdiagnose zur Sprachf¨orderung, ed. by Lengyel, Drorit; Reich, Hans H.; Roth, Hans-Joachim & D¨oll,Marion; vol. 5 of F¨orMig Edition; pp. 91–107; M¨unster/ New York / Munich / Berlin: Waxmann.

Schwienbacher, Brunhild (1997): Uber¨ den Ultner Dialekt. Struktur und Aufbau einer Tiroler Mundart; Ulten: Museumsverein Ulten. Bibliography 253

Scott, Mike (2004): The WordSmith Tools (v. 4.0); Oxford: Oxford University Press.

Sekine, Satoshi (1997): ‘The domain dependence of parsing’. In: Proceedings of the 5th Applied Natural Language Processing Conference (ANLP); pp. 96–102; Washington, D.C.

Sharoff, Serge (2007): ‘Classifying Web corpora into domain and genre using automatic feature identification’. In: Proceedings of the 3rd Web as Corpus Workshop (WAC); Louvain-la-Neuve.

Silva, Penny (ed.) (1996): A Dictionary of South African English on Historical Principles; Oxford: Oxford University Press.

Sinclair, John (1991): Corpus, concordance, collocation; Describing English language; Oxford: Oxford University Press.

Sinclair, John (1996): ‘EAGLES – Preliminary recommendations on Corpus Typology’. EAGLES Document EAG-TCWG-CTYP/P. http://www.ilc.cnr. it/EAGLES96/corpustyp/node1.html, last accessed 2012-10-24.

Sinclair, John (2004): Trust the Text: Language, Corpus and Discourse; London: Routledge.

Smadja, Frank (1993): ‘Retrieving collocations from text: Xtract’. Computa- tional Linguistics; vol. 19(1): pp. 143–178.

Stefanowitsch, Anatol & Gries, Stefan T. (2003): ‘Collostructions: Investigating the interaction between words and constructions’. International Journal of Corpus Linguistics; vol. 8(2): pp. 209–243.

Stefanowitsch, Anatol & Gries, Stefan T. (2009): ‘Corpora and gram- mar’. In: Ludeling¨ & Kyto¨ (2009); pp. 933–952.

Stein, Benno; Potthast, Martin & Trenkmann, Martin (2010): ‘Re- trieving Customary Web Language to Assist Writers’. In: Advances in Infor- 254 Bibliography

mation Retrieval. Proceedings of the 32nd European Conference on Informa- tion Retrieval (ECIR); pp. 631–635.

Steininger, Rolf (1999): S¨udtirol im 20. Jahrhundert. Vom Leben und Uberleben¨ einer Minderheit; Innsbruck / Vienna / Bolzano: Studienverlag.

Steyer, Kathrin (2004a): ‘Kookkurrenz. Korpusmethodik, linguistisches Modell, lexikografische Perspektiven’. In: Steyer (2004b); pp. 87–116.

Steyer, Kathrin (ed.) (2004b): Wortverbindungen - mehr oder weniger fest; Jahrbuch des Instituts f¨urDeutsche Sprache 2003; Berlin: De Gruyter.

Streiter, Oliver & De Luca, Ernesto W. (2003): ‘Example-based NLP for Minority Languages: Tasks, Resources and Tools’. In: Proceedings of the TALN Workshop on Natural Language Processing of Minority Languages with few computational linguistic resources; Batz-sur-Mer.

Streiter, Oliver; Scannell, Kevin P. & Stuflesser, Mathias (2006): ‘Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers’. Machine Translation; vol. 20(4): pp. 267– 289.

Stubbs, Michael (1986): ‘Lexical density: A computational technique and some findings’. In: Talking about Text. Discourse Analysis Monograph, ed. by Coultard, Malcolm; pp. 27–42; Birmingham: English Language Research.

Stubbs, Michael (1996): Text and Corpus Analysis: Computer-assisted stud- ies of language and culture; Oxford: Blackwell.

Su, Fangzhong & Babych, Bogdan (2012): ‘Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-) Parallel Translation Equivalents’. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL); pp. 10–19; Avignon. Bibliography 255

Taylor, Charlotte (2008): ‘What is corpus linguistics? What the data says’. ICAME Journal; vol. 32: pp. 143–164. http://icame.uib.no/journal.html, last accessed 2012-10-31.

Teubert, Wolfgang (2004): ‘Corpus Linguistics and Lexicography. The Beginning of a Beautiful Friendship’. Lexicographica International Annual for Lexicography; vol. 20: pp. 1–19.

Thomason, Sarah & Kaufman, Terrence (1988): Language Contact, Creolization, and Genetic Linguistics; University of California Press.

Ties, Isabella (ed.) (2006): LULCL 2005 - Proceedings of the Lesser Used Languages and Computer Linguistics Conference; Bolzano; EURAC research.

Todirascu, Amalia; Tufis¸, Dan; Heid, Ulrich; Gledhill, Christo- pher; S¸tefanescu, Dan; Weller, Marion & Rousselot, Franc¸ois (2008): ‘A Hybrid Approach to Extracting and Classifying Verb+Noun Con- structions’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/ proceedings/lrec2008, last accessed 2012-10-26.

Tognini-Bonelli, Elena (2001): Corpus Linguistics at Work; vol. 6 of Studies in Corpus Linguistics; Amsterdam: Benjamins.

Trudgill, Peter & Jean, Hannah (2002): International English: A guide to varieties of standard English; London: Arnold; 4th ed.

Tufis¸, Dan; Irimia, Elena; Ion, Radu & Ceausu, Alexandru (2008): ‘Unsupervised Lexical Acquisition for Part of Speech Tagging’. In: Proceedings of the 6th International Language Resources and Evaluation Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/lrec2008, last accessed 2012-10-26.

Turney, Peter D. & Pantel, Patrick (2010): ‘From Frequency to Mean- ing: Vector Space Models of Semantics’. Journal of Artificial Intelligence Research; vol. 37: pp. 141–188. 256 Bibliography

Tyroller, Hans (1986): ‘Trennung und Integration der Sprachgruppen in S¨udtirol’. In: Europ¨aischeSprachminderheiten im Vergleich, ed. by Hinderling, Robert; vol. 11 of Deutsche Sprache in Europa und Ubersee¨ ; pp. 17–36; Stuttgart: Steiner.

Villada Moiron,´ Begona˜ & Tiedemann, Jorg¨ (2006): ‘Identifying id- iomatic expressions using automatic word-alignment’. In: Proceedings of the EACL Workshop on Multiword Expressions in a Multilingual Context; Trento. http://www.aclweb.org/anthology/W/W06, last accessed 2012-10-21.

Virtanen, Tuija (2009): ‘Corpora and discourse analysis’. In: Ludeling¨ & Kyto¨ (2009); pp. 1043–1070.

Schulte im Walde, Sabine (2003): ‘A Collocation Database for German Verbs and Nouns’. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (COMPLEX); Budapest.

Wandmacher, Tonio; Ovchinnikova, Ekaterina & Alexandrov, Theodore (2008): ‘Does Latent Semantic Analysis reflect Human Associa- tions?’. In: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics; Hamburg.

Weinreich, Uriel (1953): Languages in contact: findings and problems; The Hague: De Gruyter.

Weinrich, Harald (1993): Textgrammatik der deutschen Sprache; Mannheim: Dudenverlag.

Weisgerber, Leo (1933): Die Stellung der Sprache im Aufbau der Gesamtkul- tur; vol. 15; Heidelberg: Universit¨atsverlag Winter.

Weitzman, Michael (1971): ‘How useful is the logarithmic type/token ratio?’. Journal of Linguistics; vol. 7: pp. 237–243.

Wermke, Mathias (1995): ‘Austriazismen im gemeinsprachlichen W¨orterbuch des Deutschen, dargestellt an Duden – Deutsches Universalw¨orterbuch, 2. Auflage 1989’. In: Muhr et al. (1995); pp. 197–207. Bibliography 257

Widdows, Dominic & Ferraro, Kathleen (2008): ‘Semantic Vectors: a Scalable Open Source Package and Online Technology Management Applica- tion’. In: Proceedings of the 6th International Language Resources and Evalua- tion Conference (LREC); Marrakech. http://www.lrec-conf.org/proceedings/ lrec2008, last accessed 2012-10-26.

Wiechmann, Daniel & Fuhs, Stefan (2006): ‘Concordancing Software’. Corpus Linguistics and Linguistic Theory; vol. 2(1): pp. 109–130.

Wiegand, Herbert E. (2007): Internationale Bibliographie zur germanistis- chen Lexikographie und W¨orterbuchforschung.Mit Ber¨ucksichtigunganglis- tischer, nordistischer, romanistischer, slavistischer und weiterer metalexiko- graphischer Forschungen; Berlin: De Gruyter.

Wiesmann, Eva (2004): Rechts¨ubersetzung und Hilfsmittel zur Translation. Wissenschaftliche Grundlagen und computergest¨utzteUmsetzung eines lexiko- graphischen Konzepts; vol. 65; T¨ubingen:Narr.

Wilcock, Graham (2009): Introduction to Linguistic Annotation and Text Analytics; vol. 2(1) of Synthesis Lectures on Human Language Technologies; Princeton: Morgan and Claypool.

Wittenburg, Peter; Bel, Nuria; Borin, Lars; Budin, Gerhard; Calzolari, Nicoletta; Hajicova, Eva; Koskenniemi, Kimmo; Lem- nitzer, Lothar; Maegaard, Bente; Piasecki, Maciej; Pierrel, Jean-Marie; Piperidis, Stelios; Skadina, Inguna; Tufis, Dan; van Veenendaal, Remco; Varadi,´ Tamas & Wynne, Martin (2010): ‘Resource and Service Centres as the Backbone for a Sustain- able Service Infrastructure’. In: Proceedings of the 7th International Lan- guage Resources and Evaluation Conference (LREC); pp. 60–63; Valetta. http://www.lrec-conf.org/proceedings/lrec2010, last accessed 2012-10-24.

Wollstein-Leisten,¨ Angelika; Heilmann, Axel; Stepan, Peter & Vikner, Sten (1997): Deutsche Satzstruktur - Grundlagen der syntaktischen Analyse; T¨ubingen:Stauffenburg. 258 Bibliography

Wong, Deanna & Peters, Pam (2007): ‘A study of backchannels in regional varieties of English, using corpus mark-up as the means of identification’. International Journal of Corpus Linguistic; vol. 12(4): pp. 479–509.

Wynne, Martin (ed.) (2005): Developing Linguistic Corpora: a Guide to Good Practice; Oxford: Oxbow Books. http://ahds.ac.uk/linguistic-corpora, last accessed 2012-10-31.

Xiao, Richard (2006): ‘Review of Xaira: an XML Aware Indexing and Retrieval Architecture’. Corpora; vol. 1(1): pp. 99–103.

Xiao, Richard (2009a): ‘Theory-driven corpus research: Using corpora to inform aspect theory’. In: Ludeling¨ & Kyto¨ (2009); pp. 987–1007.

Xiao, Richard (2009b): ‘Well-known and influential corpora’. In: Ludeling¨ & Kyto¨ (2009); pp. 383–456.

Xiao, Richard & McEnery, Tony (2005): ‘Two approaches to genre analysis: three genres in modern American English’. Journal of English Linguistics; vol. 33(1): pp. 62–82.

Zambelli, Martina (2004): Interferenze lessicali in situazioni di contatto linguistico: Il caso dell’Alto Adige-S¨udtirol; Master’s thesis; Universit`adi Venezia.

Zampieri, Marcos & Gebre, Binyam (2012): ‘Automatic identification of language varieties: The case of Portuguese’. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS); pp. 233–237. http://www.oegai.at/konvens2012/proceedings/33 zampieri12p, last accessed 2012-10-23.

Zeldes, Amir; Ritz, Julia; Ludeling,¨ Anke & Chiarcos, Christian (2009): ‘ANNIS: A Search Tool for Multi-Layer Annotated Corpora’. In: Proceedings of the 5th Corpus Linguistics Conference; Liverpool. http://ucrel. lancs.ac.uk/publications/CL2009, last accessed 2012-10-31. Bibliography 259

Zinsmeister, Heike & Heid, Ulrich (2002): ‘Collocations of Complex Words: Implications for the Acquisition with a Stochastic Grammar’. In: Proceedings of the International Workshop on Computational Approaches to Collocations; Vienna. http://www.ims.uni-stuttgart.de/∼zinsmeis/pubs/ Coll02.pdf, last accessed 2012-10-26.

Zinsmeister, Heike & Heid, Ulrich (2004): ‘Collocations of Complex Nouns: Evidence for Lexicalisation’. In: Proceedings of the 7th Conference on Natural Language Processing (KONVENS); Vienna. http://www.ims. uni-stuttgart.de/∼zinsmeis/pubs/ZinsmeisterHeid Konv04.pdf, last accessed 2012-10-26.

Zipf, George K. (1932): Selected studies of the principle of relative frequency in language; Cambridge: Harvard University Press.