Computational Linguistics in Practice: WebLicht and GermaNet Two Projects at the Department of Linguistics at the University of Tübingen

Verena Henrich University of Tübingen Department of Linguistics

November 30, 2010 Who I am: Verena Henrich

• 2009: Master in Computer Science at h_da

- Lecture about Natural Language Processing (NLP) - Two semesters in Iceland - Topic of master thesis about NLP • Since 2009: Researcher at the Department of General and Computational Linguistics at the University of Tübingen

- First task: development of an editor for the German (GermaNet) - Further project that I will introduce today: WebLicht - PhD plans: word sense disambiguation with GermaNet

2 | Verena Henrich November 30, 2010 GermaNet – A German Wordnet

3 | Verena Henrich November 30, 2010 GermaNet: A German Wordnet

• GermaNet is a lexical resource covering the German base vocabulary • It is a lexical • Belongs to the family of modeled after the Princeton WordNet for English • GermaNet is divided into 3 word categories:

- Adjectives - Nouns - Verbs • Words are ordered according to their meaning

4 | Verena Henrich November 30, 2010 GermaNet: Lexical Units

• Word meanings are represented by lexical units • A lexical unit specifies one form and one meaning (i.e. reading) of a word • Examples:

- “Bank“ has 2 readings . Reading 1: [Bank, {Sitzbank}] (bench) . Reading 2: [Bank, {Geldinstitut}] (financial institution)

- “Leiter” has 3 readings . Reading 1: [Leiter, {Steiggerät}] (ladder) . Reading 2: [Leiter, {Verantwortlicher, Anführer}] (leader) . Reading 3: [Leiter, {stromleitender Stoff}] (electric conductor) • Lexical units are grouped into semantic concepts according to their meaning

5 | Verena Henrich November 30, 2010 GermaNet: Synsets

• Semantic concepts are represented by synsets • A synset is a set of (near-)synonymous words

6 | Verena Henrich November 30, 2010 GermaNet: Synset Examples

• Verb examples: [rennen, laufen, sprinten, spurten] (to run) [klingeln, bimmeln, schellen, gongen, läuten] (to ring) • Adjective examples: [stark, kräftig] (strong/poweful) [eckig, kantig, zackig] (square-shaped/jagged) [ausgeprägt, hervorstechend, markant] (distinctive) • Noun examples: [Witz, Scherz, Jux, Ulk, Spaß, Schabernack, Gag] (joke) [Substantiv, Hauptwort, Nomen] (noun) [Textil, Gewebe, Webware, Stoff] (cloth/material)

7 | Verena Henrich November 30, 2010 GermaNet: Synsets

• Each lexical unit belongs to exactly one synset • A literal however can belong to many synsets [Chip, Katoffelchip] (potato chrisp) [Chip, Mikrochip] (computer chip) [Kohle, Geld, Kies, Knete, Moneten] (money) [Kohle, Kohlegestein] (coal) [Golf, VW Golf] (car) [Golf] (Küstengebiet) (gulf) [Golf, Golfspiel] (golf) [gehen, laufen] (to walk) [gehen, funktionieren] (to work) • A synset has an average of 1.37 lexical units

8 | Verena Henrich November 30, 2010 GermaNet: Relations

• In GermaNet, there are two types of semantic relations

- Lexical relations are established between lexical units . Synonymy . Antonymy . Pertainymy

- Conceptual relations are established between synsets . Hypernymy and hyponymy . Part-whole relations (meronymy and holonymy) . Entailment . Causation . Association

9 | Verena Henrich November 30, 2010 GermaNet: Lexical Relations

• Lexical relations hold between two lexical units

- Synonymy - Antonymy - Pertainymy

10 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations

• Conceptual relations hold between two synsets

- Hypernymy and hyponymy - Part-whole relations (meronymy and holonymy) - Entailment - Causation - Association

11 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations

• GermaNet is hierarchically structured in terms of the hypernymy-hyponymy relation of synsets

12 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations

• Part-whole relations are conceptual relations

13 | Verena Henrich November 30, 2010 GermaNet: Relations

14 | Verena Henrich November 30, 2010 GermaNet: Readings for “unterhalten”

1. (v) [unterhalten, pflegen] (to cultivate) -- über etwas verfügen • [unterhalten] -- NN.AN.Pp -- Sie unterhalten gute Beziehungen zu ihren Nachbarn. • [pflegen] Hypernyms: [haben, besitzen]

2. (v) [unterhalten] (to keep oneself amused) -- sich auf angenehme Weise die Zeit vertreiben • [unterhalten] -- NN.AR.BM -- Sie hat sich blendend unterhalten. (NN.AR.BM) Hypernyms: [vergnügen]

3. (v) [unterhalten] (to entertain) -- für Zerstreuung/Zeitvertreib sorgen • [unterhalten] -- NN.AN.Bs -- Er unterhielt seine Gäste mit Musik. (NN.AN.Bs) Hypernyms: [vergnügen, amüsieren]

4. (v) [unterhalten] (to maintain sth.) – etw. halten/einrichten/betreiben und dafür aufkommen • [unterhalten] -- NN.AN -- Er unterhält einen Reitstall. (NN.AN) Hypernyms: [führen] Hyponyms: [instandhalten] [bewirtschaften]

5. (v) [unterhalten] (to talk) -- ein Gespräch führen • [unterhalten] -- NN.AR.Pp.Bo -- Er unterhielt sich den ganzen Abend über seine Prüfungen. (NN.AR.Pp) -- Er unterhielt sich nur mit mir. (NN.AR.Bo) Hypernyms: [austauschen] Hyponyms: [klönen] [labern] [palavern] [philosophieren] [plauschen] [plaudern, schwatzen, schnattern]

6. (v) [unterhalten, alimentieren] (to support sb.) -- für jmds. Lebensunterhalt aufkommen • [unterhalten] -- NN.AN -- Er unterhält eine sieben-köpfige Familie. (NN.AN) • [alimentieren] Hypernyms: [ernähren, nähren]

15 | Verena Henrich November 30, 2010 GermaNet: Purpose

• GermaNet development started in 1997 at the Department of Linguistics at the University of Tübingen • Developed to serve as an electronic lexicographic reference database for German word senses • Primarily intended to serve as a resource for word sense disambiguation which is crucial for natural language applications like

- Information retrieval - Construction of language technology tools - Annotation of corpora - Machine translation

16 | Verena Henrich November 30, 2010 GermaNet: Size

• Number of lexical units: 84.600

- Adjectives: 8.100 lexical units - Nouns: 64.100 lexical units - Verbs: 12.300 lexical units • Number of synsets: 61.700

- Adjectives: 5.600 synsets - Nouns: 46.900 synsets - Verbs: 9.200 synsets • 84600 literals (1,10 readings per literal) • Lexical relations: 3500 • Conceptual relations: 73700

17 | Verena Henrich November 30, 2010 Tools for GermaNet

• Application Programming Interfaces

- Java API - Perl API • Web Application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Web service: as part of WebLicht • GermaNet-Explorer: visualisation tool (developed at the University of Dortmund) • GernEdiT: GermaNet editing tool

18 | Verena Henrich November 30, 2010 GermaNet: Data Formats

• Former:

- Lexicograher files: complex legacy format • Now:

- Relational database • Export formats:

- Proprietary XML format: distribution format - Lexical Markup Framework: XML, ISO standard - Princeton WordNet format

19 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files

(*** Nüsse ***)

{Nuss, Nuß*o, Nusskern, ?festes_Nahrungsmittel,@ nomen.Pflanze:Nuss,@ ('der essbare Kern einer Nuss')}

{Haselnuss, Haselnuß*o, Haselnusskern, Haselnußkern*o, Nuss,@ nomen.Pflanze:Haselstrauch,#}

{Kokosnuss, Kokosnuß*o, Nuss,@ nomen.Pflanze:Kokospalme,#}

{Betelnuss, Betelnuß*o, Nuss,@ Genussmittel,@}

{Erdnuss, Erdnuß*o, Erdnusskern, Erdnußkern*o, Nuss,@ nomen.Pflanze:Erdnusspflanze,#}

{Cashewkern, Cashewnuss, Cashewnuß*o, Nuss,@ nomen.Pflanze:Acajubaum,#}

...

20 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files

• Lexicographer files have shortcomings, there are three main problems 1. No visualization  Difficult to insert new items 2. Complex data format  Syntax errors and semantic inconsistencies 3. No versioning  Impossible to track back changes

21 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool

• Developed to overcome the shortcomings of the lexicographer files 1. No visualization  Graphical tool (search and browse GermaNet) 2. Complex data format  User-friendly tool (with internal consistency checks) 3. No versioning  Editing history

22 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool

23 | Verena Henrich November 30, 2010 GermaNet: Links & References

• GermaNet homepage: http://www.sfs.uni-tuebingen.de/GermaNet/ • GermaNet web application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Verena Henrich and Erhard Hinrichs: GernEdiT - The GermaNet Editing Tool. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010 http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf • Verena Henrich and Erhard Hinrichs: GernEdiT: A Graphical Tool for GermaNet Development. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010 http://www.aclweb.org/anthology/P10-4004 • Fellbaum, C. (ed.): WordNet – An Electronic Lexical Database. The MIT Press, 1998. • Princeton WordNet homepage: http://wordnet.princeton.edu/ • Princeton WordNet web application: http://wordnetweb.princeton.edu/perl/webwn

24 | Verena Henrich November 30, 2010 WebLicht – Web-Based Linguistic Chaining Tool

25 | Verena Henrich November 30, 2010 WebLicht: Motivation

• Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines

- This can be inconvenient, time-consuming, and error-prone because a user has to install all necessary tools • Requirement: avoid “download-first” paradigm  One possible solution: make tools and resources available on the web

26 | Verena Henrich November 30, 2010 WebLicht: Motivation

• Make linguistic resources and tools available online

- Some can easily be put online for download or for online usage

- For others more effort is necessary, e.g. because of limiting access to resources or because it is difficult to make them online usable  Solution: a service-oriented architecure (SOA)

- The end user needs just a web browser: no more download, installation, configuration etc. of software is necessary

27 | Verena Henrich November 30, 2010 WebLicht: Purpose

• WebLicht: Web-Based Linguistic Chaining Tool • A service-oriented architecture for linguistic analysis of text • Users do not need to download and install any software on their own computers • Allows the user to run several linguistic tools, implemented as web services (WS), to be executed in succession  It is possible to build a chain of linguistic web services (without concerning with the technical details involved in building linguistic tool chains)

28 | Verena Henrich November 30, 2010 WebLicht: Architecture

• Development started in October 2008 • WebLicht is implemented as a web application

- Ensures that the tool chains are valid - Generates the calls to the web services - Displays the results • WebLicht consists of the following components:

- Distributed web services: offer functionality (resources and tools) over the internet - Registry: stores metadata and technical information about the web services - Web user interface: interacts with the user and combines services and information from the registry

29 | Verena Henrich November 30, 2010 WebLicht: Architecture

Stuttgart Tübingen Leipzig

Standard-conformant Web 2.0 application for Registry text corpus encoding tool chaining and execution

Stuttgart Tübingen Berlin Leipzig Finland Romania Iceland UK

30 | Verena Henrich November 30, 2010 WebLicht: Features

• With REST-style web services, everyone can implement a web service for WebLicht • Web services can be accessed through anything which is able to use the HTTP protocol

- Web browser (user interface) - Commandline tools (wget, curl) - Scripts/programming languages  Anyone can implement their own interface to WebLicht • The SOA infrastructure is independent of programming languages and operating systems • The chaining algorithm is independent of the used data format • From a legal point of view, the web services are still located in the institute where they were created (service providers)

31 | Verena Henrich November 30, 2010 WebLicht: Web Services

• WS make functionality of existing desktop applications and command line tools available through the web • Approximately 120 web services are currently available

- 3 countries: Germany, Finland, Romania - 8 partners - 10 languages: German, English, Finnish, Italian, French, Spanish, Czech, Hungarian, Romanian, Slovenian - More than 10 different tools: converter, tokenizer, part-of- speech tagger, lemmatizer, parser, etc. • Web services are implemented in REST-style • HTTPs POST method is used to send data from the user interface to the web services

32 | Verena Henrich November 30, 2010 WebLicht: Integrating new Services

• Building a web service for WebLicht consists of the following steps: 1. Create a REST-style web service around the tool as wrapper 2. Make input and output compatible with TCF format 3. Register the service in the registry

Input in Wrapper (REST-style web service) Output in TCF format Tool or Resource TCF format

• There are tutorials for Java and for Perl at: http://weblicht.sfs.uni-tuebingen.de/englisch/weblicht.shtml

33 | Verena Henrich November 30, 2010 WebLicht: Registry

• Every WS in WebLicht is registered in a central registry • Implemented at the University of Leipzig • Consists of a relational database containing all the information about the web services

- Metadata (creator, name, address) - Processing information (input and output specifications) • Information about every registered web service can be obtained through RESTful web services

- Which services are available? - How can I combine them? - Which input/output format does a service accept/produce? • Example: a tokenizer is already applied to a plain text, which services can be used next?

34 | Verena Henrich November 30, 2010 WebLicht: Processing Chains

• The chaining algorithm ensures that the list of possible next web services only contains web services that form a valid next step in the chain

- For example: a part-of-speech tagger can only be added to a chain after a tokenizer was chosen • The metadata of each WS contains information about

- The input requirements - The produced output • Communication between the web services is based on the TCF data format

35 | Verena Henrich November 30, 2010 WebLicht: Text Corpus Format (TCF)

• Common data format for in- and output of web services • Each WS incrementally adds an annotation layer to TCF

- Step 1: only plain text is in the TCF document - Step 2: tokens are added to the plain text - Step 3: a third layer containing part-of-speech tags is added - Step 4: a parse tree layer is added

36 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 1)

He buys an apple and a drink. Layer 1: Plain text

37 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 2)

He buys an apple and a drink. Layer 1: Plain text He buys an apple and Layer 2: Tokens a drink .

38 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 3)

He buys an apple and a drink. Layer 1: Plain text He buys an apple and Layer 2: Tokens a drink . PP VBZ DT NN CC Layer 3: Part-of-speech tags DT NN .

39 | Verena Henrich November 30, 2010 WebLicht: User Interface

• Web 2.0 application for tool chaining and execution

- Several forms of input: . Enter plain text . Upload a text (plain text, MS Word, RTF, or PDF files) . Select one of the sample texts - Build a chain of linguistic tools with information from registry - Execution of the tool chain - Presentation of the results • Implemented at the Department of Linguistics at the University of Tübingen • Java application using an AJAX driven toolkit • Deployed in Apache Tomcat

40 | Verena Henrich November 30, 2010 WebLicht: User Interface

41 | Verena Henrich November 30, 2010 WebLicht: Results

42 | Verena Henrich November 30, 2010 WebLicht: Demonstration

• URL to the WebLicht web application

- Via Shibboleth: https://weblicht.sfs.uni-tuebingen.de/WebLicht1.5s/

- Local user management: http://weblicht.sfs.uni-tuebingen.de:8080/WebLicht1.5/

43 | Verena Henrich November 30, 2010 WebLicht: Links & References

• URL to the WebLicht web application: https://weblicht.sfs.uni-tuebingen.de/WebLicht1.5s/ (via Shibboleth) • Verena Henrich, Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow: Service-Oriented Architectures: From Desktop Tools to Web Services and Web Applications. To appear in Dan Tufiş, Corina Forăscu (eds.): Multilinguality and Interoperability in Language Processing with Emphasis on Romanian. Romanian Academy Publishing House, 2010 • Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow: WebLicht: Web-based LRT services for German. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010 • Ulrich Heid, Helmut Schmid, Kerstin Eckart, and Erhard Hinrichs: A Corpus Representation Format for Linguistic Web Services: the D-SPIN Text Corpus Format and its Relationship with ISO Standards. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010 • The WebLicht slides are copied from Thomas Zastrow and then modified

44 | Verena Henrich November 30, 2010 Thank you.

Verena Henrich Department of Linguistics University of Tübingen

Wilhelmstr. 19 72074 Tübingen Germany

[email protected] http://www.verenahenrich.de

Phone: +49 7071 29-77313

45 | Verena Henrich November 30, 2010