Computational Linguistics in Practice: WebLicht and GermaNet Two Projects at the Department of Linguistics at the University of Tübingen
Verena Henrich University of Tübingen Department of Linguistics
November 30, 2010 Who I am: Verena Henrich
• 2009: Master in Computer Science at h_da
- Lecture about Natural Language Processing (NLP) - Two semesters in Iceland - Topic of master thesis about NLP • Since 2009: Researcher at the Department of General and Computational Linguistics at the University of Tübingen
- First task: development of an editor for the German wordnet (GermaNet) - Further project that I will introduce today: WebLicht - PhD plans: word sense disambiguation with GermaNet
2 | Verena Henrich November 30, 2010 GermaNet – A German Wordnet
3 | Verena Henrich November 30, 2010 GermaNet: A German Wordnet
• GermaNet is a lexical resource covering the German base vocabulary • It is a lexical semantic network • Belongs to the family of wordnets modeled after the Princeton WordNet for English • GermaNet is divided into 3 word categories:
- Adjectives - Nouns - Verbs • Words are ordered according to their meaning
4 | Verena Henrich November 30, 2010 GermaNet: Lexical Units
• Word meanings are represented by lexical units • A lexical unit specifies one form and one meaning (i.e. reading) of a word • Examples:
- “Bank“ has 2 readings . Reading 1: [Bank, {Sitzbank}] (bench) . Reading 2: [Bank, {Geldinstitut}] (financial institution)
- “Leiter” has 3 readings . Reading 1: [Leiter, {Steiggerät}] (ladder) . Reading 2: [Leiter, {Verantwortlicher, Anführer}] (leader) . Reading 3: [Leiter, {stromleitender Stoff}] (electric conductor) • Lexical units are grouped into semantic concepts according to their meaning
5 | Verena Henrich November 30, 2010 GermaNet: Synsets
• Semantic concepts are represented by synsets • A synset is a set of (near-)synonymous words
6 | Verena Henrich November 30, 2010 GermaNet: Synset Examples
• Verb examples: [rennen, laufen, sprinten, spurten] (to run) [klingeln, bimmeln, schellen, gongen, läuten] (to ring) • Adjective examples: [stark, kräftig] (strong/poweful) [eckig, kantig, zackig] (square-shaped/jagged) [ausgeprägt, hervorstechend, markant] (distinctive) • Noun examples: [Witz, Scherz, Jux, Ulk, Spaß, Schabernack, Gag] (joke) [Substantiv, Hauptwort, Nomen] (noun) [Textil, Gewebe, Webware, Stoff] (cloth/material)
7 | Verena Henrich November 30, 2010 GermaNet: Synsets
• Each lexical unit belongs to exactly one synset • A literal however can belong to many synsets [Chip, Katoffelchip] (potato chrisp) [Chip, Mikrochip] (computer chip) [Kohle, Geld, Kies, Knete, Moneten] (money) [Kohle, Kohlegestein] (coal) [Golf, VW Golf] (car) [Golf] (Küstengebiet) (gulf) [Golf, Golfspiel] (golf) [gehen, laufen] (to walk) [gehen, funktionieren] (to work) • A synset has an average of 1.37 lexical units
8 | Verena Henrich November 30, 2010 GermaNet: Relations
• In GermaNet, there are two types of semantic relations
- Lexical relations are established between lexical units . Synonymy . Antonymy . Pertainymy
- Conceptual relations are established between synsets . Hypernymy and hyponymy . Part-whole relations (meronymy and holonymy) . Entailment . Causation . Association
9 | Verena Henrich November 30, 2010 GermaNet: Lexical Relations
• Lexical relations hold between two lexical units
- Synonymy - Antonymy - Pertainymy
10 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations
• Conceptual relations hold between two synsets
- Hypernymy and hyponymy - Part-whole relations (meronymy and holonymy) - Entailment - Causation - Association
11 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations
• GermaNet is hierarchically structured in terms of the hypernymy-hyponymy relation of synsets
12 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations
• Part-whole relations are conceptual relations
13 | Verena Henrich November 30, 2010 GermaNet: Relations
14 | Verena Henrich November 30, 2010 GermaNet: Readings for “unterhalten”
1. (v) [unterhalten, pflegen] (to cultivate) -- über etwas verfügen • [unterhalten] -- NN.AN.Pp -- Sie unterhalten gute Beziehungen zu ihren Nachbarn. • [pflegen] Hypernyms: [haben, besitzen]
2. (v) [unterhalten] (to keep oneself amused) -- sich auf angenehme Weise die Zeit vertreiben • [unterhalten] -- NN.AR.BM -- Sie hat sich blendend unterhalten. (NN.AR.BM) Hypernyms: [vergnügen]
3. (v) [unterhalten] (to entertain) -- für Zerstreuung/Zeitvertreib sorgen • [unterhalten] -- NN.AN.Bs -- Er unterhielt seine Gäste mit Musik. (NN.AN.Bs) Hypernyms: [vergnügen, amüsieren]
4. (v) [unterhalten] (to maintain sth.) – etw. halten/einrichten/betreiben und dafür aufkommen • [unterhalten] -- NN.AN -- Er unterhält einen Reitstall. (NN.AN) Hypernyms: [führen] Hyponyms: [instandhalten] [bewirtschaften]
5. (v) [unterhalten] (to talk) -- ein Gespräch führen • [unterhalten] -- NN.AR.Pp.Bo -- Er unterhielt sich den ganzen Abend über seine Prüfungen. (NN.AR.Pp) -- Er unterhielt sich nur mit mir. (NN.AR.Bo) Hypernyms: [austauschen] Hyponyms: [klönen] [labern] [palavern] [philosophieren] [plauschen] [plaudern, schwatzen, schnattern]
6. (v) [unterhalten, alimentieren] (to support sb.) -- für jmds. Lebensunterhalt aufkommen • [unterhalten] -- NN.AN -- Er unterhält eine sieben-köpfige Familie. (NN.AN) • [alimentieren] Hypernyms: [ernähren, nähren]
15 | Verena Henrich November 30, 2010 GermaNet: Purpose
• GermaNet development started in 1997 at the Department of Linguistics at the University of Tübingen • Developed to serve as an electronic lexicographic reference database for German word senses • Primarily intended to serve as a resource for word sense disambiguation which is crucial for natural language applications like
- Information retrieval - Construction of language technology tools - Annotation of corpora - Machine translation
16 | Verena Henrich November 30, 2010 GermaNet: Size
• Number of lexical units: 84.600
- Adjectives: 8.100 lexical units - Nouns: 64.100 lexical units - Verbs: 12.300 lexical units • Number of synsets: 61.700
- Adjectives: 5.600 synsets - Nouns: 46.900 synsets - Verbs: 9.200 synsets • 84600 literals (1,10 readings per literal) • Lexical relations: 3500 • Conceptual relations: 73700
17 | Verena Henrich November 30, 2010 Tools for GermaNet
• Application Programming Interfaces
- Java API - Perl API • Web Application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Web service: as part of WebLicht • GermaNet-Explorer: visualisation tool (developed at the University of Dortmund) • GernEdiT: GermaNet editing tool
18 | Verena Henrich November 30, 2010 GermaNet: Data Formats
• Former:
- Lexicograher files: complex legacy format • Now:
- Relational database • Export formats:
- Proprietary XML format: distribution format - Lexical Markup Framework: XML, ISO standard - Princeton WordNet format
19 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files
(*** Nüsse ***)
{Nuss, Nuß*o, Nusskern, ?festes_Nahrungsmittel,@ nomen.Pflanze:Nuss,@ ('der essbare Kern einer Nuss')}
{Haselnuss, Haselnuß*o, Haselnusskern, Haselnußkern*o, Nuss,@ nomen.Pflanze:Haselstrauch,#}
{Kokosnuss, Kokosnuß*o, Nuss,@ nomen.Pflanze:Kokospalme,#}
{Betelnuss, Betelnuß*o, Nuss,@ Genussmittel,@}
{Erdnuss, Erdnuß*o, Erdnusskern, Erdnußkern*o, Nuss,@ nomen.Pflanze:Erdnusspflanze,#}
{Cashewkern, Cashewnuss, Cashewnuß*o, Nuss,@ nomen.Pflanze:Acajubaum,#}
...
20 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files
• Lexicographer files have shortcomings, there are three main problems 1. No visualization Difficult to insert new items 2. Complex data format Syntax errors and semantic inconsistencies 3. No versioning Impossible to track back changes
21 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool
• Developed to overcome the shortcomings of the lexicographer files 1. No visualization Graphical tool (search and browse GermaNet) 2. Complex data format User-friendly tool (with internal consistency checks) 3. No versioning Editing history
22 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool
23 | Verena Henrich November 30, 2010 GermaNet: Links & References
• GermaNet homepage: http://www.sfs.uni-tuebingen.de/GermaNet/ • GermaNet web application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Verena Henrich and Erhard Hinrichs: GernEdiT - The GermaNet Editing Tool. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010 http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf • Verena Henrich and Erhard Hinrichs: GernEdiT: A Graphical Tool for GermaNet Development. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010 http://www.aclweb.org/anthology/P10-4004 • Fellbaum, C. (ed.): WordNet – An Electronic Lexical Database. The MIT Press, 1998. • Princeton WordNet homepage: http://wordnet.princeton.edu/ • Princeton WordNet web application: http://wordnetweb.princeton.edu/perl/webwn
24 | Verena Henrich November 30, 2010 WebLicht – Web-Based Linguistic Chaining Tool
25 | Verena Henrich November 30, 2010 WebLicht: Motivation
• Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines
- This can be inconvenient, time-consuming, and error-prone because a user has to install all necessary tools • Requirement: avoid “download-first” paradigm One possible solution: make tools and resources available on the web
26 | Verena Henrich November 30, 2010 WebLicht: Motivation
• Make linguistic resources and tools available online
- Some can easily be put online for download or for online usage
- For others more effort is necessary, e.g. because of limiting access to resources or because it is difficult to make them online usable Solution: a service-oriented architecure (SOA)
- The end user needs just a web browser: no more download, installation, configuration etc. of software is necessary
27 | Verena Henrich November 30, 2010 WebLicht: Purpose
• WebLicht: Web-Based Linguistic Chaining Tool • A service-oriented architecture for linguistic analysis of text • Users do not need to download and install any software on their own computers • Allows the user to run several linguistic tools, implemented as web services (WS), to be executed in succession It is possible to build a chain of linguistic web services (without concerning with the technical details involved in building linguistic tool chains)
28 | Verena Henrich November 30, 2010 WebLicht: Architecture
• Development started in October 2008 • WebLicht is implemented as a web application
- Ensures that the tool chains are valid - Generates the calls to the web services - Displays the results • WebLicht consists of the following components:
- Distributed web services: offer functionality (resources and tools) over the internet - Registry: stores metadata and technical information about the web services - Web user interface: interacts with the user and combines services and information from the registry
29 | Verena Henrich November 30, 2010 WebLicht: Architecture
Stuttgart Tübingen Leipzig
Standard-conformant Web 2.0 application for Registry text corpus encoding tool chaining and execution
Stuttgart Tübingen Berlin Leipzig Finland Romania Iceland UK
30 | Verena Henrich November 30, 2010 WebLicht: Features
• With REST-style web services, everyone can implement a web service for WebLicht • Web services can be accessed through anything which is able to use the HTTP protocol
- Web browser (user interface) - Commandline tools (wget, curl) - Scripts/programming languages Anyone can implement their own interface to WebLicht • The SOA infrastructure is independent of programming languages and operating systems • The chaining algorithm is independent of the used data format • From a legal point of view, the web services are still located in the institute where they were created (service providers)
31 | Verena Henrich November 30, 2010 WebLicht: Web Services
• WS make functionality of existing desktop applications and command line tools available through the web • Approximately 120 web services are currently available
- 3 countries: Germany, Finland, Romania - 8 partners - 10 languages: German, English, Finnish, Italian, French, Spanish, Czech, Hungarian, Romanian, Slovenian - More than 10 different tools: converter, tokenizer, part-of- speech tagger, lemmatizer, parser, etc. • Web services are implemented in REST-style • HTTPs POST method is used to send data from the user interface to the web services
32 | Verena Henrich November 30, 2010 WebLicht: Integrating new Services
• Building a web service for WebLicht consists of the following steps: 1. Create a REST-style web service around the tool as wrapper 2. Make input and output compatible with TCF format 3. Register the service in the registry
Input in Wrapper (REST-style web service) Output in TCF format Tool or Resource TCF format
• There are tutorials for Java and for Perl at: http://weblicht.sfs.uni-tuebingen.de/englisch/weblicht.shtml
33 | Verena Henrich November 30, 2010 WebLicht: Registry
• Every WS in WebLicht is registered in a central registry • Implemented at the University of Leipzig • Consists of a relational database containing all the information about the web services
- Metadata (creator, name, address) - Processing information (input and output specifications) • Information about every registered web service can be obtained through RESTful web services
- Which services are available? - How can I combine them? - Which input/output format does a service accept/produce? • Example: a tokenizer is already applied to a plain text, which services can be used next?
34 | Verena Henrich November 30, 2010 WebLicht: Processing Chains
• The chaining algorithm ensures that the list of possible next web services only contains web services that form a valid next step in the chain
- For example: a part-of-speech tagger can only be added to a chain after a tokenizer was chosen • The metadata of each WS contains information about
- The input requirements - The produced output • Communication between the web services is based on the TCF data format
35 | Verena Henrich November 30, 2010 WebLicht: Text Corpus Format (TCF)
• Common data format for in- and output of web services • Each WS incrementally adds an annotation layer to TCF
- Step 1: only plain text is in the TCF document - Step 2: tokens are added to the plain text - Step 3: a third layer containing part-of-speech tags is added - Step 4: a parse tree layer is added
36 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 1)
37 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 2)
38 | Verena Henrich November 30, 2010 WebLicht: TCF Example (step 3)
39 | Verena Henrich November 30, 2010 WebLicht: User Interface
• Web 2.0 application for tool chaining and execution
- Several forms of input: . Enter plain text . Upload a text (plain text, MS Word, RTF, or PDF files) . Select one of the sample texts - Build a chain of linguistic tools with information from registry - Execution of the tool chain - Presentation of the results • Implemented at the Department of Linguistics at the University of Tübingen • Java application using an AJAX driven toolkit • Deployed in Apache Tomcat
40 | Verena Henrich November 30, 2010 WebLicht: User Interface
41 | Verena Henrich November 30, 2010 WebLicht: Results
42 | Verena Henrich November 30, 2010 WebLicht: Demonstration
• URL to the WebLicht web application
- Via Shibboleth: https://weblicht.sfs.uni-tuebingen.de/WebLicht1.5s/
- Local user management: http://weblicht.sfs.uni-tuebingen.de:8080/WebLicht1.5/
43 | Verena Henrich November 30, 2010 WebLicht: Links & References
• URL to the WebLicht web application: https://weblicht.sfs.uni-tuebingen.de/WebLicht1.5s/ (via Shibboleth) • Verena Henrich, Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow: Service-Oriented Architectures: From Desktop Tools to Web Services and Web Applications. To appear in Dan Tufiş, Corina Forăscu (eds.): Multilinguality and Interoperability in Language Processing with Emphasis on Romanian. Romanian Academy Publishing House, 2010 • Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow: WebLicht: Web-based LRT services for German. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010 • Ulrich Heid, Helmut Schmid, Kerstin Eckart, and Erhard Hinrichs: A Corpus Representation Format for Linguistic Web Services: the D-SPIN Text Corpus Format and its Relationship with ISO Standards. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010 • The WebLicht slides are copied from Thomas Zastrow and then modified
44 | Verena Henrich November 30, 2010 Thank you.
Verena Henrich Department of Linguistics University of Tübingen
Wilhelmstr. 19 72074 Tübingen Germany
[email protected] http://www.verenahenrich.de
Phone: +49 7071 29-77313
45 | Verena Henrich November 30, 2010