<<

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Building FrameNet in Czech

PH.D. THESISPROPOSAL

Jiˇr´ıMaterna

Brno, September 2010

Supervisor: doc. PhDr. Karel Pala, CSc. Contents

1 Introduction ...... 3 2 Ontologies and frame-based approaches to lexical . . 5 2.1 Ontologies ...... 5 2.2 Frames ...... 5 2.3 WordNet ...... 6 2.3.1 EuroWordNet ...... 7 2.4 Verb valency ...... 8 2.4.1 PropBank ...... 8 2.4.2 VerbNet ...... 9 2.4.3 BRIEF ...... 9 2.4.4 Vallex ...... 10 2.4.5 VerbaLex ...... 10 3 Frame Semantics and FrameNet ...... 13 3.1 The Berkeley FrameNet ...... 13 3.1.1 Semantic Frames ...... 14 3.1.2 Frame Elements ...... 14 3.1.3 FrameNet relations ...... 16 3.1.4 Semantic types ...... 16 3.2 FrameNets in other languages ...... 17 3.2.1 SALSA ...... 17 3.2.2 Spanish FrameNet ...... 18 3.3 Automatic methods of creating new FrameNets ...... 19 4 Current results and aims of the thesis ...... 20 4.1 Current results ...... 20 4.1.1 Linking frames ...... 21 4.1.2 Assigning verb arguments ...... 22 4.1.3 Ontologies behind FrameNet and VerbaLex ...... 24 4.1.4 Exploitation of ontologies in linking FrameNet frame elements with semantic roles from VerbaLex . . . . . 25 4.1.5 Evaluation ...... 26 4.1.6 VerbaLex-FrameNet linking tool ...... 27 4.2 Case study for the Indicate verb class ...... 28

1 4.2.1 Annotation process ...... 29 4.2.2 Statistics and typological divergences ...... 29 4.3 Reusing Berkeley FrameNet frames ...... 30 4.4 New frames and consistency ensuring ...... 31 4.5 Semi-automatic annotation of corpus texts ...... 32 4.6 Schedule of further work ...... 32 4.7 Publications ...... 33

2 Chapter 1 Introduction

Natural language understanding, which belongs to the natural language processing field has been intensively investigated by researchers for many years. Many natural language processing applications like information retrieval and machine translation as well as disambiguation tasks on all levels require techniques enabling, at least partially, semantic parsing and understanding. For example, there has been spread deployment of simple speech-based natural language understanding systems that answer ques- tions about flight arrival times, give directions, report on bank balances, or perform simple financial transactions. More sophisticated experimental systems generate concise summaries of news articles, answer fact-based questions, and recognize complex semantic and dialogue structure in general. Unfortunately, current information extraction and dialogue under- standing systems are often domain dependent. Nowadays, there is a trend to build large domain independent elec- tronic lexical databases containing as much semantic information as possi- ble. Probably the best known semantic database is Princeton WordNet [8]. It is a large lexical resource of American English, developed under the di- rection of George A. Miller, where nouns, verbs, adjectives and adverbs are grouped into sets of (synsets). Synsets are interlinked by means of conceptual-semantic and lexical relations (e.g. synonymy, antonymy, hypero/hyponymy, ). Another complex electronic lexical resource of English, based on Frame Semantics [13] proposed by Charles J. Fillmore, is called FrameNet. In Frame Semantics, meaning is described in relation to the semantic frame, which consists of a target lexical unit (pairing of a word with a sense), frame elements (its semantic dependants) and relations between them. These lexical resources are not domain dependent but have some disadvantages. First of all, the coverage of FrameNet is insufficient. It is mainly caused by the methodology of its building, which proceeds frame by frame rather than lemma by lemma. On the other hand, Princeton WordNet has relatively high coverage but suffers from inconsistency and

3 errors, coming from the fact that WordNet has been built manually by humans without any well-defined methodology based on corpus evidence. The aim of the thesis is to propose a methodology of building large, frame-based and domain independent of valency possibilities of the Czech language with the effort to overcome low coverage and inconsistency as much as possible, and to create a core of such lexicon exemplified in a sample corpus. Throughout the world, there are several projects for different languages built on the idea of Frame Semantics. The largest and probably best known one is the original Berkeley FrameNet. Saarbrucken¨ team has been developing German frame-based electronic lexicon SALSA [4], Spanish team has been working on Spanish FrameNet [49], etc. For Czech some experiments have been carried out by the Jan Hajic’sˇ group in Prague [1], taking advantage of the lexical database Vallex built in this group. However, so far, Czech FrameNet as such has not been worked out yet. In this work I give an overview of existing approaches to building frame-based lexicons and point out their advantages and limitations. The main focus is laid on FrameNet and FrameNet like approaches. In the rest of this work I will describe current results in building Czech FrameNet and outline main ideas of the further work.

4 Chapter 2 Ontologies and frame-based approaches to lexical se- mantics

2.1 Ontologies

The term ontology has its origin in philosophy and represents a philosoph- ical study of the nature of being as well as the basic categories of being and their relations. It was Aristotle who constructed first well-defined ontology. In his Metaphysic [44] he analyzed the simplest elements to which the mind reduces the real world of reality. In computer science, an ontology is a formal representation of a set of concepts1 and relationships between them. In order to explain how the ontologies are constructed and what are they good for we can refer to logical systems. The existential quantifier in logic is a notation for asserting that something exists, but logic itself has no vocabulary for describing the things that exist. Ontology fills that gap. It is the study of the existence of all kinds of entities, abstract and concrete, that make up the world or particular domain. Domain ontologies, sometimes called domain-specific ontologies, model a specific part of the world. The universal, non-domain ontologies, are called upper ontologies. An upper ontology is a model of the common world, which describes general concepts that are the same across all domains. There are several upper ontologies including commercial (Cyc [25]) as well as freely available ones (SUMO [35], Dolce [15]).

2.2 Frames

Besides representing standalone concepts, a language understanding sys- tem must be able to organize knowledge in high-level structures. In symbolic logic, the basic units are predicates, which are connected by operators to create formulas representing high-level structures. Another

1. When speaking about concepts we rather mean their labels because as such they are not linguistic entities.

5 possible way of organizing concepts is to use structures like frames. In the field of knowledge representation, the frame is a data structure introduced by Marvin Minsky [34], which is intended to represent complex objects or stereotyped situations. We can think of a frame as a network of nodes and relations, where some nodes are fixed, representing objects that are always true about the frame. The rest of the nodes represent slots, which must be filled by specific instances of data. Being inspired by Minsky, many researchers in different branches of science have followed him in the idea of frames. In , the best known frame-based theory is Charles J. Fillmore’s Frame Semantics [13], which will be discussed later2.

2.3 WordNet

Princeton WordNet (PWN) [8] is a large lexical database of English, devel- oped under the direction of George A. Miller at Princeton University. Its design has been inspired by psycholinguistic research and computational theories of human lexical memory [33]. Entries in PWN are made of nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms called synsets, each expressing a distinct concept. Different senses of polysemous (literals) are distinguished by their sense Ids (positive integers). Synsets in PWN are interlinked by means of the relations which form the net. Some of the most important relationships together with an example of each of them are listed below.

• Synonymy is the most important relationship in WordNet. Accord- ing to Leibniz’s definition, synonyms are different words, which can be replaced each other in any context without changing the truth value of the proposition. This definition is, however, too strict and is fulfilled only by a small number of very similar words. That is why the near synonymy is used in WordNet. According to the definition of near synonymy, two different words are synonyms if one can be replaced the other in the same context without changing the sense of a sentence3. An example of such synonymy relation are words eat and consume.

2. It should be remarked that the roots of Fillmore’s frames date back to his earlier work The Case for Case [9], in which he introduced deep (semantic) cases which are today understood as semantic roles. 3. Even this weaker condition need not always be fulfilled in WordNet.

6 • Hyponymy/hyperonymy is a relationship between two synsets, where one’s semantic range is within/outside of the other’s. In computer science, this type of relationship is sometimes called IS- A, TYPE-OF, or lexical inheritance relation because all hyponyms must inherit all properties of their hypernyms. For example, word animal is a hypernym of dog and car is a hyponym of vehicle. • Meronymy/holonymy is a relationship between a whole and its parts. For example, wheel is a meronym of car and house is a holonym of door. This relationship is, similarly to the hyponymy/hypernymy relation, transitive. • Synsets with meaning are interlinked by an antonymy re- lationship. Words may have several different antonyms, depending on their meaning. For instance, both long and tall can be antonyms of short. The definition of antonymy may be a little bit confusing. For example, antonym is not the same as negation: telling that someone is not rich does not mean that he or she is poor although these two words are antonyms. Whilst at the beginning of the project, WordNet was intended to be a psycholinguistic study of the human mind, nowadays it is used in many branches of NLP field as a general purpose reference source for English language. It is freely available on the web4, and furthermore, it can be used, copied, modified and distributes as a part of commercial applications.

2.3.1 EuroWordNet EuroWordNet [51] is a multilingual database based on the same principles as PWN in terms of synsets with basic semantic relations between them. The project was completed in the summer of 1999 and currently incor- porates apart from English PWN of seven European languages (Czech, Dutch, Estonian, French, German, Italian and Spanish). Synsets in the WordNets of all listed languages are linked via the Inter- lingual Index (ILI), coming from PWN. The languages are interconnected by means of this index, thus it is possible to traverse from the synsets in one language to similar synsets in any other language. The index also gives an access to a shared top-ontology of 63 semantic items. This top-ontology provides a common semantic framework for all the languages, while language specific relations are maintained in the individual WordNets. The

4. http://wordnet.princeton.edu/wordnet/download/

7 database can be used in both monolingual and cross-lingual NLP tasks, but in contrast to PWN, major part of the WordNets is not freely available. The EuroWordNet project has been followed by similar projects for other languages. In 2001, the BalkaNet project, intended to extend database of WordNets linked to PWN, started. It concerns a development of Balkan WordNets (Greek, Turkish, Romanian, Bulgarian and Serbian) as well as the second phase of the development of Czech WordNet [39].

2.4 Verb valency lexicons

The linguistic interpretation of valence is derived from the definition of valency in chemistry and was used for the first time by Lucien Tesniere` in his work about syntactic analysis of sentences [50]. Nowadays, there are plenty of various models of verb valencies, each having its own uniquenesses, but we can roughly say that the valencies are verb arguments required by the verb in particular language. In the rest of this section, several current approaches to the verb valency in English and Czech will be mentioned.

2.4.1 PropBank

The Proposition Bank project [42] started in 2000 and aims to extend the Penn Treebank II Wall Street Journal Corpus [30] by adding the predicate argument structure annotation. The current project annotates only verbal predicates, but substantives, adjectives and prepositions are in contemplation. For each verb it was necessary to distinguish its individual senses first, and then to store argument structures of the verb in each sense separately. The arguments of each sense are numbered sequentially from Arg0 to Arg5, where these numbering may have different meaning in the context of different verbs. The verb itself is represented by the Rel mark, and modifiers are denoted as ArgM. The example below illustrates the PropBank notation:

... the company to offer a 15% stake to the public. Arg0: the company Rel: offer Arg1: a 15% stake Arg2-to: the public

8 2.4.2 VerbNet

VerbNet is a lexicon of English verbs developed in Department of Lin- guistics, University of Colorado, which is partially based on the PropBank project – it uses its syntactic frames. The main difference is that Verb- Net contains both syntactic and semantic information. The fundamental assumption is that the syntactic frames of a verb directly reflect the underlining semantics. VerbNet associates the semantics of a verb with its syntactic frames in terms of combining them with traditional lexical semantic information such as thematic roles and semantic predicates. Each syntactic frame in VerbNet is assigned to a semantic class based on Beth Levin’s verb classification [27]. In her work, Levin recognizes about 80 types of alternations. The alternations which are (or are not) applicable to a given verb then determines its semantic class5. Each VerbNet class contains a set of syntactic descriptions, or syntactic frames, depicting the possible surface realizations of the argument structure. The lexicon was recently extended by an additional set of new classes6 and is currently a freely available resource7, which constitutes the most comprehensive and versatile Levin-style verb classification for English.

2.4.3 BRIEF

Electronic lexicon of Czech verb valencies BRIEF [41] was developed at the Faculty of Informatics, Masaryk University in 1997. It contains about 15,000 Czech verb lemmata and nearly 50,000 of valency frames. The lexicon has been built from several printed of Czech as well as using an electronic lexicon of Czech stems [38]. For each verb BRIEF contains a list of frames separated by comma. Frame is a sequence of elements separated by dash, where each element is represented as a sequence of attribute-value pairs. Attributes are denoted with lower case letters, and values either as capital letters or as strings delimited by braces.

5. Levin’s classes are problematic because they do not take into consideration any corpus data. It can be observed that the individual classes contain verbs that considerably differ in their meanings, thus the classes are sometimes semantically inconsistent. 6. Approximately 200 additional classes. 7. http://verbs.colorado.edu/˜mpalmer/projects/verbnet.html

9 2.4.4 Vallex

Valency lexicon of Czech verbs Vallex [52] is being developed in the Institute of Formal and Applied Linguistics at Charles University in Prague since 2001. Vallex uses the Functional Generative Description [47] as its the- oretical background and is closely related to Prague Dependency Treebank [19]. Main goal of the project is to describe valencies of Czech verbs on both syntactic and semantic layer. First version of the lexicon, Vallex 1.0, was published in 2003 and contains about 1,400 verb entries. The current version, Vallex 2.5, contains almost 4,300 entries and was published in 2008. The consistency of verbs in Vallex is improved by grouping them into verb classes of semantically similar verbs. Vallex classes are based on the so-called alternation model [29] and currently consists of 22 classes. The key information on the valency structure of a verb is encoded in the form of valency frames. Valency frames are stored as a sequence of slots, where each slot represents one valency complementation and consists of its type, morphemic realization and its obligatoriness. The position of the verb is not marked in the Vallex frame. An example of the frame for verb vaˇrit (to boil) is shown in Figure 2.1.

Figure 2.1: Example of a Vallex frame.

2.4.5 VerbaLex

VerbaLex [22] is an electronic database of verb valency frames in Czech, which has been developed in the Centre for Natural Language Processing at the Faculty of Informatics, Masaryk University. Basic units (entries) in VerbaLex consist of verb lemmata grouped into synsets together with their

10 sense numbers in standard WordNet notation. Verb valencies are realized on two levels – deep valency level which corresponds to the semantic role (semantic values of the verb arguments) and surface level reflecting information about syntactic and morphological valencies. The current version of VerbaLex contains more than 6,000 synsets, more than 21,000 verb senses and approximately 10,500 verb lemmata in 19,500 valency frames. The valency frame represents verb valencies on both syntactic and semantic level. In the centre of the frame, there is a mark representing the verb position, surrounded by the left-hand and right-hand arguments in the canonical word-order. Type of the valency relation for each constituent element is marked as obligatory or optional. Semantic information about the verbal complement is represented by two-level semantic roles. The first level is represented by main labels primarily based on the EuroWordNet [51] first-order and second-order top ontology entities ar- ranged in a hierarchical structure. The inventory of these labels is closed and currently contains 33 items (concepts). On the second level there is a collection of selected lexical units (literals) from the set of EuroWordNet and BalkaNet Base Concepts [39] with their respective sense numbers. The list of second level semantic roles is open and currently contains about 1,200 literals. A valency frame also contains other additional information about verbs. Combining all this information we obtain Complex Valency Frames (CVFs). The additional information from CVFs includes:

• definition of verb meaning

• verb ability to create passive form

• number of meanings for homonymous verbs

• semantic class a verb belongs to

• aspect (perfective or imperfective)

• example of verb use

• types of reflexivity for reflexive verbs

An example of VerbaLex frame j´ıst, pˇrij´ımat (to eat) is shown in Fig- ure 2.2.

11 Figure 2.2: VerbaLex frame example: to eat

12 Chapter 3 Frame Semantics and FrameNet

The linguistic basis of Frame Semantics and the FrameNet project can be found in the theory of Case Grammar, beginning with the work by Charles J. Fillmore in 1968 [9]. It was offered as a contribution to the transformational-generative grammar promoted by Noam Chomsky and his followers. The main idea of Case Grammar consists in the proposal that deep syntactic structures are best expressed as configurations of deep cases. These deep cases are expressed as a fixed set of general semantic role names such as Agent, Patient, Time, etc. From the start, however, there were questions about the correct set of semantic roles, and about the issue that a small and fixed set of semantic roles can characterize all predicates of natural languages. Indeed, in his later work [10, 12], Fillmore showed that such fixed set of semantic roles is not sufficient to characterize all language predicates, and proposed the theory of Frame Semantics [13] where frame elements represent frame specific situation roles. The central idea of Frame Semantics is that word meaning is described in a relation to semantic frame, which consists of a target lexical unit (pairing of a word with a sense), frame elements (its semantic arguments) and relations between them.

3.1 The Berkeley FrameNet

Since the end of 1990s, at the International Computer Science Institute in Berkeley, computational project FrameNet [43] has been going on. The main goal of the project is to extract the information about the linked semantic and syntactic properties of English words from a large electronic text corpora, using both manual and automatic procedures. The name “FrameNet” is inspired by WordNet, reflecting the fact that the project is based on the theory of Frame Semantics and the semantic frames form a network.

13 The information about words and their properties is stored in an electronic lexical database. Possible syntactic realizations of the semantic roles associated with a frame are exemplified in the annotated FrameNet corpus. Currently, the Berkeley FrameNet consists of more than 10,000 lexical units across the parts of speech, associated with more than 800 frames exemplified in more than 135,000 annotated sentences.

3.1.1 Semantic Frames

A semantic frame is defined as a coherent structure of related concepts such that without knowledge of all of them, one does not have complete knowledge of one of the others, and are in that sense types of gestalt [11]. In other words, it is a script-like conceptual structure that describes a particular type of situation, object or event along with its participants and properties [43]. Lexical unit is a pairing of a word with a meaning. Typically, each sense of a polysemous word belongs to a different semantic frame. For example, the Commerce sell frame describes a situation in which a seller sells some goods to a buyer, and is evoked by lexical units such as auction, retail, retailer, sale, sell, etc. The semantic participants are called Frame Elements.

3.1.2 Frame Elements

The semantic valencies of a lexical unit are expressed in terms of the kinds of entities that can participate in frames of the type evoked by the lexical unit. The valencies are called frame elements. Frame elements (FEs) bear some resemblance to the argument variables used in first-order predicate logic, but have important differences derived from the fact that frames are much more complex than logical predicates [14]. In the example above, the frame elements include Seller, Goods, Buyer, etc. FrameNet distinguishes three types of frame elements – core FEs (the presence of such FEs is necessary to satisfy a semantic valence of a given frame), peripheral FEs (they are not unique for a given frame and can usually occur in any frame, typically expressions of time, place, manner, purpose, attitude, etc.), and extra-thematic FEs (these FEs have no direct relation to the situation identified with the frame, but add new information, often showing how the event represented by one frame is a part of an event involving another frame). An example of FrameNet frame is shown in Figure 3.1.

14 Ingestion

Definition: An Ingestor consumes food or drink (Ingestibles), which entails putting the Ingestibles in the mouth for delivery to the digestive system. This may include the use of an Instrument. Sentences that describe the provision of food to others are NOT included in this frame.

FEs: Core: Ingestibles [Ingible] The Ingestibles are the entities that are being consumed by the Semantic Type Ingestor. Sentient Non-core: Degree [Degr] The extent to which the Ingestibles are consumed by the Ingestor. Semantic Type Degree Duration [Dur] The length of time spent on the ingestion activity. Instrument [Ins] The Instrument with which an intentional act is performed. Semantic Type Physical_entity Manner [Manr] Manner of performing an action. Semantic Type Manner Means [Mns] An act performed by the Ingestor that enables them to accomplishes Semantic Type the whole act of ingestion. State_of_affairs Place [Place] Where the event takes place. Semantic Type Locative_relation Purpose [Pur] The action that the Ingestor hopes to bring about by ingesting. Semantic Type State_of_affairs Source [Src] Place from which the Ingestor takes the Ingestibles Time [Time] When the event occurs. Semantic Type Time Lexical units: breakfast.v, consume.v, devour.v, dine.v, down.v, drink.v, eat.v, feast.v, feed.v, gobble.v, gulp.n, gulp.v, guzzle.v, have.v, imbibe.v, ingest.v, lap.v, lunch.v, munch.v, nibble.v, nosh.v, nurse.v, put away.v, put back.v, quaff.v, sip.n, sip.v, slurp.n, slurp.v, snack.v, sup.v, swig.n, swig.v, swill.v, tuck.v

Figure 3.1: FrameNet frame example: Ingestion

15 3.1.3 FrameNet relations

In order to make FrameNet more comprehensive several frame-to-frame relations are introduced. Each frame relation in FrameNet is an asymmetric relation between two frames, where one frame (the more abstract) can be called the super-frame and another (the less abstract) can be called the sub- frame. Set of the most important relations comprises the following ones:

• Inheritance – the strongest relation between frames, corresponding to IS-A relation in ontologies. Each sub-frame must inherit all properties of its super-frame and share all FEs.

• Subframe – some frames are complex in the sense that they refer to sequences of states and transitions, each of them can be separately described as an individual frame. The separate frames (sub-frames) are related to the complex frames via the Subframe relation.

• Perspective on – this relation indicates the presence of at least two different points-of-view. For example, a neutral Commerce goods trans- fer frame has two points-of-view – Commerce buy and Commerce sell. The neutral frame is usually non-lexical [43].

• Using – the super-frame constitutes the background for its sub- frames. Not all attributes of the super-frame must be inherited by the sub-frames. For example, Volubility uses the Communication frame, since Volubility describes a quantification of communication events.

An example of relations between frames in FrameNet is shown in figure 3.2.

3.1.4 Semantic types

We can say that FrameNet comprises two independent ontologies. First of them is the hierarchy of FrameNet frames arranged according to the frame- to-frame relations, especially using the inheritance relation. The second one is the hierarchy of semantic types (e. g. Sentient for the Cognizer FE), which are connected with some general frame elements. The general use of semantic types in the FrameNet project is to record information that is not representable in frames, nor in the frame element hierarchies. In specific, semantic types are mainly employed for following functions: indicating the basic typing of fillers of frame elements, marking

16 Figure 3.2: FrameNet relation example: Crime scenario non-lexical types of frames, and recording important semantic differences between lexical units belonging to the same frame. Each frame element may be connected with one or more semantic types. Unfortunately, only approximately 20 % of all frame elements (usually very general) is connected with any of them. Moreover, the hierarchy of semantic types does not correspond to any existing ontology (although it reminds the Top Ontology in WordNet).

3.2 FrameNets in other languages

As the popularity of Berkeley FrameNet has grown, many research teams throughout the world began developing FrameNets for other languages. Among the greatest projects belong SALSA (The Saarbrucken¨ Acquisition Project) [4] for German, Spanish FrameNet [49] and Japanese FrameNet [37]. Other languages a FrameNet is being built for comprise Chinese, Italian, French, Romanian, etc.

3.2.1 SALSA

The aim of the SALSA project is to create a German lexical resource annotated by the semantic information following the frame semantics theoretical background and to investigate methods for its utilization. The SALSA team has chosen a corpus-based approach, which means to extend an existing German treebank TIGER [2] by flat trees representing frame semantic information. An example of simple annotation instance is shown in figure 3.3. The root node is for a single FrameNet frame labelled with the frame

17 Figure 3.3: SALSA annotation example: Communication response name and the edges are labelled with the frame element names. The verb antwortet (answers) introduces the frame Communication response, the NP subject die Branche is annotated as Speaker and schlecht as Message. Annotation process is fully manual. Two independent annotators pro- ceed one predicate at a time and in the case of disagreement the third one resolves the collision. The cross-lingual divergences are solved both by adding new frames and new frame elements into frames. The current SALSA release 1 primarily annotates verbal predicates and contains approximately 500 German predicates in 20,000 annotated instances. It has been used 252 original FrameNet frames and 373 new proto-frames, the average number of frames per predicate is 2.8 [4].

3.2.2 Spanish FrameNet Spanish FrameNet is a research project creating an on-line lexical resource for Spanish, based on frame semantics and supported by corpus evidence, which is being developed at the Autonomous University of Barcelona and the International Computer Science Institute (Berkeley). The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses through human approved and automatic annotated example sentences. Annotated sentences are the building blocks of the database. The sentences are marked up in XML and form the basis of the lexical entries. This format supports searching by lemma, frame, frame element, and combinations of these. The Spanish FrameNet Corpus includes both New World and European Spanish and is composed of texts of different genres, primarily newspapers, newswire texts, book reviews, and humanities essays.

18 The Spanish FrameNet database will act both as a and a . The dictionary features include definitions, tables showing how frame elements are syntactically expressed in sentences containing each word and human approved annotated examples from the corpus. Like a thesaurus, words are linked to the semantic frames in which they participate, and frames, in turn, are linked to word lists and to related frames. The semantic and syntactic annotation is carried out by using the system developed by the Berkeley FrameNet Project, whose input are files that have been extracted from the corpus, POS tagged, lemmatized, and chunked. Each Spanish FrameNet entry will provide links to other lexical resources, including Spanish WordNet synsets. First release of the lexicon is available to the public, and contains more than 1,000 lexical items (verbs, predicative nouns, and adjectives, adverbs, prepositions and entities) representative of a wide range of semantic domains connected to 325 frames in more than 10,000 sentences.

3.3 Automatic methods of creating new FrameNets

All FrameNet projects mentioned so far have been built manually or at the most with the aid of assistance tools. The process of creation lexical resources in that way, however, is too expensive and time consuming. Nev- ertheless, there are projects that tackle the problem automatically. We can mention the English-Chinese bilingual FrameNet project [6] and a recently published semi-automatic approach to building an Italian FrameNet [26].

19 Chapter 4 Current results and aims of the thesis

The aim of the thesis is to propose a methodology of building domain independent lexicon of valency possibilities of the Czech language based on Frame Semantics, and to create a core of that lexicon exemplified in a corpus. The core should comprise the most frequent frame-bearing Czech words. Building new FrameNet from the scratch would be unnecessarily dif- ficult and time consuming, that is why I intend to begin the work with linking VerbaLex frames with the Berkeley FrameNet ones. By investigating possible links between these two resources, we can more simply identify cross-linguistic divergences between Czech and English in terms of Frame Semantics. Within the scope of the Ph.D. thesis, I am going to create approximately 100 Czech frames, each of them exemplified by at least 10 sentences in a corpus. Because the number of annotated sentences will be relatively small, the corpus has to be well-balanced and should reflect general literary language. This condition is fulfilled by a recently created multi-layer annotated corpus [18] built on DESAM [40]. We can even benefit from its manual morphological and syntactical annotation.

4.1 Current results

My current work [32] concerns the investigation of the relationship between Berkeley FrameNet and VerbaLex. The work mainly aims to link these lexical resources together and to get a set of FrameNet frames that can be reused in Czech. In order to reduce manual work that is expensive and time consuming, I have proposed a semi-automatic approach, which requires a lesser amount of human effort. By linking FrameNet with VerbaLex, we are able to find a non-trivial subset of interlingual FrameNet frames (including their frame-to-frame relations), which I am going to use as a core for building Czech FrameNet.

20 4.1.1 Linking frames The process of linking frames is asymmetric, which means that at most one frame from FrameNet is assigned to a frame in VerbaLex. There are several ways of looking for the appropriate frame. The most straightforward solution is to translate verbs belonging to a given synset in VerbaLex from Czech to English and to find their equivalents among lexical units in FrameNet. In this approach, each VerbaLex frame is represented by a set of translations of the verb literals from the given synset. As a translation dictionary the electronic Czech-English dictionary, developed at the University of West Bohemia 1, which can be freely downloaded under the GNU/FDL licence, has been used. In order to compare two frames, the Jaccard coefficient 2 is used. The method based on verb translations is shown in figure 4.1.

VerbaLex frame

WN synset

1 synonym 2 synonym n

CZ-ENG translations

Jaccard similarities

Lexical units

FrameNet FrameNet FrameNet frame 1 frame 2 frame 3

Figure 4.1: Verb translation approach.

1. http://slovnik.zcu.cz/ 2. The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

21 However, there is a problem of finding a correct translation. Another possible solution is to exploit direct links between VerbaLex and Prince- ton WordNet [39, 8] and to use WordNet as an inter-language between VerbaLex and FrameNet. To ensure maximum reliability, the linking of PWN verbs onto FrameNet lexical units is based on a WordNet verb– FrameNet dictionary, which has been built manually as a part of the work presented in [48]. This manually built dictionary has high precision but very low recall – it covers 2,651 from 10,500 verbs. This sparseness of the dictionary function can be overcame by the usage of some additional automatic method. In this work, a rule-based system for mediating frame assignment by a “Detour via WordNet” developed by Burchardt et al. [3] has been used. The process of linking VerbaLex frames with FrameNet frames based on the inter-language is described in figure 4.2. Note that multiple frames can be assigned to a single synset if all English translations do not share the same FrameNet frame.

synset ID Czech WordNet Princeton WordNet

WordNet-FrameNet verb dictionary

FrameNet VerbaLex frame FrameNet frame 1 frame 2

Figure 4.2: Using WordNet as an inter-language between VerbaLex and FrameNet.

4.1.2 Assigning verb arguments

An assumption of this task is that the investigated VerbaLex frame is already connected with exactly one appropriate FrameNet frame. Once each VerbaLex frame is connected with a semantic frame from FrameNet, a mapping between semantic roles in VerbaLex and frame elements in FrameNet can be easily identified. Because of the nature of VerbaLex arguments (reflecting combinatory possibilities of verbs), in the case of FrameNet frame, only the core frame elements are used. The linking of semantic roles in VerbaLex with frame elements from FrameNet is based on the most probable pairing. The FrameNet corpus provides a wide range of annotated frame element realizations. Each frame element of the investigated frame is represented by the set of all

22 its realisations in the FrameNet corpus. If the realisation comprises more than one word form all of them are used. In addition, the constituents are normalized and lemmatized using the TreeTagger [46], developed at the Institute for Natural Language Processing, Stuttgart University. In this method, semantic roles in VerbaLex are not exemplified suffi- ciently but a large enough set of examples can be obtained by taking into account a hyponym relation in CZWN. Formally, each verb argument of the investigated frame is represented by the associated synset from PWN and all its hyponyms. The similarity of a frame element and a verb argument in VerbaLex is defined as a Jaccard measure between their representing sets. The method of linking VerbaLex semantic roles with FrameNet frame elements is shown in figure 4.3.

Figure 4.3: Linking VerbaLex semantic roles with FrameNet frame elements

The goal then is to find the best pairing in terms of the greatest sum of their similarities. The key issue of this approach is a definition of the ,,best pairing”. The two following requirements on the pairing are formulated:

• sum of the similarities of all the participated pairs must be maximal;

• each frame element (verb argument) can be connected at most to one verb argument (frame element), i. e. a particular node may not share more than one pair.

The task can be modeled in the graph theory as the most expensive pairing problem of the bipartite graph G = (E,V1,V2, f), where V1 is the set of all frame elements in the FrameNet frame, V2 is the set of all verb arguments W in the VerbaLex frame, E ⊆ V1 × V2 is the set of edges, s : V1 ∪ V2 → 2 is the set of lexical units associated with the vertex v ∈ V1 ∪ V2, W is the set of

23 all possible lexical units and f : E → R is the weight of an edge (v1, v2) ∈ E and is defined as

|s(v1) ∩ s(v2)| f((v1, v2)) = |s(v1) ∪ s(v2)| Even though several algorithms performing in polynomial time com- plexity exist [7], the simplest brutal-force algorithm, which searches all possible pairings, has been used. The cardinality of a frame element set is usually up to 5 and the cardinality of a VerbaLex argument set up to 3. For such small graphs, even the brutal-force exponential algorithm can overcome some sophisticated polynomial algorithm. For a better insight into the problem, an example of linking VerbaLex frames with FrameNet ones is given. Let us consider following synset from CZWN: prodat:n9, stˇrelit:n2 (to sell something) and a shortened list of simplified VerbaLex frames corresponding to the synset: (a) AG (institu- tion:1) VERB ART (goods:1) REC (institution:1), (b) AG (person:1) VERB ART (goods:1), (c) AG (person:1) VERB ART (goods:1) REC (person:1). All listed frames should be linked to the Commerce sell frame from FrameNet. The core frame elements of the FrameNet frame include Seller (linked to AG (institution:1) in (a) and to AG (person:1) in (b), (c)), Goods (linked to ART (goods:1) in (a), (b) and (c)) and Byuer (linked to REC (institution:1) in (a) and to REC (person:1) in (c)). In frame (b), there is no equivalent for Buyer.

4.1.3 Ontologies behind FrameNet and VerbaLex

We can say that FrameNet lexical resource comprises two independent ontologies. First of them is the hierarchy of FrameNet frames arranged according to the frame-to-frame relations, using especially the inheritance relation. The second one is the hierarchy of semantic types (e. g. Sentient for the Cognizer FE), which are connected with some general frame elements. While the frame-to-frame relations were exploited in the methods identifying links between FrameNet frames and WordNet synsets [48, 3], and subsequently in linking FrameNet frames with VerbaLex frames, the relation between frame elements stood apart from our interest so far. One of the fundamental differences between FrameNet and VerbaLex consists in the fact that while VerbaLex uses as labels of the verb arguments a fixed set of Complex Semantic Roles (CSRs) for all valency frames, FrameNet defines a unique frame element set for each frame. Nevertheless, there are frame elements shared by two or more frames (e. g. Time). On the other hand, some frame elements of the same name can have different

24 WordNet WordNet

frame element sematic role frame element sematic role

(a) (b)

Figure 4.4: Distance of a VerbaLex semantic role and a FrameNet frame element. meaning in different frames. To extend information value of the frame elements a hierarchy of 72 semantic types has been introduced. Each frame element may be connected with one or more semantic types. Unfortunately, only approximately 20 % of all frame elements (usually very general) is connected with any of them. Moreover, the hierarchy of semantic types does not correspond to any existing ontology (although it reminds the Top Ontology in EuroWordNet). However, we can mention a project [45] in which the authors linked semantic types and frame elements to SUMO ontology [35].

4.1.4 Exploitation of ontologies in linking FrameNet frame elements with semantic roles from VerbaLex

The most problematic part of the process of linking VerbaLex with FrameNet is the connecting VerbaLex verb arguments with appropriate frame ele- ments from FrameNet. The methods of linking frames described above use statistical technique based on comparison of a typical frame element and semantic role fillers, but it can be improved by using the links between semantic types (or frame elements) and SUMO concepts. Once a SUMO concept is linked to each frame element (or most of them), we can use a SUMO-WordNet mapping described in [36] and get the corresponding WordNet synset. The distance of the frame element and semantic role from VerbaLex is then defined in the following way:

• The linked synset is equal to the synset associated with the VerbaLex semantic role (see fig. 4.4.a). In this case, the distance is zero.

• The linked synset is not equal to the synset associated with the

25 VerbaLex semantic role (see fig. 4.4.b). In this case, the distance is defined as the length of the path from one synset to the other using the hypernym relation in WordNet.

The ontology-based method of identifying the nearest FrameNet frame element for a semantic role from VerbaLex is very straightforward solution but suffers from many errors in the SUMO-WordNet mapping as well as the lack of frame elements annotated by a semantic type. That is why the method is used only as a correction of the statistical approach described in section 4.1.2 The bottleneck of the statistical approach consists especially in ambiguity. The problem arises when two or more edges in the pairing graph have (nearly) equal costs. In that case (on condition that the ontology-based cost is defined), the ontology-based instead of the statistical-based distance is used to decide which edge should be selected. The combination of the statistical and the ontology-based method increases the coverage as well as accuracy of the system.

4.1.5 Evaluation

The first phase describing the mapping of the frames from FrameNet onto VerbaLex is proposed using two different approaches. Both methods return a list of frames with a probability. For the purposes of the future applications and the evaluation, the methods are combined in the following way. A A A A A Let L (vf) = (f1 , p1 ) ... (fm, pm) be a list of potential (frame, probability) pairs for the VerbaLex frame vf, acquired by the method based on verb translations (A) and sorted in the increasing order according to their B B B B B probabilities. Similarly, L (vf) = (f1 , p1 ) ... (fn , pn ) be an output of the method based on WordNet (B). The final list of potential (frame, probability) pairs is computed by merging LA and LB. If any frame is a member of both lists, the arithmetic mean of their probabilities is used. The whole output list can be useful in some application but for the evaluation only the first, most probable item, is used. The method has been evaluated manually by the human annotator on the randomly chosen set of 200 VerbaLex frames. Main problem of this task are translation errors of two types – trans- lation between Czech WordNet and Princeton WordNet and translation between PWN and FrameNet. In spite of the fact that the sense of a word can relatively easily change, acquired precision of the linking between FrameNet frames and VerbaLex ones is about 84 %. The errors include cases

26 where no frame was assigned at all. The frame is not assigned if either the corresponding frame does not exist or the PWN–FrameNet mapping is incomplete. The second phase deals with the linking between verb arguments in VerbaLex and frame elements in FrameNet. This task is, in the principle, more complicated than the previous one. Regardless of the wrongly as- signed frame elements, there is a problem with ambiguity. For instance, if two or more arguments of a verb in VerbaLex represent persons, they all are usually connected with the same literal from WordNet (person:1). The similarities of these arguments with frame elements are then the same and the assignment process is ambiguous (see frames (b) and (c) in examples from section 4.1.2). The experiments have shown that the error rate of the system is about 24 % and the ambiguity rate about 12 %. The former version of the system without corrections by the ontology-based algorithm has performed with the error rate about 26 % and the ambiguity rate about 19 %. Two main problems were identified – first, some characteristic sets of the lexical units are not representative enough and second, there is a problem with verb arguments for which the corresponding frame element in FrameNet does not exist. The relatively low accuracy is also caused by the high number of combinations for each frame pair – the average number of arguments in VerbaLex is 2.7 and the average number of frame elements in FrameNet is 14.6.

4.1.6 VerbaLex-FrameNet linking tool Based on the described methods, a semi-automatic VerbaLex–FrameNet linking tool has been developed. The tool is a client-server application, which enables different users to link FrameNet frames and their frame elements to VerbaLex frames simultaneously. Users are first asked by the system to select the most appropriate FrameNet frame for a given VerbaLex Frame. The possible frames are ordered from the most probable to the least probable one using the algorithm from section 4.1.1. After selecting it, based on the method described in sections 4.1.2 and 4.1.4, a graph representing linking between VerbaLex arguments and FrameNet frame elements is generated. If the user does not agree with the automatically generated pairing, he or she can make it better simply using the mouse. During the whole process one can see a detailed description of any object (frame, synset, frame element, etc.) in the bottom panel by hovering mouse over the it.

27 Figure 4.5: VerbaLex–FrameNet linking tool

A screenshot of the system is shown in figure 4.5. First column in the screenshot contains frames from VerbaLex, second one contains frames from FrameNet and the third one shows linking between VerbaLex verb arguments and FrameNet frame elements. The figure illustrates anno- tation of one of the frames belonging to CZWN synset prodavat:1,´ obchodovat:2, distribuovat:n3 (to sell something). The VerbaLex frame (placed on the bottom panel) with verb arguments person:1, object:1 and person:1 is linked to Commerce sell frame from Frame- Net.

4.2 Case study for the Indicate verb class

In order to discover main typological differences between Berkeley Frame- Net frames and Czech valency frames in VerbaLex I have selected VerbaLex frames belonging to the Indicate class and carried out their complete linkage to FrameNet frames. The Indicate class is one of 111 semantic classes defined in VerbaLex and consists of 136 verb senses in 27 CZWN synsets evoking 119 valency frames.

28 4.2.1 Annotation process The annotator proceeds one VerbaLex frame at a time and is asked to assign at most one FrameNet frame to it. If the annotator is not able to find appropriate FrameNet frame, the VerbaLex frame will not be annotated and a new Czech FrameNet frame should be defined in future work. There are at least two possible reasons for the definition of a completely new frame [28]: 1. Inadequacy of frame definitions in the corresponding semantic domain or area.

2. Insufficient coverage of the domain in Berkeley FrameNet (i.e. English lexical units and corresponding frames have not been defined yet). If it is possible to find an appropriate frame from FrameNet, the process of annotation continues with linking semantic roles from VerbaLex with frame elements from FrameNet. At most one frame element can be chosen for a semantic role and no more than one semantic role can be linked to a FrameNet frame element. If the appropriate FrameNet frame element for a semantic role from VerbaLex frame does not exist, the semantic role should be connected to a newly-defined frame element in future. Nevertheless, the VerbaLex specific addition of frame elements to a FrameNet frame results in a different and more restricted frame. This more specific non-English frame could be related to the English one by a cross-lingual Inheritance relation, whereby it would become a cross-lingual Child frame of the English frame [28].

4.2.2 Statistics and typological divergences In my experience, the most of VerbaLex frames can be directly linked to a semantic frame from FrameNet, nevertheless, some of the VerbaLex frames require an adaptation or creating a completely new FrameNet frames. In table 4.1 there is a list of the most frequently assigned FrameNet frames with numbers of corresponding VerbaLex frames. During the annotation phase I have identified 15 FrameNet frames belonging to 119 VerbaLex frames. It means that approximately 8 Ver- baLex frames are linked to one FrameNet frame. Among all 119 VerbaLex frames there are 9 frames that cannot be linked to any known frame from FrameNet. For these frames a new FrameNet frame will need to be defined. The coverage of the Indicate class by FrameNet frames is more than 99 %.

29 FrameNet frame name VerbaLex frames Telling 30 Reasoning 22 Reveal secret 14 Gesture 11 Expressing publicly 7 Forgiveness 5 Communication 4 Sign 3 Others 9 None 9

Table 4.1: Assigned FrameNet frames.

When evaluating linkage between semantic roles from VerbaLex and frame elements from FrameNet, I have found 19 frames where a new frame element has to be added, two or more frame elements have to be put together, or some frame elemens have to be restructuralized. It has at least three reasons:

1. The corresponding frame element is missing.

2. The frame element is too general and has to be divided into more specific ones.

3. The frame element is too specific and has to be replaced by a more general one.

4.3 Reusing Berkeley FrameNet frames

A fundamental assumption of the FrameNet building methodology is that the most of Berkeley FrameNet frames can be reused for the semantic analysis of Czech. The assumption takes advantage of the nature of frames as coarse-grained semantic classes, which refer to prototypical situations. Nevertheless, the assumption that these situations are applicable across languages should be empirically verified. In general, a sense of a lemma can evoke a FrameNet frame if this sense is able to realize the conceptually necessary components of the frame (its core frame elements). Inversely, the FrameNet frames cannot be applicable to other languages if the sub-

30 categorization properties of lemmas in this language differ significantly from their English translations [5]. Among the issues that the thesis should address is the analysis of cases of typological differences between Czech and English (in particular with respect to valencies), and answering questions about how to solve the cross-linguistic divergences. Several experiments concerning analysis of the relation between Berkeley FrameNet and Czech verb valency lexicons have been already performed [1, 31] and have shown that a large subset of Berkeley FrameNet frames can be reused for Czech. Moreover, there are many projects (see [4, 49, 6] for example) in which the authors successfully exploited Berkeley FrameNet frames for building FrameNets in other languages.

4.4 New frames and consistency ensuring

Apart from selecting FrameNet frames that are applicable to Czech, there is a serious problem of ensuring consistency and an issue of creating new frames, specific for the Czech language. To treat with both these problems I would like to employ corpus linguistics. As it turns out, even relatively simple statistical techniques over large set of data can often give better results than complex and sophisticated rule-based systems for some applications. For instance, statistical machine translation is currently the best resulting and most widely studied method of machine translation in comparison to the rule-based or the example-based machine translation systems. An interesting information about word behaviour in corpora can be found in so-called Word Sketches [23]. Word sketches are one-page auto- matic, corpus-based summaries of a word’s grammatical and collocational behaviour generated by Sketch Engine which takes as input a corpus of any language and a corresponding grammar patterns, and which generates word sketches for the words of that language [24]. The word sketches built with a simple grammar can even reflect semantic valencies for the words with sufficient frequencies. In 2004, Patrick Hanks and James Pustejovsky began their work on Corpus Pattern Analysis (CPA) [21]. The aim of CPA is to link word use to word meaning in a machine-tractable way using corpus data (mainly concordances and word sketches). According to the Theory of Norms and Exploitations [20] which serves as a theoretical background of the CPA project, words in isolation do not have specific meanings. In CPA, meanings

31 of words are associated with their prototypical sentence contexts. This approach to lexical semantics seems to be similar to Fillmore’s Frame Semantics. I would like to explore the relationship between FrameNet frames and CPA frames in order to develop a corpus-driven methodology of ensuring consistency and high coverage in creating new Czech FrameNet frames.

4.5 Semi-automatic annotation of corpus texts

One of the objectives of the Berkeley FrameNet project is the exemplifi- cation of all its semantic frames in a corpus. Therefore, one of the last things I plan to do within the scope of the thesis is the development of a semi-automatic tool which should serve for identifying semantic frames and semantic roles in raw texts. A detailed description of such tool for English and Berkeley FrameNet can be found in [17, 16]. According to authors, the system utilizes both heuristic- and statistical-based approaches and performs with precision about 60 %. Using similar newly created tool for Czech I will annotate a sample of a corpus by the Frame Semantic information.

4.6 Schedule of further work

• Autumn 2010

– Finish the comparative study of Berkeley FrameNet frames and Czech verb valencies. – Carry out complex analysis of cases of typological differences between Czech and English in terms of Frame Semantics.

• Spring, Autumn 2011

– Explore the relationship between FrameNet frames, CPA frames, and Word Sketches. – Develop a corpus-driven methodology of creating new frames with regard to ensuring consistency. – Create a set of Czech FrameNet frames, corresponding to the most frequent lexical units in Czech.

32 • Spring 2012 – Finish the development of a frame and semantic role labelling tool. – Annotate a sample corpus by the set of Czech FrameNet frames. • Autumn 2012 – Submit the Ph.D. thesis.

4.7 Publications

• Jirˇ´ı Materna and Karel Pala. Using Ontologies for Semi-automatic Linking VerbaLex with FrameNet. In Proceedings of the Seventh con- ference on International Language Resources and Evaluation (LREC’10), Valletta Malta, 2010 • Jirˇ´ı Materna. Linking Czech Verb Valency Lexicon VerbaLex with FrameNet. In Proceedings of 4th Language and Technology Conference, Poznan,´ 2009 • Jirˇ´ı Materna. Domain Collocation Identification. In Recent Advances in Slavonic Natural Language Processing, Faculty of Informatics, Masa- ryk University, Brno, 2009. • Jirˇ´ı Materna. Czech Verbs in FrameNet Semantics. In Czech in Formal Grammar, Lincom, Munchen,¨ 2009. • Lubom´ır Popel´ınsky´ and Jirˇ´ı Materna. On key words and key patterns in Shakespeare’s plays. In Znalosti 2009, Nakladatelstvo STU, Bratislava, 2009. • Jirˇ´ı Materna. Automatic Web Page Classification. In Recent Ad- vances in Slavonic Natural Language Processing, Faculty of Informatics, Masaryk University, Brno, 2008. • Lubom´ır Popel´ınsky´ and Jirˇ´ı Materna. Keyness and Relational Learn- ing. In International conference Keyness in Text, University of Siena, Siena, 2007. • Jirˇ´ı Materna. Keyness in Shakespeare’s Plays. In Recent Advances in Slavonic Natural Language Processing, Faculty of Informatics, Masaryk University, Brno, 2007.

33 Bibliography

[1] V. Benesovˇ a,´ M. Lopatkova,´ and K. Hrstkova.´ Enhancing Czech Valency Lexicon with Semantic Information from FrameNet: The Case of Communication Verbs. In Proceedings of The First International Conference on Global Interoperability for Language Resources, pages 18 – 25. Hong Kong, 2008.

[2] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. The TIGER Treebank. In Proceedings of the Workshop on Treebanks and Linguistic Theories, pages 24 – 42, 2002.

[3] Aljoscha Burchardt, Katrin Erk, and Anette Frank. A WordNet Detour to FrameNet. Technical report, Saarland University and DFKI Saarbrucken, 2006.

[4] Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Pado,´ and Manfred Pinkal. The Salsa Corpus: A German Corpus Resource for Lexical Semantics. In The International Conference on Language Resources and Evaluation (LREC). Genoa, Italy, 2006.

[5] Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Pado,´ and Manfred Pinkal. FrameNet for the Semantic Analysis of German: Annotation, Representation and Automation (preprint).

[6] Benfeng Chen and Pascale Fung. Automatic Construction of an English-Chinese Bilingual FrameNet. In Proceedings of Human Language Technology conference, pages 29 – 32, 2004.

[7] William J. Cook, William H. Cunningham, William R. Pulleyblank, and Alexander Schrijver. Combinatorial Optimization. Wiley-Interscience, 1997.

[8] Christiane Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998.

34 [9] Charles J. Fillmore. The Case for Case. In Universals in Linguistic Theory. New York: Holt, Rinehart and Winston, 1968.

[10] Charles J. Fillmore. Topics in Lexical Semantics. In Current Issues in Linguistic Theory. Indiana University Press, 1976.

[11] Charles J. Fillmore. Scenes-and-frames Semantics. In A. Zampolli, ed. Linguistic Structures Processing, pages 55 – 81. Amsterdam, North- Holland, 1977.

[12] Charles J. Fillmore. The Case for Case Reopened. In Syntax and Semantics: Grammatical Relations. Academic Press Inc., 1977.

[13] Charles J. Fillmore. Frame Semantics. In Linguistics in the Morning Calm, pages 111 – 137. Seoul, Hanshin Publishing Co., 1982.

[14] Charles J Fillmore, Christopher R Johnson, and Miriam R L Petruck. Background to FrameNet. In International Journal of Lexicography, pages 235 – 250. Oxford University Press, 2003.

[15] Aldo Gangemi, Nicola Guarino, Claudio Masolo, Alessandro Oltra- mari, Ro Oltramari, and Luc Schneider. Sweetening Ontologies with DOLCE. pages 166 – 181. Springer, 2002.

[16] Daniel Gildea and Daniel Jurafsky. Automatic Labeling Of Semantic Roles. In Computational Linguistics, volume 28, pages 245 – 288, 2002.

[17] Daniel Joseph Gildea. Statistical Language Understanding Using Frame Semantics. PhD thesis, University of California, Berkeley, 2001.

[18] Marek Grac,´ Milosˇ Jakub´ıcek,ˇ and Vojtechˇ Kova´r.ˇ Through low-cost annotation to reliable parsing evaluation. In To appear in PACLIC 24: Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, 2010.

[19] Jan Hajic.ˇ Complex Corpus Annotation: The Prague Dependency Treebank. In Insight into Slovak and Czech Corpus Linguistics. Veda, Vydavate´lstvo SAV, Bratislava, 2005.

[20] Patrick Hanks. The Syntagmatics of Metaphor and . In International Journal of Lexicography, pages 245 – 274. Oxford University Press, 2004.

35 [21] Patrick Hanks and James Pustejovsky. A Pattern Dictionary for Natural Language Processing. In Revue Francaise de Langue Appliquee´ . Brandeis University, 2005.

[22] Dana Hlava´ckovˇ a´ and Alesˇ Horak.´ VerbaLex – New Comprehensive Lexicon of Verb Valencies for Czech. In Computer Treatment of Slavic and East European, Languages Bratislava, pages 107 – 115. Slovakia: Slovensky´ narodn´ y´ korpus, 2006.

[23] Adam Kilgarriff, Pavel Rychly,´ Pavel Smrz,ˇ and David Tugwell. The Sketch Engine. In Proceedings of the Eleventh EURALEX International Congress, pages 205–116. Lorient, France, 2004.

[24] Adam Kilgarriff, Pavel Rychly,´ Pavel Smrz,ˇ and David Tugwell. The Sketch Engine. In Practical Lexicography: A Reader, pages 297–306, 2008.

[25] Douglas Lenat. CYC: A Large-Scale Investment in Knowledge Infrastructure. In Communications of the ACM, volume 38, pages 33 – 38, 1995.

[26] Alessandro Lenci, Martina Johnson, and Gabriella Lapesa. Building an Italian FrameNet through Semi-automatic Corpus Analysis. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), 2010.

[27] Beth Levin. English Verb Classes and Alternation: A Preliminary Investigation. The University of Chicago Press, 1993.

[28] Brite Lonneker-Rodman.¨ Multilinguality and FrameNet. Technical report, International Computer Science Institute, Berkeley, 2007.

[29] Marketa´ Lopatkova,´ Zdenekˇ Zabokrtskˇ y,´ and Karol´ına Skwarska. Valency Lexicon of Czech Verbs: Alternation-based Model. In The International Conference on Language Resources and Evaluation (LREC). Genoa, Italy, 2006.

[30] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, pages 313 – 330. MIT Press, Cambridge, 2003.

[31] Jirˇ´ı Materna. Czech Verbs in Framenet Semantics. In Czech in Formal Grammar, pages 131 – 139. Lincom Europa, 2009.

36 [32] Jirˇ´ı Materna and Karel Pala. Using Ontologies for Semi-automatic Linking VerbaLex with FrameNet. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), 2010.

[33] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. Five Papers on WordNet. Technical report, Princeton University, 1993.

[34] Marvin Minsky. A Framework for Representing Knowledge. In Psychology of Computer Vision. McGraw-Hill, New York, 1975.

[35] I. Niles and A. Pease. Towards a Standard Upper Ontology. In Proceedings of the 2nd International Conference on Formal Ontology in Information Systems, pages 2 – 9, 2001.

[36] I. Niles and A. Pease. Linking Lexicons and Ontologies: Mapping WordNet to the Suggested Upper Merged Ontology. In Proceedings of the 2003 International Conference on Information and Knowledge Engineering, pages 23 – 26. Las Vegas, 2003.

[37] Toshio Ohori Ryoko Suzuki Hiroaki Saito Shun Ishizaki Ohara Kyoko Hirose, Seiko Fujii. The Japanese FrameNet Project: An Introduction. In The Fourth international conference on Language Resources and Evaluation (LREC’04), pages 9 – 11, 2004.

[38] Klara´ Osolsobeˇ and Karel Pala. Czech Stem Dictionary. In Sborn´ık prac´ı filozoficke´ fakulty brnenskˇ e´ univerzity, pages 51 – 60. Masarykova Univerzita, Brno, 1993.

[39] Karel Pala and Smrzˇ Pavel. Building Czech WordNet. In Romanian Journal of Information Science and Technology, volume 7, pages 1–2. Romanian Academy, 2004.

[40] Karel Pala, Pavel Rychly,´ and Pavel Smrz.ˇ Desam – annotated corpus for czech. In Proceedings of SOFSEM’97, pages 523 – 530. Springer- Verlag. Lecture Notes in Computer Science 1338, 1997.

[41] Karel Pala and Pavel Seveˇ cek.ˇ Valence ceskˇ ych´ sloves. In Sborn´ık prac´ı FFBU, pages 41 – 54. Masarykova Univerzita, Brno, 1997.

[42] Martha Palmer, Dan Gildea, and Paul Kingsbury. The Proposition Bank: A Corpus Annotated with Semantic Roles. In Computational Linguistics Journal, volume 31, pages 71 – 106, 2005.

37 [43] J. Ruppenhofer, M. Ellsworth, M. R. L. Petruck, C. R. Johnson, and J. Scheffczyk. FrameNet II: Extended Theory and Practice, 2006. http://www.icsi.berkeley.edu/framenet.

[44] Joe Sachs. Aristotle’s Metaphysics. Green Lion Press, Santa Fe, New Mexico, 1999.

[45] Jan Scheffczyk, Adam Pease, and Michael Ellsworth. Linking FrameNet to the Suggested Upper Merged Ontology. In Proceeding of the 2006 conference on Formal Ontology in Information Systems, pages 289 – 300. Baltimore, Maryland, 2006.

[46] Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In International Conference on New Methods in Language Processing, pages 44 – 49. Institut fur¨ Maschinelle Sprachverarbeitung, Universitat¨ Stuttgart, 1994.

[47] Petr Sgall, Eva Hajicovˇ a,´ and Jarmila Panevova.´ The Meaning of the Sentence in its Semantic and Pragmatic Aspects. D. Reidel Publishing Company, Dordrecht, 1986.

[48] Lei Shi and Rada Mihalcea. Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robus Semantic Parsing. Proceedings of Cicling, Mexico City, Mexico, 2005.

[49] Carlos Subirats and Miriam Petruck. Surprise: Spanish Framenet. In International Congress of Linguists. Workshop on Frame Semantics. Prague, Matfyzpress, 2003.

[50] Lucien Tesniere.` El´ ements´ de syntaxe structurale. Klincksieck, Paris, 1959.

[51] Piek Vossen and Graeme Hirst. EuroWordNet: A Multilingual Database with Lexical Semantic. Kluwer Academic Publisher, 1998.

[52] Zdenekˇ Zabokrtskˇ y.´ Valency Lexicon of Czech Verbs. PhD thesis, Charles University, Prague, 2005.

38