Building Framenet in Czech

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨ OF I !"#$%&'()+,-./012345<yA|NFORMATICS Building FrameNet in Czech PH.D. THESIS PROPOSAL Jiˇr´ıMaterna Brno, September 2010 Supervisor: doc. PhDr. Karel Pala, CSc. Contents 1 Introduction ............................... 3 2 Ontologies and frame-based approaches to lexical semantics . 5 2.1 Ontologies ............................. 5 2.2 Frames ............................... 5 2.3 WordNet .............................. 6 2.3.1 EuroWordNet . 7 2.4 Verb valency lexicons ....................... 8 2.4.1 PropBank . 8 2.4.2 VerbNet . 9 2.4.3 BRIEF . 9 2.4.4 Vallex . 10 2.4.5 VerbaLex . 10 3 Frame Semantics and FrameNet ................... 13 3.1 The Berkeley FrameNet ..................... 13 3.1.1 Semantic Frames . 14 3.1.2 Frame Elements . 14 3.1.3 FrameNet relations . 16 3.1.4 Semantic types . 16 3.2 FrameNets in other languages . 17 3.2.1 SALSA . 17 3.2.2 Spanish FrameNet . 18 3.3 Automatic methods of creating new FrameNets . 19 4 Current results and aims of the thesis . 20 4.1 Current results ........................... 20 4.1.1 Linking frames . 21 4.1.2 Assigning verb arguments . 22 4.1.3 Ontologies behind FrameNet and VerbaLex . 24 4.1.4 Exploitation of ontologies in linking FrameNet frame elements with semantic roles from VerbaLex . 25 4.1.5 Evaluation . 26 4.1.6 VerbaLex-FrameNet linking tool . 27 4.2 Case study for the Indicate verb class . 28 1 4.2.1 Annotation process . 29 4.2.2 Statistics and typological divergences . 29 4.3 Reusing Berkeley FrameNet frames . 30 4.4 New frames and consistency ensuring . 31 4.5 Semi-automatic annotation of corpus texts . 32 4.6 Schedule of further work ..................... 32 4.7 Publications ............................ 33 2 Chapter 1 Introduction Natural language understanding, which belongs to the natural language processing field has been intensively investigated by researchers for many years. Many natural language processing applications like information retrieval and machine translation as well as disambiguation tasks on all levels require techniques enabling, at least partially, semantic parsing and understanding. For example, there has been spread deployment of simple speech-based natural language understanding systems that answer questions about flight arrival times, give directions, report on bank balances, or perform simple financial transactions. More sophisticated experimental systems generate concise summaries of news articles, answer fact-based questions, and recognize complex semantic and dialogue structure in general. Unfortunately, current information extraction and dialogue understanding systems are often domain dependent. Nowadays, there is a trend to build large domain independent electronic lexical databases containing as much semantic information as possible. Probably the best known semantic database is Princeton WordNet [8]. It is a large lexical resource of American English, developed under the direction of George A. Miller, where nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets). Synsets are interlinked by means of conceptual-semantic and lexical relations (e.g. synonymy, antonymy, hypero/hyponymy, meronymy and holonymy). Another complex electronic lexical resource of English, based on Frame Semantics [13] proposed by Charles J. Fillmore, is called FrameNet. In Frame Semantics, word meaning is described in relation to the semantic frame, which consists of a target lexical unit (pairing of a word with a sense), frame elements (its semantic dependants) and relations between them. These lexical resources are not domain dependent but have some disadvantages. First of all, the coverage of FrameNet is insufficient. It is mainly caused by the methodology of its building, which proceeds frame by frame rather than lemma by lemma. On the other hand, Princeton WordNet has relatively high coverage but suffers from inconsistency and 3 errors, coming from the fact that WordNet has been built manually by humans without any well-defined methodology based on corpus evidence. The aim of the thesis is to propose a methodology of building large, frame-based and domain independent lexicon of valency possibilities of the Czech language with the effort to overcome low coverage and inconsistency as much as possible, and to create a core of such lexicon exemplified in a sample corpus. Throughout the world, there are several projects for different languages built on the idea of Frame Semantics. The largest and probably best known one is the original Berkeley FrameNet. Saarbrucken¨ team has been developing German frame-based electronic lexicon SALSA [4], Spanish team has been working on Spanish FrameNet [49], etc. For Czech some experiments have been carried out by the Jan Hajic’sˇ group in Prague [1], taking advantage of the lexical database Vallex built in this group. However, so far, Czech FrameNet as such has not been worked out yet. In this work I give an overview of existing approaches to building frame-based lexicons and point out their advantages and limitations. The main focus is laid on FrameNet and FrameNet like approaches. In the rest of this work I will describe current results in building Czech FrameNet and outline main ideas of the further work. 4 Chapter 2 Ontologies and frame-based approaches to lexical semantics 2.1 Ontologies The term ontology has its origin in philosophy and represents a philosoph- ical study of the nature of being as well as the basic categories of being and their relations. It was Aristotle who constructed first well-defined ontology. In his Metaphysic [44] he analyzed the simplest elements to which the mind reduces the real world of reality. In computer science, an ontology is a formal representation of a set of concepts1 and relationships between them. In order to explain how the ontologies are constructed and what are they good for we can refer to logical systems. The existential quantifier in logic is a notation for asserting that something exists, but logic itself has no vocabulary for describing the things that exist. Ontology fills that gap. It is the study of the existence of all kinds of entities, abstract and concrete, that make up the world or particular domain. Domain ontologies, sometimes called domain-specific ontologies, model a specific part of the world. The universal, non-domain ontologies, are called upper ontologies. An upper ontology is a model of the common world, which describes general concepts that are the same across all domains. There are several upper ontologies including commercial (Cyc [25]) as well as freely available ones (SUMO [35], Dolce [15]). 2.2 Frames Besides representing standalone concepts, a language understanding sys- tem must be able to organize knowledge in high-level structures. In symbolic logic, the basic units are predicates, which are connected by operators to create formulas representing high-level structures. Another 1. When speaking about concepts we rather mean their labels because as such they are not linguistic entities. 5 possible way of organizing concepts is to use structures like frames. In the field of knowledge representation, the frame is a data structure introduced by Marvin Minsky [34], which is intended to represent complex objects or stereotyped situations. We can think of a frame as a network of nodes and relations, where some nodes are fixed, representing objects that are always true about the frame. The rest of the nodes represent slots, which must be filled by specific instances of data. Being inspired by Minsky, many researchers in different branches of science have followed him in the idea of frames. In linguistics, the best known frame-based theory is Charles J. Fillmore’s Frame Semantics [13], which will be discussed later2. 2.3 WordNet Princeton WordNet (PWN) [8] is a large lexical database of English, developed under the direction of George A. Miller at Princeton University. Its design has been inspired by psycholinguistic research and computational theories of human lexical memory [33]. Entries in PWN are made of nouns, verbs, adjectives and adverbs grouped into sets of cognitive synonyms called synsets, each expressing a distinct concept. Different senses of polysemous words (literals) are distinguished by their sense Ids (positive integers). Synsets in PWN are interlinked by means of the relations which form the net. Some of the most important relationships together with an example of each of them are listed below. • Synonymy is the most important relationship in WordNet. Accord- ing to Leibniz’s definition, synonyms are different words, which can be replaced each other in any context without changing the truth value of the proposition. This definition is, however, too strict and is fulfilled only by a small number of very similar words. That is why the near synonymy is used in WordNet. According to the definition of near synonymy, two different words are synonyms if one can be replaced the other in the same context without changing the sense of a sentence3. An example of such synonymy relation are words eat and consume. 2. It should be remarked that the roots of Fillmore’s frames date back to his earlier work The Case for Case [9], in which he introduced deep (semantic) cases which are today understood as semantic roles. 3. Even this weaker condition need not always be fulfilled in WordNet. 6 • Hyponymy/hyperonymy is a relationship between two synsets, where one’s semantic range is within/outside of the other’s. In computer science, this type of relationship is sometimes called IS- A, TYPE-OF, or lexical inheritance relation because all hyponyms must inherit all properties of their hypernyms. For example, word animal is a hypernym of dog and car is a hyponym of vehicle. • Meronymy/holonymy is a relationship between a whole and its parts. For example, wheel is a meronym of car and house is a holonym of door. This relationship is, similarly to the hyponymy/hypernymy relation, transitive. • Synsets with opposite meaning are interlinked by an antonymy relationship.

Load more