MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Improving NLP Systems with Common Sense Knowledge and Reasoning

PH.D.THESIS PROPOSAL

Zuzana Nevˇeˇrilov´a

Brno, September 2010

Advisor: doc. PhDr. Karel Pala, CSc. Signature: ...... Contents

1 Introduction ...... 2 1.1 Common Sense Definitions ...... 3 1.2 Cognitive Science Contribution to Automated NLU .....5 1.3 Knowledge Representation and Inference ...... 6 1.4 Common Sense Reasoning and Context Sensitivity ...... 7 1.5 Limited Success of Existing Common Sense Applications and Intelligent Agents ...... 8 1.6 Motivation ...... 9 2 Current State-of-art ...... 10 2.1 Resources of Common Sense Knowledge ...... 10 2.1.1 Encyclopedias and Explanatory Dictionaries . . . . . 10 2.1.2 Ontologies ...... 11 2.1.3 Special Collections of Common Sense Knowledge . . 14 2.1.4 Neural Networks ...... 16 2.2 Computer Programs that Use Common Sense ...... 16 3 Present Results ...... 20 3.1 Visualization ...... 20 3.2 Collecting Common Sense Propositions ...... 20 3.3 Czech Verb Valency Lexicon VerbaLex ...... 21 3.4 Publication Overview ...... 21 4 Aims of the Dissertation ...... 23 4.1 Evaluation of Resources of Common Sense Propositions .. 25 4.2 Application ...... 25 4.3 Evaluation ...... 27 4.4 Time Schedule ...... 27

1 Chapter 1 Introduction

Within more than 50 years of computational linguistics, different aspects of natural language understanding (NLU) have been studied. From gram- mar construction which initially seemed to resolve the problem, computa- tional linguists came over complex, multi-level natural language processing (NLP) systems. In the NLP framework, natural language is generally decomposed on smaller units as phonemes, morphemes, words, phrases, sentences, dis- courses. According to the Frege’s principle (also known as the Principle of compositionality [Janssen, 2001]) “the meaning of a expres- sion is a function of the meaning of its parts and of the syntactic rule by which they are combined.” In computational linguistics this princi- ple is widely plausible and it is a base of generic NLP systems. Accord- ing to [Johnson-Laird and Miller, 1976] “understanding the meaning of a sentence depends on knowing how to translate it into the information- processing routines it calls for.” The approach of analyses on each level of the language has been widely plausible by computer scientists. Actu- ally there are two very different principles known by the name “Frege’s principle”. The Principle of compositionality is widely accepted (and im- plemented), however the second (also called Context principle: “Never ask for the meaning of a word in isolation, but only in the context of a sen- tence” [Janssen, 2001]) is accepted only by some linguists (e.g. Corpus Pat- tern Analysis [Pustejovsky et al., 2004]). According to [Allen, 1995] a NLU system has to use considerable knowledge about the language itself, about the context the discourse is held in and about the general world. Miller and Johnson-Laird in [Johnson-Laird and Miller, 1976] describe the need of context as:

Efforts to put some sensible construction on what another per- son is saying are usually aided by knowledge of the context in which he says it. The context provides a pool of shared informa- tion on which both parties to a conversation can draw. The infor-

2 1. INTRODUCTION

mation, both contextual and general, that a speaker believes his listener shares with him constitutes the cognitive background of this utterance.

Researchers in artificial intelligence (AI) such as Lieberman, Lenat or Minsky agree that NLU is conditional on real-world knowledge (in this work called common sense knowledge). Currently, there is a huge effort in developing collections of common sense propositions (for definitions see below) as well as reasoners over these collections. However, in practice not many applications exist and not many applications have been published. It even seems that using statistical methods is wide-spread and the research on common sense collections and common sense reasoning remains experimental for years. Following sec- tions try to clarify the problem and the incomplete results of works hitherto done.

1.1 Common Sense Definitions

First of all, the term common sense has to be defined. It can be defined from different points of view: from the view of linguistics, cognitive science, artificial intelligence or computer science. For this reason, three definitions are provided: “Common sense includes commonsense knowledge – the kinds of facts and concepts that most of us know – but also the commonsense reason- ing skills which people use for applying their knowledge. We each use terms like commonsense for the things that we expect other people to know and regard as obvious” [Minsky, 2006]. Common sense is simply a shared knowledge. In human communication this shared knowledge is not men- tioned because it is expected to be known to all participants of the commu- nication. If this expectation is exaggerated, it leads to communication mis- understandings. This case happens very often in human-computer com- munication. On the other hand, if the expectation is underestimated, the conversation is boring. According to [Minsky, 1986] “common sense is not a simple thing. In- stead, it is an immense society of hard-earned practical ideas–multitudes of life-earned rules and exceptions, dispositions and tendencies, balances and checks.” Adults can not recall their own process of learning the ba- sic facts and rules. That is why it is called common sense. This knowl- edge, acquired in childhood and improved during the whole life, comprises [Minsky, 2006]:

3 1. INTRODUCTION

• Social rules. For example, inanimate object do not move themselves, they have to be pushed, pulled or carried. Those actions are consid- ered inappropriate if applied to a person. • Economic rules. Every action leads to questions about how much ef- fort and time one should spend at comparing the costs of alternative solutions. • Conversational skills. People usually know how to keep track of the topic, their conversational goals, their social roles. Everybody has to guess what his/her addressees already know – repeating things one already knows is annoying (see also communication maxims in [Grice, 1989]). • Sensory and Motor Skills. These skills are usually not called “com- monsensical”, but the (in)ability to physically do something is un- doubtedly a part of future human planning. • Self-Knowledge. Models of one’s own abilities is necessary for plan- ning.

Common sense has a lot of relations to the physical world, people’s usual abilities as well as emotions. Minsky in [Minsky, 2006] states that “emo- tions are certain ways to think that we use to increase our resourcefulness”. Moreover, Minsky is convinced that purely logical, rational thinking does not exist because our minds are always affected by our assumptions, inten- tions and values of life. Barry Smith in [Smith, 1995] describes common sense as “on one hand a certain set of processes of natural cognition–of speaking, reasoning, seeing, and so on. On the other hand common sense is a system of beliefs (or folk physics and folk psychology). Over against both of these is the world of common sense, the world of objects to which the processes of natural cog- nition and the corresponding belief-contents standardly relate.” Common sense propositions are not always related to scientific or even real world ob- servations (e.g. propositions such as “natural gas smells”, “oasis is a calm place”). According to [Smith, 1995], common sense is not considered to be a single, coherent object of scientific observation (similarly to natural lan- guage). Its beliefs are context-dependent and this dependency is in princi- ple unlimitedly nuanced. Moreover, there is not a single “world” to which natural cognition can relate.

4 1. INTRODUCTION

1.2 Cognitive Science Contribution to Automated NLU

Cognitive science is an interdisciplinary study of mind and intelligence. Its start is probably in psychology, where cognitivism is, in part, a synthesis of earlier forms of psychological analysis. It emphasizes internal mental pro- cesses, but it has come to use precise quantitative analysis to study how people learn and think [Sternberg, 2002]. Connectionism is one subfield of cognitive science, neuroscience and artificial intelligence that attracts the interest of computer scientists as of 1980’s. The basic idea of connectionism is that mental models are repre- sented by networks of simple units and “the key to knowledge represen- tation lies in the connections among various nodes, not in each individual node” [Sternberg, 2002]. Cognitive science tries to discover how human mind and memory works by means of observations of human behavior. Apart from obser- vation of disabled people (e.g. with aphasia or autism), there are sev- eral generic experiments that support hypotheses about how human brain works. Semantic priming is one of such outer evidences about human mem- ory storage, retrieval and organization. According to [McNamara, 2005] “priming is an improvement in perfor- mance in a perceptual or cognitive task, relative to an appropriate baseline, produced by context or prior experience. Semantic priming refers to the improvement in speed or accuracy to respond to a stimulus, such as word or picture, when it is preceded by a semantically related stimulus (e.g. cat- dog) relative to when it is preceded by a semantically unrelated stimulus (e.g. table-dog).” Semantic priming is used as a tool to investigate some aspects of perception and cognition. Semantic priming has different models where the stimulus and semantic relations are represented by different means. In distributed network model, concepts are represented as patterns of activation; in spreading activation concepts are represented by nodes. Significantly connected to associative networks (see section 2.1.2), several types of spreading activation models exist: Collins and Loftus’ model and ACT* models are the best-known. Spreading activation model is convenient for the purpose of this work, since the data at our disposal are mostly associative networks or related works. Currently the spreading activation model is used in information retrieval and similar tasks such as recommendation systems (e.g. [Huang et al., 2004]). According to [Crestani, 1997] spreading activation (SA) consists of a sequence of iterations, where one iteration consists of one or more pulses and a termination check. Each pulse consists of preadjust-

5 1. INTRODUCTION ment, spreading and postadjustment, the preadjustment and postadjust- ment phases are optional. The algorithm starts with one or more activated nodes (representing a concept) in the network and during the iterations ac- tivation is spread through relations between nodes. In [Kleb and Abecker, 2010], spreading activation is used for entity dis- ambiguation in Resource Description Framework (RDF) ontologies. RDF is intended as a standard format for the Semantic Web. This approach does not need any pre-learned knowledge to identify only the most likely ontol- ogy elements given a particular context. The Bubble Lexicon [Liu, 2003] uses spreading activation to express the meaning of words: “a word only becomes coherently meaningful in a bub- ble lexicon as a result of simulation (graph traversal) via spreading activa- tion from the origin node, toward some destination. This helps to exclude meaning attachments which are irrelevant in the current context, to ham- mer down a more coherent meaning”. Liu illustrates the Bubble Lexicon on different meanings of “fast car”. Resulting from network traversal, “fast car” is:

• car – top speed – fast: The car whose top speed is fast. • car – drive – speed – fast: The car that can be driven at a speed that is fast. • car – tires – rating – fast: The car whose tires have a rating that is fast. • car – paint – drying time – fast: The car whose paint has a drying time that is fast. • etc.

1.3 Knowledge Representation and Inference

Some automated NLU systems use the knowledge representation (KR) ap- proach. The goal of KR is to represent knowledge and infer new knowledge. KR is a formal description that often uses special KR language. Some re- searchers prefer mathematical formalisms such as lambda calculus or first order logic, others such as [Bobrow and Winograd, 1976] propose a com- plex system that integrates procedural knowledge with a broad base of declarative forms. According to Collins and Quillian in [Collins and Quillian, 1969] it is as- sumed that people do not store all pieces of knowledge (e.g. dogs have four legs, cats have four legs). It is assumed that the capacity of human mem- ory is finite and not very large and thus people need to save space. From this point of view it is much more efficient to have in memory only general

6 1. INTRODUCTION knowledge about types (e.g. mammals have four legs, dogs are mammals) and infer new pieces of knowledge on demand (e.g. dogs have four legs). Computer programs with KR usually use this notion. Computer programs store, retrieve and organize knowledge using knowledge bases (KB). The conclusions based on general facts and infer- ence rules is distinguished in two types [Allen, 1995]:

• entailment – a formula that is true given the formulas in the KB • implication – a conclusion that is drawn from the KB but can be ex- plicitly denied

Generally inference is distinguished in three types [Allen, 1995]:

• deduction – what can be logically proved • induction – learning generalities from examples • abduction – inferring causes from effects

While deduction is the most precise, the other inference techniques (that can sometimes lead to false conclusions) are very often in natural lan- guages.

1.4 Common Sense Reasoning and Context Sensitivity

In the previous section different inference types were mentioned. How- ever, in practice, existing common sense reasoning systems do not use these types of inference. According to [Liu and Singh, 2004] “There seems to be a divergence between the logical approach to reasoning and what is known about how people reason. . . . Most notably, commonsense knowledge is largely defeasible and context-sensitive.” For the purpose of this disserta- tion context sensitivity means the fact that a word or word expression has different meanings and connotations given different contexts. Henry Lieberman [Lieberman, 2002] shows on the examples below a possible perplexity that can originate from inference:

Birds can fly. Tweety is a bird. Thus, tweety can fly. – OK Cheap apartments are rare. Rare things are expensive. Thus, cheap apartments are expensive. – OOPS!

Lieberman and Selker in [Lieberman and Selker, 2000] describe the ways of common sense reasoning in a particular context. While typical com- puter programs are context-independent the authors explain the necessity

7 1. INTRODUCTION of context-sensitive computing for evolution of “intelligent” software. In section 2.1.3 different approaches to common sense reasoning are described and most of them are (in some way) context-sensitive. According to [Pala, 1999], the knowledge needed to understand a dis- course is of two kinds:

• specific knowledge of the current situation • general knowledge of the world

Specific knowledge is also known as communication situation: in [Pala, 1999] described as a vector

Ks = o1, o2, . . . , on, mi, pj, lk, tm, where o1, o2, . . . , on are objects of the discourse, mi is the speaker (writer), pj is the addressee of the communication, lk is the location and tm is the time interval of the communication. General knowledge concerns mostly information about types and inference rules, while specific knowledge con- cerns mostly information about individuals. The context sensitivity is here described by the vector Ks.

1.5 Limited Success of Existing Common Sense Applications and Intelligent Agents

According to [Lieberman et al., 2004], common sense applications such as question-answering systems were relatively unsuccessful for two reasons. First, the system has to be fast, otherwise users give up on using the sys- tem because of its lack of interactivity. Second, collections of common sense propositions are huge, but none the less sparse. Therefore, there are situa- tions not covered by the system. If the system tries to reason on that situa- tion, it leads to serious misunderstandings. Lieberman et el. proposes using common sense in “fail-soft” applica- tions such as Intelligent Agents. The difference is, that such Agent helps users only where it is capable to. “If a common sense inference turns out wrong, the user is often no worse off then they would be without any assis- tance” [Lieberman et al., 2004]. According to [Russell et al., 1996] “an agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators”. In case of software agents usual computer inputs (e.g. keystrokes, incoming packets) are sensory inputs and usual out- puts (e.g. display, printer) are actuators. Intelligent agents are described as

8 1. INTRODUCTION agents that take the best possible action in a situation. Therefore, a perfor- mance measure is needed for each agent. Programs that are intelligent agents are context sensitive (“acts upon an environment”) and do not work in the traditional input–output itera- tion. Intelligent agents can act on their own as well as not react at all on user’s signal in case they do not have enough knowledge. Ira Rudowsky in [Rudowsky, 2004] summarizes that the key features of an agent are au- tonomy and the ability to act without human intervention. A simple agent example is a thermostat for a heater. On the other hand, the Remote Agent of the NASA’s Deep Space 1 is an example of a complex agent:

Remote Agent will formulate its own plans, by combining the high level goals provided by the operations team with its de- tailed knowledge of both the condition of the spacecraft and how to control it. It then executes that plan, constantly moni- toring its progress. [Rudowsky, 2004]

1.6 Motivation

With these ideas in mind, I plan to build an application extending a bigger NLP system with common sense propositions and new inferred proposi- tions. The application should be “fail-soft” in the meaning that it can help only where it can. The application has to deal with the innate vagueness of common sense and its results will be vague as well. Methods used for achieving such goal comprise those of cognitive sci- ence (such as semantic priming), artificial intelligence (AI) and computa- tional linguistics (such as knowledge representation, inference, semantic networks and ontologies). To be more explicit, the application’s contribu- tion can go from semantic disambiguation, topic spotting, metonymy reso- lution 1 to affect sensing. It will not be perfect or complete in any of these areas, but an improvement can be achieved. The structure of this thesis proposal follows. In chapter 2 state-of-art is described. It starts with enumerating existing resources and discuss their content with respect to common sense propositions. Section 2.2 describes computer programs with common sense. Chapter 3 resumes my present results related to the topic. Chapter 4 describes the aims of the dissertation and a time schedule.

1. metonymy is a figure of speech in which a concept is called by the name of something associated, e.g. “Prague” in some contexts means “the Czech government”

9 Chapter 2 Current State-of-art

Currently, a lot of common sense research concentrates on creating re- sources of common sense knowledge. However, the second phase of this research – the applications are not numerous. To be more precise, present applications mainly serve to support or disconfirm a hypothesis of com- puter science or cognitive psychology and therefore are rather experimen- tal. In the first part of this chapter different types of resources are intro- duced, while in the second part some representative applications are men- tioned.

2.1 Resources of Common Sense Knowledge

One of the basic properties of common sense knowledge is that it is rarely mentioned. As was stated in section 1.1, in human communication there is no need to mention knowledge that is common. Therefore such knowl- edge is mentioned only in inappropriate communication, communication with people we do not expect to share the knowledge (mainly children) or, sometimes, in collections of knowledge. The following summary will shortly describe some resources of com- mon sense knowledge. The choice below wants to present the best repre- sentatives of their categories. Some of the resources are long-time projects and provide large-scale data, some are suspended, but contain remarkable results or ideas.

2.1.1 Encyclopedias and Explanatory Dictionaries

Encyclopedias and explanatory dictionaries (EDs) are useful knowledge re- sources for humans. Generally, encyclopedias contain information about in- dividuals (people, places, organizations, events, phenomena) and EDs con- tain information about classes. Usually an encyclopedia or ED entry con- sists of genus proximum (family) and differentia specifica (distinguishers from siblings).

10 2. CURRENT STATE-OF-ART

Both encyclopedias and EDs partially contain common sense knowl- edge, but there are differences: there should be no folk physics or folk psy- chology (as described in 1.1). The definitions do not have to be complete from the point of view of a computer program, because some facts are ob- vious for humans. Moreover, encyclopedias and EDs contain information in natural language, thus it has to be converted to a machine understandable form where possible1. Using encyclopedias and EDs as common sense resources for a NLP sys- tem can be criticized on the ground that there is more information in these resources than an average human knows. In practice, this is not a disadvan- tage. The program should not be confused by huge amount of information, it should be similar to a situation when a scholar reads a popular scientific text and gets bored.

2.1.2 Ontologies The term ontology currently appears in many domains. In this work, on- tology means “. . . a set of representational primitives with which to model a domain of knowledge or discourse. The representational primitives are typically classes (or sets), attributes (or properties), and relationships (or re- lations among class members). The definitions of the representational prim- itives include information about their meaning and constraints on their log- ically consistent application.” [Gruber, 2009] The term was introduced by M. Ross Quillian [Collins and Quillian, 1969]. Initially it meant a network of concepts2, their attributes and the hierarchical sub-superclass relationship. Each concept is represented by a node and nodes are connected via “is-a” and “instance- of” links. Currently, the term semantic network is often used in a more general sense: nodes represent information items and links represent some- times undefined or unlabeled relations. This type of networks is sometimes called associative networks [Crestani, 1997]. Semantic (or associative) networks are considered to be one of formal representations of ontologies, while frames and scripts are another one. Frames were introduced by Marvin Minsky [Minsky, 1986] and accord- ing to him are described as “representations of stereotyped situations. At- tached to each frame are several kinds of information. Some of this infor-

1. N.B. that there is a significant difference between machine readable (e.g. XML) and ma- chine understandable format (e.g. a knowledge representation language). 2. Note that concept is not a language expression, word expressions are only labels for concepts.

11 2. CURRENT STATE-OF-ART mation is about how to use the frame3. Some is about what one can ex- pect to happen next. Some is about what to do if these expectations are not confirmed.” The idea of scripts came also in 1970’s from Roger Schank [Schank and Abelson, 1977]. Script is a representation of a stereotypical se- quence of actions that are temporally ordered. Currently, best-known ontologies include Princeton WordNet, Eu- roWordNet and FrameNet. Within the Czech context a closely related work is worth to mention: a verb valency lexicon VerbaLex. Princeton Word- Net (PWN) [Fellbaum, 1998] is a lexical database. Basic units – synsets (synonymic sets) are organized in hierarchies (using semantic relations such as hypo/hypernymy, holo/meronymy etc.). EuroWordNet (EWN) [Vossen, 1998] is a following project that extended PWN with several lan- guages (including Czech) and inter-lingual index (ILI) that connects synsets with the same meaning across languages. The basic advantage of from a point of view of this dissertation is that units and relations between them are understandable for humans and ready-to-use for computer pro- grams. A possible drawback of the WordNets is the relation sparsity of data: e.g. in Czech WordNet (CzWN) there are 28 201 synsets, but there are only 34 267 relations between them4. FrameNet is an on-line based on frame semantics and supported by corpus evidence. In 2006 it contained more than 10 000 lexical units (pairs word – meaning) in nearly 800 hierarchically-related semantic frames [Ruppenhofer et al., 2006]. Typically the frame evoking lexical unit is a verb and the frame elements are its dependents. There are several ad- vantages of FrameNet for the purpose of this dissertation: FrameNet is in- tended for annotating continuous texts. It recognizes named entities (words that do not have to be in a frame, e.g. names) and lexical units. It can find appropriate frame to these lexical units. The annotation is accessible on-line [Ruppenhofer et al., 2006]. FrameNet is available for English language only. VerbaLex cannot be considered as an ontology, since there are no rela- tions between frames yet. However, it is presently the most complete lexi- con of verb valencies for the Czech language. In 2006 [Hlava´ckovˇ a,´ 2007] it contained 35 302 verb meanings (of 11 306 verb lemmata) within 28 418 verb frames with slots describing appropriate semantic roles5. Semantic roles

3. Author’s note: this kind of information is often called meta-information 4. The numbers were taken from European Association’s Catalogue, accessed 4-September-2010, on-line from http://catalog.elra.info/product_ info.php?products_id=1089. 5. Also known as thematic roles, semantic roles were introduced in mid-1960’s as a way of classifying the arguments of natural language predicates [Gruber, 1965, Fillmore, 1968].

12 2. CURRENT STATE-OF-ART such as Agent, Location or Patient make an important part of VerbaLex. The lexicon uses two layers of semantic roles, the latter relates the role to PWN, e.g. Agent is restricted to PWN person:1. VerbaLex thus represents the basic knowledge about super-types (e.g. the agent of buying is a person). Together with PWN VerbaLex provides information about subtypes (e.g. a girl (subclass of PWN “person”) bought new shoes (subclass of PWN “goods”). VerbaLex is made by linguists and therefore contains much linguistic information (cases, aspects, idioms). Currently VerbaLex is planned to be extended with inter-frame relations such as precondition, effects or decomposition [Neveˇrilovˇ a,´ 2009]. With such relations it could be inferred that a particular verb frame has particular relationship to other frames. For example, setting the relation “effect” be- tween verb frames “prijˇ ´ıt” (to arrive) and “nachazet´ se” (to be found some- where), a program could infer that the Agent arriving at a Location can be found at that Location afterwards. Since there are lot of frames, this relation setting cannot be done manu- ally. In [Neveˇrilovˇ a,´ 2010], verb frames with similar slots or the same verb class (based on Levin’s classes [Levin, 1993]) are grouped so that the rela- tions can be set between whole groups of verbs. This is expected to be done semi-automatically, but so far it is a work in progress (see also chapter 3). Materna and Pala in [Materna and Pala, 2010] establish relationship between FrameNet and VerbaLex. Using their semi-automatic approach FrameNet frames (including frame-to-frame relations) can be linked to Ver- baLex frames. This method could be used as a core for building FrameNet in Czech. Ontologies have a lot of common with encyclopedias and explana- tory dictionaries. All of these resources contain (in some form) ontolog- ical categories (categories of possible objects meant [Thomasson, 2009]). Encyclopedias and EDs usually contain a domain assignment of an en- try (often in form of field label), while in ontologies a relation has do- main is established. Lexical information (such as derivatives) is more frequent in EDs and WordNets. The advantage of encyclopedias and EDs is the tradition: they exist in all European languages, in Czech there are 3 EDs (PSJCˇ [Havranek´ et al., 1957], SSJCˇ [Havranek´ et al., 1971] and SSCˇ [Filipec and Danes,ˇ 1995], all available with the DebDict tool [Horak´ et al., 2006]). On the other hand ontologies are made to be readable for humans and understandable for computers. If encyclopedias and EDs are used as a resource of common sense propositions, some conversions to machine understandable formats will be needed.

13 2. CURRENT STATE-OF-ART

2.1.3 Special Collections of Common Sense Knowledge

In recent years researchers concentrate on creating collections of common sense propositions especially for future computer programs to use. They hope that these special collections will contain the knowledge that is not mentioned in encyclopedias and EDs or ontologies. This section contains short description of representative special collections and a summary.

CyC is a large handcrafted knowledge base and inference machine. In 2007 it contained roughly 3 million propositions (called assertions) [Lieberman, 2007] (thanks to the inference machine millions more can be inferred). According to [CyC, 2010], several forms of reasoning are used, in- cluding the general logical deduction (modus ponens, modus tollens, quan- tificators) and inference mechanisms typical for AI such as inheritance of automatic classification. CyC puts each of its assertions into one or more contexts (called microtheories). CyC does not use numeric certainty factors unless statistics is available. Instead, it uses assertion about assertions: e.g. assertion A is less likely than assertion B. Several applications based on CyC exist, others are in development. More information about the project is available in [Lenat, 1995]. CyCorp Inc., the owner of CyC provides two versions of the database for free: Open- CyC and ResearchCyC.

Thought Treasure ThoughtTreasure contains over 25 000 concepts orga- nized in hierarchy and about 100 scripts [Schank and Abelson, 1977] de- scribing common situations such as eating in a restaurant. While inspired by CyC, ThoughtTreasure is different: for representation it uses logic, finite automata, grids, and scripts, propositions are represented in a format of relational databases [Mueller, 2003]. ThoughtTreasure contains both encyclopedic facts and knowledge about the language. It contains near synonyms in English and French. Beyond the declarative knowledge it also contains agents: emotional agents, understanding agents, planning agents, etc. Thanks to these agents ThoughtTreasure can process a story: it can confirm facts that correlate with expectations (e.g. “A is enemy of B”, “A hates B”) and it can find oddities such as “A is enemy of B”, “A helps B”. ThoughtTreasure’s development stopped in 2000. However, the ideas of multiple agents are very inspiring.

14 2. CURRENT STATE-OF-ART

ConceptNet ConceptNet [Arnold et al., 2009] is based on Project. It is a open-source, multilingual semantic network of common sense knowledge. It allows common sense reasoning thanks to Divisi, a tool for reasoning by analogy. In ConceptNet, three kinds of flexible inference are used: context find- ing, inference chaining, and conceptual analogy [Liu and Singh, 2004]. The “reasoning is associative and thus not as expressive, exact, or certain as log- ical inferences, but it is much more straightforward to perform, and useful for reasoning practically over text”.

Games for Collecting Common Sense Propositions While the develop- ment of specialized collections of common sense propositions is an expen- sive and long-term task, another approach is going in fashion: collecting common sense propositions by means of games. Here, the contributors are anonymous Internet users and their intention is not to build a collection of common sense propositions but to simply have fun. Von Ahn calls these games “games with a purpose” (GWAP) [von Ahn, 2006]. Different games with different purposes exist, e.g. 20 questions [Speer et al., 2009], Verbosity [von Ahn et al., 2006], X-plain [Neveˇrilovˇ a,´ 2010b] (see also chapter 3) or JeuxDeMots [Lafourcade and Joubert, 2009]. The common features of these games are:

• the game rules are straightforward so that most people can under- stand them • the task can be simple such as categorization or difficult such as ex- planation of a word • the game has a time limit (this is one of the features that differs a game from an annotation [Vickrey et al., 2008]) • the collection contains errors, but still is considered to be quite reliable thanks to measuring agreement among players

Summary In previous paragraphs, different approaches to collecting common sense propositions were introduced. Several attributes of these collections can be observed and compared:

• created by experts (CyC, ThoughtTreasure) or by public (ConceptNet, GWAP) • contain words (ConceptNet, GWAP) or concepts (CyC, ThoughtTrea- sure)

15 2. CURRENT STATE-OF-ART

• contain inference (CyC, ThoughtTreasure, ConceptNet) or are simply a collection (GWAP) • language: CyC, ThoughtTreasure, Verbosity, ConceptNet are in En- glish. ConceptNet and ThoughtTreasure are in French as well. JeuxDeMots is in French. X-plain is in Czech • size: CyC contains millions of propositions, ThoughtTreasure contains tens of thousands, ConceptNet contains 1.6 millions of relations be- tween more than 300 000 nodes, GWAP contain thousands of propo- sitions. Numeric comparison is only an approximate notion about the size, since all these resources use different knowledge representations.

2.1.4 Neural Networks

Neural networks are moderately relevant to the dissertation topic. How- ever, there is one project worth to mention: a 20 questions game (computer asks human maximum 20 yes/no questions and tries to guess what object the human has been thinking about). The program at http://www.20q.net is based on a neural network and in 2004 the website accumulated about 10 million synaptic associations [Kelly, 2007]. It had 73 % success rate of guessing. In august 2010 the game was played over 77 million times. The neural network filled by data from players provides a lot of common sense knowledge in several languages including Czech. To my knowledge there is not much scientific information about this project (no reviewed articles published), but the application is interesting.

2.2 Computer Programs that Use Common Sense

The history of adding common sense to computer programs is coeval with that of artificial intelligence. Loebner Prize for artificial intelligence is the first formal instantiation of a Turing Test [Turing, 1950]. Each year an an- nual prize of $2 000 and a bronze medal is awarded to the most human-like computer, however, nobody was awarded the Grand Prize of $100 000 and gold medal yet. In this section I try to describe the progress of computer programs writ- ten for NLP. The following enumeration tries to describe chronologically some representatives. The computer programs are worth to mention be- cause they test cognitive science, computational linguistics and AI hypothe- ses in real world.

16 2. CURRENT STATE-OF-ART

SHRDLU SHRDLU [Wasserman, 1985] was a program made by Terry Winograd in 1972. It dealt with a specific domain – the block world – and was concerned both with physical object representation and natural lan- guage understanding. The novelty was in using three components: syntac- tic parses, semantic processor and “logical deductive segment that figured out the users’ requests and answer questions” [Wasserman, 1985]. SHRDLU used semantic networks for representation of associations.

MARGIE One of the first attempts in (general) natural language under- standing programs was MARGIE [Schank et al., 1973] – a set of programs that “displays its understanding by paraphrasing the text it reads and by answering questions about them” [Garnham, 1988]. It consisted of three modules: a semantic parser, an inference mechanism that uses knowledge about the world and a response generator. It used an at that time new the- ory of conceptual dependency (language independent, primitive based) as a semantic representation.

GUS While SHRDLU and MARGIE were experimental and did not have much application to real world situations, GUS was designed to provide information on airline flight schedules [Wasserman, 1985]. GUS used the frame concept for semantic representation.

IPP IPP uses frame representation scheme as well. It uses high-level rep- resentational structures called Memory Organizational Packets (MOPs). MOPs can change dynamically during the analysis. IPP’s main contribution is the generalization capability – it recognizes similarities and differences between events stored in MOPs and build another MOPs that are used as stereotypical knowledge [Wasserman, 1985].

SensiCal SensiCal (smart calendar) [Mueller, 2000a] is an application that looks for conflicts in calendar entries. It uses common sense propositions such as “steak house is a restaurant”, “steaks are served in steakhouses”, “steak is made from meat”, “vegetarians avoid meat”. Then it can search conflicts such as “dinner with Lin in Frank’s Steakhouse”, “Lin is vegetar- ian” and warn the user: “You are taking Lin who is vegetarian to a steak- house?”.

NewsForms NewsForms and NewsExtract [Mueller, 2000b] is another application that uses common sense knowledge to extract information from

17 2. CURRENT STATE-OF-ART newspaper. It can extract information about people, organizations, loca- tions, earnings, injuries, legal events, medical findings, wars, weather etc.

Story Understanding with ThoughtTreasure There is a large effort in building story understanding programs. Mueller [Mueller, 2002] describes story understanding by simulation such as:

• A simulation is a sequence of states • A state is a snapshot of the mental world of each story character and the physical world

Mueller’s project ThoughtTreasure maintains this concept by using sev- eral understanding agents that “decide whether a given input concept makes sense given the current context. Each understanding agent returns a make sense rating that indicates how much an input concept makes sense along with reasons why that concept makes or does not make sense” [Mueller, 2003]. ThoughtTreasure can resolve oddities in emotions or inter- personal relations. Eric T. Mueller in [Mueller, 2003] describes an example: Jacques is an enemy of Franc¸ois. [enemy-of Francois Jacques] Jacques succeeded at being elected President of France. [succeeded-goal Jacques [President-of France Jacques]] Franc¸ois is happy for Jacques. [happy-for Francois Jacques] ThoughtTreasure’s understanding agents (UA) will resolve the situation as follows: ----UA_Emotion---- makes sense because [succeeded-goal Jacques [President-of France Jacques]] does not make sense because [enemy-of Francois Jacques] True, he succeeds at being the president of France. But I thought that he was an enemy of Franc¸ois.

Gossip Galore Recent applications are often developed as intelligent agents (see 1.5). Conversation agent Gossip Galore [Cheng et al., 2009] can chat with users about popular musicians, band, actors etc. It can reveal re- lationship among these people. It uses textual interface for simple ques- tions and answers and visual interface for displaying relationships among celebrities.

18 2. CURRENT STATE-OF-ART

Siri Siri’s Personal Assistant for iPhone [Schonfeld, 2010] is able to recog- nize speech and find information about events, transports, weather, restau- rants etc. depending on the context (time, date and place). Apart from ticket reservations, weather forecasts etc. it uses data from Wolfram|Alpha (a question-answer system which aim is to make all systematic knowledge computable and accessible [Wolfram|Alpha, 2010]). The application has a conversational interface.

Other Applications In [Lieberman et al., 2004] several other applications of common sense are described: telling stories with ARIA (an applica- tion that semi-automatically annotated photographs), common sense word completion (an application useful namely on mobile devices), affective clas- sification of the text (empathy buddy). Most of these applications seem to be prototypes.

Summary Although the evolution of “intelligent” software has quite a long history, still experimental applications predominate the practical ones6. The applications listed above use different approaches, but it seems that some approaches are better that others, namely the use of more differ- ent resources of information, use of more types of semantic representation (frames, networks etc.). Almost all programs use some type of reasoning. However, it seems that a cautious notion that sometimes infers nothing is better than inferring oddities.

6. However, my current knowledge need not to cover all work done (especially organiza- tions such as DARPA do not publish their results).

19 Chapter 3 Present Results

This chapter summarizes my work on different topics related to the thesis.

3.1 Visualization

My master thesis [Neveˇrilovˇ a,´ 2005] and the following publications [Neveˇrilovˇ a,´ 2005, Gregar et al., 2006, Neveˇrilovˇ a,´ 2010a] concern visualiza- tion of associative networks. Although the topic seems to be moderately relevant to the dissertation, the benefits for the dissertation are as follows:

• the work accumulates relevant knowledge about data representation and associative networks • every visualization is preceded by a deep study of the data structure been visualized • visual representation may be one of the intended outputs of the dis- sertation

Further work has been done on the visualization application and cur- rently (2010) it is able to visualize not only lexicons, but any (associa- tive) network. Visualization of semantic (associative) networks has been described in paper [Neveˇrilovˇ a,´ 2005]. In practice, several data structures have been visualized:

• library data and citations (presented in “XML-based flexible visual- ization of networks: Visual Browser”) • e-learning data [Gregar et al., 2006] • search results and related elements of digital mathematics library [Neveˇrilovˇ a,´ 2010a]

3.2 Collecting Common Sense Propositions

The paper [Neveˇrilovˇ a,´ 2010b] presents a way of collecting common sense propositions by means of a game. It is intended for one player, the sec-

20 3. PRESENT RESULTS ond player is a computer program. Its “intelligence” is based on the rising collection and on word sketch engine [Kilgarriff et al., 2004]. The resulting collection is in form of triples (subject, relation, object) and the agreement among players is considered as well. So far there are more than 3 000 unique triples collected.

3.3 Czech Verb Valency Lexicon VerbaLex

The paper [Neveˇrilovˇ a,´ 2009] presents usage of two major, linguist-made lexical resources of Czech language: WordNet and VerbaLex. First, a con- version to RDF was made. Afterwards, a Prolog program was used to an- alyze Czech language inputs. In the second part of the article an extension to current VerbaLex is proposed. Possible pitfalls are discussed. In the con- clusion, the side-effect of this work is emphasized: an important feedback for authors and administrators of both lexical resources. Successive work is presented by the paper [Neveˇrilovˇ a,´ 2010] (to be pub- lished in the Proceedings of TSD). VerbaLex frames are grouped by two cri- teria. The former is the pattern made of semantic roles present in a frame, the latter is a verb classification based on Levin’s [Levin, 1993] verb classes (adapted for Czech language). The main purpose of this work is to group appropriate verbs together and apply inference rules to whole groups in- stead of individual verbs.

3.4 Publication Overview

• Neveˇrilovˇ a,´ Z. (2005). Vizualn´ ´ı lexikon. Master’s thesis, Masaryk Uni- versity in Brno. • Neveˇrilovˇ a,´ Z. (2005). Visual Browser: A Tool for Visualizing On- tologies. In Proceedings of I-KNOW’05, pages 453–461, Graz, Aus- tria. Know-Center in coop. with Graz Uni, Joanneum Research and Springer Pub. Co. • Gregar, T. Neveˇrilovˇ a,´ Z., Rambousek, A. and Pitner, T. (2006). Vi- zualizace znalost´ı v e-learningu. In SCO 2006, Sharable Content Ob- jects, 3. rocnˇ ´ık konference o elektronicke´ podporeˇ vyuky´ , pages 39–45, Brno. Masarykova univerzita. • Neveˇrilovˇ a,´ Z. (2009). Exploring and Extending Czech WordNet and VerbaLex. In Proceedings of the RASLAN Workshop 2009, Brno. Masaryk University. • Neveˇrilovˇ a,´ Z. (2010). Implementing Dynamic Visualization as an Al-

21 3. PRESENT RESULTS

ternative Interface to a Digital Mathematics Library. In Towards a Digital Mathematics Library, Proceedings, pages 63–68, Brno. • Neveˇrilovˇ a,´ Z. (2010). X-plain – a Game that Collects Common Sense Propositions. In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science NLPCS 2010, pages 47–52, Portugal. • Neveˇrilovˇ a,´ Z. (2010). Semantic Role Patterns and Verb Classes in Verb Valency Lexicon. In Proceedings of the 13th International Con- ference on Text, Speech and Dialog TSD 2010, pages 147–153, Brno.

22 Chapter 4 Aims of the Dissertation

The aim of this work is to build an application for Czech that adds com- mon sense knowledge to a text analysis. The application is going to perform story understanding and is intended as a module in a larger NLP applica- tion, if there exists any. Its purpose is going to reveal relations between se- mantic units of the text (mentioned as well as unmentioned). In next stage the application is going to use common sense reasoning techniques to in- fer propositions that are not mentioned in the text, but are expected to be known by the addressee of the discourse. A possible output of a story understanding application is shown on a short story:

Kamen´ık nechal [jejich] j´ıdlo pod nedalekym´ olivovn´ıkem. Byl parny´ den, ale pivo bylo nastˇ estˇ ´ı jestˇ eˇ studene.´ [Altmann, 2005] (The stonemason had left their lunch under a nearby olive tree. It was a hot day but fortunately the beer was still cold.)

The program has to output (some of) the following relations:

• parny´ den – dopoledne (hot day – morning) • olivovn´ık – st´ın – studene´ pivo (olive tree – shadow – cold beer) • j´ıdlo – pivo (lunch – beer) • studene´ pivo – kamen´ıkovo stˇ estˇ ´ı (cold beer – stonemason’s happi- ness)

The program can also infer that for example:

• kamen´ık byl venku (the stonemason was outside) – sure • kamen´ık byl s nekˇ ym´ (the stonemason was with someone) – sure • slunce sv´ıtilo (the sun was shining) – probable • bylo dopoledne (it was morning) – probable • bylo leto´ (it happened in summer) – probable

23 4. AIMSOFTHE DISSERTATION

• kamen´ık i jeho druh pracovali (the stonemason and his companion were about to work) – possible • kamen´ık melˇ sebou na´radˇ ´ı (there were stonemason’s tools) – possible • pivo bylo soucˇast´ ´ı j´ıdla (the beer is a part of the lunch) – probable • prˇ´ıbehˇ se odehral´ v Reckuˇ nebo v Italii´ (the scene took place in Greece or Italy) – possible

With the help of CzWN, the relations from the hypo/hypernymy hi- erarchy (such as “beer is a beverage”) can be obtained. With the help of VerbaLex the semantic roles (e.g. “the stonemason is the agent”, while “the lunch is the patient” of the action of “leaving”) can be settled. With current state-of-art of the NLP tools available for Czech language we are not able to obtain the relations stated in the stonemason example yet. To my knowl- edge there is so far no common sense reasoning tool suitable for Czech lan- guage. However, this dissertation is going to establish relations between different semantic units of the text as well as to provide common sense rea- soning. The work is going to include three steps:

• evaluation of existing resources of common sense propositions, even- tually proposals of new ones (see below) • creating an application that will support discourse understanding (see section 4.2) • evaluating the usefulness of the application (see section 4.3)

Since the time for writing a dissertation is limited, the application has to be considered as a prototype. Therefore, it does not need to cover all gen- eral knowledge, but can be limited to a domain, e.g. fairy tales, or Czech political comments. Since the application is considered to be context sensi- tive, the future extension towards a general domain actually means adding domains (and their contexts). Story understanding has positive impact on several subfields of NLP such as:

• semantic disambiguation – word meanings can be distinguished de- pending on particular context • topic spotting – if the story is (at least partially) understood, it will be possible to detect (different) topics of the text • metonymy resolution – thanks to relations established between se- mantic units, metonymy can be detected and resolved • affect sensing – if the story is understood, emotions of the participants will be detected

24 4. AIMSOFTHE DISSERTATION

4.1 Evaluation of Resources of Common Sense Propositions

In section 2.1, different kinds of common sense propositions resources were presented. However, it is not sure that using all of them will be useful. Most resources are in English and it is not easy to distinguish what type of knowl- edge is “common to all people” and what knowledge is “common to all English”. In other words there are differences in habits, culture, geography, religion etc. and we do not know where is the border between “common” common sense and “national” common sense. This fact is closely related to hypotheses about linguistic relativity. For the purpose of this dissertation I use the following rule: the knowledge about individuals is considered “na- tional” and the knowledge about classes is considered to be “common”. There are counterexamples like (class) “bread” that denotes a different ob- ject in Great Britain or in Czech Republic. Another counterexamples are individuals like “Ludwig van Beethoven” or “Jupiter” that are probably known with similar connotations in Great Britain as well as in Czech Re- public. The evaluation will concentrate on this aspect of the resources. Building new resources of the common sense propositions is a time con- suming task. However, there are ways to collect at least some data purely from Czech language users. X-plain (see section 2.1.3) is the first published work with this purpose related to this dissertation. In future, other collec- tions can be build.

4.2 Application

Traditionally, a generic NLP system contains modules for morphological, syntactic, semantic etc. analyses. An input is processed using these mod- ules sequentially. Madeleine Bates in [Bates, 1995] proposes a more efficient scheme where natural language input is processed as a series of indepen- dent processes where each of them contributes to an overall understanding of the input. This approach has some advantages: it can handle failures of some components (e.g. semantic analysis can continue even if there is an unknown word in the input) and it is possible to add a component. Bates calls this process “understanding search” (see figure 4.1). The Bates scheme is inspiring for the application scheme for two rea- sons. First, similarly to modules in a NLP system, there are going to be several modules where each will work with different resource of common sense. Second, similarly to contributions of different analyses to the overall NLP process in Bates scheme, modules in the application should contribute to the overall result.

25 4. AIMSOFTHE DISSERTATION

Figure 4.1: Traditional scheme vs. Bates “understanding search”

The future application has to deal with different formats of the re- sources, therefore several conversion tools will be needed. Then, every module can use different data format and different processing methods. The resulting application will support story understanding by simula- tion (as explained in 2.2). According to [Mueller, 2002], the purpose of a common sense application are:

• fill in missing information • point out potential problems

Again, according to [Mueller, 2002], each input updates the simulation in the way that it is possible to read a story, maintain a model of the story and answer questions. The use of intelligent agents (see 1.5) allows to have different agents for understanding different aspects of the story: story struc- ture, space, time, actors, emotions, conflicts or oddities in the story etc. as well as for dealing with linguistic features of the text (morphological derivations, multi-word expressions, proper names). The advantage of us- ing intelligent agents is that an agent does not to suggest a relation if it had not found any. In other words, “I do not know” is considered to be better answer than a false suggestion. Agents act depending on the environment (context sensitivity). Thus the application will be context sensitive, either by encoding current context into vector (see section 1.4), or by spreading activation (see section 1.2). The use of intelligent agents promises big precision accompanied with poor recall (at least at the beginning). In future, self-learning agents can be used for improving the recall.

26 4. AIMSOFTHE DISSERTATION

4.3 Evaluation

The evaluation of usefulness of the application is always a potential pit- fall. Evaluation of the application is going to proceed on two levels. First, every agent is going to have appropriate performance measure (see [Russell et al., 1996]). Second, an overall performance of the application has to be evaluated. The best way to evaluate outputs of the story understand- ing application is to compare them with an hand-annotated corpus. To my knowledge, there is no such corpus for Czech, but The New York Times An- notated Corpus [Sandhaus, 2008] is a promising work and can be inspiring for a Czech corpus. The evaluation can be done in the case an annotated corpus of Czech exists. If there is no suitable Czech corpus, another way of evaluation will be to compare outputs of the story understanding applica- tion with hand-made evaluation with multiple annotators agreement.

4.4 Time Schedule

September 2010–January 2011 • Examination of resources of common sense propositions. • Creating own resources for Czech language.

January 2011 • Defense of this proposal. • Doctoral exam.

January 2011–August 2011 • Design and implementation of a story understanding application. • Building modules for different resources of common sense proposi- tions. • Creating different intelligent agents. • Writing the Ph.D. thesis.

September 2011–August 2012 • Evaluation of a story understanding application. • Evaluation of different intelligent agents. • Writing the Ph.D. thesis. • Submitting the thesis.

27 Bibliography

[Allen, 1995] Allen, J. (1995). Natural Language Understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA.

[Altmann, 2005] Altmann, G. (2005). Vystup´ na babylonskou veˇzˇ. Triada.´

[Arnold et al., 2009] Arnold, K., Alonso, J., Havasi, C., and Speer, R. (2009). ConceptNet. [Retrieved on-line from http://conceptnet.media.mit.edu/; accessed 29-July-2009].

[Bates, 1995] Bates, M. (1995). Models of natural language understanding. In Proceedings of Natural Academy of Science, volume 92, pages 9977–9982, USA.

[Bobrow and Winograd, 1976] Bobrow, D. G. and Winograd, T. A. (1976). An overview of krl, a knowledge representation language. Technical report, Stanford University, Stanford, CA, USA.

[Cheng et al., 2009] Cheng, X., Adolphs, P., Xu, F., Uszkoreit, H., and Li, H. (2009). Gossip Galore: a Self-Learning Agent for Exchanging Pop Trivia. In EACL ’09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session, pages 13–16, Morristown, NJ, USA. Association for Computational Linguistics.

[Collins and Quillian, 1969] Collins, A. M. and Quillian, M. R. (1969). Retrieval Time from Semantic Memory. Journal of Verbal Learning and Verbal Behavior, 8(2):240–247.

[Crestani, 1997] Crestani, F. (1997). Application of Spreading Activation Techniques in Information Retrieval. Artif. Intell. Rev., 11(6):453–482.

[CyC, 2010] CyC (2010). How does cyc reason? [Retrieved on-line from http://www.cyc.com/cyc/technology/technology/ whatiscyc_dir/howdoescycreason; accessed 10-August-2010].

28 4. AIMSOFTHE DISSERTATION

[Fellbaum, 1998] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press.

[Filipec and Danes,ˇ 1995] Filipec, J. and Danes,ˇ F. (1995). Slovn´ık spisovne´ ceˇ stinyˇ (Dictionary of Literary Czech, SSC)ˇ . Academia, Praha, 1st edition. electronic version, LEDA, Praha.

[Fillmore, 1968] Fillmore, C. J. (1968). The Case for Case. In Bach, E. and Harms, R., editors, Universals in Linguistic Theory, pages 1–88, New York. Holt, Rinehart, and Winston.

[Garnham, 1988] Garnham, A. (1988). Artificial Intelligence: an Introduction. Taylor & Francis, London, New York.

[Gregar et al., 2006] Gregar, T., Neveˇrilovˇ a,´ Z., Rambousek, A., and Pitner, T. (2006). Vizualizace znalost´ı v e-learningu. In SCO 2006, Sharable Content Objects, 3. rocnˇ ´ık konference o elektronicke´ podporeˇ vyuky´ , pages 39–45, Brno. Masarykova univerzita.

[Grice, 1989] Grice, H. P. (1989). Studies in the Way of Words. Harvard University Press, Cambridge, Massachusetts, London, England.

[Gruber, 1965] Gruber, J. S. (1965). Studies in Lexical Relations. PhD thesis, MIT, Cambridge, MA.

[Gruber, 2009] Gruber, T. (2009). Ontology. In Liu, L. and Ozsu,¨ M. T., editors, Encyclopedia of Database Systems, pages 1963–1965. Springer Verlag.

[Havranek´ et al., 1971] Havranek,´ B., Beliˇ c,ˇ J., Helcl, M., Jedlicka,ˇ A., Krˇ´ıstek, V., and Travn´ ´ıcek,ˇ F. (1960–1971). Slovn´ık spisovneho´ jazyka ceskˇ eho´ (Dictionary of Written Czech, SSJC)ˇ . Academia, Praha, 1st edition. electronic version, created in the Institute of Czech Language, Czech Academy of Sciences Prague in cooperation with Faculty of Informatics, Masaryk University Brno.

[Havranek´ et al., 1957] Havranek,´ B., Hujer, O., Smetanka,´ E., Weingart, M., Smilauer,ˇ V., and Z´ıskal, A. (1935–1957). Prˇ´ırucnˇ ´ı slovn´ık jazyka ceskˇ eho´ (Reference Dictionary of Czech Language, PSJC)ˇ . Statn´ ´ı pedagogicke´ nakladatelstv´ı/SPN, Praha. electronic version, created in the Institute of Czech Language, Czech Academy of Sciences Prague in cooperation with Faculty of Informatics, Masaryk University Brno.

29 4. AIMSOFTHE DISSERTATION

[Hlava´ckovˇ a,´ 2007] Hlava´ckovˇ a,´ D. (2007). Databaze´ slovesnych´ valencnˇ ´ıch ramc´ u˚ VerbaLex. PhD thesis, Masarykova univerzita, Filozoficka´ fakulta, Ustav´ ceskˇ eho´ jazyka.

[Horak´ et al., 2006] Horak,´ A., Pala, K., Rambousek, A., and Rychly,´ P. (2006). New Clients for Dictionary Writing on the DEB Platform. In DWS 2006: Proceedings of the Fourth International Workshop on Dictionary Writings Systems, pages 17–23. Lexical Computing Ltd., Torino, Italy.

[Huang et al., 2004] Huang, Z., Chen, H., and Zeng, D. (2004). Applying Associative Retrieval Techniques to Alleviate the Sparsity Problem in Collaborative Filtering. ACM Trans. Inf. Syst., 22(1):116–142.

[Janssen, 2001] Janssen, T. M. V. (2001). Frege, contextuality and compositionality. J. of Logic, Lang. and Inf., 10(1):115–136.

[Johnson-Laird and Miller, 1976] Johnson-Laird, P. N. and Miller, G. A. (1976). ”Language and Perception”. The Belknap Press of Harvard University Press, Cambridge, Massachusetts.

[Kelly, 2007] Kelly, K. (2007). Kevin Kelly’s cool tools – 20q. [Retrieved on-line from http://www.kk.org/cooltools/archives/000725.php; accessed 30-October-2007].

[Kilgarriff et al., 2004] Kilgarriff, A., Rychly,´ P., Smrz,ˇ P., and Tugwell, D. (2004). The Sketch Engine. In Proceedings of the Eleventh EURALEX International Congress, pages 105–116.

[Kleb and Abecker, 2010] Kleb, J. and Abecker, A. (2010). Entity Reference Resolution via Spreading Activation on RDF-Graphs. In The Semantic Web: Research and Applications, volume 6088/2010, pages 152–166, Berlin, Heidelberg. Springer.

[Lafourcade and Joubert, 2009] Lafourcade, M. and Joubert, A. (2009). Similitude entre les sens d’usage d’un terme dans un reseau´ lexical. Traitement Automatique des Langues (TAL), 50. [unpublished].

[Lenat, 1995] Lenat, D. B. (1995). CyC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38.

[Levin, 1993] Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, IL.

30 4. AIMSOFTHE DISSERTATION

[Lieberman, 2002] Lieberman, H. (2002). MAS.964 Common Sense Reasoning for Interactive Applications. [Retrieved on-line from http://ocw.mit.edu; accessed 11-January-2006]. License: Creative Commons BY-NC-SA.

[Lieberman, 2007] Lieberman, H. (2007). Back into equilibrium: Balancing the ordinary and the extraordinary. Out of Balance: New Frontiers in Science, Art and Thought.

[Lieberman et al., 2004] Lieberman, H., Liu, H., Singh, P., and Barry, B. (2004). Beating Common Sense into Interactive Applications. AI Magazine, 25:63–76.

[Lieberman and Selker, 2000] Lieberman, H. and Selker, T. (2000). Out of Context: Computer Systems that Adapt to, and Learn from, Context. IBM Syst. J., 39(3-4):617–632.

[Liu, 2003] Liu, H. (2003). Unpacking Meaning from Words: A Context-centered Approach to Computational Lexicon Design. In Proc CONTEXT 2003, pages 218–232, Berlin. Heidelberg. Springer-Verlag.

[Liu and Singh, 2004] Liu, H. and Singh, P. (2004). Commonsense Reasoning in and over Natural Language. In Proceedings of the 8th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES-2004, Berlin. Heidelberg. Springer-Verlag.

[Materna and Pala, 2010] Materna, J. and Pala, K. (2010). Using Ontologies for Semi-automatic Linking VerbaLex with FrameNet. In Proceedings of LREC, pages 3331–3337.

[McNamara, 2005] McNamara, T. P. (2005). Semantic Priming: Perspectives from Memory and Word Recognition. Psychology Press, New York. Hove.

[Minsky, 1986] Minsky, M. (1986). The Society of Mind. Simon & Schuster, Inc., New York, NY, USA.

[Minsky, 2006] Minsky, M. (2006). The Emotion Machine. Simon & Schuster, Inc., New York, NY, USA.

[Mueller, 2000a] Mueller, E. T. (2000a). A Calendar with Common Sense. In IUI ’00: Proceedings of the 5th international conference on Intelligent user interfaces, pages 198–201, New York, NY, USA. ACM.

31 4. AIMSOFTHE DISSERTATION

[Mueller, 2000b] Mueller, E. T. (2000b). Making News Understandable to Computers. CoRR, cs.IR/0003001.

[Mueller, 2002] Mueller, E. T. (2002). ThoughtTreasure, the Hard Common Sense Problem, and Applications of Common Sense. [Retrieved on-line from http://ocw.mit.edu; accessed 21-July-2010]. License: Creative Commons BY-NC-SA.

[Mueller, 2003] Mueller, E. T. (2003). ThoughtTreasure: A natural language/commonsense platform. [Retrieved on-line from http://alumni.media.mit.edu/˜mueller/papers/tt.html; accessed 9-November-2009].

[Neveˇrilovˇ a,´ 2005] Neveˇrilovˇ a,´ Z. (2005). Visual Browser: A Tool for Visualising Ontologies. In Proceedings of I-KNOW’05, pages 453–461, Graz, Austria. Know-Center in coop. with Graz Uni, Joanneum Research and Springer Pub. Co.

[Neveˇrilovˇ a,´ 2005] Neveˇrilovˇ a,´ Z. (2005). Vizualn´ ´ı lexikon. Master’s thesis, Masaryk University in Brno.

[Neveˇrilovˇ a,´ 2009] Neveˇrilovˇ a,´ Z. (2009). Exploring and Extending Czech WordNet and VerbaLex. In Proceedings of the RASLAN Workshop 2009, Brno. Masaryk University.

[Neveˇrilovˇ a,´ 2010] Neveˇrilovˇ a,´ Z. (2010). Semantic Role Patterns and Verb Classes in Verb Valency Lexicon. In Proceedings of 13th International Conference on Text, Speech and Dialogue TSD 2010, Brno. Masaryk University.

[Neveˇrilovˇ a,´ 2010a] Neveˇrilovˇ a,´ Z. (2010a). Implementing Dynamic Visualization as an Alternative Interface to a Digital Mathematics Library. In Towards a Digital Mathematics Library, Proceedings, pages 63–68, Brno.

[Neveˇrilovˇ a,´ 2010b] Neveˇrilovˇ a,´ Z. (2010b). X-plain – a Game that Collects Common Sense Propositions. In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science NLPCS 2010, pages 47–52, Portugal.

[Pala, 1999] Pala, K. (1999). Zasady´ pozitivn´ı komunikace. In Svandovˇ a,´ B. and Jel´ınek, M., editors, Argumentace a umenˇ ´ı komunikovat, page 75. PedF MU Brno.

32 4. AIMSOFTHE DISSERTATION

[Pustejovsky et al., 2004] Pustejovsky, J., Hanks, P., and Rumshisky, A. (2004). Automated induction of sense in context. In COLING, pages 924–931, Geneva, Switzerland. [Rudowsky, 2004] Rudowsky, I. (2004). Intelligent Agents – A Tutorial. In Proceedings of the Americas Conference on Information Systems, pages 4588–4595, New York, New York. [Ruppenhofer et al., 2006] Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. R., and Scheffczyk, J. (2006). FrameNet II: Extended Theory and Practice. Technical report, ICSI. [Russell et al., 1996] Russell, S. J., Norvig, P., Candy, J. F., Malik, J. M., and Edwards, D. D. (1996). Artificial intelligence: a modern approach. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. [Sandhaus, 2008] Sandhaus, E. (2008). The New York Times Annotated Corpus. [Schank and Abelson, 1977] Schank, R. C. and Abelson, R. P. (1977). Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures (Artificial Intelligence). Lawrence Erlbaum Associates, 1 edition. [Schank et al., 1973] Schank, R. C., Goldman, N., Riager, C. J., and Rissbeck, C. (1973). MARGIE Memory, Analysis, Response Generation, and Inference on English. In IJCAI’73: Proceedings of the 3rd international joint conference on Artificial intelligence, pages 255–261, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Schonfeld, 2010] Schonfeld, E. (2010). Siri’s IPhone App Puts A Personal Assistant In Your Pocket. TechCrunch. [Retrieved on-line from http://techcrunch.com/2010/02/04/ siri-iphone-personal-assistant/; accessed 10-August-2010]. [Smith, 1995] Smith, B. (1995). Formal Ontology, Common Sense and Cognitive Science. International Journal of Human-Computer Studies, pages 641–667. [Speer et al., 2009] Speer, R., Krishnamurthy, J., Havasi, C., Smith, D., Lieberman, H., and Arnold, K. (2009). An Interface for Targeted Collection of Common Sense Knowledge using a Mixture Model. In IUI ’09: Proceedings of the 13th international conference on Intelligent user interfaces, pages 137–146, New York, NY, USA. ACM.

33 4. AIMSOFTHE DISSERTATION

[Sternberg, 2002] Sternberg, R. J. (2002). Kognitivn´ı psychologie. Portal.´

[Thomasson, 2009] Thomasson, A. (2009). Categories. In Stanford Encyclopedia of Philosophy. [Retrieved on-line from http://plato.stanford.edu/entries/categories/; accessed 21-August-2010].

[Turing, 1950] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59:433–460.

[Vickrey et al., 2008] Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A., and Koller, D. (2008). Online Word Games for Semantic Data Collection. In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 533–542, Morristown, NJ, USA. Association for Computational Linguistics.

[von Ahn, 2006] von Ahn, L. (2006). Games with a Purpose. Computer, 39(6):92–94.

[von Ahn et al., 2006] von Ahn, L., Kedia, M., and Blum, M. (2006). Verbosity: a game for collecting common-sense facts. In CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 75–78, New York, NY, USA. ACM.

[Vossen, 1998] Vossen, P. (1998). EuroWordNet – A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht.

[Wasserman, 1985] Wasserman, K. (1985). Physical Object Representation and Generalization: A Survey of Programs for Semantic-Based Natural Language Processing. AI Magazine, 5(4):28–42.

[Wolfram|Alpha, 2010] Wolfram|Alpha (2010). Wolfram alpha. [Retrieved on-line from http://www.wolframalpha.com/; accessed 3-August-2010].

34 Appendix

Publications

1. Neveˇrilovˇ a,´ Z. (2009). Exploring and Extending Czech WordNet and VerbaLex. In Proceedings of the RASLAN Workshop 2009, Brno. Masaryk University.

2. Neveˇrilovˇ a,´ Z. (2010). Implementing Dynamic Visualization as an Al- ternative Interface to a Digital Mathematics Library. In Towards a Digital Mathematics Library, Proceedings, pages 63–68, Brno.

3. Neveˇrilovˇ a,´ Z. (2010). X-plain – a Game that Collects Common Sense Propositions. In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science NLPCS 2010, pages 47–52, Portugal.

4. Neveˇrilovˇ a,´ Z. (2010). Semantic Role Patterns and Verb Classes in Verb Valency Lexicon. In Proceedings of the 13th International Con- ference on Text, Speech and Dialog TSD 2010, pages 147–153, Brno.

35 Exploring and Extending Czech WordNet and VerbaLex

Zuzana Nevˇeˇrilov´a

Masaryk University, Faculty of Informatics Botanick´a68a, 602 00 Brno, Czech Republic [email protected]

Abstract. This paper presents usage of two major, linguist-made lexical resources of Czech language: WordNet and VerbaLex. First, a conversion to RDF was made. Afterwards, a Prolog program was used to analyse Czech language inputs. In the second part of the article an extension to current VerbaLex is proposed. Possible pitfalls are discussed. In the conclusion, we empha- size the side-effect of this work: an important feedback for authors and administrators of both lexical resources.

Key words: VerbaLex, WordNet, semantic analysis, RDF, Prolog

1 Introduction

Since 2005 a database of verb valency frames is created. This database, VerbaLex [5], has form of frame-based lexical resource: it consist of verb valency frames with slots. Each slot contains two levels of semantic information: – semantic role, such as agent, patient, instrument – value restriction in form of bottommost hypernym, specified by literal and sense number in Princeton WordNet [4] (e.g. person:1) Czech WordNet (CZWN) started as part of EuroWordnet [10] project in 1998 and it is still being actively developed. VetbaLex and CZWN are two large linguist-made resources for Czech lan- guage. These resources can be and are expected to be used together thanks to the fact that in CZWN the IDs of synsets are linked to their translations in Princeton WordNet. This article shows how these resources can be used for semantic analysis of sentences and proposes an extension that can add background knowledge to these sentences. This background knowledge is considered necessary for semantic discourse analysis [9]. For verb frame identification, semantic role assignment and subsequent in- ference SWI-Prolog and RDF were used. In the experiments we deliberately omit syntactic analysis of the sentences and use only base form of nouns (singular nominative). We expect that syntactic analysis could improve the results notably. In practice intersection of our results and those of syntactic analysis will be used. 2 Zuzana Nevˇeˇrilov´a

2 Data Formats and the Program

Both CZWN and VerbaLex are stored in their own formats in the form of XML. For the purpose of their connection and inference, both data sources were con- verted to RDF [3] (in the form of XML). The conversion was done through XSL templates, since it is portable and easy to maintain (in case of slight changes in the structure of the XMLs). The conversion does not cover all aspects of VerbaLex nor CZWN. For the reasons of reasonable size of the data, some features such as examples, human readable definitions etc. were omitted. In VerbaLex there is no ID for a frame, but during the conversion one is added for each verb frame. The ID consists of one of the lemmata (where czech accents were replaced by capitals), sense number and frame number (generated during the conversion). The ID is in form of URI according to RDF specification [7]. After experiments with RDF reasoners, Prolog with rdf_db module was cho- sen for inference. The advantages of this solution are:

– Prolog is able to work with large data. VerbaLex comes with more than 212 000 RDF triples, CZWN with nearly 100 000. – It is possible to insert inference rules to the program and not to the data. The most resource-consuming relation is the hyperonymy, because it is a transitive relation. Since RDF is not able to handle transitivity, it would be necessary to use some kind OWL [8] guided with enormous increase of the number of RDF triples. Hyperonymy is handled in the Prolog program and thus the number of RDF triples is final. – With an appropriate Prolog module, web interface can be made straightfor- wardly.

3 Finding Semantics through Verb Frames

Since this work does not concern syntactic analysis, almost no grammatical information is available for the analysis. The input is simple: the verb and a list of nouns in their base form (singular nominative). In our analysis of a sentence, we can identify 3 kinds of bearers of the meaning:

– nouns occuring in the sentence identify hypernyms occuring in the verb frame – semantic roles that the nouns play – the verb frame structure, especially the number, semantic role and occupancy of other slots

The output contains the ID of a verb frame and nouns of the list with their semantic roles assigned: ?- find_roles(’pˇricestovat’,[’ministr’,’zast´avka’],FrameID,Roles). FrameID = ’http://nlp.fi.muni.cz/verbalex#pRicestovat_1_2’, Roles = [ (ministr, ’AG’, kdo1, obl), (zast´avka,’LOC’, ˇceho2,opt)] ; Exploring and Extending Czech WordNet and VerbaLex 3

The input: verb pˇricestovat (arrive) and the nouns ministr (minister) and zast´avka (station). The resulting role assignment: minister as AG(ent) and nominative animate (kdo1), obl(igatory) value of the slot, station as LOC(ation) inanimate genitive (ˇceho2), opt(ional).

3.1 Features, Problems and Solutions The result of the analysis brings following advantages:

– appropriate verb meaning recognition – frame identification – semantic roles assignment – grammatical information (cases)

It is necessary to keep in mind that the result is a set. In the case above, this set has only one element. Problems occuring during the analysis can be following: – verb not found in VerbaLex. This is not expected to occur often, since Ver- baLex contains 19 360 valency frames from more than 10 000 verbs [6]. But if this case occurs, the analysis brings no result. – word from the list not found in CZWN. This occurs almost in every sentence, since CZWN is much smaller than Princeton WordNet. Moreover it does not contain proper names at all. The instant solution is to take subsets of the input set and try to assign as much nouns as possible. A long-term solution consists of improving CZWN and using other resources for proper names. – no suitable frame for the list of words. In VerbaLex, only common use is encoded. In some cases, language users do not follow the common use. This occurs rarely. Most often there are words not related to the verb (e.g. parts of noun phrases) or nouns contained in adverbial phrases. Solution is again to take subsets of input set. – no suitable hyperonym for a word. This came in sight as the most difficult problem. It seems that there is not much consensus about the bottommost hypernyms in frame slots. For example the verb koupit (to buy) has the OBJ(ect) slot value goods:1. But the object of buying can be almost every object or even animal. Thus it seems that the value of the OBJ slot should be object:1. In this case verb frame will not offer much information.

4 Proposed Extension of VerbaLex

VerbaLex is a frame-based lexical resource. Like other resources, such as Frame- Net [2], it contains slots describing typical situations (in this case noun phrases related to the verb), with restriction on their values (in this case WordNet hy- pernyms). Contrary to FrameNet, VerbaLex frames are not related together, there is no hierarchy among the frames. 4 Zuzana Nevˇeˇrilov´a

According to [1] it makes sense that frame information should be inherited through type hierarchy. Frame-based representation can be also used to encode additional information not mentioned in the sentences. This underlying knowl- edge is believed to be very useful in interpreting language. In particular knowl- edge about causality is very important. Frame-based knowledge representations consist at least of: – preconditions – effects – decompositions FrameNet, as a representant of large frame-based resources, contains even more types of relations. Proposed extension rests in introducing these three relations to the frames. Prolog program was extended that it supports inference rules. These inference rules are in form of another RDF (encoded in XML) and related to VerbaLex through RDF IDs. Only information is: type of relation (precondition, effect, decomposition), relation to another frame and mapping between the slots:

In this piece of XML the effect of pˇricestovat (arrive) is to nach´azetse (inhere). Mapping is done from AG(ent) to ENT(ity) and from LOC(ation) to another LOC(ation). Note that in the example above the grammatical change occurs on the basis of VerbaLex information. No other information is needed in the inference rule. With these data program is able to output: ?- find_effect(’pˇricestovat’,[’ministr’,’zast´avka’],FrameID,Roles). FrameID = ’http://nlp.fi.muni.cz/verbalex#nachAzet_se_1_1’, Roles = [ (ministr, ’ENT’, kdo1, obl), (zast´avka,’LOC’, ˇcem6,opt)] .

The input: verb pˇricestovat (arrive) and the nouns ministr (minister) and zast´avka (station). With the inference rule that pˇricestovat (arrive) has the effect of nach´azetse Exploring and Extending Czech WordNet and VerbaLex 5

(inhere): minister as ENT(ity) and nominative animate (kdo1), obl(igatory) value of the slot, station as LOC(ation) inanimate locative (ˇcem6),opt(ional).

Result of inference brings in addition to features mentioned above following:

– new frame identification – change of roles assignment (AG → ENT) – change of grammatical information (ˇceho2 → ˇcem6)

4.1 Problems and Solutions

Main problem of this extension is how to build effectively set of inference rules. Proposition is to group verbs according to structure of their frames and assign rules depending on which group each verb joins. For example: LOC(ation) slot with genitive indicates that one of role rep- resentants (either AG(ent) or PAT(ient) changes LOC(ation)). In most cases, s/he either starts or stops to be placed in that LOC(ation). Verbs fulfilling this structure are the verbs of motion [5] such as dorazit, pˇricestovat (arrive), doj´ızdˇet (commute), or the verbs of sending and carrying such as cp´at (crowd), verbs of spatial configuration such as klesat, svaˇzovatse (slope down). This grouping can lead to a semi-automatically created inference rules set.

4.2 Introducing New Entities and New Roles to the Discourse

According to [1], knowledge about usual situations in which actions occur is useful for language interpretation. Moreover if these situations are defined, the knowledge reveals new objects that do not have to be mentioned, but exist in the discourse. For example buying something involves four objects: the buyer, the seller, the object and an amount of money. Even if the money is not mentinoned in discourse, it is contained in it. Decomposition of buying is:

– buyer gives money to seller – seller gives object to buyer

Moreover agents in the discourse can play new roles. Every living person can be buyer or seller, but during the act of buying, AG(ent) has the role of buyer (buyer is not a new entity in the discourse, but it is a new role of the entity previously mentioned). In future work we will concentrate on encoding these new entities and roles to inference rules so they can be used in the discourse semantic analysis. 6 Zuzana Nevˇeˇrilov´a

5 Conclusion

We have introduced Prolog program that is able to analyse verb and nouns occuring in a sentence. The analysis acquire following information:

– valency frame identification – semantic role assignment – grammatical information We have proposed an extension to VerbaLex that can imply new propositions. Main problem is how to build an appropriate set of rules. With this extension we can even introduce new object to the discourse or to assign new roles to the agents previously mentioned. This background knowledge is believed to be useful for language interpretation. Side-effect of this analysis is that on corpus sentences it offers an important feedback to the authors and administrators of VerbaLex and CZWN. Namely choice of bottommost hypernym in VerbaLex slots can be checked.

References

1. James Allen. Natural Language Understanding (2nd ed.). Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1995. 2. Collin F. Baker and Charles J. Fillmore. FrameNet, 2009. [Online; accessed 30- July-2009]. 3. Dave Beckett and Brian McBride. RDF/XML syntax specification, February 2004. 4. Christiane Fellbaum. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, May 1998. 5. Dana Hlav´aˇckov´a. Datab´azeslovesn´ych valenˇcn´ıch r´amc˚uVerbaLex. Master’s thesis, Masarykova univerzita, Filozofick´afakulta, Ustav´ ˇcesk´ehojazyka, 2007. 6. Dana Hlav´aˇckov´a.Poˇcetlemmat v synsetech VerbaLexu. In After Half a Century of Slavonic Natural Language Processing, Brno, Czech Republic, 2009. Tribun EU. 7. Ora Lassila and Ralph R. Swick. Resource Description Framework, (RDF) model and syntax specification, 1999. 8. Deborah L. McGuinness and Frank van Harmelen. OWL Web Ontology Language Overview. 9. Teun A. van Dijk. Semantic discourse analysis. In Handbook of Discourse Analysis: Dimensions of Discourse, volume 2, London, 1985. Academic Press. 10. Piek Vossen. EuroWordNet – a multilingual database with lexical semantic net- works, 1998. Implementing Dynamic Visualization as an Alternative Interface to a Digital Mathematics Library

Zuzana Neveˇrilovˇ a´

Faculty of Informatics, Masaryk University, Botanicka´ 68a, 602 00 Brno, Czech Republic [email protected]

Abstract. This paper presents an alternative interface for browsing in the Czech Digital Mathematics Library (DML-CZ) using our Visual Browser web browsing tool. Using dynamic visualization, we have created a tool for browsing the library graphically. Visualization can help users orient themselves in complex data and at the same time reveal sometimes unex- pected relationships among units; it at least speeds up browsing. This work follows the metadata processing undertaken on DML-CZ and visualizes all reasonable and useful relationships among journals, issues, articles, authors, classification, keywords, references and similar articles. We converted metadata to RDF and use a Visual Browser Java Applet that runs in a web browser. We describe briefly the metadata nature, then server and client side of the visualization including data formats and con- versions. There follows a description of the interaction between visual and textual interfaces.

1 Introduction

This paper presents dynamic visual interface for browsing the Czech Digi- tal Mathematics Library (DML-CZ) as an alternative to a textual listing. We are offering the interface to the ongoing EuDML project1. The DML-CZ cur- rently contains more than 28,000 articles in 11 journals, 5 proceedings series and 28 monographs [5]. Users usually do not browse within such a vast amount of data, rather they search for titles or authors. On the search results page users can see the number of search results and the list of articles. When clicking on an article, the information listed below is shown:

– bibliographic information about the article (author, title, serial, year, Math- ematics Subject Classification (MSC) [6],. . . ) – preview of the article and link to the PDF – link to similar articles – references with links to articles where possible.

1 The European Digital Mathematics Library – http://www.eudml.eu/ 2 Zuzana Neveˇrilovˇ a´

The great strength of the DML-CZ interface is that it finds similar articles in search results. Three methods for calculating similarities are used [11] and the percentages are expressed graphically. This is so far the only information that is visualized. Nevertheless according to [10] a good visualization helps accelerat- ing the cognitive process, since the eyes can pick up details of the visualization and keep a holistic overview at the same time. Visualization is most suitable for complex and relatively sparse data and this is precisely the case of library data. Google has started to offer a graphical interface for search results in addition to the standard view: their so-called Wonder wheel has both plain text and time- lines. Information seekers who would tend to use it, are likely to appreciate it for not only Google searches. The structure of paper is as follows. In Section 2 we describe the server side including data formats provided by the server. Section 3 briefly describes the Visual Browser and shows the interaction between the Visual Browser and the textual listing on the web page. Section 4 contains both the conclusion and the future development that the dynamic visual interface may undergo.

2 Server Side

Since the amount of data in DML-CZ is very large, a client-server architecture is the most suitable. The server has to store the data, provide a method for its retrieval and quickly return a small amount of the data requested.

2.1 Data Formats

Because the client side uses RDF [1], the server has also to provide this format. We had to convert the existing XML format of metadata to RDF. This conversion required the following steps: – selecting only the appropriate data for visualization (some information is omitted) – assigning IDs for articles, issues, journals and authors – adding short titles for the visualization – conversion of the lang attribute according to RFC 3066 sec. 2.3 [9] – adding information about similar articles – adding MSC labels.

2.2 RDF Server

Joseki RDF Server2 was used. It offers SPARQL [8] as a query language. Joseki was selected because of the Jena Framework3 used in the client. Nevertheless, the server side can be substituted by any other RDF server if needed. The data is stored in a relational database. 2 http://joseki.sourceforge.net/ 3 http://jena.sourceforge.net/ Implementing Dynamic Visualization . . . to a Digital Mathematics Library 3

3 Client Side

On the client side two interfaces are used: a traditional textual interface (a list of authors and articles) and the Visual Browser [7]. The latter is a tool for the dynamic (animated) visualization of RDF graphs. It provides flexible visualiza- tion thanks to the two-layer architecture:

– first layer—the data stored in RDF (whether in RDF/XML, N3 [2] or Tur- tle [4]); – second layer—perspective of view, an XML description of graphic represen- tation of nodes and edges of the graph.

The visualization of different types of data is described below. The Visual Browser exists either as a standalone Java application or as a Java applet. The applet can communicate with textual parts of the search results page. The interaction Java Applet–web page was made through AJAX4 plus JavaScript to communicate with the applet. Submitting the search field or browsing data in one of the interfaces results in a SPARQL query. The server evaluates the query and returns an RDF graph. The XSLT [3] conversion is made and the result is returned as a list of au- thors and titles. The communication between the applet and the web page is bi-directional: clicking on a name or title in the list renders a set of nodes and edges in the visual interface, a set of nodes and edges can be displayed as a list of authors and articles. We expect users to type (part of) a name or title in the search box. Then users can browse either the more familiar textual interface (as they are used to), or the visual one. Conversely, when viewing a particular subgraph in the Visual Browser, users can click to have it appear it in the textual interface as seen in Figure 1.

Fig. 1. Visual and textual interfaces to search results within DML-CZ. It allows users to choose how they browse the results. For this purpose, it is necessary that the interfaces are able to communicate.

4 Asynchonous JavaScript And XML 4 Zuzana Neveˇrilovˇ a´

In this visualization, nodes represent units such as authors, articles, issues, journals, and keywords and MSC as well. Different classes of units are repre- sented by different colours and shapes. Mapping from logical entities to their visual attributes is fully configurable in Visual Browser. Edges represent authors and their articles, articles in issues, issues in jour- nals, as well as links between similar articles. Some of these relations are struc- tural (e.g. articles in issues), some are semantic (e.g. classification of articles), some have both aspects (authors of articles). We have to evaluate users’ be- haviour to decide what types of relations are useful for browsing. Even throug we nevertheless expect semantic relations to be more important than the struc- tural ones, display both types of relations. Similarly to the visualization of nodes, the appearance of the edge (its colour, shape and length) distinguishes different classes of edges. Thanks to small internal windows that open when the mouse is held over a node, short texts can be displayed. This can be helpful when displaying titles or even abstracts as seen in Figure 2.

Fig. 2. Visualization of texts: when the mouse is held over a node, more infor- mation is shown

The current interface for DML-CZ provides information about semantically similar articles. Similarities have been pre-calculated using three different meth- ods [11]. Similar articles are connected with edges of different lengths; the shorter the edge, the more similar two articles are (see Figure 3 on the facing page). Scientific articles usually cite other sources. These citations (references) are related to a topic mentioned in that article and therefore it is useful for users who have already read the article and are looking for further reading. The re- quired state will be that users can browse references to articles regardless of the repository these articles come from. Achieving a high coverage of at least articles’ metadata is one of the major goals of the EuDML project. Implementing Dynamic Visualization . . . to a Digital Mathematics Library 5

Fig. 3. Visualization of similarities: the length of the edge is also a bearer of meaning; with edge labels displayed, one can also see similarities expressed by numbers

4 Conclusion and Future Work

We have presented an alternative to the current DML-CZ interface. Visual in- terfaces are more attractive and can help orientation in complex data such as library records. So far it is experimental, but we plan to include it in the official DML-CZ site and offer it to the EuDML project. Future work comprises monitoring users’ preferences on interfaces and their possible feedback. We expect that users will need time to get used to using the visual interface, since it is far from the traditional way of browsing. But we hope that users will appreciate the holistic overview of complex information. Our immediate plans include working on the design of the search result interfaces. For this, users’ feedback will be necessary. We also have to test the RDF Server on the significant loads that are expected within DML-CZ and Eu- DML. These conditions seem necessary for usability within EuDML. Working prototype can be seen on

http://nlp.fi.muni.cz/˜xpopelk/dml/VBApplet.html.

Acknowledgments

This research has been partially supported by the grant registration no. 1ET200190513 of the Academy of Sciences of the Czech Republic (DML-CZ), and by EU project # 250503 in CIP-ICT-PSP.2009.2.4 (EuDML).

References

1. Beckett, D. and McBride, B.: RDF/XML Syntax Specification (February 2004), http: //www.w3.org/TR/rdf-syntax-grammar/ 6 Zuzana Neveˇrilovˇ a´

2. Berners-Lee, T.: Notation 3 (2008), http://www.w3.org/DesignIssues/ Notation3 3. Clark, J.: XSL Transformations (XSLT) (1999), http://www.w3.org/TR/xslt 4. David Beckett, T. B.-L.: Turtle – Terse RDF Triple Language (2008), http://www.w3. org/TeamSubmission/turtle/ 5. DML: Czech digital mathematics library – news (2010), http://dml.cz/news, re- trieved April 20, 2010 from http://dml.cz/news 6. Ion, P. and Eilbeck, C.: Mathematics Subject Classification 2010 (2010), http:// msc2010.org 7. Neveˇrilovˇ a,´ Z.: Visual Browser: A Tool for Visualising Ontologies. In: Proceedings of I-KNOW’05, pp. 453–461. Know-Center in coop. with Graz Uni, Joanneum Research and Springer Pub. Co., Graz, Austria (2005) 8. Prud’hommeaux, E. and Seaborne, A.: SPARQL Query Language for RDF (2008), http://www.w3.org/TR/rdf-sparql-query 9. RFC3066: Tags for the Identification of Languages (January 2001), http:// potaroo.net/ieft/idref/rfc3066 10. Tufte, E.: Envisioning Information. Graphics Press (1990) 11. Rehˇ u˚ rek,ˇ R. and Sojka, P.: Automated Classification and Categorization of Math- ematical Knowledge. In Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.): Intelligent Computer Mathematics – Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008. Lecture Notes in Computer Science LNCS/LNAI, vol. 5144, pp. 543–557. Springer-Verlag, Berlin, Hei- delberg (July 2008) X-plain – a Game that Collects Common Sense Propositions

Abstract. Common sense knowledge is very important for some NLP tasks, but it is hard to extract from existing linguistic resources. Thus specialized collections of common sense propositions are created. This paper presents one of the ways of making such collection w.r.t. Czech language. We have created a cooperative game, where computer program plays together with human. The purpose of the game is to describe a word with short sentences to the co-player. While the human player is expected to use his/her common sense, the computer program uses word sketches. The paper describes in detail the game, its background and discusses the need for motivation and game policy. It also discusses the quality and coverage of the collection.

1 Introduction

Common sense knowledge is considered to be crucial for some NLP tasks. In principle there are two approaches on how to collect common sense data: collection made by experts, collection made by volunteers. Both approaches and many variants between them differ in several aspects such as cost, quality, coverage. We present a project, where common sense propositions are collected by means of a game. In this article a Czech version of the game is presented. However the principle can be used for different languages as well. The game presented in this paper is named X-plain. Players are at the same time contributors to the database of common sense propositions. Section 2 describes the common sense and explains the need for collecting it. In section 3 we describe the principle of the game. Section 4 describes closely how computer program can play together with human. In section 5 we discuss in detail the quality of collected data and contribution policy. We have to expect that the database will be error prone and different contributors have different reliability.We propose some work in future in section 6.

2 Common Sense and How to Collect it

Common sense is often described as a huge set of processes of natural cognition and system of beliefs that people share. Common sense does not always correspond to sci- entific or even real world observation, rather it is a set of assumptions about the real world [Smith, 1995]. Inherently common sense propositions are not easy to collect. Therefore specialized collections of common sense exist. Well-known projects include CyC [Lenat, 1995], Thought Treasure [Mueller, 2003] (expert-made) or Open Mind Common Sense Initia- tive [Stork, 2007] (volunteer-made). The game Verbosity [von Ahn et al., 2006] proposes another way of collecting com- mon sense propositions. All mentioned projects contain mainly data in English lan- guage. This paper refers about a game similar to Verbosity, but with different engine. Its main purpose is to create a collection of common sense propositions in Czech lan- guage.

3 Game Principle

X-plain has analogy in board games such as “TabooTM”.It is a cooperative game for two players. The principle is that a random word (called secret word) is displayed to one player (narrator) and s/he has to explain it to the second player (guesser). The guesser has to say (or write down) the exact word. In X-plain the guesser tries to guess the word with apparently unlimited number of tries. When s/he is successful, the score is increased and next turn the roles swap. When the narrator is not able to describe the secret word or the guesser is not able to reveal it, they can pass on the word. Next turn the roles swap but the score stays unchanged. The game is time limited to 3 minutes. In X-plain there are different relation types that together with the secret word and the object make sentence templates, e.g. X is˙kind˙of Y. Currently there are following relation types: – can_have_property – has_part – is_part_of – is_a_type_of – is_used_for – can_be_used_for – can_be_likely_found – is_the_opposite_of – is_similar_to – is_related_to – using

At first relations were selected according to Verbosity. Afterwards the list was adapted to Sketch Engine outputs (see subsection 4.2). These relations are considered to be easy to understand, however it seems that players attach significant importance to secret words and objects and not to relations, see section 5. Secret words were selected randomly, one-meaning words are prefered. The list is continuously adapted, as the words are examined by human players and Sketch Engine (“difficult words are rejected”), see section 5.

4 Game Background

There is a significant difference between X-plain and Verbosity: in Verbosity two human players (that are chosen randomly from on-line players) play together, whether in X- plain human plays with computer program. The program has to take role of the second Fig. 1. Screenshot (part) from X-plain: narrator (human) has to describe the word “kometa” (comet). On the left s/he has to fill the following sentence templates: se sklada´ z (has part); je soucˇast´ ´ı (is part of); je druh (is a type of); je urcenˇ a´ pro/k/na (is used for); se nejcastˇ ejiˇ nachaz´ ´ı bl´ızko/v/na (can be likely found). S/he types: “. . . se sklad´ a´ z ohonu” (. . . has part tail). On the right the guesser (computer) tries to guess the secret word: “liska”ˇ (fox), “ku˚n”ˇ (horse). On top right of the screen there is time countdown. player. The program’s “knowledge” is based upon two resources: previous contributions and word sketches. X-plain is a web-based application where server side is programmed in PHP1. Client side uses Javascript and AJAX2 for better comfort. Thus players do not have to install special software. Contributions from human narrators are stored in MySQL3 database in form of triple (subject, relation, object) together with its number of occurrences. Hints given by computer program are not stored because they result from the database itself or from the Sketch Engine (see subsection 4.1).

4.1 Word Sketches

Word sketch [Kilgarriff and Rundell, 2002] is made from corpus using grammar pat- terns. It groups together words playing the same grammatical role in sentences. The Sketch Engine [Kilgarriff et al., 2004] is supplied with grammatical relations for the requested language. Grammatical relations for Czech include three types: symmetric, dual and trinary (explained in detail in [Ske, 2010]). For Czech language the words are in grammatical relations such as:

– coord – words in coordination, typically nouns connected by conjunctions “and”, “or”. This relation is symmetric. – prec_ – the word followed by and X. This relation is trinary. – a_modifier – adjective word modifier. This relation is dual to modifies.

1 http://www.php.net 2 Asynchronous Javascript And XML 3 http://www.mysql.com 4.2 From Grammar to Semantics

In X-plain the relations in sentence templates are semantic, but in word sketches only grammatical relations exist. Therefore, we propose a set of rules that link grammatical and semantic relations. The idea is similar to grammatical relations in Sketch Engine: the rules are quite straightforward and the results do not tend to be perfect, but plausible. Since grammatical relations are made for each language, grammar-to-semantics rules are also language dependent. Currently there are grammar-to-semantics rules such as: is_related_to =⇒ ["coord"] is_part_of =⇒ ["gen_1"] has_part_of =⇒ ["gen_2"] can_have_property =⇒ ["a_modifier"] The first rule is interpreted as “the relation type is_related_to relates the secret word to all words from word sketch coord (coordination)”. Similarly reverse grammar-to-semantics rules exist. For symmetric grammatical relations these rules are equal, for dual relations the rules use the dual relation. For trinary relations with preced- ing/following preposition the rules interchange the post and prec prefixes. Reverse grammar-to-semantics rules look like: is_related_to =⇒ ["coord"] is_part_of =⇒ ["gen_2"] has_part_of =⇒ ["gen_1"] can_have_property =⇒ ["modifies"]

4.3 Use of Word Sketches in the Game

If the program plays the role of narrator, it follows the scenario:

– it looks for an object to the database of contributions (same secret word, same relation type) and creates a result set for each relation type – it creates a word sketch based on the secret word, uses grammar-to-semantics rules for sets of words for each relation type – it chooses words randomly from these sets and presents them together with appro- priate relation types to the guesser

In the guesser role, the program follows the scenario:

– it looks for a subject that fulfils the conditions (same relation type, same object) to the database of contributions and creates a result set – it creates a word sketch based on the object, uses reverse grammar-to-semantics rules for sets of words for the relation type – it chooses words randomly from these sets and tries to guess the secret word

Example The secret word is “kometa” (comet)

– narrator (human) can choose among 5 relation types – narrator (human) fills template: . . . souvis´ı s vesm´ırem (. . . is related to space) relation relation % of occurrences is_similar_to is_related_to 1.08 is_similar_to is_a_type_of 0.51 is_related_to is_part_of 0.47 is_related_to is_a_type_of 0.47 can_be_likely_found is_part_of 0.47 Table 1. Relations that are used with same subjects and objects: X relation 1 Y and X relation 2 Y

– guesser (computer) gets “hvezda”ˇ (star) as a result from the database and a set of words from word sketches: aviation, solar, astronomy, science, galaxy, human. . . . Guesser chooses following words: astronomie (astronomy), vedaˇ (science), hvezdaˇ (star) – narrator (human) fills template: . . . mu˚zeˇ m´ıt vlastnost Halleyova (. . . can have property Halley’s) – guesser (computer) gets no results from the database, but one result from word sketches: “kometa” (comet). Guesser types the word “comet” (kometa) – success! (players score points)

So far human players score points 1,24× more often than computer. For making computer program more successful we can arrange the results from database and word sketches and do not choose randomly but consider the frequency. On the other hand the more successful the computer is less the propositions we collect.

5 Game Policy and Quality of Contributions

A simple measure for quality of contributions is the agreement. Since common sense propositions are not a scientific approach we do not need to collect the “truth”. All we need is the usage. Where a proposition repeats from different contributors, it means that several players think the same way about the secret word. Players are playing with time limit, so they often write the first idea that comes to their mind. When collecting common sense propositions, this is an advantage. On the other hand the time limit can lead to many spelling errors. In the data we have already collected (about 2200 propositions), the relation type is often misused. For example in the database we can find records such as: X is_- similar_to Y and X is_opposite_of Y. This has not to be error in all cases, however we cannot weight the relation type same as the secret word or the object. The table 5 shows what types of relations (the most occuring cases) are used with the same subject and object and their occurrence ration in the whole collection. An important aspect of the collection is the coverage. We can observe that some words are passed very often with no propositions: either they are not understood by players or they are “hard” to explain. The table 5 shows words that are poorly covered and their categorization. The majority of them are abstract words and we can assume that these words are difficult to explain. word translation number of unsuccessful guesses category zproneveraˇ fraud 5 abstract words zkouskaˇ exam/testing 4 abstract words, polysemes myslivost woodcraft 3 domain specific terms nemocny´ sick/invalid 3 polysemes vztah relation 3 abstract words copyright copyright 2 abstract words demokracie democracy 2 abstract words duvod˚ reason 2 abstract words fanousekˇ fan 2 other gang gang 2 other guverner´ governor/proconsul 2 polysemes hrana edge/angle/knell 2 polysemes lesak´ woodlander 2 domain specific terms Table 2. Words difficult to explain for humans and their categorization. Number of unsuccessful guesses take in account only games where human player gives at least some clue.

6 Conclusion and Future Work

This paper describes another approach to linguistic data collecting. It is designed mainly for collecting common sense propositions within Czech language. Czech is a minor lan- guage thus we cannot expect millions of propositions within a few months like GWAP [von Ahn, 2006]. We are strongly interested to players’ motivation. Game history is available for each game, so we can identify the words that are hard to explain (many passes, few propositions) or conversely the words that are easy to explain (best scored guesses). Further analysis should answer the question why some words are “easy” and others are not. We have to carefully choose the words for each level so that players stay motivated. As the number of players grows we have to find automatic or semi-automatic ways to discover “hostile” contributors. Although Czech uses diacritics, some web users are used to write without it. We have to observe the data in long term and decide afterwards how to handle this problem. Either we can count the words without diacritics as doublets, or we can try to add diacritics. The major contribution of this work is the method how to collect common sense propositions in Czech. We have to evaluate the reliability of the collection over time. We expect that a plausible number of common sense propositions will be collected over time.

References

[Ske, 2010] Corpus querying and grammar writing for the sketch engine. Retrieved March 2, 2010 from http://trac.sketchengine.co.uk/wiki/SkE/CorpusQuerying. [Kilgarriff and Rundell, 2002] Kilgarriff, A. and Rundell, M. (2002). Lexical profiling software and its lexicographic applications - a case study. In Proceedings of the Tenth EURALEX Inter- national Congress, pages 807–818. [Kilgarriff et al., 2004] Kilgarriff, A., Rychly,´ P., Smrz,ˇ P., and Tugwell, D. (2004). The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, pages 105–116. [Lenat, 1995] Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38. [Mueller, 2003] Mueller, E. T. (2003). ThoughtTreasure: A natural language/commonsense plat- form. Retrieved November 9, 2009 from http://alumni.media.mit.edu/ mueller/papers/tt.html. [Smith, 1995] Smith, B. (1995). Formal ontology, common sense and cognitive science. Inter- national Journal of Human-Computer Studies, pages 641–667. [Stork, 2001] Stork, D. G. (2001). Toward a computational theory of data acquisition and truthing. In COLT ’01/EuroCOLT ’01: Proceedings of the 14th Annual Conference on Com- putational Learning Theory and and 5th European Conference on Computational Learning Theory, pages 194–207, London, UK. Springer-Verlag. [Stork, 2007] Stork, D. G. (2007). Open mind initiative — about. Retrieved October 28, 2007 from http://openmind.org. [von Ahn, 2006] von Ahn, L. (2006). Games with a purpose. Computer, 39(6):92–94. [von Ahn et al., 2006] von Ahn, L., Kedia, M., and Blum, M. (2006). Verbosity: a game for collecting common-sense facts. In CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 75–78, New York, NY, USA. ACM. Semantic Role Patterns and Verb Classes in Verb Valency Lexicon

Zuzana Neveˇrilovˇ a´

Faculty of Informatics, Masaryk University, Botanicka´ 68a, Brno 602 00

Abstract. For Czech language there is large valency frame lexicon: VerbaLex. It contains verbs, slots related to the verbs and information about semantic roles each slot plays. This paper discusses observations made on VerbaLex frames re- lated to verb classification. It shows that for particular classes of verbs (e.g. verbs describing weather) some semantic role patterns are typical. It also tries to reveal these patterns in not so obvious cases. Currently, verb frames in VerbaLex are not interconnected. This paper outlines the way we can do such connections. We expect that verb frames of the same class or with the same semantic role patterns are semantically close and therefore propose similar types of interconnection. We expect to create relatively small set of inference rules that influence a large number of verb frames.

1 Introduction

Valency frame lexicon consists of following units:

– verb – a word and its synonyms describing an action, event or state – verb frame – syntactic and semantic description of sentence constituents dependent on the verb – slot – description of each dependent constituent

Valency frame lexicon serves not only as a syntactic description of verb dependent constituents, but also helps to describe or predict their semantic roles. We consider that the meaning of a sentence is composed by meanings of its constituents and syntac- tic structure the constituents form. Due to this we can use verb frames for semantic disambiguation. Moreover if we are able to semantically disambiguate sentences in a discourse, we will be able to put relations of known types between constituents in the sentences (e.g. cause–effect). In this paper we study valency frame lexicon of Czech verbs VerbaLex. We construct the semantic role patterns and compare it with verb classification. We in short introduce VerbaLex in section 2. Section 3 defines and describes se- mantic role patterns in detail. In section 4 we describe verb classes and evaluate the VerbaLex data w.r.t. semantic role patterns. In section 5 we describe the generaliza- tion as the result of the two approaches. We discuss the possibility of interconnecting VerbaLex frames. Section 6 gives a conclusion and proposes future work. 2 VerbaLex

VerbaLex [Hlava´ckovˇ a,´ 2007] is a valency frame lexicon built for Czech language. Cur- rently it contains 19 360 verb frames for more than 10 000 verbs [Hlava´ckovˇ a,´ 2009]. Semantic information is available on two levels:

– semantic role (also known as thematic role or thematic relation) that a sentence constituent plays w.r.t. the action or state. The concept is based on [Gruber, 1965] and currently widely used (with some changes). VerbaLex contains 33 semantic roles such as agent, patient, location or substance. – semantic restriction on a hypernym (e.g. person). This second level is related to WordNet’s hypernym [Fellbaum, 1998] (e.g. person:1, where person is a lit- eral and 1 is the sense number).

Moreover a list of grammatical features such as preposition and grammatical case are present for each slot. VerbaLex was built by lexicographers, independently of corpora information. It dif- fers from VALLEX [Lopatkova´ et al., 2006] mainly in its size and structure. VALLEX is closely related to Prague Dependency , while VerbaLex was built indepen- dently from it. VALLEX has no relation to WordNet, it contains only thematic roles (called functors in VALLEX). Moreover the function of these functors is different. Verb frames in both lexicons are not compatible.

3 Semantic Role Patterns

3.1 First Level Semantic Role Patterns

This paper concerns not single verbs, but groups of verbs that are expected to be seman- tically close. We define 1st level semantic role pattern for a particular verb frame as a tuple P = (R1,...Rn). One of the elements of P is always the verb (marked as VERB later in this paper), other Ri are semantic roles assigned to this verb frame. We made observations on the 1st level semantic role patterns and their frequency in VerbaLex. Table 1 shows most frequent patterns with example verbs. We can see that in some cases the verbs are semantically close while in other cases there is no perceivable semantic closeness. This feature significantly depends on the type of semantic role: the more specific it is, the more relationship among verbs we can observe. For example patterns containing communication (COM) group semantically close verbs, while pat- terns containing abstract object (OBJ) embody different groups of verbs. N.B. that only first level (semantic role) of VerbaLex is considered and no grammatical features are considered.

3.2 Second Level Semantic Role Patterns

Similarly to 1st level semantic role patterns we define 2nd level semantic role patterns for a particular verb frame as a tuple of pairs P = ((R1,W1),... (Rn,Wn)). Here again Ri is a semantic role (one of them is the verb) and Wi is a WordNet hypernym semantic role pattern # of frames example verbs translation (AG,VERB,PAT) 1049 bodat, ovladnout,´ stˇ ekatˇ sting, dominate, bark (AG,VERB,OBJ) 866 bouchat, klovat, kacet´ knock, (bird) peck, lumber (AG,VERB,ACT) 788 detekovat, kazit, litovat detect, destroy, sorry (AG,VERB,PAT,ACT) 444 blahoprˇat,´ tazat´ se compliment, ask (AG,VERB,ART) 403 obarvit, kompilovat, koupit color, compile, buy (AG,VERB,LOC) 394 uzavrˇ´ıt, chvatat´ close, rush (AG,VERB,COM) 388 analyzovat, psat,´ klabosit´ analyse, write, chat (AG,VERB,STATE) 339 adaptovat se, dosahnout,´ objasnit adapt, achieve, clarify (AG,VERB,KNOW) 297 badat,´ konvertovat research, convert (AG,VERB,SUBS) 295 bagrovat, p´ıt, varitˇ dig, drink, cook (AG,VERB,ENT) 279 krast,´ podobat se steal, resemble (AG,VERB,OBJ,OBJ) 261 doplnit, rozeznat supplement, distinguish Table 1. 12 most frequent 1st level semantic role patterns with number of occurrences (of appro- priate verb frames) in VerbaLex.

that restricts the sentence constituent. Note that Wi has not to be present for every semantic role. In this case it is marked as ε.

Table 2 shows the most frequent second level semantic role patterns.

semantic role pattern # of frames ((AG, person:1), (VERB, ε), (PAT, person:1)) 800 ((AG, person:1), (VERB, ε), (OBJ, object:1)) 553 ((AG, person:1), (VERB, ε), (ACT, act:2)) 543 ((AG, person:1), (VERB, ε), (PAT, person:1), (ACT, act:2)) 299 ((AG, person:1), (VERB, ε), (DPHR, ε) 242 ((AG, person:1), (VERB, ε), (STATE, state:4)) 224 ((AG, person:1), (VERB, ε), (LOC, location:1)) 176 ((AG, person:1), (VERB, ε), (EVEN, event:1)) 171 ((AG, person:1), (VERB, ε) 170 ((AG, person:1), (VERB, ε), (OBJ, object:1), (OBJ, object:1)) 147 ((AG, person:1), (VERB, ε), (PAT, person:1), (DPHR, ε)) 134 ((AG, person:1), (VERB, ε), (INFO, info:1)) 127 ((AG, person:1), (VERB, ε), (ART, artifact:1)) 120 ((AG, person:1), (VERB, ε), (PAT, person:1), (OBJ, object:1)) 116 Table 2. 12 most frequent 2nd level semantic role patterns with number of occurrences (of ap- propriate verb frames) in VerbaLex. 4 Verb Classes

Verb classes, defined “in terms of shared meaning components and similar syntactic behavior of words” [Kipper et al., 2008] are useful because of generalizations. Since there are thousands of verbs we prefer processing whole classes instead of single verbs. So far there are 5638 verbs classified that makes about 25 % of all verbs in the lexicon. The classification is based on [Schuler, 2005] (VerbNet’s classification is based on Levin’s classes of English verbs), but adapted for Czech language. Table 3 shows the relation between semantic role patterns and verb classes. Al- though there are many patterns for each verb class (only those with frequency greater than 5 are shown), we can see some similarities. For example, often a pattern is subset of another pattern in the same verb class. Or, in other words, a particular semantic rela- tion is contained in (almost) every pattern for a particular verb class. Note that classified verbs were considered, therefore the number of occurrences is related to about 25 % of the data.

5 Generalization

In previous sections we have described in detail semantic role patterns and verb classes. The main difference is the manner they were acquired. In case of semantic role patterns, we deduce the semantic closeness only from the semantic roles of the verb dependent constituents. Therefore some of the verb groups are semantically closer than other. Verb classes were made by linguists to group semantically close verbs together. Al- though the “behavior” of the verb was observed, the main decision feature is the mean- ing. We expect that both semantic role patterns and verb classes can serve to VerbaLex extension. In case of 2nd level semantic role patterns we observe that the semantic closeness is more significant than if using only 1st level. For example verbs with the pattern ((AG,person:1),(VERB,ε),(PAT,person:1)) describe human interaction with another human(s). It is therefore meaningful to ask the following questions: – was the action positive or negative for the agent (AG)? – was the action positive or negative for the patient (PAT)? – has the patient (PAT) to be at the same time on the same place as the agent (AG)? – has the patient (PAT) to be in physical contact with the agent (AG)? For verbs with the pattern PART(bodypart:1)+VERB+PAT(person:1) de- scribing the action or state of body parts there are following expectations: – the body part (PART) is part of the patient (PAT) – there is something unusual with the patient’s (PAT) body part (PART) The following question is meaningful: – was the action positive or negative for the patient (PAT)? Similarly we can construct sets of expectations and meaningful questions for verb classes. no. of occurrences in VerbaLex semantic role pattern verb class 22 (AG,VERB,OBJ,SUBS) butter-9 15 (AG,VERB,ART) 15 (AG,VERB,OBJ) 10 (AG,VERB,PAT,SUBS) 9 (AG,VERB,PAT,PART,SUBS) 9 (AG,VERB,SUBS) 20 (AG,VERB,PAT) judgement-33.2 11 (AG,VERB,PAT,ACT) 9 (AG,VERB,PAT) 8 (AG,VERB,ACT) 6 (AG,VERB,ATTR) 15 (AG,VERB,OBJ) remove-10.1 9 (AG,VERB,OBJ,LOC) 8 (AG,VERB,OBJ,OBJ) 7 (AG,VERB,OBJ,LOC,LOC) 7 (AG,VERB,PAT) 15 (AG,VERB,OBJ) bodyinternalmotion-49 9 (AG,VERB,PART) 15 (AG,VERB,ACT) want-32 10 (AG,VERB,PAT) 6 (AG,VERB,ABS) 6 (AG,VERB,OBJ) 13 (AG,VERB,PAT,PART) spank-18.3 12 (AG,VERB,PAT) 11 (AG,VERB,PAT,INS) 13 (AG,VERB,OBJ) fill-9 8 (AG,VERB,PAT,ART) 7 (AG,VERB,OBJ,OBJ) 13 (AG,VERB,KNOW) discover-82 11 (AG,VERB,ACT) 11 (AG,VERB,STATE) 11 (ENT,VERB) animalsounds-38 8 (AG,VERB) 10 (AG,VERB,PAT) see-30.1 10 (AG,VERB,OBJ) 6 (AG,VERB,EVEN) Table 3. 10 most frequent verb classes with their semantic role patterns. 5.1 Extending VerbaLex with Inference Rules

VerbaLex is a frame-based lexical resource. It contains slots describing typical situa- tions (in this case sentence constituents dependent on the verb), with restriction on their values (in this case WordNet hypernyms). Contrary to other frame-based resources such as FrameNet [Baker and Fillmore, 2009], VerbaLex frames are not related together, there is no hierarchy among the frames. According to [Neveˇrilovˇ a,´ 2009] we would like to extend VerbaLex with the inter- connection of frames. At first, three relation types come on force: precondition, effect and decomposition. With the described generalization we can create small sets of infer- ence rules for interconnecting large sets of verbs. Rules can look as follows:

((PART,bodypart:1),(VERB,ε),(PAT,person:1)) PP  PP  PP  PP  implies P  PP  PP 9 PPPq ((AG,person:1),(VERBfeel,ε),(FEEL,feeling:1),(REAS,reason:1))

Fig. 1. A rule stating that someone is feeling something because of his/her bodypart action/state. Arrows show what sentence constituents refer to the same entity.

The rules outlined above aim to transform VerbaLex from separate verb frames collection to a semantic network with inference as did Chang, Narayanan and Petruck within FrameNet [Chang et al., 2002]. Obviously the example above is just intuitive, not formalized. We also have to consider the loss and gain when processing groups of verbs instead of processing single verbs. The main advantage is the effectivity: with relatively small number of rules we can interconnect large number of frames. The disadvantage of this approach is that it cannot handle exceptions to these rules and it cannot handle cases when the verb is badly annotated (either it has bad roles in slots, or bad verb class). On the other hand, side effect of this work is in checking VerbaLex. We can see “suspicious” verb frames (e.g. those verbs that are in a particular class but do not follow patterns of other verbs of the same class) and check manually if they are annotated appropriately.

6 Conclusion and Future Work

This paper studies the structure of verb frames found in Czech valency frame lexicon VerbaLex. Two approaches of grouping verbs are shown: semantic role patterns ex- tracted from VerbaLex and verb classification. We expect that verbs with same seman- tic role patterns are semantically close. Therefore, the patterns are compared to verb classes (based on VerbNet’s classification). The purpose of this work is to make useful generalizations on verb groups. Due to these generalizations we can apply small number of inference rules on large number of verbs. Side effect of this work is verification of VerbaLex. In the future we have to concentrate on formal representation of inference rules. Future work also concerns construction of rules that can interconnect the verb frames in VerbaLex. If this work succeeds, VerbaLex will acquire a new dimension of meaning representation for Czech language.

Acknowledgments

This work has been partly supported by the Ministry of Education of CR within the Center of basic research LC536.

References

[Baker and Fillmore, 2009] Baker, C. F. and Fillmore, C. J. (2009). FrameNet. [Online; accessed 30-July-2009]. [Chang et al., 2002] Chang, N., Narayanan, S., and Petruck, M. R. L. (2002). From frames to inference. In Proceedings of the First International Workshop on Scalable Natural Language Understanding. [Fellbaum, 1998] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. [Gruber, 1965] Gruber, J. S. (1965). Studies in Lexical Relations. PhD thesis, MIT, Cambridge, MA. [Hlava´ckovˇ a,´ 2007] Hlava´ckovˇ a,´ D. (2007). Databaze´ slovesnych´ valencnˇ ´ıch ramc´ u˚ VerbaLex. PhD thesis, Masarykova univerzita, Filozoficka´ fakulta, Ustav´ ceskˇ eho´ jazyka. [Hlava´ckovˇ a,´ 2009] Hlava´ckovˇ a,´ D. (2009). Pocetˇ lemmat v synsetech VerbaLexu. In After Half a Century of Slavonic Natural Language Processing, Brno, Czech Republic. Tribun EU. [Kipper et al., 2008] Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. (2008). A large-scale classification of english verbs. In Language Resources and Evaluation Journal, volume 42, pages 21–40. Springer Netherland. [Lopatkova´ et al., 2006] Lopatkova,´ M., Zabokrtskˇ y,´ Z., and Benesovˇ a,´ V. (2006). Valency lex- icon of czech verbs VALLEX 2.0. Technical Report 34, UFAL MFF UK. [Neveˇrilovˇ a,´ 2009] Neveˇrilovˇ a,´ Z. (2009). Exploring and Extending Czech WordNet and Ver- baLex. In Proceedings of the RASLAN Workshop 2009, Brno. Masaryk University. [Schuler, 2005] Schuler, K. K. (2005). VerbNet: A Broad-Coverage, Comprehensive Verb Lexi- con. PhD thesis, Faculties of the University of Pennsylvania.