Knowledge Media Technologies

First International Core-to-Core Workshop

Klaus P. Jantke & Gunther Kreuzberger (eds.)

Nr. 21 July 2006

Herausgeber: Der Rektor der Technischen Universität Ilmenau Redaktion: Institut für Medien- und Kommunikationswissenschaft, Prof. Dr. Paul Klimsa ISSN 1617-9048 Kontakt: Klaus P. Jantke, Tel.: +49 3677 69 47 35 E-Mail: [email protected]

Table of Contents

Knowledge Federation and Utilization

Yuzuru Tanaka A Formal Model for the Proximity-Based Federation among Smart Objects...... 5

Hidetoshi Nonaka and Masahito Kurihara Protean sensor network for context aware services...... ………………………………….....21

Hosting Team Presentations

Klaus P. Jantke Pattern Concepts for Digital Games Research...... …………………………..……...……..29

Anja Beyer Security Aspects in Online Games Targets, Threats and Mechanisms………………………………….....…41

Jürgen Nützel and Mario Kubek Personal File Sharing in a Personal Bluetooth Environment ...... …………………………………...…..…47

Henrik Tonn-Eichstädt User Interfaces for People with Special Needs...... …………………………………....…..54

Knowledge Clustering and Text Summarization

Jan Poland and Thomas Zeugmann Clustering the Google Distance with Eigenvectors and Semidefinite Programming ...... …...….…61

Tetsuya Yoshida Toward Soft Clustering in the Sequential Information Bottleneck Method………………………...………..70

Masaharu Yoshioka and Makoto Haraguchi Multiple News Articles Summarization based on Event Reference Information with Hierarchical Structure Analysis...... ……………………………………………………………………………...... …79

Core-to-Core Partners I

Roland H. Kaschek, Heinrich Mayr, Klaus Kienzl Using Metaphors for the Pragmatic Construction of Discourse Domains……………………………………88

Gunar Fiedler, Bernhard Thalheim, Dirk Fleischer, Heye Rumohr Data Mining in Biological Data for BiOkIS...... ……………………………………………………...104

Christian Reuschling and Sandra Zilles Personalized information filtering for mobile applications...... ………………………………….……...114

Data and Text Mining

Hiroki Arimura and Takeaki Uno Efficient Algorithms for Mining Maximal Flexible Patterns in Texts and Sequences...... ……………….125

Takuya Kida, Takashi Uemura, and Hiroki Arimura Application of Truncated Suffix Trees to Finding Sentences from the Internet...... ………..136

Shin-ichi Minato Generating Frequent Closed Item Sets Using Zero-suppressed BDDs…………………………………...…145

Kimihito Ito, Manabu Igarashi, Ayato Takada Data Mining in Amino Acid Sequences of H3N2 Influenza Viruses Isolated during 1968 to 2006...... ……154

Core-to-Core Partners II

Nigel Waters The Links Between GIScience and Information Science ..…………………………………...... ……...159

Rainer Knauf and Klaus P. Jantke Storyboarding – An AI Technology to Represent, Process, Evaluate, and Refine Didactic Knowledge...... 170

Jürgen Cleve, Christian Andersch, Stefan Wissuwa Data Mining on Traffic Accident Data...... …………………………………...... ……..180

Core-to-Core Partners III

Torsten Brix, Ulf Döring, Sabine Trott, Rike Brecht, Hendrik Thomas The Digital Mechanism and Gear Library – a Modern Knowledge Space………………………………….188

Donna M. Delparte The Use of GIS in Avalanche Modelling...... …………………………………...... ………195

Jochen Felix Böhm and Jun Fujima A Case Study for Knowledge Externalization using an Interactive Video Environment on E-learning...….199

Preface

Knowledge Media Technologies has been chosen as a headline for the present workshop proceedings to focus the audience's attention to the central problem domain around which a variety of investiga- tions have been grouped.

The Japanese Core-to-Core cooperation programme has been providing the framework for this work- shop. This farsighted programme does encourage and support Japanese centers of research excellence to widen and deepen their scientific connections to other research centers around the world. For this purpose, scientists from Europe, from Northern America, and from other regions such as New Zealand decided to go to Dagstuhl Castle, , to meet a larger group of Japanese scientists headed and coordinated by Dr. Yuzuru Tanaka, professor and director of the Meme Media Laboratory at Hokkai- do University Sapporo, Japan.

Tanaka's research center is strong in and world-wide known for its work on knowledge media tech- nologies.

For almost half a century, the transformation into a knowledge society has been recognized world- wide. Science and technologies as well as politics strive hard to act appropriately.

With the advent of global communication networks, the development and exchange of information is boosted. Combinations of knowledge media sources open unpredictable potentials.

But most sources on the Web are not prepared to be effectively combined with other sources. What is known as the enterprise application integration problem locally shows even more seriously and with higher potential impact globally.

The Japanese programme on Knowledge Media Technologies for the Advanced Federation, Utilization and Distribution of Knowledge Ressources addresses one of the most important scientific and engi- neering issues of the present and the near future. The work belongs to the wide field of information and communication technologies, but is surely relevant to all other disciplines.

The workshop programme is completed by three sessions on Digital Games. This field is a rather new discipline with high economic, scientific, and social potentials. Issues of game knowledge lead to a natural integration of these sessions into the workshop program, although the Digital Games sessions are not reflected in the present proceedings.

Preparing the proceedings of this First International Core-to-Core Workshop has been a great pleas- ure. We are looking forward to invaluable cooperations of an unforeseeable reach and impact.

Klaus P. Jantke Ilmenau, July 2006 Gunther Kreuzberger

3

A Formal Model for the Proximity-Based Federation among Smart Objects

Yuzuru Tanaka Meme Media Laboratory Hokkaido University

N13, W8, Sapporo, 060-8628 Japana  tanaka, fujima, ohigashi @meme.hokudai.ac.jp

Abstract This paper proposes a new formal model of autonomic proximity-based federation among smart objects with wireless network connectivity and services available on the Internet. Federation here denotes the definition and execution of interoperation among smart objects and/or services that are accessible either through the Internet or through peer-to-peer ad hoc communication without previously designed interop- eration interface. This paper proposes a new formal model for autonomic federation among smart objects through peer-to-peer communication, and then extends this model to cope with federation among smart objects through the Internet as well as federation including services over the Web. Each smart object is modeled as a set of ports, each of which represents an I/O interface for a function of this smart object to interoperate with some function of another smart object. Here we consider the matching of service- requesting queries and service-providing capabilities that are represented as service-requesting ports and service-providing ports, instead of the matching of a service requesting message with a service-providing message. Our model extracts the federation mechanism of each smart object as its interoperation interface that is logically represented as a set of ports. Such an abstract description of each smart object from the view point of its federation capability will allow us to discuss both the matching mechanism for federation and complex federation among smart objects in terms of a simple mathematical model. Applications can be described from the view point of their federation structures. This enables us to extract a common sub- structure from applications sharing the same typical federation scenario. Such an extracted substructure works as an application framework for this federation scenario. This paper first proposes our mathemati- cal modeling of smart objects and their federation, then gives the semantics of our federation model, and finally, based on our model, shows several application frameworks using smart object federation. 1. Introduction tion and execution of interoperation among smart objects and/or services that are acces- This paper proposes a new formal model sible either through the Internet or through of autonomic proximity-based federation peer-to-peer ad hoc communication with- among smart objects with wireless network out previously designed interoperation inter- connectivity and services available on the face. Federation is different from integration Internet. Smart objects here denote comput- in which member objects involved are as- ing devices such as RFID tag chips, smart sumed to have previously designed standard chips with sensors and/or actuators that are interoperation interface. embedded in pervasive computing environ- ments such as home, office, and social in- The Web works not only as an open pub- frastructure environments, mobile PDAs, in- lishing repository of documents, but also as telligent electronic appliances, embedded an open repository of services represented as computers, and access points with network Web applications and/or Web services. Per- servers. Federation here denotes the defini- vasive computing denotes an open system of

5 computing resources in which users can dy- matches service-requesting queries with cor- namically select and federate some of these responding service-providing capabilities. computing resources as well as some of The origin of such an idea can be found those computing resources accessible only in the original tuple space model Linda [1] through peer-to-peer ad hoc wireless com- and its extension Lime [4] that copes with munication to perform their jobs satisfying mobile objects by providing each of them their dynamically changing demands. Such a dedicated tuple space. Linda and Lime computing resources include not only ser- are languages that use tuples to register vices on the Internet, but also functions pro- service-providing capabilities and to issue vided by embedded and/or mobile smart ob- service-requesting queries to a repository- jects connected either to the Internet through and-lookup service called a tuple space. Java WiFi communication or directly to the user’s Space [2] and Jini [3] are Java versions of device through peer-to-peer ad hoc wireless Linda and Lime architectures. connection. In these preceding studies, each service- Federation over the Web is attracting the at- requesting message is matched with a tention not only for user-oriented integration service-providing message in a com- of mutually related legacy Web-based busi- mon repository of service requests and ness applications and services, but also for service-providing capabilities. Both each interdisciplinary and international advanced service-requesting message and each reuse and interoperation of heterogeneous service-providing message are represented intellectual resources especially in scientific as tuples, and they are matched with each simulations, digital libraries, and research other based on the tuple matching algo- activities. Federation may be classified into rithm. Each tuple consists of a single or two types: ad hoc federation defined by more than one attribute-value pair in which users and autonomic federation defined by some attributes may take variables. Such programs. We have already proposed our a match of tuples is temporarily used to approach based on meme media technolo- associate two objects, one issuing a service- gies [5] for ad hoc federation over the Web requesting message and the other issuing [8, 7, 9], and its extension to aggregate ad a service-providing message. There may hoc federation over the Web [6]. Here in this be more than one repository-and-lookup paper, we will propose a new formal model service in such a system. Each of these for autonomic federation among smart ob- services can delegate such a query that is not jects through peer-to-peer communication, matched in itself to one or more than one and then extend this model to cope with fed- neighboring repository-and-lookup service, eration among smart objects through the In- and obtain a match result. ternet as well as federation including ser- vices over the Web. Here in this paper, we focus on the inter- face of smart objects for their federation Every preceding studies on federation ba- with each other and with services over the sically proposed two things, i.e., a stan- Web. We will hide any details on how dard communication protocol with a lan- functions of each smart object are imple- guage to use it, and a repository-and-lookup mented, and focus on abstract level model- service that allows each member to reg- ing of its federation interface. Each smart ister its service-providing capabilities, and object is modeled as a set of ports, each to request a service that matches its de- of which represents an I/O interface for a mand. For each request with a speci- function of this smart object to interoperate fied demand, a repository-and-lookup ser- with some function of another smart object. vice searches all the registered service ca- Here, we consider the matching of service- pabilities for those satisfying the specified requesting queries and service-providing ca- demand. A repository-and-lookup service pabilities that are represented as service-

6 requesting ports and service-providing ports, connections, i.e., each of them has a distance

instead of the matching of a service re- range of wireless communication. We model

´Óµ questing message with a service-providing this distance range by a function ×ÓÔ , message. One of these functions requests which denotes a set of smart objects that

the service provided by the other function. are currently accessible by an smart object

×ÓÔ´Óµ Therefore, each port represents an interface Ó. This set may also include other of either a service-requesting function or a smart objects that are accessible through the

service-providing function. Internet if the smart object Ó is currently con- In the preceding research studies, federation nected to the Internet. Therefore, the func- mechanisms were described in the codes that tion ×ÓÔ does not directly correspond to define the behaviors of participating smart the wireless communication range of each objects, and were not separated from these object, but defines its current accessibility to codes to be discussed independently from other smart objects. them. Here in this paper, we will extract Each smart object provides and/or requests the federation mechanism of each smart ob- services. A service executed by a smart ob- ject as its interoperation interface that is log- ject may request another service provided by ically represented as a set of ports. Such another smart object. For a smart object to an abstract description of each smart object request a service running on another smart from the view point of its federation capabil- object, it needs to know the id and the inter- ity will allow us to discuss both the match- face of the service at least. We may assume ing mechanism for federation and complex that each service is uniquely identified by federation among smart objects in terms of its service type in its providing smart object. a simple mathematical model. Applications Therefore, each service can be identified by can be described from the view point of the concatenation of the object id of its pro- their federation structures. This enables us viding smart object and its service type. The to extract a common substructure from ap- interface of a service can be modeled in var- plications sharing the same typical federa- ious different ways. Here we model it as a tion scenario. Such an extracted substruc- set of attribute-value pairs without any du- ture may work as an application framework plicates of the same attribute. We call each for this federation scenario. attribute and its value respectively a signal In the following sections, we will first pro- name and a signal value. pose our mathematical modeling of smart objects and their federation, then give the se- Now we need to consider the pluggability mantics of our federation model, and finally, between smart objects. If a smart object based on our model, show several applica- needs to explicitly specify, in its code, the tion frameworks using smart object federa- identifier of the service it wants to access, tion. then this smart object cannot federate with any other smart object providing the same 2. Smart Object and its Formal type of services. The substitutability of each smart object in an arbitrary federation with Modeling another one providing the same type of ser- Each smart object communicates with an- vice is called the pluggability of smart ob- other smart object through a peer-to-peer jects. Pluggable smart objects cannot spec- communication facility, which is either a di- ify its service request by explicitly specify- rect cable connection or a wireless connec- ing the service id. Instead, they need to spec- tion such as Bluetooth and IrDA wireless ify the service they request by its name, by connection. Some smart objects may have its type, or by its property. The conversion WiFi communication facilities for their In- from each of these three different types of ternet connection. These different types of reference to the service id is called ‘reso- wireless connections are all proximity-based lution’. Service-name resolution converts a

7 given service name to a corresponding ser- we treat services of the same function ac- vice id. Service-type resolution converts a cessible through different protocols as ser- given service type to the service ids of ser- vices of different service types. Now as vices of this type. Service-property resolu- minimalists, we can model each primitive tion converts a given service property to the smart object to have only the following service ids of services satisfying this prop- three functions, i.e., recognition of acces- erty. sible smart objects, object-name resolution, and service-type resolution. The first one is

When a smart object can access the Inter-

´Óµ represented by the function ×ÓÔ , the

net, it can ask a repository-and-lookup ser-

´Ó ÓÒ Ñµ second by ÓÒ ÑÊ × , and the

vice to perform each resolution. The object

´Ó ×ØÝ Ô ÑÓµ third by ×ØÝ ÔÊ × . The id and service type obtained as a resolution last two functions are defined as follows:

result are used to access the target smart ob-

´Ó ÓÒѵ ject and its service. This requires another ÓÒÑÊ ×

resolution service that converts each object =if ÓÒÑ is the name of the smart object Ó id to its global network address, such as url identified by Ó then else nil.

and mobile phone number. This resolution

×ØÝ ÔÊ ×´Ó ×ØÝ Ô ÑÓ µ

is outside of our mathematical model. Here ½ = when ÑÓ  , if the smart object identified

we will not distinguish object ids and urls.

×ØÝ Ô Ó by Ó provides a service of then When a smart object can access others only else nil, otherwise if the smart object

through peer-to-peer network, it must be ×ØÝ Ô identified by Ó s requests a service of able to ask each of them to perform each res- then Ó else nil, olution. In this case, if a target smart object can be accessed only through peer-to-peer When a service-requesting smart object Ó re-

network, then it needs to perform each reso- quests a service, it sends an object name ×ØÝ Ô lution by itself. Therefore, such a service- ÓÒ Ñ or a service type to each smart

object Ó in its proximity represented by

providing smart object that has no Inter-

´Óµ Ó ×ÓÔ . Each recipient smart object ,

net accessibility and might be accessed by

×ØÝ Ô Ó when receiving ÓÒ Ñ or from ,

another without Internet accessibility needs

´Ó ÓÒ Ñµ performs either ÓÒ ÑÊ × or

to have a resolution mechanism for its ser-

´Ó ½µ ×ØÝ ÔÊ × stype . The sender smart vices. Here we assume that every smart object receives either the recipient’s Ó or object has a resolution mechanism in itself. nil depending on the success of the reso- Some smart objects can perform only object- lution. Such resolution is called provid- name resolution and service-type resolution ing resolution since the resolution is per- for their own services. Service-property res- formed by smart objects providing a ser-

olution is too heavy to implement in some vice to discover. When a service-providing ÓÒ Ñ primitive smart objects. smart object Ó with as its name searches for another smart object request- In our modeling of smart objects, we take ing a service of type ×ØÝ Ô that is pro-

a minimalist approach. In other words, we ÓÒ Ñ vided by Ó, it sends its object name ,

will start with a minimum model of prim- or the service type ×ØÝ Ô to each smart

itive smart objects and primitive federation object Ó in its proximity represented by

´Óµ Ó

operations, and then try to describe complex ×ÓÔ . Each recipient smart object ,

×ØÝ Ô Ó

federation mechanisms and a wide range of when receiving ÓÒ Ñ or from ,

´Ó ÓÒ Ñµ

applications by combining primitive smart performs either ÓÒ ÑÊ × or

´Ó ×ØÝ Ô ¼µ objects and primitive federation operations. ×ØÝ ÔÊ × . Such resolution is called requesting resolution since the res- In practical situations, the same smart ob- olution is performed by smart objects re- ject may provide more than one service questing a specific service. Our minimal- of the same function through different ac- ist’s model represents each resolution mech- cess protocols. In order to hide the differ- anism as a port. Each smart object is mod- ence of access protocols from our model, eled as a set of ports. Each port consists of

8

a port type and its polarity, i.e., either a pos- by their shared port type. We define a set

¼

´Ó Ó µ itive polarity ‘+’ or a negative polarity ‘-’.  ÒÒÐ of port types as a set of all Providing resolution is represented by a pos- the channels established by a federation of a

itive port, while requesting resolution is rep- ¼ Ó smart object Ó with . resented by a negative port. Object-identifier resolution and object-name resolution are The same smart object may be involved in more than one different federation, which respectively represented by port types Ó implies that the same port may be involved and ÓÒ Ñ with polarities. A smart ob-

in more than one different channel.

ÓÒ Ñ ·Ó

ject with Ó and has ports ÓÒ Ñ and/or · if it has respectively an object-identifier-resolution function and/or 3. Port Signals an object-name-resolution function. A smart

When a channel is established between two

Ó ·ÓÒ Ñ object has ports · and/or if it requests another smart object identified by smart objects, they can communicate with

each other through this channel. Each of ÓÒ Ñ Ó and/or . Service-type resolution is represented by a port with a port type these smart objects assumes a constraint on a set of I/O signals going back and forth

×ØÝ Ô and a polarity. If it is a providing ×ØÝ Ô

resolution, then it is represented by · , through this channel. Their assumed con- ×ØÝ Ô else by . Each port represents both a straints on signal sets must be compatible request and the capability of corresponding with each other. Otherwise, they cannot resolution. Therefore, if a smart object en- communicate with each other through this ters the proximity of another smart object, channel. The assumed constraint on a set and if they have ports with the same port of signals for a channel, and hence for a type and different polarities, one can send port, in each smart object is called the port a resolution request to another to obtain the signal constraint. Mathematically, a port partner object id successfully. This means signal constraint is represented by a set of that each request for a specific smart object signals, each of which consists of a signal or a specific service is satisfied by the part- mode, a signal name and a signal domain.

ner smart object. This step is called the port

´Ôµ type matching in our model. Ports are math- By × Ò Ð is denoted a set of signals of the port p. Without loss of generality, we ematically defined as follows. Let Ç ,N,and S denote respectively the set of object iden- may assume that each signal is given its tifiers, the set of object names, and the set of value either by a service-providing port or service types. The set P of ports is defined a service-requesting port. Bilateral signals as follows: can be treated as a pair of unilateral signals.

A signal mode is ‘+’ if the signal value is

·  ¢ ´Ç  Æ  ˵ È given by the service-providing port. Such

a signal is called a providing signal. Oth-

ÔÓÖ Ø´Óµ Each smart object Ó has a set of ports erwise, i.e., if the signal value is given by

which is a subset of P. A smart object the service-requesting port, the signal mode

ÓÒ Ñ ·Ó

with Ó and may have and is ‘-’. Such a signal is called a request- ÓÒ Ñ · as its ports, but neither of them ing signal. The domain of a signal is a set

is mandatory. Federation of a smart object of allowed signal values of this signal. For

¼ Ó

Ó with another smart object in its scope a signal s, we denote its mode, name and

´Óµ

×ÓÔ is initiated by a program running

´×µ × Æ Ñ´×µ domain by × Å Ó , , and

on Ó. This program detects either a specific

´×µ

×  ÓÑ Ò .

×ÓÔ´Óµ user operation on Ó, a change of ,or Example 1

some other event on Ó as a trigger to initi- 

ate federation. The initiation of federation PDA1 = -DBaccess × ÒÐ ¼ (-DBaccess) by a smart object o with Ó performs the

= (-query: SQLstandard), (-search:Boolean),

port matching between the ports of Ó and  ¼ (+result:RecorList) the ports of Ó . As its result, every paired ports are connected by a channel identified A pair (-query:SQLstandard) denotes that

9 oid oname range of accessibility of PDA1 PDA1:Jack PDA1 has two requesting ports, requesting ports have one for the PPT projection, and negative polarity the other for the printing. -PPTprojection polarity service type Whenever Room1∈scope(PDA1) holds, +Jack PDA1 PDA1 initiates a federation with Room1. PDA1 -DocPrint oname +Jack providing ports have -PPTprojection Room1∈scope(PDA1) Room1:Mike positive polarity.

+PPTprojection -DocPrint Room1 has two providing ports, polarity service type +Mike one for a PPT projection service, Room1 and the other for a printing service. oname +DocPrint +PPTprojection +Mike channel(PDA1, Room1) ={PPTprojection, DocPrint} +DocPrint

(a) Two smart objects PDA1 and Room1 (b) Room1 enters the scope of PDA1

Figure 1: Fedration of PDA1 with Room1 and the channels established between them. this signal value is given by a service- In example 1, the port -DBaccess and the requesting port, and that it has ‘query’ and port +DBaccess are compatible with each

SQLstandard (the domain of SQL Standard other, i.e., -DBaccess  +DBaccess since it

queries) as its name and domain. This holds that SQLstandard  OracleSQL. service-requesting port -DBaccess issues a query in SQL Standard and a Boolean search 4. Applications Running on a command to obtain a search result as a Smart Object record list.

As shown in Figure 2, each smart object has  Room1 = +DBaccess some ports, internal variables correspond- × ÒÐ (+DBaccess) = ing to internal registers, and HCI variables (-query,: SQL), corresponding to the console input and out- (-search:Boolean), put variables for its user to input commands (+result:RecorList), and to observe its response. Applications (+currentRecord:Record), running on a smart object can input values (-nextRecord, Boolean), from internal variables, HCI input variables, (-previousRecord, Boolean) providing signals of its service-requesting This service-providing port +DBaccess re- ports, and requesting signals of its service- ceives a query in Oracle SQL, a Boolean providing ports. They can output values search command, and a Boolean next-record to internal variables, HCI output variables, or previous-record command, and returns a requesting signals of its service-requesting search result as a record list and a current ports, and providing signals of its service- record pointed to by a record cursor that is providing ports. Figure 2 shows two appli- moved back and forth by the next-record and cations in a smart object and their I/O values. previous-record command. Now, we will Each application program is invoked inde- define the compatibility of ports. A service- pendently from others by the requesting sig- requesting port -p and a service-providing nals of its dedicated service-providing port. port +p are compatible with each other and Application programs in the same smart ob- denoted by -p  +p if the following holds: ject are therefore asynchronously interoper-

ate with each other through the internal vari-

´ Ü Ý µ ¾ × ÒÐ ´ Ôµ

´ ables of the smart object. In general, a smart

´ Ü Þ µ ¾ × ÒÐ ´ Ôµ Ý Þ µ

 object requests values from other smart ob-

´´·Ü Ý µ ¾ × ÒÐ ´ Ôµ

jects through its requesting ports, and pro-

´·Ü Þ µ ¾ × ÒÐ ´·Ôµ Ý Þ µ   vides values computed by its applications to other smart objects though its providing

10 internal variable: x2 HCI variable: x1 (input), y1 (output) 5. Semantics of Federation f1 f2 Smart Object

Let us consider a federation among three

Ç Ç

-p2(-s3, +s4) Ç

½ ¾ -p1(-s1, +s2) smart objects ¼ , , and as shown in

y1=h2(f2(h1(x2,x4)), x2, x4) Ç

h1 Ç ¾ Figure 3. ½ and perform addition and g1 app1 h2

app2 Ç x1 x2 y1 multiplication, while ¼ compose these two

x2=g1(f1(x1), x3)

Ç Ç¾

functions provided by ½ and to calcu- h3 g2 +p4(-s7, +s8) late (s1+s2) ¢ s2 as the value of s3. We +p3(-s5, +s6) y3=h3(f2(h1(x2,x4)), x2, x4) need to define the semantics of smart objects y3 x4 x3 and their federation so that we can calculate y2 y2=g2(f1(x1), x3) the composed function from the functions of participating smart objects. Here we will de- Figure 2: Applications running on a smart scribe the federation semantics using a Pro- object. log like description of the function of each smart object. ports. Two important classes of smart ob- s3=(s1+s2)×s2 +p0(-s1, -s2, +s3) jects are providing-only smart objects called O supplier objects and requesting-only smart 0 objects called consumer objects. Supplier -p1(-s1, -s2, +s3) -p2(-s1, -s2, +s3) objects have providing ports only. For ex- ample, bar codes and RFID tag chips in con- +p1(-s1, -s2, +s3) sumer products, Web applications, or Web +p2(-s1, -s2, +s3) O 1 O services are supplier objects. Consumer ob- 2 + × jects have requesting ports only. For exam- ple, remote controllers of all kinds are con- s3=s1+s2 sumer objects. In general, a smart object s3=s1×s2 cannot be copied. However, there are im-

portant classes of objects where this is vir-

Ç Ç Ç

½ ¾ Figure 3: Federation among ¼ , , and . tually possible by copying their proxy ob- jects. Most notably this is possible for soft- The function of each smart object in Figure 3

ware smart objects we will mention later. is described as follows in our description.

Ç 

A service-requesting port p1 is replaceable ¼ : p0( -s1:x, -s2:y, +s3:z )



with another service-requesting port p2 if p1( -s1:x, -s2:y, +s3:w ), 

the following holds: p2(-s1:w, -s2:y, +s3:z ) 

( p0(-s1:x, -s2:y, +s3:z ) is implied by

´´ Ü Ý µ ¾ × ÒÐ ´ Ô½µ



p1(-s1:x, -s2:y, +s3:w ) and p2( -s1:w, -s2:y,

´ Ü Þ µ ¾ × ÒÐ ´ Ô¾µ Ý Þ µ

+s3:z))

´´·Ü Ý µ ¾ × ÒÐ ´ Ô½µ

Ç 

½ : p1( -s1:x, -s2:y, +s3:z ) [z:=x+y]

´·Ü Þ µ ¾ × ÒÐ ´ Ô¾µ Ý Þ µ   ( p1(-s1:x, -s2:y, +s3:z ) is implied by the eval- uation of z:=x+y. )

A service-providing port p1 is replaceable

Ç  ¢

¾ : p2( -s1:x, -s2:y, +s3:z ) [z:=x y] with another service-providing port p2 if the following holds: The functions of a smart object is described

by a set of rules. Each rule has zero or one

´ Ü Ý µ ¾ × ÒÐ ´·Ô½µ

´ literal on the left-hand side, and zero or an

´ Ü Þ µ ¾ × ÒÐ ´·Ô¾µ Ý Þ µ

 arbitrary number of literals on the right-hand

´´·Ü Ý µ ¾ × ÒÐ ´·Ô½µ

side. Each literal on the left-hand side cor-

´·Ü Þ µ ¾ × ÒÐ ´·Ô¾µ Ý Þ µ   responds to a service-providing port, while each literal on the right-hand side corre- sponds to either a program code to evaluate

11 Ç or a service-requesting port. Each port may and b from outside, ¼ starts to evaluate a have a set of pairs, each of which consists of goal a signed signal name and its value. A sig-

nal value may be represented by a variable.

 Ô¼´ ×½ ×¾  ·×¿ Þ µ Similarly to the logical resolution in Prolog,

we start with the evaluation of a given goal Ç

which is resolved by the rule in ¼ , and is with arbitrary number of literals only on the reduced to

right-hand side:

 Ô½´ ×½ ×¾  ·×¿ Û µ

Ô¾´ ×½ Û ×¾  ·×¿ Þ µ

 Ľ ľ  ÄÒ Then the first literal is resolved by the rule in

Such a rule without any left-hand side literal

Ç Ç ¾ ½ , and the second by the rule in , since

is called a goal. The first literal Li that is not Ç

the port -p1 and -p2 in ¼ are respectively

a program code in this goal is logically re- Ç

connected to the ports +p1 in ½ and +p2 in

solved with some literal with the same port Ç

¾ through channels, i.e.,

type that appears on the left-hand side of

Û   ·  Ô¾´ ×½  Û  ×¾   ·×¿  Þµ

some other rule, and is replaced with the 

Û   ·  Þ  Û ¢  right-hand side of this rule with all the vari-  ables changed to the expressions that unifies these two literals. Let a rule This result has only program codes, and can

be partially evaluated to obtain the follow-

¼ ¼ ¼ ¼

 Ľ ľ  ÄÑ Ä ing:

be such a rule. Each port literal has a set of

Þ  ´ · µ ¢  signal pairs, each of which consist of a sig-  nal name and a value, whereas a predicate in This is the function composed by the feder- Prolog has parameters, the number of which ation of these three smart objects. is fixed for the same predicate name. A

Example 1 revisited Դ˵

right-hand side literal Ä can be uni- ¼

¼ The semantics of PDA1 and Room1 in Example

Դ˵ fied with a left-hand side literal Ä if the following holds: 1 is described as follows:

PDA1

¼ ¼

¼ App(x)

  ´´Ü Ý µ ¾ Ë ´ Ü Þ µ ¾ Ë Ý Þ µ  DBservice(-query:q, -search:true, +result:x ).

¼ Room1 

where  and are substitutions of variables.

¼ DBservice( -query:x, -search:y,  A pair ( , ) is called a unifier. Generally, +result:z, +currentRecord:u, such a unifier is not uniquely determined. In

-nextRecord:v,-previousRecord:w) such a case, there exists the most general

[PDB(x, y, z, u, v, w)].

¼

 

unifier ( ¼ , ). Using these substitutions, ¼ the current goal is replaced with App(x) is an application running on PDA1, while PDB(x, y, z, u, v, w) is a program code exe-

cuted by Room1 to access its database to answer

  ´Ä½µ  ´Ä¾µ   ´Ä´ ½µµ

¼ ¼

¼ a given query. Whenever PDA1 starts to evaluate

¼ ¼ ¼ ¼ ¼ ¼

 ´Ä½ µ  ´Ä¾ µ   ´ÄÑ µ

¼ ¼ ¼ a goal

App(x),

 ´Ä´ ·½µµ   ´ÄÒµ 

¼ ¼

it is reduced as follows:

 This operation is the extension of Prolog res-  DBservice( -query:q, -search:true, +result:x ) olution mechanism to the resolution of our  [PDB(q, true, x, u, v, w)].

rules. When the service-providing port p0 Example 2 Ç of ¼ is accessed with two signal values a Let us consider federations among a PDA PDA2,

12 a video cassette recorder VCR1, and a monitor VideoDisplay(-video:y1, audioL:y2,

display MonitorDisplay1, where PDA2 works as audioR:y3)

a remote controller of VCR1, and the audio and [PMonitor(y1, y2, y3)] video signals of VCR1 are sent to MonitorDis- The literal on the right-hand side represents the

play. video displaying function of this monitor dis-  VCR1=-VideoDisplay, +VideoControl play.

× ÒÐ (-VideoDisplay) PDA2

 =(video: NTSC ),(audioL: Audio),

HCIPDA(-key1:x1, -key2:x2,

(audioR: Audio)

-key3;x3, -key5:x4, -key5:x5)

× ÒÐ (+VideoControl) VideoControl( -play:x1, -ff:x2,

=(play: Boolean), (ff: Boolean),

-rewind:x3, -stop:x5) (rewind: Boolean), (record: Boolean), The literal on the left-hand side represents the (stop: Boolean)

human-computer interaction interface of this  MonitorDisplay1=+VideoDisplay PDA. This rule uses only five keys of PDA2,

× ÒÐ (+VideoDisplay) key1, key2, key3, key4, and key5 for its user

 =(video: NTSC, PAL ), to input the five commands, i.e., ‘play’, ‘ff’, (audioL: Audio), ‘rewind’, ‘record’ and ‘stop’. When one of these

(audioR: Audio) keys is pushed, PDA2 sends a service request  PDA2=-VideoControl with the corresponding command signal through

× ÒÐ (-VideoControl) the port -VideoControl.

=(play: boolean), (ff: boolean), Let us assume that the two federations, i.e., (rewind: boolean),(stop: boolean) PDA2 with VCR1, and VCR1 with MonitorDis- Their semantics is described as follows: play1, are both established. In order to know the VCR1 composed function of each key input to PDA2 in these federations, we can evaluate the following

VideoControl(-play:x1, -ff:x2, -rewind:x3, goal:

-record:x4, -stop:x5)

[PVCR(x1, x2, x3, x4, x5, y1, y2, y3)], HCIPDA( -key1:x1, -key2:x2, VideoDisplay( -video:y1, audioL:y2, -key3;x3, -key5:x4, -key5:x5)

audioR:y3) This is reduced as follows:

When VCR1 receives a service request through VideoControl( -play:x1, -ff:x2, the port +VideoControl, it performs its operation

-rewind:x3, -stop:x5) depending on the input command signals. If the command signal is ‘play’, it plays the loaded [PVCR(x1, x2, x3, x4, x5, y1, y2, y3)], video tape and sends out its audio-video out- VideoDisplay(-video:y1, put signals through the service-requesting port - audioL:y2, audioR:y3)

VideoDisplay. If the command is ‘ff’ i.e., ‘fast [PVCR(x1, x2, x3, x4, x5, y1, y2, y3)], forward’, then it fast-forward the tape until an- [PMonitor(y1, y2, y3)]. other command signal is input, and sends out the fast-forward mark as its video output sig- 6. Service-Property Resolution nal through the port -VideoDisplay. It simi- Service larly treats other command signals, ‘rewind’, and ‘stop’. If the command is ‘record’, it starts As mentioned in Chapter 2, our model does the recording of its input video signal onto the not treat service-property resolution as a loaded tape, and sends out the input video sig- primitive function. Instead, it treats this type nal combined with the recording mark as its of resolution as a service. Therefore, smart audio-video output signals through the port - objects with this type of resolution func- VideoDisplay. The first literal [PVCR(x1, x2, tion are modeled to have +sPropertyReso- x3, x4, x5, y1, y2, y3)] on the right-hand side lution port, whereas those capable to issue of this rule represents the internal mechanism of a query to another smart object for service- VCR1. property resolution are modeled to have - MonitorDisplay1 sPropertyResolution port. These ports may

13 Room Room have arbitrary number of signals specifying new port conditions on different attributes of the ser- +p +p +r vice property since there may be a large va- +sPropertyResolution +q +sPropertyResolution +q riety of different types of service-property -r -r resolution. The service-property resolution PDA PDA -sPropertyResolution -sPropertyResolution

mechanism is defined as follows: 

sPropertyResolution( attr1:x1, , attrk:xk,

stype:y, signal1:y1,  , signalh:yh )

Figure 4: Service-property resolution imple-

  [find( attr1:x1, , attrk:xk,

mented as a service.

stype:y, signal1:y1,  , signalh:yh ), openNewPort(+y)] requesting port may use one of these chan- When a channel is set up between nels or scan these channels to access each -sPropertyResolution and +sProper- of these services, depending on the appli- tyResolution, a smart object with - cation program using this port. Figure 5 sPropertyResolution port sends the service shows a case of scanning all the channels, property information attr1:x1, attr2:x2,

where PDA with a sensor reader simultane-

 , attrk:xk through this channel to ously federates with more than one sensor. another smart object with +sProper- tyResolution port to find out a service PDA satisfying this property, and to create a -Sensor (-ID, -Type, +Value)

new port with y as its service type, and

×Ò Ð  Ò ×Ò Ð

with ×Ò Ð as its

¾  ½ … … signals. This newly created port enables the service-requesting smart object to set up +Sensor(-ID, -Type, +Value) +Sensor(-ID, -Type, +Value)

Sensor1 Sensor

another channel from its preset port -y with n

×Ò Ð  Ò ×Ò Ð

h signals ×Ò Ð

¾  ½ +Sensor(-ID, -Type, +Value) to the newly created port of the service- Sensori providing smart object. A smart object with a service-property res- olution service can search all the services Figure 5: Aggregate federation between a it provides for the services satisfying the PDA and more than one sensor. specified property condition, and provides one of these services through the API pre- The semantics of these ports are described as sumed by the requesting smart object. Fig- follows. The PDA has an application to re- ure 4 shows an implementation of service- trieve values from all the sensors of a spec- property resolution as a service provided by ified type, and to analyze this value set to a smart object Room. PDA requests Room obtain some index value such as the average for a service through a presumed port -r. of them.

Room does not initially provide a compat- PDA

ible service-providing port +r. PDA asks Analysis( type:x, index:y )

Room for service-property resolution, which  [y:=analysis(z), z:=bag u ] makes Room create a new port +r. Sensor( ID:v, type:y, value:u ) Analysis(type:x, index:y) is implied by the

7. Aggregate Federation

evaluation of y:=analysis(z) and z:=bag u ,

When a service-requesting smart object with where Sensor( ID:v, type:y, value:u )is a port -p federates with more than one presumed. The vertical bar denotes that the service-providing smart object with +p port, followings constitute a where clause. more than one channel of the service type p Each sensor, on the other hand, first checks are simultaneously established. The service- if its sensor type is equal to the specified

14 type, and, if ‘yes’, return its current sensor value. +p PC+p PC

Room Sensor Room +p -p

+q +q  Sensor(ID:v, type:y, value:u ) +ResDelegation +sprop

[u:= if y=sensorType(v) then

currentSensorValue(v) else nil] PDA -p PDA

-ResDelegation -p  Sensor(ID:v, type:y, value:u ) is implied by -sprop the evaluation of [u:= if y=sensorType(v) then currentSensorValue(v) else nil]. The composed function of Analysis(type:x, Figure 6: Resolution delegation. index:y) is obtained by evaluating the fol-

lowing goal: ¾

p(S(x)), Ä .

 Analysis( type:x, index:y ). Room

This can be reduced as follows: 

ResDelegation(ptype:x, signals:y )

 [y:=analysis(z), z:=bag u ] [openPortPair(x, y)]

Sensor( ID:v, type:y, value:u ) p(x) p(x).

 [y:=analysis(z), z:=bag u ] ½ An application Appl in PDA invokes Ä [u:= if y=sensorType(v) then and requests ResDelegation for the port type currentSensorValue(v) else nil]. p showing the signal information about p.

This composite function accesses every sen- Using the ResDelegation result, it invokes ¾ sor to check if its type is the specified one, p(S(x)), and Ä . S(x) here denotes the set of gets the values of all the sensors of the spec- signals of the port p which may depend on ified type, and analyzes this value set to ob- a variable x. Room, on the other hand, del- tain the value y. egates a resolution request for a port p with a port signal information about p by creat- 8. Delegation of Access Resolution ing, in itself, a port pair -p and +p having the signals specified by the port signal informa- Some smart object can delegate an access tion about p. The last rule above has p(x) on resolution it is requested to other smart ob- both sides. However, different from Prolog, jects within its scope. The requesting smart p(x) on the left-hand side and the p(x) on the object in such a case needs to explicitly right-hand side are different from each other. specify if it actually request such delega- The former is a service-providing port, while tion. Therefore, we can model such access- the latter is a service-requesting port. resolution delegation mechanism as a ser- Room in this example may also ask another vice. A smart object with a resolution del- smart object for resolution delegation. In egation service is modeled to have a special such a case, its semantics becomes as fol- port +ResDelegation, while one requesting lows:

a resolution delegation service is modeled to have -ResDelegation port. Once a channel ResDelegation( ptype:x, signals:y )

is established between these two ports, the  [openPortPair(x, y)], latter smart object can ask the former smart ResDelegation( ptype:x, signals:y )

object to generate a pair of -p and +p ports p(x)  p(x). in itself for a specified -p port in the latter. Figure 6 shows such an example. 9. Web Services and Software This mechanism is defined as follows. Smart Objects over the Web

PDA Our model treats each WiFi access point as

Ľ Appl(x) , ResDelegation( ptype:p, a smart object. Once a smart object feder-

signals:signal information about p), ates with some access point that provides an

15 Internet access service, it can access what- These registered information is used to cre- ever Web services this access point is given ate a proxy port of the same type and sig- permission to access if it knows their url ad- nals in this repository-and-lookup service to dresses. Now we will extend our model so access this port in the registering smart ob- that smart objects may access Web services ject. When the service is accessed with ‘dis- through an access point. We will model such enroll’ as its operation, it disenrolls the spec-

a function of each access point ÔÓÒØ as fol- ified port and smart object. When the service lows: is accessed with ‘lookup’ as its operation, it

looks up the repository for registered ports  ResDelegation(ptype:x,signals:y )

of the given port type x and with the given ÔÓÒØ isURL(x), permitted( , x), [createProxy(x, y)] signals y, and finds out a software smart ob- ject z. When the service is accessed with The procedure createProxy(x, y) creates a ‘proxy’ as its operation, it creates, in itself, port +x with signals specified by y as a port a proxy port of type x that works as a proxy of the access point. Such a port is called a to access a software smart object z using sig- proxy port since it works as a proxy object nals y. to communicate with a remote Web service. A repository-and-lookup service may ask Another way for an access point to provide another repository-and-lookup service to the accessibility to such a Web service is to find out an appropriate software smart object provide its proxy port from the very begin- and its port. In this case, the semantics of ning. this repository-and-lookup service needs to

Now let us expand our model so that we can add the following rule as the last rule for

´ÓÔ ÖØÓÒ  ÐÓ ÓÙÔ  ÔØÝÔÜ×ÒÐ×  Ý µ deal with distributed software objects over ÙÖÐ

¼ ‘ ’

the Internet as software smart objects. Each

ÙÖÐ ´ ÓÔ Ö ØÓÒ ÐÓ ÓÙÔ

¼ ‘ ’

software smart object has its scope including

ÝÔ  Ü ×Ò Ð× Ý µ

one or more than one repository-and-lookup ÔØ

 ÙÖÐ ´ ÓÔ Ö ØÓÒ ÐÓ ÓÙÔ

½ ‘ ’

service on the Internet. Such a repository-

ÝÔ  Ü ×Ò Ð× Ý µ

and-lookup service available at an address ÔØ ÙÖÐ url0 is denoted by RepositoryLU(url0). The where ½ is the url of another repository- semantics of such a service is described as and-lookup service. If there are more than follows: one such service, we just need to add a sim- ilar rule for each of them. In order to utilize

Ê Ô Ó×ØÓÖÝÄÍ´ÙÖе such a service, the semantics of each soft-

ÙÖÐ

ÙÖм´ÓÔ ÖØÓÒ  Ö ×Ø Ö  ÔØÝÔ  Ü

‘ ’ ware smart object at ¾ is defined as fol-

×ÑÖØÇ !Ø  Þµ

×ÒÐ×  Ý lows.

#Ö ×Ø Ö´Ü Ý Þµ$ ÙÖÐ

Software Smart Object at ¾

%× ÒÖÓÐÐ ÓÔ ÖØÓÒ 

ÙÖм´ ‘ ’

ÔØÝÔ  Ü ×ÑÖØÇ !Ø  Ý µ

Ö ×ØÖØÓÒ´Ü Ýµ

#%× ÒÖÓÐÐ´Ü Ýµ$

Ö ×Ø Ö ÙÖÐ ´ÓÔ ÖØÓÒ 

¼ ‘ ’

ÙÖм´ÓÔ ÖØÓÒ  ÐÓ Ó&ÙÔ ÔØÝÔ  Ü

ÝÔ  Ü ×ÒÐ×  Ý

‘ ’ ÔØ

×ÒÐ×  Ý ×ÑÖØÇ !Ø  Þµ

×ÑÖØÇ !Ø ÙÖÐ µ

¾

#ÐÓ Ó&ÙÔ´Ü Ý Þµ$

ÓÔ ÖØÓÒ  ÔÖÓÜÝ ÔØÝÔ  Ü

ÙÖм´ ‘ ’

'ÔÔдܵ

×ÒÐ×  Ý ×ÑÖØÇ !Ø  Þµ

Ľ

#!Ö Ø ÈÖÓÜÝ´Ü Ý Þµ$

ÙÖÐ ´ÓÔ ÖØÓÒ  ÐÓ Ó&ÙÔ

¼ ‘ ’

ÝÔ Ô ×ÒÐ×  Ë

A smart object with url0 in its scope can ÔØ

directly access this service. When this ser- ×ÑÖØÇ !Ø  Þ

ÙÖÐ ´ÓÔ ÖØÓÒ  ÔÖÓÜÝ

¼ ‘ ’

vice is accessed with ‘register’ as its opera-

ÔØÝÔ Ô ×ÒÐ×  Ë

tion, it registers the requesting smart object

×ÑÖØÇ !Ø  Þ µ

ľ with its specified port type and port signals. Դ˴ܵµ

16 downloading of a software smart object. In The semantics of the operation ‘regis- this figure, the downloaded software smart ter’ is obvious. The application invokes object may work as a driver to access the ser-

vice +p of the smart object Room through a

ÙÖÐ ´ ÓÔÖ ØÓÒ ÐÓÓÙÔ ÔØÝÔ Ô ‘ ’

¼ port -r of the smart object PDA.

ÙÖÐ ´ Ë ×Ñ Ö ØÇ Ø! Þ µ

× Ò Ð × and

¼

¼

ÓÔÖ ØÓÒ "ÔÖ ÓÜÝ ÔØÝÔ Ô × Ò Ð × Room

SO to down-load

Þ µ Ë ×Ñ Ö ØÇ Ø to create a proxy port of type p that communicates with a software +p +SOdownload -p +r smart object z using signals S, where S de- Room notes a set of pairs, each of which consists -r PDA of a signal name and a signal domain. S(x) -SOdownloadl +p +SOdownload is a set of pairs, each of which consists of a -p signal name and its value, and denotes that -r +r PDA it has the same set of signal names as S, and -SOdownloadl their values somehow depends on x.

10. Application Frameworks using Smart Objects Figure 7: Smart object downloading

10.1. The downloading of a software If a downloaded smart object is a proxy ob- smart object ject to a Web service or another smart object, the recipient smart object can access this re- A smart object may presume a standard API mote service through this proxy smart ob- to request a service of another smart object ject. This case is shown in Figure 8, where that does not provide the compatible API PDA is accessing through its port -r a remote but provides a downloadable driver to access service using a downloaded proxy smart ob- this service through the presumed API. Such ject. a mechanism is described as follows. Sup- pose that a smart object o has a service to Room SO to down-load Remote download a software smart object o’. This is service on +p the Internet described as follows: +SOdownloadl +r

Room

ÛÒÐÓ ´ ÕÙÖÝ Ü ËÇ Ý µ ËÇÓ PDA -r +p

-SOdownloadl +SOdownloadl

 ¬Ò´ ÕÙÖÝ Ü ËÇ Ý µ

PDA A requesting smart object o” with the down- -r +r load request facility can ask this smart object -SOdownloadl o to down load a software smart object y that satisfies a query x, and install this software smart object y in itself, i.e., Figure 8: The downloading of a proxy smart

object to access a remote service.

ËÇÐÓ ÁÒ×Ø Ðдܵ

ËÇÓÛÒÐÓ ´ ÕÙÖÝ Ü ËÇ Ý µ

 10.2. Location-transparent service Ò×Ø Ðдݵ continuation The installation of a software smart object y Suppose that PDA1 has federated with Of- by o” adds y in the scope of o”, initiates a fice, an access point with a server. Office federation between o ”and y, and makes the provides services available at the user’s of- scope of y equal to the scope of o ”. Fig- fice. Let Home be another access point with ure 7 shows an example application of the a server providing services available at his or

17 £ her home. The location transparent continu- lay station of +p or -p. A proxy Ô in a smart

ation of services means that he or she can object o forms a channel with each of +p, -p, £ continue the job that was started at the office and Ô in a different smart object. using PDA1 even after he or she goes back A glue smart object is a special smart object home carrying PDA1. This is realized by the with no port other than a service-requesting following mechanism. When he or she car- port -export. A smart object with -p port ries out PDA1 from Office environment, he cannot communicate with another smart ob- or she just needs to make PDA1 download ject with +p if neither of them is located the proxy smart object of Office as shown in within the proximity of the other. Let us as- Figure 9. This operation is called federation sume that each of them has a port +export. In suspension. When arriving at home, PDA1 this case, if we bring in a glue smart object is WiFi connected to Home, then federates with a sufficiently large WiFi access range with Home. At the same time, the proxy to cover these two smart objects, this glue smart object installed in it resumes the ac- object can federate with these two, sets up cess to Office. Therefore, they set up one channels between its -export and the +export additional channel to +Print port in Home port of each of the two objects, and obtain a

smart object as shown in the right-hand side £ port proxy Ô . This proxy works as a relay of this figure. Now the PDA can access station for -p and +p in these two objects, the database service of Office, and the two and sets up channels from itself to each of printing services of Office and Home. When these two ports. Through these channel, the PDA1 requests a print service, it asks its user port -p becomes able to communicate with or the application which of these services to the port +p. This situation is shown in Fig- choose. ure 10.

-url +DB office +Print +export +export glue smart object Home Office +p -q -p +q +SOdownload +export +SOdownload -p +q +Print +Print q* -url +DB PDA1 +DB office PDA1 +Print +export -DB -DB -Print -Print -q -SOdownload -export +p -SOdownload -export glue smart object p* suspend resume

+DB -urloffice +Print PDA1 -DB -Print -SOdownload Figure 10: A glue smart object and its usage.

Their semantics is defined as follows. Each Figure 9: Location-transparent service con- smart object that allows the export of its port tinuation using the downloading of a proxy proxies has the following rule.

smart object.

ÜÔ ÓÖØ´ Ô ÓÖØÈÖÓÜÝÄ×Ø Ý µ

 (Ö ØÈÓÖØÈÖÓÜݴݵ  10.3. Import and export of ports and glue smart objects Each glue smart object has the following rules. Here we consider a special type of port

‘export’. A service-providing port +ex- )Ö ØÓÒ´µ

ÜÔ ÓÖØ´ Ô ÓÖØÈÖÓÜÝÄ×Ø Ý µ

port exports proxies of all the ports ex- 

ÓÖØÈÖÓÜݴݵ cept those of export type to the service- *Ö ØÈ

requesting smart object that accesses this £

port through -export. A proxy of a port +p For each created port proxy Ô , the following £ or -p is represented as Ô , and works as a re- rules are added to the glue smart object.

18

Ô £ ´Üµ  Դܵ

Ø ´  Ü Ú ÐÙ Ý ÛÖØ ØÖÙ µ

Ô £ ´Üµ  Ô £ ´Üµ

 ) Ü× Ñ Ý Ø   Ø-Ò ÛÖشݵ

Ø ´  Ü Ú ÐÙ Ý Ö  ØÖÙ µ

) Ü× Ñ Ý Ø   Ø-Ò Ö ´Ýµ Glue smart objects can be hierarchically  used to make widely spread smart objects to interoperate with each other as shown in Fig- ure 11. It should be noticed that each glue smart object does not have +export port. 11. Concluding Remarks This paper has proposed a new formal : federation model of autonomic proximity-based feder- +export ation among smart objects with wireless net- +export +export -p +q +p -q -p +q work connectivity and services available on the Internet. Smart objects here may in-

-export clude computing devices such as RFID tag p* q* chips, smart chips with sensors and/or actua- p* q* tors that are embedded in pervasive comput- -export glue smart object 2 ing environments such as home, office, and glue smart object 1 social infrastructure environments, mobile PDAs, intelligent electronic appliances, em- bedded computers, and access points with Figure 11: Hierarchical usage of glue smart network servers. Federation here denotes objects. the definition and execution of interoper- ation among smart objects and/or services 10.4. Proxy smart objects for pas- that are accessible either through the Internet

sive tags or through peer-to-peer  Ó communica- tion without previously designed interopera- Passive tags are accessed by a smart object tion interface. with a tag reader. When such a smart object This paper first proposed a basic formal o accesses a passive tag, we consider that a model for federation among primitive smart new software smart object that works as a objects, and then extended this model to proxy of this tag is generated. The scope cope with federation among smart objects of this proxy smart object only includes o, through the Internet as well as federation in- while the object o updates its scope to in- cluding services over the Web. Each smart clude this proxy. This proxy of a passive object is modeled as a set of ports, each of tag has a port +tag with signals including the which represents an I/O interface for a func- tag id, the tag value, the read signal, and the tion of this smart object to interoperate with write signal. The smart object o evaluates some function of another smart object. the following goal Here we focused on the matching of service-

requesting queries and service-providing ca-

Ø ´  Ø   Ú ÐÙ Ú ÛÖØ ØÖÙ µ  pabilities that are represented as service- requesting ports and service-providing ports, to write a new tag value to the tag with Ø   instead of focusing on the matching of a as its tag id. It reads this tag by evaluating service requesting message with a service-

providing message. Our abstract modeling

Ø ´  Ø   Ú ÐÙ Ú Ö  ØÖÙ µ  of each smart object from the view point of its federation capability has allowed us Each proxy of a passive tag has the follow- to discuss both the matching mechanism for ing rule: federation and complex federation among

19 smart objects in terms of a simple mathe- [7] Yuzuru Tanaka, Jun Fujima, and Makoto matical model. Our model has enabled us Ohigashi. Meme media for the knowl- to describe applications from the view point edge federation over the web and perva- of their federation structures. This has en- sive computing environments. In ASIAN abled us to extract a common substructure 2004, Lecture Notes in Computer Sci- from applications sharing the same typical ence, 3321, pages 33Ð47, 2004. federation scenario as an application frame- work for this federation scenario. This pa- [8] Yuzuru Tanaka and Kimihito Ito. Meme per has also given the semantics of our fed- media architecture for the reediting and eration model based on a Prolog like logical redistribution of web resources. In framework. FQAS 2004: Lecture Notes in Computer Science, 3055, pages 1Ð12, 2004. Federation of smart objects also require se- curity mechanism to protect each smart ob- [9] Yuzuru Tanaka, Kimihito Ito, and Jun ject from unexpected or evil federation. Se- Fujima. Meme media for clipping and cure federation requires the integration of combining web resources. World Wide some security mechanism with our federa- Web, 9(2):117Ð142, 2006. tion model, on which we are also currently working. References

[1] David Gelernter. Generative communi- cation in linda. ACM Trans. Program. Lang. Syst., 7(1):80Ð112, 1985.

[2] Sun Microsystems. Javaspaces service specification, version 1.2, 2001.

[3] Sun Microsystems. Jini technology core platform specification, version 1.2, 2001.

[4] Gian Pietro Picco, Amy L. Murphy, and Gruia-Catalin Roman. Lime: Linda meets mobility. In ICSE ’99: Proceed- ings of the 21st international conference on Software engineering, pages 368Ð 377, Los Alamitos, CA, USA, 1999. IEEE Computer Society Press.

[5] Yuzuru Tanaka. Meme Media and Meme Market Architectures: Knowledge Me- dia for Editing, Distributing, and Man- aging Intellectual Resources. Wiley- IEEE Press, 2003.

[6] Yuzuru Tanaka. Knowledge federation over the web based on meme media technologies. In Lecture Notes in Com- puter Science, 3847, pages 159Ð182, 2006.

20 Protean sensor network for context aware services

Nonaka, H. and Kurihara, M. Graduate School of Information Science and Technology Hokkaido University, Sapporo 060 0814, Japan {nonaka, kurihara}@main.ist.hokudai.ac.jp

Abstract We present a wearable sensor network system. In our study, we refer a personal area network as “sensor network.” The main part of our system is eye and head movement tracking module based on visual sensorimotor integration. Eye-head cooperation is considered, especially head gesture accompanied with vestibulo-ocular reflex is used for a cue of person’s intention. Each sensor is responsible for respective roll, which varies according to the modality. And the modality depends on the person’s context. In order to change the function of each node, respective firmware is re- programmed by each other using in-circuit serial programming. Keywords: sensor network, context-aware computing, eye movement tracking, wearable computer, gesture recognition

1. Introduction ter relief, and so on.

Recently, the term “sensor network” has “Sensor network” is also used as that for been often used for an ad-hoc wireless extracting or inferring a person’s context, network or a multi-hopping network, es- especially in the research field of ubiqui- pecially in the research field of wireless tous computing or pervasive computing. communication networks. Huge num- The term “context” is defined, for exam- bers of inexpensive sensor nodes are de- ple, as follows: ployed geographically or attached to mo- “Context: any information that can be bile robots [2] [6] [18]. Sensor nodes used to characterize the situation of enti- acquire only local information from their ties (i.e., whether a person, place, or ob- surroundings. However each sensor node ject) that are considered relevant to the has limited communication and computa- interaction between a user and an appli- tion ability, large scaled sensor network cation , including the user and the ap- can be constructed by organizing them plication themselves. Context is typically into collaborative, without a priori sens- the location, identity, and state of people, ing infrastructure. Now such a network groups, and computational and physical system is commercialized [3] and be- objects.” by Dey et al. [4] coming available for information acquisi- tion including target tracking, area search, In order to acquire and gather informa- environmental monitoring, infrastructure tion available for extracting a person’s maintenance, feature localization, border context, variety of sensors are installed, monitoring, battlefield monitoring, disas- as is often the case, by invisible way.

21 In the case of a room, they are embed- And the modality depends on the person’s ded in furniture, electronic appliances, context, for example, eye-gaze is used for the ceiling, the floor, and so on. The not only seeing or looking, but also sig- range extends to a building, public facili- naling or directing a partner. In order to ties, station premise, airport, city, and so change the function of each node, respec- forth. Various systems and utilities for tive firmware is re-programmed by each context-aware computing have been pro- other using in-circuit serial programming. posed based on such sensor networks [1] [4] [7] [17]. 2. Network configuration In our study, we refer a personal area sen- Heterogeneous sensor nodes are inter- sor network as “sensor network,” how- connected by both wired and wire- ever it is used for context-aware services. less communication links. Neighbor- Context resides intrinsically in each per- ing nodes are mainly linked single-wire son’s mind, therefore, context-aware ser- (two-core) connections overlapped with vices are preferable to be provided by power supply. For inter-PCB connec- personal belongings, rather than by sur- tion, we adopted four-core equilibrium roundings. Some people may feel un- half-duplex transmission that is compli- comfortable to be taken pictures wherever ant with RS-422, with the view of re- they go, even if they take pictures wher- ducing radiated electromagnetic interfer- ever. They should prefer GPS as a posi- ence. Wireless links are used for external tioning service to that with satellite im- communications with surroundings. For ages or surveillance cameras. If an ID-tag simplicity, only one node is allowed to system is available for a context-aware drive the transmission line in any mo- service, they might desire to take belong ment, and the other nodes can only re- a tag reader rather than a tag. ceive data, therefore, carry sense or col- lision detection are not required. The net- From these considerations, we proceeded work performs at a time in one of four to develop a personal area sensor network modes: normal-packet-mode, cascaded- close to a person. The sensor network packet-mode, pulsed-network-mode, and is constructed by heterogeneous sensor programming-mode. nodes, for example, pressure sensors [9] In the normal-packet-mode, each packet [10], acceleration sensors [11], ultrasonic is exchanged between host node and a sensors [12]. The main part of our system target node with corresponding addresses is eye and head movement tracking mod- (Figure 1). The network initially starts in ule based on visual sensori-motor integra- this mode, and it enters the other modes tion. Eye-head cooperation is considered, using packet exchange in this mode. especially head gesture accompanied with vestibulo-ocular reflex is used for a cue In the cascaded-packet-mode, each of person’s intention. We adopted a packet is transmitted from node to node position sensitive device (S7848, Hama- successively in a predefined order with matsu) for eye tracking. Accelerometer the subsequent address (Figure 2). While (ADXL311E, Analog Devices), gyro sen- the period of transmission, every traffic sor (ADXRS300, Analog Devices), and is observed by host node. Broadcasting magnetic sensor (AMI201, Aichi Micro message and gathering sensor data are Systems) are used for head tracking. Each usually achieved in this mode. sensor is responsible for respective roll, In the pulsed-network-mode, the trans- which varies according to the modality. mission line is allowed in a given period

22 Source Packet Traffic Source Packet Traffic

Address 0 Data 0-0 Address 1 Address 0 Data 0-0 Address 1

Address 1 Data 1-0 Address 0 Address 1 Data 1-0 Address 2

Address 0 Data 0-1 Address 2 Address 2 Data 2-0 Address 3

Address 2 Data 2-0 Address 0 Address 3 Data 3-0 Address 0

Address 0 Data 0-2 Address 3 Address 0 Data 0-1 Address 1 F 8 7 0 F 8 7 0 . . . . . MSB . LSB . MSB . LSB . . . .

Figure 1: An example of packet traffic in Figure 2: An example of packet traffic in normal-packet-mode. cascaded-packet-mode. Each packet is exchanged between host node Each packet is transmitted successively in a and a target node. predefined order with the subsequent address.

to use for multi-directional links of pulse or environment, and so on. frequency data. It is a simple represen- tation of pulsed neural network or spik- 3. Configuration of sensor node ing neural network[8]. This mode is used Examples of hardware configurations of especially for extraction of user’s inten- sensor nodes are shown in Figures 3 - tion from multi-modal gesture, such as 6. The common part of each node is eye-head cooperative motion. An exam- mainly constituted of dual-MCU (Micro ple of operations in pulsed-network-mode Controller Unit): main MCU, and com- is presented in section 4. munication MCU. The former is used for In the programming-mode, the processes obtaining sensor data and signal process- of specific nodes or whole of the network ing, and the latter is used for communica- are temporarily stopped and its firmwares tion and in-circuit serial programming, as are reprogrammed by the capability of in- is mentioned in the last section. circuit serial programming. It becomes The first example is the node with 2- possible by using dual-MCU (Micro Con- axis acceleration sensor (Figure 3). By troller Unit) in the constitution of each each analog value of x-y axis, the gradi- sensor node. The one (main MCU) is as- ent from the direction of gravitational ac- signed to gathering sensor data and exe- celeration is measured. It is mainly used cuting signal processing, while the other for measuring tilt of head and a motion (communication MCU) is in charge of of nodding head. The measured data also communication with other nodes and re- involve dynamic acceleration, e.g. hori- programming the main MCU in obedi- zontal acceleration and vibration. Theo- ence to the demand of other node. This retically, it is impossible to eliminate such mode is used for upgrade of system ver- dynamic noises from the static accelera- sion, change of user, modification of the tion derived from gravity, but under the role of the system, addition or removal of condition that the motion is human mo- certain nodes, transition of user’s context tion, it is possible to some extent, by us-

23 ing the method of time-frequency analy- tronic compass (Figure 5). Measuring the sis. The detail of the processing is men- direction of earth magnetism, static di- tioned in Section 4. rection can be obtained. It is used for measuring the orientation of head and the compensation of the data acquired by Main MCU Accelerometer gyro sensor. PIC12F683 ADXL311

Communication StepUp DC/DC Main MCU MagneticSensor MCU Converter PIC12F683 AMI201 PIC10F206 MAX631

Communication StepUp DC/DC MCU Converter Transmission Line PIC10F206 MAX631

Figure 3: Hardware configuration of sensor node with 2-axis acceleration sensor. Transmission Line The second example is the node with single-axis yaw-rate gyro sensor (Figure Figure 5: Hardware configuration of sensor 4). Using the value of angular velocity node with 2-dimensional magnetic sensor. about the axis normal to the top surface, relatively quick response of rotation can be obtained. It is mainly used for measur- The last example is the node with 2- ing turn of head and a motion of shaking dimensional position sensitive device and head. Infrared LED (Figure 6). It is used for eye movement tracking. Various method- ologies for eye movement measurement Main MCU Gyroscope have been proposed since early times, PIC12F683 ADXRS300 for example, electro-oculography (EOG), video-based oculography (VOG), corneal reflection method, combination of corneal Communication StepUp DC/DC reflection and video image, and so on [5]. MCU Converter PIC10F206 MAX631 In our purpose, where it is used with head tracking and mainly used for detecting eye-head cooperative motion in ubiqui- tous environment, the eye tracking system Transmission Line should be lightweight and small enough. Our prototype system is mounted on 5 × × Figure 4: Hardware configuration of sensor mm 8 mm 3 mm flexible printed node with 1-axis yaw-rate gyro sensor. circuit board and the weight is 3.4 g, which is sufficiently compact for attach- ing glasses, however the precision and the The third example is the node with 2- accuracy are meager than those of other dimensional magnetic sensor or elec- elaborate eye tracking systems.

24 . Ir LED SE302A . J−1 Main MCU xJ−1(t) − xJ−1(t − 2 ) Position Sensitive yJ (t) = PIC12F683 Device S7848 2 J−1 x − (t) + x − (t − 2 ) x (t) = J 1 J 1 J 2 Communication StepUp DC/DC (1) MCU Converter PIC10F206 MAX631 This time-frequency decomposition is equivalent to the maximal overlap dis- crete Harr wavelet transform [16], but the expression is modified in order to be cal- Transmission Line culated only by additions, subtractions, and shift operators. Especially xJ (t) rep- Figure 6: Hardware configuration of sen- resents the direct-current component, and sor node with 2-dimensional position sensi- the original time series x0(t) can be re- tive device and infrared LED. constructed as follows.

∑J x0(t) = xJ (t) + yi(t) (2) 4. Detecting of eye-head cooper- j=1 ative gesture Eye-head cooperative gesture is detected Attenuated Refractory Integrators Period autonomously in a decentralized way by τ Generator Π 1 heterogeneous sensor nodes: acceleration y1 τ r sensor, gyro sensor, magnetic sensor, and w1 position sensitive device. τ2 y2 Π w According to the user’s context, the net- 2 −wr Π work changes the mode into the pulsed- network-mode. τ θ In each sensor node, for a time series y Π J J Σ x0(t), wavelet coefficients yi(t) and scal- wJ τ ing coefficients xi(t) are calculated as fol- Π lows. w x (t) − x (t − 1) Transmission Line y (t) = 0 0 1 2 x0(t) + x0(t − 1) x1(t) = Figure 7: The block diagram of spiking neu- 2 ral network. − − x1(t) x1(t 2) Each component of Haar transform yj(t) y2(t) = 2 is multiplied by respective weight wj and x (t) + x (t − 2) accumulated by attenuated integrator, and x (t) = 1 1 2 2 summed up. To inhibit an infinite activa- . tion, refractory period is generated by nega- . J−1 tive feedback. x − (t) − x − (t − 2 ) y (t) = j 1 j 1 j 2 J−1 In the simulated spiking neural network, xj−1(t) + xj−1(t − 2 ) xj(t) = y (t) w 2 each component j with weight j

25 tor of gaze line measured by position sen- int integrator(int value) sitive device, and (φy, φp, φr) denotes the { orientation (yaw, pitch roll) of head mea- int a = attenuation_factor; int temp = value; sured by accelerometer, gyroscope, and temp += y(t)*weight; magnetic sensor, then the vector of gaze value = 0; line is given by (3). for(int i=16;i>0;i--){ temp >>=1;     a <<=1; ξ cos φ 0 − sin φ if(carry is detected)  x   y y    value += temp; ξy  = 0 1 0  } ξz sin φy 0 cos φy   return value; 1 0 0 }   ·  0 cos φp sin φp  0 − sin φp cos φp   Figure 8: A pseudo-code of attenuated inte- cos φ sin φ 0 grator in Figure 7.  r r  ·  − sin φ cos φ 0  weight and attenuation factor are pre- r r 0 0 1 defined and y(t) are input value from the   sensor. − sin θ cos θ  x y  ·  sin θy  . (3) is accumulated by attenuated integrator, cos θx cos θy and summed up. Some of them are in excitatory connections and others are in The function of detecting eye-head co- inhibitory connections, according to the operative gesture in pulsed-network-mode kind of reference signal pattern. Both (Chapter 4) and calculating gaze line in of input and output are connected to the normal- or cascaded-packet-mode (Chap- transmission line, therefore, refractory ter 5) are summarized in Figure 9. period is generated by negative feedback Eye tracking to inhibit an infinite activation. Calculation Figure 7 depicts the block diagram of Head tracking of gaze line spiking neural network. Figure 8 shows a pseudo-code of attenuated integrator in Ref. pattern of Detection of the spiking neural network. shaking head fixation

5. Calculation of gaze line Detection of Identification shaking head of "No" Using the data from the node with po- sition sensitive device, the gaze line is Detection of Identification also roughly calculated with the compen- nodding head of "Yes" sation of head orientation measured by the nodes with accelerometer, gyroscope, Ref. pattern of and magnetic sensor. It is achieved in the nodding head normal-packet-mode or cascaded-packet- mode. The direction vector of gaze line Figure 9: Logical configuration of the sys- tem for detection of eye-head cooperative (ξx, ξy, ξz) is obtained in the same man- ner of our previous work [14]. gesture and calculation of gaze line.

Let (θx, θy) represents the direction vec-

26 6. Conclusions [5] Duchowski, A.T.: Eye Tracking Methodology — Theory and practice, This paper proposed a wearable sensor Springer Verlag, 2003. network system based on a capability of in-circuit serial programming. The main [6] Grocholsky, B., Makarenko, A., part of our system is eye-head movement Kaupp, T., and Durrant-Whyte, H.F.: tracking, and the roll of these movement Scalable Control of Decentralised depends on the modality and user’s con- Sensor Platforms, ISPN2003, LNCS text. In order to cope with such variety of 2634:96-112, 2003. modality and context, we introduced four network modes: normal-packet-mode, [7] Itao, T., Nakamura, T., Matsuo, cascaded-packet-mode, pulsed-network- M., and Aoyama, T.: Context-Aware mode, and programming-mode. We Construction of Ubiquitous Services, presented several applications of these IEICE Transactions on Communica- modes for acquisition of user’s intention. tions, E84-B(12):3181-3188, 2001. At present we have only made a prepa- [8] Jahnke, A., Roth, U., and Schonauer,¨ ration for context-aware services with T.: Digital Simulation of Spik- general-purpose equipment. Now we are ing Neural Networks, “Pulsed Neu- starting on development of context-aware ral Networks”, ed. Maas,W. and system with present network system com- Bishop,C.M., MIT Press, 237-257, bined with nonverbal interaction proto- 1998. cols. In addition, the consideration for in- [9] Nonaka, H., and Kurihara, M.: Sens- dividual difference is needed for further ing pressure for authentication sys- improvement, as a necessary part of our tem using keystroke dynamics, Inter- future work. national Journal of Computational In- References telligence, 1(1):19-22, 2004. [10] Nonaka, H., and Kurihara, M.: An- [1] Benerecetti, M., Bouquet, P., and ticipation of Mouse Pointer Move- Bonifacio, M.: Distributed Context- ment Using Pressure Sensors, Inter- Aware Systems, Human-Computer national Conference on Computing Interaction, 16(3):213-228, 2001. Anticipatory Systems, 7, Liege, 2005 [2] D’Costa, A., Sayeed, A.M.: Col- [11] Nonaka, H., and Kurihara, M.: laborative Signal Processing for Dis- Time-Frequency Decomposition in tributed Classification in Sensor Net- Gesture Recognition System Using works, ISPN2003, LNCS 2634:193- Accelerometer, International Confer- 208, 2003. ence on Knowledge-Based Intelligent Information & Engineering Systems, [3] Mote System, Crossbow Technology LNAI, 3213:1072-1078, Wellington, Inc., http://www.xbow.com. 2004. [4] Dey, A.K., Abowd, G.D., and Salber, [12] Nonaka, H., and Kurihara, M.: D.: A Conceptual Framework and a Pulse-Based Learning for Object Toolkit for Supporting the Rapid Pro- Identification using Ultrasound, totyping of Context-Aware Applica- International Conference on Neural tions, Human-Computer Interaction, Information Processing, 9, Singapore, 16(2):97-166, 2001. 2002.

27 [13] Nonaka, H., and Kurihara, M.: [16] Percival, D.B., and walden, A.T.: Anticipatory Matching Method for Wavelet Methods for Time Series Query-Based Head Gesture Identifi- Analysis, Cambridge Univ. Press, cation, International Journal of Com- 2000. puting Anticipatory Systems, 15:279- 287, 2004. [17] Yamaguchi, A., Ohashi, M., and Murakami, H.: Autonomous Decen- [14] Nonaka, H.: Communication Inter- tralized Control in Ubiquitous Com- face with Eye-gaze and Head Gesture puting, IEICE Transaction s on Com- using Successive DP Matching and munications, E88-B(12):4421-4426, Fuzzy Inference, Journal of Intelli- 2005. gent Information Systems, 21(2):105- 112, 2003. [18] Liu, J., Reich, L.J., Chung, P., and Zhao, F.: Distributed Group Manage- [15] Nonaka, H., and Kurihara, M.: Eye- ment for Track Initiation and Main- Contact Based Communication Proto- tenance in Target Localization Appli- col in Human-Agent Interaction, In- cations, ISPN2003, LNCS 2634:113- ternational Workshop on Intelligent 128, 2003. Virtual Agents, LNAI, 2792:106-110, Irsee, 2003.

28 Pattern Concepts for Digital Games Research

Klaus P. Jantke Technical University of Ilmenau Institute for Media and Communication Science Department of Multimedia Applications Am Eichicht 1, 98693 Ilmenau, Germany

Abstract The digital games industry has grown to the size of the film industry and is going to leave it far behind. Computers with advanced graphics capabilities have contributed to the immersive interactive experience that attracts many to spend more of their leisure time playing digital games than watching television. The available CPU power of current home computers and notebook PCs is setting the stage for game AI; console development shows a similar trend. And the expectations of human players are high. Digital games and their potential social impact are subject to a heated debate worldwide which is fueled by tragic events such as the 1999 deadly shooting at the Columbine Highschool, Littleton, CO, USA, or the 2002 amok run at the Gutenberg Highschool in Erfurt, Germany. This debate is getting even more controversal when games such as the Super Columbine Massacre Role Playing Game enter the stage. There are several good reasonsÐeconomically, socially, politicallyÐto engage in digital games research. But digital games research has a high demand of interdisciplinarity. Digital games are at the same time entertainment media and IT systems. Even more specifically, they are turned more and more into complex AI systems. Digital games research is on its way to establish a digital games science. This discipline will have its language and its methodologies. And patterns are key concepts of understanding game playing, of understanding digital game reception and, thus, the potential social impact of particular digital games. Similarly, pattern concepts are crucial to anticipating the effects of game mechanics and, hence, play a key role in digital games design and development. This paper aims at a contribution to the formation of interdisciplinary pattern concepts for digital games.

1. Patterns in Digital Games: selves against intruding NPC adversaries. The Author’s Initial Approach A certain standard behaviorÐa game patternÐ However to generalize, we need experience, is a saying attributed to George Gr¬atzer, one of the leading scholars in universal algebra. It applies to digital games science as well. Where do we find in digital games instances of what might be a pattern? When playing AGE OF EMPIRES II, e.g., most players experience certain difficulties and find particular ways to overcome them. AGE OF EMPIRES II (figure 1) is a fantasy strategic development game with a rather substantial amount of fighting involved. Figure 1: AGE OF EMPIRES II in progress When being busy with developing their own proves successful: walling up all entrances civilization, players need to defend them- to their own settlement, esp. closing bridges.

29 2. The Underlying Perspective leveled so that you can eventually, after a at Digital Games thousand hours, get to see the cool stuff. You dont take on those games lightly. If you’re Digital games areÐno doubtÐentertainment going to play a traditional MMO, not only media and as such they are subject to media do you have to be willing to commit to it fi- research. This is quite far beyond the limits nancially in terms of an ongoing subscrip- of IT and AI. tion, but also in terms of your life. You dont When digital games are accepted as an art play casually–you’re either completely im- form, they deserve all the rights other arts mersed in it or you quit. Its kind of like peo- enjoy such as free speech, to say it in brief. ple kicking a drug addiction. Immersion or flow [7] is crucial to playing games [10]: Games create a cybernetic sys- tem between you and the machine, with your senses eventually expanding to possess your avatar when you’ve sufficiently mastered the control system. This is the absolute magic of the form, where you stop thinking, ”I need to press X to jump”, and start thinking, ”I’ll jump”. Just look at the language people use to talk about games to show how much their sense of identity has merged with their in-game character. If someones enjoying a Figure 2: Download Page of a Free Game game, it’s, ”It hit me”, never, ”It hit my character”, . . . Videogames are the simula- Does this truly apply to all games including tor which swallows your consciousness alive the SUPER COLUMBINE MASSACRE ROLE and takes you to another place. PLAYING GAME (see figure 2) ...? In this simple point and click shooting game, the In addition to the entertainment media per- player takes the role of one of the two teens spective, digital games are clearlyÐusually who shot a dozen of their school mates and a complexÐIT systems. Going even further, teacher before they finally shot themselves. they should be seen as systems of artificial intelligence (AI). Computer scientists might Digital games reflect reality to some extent. be surprised, but in gamer communities, it In turn, playing digital games might have still is a rather ‘unusual perspective’ to see impact on some players’ behavior in the sur- digital games as IT systems ([16], p. 16). rounding real world (see [4, 17] for some comprehensive approaches and [11] for an To the author, it is fundamental to consider interdisciplinary discussion). digital games at the same time as entertain- Digital games have to be taken seriously. For ment media and as systems of AI [11, 12]. the class of massively multi-player online When being interested in the way in which role playing games (MMORPG, for short), digital games might have impact on some Jeff Strain, co-founder of ArenaNet, in an human brains and, thus, have an impact of interview1, characterized those games as just social relevance, one needs a holistic view polished incarnations of a traditional MMO at digital games: IT systems that affect the that was laid down ten years ago with Ul- human brain by rewarding experiences [14]. tima Online. Its a game mechanic that re- quires you to grind away, in many cases, try- Next to the analytical perspective, digital ing to level up. The whole goal in many of games science deals with principles and those games is to just get higher and higher problems of digital game synthesis: design, development, and implementation of enter- 1www.just-rpg.com, April 28, 2006 taining IT systems.

30 3. Digital Games as Generators by inventing a particular partial operator of of Game Playing Experience key combinations. The alphabet A of actions is supposed to be closed under this operator, After stressing the entertainment media per- i.e. those combinations of keys introduce ex- spective in the preceding section, the present tra actions into A. one will put more emphasis on considering digital games as IT systems. Based on those technicalities, every game G may be seen as generating a formal language Digital games like all other types of games ∗ Π(G) ⊆ A of potential game experiences. are generators of human playing experience. When humans play a game, sequences of So far, the approach laid out lies completely activities are unfolded. What sequence may within IT terminology, what directly leads be potentially generated is determined by the to a clarity outperforming all utterances of game mechanics. But usually only a small media sciences by far. But the limitations amount of those potential sequences unfold are obvious as well. in real game playing. For complex games G and under realistic A more formal terminology helps to clarify conditions of game playing, large parts of the issues under discussion. Π(G) are never played. It remains open whether or not the part of Π(G) really With a game G, there is given a finite set played within the cybernetic system of the of potential activities that may take place. players and the game [10] may be described In formally simple cases such as playing in sufficiently clear terms. Let’s call the set CHESS,EINSTEIN WURFELT¬ NICHT2,or of really played action sequences within A∗ JOSTLE2 , every game play π appears as a Ψ(G) ⊆ Π(G). What about Ψ(G) ⊂ Π(G)? finite sequence of moves of the alphabet of possible activities, i.e. π ∈ A∗. That this is not a scholastic discussion, shall In many games, especially in beat’em up be explained by means of a practical case. games such as SOUL CALIBUR (see figure3) Let us express it somehow formally first: In all cases of play observed by the author, playing the game AGE OF EMPIRES II in- volved playing frequently the subsequence of actions necessary to wall up some area, especial building walls to close bridges that lead to the player’s settlement. (Note that those actions can usually be performed by three subsequent clicks, a structure that can be easily identified in any string π ∈ Π(g).) Let us express the problem secondly in more semantic terminology: Is it necessary that this pattern of players’ behavior occurs in Figure 3: SOUL CALIBUR II Ð Just a Few successfully playing AGE OF EMPIRES II? Keys Forming Large Numbers of Patterns Or is it just easier to play that way? It is the author’s suspicion that very fit players might the necessity is arising to press certain keys be able to play the game without using the simultaneously. This may be formally re- pattern mentioned. Being fast enough, they presented by either introduced extra moves might be able to defeat invading hordes of for those combinations or, more elegantly, NPCs instead of avoiding battles. 2 EINSTEIN WURFELT¬ NICHT is a game developed It remains open whether or not the discussed by Ingo Alth¬ofer, Jena, Germany. JOSTLE has been designed by the present author. Both games have behavioral pattern occurs in all Π(G), in all been used for in-depth discussions of game patterns Ψ(G), but not necessarily in Π(G) \ Ψ(G), (see [12] for more details). or just in some part of Ψ(G).

31 4. Patterns in Game Play – 5. Pattern Concept Case Studies Key to Fun and Immersion Koster [14] who nicely motivates patterns as The author’s model of human game playing key to fun does not mention a single par- behavior was inspired by [9] and has been ticularly interesting case, whereas Bj¬ork and introduced in [11] in much detail. Holopainen [6] list more than twohundred of what they call patterns. They simply wrote down every regularity in games that came to their minds. In the author’s opinion, it does not make sense to call literally everything a pattern; pattern should not become another word for everything. We need to find pattern concepts neither to narrow nor to wide that may serve as fundamentals of an interdisciplinary digi- tal games science. Those concepts have to support a communication between scientists from disciplines as diverse as mathemat- Figure 4: Human Game Playing Behavior ics, computer science, artificial intelligence, Playing a game means to gain control over psychology, sociology, and media and com- the balance of indetermination and self- munication science. determination. When you play a game, you According to Christopher Alexander [1, 2] learn to master certain difficulties. Several who introduced the pattern concept to the regularities of successful playing are learned sciences, to engineering, and to the arts, unknowingly. patterns are something that relates to human A game without regularities is unplayable. behavior. To say it the other way around, every Only when mathematicians adopted and playable game exhibits a lot of regularities. adapted pattern concepts, they stripped them Our human brains have developed over mil- from the essentials. lions of years to become highly qualified The present section, with Gr¬atzer’s initially mechanisms of identifying regularities [17]. cited words in mind, aims at a collection of Identifying regularities wherever occurring, patterns in digital games that are of differ- whether consciously or unconsciously, is ent quality. The understanding of those case pleasant and gives us a good feelingÐthe key studies is deemed a basis for a subsequent to understand fun in game playing [14]. systematization. A crucial point for understanding the fasci- We did already discuss patterns such as the nation of games is to see their potentials of walling up pattern in AGE OF EMPIRES II immersion. Players who are able to iden- which may be expressed as substrings in cer- tify with there avatars may experience deep- tain (or possibly all) game plays π ∈ Ψ(G). est satisfaction. Those are patterns on the level of actions3. At the persona level of immersion, the vir- Conventional mathematical investigations tual world is just another place you might mostly speak about patterns of this type [5]. visit, like Sydney or Rome. Your avatar is simply the clothing you wear when you go Those patterns may have a more complex there. There is no more vehicle, no more sep- appearance due to the richness of the virtual arate character. Its just you, in the world. game world, but may be reduced to their es- (Richard Bartle, cited after [18]) sential structure more or less directly. Such a

This assumes a highly unconscious mastery 3”Moves” is the traditional term preferred in many of patternsÐkey to digital games science. publications; we are open to accept both expressions.

32 case shows in HALF-LIFE 2: EPISODE ONE game states in an attribute-value-way adopt- as illustrated by figure 5. ing terminology from relational databases. In the survival game Half-Life 2 the human There are bottlenecks characterized by some player is accompanied by his fellow NPC finite number of attributes a1,... ,an and named Alyx. When the player solves some particular values c1,... ,cn. Game states s key tasks, Alyx is taking care of the rest of in a bottleneck are defined by the property the work by eliminating all other attacking i=1,...,n s.ai = ci being valid in all states of adversaries. the bottleneck, but in no predecessor state.

Figure 6: Bottlenecks in a Game State Space Bottlenecks decompose the game state space into subspaces as depicted in figure 6.

Figure 5: HALF-LIFE 2: EPISODE ONE Ð The game THE DA VINCI CODE provides Cars for Blocking Exits of Ant Lions’ Nests an extreme case study of a rigid game space structure with narrow bottlenecks. On the During mission 3, the player has to pass game phase from which the screenshot in some underground parking area as depicted figure 7 is taken, the human player is stuck in figure 5. Certain adversary monsters until a certain number of actions has been named ant lions live in nests in the under- performed to set certain attribute values. ground. The player has to avoid or to fight them. A quite useful tactics is to block the nest exits by pushing cars to block the holes. The same pattern of player activity solves the problem repeatedly five or six times. Those are examples of patterns in sequences of actionsÐrepeatedly occurring substrings. In a larger number of point & click criminal story adventures such as BLACK MIRROR, AGATHA CHRISTIE:AND THEN THERE WERE NONE, and THE DA VINCI CODE solving riddles is essential to playing the game successfully. Figure 7: THE DA VINCI CODE Ð Bottle- Riddles (puzzles, ...) in games establish necks in the Game State Space Determining bottlenecks in the state space illustrated in Patterns of Human Game Playing Behavior figure 6. To reach certain sets of game states, The whole game THE DA VINCI CODE con- one needs to pass a comparably small class sists of subspaces like that. The description of other states before. of the pattern identified requires to speak Those bottlenecks may be characterized in about sets of game states and, perhaps, prop- different ways. Usually, the cardinality of erties of states in a set. bottlenecks is a secondary, but characteristic Compared to the pattern found above in AGE property. In many cases, it helps to look at OF EMPIRES, this is a higher level pattern.

33 Games such as BLACK MIRROR and In other games, avatars are turning their AGATHA CHRISTIE:AND THEN THERE heads, e.g., to signal something important. WERE NONE show the same pattern, but not In BLACK MIRROR certain keys are shining. as rigid as in THE DA VINCI CODE. Let us direct our attention to properly more Let us conclude the present case study by complex patterns. CALL OF DUTY is an some speculation. The strength of the player award winning4 ego-shooter distinguished guidance through the game’s state space by several features that make playing more makes those games good candidates for successful and, thus, more fun. implementing something like a dramaturgy [15] and this, in turn, makes them subject to related analyzes [8, 13]. There is a large family of games normally called ego-shooters or first person shooters; those games are usually also quite linearly organized and, therefore, potentially good candidates of employing dramaturgy. The player gets a usually simple mission and has to shoot a number of adversaries on his way. The game play is characterized by the first person perspective of seeing the whole environment ‘through the eyes of the avatar’. Figure 9: CALL OF DUTY Ð Patterns of Higher Level Guidance to Human Players Literally all fantasy adventure games con- tain potions (magic drinks in fancy bottles or In the game CALL OF DUTY, the human anything like that) for recreating an avatar’s player fights within a group of NPC fellows. life energy or, a little bit stronger, to give him The figures 9, 10 and 11 show screenshots extra lives. taken within a few subsequent seconds of game play. From the player’s perspective, In first person shooters, those potions have the experience is as follows: a different appearances and look much less magically, though their effect is as magic as in any fantasy game.

Figure 10: CALL OF DUTY Ð Screenshots from a Playing Experience, the Second Step

Figure 8: CALL OF DUTY Ð Widely Used I am running through a building. My NPC Patterns between Helpful and Annoying fellows pass me. They leave the house through some doorway and I follow them There are simple patterns of presentation to (figure 9). They turn left (figure 9 and 10) attract the player’s attention to the potions and I do so as well. Outside (figure 10) they such a first-aid kits in the ego-shooter CALL OF DUTY that are glowing or even blinking. 4more than 80 awards since its publication in 2003

34 turn left and I am rather hot on their heels. Shapped as the so-called morphing ball, she In this way, we reach the next point where is better prepared for speed runs and able to we are facing our adversaries (figure 11). 35 pass quite narrow passages. The syntactic form of the transformation as a behavioral pattern is extremely simpleÐjust a single key pressed, i.e., a single letter in a sequence π ∈ Ψ(G). Obviously, identifying the change from the avatar Samus Aran into the morphing ball and, vice versa, unfolding Samus Aran from her spherical form does not say much about the experience of game playing. Identifying a pattern syntactically rarely tells much. The crux is to relate syntax and Figure 11: CALL OF DUTY Ð Screenshots semantics. from a Playing Experience, the Third Step The METROID case study has been chosen This guidance by NPC fellows exemplified to illustrate the key problem in especially is very valuable to the human player in some simple terms. respect. Firstly, the NPC guidance helps In other cases, both syntax and semantics are finding the way in the virtual environment. difficult, but nevertheless important patterns Secondly, it keeps the pace high enough to occur. Consider, for instance, the survival cope with the time limits. Thirdly, the forced game CALL OF CTHULHU which did appear speed of movement keeps the tension high in a PC version in Germany in early 2006. and contributes to the overall experience. How is a truly creepy atmosphere provoked Last but not least, when the player’s NPC to arise? fellows run for cover, this indicates some danger and helps the human player also to act appropriately, e.g., to take cover as well. The pattern in the NPC behavior is difficult to describe in terms of states and moves, but easy to identify in game play.

Figure 13: The Creepy CALL OF CTHULHU

The question for the principles of touching human emotions is among the most difficult and deep problems of the arts. There are no universal recipes [15], but several dozens of Figure 12: METROID Ð The Morphing Ball examples we may learn from [8, 13]. In the METROID games series, the human One may call some of the tricks in use in the player acts as the intergalactic bounty hunter film industry patterns. There is some hope Samus Aran. This female NPC can change that the digital games science will also learn her appearance drastically (figure 12). from the arts.

37 6. Hierarchies of Pattern Concepts For this purpose, the digital games science needs its modeling language(s)Ðthe focus of The case studies of the preceding section this paper. have brought to light some central issue of patterns in digital games science: Different The insights one can gain depend very much patterns may require quite different terms to on the language in use. describe them. Let us have a brief look at one of the classics, 5 First of all, do we expect and work for Byron Haskin’s film of the famous book any universal approach that applies to any ”The War of the Worlds”, 1953. genreÐwhatever the term genre might meanÐ The film easily decomposes into 31 scenes of digital games? which may be grouped into 5 main phases. Experienced readers may have recognized that the author deviates from the usage of genre terms that appear currently in the ma- jority of publications about digital games. The current practice results in a partially in- consistent usage of terminology. There is an urgent need for clarification, but not yet any firm basis. Therefore, the author prefers to avoid misleading, partially inconsistent terms and, instead, circumscribes what he wantsÐas seen aboveÐin his own words. The progress of a digital games science will also contribute to a revision of terminology. The author reserves his contribution to some forthcoming publication. So, back to the issues of patterns in digital games without any considerations specific to the one or to the other genre. To be honest, what will be proposed below has been adopted and adapted from media research, to some extent.

Consider systematic film analysis [8, 13] as Figure 14: Formal Tension in the Film ”The a sample source. A standard approach is to War of the Worlds”’, Byron Haskin, 1953 separate a certain film under consideration in scenes and to group scenes into phases. Based on such a structuring of the flow of On this basis, you may go into depth and activity, one might analyze whatever seems attribute properties to scenes for possibly of interest. Following [8], we have simply identifying regularities when looking at the counted the frequency of camera shots per film representation as a whole. minute. In figure 14, the scenes proceed from the top to the bottom. On the hori- Notice that, interestingly, systematic film zontal axis, the frequency of shots is shown. analysis typically does not talk about the What the reader can see might be interpreted film itself, but about the film’s particular as some pattern(s) employed when making representation derived. In other words, we the film. build models and perform reasoning about the models built. 5The H. G. Wells book ”The War of the Worlds” became properly famous by Orson Welles’ radio That’s what we are going to do in the digital feature broadcasted on October 30, 1938, by CBS in games science as well. Northern America. It made Orson Wells famous.

36 When analyzing digital games, let us have a Once again, we take inspiration from the look at game play in a similar fashion. When media sciences. For talking about drama and game play proceeds, one scene follows the film, there does exist a rich terminology other. It depends very much on what you with concepts such as anagnorisis, climax, consider a scene. d«enouement, and peripeteia, to mention a few. We need to develop wordsÐor, better, conceptsÐto talk about playing a game. The language of digital games science does not yet exist. We need to develop a layered language, whereby layers may be visualized like the columns in figure 15; there is, naturally, no limitation to just three layers. Figure 15: Some Schematic Approach to the (not only) Digital Game Play Understanding A few layers are immediately obvious: • On the atomistic level, one might consider game states and moves, every action in A a scene. In such a way, • sets of game states, the sequence of scenes listed (in figure 15 • composite actions that establish some from the top to the bottom) coincides with meaning which may be circumscribed ∈ π Ψ(G). on a higher language level. Every traditional protocol of a CHESS game, Walling up a bridge is of the third type. We for instance, forms such a sequence. The are still able to name the actions and have above-mentioned sequences of actions in a look at every syntactic detail, but it seems AGE OF EMPIRES II fit as well. clearly more convenient to communicate on Even without any assigned particular seman- a semantic level. tics, one may go for string mining to exhibit Other languages constructs may be intro- potential patterns of game play. duced to talk about structural digital games Though this seems trivial because of being features such as much too formal, it yields several results • such as the preferred combinations of hits, embedding of cut scenes, kicks, and movements in a row a particular • integration of puzzles, player of SOUL CALIBUR II performs. • changing camera positions (switching The case study of AGE OF EMPIRES II was between first person and third person intended to point to the fact that several low- views, e.g.). level patterns may have some higher-level The third point comes close to the level of meaning such as walling-up a bridge. formal tension (see figure 14). For understanding digital games, we need to Another higher level refers to informa- decide about the language expressions ad- tion resp. knowledge of human players and missible to describe meaning as indicated in NPCs; figure 15. • information systematically delivered in What might seem difficult at a first glance portions, becomes easier when having a closer look. • Let us just ask what terminology we need to indications given to make a player wor- tell the story6 of a game play. rying (towards ’Eigenaffekt’a ` la [15]). On the highest level according to the present 6This is not to be confused with storytelling as it is used in some communities: The author does not approach, one might talk about the player’s claim that every game tells a story. But we may tell a mental states, her/his goals and intentions story about every game we are playing. and the like. Such a language level would be

37 necessary to address complex patterns such wins the game. The coins can only be seen as those discussed with respect to the game on the map on the players PDAs. CALL OF CTHULHU above (see figure 13). When players come close to a treasure, they How to cause affright? can pick it up (see the related button on the One might go even further and speak about PDA screen). For saving coins to the trea- more than the game itself. What about the sure chest, they have to come into the reach embedding of game play in the surrounding of the server’s wireless network. world? There is a variety of more player interactions Important aspects are not yet covered by the such as picking pockets (by using the button approach sketched above. This fact shall be ‘PickPocket’ on the PDA screen), when one illustrated by means of a single case study. player comes close to another one. Playing TREASURE7 is a digital game developed for the game needs obviously certain awareness academic purposes. The focus of the under- of network connectivity issues. lying research is on communication using Among many other issues, the developers of varying protocols such as GPS or Bluetooth. TREASURE have been interested in the play- ers’ behavior with respect to seams. Such a seam is a break (or gap) between tools or media. Seams appear frequently when dif- ferent digital systems are used in combina- tion, or when they are used along with the other conventional media that make up our everyday environment. The TREASURE game has been intentionally designed as a seamful one. Mastery of the game requires to perceive and master seams such as a missing connection to the server. Patterns that show in the TREASURE game are of a quality quite different from those patterns discussed so far. To express those patterns, one may need concepts from the environment such as locations and relations from urban environments, communication protocols, and, perhaps, combinations of them. For illustration, a certain pattern might con- tain a description saying that a human player Figure 16: TREASURE Ð Patterns of a with his PDA is in reach of some wire- Higher Level Ð Human-System Interaction less network able to connect his PDA to the background server. The weatherÐreally the The aim of the TREASURE game is to collect ‘coins’ scattered over an urban area such as true weather out thereÐmight play a role in a park on the PDA’s display in figure 16. The patterns. players’ goal is to get them in to the treasure The technology is bringing game playing chest (technically, a server). Two teams of back to our natural environment where it two players compete against each other. A belongs8. The digital games science is not clock counts down, and the team with the yet prepared to deal with those phenomena. most coins in their treasure chest at the end 8. . . such as TV brought murder back to home 7The PDA screenshot on display in figure 16 is where it belongs, a well-known saying attributed to published with permission of the authors. Alfred Hitchcock

38 7. A Brief Outlook 8. Acknowledgments A large variety of problems have not even Several of my students delivered substantial been touched in the present short paper; here contributions to our recent research towards is just one quite complex case: a Digital Games Science. In the German manual of BLACK &WHITE, There is not sufficient space to go into detail, a strategic development game, you find a but one paragraph should be used to give a rather irritating statement on page 19: Denke brief impression of each ongoing work. immer daran, dass dir deine Kreatur im Jochen F. Bohm¨ , currently in Saarbr¬ucken, Laufe der Zeit immer ahnlicher¨ wird. Sie gave me comprehensive introductions into verkorpert¨ deine Personlichkeit¨ und deine digital games I did not know much about. Spielweise. Arne Primke, currently in , recently You play with ”god’s hand” and develop a started investigations into patterns which creature such as the cow on display (see possibly show in the two ego-shooter series figute 17). You teach and train your creature CALL OF DUTY and MEDAL OF HONOR. such that it develops a particular behavior. Carsten Lenerz and Sebastian Dittmann The cited German text from the manual says have undertaken a systematic analysis of the that players should be aware of the fact that METROID game series under the perspective their digital creatures’ behavior is not only of occurrence and development of patterns. reflecting the way they play, but the players’ personalities. This sounds disconcerting. Sebastian Gaiser and Andreas Schmidt are investigating the development of classical Indeed, some communication with players jump’n run games such as SUPER MARIO. revealed concern about the extent to which the creature’s behavior is telling about the Monique Scholz and Steffen Rassler became player’s individuality. fascinated by the truly unique atmosphere players are experiencing when playing the survival/horror game CALL OF CTHULHU: DARK CORNERS OF THE EARTH. They started their own search for the patterns that stimulate such an impressive atmosphere. Steffen Kehlert systematized the utilization of commercial modern music in particular digital games such as sports and fun sports games like, for instance, TONY HAWKS and NEED FOR SPEED. Many more of my students in Darmstadt, in Ilmenau, and in Mannheim contributed Figure 17: BLACK &WHITE Ð Teach Your through engaged discussions, useful hints, Creature and Project Your Own Personality interesting suggestions, and by pointing to Patterns in game playing are reflected by necessary corrections. patterns in the NPC’s behavior, a relation- Last but not least, the author’s work towards ship which has not yet been subject to any a digital games science is very much driven systematic investigation. The issue might be by considerations of memetics (see [11, 12] of interest both to the social sciences and to for a much more comprehensive discussion). IT and AIÐthere is exciting work to come. It seems particularly exciting to find out how The topic addressed needs expressions for ideas of game and play evolve over time. qualities of an NPC’s behavior and terms for The present author’s colleagues and friends potentially related features of a player’s per- Susan Blackmore and Yuzuru Tanaka are in- sonality; we arrive at patterns in psychology. valuable partners in related discussions.

39 Appendix A: (Digital) Games that [6] S. Bj¬ork and J. Holopainen. Patterns are Mentioned in this Publication in Game Design. Hingham, MA, USA: Charles River Media, 2004. AGATHA CHRISTIE:AND THEN THERE WERE NONE [7] M. Csikszentmihalyi. Flow: The Psychology of Optimal Experience. AGE OF EMPIRES II Harper Perennial, 1991. BLACK MIRROR BLACK &WHITE [8] W. Faulstich. Grundkurs Filmanalyse. M¬unchen: Wilhelm Fink Verlag, 2002. CALL OF CTHULHU:DARK CORNER OF THE EARTH [9] J. Fritz. Das Spiel verstehen. Eine CALL OF DUTY Einfuhrung¨ in Theorie und Bedeutung. CHESS Weinheim & M¬unchen: Juventa, 2004. EINSTEIN WURFELT¬ NICHT [10] K. Gillen. Culture wargames. The Es- HALF-LIFE 2: EPISODE ONE capist, (1):8Ð13, 2005. JOSTLE [11] K. P. Jantke. Digital game knowl- MEDAL OF HONOR edge media (Invited Keynote). In METROID Y. Tanaka, editor, Proceedings of the 3rd International Symposium on Ubiq- NEED FOR SPEED uitous Knowledge Network Environ- SOUL CALIBUR II ment, February 27 March 1, 2006, SUPER COLUMBINE MASSACRE RPG Sapporo Convention Center, Sapporo, SUPER MARIO Japan, Volume of Keynote Speaker Presentations, pages 53Ð83. Hokkaido THE DA VINCI CODE University of Sapporo, Japan, 2006. TONY HAWKS TREASURE [12] K. P. Jantke. Knowledge evolution in game design Ð just for fun. In CSIT ULTIMA ONLINE 2006, Amman, Jordan, April 5-7, 2006, References 2006. [13] H. Korte. Einfuhrung¨ in die Syste- [1] C. Alexander. The Timeless Way of matische Filmanalyse. Berlin: Erich Building. New York: Oxford Univer- Schmidt Verlag, 2004. sity Press, 1979. [14] R. Koster. A Theory of Fun for [2] C. Alexander, S. Ishikawa, and M. Sil- Game Design. Scottsdale, AZ, USA: verstein. A Pattern Language.New Paraglyph Press, Inc., 2005. York: Oxford University Press, 1977. [15] P. Rabenalt. Filmdramaturgie. VIS- [3] R. Bartle. I was young and I needed the TAS media production, 2004. money. The Escapist, (43):4Ð9, 2006. [16] T. Schreiner. Faszination Videospiele. [4] J. Berndt. Bildschirmspiele: Faszina- PLAYZONE, (08/06, Ausg. 94):16Ð21, tion und Wirkung auf die heutige Ju- 2006. gend.M¬unster: Monsenstein und Van- nerdat, 2005. [17] M. Spitzer. Lernen. Gehirnforschung und die Schule des Lebens. Spektrum [5] J. Bewersdorff. Luck, Logic & White Akademischer Verlag, 2002. Lies. The Mathematics of Games. [18] M. Wallace. In celebration of the inner Wellesley, MA, USA: A K Peters, Rouge. the escapist, (30):4Ð6, 2006. 2005.

40 Security Aspects in Online Games Targets, Threats and Mechanisms

Anja Beyer Technische Universitat¨ Ilmenau Institut fur¨ Medien und Kommunikationswissenschaft Am Eichicht 1, 98693 Ilmenau [email protected]

Abstract This paper presents a summary of possible threats in Online Games based on the fundamentals of IT-Security. The IT security is defined via its targets confidentiality, integrity, availability, privacy and non-repudiation. Possible threats are identified and protection mechanisms are discussed. An outlook on the oncoming project to security in Online Games will conclude the paper. 1. Motivation other increasingly. That means, that playing an Online Game is not only fun but in some Since ”World of Warcraft (WoW)” was pub- cases also real business [10]. For the players lished in the USA in 2004 Online Games ”the virtual worlds are as real as the physi- are truly booming. WoW started with about cal world” [6]. They do not only have fun 200.000 players in the United States. In playing the games, they spend money and the meanwhile there are about six million leisure time in the virtual world and experi- people playing WoW on 40 servers around ence flow [5]. The players get carried away the world in one environment. For games in the games and they attach a certain im- like this, where thousands of players can portance on them. At this point the games play simultanuously in the same environ- become a serious touch and thoughts of se- ment via the Internet, a new term was cre- curity issues are necessary. ated: ”Massive(ly) Multiplayer Online Role- Playing Games (MMORPG)”. Before the In the academic field a lot of work is done Online Games became popular the games on graphics and hardware improvement for were designed for one player against a vir- games but less work is done on security is- tual player simulated by the game software. sues. Smed et al. [15] concentrate on the net- The Online Games allow the people play- working issues, like latency, bandwidth and ing together (with real human beings) from computational power. Kirmse and Kirmse all over the world [16]. That this develop- [9] recognize two security goals for online ment is an important one (not only) for the games: protecting sensitive data (like credit future of games is invigorated through the card numbers) and providing a fair playing statement of Kirmse and Kirmse. They say field. These two goals are of high concern Online Games are the biggest revolution in but do not cover all security needs. Other games since the introduction of home com- publications are only focused to cheating puters [9]. WoW is the most popular and [17, 3, 1]. Of course honest players do not successful MMORPG but not the only one like other players cheating and so they exert as there are lots of others (e.g. Dungeons pressure on the publishers to provide a fair and Dragons online, Guild Wars, Lineage 2). game setting. The games industry has real- ized that they will lose players and money A relatively new effect in this genre is that when they do not act against cheating [15]. the virtual worlds and the reality affect each In this paper the author wants to show that

41 IT-Security is an important topic and more individual persons to decide about the usage than just protection against cheating. How- of their personal data. ever this paper does not provide a full list of all possible threats, so the author focuses 2.2. ... and in Online Games the most valuable assets of the player and the From [11] we have learned that it is impor- provider. tant not to consider IT-Security from only one perspective (multilateral security). That 2. IT-Security Targets is why this chapter shows which security re- From the daily IT news one could get an quirements apply to Online Games from the impression on how important the topic IT- players and from the publishers view. Security already is. The number of ”bad As described above, confidentiality is nec- news” is increasing steadily. With more and essary in interpersonal communication and more applications connected to the insecure in e-commerce transactions. This is espe- network ’Internet’, the problems will not be- cially true in Games and specifically essen- come less and the topic IT-Security will be tial in Online Games. The communication of even more importance in the future. To of two players or a team is done via the In- give a short introduction the author wants to ternet which does not provide any protection give an overview on what IT-Security is. against eavesdropping or trapping informa- tion. When the players in an Online Game 2.1. In General ... arrange a strategy to fight against other play- ers or to win a match, it is unaccepted that In [11], IT-Security is mostly defined via its this information reaches the combatant be- protection targets confidentiality, integrity cause in that case the combatant would have and availibility. In e-commerce systems the an unfair advantage. Also payment data, like goals non-repudiation and privacy are addi- credit card numbers have to be confidential. tionally important. From the players perspective there are two Confidentiality means the protection main concerns that affect integrity: on the against unauthorized access to data and one hand the player is interested in the information. Communication between reached score or level, stored on the server two partners is thought to take place se- of the publisher, not to be unauthorized mod- cretly. That means that no third party is ified. The most valuable asset of the player allowed to acquire knowledge about the is the reached score or level of the game. communication. On the other hand, the player’s system has Integrity refers to protection against unau- to be kept integer, so that no unauthorized thorized modification of data or information: changes are done in the system without the the shown information has to be correct and recognition of the user (as lately happened presented unmodified. with the Sony root kit [14]). As the player pays a monthly fee (e 10,99 for WoW) for Availability indicates the protection against playing Online Games on the publishers unauthorized interference of functionality. server, he or she will not accept an hour or Non-repudiation expresses the unautho- daily long unavailability of the server. So the rized non-commitment, meaning the loss of the publisher is forced to install mechanisms bindingness. Business partners have to be that assure the availability of the servers (see sure that both stay with their proposal to sell chapter 3). or buy. Once agreed to a contract and having paid Privacy, also regulated by the EU Directive for the usage, the player is unwilling to ac- 95/46/EC on the protection of personal data cept a repudiation of the contract from the [4], is the right of an individual person on publisher. But this is exactly what regularly informational self determination. It allows happens when players are proven cheating.

42 Also privacy is a goal that must not be un- 3.1. On Clients Side derestimated. According to the law [4], each An attack on confidentiality on clients side person has the right to decide who may save is the eavesdropping of the communication and to what extent he or she is allowed to between players. If a combatant is able to save the personal data. listen to their communication and finds out The publishers most valuable asset is of about their strategy, he or she would have an course the game itself because this earns the unfair advantage. Considering confidential- money. So the publisher is interested in ity, a player also has to be aware of phishing the integrity of the game. This means that attacks. The term phishing derives from the no unauthorized person should be able to two words password and fishing. It is a crim- change the setting of the game. In combina- inal activity belonging to social engineering tion with that, the availability of the server [12]. An attacker pretends to be a trustwor- is an important target for the publisher. So thy person or company and tries to fraudu- people will only pay for the game if they are lently acquire account names and passwords able to play. Here the targets of the player by asking the players directly using mail. and the publisher are the same. Furthermore on confidentiality, there is the Furthermore, the publisher wants to make risk of malware injection. Malware stands the players satisfied to make them keep on for malicious software and means viruses, paying. They will do if their interests are trojan horses, worms, etc. that are pro- fulfilled. They will not be satisfied if other grammed to spy out account data and pass- players will have an unfair advantage. So words. the publisher has to act against cheating. In Considering integrity, there is the risk of the example of WoW the publisher scans the unauthorized modification of the clients user’s system for cheating tools which is a hardware or software configuration. heavy intervention in the users system. A Denial-of-Service (DOS) attack can make Also, the usability of the system is an impor- the client game software unserviceable so tant target because high support costs have that it cannot be used anymore. immense economic consequences. Privacy attacks are also possible, when an It is obvious that the security targets of the attacker is able to spy out personal data on publishers are not always the same like the the clients side. security targets of the players. They com- If a player does not stay with the promise pete in some cases (system scan, data col- to buy or pay, we speak of the threat of lection). non-repudiation. This is e.g. the case if 3. Security Threats in Online players are displeased with dishonest play- ers cheating. This once happened with the Games MMORPG Call of Duty 2, when players In the following chapters, possible threats on called out a boycott [2]. client side and on servers/publishers side are described. The classification is according to 3.2. On Servers Side where the attack takes place. This means that the damage can be at another place. If A possible threat concerning confidentiality e.g. the publisher does not protect its servers on server side is to spy out secret informa- in an adequate way, there might be the possi- tion about the combatant, e.g. the amount of bility that an attacker can read out the credit gold he or she owns. card payment data of the customers. The at- There are several attacks on integrity on the tack is on the servers side but the user is af- servers side. First of all, there is the threat fected of the damage if the attacker uses his of unauthorized modification of the games or her credit card number for not intended functionality. Since the game is the most purposes. valuable asset for the publisher, it might be

43 the worst case scenario if an attacker suc- cheats into 15 categories: ceeds in changing the games setting, e.g. to deactivate a whole continent of an environ- • Cheating due to misplaced trust (A) ment. • Cheating due to Collusion (B) The games scores of the players are saved on the server. So the publisher has to protect • Cheating by Abusing Game Procedure the scores from unauthorized modification (C) of the games scores of the players. A third attack imaginable is URL-Spoofing. This • Cheating related to Virtual Assets (D) means, if an attacker succeeds in changing • Cheating due to Machine Intelligence the URL’s of the gaming servers, the players (E) are forwarded to a third-party server with the risk of revealing sensitive information to an • Cheating via the Graphics driver (F) untrustworthy party. • Cheating by Denying Service to Peer There is also the risk of Non-Availability of Players (G) the server, e.g. due to unscheduled mainte- nance or DOS attacks. • Timing Cheating (H) Publishing personal data of players either willfully or through inappropriate protection • Cheating by Compromising Passwords of the storage is an attack on Privacy. (I) There is the danger of an attack on Non- • Cheating due to Lack of Secrecy (J) Repudiation on servers side, too. If the pub- lisher decides to eliminate players from the • Cheating due to Lack of Authentication game, e.g. because he or she suspects or (K) prooves them cheating, he steps back from • Cheating by Exploiting a Bug or Loop- the contract although the player has payed hole (L) for the usage. That is why the publishers insert an clause in general terms and condi- • Cheating by Compromising Game tions to leave that option open. Servers (M) Some examples show that it must not always • Cheating Related to Internal Misuse be the bad attackers from the outside who (N) threat the games. The threats can also come from the inside (e.g. unsatisfied or careless • Cheating by Social Engineering (O) employees) or due to act of nature beyond control. The above list covers two aspects why cheat- ing is possible: first because of the tech- 3.3. Cheating nical possibilities (I, J, K, L, M, N): The cheater, if he or she has enough malicious in- As the above threats concern digital games tend might exploit the games procedure and they are not really special for Online Games. bugs to gain an unfair advantage. Second is All other IT application systems are ex- the social aspect (D, G, O): some players are posed to these threats, too. As digital games not aware of the threats and give trust to the are special IT application systems [7] these wrong people (O). threats also apply to them. Not a new but a very special threat to digital games is cheat- 4. Security Mechanisms ing. Yan [8] defines ”any behavior that a player may use to get an unfair advantage, For a wide variety of threats there already or achieve a target that he is not supposed to exist protection mechanisms. A secret com- is cheating.” Yan and Randell [16] classify munication can be realized via encryption

44 and virtual private networks (VPN). Protec- mechanisms are realized to meet the secu- tion against malicious software can be done rity requirements of these games. We will through firewalls, virus scanners, intrusion provide a proposal of which and how further detection systems and an adequate config- mechanisms can be included. uration of hard- and software. The condi- Our aim is to raise the awareness for exist- tion is that they are kept updated. To avoid ing threats of the producers, publishers and a breakdown of the servers due to mainte- players. In our ”Security in Online Games” nance or due to unforeseen incident, back-up project, starting from August 2006, we want systems are recommended. to survey how security can be implemented To prevent players from cheating the one and without the players loosing the fun of gam- only solution at the moment is to scan the ing (e.g. by clicking away security alerts). users system to detect additionally installed References cheating software. The game producers in- tegrate software to scan the users system for [1] H. Banavar. Security issues in cheating software. One famous tool is called multi-player, distributed network PunkBuster [13] which is used in e.g. World games. http://ww2.cs.fsu.edu/ ba- of Warcraft and Call of Duty 2. If once a navar/research/NSPaper.htm, 2000. player is proven cheating, he or she will be [2] Call of duty 2 - community ruft zum eliminated from the game. From the authors boykott auf. Winfuture - Windows view, until now, there is not a good solution Online Magazin, to fight against cheating. Especially the so- http://www.winfuture.de/news, cial aspects of cheating, like social engineer- 23212.html [28/11/2005], June ing or denying service, can not be prevented. 2006. It follows that not all of the threats can be [3] Y.-C Chen, J.-J. Hwang, R. Song, faced with pure technical solutions. For ex- G. Yee, and L. Korba. Online gam- ample phishing is a real awareness problem. ing cheating and security issue. In- It is important to call the attention of the peo- ternational Conference on Information Technology: Coding and Computing ple to reduce such attacks. (ITCC’05), 1:518–523, 2005. Maybe not all forms of cheating can be cir- cumvented easily, but at least a proper im- [4] Directive 95/46/EC of the european parliament and of the council of 24 oc- plementation and configuration is the mini- tober 1995 on the protection of indi- mum. To meet the social aspects of cheating viduals with regard to the processing takes much more effort because the aware- of personal data and on the free move- ness of the people has to grow in their heads. ment of such data, 1995. [5] Moore R.J. Ducheneaut, N. The 5. Conclusions and Further Work social side of gaming: A study of interaction patterns in a massively Online Games are IT application systems multiplayer online game. Proceed- which have to meet security standards as ings of ACM CSCW’04 Conference usual networked IT applications do. Fur- on Computer-Supported Cooperative thermore, the games have an additional emo- Work, 6(3):360–369, 2004. tional factor. As the players identify them- [6] M. Jakobsson. Virtual worlds and so- selves with the avatars and spend much time cial interaction design. PhD thesis, in the games, they will have to be satisfied Umea˚ University, Department of Infor- by the games industry who earn money with matics, 2006. the games. [7] K. P. Jantke. Digitale Spiele In our future work, we want to analyse Forschung, Technologie, Wirkung the security mechanisms in current Online & Markt, Invited Plenary Talk, Games. We expect that not enough security Wismar, Germany, June 8/9, 2006.

45 [8] Yan J.J and Choi H.-J. Security issues in online games. The Electronic Li- brary, 20(2):125–133, 2002. [9] C. Kirmse and A. Kirmse. Security in online games. Gamasutra http://www.gamasutra.com, originally published in Game Developer, July 1997. [10] Dr. A. Lober and O. Weber. Money for nothing? - Handel mit Spiel-Accounts und virtuellen Gegenstanden.¨ c’t, 20:178–180, 2005. [11] G. Muller¨ and A. Pfitzmann. Sicher- heit, insbesondere mehrseitige IT- Sicherheit. In Mehrseitige Sicherheit in der Kommunikationstechnik Ver- fahren, Komponenten, Integration, pages 21–29. Addison-Wesley- Longman, 1997. [12] Phishing. Wikipedia, http://en.wikipedia.org/wiki/Phishing [29/06/2006], June 2006. [13] PunkBuster Anti-Cheating Software. http://www.evenbalance.com, July 2006. [14] M. Russinovich. Sony, rootkits and digital rights management gone too far. http://www.sysinternals.com/blog/ 2005 10 01 archive.html, published 31.10.2005, June 2006. [15] Hakonen Harri Smed J., Kauko- ranta T. Aspects of networking in multiplayer computer games. In Pro- ceedings of International Conference on Application and Development of Computer Games in the 21st Cen- tury, pages 74–81, 2001. available at http://www.tucs.fi/Publications/ pro- ceedings/pSmKaHaa.php. [16] J. Yan and B. Randell. Security in computer games: from pong to online poker. Technical Report Series, Pub- lished by the University of Newcastle upon Tyne, 2005. [17] George Yee, Larry Korba, Ronggong Song, and Ying-Chieh Chen. Towards designing secure online games. In 20th International Conference on Advanced Information Networking and Applica- tions (AINA 2006), 18-20 April 2006, Vienna, Austria, pages 44–48, 2006.

46 Personal File Sharing in a Personal Bluetooth Environment

Jürgen Nützel and Mario Kubek Technische Universität Ilmenau Institut für Medien- und Kommunikationswissenschaft Am Eichicht 1 98684 Ilmenau

Abstract This paper introduces a piconet file sharing application using Bluetooth. The described Java- based application is designed for mobile devices supporting J2ME. The focus of the paper is on the description of the needed client/server architecture, the used message protocols and the final implementation. Therefore common Bluetooth protocols and Bluetooth APIs for J2ME are explained. The paper concludes with a view on future enhancements regarding automatic program activation, the handling of DRM and the integration of user profiles.

1. Introduction The Bluetooth Special Interest Group (SIG) [1] with members like Ericsson, Modern mobile devices like cell phones IBM, Intel and Nokia releases the or PocketPCs offer many different Bluetooth specification. The Bluetooth connectivity solutions. Beside WAN standard 1.x supports data rates of 723,2 protocols based on W-CDMA and GSM, kBit/s. In 2004 version 2.0 was released. short-distance wireless communication This standard enables data rates of 2.1 technologies like IrDA, Bluetooth and MBit/s. A Bluetooth network (Piconet) WLAN are very common. can contain up to 16,7 Mio participants, The usage of these technologies is free whereby only eight devices can be active of charge (WLAN can be an exception) at the same time. Application specific and does not require network operators profiles, that represent the capabilities of or providers for data transfer, which a mobile device, are the structure for makes them attractive for users that are data transfer via Bluetooth. Common currently in the same area and need to profiles among others are the Dial-up exchange data. While IrDA based Networking Profile, the File Transfer connections have a lack of flexibility as Profile, the Headset Profile and the Fax they can only be established between Profile. Their signature is exchanged two devices that point to one another between the devices, so that their during the data transfer, WLAN support capabilities can be matched in order to is only available in premium class use the different services. Bluetooth devices. The Bluetooth technology in even supports security mechanisms as mobile devices, which is the main focus authentication and confidentiality. of this paper, offers more flexibility than Additionally many mobile operating IrDA and is available in nearly every systems like Windows Mobile or business class device. Data transmission Symbian offer Bluetooth APIs for is realized via radio communication.

47 development. Applications built for protocols like RFCOMM (radio these devices will not run on other frequency communication) and SDP devices. Therefore the Java virtual (service discovery protocol), as they are machine has been developed. This way used in the file-sharing application J2ME (the mobile edition of Java) BlueMatch. Therefore it is useful to applications, called MIDlets, can run on introduce the Bluetooth API (JSR-82) many devices with different operating first, because it enables the usage of systems. The J2ME is built on these protocols in mobile Java configurations and profiles. They are the applications. basic classes and methods that J2ME The Bluetooth API (JSR-82) [2] consists programs can use on the devices. The of two packages javax.bluetooth and most common profile is the MIDP, based javax.obex, the optional Object on the Connected Limited Device Exchange API. The classes in Configuration (CLDC). These classes javax.bluetooth offer methods for can be extended by the device vendors discovery of devices and services, the with optional APIs to support device support for connections using L2CAP specific features. Examples are the (logical link and adaptation protocol) Mobile Media API (JSR-135), the and RFCOMM, as well as device and Location API (JSR-179) and the data interfaces. These packages rely on Bluetooth API (JSR-82). JSR stands for the generic connection framework Java specification request. (GCF) in package javax.microedition.io. This paper describes a mobile file We do not want to cover the methods sharing application, called BlueMatch, offered in the API, instead we discuss which can run on devices supporting the general abilities of these classes. MIDP 2.0 and the Bluetooth API (JSR- The following Bluetooth protocols can 82). The special characteristic of be used in JSR-82: BlueMatch lies in its combined client/server architecture, which follows • SDP for device/service discovery and the peer-to-peer model, so users can service registration search and download files, while others • L2CAP for packet-oriented can download from their devices at the connections same time. • RFCOMM for stream-oriented First we want to present the components connections of the Bluetooth API before we go into details of BlueMatch's system • OBEX for transferring objects like architecture and message protocol. files, images and vCards Future enhancements and a view on the The protocol RFCOMM, which offers handling of DRM-encrypted files in such two-way communication over virtual an application conclude the paper. serial ports, resides on top of L2CAP and enables multiple concurrent stream- 2. The Bluetooth API oriented connections between devices, even over one Bluetooth link. L2CAP The Bluetooth stack consists of several segments the data by RFCOMM into protocol layers, that are implemented in packets before they are sent and hardware and software. The low level reassembles received packets into larger layers radio, baseband/link controller messages. L2CAP is an packet-oriented and link manager are not covered in this protocol and is suitable for developers paper. We will focus on higher level that wish to build custom packet-

48 oriented protocols. In this case flow users in passing to share files via control has to be implemented explicitly. Bluetooth. As a MIDlet BlueMatch can RFCOMM supports this already and run on mobile devices that support enables security parameters. Therefore MIDP 2.0 applications and implement the BlueMatch application relies on this the Bluetooth API (JSR-82) and the file protocol. OBEX is implemented on top connection API (JSR-75). It is based on of RFCOMM, but actually only a few a client/server architecture, whereby the devices on the market support this client part runs concurrently with the optional protocol in Java applications. server part in separate threads. That So file transfer has to be implemented means an instance can connect to the using RFCOMM or L2CAP. The service server on other devices, but can even discovery protocol (SDP) is used by process connections initialized by clients to search for Bluetooth devices remote devices. On incoming and services. It is also responsible for connections, the server accepts and registering custom service records in the opens them automatically without asking device's service discovery database the user. (SDDB). After registration a server can Currently these features are already accept and open connections by clients. implemented: • search and display remote devices Security Mechanisms running BlueMatch RFCOMM enables secure Bluetooth • connections. These security parameters download of file lists from remote for service usage can be set: devices • authenticate • download of files from remote devices • encrypt • serving of local file lists and files for remote devices • authorize • leaving notification for other members Authentication means, that two devices when quitting BlueMatch must be paired in order to exchange data. Encryption can be activated to secure a These functions are the use cases of this link between two devices and relies on application, embedded in the general authentication. The keys used for system architecture, which will now be authentication and encryption are described. generated by modified versions of the 128-bit block cipher algorithm SAFER+ The Client Implementation [3]. Authorization is used to demand a Each BlueMatch instance is starting a permission for remote devices to use a client thread. In this thread, remote service. The requested device grants or devices and their BlueMatch services are denies access to a service, for instance searched and called. After a connection after asking the user through a man- to a service has been established, machine interface to grant trust. This BlueMatch specific commands can be trust can be temporary or permanent. transmitted to the server. These commands will be described in the 3. BlueMatch BlueMatch message protocol (chapter 4). In theory the client and the server Now we want to present the prototype together can hold a maximum of seven application BlueMatch, which enables connections to remote devices, for

49 instance for simultaneous downloads. • incoming files from remote devices But if this limit is reached, the client and The lists of found remote devices and the server as well can not establish or BlueMatch services are used to connect open any new connections until a to other devices also running connection has been closed. Actually BlueMatch. To the user only the this limit can be lower on real devices. A community list is visible, which displays minimum of two connections at a time the friendly names of the found devices must be allowed by the device to run with BlueMatch services in their SDDB. BlueMatch properly, as it needs at least To send and receive files and filenames a one connection for the client and one for special class handling files has been an incoming connection on the server. implemented. This class, which uses the file connection API (JSR-75), is used by The Server Implementation the client and the server to access an The threaded server in BlueMatch existing memory card or hard disk in the creates and opens a service with an phone. This is done through the MIDP- universally unique identifier (UUID). property “fileconn.dir.Memorycard”, This service can be searched by clients which points to the necessary drive on remote devices. The connections name. A memory card is usually the best between clients and the server are solution in a phone to store large files, stream-oriented, based on the RFCOMM whereby some phones like the Nokia protocol. For each client that connects to N91 [4] might have an integrated hard the server the server accepts and opens disk. For simplicity of the prototype only the connection, if the maximum number the root of the card is used by of allowed connections on the device has BlueMatch. not been reached already. Every connection is processed in a separate 4. The BlueMatch Message thread. This allows to accept more than one connection at a time. The processing Protocol is done automatically without notifying BlueMatch implements a custom the user. This is possible, because the message protocol. The commands a BlueMatch service does not rely on client sends to a remote service represent authorization and authentication. So the the use cases of BlueMatch. Each MIDlet can run in the background, for command is sent as an UTF-string, using instance on multitasking operating methods in the MIDP-class DataOutput- systems like Symbian and wait for other Stream. Within a command additional users to connect. parameters are transmitted to the service depending on the use case. Now these Data Management commands will be explained. The following application data have to be managed in BlueMatch: The Presentation • list of discovered remote devices After a BlueMatch instance was started, other devices also running the • list of BlueMatch services on remote BlueMatch service will be searched for devices and put into the local community list. To • local files each found BlueMatch service the client introduces himself with the command • list of files on remote devices

50 "Presentation", followed by these data in Retrieving File Lists UTF-Strings: To download a member's file list, the • the member's name (device's friendly user selects an entry in the community name) list and presses the button "Find Files". Now the client sends the command • the URL of the local BlueMatch "GetFileList" to the selected service and service opens an incoming connection to retrieve Receiving this message, the servers on the remote device's file list. remote devices add the new member to their community list. So their clients can connect to the newly found services as well. Only new members send their identification to the BlueMatch services on the remote devices. The clients on these devices do not have to do this on their part, as their friendly names and service URLs have already been retrieved.

Fig. 2: User's File List The server replies with the number of following filenames as UTF-strings. This number tells the client how many entries should be read from the server. Figure 2 shows a screenshot of a file list with further options taken from a Nokia 6680.

Downloading Files

Fig. 1: Joining a BlueMatch Community Fig. 3: Download Process The figure 1 outlines the main steps for After receiving a file list, the user can joining a BlueMatch community. The choose some files of interest to be white activities are performed by the downloaded. Therefore the client joining BlueMatch instance, the grey initializes a file download with the action is executed by instances in the command "GetFile" and the filename as existing community. parameter. Afterwards an incoming

51 connection is opened to receive the file 5. Future Development data. The requested server now returns the filename, the filesize and the file data Now we want to list some options to to the client. The filesize is used on the improve BlueMatch. The MIDP 2.0 client side to calculate the size of the offers the possibility to start MIDlets download puffers and to determine, automatically using timer based whether there is enough space left on the activation or on inbound network phone's memory card. A gauge bar connections. This function is called push visualizes the download process as registry. Therefore an inbound server shown in figure 3. connection must be registered with it. This is even possible with Bluetooth server connections. This way, the Leaving the Community program does not have to be started to be If a user decides to leave the community reached by remote devices. This function and closes the program, the other could be implemented in BlueMatch members should be informed, so they with some modifications. can delete the corresponding entry in the In the presented prototype the type of the local community list. Therefore the transferred files will not be checked. In client sends a "Goodbye" message with future versions it is possible to restrict the local service URL to all found file transfer to encrypted DRM-files BlueMatch services. Afterwards the (Digital Rights Management). This will program quits. be a basic functionality to support legal superdistribution. With the transaction tracking [5] feature from OMA's (Open Mobile Alliance) DRM v2.0 it is possible to reward the content redistribution by the Rights Issuer. As one of the first devices the Nokia N91 [4] implements this new DRM standard. Based on user transactions and file information personal profiles could be modelled to implement a mobile distributed recommendation engine to find suitable files on remote devices. The user profile could be sent as part of the presentation, so that other users have a quick overview of community members matching their taste. Papers regarding this function have been released in [6] and [7]. Furthermore the persistent storage of application data like favoured community members and recent Fig. 4: Leaving the Community transactions is an useful improvement. Figure 4 shows these steps in an activity diagram. The white activities are 6. References performed by the leaving BlueMatch [1] Homepage of Bluetooth SIG; program. www.bluetooth.org

52 [2] Java APIs for Bluetooth Wireless [6] Kubek, M.; “Verteiltes Nutzer- und Technology (JSR-82); Content-Matching in mobilen www.jcp.org/en/jsr/detail?id=82 Kommunikationssystemen im Umfeld [3] Hole, K. J.; Wi-Fi and Bluetooth des PotatoSystems”; Diploma Thesis; Course; University of Bergen; 2006; Technical University Ilmenau; 2005; www.kjhole.com/Standards/BT/BTdown www.4fo.de/de/students#kubek loads.html [7] Nützel, J. and Kubek, M; “A [4] Nokia N91 Device Specification; Mobile Peer-To-Peer Application for 2006; Distributed Recommendation and Re- www.forum.nokia.com/devices/N91 sale of Music”; 2006; AXMEDIS 2006 conference contribution [5] OMA DRM v2.0 Specification; 2006 www.openmobilealliance.org/release_pr ogram/docs/DRM/drm_v2_0.html

53 User Interfaces for People with Special Needs

Henrik Tonn-Eichstadt¨ Technische Universitat¨ Ilmenau Institut fur¨ Medien- und Kommunikationswissenschaft Am Eichicht 1, 98693 Ilmenau [email protected]

Abstract Assistive technologies for people with disabilities transfer information to the appropriate, usable senses and so assists the users’ special abilities. Pure transformation of contents is sometimes not enough, because the transformed information only relies on a textual representation of non-text content. At this point, it seems to be interesting how far assistive technologies can usefully go and if there is an alternative to pure transformation. This paper describes some ideas about (new) interfaces for people with special needs. 1. Introduction 1. in unknown domains without help,

In ’International Classification of Function- 2. without being in danger of being in- ing and Disability’ (ICIDH-1, [6]), the volved in a real world accident and World Health Organization (WHO) defines three steps concerning disabilities. First, an 3. without the danger of buying bean tins impairment as ”loss or abnormality of [...] instead of pineapple tins due to the lack structure or function at the organ level” is the of vision. basis which could result in a disabilitywhich is described as ”restriction or lack of abil- Some people need to have special media to ity to perform an activity in a normal man- understand the presented content. Deaf and ner.” Both, impairment and disability, can - some users hard of hearing need to see text but need not to - result in a handicap which translated into sign language. Many of those means having a disadvantage in social life. people do not understand written language. To mitigate such handicaps, disabled people As blind and a lot of other disabled people are frequently using computers. With com- can not rely on standard I/O-technologies, puters, these people learn, have assistance assistive devices and technology help them in real world communication (e.g. speech getting access to the computer. Foot- synthesis) or are connected to the Internet. mice and mouthsticks are alternative de- The Internet offers an opportunity to take vices for motor impaired persons. Screen- part in social life and to be able to act au- reader, braille displays and speech output tonomously, without the help of others1. For are helpful to the blind; screen magnifier to blind people, the Internet means a broad ac- low vision people. Screenreader transform cess to everyday life as the Internet is the graphical output into a textual representa- first place where blind people can (inter)act tion which then can be rendered for percep- tion through the sense of touch or hearing. 1 In Germany, this is often called ’Daseinsentfal- Media that can not be transformed needs to tung’ in connection with legal aspects. A transla- tion to the English language does not exist. The term be coded with alternative text (e.g. the alt- could be described as ’independentexpansion of a be- attribute for images in HTML). If this alter- ings’ existence’. native text is missing, the content can not be

54 perceived by blind users. they encounter the problem of interfering Though assistive technologies work fairly sound. The authors also ”[...] would like to well, in most cases they only convert graph- more precisely evaluate the suitable type of ical output optimized for sighted users to an information for each sense.” output of minor quality. Thus, users with R¨ober and Masuch ([10]) investigate the special needs - especially blind users - have sonification of objects in a 3D virtual world. to cope with challenges that arise due to in- By distinguishing different sounds, users terfaces based on graphical paradigms. This can decide on moving in this world and paper deals with concepts of interfaces and choosing the appropriate interaction for the interaction strategies which could be devel- presented objects. This approach is demon- oped and optimized with regard to blind peo- strated in a digital game prototype. A very ple. The basic ideas behind these presented similar idea is described in [8]. There, an concepts can be transferred to everyday sit- audible space was presented to blind people uations where sighted people have to inter- who reviewed this new presentation form as act in a situation which is similar to the usable. blind users’ everyday live regarding percep- The 3D approach seems to be interesting. tion prospects. These situations could be Sighted users are arranging object icons on dealing with a 2D iconical desktop to achieve a certain order in their files and programs. Blind peo- 1. audio menus of telephone applications, ple cannot do that. Although screenreaders transform the visual output into a linear text 2. interacting with a navigation system form, spacial - or better surfacial - relations while driving or between the objects get lost. A clue on the 3. operating a machine blindly. distance between or the grouping of objects is not available to these users. In order to compensate the lack of vision, 2. Audio interfaces blind people have evolved distinct skills in Considering auditory output, several prob- hearing. So it seems to be a good idea to lems have to be taken into account. Au- think about an auditory represented room dio is a sequential phenomenon: information where users can place objects. [8] showed has to be presented in a linear way. Paral- that this approach is of interest. lel presentation of audio would lead to babel Looking at the sketch of such a room in Fig- and results in unusable output. Further on, ure 1, it becomes obvious that this interface audio features several degrees of freedom. could be adequate to perceive auditory in- Sound perception is influenced by volume, formation about all existing objects. How pitch, timbre and loudness. A challenge is this information can be coded is analysed in to create different sounds which vary those e.g. [7] for synthetic sound (’earcons’) and aspects and are doubtlessly distinguishable. in [12] for pieces of real sounds (’auditory Some time ago, Edwards ([4]) tried to con- icons’). vey direct manipulation objects from the At least, two main questions quickly arise: graphical world to the audio world. The work ends with the conclusion that much 1. How should the objects be pre- research has to be done to reach an us- sented? As a constant playing sound able interface. In principle, the approach will counterwork usability basics and has proven suitable. Asakawa et al. ([1]) rise ’sound pollution’, alternative op- somehow continue the work and try to trans- tions have to be checked. Some basic port visual effects into audio and tactile pre- research from [11] deals with the ability sentation. They are proposing ’background of human beings to extract information colour music’ and ’foreground sound’. But from simultaneously presented sounds.

55 Figure 1: In a virtual auditory room, the user could place her or his applications freely in ’space’.

Based on such work, pursuing concepts can be developed. 2. By which means can objects be placed or picked up? There is so far no adequate everyday input device. Nowadays, the virtual room has to be turned (like in a lot of 3D-games) or the user has to wear some augmented real- ity device like a helmet, special glasses or a data glove. Where turningthe room Figure 2: John Anderton (Tom Cruise) is is a cheap escape, the described devices scanning files in a 3D-interface. Copyright: are vision-based and not suited for the Twentieth Century Fox und Dreamworks. vision impaired. A data glove as it was shown in the movie ’Minority Report’ (see Figure 2) seems to be a nice idea established tactile in-/output device is the to manipulate in a 3D environment - but braille display which is widely used by blind will such device be affordable by the people. Such devices present a set of braille mean user? characters to the user (40 or 80 braille char- acters are common sizes). Each character is So there is space for research in the psycho- built with 6 or 8 braille pins which are ar- logical (perception research) and technical ranged in a 2x3 or 2x4 matrix (see Figure 3). (development of input devices) field to an- On those displays, users can read small por- swer these questions. tions of text at a time. After the text is read, they have to skip to the next line or element. 3. Tactile Interfaces For reading text, the common braille dis- Another type of presenting information in plays are well usable. Users are able to a non-visual way are tactile interfaces. An read at a high word rate and can perceive

56 perceivable. Finally, the temperature hastobe managedina way thattheuser can proceed working with the computer and fix the error or pay attention to the backgrounded application. Variation of temperature seems to be useful to com- municate warnings - but are there other fields of use?

Figure 3: ’braille’ written in 6-pin-braille 2. Pressure. Changing the pressure of and plaintext. a tactile sensation can also be used to convey information. A firmly and stat- even large amounts of text quickly. But ical presented tactile symbol has a dif- those braille displays are limited to text ferent meaning than a softly, smoothly presentation. Graphical information can presented symbol. If and how these not be displayed. Even semantic informa- variations can be introduced in a use- tion like emphasized text or headings can ful way has not been researched so far. only be presented by adding the appropri- Maybe experience of force feedback ate meta information to the text which then devices can be used to develop new pre- can be displayed: ’Heading level one: Tac- sentation forms based on pressure. tile Interfaces’. So how could these ’ad- 3. Oscillation. Brewster describes ’Tac- vanced’ pieces of information be considered tons’ that transfer structural informa- in braille presentation? tion to oscillating braille pins ([2]). 3.1. Degrees of freedom in tactile Frequency, amplitude, wave form, du- ration and rhythm are varied to en- presentation code structural information like empha- Tactile perception is not limited to the binary sis, highlighting of phrases or citations. information of a pin being there or not be- How these variations can be used in a ing there like in classic braille displays. So sensible way is not clear so far. As how can pressure, temperature and vibration it was the case with varying tempera- - which are the main tactile sensations (see tures, main presentation characteristics e.g. [3]) - be varied? have to be found for oscillation to ob- tain easily distinguishable, non-hurting 1. Temperature. The variation of temper- and non-perturbing presentation of dif- ature of the braille pins or the braille ferent information. field itself could be one means to present information. As long as every- A variety of open questions has to be an- thing is in order, the temperature will be swered concerning the degrees of freedom comfortable (of course, the user has to of presented haptical information. adjust her or his comfortable tempera- ture). If an error occurs or the computer 3.2. Two-dimensional braille fields suddenly changes its state (because a backgrounded application should be fo- The second limitation of classical braille cussed), the temperature could rise (or displays is the character-based presentation. fall) to indicate this change of state to This is optimal for the display of text but the user. Of course, the temperature has fails in case of any graphical output. Some to varied in a ’healthy’ interval that the graphical information could possibly be me- user is not hurt by a sudden change of diated by variation of temperature, pressure temperature. Additionally, the differ- or oscillation as it was described above. But ence has to be large enough to be easily for graphical media, these variations alone

57 are not sufficient as those need more room braille displays have a ’routing key’ assigned to be displayed. to each braille character. Pressing this rout- Some basic research exists in this field. In ing key, the user can activate a link or place [5], graphics and charts are transferred into the cursor in a form field directly. a tactile presentation. Rotard et al. ([9]) One way could be to add an input function have developed a tactile web browser which to the braille pins. The user could then feel can display text, tables and graphics on a the information and just press the focussed 120 by 60 pin display. The author of this area to activate an element. This of course paper knows of a japanese researcher who holds some danger: as the user has to press works on an appropriate display technique at some strength to feel the pins, he or she for a limited display area. could accidentally press and activate an ele- The following list shows open questions in ment. Another problem could be to display this field. different elements in a way that they are sep- arated from each other to prevent parallel ac- tivation. And a third point: How is it pre- 1. Are interaction methods which are de- sented to the user that the focussed-on ele- veloped for size-limited, graphical mo- ment is clickable? bile displays (e.g. panning or the fish- eye-view) transferable to tactile dis- A second idea is to have buttons placed in plays? a spot separated from the output braille area. These buttons would have to be connected to 2. (How) can dynamic graphics (e.g. the displayed information which is a design movies or animations) be presented on challenge. But the buttons forestalls activa- a tactile interface? In visual perception, tion by accident. the user can get a quick overview on an So, concerning input, some research could interface by just looking at it. Perceiv- be done. ing tactically, the user has to explore the interface with for example his or her 3.4. Challenges finger tips as the areas of the skin which are sensitive enough are limited to a few Summarized, what challenges are hidden in sensible points. tactile interfaces?

3. How can the user be lead across a new, two-dimensional interface to make ex- 1. Presentation of information in a (very) ploration of the interface easier? In vi- limited space. sual interfaces, parts of it are coded in 2. Appropriate presentation regarding colour or similar functions are grouped. temperature, pressure and oscillation. Using the laws of (visual) perception, the users’ eye is guided through the in- 3. Finding usable input options. terface. What laws of tactical percep- tion can be used to do the same for tac- 4. Mechanical construction. Concerning tile interfaces? the presentation and input options, the device itself has to be built:

3.3. Input in two-dimensional tac- (a) The device has to connect the de- tile interfaces sired functionality with the braille Talking again about direct manipulation pins. The pins themselves have as an established and usable interac- to be positioned at small distances tion paradigm: How can direct manip- and are tiny size. ulation techniques be transferred to two- (b) The device has to be produced at dimensional tactile displays? Most classic reasonable costs to enable users to

58 afford it (standard braille displays of the SIGCHI conference on Human cost around 5.000 EUR and up at factors in computing systems, pages the moment which makes it dif- 83–88, New York, NY, USA, 1988. ficult for users to afford such de- ACM Press. vices). [5] Richard E. Ladner, Melody Y. Ivory, Rajesh Rao, Sheryl Burgstahler, Dan 4. Summary Comden, Sangyun Hahn, Matthew This paper gave a brief overview on pos- Renzelmann, Satria Krisnandi, Maha- sible research on user interfaces for people lakshmi Ramasamy, Beverly Slabosky, with special needs. What is common to Andrew Martin, Amelia Lacenski, Stu- both scenarios is that users could not rely art Olsen, and Dmitri Groce. Automat- on graphical output. To even this out, alter- ing tactile graphics translation. In As- native senses are used: the senses of hear- sets ’05: Proceedings of the 7th inter- ing and touch. For each of these senses, national ACM SIGACCESS conference an appropriate interface is imaginable. Parts on Computers and accessibility, pages of the presented ideas are already integrated 150–157, New York, NY, USA, 2005. in established products. Nevertheless, these ACM Press. products lack some functions. Adding these [6] Rolf-Gerd Matthesius, Kurt-Alphons functions or building enhanced devices is the Jochheim, and Gerhard S. Barolin. aim of the described ideas. Possibly, find- ICIDH, International Classification of ings from this future research can help to im- Impairements, Dissabilities, and hand- prove everyday devices for all users. icaps. Huber, Bern, Switzerland, 1999. References [7] David K. McGookin and Stephen A. [1] Chieko Asakawa, Hironobu Takagi, Brewster. Space, the final frontearcon: Shuichi Ino, and Tohru Ifukube. Au- The identification of concurrently pre- ditory and tactile interfaces for repre- sented earcons in a synthetic spa- senting the visual effects on the web. tialised auditory environment. In Pro- In Assets ’02: Proceedings of the fifth ceedings of ICAD 04 - Tenth Meet- international ACM conference on As- ing of the International Conference on sistive technologies, pages 65–72, New Auditory Display, Sydney, Australia, York, NY, USA, 2002. ACM Press. 2004. [2] Stephen Brewster. Nonspeech au- [8] Veronika Putz. Spatial auditory user ditory output. In The human- interfaces. Master’s thesis, Institute of computer interaction handbook: fun- Electronic Music, University of Music damentals, evolving technologies and and Dramatic Arts Graz, Graz, Austria, emerging applications, pages 220– October 2004. 239. Lawrence Erlbaum Associates, [9] Martin Rotard, Sven Kn¨odler, and Inc., Mahwah, NJ, USA, 2003. Thomas Ertl. A tactile web browser for [3] Alan Dix, Janet Finlay, Gregory the visually disabled. In HYPERTEXT Abowd, and Russell Beale. Human- ’05: Proceedings of the sixteenth ACM Computer Interaction (2nd edition). conference on Hypertext and hyper- Prentice Hall Europe, Hertfordshire, media, pages 15–22, New York, NY, 1998. USA, 2005. ACM Press. [4] Alistair D. N. Edwards. The design [10] Niklas R¨ober and Maik Masuch. In- of auditory interfaces for visually dis- teracting with sound - an interaction abled users. In CHI ’88: Proceedings paradigm for virtual auditory works.

59 In Proceedings of ICAD 04 - Tenth Meeting of the International Confer- ence on Auditory Display, Sydney, Aus- tralia, 2004.

[11] Barbara Shinn-Cunningham and Antje Ihlefeld. Selective and divided atten- tion: Extracting information from si- multaneaous sound sources. In Pro- ceedings of ICAD 04 - Tenth Meet- ing of the International Conference on Auditory Display, Sydney, Australia, 2004.

[12] Catherine Stevens and David Brennan ans Simon Parker. Simultaneous ma- nipulation of parameters of auditory icons to convey direction, size, and dis- tance: Effects on recognition and in- terpretation. In Proceedings of ICAD 04 - Tenth Meeting of the International Conference on Auditory Display, Syd- ney, Australia, 2004.

60 Clustering the Google Distance with Eigenvectors and Semidefinite Programming

Jan Poland and Thomas Zeugmann Division of Computer Science Hokkaido University N-14, W-9, Sapporo 060-0814, Japan {jan,thomas}@ist.hokudai.ac.jp

Abstract Web mining techniques are becoming increasingly popular and more accurate, as the information body of the World Wide Web grows and reflects a more and more comprehensive picture of the humans’ view of the world. One simple web mining tool is called the Google distance and has been recently suggested by Cilibrasi and Vit´anyi. It is an information distance between two terms in natural language, and can be derived from the “similarity metric”, which is defined in the context of Kolmogorov complexity. The Google distance can be calculated from just counting how often the terms occur in the web (page counts), e.g. using the Google search engine. In this work, we compare two clustering methods for quickly and fully automatically decomposing a list of terms into semantically related groups: Spectral clustering and clustering by semidefinite programming.

1. Introduction used for other complexities, such as those derived from the WWW frequencies. The Google distance has been suggested by Cilibrasi and Vit´anyi [2] as a semanti- Spectral clustering is an increasingly pop- cal distance function on pairs of words or ular method for analyzing and clustering terms. For instance, for most of today’s data by using only the matrix of pairwise people, the terms “Claude Debussy” and similarities. It was invented more than 30 “B´elaBart´ok”are much tighter related years ago for partitioning graphs (see e.g. than “B´elaBart´ok”and “Michael Schu- [11] for a brief history). Formally, spectral macher”. clustering can be related to approximat- ing the normalized min-cut of the graph The World Wide Web represents parts defined by the adjacency matrix of pair- of the world we live in as a huge collec- wise similarities [15]. Finding the exactly tion of documents, mostly written in nat- minimizing cut is an NP-hard problem. ural language. By just counting the rel- ative frequency of a term or a tuple of The Google distances can be transformed terms, we may obtain a probability of this to similarities by means of a suitable ker- term or tuple. From there, one may de- nel. However, as such a transformation fine conditional probabilities and, by tak- potentially introduces errors, since in par- ing logarithms, complexities. Li et al. [7] ticular the kernel has to be chosen appro- have proposed a distance function based priately and the clustering is quite sen- on Kolmogorov complexity, which can be sitive to this choice, it seems natural to 61 work directly on the similarities. Then, alphabet is ASCII or UTF-8, according to the emerging graph-theoretical criterion which character set we are actually using. is that of a maximum cut. Optimizing (For simplicity, we shall always use the this cut is again NP-hard, but can be term ASCII in the following, which is to approximated with semidefinite program- be replaced by UTF-8 if necessary.) Then, ming (SDP). the (prefix) Kolmogorov complexity of a The main aim of this work is to com- character string x is defined as pare spectral clustering and clustering by K(x) = length of the shortest self-de- SDP, both of which are much faster than limiting program generating x. the computationally expensive phyloge- netic trees used by [2]. We will show how By the requirement “self-delimiting” we state-of-the-art techniques can be com- ensure that the programs form a prefix- bined in order to achieve quite accurate free set and therefore the Kraft inequality clustering of natural language terms with holds, i.e., surprisingly little effort. X −K(x) There is a huge amount of related work, 2 ≤ 1 , in computer science, in linguistics, as well x as in other fields. Text mining with spec- where x ranges over all ASCII strings. tral methods has been for instance studied The Kolmogorov complexity is a well- in [3]. A variety of statistical similarity defined quantity regardless of the choice measures for natural language terms has of the universal Turing machine, up to an been listed in [13]. For literature on spec- additive constant. tral clustering, see the References section ∗ ∗ and the references in the cited papers. If x and y are ASCII strings and x and y are their shortest (binary) programs, re- The paper is structured as follows. In the spectively, we can define K(y|x∗), which next section, we introduce the similarity is the length of the shortest self-delimiting metric, the Google distance, and spectral program generating y where x∗, the pro- clustering as well as clustering by SDP, gram for x, is given. K(x|y∗) is computed and we shall state our algorithms. Sec- analogously. Thus, we may follow [7] and tion 3 describes the experiments and their define the universal similarity metric as results. Finally, in Section 4 we discuss the results obtained and give conclusions. max K(y|x∗),K(x|y∗) d(x, y) = (1) max K(x),K(y) 2. Theory We can interprete d(x, y) as an approx- 2.1. Similarity Metric and Google imation of the ratio by which the com- Distance plexity of the more complex string de- We start with a brief introduction to Kol- creases, if we already know how to gen- mogorov complexity (see [8] for a much erate the less complex string. The uni- deeper introduction). Let us fix a uni- versal similarity metric is almost a metric versal Turing machine (which one we fix according to the usual definition, as it sat- isfies the metric (in)equalities up to order is not relevant, since each universal ma-  chine can interpret each other by using a 1/max K(x),K(y) . “compiler” program of constant length). Given a collection of documents like the For concreteness, we assume that its pro- World Wide Web, we define the probabil- gram tape is binary, such that all sub- ity of a term or a tuple of terms by count- sequent logarithms referring to program ing relative frequencies. That is, for a tu- lengths are w.r.t. the base 2. The output ple of terms X = (x1, x2, . . . , xn), where 62 each term xi is an ASCII string, we set are interested in the similarity of “cosine” and “triangle.” They would be relevant if www www p (X) = p (x1, x2, . . . , xn) = (2) we were searching for the corresponding Chinese terms. So we use a different way # web pages containing all x , x , . . . , x to fix the relevant database size. We add 1 2 n . # relevant web pages the search term “the” to each query, if we are dealing with English language. This Conditional probabilities can be defined is one of the most frequent words used in likewise as English and therefore gives a reasonable restriction of the database. The database www www www p (Y |X) = p (Y ∪ X)/p (X) , size is then the number of occurrences of “the” in the Google index. Similarly, for where X and Y are tuples of terms and ∪ our experiments with Japanese, we add denotes the concatenation. Although the the term n (“no” in hiragana) to all probabilities defined in this way do not queries. satisfy the Kraft inequality, we may still define complexities 2.2. Spectral Clustering Consider the block diagonal matrix K www(X) = − log pwww(X) and  1 1 0  1 1 0 www www www   . K (Y |X) = K (Y ∪ X) − K (X) . 0 0 1 Then we use (1) in order to define the web Its top two eigenvectors, i.e., the eigenvec- distance of two ASCII strings x and y, tors associated with the largest two eigen- following [2], as values, are [1 1 0]T and [0 0 1]T . That is, they separate the two perfect clusters rep- www d (x, y) = resented by this similarity matrix. In gen- K www(x ∪ y) − min K www(x),K www(y) eral, there are conditions under which the (3) max K www(x),K www(y) top k eigenvectors of the similarity matrix or its Laplacian result in a good cluster- Since we use Google to query the page ing, even if the similarity matrix is not counts of the pages, we also call dwww the perfectly block diagonal [9, 14]. In partic- Google distance. Since the Kraft inequal- ular, it was observed in [4] that the trans- ity does not hold, the Google distance is formed data points given by the k top quite far from being a metric, unlike the eigenvectors tend to be aligned to lines in universal similarity metric above. the k-dimensional Euclidean space, there- A remark concerning the “number of rele- fore the kLines algorithm is appropriate vant web pages” in (2) is mandatory here. for clustering. In order to get a com- This could be basically the number of all plete clustering algorithm, we therefore pages indexed by Google. But this quan- only need to fix a suitable kernel function tity is not appropriate for two reasons: in order to proceed from a distance matrix First, since some months ago there seems to a similarity matrix. to be no way to directly query this num- For the kernel, we use the Gaussian 1 2 2 ber. Hence, the implementation by [2] k(x, y) = exp − 2 d(x, y) /σ . We may used a heuristic to estimate this value, use a globally fixed kernel width σ, since which however yields inaccurate results. the Google distance (3) is scale invariant. Second, not all web pages are really rele- In the experiments, we compute the mean vant for our search. For example, billions value of the entries o of the distance ma-√ of Chinese web pages are irrelevant if we trix D and then set σ = mean(D)/ 2. 63 In this way, the kernel is most sensitive The max-k-cut problem is the task of around mean(D). finding the partition that maximizes the The final spectral clustering algorithm for weight of the cut. It can be stated as fol- k−2 a known number of clusters k is stated be- lows: Let a1, . . . , ak ∈ S be the vertices low. We shall discuss in the experimental of a regular simplex, where section how to estimate k if the number Sd = {x ∈ d+1 | kxk = 1} of clusters is not known in advance. R 2 is the d-dimensional unit sphere. Then Algorithm Spectral clustering of a word 1 the inner product ai ·aj = − k−1 whenever list i 6= j. Hence, finding the max-k-cut is Input: word list X = (x1, x2, . . . , xn), equivalent to solving the following integer number of clusters k program: Output: clustering c ∈ {1 . . . k}n X IP: maximize k−1 d (1 − y · y ) 1. for x, y ∈ X, compute Google relative k ij i j i

√ k−1 X 4. compute σ = mean(D)/ 2 SDP : maximize k dij(1 − vi · vj) i

2. for x, y ∈ X, compute complexities K www(x), K www(x, y)

3. compute distance matrix

www  D = d (x, y) x,y∈X

4. solve the SDP by using CP (cf. (4a) Fig. 1: Complexities Fig. 2: Distances through (4f))

5. cluster the resulting matrix Y using kernel k-means [10] In the distance matrix (Figure 2), the 65 to the medical terms, in a hierarchical clustering we would first split off the med- ical terms and then divide mathematical and financial terms. The spectral clus- tering correctly groups all terms except for “average”, which is assigned to the financial terms instead of the mathemat- Fig. 3: Similarity Fig. 4: Laplacian ical terms (this is also visible from Fig- ure 5). As average also occurs often in finance, we cannot even count this as mis-

first eigenvector of the Laplacian clustering. We stress that the clustering 0 algorithm is of course invariant to permu- tations of the data, i.e. yields the same

second eigenvector of the Laplacian results if the terms are given in a differ- 0.4 0.2 ent order. It is just convenient for the 0 -0.2 presentation to work with an order corre-

third eigenvector of the Laplacian sponding to the correct grouping. 0.4 0.2 The same clustering result is obtained 0 from the SDP clustering. The kernel ma- trix resulting from solving the SDP clearly Fig. 5: Eigenvector plot displays the block structure, again with the exception of the term “average”. In case that we do not know the num- block structure is visible. After transfor- ber of clusters k in advance, there is a mation to the Similarity matrix (Figure 3) way to estimate this quite reliably from and Laplacian (Figure 4), the block struc- the eigenvalues of the Laplacian, if the ture of the matrix becomes very clear. data is not too noisy. Consider again Figure 5 shows the top three eigenvec- the case of a perfect block diagonal ma- tors of the Laplacian. The first eigen- trix, i.e. all intra-cluster similarities are 1 vector having only negative entries seems and all other entries 0. Then the num- not useful at all for the clustering (but ber of non-zero eigenvalues of this matrix in fact it is useful for the kLines algo- is equal to the number of blocks/clusters. rithm). The second eigenvector separates If the matrix is not perfectly block di- the medical terms (positive entries) from agonal, we may still expect some domi- the union of mathematical and financial nant eigenvalues which are clearly larger terms (negative entries). This indicates than the others. Figure 7 top left shows that the mathematical and financial clus- this for our first example data set. The ters are closer related than each is related top three eigenvalues of the Laplacian are dominant, the fourth and all subsequent ones are clearly smaller. (Observe that the smallest eigenvalues are even negative: This indicates that the distances we used do not stem from a metric. Otherwise all eigenvalues should be nonnegative, since the Gaussian kernel is positive definite.) We propose a simple method for detect- Fig. 6: Kernel matrix after SDP ing the gap between the dominant eigen- values and the rest: Tentatively split 66 The next data set is the “colors-nums” eigenvalues of the Laplacian: math-med-finance 1 data set from [2]: 0.8 0.6 purple, three, blue, red, eight, 0.4 transparent, black, white, small, six, 0.2 0 yellow,seven, fortytwo, five, chartreuse, two, green, one, zero, eigenvalues of the Laplacian: colors-nums orange, four 1 0.8 Although the intended clustering has 0.6 0.4 two groups, colors and numbers (where 0.2 “small” is supposed to be a number and 0 “transparent” a color), the eigenvalues of the Laplacian in Figure 7 bottom left eigenvalues of the Laplacian: people 1 indicate that there are three clusters. 0.8 Indeed, in the final spectral clustering, 0.6 0.4 “fortytwo” forms a singleton group, 0.2 and “white” and “transparent” are 0 misclustered as numbers. Clustering

eigenvalues of the Laplacian: Japanese with SDP gives a slightly different result: 1 Here, best results are obtained with k = 2 0.8 0.6 clusters, in which case only “fortytwo” 0.4 is wrongly assigned to the colors. 0.2 0 The next data set,

Fig. 7: Plots of the eigenvalues of the Takemitsu, Stockhausen, Kagel, Xenakis, Laplacian (bars) and the s.s.e. score for Ligeti, Kurtag, Martinu, Berg, Britten, determining the number of clusters (lines) Crumb,Penderecki, Bartok, Beethoven, for the four data sets Mozart, Debussy, Hindemith, Ravel, Schoenberg, Sibelius,Villa-Lobos, Cage, Boulez, Kodaly, Prokofiev, Schubert, after the second eigenvalue, compute the Rembrandt, Rubens, Beckmann, Botero, Braque, means of the eigenvalues in the two groups Chagall, Duchamp, Escher, Frankenthaler, “dominant” and “non-dominant” (ignore Giacometti, Hotere, Kirchner, Kandinsky, the top eigenvalue, which is always much Kollwitz, Klimt, Malevich, Modigliani, larger), and calculate the sum square er- Munch, Picasso, Rodin, Schlemmer, Tinguely, ror (s.s.e.) of all eigenvalues w.r.t. their Villafuerte, Vasarely, Warhol, Rowling, means. Compute this s.s.e. score also for Brown, Frey, Hosseini, McCullough, Friedman, the split after the third eigenvalue, the Warren, Paolini, Oz, Grisham, Osteen, fourth eigenvalue and so forth. Choose Gladwell, Trudeau, Levitt, Kidd, Haddon, the split with the lowest score. We have Brashares, Guiliano, Maguire, Sparks, depicted the s.s.e. scores in Figure 7 by Roberts, Snicket, Lewis, Patterson, Kostova, solid lines. For the math-med-finance Pythagoras, Archimedes, Euclid, Thales, data set, the minimum score is at the cor- Descartes, Pascal, Newton, Lagrange, Laplace, rect number of k = 3 clusters. Leibniz, Euler, Gauss, Hilbert, Galois, Cauchy, Dedekind, Kantor, Poincare, Godel, Note that this method works only for the Ramanujan, Wiles, Riemann, Erdos, Thomas spectral clustering, there is no obvious Zeugmann, Jan Poland, Rolling, Stones, corresponding algorithm for the SDP clus- Madonna,Elvis, Depeche, Mode, Pink, Floyd, tering. Elton, John, Beatles, Phil, Collins, 67 Toten, Hosen, McLachan, Prinzen, Aguilera, clearly indicate the correct number of Queen, Britney, Spears, Scorpions, k = 2 clusters. However, when using Metallica, Blackmore, Mercy k = 2, only the term “° ƒ” (which consists of five groups of each 25 (more means “environment”) is non-intendedly or less) famous people: composers, grouped with the computer science words artists, last year’s bestseller authors, by the spectral clustering. SDP clustering mathematicians (including the authors of gives the same result here. the present paper), and pop music per- The computational resources required formers. We deliberately did not specify by our clustering algorithms are much the terms very well (except for our own lower than those needed by Cilibrasi and much less popular names), in this way Vit´anyi’s algorithm [2]. Their clustering the algorithm could “decide” itself if tries to optimize a quality function, a task “Oz” meant one of the authors Amos and which is NP hard. The approximation Mehmet Oz or one of the pop music songs costs hours even for small lists of n = 20 with Oz in the title. The eigenvalue plot words. On the other hand, our spectral in Figure 7 top right shows clearly five clustering’s naive time complexity is cubic clusters. From the 125 names, 9 were not O(n3) in the number of words. This can clustered into the intended groups using be improved to O(n2k) by using Lanczos the spectral method. The highest number method for computing the top k eigenvec- of incorrectly clustered names (4 mis- tors. clusterings) occurred in the least popular On our largest “people” data set with group of the mathematicians (but our n = 125, our spectral clustering needs two names were correctly assigned). We about 0.11sec on a 3GHz Pentium4 pro- also observed that the spectral clustering cessor with the ATLAS library. gets disproportionately harder when the Solving the SDP in order to approximate number of clusters increases: Clustering max-cut is more expensive. The respec- only the first 50, 75, and 100 names gives tive complexity is O(n3 + m3) (cf. [1]), 0, 2, and 5 clustering errors, respectively. where m is the number of constraints. If We also tried clustering the same data k = 2, then m = n and the overall com- set w.r.t. the Japanese web sites in the plexity is cubic. However, for k ≥ 3, Google index, this gave 0, 1, 4, and 16 we need m = O(n2) constraints, resulting clustering errors for the first 50, 75, 100, in an overall computational complexity of and 125 names, respectively. O(n6). On our largest “people” data set Clustering with SDP gives better results with n = 125, our SDP clustering needs here: 0, 0, 1, and 4 clustering errors for about 2544sec. the first 50, 75, 100, and 125 names, re- spectively. 4. Discussion and Conclusions The last data set consists of 20 Japanese We have shown that it needs surprisingly terms from finance and 10 Japanese terms little effort, just a few queries to a popular from computer science (taken from glos- search engine together with some state-of- saries): the-art methods in , to automatically separate lists of terms into < ºÿ ¶m †Ø *¡ °ƒ Ñ) o , , , , , , , , clusters which make sense. We have fo- Ç( üe ¡? *¡ =- 8ú Ñ ò , , , , , , , cused on unsupervised learning in this pa- Ø 4» AÕ' ©¡<8 ©ï Þï ;Ïæ , , , , , , per, but for other tasks such as supervised ¦ ;Ï'. ¢p Ñ< Âp b Ö¦ Ÿ , , , , , , , learning, appropriate tools are available  —S , . as well (e.g. SVM). Our methods are theo- The eigenvalue plot in Figure 7 does not retically quite well founded, basing on the 68 theories of Kolmogorov complexity on the [8] M. Li and P. M. B. Vit´anyi. An in- one hand and graph cut criteria and spec- troduction to Kolmogorov complexity tral clustering or semidefinite program- and its applications. Springer, 2nd ming on the other hand. The SDP clus- edition, 1997. tering is the more direct approach, as it needs less steps. Also, it yields slightly [9] A. Ng, M. Jordan, and Y. Weiss. On better results. However, it is compu- spectral clustering: Analysis and an tationally more expensive than spectral algorithm. In Advances in Neural clustering. Information Processing Systems 14, 2001. References [10] B. Sch¨olkopf, A. Smola, and K.-R. [1] B. Borchers and J. G. Young. Imple- M¨uller. Nonlinear component anal- mentation of a primal-dual method ysis as a kernel eigenvalue problem. for sdp on a shared memory parallel Neural Computation, 10(5):1299– architecture. March 27, 2006. 1319, 1998.

[2] R. Cilibrasi and P. M. B. Vit´anyi. [11] D. A. Spielman and S. Teng. Spectral Automatic meaning discovery using partitioning works: Planar graphs Google. Manuscript, CWI, Amster- and finite element meshes. In IEEE dam, 2006. Symposium on Foundations of Com- puter Science, pages 96–105, 1996. [3] I. Dhillon. Co-clustering documents and words using bipartite spectral [12] J. Sturm. Using SeDuMi 1.02, a graph partitioning. In Proceedings MATLAB toolbox for optimization of the 7th ACM SIGKDD Int. Con- over symmetric cones. Optimization ference on Knowledge Discovery and Methods and Software, 11(12):625– Data Mining(KDD), pages 269–274, 653, 1999. 2001. [13] E. L. Terra and C. L. A. Clarke. Fre- [4] I. Fischer and J. Poland. New meth- quency estimates for statistical word ods for spectral clustering. Technical similarity measures. In HLT-NAACL Report IDSIA-12-04, IDSIA / USI- 2003: Main Proceedings, pages 244– SUPSI, Manno, Switzerland, 2004. 251, Edmonton, Alberta, Canada, [5] A. Frieze and M. Jerrum. Improved 2003. algorithms for max k-cut and max [14] U. von Luxburg, O. Bousquet, and bisection. Algorithmica, 18:67–81, M. Belkin. Limits of spectral cluster- 1997. ing. In Advances in Neural Informa- [6] M. X. Goemans and D. P. tion Processing Systems (NIPS) 17. Williamson. .879-approximation MIT Press, 2005. algorithms for MAX CUT and MAX [15] S. X. Yu and J. Shi. Multiclass spec- 2SAT. In STOC ’94: Proceedings tral clustering. In ICCV ’03: Pro- of the twenty-sixth annual ACM ceedings of the Ninth IEEE Interna- symposium on Theory of computing, tional Conference on Computer Vi- pages 422–431. ACM Press, 1994. sion, pages 313–319. IEEE Computer [7] M. Li, X. Chen, X. Li, B. Ma, and Society, 2003. P. M. B. Vit´anyi. The similarity met- ric. IEEE Transactions on Informa- tion Theory, 50(12):3250–3264, 2004. 69 Toward Soft Clustering in the Sequential Information Bottleneck Method

Tetsuya Yoshida Graduate School of Information Science and Technology, Hokkaido University West 9, North 14, Kita-ku, Sapporo, Hokkaido 060-0814, Japan

Abstract This paper briefly explains the Information Bottleneck method which was originally proposed by Tishby [8], and describes our preliminary attempt to extend the Information Bottleneck (IB) method toward soft clustering. Until now four algorithms have been proposed in the framework of IB. Among them, we focus on an algorithm called sequential IB (sIB), and propose an approach for extending it toward soft clustering. We propose an approach for softening the partition in the orig- inal sIB and for cluster merger among the softened clusters. Some properties in our extension are reported, and remaining issues are discussed. 1. Introduction application domains, especially for docu- ment clustering. One of the strength of By rapid progress of network and storage this approach is that it is possible to find technologies, a huge amount of electronic out some “structure” out of the given data data has been accumulated and available itself without imposing extra assumptions these days. Especially, as typified by or constraints. Web pages, the amount of machine read- This paper describes our preliminary at- able documents have been ever increas- tempt to extend the Information Bottle- ing. Since it is almost impossible for hu- neck method toward soft clustering. Until mans to manually classify or categorize now four algorithms are proposed in the the huge amount of documents over the IB method. Among them, we focus on an Web, a lot of research efforts have been algorithm called sequential IB (sIB), and conducted on ext clustering or classifica- proposed an approach toward soft clus- tion [2, 1, 6] tering. Some properties in our extension Recently, a new information theoretic are reported and remaining issues are de- principle, termed the Information Bottle- scribed. neck method (IB), has been proposed and This paper is organized as follows. Sec- investigated [8, 3, 4, 7] It is based on the tion explains the preliminaries in prob- following idea: given the empirical joint abilistic approach in clustering. Sec- distribution of two variables, one vari- tion briefly explains the Information Bot- able is compressed so that the compressed tleneck method based on [8, 3], and representation preserves the information sets up the context of our research. Sec- about the other relevant variable as much tion describes our approach for toward as possible. It has been applied to many soft clustering, and reports our current

70 status. Section gives brief concluding re- preserved as high as possible, while the marks. give data points are compressed. The goal is: minimizing the compression 2. Preliminaries information while preserving the relevant information as high as possible. Here, the Definition 1 (KL divergence) The relevant information refers to the infor- Kullback-Leibler (KL) divergence be- mation about the relevant variable. The tween two probability distributions p1(x) above problem is formulated as: given a and p2(x) is defined as joint statics p(x, y) for a random variable X and the relevant variable Y , find a com- DKL[p1(x)||p2(x)] pressed representation T of the original  p1(x) representation X. This process is illus- = p1(x)log p2(x) trated in Figure 1. To solve this problem, x one observation is that T should be com- − = p1(x)logp2(x) H[p1(x)](1) pletely defined given X, irrelevant to Y , x since T is a compressed representation of X. This observation is formulated as fol- where H[p1(x)] represents the entropy of lows. p1(x).

Definition 3 (IB Markovian relation) Definition 2 (JS divergence) The Jensen-Shannon (JS)divergence between T ↔ X ↔ Y (3) two probability distributions p1(x) and p2(x) is defined as where X is a random variable, Y is the relevant variable, and T is a compressed JSΠ[p1(x),p2(x)] representation of X. = π1DKL[p1(x)||p¯(x)]

+π2DKL[p2(x)||p¯(x)] Under IB Markovian relation, the follow- = H[¯p(x)] ing properties hold: −(π H[p (x)] + π H[p (x)]) 1 1 2 2 p(t|x, y)=p(t|x) (4) (2) p(x, y, t)=p(x, y)p(t|x, y) | where Π={πi,πj}, 0 <π1,π2 < 1, π1 + = p(x, y)p(t x) (5) π2 =1, and p¯(x)=π1p1(x)+π2p2(x). To find the optimal compression of X into T using the method of Lagrange multipli- 3. Information Bottleneck ers, Tishby et. al [8] suggested the fol- Method lowing IB variational principle. 3.1. IB Variational Principle Definition 4 (IB variational principle) Tishby et al. proposed to cope with the precision complexity tradeoff with- L[p(t|x)] = I(X; T ) − βI(T ; Y ) (6) out defining any distortion measure in ad- vance [8]. They formulated the prob- where I(T ; X), I(T ; Y ) are the mutual lem by introducing a relevant variable on information between the variables, β is a which the mutual information should be Lagrange multiplier.

71 p(x,y)~ I( X;Y)

X p(t|x) T p(y|t) Y I( X;T ) I( T ; Y ) Compactness Accuracy

Figure 1: Information Bottleneck Framework.

Intuitively, I(T ; X) represents the com- algorithms have been proposed to find pactness of the new representation T , out exact or approximated solutions of while I(T ; Y ) represents the accuracy of Eq.(6) [5, 4, 3]. There algorithms are T for the relevant variable Y . The pa- termed as: rameter β acts as a controlling parame- ter for the trade-off. The goal is to find • iIB: an iterative optimization algo- a new representation T which minimize rithm L[p(t|x)], denoted as Lmin[p(t|x)]. It is proved that the optimal solution to • dIB: a deterministic annealing-like the IB problem, i.e., the conditional prob- algorithm ability p(t|x) which minimizes L[p(t|x)], can be determined by the following theo- • aIB: an agglomerative algorithm rem [8, 3] • sIB: a sequential optimization algo- Theorem 5 (from [8, 3]) Assume that rithm p(x, y),β are given and that IB Marko- vian relation holds. The conditional prob- The fixed-point equations in Eq.(7) is ap- ability distribution p(t|x) is a stationary plied in the iIB algorithm. It is reported point of Lmin[p(t|x)] if and only if that the sIB algorithm shows the “best” performance in terms of the time com- p(t) p(t|x)= e−βDKL[p(y|x)||p(y|t)] plexity and the quality of the constructed Z(x, β) clusters, if the size of the compressed rep- (7) resentation is rather small (i.e., the num- where Z(x, β) is a normalization func- ber of clusters in T is small) [3]. This tion. condition is enforced by setting the value of β in Eq.(8) sufficiently large. This re- | || | Note that DKL[p(y x) p(y t)], the sults in constructing a new representation Kullback-Leibler (KL) divergence be- T which sustains large mutual informa- tween two probability distributions, tion with the relevant variable Y . Several emerges as the effective distortion mea- research efforts have been conducted by sure from the IB variational principle. applying the sIB to document clustering1. Thus, in the following section, we focus 3.2. IB Algorithms 1Dr.Wang, Yaguchi, and Professor Tanaka To find out a new representation T un- have already utilized the sIB algorithm for doc- der the IB variational principle, four ument clustering in the Meme Laboratory.

72 on the sIB algorithm and consider its ex- t1 tension toward soft clustering. In the fol- T lowing, we explain the aIB and sIB algo- x rithm as the preliminaries for Section . draw {x}: as a t2 singleton 3.3. Cluster Merger in IB cluster For aIB and sIB algorithms, the following merge dual form is considered for the optimiza- tion of p(t|x): t3 −1 Lmax[p(t|x)] = I(T ; Y ) − β I(X; T ) (8) Figure 2: Overview of sIB algorithm. Clearly, minimization of L[p(t|x)] in Eq.(6) corresponds to maximization of L [p(t|x)] in Eq.(8). max Proposition 8 (Merger Cost) (from [3]) The agglomerative procedure is con- Let ti,tj ∈ T be two clusters. Then, the ducted by merging two values (clusters) ti merger cost is defined as: and tj into a single value t¯, i.e., the cluster L ¯ · ¯ merger {ti,tj}⇒t¯. Δ max(ti,tj)=p(t) d(ti,tj) (13) where Definition 6 (Merger Membership) ¯ | | (from [3]) Let {ti,tj}⇒t¯ be some d(tj,tx) = JSΠ[p(y ti),p(y tj)] −1 merger in T. We define the cluster merger −β JSΠ[p(x|ti),p(x|tj)] membership as the union of the events ti (14) and tj as:

Using Eq.(13), two clusters ti,tj ∈ T p(t¯|x)=p(ti|x)+p(tj|x), ∀x ∈ X (9) which minimize ΔLmax(ti,tj) are se- lected and merged iteratively. Proposition 7 (Cluster Merger in IB) { }⇒¯ (from [3]) Let ti,tj t be some 3.4. sIB: a sequential optimiza- merger in T defined through Eq.(6). If IB Markovian relation holds, then tion algorithm In [4], Slonim et al. proposed a frame-

p(t¯)=p(ti)+p(tj) (10) work for casting a known agglomerative algorithm into a sequential K-means like p(y|t¯)=πip(y|ti)+πjp(tj) (11) algorithm, and proposed a sequential op- where timization algorithm called sIB. The se- quential maintains a (flat) partition in T p(ti) p(tj) m t ...t ∈ T Π={π ,π } = { , } with exactly clusters ( 1 m ). i j ¯ ¯ (12) p(t) p(t) Given the initial partition, at each step some data x ∈ X is drawn from To select the merger cluster, it is proved its current cluster t (x was exclusively that the cluster which minimizes the fol- assigned to t), and represented as a lowing merger cost should be selected to new singleton cluster {x}. Then, the maximize Lmax[p(t|x)] in Eq.(8). singleton cluster {x} is merged with

73 new or inserted into tj such that tj = procedure sIBsoft(p(x, y), β,|T |) L { } new  argmintj ∈T Δ max( x ,tj), tj = tj // p(x, y): joint distribution {x}.Iftj = t, this process constitutes // β: trade - off parameter a reassignment of x, which contributes // |T |: cardinality of clusters to increasing the value of Lmax[p(t|x)] in // initialization Eq.(8). The above process is illustrated in 1: For every x ∈ X Figure 2. 2: For every t ∈ T 3: Set random value for p(t|x) ∈ [0, 1] 4. An Extension toward Soft | s.t. t p(t x)=1 Clustering in sIB // Main Loop Done In the original sIB algorithm, hard clus- 4: While not Done ← TRUE tering is assumed, i.e., each data x ∈ X is 5: x ∈ X assigned to only one cluster t ∈ T . Hard 6: For every 7: Select t ∈ T clustering is formulated as:  8: Divide t into {t ,tx}   1 x ∈ t 9: Select tj ∈ T \{t}∪{t }∪{tx} p(t|x)= (15) 0 x ∈ t 10: If tj = t, 11: Done ← FALSE However, the original IB variational prin- 12: Merge tx into tj ciple in Definition 3 does not impose this 13: Return p(t|x) (extra) constraint. Actually, the solution found by the iIB algorithm, which applies Figure 3: Pseudo-code of soft clustering the standard strategy in variational meth- in sIB. ods, does not necessarily obey this con- straint and conducts soft clustering. Difference from the original hard cluster- 4.1. An Approach toward Soft ing in sIB is that, a data x ∈ t is not en- Clustering in sIB tirely drawn from the cluster t. Instead, only some portion of x is drawn and as- To extend sIB algorithm toward soft clus- signed to a new cluster t . tering, we employ the following ap- x proach: Our procedure of soft clustering in sIB is summarized as the procedure sIBsoft in a) “cluster division & merger” strategy Figure fig:sIB-psuedo-code. The exten- As in the original sIB, ∀x ∈ X, ∃t ∈ sion from the original sIB algorithm is the Ts.t.x∈ t, divide the cluster t lines 7,8,9. {  } ⇒  ∪ into t ,tx . Here, t t tx, tx To realize the above soft clustering, the is treated as a singleton cluster but following issues should be resolved. with probabilistic assignment, as ex- plained below. Then, tx is merged • Selection of the cluster t ∈ T for with some tj ∈ T as {tx,tj}⇒t¯j. each x ∈ X s.t. x ∈ t (line 7) b) sequential update for each x ∈ X for  simplicity • Soft partitioning of t into {t ,tx}. One drawback of this approach is (line 8) that, it requires the O(|X|) loop. However, as a initial trial, we employ • Selection of the merger cluster tj sequential update for simplicity. (line 9)

74 In the following sections, we describe our Note that Eq.(18) holds for both hard and current approach for the above issues. soft clustering.

4.2. Soft Partitioning and  Proposition 11 Let t ⇒{tx,t} be some Merger soft partitioning of a cluster t ∈ T defined Our approach for soft partitioning in sIB in Definition 9. If IB Markovian relation is formulated as follows: holds, then

 Definition 9 (Soft Partitioning) For p(t)=p(t )+p(tx) (19) each x ∈ X, ∃t ∈ Ts.t.x∈ t, let  t ⇒{t ,t } be a soft partitioning of the p(y|tx)=p(y|x) (20) x  cluster t with a parameter γ (0 <γ≤ 1).  ∀  ∈ ∀ ∈ | 1 x = x Then, x X, tj T , p(x tx)=    0 x = x γp(t|x) x = x p(t |x)= (16) x 0 x = x Eq.(19) shows that Eq.(10) holds in both   hard and soft clustering.   (1 − γ)p(t|x) x = x p(t |x )=   p(t|x ) x = x 4.4. Cluster Merger Criterion (17)  For the rests, i.e., ∀tj ∈ Ts.t.tj = t ∧   tj = tx, ∀x = x, p(tj|x ) unchanged. Observation 12 Let ti,tj ∈ T be two clusters and consider the cluster merger For the clusters which are defined in Def- {ti,tj}⇒t¯. For soft clustering, if inition9, we follow the cluster merger we set conditional probability in the membership in Definition6. Thus, in merged cluster p(t¯|x) following Eq.(6), { } our soft clustering, if clusters ti,tj are from Proposition12, Eqs.(10) and (11)x ¯ merged into t, the cluster merger member- hold. Thus, Merger cost ΔLmax(ti,tj) in ship is set as Eq.(13) holds both for hard and soft clus- tering. p(t¯|x)=p(ti|x)+p(tj|x), ∀x ∈ X

We need to The value of γ (γ ∈ [0, 1]) From Observation 12, for our defini- tion of soft partitioning in Eqs.(16) and 4.3. Some Properties in Soft (17), we can utilize ΔLmax(tx,tj) as the { } ¯ Clustering merger cost of tx,tj into tj. Thus, we can set line 9 in Figure 3 as “Select tj = In hard clustering, JSΠ[p(x|ti),p(x|tj)] argmin ∈ L {t },t ”. in Eq.(14) can be simplified as t T max x Based on the above observation, we fol- JSΠ[p(x|ti),p(x|tj)] = H[Π] ([3], p.36). However, this does not hold for low the criterion in Eq.(8) and select ∈ soft clustering. Fortunately, similar the cluster tj T which minimizes L relation to Eq.(11) holds. Δ max(tx,tj) in Eq.(13). There are two cases for cluster merger: {t ,t }⇒t¯ Proposition 10 Let i j be some  • Cluster re-merge: t is selected and merger in T defined through Eq.(6). If IB    merged as t ⇒ t ∪ tx = t (as t¯) Markovian relation holds, then  ∀tj ∈ Ts.t.tj = t , p(x|t¯)=π p(x|t )+π p(x|t ) (18)  i i j j ΔLmax({tx},t) ≤ ΔLmax({tx},tj)

75 • cluster merge: ∃tj ∈ Ts.th.tj = utilize the measure for conditional distri-  t ,tj ⇒ tj ∪ tx = t¯j butions.  ∃tj ∈ Ts.t.tj = t ,  ΔLmax({tx},t) > ΔLmax({tx},tj) 4.6. Remaining Issues in Soft Clustering For cluster re-merge,

 We have described our current status for |  | | p(y t)=πt p(y t )+πtx p(y tx) the extension toward soft clustering in (21) sIB. The followings are the remaining is-

{  } sues resolved: Π= πt ,πtx p(t) p(t ) = { , x } p(t) p(t) i). Selection of the most “distant” clus- = {1 − γp(x|t),γp(x|t)} ter t for x (22) ii). Value of γ Here, p(x|t) corresponds to the weight (contribution) of x ∈ t for p(t) For i.), as stated in the previous subsec- tion, one heuristic approach for cluster se-  lection is to select the most distant cluster tj = t corresponds the re-merger to the   w.r.t. the (conditional) probability distri- original cluster t ({tx,t}⇒t¯ = t); tj =  t corresponds to a cluster update. bution for the relevant variable. Among many diversity measure for prob- 4.5. Cluster Selection ability distributions, it would be natural There are two ways to conduct cluster se- to utilize either DKL in Eq.(1) or JSΠ lection. in Eq.(2), because these diversity mea- sures emerged out of the framework of IB. • No a priori selection: At this moment we have not yet decided ∀x ∈ X, ∀t ∈ T s.t. x ∈ t, which divercity measure, DKL in Eq.(1) select the cluster t which minimize or JSΠ in Eq.(2), should be utilized to ΔLmax({tx},tj). measure the distance. Since this approach requires at least For ii.), ideally, it is desirable to show the O(XT) loops, as in aIB, the worst analytic solution for the value of γ.Ifitis case time complexity is at leat difficult or impossible, we plan to utilize O(|X|2). However, if we can assume uniform or adaptive value, i.e., that |T | is fixed and that |T ||X|, 1) uniform for ∀x ∈ X,or, this does not matter in practice. 2) adaptive for each x. • Heuristic selection: Our current plan for determining the Select the most “distant” cluster t for value of γ is as follows. We assume that each x. the cluster t s.t. x ∈ t is already se- lected (hopefully using the diversity mea- Our current plan is to utilize the latter sure as described above). Once the clus-  approach and select a cluster t ∈ T for ter t is fixed, for each tj ∈ T \t ∪{t }∪ each x ∈ X. Since the framework of IB {tx}, since p(tj), p(y|tj), and p(x|tj) are conducts clustering based on the condi- calculated beforehand, ΔLmax({tx},tj) tional distribution with resect to the rel- can be expressed as a function of γ. evant variable Y (see Eq.(7)), we plan to Thus, we can represent ΔLmax({tx},tj)

76 as ΔLmax(γ) for each tj. We plan to uti- References lize the variational method and determine the stationary point of γ for each tj, and [1] R. Bekkerman, R. El-Yaniv, L utilize the value to compute Δ max(γ). N. Tishby, and Y. Winter. Dis- tributional Word Clusters vs.Words 5. Concluding Remarks for Text Categorization. Journal This paper briefly explained the Informa- of Machine Learning Research, tion Bottleneck method which was orig- 3:1183Ð1208, 2003. inally proposed by Tishby [8] and sub- [2] F.C. Pereira, N. Tishby, and L. Lee. stantially extended in [3], and described Distributional clustering of English our preliminary attempt to extend the In- words. In Proceedings of the 30th formation Bottleneck (IB) method toward Annual Meeting of the Association soft clustering. Until now four algorithms for Computational Linguistics, pages have been proposed in the IB method. 183Ð190, 1993. Among them, we focus on an algorithm called sequential IB (sIB), and proposed [3] N. Slonim. The Information Bottle- an approach toward soft clustering. We neck:Theory and Applications. PhD proposed an approach for softening the thesis, School of Computer Science partition in the original sIB and for cluster and Engineering, Hebrew University, merger among the softened clusters. 2002. Our current status is very preliminary and there are many remaining issues, e.g., the [4] N. Slonim, N. Friedman, and selection of the cluster to divide, determi- N. Tishby. Unsupervised Docu- nation of the value of γ. We plan to tackle ment Classification using Sequential these issues in near future. We also plan Information Maximization. In Pro- to implement the described approach and ceedings of the 25th International conducts experiments to evaluate our ap- Conference on Research and De- proach. velopment in Information Retrieval (SIGIR), 2002. Acknowledgments This work is par- tially supported by JSPS Core-to-Core [5] N. Slonim and N. Tishby. Agglom- Program “Center for Research on Knowl- erative Information Bottleneck. In edge Media Technologies for the Ad- Advances in Neural Information Pro- vanced Federation, Utilization and Dis- cessing Systems (NIPS) 12, pages tribution of Knowledge Resrouces” and 617Ð623, 1999. The 21st Century COE Program “Meme- Media Technology Approach to the R&D [6] N. Slonim and N. Tishby. The Power of Next-Generation Information Tech- of Word Clusters for Text Classifica- nologies” The author is grateful to Profes- tion. In Proceedings of the 23rd Euro- sor Yuzuru Tanaka for encouraging this pean Colloquium on Information Re- path of research in this project. trieval Research, 2001.

[7] N. Tishby. Efficient Data Repre- sentations that Preserve Information. In Proceedings of the 6th Interna- tional Conference on Discovery Sci- ence (DS2003), page 45, 2003.

77 [8] N. Tishby, F. Pereira, and W. Bialek. Proof of Eq.(19)  The Information Bottleneck Method.  In Proceedings of the 37th Allerton p(tx)= p(tx,x) Conference on Communication and x | Computation, 1999. = p(tx,x)=p(tx x)p(x) = γp(t|x)p(x) = γp(t, x)   p(t )= p(t ,x) x  = p(x)p(t |x) x   = p(x )p(t|x ) Appendix x= x − | +p(x)((1 γ)p(t x))  − | Proof of Eq.(18) = p(x ,t) p(x)γp(t x) x

= p(t) − p(tx)  ∴ p(t)=p(tx)+p(t )  p(x, t¯)= p(x, y)p(t¯|x, y) Proof of Eq.(20). y ¯| | = p(x, y)p(t x) p(y tx)=p(y,tx)/p(tx)   y  p(x ,y)p(t |x ,y)  = x x | | |   = p(x, y)(p(ti x)+p(tj x)  x p(tx x )p(x ) y  |   x p(x ,y)p(tx x ) =  |  | = p(x, y)p(ti x) x p(tx x)p(x ) y |  p(x, y)p(tx x) = | + p(x, y)p(tj|x) p(tx x)p(x) y p(x, y)  = = p(x, y)p(ti|x, y) p(x) | y = p(y x) + p(x, y)p(tj|x, y) y   Proof of Eq.(21). = p(x, y, ti)+ p(x, y, tj) y y p(tx|x)p(x) p(x|tx)= = p(x, ti)+p(x, tj) p(tx) p(t ,x) = p(ti)p(x|ti)+p(tj)p(x|tj)  x =   p(t ,x) ∴ p(x|t¯)=πip(x|ti)+πjp(x|tj) x x p(t ,x) = x p(tx,x) =1   p(x |tx)=0(x = x) { } { p(ti) p(tj ) } where Π= πi,πj = p(t¯) , p(t¯) ,

78 Multiple News Articles Summarization based on Event Reference Information with Hierarchical Structure Analysis

Masaharu Yoshioka Makoto Haraguchi Graduate School of Information Science and Technology, Hokkaido University N14 W9, Kita-ku, Sapporo-shi, Hokkaido, JAPAN {yoshioka,mh}@ist.hokudai.ac.jp

Abstract In this paper, we describe our method to generate summary from multiple news articles. Since most of news articles report several events and these events are refereed with following articles, we use this event reference information to calculate importance of a sentence in multiple news articles. We also propose a method to delete redundant description by using similarity of events. Finally we discuss its effectiveness based on the evaluation result. 1. Introduction events among different news articles gains high score and sentences that are important Nowadays, we can access a large amount of only in one article are not selected as impor- text data. As a result, even for a simple topic, tant sentences. In order to solve this prob- it becomes difficult to read through all doc- lem, we propose a method to combine im- uments that are related to the topic. There- portant sentence extraction method for each fore, demand for multiple document summa- document and for whole document for re- rization is increasing and a task for multiple ducing this effect in this paper. document summarization that uses news ar- ticles is proposed for tackling this issue in 2. Multiple News Articles Summa- TSC-3[4]. rization based on Event Refer- Most significant characteristic of multiple ence Information document summarization compared with single one is that there are redundant in- 2.1. Extraction of the Event Refer- formation in a document set. For exam- ence Information ple, most of news articles reports events that As McKeown proposed in [7], there are sev- occurred at particular date and these events eral categories for multiple document sum- were refereed in following articles. Based marization. Single-event summarization is a on the assumption that repetitions of the category that aims to summarize a set of arti- same event description represent the rela- cles includes ones that reports occurrence of tionship among different articles, we have events and ones that reports following events already proposed a method to summarize (e.g., real fact of the event and sequel of the multiple news article by using event refer- event). In this type of articles sets, follow- ence information [10]. ing articles refer to the events that were al- By using TSC-3 data, we show the usage of ready described in previous articles and add event reference information is useful. How- another information that were related to pre- ever this method tends to select sentences vious events. Therefore, we assume iden- that include many repetitive events. As a tification of events in different articles and result, sentences that have similar repetitive reference information among these events is

79 useful to make a summary. Depth is a path length between Root of the Lexical cohesion method is one approach to event and root of the sentence in depen- deal with this reference information for sum- dency analysis tree. marizing a single document. However, in Date is a date that characterize the event. order to deal with reference information in This slot is not a required slot to define different documents, we cannot use informa- an event. tion such as distance between two sentences. So we propose a method to extract event in- ArticleDate is a date that the article was formation from news articles and to identify published. event by using similarity measure between two events. In this paper, we define “Event” Chunks represents list of word positions in as follows. a sentence.

Event is information that describes facts In this method, we extract event information and related information on particular from a sentence by using following steps. date. 1. We apply Cabocha[5] to obtain depen- Extraction of Events dency analysis tree.

Event is a unit to represent relationship 2. We select verbs and nouns that have among different articles and it should have modification words as candidates of information that is useful to identify same “Root” for events. events. In order to extract rich event infor- mation from sentences in a document, it is 3. We check whether negative expression better to analyze deep structure of sentences is included in root or not and set “Neg- in a document; e.g., discourse analysis and ative” based on this analysis. anaphoric analysis. However, it is very diffi- 4. We extract “Modifier” information cult to use such deep structural information, from dependency analysis tree. At this we decide to use results of dependency anal- time, we classify types of modifier by ysis and we set a size of a unit simple. In using POS tag and postpositional par- addition, date information is useful for dis- ticle (postpositional particle with “が” crimination of similar events; e.g., press re- and “は” are categorized into “Sub- lease in May is different from press release ject” and other postpositional parti- in April. cle are categorized into “postpositional- Based on this discussion, we select follow- postpositional particle” (e.g., “postpo- ing slots to define an event. sitional-に”). Modifier information in- cludes not only words that directly de- Root is a word that dominates an event pendent on Root word but also modi- (verb that represents action or noun that fiers for modifier words. Modifiers for represents subject or object) a modifier word are categorized into the same category of the modifier words. Modifier is words that modify root word. Words are categorized into several 5. When we can extract date information groups, such as subject and object from the sentence, we set this date as words for verbs and adjective and ad- “Date” for events that has dependency nominal words for nouns. with date words.

Negative represents modality of expres- 6. “Article Date” is obtained from article sion. information.

80 Extracted events:

Root ある (have) Original: Depth 2 市は「通常は5日前までに通告があ Subject 通常 (usually), 通告 (notification) る」と話し、県と基地周辺7市で10 Date 日に抗議する。(the municipal author- ArticleDate 980110 ities said “Usually, we have notifica- Chunks 4,3,2,1 tion five days before,” and claimed with postpositional-に 5 (five), 日 (day), 前 (before) prefectural authorities, and authorities Root 話す (say) from seven municipal around the base Depth 1 on 10th. (Articles from Mainichi news- Subject paper on January 10, 1998) Date 市は——————-D ArticleDate 980110 「通常は—–D | Chunks 5,4,2,1 5日前までに—D | postpositional-と 通常 (usually), 5 (five), 日 (day), 通告が-D | 前 (before), ある (have) ある」と-D | Root 抗議 (claim) 話し、—–D Depth 0 県と—D | Subject 市 (municipal authorities) 基地周辺-D | Date 10日 ((January) 10) 7市で—D ArticleDate 980110 10日に-D Chunks 9,8,7,6,10,0 抗議する。 postpositional-で 県 (prefecture), 基地 (base), 周辺 (around), 7 (seven), 市 (city) postpositional-に 10, 日 (day)

Figure 1: Example of Event Extraction from a sentence

7. “Depth” and “Chunks” are calculated links from ones with lower importance. by comparing event information with We formalize important sentence extraction the dependency analysis tree. algorithm by using following correspon- dence between web link structure and sen- Figure 1 shows a set of original sentence and tence structure. extracted events. A page in PageRank corresponds to a sen- tence. Dealing with Event Reference Informa- tion A link in PageRank corresponds to shar- ing of same words in two different sen- We have already proposed an algorithm to tences (A and B). Since it is difficult to calculate importance of a sentence in a sin- determine the direction of the link, we gle document based on PageRank [1] algo- formalize that there are two links (A to rithm [2]. B and B to A) for one sharing word. PageRank algorithm is the one that can cal- culate importance of WWW pages by us- In addition, all links in a page have same ing link analysis. Basic concept of the al- probability to distribute its importance in gorithm is distribution of page importance PageRank. However, in important sentence through link structure; i.e., page that has extraction, all words do not have same im- many links collects importance from other portance; e.g., sharing of large numbers of page and links from page with higher impor- words has closer relationship than sharing of tance has higher importance compared to the small numbers of words and sharing of rare

81 words has closer relationship than sharing of ticleDate” information. When incon- common words. Therefore, we calculate im- sistency is found, the pair of events is portance of link based on the role of shared removed from candidate similar event word in a sentence and Inverted Document pairs. Frequency (IDF). This important measure affects a transition matrix of PageRank that We generate links between two different is used to calculate distribution of impor- sentences that shares candidate similar event tance. pair(s). We also calculate importance of In this formalization, we merge all docu- link based on the importance of events in ments as a single document and calculate a sentence. Since we assume most impor- importance by using same algorithm for a tant issues discussed in a sentence are lo- single document. cated at the end of the sentence, we calculate importance of event based on the “Depth” Therefore, we employ latter approach to cal- information. Another possible measure to culate importance of all sentences. calculate this importance is a frequency In previous algorithm, links are generated based measure. However, since frequency when two different sentences shares same of events is already considered to calculate word(s). In order to handle event reference importance as a number of links, we do not information, we modify link generation al- use this measure. gorithm. We also set direction of the link as previous Since we can express same events in differ- one; i.e., we formalize that there are bidirec- ent ways, identification of same event is dif- tional two links for one similar event pair. ficult to identification of same word. There- We applied this event identification method fore, we introduce similarity measure be- for test run data and we found some links be- tween two events by using following two cri- tween two related events are missing. One teria. reason of this problem is that we can ex- press same event by using different vocab- 1. Similarity of words ulary. However, we found some words are First, we compare all words in “Root” shared for these event sets. Therefore, we and each category in “Modifiers.” For also use sharing of same word(s) for han- each category, we calculate existence dling these relationships. ratio of same word(s) that is biased by In PageRank algorithm, importance of page IDF. Second, we calculate weighted av- is calculated as a convergent vector of fol- erage of existence ratio for all cate- lowing recurrence formula. gories. “Root” and “Subject” in “Mod- ifiers” has higher weight compared to mij is an element of transition matrix M the other ones. We set threshold value of i-th row and j-th column and repre- to check whether the event pair belongs sents transition probability from j-th sen- to candidate similar event pairs or not. tence to i-th sentence based on link struc- ture. Since {m1j, m2j, ··· , mnj} represents Pn 2. Judgment of consistency of date transition probability, i=1 mij = 1. When When the event has “Date” informa- there is a sentence k that has no relationship tion, we verify consistency of date. with other sentence, we set mik = 0 . When “Date” information lacks spe- cific date such as year and/or month, −→r = M−→r (1) we complement this information by us- ing “ArticleDate” information. When In order to handle both types of links (shar- one article has “Date” information and ing events and sharing words), we make two other one does not have this informa- matrices Me(meij) and Mw(mwij) that cor- tion, we compare them by using “Ar- responds to transition matrix made by shar-

82 ing events and one made by sharing words, In this paper, we use simple formula respectively. In order to satisfy constraint 1/log(n + 1) (n: sentence number in an arti- Pn −→ −→ i=1 mij = 1, overall transition matrix cle) for initial value of v . v is normalized m M(mij) is calculated by using parameter β with Σi=0vi = 1 (m is number of all sen- and following formula. tences and vi is an initial importance value for i-th sentence). By using the algorithm discussed above, mij = β ∗ meij + (1 − β) ∗ mwij since similar sentences have similar links, Xn Xn similar sentences have similar scores. As a when meij 6= 0, mwij 6= 0 result, there is a chance to select redundant i=1 i=1 sentences when we select sentences from mij = meij Xn higher score ones. Therefore, we need a when mwij = 0 mechanism to detect such redundant sen- i=1 tences and it is required to remove such re- m = mw ij ij dundant description[6]. Xn when meij = 0 In this research, we use similarity measure i=1 of two events to calculate redundancy of new description. Since we can describe same in- formation by using different numbers of sen- Since a sentence that has no relationship tences, we do not compare sentences one by with other sentence is meaningless to in- one. We decompose a sentence into a set of clude into an abstract, we remove rows and events and we check redundancy of a sen- columns that corresponds to the sentence in tence by comparing with an extracted event the matrix M. Calculation of a convergent set that is obtained from extracted sentence; vector is conducted by using eigen vector i.e., we calculate weighted average of exis- calculation. Since convergent vector sat- tence ratio of events in the sentence. Since isfies following formula, convergent vector we assume most important issues discussed is an eigen vector of matrix M with eigen in a sentence are located at the end of sen- value = 1. −→ −→ tence, we set higher weight for an event with rc = M rc (2) lower “Depth.” By using this redundancy check mechanism, Usage of Sentence Position sentence extraction algorithm is as follows.

Since, in news articles, important sentences 1. Construction of an initial extracted may be described early part of each arti- event set cle, we use sentence position information A sentence with highest importance is for calculating importance of each sentence. selected as an extracted sentence. An In PageRank, an algorithm to set initial im- initial extracted event set is constructed portance of each page is proposed as Topic- from events in the selected sentence. Sensitive PageRank [3]. This algorithm is proposed to calculate importance of each 2. Redundancy check and addition of new page based on the category that the page be- description longs to. In this algorithm, they modify re- Our system tries to add new descrip- currence formula of PageRank as follows. tion from a sentence with higher im- −→v corresponds to initial importance vector portance. The system checks redun- and α is a parameter to control strength of dancy of the sentence and add it when it the effect by the vector. does not exceed predefined redundancy level. The system also adds events in −→ −→ −→ rc = (1 − α) ∗ M rc + α ∗ v (3) the selected sentence to the extracted

83 event set. This step reiterates to select a Coverage is an evaluation metric for mea- desired number of sentences. suring how close the system output is to the abstract taking into account the redundancy found in the set of the out- 2.2. Experiment and Discussion put. We apply this system for the task of TSC-3 [4]. In TSC-3, there are two subtasks. One is Table 1: Sentence Extraction Results with sentence extraction and the other is abstrac- Different System Setting tion. In this system, we do not use “set of Short Long questions about important information of the Event coverage 0.325 0.313 document sets” given by task organizer. only precision 0.491 0.540 In order to evaluate the the effectiveness of Word and coverage 0.323 0.341 using event sharing information, we conduct Event precision 0.523 0.592 following three different experiments. Since Word coverage 0.313 0.344 LEAD method that selects important sen- only precision 0.521 0.593 tence from the beginning of the articles does not work well, we set α = 0.1 for the evalu- In every cases, value of “Precision” is larger ation. than “Coverage”. It means that our system tends to select redundant sentences. Event only Transition matrix made only from event references is used for cal- 2.3. Discussion culating importance (β = 1.0). Our method has better performance in Word and Event Transition matrix that “Long” compared with “Short.” Since our combines event and word references algorithm does not pay attention to the is used for calculating importance length of a sentence and longer sentence has (β = 0.3). more chance to have more links, a longer sentence tends to have higher importance. Word only Transition matrix made only As a result, for the “Short” abstraction, such from word references is used for calcu- a longer sentence takes larger room and it lating importance (β = 0.0). becomes difficult to add another sentence. However for the “Long” abstraction, re- Table 1 shows a result of these experiments. moval of redundant description may make From this result, calculation of importance new room to add another sentence. For the by using event reference only has poorer per- future work, it is necessary to have a mech- formance than others. Short is a task to se- anism that pays attention to the length of a lect about 5% important sentence and long sentence. is a task to select about 10%. Since multiple document summarization has a similar sen- 3. Hierarchical Structure Analysis tences in different document, a correct an- about News Articles swer for each sentence extraction is defined 3.1. Two Stage Evaluation for Im- as a set of corresponding sentences. In order to take into account the redundant sentence portant Sentence Extraction selection, following two measures are used From this experiment, we found that our al- in TSC-3[4]. Please consult [4] for detail. gorithm tends to select longer sentences that have similar repetitive events among differ- Precision is the ratio of how many sen- ent news articles. As a result, it underesti- tences in the system output are included mate the importance of some sentences that in the set of the corresponding sen- are important only in one article but not dis- tences. cussed in other articles. These sentences are

84 important when we aim to include several the sentence position for i-th sentence sub-episode in the summary. is as follows. In this equation, n is a For solving this problem, we propose to use sentence number in an article and m is two stages evaluation for important sentence a sentence number in a paragraph. extraction; e.g., the first stage is an evalua- tion based on single article, and the second one is an evaluation based on the relation- vpj = 1/log(n+1)+1/log(m+1) (5) ships among different articles. Followings −→0 −→0 are algorithm for this two stage evaluation We also normalize vs and vp that sat- l 0 l 0 isfy Σi=0vsi = 1 and Σi=0vpi = 1. By −→0 1. Calculate single article importance using these values i-th element of v is value of each sentence defined as follows. Equation 3 for each document is used 0 for calculate importance value of each vj = γ ∗ sj + (1 − γ) ∗ pj (6) sentence. a and b is a document num- ber and a sentence number of the i-th At last, the overall importance value of sentence in a whole document respec- each sentences are calculated by using tively. Ma and va correspond to tran- following formula. sitional matrix and initial importance −→ −→r = (1 − α) ∗ M−→r + α ∗ v0 (7) vector based on the sentence position of c c the document a respectively. 3.2. Small Changes −→ −→ −→ rca = (1 − α) ∗ Marca + α ∗ va (4) In previous formalization, transition matrix The importance value of the i-th sen- are constructed based on the number of the tence vsi is a value of b-the sentence words shared in a sentence pair. As a result, value of rca. longer sentence tends to overestimate its im- portance. 2. Calculate overall importance value of In order to reduce this effect, cosine similar- each sentence ity measure is used for calculating transition There are two approaches for combin- possibility. In this formalization, transition ing single article importance value with possibility based on the word reference is as overall evaluation proposed in previous follows. The i-th sentence is represented as research. One approach is just calculat- a word vector −→s whose value for each word ing summation of these two importance i is calculated by TF·IDF( tf/(1+log(df+1)) value. The other one is using single ar- In this formula, tf is a value of term fre- ticle importance value as a part of initial quency of a corresponding word in a sen- importance vector in equation 3. By us- tence. df is a value of a document frequency ing the former approach, it is better to of a corresponding word in a newspaper ar- take the importance of each article into ticle database). account. On the other hand, the latter si · sj approach can propagate the single arti- mwij = (8) cle importance to other articles by us- |si||sj| ing transition matrix and calculate im- For the event we use number of events portance of each sentence as a part of (eventsi, eventsj) and number of simi- a whole set of articles. So we use the lar events shared with a sentence pair latter approach in this paper. similarEventsij are used for the formaliza- In our system, we use following initial tion. −→0 −→ vector v instead of v in previous sec- similarEventsij meij = (9) tion. Initial importance value based on eventsi · eventsj

85 3.3. Experiment Table 3: Best Performance System in TSC In order to evaluate the effect of the new 3[4] algorithm, we also apply it to TSC-3 data. Short Long Since there are several parameters for tun- Best coverage 0.372 0.363 ing, we conduct several experiments with coverage precision 0.591 0.587 different parameter settings. Since “Cover- (Short)[9] age” is a value for representing the close- Best coverage 0.329 0.391 ness the system output and the abstract, pa- coverage precision 0.567 0.680 rameter settings for achieving highest score (Long)[8] in “Coverage” are selected. Table 2 shows some results. First two results achieves an abstract based on event reference infor- highest score in “Coverage” in “Short” and mation with hierarchical structure analysis. “Long” resepctively. From the important sentece extraction ex- We confirm new algorithm improves the periment, we confirm our algorithm produce value of “Coverage” but degrades the value better results compared with the best perfor- of “Precision” compared with previous re- mance system in TSC-3. However, further sult in Table 1. This is because previous analysis is needed for characterizing our al- method tends to select sentences that have gorithm. repetitive event reference and redundant im- portant sentences are selected as a result. Acknowledgment We would like to thank organizer of TSC-3 for their effort to In addition, optimum α value for “Short” construct this test data. In this paper, we is different from that for “Long. ” α is use news articles data of Mainichi newspa- a parameter for controling the effect of ini- per and Yomiuri newspaper on year 1998 tial importance vector. Therefore, higher α and 1999. value tends to select sub-episode sentences in each article and is good for “Long” case. References Table 3 shows the best performance system in terms of “Coverage” in the TSC-3 [4]. [1] Sergey Brin and Lawrence Page. The Our system exceeds the performance of the anatomy of a large-scale hypertex- top system. tual Web search engine. Computer Networks and ISDN Systems, 30(1– 7):107–117, 1998. Table 2: Sentence Extraction Results with Different Parameter Setting [2] Makoto Haraguchi, Masaki Yotsutani, Short Long and Masaharu Yoshioka. Towards α = 0.2, β = 0.9 coverage 0.374 0.401 an organization and access method of γ = 0.4 precision 0.450 0.543 story databases. In 7th World Multi- α = 0.4, β = 0.9 coverage 0.341 0.419 conference on Systematics, Cybernet- γ = 0.4 precision 0.473 0.578 ics and Informatics (SCI2003) Vol. V, α = 0.4, β = 0.9 coverage 0.335 0.408 pages 213–216, 2003. γ = 0.0 precision 0.458 0.565 α = 0.4, β = 0.3 coverage 0.348 0.373 [3] T. Haveliwala. Topic-sensitive pager- γ = 0.4 precision 0.463 0.533 ank. In Proceedings of the Eleventh International World Wide Web Confer- ence, 2002. 4. Conclusion [4] Tsutomu Hirao, Manabu Okumura, In this research, we propose a method to Takahiro Fukusima, and Hidetsugu extract important sentences and to generate Nanba. Text summarization challenge

86 3 - text summarization evaluation at nt- Information Retrieval, Question An- cir workshop 4 -. In Proceedings of the swering and Summarization, 2004. Fourth NTCIR Workshop on Research http://research.nii.ac.jp/ntcir/workshop/ in Information Access Technologies OnlineProceedings4/TSC/NTCIR4- Information Retrieval, Question An- TSC-SekiY-1.0.pdf. swering and Summarization, 2004. http://research.nii.ac.jp/ntcir/workshop/ [10] Masaharu Yoshioka and Makoto OnlineProceedings4/TSC/NTCIR4- Haraguchi. Multiple news ar- OV-TSC-HiraoT.pdf. ticles summarization based on event reference information. In [5] Taku Kudo and Yuji Matsumoto. Proceedings of the Fourth NT- Japanese dependency analysis using CIR Workshop on Research in cascaded chunking. In CoNLL Information Access Technologies 2002: Proceedings of the 6th Confer- Information Retrieval, Question An- ence on Natural Language Learning swering and Summarization, 2004. 2002 (COLING 2002 Post-Conference http://research.nii.ac.jp/ntcir/workshop/ Workshops), pages 63–69, 2002. OnlineProceedings4/TSC/NTCIR4- TSC-YoshiokaM.pdf. [6] Inderjeet Mani. Automatic Summariza- tion. John Benjamin Publishing Com- pany, 2001.

[7] K.R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Yen Kan, B. Schiffman, and S. Teufel. Columbia multi-document summarization: Ap- proach and evaluation. In Proceed- ings of Document Understanding Con- ference 2001, 2001.

[8] Tatsunori Mori, Masanori Nozawa, and Yoshiaki Asada. Multi-document summarization using a question- answering engine - yokohama national university at tsc3. In Proceedings of the Fourth NT- CIR Workshop on Research in Information Access Technologies Information Retrieval, Question An- swering and Summarization, 2004. http://research.nii.ac.jp/ntcir/workshop/ OnlineProceedings4/TSC/NTCIR4- TSC-MoriT.pdf.

[9] Koji Eguchi Yohei Seki and Noriko Kando. User-focused multi-document summarization with paragraph clustering and sentence-type filter- ing. In Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies

87 Using Metaphors for the Pragmatic Construction of Discourse Domains

Roland Kaschek1, Heinrich Mayr2, Klaus Kienzl3 1 Department of Information Systems, Massey University Palmerston North, New Zealand [email protected] 2 IWAS, University of Klagenfurt Klagenfurt, Austria heinrich@ifit.uni-klu.ac.at 3 Hohere¨ Bundeslehr- und Versuchsanstalt Villach Villach, Austria [email protected]

Abstract Conceptual modeling is an important task in systems development. It, however, is more often used as a tool rather than made the subject of research papers. Human thought is considered as essentially metaphorical. The paper therefore proposes to look at the metaphors that might be employed by conceptual modeling. We identify the metaphor INFORMATION IS A NETWORK as a key tool of conceptual modeling. We suppose that focusing on this metaphor has the potential for improving understanding of how conceptual modeling proceeds. The implications of this metaphor and a variant of it are discussed. This discussion includes the metaphor’s impact on the theory of abstraction. The paper contributes in two ways to a theory of conceptual modeling. It firstly provides a unifying view of the pragmatic construction of discourse domains that is based on using the metaphor INFORMATION ISANETWORK. It secondly proposes an approach to abstraction theory that fits this unifying view on conceptual modeling. 1. Introduction [29].

A theory of CM could aid in teaching CM, In a wide sense of the term conceptual mod- practically carry out CM, and impact the re- eling (CM) can be used to mean concep- search on CM. We presuppose that it would tualizing, i.e. constructing a discourse do- be useful to cover automated activities by main in terms of concepts that belong to our theory of CM. Therefore we aim at a the domain. For example CM can be used theory that avoids mentalism and is oriented for analyzing or describing computer ap- at linguistics and semiotics. Obviously we plications or to prescribe their characteris- have to take a look at papers from artificial tics. In other realms of informatics such intelligence that deal with related subjects. as reference modeling, enterprize model- Abstraction must be expected to be of major ing or similar CM is contributing essentially importance for our study since informatics to the achievements. CM thus is under- or computer science occasionally is consid- stood to be of major importance in infor- ered as the discipline of abstraction. matics. While there are major international conferences dedicated to CM, such as the CM has a precursor in philosophy, i.e. Car- Entity-Relationship conference, approaches nap’s logical construction of the world. We towards a theory of CM seem to be rare, see are going to briefly discuss Carnap’s ap-

88 proach and identify the metaphor INFORMA- part could be based on the system concept. TIONISANETWORK as the key method- The decomposition of systems into subsys- ological ingredient of his work. It turns out tems that are related to each other by struc- that this metaphor is omnipresent in concep- tural or communication relationships would tual modeling and that it helps developing a then in this ontological approach resemble semantic abstraction theory for CM. the use of the metaphor INFORMATION IS A NETWORK. In this paper we are not going to Software development is a process of knowl- compare and assess both approaches. Rather edge creation and representation. One of the we just want to develop our pragmatic ap- characteristics of this process is a succes- proach to constructing domains of discourse. sive formalization of the language in which that knowledge is obtained and represented. Paper Outline. In the next section we CM is that step in software processes that briefly discuss Carnap’s approach to a log- can be characterized by a two-fold formal- ical construction of the world. We then, in ization. Models can be given a formal se- section 3, discuss the foundations of mod- mantics and the next step, i.e. design, makes eling in so far as we believe to understand a formal prescription of a computer appli- them. In section 4 we briefly discuss the ex- cation to build. Obviously both of these pressiveness of the model concept that we formalizations are important regarding the have chosen for this paper. We then dis- success of software development. In many cuss metaphor and change of representation cases requirements holders are not able or in sections 5 and 6 respectively. After that willing to understand the formal statements we discuss how domains of discourse are referred to right now. It is, however, im- constructed by CM. We conclude our paper portant that they can learn to effectively and with section 8 regarding abstraction that is efficiently talk about the conceptual mod- followed by an outlook and our references. els. The paper is dedicated to simplifying that task. Our respective idea is that using 2. Carnap’s Logical Construction metaphors can bridge the gap between the of the World formal area and the application area. Note In this paper we focus on the theory of that we consider the need to speak metaphor- model creation. We ground our approach on ically as symmetric in terms of stakehold- Carnap’s attempt to logically construct the ers and developers, i.e. both of these talk world, see [4]. Carnap attempted to logi- about things they don’t understand in terms cally construct the world as a network, i.e. of what they understand. The point is that a graph. His idea was that (1) the nodes of these actors understand different things and the graph should represent the things in the that none of these is superior to the other. world and that (2) the edges of the graph We are modelers. Thus we believe in a cer- should represent the relationships between tain pluralism of views that is similar to the the things of the world. Carnap assumed that one in [27]. We are, however, not relativists. (3) for each two things T1,T2 in the world We presuppose that among competing con- it would be possible to find relationships to ceptualizations there can be defined and jus- other things in the world such that T1 and tified a partial order of preference. The re- T2 could be distinguished from each other in spective justification requires certainly a de- terms of their pattern of relationship partici- tailed and specific consideration of the con- pation. ceptualization’s parameters in the sense of Carnap’s objective was to formally charac- section 3. terize all entities in the world in one given One can consider our approach in this paper language, i.e. the language of graph the- as an epistemological one. Then one would ory. Achieving that objective would turn be interested in identifying and discussing all discourse regarding entities within the its ontological counterpart. Such counter- world into a mathematical discourse. The

89 aspect of that appealing to Carnap might CM follows Carnap in representing informa- have been that finally one of the competing tion as a network. It, however, deviates es- views would have been singled out as the sentially from Carnap in several respects, i.e. true one by means by mathematical reason- CM ing. We identify two major challenges re- garding that objective. Firstly, that which is • uses a material language rather than a counted as an entity in the world as well as formal one, i.e. it provides intensional that which is counted as a relationship be- definitions for the information chunks tween two things in the world is subjective, in the network and the relationships be- i.e. depends on the one who admits a thing tween these. The respective definitions or a relationship. Secondly, it is not likely are typically gathered together in a the- that for each two entities e, e0 in the world saurus or data dictionary, see, for exam- there is exactly one way of identifying sets ple [25]; of relationships r, r0 to other entities in the world respectively that would allow to dis- • focuses on a discourse domain rather tinguish between e and e0. The consequence than the whole world; of each of the two objections is that each • introduces typing of nodes and edges of list of formal characterizations of entities in the network used for representing infor- the world is incomplete or contains subjec- mation. This typing of nodes results in tive cognitive aspects (that aim at justifying information chunks being considered as selections or decisions that were made) and (1) entity type, relationship type etc. in thus deciding about assertions regarding en- the entity relationship model (see [5], tities in the world by means of only mathe- [33]), or as (2) classes, associations etc. matical reasoning becomes impossible. in a class diagram (see [26]), or as (3) Carnap didn’t achieve his goal. His method, state, state transition etc. in a state chart however, did not become obsolete therefore. (see [26]), or similar. Aspects of this Rather it is quite successful. Using the ter- typing can be understood in terms of a minology of metaphors that we are going further metaphor, i.e. INFORMATION IS to introduce below one can say that Carnap A CONTAINER that we are going to dis- has used the metaphor INFORMATION IS A cuss below; NETWORK. His method of constructing a discourse domain consists in creating infor- • relaxes the concept of identifiability of mation that specifies admissible and filters things in terms of their network repre- out inadmissable phenomena and structure. sentation. In fact, the row-id of rela- That metaphor enables humans to handle the tional databases as well as the object- abstract concept of ’information’. It sug- identity of object-oriented modeling gests firstly using information as a filter ap- approaches enables modeling discourse plied to the phenomena of a discourse. Sec- domains in which different phenom- ondly it suggests representing that informa- ena exist that cannot be distinguished tion as a graph the nodes of which represent from each other by limiting to only information chunks and the edges of which their characteristics in this discourse represent relationships between information domain.1 chunks. Carnap’s method nowadays is very frequently used in CM. Independently of the In summing up we say that CM in our view thing discussed (whether for example struc- aims at a pragmatic approach to construct- ture, behavior, causality, modality or qual- ing discourse domains rather than at a for- ity of systems is discussed) and the model- mal one. ing language used (whether for example a 1The latter, however, not necessarily is a problem, graphical, textual, or formal modeling lan- as one always can define handles that enable distin- guage is used) Carnap’s method is applied. guishing the mentioned entities from each other.

90 Note that the quality of information to func- straction regarding reasoning about the tion as a filter does not depend on the physical world; kind of information chunks and relation- ships thereof that are introduced with the • phenomenon constitution, i.e. a num- metaphor INFORMATION IS A NETWORK. ber of phenomena is created employ- What counts is that the metaphor enables ing a semantic model that was chosen making distinctions that make a difference in an ad-hoc manner. Note that the ac- (see [2] for a definition of information that tor’s attention as well as the coarseness is based on this idea) with respect to recog- of the actor’s concepts suggests another nizing and conceptualizing the phenomena kind of abstraction in reasoning about in a discourse domain. the physical world. Note also that we consider the semantic model chosen ad- 3. Modeling hoc for conceptualizing sense irritation as in part depending on cultural and in- CM achieves an understanding of a dis- dividual aspects such as education and course domain such that (1) those phenom- available resources; ena can be distinguished from each other that are important with respect to a given • modeling, i.e. the phenomenon consti- purpose; (2) a state of affairs is introduced tution is reconsidered employing a se- that aids in meeting the objective of model- mantic model that is purposefully cho- ing the discourse domain; and (3) the flux of sen and that leads to presupposing a set state of affairs can be represented by a se- of adjusted phenomena that are tailored quence of model changes. towards the utility for solving a prob- We are going to discuss abstraction below. lem at hand. In contrast to the first We use a preliminary understanding of ab- two levels on the level of modeling we straction as that what one achieves by ab- deal with relatively conscious choices stracting. The everyday-language meaning of abstractions. Note that in coinci- of “to abstract” according to the Oxford dence with [12] we consider a model English Dictionary Online is “ (t)o with- (i.e. the particular phenomenon consti- draw, deduct, remove, or take away (some- tution worked out with the chosen se- thing)” from something else. Before becom- mantic model) as an ontology if it is ing more technical we briefly sketch the con- considered as the (only) true concep- text of what we focus at. Contrary to prac- tualization, i.e. the necessary phenom- tice in artificial intelligence (see, for exam- ena constitution with respect to the dis- ple [36, p. 1296-8]) we do not consider it course domain. essential for an abstraction to preserve de- sirable properties or similar. We rather con- Our approach incorporates only a rather lim- sider quality of abstraction as the thing to ited version of realism. We find ample sup- deal with when it comes to assessing the util- port for our staged approach to conceptu- ity of representations. Regarding how an ac- alization in [27]. We mention in particular tor conceptualizes a discourse domain that I. Kant (p. 259), M. Schlick (p. 239), E. essentially uses input from outside the ac- Cassirer and W. v. Helmholtz (p. 235), K. tor’s own cognitive system we distinguish Buhler¨ (p. 231), L. Boltzmann and H. Hertz from each other three stages, (p. 157), and J. S. Mill (pp. 115 - 117). In this context we consider a semantic • sense irritation, i. e. signals trigger ir- model Σ as a pair (MC,RC) of model- ritation patterns being present in a num- ing concept MC and representation con- ber of the actor’s input interfaces. In- cept RC, for more detail on the latter con- completeness and imperfection of the cepts see [19]. To summarize briefly, a mod- actor’s senses lead to a first kind of ab- eling concept is a set of modeling notions

91 and abstraction concepts together with in- that in case O or M is semiotic then O∗ = O tensional definitions, usage rules, including and M ∗ = M respectively. With these as- concept compatibility constraints, and appli- sumptions one can explicate the model rela- cation hints. Employing these definitions, tionship more precisely and explain the fol- rules and hints is supposed to result in a lowing properties that were introduced by model, i.e. a set of domain concepts and ab- Stachowiak. stractions which are instantiations of model- ing notions and abstraction concepts respec- • mapping property, i.e. A uses M ∗ as tively. A representation concept is a set of a proxy for O∗ and has defined an ico- conventions for representing models. Obvi- morphism, i.e. a pair (τ, γ) of partial ously, communication about models would mappings τ : O∗ → M ∗ and γ : M ∗ → be close to impossible or useless without O∗. We call these mappings truncation the effective use of representation concepts. and grounding respectively. With respect to what follows below we fo- cus on the stage of modeling, i.e. we focus • truncation property, i.e. O∗ \ dom(τ) on semiotic entities. Rather than the term se- 6= ∅; mantic model the term modeling language ∗ is used as well. • pragmatic property, i.e. using M as a proxy for O∗ is only legitimate for With respect to modeling we see at least two specified actors, purposes of these ac- different approaches. The modeling actor tors, admissible tools and techniques of may be human or not. In the latter case investigation, time of use, etc. the artificial intelligence community usually focuses on computerized systems. We first Stachowiak requests τ −1 ⊆ γ as a compat- discuss what we consider approaches to se- ibility constraint between these two partial mantic modeling (i.e. modeling processes as mappings. This implies that τ is injective conducted by humans) and deal with change where it is defined. The constraint implies of representation (i.e. automated modeling τ ◦ γ(p) = p, for all predicates p ∈ im(τ). processes) later. Therefore τ ◦ γ ◦ τ = τ. We are going to use We use the model theory of Stachowiak this equation as the condition required that [30, 31, 32] as a starting point. According relates truncations and groundings. to his theory models are typically used as We point out that additionally to the trun- proxy objects. The model relationship, i.e. cation property a model in general also has the predicate R(O,M,A) means that actor an extension property, i.e. M ∗ \ im(τ) 6= A uses model M as a proxy for model orig- ∅. Stachowiak called the predicates in inal O for specified purposes only, period of M ∗ \ im(τ) abundant and the ones in O∗ \ time and context, and subject to using spec- dom(τ) the preterited ones. We call the ified tools and techniques. Stachowiak ad- former surplus and the latter ignored ones. mits models and originals that are not semi- The surplus predicates contribute essentially otic for a given actor. His theory of how to the use that can be made from the model. model and original are related to each other, For a respective example see [12]. One is however, is based on semiotic objects that a tempted to define abstraction as that kind modeler uses to represent the non semiotic of model relationship R(O,M,A) in which ones. no surplus predicates occur. We choose a Assuming that a modeler A has established a slightly different approach, as that tempta- model relationship R(O,M,A) Stachowiak tion, however, might lead to difficulties with assumes that O∗ and M ∗ are semiotic repre- respect to the syntax of the chosen modeling sentations of O and M respectively, i.e. they language. are sets of predicates with which A concep- For a simplifying description of how mod- tualizes O and M respectively. We assume eling processes can be understood according

92 to Stachowiak (see [30, pp. 317]2) we as- 67]) who considered analogy as a key cog- sume that an actor A wants to solve a prob- nitive tool for understanding that, which was lem PO that is encoded in terms of an ob- not understood before. ject O. For some reason or another A cannot With the term sign relationship we refer to solve the problem and assumes a reformula- the predicate S(T,R,A), which means that tion of PO might help. A therefore replaces actor A uses the token T to refer to the sign ∗ O and M by semiotic representations O and referent R. We do not elaborate more on ∗ M of these respectively and establishes a semiotics or linguistics and refer the inter- ∗ ∗ model relationship R(O ,M ,A). Then A ested audience to [20]. We restrict the class obtains a reformulation PO∗ of PO in terms of things from which a sign referent may be ∗ of O . A observes that (s)he likely can turn chosen. We assume that the referent is a a solution of PO∗ into a candidate solution cultural unit, i.e. something to which one of PO. A furthermore uses the truncation τ in a given culture can meaningfully refer to. ∗ to obtain a problem PM ∗ in terms of M as We have adopted this concept from [6]. Us- an encoding of PO∗ . If A can work out a so- ing the now already obvious metaphor CUL- lution SM ∗ of PM ∗ then A uses the ground- TURALUNITISANETWORK can be ex- ing γ for obtaining a hypothesis SO∗ as an pected to operationalize the concept of cul- ∗ encoding of SM ∗ in terms of O . If A has tural unit. Putting things that way is a con- chosen a successful approach the hypothesis sequence of assuming that the sign referent SO∗ can be turned into a solution proposal can only serve as a referent based on a con- SO regarding PO. If things work out well SO vention of a group of actors. The obvious is an admissible solution of PO. If the latter structural similarity between the model rela- is not the case then a number of feedback tionship and sign relationship can be made steps can be incorporated easily into the de- explicit by assigning to the model the role of scription given right now. Stachowiak notes the token and to the original the role of the that the process of solution generation may referent. The grounding γ can then be used lead to increased knowledge that in turn may for conducting the reference to the sign ref- ∗ result in the models M or M , or the origi- erent. A model can be used for transferring ∗ nals O or O being changed. information, that was requested in terms of Semiotic models are similar to signs, the original, from model to original. In con- metaphors and other concepts such as sim- trast to this with the sign token only the ex- ile and analogy. We are going to discuss istence of the referent is pointed out. We are the first two of these below. Prior to that going to explore this issue in more depth be- we mention that our epistemic pluralism low. suggests that epistemological theories that For us semantic information is the quality do not focus on modeling might have their of a string to function as a filter and to help strengths, as they represent different angles eliminate states of affairs out of a set of eli- on acquiring knowledge about the world. gible such states. Compare, for example [9, Note furthermore that of course the termi- p. 435]. The amount of semantic informa- nology used by the various authors is not tion contained in a string s is proportional to coordinated. In [22, ch. 7] for example the cardinality of the set of states of affairs it is mentioned that the word model some- eliminated by that constraint. times is used as a synonym of analogy. In that source, moreover, analogy is said to be 4. Expressiveness of Stachowiak’s fundamental for how humans understand the Model Concept world. That valuation of analogy is compa- rable to the one of Ch. Fourier (see [27, p. Our view of models as a reference concept similar to signs and related to metaphors 2Note that we use a different terminology and no- suggests that models should be a tool of tation. many scientific disciplines. At least in Logic

93 and Mathematics concepts referred to as utterances. At least the more well-known se- models are actually important. For simpli- mantic models of informatics such as entity- fying our terminology we call our concept relationship models, class diagrams, state of model a Philosophical one. Additionally charts, Petri nets, or similar can be shown we consider a Linguistic concept of Model. to be covered by that language. Recall that We are going to show that our Philosophi- a structure R is called retract of a structure cal model concept in fact can be understood C if there are morphisms ι : R → C, and as a generalization of the Linguistic model π : C → R, such that π ◦ ι = 1R. In concepts. such situation is C called a co-retract of R. Since the identity morphism 1R is the iden- 4.1. Structures tity on R the morphism ι is an injective map- ping and so the pair (ι, π) can be considered Modeling, as considered above, involves as an icomorphism. Since π ◦ ι = 1R im- creating semiotic models, i.e. sets of pred- plies ι ◦ π ◦ ι = ι one can (in the sense of icates. Since predicates reflect someone ac- Stachowiak) consider a structure as a model crediting a notion to an entity we consider of each of its retracts. A characterization of semiotic models as verdictive utterances, i.e. co-retracts of algebraic structures is given in an utterance by means of which an actor ac- [11]. credits to a number of subjects a predicate notion. For more detail regarding verdici- 4.2. Grammars tive utterances see, e.g. [1]. In some cases We are going to show that a formal gram- of verdictive utterances the utterance creates mar can be considered as a model of each a social fact. In these cases the utterance of the words that belong to the language de- in general will only then be successful if fined by the grammar. Recall for this (see the utterer is authorized, performs the utter- for example the chapter on formal languages ance in the right context and in the required in [34]) that a formal grammar is defined as way. These social parameters of verdictive a quadruple Γ = (S, N, T, R), where N,T utterances are not in the focus of our paper. are finite disjoint sets that are called non- However, they cannot be ignored completely terminals and terminals respectively. Fur- because in organizations where CM is ap- ther S ∈ N is called the start symbol and plied a consensus needs to be fixed and a R = {(σ, τ) | σ, τ ∈ (N ∪ T )∗, σ = decision needs to be made that is binding in σ1 . . . σm, {σ1, . . . , σm} ∩ N 6= ∅} a set of or for that organization. Here we abstract production rules. The language Λ(Γ) de- from these parameters. Considering CM in fined by Γ is the intersection of all sets Σ the narrower sense leads to the concept of with the properties judgment coming into the focus. Judgments are forms of verdictive utterances that are • S ∈ Σ, and elementary with respect to the information • s ∈ Σ, (σ, τ) ∈ R, α, ω ∈ (N ∪ T )∗, conveyed. Judgments can be represented as s = ασω imply that ατω ∈ Σ. predicates U(S, P, C, A). Such a predicate means that the actor A accredits to all in- If there is a rule r = (σ, τ) ∈ R in a stances of the subject notion S the predicate grammar Γ = (S, N, T, R) and ασω, ατω ∈ notion P in a way specified by the copula Λ(Γ) then one says that ατω is directly de- notion C. For an elaborated theory of judg- rived from ασω. In that case we write ment see, e.g. [24]. r(ασω → ατω). Direct derivation is a re- It has been shown in [17] that algebraic as lation on Λ(Γ) its transitive closure is called well as relational structures can be speci- derivation δ and for ξ, ψ ∈ Λ(Γ) one says fied as sets of judgments. It thus is not a that ψ is derived from ξ if (ξ, ψ) ∈ δ. severe limitation to (with respect to model- For a grammar Γ = (S, N, T, R) let Γ0 = ing purposes) restrict oneself to verdictive {(1,S)}∪{(2, n) | n ∈ N}∪{(3, t) |

94 t ∈ T }∪{(4, r) | r ∈ R}. Obviously tional theory of metaphor the so-called ’lit- µ : {Γ | Γ is a grammar} → {Γ0 | eral meaning’ of a phrase was supposed to Γ is a grammar},Γ 7→ Γ0, is a bijec- be the standard mode of language use. We tion. For each λ ∈ Λ(Γ) choose a short- do not go into respective detail here. Rather est derivation s1s2 . . . sm. Define Π(λ) = we refer to Rumelhart’s paper in [23] for a {(i, si−1, si, r) | 1 ≤ i ≤ m, s0 = discussion of some of the problems regard- S, sm = λ, r(si−1 → si), for all i}. ing ’literal meaning’. We define now τ : Π(λ) → Γ0, such Examples of further metaphors that are used that (i, si−1, si, r) 7→ (4, r). We further- more define γ :Γ0 → Π(λ), such that with respect to software systems are not only the interface metaphors such as THECOM- (j, x) 7→ (i, si−1, si, x), if j = 4 and i PUTERISADESKTOP or THECOMPUTER is the smallest index with (i, si−1, si, x) ∈ ISACOUPLEOF (DIRECT MANIPULATION) Π(λ), and (1, s0, s1, r) otherwise. Then it OBJECTS. Rather metaphors such as SOFT- follows that τγτ(i, si−1, si, r) = τγ(4, r) = WARE IS A LIVING THING or SOFTWARE τ(k, sk−1, sk, r) = (4, r) = τ(i, si−1, si, r). Therefore τ ◦ γ ◦ τ = τ, which had to be ISABUILDING are used as well. The lat- shown. ter ones respectively enable humans to talk about software systems in terms of these sys- 5. Metaphors tems aging or getting infected by a virus or similar and in terms of their architec- One of the most basic phenomena of lan- ture. It occurs such that the metaphor SOFT- guage use is the capability of speakers to WARE IS AN ACTIVE THING is a precursor use different words for the same purpose of the metaphor SOFTWARE IS A LIVING or thing.3 Metaphor is a particular way of THING. It is presupposed in talking about achieving that effect. According to [18] a programs in terms of the metaphorical ex- metaphor is a partial mapping out of a source pressions as ”the programm calls a subpro- domain in a target domain. These domains, gram”, ”the program is executing”, or sim- if the metaphor is supposed to work, must ilar. Obviously, presupposing current com- be cultural units. Lakoff in [18] has used the puter technology, programs are not active. metaphors LOVEISAJOURNEY, and LIFE Rather the computer uses them for control- ISAJOURNEY to illustrate the concept of ling its own behavior. In [13] the metaphor- metaphor. The notational convention bor- ical expressions assistant, agent, master and rowed from Lakoff puts the metaphor name peer are used for discussing roles that can in small capitals and lists the source domain be played by computers in human computer before the infix ISA and the target domain interaction. These metaphorical expressions after that infix. Metaphors are used for trans- fit the metaphor THECOMPUTERISANIN- ferring information from the target domain TERLOCUTOR. The role of metaphors with to the source domain. If one ever has under- respect to the evolution of knowledge that is stood a metaphor such as LOVEISAJOUR- relevant for computer use was discussed in NEY then it offers no difficulty to understand [15]. a metaphorical expression such as our re- The Oxford English Dictionary Online de- lationship is at a dead end even if one has fines dynamics as “(t)hat branch of any sci- never encountered it before. It is a striking ence in which force or forces are consid- advantage of Lakoff’s theory of metaphor to ered.” The term ’dynamic modeling’ is enable understanding that fact that, by the quite frequently used in systems develop- way, couldn’t be explained satisfactorily by ment when one wants to refer to understand- the elder theory of metaphor (see for ex- ing the changes of states of certain items ample Searle’s paper in [23]). In the tradi- such as the objects in an object model. Ob- 3We thank Tilman Reuther for mentioning this ob- viously the use of that term is a metaphor- servation to us. ical one, as the events that occur and im-

95 pact the state of such objects neither are malism, that hides details and preserves de- forces nor exert forces on the considered ob- sirable properties.” Obviously the terms ’de- jects. Rather the computer is supposed to tail’, and ’desirable property’ need further use the information that the event has oc- clarification. It is our view that these terms curred for adapting the objects’ state ade- only can be explained with respect to the quately. One could use the metaphor IN- purpose of the abstraction performed. To- FORMATION IS A FORCE as the base of us- gether with these terms also the term ’sim- ing the term ’dynamic modeling’. Compara- plicity’ is used in the literature for explain- ble metaphors are in use in software process ing aspects of abstraction. The latter term, modeling. Finally, also talking about static however, as well is notoriously hard to de- modeling when referring to creating entity- fine. relationship diagrams or class diagrams is a [36, p. 1300] states that it is important to metaphoric language use, as static also is a choose abstractions carefully and that “... term used in studying forces and their con- the utility of abstraction is related to the util- sequences. ity of the representation change that is asso- The obvious similarity between metaphors ciated with it.” The drawbacks that might and model relationships can be made ex- occur with respect to the utility of an ab- plicit by assigning to the original and the straction are defined in terms of computa- model the role of the source domain and tional cost and problem complexity. It oc- the target domain respectively. The actual curs as a reasonable assumption that change mapping is then conducted by the trunca- of representation is a particular kind of mod- tion τ. The difference between models and eling approach that is designed for use by metaphors is that metaphors compared to computerized systems. With respect to hu- models achieve a significantly reduced infor- man modeling processes various reasons can mation transfer from model to original be- be mentioned that justify modeling (such cause the only mappings that can be used as as ethical, legal, or practical reasons). Re- a grounding are subsets of τ −1, i.e. the re- garding change of representation, however, lation inverse of the truncation. Metaphors the dominating concern is the computational are important with respect to CM as the cost of problem solving. Abstraction is modern consensus of linguistics is that hu- aimed at because it often, but by no means man thought is essentially metaphorical. For always (see [36, 4.(d)]), achieves its poten- more detail see, e.g. [21]. tial for cost reduction. With respect to the terminology used in CM one would want to 6. Change of representation refer to model quality to capture abstractness or simplicity of representations. According to [8, p. 1197] change of rep- In reviewing abstraction theory, as used in resentation in artificial intelligence means artificial intelligence, [36, pp. 1298-9] uses that an actor changes a problem statement. the metaphors4 The actor is usually presupposed to be a computerized one. The outline of model- 1. ABSTRACTIONISAMAPPING ing processes as described above obviously BETWEEN PREDICATE SETS covers reformulation, since reformulation not necessarily has to explicitly refer to the 2. ABSTRACTIONISAMAPPING non-semiotic entities in the role of original BETWEENFORMALSYSTEMS, and model respectively that were mentioned above. [8, p. 1197] gives a nice example 3. ABSTRACTIONISASEMANTIC showing that in fact mentioning such enti- MAPPING OF INTERPRETATION ties not necessarily is required. In [36, p. MODELS, and 1297] it is then defined that “an abstraction 4Note that the source does not explicitly uses the is a change in representation, in a same for- terminology of metaphors.

96 4. ABSTRACTIONISASEMANTIC stricted to a physical one. For non-physical MAPPINGOFFORMULAE. worlds, however, no hints at all are given as to how the phenomena in P could be deter- Using the first metaphor seems to sug- mined. For another reason, that we are going gest the problem that was already men- to explain now, this point is very important tioned above when discussing abstraction in as well. These two points make us reject that the context of Stachowiak’s model theory. abstraction theory for CM. Therefore we do not accept that metaphor. Given a world W an abstraction is then As [36] mentions, the second metaphor a mapping A from a presentation frame- seems not to be suited for aiding actors work Dg(W ) onto a presentation frame- in finding new abstractions. It also seems work Da(W ) such that (1) Dg(W ) = not to fit to our general view of modeling (Pg,Sg,Lg,Tg), Da(W ) = (Pa,Sa,La,Ta), processes, as it does not refer to the phe- (2) A(Pg) = Pa, and (3) Pa is sim- nomena in a discourse domain. We thus re- pler than Pg. The predicate simpler is de- ject that metaphor as well. The last two ap- fined such that for the Kolmogorov com- proaches comply with our view of modeling plexity K the following condition holds: processes as involving phenomena that con- 5 K(Γg) ≤ K(Γa). In this condition the stitute an object level with respect to which sets Γg, Γa are so-called configurations of on a meta level semantic information can be the perception processes Pg, Pa respectively specified for restricting the state of affairs at that generate Pg,Pa respectively. Not only the object level. We are going to use aspects is the Kolmogorov complexity known to of both of these metaphors for our proposal be a non-computable function and therefore of abstraction in CM. practically computing the values K(Γg) and [36] proposes a new theory of abstraction. K(Γa) might be not feasible. The configu- We reject that theory with respect to CM rations furthermore are assumed to be sets for reasons we explain shortly. That the- of bit strings and the Kolmogorov complex- ory employs the metaphor ABSTRACTION ity of a configuration is the maximum of the IS A MAPPING BETWEEN PRESENTATION Kolmogorov complexities of the bit strings FRAMEWORKS. A presentation framework in that configuration (see [28, p. 766]). For is a 4-tuple D = (P, S, L, T ) where P is a infinite configurations a maximum not nec- set of phenomena that are typed as either ob- essarily exists and thus the complexity com- ject, attribute, function, or relationship; S is parison might be meaningless. Finally, since a storage structure for storing the phenom- there were no process quality aspects spec- ena including their typing; L is a language ified for the processes Pg and Pa it cannot for talking about the phenomena; and T is a be guaranteed that repeated or concurrent theory on which reasoning about the world process execution is invariant against Kol- is based. A presentation framework is sup- mogorov complexity. We consequently cur- posed to have a number of parameters. Two rently see no way of using the abstraction of these are of particular importance for our theory under discussion for CM. purposes. The first one is a world W that is represented by the framework. The sec- 7. Constructing Discourse Do- ond one is a process P that generates the set mains P out of W . For the process P [36] does not provide a specification. Only a reference The main purpose of CM is conceptualizing to [28] is given. In that source, however, in a suitable way a discourse domain at hand. there only is given an example of a digital The phenomena in the discourse domain can camera that is capable of providing bit pat- 5Recall that for a string S the complexity K(S) is terns that function as phenomena. That is an defined as the length of a shortest program that puts important point as [36, p. 1303] says that out S. A very brief introduction to Kolmogorov com- the concept of ’world’ by no means is re- plexity can be found in The Wikipedia.

97 be distinguished from each other by a com- mapped onto node or edge. bination of structural, extensional, and in- tensional parts of the conceptualization. CM A further metaphor is used in constructions may be concerned with concepts (see, e.g. of discourse domains, i.e. INFORMATION [14] for a definition and discussion) in two IS A CONTAINER. In this metaphor infor- ways. First of all, if CM is used by an actor mation is considered as something that can A to conceptualize a discourse domain D∗ contain information in the sense that the lat- then A creates a model M ∗ of D∗, i.e. A ter can be put in or out. For example, when creates a model relationship R(D∗,M ∗,A). modeling with state charts (see, e.g. [25] For this paper D∗,M ∗ are a semiotic en- or [26]) one uses two forms of nesting, i.e. tity and thus involves concepts. Finally, the of including state charts into a given state discourse domain D that has D∗ as a semi- chart. The latter is then considered as be- otic representation in R(D∗,M ∗,A) may al- ing at a higher level of abstraction. These ready be a conceptualization. That, for ex- two ways represent generalization and ag- ample, is the case in requirements elicitation gregation of states respectively. Similarly, and analysis. in class diagrams one uses boxes to repre- sent classes. Inside of these boxes the at- A discourse domain is something that one tributes as well as the methods of objects can sensibly talk about, i.e. something that of the respective class are listed. In class can be a matter of discourse. We assume diagrams the metaphor INFORMATION IS A that the minimum assumption for that to CONTAINER is used for representing data be possible is that one can identify in the encapsulation. Please note that typically in discourse domain a collection of phenom- ER diagrams that metaphor is not employed ena. In our view when conceptualizing a since data encapsulation is not a modeling discourse domain a modeler aims at provid- notion of the ER model. ER modeling as ing semantic information that rules out all well as Petri Net modeling use the metaphor non-sensible phenomena. Our thesis is that INFORMATION IS A CONTAINER in a further in this respect conceptual modelers use the way. An elementary model part such as an metaphor INFORMATION IS A NETWORK. entity, a relationship, or a place, or a transi- This metaphor suggests that the required in- tion is defined as including a whole ER di- formation can be represented as a number agram or Petri Net respectively. The latter of information chunks (nodes of the net- form of applying the metaphor INFORMA- work) and that several of these chunks are TION IS A CONTAINER is called refinement. related to each other (edges in the network). Applying that metaphor enables modelers In refinement the elementary model part is to qualitatively classify discourse domain- supposed to be at a higher level of abstrac- phenomena as thing and thing relationship tion than is the refining model. The pur- respectively. If the modeler wants a more so- pose of refinement is exactly the introduc- phisticated conceptualization then the mod- tion of these levels of abstraction. These lev- eler may chose a target domain for the men- els make complex models more usable be- tioned metaphor that has node labels or edge cause for some use of them it is sufficient labels. These labels can be used to de- to focus on a given level of abstraction and fine types of discourse domain-phenomena. ignore the rest of the model. Thus refine- The semantic information initially required ment has the potential of significantly reduc- most easily can be represented graphically ing the complexity an actor has to deal with such that nodes labeled equally will be rep- at once for solving a problem at hand. Ob- resented by a given shape (such as a rectan- viously, applying the metaphor INFORMA- gle or a diamond etc.) and that the edges TION IS A CONTAINER can be made redun- labeled equally will be represented as a par- dant by explicitly typing nodes and edges ticular edge type. Both edges and nodes will in a graph that represents information ac- be attached an identifier of the phenomena cording to the metaphor INFORMATION IS

98 ANETWORK. dimensional ones. One-dimensional items, Providing evidence for a theory like the one i.e. lines cannot have attributes because that we have outlined in this paper is not a sim- would involve a line connecting a line with ple task. Currently we have no clues of a two-dimensional item, i.e. an information how to empirically confirm our theory. We chunk. can, however, mention three observations Thirdly, our theory complies with a lot of that could be considered as such confirma- modeling practice that we are aware of. We tion. Firstly, in object-relational database mention in particular the use of the metaphor management systems the subtype as well as INFORMATION IS A SPATIAL NETWORK the sub-table relationship is denoted as ”un- that we are going to discuss in detail in sec- der” (see [35]). This seems surprising since tion 8.1. We mention here only the interest- established different terminology is avail- ing point that this metaphor enables making able. The chosen terminology seems to pre- pragmatic distinctions that are usually not suppose the metaphor INFORMATION IS A supported by formal model semantics. SPATIAL NETWORK. Secondly, there is a long tradition of work regarding the ques- 8. Abstraction tion whether relationship types in the ER- model can have attributes or not. There is Abstraction comes in at least two forms, quite some work of Ron Weber and col- i.e. intra model abstraction (as levels of ab- laborators devoted to that question. Weber straction within a model) and inter model has argued in two ways, i.e. empirically abstractions (as different views on a given and theoretically. In his empirical respec- discourse domain). In contrast to that we tive work he has tried to show that modelers suggest understanding the intra model ab- find it more difficult to understand relation- straction as consequence of replacing the ships in the ER-model that have attributes metaphor INFORMATION IS A NETWORK rather than relationships without such at- by the metaphor INFORMATION IS A SPA- tributes. In his theoretical respective work TIALNETWORK. Intra model abstraction Weber has argued that ontological reason- is mainly used for improving model qual- ing suggests that relationships must not have ities such as readability, memorizabilty, or attributes. It is however a severe weak- maintainability. This form of abstraction ness in his latter reasoning that he only uses mainly is concerned with the detail of spec- Bunge’s ontology and does not provide a ification. We suggest understanding inter rationale for choosing an ontology. While model abstraction as omitting information from the view of an ontology’s purpose that represented by one model in comparison might not be particularly worrying it is how- to another model. For providing a frame- ever known that a number of ontologies ex- work that tells us what kind of information ist and that for example Chisholm’s ontol- can be omitted from a model we refer to ogy has been applied to information systems the metaphor INFORMATION IS A SPATIAL related questions. The obvious weakness of NETWORK. We thus suggest that the infor- not providing a justification of the choice mation that can be omitted concerns ruling made for the ontology seems to suggest a out a discourse domain-phenomenon from predisposition in place against attributes of the model, or typing it inadequately, or by relationships. The conventional use of the relating it to undesirable phenomena in an undesirable way. metaphor INFORMATION IS A NETWORK has the potential to explain the existence of It appears to exist a separation of interest in such a predisposition. According to that abstraction with respect to data engineering metaphor information is only represented and artificial intelligence. In [36, p. 1294- by two-dimensional items (such as rectan- 6] five examples of inter model abstraction gle or circle which represent entity type are discussed while intra model abstraction and value type respectively) rather than one- is ignored. In [3, section 2.1] only the tra-

99 ditional examples of intra model abstraction ables using Plato’s idea of heaven of ideas are considered, i.e. classification, general- by associating top with heaven, and bottom ization and aggregation. In [7] a further in- with earth. tra model abstraction is mentioned, i.e. de- A sophisticated concept of intra model ab- contextualization. [16] is among the papers straction was discussed in [10]. In that that have discussed further concepts of intra paper a cohesion C is understood to be a model abstraction, i.e. cooperations. It is an pair C = (A, PA) where A is a set of obvious idea to employ intra model abstrac- roles or perspectives and P is a mapping tion for achieving inter model abstraction. S A PA : P( a∈A ε(a)) → {true, false} that From a view of systematic model develop- defines which entities in the union of the ment one feels tempted to challenge that in- extensions ε(a) of roles a ∈ A are related ter model abstraction is not discussed in data to each other by C.6 Intra model abstrac- engineering. It seems that data engineering tion was understood as provided by abstrac- aims at finding the right abstractions only tion concepts such as aggregation, classifi- within a model while artificial intelligence cation, and generalization. An abstraction aims at finding the right level of abstraction concept was in that source understood as a of the models, i.e. one tries to find minimum mapping α :(A, PA) → (B,PB) where complexity models of a given discourse do- (A, P ) and (B,P ) are cohesion and B ⊆ main. From a holistic point of view both of A B A, as well as PB = PA|B, i.e. PB is the disciplines seem to have chosen an in- what one gets by restricting P to the roles complete approach to abstraction. A in B. The metaphor INFORMATION IS A 8.1. Intra model abstraction NETWORK together with the idea of typing nodes and edges suggests the concept of co- Diagrams as the ones obtained from em- hesion, as one node (i.e. information chunk) ploying the metaphor INFORMATION IS A can be typed such that it represents a cohe- NETWORK may be confusing as no obvi- sion and the edges that connect it to other ous principle to group nodes and edges can nodes can be understood as the function of be applied. For introducing hierarchy into the latter nodes in that cohesion. This con- the diagram the basic metaphor INFORMA- cept of abstraction can be described by the TIONISANETWORK is complemented by metaphor ABSTRACTIONISASEMANTIC the metaphors INFORMATION IS A SPATIAL EXCEPTIONFROMCOHESION. The latter NETWORK and INFORMATION IS A CON- is similar to the metaphor ABSTRACTIONIS TAINER. This allows in the diagram to ASEMANTICMAPPINGOFFORMULAE (see make use of the well-known pairs of spatial [36]) because cohesion in a model can be un- concepts such as (up, down), (left, right), derstood as a formula. (front, rear), and (in, out). These spa- The metaphor INFORMATION IS A SPATIAL tial concepts can be applied to the semantic NETWORK can be extended such that it ex- information as represented by the network ploits not only the vertical spatial dimen- because of the typing of nodes and edges. sion (i.e. up vs down). Rather the depth Some of the nodes are considered as primary space dimension can be included as well (i.e. and others as secondary. The primary ones front vs. back). In that extended version the are considered such that they represent the metaphor enables relating perceived size of information regarding the discourse domain a concept representation as proximity to the and the secondary ones provide the detail or model user. That which appears to be larger data for that. Of two related primary nodes is closer to the model user and thus more rel- the one would then considered as less ab- evant for him or her. The spatial dimension stract that has a more detailed description. of breadth (i.e. left vs. right) can be ex- The more abstract node would in the dia- gram then be placed above the less abstract 6For a set S the set of subsets of S is denoted as one. In doing so the revised metaphor en- P(S).

100 ploited too. For example, in modeling with related the right way to other phenomena state charts it is a very common to arrange included into model X of O∗. We can them such that object creation occurs in the then define the predicate α(M 0,M,P,O∗). top left corner of a diagram and that the se- It means that model M 0 is more abstract quence of object states is arranged in lines than model M with respect to purpose P from left to right and from top to bottom of actor A and original O in state i if 7 0 0 of the diagram. A respective metaphor that |{o ∈ Oi|ci(o, M ) = true or ti(o, M ) = 0 could justify that implicit convention is IN- true or ri(o, M ) = true }| ≤ |{o ∈ FORMATION IS A TEXT. Oi|ci(o, M) = true or ti(o, M) = Proximity of information chunks in a model true or ri(o, M) = true }|. (and thus recognized by the model user) Obviously more research is required for can be understood as meaning relatedness. working out an operational approach to Similarly the metaphor INFORMATION IS A achieving more abstract models that meet CONTAINER can be used to address encap- given requirements such as given queries sulation and protection. can be efficiently processed. Our idea in that respect is to investigate design primi- 8.2. Inter model abstraction tives such as the one in [3] with respect to their impact on inter model abstraction. We understand semantic information as something that can be used for eliminating or reducing uncertainty. That is what makes 9. Outlook us interested in having information about It would be of some interest to have a more many entities of the one or another discourse complete list of the metaphors that are used domain. This understanding is coherent with in CM. the well-known definition of information as As we have introduced two kinds of abstrac- the distinction or difference that makes a dif- tion it is our aim in future work for each ference ([2]). Aiming at applying this view item L in a given list of modeling languages of information to our problem of defining to provide a list of abstraction concepts that inter model abstraction makes us focus at enables for a given model M to create all the ways in which phenomena are included the models that are more abstract than M. into models. The details of the used mod- The top-down primitives proposed in [3] are eling language come to mind. Since we are not complete, i.e. do not enable construct- interested in a pragmatic theory of abstrac- ing all entity-relationship diagrams. Provid- tion we do not, however, focus on technical ing a sufficiently expressive set of abstrac- or formal detail. We rather again make use tion concepts must thus considered to be a of the metaphor INFORMATION IS A NET- non-trivial task. WORK and our explanation of it provided above. The idea proposed in [28] to use the Kol- mogorov complexity K as a simplicity mea- Consider an entity O the history of which sure for bit strings that represent a state of is captured as the sequence {O } . Let i i∈I a world W appears as promising. It would i ∈ I and O∗ be a semiotic representa- be interesting to work out formulae for the tion of O . Consider model relationships i complexity K(p : M) for each model M and R(O∗,M,A), and R(O∗,M 0,A) with M each design primitive p taken from a set P of and M 0 being semiotic entities. Consider design primitives. It could be possible to use the predicates c (o, X), t (o, X), r (o, X). i i i K as a measure of model complexity and to Let them mean that phenomenon o is be- find strategies for complexity reduction that ing included in model X of O∗, that it is preserves information contents, i.e. specifies typed the right way in X and that it is the same discourse domain model. 7We thank Bernhard Rumpe for mentioning this observation to us. References

101 [1] John Longshaw Austin. Zur Theorie 2004: Proceedings zur Tagung. GI der Sprechakte. Philip Reclam jun., Lecture Notes in Informatics, P-45, Stuttgart,1979. 2004. [2] Gregory Bateson. Mind and nature: a [11] Roland Kaschek. A characterization of necessary unity. Dutton, New York. co-retracts of functional structures. In W. Ghler and G. Preuss, editors, Cat- [3] Carlo Batini, Stefano Ceri, and egorical Structures and their Applica- Shamkant Navathe. Conceptual tions. World Scientific Publishers, New database design. The Benjamin / Jersey et al., 2004. Cummings Publishing Company; Inc., Redwood City, California, 1992. [12] Roland Kaschek. Modeling ontol- ogy use for information systems. In [4] Rudolf Carnap. The logical construc- Klaus-Dieter Althoff, Andreas Den- tion of the world, (In German). Fe- gel, Ralph Bergmann, Markus Nick , lix Meiner Verlag, Hamburg, 2nd ed. Thomas Roth-Berghofer Th., editors, 1961. Professional Knowledge Management. [5] Peter Chen. The entity relationship LNCS 3782 Springer Verlag, 2005. model: toward a unified view of data. [13] Roland Kaschek. Preface. In Roland ACM Transactions on Database Sys- Kaschek, editor, Intelligent assistant tems 1(1976), pp. 9 - 37. systems: concepts, techniques, tech- [6] Umberto Eco. Introduction to semi- nologies. To appear IDEA Group Inc., otics, (In German). Wilhelm Fink Ver- 2006. lag & Co. KG, 1994. [14] Wilhelm Kamlah and Paul Lorenzen. [7] Pierre Luigi Ferrari. Abstraction in Logical propaedeutics: preschool of mathemattics. In Lorenza Saitta, edi- sensible sppeech, (In German). Verlag tor, The abstraction paths: from expe- J. B. Metzler, Stuttgart, Weimar, 1996. rience to concept. Philosophical trans- [15] Roland Kaschek and Alexei Tretiakov. actions of The Royal Society, Biolog- Knowledge evolution: the metaphor ical Sciences, vol. 358, no. 1435. July case. In Proceedings of CSIT 2006, 2003, pp. 1225 - 1230. Amman 5 - 7 April 2006. [8] Robert C. Holte and Berthe Y. [16] Roland Kaschek and Claudia Kohl and Choueiry. Abstraction and reformu- Heinrich Mayr: Cooperations- an ab- lation in artificial intelligence. In straction concept suitable for business Lorenza Saitta, editor, The abstraction process reengineering. In J. Gyork¨ os¨ paths: from experience to concept. and M. Krisper and H. Mayr, edi- Philosophical transactions of The tors, Re-Technologies for Information Royal Society, Biological Sciences, Systems, ReTIS’95 Conference Pro- vol. 358, no. 1435. July 2003, pp. 1197 ceedings, Oldenbourg Verlag, Wien, - 1204. Mnchen, 1995. [9] Philip Johnson-Laird. A taxonomy of thinking. In R. J. Sternberg and E. [17] Roland Kaschek and Klaus P. Jan- Smith , editors, The psychology of tke and Tibor-Istvan Nebel. Towards human thought, Cambridge University understanding meme media knowl- Press, 1988, pp. 429 - 457. edge evolution. In Klaus P. Jantke and Aran Lunzer and Nicolas Spyratos [10] Roland Kaschek. A little theory of ab- and Yuzuru Tanaka, editors, Federa- straction. In Bernhard Rumpe, Wolf- tion over the Web. LNAI 3847 Springer gang Hesse, editors, Modellierung Verlag, 2006.

102 [18] George Lakoff. The contemporary the- visual perception. Applied Artificial ory of metaphor. In Andrew Ortony, Intelligence 15,8(2001), pp. 761 - 776. editor, Metaphor and Thought, 4th. printing of second edition of 1992. [29] Reinhard Schuette and Thomas Rot- Cambridge University Press, 1998. thowe. The Guidelines of Modeling - an approach to enhance the quality [19] Peter C. Lockemann and Heinrich C. in information models. Proceedings of Mayr. Computer supported informa- the 1998 Entity-Relationship confer- tion systems, (In German). Springer ence. Verlag Berlin, Heidelberg, 1978. [30] Herbert Stachowiak. General model [20] Angelika Linke, Markus Nussbaumer, theory, (In German). Springer, 1973. and Paul R. Portmann; Transscript lin- guistics, (In German). Max Niemeyer [31] Herbert Stachowiak. Knowledge levels Verlag, Tubingen,¨ 4th. unchanged edi- towards systematic neo-pragmatism tion 2001. and general model theory, (In Ger- man). In Herbert Stachowiak, editor, [21] George Lakoff and Mark Johnson. Modelle: Constructions of reality, (In Metaphors we live by. University of German), pages 87 - 146. Wilhelm Chicago Press, Chicago, 1980. Fink Verlag, Munchen,¨ 1983.

[22] Werner Metzig and Martin Schus- [32] Herbert Stachowiak. Model, (In Ger- ter. Learning to learn, (In German). man). In Helmut Seiffert and Ger- Springer, Berlin, Heidelberg, 7th. edi- ard Radnitzky, editors, Handlexikon tion, 2006. zur Wissenschaftstheorie, pages 219 - 222. Deutscher Taschenbuch Verlag [23] Andrew Ortony (editor). Metaphor and GmbH&Co. KG, Munchen,¨ 1992. Thought, 4th. printing of second edi- tion of 1992. Cambridge University [33] Bernhard Thalheim. Entity- Press, 1998. relationship modeling. Springer- Verlag, Berlin, Heidelberg, 2000. [24] Alexander Pfander.¨ Logic, (In Ger- man). Verlag von Max Niemeyer, Halle [34] Allan Tucker, editor. Computer science a. d. Saale, 1921. handbook, Chapman & Hall / CRC and ACM, 2004. [25] James Rumbaugh et al. Object- oriented modeling and design. [35] Can Turker¨ and Gunther Saake. Prentice-Hall, Englewood Cliffs, NJ, Object-relational datbases (In Ger- 1991. man). dpunkt, 2006

[26] James Rumbaugh and Michael Blaha. [36] Jean-Daniel Zucker. A grounded the- Object-oriented modeling and design, ory of abstraction in artificial intel- 2nd edition. Prentice-Hall, Englewood ligence. In Lorenza Saitta (ed.) The Cliffs, NJ, 2004. abstraction paths: from experience to concept. Philosophical transactions of [27] Hans Jorg¨ Sandkuhler.¨ Nature and The Royal Society, Biological Sci- knowledge cultures: Sorbonne lec- ences, vol. 358, no. 1435. July 2003, tures on pluralism and epistemology, pp. 1203 - 1309. (In German). Verlag J. B. Metzler, Stuttgart, Weimar, 2002.

[28] Lorenza Saitta and Jean-Daniel Zucker. A model of abstraction in

103 Data Mining in Biological Data for BiOkIS

Gunar Fiedler, Bernhard Thalheim Department of Computer Science, Kiel University, Olshausenstr. 40, 24098 Kiel, Germany Email: {fiedler,thalheim}@is.informatik.uni-kiel.de Dirk Fleischer Institute for Polar Ecology, Kiel University, Wischhofstr. 13, 24148 Kiel, Germany Email: dfl[email protected] Heye Rumohr Leibniz-Institute of Marine Sciences, Benthic Ecology, Dusternbrooker¨ Weg 20, 24105 Kiel, Germany Email: [email protected]

Abstract Modern ecological research attempts to measure and explain global dependencies and changes and therefore needs to access and evaluate scientific data just in time. At the same time, the uniqueness of biological data has to be taken into account by ensuring a safe long term data storage with high availability as a base for new biological studies. The bio-ecological information system BiOkIS attempts to provide this platform. It offers a safe data store for biological data and promotes data exchange between researchers or within distributed research groups. In addition to a simple data archive BiOkIS will provide data evaluation methods to participating researches as an inducement for making research data available to others. This paper will introduce the problems of modern biological data management and sketches the possibilities of offering archive and mining facilities to non database experts. 1. Introduction that the existing complex and restrictive data exchange protocols between research The efforts of large scale ecological re- partners only lead to an overall confusion search are nowadays handicaped by the and faulty data transfers. Although col- lack of an information infrastructure that laboration is explicitly promoted, the ex- supports a free and uncomplicated data change rate of data between researches is transfer between researches. The marine not substantially increased. ecological research in Kiel with its long tradition was involved in long term obser- From a technical point of view biologi- vations for more than 20 years and partic- cal data can be separated in different cat- ipated in ecological projects with partners egories, for example data obtained from distributed over several countries. The biological surveys (survey data) and data experiences from these projects showed obtained from biological experiments.

104 Survey data is usually of a descriptive ally, the research group processed numer- character. To obtain biological survey ous historical data sources from the last data an area of interest is defined and sam- 100 years, e.g. diploma theses and PhD ples are taken. These samples are associ- theses. This data is now annotated, doc- ated with their geographic references (lat- umented, and available in an electronic itude and longitude values). The sample’s form. parameters of interest, e.g. number of in- This stock of data seems to be a solid dividuals of a certain species, or the in- ground for formulating hypotheses on bi- dividual’s size, are determined by apply- ological questions. But unfortunately, the ing specific techniques. Where appropri- management of biological data has to face ate, the data is compared with results from several problems on the technical as well former surveys. Projects that describe re- as on the organizational level that pre- gional communities of species or changes vent an efficient evaluation. The technical within a region collect a huge amount of problems are the typical ones known e.g. data over the years, e.g. 2500 records for from distributed systems and data integra- 5 stations per year. tion scenarios ([5]): If hypotheses are made based on survey Heterogeneity of data: data provided by data, there is always the need for an ex- different researchers or research groups periment that verifies the result. A single differ from each other. Due to the orga- biological experiment produces approxi- nization of biological research, data sets mately 1000 records. Due to modern pro- are designed based on different points cedures of taking samples this size is even of view. Depending on the utilization, increased. the personal goals and expectations, each Additional to these traditional invasive researcher chooses its own structures. methods of taking samples out of a habi- There is a common agreement about pro- tat, photographs and video sequences are cedures and the overall set of concepts used for visual scans without removing that are represented in data sets, but it is individuals from the habitats of interest very costly to fully map raw data from and for tracking species hardly repre- different sources. sented by quantitative methods. Discretization of data or conversion of Geographically referenced data can be continous data to discrete data leads to used e.g. for creation of distribution maps different interpretations of data. Dis- for species or other (higher) taxa. Using cretization may be based on time, space, obtained biomass data it is possible to es- or other abstractions which may vary timate the number of individuals of cer- among different research groups. tain species in different regions. The scope of data representation is of- Table 1 gives an impression of the size ten concentrated on the scope of the user. of available biological data. It shows the This leads to the representation of macro amount of survey data and experimen- data in the data sets which are compre- tal data available at the Research Group hensions of micro data. Additionally, for Marine Benthic Ecology at the Leib- data abstractions are often used instead niz Institute of Marine Science and the of basic data. Attribute names are usually Kiel University. The data is geograph- not documented, abbreviations are quite ically referenced and described by meta commonly used. The data sets are full data. Figure 1 shows a map of stations of symbolic values and artifical identifica- where survey data is obtained. Addition- tors without any meaning outside this par-

105 Type Estimated Number of Records survey data (quantitative) ca. 250,000 records survey data (qualitative) ca. 100,000 records experimental data ca. 150,000 records photographs ca. 6500 video sequences ca. 530 hours Table 1: data collected by researches from Kiel

Figure 1: survey stations ticular data set and without any documen- dents write their diploma theses or PhD tation. Type information is often missing, theses. All these students are pressed for e.g. because the data set is represented as limited time only, so local and personal a plain text file. This leads to additional optimizations of obtaining, storing, and problems concerning the quality of data evaluating data naturally occur. Biologi- at the instance level: values are missing, cal researchers are no computer scientists attributes are coded differently (e.g. using and so, they are usually not familar with degrees with decimals for geographic ref- aspects of data modeling, data exchange, erences vs. using degrees with minutes of and long term data storage. Especially arc and decimals) or have wrong values. the description of data by meta data and backup procedures are often not present The organization of biological research in mind. After the thesis is finished also leads to specific problems which pre- the researcher usually leaves the research vent an effective usage of data. Most of group, home directories are deleted, and the work is done during student’s work or raw data is lost. Only comprehensions other time limited projects, e.g. while stu-

106 like reports and publications remain. But vented to store geological data from core even if raw data can be preserved, the drillings. This causes disadvantages: the knowledge about the data structure, the existing data model has to be used to concepts behind the data, the description store biological data which is structurally of the procedures that were applied to ob- and semantically totally different. PAN- tain the data is lost. That’s why biological GAEA is bringing data online right after data sets may end up uninterpretable or publication, but unfortunately, diploma have to be downscaled after a few years theses or student research projects are not or even months. considered as real publications. There- Publication of research results usually fore, they will seldomly become avail- doesn’t contribute to the public outreach, able. The data transfer protocols of PAN- because publications only contain com- GAEA are complex to use for many re- prehensions of data, raw data remains at searches. So many problems addressed the author. The usual peer review proce- above remain unsolved. dure delays publications together with the There is a number of highly specialized reusage of data sets for several months. information systems for research data Additional to these arguments there are as well as popular science. For ex- some further aspects that prevent an ef- ample, FishBase (www.fishbase.org) is a ficient reuse of biological data. Failed ex- database containing facts about fishes. It periments are usually not documented or can be used by taxonomists, fishermen, even published. But especially these fail- teachers, or pupils and provides facts ures may be important for the research about species as well as multimedia con- community because learning from other tent like photographs. FishBase as an failures may speed up the development example shows the possibilities of pro- of functional solutions. Distributed re- viding scientific information over a Web search projects often lack of an efficient based information system. It addresses data transfer between local groups due the needs of the public due to its great to technical difficulties and complex data stock of available photographs. transfer protocols. The interested public ReefBase (www.reefbase.org) is the first is excluded from participating in scien- online information system concerning tific results due to journal publication, so coral reefs. It provides information and research groups have to make additional services for professionals e.g. in the ar- investments for public relationship man- eas of management, science, and pollu- agement and popular science. tion control. ReefBase provides a great variety of geographically referenced and 2. Existing Information Systems annotated photographs. Visualization is supported using a mapping service that The problem of data loss in biologi- produces interactive maps. This enables cal research is known for a couple of a playful handling of information. years. There exist some information systems that try to address this topic. The Ocean Biogeographic Information The world’s data archive for environmen- System OBIS (www.iobis.org) provides tal, marine, and geological data PAN- downloadable data sets from a network GAEA (www.pangaea.de) ensures long of institutions from 45 nations. Unfor- term archiving of geographically refer- tunately, these data sets are restricted to enced data. Researchers are called to observations (presence of individuals, not store their data in PANGAEA. It was in- absence). Distribution maps that reveal

107 distribution pattern for a great number of Which data sets are geographically or tax- species all around the world can be gen- onomically related? Is there any multi- erated online. media material supporting this data set? Many of these associations can be auto- 3. General Goals matically obtained based on the data’s us- The problems mentioned above can be age or meta data. faced by using an information system Searchable data sets: Who is working with a central data storage that is capable on a similar topic? Are there data sets that of providing extensive descriptions and I need for my hypotheses? Did the ex- documentation of arbitrary data sets from periment I am currently planning already the known domain. In detail, the follow- failed in a similar context? ing points are currently missing: Controlled output of biological data: Controlled data input: data transfer into Two points are important: first, the sup- the system should be user-oriented and port for common data formats used in the not backed by complex exchange proto- research community and transformations cols or data formats. Data import should between these formats. Second, there is take place as soon as possible, in the ideal a need for a mechanism that controlls ac- case directly when data is obtained. The cess to data sets such that every provider user should be guided through the import of data sets is notified when data is down- procedure, integrity checks should be ap- loaded or displayed. plied both on syntactical as well as on the semantical level. In the case of detected Virtual working groups: Data sets grow errors, the user should instantly correct over time. To publish a data set is only the data. Common data formats that are useful if major steps of the survey or ex- used by researches like CSV or spread- periment are finished. Additionally, re- sheet formats have to be supported. searchers are usually not interested in an At each step the user has to be forced to uncontrolled distribution of the raw data; annotate the data. Who obtained the data? especially, if own publications are still Which procedures were made? Where the in the queue. On the other hand it is survey / experiment took place? What adequate to share unfinished and locally is the meaning of the record’s attributes? incomplete data sets among project col- How is the data encoded? Is this data leagues or trusted partners. After a pre- set based on special assumptions? Most defined period of time where the data is errors will be detected on the syntacti- reserved for private use it is made avail- cal level, but it is also possible to ap- able to the public, but still under down- ply a basic set of tests whether the data load control. set is plausible according to the common Cooperation with other systems: An- knowledge of the research community for notated and documented data sets can this domain. be automatically transfered to cooperat- Association between data sets: very im- ing information systems like PANGAEA portant is the linking between depend- or OBIS so researchers save time for ing data sets. Which experiments are data publication. For that reason, open based on which surveys? Which evalua- standards and formats have to be sup- tions were already made for a certain data ported like the Darwin Core exchange for- set? Which publications exist? Which re- mat ([6]) developed by the Taxonomic search group used a particular data set? Databases Working Group ([8]).

108 A system that fulfills these requirements its support of linking placemarks with can be used as a library for sophisticated Web resources it is easily possible to pro- data evaluation. In principle, there are duce distribution maps in different styles two possibilities for using this data: In or to display measures or multimedia ele- the simple case data sets are searched, ments according to their coordinates. downloaded, and processed locally by Survey data can be visualized by dia- third-party statistical or data mining soft- grams showing the number of individu- ware. As a surplus value to providers of als or species at a certain place over a pe- data sets the information system can pro- riod of time. This enables researchers to vide a set of evaluation methods, so data choose data sets for comparative studies providers do not need to care about tech- or to make a long term observation of a nical questions, e.g. obtaining and in- geographical region of interest. Figure 3 stalling complex software just to process show an example. The same discussion common evaluation techniques. In bio- can be made for experimental data. logical research, a couple of methods are established, e.g.: Figure 4 shows possibilities for data min- ing approaches. For example, by using • Bray-Curtis similarity matrices (sim- Bray-Curtis similarity matrices it is pos- ilarity between samples according to sible to identify the species of importance found species and individuals, [2]) for a data set. This may lead to a better • Simpson Index (diversity index understanding of the biological system. based on the number of species, [13]) 4. Provided Services within a • Margalef Index bio-ecological Information • Shannon-Weaver Index (diversity in- System dex based on the number of species and individuals, [12]) Looking at different types of users a bio- • Pielou Evenness (analysis of the dis- ecological information system will pro- tribution of individuals, [10]) vide different services. Although the sys- • taxonomic distinctness ([3]) tem will mainly be used by scientists, • AMBI, BQI (diversity indexes for there are also interesting services for the classifying ecological quality, [1, interested public or sponsors of research 11]) projects. • statistical significance tests The group of researchers can be divided Beside these statistical evaluation meth- in two groups: data providers and data ods there are a couple of standard visu- users. Data providers publish annotated alization techniques that can be applied, data sets by using the system. Data users e.g. for creating project summaries or are researchers that will use data sets other kinds of reports. Some techniques stored in the system for further research. may be: Data providers are offered a long term Visualization of geographically refer- archiving of their published data sets, enced data using annotated maps: Fig- so the researcher needs not to deal with ure 2 shows a screenshot of Google Earth backup procedures. If it is wanted, the ([7]) with a data set that is visualized ac- data sets are distributed, e.g. to the world cording to the geographical references of data center. As a surplus value, each data the records. Because of the facilities of provider can use predefined data evalua- navigating in Google Earth together with tion methods. Predefined data visualiza-

109 Figure 2: Data Visualization with Google Earth tion can easily be used for public out- Researchers acting as data users are of- reach. Each data provider is notified fered an archive with semantically anno- if some user accesses the provided data tated, documented, and linked data sets sets, so data providers always have an that are searchable and browseable ac- overview about the impact factor of their cording to numerous parameters. Re- research. Researchers working in dis- ports, visualizations, and publications re- tributed working groups can define pri- lated to data sets are accessible. vate areas for data that is currently under processing, so data transfer between local groups is simplified. Because biology is a popular science the interested public should be explic- The online overview of downloaded or itly included in any thoughts about a displayed data enables research sponsors bio-ecological information system. Es- to participate in data ascertainment and pecially multimedial content that is pro- the project’s progress. Within the sys- duced during non-invasive surveys can be tem a statistical evaluation of queried data easily reused for public relations manage- grouped by projects is possible. This ment as long as they are annotated. But evaluation continues after the project’s there are also other interesting services lifetime and reveals the impact of the possible like visual classification wizards project within the research community that allow classification of individuals ac- or the public. Evaluation results can be cording to the taxonomy of species by it- linked to the sponsor’s homepage to im- eratively presenting photographs of pos- prove the sponsor’s own public outreach. sible species.

110 Figure 3: A diagram showing the number of individuals at a certain station over a longer period of time. Individuals of all species are summarized.

Figure 4: MDS plot of a time series of the seafloor of the Kiel Bay, based on ca. 150 species. The second picture shows the plot for 16 species. The most important species were extracted from the data set.

111 5. A General Architecture collaborating schema fragments (called ‘sunflower’ schemas) that allow the co- From a technical point of view an infor- existance of data that is shared between mation system with the described func- contexts and ‘private’, context dependent tionality has a number of points of con- data. tact with different disciplines in com- The semantics of the data has to be avail- puter science. Data warehouses inte- able in an interpretable form. That’s why grate data from different operative sys- the central data store is enhanced by a rea- tems and offer numerous data evaluation soning engine for managing terminologi- procedures. In difference to classical data cal knowledge about the data as well as warehouses data in a bio-ecological in- complex integrity constraints. This com- formation system will be structured more ponent strongly interacts with a compo- complex. Content management systems nent for the management of user profiles offer functionality to store such data and and portfolios to derive access rights and present it to users in specific ways. obligations based on the state of data sets To ensure data quality within the infor- and defined working groups. mation systems, numerous domain spe- The system’s interface to the user pro- cific integrity checks have to be applied. vides plugable evaluation and visualiza- The (at least partial) explicit representa- tion modules as well as import and ex- tion of these rules within the information port filters for common exchange formats. systems will promote the long term un- The interaction generator uses informa- derstandability and that’s why the long tion from the user management to deliver term usability of data sets. For this point information according to the user’s needs, many work was done in the area of de- wishes, and permissions in a suitable style ductive database systems as well as ac- as explained in [15]. tive database systems. Other technolo- gies as well as languages for describing 6. Conclusion data can be found in the area of the Se- mantic Web. In this paper we discussed the challenges of using data obtained during biological Therefore, a general architecture for these research as a base for data mining. The kinds of information systems, called con- central problems of data availability and tent warehouses can be derived. For a de- data quality have to be faced. By inte- tailed description, see [4]. Typical system grating the functionality of a data archive components are: system with data evaluation, data presen- The central data store is responsible for tation, and workgroup functionality in an representing the raw data. Due to the information system it is possible to invite variety of data structures it has to sup- more and more researchers to participate port schema components that are flexi- in creating a high-quality stock of biolog- ble enough to integrate different points ical data as a ground for new scientific of view on the data but are structured work. enough to allow an efficient handling of Acknowledgements mass data. For that reason, star- and snowflake schemas (see [16]) are intro- We like to thank our students Aylin duced that define mandatory kernel types Aksac¸, Anna Jaworska, Helge Rabsch, and optional types. The modeling ap- and Meimei Xu for implementing the proach defined in [5] introduces partially Google Earth visualization tool.

112 References HMI-Entwicklungsprozess. In VDI, editor, Elektronik im Kraftfahrzeug [1] A. Borja, J. Franco, and V. Perez. A 2005. 12. Internationaler Kongress Marine Biotic Index to Establish the Electronic Systems for Vehicles, Ecological Quality of Soft-Bottom number 1907 in VDI-Berichte. Benthos within European Estuarine VDI, VDI-Verlag, 2005. and Coastal Environments. Marine Pollution Bulletin, 40:1100–1114, [10] E.C. Pielou. The measurement of di- 2000. versity in different types of biologi- cal collections. Journal of Theoreti- [2] J.R. Bray and J.T. Curtis. An ordi- cal Biology, 13:131–144, 1966. nation of upland forest communities of southern Wisconsin. Ecological [11] R. Rosenberg, M. Blomqvist, Monographs, 27:325–349, 1957. H. Nilsson, H. Cederwall, and A. Dimming. Marine quality as- [3] K.R. Clark and R.M. Warwick. A sessment by use of benthic species taxonomical distinctness index and abundance distribution: a proposed its statistical properties. Journal new protocol within the European of Applied Ecology, 35:523–531, Union Water Framework Direc- 1998. tive. Marine Pollution Bulletin, 49:728–739, 2004. [4] G. Fiedler, A. Czerniak, D. Fleis- cher, H. Rumohr, M. Spindler, [12] C. E. Shannon and W. Weaver. The and B. Thalheim. Content Ware- Mathematical Theory of Communi- houses. Preprint 0605, Department cation. University of Illinois Press, of Computer Science, Kiel Univer- 1963. sity, March 2006. [13] E.H. Simpson. Measure of diversity. [5] G. Fiedler, Th. Raak, and B. Thal- Nature, 163:pp 688, 1949. heim. Database Collaboration In- stead of Integration. In Sven Hart- [14] B. Thalheim. Entity-relationship mann and Markus Stumptner, edi- modeling – Foundations of database tors, APCCM, volume 43 of Confer- technology. Springer, Berlin, 2000. ences in Research and Practice in See also http://www.informatik.tu- Information Technology. Australian .de/∼thalheim/HERM.htm. Computer Society Inc., 2005. [15] B. Thalheim. Co-Design of Struc- [6] Darwin Core Exchange Format. turing, Functionality, Distribution, http://darwincore.calacademy.org/. and Interactivity of Large Informa- tion Systems. Technical Report [7] Google, Inc. Google Earth 15/03, Brandenburg University of (http://earth.google.com). Technology at Cottbus, 2003.

[8] Taxonomic Databases Working [16] B. Thalheim. Component de- Group. http://www.tdwg.org/, 2006. velopment and construction for database design. Data Knowl. Eng., [9] T. Kruscha, B. Briel, G. Fiedler, 54(1):77–95, 2005. K. Jannaschk, Th. Raak, and B. Thalheim. Integratives HMI- Warehouse fur¨ einen durchgangigen¨

113 Personalized information filtering for mobile applications

Christian Reuschling and Sandra Zilles DFKI GmbH Erwin-Schrodinger-Str.¨ 57 67663 Kaiserslautern, Germany {christian.reuschling,sandra.zilles}@dfki.de

Abstract Recent technological development has enhanced research in the field of pervasive computing for mobile applications. In particular, topics like ad-hoc community building and personalized contextual product offers are of relevance for cellular radio providers. In this context, we propose a generic approach to personalized information filtering. This concerns first an appropriate representation scheme for the application domain, the user, as well as the information to be filtered. Here we propose to model the domain in an ontology with special weight attributes in RDFS, such that personal interests or resources (e. g., product descriptions) can be represented as RDF instances of this ontology. Using case based reasoning techniques, we second propose an implementation of a similarity measure between such instances. On the one hand, given a special domain model, this similarity measure allows for filtering a list of resources according to a person’s interests in a way immediately suitable for the intended applications; on the other hand, this similarity measure is defined generally enough to allow for the comparison of RDF instances in general, with different specialized similarity measures depending on the intended semantics of similarity. Third, in addition, the problem of how to maintain representations of user interests and resource descriptions in a dynamic domain is addressed briefly. 1. Introduction • a model for representation of avail- able information or resources (e. g., Recent technological development, such as products, entertainment offers, other concerning UMTS and smartphones, has users2), enhanced research in the field of perva- sive computing for mobile applications. In • a scenario for filtering and recommen- particular, topics like socializing, ad-hoc dation of information or resources, community building, entertainment on de- mand, and personalized contextual product • a process for content acquisition (in- offers are of relevance for cellular radio cluding maintenance of the information providers and for the corresponding third represented). party providers. What many products cel- lular radio providers aim at have in common Here the main difference compared to clas- is the fact that they implement sical, non-mobile applications is the focus on the user’s context, in particular, e.g., her • a model for representation of the user, current location, the current daytime, current and/or previous tasks, active applications • a model for representation of a user’s and services, etc., cf. [15, 9, 14]. Roughly context1, 2Note that users themselves can be resources, for 1i.e., additional constraints describing the current instance, in applications focusing on socializing sce- situation of a user’s fluently changing environment narios.

114 speaking, the user’s context is defined by the current status of her fluently changing envi- ronment, see Section 2.3 for more details. However, the actual implementations of these models, scenarios, and processes differ in terms of methods used. The variety of the approaches of course results from different applications and thus different intended se- Figure 1: The general scenario. mantics of user/resource representations and different intended recommendation features. mendation process is to first use context in- State of the art approaches to modeling per- formation for filtering the case base, and sonal taste for recommendation tasks are for then select appropriate ‘cases’ (i. e., profiles) instance discussed in [11], cf. also the ref- depending on a similarity match with the erences therein. Ideas especially aiming at given user profile. The system has to be mobile applications often concern socializ- maintained in a clearly defined workflow, for ing based on, for instance, music recom- instance by (but not restricted to), using ex- menders, cf. [2, 3]. ternal databases. The aim of this paper is to propose a unified The paper is organized as follows: we first framework allowing for a generic implemen- motivate and describe our chosen model for tation of the required models, scenarios, and representing user and resource profiles, i. e., processes – independent from the actual ap- descriptions of their interests or features, plication domain. This framework is based (Section 2), then, in the main section (Sec- on previous work concerning tion 3) we discuss our generic recommender approach, followed by a brief overview over • Semantic Web technologies (here used possible approaches to content acquisition for the representation model); see [6, (Section 4). In the concluding section, we 12] and address some possible issues for future work in this context. • case based reasoning methods (here used for the recommendation model); 2. Representation model: the RDF see [4, 5], as well as [1] for a funda- mental overview. ‘boost factor’ framework 2.1. Basic requirements The fundamental approaches combined are not new; however, the combination of Se- Let us first motivate our choice of the model mantic Web technology with case-based rea- for representing user interests (i. e., user pro- soning methods in a unified framework as files) and resource properties (i. e., resource proposed below hopefully provides a fun- profiles). dament for fruitful application-oriented re- First of all, note that, particularly for mo- search. bile applications, a user should never be de- Figure 1 sketches the underlying scenario in scribed by her profile alone, but also by her the scope of the applications our approach current context. Substracting context from a addresses. Note that the term ‘case base’ user description, it is straightforward to use here represents a database containing all only a single model both for user profiles profiles – the latter being regarded as cases and for resource profiles. Requirements for a in a case-based reasoning approach. We will corresponding representation scheme should return to this point of view in Section 3. Ba- be: sically, the idea is that a user will receive personalized, filtered information via a mo- 1. The representation scheme must be bile device (e.g., cell phone). The recom- general enough to express different

115 kinds of real-world concepts and their modeled in an ontology, using a represen- relations. In particular it must provide tation scheme expressive enough for repre- options for senting taxonomies, whereas user and re- source profiles are instances of this ontology (a) representing taxonomies of con- (respecting some particular features, as will cepts (in order to allow for proper be explained in Subsection 2.2). recommendations of specialized to generalized profiles and vice Currently, there are several standards for on- versa), tology representation schemes, e. g., RDFS, OWL (OWL Full and restrictions OWL DL (b) representing the amount of user and OWL Lite), etc., see also [17, 16]. On interest in each of the real-world the one hand, a language like OWL Full is concepts (in order to allow for very expressive, but on the other hand, this more specially personalized rec- expressiveness entails high costs in the tasks ommendations), resulting from our requirements. There- (c) transferring user or resource pro- fore a first reasonable step towards a generic files from (publicly) available framework for recommender systems is to databases into the predefined rep- choose RDFS/RDF for modeling the domain resentation scheme (in order to al- ontology and the profiles (as instances). low for content acquisition in a Note that an additional benefit from using recommender system), RDF might be that RDF is the language of (d) modifying a given domain model the Semantic Web, where many users ex- as well as the profiles (in order to press their personal interests within a net of allow for reflecting the changes in linked RDF resources. This may entail fur- a dynamic application domain). ther options for extracting contents for a rec- ommender system. 2. The representation scheme must be Moreover, the quite simple structure of specialized enough to be suitable for ef- RDFS may have usability benefits; for real- ficient content acquisition and recom- world applications it would be rather incon- mendation. In particular, it must allow venient to model in OWL Full. for

(a) an efficient extraction and trans- 2.2. From general RDF to interest formation of external data, profiles (b) an efficient computation of simi- So we propose to model the application do- larities between profiles (in order main in an RDFS ontology; profiles are in- to determine recommendations), stances of the latter. However, some detailed (c) efficient profile generation and up- requirements have to be taken into account, date tools. in order to use RDF specifically for repre- senting interest profiles. These detailed re- In particular, the need for feasible content quirements concern acquisition methods suggests using stan- dards in the basic representation scheme. 1. the attribute types considered, Moreover, basing on the requirement for representing taxonomies (profiles express some special interests or features of a col- 2. attributes for user rating in interest pro- lection of – maybe inter-related – concepts files, in the application domain), we chose an ontology-based approach for representation. 3. degrees of relationships between con- That means, concepts of the domain are cepts in the ontology.

116 Attribute types negatively; the absolute value is inter- preted as a weight in this rating, Up to a certain degree of granularity, a simple metadata scheme may be sufficient • a boost factor of 0 for some concept c for describing all relevant information about indicates that the person is indifferent a person or a resource. However, one concerning c. immediately recognizes limitations of such ontology-based metadata schemes. For in- In a resource profile, the same kind of boost stance, music is very hard to describe using factor is reserved. metadata only, state-of-the-art music mining Note that the open-world assumption is re- and retrieval techniques prove much more flected in our approach as follows: If, in a information in an audio file itself than can profile, a concept c is not instantiated, the in- be expressed with simple metadata. Similar tended semantics would not be equal to the statements hold for text documents. case when c is instantiated with a boost fac- Thus including attributes of type mp3 or tor of 0. The open-world assumption forbids c string in the ontology allows for integrat- us to interpret the rating of as ‘indiffer- ing audio files or any kind of textual re- ent’ (for persons) or ‘non-existing’ (for re- views/summaries (of music, literature, any sources). kind of products, etc.) into a profile. Hence profiles can be enriched by additional infor- Degrees of concept relationships mation. By the way, this also has a positive effect Assume a taxonomy of concepts is modeled in easing profile generation, as will be dis- in an ontology. Then the intended seman- cussed in Section 4. tics probably cannot always be reflected in the taxonomic structure only, as the follow- ing example shows: Attributes for user rating Assume the ontology contains concepts c, c1, and c2, where c1 and c2 are subconcepts As already indicated, it should be possible of c, see Figure 2. to instantiate a concept in an ontology such that a person can express liking or disliking the class of objects associated to that con- cept. The motivation behind is that – under an open-world assumption – in general we cannot conclude a person dislikes a concept, if it is not instantiated in the person’s profile. Our simple approach here is to demand that each concept in the ontology can have a spe- Figure 2: A simple example for taxonomies. cial ‘weighting’ attribute, a float attribute with range [−1, +1] reserved for expressing the so-called ‘boost factor’. In a person’s in- Then c1 and c2 are ‘brother’ concepts and terest profile: could be interpreted as related. But the de- gree of their relation is not clear without • a positive boost factor for some concept any additional specification which might de- c indicates that the person rates c posi- pend on the intended application. Focus- tively; the absolute value is interpreted ing on recommender systems, a strong re- as a weight in this rating, lation would mean that profiles instantiating c1 could be recommended to profiles instan- • a negative boost factor for some con- tiating c2. Obviously, the intended recom- cept c indicates that the person rates c mender functionality depends on some kind

117 of degree of relationship you would like to functionality-driven point of view: instead assign to c1 and c2. If the intended semantics of asking first what context is, it is rea- would say that ‘red wine’ is closely related sonable to ask first what context should be to ‘white wine’, but ‘female’ is not (closely) used for in the application. In our current related to ‘male’ (which should definitely model, we define the user’s context by all have an impact on how to compute recom- constraints that are used for pre-filtering in- mendations), this requires some additional formation (in a database containing all pro- representation within the ontology. In par- files) before computing recommendations ticular, the concept c should be annotated via similarities or for weighting the over- with some value expressing the degree of re- all similarity values between profiles, which lationship of its child concepts. are determined for computing recommenda- Thus, in the proposed generic model, each tions. concept in the ontology has a special numer- Note that it may be worth analyzing to what ical attribute (e. g. with range [0, 1]) used for extent a user’s context could also be mod- representing the degree of relationship of its eled as part of her interest profile, using the child concepts. Extreme values then are in- RDFS framework as explained above. terpreted as the maximum or minimum pos- sible degree of relationship, i.e. if the range 3. Recommender model: a generic is [0, 1], then thevalue0 for c means the sons similarity measure of the concept c are not related at all; the From now on, suppose an ontology and a value 1 means that the sons of c are just rep- list of profiles have been defined according resentations of only one subconcept of c. to our special RDFS/RDF representation ap- 2.3. Context in addition to a user proach. Additionally, assume some context information is attached to the profiles, profile Given a profile p, how should recommenda- A generic representation model for user con- tions for p be computed? text is not so easy to define. In general, there Let us first assume the existence of algo- is no universal definition of the term con- text, cf. [7, 8], though this topic has gained rithms for computing a lot of interest in the scientific community, 1. a similarity sim(p, p′) between the pro- see also [9, 14]. In Section 1, we explained file p and any other profile p′ (according the user’s context as the current state of her to some similarity measure), fluently changing environment – but this is far from being a clear definition. Obviously, 2. a context relevance con(p, p′) of any in mobile applications, location and time other profile p′ to profile p. should be part of a user’s context, but it is unclear, how these should in general be eval- Then the recommendation process could be uated by a recommender system. A generic sketched as follows: approach would have to allow for reflecting any particular application-specific definition parameters: thresholds θs, θc, θ, integer k of context and its semantics. input: list L of all profiles, Up to now, we have not yet elaborated a query profile p ∈ L generic context model for mobile applica- output: listofprofilesin L tions, though in several realizations of our (recommendations for p), proposed approach (recommender systems with similarity values in [0, 1] for leisure time, for shopping scenarios, and for media recommendations) the user’s con- 1. (* optional *) Let L′ ⊆ L be the text has been integrated. list of all profiles p′ ∈ L, for which ′ ′ Currently, we define context from a more con(p, p ) < θc; L := L \ L ;

118 2. (* options: a, b, c, d *) Local similarity (a) For all profiles p′ ∈ L ′ Concerning local similarities, there is a need with sim(p, p ) ≥ θs, return ′ ′ for different type-specific similarity mea- (p , sim(p, p ). sures, e.g., for audio files of the types oc- ′ (b) For all profiles p ∈ L with curring in the ontology, for html documents, ′ ′ sim(p, p ) · con(p, p ) ≥ θ, return for general string values, and for numerical ′ ′ ′ (p , sim(p, p ) · con(p, p ). values. Here different similarities must be

(c) Return (p1, sim(p, p1)), ..., defined for different intended semantics of (pk, sim(p, pk)), for profiles certain attributes: ′ p1, . . . , pk ∈ L, such that Two strings s, s can have a similarity de- ′ sim(p, pi) ≥ sim(p, p ) for all fined by i ∈ {1, . . . , k} and all profiles ′ p ∈ L. 1. 0, if s 6= s′ and 1, if s = s′; or (d) Return (p , sim(p, p ) · 1 1 2. 1 − l(s, s′), where l(s, s′) is the Leven- con(p, p )), ..., (pk, sim(p, p ) · 1 k shtein distance of s and s′ (normalized con(p, p )), for profiles k in the interval [0, 1]), cf. [10]; or p1, . . . , pk ∈ L, such that sim(p, pi) · con(p, pi) ≥ 3. ... sim(p, p′) · con(p, p′) for all i ∈ {1, . . . , k} and all profiles Two numerical values r, r′ can have a simi- ′ p ∈ L. larity defined by In other words, one usually recommends ei- 1. 1 −d, where d is their absolute distance ther all resources with a minimum similar- (normalized in the interval [0, 1]); or ity or just k resources with highest similar- ity. The context relevance may be used for 2. 0, if r 6= r′ and 1, if r = r′; or pre-filtering and/or as a weight for the over- ′ ′ all similarity. 3. 0, if r < r and 1, if r ≥ r ; or So a generic framework requires a definition 4. 0, if r > r′ and 1, if r ≤ r′; or of a suitable similarity measure, and thus a generic algorithm for matching RDF files. 5. ... We propose a matching algorithm which should be a fixed component in any con- Similar variants are conceivable for different crete application task, such that, with each types of attributes. new application, it remains only to define a Note that each variant of a local similar- new ontology and a new context model (the ity measure goes along with different in- latter including definitions of how to com- tended semantics of the corresponding at- pute context relevance values). That means, tribute. Consequently, this has to be ex- the matching algorithm proposed is generic; pressed by formally different types of string its parameters are the ontology and the algo- attributes or different types of numerical at- rithm for computing the context relevance. tributes in the ontology. Basically, the desired similarity measure is The need for these different semantics is ob- implemented in this matching algorithm in vious: If, for instance, a numerical attribute the sense that the latter compares two pro- represents a production year of a movie a files and returns a similarity value in [0, 1]. user is interested in, then the local similar- Requirements for such a similarity mea- ity for this attribute should be defined using sure should concern local similarities, which their absolute distance, since the user may must be defined for all relevant attributes, as presumably like movies from that time, but well as the global (overall) similarity. not only from that year. The intended local

119 similarities would of course be different, if special local similarities influence the the attribute should represent some maximal global similarity value. Thus, even if allowed price for recommended products. all local similarities fulfill the triangle inequality, this does not hold for the global similarity. Global similarity So, in what follows, we should concentrate Note that, as for the case of local similarities, ′ on reflexivity and symmetry constraints. we assume sim(p, p ) ∈ [0, 1] for all profiles p, p′. For instance, in a shoppingscenario, if a user has rated her favorite movie with title t, a The overall similarity of two profiles should resource (e.g., a DVD) with the same title take into account the local similarities con- t would maybe not be the top recommen- cerning different attribute values, as well as dation, assuming the user already possesses the degree of relationships between the in- such a product. Hence you would require a stantiated concepts. Moreover, the user’s in- low similarity of a profile to itself, i.e., the terests expressed in the boost factors must corresponding distance measure would be define ‘weights’ in computing a similarity. non-reflexive. On the other hand, retrieval Note that the boost factor is a special kind of scenarios are conceivable, where each pro- attribute, the value of which is not included file has maximum similarity to itself, i.e., into the computation of similarities in the where the global distance measure is reflex- usual way. The boost factor merely deter- ive. Similar scenarios are conceivable disal- mines the amount of how strongly the local lowing or requiring symmetry. similarity will contribute to the global sim- ilarity. Thus it works like a weight, though Now, if our global similarity measure is sup- it is not really a weight, due to the fact that posed to be generic, this means that whether the boost factors can have arbitrary values in or not reflexivity, symmetry, or the triangle [0, 1], without requiring that they sum up to inequality hold, must depend on the local 1, as would be usual for weights. similarities exclusively. Only in this case we can guarantee that the ontology model alone Here different requirements concerning a (and in particular the choice of special at- similarity measure are conceivable:3 tributes for which local similarity measures are defined) determines the properties of the 1. Reflexivity similarity measure. One might or might not require that sim(p, p) = 1 for all profiles p. Consequently, the main requirements con- cerning the global similarity measure can be 2. Symmetry summarized as follows: One might or might not require that sim(p, p′) = sim(p′, p) for all profiles 1. The global similarity of two profiles p, p′. must depend on their local similarities (in particular also concerning the boost 3. Triangle inequality factors) and the relationship degrees de- One should in general not require that fined in the ontology, only. sim(p, p′) ≥ sim(p, p′′)+sim(p′′, p′)− 1 for all profiles p, p′, p′′. The rea- 2. Given a local similarity or a relation- son is, roughly speaking, that a user’s ship degree as a parameter, the global boost factors can determine how much similarity must be monotonically in- creasing in this parameter. 3The requirements reflexivity and triangle in- equality for a distance measure d are usually d(p, p′) = 0 and d(p, p′) ≤ d(p, p′′) + d(p′′, p′). We 3. The global similarity function is reflex- obtain the given formulations regarding sim = 1 − d ive, if and only if all given local simi- for a [0, 1]-valued distance measure d. larity functions are reflexive.

120 4. The global similarity function is sym- assigned to the attribute a in some metric, if and only if all given local sim- profile q.5 ilarity functions are symmetric. 4. Return the global similarity sim(p, p′) as the weighted mean of all values The easiest way to achieve this is to define blsim(p, p′, a) for a A, where the a global similarity by a weighted sum of all ∈ weights are given by the degree of re- local similarities. lationship between the concepts instan- For this purpose we follow the case-based tiated by p and p′. reasoning method proposed in [5]. Here the case base is just the profile database; each profile corresponds to a case. The query pro- 4. Content acquisition model: file p is then understood as a new case, such generating and updating that for each case p′ (existing profile) in the case base, a similarity value for p and p′ has ontologies and profiles to be computed. Concerning the content in a recommender The method explained in [5] can be sketched system in the proposed framework, both the as follows:4 ontology and the profiles have to be dis- cussed. However, we only briefly sketch some well-known basic approaches here. input: query profile p, case profile p′ 4.1. Ontology generation and up- ′ output: similarity value sim(p, p ) ∈ [0, 1] date How can an initial ontology be obtained and 1. Determine the set A of all attributes in- how can it be updated? In most cases, pre- ′ stantiated both in p and in p (except for sumably, an initial ontology must be de- boost factor attributes). signed by an administrator – as usual. How- ever, there are in fact approaches allowing 2. For all attributes a ∈ A compute a ′ for automatically computing suggestions for (local) similarity value lsim(p, p , a) as ontology updates, which can be used in the follows: context of semi-automated ontology mainte- nance. • If a is a complex attribute, then ′ ′ let lsim(p, p , a) = sim(pa, pa), ′ where pa and pa are the parts of Non-automated workflow the instances p and p′ concerning a (* recursion *). Non-automated ontology modification here ′ simply means manual editing. This only re- • Else let lsim(p, p , a) be the local ′ quires the provision of an interface allow- similarity of p and p concerning ing for ontology upload and download, with a. connection to an ontology editing tool.6 3. For all attributes a ∈ A com- pute a boosted local similarity Semi-automated workflow ′ 1 value blsim(p, p , a) = 2 · (1 + ′ ′ lsim(p, p , a)(1−|bf(p, a)−bf(p , a)|)), Though machine learning tools can be used where bf(q, a) denotes the boost factor for inferring new relations for the ontology

4 5 1 Here we deal only with a simplified case, assum- The factor 2 is needed for normalizing onto the ing that no attribute in the ontology is of a list type. interval [0, 1]. The algorithm can be extended to matching profiles 6Note that some ontology editing tools will not in which lists occur as attribute types, but we omit store the ontology in the format required in our pro- the relevant details. posal. Here parsers must be used for transformation.

121 or for detecting superfluous/missing con- factors as given in the resource profile, cepts or attributes, ontology changes should but multiplied by −1). If the user rates definitely always be executed only upon the resources without profiles, e.g., audio decision of an administrator. files or text documents, suitable min- Suggestions for an ontology update can be ing methods may be used for extracting obtained by automatically analyzing all pro- new profile features out of these. files in the database or logging user actions This local approach works for acquisi- – for instance, using association rule min- tion of resource profiles as well, since, ing or clustering techniques, see [18]. Such in general, features of resources can techniques are well-known from market bas- also be described by relevance feed- ket analysis. back.

4.2. Profile generation and update 2. Global approach User or resource profiles can be up- How can profiles be generated and updated? dated by an automatic global analy- Again we only sketch some basic ideas. sis of all profiles in the database (or of all user profiles of certain clus- ters/communities). If association rule Non-automated workflow mining or clustering methods, cf. [18], yield new relations between concepts, Of course, it is possible to provide user inter- these can be used for predicting un- faces for manual editing of profiles. How- known values in single profiles. ever, this may cause problems concerning the usability as well as because of the of- ten observed phenomenon of users incapable Fully automated workflow of describing their own interests in terms of concepts. Still manual editing might be The most obvious approach for a fully au- reasonable for creating resource profiles or tomated workflow, as already mentioned modifying them. before, is profile extraction from external databases – at least resource profiles can be automatically extracted from available Semi-automated workflow databases. For instance, movie profiles could be obtained from the internet movie In semi-automated workflows, a human database (http://www.imdb.com). administrator can be assisted by machine learning systems inferring new knowledge Still note that some administrative tasks may from user and/or resource data. be required concerning a comparison of the vocabulary (the metadata values themselves) Here we distinguish between ‘local’ ap- used in existing profiles and those in the ex- proaches using single user data and ‘global’ ternal resource descriptions. ones using a larger part of the system content for learning. 5. Conclusions

1. Local approach We have proposed and implemented a The interest profile of a user can for in- generic framework for personalized in- stance be obtained/updated by an au- formation filtering, basing on a special tomatic analysis from relevance feed- RDFS/RDF representation model and a uni- back, see [13]. If the user rates re- form algorithm computing a similarity mea- sources, which have profiles, positively sure for RDF instances. (or negatively), they can be merged into The idea is that, for each new application, the user’s profile with boost factors as only a new ontology (and, if required, new given in the resource profile (or boost local similarity measures) has to be defined.

122 The global similarity measure can be com- Content acquisition and maintenance is- puted using a fixed uniform tool following sues our proposal. Furthermore, the methods sketched in Sec- However, there are still several aspects in tion 4 have to be conceptualized in a way which further details in our proposal have to such that our generic framework is comple- be elaborated. mented by a toolkit easing content acquisi- tion and update in concrete applications. We are currently developing a toolkit for merge functions allowing for profile update bas- Efficiency/performance issues ing on relevance feedback. Adapting stan- dard market basket analysis approaches to Up to now we have not analyzed the effi- our framework will be one of the next steps. ciency of the proposed matching algorithm formally. A uniform approach however requires reliable statements concerning the Possible extensions conditions under which this algorithm works Because of its simple structure and the re- well. This should be one point of focus in sulting efficiency benefits, we had decided future work. to use RDFS/RDF for ontology modeling. However, one future task might be to ex- tend our framework by designing and imple- Framework documentation menting a matching algorithm which defines a similarity measure for OWL instances.

Even if a framework proves uniform for a References class of applications, this does not imme- diately imply that it is uniformly usable in [1] A. Aamodt and E. Plaza. Case- practice. Here a representation language is based reasoning: Foundational issues, required which clearly explains the effect of methodological variations, and system all kinds of parameters in the ontology upon approaches. AI Commununications, the resulting global similarity measure. In 7:39–59, 1994. particular, if a designer of some new appli- cation has an intended similarity measure in [2] A. Bassoli and S. Baumann. Blue- mind, then it should be clear how to define tunA: music sharing through mobile an ontology appropriately. This is a further phones. In Proceedings of the Third In- aspect which must definitely be addressed. ternational Workshop on Mobile Music Moreover, this aspect also includes the ques- Technology, 2006. tion of how the generic tools in our frame- [3] S. Baumann. Smartmobs and music: work can be best embedded into a concrete Ad-hoc socializing by portable music application. profiles. In Inspirational Ideas at Inter- national Computer Music Conference (ICMC 2005), 2005.

Context representation [4] R. Bergmann. On the use of tax- onomies for representing case features As already indicated in Section 2.3, a suit- and local similarity measures. In able framework for context representation L. Gierl and M. Lenz, editors, Proceed- has to be defined; in particular when aiming ings of the 6th German Workshop on at mobile applications this is of great impor- CBR. Universit¨at Rostock, 1998. IMIB tance. Series Vol. 7.

123 [5] R. Bergmann and A. Stahl. Similar- [12] F. Manola and E. Miller (Eds.). RDF ity measures for object-oriented case Primer. W3C recommendation, W3C, representations. In B. Smyth and http://www.w3.org/TR/2004/REC-rdf- P. Cunningham, editors, Advances in primer-20040210/, 2004. Case-Based Reasoning, 4th European [13] J. Rocchio. Relevance feedback in in- Workshop, EWCBR-98, Dublin, Ire- formation retrieval. In G. Salton, edi- land, September 1998, Proceedings, tor, The SMART Retrieval System: Ex- volume 1488 of Lecture Notes in Com- periments in Automatic Document Pro- puter Science. Springer, 1998. cessing, pages 313–323. Prentice-Hall, 1971. [6] T. Berners-Lee, J. Hendler, and O. Las- sila. The semantic web. Scientific [14] T. Roth-Berghofer, S. Schulz, and American, 89, 2001. D. Leake, editors. Modeling and [7] Patrick Brezillon. Context in Artificial Retrieval of Context, Second Interna- Intelligence: I. A survey of the litera- tional Workshop, MRC 2005, Edin- ture. Computer and AI, 18(4):321–340, burgh, UK, July 31 - August 1, 2005, 1999. Revised Selected Papers, volume 3946 of Lecture Notes in Computer Science. [8] Patrick Brezillon. Context in Artificial Springer, 2006. Intelligence: II. Key elements of con- texts. Computer and AI, 18(5):425– [15] Sven Schwarz. A context model for 446, 1999. personal knowledge management. In [9] A. Dey, B. Kokinov, D. Leake, and Roth-Berghofer et al. [14]. R. Turner, editors. Modeling and Us- ing Context, 5th International and In- [16] W3C. OWL web ontology language, terdisciplinary Conference, CONTEXT overview. Technical report, 2004. 2005, Paris, France, July 5-8, 2005, D. McGuinness and F. van Harme- Proceedings, volume 3554 of Lecture len (Eds.), W3C Recommendation, Notes in Computer Science. Springer, http://www.w3.org/TR/owl-features/. 2005. [17] W3C. RDF vocabulary descrip- [10] V. I. Levenshtein. Binary codes capa- tion language 1.0: RDF Schema. ble of correcting deletions, insertions, Technical report, 2004. D. Brick- and reversals. Doklady Akademii Nauk ley, R. Guha, and B. McBride SSSR, 163:845–848, 1965. English (Eds.), W3C Recommendation, translation in Soviet Physics Doklady http://www.w3.org/TR/rdf-schema/. 10, 707–710, 1966. [11] H. Liu, P. Maes, and G. Davenport. [18] I. H. Witten and E. Frank. Data Min- Unraveling the taste fabric of social ing: Practical Machine Learning Tools networks. International Journal on Se- and Techniques with Java Implementa- mantic Web and Information Systems, tions. Morgan Kaufmann, 1999. 2:42–71, 2006.

124 Effcient Algorithms for Mining Maximal Flexible Patterns in Texts and Sequences

Hiroki Arimura and Takeaki Uno Graduate School of Information Science and Technology, Hokkaido University Kita 14 Nishi 9, Sapporo 060-0814, Japan National Institute of Informatics 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

Abstract In this paper, we study the enumeration algorithm for the maximal pattern discovery problem for the class ERP of flexible sequence patterns, also known as erasing regular patterns, in a given seqeunce, with applications to text mining. Out notion of maximality is based on the position occurrences and weaker than the traditional notion of maximality based on the document occurrences. We present a polynomial space and polynomial delay algorithm for enumerating all maximal patterns without duplicates for the class of ERP of flexible sequence patterns based on the framework of reverse search. As a corollary, the enumeration problem for maximal flexible patterns is shown to be solvable in output polynomial time. We also discuss the utility of maximal pattern discovery in document clssification and a heuristic algorithm for discovering document-maximal flexible patterns in a set of input strings.

Keywords: Data mining, Enumeration algorithms, Sequence databases, Maximal pattern, Motif Discov- ery

1. Introduction In this paper, we consider the maximal pat- tern discovery problem [9, 10, 13, 14] in a set of sequences. A pattern is maximal By the rapid growth of the amount of if there is no properly more specific pat- human-readable electronic data on networks tern w.r.t. some generalization ordering over and storages, there are potential demands the class of patterns that has the same oc- for the efficient computational methods to currence, or equivalently, has the same fre- extract useful information and knowledge quency, in a given set of input sequences. from massive amount of electronic data scat- Since the set of all maximal patterns are typ- tered over the network. Some prominent ex- ically much smaller than the set of all pat- amples of such knowledge discovery tasks terns appearing in a data set while the formar are: Automatic classification of natural lan- contains the complete information of the lat- guage texts and web pages, characteristic ter, maximal pattern discovery has merits and descriptive pattern discovery, prediction in efficiency and comprehensiveness. On of trends from market data, detection of ma- the other hands, the computational complex- licious activities from audit data, and clus- ity of maximal pattern discovery is higher tering of documents. Pattern discovery is than that of frequent pattern disocovery. For one of the most basic technology to find a example, there are few patterns classes for class of patterns appearing in a data set satis- which the maximal pattern discovery prob- fying given constraints, and plays an impor- lem have the polynomial output time com- tant role in many knowledge discovery prob- plexity. lems.

125 The class of patterns we consider is the class |s| = n the length of s and by ε the empty ERP of erasing regular patterns of Shino- string. For any indices 1 ≤ i ≤ j ≤ n, hara [12], which is also called flexible pat- we denote by s[i]=ai the i-th letter of s, terns in bioinformatics area [9]. The class and by s[i..j] the substring s[i..j]=ai ···aj ERP is a super class of the class of subse- that starts with i and ends with j. For a set ⊆ ∗ | | quence patterns for which polynomial out- S A , we denote by S the cardinality of || || | | put maximal pattern enumeration algorithm S and by S = s∈S s be the total size of is known [14]. However, there is no output- the strings in S. polynomial algorithm for the maximal pat- For strings u, v, w ∈ A∗, we say that u, ERP tern problem for . A potential prob- v, and w, respectively, are a prefix,asub- ERP lem for the class of flexible patterns string, and a suffix of a string s = uvw. is, unlike classes of itemsets and rigid pat- Then, the substring v occurs in s at position terns [3, 9], there are no unique maximal p = |u|+1. Equivalently, if s = s[1] ···s[n], pattern in each equivalence class of pat- then v occurs in s at position i iff v = terns. To overcome this problem, we intro- s[i] ···s[i + |v|−1]. The left position and duce a weaker notion of maximality, called the right position of v corresponding to this ERP position-maximality, for , where two occurrence of in t are i and i+|v|−1, respec- patterns P and Q are regarded as equivalent tively. The reversal of a string x = a1 ···am if they have the same sets of left-positions is defined by xR = a ···a . in an input string. The position-maximality n 1 is implicitly used in maximal pattern mining 2.1. Text and Patterns for subsequence patterns [14]. In this subsection, we introduce the class As a main result of this paper, un- ERP of erasing regular patterns of Shi- der this definition of maximality, we nohara [12], also known as flexible pat- present a polynomial-space and polynomial- terns [9]. Let Σ={a, b, c, . . .} be a fi- delay algorithm for enumerating all max- nite alphabet of constant symbols. We as- imal patterns appearing in a given string sume a special symbol ∗ ∈ Σ called a vari- without duplicates in terms of postion- able (a string wildcard or variable-length maximality. This result generalizes the don’t cares, VLDC), which represents ar- output-polynomial complexity of [14] for bitrary long possibly-empty finite string in subsequence patterns to the class ERP. The Σ∗. Then, an erasing regular pattern (or polynomial-space and delay property indi- pattern, for short) over Σ is a string P ∈ cates that it can be used as a light-weight (Σ ∪ {∗})∗ consisting of constant symbols and high-throughput algorithm for pattern and variables. An erasing regular pattern P discovery. Finally, we presented a heuris- is said to be in canonical form if it is writ- tics algorithm for enumerating document- ten as P = w ∗ w ∗ ··· ∗ w for some maximal patterns in a collection of strings 0 1 m integer m ≥ 0 and some non-empty strings for the class ERP with non-trivial pruning w ,w ,...,w ∈ Σ+. Each constant string strategies. The motivation of this study is 0 1 m w is called a segment of P . An erasing reg- application of the maximal pattern discovery i ular pattern is also called as a VLDC pattern to the optimal pattern discovery problem in or a flexible pattern. We denote by ERP the machine learning and knowledge discovery class of erasing regular patterns over Σ in with application to text mining. canonical form. We note that Σ∗ ⊆ERP. 2. Preliminaries Let Σ be a fixed alphabet of constants. An input string is a constant string T = Let A be an alphabet of symbols. We denote a1 ···an (n ≥ 0) over Σ. Let P = w0 ∗ w1 ∗ ∗ ∗ by A the set of all finite strings over A and ···∗wm ∈ (Σ∪{∗}) be a pattern with m ≥ + ∗ define Σ =Σ −{ε}. Let s = a1 ···an ∈ 0 variables in canonical form. A substitution ∗ A be a string over A of length n. We denote for P is any m-tuple θ =(u1,...,um) ∈

126 ((Σ ∪ {∗})∗)m of strings. Then, we define (ERP, ) is a partial ordering with the the application of θ to P , denoted by Pθ, smallest element ε. as the string Pθ = w0u1w1u2 ···umwm ∈ (Σ∪{∗})∗, where the i-th occurrence of vari- In machine learning area, the learning prob- able ∗ is replaced with the i-th string ui for lem for the class formal languages de- every i =1,...,m. The string Pθ is said to fined by ERP has been studied exten- be an instance of P by substition θ. sively [12]. For a erasing regular pattern ∗ P ∈ERP, the language defined by P is the For strings P, Q ∈ (Σ ∪ {∗}) , P occurs in set Lang (P )={ s ∈ Σ∗ : P  s }⊆Σ∗. Q at position 1 ≤ i ≤ n if there exists some Σ It is easy to see that for any regular pattern instance I of P that is a substring of Q start- P ∈ (Σ∪{∗})∗, there exists some Q ∈ERP ing at p, that is, Q[i..i+ |Pθ|−1] = Pθfor in canonical form such that Lang (P )= some substitution θ for P . Particularly, the Σ Lang (Q). positions i and i + |Pθ|−1, respectively, Σ correspond to the left and the right ends 2.2. Maximal patterns of the occurrence of instance Pθ, and are called the left and the right positions of P in Let T be a fixed text of length n ≥ 0. The T (See Fig. 1). We denote by LO(P, Q) and frequency of a pattern P in T is |OT (P )|. RO(P, Q) the set of all left-positions and the A minimum support threshold is a nonnega- set of all right-positions of pattern P in text tive integer 0 ≤ σ ≤ n. A pattern P is σ- Q, respectively. We refer to LO(P, Q) and frequent in T if it has the frequency no less RO(P, Q) as the left-location list and the than σ in T , i.e., |OT (P )|≥σ. right-location list of P in T , respectively. Definition 1 A pattern P is maximal in T if Lemma 1 Let P be a pattern and T be there is no proper specialization Q of P that a text. For any position 1 ≤ p ≤ n, has the same location list, i.e., P < Q and p ∈ ROT (P ) if and only if n − p +1 ∈ OT (P )=OT (Q). R LOT R (P ). We see that a pattern P ∈ERPis max-  By the symmetry between the left and imal iff P is a maximal element w.r.t. in the equivalence class [P ]≡ = { Q ∈ the right location lists in Lemma 1, it is T ERP ≡ } sufficient to consider only LO(P, T) than : P T Q under the equivalence re- ≡ ≡ ⇔ RO(P, T). Thus, we only consider the loca- lation T defined by P T Q OT (P )= tion list O(P, T)=LO(P, T) in what fol- OT (Q). lows. Lemma 4 The maximal patterns in each equivalence class [P ]≡ is not unique in Lemma 2 The location list LO(P, T) is T general. computable in O(mn) time, where m = |P | and n = |T |. We denote by Fσ, M, and Mθ = Fσ ∩ M the classes of the σ-frequent patterns, We define a binary relation  over P as fol- the maximal patterns, and the maximal σ- lows. For patterns P, Q ∈ ERP, P is more frequent patterns in T . It is easy see that the specific than Q, denoted by P  Q,iffP number of frequent flexible patterns in an in- occurs in T at some position 1 ≤ p ≤ n − 1. put text S is exponential in the total length n If P  Q and Q < P hold, then we say that of T in the worst case. P is properly more specific than Q and de- note P < Q. We note that if P  Q and Lemma 5 There is an infinite sequence < Q P hold then P = Q holds. (Ti)i≥0 of texts such that the number of max- imal flexible patterns in Ti is exponential in Ω(n) Lemma 3 Let T be any text. Then, n = |Ti|, i.e., |Mσ| =2 .

127 Now, we state our data mining problem con- minimizes within P the empirical error sidered in this paper as follows.  ERRS,F (P )= [P (x) = F (x)]. Maximal Flexible Pattern Enumeration x∈S Problem: Input: An alphabet Σ, a text T of length n, and a minimum support threshold σ. In learning theory, it is known that any algo- Task: To generate all maximal frequent flex- rithm that efficiently solves the above empir- ical error minimization problem can approx- ible patterns in Mσ ⊆ERPin T without duplicates. imate a target concept within a given con- cept class under arbitrary unknown proba- It is easy to see that Mθ = Fσ ∩M. There- bility distributions, and thus can work with fore, we will first present an efficient enu- noisy environments [11]. meration algorithm for M = M1, and then extend it for Mσ for every σ ≥ 1. 3.2. Optimal Pattern Discovery 3. Motivated Applications of Max- The above framework can be extended for more general classes of score functions [4]. imal Pattern Discovery An impurity function is any real-valued In this section, we discuss potential applica- function ψ :[0, 1] → Real such that (i) it tions of maximal pattern discovery consid- takes the maximum value ψ(1/2), (ii) the ered in this paper to predictive mining and minimum value ψ(0) = ψ(1) = 0, and 1 ≥ classification. (iii) ψ(x) is convex, i.e., ψ( 2 (x + y)) 1 ∈ 2 (ψ(x)+ψ(y)) for every x, y [0, 1]. 3.1. Predictive Mining and Classi-

fication • The information entropy: ψ1(x)= − − − − First, we introduce a practical model of x log x (1 x) log(1 x). machine learning from noizy environment, • The Gini index: ψ2(x)=2x(1 − x). known as robust training or agnostic learn- ing according to, e.g.,[5, 6]. Suppose we Given objective function F , and pattern are given a finite collection S = { xi : i = 1,...,m}⊆Σ∗ of strings, called a sam- P , the contingency table is a 4-tuple ple, and a binary labeling function F : S → (M1,M0,N1,N0), where M1 and M0 are the {0, 1}, called an objective function. Each numbers of the matched positive and nega- string s ∈Sis called a document and the tive examples, and N1 and N0 are the num- S value of the funcion F (s) ∈{0, 1} indicates bers of positive and negative examples in . whether the document is, e.g., interesting or Now, we describe the optimal pattern dis- not. Let P be a class of classification rules covery problem, which is parameterized by or patterns, where each pattern P ∈Prep- an impurity function ψ, as follows (See, e.g., resents a binary function P : S →{0, 1}.In Devroy et al. [4]). ∈S our case, for any document s , P (s)=1 ψ-Optimal Pattern Discovery Problem: if P matches s and P (s)=0otherwise. For Input: A sample S and an objective function ∈{ } a predicate π(x), [π(x)] 0, 1 is the indi- F : S→{0, 1}. cator function that returns 1 or 0 depending Task: To find an optimal pattern P ∈Pthat on the truth value of π(x). Now, we state our minimizes within P the cost pattern discovery problem [6]. ψ · M1 · M0 Empirical Error Minimization Problem: GS,F (P )=N1 ψ( )+N0 ψ( ), N1 N0 Input: A sample S and an objective function F : S→{0, 1}. where (M1,M0,N1,N0) is the contingency Task: To find an optimal pattern P ∈Pthat table defined by S, F , and pattern P .

128 012345678 Pattern P AB*CBA*BB m = 9

Text T BCBCABDCBCBABACBBABAB n = 21 012345678901234567890 00 10 20 p = 4 An occurrence position p of P in T

Figure 1: A pattern and its left and right positions in a text.

3.3. A Heuristic Algorithm for Op- Theorem 6 Let S⊆Σ∗ and timal Pattern Discovery F : S→{0, 1}. Let P = ERP and M⊆Pbe the class of maximal pat- In spite of the practical importance of op- terns in S. Even if we restrict the class timal pattern problem, the above optimiza- of patterns to M in S, then this does tion problems are known to be computa- not lose the optimility of the answers for tionally hard even for most pattern classes, the empirical error minimization prob- e.g., half-spaces and conjunctions [5]. In lem and the ψ-optimal pattern discovery ERP particular for the class of flexible problem. That is, min{ ERRS,F (P ):P ∈ patterns, Miyano, Shinohara, and Shino- M} = min{ ERRS,F (P ):P ∈P} hara [8] showed that the consistency prob- { ψ ∈M} and min GS,F (P ):P = lem for ERP is NP-complete. Shimozono, ψ min{ G (P ):P ∈P}hold. Arimura, and Arikawa [11] showed that the S,F empirical error minimization problem for Proof: For any pattern P , the cost GS (P ) ERP is even hard to approximation within ,F is uniquely determined by the contingency arbitrary small ratio.1 Therefore, a class of table (M ,M ,N ,N ) corresponding to a straightfoward generate-and-test algorithms, 1 0 1 0 sample S, an objective function F , and which enumerates frequent patterns with an pattern P . For any patterns P, Q,if adequate threshold, are used to solve the op- LO(P, S)=LO(Q, S) then P and Q re- timized pattern discovery problem in prac- alizes the same classification function P : tice. S→{0, 1}, and thus gives the same contin- In Fig. 2, we show a heuristic algorithm gency table. Hence, the follows. 2 FINDOPTIMAL for discovering top-K op- timal patterns in terms of the score func- From Theorem 6, we know that the algo- ψ tion GS,F (P ) based on a generate-and-test rithm FINDOPTMAX correctly finds top-K strategy using maximal patterns in M in- optimal patterns in S minimizing the cost stead of frequent patterns in F. If we set GS,F (P ). The efficiency of this heuristics K =1then the algorithm solves the above algorithm heavily depends on enumeration ψ-optimal pattern problem. Then, the fol- of maximal patterns of M at Line 2. There- lowing theorem gives a justification of such fore, our goal in the remainder of this paper an approach. is to develop a memory and time efficient

1 enumeration algorithm for maximal patterns Although the original approximation hardness M ERP result in [11] has been shown for the class with prox- in for the class . In what follows, imity constraints, we can obtain the same result for we refer to as the delay of an enumeration al- ERP without proximity constraints from its proof. gorithm A the maximum computation time

129 Algorithm FINDOPTMAX(S,F) input: A sample S, an objective function F : S→{0, 1}, and an integer K ≥ 1. output: The top-K optimal patterns P1,...,PK minimizing the cost GS,F (P ). 1 Let Q := ∅ be an empty priority queue with the cost as the key. 2 foreach maximal pattern P ∈Min S do begin: 3 Let LO(P, S) be the location list of P ; 4 Compute the contingency table τ =(M1,M0,N1,N0) from LO(P, S); Q Q∪{ ψ } 5 := (P, GS,F (P )) ; 6 end 7 output the top K patterns P in Q in terms of GS,F (P );

Figure 2: An algorithm for the optimal pattern discovery via maximal pattern enumeration. of A between consecutive outputs. The following two lemmas are essential for flexible patterns. 4. Tree-shaped Search Route for Position-Maximal Flexible Pat- Lemma 7 Let T ∈ Σ∗ be an input string terns and P, Q ∈ERPbe flexible patterns. Sup- pose that P  Q. Then, if Q is a prefix spe- In this section, we will present our algo- cialization of P then O(P, T) ⊇ O(Q, T ). rithm for finding position-maximal patterns in a given input string for the class ERP of Proof: Let T be an input string of length n. flexible patterns. First, we introduce a tree- Let p ∈ O(P, T)=LO(P, T) be any left- shaped search route T for the space of all position of P in T . Then, it follows from the maximal patterns in M. Then, in the next definition that If p ∈ LO(P, T) then some section, we give a memory efficient algo- substring H of T starting at position p is an rithm for enumerating all maximal patterns instance of Q. On the other hand, since Q is based on the depth-first search over T . Out a prefix specialization of P , some prefix H strategy is explained as follows: First we de- of Q is an instance of P . Therefore, some fine a binary relation between maximal pat- prefix H of H, and thus a substring of T , terns, called the parent function, which indi- is an instance of P . Since H is a substring cates a reverse edge from a child to its par- starting at p in T , the lemma is proved. 2 ent. Next, we reverse the direction of the edges to obtain a spanning tree for M. Lemma 8 Let T ∈ Σ∗ be an input string ∈ We start with technical lemmas. Let P, Q and P, Q ∈ERPbe flexible patterns. Sup- ERP be patterns over Σ. Recall that Q is a pose that P  Q. Then, if Q is a non-  specialization of P , denoted by P Q,iff prefix specialization of P then O(P, T) ⊆ ∈ ∪ {∗} ∗ Q = α(Pθ)β for some α, β (Σ ) O(Q, T ). and for some substitution θ for P . We dis- tinguish two cases whether α = ε. Proof: Let pmax = max LO(P, T) be the largest left-position of P in T . Since Definition 2 A pattern Q is said to be a pre- LO(P, T) = ∅ and T has finite length, there fix specialization of another pattern P if P always exists such a largest left-position occurs in the initial part of Q, i.e., Q = pmax in T . Now we assume to contradict ∗ (Pθ)β for some string β ∈ (Σ ∪ {∗}) and that O(P, T) ⊆ O(Q, T ). Then, pmax is also for some substitution θ for P . If there is no a postition of Q in T . Since Q is a non- such β and θ, Q is said to be a non-prefix prefix specialization of P , P occurs in Q at specialization of P . some position δ>1. Thus, we know that

130 q = pmax + δ − 1 is a position of P in T . X + k and X 1≤ Y are subsets of X. Using If δ>1 then q is strictly larger than pmax. these operators, we can describe the location This contradicts the assumption that pmax is lists of a composite pattern of the form wP the largest position of P in T , and thus we or w ∗ P , where w ∈ Σ+ and P ∈ERPas conclude that O(P, T) ⊆ O(Q, T ). Hence, follows. the result is proved. 2 Lemma 11 Let T ∈ Σ∗ be any input string, + ∗ w ∈ Σ be any non-empty constant string, Corollary 9 Let T ∈ Σ be an input string and P ∈ERPbe any flexible pattern. Then, and P, Q ∈ERPbe flexible patterns. Sup- the following (a) and (b) hold: pose that P  Q. Then, if Q is a non-  prefix specialization of P then O(P, T) = (a) LO(wP)=LO(w, T) ∩ (LO(P, T) − O(Q, T ). |w|).

We define the parent-child relationship be- (b) LO(w ∗ P )=LO(w, T) 1≤ tween two flexible patterns in ERP as fol- (LO(P, T) −|w|). lows. Let T ∈ Σ∗ be any input string of length n ≥ Definition 3 Let Q = w ∗ w ∗···∗w 0. A maximal pattern P is a root pattern in 0 1 m { | |} be a flexible pattern over Σ, where m ≥ 0 T if O(P, T)= 1,..., T . Now, we show + the main result of this section. and w1,...,wm ∈ Σ . Then, we define the parent of Q, denoted by P(Q), is the pattern M satisfying the the followings (i) or (ii): Theorem 12 (reverse search property of ) Let Q ∈Mbe a maximal pattern in T that is not a root pattern. Then, P(Q) is also a (i) If |w0|≥2, that is, w0 = au0 for some letter a ∈ Σ and a non-empty maximal pattern in T . That is, if Q ∈M + then P(Q) ∈Mholds. Furthermore, string u0 ∈ Σ , then then P(Q)= |P(Q)| < |Q| holds. u0 ∗ w1 ∗···∗wm. Proof: Let Q be a maximal pattern that is (ii) If |w0| =1, that is, w0 = a for some not a root pattern, and let P = P(Q) be the letter a ∈ Σ then P(Q)=w1 ∗···∗wm. parent of Q. In what follows, for any pattern R, we write LO(R) for LO(R, T) by omit- In summary, the parent P(Q) is the flexi- ting T for simplicity. Suppose to contradict ble pattern obtained from Q by removing the that P is not maximal in T . Then, there ex- first letter, and then remove the first variable ists some proper specialization P  of P , i.e., ∗ at the starting position if it exists. The re- P < P , such that LO(P )=LO(P ). moval of the initial ∗ ensures the canonicity of the resulting pattern. If P < P  then P occurs in P  at some posi- tion, say, 1 ≤ p ≤|P |. There are two cases Lemma 10 For any non-empty flexible pat- below. ∈ERP  tern Q in cannonical form, its par- (i) The case where p =1 : Then, P is a P ent (Q) is always defined, unique, and non-prefix specialization of P . It immedi- a flexible pattern in canonical form, i.e., a ately follows from Lemma 9 that LO(P ) = ERP  member of . LO(P ). This is a contradiction. p P  Let n ≥ 0 be a positive integer. For a (ii) The case where =1: Then, is a nonnegative integer 0 ≤ k ≤ n and a set prefix specialization of P . By the definition X ⊆{1,...,n}, we define X + k = { x + of the parent, there are the following cases k : x ∈ X }. For sets X, Y ⊆{1,...,n}, for P and Q. we define X 1≤ Y = { p ∈ X : p ≤ (ii.a) The case where Q = aP for some q for some q ∈ Y }. By definition, both of letter a ∈ Σ: Let Q = aP  ∈ERP.

131 Since P  is a proper prefix specialization 5. An Algorithm for Position-  of P , Q is also a proper specialization of Maximal Flexible Pattern <  Q, i.e., Q Q . In this case, we have Enumeration LO(Q)=LO(aP )=LO(a)∩(LO(P )− |a|) by Property (a) of Lemma 11. Since In Fig. 3, we show a polynomial-space LO(P )=LO(P ) by the assumption, it and polynomial-delay enumeration algo- is obvious that LO(a) ∩ (LO(P ) −|a|)= rithm POSMAXFLEXMOTIF for maximal LO(a) ∩ (LO(P ) −|a|). Again by applying flexible patterns. Given an input string T Property (a) of Lemma 11 to the right hand of length n, this algorithm enumerates all side, we have LO(a) ∩ (LO(P ) −|a|)= position-maximal patterns in T without du- LO(aP ). Thus, we have Q < Q and plicates in polynomial time per maximal pat- LO(Q)=LO(Q), which says that Q is not tern using polynomial space in the input size maximal in T . However, this contradicts the n using depth-first search over the search T M assumption. tree over based on Corollary 13, T M (ii.b) The case where Q = a∗P for some let- Recall that the search tree for has re- T ter a ∈ Σ: Let Q = a∗P  ∈ERP. Since P  verse edges only, that is, each edge of is is a proper specialization of P (in this case, directed from a child to its parent. There- P  is not necessarily prefix-specialization), fore, the first step is to compute all children, ∈M we have Q < Q. Furthermore, since given a parent pattern P . This can be LO(P )=LO(P ) by assumption, we can done as follows. also show that LO(Q)=LO(Q) by apply- ing Property (b) of Lemma 11 as in the proof Lemma 14 For any maximal flexible pat- ∈M P for case (ii.a). Therefore, we see that Q is terns P, Q , P = (Q) if and only if ∈ not maximal in T , and thus, the contradic- there exists some constant letter a Σ such ∗ tion is derived. that either (i) Q = aP or (ii) Q = a P holds. By combining cases (i), (ii.a), and (ii.b) above, we conclude by contradiction that Furthermore, since P(Q) is defined also for P is maximal in T . Furthermore, it is non-maximal flexible patterns Q, we know clear from the construction that P is strictly that any flexible pattern can be obtained shorter than Q in length. Hence, the result is from finite applications of the operations in 2 proved. above Lemma 14 to the empty pattern ε. Then, Theorem 12 gives a sound pruning Definition 4 A search graph for M w.r.t. P strategy that once an enumerated pattern P is a directed graph T =(M, P, I) with gets non-maximal then we can immediately roots, where M is the set of nodes, i.e., the prune all the descendants of P . P set of all maximal flexible patterns in T , is Secondly, we discuss how to efficiently test ∈P the set of reverse edge such that (P, Q) the maximality of a given pattern P . The re- P I⊆M iff P = (Q) holds, and is the set finement operator for ERP was introduced of root patterns in T . by Shinohara [12]. The following version is due to [2]. Since each non-root node has the unique par- ent in T from Theorem 12, the search graph Definition 5 A basic refinement of a pattern T is acturally a directed tree with reverse P is any pattern Q obtained from P by ap- edges. Therefore, we have the following plying one of the following operations (r1) corollary. and (r2):

Corollary 13 Let T be any input string. (r1) Q is obtained by replacing some seg- Then, T =(M, E, I) is a spanning forest ment w ∈ Σ+ in P with either aw, wa, for M with the root set I. a ∗ w, w ∗ a for some a ∈ Σ.

132 Algorithm POSMAXFLEXMOTIF(Σ, S, σ) input: An alphabet Σ, an input string S, minimum frequency threshold 0 ≤ σ ≤|S|; output: All maximal patterns in M; 1 Let ⊥ be the maximal pattern in T which equivalent to ε (the root pattern). 2ENUMMAXIMAL(ε, σ);

Procedure ENUMMAXIMAL(P, LO(P ),σ) input: A maximal pattern P and its left-location list LO(P ). output: All maximal patterns that are descendants of P . 1 Compute LO(P ); 2 if |LO(P )| =0then return; 3 if P is not maximal in S then return; 4 Output P ; 5 foreach a ∈ Σ do begin: 6ENUMMAXIMAL(aP, LO(aP ),σ); 7ENUMMAXIMAL(a ∗ P, LO(a ∗ P ),σ); 7 end

Figure 3: An algorithm POSMAXFLEXMOTIF for enumerating all maximal flexible patterns in an input sequence.

(r2) Q is obtained by replacing a pair (ii) The branching of P in T (the number of of consecutive segments v ∗ w ∈ the children for P ) is at most O(|Σ|m). Σ+{∗}Σ+ in P with vw. By combining the above lemmas, we have ⊆ERP For a pattern P , we define ρ(P ) to the main result of this paper. This says be the set of all basic refinements of P . that our algorithm POSMAXFLEXMOTIF is a memory and time efficient algorithm for Lemma 15 A flexible pattern P is maximal discovering maximal flexible patterns. in T if and only if there is no basic refine- ment Q ∈ ρ(P ) such that LO(P, T)= LO(Q, T ). Theorem 18 Let Σ be an alphabet and T ∈ Σ∗ be an input string of length n ≥ Corollary 16 The maximality of a flexible 0. Then, the algorithm POSMAXFLEX- pattern P in an input string T is decidable MOTIF in Fig. 3 enumerates all position- in O(|Σ|m2n) time, where m = |P | and maximal patterns P in T without duplicates | | 2 n = |T |. in O( Σ kmn ) delay per maximal pattern using O(mn) space, where m = |P | and On the shape of T , we have the following k = O(m) are the size and the number of lemma. variables of the pattern P to be enumerated.

Lemma 17 Let P ∈Mbe any maximal Corollary 19 The maximal pattern enu- pattern in T and m = |P |. Then, meration problem for the class ERP of flexible patterns (or erasing regular pat- (i) The depth of P in T (the length of the terns) w.r.t. position-maximality is solvable unique path from the root to P )isat in polynomial-space and polynomial-delay most m. in the total input size.

133 6. A Practical Algorithm for Theorem 22 Let Σ be an alphabet and T ∈ ∗ Discovering document-maximal Σ be an input string of length n ≥ 0. flexible patterns in a set of input Then, the algorithm DOCMAXFLEXMOTIF in Fig. 4 enumerates all document-maximal strings patterns P in T without duplicates using Let Σ be a fixed alphabet of constants. An O(mn) space, where m = |P | is the size input string set is a collection of constant of the pattern P to be enumerated. strings over Σ ∗ S = {s1,...,sd}⊆Σ , 7. Conclusion ∗ where each member si ∈ Σ is called a doc- In this paper, we consider the maximal pat- ument of S (i =1,...,d). For a flexible pat- tern discovery problem for the class ERP tern P ∈ERP, the document-location list of flexible patterns [9], which is also known of P is defined by the set DO(P, S)={ 1 ≤ as erasing regular patterns in machine learn- i ≤ d : P  si } of the indices of the docu- ing. The motivation of this study is ap- ments in which P occurs. The document fre- plications to the optimal pattern discovery quency of P in S is defined by |DO(P, S)|. problem in machine learning and knowledge For 0 ≤ σ ≤|S|, a pattern P is docu- discovery. As a main result we present a ment σ-frequent if |DO(P, S)|≥σ. A pat- polynomial-space and polynomial-delay al- tern P is document-maximal in S if there gorithm for enumerating all maximal pat- is no proper specialization Q of P that has terns appearing in a given string without du- the same document-location list in S, i.e., plicates in terms of postion-maximality de- P < Q and DO(P, S)=DO(Q, S). The fined through the equivalence relation be- following lemma justifies the pruning strat- tween the location lists. As another applica- egy at Line 2 of Algorithm DOCMAXFLEX- tion, we presented a heuristics algorithm for MOTIF in Fig. 4. enumerating document-maximal patterns in a collection of strings for the class ERP Lemma 20 (pruning by monotonicity) If with non-trivial pruning strategies.  S ⊇ S P Q then DO(P, ) DO(Q, ). References ∗ Let S = {s1,...,sk}⊆Σ be an input doc- [1] D. Avis and K. Fukuda, Reverse ument set and let # ∈ Σ) be a new delimi- Search for Enumeration, Discrete Ap- plied Mathematics, Vol. 65, 21Ð46, tor symbol. Then, we define an input string 1996. ··· S S = s1# #sk obtained from by con- [2] H. Arimura, R. Fujino, T. Shino- catenating all documents by the delimitor #. hara, Protein motif discovery from The following lemma ensure the soundness positive examples by minimal multiple generalization over regular patterns, of the pruning strategy at Line 3 of Algo- Proc. GIW’94, 39-48, Dec. 1994. rithm DOCMAXFLEXMOTIF in Fig. 4. [3] H. Arimura, T. Uno, A polynomial space and polynomial delay algorithm Lemma 21 (pruning by position-maximality) for enumeration of maximal motifs in Let P be any flexible pattern over Σ.IfP a sequence, Proc. ISAAC’05, LNCS 3827, Springer, Dec. 2005. is document-maximal in S then P is also a [4] L. Devroy, L. Gyrfi and G. Lugosi, A position-maximal in S. Probabilistic Theory of pattern Recog- nition, Springer Verlag, 1996. Based on the above lemmas, we show the al- [5] M. J. Kearns, R. E. Shapire, L. M. Sel- gorithm DOCMAXFLEXMOTIF for enumer- lie, Toward efficient agnostic learning, ating all document-maximal frequent flexi- Machine Learning, 17(2Ð3), 115Ð141, 1994. ble patterns in a set of strings in Fig. 4. Un- [6] W. Maass, Efficient agnostic PAC- fortunately, this algorithm is not shown to be learning with simple hypothesis, In of output-polynomial time. Proc. COLT94, 67Ð75, 1994.

134 Algorithm DOCMAXFLEXMOTIF(Σ, S, σ) ∗ input: An alphabet Σ, an input set S = {s1,...,sk}⊆Σ , 0 ≤ σ ≤|S|; output: All document-maximal patterns in DM; 1 Let ⊥ be the maximal pattern in T which equivalent to ε (the root pattern). 2 Let S = s1# ···#sk be an input string (# ∈ Σ). 3DOCENUMMAXIMAL(⊥,LO(⊥,S),σ);

Procedure DOCENUMMAXIMAL(P, LO(P ),σ) input: A maximal pattern P and its left-location list LO(P, S). output: All maximal patterns that are descendants of P . 1 Compute DO(P, S) from LO(P, S); 2 if |DO(P, S)| <σthen return; 3 if P is not position-maximal in S w.r.t. LO(P, S) then return; 4 if P is document-maximal in S then output P ; 5 foreach a ∈ Σ do begin: 6DOCENUMMAXIMAL(aP, LO(aP, S),σ); 7DOCENUMMAXIMAL(a∗P, LO(a∗P, S),σ); 7 end

Figure 4: An algorithm POSMAXFLEXMOTIF for enumerating all maximal flexible patterns in an input sequence.

[7] H. Mannila, H. Toivonen, A. large databases, In Proc. SDM 2003, I. Verkamo, Discovery of frequent SIAM, 2003. episodes in event sequences, Data [14] J. Wang and J. Han, BIDE: efficient Min. Knowl. Discov., 1(3), 259Ð289, mining of frequent closed sequences, 1997. In Proc. IEEE ICDE’04, 2004. [8] S. Miyano, A. Shinohara, T. Shino- hara, Polynomial-time learning of el- ementary formal systems, New Gener- ation Comput. 18(3): 217Ð242, 2000. [9] L. Parida, I. Rigoutsos, et al., Pattern discovery on character sets and real- valued data: linear-bound on irredan- dant motifs and efficient polynomial time algorithms, In Proc. SODA’00, 2000. [10] N. Pisanti, M. Crochemore, R. Gross, M.-F. Sagot, A basis of tiling motifs for generating repeated patterns and its complexity of higher quorum, In Proc. MFCS’03, 2003. [11] S. Shimozono, H. Arimura, S. Arikawa, Efficient discovery of optimal word-association pat- terns in large text databases, New Generation Comput. 18(1), 49Ð60, 2000. [12] T. Shinohara, Polynomial time infer- ence of extended regular pattern Lan- guages. Proc. RIMS Symp. on Software Sci. & Eng., 115Ð127, 1982. [13] X. Yan and J. Han, R. Afshar, CloSpan: mining closed sequential patterns in

135 Application of Truncated Suffix Trees to Finding Sentences from the Internet

Takuya Kida Takashi Uemura Hiroki Arimura Hokkaido University Graduate School of Information Science and Technology Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido, 060-0814, Japan

Abstract We propose a space saving index structure based on the suffix tree, called the truncated suffix tree with k-words limitation, which is a tree structure obtained by eliminating the nodes that indicate the substrings longer than k words in a given text. Although such tree may confuse any substrings that occur at the different positions but the same string, it is suitable for counting short substrings. We presented it can be constructed on-line in O(n) time and space. Our experimental result shows that the number of nodes falls broadly when k is small. The reduction ratio is almost half in comparison with the original suffix tree when k =3in practice. We also discuss about the application of it to Kiwi system which can help us to find keywords from the Internet. 1. Introduction phrase examples that match into the inputed context. Though it is based on the same Since information created by people has in- idea of the KWIC (Key Word in Context) creased rapidly after moving into the 21st tool, it can present phrases that match the century, efficient technologies for dealing user’s needs better by using data on the In- with numerous data become more and more ternet rather than a fixed dictionary. There important. There are many documents and are some similar systems like WebCorp[1], databases on the Internet now, and it is ur- Google Fight[2], Google Duel[10], and so gently necessary to develop information re- on, but Kiwi has a superior feature that can trieval technologies that one can pick up the handle any languages and flexible queries. desired information. The current Kiwi system gathers the search Nowadays almost everyone uses search en- results for the temporal dictionary to be used gines like Google or Yahoo to get informa- by touching search engine’s API for each tion which one wants to know. Such search query request. However, such mechanism engines may display search results quickly may take several minutes before replying the when a user throws appropriate queries into final results. Although the response time can them. However the user would not be able to be reduced if the system gathers less search obtain sufficient results if proper keywords results, the precision of the final results will are not come to mind. Therefore, to give drop down. Since users usually can not stand users some clues for proper queries, or to for waiting more than several tens of sec- extract semi-automatically the desired infor- onds, the text data to be used must be in a lo- mation from the Internet, many researchers cal server, and the system must be able to re- have studied about the web data mining. trieve keywords quickly from them, to guar- Kiwi system[13] developed by Tanaka and antee the sufficient precision and response Nakagawa[14] is one of the solutions to help speed. such situation. It is originally developed Suffix tree[5] is a useful data structure that with the aim of presenting the users some can index every substring in a given text.

136 The size of the tree for the text is O(n), structed in O(n) time and space for the input where n is the text length. By using the text of length n. The basic idea of our al- suffix tree, searching a keyword can be gorithm is to close partially the extensions done in O(m) time, where m is the key- of leaves in the suffix tree, which are au- word length[16]. Several well-known opti- tomatically extended in the original Ukko- mal algorithms for constructing suffix trees nen’s algorithm[15], whenever each delim- exist[8, 15], and various extensions have iter is loaded. By applying k-WST to Kiwi been proposed. system, there exists a possibility to make it Admitting that the suffix tree is a compact more flexible and useful. data structure which can deal with all sub- strings in O(n) space as mentioned above, Related works Implementing data com- the required memory space tends to grow to pression methods such as Ziv-Lempel fam- an enormous size for practical purposes be- ily is one of the most important applications cause the size n of the target data itself is of suffix trees. The original suffix tree is often large. Therefore, more compact data not applicable to the LZ77 scheme because structures that have almost the same func- that the string can be referred as a dictionary tion as the suffix tree, is desirable. From the is restricted by the sliding window. Several viewpoint of application of the suffix tree, methods which can overcome this problem there are few cases that very long substrings have been proposed[11, 4, 7]. Na et al.[9] are needed. For the purpose of searching defined the truncated suffix tree which can short sentences, for example, it is sufficient represent all substrings whose length is less to index only substrings that cover several than k, and proposed the constructing algo- words. rithm in proportion to the text length. Al- In order to speed up the response time, Ichii though our algorithm can be seen as an ex- et al.1 improved Kiwi system by using a suf- tension of their idea, closing leaves’ exten- fix tree for n-gram that indexes all n-grams sion is done at different time intervals for of the text data which are gathered from the each leaves since the length of the suffixes Internet in advance and stored locally. Their registered with the tree is not fixed. proposed data structure is fundamentally the For Kiwi system, by adopting the several de- same as that of Na et al.[9]. They succeeded mands of the field of linguistics, the newly to shorten the response time dramatically. system, called Tonguen, has developed by However, it can not retrieve the sentences Tanaka and Ishii[12]. longer than the fixed size n. In this paper, we propose a new data struc- 2. Preliminaries ture which can index any substring shorter Let Σ be a finite alphabet and let Σ∗ be the than k words in a given text, called the set of all strings over Σ. We denote the truncated suffix tree with K-words limita- length of string x ∈ Σ∗ by |x|. The string tion (k-WST for short). We also present it whose length is 0 is denoted by ε, called can be constructed in O(n) time. Assume empty string, that is |ε| =0. We also de- that the input text T is split into the N words fine that Σ+ =Σ∗ −{ε}. The concatenation ∗ sequence with delimiter symbol #, that is, of two strings x1 and x2 ∈ Σ is denoted by ··· T = w1#w2# #wN . Then, for T and a x1 · x2, and also write it simply as x1x2 if no given integer k, k-WST represents all sub- confusion occurs. w #w # ...#w − strings of the string i i+1 i+k 1 Strings x, y, and z are said to be a prefix, i 1 ≤ i ≤ N − k +1 for every ( ). The pro- factor, and suffix of the string w = zyz, re- posed algorithm is based on the online algo- spectively. The ith symbol of a string w is k WST rithm of Ukkonen[15]. - can be con- denoted by w[i], and the factor of u that be- 1This paper was published in the workshop pro- gins at position i and ends at position j is ceeding written in Japanese denoted by w[i...j]. For convenience, let

137 w[i...j]=ε for j

1. Receive a query from a user. (iii) It is succeeded by various kind of words (namely characters). 2. Tokenize the query and then send it to

search engines. Let denote by Xi the prefix x[1 ...i] (i> |P |)ofx, and also denote by C the number 3. Extract all fixed-length strings from the i of kinds of character x[i+1]preceded by X search results, which include the candi- i in the target text. From the condition (iii), dates. the Kiwi system extracts Xi as a candidate 4. Cut these strings to the proper length to when extract candidates. Ci >Ci−1. 5. Rank them before answer. In order to do this, an n-gram trie for the text The clipping positions can be straightfor- is constructed and traversed. That is, the sys- wardly determined when the ‘∗’ is used in tem determines the candidates by cutting at the middle. In the case that a query is pre- the position where the number of branches ceded or succeeded by ‘∗’, it is a problem to in the trie turns from downward to upward determine how long we must take as a can- in the traversing (see Figure 1). Such can- didate string. Cutting at each word separa- didate strings often satisfy the condition (ii). tor for segmented languages, or cutting by fixed-length for non-segmented languages, This method is empirically and do not neces- are in common use for the other KWIC sys- sarily go well at any time. Actually, the ad- tems, while Kiwi system takes another ap- ditional efforts are required because it may proach. fail when the candidates hardly appear in the Now we concentrate in the case that the in- text. However, it has an advantage that it is put query ends with ‘∗’, since the case that it independent of language. ∗ starts with ‘ ’ can be managed as the same In the step of ranking, Kiwi system uses the way by viewing the target texts in reverse. formula We call a string obtained by eliminating ‘∗’ from a query as a clue string. What we want F (X)=freq(X) log(|X| +1)

138 as an evaluation function, where freq(X) is no corresponding node in ST (T ) as a virtual the occurrence rate of the candidate X. This node. To distinguish it, we call each s ∈ V is to think that freq(X) is important for the a real node. condition (i); on the other hand longer can- We define suf : V → V as a function which didates are more important for the condi- returns a node representing the string elimi- tion (ii). Although we can consider more nating the first character from s for s ∈ V . complicated evaluation functions, it will be a For each real node s, t ∈ V and a character trade-off between the precision and the per- a ∈ Σ, suf (s)=t iff s = a · t. Since this formance. can be seen as a edge (s, t) from s to t,we call it the suffix link from s to t. For each real 2.2. Suffix tree node s ∈ V , there exists just one path from root Suffix tree (ST for short) is a data structure s to by traversing suffix links. We call that can represent Fac(T ) for the string T .It the path the suffix path. can be denoted by a quadruplet as ST (T )= (V ∪ {⊥}, root ,E,suf ), where V is a set of 2.3. Ukkonen’s suffix tree con- nodes and ⊥ ∈/ V , and E is a set of edges struction and E ⊆ V 2. Then, the graph (V,E) forms a We will give a brief sketch of Ukkonen’s rooted tree which have a root node root ∈ V suffix tree construction algorithm[15] below. for ST (T ). That is, there exists just one path The algorithm reads characters one by one from root to each node s ∈ V .If(s, t) ∈ E from the given text T = T [1 ...n], and con- for nodes s, t ∈ V , we call s the parent of structs ST (T [1 ...i]) gradually for each step t and t by the child of s. We also call the i(1 ≤ i ≤ n). The construction process is internal node of T if it has a child, and call done by adding every suffix of T [1 ...i] into the leaf of T if it has no children. Moreover, the tree in descending order of the length for every node on the path from root to the node each step i. Note that, however, the nodes s ∈ V is called the ancestor of s. corresponding to suffixes of T [1 ...i] are ST Each edge in E is labeled with a factor of T , not necessarily to be leaves of (T [1 ...i]) represented by a pair of integers (j, k), that since it does not end with the terminal sym- is, the label string is T [j...k]. We denote bol in progress phase. Such a incomplete by label(s) the label string for the edge into suffix tree is called an implicit suffix tree. s ∈ V since any child has just one parent. Consider that we are going to extend the suf- For any s ∈ V , we denote by s the string fix tree at a step i. We denote by φ the po- spelled out every ancestor’s label from root sition on the tree to where a new leaf is go- to s, and call it the string represented by s. ing to be added, and we call it the working That is, assuming that root ,a1,a2,... ,s is node. We regard the node at position φ as the the sequence of the ancestor nodes of s, s = node named φ if no confusion occurs. The label(root )·label(a1)·label(a2) ···label(s). working node can point a virtual node. Let For a given text T , each leaf of ST (T $) φ = root when i =0for convenience. If · ST corresponds one-to-one with each suffix of φ T [i] is not represented by (T [1 ...i]), T ,and any internal node except for root has we add at φ the leaf whose label starts T [i] more than two children, where we assume as a child. If the node φ is a virtual node, that $, called terminal symbol, is not in- we change it a real node and make a suffix cluded in Σ. Then, there are |T | leaves link for it. Although the suffix link for the and less than 2|T |−1 nodes in ST (T $).For changed node is not stored explicitly in this ST any two children x, y of each s ∈ V , x case, we can traverse to the node repre- suf starts a different character from y, that is, senting (φ) by using the suffix link of the x[1] = y[1]. Note that each node s ∈ V nearest neighbor ancestor that is also a real corresponds one-to-one with a factor of T node. from the above. We treat each factor that has When creating a leaf v and its edge, we la-

139 t n bel the edge as label(v)=(i, ∗) by intro- h p i # e e ∗ ∗ h s s n ducing the special symbol , and regard # # i e i e i p # n s # t e as the same value as i for each step. Then, s # i t s p i t h n # p # p each leaf always represents a suffix of T . s h e s h e i e i e n Now we finished the process to add the suf- e s n s n e fix φ · T [i] to the tree. Next we move the working node in order to add the shorter suf- fix and repeat the above process. Thus we Figure 2: 2-WST(T = ”this#is#the#pen”) · create leaves if φ T [i] is not represented by t h i s # i s ST h i s # i s by traversing suffix path one by one until i s # i s s # i s φ·T [i] has already represented. Then we up- # i s i s # t h e date φ by traversing with the character T [i], s # t h e # t h e t h e # p e n where is the starting working node position h e # p e n e # p e n for the next step. Note that, we can stop # p e n p e n the extension of ST when we encounter the e n n node representing φ · T [i], because any suf- · fix Suf(φ) T [i] must appear at the same end Figure 3: Suffixes represented by · position if φ T [i] appears in T and thus any 2-WST (T = ”this#is#the#pen”) node on the suffix link which representing shorter suffix must be represented in the tree. spaces and new line symbols as one delim- For each step i, we call the portion of the suf- iter. Although there is not delimiter sym- fix path working path where the leaves are bols between words in phrases written in added newly in the above creation. some oriental languages like Japanese, we We will omit the detail of the proof, the ex- can consider that it is separated into words tension process can be done totally in time with some NLP tools in advance. Then, in proportion to the number of added nodes. for a given text T = w1#w2# ···#wN Finally, we can obtain the complete suf- and a integer k (1 ≤ k ≤ N), we de- fix tree by adding the terminal symbol $ to fine the suffix tree with k words limitation the implicit suffix tree ST (T [1 ...n]) con- (k-WST for short) as the rooted tree by do- structed as above. ing path compression of the trie which repre- sents Fac(wi#wi+1# ...#wi+k−1) for any 3. Suffix tree with k words limita- 1 ≤ i ≤ N −k +1(Figure 2, Figure 3). For- tion mally, k-WST (T ) is a rooted tree that satis- fies the following conditions. Let # ∈/ Σ be a symbol, called the delim- ∈ ∪{ } ∗ iter symbol. Any string x (Σ # ) 1. Each edge of k-WST (T ) is labeled ··· can be written as x = w1#w2# #wN with non-empty factor of T . for the sequence of strings w1,w2,... ,wN ∀ ∗ WST ( i, wi ∈ Σ ). We call each wi a word, and 2. Each internal node of k- (T ) ex- also call w the string of N-words divided by cept for the root node has at least two #. children, and the edges from v to each children are labeled with strings that Assume that the given text is a string of start with different characters mutually. N-words divided by #. For convenience, ∈ + we assume that wi Σ for any i, and 3. For each leaf v of k-WST (T ), v cor- that # never appears continuously. This responds one-to-one to a string s ∈ means that we regard the consecutive #s Suf (wi#wi+1# ...#wi+k−1) (1 ≤ as one delimiter symbol. The assumption i ≤ N − k +1). is natural even in a practical use. For ex- ample, it is natural for English (or Euro- We will discuss about the algorithm for con- pean language) texts to regard consecutive structing k-WST (T ), and present it can be

140 constructed in O(n) time and space. There- fore, our claim is as follows. procedure update(φ, numφ, lposφ,i); method: Theorem 1. Given a N-words text T = c = T [i]; oldr = root ; { w1#w2# ···#wN whose length is n, as- while (φc has not represented yet) suming # ∈/ Σ and ∀i, w ∈ Σ+, and a in- if (c =# or i { teger k (1 ≤ k ≤ N), the suffix tree with k there isn’t a closed leaf representing φ) if (φ is a virtual node) words limitation of T can be constructed in Change the node φ into a real node; O(n) time and space. Create a child q, label(q)=(i, ∗) of φ; if (oldr = root ) suf (oldr)=φ; 3.1. Constructing algorithm oldr = φ; } The most important thing is not to process φ = suf (φ); for redundant leaves and to close the exten- if (T [lposφ]=#) numφ = numφ − 1; sion of appropriate leaves just before they lposφ = lposφ +1; extend over k +1words when the tree is up- } dated for each step. To close the extension if (oldr = root ) of the leaf is done by replacing the label’s suf (oldr)=φ; end mark ∗ into i − 1. We call this process φ = φc; if (c =#) closing a leaf. Here, we consider that the numφ = numφ +1; label for each closed leaf is added virtually return (φ, numφ, lposφ); a terminal symbol /∈ Σ. From the above end. procedure, we can prevent from entering the suffixes longer than k+1 words into the tree. procedure close(ψ, numψ, lposψ,i); method: Figure 4 is the detail of the algorithm. if (T [i]=#) { while (numψ = k − 1) { Close the leaf ψ with i − 1; Procedure for updating the tree Con- if (T [lposψ]=#) − sider that we have now k-WST (T [1 ...i− numψ = numψ 1; lposψ = lposψ +1; ψ = suf(ψ); 1]) and process the updating the tree for } T [i]. Let numφ be the number of # which numψ = numψ +1; the suffix φ includes, and also let lposφ be } the start position of φ in T , that is, φ = return (ψ, numψ, lposψ); end. T [lposφ ...i− 1]. We increase numφ1 by one when traversing the tree with T [i]= Algorithm constructing k-WST ; #. On the other hand, we decrease numφ input: k,T[1 ...n]; by one when traversing the suffix path and output: k-WST (T ); T [lpos ]=#. Then we increase lpos by method: φ φ ⊥ one. We can manage correctly the number Create the root root and the auxiliary node , and then add an edge (⊥, root); of words for φ in this way. foreach c ∈ Σ ∪{#, $} do The procedure for adding leaves on the Add an edge (⊥, root ) labeled with c; φ = numφ =0 lposφ =1 working path is basically the same as that root; ; ; ψ = root ; numψ =0; lposψ =1; of Ukkonen’s algorithm. However, we need for (i =1...n) { additional effort if the k-words string con- (φ, numφ, lposφ)=update(φ, numφ, lposφ,i); tains another k-words string as a prefix like (ψ, numψ, lposψ)=close(ψ, numψ, lposψ,i); the case in Figure 5. Then, we must create a } Report k-WST (T ). leaf which represents the end of the word for the shorter one, too. Although the label of it Figure 4: Algorithm for constructing is ε, we assume that it represents a virtual k-WST terminal symbol /∈ Σ. On the other hand,

141 root T [i]. Thus it moves n times totally. To up- date numψ and lposψ for each movement can be done in constant time. On the other t hand, the position ψ traverse the suffix link h only when num = k − 1 and T [i]=#, e ψ that is, when closing leaves. To close a leaf s i s can be done in constant time. Therefore, the procedure for closing leaves can be done in O(n) time totally since the number of the Figure 5: Virtual terminal symbol leaves is O(n). we do not create a new leaf if there exists a closed leaf at the position φ. 3.3. Managing the occurrence rate We can accumulate the occurrence rate of Procedure for closing leaves Consider any factor in T by using the original suf- that we have now k-WST (T [1 ...i−1]) and fix tree ST (T $),where the occurrence rate process the updating the tree for T [i]=#. of each suffix is exactly 1. A factor u repre- Then, we need to close each leaf which rep- sented by an internal node u is a prefix of the resents the suffix that includes just k − 1#s. strings represented by their children. Thus,   To do this, we close all relevant leaves from for any children u of u,ifu occurs p times the leaf representing the longest suffix at this in T , then u occurs p times at the same posi- point in descending order of length. tion. Therefore, the occurrence rate of u can be calculated as the sum of the occurrence Let ψ be a position of the leaf now we fo- rate of every u’s children. cused on. Also let numψ be the number of WST # within ψ, and lposψ ψ be the start posi- However, for k- (T $),not all the occur- tion of ψ in T in a similar way of the updat- rence rates for closed leaves are 1, because ing procedure. We increase numψ by one that it is not necessarily the case that each when T [i]=#, and close ψ with i − 1 leaf v does not correspond to a suffix and if numψ = k, and then repeat this after thus v may occur in T several times. To traversing the suffix link until numψ

142 References

[1] Webcorp home page, 1999. http://www.webcorp.org.uk/.

[2] Googlefight : Make a fight with googlefight, 2000. http://www.googlefight.com/.

Figure 6: Change in the number of nodes [3] Arne Andersson, N. Jesper Larsson, and Kurt Swanson. Suffix trees on is small, while it is almost the same when words. Algorithmica, 23(3):246Ð260, k =6. 1999. 3.5. Application to Kiwi [4] E. R. Fiala and D. H. Greene. Data compression with finite windows. As mentioned in [13], the processing for Comm. ACM, 32(4):490Ð505, 1989. clipping candidates which denoted in Sec- tion can be realized by using suffix trees [5] Dan Gusfield. Algorithms on strings, but n-gram trie. For the case of Kiwi sys- trees, and sequences - Computer sci- tem, however, we do not need substrings ence and computational biology. Cam- longer than several words. Their positions in bridge University Press, 1997. the text are also unnecessary. Moreover the longer candidates are not unnaturally bro- [6] Shunsuke Inenaga and Masayuki ken when we use k-WST as an index struc- Takeda. On-line linear-time construc- ture but n-gram trie, since the lengths of tion of word suffix trees. In In proc. the substrings which we can refer to are not of 17th Ann. Symp. on Combinatorial fixed. From the above observation we con- Pattern Matching, 2006. (to appear). clude that our data structure k-WST is more suitable for Kiwi system rather than the orig- [7] N. J. Larsson. Extended application inal suffix tree or the n-gram trie. of suffix trees to data compression. In In Proc. of Data Compression Confer- 4. Conclusion ence, pages 190Ð199, 1996. In this paper, we proposed the truncated suf- [8] E. M. McCreight. A space-economical fix tree with k-words limitation, which can suffix tree construction algorithm. J. represent any factor less than k-words in a ACM, 23:262Ð272, 1976. given text T = T [1 ...n], and also presented it can be constructed in O(n) time. In our [9] J. C. Na, A. Apostolico, C. S. Iliopou- implementation of the data structure, the ex- los, and K. Park. Truncated suffix trees perimental results showed that the number and their application to data compres- of nodes falls broadly when k is small, and it sion. Theoretical Computer Science, becomes about half when k =3comparing 304:87Ð101, 2003. with the original suffix tree. We also men- tioned about the possibility of application to [10] G. Peters. Geoff’s googleduel!, 2002. Kiwi system. http://www.googleduel.com/original.php. Several researchers have proposed the tech- niques to enter the suffixes for a word-based [11] M. Rodeh, V. R. Pratt, and S. Even. sequence rather than character-based[3, 6]. Linear algorithm for data compression It is our future work to combine the tech- via string matching. J. ACM, 28(1):16Ð nique into our idea. 24, 1981.

143 [12] Kumiko Tanaka-Ishii and Yuichiro Ishii. Tonguen - search your tongue’s end, 2006. http://cantor.ish.ci.i.u-tokyo.ac.jp/ ku- miko/tonguen/tonguen.cgi.

[13] Kumiko Tanaka-Ishii and Hi- roshi Nakagawa. Kiwi site, 2004. http://kiwi.r.dl.itc.u-tokyo.ac.jp/.

[14] Kumiko Tanaka-Ishii and Hiroshi Nak- agawa. A multilingual usage consul- tation tool based on internet searching: more than a search engine, less than qa. In In Proc. of the 14th internat. WWW Conference, pages 363Ð371, 2005.

[15] E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14(3):249Ð 260, 1995.

[16] P. Weiner. Linear pattern matching al- gorithms. In In Proc. of the 14th IEEE Symp. on Switching and Automata The- ory, pages 1Ð11, 1973.

144 Generating Frequent Closed Item Sets Using Zero-suppressed BDDs

Shin-ichi Minato Graduate School of Information Science and Technology, Hokkaido University Sapporo, 060-0814 Japan.

Abstract Frequent item set mining is one of the fundamental techniques for knowledge discovery and data mining. In the last decade, a number of efficient algorithms for frequent item set mining have been presented, but most of them focused on just enumerating the item set patterns which satisfy the given conditions, and it was a different matter how to store and index the result of patterns for efficient data analysis. Recently, we proposed a fast algorithm of extracting all frequent item set patterns from transaction databases and simultaneously indexing the result of huge patterns using Zero-suppressed BDDs (ZBDDs). That method, ZBDD-growth, is not only enumerating/listing the patterns efficiently, but also indexing the output data compactly on the memory to be analyzed with various algebraic operations. In this paper, we present a variation of ZBDD-growth algorithm to generate frequent closed item sets. This is a quite simple modification of ZBDD-growth, and additional computation cost is relatively small compared with the original algorithm for generating all patterns. Our method can conveniently be utilized in the environment of ZBDD-based pattern indexing.

1. Introduction Zero-suppressed BDDs. That method, called ZBDD-growth, does not only enu- Frequent item set mining is one of the fun- merate/list the patterns efficiently, but damental techniques for knowledge dis- also indexes the output data compactly covery and data mining. Since the intro- on the memory. After mining, the result duction by Agrawal et al.[1], the frequent of patterns can efficiently be analyzed by item set mining and association rule anal- using algebraic operations. ysis have been received much attentions from many researchers, and a number of The key of the method is to use BDD- papers have been published about the new based data structure for representing sets algorithms or improvements for solving of patterns. BDDs[2] are graph-based such mining problems[4, 6, 11]. However, representation of Boolean functions, now most of such item set mining algorithms widely used in VLSI logic design and ver- focused on just enumerating or listing the ification area. For the data mining ap- item set patterns which satisfy the given plications, it is important to use Zero- conditions and it was a different matter suppressed BDDs (ZBDDs)[7], a special how to store and index the result of pat- type of BDDs, which are suitable for han- terns for efficient data analysis. dling large-scale sets of combinations. Us- ing ZBDDs, we can implicitly enumerate Recently, we proposed a fast algorithm[8] combinatorial item set data and efficiently of extracting all frequent item set pat- compute set operations over the ZBDDs. terns from transaction databases, and si- multaneously indexing the result of huge In this paper, we present an interesting patterns on the computer memory using variation of ZBDD-growth algorithm to

145 generate frequent closed item sets. Closed item sets are the subset of item set pat- terns each of which is the unique represen- tative for a group of sub-patterns relevant to the same set of transaction records. Our method is a quite simple modifica- tion of ZBDD-growth. We inserted sev- eral operations in the recursive procedure of ZBDD-growth, to filter the closed pat- terns from all frequent patterns. The Figure 1: A Boolean function and a com- experimental result shows that the addi- binatorial item set. tional computation cost is relatively small compared with the original algorithm for generating all patterns. Our method can conveniently be utilized in the environ- ment of ZBDD-based data mining and knowledge indexing.

2. ZBDD-based item set rep- resentation As the preliminary section, we describe the methods for efficiently indexing item Figure 2: An example of ZBDD. set data based on Zero-suppressed BDDs.

2.1. Combinatorial item set can enjoy more efficient manipulation us- and ZBDDs ing “Zero-suppressed BDDs” (ZBDD)[7], A combinatorial item set consists of the which are special type of BDDs optimized elements each of which is a combination for handling combinatorial item sets. An of a number of items. There are 2n com- example of ZBDD is shown in Fig. 2. binations chosen from n items, so we have The detailed techniques of ZBDD manip- n 22 variations of combinatorial item sets. ulation are described in the articles[7]. A For example, for a domain of five items typical ZBDD package supports cofactor- a, b, c, d, and e, we can show examples of ing operations to traverse 0-edge or 1- combinatorial item sets as: edge, and binary operations between two {ab, e}, {abc, cde, bd, acde, e}, {1,cd},0. combinatorial item sets, such as union, in- Here “1” denotes a combination of null tersection, and difference. The computa- items, and 0 means an empty set. Com- tion time for each operation is almost lin- binatorial item sets are one of the ba- ear to the number of ZBDD nodes related sic data structure for various problems in to the operation. computer science, including data mining. A combinatorial item set can be mapped 2.2. Tuple-Histograms and into Boolean space of n input variables. ZBDD vectors For example, Fig. 1 shows a truth table A Tuple-histogram is the table for count- of Boolean function: F =(abc) ∨ (bc), ing the number of appearance of each tu- but also represents a combinatorial item ple in the given database. An example of set S = {ab, ac, c}. Using BDDs for tuple-histogram is shown in Fig. 3. This is the corresponding Boolean functions, we just a compressed table of the database to can implicitly represent and manipulate combine the same tuples appearing more combinatorial item set. In addition, we than once into one line with the frequency.

146 by a simple ZBDD. The three ZBDDs are shared their sub-graphs each other. Now we explain the procedure for con- structing a ZBDD-based tuple-histogram from original database. We read a tuple data one by one from the database, and accumulate the single tuple data to the histogram. More concretely, we generate a ZBDD of T for a single tuple picked up from the database, and accumulate it to Figure 3: Example of tuple-histogram. the ZBDD vector. The ZBDD of T can be obtained by starting from “1” (a null- combination), and applying “Change” op- erations several times to join the items in the tuple. Next, we compare T and F0, and if they have no common parts, we just add T to F0.IfF0 already contains T ,we eliminate T from F0 and carry up T to F1. This ripple carry procedure contin- ues until T and Fk have no common part. After finishing accumulations for all data Figure 4: ZBDD vector for tuple- records, the tuple-histogram is completed. histogram. Using the notation F .add(T ) for addition of a tuple T to the ZBDD vector F , we de- scribe the procedure of generating tuple- Our item set mining algorithm manipu- histogram H for given database D. lates ZBDD-based tuple-histogram repre- H = 0 sentation as the internal data structure. forall T ∈ D do Here we describe how to represent tuple- H = H.add(T ) histograms using ZBDDs. Since ZBDDs return H are representation of sets of combina- When we construct a ZBDD vector of tions, a simple ZBDD distinguishes only tuple-histogram, the number of ZBDD existence of each tuple in the database. nodes in each digit is bounded by total In order to represent the numbers of appearance of items in all tuples. If there tuple’s appearances, we decompose the are many partially similar tuples in the number into m-digits of ZBDD vector database, the sub-graphs of ZBDDs are {F ,F ,...,F − } to represent integers 0 1 m 1 shared very well, and compact representa- up to (2m−1), as shown in Fig. 4. Namely, tion is obtained. The bit-width of ZBDD we encode the appearance numbers into vector is bounded by log S , where S binary digital code, as F represents a set max max 0 is the appearance of most frequent items. of tuples appearing odd times (LSB = 1), F1 represents a set of tuples whose ap- Once we have generated a ZBDD vec- pearance number’s second lowest bit is 1, tor for the tuple-histogram, various op- and similar way we define the set of each erations can be executed efficiently. Here digit up to Fm−1. are the instances of operations used in our pattern mining algorithm. In the example of Fig. 4, The tuple frequencies are decomposed as: F0 = • H.factor0(v): Extracts sub- {abc, ab, c}, F1 = {ab, bc}, F2 = {abc}, histogram of tuples without item and then each digit can be represented v.

147 • H.factor1(v): Extracts sub- ZBDDgrowth(H, α) { histogram of tuples including if(H has only one item v) item v and then delete v from the if(v appears more than α ) tuple combinations. (also considered return v ; as the quotient of H/v) else return “0” ; F ← Cache(H); • v · H: Attaches an item v on each if(F exists) return F ; tuple combinations in the histogram v ← H.top ; /* Top item in H */ ← F . H1 H.factor1(v); H0 ← H.factor0(v); ← • H + H : Generates a new tuple- F1 ZBDDgrowth(H1,α); 1 2 ← histogram with sum of the frequen- F0 ZBDDgrowth(H0 + H1,α); F ← (v · F1) ∪ F0 ; cies of corresponding tuples. Cache(H) ← F ; return F ; • H.tuplecount: The number of tuples } appearing at least once. Figure 5: ZBDD-growth algorithm. These operations can be composed as a sequence of ZBDD operations. The result is also compactly represented by a ZBDD This procedure may require an exponen- vector. The computation time is bounded tial number of recursive calls, however, by roughly linear to total ZBDD sizes. we prepare a hash-based cache to store the result of each recursive call. Each en- 3. ZBDD-growth Algorithm try in the cache is formed as pair (H, F), where H is the pointer to the ZBDD vec- Recently, we developed a ZBDD-based F algorithm[8], ZBDD-growth, to generate tor for a given tuple-histogram, and is the pointer to the result of ZBDD. On “all” frequent item set patterns. Here we each recursive call, we check the cache describe this algorithm as the basis of our to see whether the same histogram H method for “closed” item set mining. has already appeared, and if so, we can ZBDD-growth is based on a recursive avoid duplicate processing and return the depth-first search over the ZBDD-based pointer to F directly. By using this tech- tuple-histogram representation. The ba- nique, the computation time becomes al- sic algorithm is shown in Fig. 5. most linear to the total ZBDD sizes. In this algorithm, we choose an item v In our implementation, we use some sim- used in the tuple-histogram H, and com- ple techniques to prune the search space. pute the two sub-histograms H1 and H0. For example, if H1 and H0 are equiva- (Namely, H =(v · H1) ∪ H0.) As v is lent, we may skip to compute F0.For the top item in the ZBDD vector, H1 and another case, we can stop the recursive H0 can be obtained just by referring the calls if total frequencies in H is no more 1-edge and 0-edge of the highest ZBDD- than α. There are some other elaborate node, so the computation time is constant pruning techniques, but they needs addi- for each digit of ZBDD. tional computation cost for checking the The algorithm consists of the two recur- conditions, so sometimes effective but not sive calls, one of which computes the sub- always. set of patterns including v, and the other computes the patterns excluding v. The 4. Frequent closed item set two subsets of patterns can be obtained mining as a pair of pointers to ZBDDs, and then the final result of ZBDD is computed. In frequent item set mining, we sometimes

148 faced with the problem that a huge num- 4.2. Eliminating non-closed ber of frequent patterns are extracted and patterns hard to find useful information. Closed item set mining is one of the techniques Our method is a quite simple modifica- to filter important subset of patterns. In tion of ZBDD-growth shown in Fig. 5. this section, we present a variation of We inserted several operations in the re- ZBDD-growth algorithm to generate fre- cursive procedure of ZBDD-growth, to fil- quent closed item sets. ter the closed patterns from all frequent patterns. The ZBDD-growth algorithm 4.1. Closed item sets is staring from the given tuple-histogram H, and compute the two sub-histograms Closed item sets are the subset of item H1 and H0, such that H =(v · H1) ∪ set patterns each of which is the unique H0. Then ZBDD-growth(H1) and ZBDD- representative for a group of sub-patterns growth(H1 + H0) is recursively executed. relevant to the same set of tuples. For Here, we consider the way to elimi- more clear definition, we first define the nate non-closed patterns in this algo- common item set Com(ST ) for the given rithm. We call the new algorithm ZBDD- set of tuples ST , such that Com(ST ) is the growthC(H). It is obvious that (v· set of items commonly included in every ZBDD-growthC(H ) ) generates (a part ∈ occurence 1 tuple T ST . Next, we define of) closed patterns for H each of which Occ(D, X) for the given database D and includes v, because the occurrence of any item set X, such that Occ(D, X) is the closed pattern with v is limited in (v · subset of tuples in D, each of which in- H1), thus we may search only for H1. cludes X. Using these notations, if an Next, we consider the second recursive item set X satisfies Com(Occ(D, X)) = call ZBDD-growthC(H1 + H0) to gener- X, we call X is a closed item set in D. ate the closed patterns without v. Impor- For example, let us consider the database tant point is that some of patterns gen- D as shown in Fig. 3. Here, all erated by ZBDD-growthC(H1 + H0)may item set patterns with threshold α = have the same occurrence as one of the 1 is: {abc, ab, ac, a, bc, b, c}, but closed pattern with v already found in H1. The item sets are: {abc, ab, bc, b, c}.In condition of such duplicate pattern is that this example, “ac” is eliminated from a it appears only in H1 but irrelevant to H0. closed pattern because Occ(D,“ac”) = In other words, we eliminate the patterns Occ(D,“abc”). from ZBDD-growthC(H1 + H0) such that the patterns are already found in ZBDD- In recent years, many researchers discuss growthC(H ) but not included in any tu- the efficient algorithms for closed item set 1 ples in H . mining. One of the remarkable result 0 is LCM algorithm[10] presented by Uno For checking the condition for closed pat- et. al. LCM is a depth-first search al- terns, we can use a ZBDD-based opera- gorithm to extract closed item sets. It tion, called permit operation by Okuno features that the computation time is et al.[9].1 P .permit(Q) returns a set of bounded by linear to the output data combinations in P each of which is a length. Our ZBDD-based algorithm is subset of some combinations in Q.For also based on a depth-first search man- example, when P = {ab, abc, bcd} and ner, so, it has similar properties as LCM. Q = {abc, bc}, then P .permit(Q) returns The major difference is in the data struc- {ab, abc}. The permit operation is effi- ture of output data. Our method gener- ciently implemented as a recursive proce- ates ZBDDs for the set of closed patterns, 1Permit operation is basically same as SubSet ready to go for more flexible analysis us- operation by Coudert et al.[3], defined for ordi- ing ZBDD operations. nary BDDs.

149 P .permit(Q) ZBDDgrowthC(H, α) { { if(P =“0” or Q =“0”) return “0” ; if(H has only one item v) if(P = Q) return F ; if(v appears more than α ) if(P =“1”) return “1” ; return v ; if(Q =“1”) else return “0” ; if(P include “1” ) return “1” ; F ← Cache(H); else return “0” ; if(F exists) return F ; R ← Cache(P,Q); v ← H.top ; /* Top item in H */ if(R exists) return R ; H1 ← H.factor1(v); v ←TopItem(P,Q) ; /* Top item in P,Q */ H0 ← H.factor0(v); ← (P0,P1) ←factors of P by v ; F1 ZBDDgrowthC(H1,α); ← (Q0,Q1) ←factors of Q by v ; F0 ZBDDgrowthC(H0 + H1,α); R ← (v · P1.permit(Q1)) F ← (v · F1) ∪ ∪ (P0.permit(Q0 ∪ Q1)) ; (F0 − (F1 − F1.permit(H0))) ; Cache(P,Q) ← R ; Cache(H) ← F ; return R ; return F ; } }

Figure 6: Permit operation. Figure 7: ZBDD-growthC algorithm.

dure of ZBDD manipulation, as shown in histograms can be constructed for all in- Fig. 6. The computation time of permit stances in a feasible time and space. The operation is almost linear to the ZBDD ZBDD sizes are almost same or less than size. total|T |. Finally, we describe the ZBDD-growthC After generating ZBDD vectors for the algorithm using the permit operation, as tuple-histograms, we applied ZBDD- shown in Fig. 7. The difference from the growth algorithm to generate frequent original algorithm is only one line, written patterns. Table 2 show the results of in the frame box. the original ZBDD-growth algorithm[8] for the selected benchmark examples, 5. Experimental Results “mushroom,” “T10I4D100K,” and Here we show the experimental results to “BMS-WebView-1.” The execution time evaluate our new method. We used a includes the time for generating the initial Pentium-4 PC, 800MHz, 1.5GB of main ZBDD vectors for tuple-histograms. memory, with SuSE Linux 9. We can deal The results shows that the ZBDD size is with up to 40,000,000 nodes of ZBDDs in exponentially smaller than the number of this machine. patterns for “mushroom.” This is a sig- Table 1 shows the time and space nificant effect of using the ZBDD data for generating ZBDD vectors of tuple- structure. On the other hand, no remark- histograms for the FIMI2003 benchmark able reduction is seen in ”T10I4D100K.” databases[5]. This table shows the com- ”T10I4D100K” is known as an artificial putation time and space for providing in- database, consists of randomly generated put data for ZBDD-growth algorithm. In combinations, so there are almost no re- this table, #T shows the number of tu- lationship between the tuples. In such ples, total|T | is the total of tuple sizes cases, ZBDD nodes cannot be shared well, (total appearances of items), and |ZBDD| and only the overhead factor is revealed. is the number of ZBDD nodes for the For the third example, “BMS-WebView- tuple-histograms. We can see that tuple- 1,” the ZBDD size is almost linear to the

150 Table 1: Generation of tuple-histograms[8]. Data name #T total|T | |ZBDD Vector| Time(s) T10I4D100K 100,000 1,010,228 552,429 43.2 T40I10D100K 100,000 3,960,507 3,396,395 150.2 chess 3,196 118,252 40,028 1.4 connect 67,557 2,904,951 309,075 58.1 mushroom 8,124 186,852 8,006 1.2 pumsb 49,046 3,629,404 1,750,883 188.5 pumsb star 49,046 2,475,947 1,324,502 123.6 BMS-POS 515,597 3,367,020 1,350,970 895.0 BMS-WebView-1 59,602 149,639 46,148 18.3 BMS-WebView-2 77,512 358,278 198,471 138.0 accidents 340,183 11,500,870 3,877,333 107.0

Table 2: Result of the original ZBDD-growth[8]. Data name: #Frequent (output) Time(sec) Min. freq. α patterns |ZBDD| mushroom: 5,000 41 11 1.2 1,000 123,277 1,417 3.7 200 18,094,821 12,340 9.7 50 198,169,865 36,652 10.2 16 1,176,182,553 53,804 7.7 4 3,786,792,695 59,970 4.3 1 5,574,930,437 40,557 1.8 T10I4D100K: 5,000 10 10 81.3 1,000 385 382 135.5 200 13,255 4,288 279.4 50 53,385 20,364 408.7 16 175,915 89,423 543.3 4 3,159,067 1,108,723 646.0 BMS-WebView1: 1,000 31 31 27.8 200 372 309 31.3 50 8,191 3,753 49.0 40 48,543 12,176 46.6 36 461,521 34,790 102.4 35 1,177,607 47,457 111.4 34 4,849,465 64,601 120.8 33 69,417,073 80,604 130.0 32 1,531,980,297 97,692 133.7 31 8,796,564,756,112 117,101 138.1 30 35,349,566,550,691 152,431 143.9 number of patterns when the output size but some additional factor is observed for is small, however, an exponential factor “T10I4D100K.” Anyway, filtering closed of reduction is observed for the cases of item sets has been regarded as not a generating huge patterns. easy task. We can say that the ZBDD- growthC algorithm can generate closed Next, we show the experimental results item sets with a relatively small ad- of frequent closed pattern mining us- ditional cost from the original ZBDD- ing ZBDD-growthC algorithm. In Ta- growth. ble 3, we show the results for the same examples as used in the experiment of 6. Conclusion the original ZBDD-growth. The last column T ime(closed)/T ime(all) shows the In this paper, we presented an interest- ratio of computation time between the ing variation of ZBDD-growth algorithm ZBDD-growthC and the original ZBDD- to generate frequent closed item sets. Our growth algorithm. We can observe method is a quite simple modification that the computation time is almost the of ZBDD-growth. We inserted several same order as the original algorithms operations in the recursive procedure of for “mushroom” and “BMS-WebView-1,” ZBDD-growth, to filter the closed pat-

151 Table 3: Results of ZBDD-based closed pattern mining. Data name: #Freq. (output) ZBDD- T ime(closed) Min. freq. α closed |ZBDD| growthC /T ime(all) patterns Time(s) mushroom: 5,000 16 16 1.2 1.00 1,000 3,427 1,660 3.8 1.02 200 26,968 9,826 9.9 1.02 50 68,468 19,054 13.0 1.27 16 124,411 24,841 13.3 1.73 4 203,882 26,325 13.2 3.06 1 238,709 20,392 12.9 7.19 T10I4D100K: 5,000 10 10 104.8 1.29 1,000 385 382 208.1 1.54 200 13,108 4,312 2713.6 9.71 50 46,993 20,581 4600.1 11.25 16 142,520 89,185 5798.5 10.67 4 1,023,614 691,154 18573.0 28.75 BMS-WebView-1: 1,000 31 31 30.1 1.08 200 372 309 36.8 1.18 50 7,811 3,796 71.9 1.47 40 29,489 11,748 111.4 2.39 36 64,762 25,117 153.7 1.50 35 76,260 30,011 169.2 1.52 34 87,982 35,392 186.5 1.54 33 99,696 40,915 207.7 1.60 32 110,800 46,424 221.7 1.66 31 120,190 51,369 247.7 1.79 30 127,131 55,407 271.5 1.89

terns from all frequent patterns. The mization, in Proc. of 30th ACM/IEEE experimental result shows that the addi- Design Automation Conference, pp. 625-630, 1993. tional computation cost is relatively small compared with the original algorithm for [4] B. Goethals, “Survey on Frequent Pattern Mining”, Manuscript, generating all patterns. 2003. http://www.cs.helsinki.fi/ A ZBDD can be regarded as a compressed u/goethals/publications/survey.ps trie for representing a set of patterns. [5] B. Goethals, M. Javeed Zaki (Eds.), Frequent Itemset Mining Dataset ZBDD-based method will be useful as a Repository, Frequent Itemset Mining fundamental technique for database anal- Implementations (FIMI’03), 2003. ysis and knowledge indexing, and will be http://fimi.cs.helsinki.fi/data/ utilized for various data mining applica- [6] J. Han, J. Pei, Y. Yin, R. Mao, Mining tions. Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree References Approach, Data Mining and Knowledge Discovery, 8(1), 53–87, 2004. [1] R. Agrawal, T. Imielinski, and [7] S. Minato: Zero-suppressed BDDs for A. N. Swami, Mining Association set manipulation in combinatorial prob- rules between sets of items in large lems, In Proc. 30th ACM/IEEE Design databases, In P. Buneman and S. Ja- Automation Conf. (DAC-93), (1993), jodia, edtors, Proc. of the 1993 ACM 272–277. SIGMOD International Conference on Management of Data, Vol. 22(2) of [8] S. Minato, H. Arimura: ZBDD-growth: SIGMOD Record, pp. 207–216, ACM An Efficient Method for Frequent Press, 1993. Pattern Mining and Knowledge In- dexing, Hokkaido University, Division [2] Bryant, R. E., Graph-based algo- of Computer Science, TCS Technical rithms for Boolean function manipula- Reports, TCS-TR-A-06-12, Apr. 2006. tion, IEEE Trans. Comput., C-35, 8 http://www-alg.ist.hokudai.ac.jp/ (1986), 677–691. tra.html [3] O. Coudert, J. C. Madre, H. Fraisse, A [9] H. Okuno, S. Minato, and H. Isozaki: new viewpoint on two-level logic mini- On the Properties of Combination Set

152 Operations, Information Procssing Let- ters, Elsevier, 66 (1998), pp. 195-199, 1998. [10] T. Uno, Y. Uchida, T. Asai, and H. Arimura: “LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets,” Proc. Workshop on Frequent Itemset Mining Implementa- tions (FIMI’03), Mohammed J. Zaki and Bart Goethals (eds.), 2003. http://fimi.cs.helsinki.fi/fimi03/ [11] M. J. Zaki, Scalable Algorithms for As- sociation Mining, IEEE Trans. Knowl. Data Eng. 12(2), 372–390, 2000.

153 Data Mining in Amino Acid Sequences of H3N2 Influenza Viruses Isolated during 1968 to 2006

Kimihito Ito, Manabu Igarashi, Ayato Takada Hokkaido University Research Center for Zoonosis Control Sapporo 060-0818, Japan

Abstract The hemagglutinin (HA) of influenza viruses undergoes antigenic drift to escape from antibody-mediated immune pressure. In order to predict possible structural changes of the HA molecules in future, it is important to understand the patterns of amino acid mutations and structural changes in the past. We performed a retrospective and comprehensive analysis of structural changes in H3 hemagglutinins of human influenza viruses isolated during 1968 to 2006. Amino acid sequence data of more than 2000 strains have been collected from NCBI Influenza virus resources. Information theoretic analysis of the collected sequences revealed a number of simultaneous mutations of amino acids at two or more different positions (correlated mutations). We also calculated the net charge of the HA1 subunit, based on the number of charged amino acid residues, and found that the net charge increased linearly from 1968 to 1984 and, after then, has been saturated. This level of the net charge may be an upper limit for H3 HA to be functional. It is noted that ”correlated mutations” with the conversion of acidic and basic amino acid residues between two different positions were frequently found after 1984, suggesting that these mutations contributed to counterbalancing effect to keep the net charge of the HA . These approaches may open the way to find the direction of future antigenic drift of influenza viruses. 1. Introduction ods that include information theoretic anal- ysis to find patterns of amino acids substi- Influenza A virus causes highly contagious, tutions, molecular modeling to reveal the acute respiratory illness, and continues to be changes in 3D structure of antigenically dif- a major cause of morbidity and mortality. ferent HAs, and molecular simulation to in- The viral strains mutates from year to year, vestigate the antigen-antibody interaction. causing the annual epidemic world wide. The hemagglutinin (HA) is the major sur- We performed a retrospective and compre- face glycoprotein of influenza viruses and hensive analysis of structural changes in plays an important role in virus entry into H3 HAs of human influenza viruses iso- host cells. The HA undergoes antigenic lated during 1968 to 2006. Amino acid se- drifts which occur by accumulation of a se- quence data of more than 2000 strains have ries of amino acid substitutions. Influenza been collected from NCBI Influenza virus viruses that escape from antibody-mediated resources and used for the analysis. These immune pressure of human population ac- approaches may open the way to find the di- quire a new antigenic structure and contin- rection of future antigenic drift of influenza uously cause epidemic in the world. viruses. In order to predict possible structural 2. Information Theoretic Analysis changes of the HA molecules in future, it is important to understand the patterns of anti- of the Past HA Sequences genic drift of influenza viruses in the past. The entropy and mutual information are We have been studying computational meth- used to find simultaneous mutations of

154 amino acids at two or more different posi- position 1 QQQQQQ tions (correlated mutations). The entropy position 2 QQQIIV of an amino acid position represents the un- position 3 DDDNNK certainty of the amino acids in the position. position 4 QIQQIQ The amino acid positions with frequent mu- The Mutual Information of among position tations are expected to have higher entropy. 1, 2, 3 include The mutual information between two amino acid positions represents the average reduc- tion in uncertainty about one amino acid po- I(X1; X2) = 0.00, sition that results from learning the amino I(X2; X3) = 1.45, acid of another position. The pairs of amino I(X ; X ) = 0.12. acid positions that tend to be involved in 3 4 correlated mutations are expected to have higher mutual information values. 3. Correlated Mutations Found by the Analysis Definition 1 Entropy: The entropy of X is defined to be the average Shannon informa- X Y I(X; Y ) tion of an outcome. 144 156 1.0687064407 ∑ 1 156 276 1.0606584462 H(X) = P (x)log 62 158 1.0568431741 x∈Ax P (x) 135 262 1.0462450293 156 158 0.9904144166 Example 1 Suppose we have 6 protein se- 196 276 0.9548333157 quences in which the amino acid of positions 62 276 0.9544171044 1, 2, 3 are the following. 62 156 0.9440218619 position 1 DDDDDD 158 276 0.9313234350 position 2 QQQQQI 156 196 0.9269716745 position 3 DDIINV The entropy of position 1,2, and 3 are Table 1: The top 10 pairs of amino acid po- H(X ) = 0.00, sitions in HA1 subunit that have high mutual 1 information values. H(X2) = 0.65, 2183 amino acid sequences of HA1s of hu- H(X3) = 1.91, man influenza viruses have been collected respectively. from NCBI Influenza virus resources. Mu- tual information for every pair of two amino acid positions in the HA1 subunit is calcu- Definition 2 Mutual Information: The mu- lated. Table 1 shows the top 10 pairs that tual information between X and Y is defined have high mutual information values. The to be the average reduction in uncertainty analysis founds a number of the correlated about x by knowing the value of y. mutations in amino acid positions of the I(X; Y ) = H(X) − H(X|Y ) HA1, and three of them are shown in Figure 1. ∑ | 1 The 3D structure model in Figure 2 depicts H(X Y ) = P (x, y)log | xy∈AxAy P (x y) the location of amino acid positions that ap- pear in the pairs in Table 1. Example 2 Suppose we have 6 protein se- The correlated mutation between amino acid quences in which the aminos acid of posi- position 163 and 248, which have mutual in- tions 1, 2, 3 are the following. formation 0.61 during 1968 to 1989, resulted

155 Figure 1: Correlated mutations found by156 the analysis. Series of amino acid changes in the positions 144-156(left), 62-158(middle), and 163-248(right) are shown.

¤ ¦

¨ § ¨ ¦

¡ ¢ £

¤ ¥ ¦ §

 

¤ ¥ § ¨

    

   ¢    

¤ ¥ § ©

 

   ¡

¤ ¥ § ¥

 

Figure 3: The correlated mutation among amino acids at 163, 246, and 248.(a) The 3D structure model of an HA. The locations of Figure 2: The 3D structure model of an HA the amino acids at 163, 246, and 248 are in- molecule. The locations of amino acid posi- dicated. (b)The amino acids in the glycosy- tions that are involved in the correlated mu- lation site during 1968 to 1989. tations in Table 1 are highlighted. The com- monly used definitions of the five antigenic since 1963. This result suggests that the sites (A-E) are shown in different colors. mutations that increased the charge of HA molecule were required for the adaptation from the addition of an N-linked glycosyla- of avian virus to human population. Af- tion site. The acquisition of new oligosac- ter 1984, the increase in the net charge charide chains is known to be significant in human virus HA has been saturated. for the antigenic drift of HA. Figure 3(a)(b) This level of the net charge might be an shows that the oligosaccharide chain that is upper limit for H3 HA to be functional. added at position 246 produced a steric hin- Correlated mutations that exchange the drance which likely required the substitution charges among two or more different of valine at position 163 with the smaller amino acid positions were frequently found amino acid alanine. after 1984. These co-mutations include E82K/K83E(1989), N145K/G135E(1991), 4. Retrospective Net Charge Anal- E135K&E156K/S133D&K145N&R189S- ysis of Hemagglutinins (1993), D124G/G172D&R197Q(1995), N145K/K135T(1996), E158K&N276K/- To study the change of the net charge of K62E&K156Q(1998), D271N/K92T(2000), HAs, which may be associated with aniti- T92K/N271D(2001), E83K&D144N&- genic properties, we have also analyzed the W222R/R50G&N126D&G225D(2003), HA1 subunits of influenza viruses isolated and D126N/K145N(2004). It suggests that from human, swine, and avian hosts. these co-mutations contributed to counter The net charge at neutral pH are calculated, balancing effect to keep the net charge of assuming that glutamic and aspartic acid the HA. have -1 charge, and arginine and lysin For further investigation, we performed ho- have +1 charge at this pH. Figure 4 shows mology modeling to build 3D structure mod- that the net charge of HA1s increased els of HAs of antigenically different in- linearly from 1968 to 1984, while avian fluenza viruses (Figure 5). We found that virus HAs have retained their net charge antigenic drifts with frequent substitutions

157 been made by co-mutations since 1984 and these co-mutations might contribute to keep the net charge constant in HA molecules. References

[1] Atchley, W.R., Wollenberg, K.R., Fitch, W.M., Terhalle, W., Dress, A.W.: Correlations Among Amino Acid Sites in bHLH Protein Domains: An Information Theoretic Analysis, Molecular Biology and Evolution 17,pp.164-178 (2000)

Figure 4: The net charges of H3 HA1 of in- [2] Bush, R.M., Bender ,C.A., Sub- fluenza viruses isolated from human, swine, barao, K., Cox, N.J., Fitch, W.M.: and avian. Predicting the evolution of human influenza A, Science, 286(5446),pp.1921-1925.(1999)

[3] Ghedin, E., Sengamalay, N.A., Shumway, M., Zaborsky, J., Feld- blyum, T., Subbu, V., Spiro, D.J., Sitz, J., Koo, H., Bolotov, P., Der- novoy, D., Tatusova, T., Bao, Y., St, George, K., Taylor, J., Lip- man, D.J., Fraser, C.M., Tauben- berger, J.K., Salzberg, S.L.: Large- scale sequencing of human in- fluenza reveals the dynamic nature of viral genome evolution, Nature, Figure 5: 3D structure model of three anti- 437(7062),pp.1162-1166 (2005) genically different HAs of human H3N2 viruses. The electrostatic potential on the [4] Martin LC, Gloor GB, Dunn SD, surface of HAs are shown. Wahl LM. Using information theory to search for co-evolving residues in pro- of charged amino acids resulted in the teins. Bioinformatics, 21(22),pp.4116- change of the charge distribution on the HA 4124, (2005) surface.

5. Conclusion A number of correlated mutations in HAs are found by analyzing large scale sequence data of influenza viruses. A retrospective net charge analysis shows the increase of the net charge in HAs of human H3N2 viruses. It might be required for avian influenza viruses to increase the positive charge in order to adapt to human population. There seems to be an upper bound on possible net charge of hemagglutinin. Charge compensations have

158 The Links Between GIScience and Information Science

By Nigel Waters Department of Geography University of Calgary 2500 University Dr. NW Calgary, Alberta, Canada T2N 1N4 e-mail; [email protected] Tel: 403-220-5367 Fax: 403-282-6561

Introduction This paper explores the links between GIScience and Information Science and in so doing sug- gests areas where future collaboration may be both warranted and productive.

1. The Rise of a Geographic An argument was made for a new disci- Information Science pline of Geographic Information Sci- ence, or GIScience as it was soon re- Geographic Information Systems (GIS) named. This new discipline was promul- are a relatively new technology. The gated in a paper by Goodchild [6] and term was first used in 1968 by Roger leading GIS academic journals (includ- Tomlinson [3]. Although the market for ing the one in which the paper was pub- GIS technology has been growing stead- lished) began renaming themselves with ily estimates of the size of the industry Science replacing Systems. It was ar- vary greatly. One recent report from in- gued by Goodchild that information sci- dustry assessment specialists, Daratech ence studies issues relating to the crea- (http://www.daratech.com/press/2004/- tion, handling, storage, analysis and use 041019/ ), suggested that the core reve- of information. Similarly Geographic nue of the GIS industry in 2003 was Information Science, Goodchild sug- USD $1.84 billion and that it had grown gested, should study those issues that at a rate of 5.1% over 2002. At the end arise from the creation, handling, stor- of 2004 the core revenue was estimated age, analysis and use of geographic or to be USD $2.02 with a corresponding spatial information. annual growth rate of 9.7%. Core busi- ness revenue was stated to include soft- So the links are there, to a certain extent. ware (64%), hardware (4% and declin- With respect to Geographic Information ing), services (24%) and data products Systems we may ask: how were the links (8%). By any measure, in the early years exploited? With respect to Geographic of this decade, the GIS industry was do- Information Science let us ask: how will ing well. the links be exploited? Yet in the early 1990s many academics had been less than satisfied with the cur- rent state-of-affairs in GIS research.

159 2. The Links Between Geo- 6. Define the spatial properties of your datasets. graphic Information Sys- 7. Propose a geodatabase design. tems and Information Science “Physical Design”: 8. Implement, prototype, review and The early approach to GISystems im- refine your design. plemented by Environmental Systems 9. Design workflows for building and Research Institute (ESRI) in their pio- maintaining each layer. neering ArcInfo software was to have 10. Document your design using appro- one solution for handling the spatial data priate methods.” [1] (Arc) and a second solution for handling the attribute data (the proprietary Info Under step 7 Arctur and Zeiler suggest database) in their GISystem software. studying existing design for examples The Info part used traditional database and ESRI provides a website that in- technology that was imported directly cludes a collection of downloadable case from Information Science. Originally studies that provide detailed geodatabase ESRI used their own database technol- models (http://support.esri.com/index.- ogy but links with commercial relational cfm?fa=downloads.dataModels.caseStud database technology soon followed. ies ). Towards the end of the 1990s the rela- Most areas of application have their own tional database approach was supple- distinct challenges and issues. Two of mented by object-oriented solutions particular note are the transportation and made available in the ArcGIS generation atmospheric data models. Problems with of products. The current state of the art is the transportation data model arose even described in Designing Geodatabases: in the era of the use of relational data- Case Studies in GIS Data Modeling by base models. Two issues of concern Arctur and Zeiler [1]. They identify ten were the need to develop linear referenc- steps in designing and building an ex- ing systems and the need for the ability emplary geodatabase: to change data attributes, for example speed limits, along a single link in the “Conceptual Design”: network. The latter problem was re- 1. Identify the information products solved with the introduction of dynamic that will be produced with your GIS. segmentation approaches ([17, 18], see 2. Identify the key thematic layers also the ArcGIS Transportation Data based on your information require- Model for New York State available at ments. http://support.esri.com/index.cfm?fa=do 3. Specify the scale ranges and spatial wnloads.dataModels.caseStudies ). representation for each thematic One particular challenge remains and layer. this is the handling of continuously vary- 4. Group representation into datasets. ing data. This is a problem along trans- portation networks and makes for diffi- “Logical design”: culties in calculating the spatial autocor- 5. Define the tabular database structure relation indices of, for example, accident and behavior for descriptive attrib- data [2]. It is a problem for data that it is utes. distributed over surfaces and even more so for continuously varying data in three

160 or even four dimensions, for example resources using web service technologies atmospheric data. This is a problem that (the autonomic approach) or they can be is still not handled well as a data model defined by users (the ad hoc approach). in GIS (although the analytical tools are Noting that there is extensive literature there in abundance). These concerns exploring the former, the autonomic ap- were noted by Mike Goodchild in a Key- proach, he focuses on the latter, ad hoc note Address to the Annual GEOIDE methodologies. He thus advocates the Conference held in Banff June 1st, 2006. use of meme-media technologies to ex- The problem is addressed from the per- tract intelligent resources from web spective of scientific analysis of atmos- documents and applications. The ap- pheric data by Nativi et al. [10]). They proach is similar to a structured query to note the difficulty in integrating atmos- a relational database but in this case the pheric data within a GIS framework and query is defined by HTML pathway ex- suggest interim remedies in their article. pressions to Web documents and/or ap- plications. 3. Selected Issues in Infor- It will be seen below that there are many mation Science broad areas of application in a GIS envi- Recently Tanaka [15] has described ronment where Professor Tanaka’s ap- some of the challenges facing Informa- proach will provide new research oppor- tion Science with respect to developing a tunities. knowledge federation over the Web. He Spyratos [14] has presented a functional begins by noting that the Web may be model for the analysis of large volumes seen as a depository of distributed intel- of transactional data. The model pre- ligent resources. These resources include sented used a data schema with an traditional server-client applications as acyclic graph and a single root. An ex- well as mobile applications. Tanaka con- ample was provided for product pur- tinues by defining pervasive computing chases from a store. Since the store is as an open system of intelligent re- located in geographical space the exam- sources that a user can interact with in a ple has obvious implications for GIS seamless interoperable environment. analysis in the field of Business Pervasive computing is assumed to link Geographics [9]. Geographers refer to a wide variety of intelligent resources this as panel data. that are distributed in both virtual envi- ronments (the Web) and the geographi- Professor Tanaka and his colleagues at cal environment of the real world. the Meme Media Laboratory have de- signed an interface to the Digital Li- The ad hoc definition and implementa- brary. The metaphor that has been used tion of pervasive computer in either or is that of a City. So that the user can both environments is defined as a federa- move through a network of streets and tion. The intelligent resources that form then into a building, a room and then, part of this definition are assumed to be finally, remove a book from a virtual in the form of documents. Suggested shelf. Additional challenges have in- uses include scientific simulations, digi- cluded providing a cell phone interface tal libraries and research activities. Ta- to the library. naka [15] continues by noting that a fed- eration of intelligent resources may be Geographers have considered similar defined by programs that access such problems in trying to build digital librar-

161 ies of maps (http://www.alexandria.- 4. The Nexus of GIS and IS ucsb.edu/ ). The digital map library pro- The use of Tanaka’s approach to the ac- vides additional challenges in storing, quisition of documents over the Web has referencing and accessing the spatial in- obvious implications for our Mapping formation. Tanaka [15] addressed prob- the Media in the Americas (MMA) Pro- lems with which geographers are also ject ([18] and www.mediamap.info for a concerned. For example, he showed a complete description). The MMA Pro- movie in which fish were linked to a ject is a collaboration between the Uni- map. We have similarly experimented versity of Calgary, the Carter center in with the use of case-based reasoning Atlanta and the Canadian Foundation for software to link the resources of indige- the Americas (FOCAL) in Ottawa. The nous peoples in northern Canada with a three-year project began in September GIS and map based information [4]. 2004 and will conclude in August, 2007. Similarly we have developed Visual Ba- The Project seeks to portray and explain sic templates to input indigenous knowl- the impact of the media (TV, radio and edge into a GIS. newspapers) on the electoral process in Algorithms for searching the web for 12 countries in the Americas. To-date we documents are a major area of research have completed research on Canada and in information science. Several papers at three Latin American countries (Peru, the Third International Symposium con- Guatemala, and the Dominican Repub- centrated on these technologies. These lic). procedures are of interest to GIScientists The research process involves scouring because they too need faster, more effi- the internet for spatial and aspatial data cient web search algorithms and because on socio-economic variables for the they use similar algorithms for the countries question. We also need to map analysis of spatial and socio-economic the locations of TV and radio antenna data. Examples of new Web Document and the distribution points for newspa- search algorithms are provided by La- pers. Spatial data on the results of the monova and Tanaka [8] and Poland and most recent national or federal elections Zeugmann [11]. Both papers describe are obtained. Although the research approaches that are of great interest to process usually begins with a visit to the GIScientists. For example, similar fuzzy capital city of each country in question, clustering algorithms to those developed in addition, we search the Web inten- by Lamonova and Tanaka are now being sively for appropriate resources. This in used to classify commuters in trans- research process would obviously be portation planning models and the spec- more efficient if the resources that we tral approaches used by Poland and require were intelligent resources, as en- Zeugmann are of great interest to spatial visaged by Tanaka, that could be analysts as are the neural nets and wave- searched with more intelligent algo- let approaches mentioned by Lamonova rithms. In one sense our research aims to in her presentation. bring together the information in one location thus removing the need for such searches. Moreover, Tanaka’s [15] C3W framework that allows the clipping of arbitrary HTML elements from web pages and then the subsequent pasting of

162 these clips on to a special C3Wsheet Pad on a single website rather than allowing would appear to have enormous potential the user to access the data on-the-fly for increasing the efficiency of our re- from a federated knowledge network. search process. Our website does fulfill the third of Ta- The data, once acquired, is assembled naka’s goals, listed above, namely that into a GIS. This GIS is then migrated to of allowing for enhanced research activi- an interactive, GIS-based web server. ties. These may be carried out either by For our work the GIS is built using those campaigning for political office or ESRI’s ArcGIS 9.1 (www.esri.com ). by their party strategists. It is also a re- Likewise we use an ESRI product, Ar- source for statistical analysis for aca- cIMS (Internet Map Server), to produce demics. Thus we have built Ordinary the web-based GIS. The final product Least Squares regression models to pre- may be found at www.mediamap.info . dict electoral outcomes based on the The splash screen for the website pro- socio-economic variables included in the vides an introduction to the project and, website. As might be expected such like the rest of the site is available in two models have met with limited success languages: Spanish and English for the because political support and the inde- Latin American countries and French pendent variables that can be used to and English for Canada. predict that support vary geographically. Geographers have proposed various so- One goal of the MMA Project is to show lutions to this problem. These include the impact of the media on the electoral the use of Spatial Regressions and Geo- process. This can be achieved , in part, graphically Weighted Regression (GWR; through the use of visualization tech- [5]). niques. Thus we can map the results of the elections and we can map socio- economic data, and the locations of the media. We can visualize the results as individual layers or we can map two or more layers together. This is most effec- tive where just one of the layers is a polygon layer and the additional layers are point files such as the locations of two or more different types of media. We have experimented with different types of shading for two different poly- gon layers e.g. solid colors that are over- lain by hatched shading for the second variable. This, however, has met with limited success. Essentially our work has brought to- gether disparate data sets in an interac- tive, web-based GIS. We have achieved the goal suggested by Tanaka ([15], and see discussion above) but we have done this by physically locating the resources

163 Figure 1: Splash Screen for the Mapping the Media in the Americas Project GIS-Based Interactive Website: www.mediamap.info

The original OLS model, using five in- Since in the MMA Project we are con- dependent variables that represented cerned with the reach of the broadcast socio-economic status, education, liter- media we have been building models of acy, and languages spoken, achieved wave propagation from radio antenna goodness-of-fit statistics (R-squared) of towers. Based on the height of the tow- only about 35%. When we used the ers and by incorporating their power GWR model, with a spatially varying these broadcast models, in conjunction kernel of about 100 points, the R- with digital elevation models, provide an squared values ranged from 27% to 83%. estimate of the reach of the radio station Our work here is reminiscent of the work in mountainous countries such as Peru. of Taniguchi and Haraguchi [16]. They We combine these models with data on noted that correlations in their global population distribution and language database were lower than in a local data- spoken from the census data incorpo- base. The result appeared to be due to rated into our web-based GIS and this temporal autocorrelation in the US Cen- then provides an estimate of the reach of sus data with which they experimented. the radio station. In turn this can be combined with information from past Essentially, our results are the spatial elections to determine if it is worthwhile equivalent of their temporal autocorrela- for a political candidate to invest adver- tions. It would appear that a collabora- tising resources in radio or TV broad- tion in this area would be mutually bene- casts. Figure 3 shows one such propaga- ficial to both Information Scientists and tion model from ReMartinez [13]. GIScientists.

164

Figure 2: A Query to the TV Antenna Database for the Peruvian Website Showing How One National Television Network Does Note Reach an Area Where the Indigenous Quechuan Speaking Population Predominate.

165

Figure 3: Antenna Viewshed [13] to other markup languages. Specifically concern over the difference in perform- GIScience practitioners would want to ance in line-of-sight and nonline-of-sight extend Tanaka’s approach to include the environments was the subject of Geography Markup Language (GML). Ogawa’s [12] presentation at the Third This is even more important now that on International Symposium on Ubiquitous June 5th, 2006, the Open Geospatial Knowledge: Network Environment. The Consortium has approved and released a possibility for synergy between work in simple features profile specification for GIScience and the work of Ogawa GML (http://www.opengeospatial.org/- should be apparent. Spyratos’ proposed press/?page=pressrelease&year=0&prid data model, discussed above, produces =260 ). panel data. Such data, while difficult to One area of future research in GIScience acquire, has long been of interest to re- appears to be in the use of interactive tail geographers and those developing web-based GIS for decision support. In business applications [19, 20]). 2005 we began a GEOIDE supported research project for Promoting Sustain- 5. Future Directions able Communities Through Participa- One of the next steps in the integration tory Spatial Decision Support (http://- of Information and Geographical Sci- www.geoide.ulaval.ca/files/17_E.jpg ). ence, especially as the former was repre- Our research involves the Town of sented at Professor Tanaka’s Third In- Canmore in Alberta, Canada. We are ternational Symposium on Ubiquitous Knowledge: Network Environment, building a GIS that will include key would be to extend his C3W framework variables that of interest to both the

166 planners and the citizens in the town. the Institute for Media and Information These GIS maps will then be entered Science at Ilmenau in Germany and at into an interactive GIS based website the Meme Media Lab at the University using one of two technologies: MapChat of Hokkaido in Sapporo. or ArguMap. The latter is already being used for a similar research experiment at University College London References (http://ernie.ge.ucl.ac.uk:8080/Web_Syst em/faces/2_Results.jsp?argumap=yes ). [1] D. Arctur and M. Zeiler. Design- One of the components of any such sys- ing Geodatabases: Case Studies tem might be the ability to bring in new in GIS Data Modeling. ESRI material from the Web and to use in sub- Press, Redlands, CA., 2004. sequent analysis. Meme media extended to geographical objects would be ex- [2] W. Black. and I. Thomas. Acci- tremely useful, as would fast internet dents on the Belgium Motorway search algorithms. System: A Network Autocorrela- The research that Jantke [7] presented at tion Analysis. Journal of Trans- the Meme Media workshop in Sapporo port Geography, 6, 23-31, 1998. in early 2006 included a discussion of how a movie such as The Kingdom of [3] N. Chrisman. Charting the Un- Heaven could have digital tags attached known: How Computer Mapping to items in the screen shot. These tags at Harvard became GIS. ESRI allow the viewer to access additional in- Press, Redlands, CA, 2006. formation relation to the tagged object. This type of technology could be used [4] D. Clayton and N. Waters. Dis- effectively within an environment such tributed Knowledge, Distributed as MapChat or ArguMap. Thus the plan- Processing, Distributed Users: ner or citizen working within the interac- Integrating Case Based Reason- tive environment could query an object ing and GIS for Multicriteria De- in the GIS or perhaps on a Google Earth cision Making. Chapter in Mul- or orthophoto backdrop and be provided ticriteria Decision-Making and with additional information that would Analysis: A Geographic Informa- lead to a more informed planning choice. tion Sciences Approach, edited by Jean-Claude Thill, Ashgate, 6. Conclusion Brookfield, USA, 275-308, 1999.

This paper has attempted to show the [5] A. S. Fotheringham, C. Brunsdon links between GIScience and Informa- and M. Charlton. Geographically tion Science. It began with a short intro- Weighted Regression: The duction to Geographic Information Sys- Analysis of Spatially Varying tems and then discussed the rise of GIS- Relationships. Cience as a distinct discipline. It then Wiley, Chichester, provided a small subset of examples that UK, 2002. show some of the more intriguing links between the GIScience research that is [6] M. F. Goodchild. Geographical Interna- being conducted at the University of Information Science. tional Journal of Geographical Calgary and the Information Science at

167 Information Systems, 6, 31-45, ated with the 21st Center of Ex- 1992. cellence Program in Information, Electrics and Electronic at Hok- [7] K. Jantke. Media Integration for kaido University, Sapporo, Feb- Film in the Digital Bibliotheque ruary 28 – March 1, 2006. City. Paper presented at the Meme Media Workshop, associ- [12] Y. Ogawa. Multiple Antenna ated with the 21st Center of Ex- Systems and Their Applications. cellence Program in Information, In Proceedings of the Third In- Electrics and Electronic at Hok- ternational Symposium on Ubiq- kaido University, Sapporo, Feb- uitous Knowledge: Network En- ruary 28 – March 1, 2006. vironment, February 27 – March 1, 2006, Sapporo Convention [8] N. Lamonova and Y. Tanaka. A Center, Sapporo, Japan, pp. 143- Robust Probabilistic Fuzzy Clus- 151, 2006. tering Algorithm with Applica- tion to Web Document. Paper [13] C. ReMartinez. A GIS Coverage presented at the Meme Media Broadcast Prediction Model for Workshop, associated with the Transmitted FM Radio Waves in 21st Center of Excellence Pro- Peru. MGIS Research Project, gram in Information, Electrics Department of Geography, Uni- and Electronic at Hokkaido Uni- versity of Calgary, Calgary, Al- versity, Sapporo, February 28 – berta, Canada, 2006. March 1, 2006. [14] N. Spyratos. A Functional Model [9] P. A. Longley and G. Clarke. for Data Analysis. In Proceedings GIS for Business and Service of the Third International Sym- Planning. Wiley, New York, posium on Ubiquitous Knowl- 1996 edge: Network Environment, February 27 – March 1, 2006, [10] S. Nativi, M. B. Blementhal, J. Sapporo Convention Center, Caron, B. Domenico, T. Haber- Sapporo, Japan, pp. 37-48, 2006. mann, D. Hertzmann, Y. Ho, R. Raskin and J. Weber. Differences [15] Y. Tanaka. Knowledge Federa- Among the Data Models Used by tion Over the Web Based on the Geographic Information Sys- Meme Media Technologies. In tems and Atmospheric Science Proceedings of the Third Interna- Communities, 2003. Retrieved tional Symposium on Ubiquitous from Knowledge: Network Environ- http://support.esri.com/index.cfm ment, February 27 – March 1, ?fa=downloads.dataModels.case 2006, Sapporo Convention Cen- Studies ter, Sapporo, Japan, pp. 13-36, 2006. [11] J. Poland and T. Zeugmann. Spectral Clustering of the Google [16] T. Taniguchi and M. Haraguchi. Distance. Paper presented at the Discovery of Implicit Correla- Meme Media Workshop, associ- tions Based on Differences of

168 Correlations. Paper presented at the Meme Media Workshop, as- sociated with the 21st Center of Excellence Program in Informa- tion, Electrics and Electronic at Hokkaido University, Sapporo, February 28 – March 1, 2006.

[17] N. Waters. Transportation GIS: GIS-T. In Geographical Informa- tion Systems: Principles, Tech- niques, Applications and Man- agement, 2nd Edition, Edited by Paul Longley, Michael Good- child, David Maguire, and David Rhind, John Wiley and Sons, pp. 827-844, 1999.

[18] N. M. Waters. Transportation GIS: GIS-T. In Geographical In- formation Systems, Second Edi- tion, Abridged, edited by Long- ley, P.A., Goodchild, M.F., Maguire, D.J. and Rhind, D.W.; Wiley, New York, 2005.

[19] N. Wrigley. Categorical data Analysis for Geographers and Environmental Scientists. Long- man, London, 1985.

[20] N. Wrigley and M. S. Lowe. Reading Retail: A Geographical Perspective on Retailing and Consumption Spaces. Arnold, New York, 2002.

169 Storyboarding – An AI Technology to Represent, Process, Evaluate, and Refine Didactic Knowledge

Rainer Knauf1 and Klaus P. Jantke2 Technical University Ilmenau 1Faculty of Computer Science and Automation Chair of Artificial Intelligence 2Institute of Media and Communication Sciences PO Box 100565 98684 Ilmenau

Abstract The current state of affair in learning systems in general and in e-learning in particular suffers from a lack of an explicit and adaptable didactic design. Students complain about the insufficient adaptability of e-learning to the learners’ needs. Learning content and services need to reach their audience properly. That is, according to their different prerequisites, needs, and different learning conditions. After a short introduction to the storyboard concept, which is a way to address these concerns, we present an example of using storyboards for the didactic design of a university course on Intelligent Systems. In particular, we show the way to express didactic variants and didactic intentions within storyboards. Finally, we sketch ideas to for a machine supported knowledge processing, knowledge evaluation, knowledge refinement, and knowledge engineering with storyboards.

1. Introduction use today’s presentation equipment, didactic skills to effectively present the topical con- Successful university instructors are often tent, plus skills in fields like social sciences, not those with the very best scientific back- psychology and ergonomics. ground or outstanding research results. The most successful ones are typically those that University instruction, however, often suf- successfully utilize didactical experiences as fers from a lack of didactic knowledge. well as ‘soft skills’ in dealing with other ac- Since universities are also research insti- tors in the teaching process. Besides the tutions, their professors are usually hired students and colleagues, such actors include based on their topical skills. Didactic skills e-learning systems as well ass the large are often underestimated in the recruiting amount of active (desirable and undesirable, process. At German universities, e.g., there conscious and unconscious) ‘content presen- is no didactic education required as a pre- ters’ that include news, web sources and ad- requisite for a professorship. Such skills are vertisements. only checked by asking (usually just two) students about their impression on this issue The design of learning activities in colle- after watching a single talk given by the ap- giate instruction is a very interdisciplinary plicant. process. Besides deep, topical knowledge in the subject being thought, an instructor Here, we refrain from further discussing rea- needs knowledge and skills in many other sons for that, but focus the issue of involving subjects. This includes IT-related skills to it a little more. We propose an approach to

170 improve the didactic content within the large boards on shows, plays, or movies. The ma- variety of skills needed for successful teach- terial of the storyboarded learning activities ing by discussing questions like (e.g., text books, scripts, slides, models) is • How to include didactic variants of pre- something comparable to the requisites of a senting materials? show. • What determines the variants? Basic differences of our storyboards to those • How to choose an appropriate variant used to ‘specify’ a show are: according to particular circumstances? 1. the primary purpose (learning vs. enter- Our approach to facing problems like these tainment)1, is a modeling concept for didactic knowl- 2. the degree of formalization, and, as a edge called Storyboarding. A storyboard, as consequence of being semi–formal, and the name implies, provides a roadmap for a 3. the opportunity to formally represent, course, including possible detours if certain process, evaluate, and refine our story- concepts to be learned need reinforcement. boards. Using modern media technology, a story- Our storyboard concept [1] is built upon board also plays the role of a server that pro- standard concepts which enjoy (1) clarity by vides the appropriate content material when providing a high-level modeling approach, deemed required. Our suggestion to ensure a (2) simplicity, which enables everybody to wide dissemination of this concept is to use become a storyboard author, and (3) visual a standard tool to develop and process this appearance as graphs. model, which is MicrosoftTM Visio. Here, A storyboard is a nested hierarchy of di- we present an application of this approach rected graphs with annotated nodes and an- for a particular university course. notated edges. Nodes are scenes or episodes; The paper is organized as follows. Section scenes denote leaves and episodes denote 2 is a short introduction to the storyboard a sub–graph. Edges specify transitions be- concept as introduced in [1]. Section 3 in- tween nodes. Nodes and edges have (pre– troduces an exemplary storyboard. Because defined) key attributes and may have free at- we used ourselves as the very first ‘experi- tributes. These terms are described below. mental subjects’ we selected an AI course Figure 1 shows an example storyboard on at the University of Central Florida as our the present paper. The representation as a example. In this section, we show the op- graph (instead of a linear sequence of sec- portunities of storyboards to express didactic tions) reflects the fact that different read- intentions and variants. Section 4 outlines ers trace the paper in different manners ac- some conceptual refinements towards a ma- cording to their particular interests, prereq- chine supported knowledge processing with uisites, a current situation (like being under storyboards. In section 5, we summarize the time pressure), and other circumstances. For research undertaken so far and outline our example, future research in three time horizons, (1) • members of our research group will the short term objective of storyboard dis- skip the sections 1 and 2, because they semination, (2) the medium term objective are already motivated and familiar with of storyboard evaluation, and (3) the vision the concept, of identifying and finally utilizing successful • reviewers of this paper, on the other didactic patterns. hand, will (hopefully) read the com- plete paper, but they might skip the 2. Storyboarding - the concept Acknowledgement section, because its content doesn’t matter for their work. The basic approach behind storyboarding is 1There is no ambivalence between these purposes. that teaching consists of further structured To include some entertainment into learning is one of episodes and atomic scenes (with no mean- the keys of successful learning and thus, also an ulti- ingful structure) - just like traditional story- mate objective of storyboarding learning processes.

171 of a teaching subject and thus, a firm base for processing, evaluating and refining this knowledge. Not all the following supplements are re- ally necessary. In fact, the following list is a proposal of further supplements. Con- versely, some of the features might not be applicable to particular storyboards. Many of them are implicit in the general concept. We discuss those details only for the read- ers’ convenience, to become a little more fa- miliar with our ideas, aims and intuition. In italics we provide details about our current technological representation of storyboards in MicrosoftTM Visio [3]. Readers may use Fig. 1: An exemplary storyboard any other appropriate tool at hand. • Because those nodes called episodes So a storyboard can be traversed in dif- may be expanded by sub-graphs, sto- ferent manners according to (1) users’ in- ryboards are hierarchically structured terests, objectives, and desires, (2) didac- graphs by their nature. Double clicking tic preferences, (3) the sequence of nodes on an episode opens the corresponding (and other storyboards) visited before (i.e. sub-graph on a separate sheet. according to the educational history), (4) • Comments to nodes and edges are in- available resources (like time, money, equip- tended to carry information about di- ment to present material, and so on) and (5) dactics. Goals are expressed and vari- other application driven circumstances. ants are sketched. Clicking to a com- A storyboard is interpreted as follows: ment opens a window with the text, in- • Scenes denote a non-decomposable cluding author information and date. learning activity, which can be imple- • As far as it applies to a node, educa- mented in any way: It can be the pre- tional meta–data, such as a degree of sentation of a (media) document, or difficulty (e.g., basic or advanced) or an informal description of the activity. a style of presentation (e.g., theory– There is no formalism at and below the based or illustrated) may be added as scene level. key attributes. Visio built–in object • Episodes denote a sub-graph. properties are used to represent general • Edges denote transitions between information and meta–data. nodes. • Edges are colored to carry informa- • Key attributes specify actors and loca- tion about activation constraints, con- tions. Depending on the application, ditions, or recommendations to follow more key attributes may be defined. them. Certain colors may have some • Free attributes can specify whatever fixed meaning like usage for certain ed- the storyboard author wants the user to ucational difficulties. Clicking on edges know: didactic intentions, useful meth- opens didactic comments and meta– ods, necessary equipment, and so on. data for adaptive behavior. • (Key and free) attributes to nodes are • Actors and locations, including those in inherited by each node of the related the real world, are assigned to scenes super–graph. only. Through programming, actor In fact, the storyboard is a semi–formal and location information may be prop- knowledge representation for the didactics agated automatically.

172 • Certain scenes represent documents of is only one requirement to meet: the col- different media types like pictures, ors must mean something. Wherever a new videos, PDF files, Power Point slides, color is introduced for several edges going Excel Tables, and so on. Double click- out of one node, it must be noted as the ‘key ing on a scene opens the media object attribute’ which color of the upcoming path in a viewer, e.g., plays the film. is recommended under which conditions.

Clearly, the sophistication of storyboards The storyboarding practice of the authors in- can go very far. The concept allows for dicated that introducing new colors is useful, deeply nested structures involving different if the ‘fork-situation’ continues for a path of forms of learning, getting many actors in- several nodes. In case both ways merge back volved and permitting a large variety of al- to one after visiting one node, we did only ternatives. Though this is possible, in prin- mark the condition as a key attribute, but did ciple, the emphasis of this concept – driven not use a new color and did not see a lack by the goal of dissemination – is on simple of visual overview and clarity. We feel the storyboards designed quickly by almost any- opposite is true; too many colors cause lack one. of overview. For the intended purpose of storyboarding 3. A complete storyboard for a a university course, we developed story- boards that contain (besides the Scenes and course on Intelligent Systems Episodes) additionally Here, we outline a storyboard that we devel- • To Do –s, which define anything to do oped for a course on Intelligent Systems at for the final grade, and the University of Central Florida. • Off Page References, which are points to jump back and forth between sub– 3.1. Resources initially available and related super–graphs. At the starting point of the storyboard de- Each node type has different meanings and velopment, we had the course material and behaviors. In MicrosoftTM Visio, so called ”experience sources” as used so far. This in- hyperlinks can be defined on any graph ob- cluded: ject to open either a local file of any me- • a course syllabus including references, dia type with the appropriate tool or to open grading rules, class policy, and a ten- the standard browser with a specified URL tative schedule as a linear sequence of or mail tool if it is an e-mail address. We topics over the semester, made use of this opportunity for the Scenes, • Episodes and the To Do –s. the books referenced in this syllabus, • a huge amount of slights (MicrosoftTM In particular, the nodes, their behavior and Power Point files) for each part, and their key attributes are as specified in Tables • two individuals with some topical back- 1, 2, 4, and 3. ground: the lecturer and author of this 2 For edges, it is not meaningful to define paper. double click actions or hyperlinks. In our storyboard, they have exactly one key at- In fact, we intentionally storyboarded a tribute: a field, where the author can spec- course on our own (teaching and research) ify a condition to follow this edge. Edges subject to gain some experience with the may have different colors. A storyboard au- use, evaluation and refinement of particular thor is free in the choice of colors. There courses as well as with the concept itself. 2Also the edges are not intended to carry topical 3.2. Course structure subject content, but didactics of a (mandatory, condi- tioned, or recommended) switch between the nodes Figure 2 shows the top level storyboard of of the graph. the course.

173 Symbol Behavior when double • opening a material document clicked • nothing, if just verbally described scene Behavior when following a • opening a material document hyperlink • visiting a website with the standard browser, if URL • opening the standard mail tool, if it is an e–mail address Key annotations Scene scene name: free text Key words key words: free text Educational difficulty degree of complexity: any | basic | advanced Educational presentation style: any | theory based | illustrated Media media documents needed: free text Unspecified media media documents to be specified: free text Actors human actors: instructor | students | both # slides # of slides: integer Estimated time consumption approx. time needed: [h]:[min] Location location and equipment requirements: free text Table 1: Scenes

Symbol Behavior when double opening the sub-graph that specifies the episode clicked Behavior when following a • opening a material document hyperlink • visiting a website with the standard browser, if URL • opening the standard mail tool, if it is an e–mail address Key annotations Episode episode name: free text Key words key words: free text Educational difficulty degree of complexity: any | basic | advanced Slides slide file in the conventional course: [file name].ppt Equipment equipment needed for the episode: free text Media media documents needed: free text Unspecified media media documents to be specified: free text Table 2: Episodes

Symbols End @@ @@ Behavior when double jumping back and forth clicked Behavior when following a • opening a material document hyperlink • visiting a website with the standard browser, if URL • opening the standard mail tool, if it is an e–mail address Key annotation: name of sub– or super/graph: free text Table 3: Off–Page References

174 ¨ Symbol © Behavior when double • opening a material document clicked • nothing, if just verbally described scene Behavior when following a • opening a material document hyperlink • visiting a website with the standard browser, if URL • opening the standard mail tool, if it is an e–mail address Key annotations Type type of activity (homework #, midterm exam, final exam, ...): free text Topics topic of homework or exam: free text Material allowed for students things allowed to use (calculator, dictionary, . . . ): free text Material needed by instructor things needed by instructor (docs with tasks, . . . ): free text Relative worth % of worth w.r.t. the final grade: free text Date scheduled date and time for activity: free text Location location of activity (home, seminar room, . . . ): free text Maximal time consumption max. time given for activity: [h]:[min] Table 4: ToDo-s

At this level, there are not many alternative paths. In our experience in building story- boards, the top level storyboard of a sub- ject to be taught is usually of this kind. The storyboards below this level contain alterna- tive paths according to some ‘tactical vari- ants’ of teaching (like a recommended inclu- sion of an example). The storyboards above this level contain alternative paths accord- ing to some strategy of teaching (like rec- ommended subject combinations). To show the opportunities of including di- dactic variants and intentions, we next have a closer look at the 2nd level storyboard on Machine Learning respectively selected sto- ryboards below this level.

3.3. Examples from the Machine Learning chapter Two sub-graphs of this chapter are dedicated to Case Based Reasoning (CBR, for short) and Inductive Inference. The first one (CBR, see Figure 3), shows a little more structure than the top level. Here, different paths are defined depending on a tactical decision: If the instructor realizes that the CBR process is well understood, he can continue with search technologies in Fig. 2: Top level storyboard of the course Case Libraries. Otherwise he needs to in- clude the example before doing that. Both

175 Fig. 4: Storyboard on Inductive Inference Fig. 3: Storyboard on Case Based Reasoning tertaining) bridge towards the next topic is paths are distinguished by different colors appropriate. Moreover, the open discussion and by a description of conditions to fol- on requirements the Example Set in Induc- low the one or the other as an annotation tive Inference should meet for CLS and ID3 to the respective edge at the ‘fork point’ of and the impact of these sets on the quality these different ways to continue. Moreover, of inference results, if they are not met prop- for the ‘advanced students’ who follow the erly, leads directly to a concept to address green path, an additional optional node is these drawbacks: the k–NN Method. Since visited. Since the k–NN Method (see last k–NN is not a method of Inductive Inference, scene in Figure 4) is some sort of CBR, but closely related to CBR, an Off–Page Ref- which overcomes some drawbacks in Induc- erence to the related node in the CBR story- tive Inference (which is another 3rd level board is placed here. A scene that hasn’t been a part of the Ma- sub–graph besides CBR), there is a reference chine Learning episode in the course before back and forth. Depending on whether this its storyboarding, is Inductive Inference with scene is visited as a supplement to Inductive FOPC expressions (see Figure 4), which is a Inference or as a regular part of CBR, the re- technology to compute a so–called best in- lated Off-Page Reference needs to be clicked ductive conclusion on a set of expressions of to jump back. the First Order Predicate Calculus (FOPC). A nice example for including didactic inten- This is a technology that is not presented in tions can be seen in the storyboard of Induc- any AI textbook, but included in a related tive Inference (see Figure 4). Here, the sim- course of the first author. By including it ple play ‘guess my number’ is intended to as an optional scene in the second author’s give the students an idea of what the term In- course on Intelligent Systems, we utilized formation Entropy means, because this con- the storyboard as a medium of collabora- cept is utilized in the subsequent scene. An tion in teaching Artificial Intelligence. As instructor might decide whether such a (en- the double click action, we implemented the

176 opening of a MicrosoftTM Power Point pre- topics are related to each other. Provid- sentation to this topic as used in the first ing this information in conventional course author’s course3. Furthermore, clicking the material exclusively (e.g., books, scripts, www–symbol in the lower left corner of the slides), which is sequentially organized, scene opens the webpage of the first author, bags the risk, that the user doesn’t recognize at which all slides and scripts of the first that this information is some sort of meta– author for teaching AI can be downloaded knowledge about these three technologies. by interested users (instructors or students). The nesting of storyboards supports imple- Additionally, there is an information symbol menting such knowledge, effectively. i in the lower left corner, which indicates a source of further information on this learn- 4. Current efforts towards formal- ing content. What it does is open the stan- izing the concept dard mail tool of the PC under use with the With the objectives of recipient address of somebody who is able to answer questions on this learning content: 1. a software realization of knowledge the first author. processing (i.e. deductive inference on storyboards), A similar use of the storyboard for sharing 2. evaluating storyboards by case–based teaching material has been done with the validation technologies such as de- scene on the CLS Algorithm. Here, two hy- scribed in [2], perlinks are defined, one that opens the orig- 3. machine learning (i.e. the identification inal slides of the course on Intelligent Sys- of didactic patterns by some sort of data tems and another one that opens (an English mining and inductive inference), translation of) the slides for the same topic in 4. knowledge refinement by improving the first authors’ course on Artificial Intelli- particular storyboards due to didactic gence. Generally, the number of hyperlinks insights that became explicit by story- is unlimited, so that a scene can carry many boarding, and finally different material to provide its content, in- 5. knowledge engineering by supporting cluding web addresses. the storyboard development with a tool- Another didactic measure that helps to real- box of didactic patterns that are proven ize the nature of the various Machine Learn- to be appropriate in former use, ing technologies on the very first view is the the overall concept is currently under revi- labeling of the scenes (1) CLS Algorithm, (2) sion. ID3 Algorithm, and (3) Inductive Inference In opposition to classical knowledge with FOPC expressions. By the comments processing in Artificial Intelligence, de- above these scenes are titled duction on the knowledge modelled by (1) A straightforward induction of a rule- storyboards is not 100 % performed by based classifier in a n–dimensional machines. Albeit the traversing can be space of attributes software-guided, there are some decisions (2) Induction of an optimal rule- left to humans respectively depend from based classifier for objects in a n– a input parameter that might be provided dimensional space of attributes driven by humans. Furthermore, the knowledge by the information gain, respectively within (and below) the Scenes is usually (3) Computing a best inductive conclu- informal and thus, absconds itself from sion for a set of examples by Anti- formal processing. Unification However, we made some conceptual refine- The storyboard user (instructor and student) ments that improve the opportunities for a receives some knowledge on how these three software supported (deductive) knowledge processing as a very first step towards the 3Of course, this is a version translated in English. first one of the visionary objectives above.

177 4.1. Conceptual refinements of tinguished: (1) they are alternatives that ex- nodes clude each other, (2) they are edges that need A view at figure 3 reveals that jumping back to be traversed concurrently, and (3) there is into a related super–graph is not uniquely a rule that determines something in–between defined, formally. For better opportunities to both (‘follow at least three of 10 edges’ e.g.). support the storyboard traversing by an ap- These cases are syntactically distinguished propriate software, we additionally limited in the way these edges branch and merge af- the expressiveness by requesting, that the ter finishing the the alternative or concurrent graph hierarchy has to be a tree of graphs. paths: (1) alternative paths start and end with In other words, a graph that implements an different edges (2) concurrent paths start and episode can only be a sub–graph of exactly end with one edge that spreads at the branch one super–graph. This implies, that, when- point and re–unites at the merge point in the ever a sub–graph has been completely tra- graph. (3) The third case is treated as a spe- versed by reaching its final node, the is a cial case of the second one by expressing the unique super–graph into which the story- rules at the branch and merge point of the board interpreter has to jump back. paths. Also, the definition of exactly one starting point and exactly one end point is necessary Edge coloration Additionally, rules of to interpret graphs without any elbowroom coloring edges have been established. So for human interpretation respectively mis- far, a color specifies a path to follow under understandings. Other references between a certain condition. This raises a problem, graphs than start– and end–nodes have to if there are several edges of different col- be forbidden to enable a software supported ors towards a node and also several edges storyboard traversing. of different colors leaving a node. So the color concept so far was not able to express 4.2. Conceptual refinements of the rules how to determine possible outgoing edges colors depending on the incoming color. Since humans are still an essential part of By introducing the concept of bi–colored a storyboard interpreter, the modelling lan- edges that change their color within the guage must be easy to read and to under- edge, these rules can be expressed in a very stand. So we shifted the representation of simple and formally decidable way: All conditions to follow an edge from the (rather edges having the same color as the incoming hidden) key annotation of it to an obvious edge, are the only possible outgoing edges. annotation within the arrow that represents the edge. 5. Summary and outlook To ensure decidable conditions for edges, we Storyboards in general, and the one intro- limit these conditions to duced in this paper in particular, are an ap- • dynamic conditions regarding the tra- proach to make the didactic design of univer- versing history (visited nodes, e.g.) and sity courses explicit. Since their scenes are • input parameter with case–specific val- not limited to the presentation of electronic ues that don’t change after their evalua- material, but may represent any learning ac- tion. tivity, the application of this concept goes far Furthermore, if node–outgoing edges carry beyond the IT approaches to support learn- a condition, there have to be at least two of ing so far. those to avoid dead ends of traversing. The idea to represent knowledge at a high level with a modeling concept that is ap- Fork situations If there are several edges propriate to be used by topical experts (uni- that leave a node, three cases have to be dis- versity instructors, in this case) without the

178 need of an IT– or even software techno- cessful paths within storyboards in par- logical background is very much AI–driven. ticular. Through the use of Machine Here, the term ‘topical knowledge’ is not re- Learning methods, we finally might be lated to the learning content, but to its di- able to find out what these successful dactics instead. In particular, didactical in- storyboards respectively paths have in tentions and variants can be specified as a common and in which properties they nested graph–structure. differ from the less successful ones. To validate the usefulness in practice, we de- Thus, we might be able to identify suc- veloped a storyboard on an AI related course cessful didactic patterns. at the second author’s university. After its The latter is, in fact, the vision of knowl- development, we presented the result to the edge discovery in didactics. By utilizing the current instructor of this course (as well as didactic insights acquired by this approach other university instructors). There was, in for the upcoming storyboards, we intend to fact, no need to convince him in using this close the loop of the never ending storyboard storyboard; he was quite happy to have such development spiral. a useful tool to support his teaching. One essential property of this concept and its Acknowledgements The authors are implementation is its simplicity in terms of grateful for the generous support from the both the concept itself and the tool we used Laboratory of Interdisciplinary Information 2 to implement it. Everybody, also university Science and Technology (I -Lab) at the instructors of subjects that are far removed School of Electrical Engineering and Com- from information technology, are able to de- puter Science at the University of Central velop storyboards. Florida. Thanks to this support we could build the exemplary storyboard, which is (1) Of course, we won’t stop this research after an important source of experience for the au- having a high-level modeling concept for di- thors, (2) a starting point for future research dactic design and asking university instruc- to refine the concept and, most importantly tors to perform their ‘knowledge processing’ (3) a launch pad of a campaign that aims at with it. the wide dissemination of this concept. Our upcoming research on this issue is as follows: References 1. A short term objective is, of course, [1] K.P. Jantke and R. Knauf. Didactic promoting the development and use of design though storyboarding: Standard this concept. concepts for standard tools. In Proc. of the 4th International Symposium on 2. After that, as a medium term objective, Information and Communication Tech- we plan to develop an evaluation con- nologies, Workshop on Dissemination of cept for storyboards based on the learn- e-Learning Technologies and Applica- ing results of the students as acquired tions, Cape Town, South Africa, 2005, from the final grade they achieve for the pages 20–25, 2005. storyboarded courses as well as the stu- [2] R. Knauf, A.J. Gonzalez, and T. Abel. A dents’ specific comments in a question- framework for validation of rule-based naire. systems. IEEE Transactions of Systems, 3. Our long term objective is to identify Man and Cybernetics - Part B: Cyber- netics, 32(3):281–295, June 2002. typical didactic patterns of successful storyboards. Since the learning result [3] M.H. Walker and N. Eaton. icrosoft Of- of a particular student is associated to a fice Visio 2003 Inside Out. Redmond, particular path through the storyboard, Washington: Microsoft Press, 2004. we should be able to identify success- ful storyboards in general, but also suc-

179 Data Mining on Traffic Accident Data

J¨urgen Cleve†, Christian Andersch?, Stefan Wissuwa†

†University of Wismar, Dept. of Economics Philipp-M¨uller-Str.,D-23952 Wismar, Germany Phone: +49 3841 753-527 E-mail: {j.cleve,s.wissuwa}@wi.hs-wismar.de http://www.wi.hs-wismar.de/kiwi ?Current affiliation: MIT – Management Intelligenter Technologien GmbH Pascalstr. 69, D-52076 Aachen, Germany E-mail: [email protected]

Abstract

A large number of people is killed or injured in traffic accidents every year. We describe an approach to determine the seriousness of potential accidents. Therefore, we use data mining techniques to analyze data from traffic accidents and to build models for prediction. We use freely available data mining tools and self-developed software for data preprocessing and automation.

Keywords: data analysis, data mining, machine learning, traffic accident data

1 Traffic Accident Data We have access to official statistics cov- ering several years of the Rostock area. The number of people killed in traffic There are a total of 10,813 anonymized accidents per year is about 1.2 million records available. Each is marked with worldwide. Even more get injured: 50 a ‘score’ that indicates the seriousness million people [6]. It is important to of the accident [2]. The score is an in- lower these numbers. teger number on a scale from 1 to 9 with 9 as maximum. It was originally Analyzing the accidents may help calculated from the number of injured to understand and isolate facts with or killed people and the material dam- significant influence. In this paper we age as shown in Table 1. It is worth describe the analysis of traffic accident to notice that the rules for the score data from the Rostock1 area. The were created by physicians. Since the question is: If there is an accident, material damage is not included in the how serious will it be?2 data provided, the mapping from mate- rial damage and personal injury to the 1City located in the northern part of Ger- score cannot be reversed. many; important Baltic seaport Each record contains 52 attributes de- 2 See also: “Improving road safety with scribing the circumstances of the acci- data mining”, http://soleunet.ijs.si/ website/html/cocasesolutions.html dent. The attributes include weather

180 Table 1: Score definition

Material damage (e) Personal injury Score [0; 250) none 1 [250; 500) none 2 [500; 2,000) none 3 [2,000; 5,000) barely injured 4 [5,000; 10,000) ≤ 2 badly injured 5 [10,000; 25,000) > 2 badly injured 6 [25,000; 50,000) 1 dead 7 ≥ 50,000 several dead and badly injured 8 massive accident with ≥ 10 vehicles and/or ≥ 5 badly injured 9

conditions like rain or darkness, vio- environment ‘Eddie’, described in Sec- lations of traffic regulations like speed tion 3, for data preprocessing. limits or alcohol, character of the sur- rounding area like trees or sharp turns, 1.1 Preliminary considerations type and number of vehicles and people involved, type and number of injured The main goal was to predict the score people, and the age and gender of the of (un)known data. Therefore, to iden- person who caused the accident. tify over-fitting, the comparison of al- From these attributes are 36 binary gorithms was based on three different sparse data. Eight of them, the at- methods for training and test data. tributes that indicate violation of traf- fic regulations, are mutually exclusive, • Test on training data: This so they might be interpreted as one at- method shows how good the train- tribute. ing data are learned. It is typically The values for age class and gender are not suited for prediction. If not missing in 194 records (1.79%) because stated otherwise, test on training of hit-and-run drivers. data in this paper means training and testing on all 10,813 records. The data are ambiguous, which means 2 that several records with identical at- • The 3 -rule means a random split 2 tribute values can have different scores. of the data into 3 for training and 1 This is a major problem for machine 3 for testing. learning because contradictions cause diametrical learning effects. It also • Cross validation means multiple raises questions about data quality or stratified splitting into test and missing attributes. On the other hand, training data, where the errors are this may be the result of highly chaotic averaged. We based our main cri- behavior in complex dynamic systems teria to judge the result’s qual- as traffic accidents usually are. ity on 10-times cross validation (CV10). We used a relational DBMS to hold the data and to perform early analy- The algorithms used for classification sis, WEKA [9] for data mining because include decision trees such as ID3 or of its high number of included algo- Random Forest [3], rule based algo- rithms, and our self-developed software rithms, and SMO [5] as representative

181 for Support Vector Machines (SVM). m We used the standard parameters for R(x) = x,x Pn m all algorithms.3 As additional step, the j=1 x,j boosting algorithms Vote, AdaBoost, Precision P defines the accuracy of the and Bagging were applied for further predicted class: optimizations. mx,x Since ID3 ‘stores’ all distinguishable in- P (x) = Pn i=1 mi,x put data, it will give an upper limit on the training data and can therefore Both, precision and recall are equal to be used as reference for how good the the proportion of the main diagonal el- training data can be learned. ZeroR ement versus the column or row sum. always predicts the mean/mode value F-measure F is the ‘average’ of R and and therefore gives the lower limit on P : CV10, which should be outperformed 1 1 1 1 P · R = ( + ) → F = 2 · by all other algorithms. F 2 P R P + R Many data mining algorithms try to 1.2 Methods and Procedures minimize F . This is especially useful if no cost matrix C is used. In case a Classification predicts a target at- cost matrix is used, its elements c are tribute from a set of non-target at- i,j multiplied with the corresponding ele- tributes. This equals the function ments of the confusion matrix mi,j. In our experiments we used the following f(x , . . . , x ) = x 1 n−1 n cost matrix: where xn is the target attribute. To evaluate the results of different 0 1 2 3 4 5 6 7 8 experiments, a quality indicator is 1 0 1 2 3 4 5 6 7   2 1 0 1 2 3 4 5 6 needed. The simplest indicator is the   3 2 1 0 1 2 3 4 5 total recognition rate:   C9 = 4 3 2 1 0 1 2 3 4   5 4 3 2 1 0 1 2 3 # correctly classified records   R = 6 5 4 3 2 1 0 1 2 total # records   7 6 5 4 3 2 1 0 1 8 7 6 5 4 3 2 1 0 The classification results for test data are often represented as square confu- sion matrix M that shows the num- The main diagonal elements ci,i repre- ber of records from each class (score) sent the correctly classified records. As and how they are classified. This al- costs for incorrectly classified records lows us to define recall, precision, and we chose the difference between real f-measure for each score. and predicted score. The cost function Let i, j be the row and column index is the sum of all singular costs, which of the confusion matrix M. Recall R needs to be minimized: is the normalized amount of correctly classified records for each class: n n X X 3Except for SMO, where we used radial ba- Costs = mi,j · ci,j → min sis function (RBF) with c = 1, γ = 0.3. i=1 j=1

182 Table 2: Identical records shows the distribution of score differ- ences over the entire dataset. For ex- Identical records ample, there are 14 identical records Score difference Absolute # % with a score difference of 5. 0 3,729 34.5 1 2,795 25.8 Based on this we split the data into 2 1,953 18.1 test and training data so that the train- 3 537 5.0 ing set contains only unique records. 4 108 1.0 5 14 0.1 The remaining records were used as test 6–8 — — data. Incomplete records were also re- moved from the training set, although this may be argued. 2 First Results We used the same cost function as be- fore. From the records with different 2.1 Classification using original scores we chose the one which was near- data est to the average score. In case of doubt we chose the higher score be- The experiments were performed using cause we considered higher scores as the original data and score. For all val- more important. ues either a nominal or binary encod- ing was used. Missing values were set The training set contained 5,891 to 0. Further, we removed all attributes (54.48%) unique records and the test- that used information ascertained after ing set contained the remaining 4,992 the accident already happened, such as (45.52%). All data were nominal en- number of injured people. coded. Missing values for age and gen- der were set to NULL (non-existing). The results are shown in Table 3. ID3 Violations of traffic rules were grouped and Random Forest have similar recog- into one attribute. nition rate and costs, but differ in cross validation. The best recognition of the We got the best result using Random training data with 77.5% by ID3 gen- Forest. All results are shown in Table 4. erally indicates that it may be hard to learn the data. 2.3 Classification using new scores

2.2 Classification using unique Previous results showed problems in data predicting the score. In the last exper- iment, unique data was used for train- Because of the relatively bad results in ing, but not for testing. For the cur- our previous experiments we did fur- rent experiment, we defined new scores ther investigation of the data. We fig- to minimize ambiguousness. These new ured out that the data are partially scores were directly calculated from the ambiguous, which means that there personal injury, ignoring the material are identical records but with different damage, which was not provided in scores. the original data. In several steps the Ignoring the score attribute, there are personal injury was aggregated from a total of 5,993 (55.42%) different ScoreB with 6 score classes (ScoreB = 1: records. With score, there are 7,084 only material damage; ScoreB = 2 to (65.51%) different records. Table 2 6: personal injury according to the

183 Table 3: Original data—total recognition rate and costs per algorithm

Total recognition rate (%) Costs 2 2 Algorithm Training CV10 3 -rule Training CV10 3 -rule Random Forest 77.3 46.9 44.4 3,697 9,005 3,208 J48 60.7 44.5 44.2 6,507 9,303 3,100 REPTree 49.2 41.6 41.1 8,352 9,496 3,291 ZeroR 38.4 38.4 39.4 9,776 9,776 3,266 ID3 77.5 43.8 40.8 3,678 10,426 3,858

Table 4: Unique data—total recognition rate and costs per algorithm

Total recognition rate (%) Costs Algorithm Train. Test Total CV10∗ Train. Test Total CV10∗ Rand. Forest 93.2 48.4 72.8 35.5 686 3,439 4,125 6,109 J48 60.1 43.8 52.7 40.7 3,715 3,804 7,519 5,405 REPTree 47.0 42.1 44.8 38.6 4,808 3,898 8,706 5,483 ZeroR 34.4 43.1 38.4 34.4 5,830 3,946 9,776 5,830 ID3 93.4 47.3 72.4 28.6 680 4,370 5,050 8,108 ∗CV10 on unique data, therefore reduced costs compared to Table 3, CV10

original score) to ScoreF (ScoreF = 1: Most algorithms did not predict the only material damage; ScoreF = 2: also higher scores at all. Furthermore, the personal injury). Results for ScoreB recognition rate of a score is nearly pro- as the most complex and ScoreF as portional to the number of appearances the most simple differentiation are de- in the original data. This cannot be scribed here, for other scores see [1]. generalized for score 4 to 6 because We used nominal encoding and set of their very low appearance. Ran- missing values to a special value. At- dom Forest and SMO have the low- tributes for violations of traffic rules est costs and highest total recognition were not grouped into one attribute rate, but only Random Forest predicts as before. All attributes collected af- higher scores. ter the accident were removed. Alto- Directly learning score variants with gether, we used 44 attributes including fewer classes shows slightly better re- the score. sults than learning ScoreB and aggre- The results for total recognition shows gating to ScoreF afterwards [1]. Details Table 5. The total recognition rate is for ScoreF and one algorithm shows Ta- in inverse proportion to the number of ble 6. classes used. ID3 on training data rec- ScoreF only distinguishes the binary ognizes exactly the number of distin- decision problem material damage vs. guishable records, as verified by analyz- personal injury. Therefore, the cost ing the database. Using CV10 as crite- matrix leads to costs that equal the ria, SMO gives the best results. number of wrongly classified records: Table 7 shows a detailed overview of single and total recognition as well as 0 1 C = costs when using CV10. 2 1 0

184 Table 5: Total recognition rate (%) 3 Data Mining Environ- ment for Preprocessing Alg. Test Score ScoreB ScoreF and Automation ID3 Train. 77.5 97.0 97.4 CV10 45.2 87.0 89.4 2 We are developing a data mining en- 3 -rule 42.7 85.7 88.8 J48 Train. 59.9 90.7 91.9 vironment for data preprocessing, au- CV10 44.4 88.0 90.8 tomation and integration of data min- 2 3 -rule 43.6 87.9 90.6 ing software, called Eddie: Extensi- SMO Train. 58.3 91.6 93.4 ble Dynamic Data Interchange Envi- CV10 46.5 88.7 91.1 ronment. The system was used in part 2 3 -rule 45.5 88.5 91.1 for data preprocessing and table oper- ations. It is designed to support scien- tific experiments, which need a flexible and in-depth capability to adjust pa- Table 6: Details of ScoreF for J48, rameters, to handle very large amounts CV10; R is recall, P is precision (%) of data, and to perform multiple exper- iments automatically [7], [8], [4]. S 1 2 R 1 9,047 203 97.81 2 797 766 49.01 3.1 Motivation P 91.90 79.05 90.75 When running data mining experi- ments, we often have to use several soft- ware applications. Reasons are that As new optimization criteria, f-measure not every algorithm is supported by for ScoreF = 2 (personal injury) was every software, non-transparent algo- used, since it shows the quality and rithms complicate the interpretation precision of the classification of serious of results or models (especially true accidents. The reduced complexity al- for Neural Networks), and visualization lowed us to further use several boosting capabilities are not always appropri- algorithms. The best results are shown ate. It is necessary to exchange data in Table 8 including recall and precision between different applications and to for ScoreF = 2, total recognition rate, manually perform various formatting and costs. steps because of proprietary data for- The best result with the highest f- mats. While for a single experiment measure of 0.667, almost 60% recog- this is not a limiting factor, it becomes nition and little more than 21% false- very time consuming and error prone positive was achieved with voting using for larger series of experiments. Random Forest and the rather simple To solve these issues we decided to algorithms Decision Stump and ID3. develop a set of data transformation Replacing ID3 with PART resulted in tools that convert data from propri- slightly lower f-measure, but higher to- etary formats into a single exchange tal recognition rate of 92.0% and lower format and back, all using standard in- costs. For J48, Table 8 compared to put and output streams so they can Table 5 also shows that boosting algo- easily be appended. We chose XML rithms do not always lead to a better as standard exchange format since it total recognition rate. is easy to parse, to edit, and human

185 Table 7: New score—recognition rate (%) and costs for ScoreB, CV10

ScoreB % ID3 Dec.Table PART J48 NBTree SMO C.Rule Rand.For. 1 85.5 94.8 98.2 96.4 98.1 97.3 98.6 98.1 97.3 2 9.5 44.4 39.3 36.9 33.1 41.9 40.1 42.3 43.5 3 4.2 38.9 13.1 26.0 21.4 8.1 12.0 — 38.4 4 0.3 18.2 9.1 — — — — — 18.2 5 0.3 36.1 — — — — — — 39.5 6 0.1 50.0 8.3 — — 5.6 — — 50.0 uncl. — 0.2 — — — — — — — Total: 100.0 87.0 88.3 87.0 88.0 87.5 88.7 87.9 89.2 Costs: 1,878 1,720 1,890 1,772 1,857 1,671 1,764 1,545

Table 8: Best boosted ScoreF, CV10; based on f-measure for ScoreF = 2 (F-M2)

Algorithm F-M2 R2 (%) P2 (%) Total (%) Costs Vote: Rand.For., Dec.Stump, ID3 0.677 59.4 78.6 91.8 887 Vote: Rand.For., Dec.Stump, PART 0.663 54.4 84.6 92.0 867 Random Forest: Bagging 0.646 56.0 76.1 91.1 962 PART: AdaBoost 0.645 58.3 72.0 90.7 1,005 J48: AdaBoost 0.640 57.6 72.0 90.6 1,013 ··· SMO∗ 0.621 50.6 80.3 91.1 966 ··· ∗not boosted because of much higher CPU consumption readable. Because this set of programs as specialized software for Neural Net- was already very flexible, we found that works. using the underlying architecture could Each module reads and writes SXML be used to describe and automate gen- data, which contains the data to be eral data flows and preprocessing steps. processed as well as configuration infor- mation. Therefore, a workflow can be 3.2 Architecture described as configuration for a special module, which then reads its configura- Its main concept is to split function- tion, executes programs, and manages ality into several stand-alone programs the data flow between them. Because for well-defined tasks, which commu- the workflow module itself reads and nicate with each other using a flexi- writes SXML code, a workflow can eas- ble XML-based protocol, which we call ily be integrated within another work- SXML. These modules can be con- flow. Data import and export as well as nected via input and output streams to integration of existing applications are build a workflow or to exchange data handled this way using special modules. with other applications, as shown in Fig. 1. This allows us to build com- plex data flows, to store intermediate 4 Conclusion results at any time for further investi- gation, and to integrate the function- A large number of correctly predicted, ality of existing applications, such as but less serious accidents resulted in a WEKA for data mining and SNNS [10] relatively high total recognition rate.

186 Verkehrssicherheit GmbH Schwerin, University of Rostock, July 2004.

[3] L. Breiman. Random forests. Ma- chine Learning, 45(1):5–32, 2001.

[4] J. Cleve, U. L¨ammel, and S. Wis- suwa. Data Mining on Transaction Data. In Nejdet Delener and Chiang- Nan Chiao, editors, Global Markets Fig. 1: System architecture of Eddie in Dynamic Environments, Lisboa, 2005. GBATA.

[5] J. Platt. Sequential Minimal Op- The best result for the separation of timization: A Fast Algorithm for high and low score achieved a voting al- Training Support Vector Machines, gorithm with an f-measure of 0.677 for 1998. Microsoft Research Technical the high score. Such higher scores are Report MSR-TR-98-14. more important as they indicate situa- tions with matter of life and dead. [6] WHO. World Health Day: Road safety is no accident!, Better results may be possible by de- 2004. Press release for the creasing the ambiguity of the data com- World Health Day 2004, http: bined with a more specific subset of //www.who.int/mediacentre/ training data. Especially the score def- news/releases/2004/pr24/en/. inition is not mathematically derived and very subjective. Further, an asym- [7] S. Wissuwa. Data Mining und XML metric cost matrix with higher penalty – Modularisierung und Automa- for too low classified records should tisierung von Verarbeitungsschritten. produce better models. University of Wismar, 2003. Wis- mar Discussion Paper 12/2003, Our experiments show that many diffi- http://www.wi.hs-wismar.de/ culties in building data mining models fbw/aktuelles/wdp/. are caused by inadequate data, which had been raised and partially pre- [8] S. Wissuwa, U. L¨ammel, and processed without taking information- J. Cleve. XML-basierte Beschrei- theoretical aspects into account. bung von Data-Mining-Prozessen. In J. Biethahn, A. Lackner, and V. Nissen, editors, Information- References Mining und Wissensmanagement in Wissenschaft und Wirtschaft, Proc. AFN-Jahrestagung, pages 37–48, [1] C. Andersch and J. Cleve. Data G¨ottingen, June 2004. University of Mining auf Unfalldaten. University G¨ottingen. of Wismar, January 2006. Wis- mar Discussion Paper 01/2006, [9] I. H. Witten and E. Frank. Data http://www.wi.hs-wismar.de/ Mining: Practical Machine Learning fbw/aktuelles/wdp/. Tools and Techniques. Morgan Kauf- mann, San Francisco, 2nd edition, [2] D. Bastian and R. Stoll. Risikopoten- 2005. ziale und Risikomanagement im Straßenverkehr. Technical [10] Andreas Zell. Simulation Neuronaler report, Forschungsinstitut f¨ur Netze. Oldenbourg, M¨unchen, 2003.

187 The Digital Mechanism and Gear Library – a Modern Knowledge Space

Torsten Brix, Ulf Doring,¨ Sabine Trott, Rike Brecht, Hendrik Thomas Technical University of Ilmenau, PO Box 10 05 65 98684 Ilmenau, Germany [email protected]

Abstract Mechanisms and gears are an essential part of technical products in industry. However, the worldwide existing knowledge about mechanisms in theory and practice is mostly scattered and only fragmentarily accessible for users like students, engineers and scientist. It does not comply with today’s requirements concerning a rapid information retrieval. This paper presents the “Digital Mechanism and Gear Library” (DMG-Lib). In this interdisciplinary project of the Technical Universities of Ilmenau, Dresden and the RWTH Aachen a new digital, internet-based library (www.dmg-lib.de) is built to collect, preserve and present the knowledge of mechanism and gear science on a new level of quality. The DMG-Lib contains a wide range of digitalized information resources in very heterogeneous media types. The resources are enriched with various additional information like animations and simulations. Combined with innovative multimedia applications and a semantic information retrieval environment, the DMG-Lib provides an efficient access to this knowledge space of mechanism and gear science. 1. Introduction vide such access are promising (e. g. [3]) but by far insufficiently. Today in Germany In the middle of the 19th century in Ger- only 12 university institutes with focus on many the systematic research on mecha- mechanism and gear science are left. More nisms and gears started as a result of the and more didactical experiences and valu- fast growing engine building industry in this able training material are lost because ex- time [5]. Especially the theoretical reflec- perts on this field of application retire or tions and practical works of the German en- through economy measures specialized in- gineer F. Reuleaux [15] became important. stitutes are closed. Also old and unique liter- Mechanism and gear technology is today ature with only a few numbers left are quite still essential for industry and it will become difficult to access like the publications of even more important due to the introduc- Reuleaux. They have to be digitized and tion of new technologies like nanotechnol- online presented so that this still important ogy and corresponding new fields of appli- knowledge becomes accessible for the pub- cation. lic again. The existing knowledge about mechanisms A solution of these problems is the collec- in theory and practice is worldwide scattered tion and presentation of all relevant informa- in hand- and textbooks, photographs, solid tion resources for mechanism and gear sci- functional models, engineering drawings, ence in a centralized worldwide accessible etc. It is only limited and very fragmentarily platform [5, 8]. The research and education accessible and does not comply with today’s in various ingenious disciplines would cer- requirements concerning a rapid information tainly benefit from such a comprehensive li- retrieval [5]. However, industrial compa- brary of knowledge. nies and research institutes demand an ef- ficient access to the whole mechanism and In 2004 the development of the worldwide gear theory [7]. Existing activities to pro- accessible “Digital Mechanism and Gear Li-

188 brary” (DMG-Lib) was started to prevent 2. Concept of the DMG-Lib this sneaking lose of knowledge. The DMG- The DMG-Lib contains a vast amount of Lib is an interdisciplinary project of differ- very heterogeneous information resources ent departments of the Technical Universi- (see Fig. 1) like books, publications, func- ties of Ilmenau, Dresden and the RWTH tional models, gear catalogues, videos, im- Aachen. It is financed by the “German Re- ages, technical reports, etc. The original search Foundation” in the program “Scien- sources are procured, digitized and con- tific Library Services and Information Sys- verted to suitable data formats. tems” (project number: LIS 4-554975). The aim of this project is the collection, in- tegration, preservation, systematization and adequate presentation of the worldwide knowledge about mechanisms and gears. The gained results and experiences of this project will hopefully help in future other digital libraries in different application do- mains as well. The digital library is designed to satisfy the requirements of different user groups like engineers, scientists, teachers, students, li- Fig. 1: Examples of information resources brarians, historians and others. To offer in the DMG-Lib users a wide variety of opportunities for re- trieval and utilization the digitized resources The information resources can be accessed are extensively post-processed and enriched worldwide on the DMG-Lib internet por- with various information like animations, tal. This simplifies the access and distri- meta-data, references and constraint based bution of these information resources, but models. The focus is not only on textual does not directly enhance a goal-oriented us- documents, images and animations. Also age and retrieval for solutions of technical functional models are digitalized, which ex- tasks in research and industry. Furthermore ists in thousands of unique models with no the common storage method for knowledge, or only very limited access for the public. mainly in static texts and images, does not This huge amount of available heteroge- comply with requirements concerning an ef- neous information resources in the DMG- ficient and fast information retrieval. The Lib implies a key challenge of this project: advantages of functional models for a bet- the implementation of an efficient, uniform ter understanding of complex construction and user-satisfying information retrieval [8, and function principles are well known. To- 14]. day the necessary techniques are available In the following section the concept of the to provide an easy access to such helpful DMG-Lib project is introduced. Afterwards demonstration models for a broad public. the implementation of the DMG-Lib is pre- Computer based methods enable the gener- sented. Thereby the digitalization and en- ation of multimedia documents which de- richment of the information resources and scribe the function and other relevant at- the DMG-Lib online portal are discussed. tributes of mechanisms and gears. These can Also developed multimedia applications and easily be distributed and enriched with ex- a semantic information retrieval environ- tensive additional information [7]. ment for innovative ways of presentation Therefore in contrast to other digital li- and retrieval in the knowledge space are de- braries projects, which often provide only scribed. Finally, the paper concludes with a access to the digital raw data [4], in the summary and an evaluation of the project. DMG-Lib project the digitized resources

189 are extensively post-processed and enriched • Cross-platform presentation in the in- with various information like animations, ternet for different user-groups and dif- constraint-based models or various verbal ferent use-cases like research, product descriptions. Also further simulations and development or self-study analyses are possible, because constraint- based models can be used in external anal- • Development of information retrieval ysis, synthesis and optimization systems. systems, which allow a structural selec- Such approaches are necessary to move from tion and type syntheses of mechanisms a static to a dynamic problem oriented sup- and gears ply of knowledge for a wide rage of applica- • Support of automated access options tion domains and user requirements. for the library content using various An overview of the complex production applied descriptors or meta-data (e. g. workflow for the identification, digitaliza- OAI-PMH service) tion, enrichment, storage and presentation • Support for researchers and developers of information resources in the DMG-Lib is during the development of solutions for displayed in the following figure (see Fig. 2). special synthesis or optimizing prob- lems

3. Implementation of the DMG- Lib For the implementation of this ambitious concept a consequent cooperation of infor- mation, computer and usability scientists as well as engineers, librarians and experts of mechanism and gear science is necessary. This is the only way to collect, enrich and Fig. 2: Production workflow in the DMG- present the complex domain specific hetero- Lib geneous information resources according to user requirements. Based on the vast amount of available het- 3.1. Enrichment of the information erogeneous information resources in the li- brary and the extensive enrichment, the resources DMG-Lib is able to provide an efficient The following information sources are digi- retrieval as well as various utilization op- tized and integrated in the digital library: tions for users. Following these considera- tions several additional aims of the DMG- • Literature relevant for mechanism and Lib project can be derived: gear technology (monographs, journal articles, etc.) from different libraries and private collections • Constraint based modeling of mecha- nisms and gears as base for generation • Solid mechanism and gear models of of further description forms [7] the TU-Ilmenau, the TU-Dresden and the RWTH Aachen • Supply of descriptions of mechanism • Images and slides of gears available in and gear knowledge in various forms the project partners archives to ensure a flexible, adaptive and long term usability (verbal, images, • Technical drawings (outlines, technical constraint-based descriptions, 2D and blueprints, technical principles and cal- 3D animations) culation instructions)

190 • Training materials of the departments involved in the DMG-Lib project

The literature sources are usually scanned with 300 dpi resolution and 256 grayscales and are saved as TIFF files. For the scanned resources meta-data according to the Dublin Core standard are stored in the production database [1]. In addition the documents are classified according to technical aspects. Fig. 3: Enhancement of videos by an over- For further processing of the digital raw laid simulation-based animation data a layout and text analysis is neces- sary. For the identification of the phys- tions provide rich information for various ical structure (text blocks, images etc.) search criteria for example the number of el- as well as the individual characters in ements of the gear. The analysis of the sim- the documents the commercial software ulation results provides further information ABBYY-Finereader is used. The software describing the function of the gear like the is embedded in a self developed applica- transmission behavior. This functional in- tion framework called AnAnAS (Analyse- formation is important for the implementa- Anreichungs-Aufbereitungs-Software). tion of a problem oriented information re- Other applications developed in the DMG- trieval. Lib project identify the logical structure To the individual models, animations, im- (headlines, labels of figures etc.) more ages and literature resources experts can and more automatically. The storage of attach further meta-data like detailed de- the meta-data in AnAnAS is based on the scriptions, cross-links and other annotations. METS-Standard [2]. Different meta-data This information will be edited either in the are added to the documents like administra- AnAnAS system during the processing of tive (who scanned the document, document the digital raw data or in special designed in- source), descriptive (e. g. Dublin Core) and terfaces directly in the production database. structural (connection between the content A first version of the production database and other meta-data like figure references). was developed using MySQL and content is The result of the structural and layout anal- now continually added. In June 2006 the ysis is the identified logical structure of the DMG-Lib portal included about 30 books, document. This information can be used in 700 demonstration models, 45 bibliographic further processing steps like the automated entries and more than 40 enhanced images generation of links and tables of contents as and videos. However in the production well as in the ranking of full text search re- database over 900 documents and 400 per- sults. sons relevant to the DMG-Lib are listed. In For the enrichment of the scanned docu- the next years thousands of resources will be ments an animation generator was devel- provided in the portal. oped which allows the simulation and the variation of drawings, images and models in 3.2. DMG-Lib Online Portal an easy and fast way (see Fig. 3). The portal is the internet based communica- An export to CAD and special analysis soft- tion and presentation interface between the ware systems will be available as well. Base user and the DMG-Lib (see Fig. 4). For for the export and the animation genera- an user adequate design and implementation tion is a special XML based file format in an evaluation of the usability was performed which the description of the displayed gear which is oriented on the Usability Engi- is stored [7]. These abstract model descrip- neering Lifecycle developed by Deborah J.

191 Mayhew [11]. According to this method a The timeline application gives users a mul- requirement analysis and expert interviews timedial overview of important persons, in- have been carried out to develop a concep- ventions and publications in the historical tual model of the DMG-Lib portal. development of mechanism and gear sci- In March 2006 the prototypic online portal ence. Users will be able to directly ac- on www.dmg-lib.org was activated. It cur- cess corresponding information resources, rently serves as a platform for usability tests. for example available books of selected per- Beside the interactive search option in the sons in the library. web portal the content of the DMG-Lib can Beside traditional browsing and retrieval be accessed with an OAI-PMH web service methods these applications provide alterna- as well. tive ways to access the knowledge stored in the library. Prototypes of these applications are integrated in the DMG-Lib portal and are currently tested by the user community.

3.4. Semantic Information Re- trieval A further field of research is the retrieval in heterogeneous information resources us- ing different mechanism and gear hierar- chies like the structural system of Reuleaux [15] or other classification systems of well- known publications (e.g. [6]). Fig. 4: DMG-Lib portal A visualization and an efficient navigation over these different categories of gears could 3.3. Multimedia Applications of help users to get a systematic overview over the huge amount of existing mechanism and the DMG-Lib gear constructions. However, the identifi- Parallel to the internet portal interface other cation and modeling of these classifications interactive multimedia applications are de- and relations between the different technical veloped like the multimedia timeline (see terms are quite complicated, because differ- Fig. 5) and the virtual mechanism and gear ent opinions of experts and authors have to museum. be considered. To solve this problem semantic web tech- nologies can be used. With the help of Topic Maps, as a special kind of seman- tic networks, the knowledge of mechanism and gear science can be generalized and explicit modeled in a semantic meta-layer [12, 13]. With the extensive descriptive power of Topic Maps, all relevant concepts and relations between the concepts of this application domain can be modeled. Ad- ditionally, valid contexts, alternative names and other relevant semantic information can be included. Furthermore each concept in the semantic meta-layer is linked to all rel- Fig. 5: Timeline of mechanism and gear de- evant information resources available in the velopment library (see Fig. 6).

192 Based on the semantic information and with the help of SIREN the structuring and the retrieval in the available heterogeneous in- formation resources of the DMG-Lib can be enhanced. 4. Conclusion In this paper the DMG-Lib project is pre- sented, a digital and interactive library for mechanism and gear science. Aim of this project is the collection, preservation and Fig. 6: Semantic meta-layer suitable presentation of the worldwide exist- ing knowledge about mechanisms and gears. Outstanding features of the digital library With the help of this semantic meta-layer the are the powerful and user-oriented internet different hierarchies can modeled and visu- portal and the integration of a high amount alized. This enables a user to decide which of very heterogeneous information resources structuring system he wants to use for navi- relevant for this field of application. The ex- gation. tensively post-processing and enrichment of Currently a Topic Map based “Semantic the digital data with various additional infor- Information Retrieval ENvironment for dig- mation like animations or constraint-based ital libraries” (SIREN) is developed to sup- models is also important. Combined with port the complex development process of the the development of new interactive multime- semantic meta-layer and the information re- dia applications and a semantic information trieval process. SIREN consists of three pro- retrieval environment, the DMG-Lib pro- totypical systems, which are developed as vides users with an innovative access to the part of the DMG-Lib: stored knowledge in the library. The DMG-Lib project is an example for a • TMwiki (Topic Map Wiki) [10, 16] en- modern knowledge space, which tries to sat- ables a collaborative development of isfy the users’ needs for an efficient access to semantic meta-layers in a wiki environ- required information as one of the key tasks ment. in our today’s information society.

• TMV (Topic Map Visualizer) [10] pro- References vides an user-friendly interface for vi- sualization, presentation and navigation [1] The Dublin Core Metadata Initiative. in the semantic meta-layer, a graphi- http://dublincore.org (2006- cal topic-based definition of informa- 20-06), 2006. tion needs and the presentation of the [2] Metadata Encoding and Transmis- search results in the semantic context. sion Standard. Library of Congress. • MERLINO (Method for extraction and http://www.loc.gov/stand retrieval of links for occurrences) [9, ards/mets (2006-06-20), 2006. 16] is able to identify relevant infor- [3] Web Resource of the Kinematic mation resources for a defined infor- Models for Design Digital Library. mation need automatically. The proto- http://kmoddl.library. type identifies relevant information re- cornell.edu (2006-06-20), 2006. sources by querying the database of the digital library based on the knowledge [4] Christine Borgman and L´aszl´oc Kovcs stored in the semantic meta-layer. Ingeborg Sølvberg and editors.

193 Proceedings of the Fourth DELOS 3Aht%3Amerlinoabstract_ Workshop Evaluation of Digital Li- eswc2005_greece.pdf (2006- braries: Testbeds, Measurements, 20-06), 2005. and Metrics. Budapest, Hungary, June 6-7, 2002. http://www. [10] Bernd Markscheffel, Hendrik Thomas, sztaki.hu/conferences/ and Dirk Stelzer. TMwiki – a Col- deval/presentations.html laborative Environment for Topic (2006-20-06), 2002. Map Development. In Poster and Demos of the 3rd European Se- [5] Torsten Brix, Ulf D¨oring, and Sabine mantic Web Conference (ESWC Trott. DMG-Lib – ein moderner Wis- 2006). Budv, Montenegro,June 10- sensraum f¨ur die Getriebetechnik. In 14, 2006. http://topic-maps. Knowledge extended: die Koopera- org/lib/exe/fetch.php? tion von Wissenschaftlern und Biblio- cache=cache&media=member% thekaren und IT-Spezialisten, 3. Kon- 3Aht%3Atmwiki_abstract_ ferenz der Zentralbibliothek (KX’05), eswc2006_budva.pdf (2006-20- November 2-4, 2005, Julich,¨ Germany, 06), 2006. pages 251–262. J¨ulich:Schriften des Forschungszentrums J¨ulich, 2005. [11] Deborah J. Mayhew. The Usability En- gineering Lifecycle – A Practitioner’s [6] Kammer der Technik. Begriffe Handbook for User Interface Design. und Darstellungsmittel der Mechanis- San Francisco:Morgan Kaufmann Pub- mentechnik. Suhl: Kammer der Tech- lishers Inc., 1999. nik, 1978. [12] Jack Park and Sam Hunting. XML [7] Ulf D¨oring, Torsten Brix, and Topic Maps: Creating and using topic Michael Reeßing. Application of maps for the web. New Jersey: Pearson Computational Kinematics in the Education Inc., 2003. Digital Mechanism and Gear Library DMG-Lib. Special issue on CK2005, [13] Steve Pepper. The TAO of Topic Maps. International Workshop on Compu- http://www.ontopia.net/ tational Kinematics. Mechanism and topicmaps/materials/tao. Machine Theory, 41(8):1003–1015, html (2006-20-06), 2000. August 2006. [14] Edie Rasmussen. Information Re- [8] George A. Goodall. A Time for Digital trieval Challenges for Digital Li- Libraries. http://www.deregu braries. In Proceedings of the 7th In- lo.com/facetationpdfs/ ternational Conference on Asian Dig- aTimeForDigitalLibraries. ital Libraries (ICADL’04), December pdf (2006-20-06), 2003. 13–17, 2004, Shanghai, China, pages 93–103. New York: Springer, 2005. [9] Bernd Markscheffel, Hendrik Thomas, and Dirk Stelzer. Merlino – a Proto- [15] Franz Reuleaux. Lehrbuch der Kine- type for semi automated Generation matik. Braunschweig:Vieweg, 1875. of Occurrences in Topic Maps using [16] Alexander Sigel. Report on the Open Internet Search Engines. In Poster and Space Session. In Lutz Maicher Demos of the 3rd European Semantic and Jack Park, editors, Charting the Web Conference (ESWC2005). Her- Topic Maps Research and Application aklion, Greece, 29. Mai - 01. June, Landscape, pages 270–280. Berlin: 2005. http://topic-maps. Springer, 2005. org/lib/exe/fetch.php? cache=cache&media=member%

194 The Use of GIS in Avalanche Modeling

Donna M. Delparte, PhD Candidate University of Calgary, 2500 University Drive NW, Calgary, Alberta, T2N 1N4, Canada, [email protected]

Abstract Geographic Information Systems have the potential to aid in modeling snow avalanche terrain in order to identify areas of varying hazard. At the heart of terrain modeling is the digital elevation model (DEM). For this research a higher resolution DEM has been generated for selected areas within Glacier National Park in the Rogers Pass area of British Columbia. An avalanche database of the Rogers Pass highway corridor has been input into GIS using 3D mapping techniques and contains detailed information based upon expert knowledge. Terrain parameters have been extracted from the known avalanche paths to identify similar terrain in lesser known areas in the backcountry. Start zone and runout zone models aid in the process of identifying avalanche terrain. Visualization of the initial results has been integrated into Google Earth with the goal to allow potential backcountry users to recognize the snow avalanche hazard and reduce their level of risk.

1. Introduction

Start Zone: Snow avalanches are a significant natural 30E- 50E hazard that impact roads, structures and threaten human lives in mountainous Track: terrain. Modeling of terrain in a GIS is 15E- 30E typically done by utilizing a digital elevation model (DEM). To evaluate what Runout Zone: terrain parameters are most likely to 0E- 15E contribute to high frequency, known avalanche paths are typically documented and evaluated for these key factors [2-5]. Figure 1. Anatomy of an Avalanche Path [1] An avalanche path describes terrain boundaries of known or potential avalanches [1]. An avalanche path is 2. Methodology characterized by: a start zone, track and runout zone (Figure 1). The starting zone A high resolution DEM for the study area of an avalanche path is where avalanches was created using a procedure of digital begin with a slope ranging from 30˚ - 50˚, stereo photogrammetry. This technology the track is where avalanches achieve allows a GIS operator wearing stereo maximum velocity and mass with slopes goggles to resample surface heights using between 15˚ - 30˚ and the runout zone is stereo-airphoto pairs. The horizontal where avalanches begin to decelerate and resolution of the DEM obtained from the the deposition occurs. 1:30,000 stereo-airphotos was 9 m. Vertical accuracy is within 1 m. Topographic This paper highlights the process of parameters such as slope, surface area, building a DEM and the methods for aspect, runout length, and profile shape avalanche terrain modeling using a GIS. were derived from the DEM to evaluate Initial results are featured along with a start zone, track and runout characteristics discussion of visualization for the general that are most likely to contribute to backcountry recreationist. avalanche frequency as determined by

195 known records from over a 130 avalanche Equation 2 represents a simplified two paths along the Rogers Pass highway. The parameter ($) model. results are useful in conducting and improving avalanche hazard mapping as To apply terrain parameters for risk well as a tool for risk assessment. Terrain assessment, the information derived from parameters from avalanche paths in the the GIS analysis has been mapped database along the highway corridor can be according to the Parks Canada, Avalanche used to identify comparable areas in the Terrain Exposure Scale (ATES) introduced backcountry. in 2004 to reduce risk of backcountry users in National Parks and act as guidelines for The DEM facilitated analysis of start zone backcountry use by custodial groups. The terrain and ground characteristics as well as terrain based guidelines from ATES were runout parameters based on the alpha-beta used in the GIS to develop maps displaying statistical approach (Figure 2) first used by backcountry areas based upon simple, the Norwegian Geotechnical Institute [6] . challenging and complex terrain.

Top of Start Zone 2 3. Initial Results

100 m ( Actual Path Profile $ point H (Slope Angle = 10˚) Figure 3 highlights the information Best fitted $ ) " ) Parabola y = ax2 + bx + c digitized in stereo based upon avalanche y” = 2a expert confirmation. The thick bounding

Figure 2. Alpha-Beta Statistical Model for line indicates the boundaries of the snow Avalanche Runout avalanche revised to accurately represent the path, the dashed line represents the The Norwegian approach involves four typical centerline of avalanche travel down specific terrain parameters that are the path, the shaded grey area indicates the represented in the regression equation start zone for the path and the dotted line below (1). indicates the typical start of avalanche runout or area where the avalanche begins " = 0.92$ - 7.9 x 10-4 H + 2.4 x 10-2 Hy’’2 to slow and deposit its snow load. + 0.04 (1)

" = 0.92$ - 1.4˚ (2)

Equation 1 represents a four parameter model: " – represents the angle between the maximum runout and the top of the slide H – is the vertical distance from the top of slide to the base as estimated by the best fitted parabola $ – represents the angle from a line of sight where the slope is 10˚ to the top of the slide y’’ – is the curvature of the slope based on a Figure 3. Avalanche Path 7 second derivative of a second degree polynomial Geographic Information Systems allow for 2 – is the average inclination of the starting the extraction of terrain parameters from the zone as measured within the first vertical DEM. Table 1 shows the extraction of 100 m of the path terrain parameters from the start zone of

196 avalanche path 7. For each avalanche path fracture line near the top of the slope. The along the highway corridor, parameters are thick bounding line indicates the path extracted for the entire path, the start zone, perimeter and the dashed line represents the the track and the runout. In the profile main gulley through which the avalanche graph (Figure 4) the line segment to the path traveled. triangle highlights the start zone portion of the entire path. The table reveals the start zone terrain parameters with measurements in degrees and meters. It is interesting to note that the mean slope for the start zone falls close to the commonly recognized 38E as a slope that is highly typical for avalanche activity.

Avalanche Path 7 Profile 2300 2100 End of Start Zone Figure 5. Start Zone of Avalanche Path 1900

1700 7 ) m 1500 n (

io 1300 evat l E Start of Runout 1100 4. Discussion- 900 GIS Visualization 700

500 0 500 1000 1500 2000 2500 3000 Distance (m) The capacity for Geographic Information Figure 4. Profile of Avalanche Systems (GIS) to produce maps visualizing Path 7 terrain features and improvements in web- based mapping software has lead to an Table 1. Terrain Parameters from interest in providing avalanche risk based Avalanche Path 7 maps for the public, particularly in the form of on-line solutions. Google Earth is Terrain Parameter Measure becoming a ubiquitous product that is Surface Length 756.8 m expanding its reach to the general Internet Flat Length 572.8 m user. Google Earth has the capacity to Length Ratio 1.3 m allow GIS overlays. It is online solutions Mean Slope 37.6 ˚ such as Google Earth ( Min Slope 0.2 ˚ Figure 6) that have the capacity to allow Max Slope 79.9 ˚ backcountry enthusiasts to view avalanche Low Elevation 1730.2 m terrain and aid in their decision making. High Elevation 2151.9 m Average Elevation 1947.7 m Range 421.7 m Cum Z 442.4 m Surface Area 230024.5 m2 Flat Area 149535.2 m2 Surf ratio 1.5 Aspect NE

In the winter of 2005-2006, Avalanche Path 7 experienced a significant avalanche event. A photo was taken subsequent to the event (Figure 5). The photo reveals the crown Figure 6. Google Earth

197

5. References

1. Mears, A.I., Snow Avalanche Hazard Analysis for Land-Use Planning and Engineering. Vol. Bulletin 49. 1992, Denver, Colorado: Colorado Geological Survey, Geological Survey, Dept. of Natural Resources. 54. 2. Maggioni, M., U. Gruber, and A. Stoffel. Definition and characterisation of potential avalanche release areas. in Proceedings of the 2001 ESRI International User Conference. 2001. San Diego. 3. Maggioni, M. and U. Gruber, The influence of topographic parameters on avalanche release dimension and frequency. Cold Regions Science and Technology, 2003. 37: p. 407-419. 4. Gruber, U. Using GIS for avalanche hazard mapping in Switzerland. in Proceedings of the 2001 ESRI International User Conference. 2001. San Diego. 5. Gruber, U. and S. Sardemann. High frequency avalanches: Release area characteristics and runout distances. in International Snow Science Workshop (ISSW). 2002. Penticton, BC, Canada. 6. Lied, K. and S. Bakkehoi, Empirical calculation of snow-avalanche run-out distance based on topographic parameters. Journal of Glaciology, 1980. 26(94): p. 165-177.

198 A Case Study for Knowledge Externalization using an Interactive Video Environment on E-learning

Jochen Felix Boehm Jun Fujima Saarland University Hokkaido University Information sciences Meme Media Laboratory Saarbruecken, Germany Sapporo, Japan [email protected] [email protected]

Abstract E-learning technology today should provide different means for the user to interact with the presented knowledge. In this case study an electronic instruction manual based on an instructional movie is presented and enhanced with interactive components. The integration of novel technologies – IntelligentPad and DVDconnector – is presented in this paper and the new possibilities for E-learning by these technologies is shown. The instruction manual application takes advantage of dynamic high quality videos and the possibility to interact with the learning environment. Also a new dimension of interaction will be shown. In standard E-learning scenarios the user normally acts with the information system and the environment seperately. Using a special kind of pad, the system is now able to receive information via a sensor from the environment and can store this information as knowledge based items in form of IntelligentPads within the E-learning application. This paper shows a case study for externalization of knowledge in an interactive video environment based on an E-learning application designed as an electronic instruction manual for putting strings on a violin and tuning-up the violin. The user obtains all necessary information to correctly perform each step and can verify the correctness via a sensor interface showing the frequency of the violin. 1. Introduction

Nowadays E-learning is still used in ways of mediating knowledge from the teacher to Information System the student, often only in textual form and Environmental sometimes supported with audio-visual me- generated Pad dia. The main forms of interaction given to the user are means of testing sequences User created Pads System generated Pad to verify if the content has been success- Sensor fully mediated and means of choosing his Generator own learning path. However, in standard E- learning scenarios the user has to carry out Environment these tasks separately, since the information system and the external environment are in- dependent. Figure 1: Model for representing knowledge In order to provide a flexible and interactive objects in information systems E-Learning environment, various kinds of knowledge should be externalized as knowl- edge objects which can be accessed or ma- 199 nipulated by the user. If the learning system of the video and place it in the TunerPad is more advanced, new knowledge objects window. This pad resembles a knowledge can also be created dynamically through object. The exact frequency of the sound some types of processes. Figure 1 shows the of the corresponding string is contained in model for representing knowledge as knowl- it. This object is provided by the system. edge objects in information systems. Each The user also will be able to play his violin knowledge object can be created through and record the played sounds with the use of three different processes. One can be gen- a microphone to verify that the tuning pro- erated automatically by the system. Another cess has been correct. The frequency of the can be created by the user himself through played sound will be displayed on the Tuner- operations with the system. The third pos- Pad, using the microphone as a sensor input sibility is the generation of knowledge ob- device to pass that information from the en- jects by processing input data from the exter- vironment into the system. nal environment through some kind of sen- First, the basic concept of this project shall sor devices. be described, while drawing a parallel be- This paper shows a case study for represen- tween complex instructions and E-learning tation of knowledge in an E-learning appli- applications. Subsequently, we will focus cation. The user can obtain necessary in- on the design and the technical execution of formation through knowledge objects in dif- the project, as well as a description of the ferent processes, such as interactive video, used technology. In a conclusion, we dis- online/offline contents described by HTML cuss other possibilities of using the modules or interactive object connected to sensor de- of content and future work. vices. 2. Project scenario In this project, we use DVDconnector tech- nology for providing an interactive video Basic idea of the project is to create a mul- environment with dynamical contents and timedia manual for fastening the strings of Meme Media [6] technology for represent- a violin and tuning the instrument. While ing interactive knowledge objects in in- the instruction should appeal to the user, it formation systems. DVDconnector pro- should also offer the possibility of verifying vides DVD-based high quality videos which the learning success, and designed in such a have the ability to access online/offline con- way as to allow its use for other applications, tents. On the other hand, Meme Media is thereby ensuring its sustainability. a media technology for externalizing var- Conventional manuals describe in longer or ious kinds of intellectual information re- shorter text sequences the steps necessary sources and distributing them among com- to achieve a certain result. Illustrations fa- puter users. The integration of DVDconnec- cilitate understanding of complex manuals tor and Meme Media technology provides and, in addition, offer the possibility of an flexible and interactive environments for E- overview of the respective step. Thus sec- learning. tioned, the user is always able to again re- Following the basic approach of learning by turn to the manual from any point and, with imitation [2], the authors developed an E- the help of the illustrations, he is able to at Learning module about putting on strings on least visually verify whether the steps taken a violin and tuning-up the instrument. In have been performed correctly. In complex addition to providing a step-by-step instruc- manuals, the purely textual form is mostly tion via audio-visual and textual informa- supplemented by graphs. This can help the tion, the user is also able to interact with user in mentally simulating real processes the video employing so called Hotspots in [4]. Mental simulation, however, has its lim- the video. These Hotspots enable the user its. Complex processes require the iterative to draw the sound of a tuned-up string out completion of approximation cycles [5]. 200 Often, if the application level is more com- applied in earlier projects, thereby ensur- plex, a training film is applied, e.g., in dis- ing that the technical prerequisites for this playing an instruction for using a certain project can be met [2]. According to software. Usually, this is realized with the the authors, three aspects were elementary help of Flash,- Director,- Authorware-films in preparing this project. First, a high- or other software tools, which enable the se- resolution video should depict the fastening quential procedure being visualized as true of the strings and the tuning of the instru- to original as possible. ment. This video should then be enriched The authors therefore want to explore in this with additional information. Second, the en- case study, in which way known technol- vironment of the application should give the ogy can be applied in order to make manu- user the possibility to interact with the dif- als more appealing for the user, while at the ferent levels, in particular with the video. same time creating added value compared to Third, the user should be provided with the traditional formats for manuals. Videos are possibility to verify his success in reality. helpful in depicting complex chains of ac- 2.1. Design of the application envi- tions and convey information in a compact format. Furthermore, the interaction with a ronment certain medium furthers the interest of users. Since the instruction video constitutes the While conventional manuals are usually not main component of this manual, the video used at all, the authors want to rouse the and its place in the instruction environment interest of users with the help of appealing had to be designed first. Detailed informa- videos and possibilities for interaction. tion on the design and the creation of the video can be found in chapter 2.2. For this purpose, the authors apply high- resolution film material on a DVD, which is equipped with hyper-video elements (see further [3]) provided by the DVDconnec- tor Technology by the firm micronomics. Moreover, so-called IntelligentPads [6], de- veloped by the Meme Media Laboratory of Hokkaido University, offer the possibility of practical interaction with the instructional film. The mix of different intertwined media and applications in this project make parallels to E-learning scenarios apparent. Not only is an instruction offered for each step, but it is Figure 2: Instruction environment provided in modules and enriched with fur- ther information. As can be seen in figure 2, the video win- The use of instruction videos basically cor- dow (1) has a central position in the instruc- responds to the use of teaching videos. By tion environment. It is supplemented by a imitating the action depicted in the video, navigation window (2), as well as a window the user experiences learning success, due for additional information (3), a TunerPad- to his being able to both analogously per- window (4), which serves as work station for form the actions depicted therein and – in transferring the sounds for tuning the violin, this project – also acoustically and techni- an optional window for e-business possibil- cally verify the correct performance of each ities (5) and a glossary window (6), which step. shows specific terminology while the video The DVDconnector and the IntelligentPad is running. The DVDconnector by micro- Technology have already been successfully nomics comes into operation when creating 201 this instruction environment. It offers to is termed Pad. The user can combine two seamlessly link different media and applica- Pads to form one new Pad and thus equip tions and represents a basis for the essential it with complex functions. Each Pad has hyper-video component (see also [1]). an interface, which displays its status data and is termed Slot. If one Pad is inserted 2.2. Creating the instruction video into another Pad, the user can define a ”Par- ent - Child Relation” between both Pads Generally, a manual follows sequential and link their Slots, thereby establishing a steps, which create a final product. Conse- functional connection between the Pads. In quently, the video is designed likewise, i.e., this way, Pads can be connected to one an- the single steps from fastening the strings to other to form different multimedia docu- tuning the instrument are sequentially docu- ments and applications. Unless otherwise mented. defined, combined Pads can always be sep- The video is subdivided into 2 units. The arated and changed again. Recently, the first part comprises the fastening of the Meme Media Technology was enhanced for strings, while the second part consists of the the reuse of web-based resources, such as tuning of the instrument. Both units con- web documents and web applications. Web stitute modules which can be used indepen- resources can thus be modified and their dently from each other in other projects. functions can be combined by ”Copy and In addition to the audio commentary, which Paste”. comments the different steps, the sound is a significant component for tuning the instru- 3.2. Tuning function ment. The authors therefore deem the high For tuning the violin, two Pads were im- quality of the audiovisual media to be essen- plemented, the SoundPad and the TunerPad. tial. We hence picked a DVD as medium, in The SoundPad represents a sound bearing a order to be able to provide for sound and vi- certain frequency, which is determined as a sion in best quality. To generate the medium, value in the Frequency Slot. If the user now we used a Mini DV camera and converted clicks on a SoundPad, a defined sound is the film data files into DVD format with the played-back (e.g. the sound of the A-string). help of a WinAVI video converter. The TunerPad recognizes the pitch recorded by the microphone and visualizes informa- 3. Applied technologies tion on the sound, such as volume, basic fre- In order to allow for the interaction of the quency and the deviation between the origi- user with the manual, the Meme Media nal frequency and the desired frequency. A Technology and the DVDconnector Tech- TunerPad transfers these values to the Input, nology was applied. The DVDconnector Frequency and Difference Slots. If the user constitutes the framework and the basis for combines a SoundPad with a TunerPad, the the hypervideo function, while the Intelli- SoundPad transfers its frequency to the Tar- gentPad Technology allows for linking dif- getFrequency Slot of the TunerPad and thus ferent applications with the DVDconnector. defines the desired frequency for the tuning procedure. If the user clicks on a string of 3.1. Meme Media Technology the violin in the video, a SoundPad appears, which contains as information the sound fre- The Meme Media Technology [6] allows quency of the tuned violin. The User can for editing, distributing and linking different now drag this Pad to his work station in the program resources with one another. TunerPad window. Upon clicking on it, this The IntelligentPad Technology [6], [7] is SoundPad produces the sound of the tuned part of the Meme Media application. In- string and the user can tune the respective telligentPad presents every functional com- violin string with the help of this sound. In ponent as a two-dimensional object, which order to tune the string precisely, he can 202 now copy the SoundPad into the TunerPad. Additional information on the instrument There, the difference in frequency between can be selected by the user at any time, ei- the predefined sound of the SoundPad and ther via the glossary promptly displayed or the sound recorded via microphone is dis- via so-called Hotspots in the video. Hotspots played. In this way, the user can adjust the are links in the film, which make it possi- frequency of the sound played by him/her to ble to design the film as a hyper video (for a the preset sound of the SoundPad. detailed description see [1]). Via an online- component, the user can always be supplied with new information. Further, the producer can integrate the por- tal for customer support, as well as a dy- namic and continually updated FAQ-page for the customer into the manual (see (5) in Fig.2). Further, while appearing in the video, a glossary window displays terminol- ogy, which can be selected in the form of a popup-window explaining the respective terms in more detail. The user can choose between different view Figure 3: TunerPad and SoundPad options for the instruction video, so the tex- tual information can also be hidden and the video displayed in full view. If the user 3.3. DVDconnector Technology prefers to have the instructions in textual The DVDconnector Technology by micro- form and the video as background, the view nomics enables seamless linking of web ap- options allows to exchange the video win- plications and media. An elementary part dow with the information window. If the of this project is the possibility to link In- strings are fastened on the violin, the video telligentPad Technology in the video via demonstrates the correct tuning of the vi- Hotspots. Further, offline and online content olin. The user now has the possibility to can be combined and thus enable the user draw Pads from the played strings and place to access current information without expe- them onto his ”work station” in the Tuner- riencing a loss of quality of the supplied me- Pad window. These Pads play-back the pre- dia due to compression during data transfer. cise sound of the string tuned in the video. A detailed account of the applied DVDcon- nector functions is given in: ”Highlights of Algorithmic Learning: Technischer Report.” [1]. 4. Project implementation This project shows the fastening of a violin with new strings with the help of a high- resolution video on a DVD medium. The user can thus exactly follow the necessary steps and, if need be, immediately reproduce them. Furthermore, the order of the differ- ent steps is supplied in textual form. The user can therefore gain a fast overview of the steps to be taken and select a video sequence Figure 4: Drawing SoundPad from a string analogous to the textual instruction. 203 Thus, the user has all four basic sounds of Acknowledgement the violin on his work station (g, d’, a’ und e”) and can tune the strings of his own vi- The authors want to thank Anne-Kathrin olin with the help of a microphone and a Feld, violin teacher and player for many pad-interface, as well as the four tune Pads years, whose expert knowledge in the field in the TunerPad window. Aside from just of violins was authorative for providing receiving information, the user can interact needed information for the creation of this with the manual and, what is more, ver- E-learning scenario about violins. ify whether he has performed the described steps correctly. References [1] Jochen F. Bohm.¨ Highlights of algo- 5. Outlook rithmic learning: Technischer report. In www.http://www.fit-leipzig.de/ Publika- This case study in whole or parts of it can tionen/Berichte neueSerie/Report FIT easily be placed in other projects or E- 2005-05.pdf, 2005. learning modules due to the modular design [2] Jochen F. Bohm,¨ Jun Fujima, Klaus P. of the content. The topic of this project al- Jantke, Aran Lunzer, and Yuzuru lows wide reuse of the content, since neither Tanaka. Novel technology integration technology nor material of the described ob- for learning by inmitation. In SITE jects will change. This also guarantees the 2006, Orlando, FL, USA, March 20-24, sustainability of the manual. The authors 2006. are convinced that this type of manual moti- [3] P. Grieser, G. und Griegoriev. Erstel- vates the user and guarantees greater success lung und integration eines hypervideos when applied. Since the approach is generic, in das e-learning system damit ein er- it can be transferred to other areas of appli- fahrungsbericht. In In Jantke, K.P. Wit- cation of operation manuals. tig, W. und Herrmann, J. (Hrsg.) Von E-learning bis e-Payment 2003. Das In- This small case study also serves as a first ternet als sicherer Marktplatz. Tagungs- step to researching sensor input devices in band LIT03, 24.26. Leipzig. Akademis- E-learning scenarios using IntelligentPad che Verlagsgesellschaft AKA, S.269277, components. In this project the source for 2003. inputing information from the environment into the system was a well defined single [4] Kriz S. Cate C. Hegarty, M. The role of mental animations and external an- source. With the use of a microphone it imations in understanding mechanical was possible to represent the frequency of systems. In Cognition and Instruction the played violin strings by the user in the 21(4). S.325360, 2003. system, thus providing the possibility to verify if the instrument has been tuned-up [5] N. Miyake. Constructive interaction and the iterative process of understand- correctly according to the instructions ing. In Cognitive Science 10. S.151177, given in the E-Learning application. In 1986. further projects the authors may concentrate on implementing further means to input [6] Y. Tanaka. Meme Media and Meme environmental information into a learning Market Architectures. IEEE Press and system and in doing so taking a step to Wiley-Interscience., 2003. creating a basis for a virtual laboratory [7] T. Tanaka Y. und Imataki. Intelligent- using the DVDconnector and IntelligentPad pad: A hypermedia system allowing technologies. functional composition of active media objects through direct manipulations. In Proc. IFIP89, San Francisco, USA, S.541546, 1989.

204

01 Rüdiger Grimm, „Vertrauen im Internet – Wie sicher soll E-Commerce sein?“, April 2001, 22 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

02 Martin Löffelholz, „Von Weber zum Web – Journalismusforschung im 21. Jahrhundert: theoretische Konzepte und empirische Befunde im systematischen Überblick“, Juli 2001, 25 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

03 Alfred Kirpal, „Beiträge zur Mediengeschichte – Basteln, Konstruieren und Erfinden in der Radioentwicklung“, Oktober 2001, 28 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

04 Gerhard Vowe, „Medienpolitik: Regulierung der medialen öffentlichen Kommunikation“, November 2001, 68 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

05 Christiane Hänseroth, Angelika Zobel, Rüdiger Grimm, „Sicheres Homebanking in Deutschland – Ein Vergleich mit 1998 aus organisatorisch-technischer Sicht“, November 2001, 54 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

06 Paul Klimsa, Anja Richter, „Psychologische und didaktische Grundlagen des Einsatzes von Bildungsmedien“, Dezember 2001, 53 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

07 Martin Löffelholz, „Von ‚neuen Medien’ zu ‚dynamischen Systemen’, Eine Bestandsaufnahme zentraler Metaphern zur Beschreibung der Emergenz öffentlicher Kommunikation“, Juli 2002, 29 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

08 Gerhard Vowe, „Politische Kommunikation. Ein historischer und systematischer Überblick der Forschung“, September 2002, 43 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

09 Rüdiger Grimm (Ed.), „E-Learning: Beherrschbarkeit und Sicherheit“, November 2003, 90 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

10 Gerhard Vowe, „Der Informationsbegriff in der Politikwissenschaft“, Januar 2004, 25 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

11 Martin Löffelholz, David H. Weaver, Thorsten Quandt, Thomas Hanitzsch, Klaus-Dieter Altmeppen, „American and German online journalists at the beginning of the 21st century: A bi-national survey“, Januar 2004, 15 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

12 Rüdiger Grimm, Barbara Schulz-Brünken, Konrad Herrmann, „Integration elektronischer Zahlung und Zugangskontrolle in ein elektronisches Lernsystem“, Mai 2004, 23 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

TU Ilmenau - Institut für Medien- und Kommunikationswissenschaft Am Eichicht 1, 98693 Ilmenau 13 Alfred Kirpal, Andreas Ilsmann, „Die DDR als Wissenschaftsland? Themen und Inhalte von Wissenschaftsmagazinen im DDR-Fernsehen“, August 2004, 21 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

14 Paul Klimsa, Torsten Konnopasch, „Der Einfluss von XML auf die Redaktionsarbeit von Tageszeitungen“, September 2004, 30 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

15 Rüdiger Grimm, „Shannon verstehen. Eine Erläuterung von C. Shannons mathematischer Theorie der Kommunikation“, Dezember 2004, 51 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

16 Gerhard Vowe, „Mehr als öffentlicher Druck und politischer Einfluss: Das Spannungsfeld von Verbänden und Medien“, Februar 2005, 51 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

17 Alfred Kirpal, Marcel Norbey, „Technikkommunikation bei Hochtechnologien: Situationsbeschreibung und inhaltsanalytische Untersuchung zu den Anfängen der Transistorelektronik unter besonderer Berücksichtigung der deutschen Fachzeitschriften“, September 2005, 121 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

18 Sven Jöckel, „Digitale Spiele und Event-Movie im Phänomen Star Wars. Deskriptive Ergebnisse zur cross-medialen Verwertung von Filmen und digitalen Spielen der Star Wars Reihe“, November 2005, 31 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

19 Sven Jöckel, Andreas Will, „Die Bedeutung von Marketing und Zuschauerbewertungen für den Erfolg von Kinospielfilmen. Eine empirische Untersuchung der Auswertung erfolgreicher Kinospielfilme“, Januar 2006, 29 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

20 Paul Klimsa, Carla Colona G., Lukas Ispandriarno, Teresa Sasinska-Klas, Nicola Döring, Katharina Hellwig, „Generation „SMS“. An empirical, 4-country study carried out in Germany, Poland, Peru, and Indonesia“, Februar 2006, 21 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

21 Klaus P. Jantke & Gunther Kreuzberger (eds.), „Knowledge Media Technologies. First International Core-to-Core Workshop“, July 2006, 204 S. TU Ilmenau, Institut für Medien- und Kommunikationswissenschaft, [email protected]

TU Ilmenau - Institut für Medien- und Kommunikationswissenschaft Am Eichicht 1, 98693 Ilmenau