Lecture Notes in Computer Science s6

Lecture Notes in Computer Science 1

Subject-Oriented Work: Lessons Learned from an Interdisciplinary Content Management Project

Joachim W. Schmidt, Hans-Werner Sehring, Michael Skusa, and Axel Wienberg

Technical University TUHH Hamburg, Software Systems Institute, Harburger Schloßstraße 20, D-21073 Hamburg, Germany

{j.w.schmidt, hw.sehring, skusa, ax.wienberg}@tuhh.de

Abstract. The two broad cases, data- and content-based applications, differ substantially in the fact that data case applications are abstracted first before they cross any system boundary while for content cases it is the system itself which has to map application content into some data-based technology. Through application analysis and software design we are aware of the difficulties of such mappings. In an interdisciplinary project with our Art History colleagues working in the subject area of “Political Iconography” we are gaining substantial insight into their Subject-Oriented Working (SOWing) needs and into initial requirements for a generic SOWing platform. In this paper we outline the project, its basic models, their generalization as well as our initial experiences with prototypical SOWing implementations. We compare our work with various XML-related activities. In this paper we emphasizes the conceptual and terminological aspects of our approach and sketch some of the technical requirements of SOWing platforms.

1 Introduction

As a result of advanced and extensible database technology now being available as of- f-the-shelf products, a substantial part of database research and development work has generalized into work on models and systems for multimedia content management. R&D in content management includes a range of models and systems concentrating on services for the following three lines of work: - content production and publication work using multimedia documents; - classification and retrieval work based on document content; - management and control of such work for communities of users differentiat- ed by their roles and rights, interests and profiles, etc.

The work reported in this paper is based on an interdisciplinary project with a partner from the humanities with strong semantic and weak formal commitment. Our project partner specializes on work in icon- and text-based content form the subject Lecture Notes in Computer Science 2

area of “Political Iconography”. This content is organized as a paper- and drawer- based subject index (PI-“Bildindex”, BPI) and is used for art history research and education. Iconographic work has a long tradition in art history, dating back into 19th century “Christian Iconography”, and is based on integrated experience from three sources: - Art: multimedia content; - Art History: process knowledge. - Library Sciences: subject-oriented content classification and retrieval. In our context we use the notion of subject-orientation very much in the sense of library science, as, for example, stated by Elaine Svenonius: “In a subject language the extension of a term is the class of all documents about what the term denotes, such as all documents about butterflies.” This understanding differs substantially from natural language, where “the extension, or extensional meaning, of a word is the class of entities denoted by that word, such as the class consisting of all butterflies” [Elaine, “The Intellectual Foundation of Information Organization”, MIT Press, 2000]. And both understandings are in clear contrast to the semantics of identifiers in programming languages and database models.  The interdisciplinary project “Warburg Electronic Library (WEL)” models and computerizes BPI content and services, and the WEL prototype allows interdisciplinary experiments and insights into multimedia content management and applications. The overall goal of the WEL project is - the generalization of our subject-oriented working experience, - a work plan for R&D in subject-oriented content management, and - a generic Subject-Oriented Working environment (SOWing environment). Currently, many contributions to such R&D are based on XML as a syntactic framework which provides a structural basis as well as some form of implementation platform. The main reasons for XML’s leading position are its strong structural commitment and its semantic neutrality. Unfortunately, much of the XML-based R&D contributes to the above three lines of work only individually. Examples include [ ], [ ], …[ ]. Successful content management requires, however, that the above services are not provided in isolation along individual lines of work but in a coherent and cooperative working environment covering the entire space spanned by all three dimensions of work. As a result, the main reason why XML-based work on content management falls short of the goal can be explained as a corollary of XML’s strength (see above): its weak semantic and exclusively structural commitment. The paper is structured as follows. Section two introduces the two projects involved, the “Bildindex Political Iconography (BPI)” and the “Warburg Electronic Library (WEL)”. In section three the WEL model is generalized towards a generic “Work Ex- plication Language”. A systems view of the WEL prototype is described in section four and the first contours of a generic Subject-Oriented Working environment (SOWing platform) are outlined. Related work, in particular work in the XML context, is discussed in section six. The paper concludes with a short summary and a presentation of future work. Lecture Notes in Computer Science 3

2 Warburg Electronic Library: An Interdisciplinary Content Management Project

The development of the currently predominant data management models was heavily influenced by application requirements from the business and banking world and their bookkeeping experience: the concepts of record, tabular view, transaction etc. are obvious examples. Data model development had to go through several generations – record-based file management, hierarchical databases and network models – until the relational data model reached a widely accepted level of abstraction for database structuring and content-based data operation. For traditional relational data management we basically assume that content is “values of quantified variables” from business domains operated by transactions and laid-out as tables with rows and columns. The question now is: - how can we generalize from data management to the area of content management where content domains are not ”just dates and dollars”, content operation goes beyond “debit-credit transactions” and content layout means multimedia documents? - what are key application areas beyond bookkeeping which help us understand, conceptualizing and finally implementing the core set of requirements for multimedia content management in terms of domain modelling, content work as well as content (re-) presentation?

Figure 1: Working with the Index for Political Iconography in the Warburg house, Hamburg (old Abb. 2.1)

The work presented here is based on an interdisciplinary r&d-project between computer science and art history, the “Warburg Electronic Library” project. The application area was chosen because of art history’s long-term management experience with content of various media. The project itself is grounded on extensive material and user experience from the area of “Political Iconography”.

2.1 Subject-oriented Work in Political Iconography

Political iconography basically intends to capture the semantics of key concepts of the political realm under the assumption that political goals, roles, values, means etc. requires mass communication which is implemented by the iconographic use of images. Our partner project in art history, the “Bildindex zur Politischen Iconographie (BPI)”, was initiated in 1982 by the art historian Martin Warnke [ ] and consists of a bout 1.500 named political concepts (subject terms, “Schlagworte”) and more than 300.000 records on iconographic works relevant to the BPI. In 1990 Warnke’s work was awarded by the Leibniz-Preis, one of the most prestigious research grants in Ger- many. Lecture Notes in Computer Science 4

Iconography has a long tradition, at least back into the 19th century (“Christian Iconography”), and is based on contributions and experience from three sources: - from art: handling and presentation of multimedia content; - from the library sciences: content classification and retrieval; - from art history research and education: process knowledge and methodolog- ical experience.

Figure 2: BPI “Bildkarte” St. Moritz (media card) describing art work by attribute aggregation

Starting from this experience, BPI-work essentially relies on an art historian’s knowledge on (documents which refer to) political acts in which images play an active role. Art historians interpret “acts” as encompassing aspects of - “projects” (who initiated and contributed to an act? the when and where of an act? etc.); - “ products” (what piece of art did the project produce? on what medium? place of current residence etc.); and finally, the - “concepts” behind the act (what political goals, roles, institutions etc. to be addressed? what iconographic means to be used by the artist? etc.). On this knowledge basis, BPI work isolates political concepts and names them individually by subject terms – e.g., by “rulers”, “princes”, “popes”, “rulers on horseback”. Subject term semantics is methodologically captured and systematically represented in the BPI by the following steps 1. designing a conceptual, prototypical (mostly mental) model for each subject term, e.g., a prototypical equestrian statute; 2. giving value to the relevant variables or facets of such prototypes by reference to the art historian’s knowledge of “good cases”, i.e., political acts with an iconographic dimension. Each such variable or facet is represented by a BPI-entry (“Bildkarte”, “Textkarte”, Videokarte” – “media cards” etc.) which holds a description of a “good case” for that facet, see for example, St. Moritz, see fig. 2. 3. collecting all BPI entries describing some prototype into a single extent (“Bildkartenstapel”, …, “pile of media cards”, see fig. 3) thus defining the semantics of a subject term. Additional fine structure may be imposed on subject term extents (order, “neighborhood”, named subextents, general association/navigation etc.); 4. maintaining a (completion) process aiming at a “best possible” representation of the subject area at hand, i.e., - “representative” subject terms covering the subject area at hand; - “qualifying” prototypes for each subject term; - “complete” sets of facets for prototype description; - “good” cases for facet substantiation. This makes it quite clear that the BPI is by no means just an index for accessing an image repository. The BPI uses images only in their rather specific role as icons and for the specific purpose of contributing to the description of cases and thus to the se- Lecture Notes in Computer Science 5 mantics of subject terms. In this sense, images represent the iconographic vocabulary of BPI-documents just as keywords contribute to the linguistic vocabulary of text documents.

Figure 3: BPI subject term semantics (e.g. equestrian statue) by media cards classification (image, text, video etc.)

The BPI has essentially two groups of users: - a few highly experienced BPI-editors for content maintenance and - various wide user communities which access BPI-content for research and education purposes. Being implemented on paper technology, the traditional BPI shows severe shortcomings. - conceptually: the above attributes “representative”, “good”, “complete”, etc. are highly subjective and, therefore, “completion semantics” is hard to meet even within a “single-person-owned” subject index; - technically: severe representational limitations are obvious and range from single subsumption of BPI elements to online and networked BPI access. In the subsequent section we give an outline of two aspects of the “Warburg Electron- ic Library” project which approach these conceptual and technical shortcomings by an advanced digital library project which, as a prime application, is now hosting the “In- dex for Political Iconography”.

2.2 A Subject-oriented Working Environment: Warburg Electronic Library

Viewed from our Computer Science perspective which shifted in recent years from basic research in “persistent database programming” towards r&d in “software systems for content management (online, multi media, …)”, the WEL project addresses a range of highly relevant and interrelated content application issues: - content representation by multiple media: images, texts, data, …; - content structuring, navigation and querying, content presentation; - content work based on subjects and ontologies: classification, indexing, …; - utilization of different referencing mechanisms: icon, index, symbol; - cooperative processes and projects on multi media content in research and education. The WEL is an interdisciplinary joint project between the art history department of Hamburg University (research group on Political Iconography, Warburg House) and the software systems institute of the Technical University, TUHH, Hamburg. It started in 1996 as a 5-years project and will be extended into an interdisciplinary r&d- framework between several Hamburg-based institutions. For a short WEL overview we will concentrate on two project contributions: - semantic modelling principles for WEL-design; - personalized digital WEL libraries based on project-specific prototypes and their use in art history education. Lecture Notes in Computer Science 6

2.2.1 WEL Semantic Modelling Principles

Basically, the WEL design is based – and so is already the BPI design – on the classical semantic data modelling principles [ ], [ ]: aggregation, classification, generalization / specialization and association/navigation (see figs. 2 and 3).

Figure 4: Media card associated with (multiple) subject terms and with information on classification work

However, it is important to note, that the semantics of subject classes and their entries originate form different semantic sources: - object semantics: seen from a data modelling point of view, subject class entries are also entities of some object classes in the sense of object-oriented modelling. However, since a subject class extent may be heterogeneous because its entries may describe documents of different media – texts, images, videos etc. Therefore, subject class entries viewed as objects may belong to different object classes – text, image, video classes etc. - content semantics: furthermore, all content described by the entries of the same subject class shares some semantic key elements. All BPI documents referring, for example, to the subject class “ruler” make use of key icons such as swords, crowns, scepters, horses etc.; similarly, the text documents associated with a certain subject class contain overlapping sets of subject-related key words. Such sets of key elements capture essential parts of content and, thus, of subject term semantics. Note that specialization of subject classes goes along with extension and union of subject-related key element sets while generalization relies on reduction and in- tersection. - completion semantics: in section 2.1 we referred to the soft constraint of achieving best possible subject extents as “completion semantics”. Although this may be considered more as an issue of class pragmatics than class semantics it implies formal constraints on subject class extents. Since users of subject definitions act under the assumption that the subject owner established a subject extent which represents all relevant aspects of the owners subject prototype, any change of that extent is pri- marily monotonic, i.e., may improve a subject definition by adding or replacing subject class elements. Therefore, references to subject class entities should never become invalid.

Fig. 4 shows a media card for St. Moritz together with the (multiple) subject terms to which St. Moritz contributes. The example also links St. Moritz to details of the classification process by which this card entered the WEL. This information is essential for the realization of project-oriented views on subject terms, or reference libraries: - thematic views: projects usually concentrate on sub areas of the all encompassing “Index for the Political Iconography”; - personalized views: they cope with the conceptual problems with “completion semantics” mentioned above. Initial experiences with both of these aspects are outlined in the subsequent section. Lecture Notes in Computer Science 7

2.2.2 Subject-oriented Work in Art History Education

A key experience of the WEL project relates to the two dimensions of subject-oriented work: subject-orientation as a thematic view (customization) and as an individual- ized view (personalization). Speaking in terms of digital libraries, both dimensions are approached by the WEL concept of “reference libraries” (“Handbibliothek”), which are essentially SOWing environments which are customized according to the requirements of individual projects or persons.

Figure 5: Thematic and personalized subject views for the “Mantua” seminar

Fig. 5 outlines the use of customized SOWing environments in a art history seminar on “Mantua and the Gonzaga” [Köln]. The general BPI (“Warnke-owned”) is first customized into a subject index for the “Mantua and the Gonzaga” seminar project. The main objective of the individual student projects in that seminar is to further cus- tomize the seminar index, structurally and content wise, and to produce, for example, a project-specific subject index for topics such as “Studiolo” or “Camera picta”. Pub- licizing the final subject content by some form of media document - traditional report or interactive web side - constitutes another seminar objective (see fig.6).

Figure 6: The “Studiolo” subject index publicized as a multimedia document

3 Towards a Generalized WEL-Model for Subject-oriented Work

The prime experience gained from our interdisciplinary WEL project is a deeper insights into the intertwining of essentially four groups of entities and into their working relationships: - subjects: their definition and exploitation; - documents: their evaluation and production; - records for descriptive document abstraction; - work with and as documents and subjects. Since all four notions are quite generic, there is ample space for generalizing the models and systems for subject-oriented work way beyond our initial WEL approach. In the subsequent sections we will outline an extended SOWing model and platform and re-interpret, in a sense, the acronym WEL from “Warburg Electronic Li- brary” to “Work Explication Language”, very much in the sense of the definition of ontology as “a theory regarding entities, especially abstract entities to be admit- ted into a language of description” [Webster 1996].

Lecture Notes in Computer Science 8

3.1 An Overview on SOWing Entities: Works, Documents, Records and Subjects

Conceptually, the services of our SOWing platform are essentially based on four kinds of work-related entities (“SOWing entities”) which are straight-forward gen- eralizations of the corresponding WEL entities: - “work cases” as the basic abstraction of the acts and entities of interest in a domain; - “case documents” which are abstract or physical entities which report on work; - “case records” which capture the essence of work case documents; and - “subject terms” which , based on such case records, structure the domain, capture its semantics, and give access to its documents and works. SOWing entities – work cases, documents, case records, and subjects - may vary widely by their degree of formalization ranging from rather informal entities (to be mediated to humans) to fully formalized structures (to be read and processed by machines). Furthermore, SOWing entities may be consumed and produced by external tools thus requiring a general interfacing technology to the world outside the SOWing platform. Subsequently, we discuss the four kinds of SOWing entities in more detail.

3.1.1 Towards a generic Notion of “Work”

Central to the content management support provided by our SOWing platform is a generic notion of “work”. Our work concept is based on the WEL experience and is characterized and modeled by three groups of properties (see also fig. 2): - work as “project”, i.e., the circumstances under which work is performed; - work as “product”, i.e., a work’s result, and - work as “concept”, i.e., the conceptual idea behind a work. Fig. 7 relates these three work characteristics by a “work triangle” diagram. The generalization of WEL work examples such as the one given by fig. 2 is obvious.

Figure 7: WEL work structure (work triangle)

Examples of WEL work are presented in fig. 8. The lower part (fig. 8.1) models a specific work case by which a member of the Gonzaga family residing in Mantua during the 15th century asked the artist Mantegna for a painting addressing the issue of virtues and sins. Mantegna chose the goddess Minerva as the central motive and presented her while expelling the sin out of the garden of virtue. This 15th century work case may be reported by some published work document, most probably, however, the case is just part of an art historians body of knowledge about the Italian renaissance.

fig. 8.2 Description work: Mr. B. describing …. Lecture Notes in Computer Science 9

fig. 8.1 Production work: Mantegna painting …

Figure 8: WEL work graph: production work (fig. 8.1) and description work (fig. 8.2)

The upper part of fig. 8 depicts a second work case, similarly structured but rather different in nature. Fig. 8.2 shows some WEL work case done by a Mr. B. when produc- ing a work document description of the Gonzaga-Mantegna-Minerva work case and entering it into the WEL. While there may be only partial knowledge on the renaissance work case – essential only Mantegna’s picture survived – the WEL work case being supported and observed by the SOWing platform can get an arbitrary extensive SOWing coverage. This reflective capability is probably the most powerful and unique aspect of our SOWing approach. Reflection provides the basis for customization and personalization, for self-description and profiling and for all kinds of guiding and tracing services.

======

3.1.2 On Work Documents Documents: narration of work. They will contain information of the three aspects of work: abstraction, project, and product. Furthermore, documents are created in the SOWing platform (s.b.).

Range of Document Use Today, both “binary” files meant to be viewed by human beings as well as semi-structured content is called document. We intend to allow this whole range, dealing with documents which are just byte sequences, have a known format, …, elements which have a known meaning, …, are input for a software component. Apart from this technical range, the actual representation depends on - the source from which the document comes - the receiver for which the document is created

======

3.1.3 On Document Description

Subject term semantics is extensionally defined by document descriptions. Intention- ally their semantics is partially captured by key icon shared by such extents. The subject term ”ruler”, for example, is iconographically characterized by the base icon set {crown, sword, scepter etc.} or textually by a corresponding set of key words. Both, extensional and intensional semantics are related by the fact that each entry into the subject term’s extent shows a characteristic profile over the key icon set of its subject term. A specific image document contributing to the semantics of “ruler” will not Lecture Notes in Computer Science 10 display all the base icons, i.e., the entire set {crown, sword, scepter etc.} but a characteristic subset of it, probably in a prominent position within the image.

3.1.4 Subject Definition Work

Fig. 9 gives an overview over the subject-oriented work and its support by a SOWing platform. It relates production and description work to subject work.

Figure 9: An Overview of Subject-Oriented Work

Most of the production work on which the BPI is based typically took place outside any computerized environment – usually the referenced iconographic work dates back several centuries. Production work may, however, also mean the production of documents about the original iconographic work and such document work may well contribute to or profit from a SOWing environment. Description work, being definitely a matter of the SOWing platform, is, for example, supported by services which provide reference to other subject terms covered by the index. For the Gonzaga/Mantegna/Minerva case it is, for example, quite enlightening to capture the reason why Gonzaga ordered that picture by referring to some subject term “virtue” or its generalization “political objectives” to which Machiavelli’s work “Il Principe”, a successful handbook for renaissance rulers, has contributed. As already mentioned above, a major group of SOWing services is based on the fact that “work is a first class citizen” in the SOWing world. This implies that work while being supported by the system is automatically identified, described and associated with work-related subject terms (project time and participants, product media and ar- chives etc.). Evaluating such terms provides the basis for substantially improved work support, e.g., work session management, work distribution, protection, personalization etc. ======

3.1.3.2 Basic Subject Structure and Semantics Specialization relationship on subjects, forming hierarchy (subject index). Instantiation relationship between subject and document descriptions. Multiple subsumption of descriptions (see[ ]): - in different indices of same type - in indices of different types Semantics of instantiation made up of different aspects: - object structure: technically, descriptions have a type (in the sense of data type); all aspects of the type theory of PLs / DBs apply here - icon sets: dual to the object structure, icon sets also determine a semantic type; the more specific a subject term becomes the more there icon sets extend. Example: the icon set of “ruler as god” extends that of “ruler” Lecture Notes in Computer Science 11

- completion semantics: the extent of a subject is supposed to be complete (at the current time, for the user who created it); thus, the extent is a prototype for the subject; adding / removing descriptions from the extent adjusts the subject’s semantics

3.1.3.3 Specific Semantic Issues Subject indices are not only used to capture domain knowledge on the community’s (research) topic, but also “secondary knowledge”, especially - organizational structure of the community, including knowledge on users, rights granted to them … - information on documents not related to their content, e.g., (kind of ) origin, document types, quality, … - … Each subject index is assigned a semantics (how to interpret specialization and instantiation). index types: - content-related - community-related - action-related

3.1.3.4 Range of Subject Use Dimension, like for documents: from informal (ontological, only understandable by humans) to typed (machine-readable, e.g. class hierarchy).

======

3.2 Predominant SOWing Relationships

3.2.1 Subject Index <-> Subject Index

3.3.1.1 Inter-Subject Relationships Two (or more) subjects can be related - directly - by sharing descriptions - via a relationships on the description layer The relation of subjects can have a special semantics, e.g., if a description which has been placed under a subject describing a user group and a subject for a class of access rights might be interpreted as: the group may use functionality which requires to have that right on the description. Q: co- / contra-variant use of subjects, e.g., in the above example: right co-variant (you have right r1 if you have right r2 with r2 <: r1), user group contra-variant (have right r if one of the groups you belong to has right r) Lecture Notes in Computer Science 12

3.3.1.2 Personalization Coexistence / collaboration within in the community: personalization of subject indices. Take existing index as starting point, refine, contribute. Within sub-communities / along organizational structures: direct exchange of subjects and descriptions. - personalization: “copy” subjects and descriptions into private view -> ability to change data and structure - contribution: offer personalized items to next level in the organization; delivery of results; decision about inclusion of the changes (small organizational unit: vote; large organizational unit: librarian); if changes are rejected eventually birth of a new community Communication across communities / to other systems: documents (see next sub- section).

3.2.2 Document <-> Subject Relationship established via description. Distinction of closed world of a community and the open public. - documents for communication between community and “public” (other communities or unknown / unsupported tools) - subjects to support community work (work? only within community, or also between communities?)

3.3.2.1 Subject Definition Subject is defined by documents. - Completion semantics; ever changing semantics of subject. - Extensional semantics. - Find prototypes for subjects.

3.3.2.2 Document Classification Document is classified by subjects. - To share knowledge, to find documents, to cluster documents, … - Subject term first. - Each index reflects one dimension.

3.3.2.3 Document Creation From Subjects Documents are “narrative”: they add enough context information to subjective knowledge structure so that they can be communicated. To communicate with other communities, subject indices are finally published as documents by - adding “narration” - adding layout The amount of additional information to be inserted into the document depends on the receiver. The greater the “organizational distance” between sender and receiver is, the less can be assumed about the shared knowledge, and the more has to be put into the document. Lecture Notes in Computer Science 13

Usual (?) way: stepwise transformation of subject index into document structure adding properties to extents (linear order, spatial ordering, …). Final step: layout; usually outside the scope of subject-oriented work (except cases where the layout is part of the content). For publication, the work support enables the SOWing platform to provide additional meta-data from - community, users (individuals with known interests and expertise) - work support (knowledge about steps carried out for structure from which document is created) In the sense of - “semantic web” (machine-readable descriptions for software agents) - work-document-relationships (see Svenonius; definition which documents refer to the same work)

3.2.3 Work <-> Document Interpretation and publication of documents is work. Work is communicated using documents? Again, these documents can be directed at human beings (“Handlungsanweisungen”) or software (e.g., it can be a workflow plan to be enacted by a workflow engine). Like for any document, it is necessary to know the organizational context of the receiver to decide how much information to put into the document.

3.2.4 Work <-> Subject Work on the subject level recorded/described by work description. Work is classified/organized by subjects (by viewing it as a document and provid- ing a description for it?). Relationships of domain, organization, and work reflected by relationships between subjects / descriptions of the respective indices. Document creation (“programming”) like above?

3.3 SOWing Entities and External Tool Support

- on the external use of SOWing entities - XML as gluing language and interface technology: see below

4 The SOWing System

4.1 WEL Prototype Experience

The WEL prototypes developed so far are the major source of experience on which the SOWing approach is based. The initial version was based on the Tycoon-2 persistent programming environment, an object-oriented, higher-order programming language with orthogonal persistence [], []. Using a Tycoon-based acquisition and cata- Lecture Notes in Computer Science 14 loging tool STS researchers, art historians and many students from both departments digitized ten thousands of icons and transformed file cards from their physical representation into a digital one. This work is still in progress. The descriptive data in the card catalog were entered and revised by art historians via a web-based editor. With the evolution of Java from a small object-oriented programming language for embedded devices to a mainstream programming language for networked applications the WEL system was moved to Java-related technology. The current WEL version is based on a commercial content management system [ ] which is entirely Java-based and resides on top of a relational database management system (CMS). In this second prototyping phase we were especially interested in understanding those features of commercial CMS relevant to the subject-oriented management of multimedia content. It turned out that although the CMS provided several abstractions useful for our WEL data model important relationships between and inside the subject classes could not be treated as first class objects. Furthermore, the modeling facilities of the CMS are still not rich enough to meet the needs of the art historians. On top of this basic storage layer the application logic of the WEL was implemented. Graphical web-based user interfaces were developed to enable users to browse through the subject index, create personal indices and collect references to documents accessible via the global index. In parallel the generic system was adopted to maintain an index created for the management of concept relevant in advertising. This was and still is carried out in cooperation with a commercial agency from the advertisement industry.

4.1.1 Subject Terms and Indices Commercial CMS technology concentrates on editorial processes for more or less isolated units of work (e.g., news articles) with only little association to other entities. The WEL requires, however, in addition to storing and retrieving the document itself an extended functionality for the embedding of documents into the context of the subject index with all its relationships. CMS and the underlying database support only a specific set of navigation methods predominant in the news industry, and our prototype ran into performance problems as soon as users left those default navigation paths. As a consequence, we met the domain specific access requirements by an additional application layer on top of the CMS. A main purpose of this layer was to provide a caching infrastructure for access procedures that did not match the navigation restrictions of the CMS. Another draw- back of contemporary CMS is their lack of data model extensibility which would allow schema changes while the system is running. For these reasons the CMS data model is used only as a physical store from which the WEL data model of the is strictly separated. The mapping is performed by an additional application layer.

4.1.2 Personalization The personalization capabilities of the system include the integration of personal content, the establishment of relationships between public subjects and personal descriptions as well as between public and private ones. Lecture Notes in Computer Science 15

Future work in this field will focus on how to enable cooperative index maintenance and how to give concepts like subject relationships first class status. Important for the user is how changes made by other users are communicated and how conflicting changes can be resolved. As already mentioned in section 3.1. subjects are also used for the management and validity checking of users and their rights and roles. Access rights are realized as descriptions and classified by subjects. The uniform handling of users and rights as ordinary SOWing entities introduces a high degree of flexibility for the development of rights management which is not fully exploited yet.

4.2 Requirements for a Generic SOWing Platform

Based on the experience with our prototypical WEL we will now derive requirements for a generalized SOWing platform.

4.2.1 Work Descriptions One main task identified in the WEL project is the creation of descriptions about existing work. Each description can consist of different parts, e.g.: - data about the product (publication date, publication method, current location etc.), - data about the project behind the product (production history and reasons etc.), - data about the concepts shown in the product. These descriptions are inherently personalized. The work is described using a terminology which the author of the description is used to. The context in which a work is put into by a description usually depends on the background knowledge of the author not on the “knowledge” of the system. A user reading a description of another user can not be expected to interpret the description in the same way as it was intended by the author of the description. The system can assist in such a scenario by supporting a common vocabulary which is understood by all users. If a description is created using this controlled vocabulary, authors can ensure at least a certain common sense on what the work being described is about. As such a controlled vocabulary never can model all possible facets of a concept that come to an individual author’s mind authors need the possibility to either extend the vocabulary on their own or to add descriptions which do not fit into the common terminology. If the vocabulary is allowed to grow, community processes as described below will be used.

4.2.2 Subject Indices As the subject index plays an important role for the description work as described above the SOWing systems needs facilities supporting the maintenance and usage of the indices. Important questions and requirements in this context are: - How to organize the subject terms of a domain ? (e.g. tree, semantic net) - How to visualize such an index ? (e.g. list, tree, hyperbolic tree, map) - Support for annotating subject terms, adding new subject terms, modifying or deleting existing subject terms. - Encoding the differences between the common index and a personal index. - Merging of different personal indices with each other or with the global index. Lecture Notes in Computer Science 16

4.2.3 Subject Terms and Icon Sets While the system can store and represent subject indices it can not be expected to decide what a subject term is. So the SOWing system has to enable administrative users to identify such candidates for new subject terms and to integrate them into the existing common index. One way to do this is to acquire statistics about the annotations and extensions made to the index or in descriptions which use the index. If the same icons are used again and again, the system could suggest this icon to become part of the global classification system. A domain expert has to decide then, whether the icon detected should become a subject term in that sense that it represents a concept relevant to the community as a whole or whether it is just an indicator for the presence of another subject term. In the former case functionality is needed to update the global index in a controlled way, while in the latter case the parameterization of a classification engine has to be modified. Such a classification engine may serve as a component which sug- gests a conceptual classification of new descriptions based on the current “knowledge” of the system. An important task in this context is the elimination of syntactical differences (typ- ing errors) and semantic ambiguities (synonyms). So in addition to the subject index itself a thesaurus and the corresponding management facilities are needed to remove semantic ambiguity from descriptions. A thesaurus provides an important functionality to map the terminology of a single user onto a common terminology. Those terms where no such mapping can be found can be treated as candidates for new subject terms then or at least as part of the extension of an existing subject term.

4.2.4 Personalization The preceding sections have shown already that personalization is crucial for the SOWing system in order to control the evolution of the common subject index and to restrict the freedom in creating new descriptions or modifications of the index only as little as possible. The forms of personalization can be classified into four levels of system support: - New content or descriptions: a common form of user-created are annotations. In the WEL we also allow to add new documents and descriptions as well as new subjects (top-level or as refinements of other personal subjects). - Structural amendments: The user can insert additional instantiation relationships, between a subject and a description, where each can be either public or private. - Modification of subjects or descriptions: the user can create copies from descriptions and subjects and modify those copies (implemented as copy-on-write). When owning copies a user receives notifications about changes of the source object so that she can decide whether to merge the changes into her copy. - Schema personalization: when creating descriptions, user do not want to adhere to a rigid schema; they want to invent new icon names and to assign values to them. A possible solution is to make descriptions semi-structured. They have a structured and an unstructured part, where the structured part is defined in a schema and an unstructured part is addible by the user. Lecture Notes in Computer Science 17

4.3 System Design Issues

From both the requirements (4.2) and the experience with our special purpose system mentioned above (4.1) the following design can be derived.

[ 4.3.1 Organization, Ownership, Relationship of Objects and Community  evtl. streichen oder entsprechende Bemerkungen aus anderen Abschnitten hierher verschieben ? ] see “new content or descriptions”, section 4.2.3 Registry must know ownership to formulate queries etc. w.r.t. owner.

4.3.2 Relationships as First-Class Objects As the evolution of a complex subject index can not be predicted the SOWing system needs a flexible system not only to show semantic relationships but also to define new kinds of semantic relationships. So as a base document like an image or the description of an image can be used later in many different contexts the system needs to separate the base data from the relationships which combine the base data to express concepts on a different abstraction level. Relationships have to be treated as first-class objects the structure of which can be defined separately from any concrete instance of a relationship. If users are prevented from defining new kinds of relationships the system is in danger of getting domain dependent again. So during development special care has to be taken for separating the data which is related from the data describing the relationship between other data.

[4.3.3 Variants  schon von 4.3.4 hinreichend abgedeckt ?] for personalization (modification of subjects or descriptions, see 4.2.3) Subjects, descriptions, and relationships stored in registry which supports variants: at least one per user. Personalized variant hides public one? Notification about changes of public variant.

4.3.4 Flexible Data Model In order to avoid forcing authors to use the restricted vocabulary provided by the system for work descriptions, the SOWing system allows users to enter descriptions in a free-form mode, like in standard word processing or publishing software. But additionally the users can mark semantic entities using the common vocabulary provided by the system. Markups which do not exist in the common vocabulary are stored as parts of a personal terminology which can be merged with the common vocabulary later on in a cooperative process. In short the SOWing system delivers a “partial schema” of the application domain which can be used to refer to well-known concepts and which can be extended by individual users in a personalized way and which can be incrementally extended based on the description works of the individual users. The system should allow users to store and manipulate all their individual changes made in the system, share and combine their knowledge with other users and evolve the common subject index. Additionally the data model must take care for changes in Lecture Notes in Computer Science 18 semantics, e.g. interpreting a character string no longer as a title but as a reference to a document and how they affect other users. Especially changes in the global index and its global data model should be automatically communicated to other users if the changes affect the meaning of their personal collections of data.

4.3.5 Revisions Especially for structural amendments and modifications (see above) it is important to support revisions (as opposed to update in-place). The reason for this requirement is the that additions and modifications need not be right for updated objects. For example, if a user adds an annotation to correct a mistake in description and that description is corrected at a later point time, it would be wrong if the annotation still referred to that description. Therefore, revisions are kept for all entities, and relationships refer to specific revisions. Owners of entities referring to ones for which new revisions are created need to re- ceive notifications on the change to decide whether to change the reference point to the newest revision. Additionally, revisions are a tool to reconstruct the context which was valid while a certain work was created. This is important for work descriptions.

4.3.6 Work Support Besides the support in using a common vocabulary and evolving the subject index used in a certain context as described before some benefits for the individual worker have not been mentioned yet: work identification and tracking. These apply to ad-hoc work (someone just starts doing things) or planned work (with a defined goal, one which is predefined by a work modeler).

4.3.6.1 Work Identification The system can not only respond to queries of the user or enables the user just to evolve her terminology it can be used to mark and identify the possible contributions of an author while her work is still in progress. E.g. the system can present associations between the description work done so far and existing other works. So an author can recognize problems early like leaving the intended context of her work. On the other hand the system can derive associations, that were not intended by the author but prove to be useful later on. So an author can estimate the impact her work would have much better than before.

4.3.6.2 Tracking A description worker describes works found in the system. While doing this the system can observe him and generate data about the description work. This observation again can be viewed as a work in its own right. Processes logged in this way can be passed to process modeling software for visualization and can be used for later work optimization. Another application of these data will be the documentation of schema changes in order to control the evolution process. Lecture Notes in Computer Science 19

4.3.6 SOWing Environment for the above requirements

4.3.6.1 Personalized Descriptions Using XML XML because well-suited and because of integration of existing XML-based technology. We will motivate the architecture of the SOWing environment by an example taken from the Warburg Electronic Library. First we look at a the personalization of a description work. Imagine the user downloading a description as an XML document of the form shown in fig. 4.1. Bonaparte Crossing ... David, Jacques-Louis

Figure 4.1: A description from the WEL in XML format

Once the user copied the description into her personal environment, she is free to modify it at will. E.g., the user might end up with a document like the one shown in fig. 4.2. Here, the user performed three changes of different kind: - a value change: from “Bonaparte” to “Napoleon” - a type change: “artist” now is a reference, no longer a value - an additional feature has been added: “medium” The latter two changes exemplify the semi-structured nature of the descriptions: part of the personalized descriptions conforms to the schema the community chose for descriptions. Another part was added by the user, freely choosing a tag name, the elements content format, and the position where to integrate the new element. This is the kind of liberty the workers using a subject-oriented system expect to have – at least this demand was articulated by the art historians in the WEL project. Napoleon Crossing ... /artistdb/artist4368.xml Painting

Figure 4.2: A personalized description in XML format

When the community decides to integrate the user’s contribution into the community’s registry, it has to perform the data (and eventually schema) update. To find out about type changes, the server gets the description work description which is collected by work tracking as describing above. The list features introduced by users grows without control; the system just stores the additional features. Eventually, the community decides to change its description schema to reflect the commonly used additional features or to create new subject terms (see section 4.2.3). Lecture Notes in Computer Science 20

4.3.6.2 Semi-Automatic Derivation of Descriptions from XML Documents E.g., user submits the XML document report_on_napoleon.xml: In the Painting Napoleon Crossing ... the artist David, Jacques-Louis depicts Napoleon riding ...

Figure 4.3: A XML Document to be Described

Then the SOWing system will start analyzing the document and generate a document description: Napoleon Crossing ... David, Jacques-Louis Painting

Figure 4.4: Description Derived from Document

This description serves as the basis for the description workers task.

[4.3.6.3 Generative Approach – deleted because of space limitation] Data model for machine <-> conceptual model for user …

[4.3.6.4 XML-based Middleware – deleted because of space limitation] XML-based communication / integration …

5 XML for Subject-Oriented Work

In the previous chapters we have tried to develop a scenario for a software system which supports people in understanding concepts which are considered to be relevant inside a certain community and in completing the knowledge about these concepts while working with the system. If the knowledge maintained by such a system has to be communicated to other people or should even be used together with different systems, the question arises, how to represent the knowledge in a system independent manner. We expect that the Extensible Markup Language and standards based on this language could help improving the process of knowledge representation and knowledge interchange. Parallel to our work many different groups are working on standards based on XML for various purposes, e.g. for document description or knowledge representa- Lecture Notes in Computer Science 21 tion. But currently these standardization efforts lack a common application scenario, which could prove that not only one of these isolated standards is useful, but that especially the combination of two or more of them is worth being exploited. Our SOW- ing system can serve as such a common application scenario which could motivate these standardization groups to join their forces and to enable a new kind of applications.

5.1 XML as a common means of representing and communicating knowledge

The Warburg Electronic Library enables art historians to - Collect iconographic evidence for a certain political concept, - Classify existing artifacts according to the subject terms of the index, - Define or detect new concepts not covered by the current version of the index. This results in an incrementally improving understanding of the subject terms, i.e. the ontology, which makes up the vocabulary of this special research area, here art history. Within such a well-defined community vocabularies will not be identical but at least similar across different research groups. So communication between those groups should be rather easy. But as soon as cooperations become interdisciplinary, the problem arises, that people from different communities have to work together. Those communities may have completely different ways of working, thinking and speaking about things. Even if different research groups have basically the same background knowledge, their “working sets” of subject terms may differ. When cooperation becomes important, standards are needed not only to represent the knowledge of one user group, but to communicate knowledge to other user groups. If the two (or more) groups do not share a common vocabulary, additional methods are needed to either - Map one group’s subject index onto another one or - Find a unified common vocabulary for knowledge exchange or - Transport both the works of one group and those parts of the group’s knowledge that are necessary to understand the works. We will take a look at current research activities that could enable members of a special community to “leave” the scope of their closed community, using the Extensi- ble Markup Language as a “transport layer” for communication. As a side effect such a standardized representation can serve not only as a tool for data exchange between different working environments, but as an intermediate representation inside the same working environment, which may already consist of different applications, too. The following sections show, in which way different standardization efforts from the XML community could support a SOWing system and how these standardization efforts could profit from the SOWing concepts. Lecture Notes in Computer Science 22

5.2 XML and Documents

One application of XML in the scenarios mentioned in the previous sections is its usage as a platform independent language for the exchange of data. XML can be read and understood by humans and it can be processed by machines. Processing in this context means, that machines are able to store the XML data and to perform certain checks, e.g. whether the XML data is well-formed or valid according to a certain document type definition. With pure XML, as stated in [1], only syntactical interoperability can be achieved. A target machine can decide whether the data is well-formed and whether it corresponds to a certain document type. If the target machine should reason about the content of the XML data, domain knowledge has to be hardwired into the machine which processes the data and this knowledge has to be synchronized with the knowledge assumed by the author of the data. Pure XML then can be used to facilitate communication between partners who share a common knowledge about the nature of the data being exchanged.

5.3 XML and Semantics

Up to a certain degree XML can be used to represent ontologies, but as [2] shows important ontological relationships (e.g. “subclass-of” or “instance-of”) can not be modeled directly in a XML document type definition. Another problem is, that XML of- fers different ways to model the same relationship. For instance a property of a certain concept can be modeled as XML element of its own or as an attribute of another element. Supposed a document type description exists, which represents an ontology, the receiver of a document based on this type description has to “know” that this document type represents an ontology and how this ontology can be derived from the document type definition. So instead of trying to agree on a formalism for a common ontology [1] and [2] suggest to agree on a common formalism for representing ontologies themselves. The Resource Description Framework [3] provides primitives which facilitate the representation of ontologies in a much more natural way compared to (pure) XML. RDF can be implemented on top of XML and is used as a basic framework for the definition of common ontology interchange languages like OIL [4]. We expect that works can be published from inside a SOWing environment using the above standards in order to simplify the reuse of such works. The understanding for humans outside the community for which a work was originally published can be improved and even machine reasoning about the works becomes feasible, as soon as both the SOWing system and the processing machine stick to the same ontology representation standard.

5.4 XML and Activities As mentioned in previous sections “work” is not only a product that materializes concepts or ideas in a specific manner, but the process which led to the creation of the product is part of the work, too. From the observation how a certain product was generated we expect additional insights about facts like - In which sequence were the production steps carried out ? Lecture Notes in Computer Science 23

- How long did each production step take ? - What are or what may have been the influences leading to a concrete object production sequence ? For existing works this information usually is not available or has to be derived from other sources. But for description works like those carried out by the art historians in our WEL example, data about the production process of such a description work can be collected almost automatically by the SOW environment. If we remem- ber then, that an object description work can become a value of its own, which itself can become a subject to later research, we need not revert to “guessing” what the circumstances of the production may have been, but we are able to know what these circumstances were. So if we consider the data about the production of a work as being as valuable as the original work itself, its quite natural to apply the same representations used for communicating knowledge (XML) about works to communicate knowledge about the production history of works. This uniform handling would assist users in both understanding works already finished and completing work that is still in progress. The SOW environment can in this sense document the process which led to a certain document and enhance the understanding of the “work” behind the document. On the other hand this process description can be used to derive recommendations for future production processes in the sense of instructions that have to be carried out for tasks similar to a successfully completed process. So a well-documented production process can serve as a template for future production processes and contributes in this way to a standardization of the methods used in the application domain. One possibility to represent such processes can be XML based languages like XRL (eXchangeable Routing Language) described in [5].

5.5 SOWing as a process integrating previously disjoint standardization activities

As we have seen in section 5.2 to 5.4 different standardization efforts exist, whose results can be used to address some of the representation and communication problems occurring in a SOWing scenario. While SOW can be taken as a certain kind of work which has to be represented somehow, the XML community delivers recommendations for such representations, which in turn have to be embedded into a concrete application scenario. A SOWing system can contribute to the vision of these disjoint standardization efforts to turn the web into a “semantic web”, in which data is not only transported by machines, but also processed in the sense, that machines could reason about the contents of documents in a more sophisticated manner as this is possible today. Web authors normally do not necessarily provide data about their documents. So even if systems are built which can reason about the contents of a document based on such metadata the system will not be able to reason about documents that do not contain any metadata. Automatic generation of metadata on the other hand requires a certain “initial knowledge” about the facts that could be useful for other users or machines. But such a common knowledge which could be used to describe everything in the web can be hardly defined. Lecture Notes in Computer Science 24

The SOWing environment can be helpful in this scenario, because many metadata about documents can be collected automatically by the SOWing system. The kind of data that can be collected is usually restricted to the concrete application domain of the SOWing environment, but even this restricted kind of metadata is already much more detailed than the metadata offered by normal web documents. If a web application has no rules for the handling of such specialized metadata it can not make much use of it. But if a web application has a “basic knowledge” about the domain which the metadata was generated for, then this application can help humans effectively in accessing data they are really interested in, which brings us back to the communication and interchange questions of the section 5.1. The SOWing environment can deliver lots of the metadata while the semantic web activities deliver the formalisms to transport such data.

6 Summary and Outlook

Our SOWing platform and experiments aim at relating, organizing and defining subjects and documents as well as the work behind it. In an interdisciplinary project with our Art History colleagues who are working in the subject area of “Political Iconogra- phy” we gained substantial insight into their Subject-Oriented Working (SOWing) needs and into initial requirements for a generic SOWing platform. In this paper we outlined the project, its basic models, their generalization as well as our initial experiences with prototypical SOWing implementations and compared our work with various XML-related activities. On the modeling level we improved our understanding of - the basic SOWing entities and their relationships; - the notion of work, i.e., the production context of content; - the role of key icons and key words for content-based subject definition. On the system level we are currently investigating the requirements for - a generator-based architecture for SOWing entities and relationships; - reflective system technology and its use for advanced SOWing services; - customized and personalized SOWing indices; - XML-based tool interoperability. Future large scale content-oriented project work will have to interact with a substantial number of SOWing indices and, therefore, requires a technology for “plugable SOWing arrays”. SOWing indices have to deliver their content through a wide variety of document types ranging from media documents laid out for human interaction to structured (and typed) documents for machine consumption. Finally, SOWing support for the modeling, management and enactment of content-oriented work has to be further improved. Lecture Notes in Computer Science 25

The two broad cases, data- and content-based applications, differ substantially in the fact that data case applications are abstracted first before they cross any system boundary while for content cases it is the system itself which has to map application content into some data-based technology. Through application analysis and software design we recognize the difficulties of such mappings. In this paper.

References

1. Decker, S., van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, M., and Melnik, S.: The semantic web: The roles of xml and rdf. IEEE Expert, 15(3), Octo- ber 2000 2. Fensel, D.: Relating Ontology Languages and Web Standards. Modelle und Modellierungssprachen. In Informatik und Wirtschaftsinformatik. Modellierung 2000. Foelbach Verlag. 2000. 3. Lassila, O. and Swick, R. R.: Resource Description Framework (RDF) Model and Syntax Specification. Recommendation. W3C. February 2000. http://www.w3.org/TR/REC-rd- f-syntax/ 4. Cover, R. : Ontology Interchange Language. In: The XML Cover Pages. OASIS. March 2001. http://xml.coverpages.org/oil.html 5. van der Aalst, W. M. P., and Kumar, A.: XML Based Schema Definition for Support of . In- ter-organizational Workflow. http://tmitwww.tm.tue.nl/staff/wvdaalst/Workflow/xrl/is- r01-5.pdf

3.1 The Warburg Electronic Library (WEL) Project

Viewed from the perspective of our Computer Science interests which shifted in recent years clearly from basic research in “databases and persistent programming” towards r&d in “software systems for content management (online, multi media, …)”, the WEL project addresses a range of highly relevant and interrelated content application issues: - content representation by multiple media: images, texts, data, …; - content structuring, navigation and querying, content presentation; - content work based on subjects and ontologies: classification, indexing, …; - utilization of different referencing mechanisms: icon, index, symbol; - cooperative processes and projects on multi media content in research and education. The WEL is an interdisciplinary project between the art history department of Ham- burg University (research group on Political Iconography, Warburg House) and the software systems institute of the Technical University, TUHH, Hamburg. It started in 1996 as a 5-years project and will be extended into an interdisciplinary r&d-framework between several Hamburg-based institutions. Lecture Notes in Computer Science 26

For a short WEL overview we will concentrate on two project contributions: - application of the classical semantic database modelling principles for WEL-design; development of personalized and cooperating digital WEL libraries based on project-specific prototypes and their use in art history education.