Converting a Corpus Into a Hypertext: an Approach Using XML Topic Maps and XSLT
Total Page:16
File Type:pdf, Size:1020Kb
Converting a Corpus into a Hypertext: An Approach Using XML Topic Maps and XSLT Eva Anna Lenz, Angelika Storrer Universitat¨ Dortmund, Institut fur¨ deutsche Sprache und Literatur Emil-Figge-Str. 50, D-44227 Dortmund, Germany lenz, storrer ¡ @hytex.info Abstract In the context of the HyTex project, our goal is to convert a corpus into a hypertext, basing conversion strategies on annotations which explicitly mark up the text-grammatical structures and relations between text segments. Domain-specific knowledge is represented in the form of a knowledge net, using topic maps. We use XML as an interchange format. In this paper, we focus on a declarative rule language designed to express conversion strategies in terms of text-grammatical structures and hypertext results. The strategies can be formulated in a concise formal syntax which is independend of the markup, and which can be transformed automatically into executable program code. 1. Project Framework ¢ annotation of co-reference relations between text We illustrate a model for converting a text corpus into units, in particular expressions used in topic-handling a hypertext. The approach is being developed in the con- and expressions indicating the rhetorical function of text of the HyTex project (Hypertext conversion based smaller segments, on textgrammatical annotation, http://www.hytex.info/) ¢ annotation of definitions for technical terms of the funded by the Deutsche Forschungsgemeinschaft (DFG), subject domain and of term instances, which investigates methods and strategies for text-to- ¢ modelling terminological knowledge of the global hypertext conversion. The central idea of the project is subject domain, using topic maps (Pepper and Moore, to base conversion strategies on annotations which ex- 2001), plicitly mark up the text-grammatical structures and rela- ¢ implementation and evaluation of procedures for seg- tions between text segments, e.g. co-reference relations, se- mentation, reorganisation, and linking. These oper- mantics of connectives, text-deictic expressions, and ex- ate on the text-grammatical annotations and the topic pressions indicating topic handling (Storrer, to appear). map model. The project will develop a methodology which (semi- In this paper, we concentrate on the technical realisa- )automatically constructs hypertext layers and views, using tion of the last step. In particular, we focus on a declarative the text-grammatical annotations. rule language for describing the transformation process. In By storing the hypertext as additional document lay- section 2, we outline this approach. We then give a brief ers, this methodology preserves structure and content of overview of the model’s components in section 3 before we the original text documents, thus ensuring their recover- focus on the design of the rule language and its implementa- ability. The multiple-layer approach facilitates the publish- tion in section 4. Conclusions and an outlook are presented ing of the same content in different media (cross-media- in section 5. publishing), and, in addition, provides the reader with the choice beween sequential and selective reading modes. Se- 2. Simplifying the Transformation Process lective hypertext readers will be supported in finding coher- ent pathways through the document network. The transformation process can be viewed as a map- Feasability and performance of the methodology will ping from the corpus to the hypertext. It can be described be tested and evaluated using a German text corpus. The by rules consisting of two parts: a condition (on the docu- corpus documents will deal with the same global subject ments) and an action that is to be executed when the condi- domain, namely “text technology”. The corpus comprises tion holds. In particular, on the basis of various coherence various genres (including papers, project reports, text- relations modelled by annotations (conditions), a linking books, FAQ, technical specifications, glossaries/lexicons, structure of the hypertext is to be generated (actions). and newsgroup postings). These genres differ with respect Traditionally, transformations are implemented using to parameters which are crucial for hypertext conversion: programming languages where these rules remain implicit. length, topic handling, linearity. The corpus will be trans- In the project context – as in other contexts – many rules formed into a hyperbase by a stepwise approach, the steps are similarly structured, leading to either redundant pro- being: gram code or complex programming expressions. Further- more, a programming language includes many constructs ¢ corpus collection and automated annotation of the hi- not needed in our context. erarchical document structure, The goal of the approach presented here is to de- ¢ (semi-)automatic text-grammatical annotation, e.g. sign a declarative rule language optimised for expressing segmentation of paragraphs into smaller functional conditions on text-grammatical structures and hypertext- units, generating actions. This approach makes the rules more compact, easier to understand, and easier to maintain than domain, for example in the contexts of interdisciplinary re- program code. We will further discuss the advantages and search, studies and education, journalism, or lexicography. drawbacks of the approach in section 5. Application of the linking strategy provides the reader with a hypertext that supports a selective and problem- 3. Model Overview oriented reading mode. For example, when a reader comes Our architecture contains three components (see across a term (the term instance), we can indicate that it is figure 1): a technical term in the domain, and to provide him with the ¢ the corpus of annotated documents, possibility to learn more about this term, e.g. by following ¢ the topic map modelling the technical terms of the a link to the term definition. subject domain in terms of WordNet relations, and We differentiate between three kinds of knowledge pre- ¢ a set of rules which declaratively describe how the requisites: corpus and the topic map are to be transformed into a intratextual knowledge prerequisite: the author of a text hypertext. uses a term for which he has given a definition in the All components are expressed in or transformed into XML same text, syntax. This allows us to apply available tools (parsers, generators, browsers, etc.). For example, the syntax of ev- extratextual knowledge prerequisite: the author indi- ery genre in the corpus, the topic map, and the rule lan- cates that he uses a term whose definition is given in guage can be defined and checked by document grammars another corpus document, or (e.g., Document Type Definitions (DTDs)). We can easily domain-specific knowledge prerequisite: the author uses reuse documents available in XML or HTML, and produce a term well known in the community of the subject documents for the web. domain without indicating that it is a technical term. We semi-automatically annotate term definitions as well as term instances together with the type of knowledge prereq- uisite. For example, we assume that the knowledge pre- requisite is domain-specific when no term definition can be found prior to its use. 3.2. The Topic Map In order to implement the linking strategy based on knowledge prerequisites, we need to model terminological knowledge of the subject domain. This knowledge com- prises Figure 1: Model overview. ¢ The terms and their relations to other terms, e.g. su- perordinates and subordinates, synonyms, etc. We or- ganise this knowledge according to the principles of the lexical database WordNet (Fellbaum, 1998), i.e. 3.1. The Corpus in the form of a lexical network, and add domain- The German text corpus dealing with the subject do- specific relations. main “text technology” originally contains texts in various ¢ The relationships of terms to documents, e.g. to stan- formats, including Word and HTML. In a first step, these dard definitions, or to occurrences of a term in spe- formats are converted to XML in order to have a consistent cific genres, e.g. in an FAQ (“Frequently asked ques- format. tions”). The corpus is then annotated on different levels in sub- ¢ Attributions of terms to usages and communities. For sequent steps in order to prepare it for hypertext con- example, we can express that one term is employed version. We annotate the hierarchical document struc- by the web community rather than the hypertext com- ture for each genre, the text-grammatical structure, co- munity. reference relations between text units, and definitions and instances of technical terms. We gain these annotations Topic Maps (ISO, 2000) provide a standardised syntax semi-automatically, applying several steps of markup gen- for such networks. The main unit of organisation is called eration and filtering, where intermediate levels of annota- a “topic”. Topics can be typed, related to other topics, and tions (e.g., part-of-speech-tagging, lemmatisation, chunk- related to documents in a user-definable and flexible way. ing) finally disappear. We model attributions of terms to usages using the topic Annotations of technical term definitions and instances map construct of scopes. A topic map can be expressed in will serve us as examples throughout this paper. These en- XML syntax (Pepper and Moore, 2001); topic maps thus able us to implement a linking strategy based on knowledge meet all our requirements. prerequisites. This linking strategy is suitable in an appli- Parts of the topic map are eventually transformed into cation scenario where