Using SGML As a Basis for Data-Intensive NLP

Total Page:16

File Type:pdf, Size:1020Kb

Using SGML As a Basis for Data-Intensive NLP Using SGML as a Basis for Data-Intensive NLP David McKelvie, Chris Brew & Henry Thompson Language Technology Group, Human Communication Research Centre, University of Edinburgh, Edinburgh, Scotland David. McKelvie@ed. ac. uk ~z Chris. Brew@ed. ac. uk & H. Thompson@ed. ac. uk Abstract cessing of (primarily) text corpora. It generalises the UNIX pipe architecture, making it possible to This paper describes the LT NSL sys- use pipelines of general-purpose tools to process an- tem (McKelvie et al, 1996), an architec- notated corpora. The original UNIX architecture al- ture for writing corpus processing tools. lows the rapid construction of efficient pipelines of This system is then compared with two conceptually simple processes to carry out relatively other systems which address similar is- complex tasks, but is restricted to a simple model of sues, the GATE system (Cunningham et streams as sequences of bytes, lines or fields. LT NSL al, 1995) and the IMS Corpus Workbench lifts this restriction, allowing tools access to streams (Christ, 1994). In particular we address which are sequences of tree-structured text (a repre- the advantages and disadvantages of an sentation of SGML marked-up text). SGML approach compared with a non-SGML The use of SGML as an I/0 stream format between database approach. programs has the advantage that SGML is a well de- fined standard for representing structured text. Its 1 Introduction value is precisely that it closes off the option of a The theme of this paper is the design of software proliferation of ad-hoc notations and the associated and data architectures for natural language process- software needed to read and write them. The most ing using corpora. Two major issues in corpus-based important reason why we use SGMLfor all corpus lin- NLP are: how best to deal with medium to large guistic annotation is that it forces us to formally de- scale corpora often with complex linguistic annota- scribe the markup we will be using and provides soft- tions, and what system architecture best supports ware for checking that these markup invariants hold the reuse of software components in a modular and in an annotated corpus. In practise this is extremely interchangeable fashion. useful. SGML is human readable, so that interme- In this paper we describe the LT NSL system (McK- diate results can be inspected and understood. It elvie et al, 1996), an architecture for writing corpus also means that it is easy for programs to access the processing tools, which we have developed in an at- information which is relevant to them, while ignor- tempt to address these issues. This system is then ing additional markup. A further advantage is that compared with two other systems which address many text corpora are available in SGML, for exam- some of the same issues, the GATE system (Cun- ple, the British National Corpus (Burnage&Dunlop, ningham et al, 1995) and the IMS Corpus Work- 1992). bench (Christ, 1994). In particular we address the The LT NSL system is released as C source code. advantages and disadvantages of an SGML approach The software consists of a C-language Application compared with a non-SGML database approach. Fi- Program Interface (API) of function calls, and a num- nally, in order to back up our claims about the merits ber of stand-alone programs which use this API. The of SGML-based corpus processing, we present a num- current release is known to work on UNIX (SunOS ber of case studies of the use of the LT NSL system 4.1.3, Solaris 2.4 and Linux), and a Windows-NT for corpus preparation and linguistic analysis. version will be released during 1997. There is also an API for the Python programming language. 2 The LT NSL system One question which arises in respect to using LT NSL is a tool architecture for SGML-based pro- SGML as an I/O format is: what about the cost of 229 parsing SGML? Surely that makes pipelines too in- the distribution, improves the performance of LT NSL efficient? Parsing SGML in its full generality, and to acceptable levels for much larger datasets. providing validation and adequate error detection Why did we say "primarily for text corpora"? Be- is indeed rather hard. For efficiency reasons, you cause much of the technology is directly applicable wouldn't want to use long pipelines of tools, if each to multimedia corpora such as the Edinburgh Map tool had to reparse the SGML and deal with the Task corpus (Anderson et al, 1991). There are tools full language. Fortunately, LT NSL doesn't require which interpret SGML elements in the corpus text as this. The first stage of processing normalises the offsets into files of audio-data, allowing very flexi- input, producing a simplified, but informationally ble retrieval and output of audio information using equivalent form of the document. Subsequent tools queries defined over the corpus text and its annota- can and often will use the LT NSL API which parses tions. The same could be done for video clips, etc. normalised SGML (henceforth NSGML) approximately ten times more efficiently than the best parsers for 2.1 Hyperlinking full SGML. The API then returns this parsed SGML We are inclined to steer a middle course between to the calling program as data-structures. a monolithic comprehensive view of corpus data, in NSGML is a fully expanded text form of SGML in- which all possible views, annotations, structurings formationally equivalent to the ESlS output of SGML etc. of a corpus component are combined in a sin- parsers. This means that all markup minimisation gle heavily structured document, and a massively is expanded to its full form, SGML entities are ex- decentralised view in which a corpus component is panded into their value (except for SDATA entities), organised as a hyper-document, with all its informa- and all SGML names (of elements, attributes, etc) are tion stored in separate documents, utilising inter- normalised. The result is a format easily readable by document pointers. Aspects of the LT NSL library humans and programs. are aimed at supporting this approach. It is neces- The LT NSL programs consist of mknsg, a program sary to distinguish between files, which are storage for converting arbitrary valid SGML into normalised units, (SGML) documents, which may be composed SGML1 , the first stage in a pipeline of LT NSL tools; of a number of files by means of external entity ref- and a number of programs for manipulating nor- erences, and hyper-documents, which are linked en- malised SGML files, such as sggrep which finds SGML sembles of documents, using e.g. HyTime or TEI elements which match some query. Other of our soft- (Sperberg-McQueen&Burnard, 1994) link notation. ware packages such as LT POS (a part of speech tag- The implication of this is that corpus compo- ger) and LT WB (Mikheev&Finch, 1997) also use the nents can be hyper-documents, with low-density (i.e. LT NSL library. above the token level) annotation being expressed in- In addition to the normalised SGML, the mknsg directly in terms of links. In the first instance, this program writes a file containing a compiled form is constrained to situations where element content of the Document Type Definition (DTD) 2, which at one level of one document is entirely composed LT NSL programs read in order to know what the of elements from another document. Suppose, for structure of their NSGML input or output is. example, we already had segmented a file resulting How fast is it? Processes requiring sequential ac- in a single document marked up with SGML headers cess to large text corpora are well supported. It is and paragraphs, and with the word segmentation unlikely that LT NSL will prove the rate limiting step marked with <w> tags: in sequential corpus processing. The kinds of re- peated search required by lexicographers are more <p id=~> of a problem, since the system was not designed <w id=p4.wl>Time</w> for that purpose. The standard distribution is fast <w id=p4.w2>flies</w> enough for use as a search engine with files of up to <w id=p4.w3>.</w> several million words. Searching 1% of the British <Ip> National Corpus (a total of 700,000 words (18 Mb)) is currently only 6 times slower using LT NSL sggrep The output of a phrase-level segmentation might than using fgl"ep, and sF~rre p allows more complex then be stored as follows: structure-sensitive queries. A prototype indexing • ° mechanism (Mikheev&McKelvie, 1997), not yet in <p id=p4> <phr id=p4.phl type=n doe=file1 from='id p4.wl~> 1Based on James Clark's SP parser (Clark, 1996). <phr id=p4.ph2 type=v from=~id p4.w2~> 2SGML's way of describing the structure (or grammar) </p> of the allowed markup in a document 230 Linking is specified using one of the available TEI <p £d=p325> mechanisms. Details are not relevant here, suffice it <repl from='id p325.t1' to='id p325.t15'> to say that doc=filel resolves to the word level file <!-- the correction itself --> <corr sic='procede' resp='ispell~> and establishes a default for subsequent links. At <token id=p325, t 16>proceed</t oken> a minimum, links are able to target single elements </corr> or sequences of contiguous elements. LT NSL imple- <!-- more unchanged text--> ments a textual inclusion semantics for such links, in- <repl from=~id p325.t17 ~ to=~id p325.t96 '> serting the referenced material as the content of the <Ip> <!-- the rest of the unchanged text--> element bearing the linking attributes.
Recommended publications
  • XML a New Web Site Architecture
    XML A New Web Site Architecture Jim Costello Derek Werthmuller Darshana Apte Center for Technology in Government University at Albany, SUNY 1535 Western Avenue Albany, NY 12203 Phone: (518) 442-3892 Fax: (518) 442-3886 E-mail: [email protected] http://www.ctg.albany.edu September 2002 © 2002 Center for Technology in Government The Center grants permission to reprint this document provided this cover page is included. Table of Contents XML: A New Web Site Architecture .......................................................................................................................... 1 A Better Way? ......................................................................................................................................................... 1 Defining the Problem.............................................................................................................................................. 1 Partial Solutions ...................................................................................................................................................... 2 Addressing the Root Problems .............................................................................................................................. 2 Figure 1. Sample XML file (all code simplified for example) ...................................................................... 4 Figure 2. Sample XSL File (all code simplified for example) ....................................................................... 6 Figure 3. Formatted Page Produced
    [Show full text]
  • Techniques for Authoring Complex XML Documents Vincent Quint, Irène Vatton
    Techniques for Authoring Complex XML Documents Vincent Quint, Irène Vatton To cite this version: Vincent Quint, Irène Vatton. Techniques for Authoring Complex XML Documents. Proceedings of the 2004 ACM symposium on Document Engineering, DocEng 2004, Oct 2004, MilWaukee, WI, United States. pp.115-123, 10.1145/1030397.1030422. inria-00423365 HAL Id: inria-00423365 https://hal.inria.fr/inria-00423365 Submitted on 9 Oct 2009 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Techniques for Authoring Complex XML Documents Vincent Quint Irene` Vatton INRIA Rhone-Alpesˆ INRIA Rhone-Alpesˆ 655 avenue de l’Europe 655 avenue de l’Europe 38334 Saint Ismier Cedex, France 38334 Saint Ismier Cedex, France [email protected] [email protected] ABSTRACT 1. INTRODUCTION This paper reviews the main innovations of XML and con- Authoring techniques for structured documents consti- siders their impact on the editing techniques for structured tuted an active research area during the second half of the documents. Namespaces open the way to compound docu- 80’s and the early 90’s [10]. Several experimental systems ments; well-formedness brings more freedom in the editing such as Grif [7] and Rita [6] were developed and a few pro- task; CSS allows style to be associated easily with structured duction tools resulted from that work.
    [Show full text]
  • A Wiki-Based Authoring Tool for Collaborative Development of Multimedial Documents
    MEDIA2MULT – A WIKI-BASED AUTHORING TOOL FOR COLLABORATIVE DEVELOPMENT OF MULTIMEDIAL DOCUMENTS Author Name * Affiliation * Address * Author Name * Affiliation * Address * * Only for Final Camera-Ready Submission ABSTRACT media2mult is an extension for PmWiki developed at our university. It provides functionality for embedding various media files and script languages in wiki pages. Furthermore media2mult comes with a cross media publishing component that allows to convert arbitrary wiki page sequences to print-oriented formats like PDF. This article gives an overview over the offered extensions, their functionality and implementation concepts. KEYWORDS wiki, multimedia, cross-media-publishing, authoring tool, XML 1. INTRODUCTION At least since the founding of the free web encyclopedia Wikipedia and its increasing popularity wiki web , wiki-wiki or just wiki are widely known terms in context of Web 2.0. However, their exact meaning often remains unclear. Sometimes wiki and Wikipedia are actually used synonymously. The crucial functionality of every wiki system is the possibility to edit wiki web pages directly inside a browser by entering an easy to learn markup language. Thus, manual uploads of previously edited HTML files are superfluous here. The user doesn't even have to know anything about HTML or external HTML editors. The browser- and server-based concept makes it possible that several authors can edit and revise common documents without the necessity of exchanging independently written and updated versions. Because most wiki systems offer an integrated version management system, authors can easily merge their changes and revert selected passages to former stages. Thus, accidentally or deliberately applied changes of protected or publicly accessible wiki pages can be taken back in a second.
    [Show full text]
  • SGML As a Framework for Digital Preservation and Access. INSTITUTION Commission on Preservation and Access, Washington, DC
    DOCUMENT RESUME ED 417 748 IR 056 976 AUTHOR Coleman, James; Willis, Don TITLE SGML as a Framework for Digital Preservation and Access. INSTITUTION Commission on Preservation and Access, Washington, DC. ISBN ISBN-1-887334-54-8 PUB DATE 1997-07-00 NOTE 55p. AVAILABLE FROM Commission on Preservation and Access, A Program of the Council on Library and Information Resources, 1400 16th Street, NW, Suite 740, Washington, DC 20036-2217 ($20). PUB TYPE Reports Evaluative (142) EDRS PRICE MF01/PC03 Plus Postage. DESCRIPTORS *Access to Information; Computer Oriented Programs; *Electronic Libraries; *Information Retrieval; Library Automation; Online Catalogs; *Preservation; Standards IDENTIFIERS Digital Technology; *SGML ABSTRACT This report explores the suitability of Standard Generalized Markup Language (SGML) as a framework for building, managing, and providing access to digital libraries, with special emphasis on preservation and access issues. SGML is an international standard (ISO 8879) designed to promote text interchange. It is used to define markup languages, which can then encode the logical structure and content of any so-defined document. The connection between SGML and the traditional concerns of preservation and access may not be immediately apparent, but the use of descriptive markup tools such as SGML is crucial to the quality and long-term accessibility of digitized materials. Beginning with a general exploration of digital formats for preservation and access, the report provides a staged technical tutorial on the features and uses of SGML. The tutorial covers SGML and related standards, SGML Document Type Definitions in current use, and related projects now under development. A tiered metadata model is described that could incorporate SGML along with other standards to facilitate discovery and retrieval of digital documents.
    [Show full text]
  • Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML)
    International Standard •3 8879 / INTERNATIONAL ORGANIZATION FOR STANDARDIZATION»ME)KflyHAPOflHAR OPrAHU3AL|Ufl FIO CTAHflAPTH3ALlMM»ORGANISATION INTERNATIONALE DE NORMALISATION Information processing — Text and office systems — Standard Generalized Markup Language (SGML) Traitement de /'information — Systemes bureautiques — Langage standard generalise de balisage f SGML) First edition — 1986-10-15 Adopted for Use by the Federol Government FIPS PUB 152 See Notice on Inside Front Cover —JK— 468 . A8A3 //152 1988 UDC 681.3.06 Ref. No. ISO 8879-1986 (E) Descriptors : data processing, documentation, logical structure, programming (computers), artificial languages, programming languages Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, govern¬ mental and non-governmental, in liaison with ISO, also take part in the work. Draft International Standards adopted by the technical committees are circulated to the member bodies for approval before their acceptance as International Standards by the ISO Council. They are approved in accordance with ISO procedures requiring at least 75 % approval by the member bodies voting. International Standard ISO 8879 was prepared by Technical Committee ISO/TC 97, In¬ formation processing systems. Users should note that all International Standards undergo revision from time to time and that any reference made herein to any other International Standard implies its latest edition, unless otherwise stated. NATIONAL INSTITUTE OF STANDARDS &' TECHNOLOGY Research Mormatksn Center Gakhersburg, MD £06^9 This standard has been adopted for Federal Government use.
    [Show full text]
  • HTML 4.0 Specification
    HTML 4.0 Specification PR-HTML40-971107 HTML 4.0 Specification W3C Proposed Recommendation 7-Nov-1997 This version: http://www.w3.org/TR/PR-html40-971107/ Latest version: http://www.w3.org/TR/PR-html40/ Previous version: http://www.w3.org/TR/WD-html40-970917/ Editors: Dave Raggett <[email protected]> Arnaud Le Hors <[email protected]> Ian Jacobs <[email protected]> Abstract This specification defines the HyperText Markup Language (HTML), version 4.0, the publishing language of the World Wide Web. In addition to the text, multimedia, and hyperlink features of the previous versions of HTML, HTML 4.0 supports more multimedia options, scripting languages, style sheets, better printing facilities, and documents that are more accessible to users with disabilities. HTML 4.0 also takes great strides towards the internationalization of documents, with the goal of making the Web truly World Wide. HTML 4.0 is an SGML application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language [ISO8879] [p.319] ). As an SGML application, the syntax of conforming HTML 4.0 documents is defined by the combination of the SGML declaration [p.247] and the document type definition [p.249] (DTD). This specification defines the intended interpretation of HTML 4.0 elements and adds syntax constraints that may not be expressed by the DTD alone. Status of this document This is a stable document derived from the 17 September working draft of the HTML 4.0 specification. This document has been produced as part of the W3C HTML Activity. The publication of this document does not imply endorsement by the Consortium’s staff or Member organizations.
    [Show full text]
  • Iso/Iec 19757-8:2008(E)
    This is a previewINTERNATIONAL - click here to buy the full publication ISO/IEC STANDARD 19757-8 First edition 2008-12-15 Information technology — Document Schema Definition Languages (DSDL) — Part 8: Document Semantics Renaming Language (DSRL) Technologies de l'information — Langages de définition de schéma de documents (DSDL) — Partie 8: Langage pour renommer une sémantique de documents (DSRL) Reference number ISO/IEC 19757-8:2008(E) © ISO/IEC 2008 ISO/IEC 19757-8:2008(E) This is a preview - click here to buy the full publication PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. COPYRIGHT PROTECTED DOCUMENT © ISO/IEC 2008 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISO's member body in the country of the requester.
    [Show full text]
  • An SGML Environment for STEP
    An SGML Environment for STEP Usa Phillips Joshua Lubeli U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Gaithersburg, MD 20899 NIST NISTIR 5515 An SGML Environment for STEP Lisa Phillips Joshua Lubell U.S. DEPARTMENT OF COMMERCE Technology Administration National Institute of Standards and Technology Gaithersburg, MD 20899 June 22, 1994 U.S. DEPARTMENT OF COMMERCE Ronald H. Brown, Secretary TECHNOLOGY ADMINISTRATION Mary L. Good, Under Secretary for Techrwlogy NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY Arati Prabhakar, Director Table of Contents 1 Introduction 1 2 The Application Protocol Development Environment 2 3 SGML Basics 4 3.1 Document Type Definition 4 3.2 SGML Document Instance 5 4 Why SGML? 7 4.1 Document Shareability 7 4.2 Structure-based Editing 7 4.3 Automatic Generation of Text 8 4.4 Information Integrity 8 4.5 “Intelligent” Documents 8 4.6 Increased Potential for Collaboration 9 4.7 Reduction In Document Preparation Time 9 5 The SGML Environment for STEP 10 5.1 Essential Components 11 5.1.1 DTD Parser 12 5.1.2 Document Instance Parser 12 5.1.3 STEP-Customized SGML Authoring Tool 12 5.1.4 Conversion of Legacy Documents to SGML 13 5.1.5 Translation from SGML to a Publishable Format 14 5.2 Additional Components 14 5.2.1 Multi-Document Queries and Browsing 15 5.2.2 Remote Access Services 16 5.2.3 Multi-user Editing 17 6 Conclusion 18 6.1 Status of the SGML Environment and Near-term Future Plans 18 6.2 Long-term Plans for the SGML Environment 19 7 References 20 8 Glossary 22 Appendix A: The STEP DTDs 24 DRAFT 1 ' »3S(, .
    [Show full text]
  • Taxonomy of XML Schema Languages Using Formal Language Theory
    Taxonomy of XML Schema Languages using Formal Language Theory MAKOTO MURATA IBM Tokyo Research Lab DONGWON LEE Penn State University MURALI MANI Worcester Polytechnic Institute and KOHSUKE KAWAGUCHI Sun Microsystems On the basis of regular tree grammars, we present a formal framework for XML schema languages. This framework helps to describe, compare, and implement such schema languages in a rigorous manner. Our main results are as follows: (1) a simple framework to study three classes of tree languages (“local”, “single-type”, and “regular”); (2) classification and comparison of schema languages (DTD, W3C XML Schema, and RELAX NG) based on these classes; (3) efficient doc- ument validation algorithms for these classes; and (4) other grammatical concepts and advanced validation algorithms relevant to XML model (e.g., binarization, derivative-based validation). Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design—Schema and subschema; F.4.3 [Mathematical Logic and Formal Languages]: Formal Languages— Classes defined by grammars or automata General Terms: Algorithms, Languages, Theory Additional Key Words and Phrases: XML, schema, validation, tree automaton, interpretation 1. INTRODUCTION XML [Bray et al. 2000] is a meta language for creating markup languages. To represent an XML based language, we design a collection of names for elements and attributes that the language uses. These names (i.e., tag names) are then used by application programs dedicated to this type of information. For instance, XHTML [Altheim and McCarron (Eds) 2000] is such an XML-based language. In it, permissible element names include p, a, ul, and li, and permissible attribute names include href and style.
    [Show full text]
  • Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Eugen Popovici
    Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Eugen Popovici To cite this version: Eugen Popovici. Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections. Computer Science [cs]. Université de Bretagne Sud; Université Européenne de Bretagne, 2008. English. tel-00511981 HAL Id: tel-00511981 https://tel.archives-ouvertes.fr/tel-00511981 Submitted on 26 Aug 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE SOUTENUE DEVANT L’UNIVERSITÉ EUROPÉENNE DE BRETAGNE pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ EUROPÉENNE DE BRETAGNE Mention : SCIENCES ET TECHNOLOGIES DE L’INFORMATION ET DE LA COMMUNICATION par EUGEN-COSTIN POPOVICI Information Retrieval of Text, Structure and Sequential Data in Heterogeneous XML Document Collections Recherche et filtrage d’information multimédia (texte, structure et séquence) dans des collections de documents XML hétérogènes Présentée le 10 janvier 2008 devant la commission d’examen composée de : M. BOUGHANEM Professeur, Université Paul Sabatier, Toulouse III Rapporteur P. GROS Directeur de Recherche, INRIA, Rennes Examinateur M. LALMAS Professeur, Queen Mary University of London Rapporteur P.-F. MARTEAU Professeur, Université de Bretagne-Sud Directeur G.
    [Show full text]
  • Application Developer's Guide (PDF)
    MarkLogic Server Application Developer’s Guide 1Application Developer’s Guide MarkLogic 10 May, 2019 Last Revised: 10.0-7, June, 2021 Copyright © 2021 MarkLogic Corporation. All rights reserved. MarkLogic Server Table of Contents Table of Contents Application Developer’s Guide 1.0 Developing Applications in MarkLogic Server ...........................................16 1.1 Overview of MarkLogic Server Application Development .................................16 1.2 Skills Needed to Develop MarkLogic Server Applications ..................................16 1.3 Where to Find Specific Information .....................................................................17 2.0 Loading Schemas .........................................................................................19 2.1 Configuring Your Database ..................................................................................19 2.2 Loading Your Schema ..........................................................................................20 2.3 Referencing Your Schema ....................................................................................21 2.4 Working With Your Schema ................................................................................21 2.5 Validating XML and JSON Data Against a Schema ............................................22 2.5.1 Validating Schemas using Schematron .....................................................22 2.5.2 Validating Schemas using the XQuery validate Expression ....................25 2.5.3 Validating JSON Documents against
    [Show full text]
  • Personal Knowledge Models with Semantic Technologies
    Max Völkel Personal Knowledge Models with Semantic Technologies Personal Knowledge Models with Semantic Technologies Max Völkel 2 Bibliografische Information Detaillierte bibliografische Daten sind im Internet über http://pkm. xam.de abrufbar. Covergestaltung: Stefanie Miller Herstellung und Verlag: Books on Demand GmbH, Norderstedt c 2010 Max Völkel, Ritterstr. 6, 76133 Karlsruhe This work is licensed under the Creative Commons Attribution- ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- cisco, California, 94105, USA. Zur Erlangung des akademischen Grades eines Doktors der Wirtschaftswis- senschaften (Dr. rer. pol.) von der Fakultät für Wirtschaftswissenschaften des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Dipl.-Inform. Max Völkel. Tag der mündlichen Prüfung: 14. Juli 2010 Referent: Prof. Dr. Rudi Studer Koreferent: Prof. Dr. Klaus Tochtermann Prüfer: Prof. Dr. Gerhard Satzger Vorsitzende der Prüfungskommission: Prof. Dr. Christine Harbring Abstract Following the ideas of Vannevar Bush (1945) and Douglas Engelbart (1963), this thesis explores how computers can help humans to be more intelligent. More precisely, the idea is to reduce limitations of cognitive processes with the help of knowledge cues, which are external reminders about previously experienced internal knowledge. A knowledge cue is any kind of symbol, pattern or artefact, created with the intent to be used by its creator, to re- evoke a previously experienced mental state, when used. The main processes in creating, managing and using knowledge cues are analysed. Based on the resulting knowledge cue life-cycle, an economic analysis of costs and benefits in Personal Knowledge Management (PKM) processes is performed.
    [Show full text]