Appendix A: Introduction to SGML
Total Page:16
File Type:pdf, Size:1020Kb
Appendix A: Introduction to SGML This appendix gives a very brief introduction to SGML. This should be enough to carry a reader through the first part of this book. If you are not already familiar with SGML, you will benefit from some additional reading. The bibliography lists several books on SGML; van Herwijnen 1994 is a popular and accessible introduction. What is SGML, really? SGML is short for Standard Generalized Markup Language, and it is defined by ISO Standard 8879 (1986). It provides a flexible and portable way of representing documents on computers, where you can choose what kinds of components occur in each particular type of document, and then clearly label those parts as they occur. The foundation of SGML, and to my thinking the primary reason it has gained such wide acceptance, is very simple: It lets you describe document structures directly, rather than describing something temporary like formatting, that depends on the structure. Simply put, SGML lets you tell the truth about your documents. This is reflected in five key characteristics that form the basis of its wide acceptance: • Descriptive markup: An SGML document consists of objects of various classes (chapters, titles, references, repair procedures, graphic objects, poems, etc.), not sequences of formatting instructions. These objects are called elements, and each one has a generic element type. SGML marks Ehe boundaries of elements using start- and end- tags . For example, a quotation could be stored as: I<QUOTATION>Full speed ahead</QUOTATION> I 19 6 DeRose By labeling elements themselves rather than saying what to do with them, SGML avoids locking information into one program or e,)en one purpose. Processing for various purposes is specified separately, often by style sheets or similar mechanisms. • Hierarchical structure: Elements can contain one another in a hierarchy (a chapter can contain a title and several sections which in turn contain other elements). Therefore one can create, manipulate and manage not only "chapter titles," as in word processors, but also chapters as complete units. Hierarchies are also called "trees" because they branch repeatedly, and every sub-branch comes from only one "parent" branch. For example, a chapter might be stored as: I<CHAPTER><TITLE>Get t ing Started</TITLE> <SECTION><TITLE>Overview of SGML</TITLE> <P> SGML is_</P> </SECTION></CHAPTER> A tree is a significantly different and more powerful model than the flat list or slightly hierarchical structures used in many word processor and database products. • Flexibility: SGML does not dictate what types of components may appear in a document or how they relate. Individuals or groups can define types appropriate for their kinds of documents. For example, manuals need parts lists and procedures, catalogs needs prices and descriptions, while literary texts need a quite different variety of structures. Nevertheless, a single SGML implementation can work with all of these. There is no need for new software for each new set of element types. • Formal specification: The specific elements contained in an SGML document are declared before the document begins in the DTD, or Document Type Declaration. After reading a DTD, programs called validating parsers can check SGML documents and detect many kinds of errors. Of course they cannot detect all errors: If you create a poem and tell the computer it is just a plain paragraph (or a repair procedure), validity checking won't help much. • Human-readable representation: Finally, SGML provides a plain-text format for all this information. This means that one can read and write SGML documents not only with sophisticated SGML-aware tools, but even with no tools beyond any computer's everyday text handling commands (such as "type" or "edit"). Even without knowing SGML it isn't hard to figure out that this represents an emphasized word: I<EMPH>very</EMPH> Simple yet formal representations based on readable embedded markup within the data have another important advantage: they are machine-independent. The internal file formats most programs use seldom survive transfer across networks, from one make of computer to another, or across national borders. SGML does. This is a vast improvement, especially over the proprietary binary representations programs typically use for optimizing internal operations. SGML's key characteristics have little to do with specific kinds of processing (such as formatting or retrieval) or specific interfaces. By leaving these issues undefined (while ensuring that the structural information needed to determine them is defined), SGML encourages building systems that use the best processing and interface technologies available in any situation. Because of this SGML data lasts far longer than other The SGML FAQ Book 197 representations, and a single SGML source document can be "re-purposed" for a wide variety of uses, often with no changes at all. The key characteristics also have very little to do with syntax, or the formal grammar rules that govern just how SGML represents structured information. Syntax is important because only after agreeing on some syntax can programs exchange data. However, the greatest value of SGML stems from its use of descriptive markup methodology, not from the details of its rules. The syntax could have been standardized in a number of other ways without diminishing that value or the others described above. SGML documents can be created, edited and otherwise processed with generic editors or with WYSIWYG editors that look like typical word processors. Cooperating authors often use different methods and programs on the same documents. In the same way, the result of authoring can be processed unchanged for many different purposes, such as local printing, high-quality typesetting, display in Braille, information retrieval, linguistic analysis, or hypertext browsing. No representation that is based mainly on formatting can do that. Formatting and structure Strong separation of formatting from structure is a hallmark of good SGML use. However, just using the standard cannot ensure such good practice. It is possible, even easy, to use "legal" SGML syntax with tags that represent only formatting information, but doing this loses many benefits SGML is meant to provide. For example, it is easy to make a set of SGML element types that looks like a raw formatting language. A block quote might then be tagged as follows: <SK><DS><IN L="+I0" R="-10">The quotation... I<SK><IL L="+10">The following paragraph_. Even worse, someone could just put "<DOC>...</DOC>" around a document in any other representation and call it SGML: <DOC> .sk;.ds;.in +i0 -i0; The quotation... .sk;.il +i0; The following paragraph.. <DOC> Given corresponding DTDs, both these approaches yield perfectly legal, syntactically correct SGML for which a validating SGML parser would report no errors. However, the information is no more portable or communicative than if it were not using SGML: SGML syntax is merely being a veneer over non-descriptive, non-hierarchical markup, whose meaning is not accessible to the SGML processor for validation or use. In contrast, the following SGML coding is clearer, more concise, easier to maintain, more efficient and more useful for a wide range of purposes: <QUO>The quotation...</QUO> I<P>The following paragrap~..</P> 19 8 DeRose Parts of an SGML document An SGML document is divided into three major parts: • An SGML declaration gives general information about SGML syntax options. It says what character set is used, how long element type names can be, what delimiter strings (such as "<" and "</") separate tags from content, what optional SGML features are used, and so on. Many SGML parsers allow the declaration to be omitted. Then standard default settings are used, called the reference concrete syntax. This book assumes these settings except that element and other names may exceed the default eight-character limit. • A Document Type Declaration or DTD says what element types and other components are permitted in the document, using detailed markup declarations. The declarations also say which element types can contain which others. For example, footnotes might not be allowed to contain chapters. • A document instance includes the content of a document as well as the tags and other markup that indicate where elements of each type start and end. A document instance can be stored in more than one file by dividing it up into parts called entities, and may include data in other media and formats by reference as described below. Parts of an SGML document instance This section gives a brief review of how documents are tagged using SGML and how tagging schemes are declared in DTDs, then shows the complete SGML markup for a short example document. Conceptually, an SGML document instance includes content and structure, such as graphics, video, text, indications of structural components, and links. In terms of syntax or representation, it has four major kinds of objects each of which has a corresponding type of declaration that can appear in the DTD. Once declared, actual instances of these objects may be used in the document instance. Hements divide the document into meaningful conceptual components, each of which has some name, such as chapter, part-number, or task. The name is called an element type or generic identifier ("GI"). Elements are marked by tags at their start and end, that in turn are marked off by special strings called delimiters. Attributes may be specified for elements and give information about a particular instance of the element type such as a security level or a unique identifier. Each attribute has a name, a type, and a rule about default values. Attributes appear within start-tags. Entity references are a way to include data by reference, and are resolved by the SGML parser and replaced by their values: either externally stored information or strings defined in the DTD.