<<

INFORMATION TO USERS The most advanced technology has been used to photo­ graph and reproduce this manuscript from the microfilm master. UMI films the original text directly from the copy submitted. Thus, some dissertation copies are in typewriter face, while others may be from a computer printer. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyrighted material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are re­ produced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each oversize page is available as one exposure on a standard 35 mm slide or as a 17" x 23" black and white photographic print for an additional charge.

Photographs included in the original manuscript have been reproduced xerographically in this copy. 35 mm slides or 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

Accessing theUMI World’s Information since 1938

300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA

Order Number 8804059

Braced languages and a model of translation for context-free strings: Theory and practice

Kaelbling, Michael John, Ph.D.

The Ohio State University, 1987

UMI 300 N. Zeeb Rd. Ann Arbor, MI 48106

PLEASE NOTE:

In all cases this material has been filmed in the best possible way from the available copy. Problems encountered with this document have been identified here with a check mark V .

1. Glossy photographs or pages _____

2. Colored illustrations, paper or print ______

3. Photographs with dark background _____

4. Illustrations are poor copy ______

5. Pages with black marks, not original copy i /

6. Print shows through as there is text on both sides of p a g e ______

7. Indistinct, broken or small print on several pages _

8. Print exceeds margin requirements ______

9. Tightly bound copy with print lost in spine ______

10. Computer printout pages with indistinct print ______

11. Page(s) ______lacking when material received, and not available from school or author.

12. Page(s) ______seem to be missing in numbering only as text follows.

13. Two pages num bered . Text follows.

14. Curling and wrinkled pages ______

15. Dissertation contains pages with print at a slant, filmed as received ______

16. Other______

UMI

BRACED LANGUAGES AND A MODEL OF TRANSLATION FOR CONTEXT-FREE STRINGS: THEORY AND PRACTICE

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate

School of the Ohio State University

by

Michael John Kaelbling, B.S., M.S.

The Ohio State University

1987

Dissertation Committee: Approved by

Sandra A. Mamrak / 7 ' 7/7

Thomas C. Bylander Adviser Department of Computer Karsten Schwan and Information Science in memoriam

Rudolf Kaelbling, M.Sc., M.D.

1928-1976 '

ii ACKNOWLEDGMENTS

First and foremost, I must thank Ann E. Kelley Sobel for her love and support. She showed remarkable patience and endurance while I worked to complete this dissertation. Having closed this chapter of my life, I dedicate the remainder to her, its most important person.

I would like to thank my adviser, Sandra Mamrak. I am grateful for the time and guidance she has given me throughout my studies. I would also like to thank the other members of my reading committee: Karsten

Schwan and Tom Bylander. They all generously contributed their time and effort to improve this dissertation.

I thank my family and friends for their support. Among the graduate students who helped me and brightened my time are, in no particular order: Dave Ogle, Prabha Gopinath, Ben Blake, Con O’Connell, Klaus

Buchenrieder, Gregor Taulbee, Mike Weintraub and the members of the

Desperanto and Chameleon research projects, especially Charles Nicholas. My research was supported in part by a research grant from the

Applied Information Technologies Research Center based in Columbus,

Ohio, and by an equipment grant from the Xerox Corporation. VITA

21 Dec. 1957 ...... Born - Columbus, Ohio 1979 ...... B.S., The Ohio State University, Columbus, Ohio

1980-1981 ...... Universitaet Ulm, Ulm, West Germany 1981-1982 ...... Non-teaching faculty (Programmer/Analyst) in the Dept, of Microbiology at the Ohio State University, Columbus, Ohio 1982-1983 ...... Full-time graduate student: Graduate Teaching Associate 1983 ...... M.S., The Ohio State ' University, Columbus, Ohio

1983-present ...... Full-time graduate student: Graduate Teaching Associate and Graduate Research Associate

Publications

Mamrak, S.A., Kaelbling, M.J., Nicholas, C.K.. Share. M. A Software Architecture for Supporting the Exchange of Electronic Manuscripts. Communications of the A C M 30(5):408-414. May. 1987. Schwan, K., Kaelbling, M.J., Ramnath, R. A Testbed for High-Performance Parallel Software. Jan. 1985. 34pp. (OSU-CISRC-TR-85-5). Mamrak, S.A., Kaelbling, M.J., Nicholas, C.K. An Approach to the Solution of Data Conversion Problems in Heterogeneous Networks. Aug. 1983. 35pp. (OSU-CISRC-TR-83-4). Fields of Study

Major Field: Computer and Information Sciences

Studies in: Heterogeneous Distributed Systems, with Prof. S. Mamrak Homogeneous Distributed Systems, with Prof. K. Schwan Software Environments, with Prof. J. Ramanathan TABLE OF CONTENTS

ACKNOWLEDGMENTS ...... iii

VITA ...... v

LIST OF FIGURES ...... x

LIST OF SYMBOLS ...... xi

LIST OF ABBREVIATIONS ...... xii

LIST OF NOMENCLATURE ...... xiii

CHAPTER PAGE

I. INTRODUCTION ...... 1

1. The Problem ...... 2 2. Our Specific Solution ...... 3 3. Related Work ...... 4 4. Main Contributions ...... 6 5. Overview and Organization ...... 8

II. CONTEXT AND MOTIVATION OF THE WORK ...... 12

1. The Chameleon Project ...... 14 1. The Observed Problem ...... 14 2. The Intuitive Notion of Translation ...... 16 3. The Standard-Form Approach ...... 17 4. The Chameleon Goals ...... 19 1. Domain Independence ...... 19 2. Support of the Build Phase ...... 20 3. Support of the Use Phase ...... 21 4. Hiding Formalism ...... 23 5. Prototype Systems ...... 23 2. The Current Chameleon System ...... 24 I. The Basic Tool Set ...... 24 vii 1. The SGML Scanner/Grammar Generator . . . 25 2. The Translator Writer ...... 25 3. The Translator Inverter ...... 26 4. The Translator Generator ...... 27 5. The Translation-up User Interface ...... 28 2. Two Auxiliary Tool Developments ...... 29 1. A Generator for Oracular Parsers ...... 29 2. A Generator for Powerful Scanners ...... 30 3. Related Work ...... 31

III. THE STANDARD-FORM MODEL ...... 34

1. Preliminaries ...... 35 1. Notation ...... 35 2. Regular Sets, Languages, and Expressions ...... 36 3. Context-free Grammars and Languages ...... 37 2. Generalized Parenthesis Grammars and Languages . . . 38 3. Braced Languages ...... 39 1. The Standard Definition ...... 40 2. An Alternate Definition ...... 41 4. The Definition of Our Standard-Form Model ...... 46 5. Related Work ...... 47 1. The Use of Standard Forms ...... 47 2. Basis Languages ...... 49

IV. TRANSLATION ...... 51

1. Substitutions, Homomorphisms, and Their Inverses . . . 52 2. Translation Defined ...... 53 3. Information Loss and Functionality Mismatch ...... 56 4. Consequences of the Model of Translation ...... 57 5. Related Work ...... 61

V. SGML AND OUR MODEL ...... 63

1. A Critique of SGML ...... 64 1. The Model ...... 64 2. The Standard Itself ...... 68 1. Processing with Existing Tools ...... 69 2. A Case of Inadequate Syntax ...... 72 2. The Expressiveness of SGML ...... 75 1. Restrictions Placed on SGML ...... 75 2. The Expressiveness of Restricted-SGML ...... 77

viii 3. Tools to Simplify the Processing of SGML Documents . 80 1. The Beginnings of a Design ...... 81 2. The State of the Implementation ...... 83

VI. THE IDENTIFICATION OF REDUNDANT TAGS ...... 85

1. Reduction Techniques in SGML ...... 86 1. Shortening Tags ...... 86 2. Omitting Tags ...... 87 2. Identifying Redundant Tags ...... 89 1. Redundancy in a Standard-Form Specification: Omitting Productions ...... 90 2. Redundancy in a Standard-Form String: Omitting Tags ...... 91

VII. SUMMARY AND CONCLUSION ...... 96

1. Summary ...... 96 2. Main Contributions ...... 98 3. Future Work ...... 99

BIBLIOGRAPHY 103 LIST OF FIGURES

1. Two Formatters’ Descriptions of the Same Object ...... 13 2. An Example of a Tagged Object ...... 17 3. An Overview of the Build Phase ...... 20 4. An Overview of the Use Phase ...... 22 5. The SGML Scanner/Grammar Generator ...... 26 6. The Translator Writer ...... 27 7. A Sample User-interface Query ...... 29 8. A Sample Standard-form Model ...... 46 9. A Homomorphism for Translation Down to Troff ...... 55 10. An Inverse Homomorphism for Translation Upfrom Troff . . 55 11. A Start-Tag with Two Attributes ...... 67 12. A Sample SGML Rule Containing Separators ...... 71 13. A Simplified Version of the SGML Standard’s Grammar for “ Attribute” Lists ...... 72 14. An Improved Grammar for “ Attribute” Lists ...... 74

x LIST OF SYMBOLS

0 the empty set e is the empty string, the string of zero characters f> is the blank, or space, character

* (raised star) indicates zero or more occurrences (Kleene closure)

+ (raised plus) indicates one or more occurrences (positive closure)

• (centered dot) indicates concatenation

4- indicates alternation in regular expressions

| indicates alternation in grammars

L(G ) denotes the language generated by grammar G

BR denotes a braced language: L(Bk)nR

B'R denotes a braced language: L{B'^PR LIST OF ABBREVIATIONS

BNF ...... Backus-Naur form

BR ...... braced language c f g ...... : . context-free grammar cfl ...... context-free language dcfl ...... deterministic context-free language g p g ...... generalized parenthesis grammar g p l ...... generalized parenthesis language

LALR(l) .... Look-Ahead Left-to-right Rightmost-derivation

technique using one look-ahead symbol

LL(l) ...... Left-to-right Leftmost-derivation parsing technique using

one look-ahead symbol

SGML ...... Standard Generalized Markup Language

WYSIWYG . . . What you see is what you get LIST OF NOMENCLATURE

B^ grammar ...... def. 8, p. 40

B', grammar ...... Hpf If) n. 42 braced language ...... section III .3 context-free grammar ...... def. 3, p. 37 context-free language ...... def. 4, p. 37 deterministic context-free language .... def. 5, p. 37

Dyck language ...... def. 18, p.. 57 generalized parenthesis gram m ar ...... def. 6, p. 38 generalized parenthesis language ...... def. 7, p. 39 omissible end-tag ...... def. 32, p.. 93 omissible start-tag ...... see def. 30, p. . 92 omissibility without look-ahead ...... def. 29, p.. 91 CHAPTER I Introduction

Diversity is one of the characteristics of the computer industry. From

Silicon Valley in the West to Route 128 in the East, there are hundreds upon hundreds of hardware and software producers. Most of these suppliers furnish products different from those of their competitors. The result is that there is a large number of different systems that accomplish the same task. There are dozens of text editors, computer-aided design and manufacturing packages, and database systems, among others. It seems that each one of the tools belonging to a given domain does things a little differently. Each offers a different set of options and functionality.

Each uses its own notation and vocabulary. Product differentiation and the “founder effect” rule the day.

Variety may indeed be the spice of life, but it can have a fragmentary effect as well. Language differences isolate people. Metric system tools do not fit Imperial bolts. Electricity is sometimes 110 volts, sometimes 220 volts, sometimes 60 hertz, sometimes 50 hertz. Cars come with left-hand or right-hand drive. In the midst of all this heterogeneity,

1 2 one sometimes wishes for homogeneity, or failing that, for the illusion of homogeneity provided by a dens ex machina.

1.1. The Problem

In this dissertation we address some of the problems of dealing with the diversity found in computerized environments. We have observed, in the computer world at large, that many incompatible representations exist for describing the same information. Furthermore, as time goes on, one will often encounter a situation in which the existing representation of some information is not as desirable as it once was. If a piece of software is superseded by a more powerful package that uses data in a different representation, then one would benefit from being able to translate the existing information into the new form. If the environment changes and the old tools are no longer available or too expensive to use, then the problem is even more critical. If one wishes to merge into one the information expressed in several different notations, then, again, translation is needed.

Translation can be approached in an ad hoc way or a systematic, generalized fashion. It can be done manually or automatically. It can be done with an arcane process or an accessible one.

The general problem is translating among different representations; the specific problem addressed in this dissertation is the determination, through formal analysis, of the limits of one specific model of translation. 3 1.2. Our Specific Solution

In this dissertation, we will present a formal model of translation

based on the use of standard forms. This model can be, and has been,

used in the development of a computerized system for the production and

use of automatic and semi-automatic translators. In this way we will

have contributed to the state of the art in performing the translation of

high-level data strings.

Standard forms, used as an intermediate language for exchange among different formats, offer great resource savings over approaches that develop pairwise translators. Additionally, standard forms serve as a well-defined starting point for the semi-automatic generation of translators. Tools can be, and have been, developed to work from this fixed starting point. We will describe some of these tools.

The formal model we will present is useful is several ways. It makes concrete a model of translation. It is a basis for theoretic . analysis: generating results that describe the limits of the model, and results that justify the existence of certain tools. The main result is that the model we describe applies to the context-free languages. 1.3. Related Work

An inspiration for this work is the result, due to Chomsky, that the context-free languages can be characterized by a regular expression, a

Dyck language, and a homomorphism [10]. Thus a more complicated class of languages, the context-free languages, can be represented by l) a simple context-free language, in this case a Dyck language, 2) a , which is simpler than a general context-free language, and 3) a homomorphism, which is a simple string substitution. That is: a language may be described in terms of simpler components.

Since the purpose of translation is to represent information in a different form, it is natural to look for a simple, easily comprehensible representation to serve as the standard representation on which to base automatic translation.

There have been other results, similar to Chomsky’s, for the representation of different language classes by simpler components.

Parchmann has done it for indexed languages [40k and Hirose et al. have done it for the recursively-enumerable languages '19k Unfortunately, the simpler components of these other methods are not as simple as we would like. These results, like Chomsky’s, have been explored and developed without regard to the problem of translation.

Chomsky and the others presented their grammars for grammars’ sake and not as a basis for the translation of high-level data strings. This dissertation describes a synthesis between a result similar to theirs and the 5 problem of translation. So, although their work predates the work of developing the formal model for translation presented here, it does not anticipate it.

Another group of related work is in the area of standard forms themselves. There are many projects or specifications that seek to standardize the representation of information. For example, the Initial

Graphics Exchange Specification (IGES) [23, 24] seeks to standardize the representation of graphical objects.

But it is not enough to have a standard form. Standard forms by themselves are most useful when new systems and objects are to be created; they do little to address the problems of existing systems and objects. Furthermore, the standards that one sees proposed tend to be domain dependent and inapplicable to data representations in general.

Some efforts attempt to generalize. The Standard Generalized Markup

Language [43; provides a meta-language for the description of standard forms, but it deals w'ith translation in only a superficial way. The

Interface Description Language [26 system applies to only a handful of popular programming languages; although the number of data types can be extended, the new data types are composites of existing ones and do not carry a new semantics with them. The ALOEGEN [31, 32] system is general and effective in enforcing the creation of legal representations, but does not itself help in the translation of existing, alternate representations.

Related work can thus be broken into two categories: work on the theory of languages, and work on standard forms. This dissertation 6 presents new work in both those areas motivated by the problem of translation; and it presents new work in combining both of these areas.

1.4. Main Contributions

The main aim of this dissertation is to develop a formal model of translation and to explore the theoretic properties of that model. First, we specify a formal model for standard forms: the braced languages. The braced languages have some nice properties. They are easy to parse, since they are, at most, LALR(l) languages; and they are easy to specify since they can be fundamentally described by a regular language, the simplest of the commonly described language classes studied by computer scientists.

Then we specify a formal model of translation: the application of homomorphisms and/or inverse substitutions to representations of information. We are then able to define two problems in terms of the formal model: information loss and functionality mismatch. A main advantage of this formal model is that it makes clear what we consider translation to be and allows a logical analysis of the potentials and limitations of our conception of translation.

The major result of our analysis is the conclusion that our model of translation is exactly limited to context-free languages. When we say

“limited,” we do not mean to imply that this class of languages is trivial.

On the contrary, the context-free languages are a significant class. A large number of data representations are context-free languages. Other 7 theoretic results allow us to prove that certain processing tools will be required to accomplish translation in the general case. The necessity of an oracular parser, for example, is established. Our theoretic results can then be applied to real-world systems. We are able to prove that the

Chameleon system, a software architecture to implement translation, has as a lower bound the class of context-free languages. Furthermore, when

Chameleon develops an oracular parser it is doing so for sound reasons based in theory: it is necessary to complete the functionality of a translation system.

Another contribution of this dissertation is to begin an analysis of the international standard SGML with our discussions based on the theoretic results we have established for our model. It is important to study

SGML since it is very similar in spirit to our model and is, at present, better known. There are some results shared by both models. We were able, for example, to prove that SGML, too, is capable of representing the context-free languages. Where the SGML model diverged from our model, we were able to justify our choices.

The last major contribution is an extension of a tag-omission functionality of SGML. We were able to describe this functionality in terms of our theoretic model of a standard form, and in so doing extended it beyond what was previously allowed. Our extension was possible without sacrificing the main concern of its proponents: that redundant tags be identified without using lookahead. Our model then is the source of solutions to real-world concerns and has been used to extend the bounds 8 of functionality beyond unnecessary limitations.

1.5. Overview and Organization

Throughout this dissertation, examples will be presented to illustrate

points being made in the text. In many cases these examples will be simplifications motivated by the need to represent concepts in simple,

understandable dimensions. The reader is urged to extrapolate from these simple examples to the general (or even worst-case) nature of the

problems.

In Chapter II, we discuss the context of the theoretic work. We

introduce the Chameleon project, which addresses the problem of aiding

the automatic translation of data objects. The Chameleon project, established in 1985, involves the work of about a dozen people, and its

description here is meant to be informative and not exhaustive or

territorial. This background information is important because it describes

the environment in which the theory of translation was developed and in which it is being tested. It was the development of Chameleon’s

prototypes that prompted the question: “What are the possibilities and

limitations of the architecture?” In addition to describing the environment, Chapter II discusses several key concepts: the general nature

and permanence of the translation problem, the intuitive model of

translation from which the theoretic model sprang, the standard-form

approach which offers tremendous savings over pairwise approaches, and 9 desirable properties of any general translation system.

In Chapter III, we formalize the standard-form model. A proper subset of a previously described class of languages is defined. This subset

(the B ^ languages), along with regular expressions, defines the braced

languages. The braced languages are our standard forms. It will be shown that the braced languages can be recognized by common parsing

techniques; an alternate definition is given to widen the class of parsing

techniques that can be applied. The formal definition helps to make our

intuitive notion of translation concrete, and it provides the theoretic model

needed for analysis.

In Chapter IV, we complete our process of formalization by defining translation in terms of our standard-form model. The formal model of

translation is then considered, and some of its implications are made

apparent. The problems of information loss and functionality mismatch

are discussed, and certain closure properties show us that some

translations can not be done deterministically. Most important, we are

now able to answer the question: “What are the possibilities and

limitations of the architecture?” VVe prove that our model, which is similar to the Chameleon architecture, is applicable only to the context-

free languages. That last result implies that the context-free languages are

a lower bound on the class of languages to which the Chameleon

architecture applies.

In Chapter V, we examine the Standard Generalized Markup

Language (SGML), the closest, real-world counterpart to our theoretic 10 model. As an existing and increasingly popular international standard,

SGML may be the closest competition to our newly developed model. It, therefore, requires some consideration. First, we critique SGML from theoretic and technical standpoints, and point out some weaknesses.

SGML’s model seems somewhat confused, and the syntax adopted introduces unnecessary complications. Second, we analyze the expressive power of SGML by placing some simplifying restrictions on it. The expressive power of the simplified version is the context-free languages.

Finally, we discuss what steps have been taken by the Chameleon project to automatically process SGML applications. Several tools are mentioned, including some that I have developed.

In Chapter VI, we turn from the examination of SGML to the application of automata theory to a functionality proposed by SGML: the removal of redundant information from a standard-form string. The benefit of removing redundant information hinges primarily on the belief that humans responsible for creating standard-form strings will find it more palatable to be allowed to type as little as possible. YVe first describe the string-length reduction techniques of SGML, and then restate the problem in terms of our theoretic model. It is interesting to note how the model serves as a valid construct for improving the real-world counterpart. We are able to propose, based on the theoretic model, a more encompassing solution to the identification of redundant tags. 11 In Chapter VII, we conclude and briefly summarize the dissertation. We also discuss some possible future work based on discoveries or developments encountered in the process of pursuing the research reported here. CHAPTER II Context and Motivation of the Work

The translation of high-level data objects is a problem found in diverse or distributed systems. A prototype to support translation via a standard-form approach is being developed by the Chameleon project at the Ohio State University [29]. The author has been working as part of this project to explore the work presented in this dissertation.

Therefore, to provide some background on the environment and describe the context within which our work was done, this chapter will present the Chameleon model of translation informally, and sketch the system architecture. The research described in this chapter was done by the Chameleon project group under the direction of Prof. Sandra Mamrak.

The discussion and examples relate to text formatters, but the translation architecture is not limited to that domain. The translation problem is illustrated in Figure 1, where two different representations are given for a poem. A possible translation of the Scribe [41] version, on the left of the figure, is the Troff [38] version, on the right. A translator for text formatters would, for example, take a Scribe version of a document and

12 13 produce a Troff version, or vice versa. Chameleon seeks to provide a system for producing translators. A major goal is to shield the user from the underlying formalisms and tools of the translation architecture.

A Scribe version: A Troff version: .ps +4 A Poem @title(A Poem) .ps -4 @begin(verse) . sp Roses are red, . ce 1 Violets are blue, Roses are red, Sugar is sweet, . ce 1 And so are you. Violets are blue @end(verse) • ce 1 Sugar is sweet, .ce 1 And so are you. . br key for the Troff commands: . b r is the line break command . c e 1 centers the next line of text .ps changes the point size . sp is a vertical space

Figure 1: Two Formatters’ Descriptions of the Same Object

The theory of the translation architecture is based on a subclass of the class of context-free languages called the braced languages (see

Chapter III). The implementation of the prototype is based on the

Standard Generalized Markup Language (SGML) ;43l (see Chapter V). 14 The Organization for International Standardization (ISO) has adopted

SGML as a means for specifying the standard forms of textual documents.

However, SGML may be used in other domains to specify standard forms.

II. 1. The Chameleon Project

This section describes the ongoing Chameleon project and its architecture. It also provides a background against which to view the theory presented later in Chapters III and IV.

II. 1.1. The Observed Problem

Translation is often wanted in systems that support users from different backgrounds. These systems may be loosely coupled networks, like the Arpanet and Usenet communities, or may share a fundamental trait, like the Decus, Imix, or microcomputer users’ groups. In such systems many programs and application packages are available for the same task: spreadsheet packages, interpreters and , text-processing applications, and so on. Translation becomes necessary when a data file in the input format of one program is needed as input to a program with a different input format.

For example, an editor familiar only with Scribe may want to modify an author’s document prepared with Troff. Without translation tools, the editor must learn Troff or retype the document with Scribe commands. 15 As another example, a manufacturer may design a part with a computer- aided design facility that has a storage format chosen for efficiency in designing parts. The manufacturer may then want to process the part design with programs that determine the tool path movements necessary to machine the part, or want to compute physical properties of the part such as volume, deformation behavior, thermal conductivity, etc. Many existing computer-aided design and computer-aided manufacturing systems do not share a common design storage format. Translators, therefore, would offer significant savings to such systems. In yet another case, an institution may insist that a large library of programs written in Pascal be made available in Ada; here, again, translators would save development costs.

In a system where users have a large degree of autonomy, one cannot expect there to be a common format for the description of objects within a domain. One cannot expect the ultimate, universal programming language to arrive to the unanimous approval of all users. One cannot expect that all authors are interested in learning a new, unfamiliar document preparation system. One cannot expect that companies will decide to abandon their own systems in favor of one that is the same as those of their competitors. Indeed, an examination of the two areas of text processing [13] and graphics [4, 6j reveals th at existing forms continue to be popular and new forms are continually added.

Furthermore, even if the day were to come when all new developments were done in the same representation, translation would still 16

be required for the many existing, useful objects. Just as handwritten

records were typeset when printing became commonplace, existing objects

will be translated into new formats when new practices are developed. To

help software environments keep pace with future developments and to ease the sharing of existing objects, Chameleon seeks to partially automate

the process of translating data.

II. 1.2. The Intuitive Notion of Translation

Chameleon started with an intuitive notion of what translation means:

translation is the process of taking the information present in one form of an object and representing it in another form without losing any

information. To successfully translate an object, all the information it contains must be identified. For example, the string “ .bp” can mean

many things, but only after recognizing that it represents a Troff command, can it be understood and translated into the Scribe equivalent:

“@newpage”.

The information may either be explicitly or implicitly represented, but so long as it is identifiable, it can be processed. W hat is implicit can

always be made explicit and, in doing so, often easier to process and

understand. Therefore, an explicit representation of information serves as the basis for the Chameleon architecture.

In this explicit representation, the starts, ends, and types of all component objects are marked, unambiguously, by special tokens. This 17 model corresponds to a tagging, or markup approach, where start- and end-tags mark the starts and ends of objects. Figure 2 is an example of a tagged object.

A Poem Roses are red, Violets are blue, Sugar is sweet, And so are you.

Figure 2: An Example of a Tagged Object

II.1.3. The Standard-Form Approach

The standard-form approach to translation is to develop translators from each variant representation to the standard form and vice versa.

This approach differs from the ad hoc approach, which is to write translators from each representation to every other representation. Where n is the number of representations, the effort involved in following the ad hoc approach is on the order of n2, while the effort for the standard-form approach is on the order of n, plus the time to define the standard form.

So, for domains with more than three representations, the standard form 18 approach offers considerable savings.1 Another advantage is that the standard form serves as a formal definition of the domain and a well- specified starting point for automatic processes that generate translators.

An added cost of the standard-form approach is defining the standard form used; also, standards evolve with time, and dialects arise. If one thinks of programming languages, which may be considered standard forms for algorithms, one realizes that Pascal, Fortran, Cobol, etc. are names for classes of standards. Some languages go through official versions (e.g.

Fortran-IV, Fortran-77) as well as dialects provided by various manufacturers (e.g. MacPascal, Turbo Pascal). Writing on the conversion of computer software, Wolberg states that manufacturers do not adhere to software standards (page 7 in !49’). He was writing about programming language standards, but it is safe to conclude that his remarks are indicative of a more general trend. In a competitive market where service and product differentiation are thought important, suppliers seek “better ideas” to implement. The design of an architecture dealing with standard forms should cope with these problems: the evolution of standards, and the existence of dialects. An implication of the first problem is that standard forms should be easy to understand and specify. An implication

^ylander points out that another approach would be to “write n-1 translators to

L2 , t <2 to Lg, . . Ln_^ to L , automatically create inverse translators and thus be able to translate between any two languages.” While this approach would indeed require effort on the order of n to build, the process of translation would need time proportional to the number of languages, as opposed to the other two approaches in which the time of translation is not dependent on the number of representations. 19 of the second problem is that it will be necessary to translate among standard-form dialects. Chameleon deals with these problems.

II.1.4. The Chameleon Goals

The Chameleon project is developing a system that creates translators based on the standard-form approach. Chameleon does not define standard forms for any domain, but takes a user-supplied description of the standard form as the starting point from which to derive the translators.

II.1.4.1. Domain Independence

To be broadly useful in producing translators, the Chameleon system is domain independent. By adopting the standard-form approach

Chameleon avoids making its tools dependent on a particular representation in a domain. True, the standard form chosen may be exactly that of a particular representation, but it need not be. To avoid dependence on a particular standard form, Chameleon requires that the standard form of a domain be expressed in terms of a language capable of describing standard forms (see Chapter V). Thus, the standard form ensures independence from particular representations in the domain, and the (meta-)language for describing standard forms ensures independence from a particular standard form or domain (whether it be text, graphics, or other). 20 II.1.4.2. Support of the Build Phase

Chameleon seeks to automate as much of the translation process as possible. The process begins with the selection, by the user, of the standard form. Then the translators are built. The build phase (see

Figure 3) produces translators.

Nonstandard-form Up-translator expert V X Up-translator Inverter Generator Translator *s Writer Down-translator -fjk I Generator

Down-translator

F ig u re 3: An Overview of the Build Phase

In the build phase a user with a thorough understanding of the nonstandard (or variant) form specifies translation from the standard form to a nonstandard form. The translation routine is not written by the user; it is generated from the user-supplied specification. A tool, called 21 the Inverter, then automatically rewrites the specification to make it describe the translation from the variant form to the standard form.

Thus, the build phase results in two translators: one, called the down- translator, for conversion from the standard form, and one, called the up- translator, for conversion to the standard form. For a given document type, the build phase m ust be completed once for each variant form.

II.1.4.3. Support of the Use Phase

The translation process continues with the use of the translators.

The use phase (see Figure 4) produces translations.

In the use phase the translators developed in the build phase are invoked repeatedly. In the Chameleon model, translation-down (i.e., from a standard form to a variant form) is accomplished without user intervention. Translation-up (from a variant form to a standard form), however, cannot always be done automatically. Sometimes the up- translator needs user-assistance to complete the task. The system does all it can automatically, and queries the user for the appropriate action only when necessary. 22

Translation-down Translation-up

Standard-form Standard-form Representation ^Representation^

Down- Up- translator translator

variant variant Representation Representation

Figure 4: An Overview of the Use Phase 23 II.1.4.4. Hiding Formalism

The Chameleon system makes use many formalisms, including

substitutions, homomorphisms, LALR(l) parsing, and context-free and

attribute grammars. To make the system available to many users, the

underlying formalisms are hidden.

II. 1.5. Prototype Systems

There are two Chameleon prototype systems. Both were developed in the Berkeley Software Development (BSD) Unix 4.2 environment. Existing tools, such as Yacc [25] and Lex [28], were used wherever possible, while

new’ code was w’ritten in the C programming language [5 .

The first prototype system does translation down from a standard- form object to a proprietary format for typesetting journals. The standard was defined by Chemical Abstracts Service in accordance with guidelines from the Association of American Publishers. In the build phase, computer scientists wrote a context-free grammar for the standard form, and specified the translation based on their knowledge of that grammar and of the variant form. This prototype is being used in a production environment [9:.

The second prototype system, described in Section II.2, is still in development. The underlying formalism for specifying the translation is

now based on attribute grammars. Additionally, both translation-up and

translation-down are being supported. Also, tools are being built that 24 automate processes previously done by hand, for example, the derivation of a context-free grammar from the standard-form model.

II.2. The Current Chameleon System

A main goal of the Chameleon project is to hide from the users as much of the underlying processing and formalism as possible; nevertheless some awareness of the processes and formalisms at work is necessary for an appreciation of the project. The following subsection briefly describes the main components of the system in terms of their functions and formalisms. The subsequent subsection describes two tools of special interest.

II.2.1. The Basic Tool Set

This subsection describes only some of the tools being used in the prototype. Given the changing nature of the system, and the relative unimportance or mundane nature of some of the tools, we have chosen to present only the more important, stable and original components. We do not want to suggest that the following list is complete; just that it covers the major aspects of the system. 25 11.2.1.1. The SGML Scanner/Grammar Generator

The SGML Scanner/Grammar Generator (see Figure 5) processes the

SGML description of the standard form and produces a specification of the grammar in a form that can be processed by the Translator Generator.

It also produces a scanner to be used by the Down-translator to tokenize2 a standard-form representation of an object. Thus, the developer need have no. knowledge of how to specify the grammar used by the Translator

Generator and Translator Writer.

11.2.1.2. The Translator Writer

During the build phase the developer of the Up- and Down-translators interacts with a tool called the Translator Writer3 (see Figure 6). This tool takes a specification of the standard-form grammar, supplied by the

SGML Scanner/Grammar Generator, and collects information from the developer on how to accomplish the translation down. Internally, an attribute grammar is used to specify the translation. The developer need have no understanding of attribute grammars or the specification of the action routines associated with the production. The Translator Writer

2 To tokenize is to group together certain terminal characters into single syntactic units, i.e. tokens/

3For a detailed description of the Translator Writer, see the dissertation of Nicholas |35j. 26

SGML standard-form ^ definition

Standard-form Standard-form grammar scanner generator generator

Standard-form scanner

Figure 5: The SGML Scanner/Grammar Generator

ensures that the translation attribute grammar is well-defined and correct with respect to the user's specification. The attribute grammar specifying the translation is then used by the Translator Inverter and Translator

Generator.

II.2.1.3. The Translator Inverter

The Translator Inverter takes the specification of translation-down and automatically generates from it an attribute grammar that specifies translation-up. Thus, the developer does not have to specify the up- translator, or understand the process by which it can be derived. The 27

Nonstandard-form Standard-form expert grammar

Translator Writer

Specification for Specification for nonstandard-form down-translator scanner ^

Figure 6: The Translator Writer

algorithm used to do the inversion is essentially that described by Yellin and Mueckstein [51, 52, 53j. The output is passed to the Translator

Generator.

II.2.1.4. The Translator Generator

Conceptually there are two translator generators: one for up- translators and another for down-translators; in practice, however, one generator is sufficient. The Translator Generator takes a procedural description of a translation and generates the code to do the translation.

This tool resembles a -compiler. It generates a program (an up- 28 or down- translator) that will parse an input string and produce its translation. The developer need not have knowledge of compiler-compilers, and the user of the translator need not understand the processes by which the input string is parsed.

II.2.1.5. The Translation-up User Interface

The User Interface assists in the use phase of translation up. As will be shown in Chapter IV, it is known that while the translation-down phase is always deterministic, the translation-up phase is not. To handle the nondeterminism, the User Interface questions the user about the correct choice among alternatives. The user must, therefore, understand the object being translated up, but need not understand the reasons for a choice being made, or the processing implications contingent on a choice.

Figure 7 is an example of interaction to resolve an ambiguity.

After all ambiguity has been resolved, the translation-up process is completed (see Figure 4). The result is a potentially incomplete standard- form representation of an object. All the information in the variant form will be represented in the supplied standard-form representation, but information that was missing from the variant-form would have to be edited in. 29

Text-file -window. Sugar is sweet, : .c e 1 : And so are you. A Oracular choice required here. : .b r End of file.

Is A a. After the last line in the body of the poem b. After the name of the author of the poem ?

Figure 7: A Sample User-interface Query

II.2.2. Two Auxiliary Tool Developments

This section describes two auxiliary tools being developed by the

Chameleon project group.

II.2.2.1. A Generator for Oracular Parsers

The User Interface (described above) requires a parser that does its work deterministically whenever possible and then consults an oracle when a user’s decision is required. In the User Interface the user acts as the oracle. Currently, parsers making use of an oracle are not available, so a generator for such parsers is being developed. The parsers will know when and how to interact with an oracle, and they will be able to proceed after a legitimate choice has been made. 30 A parser of this sort will be useful where much automation is already possible, but where occasionally oracular choices are needed. It may be too costly, or even impossible, to write a program that is capable of understanding the intent of data as effectively as a human can. The

Chameleon project is currently investigating the possibility of incorporating learning into the oracular parser, and of generating confidence levels to quantify the expected degree of success of the automatic translation.

II.2.2.2. A Generator for Powerful Scanners

The decision to support SGML as the basis for the implementation introduced several additional functionality requirements. For example, start and end tags that can be inferred unambiguously need not be present; this feature saves space and clutter in the standard-form representation. Also. SGML allows for a change in lexical analysis rules with the standard-form representation: this feature allows user-specific keying conventions to be processed automatically.

Currently, lexical analysis is done by routines that simulate finite state transducers. We believe that down-translators can be made smaller if the lexical analyzer were equivalent to a pushdown automaton.

Therefore, a scanner generator is being developed that will yield scanners that maintain a stack and, additionally, handle some of the peculiarities of the SGML standard-form representation. 31 Since the stack in the scanner contains information present in the subsequent parser’s stack, a merging of scanning and parsing steps is possible that will allow for smaller programs, more convenient notation for strings in a language, and more efficient processing.

II.3. Related Work

The design of the Chameleon system reflects common techniques of top-down design and modularity. Naturally, we have made use of common, existing tools where possible, e.g. Yacc ;25] and Lex [28). There is, however, one major component that is directly attributable to new, outside work.

We have adopted Yellin’s scheme for the automatic inversion of attribute grammars [53]. We can thereby produce a large portion of an up-translator automatically. Yellin’s contribution is important to the design of the Chameleon system, but does not relate directly to the main contributions of this dissertation.

Another piece of interesting work, though it too does not relate directly to this dissertation, is th e dissertation of Nicholas 35i. His work is in making functionality accessible to users who do not need to understand the actual technology involved. To this end, Nicholas was involved in the design of the Translator Writer.

One of the system tools, the oracular parser, can be related to outside work. We chose to consult an oracle (in our case the user) to make 32 critical, but uncomputable, decisions. A sophisticated oracular parser could make this process simpler.

Yellin [53] uses an attribute grammar to provide needed, disambiguating decisions. This technique is useful and effective in some cases, as in the handling of mixed-mode arithmetic expressions. The technique may, however, be impracticable or cumbersome in general.

Consider, for example, the difficulty in specifying all the appropriate attributes necessary to make the correct decision in Figure 7 on page 29.

Zaslaw [55; attempts to use an artificial-intelligence, rule-based approach to identify parts of a document. Unfortunately, he reports, his program can take longer to mark-up a document than a human editor needs. We do not suggest that this one prototype invalidates all AI approaches. We do, however, note that this approach uses a simple model of a document and that its performance will degrade as the model grows.

There is unreported work going on at the OCLC4 corporation to develop a program to mark-up (identify) parts of a titlepage. Hopefully, the insight gained there will aid in more general solutions.

Based on these efforts to aid or provide oracles, and on the tremendous value of automatic processing, we can conclude that a good deal of research and development will occur. It is too early to tell whether learning or statistical or other systems will be most effective.

^Online Computer Library Center, Inc., 65fi5 Frantz Rd., Dublin, Ohio. 33 One final area of indirectly related work is the manipulation of standard-form objects. Our work deals with translation to and from standard forms, but we believe that the standard-form representations will, at some point, have to be edited directly. For example, a copy editor many wish to make revisions in a standard-form representation, or it may become a matter of policy to create new objects in their standard-form representation rather than translate them from an old format. Therefore, we note, for the reader’s information, that a good deal of effort has already been expended on enforcing the creation of “correct” objects, i.e.. those that conform to an imposed standard. An example of work done specifically for documents is ;54j; work of a more general nature is the

Aloe system (31, 32';, which, when given a description of valid objects, will create editors that enforce the construction of valid objects. CHAPTER III The Standard-Form Model

We started with an intuitive notion of translation and worked from that notion to a model for standard forms: translation is the re­ representation of existing information, and standard-form representations unambiguously mark the beginning, end, and type of all significant information in an object. This standard-form model is restated below in terms of Braced languages.

By “significant information'’ we mean information upon which translation decisions can be made. In some sense all information is significant; if it weren’t, it would have no meaning. For our purposes, however, we consider significance at a different level. The fundamental units in our information representation are the characters in the alphabet.

We assume that these characters can be unambiguously identified, and are not marked by start- and end-tags. We then attach significance to those strings of characters that are treated as a unit for the purposes of translation. Naturally, these groupings of characters depend on the domain in which translation occurs: on the standard form that describes

3 4 35 the information capable of being conveyed. For example, if a standard form defines a postal address as a character string, then the distinction between name line, street line, and city line is not significant for the translator (though it m ay be to the post office).

By using formalisms to define the translation process, we will discover the scope to the translation model, and will be able to analyze the problems that can be encountered.

III. 1. Preliminaries

The three subsections that follow present notation and formalisms well known in computer science. The material briefly given below is drawn from Hopcroft and Ullrnan |21, 22]; should readers need a deeper understanding of the topics, they are directed there.

III. 1.1. Notation

• 0 is the empty set.

• e is the empty string, the string of zero characters.

• * indicates zero or more occurrences (Kleene closure).

• + indicates one or more occurrences (positive closure).

• • (centered dot) or juxtaposition indicates concatenation.

• + indicates alternation in regular expressions. • | indicates alternation in grammars.

• In grammars, uppercase letters represent nonterminals and sets;

lowercase'letters represent terminals.

• L(G) denotes the language generated by grammar G.

III.1.2. Regular Sets, Languages, and Expressions

Definition 1: A regular set, or regular language, is the

language accepted by some finite state automaton.

The regular languages are precisely those languages describable regular expressions. Thus, we may use a regular expression to identify regular set.

Definition 2: [From Hopcroft & Ullmanj Let £ be an

alphabet. The regular expressions over £ and the sets that

they denote are defined recursively as follows.

• 0 is a regular expression and denotes the empty set.

• e is a regular expression and denotes the set {e}.

• For each a in £, a is a regular expression and denotes

the set {a}.

• if r and s are regular expressions denoting the languages

R and S, respectively, then (r ~ s), (rs), (r*) are regular

expressions that denote the sets R U S, RS, R*,

respectively. 37

By giving * and + higher precedence than concatenation, which we give higher precedence than +, we omit many parentheses from regular expressions.

III. 1.3. Context-free Grammars and Languages

D efinition 3: A context-free grammar (cfg) is specified by a

grammar

G =(iV , T,P,S ),

where

• N is a finite set of nonterminals (AKA variables).

• T is a finite set of terminals, where T n N = 0.

• P is a finite set of productions, each of the form A —*■ £,

where A c N and £ is a string over (NliT)*.

• S is the start symbol, and S £ N.

Definition 4: A context-free language (cfl) is a language

generated by a context-free grammar, i.e. L(G) where G is a

cfg.

The context-free languages are closed under intersection with a regular set and under substitution.

Definition 5: A deterministic context-free language (dcfl) is a

language accepted by a deterministic push-down automaton (a

machine with a finite-state control section and a stack). 38 The deterministic context-free languages contain the regular sets, and are contained by the context-free languages. The dcfl’s are closed under intersection with a regular set, b u t they are not closed under substitution or homomorphism [22].

III.2. Generalized Parenthesis Grammars and Languages

The generalized parenthesis grammars and languages were defined by

Takahashi [46;. The generalized parenthesis languages contain Ginsburg and Harrison’s Bracketed Context-Free Languages [15], McNaughton’s

Parenthesis languages, and the Dyck Languages (defined below).

Definition 6: A generalized parenthesis grammar (gpg) is a

context-free grammar

G = (N, A, P, I)

where

• AT is a finite set of nonterminals

• A is an alphabet of terminals such th at A D B where

B - {a^ z. 1 < i < n}; B m ay be considered the set of

parentheses or brackets

• P is the set of productions in which the rules

(productions) are only of the following forms:

o X —>• a. f z. i s j 39 o X a. f z-Y 2 2 o x-+ c

o x — c F

o A —► e

where a^, E B, £ E (NU A-B )*, A^, Y E N, and c E A-B.

• I is the initial symbol

Definition 7: A generalized parenthesis language (gpl) is a

context-free language generated by a generalized parenthesis

grammar, i.e. L(G) where G is a gpg.

III.3. Braced Languages

We now define the Braced languages. The Braced languages are a subset of the generalized parenthesis languages. The following two restrictions define that subset:

1. The strings in a Braced language begin and end with brackets

that do not appear inside the string.

2. The longest substrings over the nonbracket characters are

immediately bracketed. 40

III.3.1. The Standard Definition

Definition 8: A Braced language is the intersection of a

regular set and a context-free language generated by a grammar

Bk =[N, T, P, S),

where

• N is the set of nonterminals {S, B, C}

• T is the set of terminals {cr, z^, c . \ 1 < i < k, 1 < j < I}.

The a^s are opening brackets; the z? s are closing

brackets, and the c ’s are nonbracket characters ’ J • P is the set of productions containing only

S —> a1 B

S — a^C z ,

B -* a^B zi | ai B zi B (Vi, 1 < i < k),

B —>■ a.i C z- i 1| ai C z. i B v(Vi, ’ 1 < i — < k), C - C] C (Vj, 1 < j < I),

C-* e

• 5 is the start symbol

We will denote a braced language by BR = (B^.R) L (B ^R , where

R is a regular set.

L(BJ, the language generated by B^ can be thought of as the balanced strings containing k types of parentheses that bracket strings over the alphabet of c .’s. 41 Theorem 9: The Braced languages are deterministic and

context-free.

Proof: L(B^ is a gpl (see definition 6, p.38 and definition 8,

p.40). Greibach et al. [16] proved that gpl’s are

superdeterministic and therefore deterministic context-free

languages. Since the deterministic context-free languages are

closed under intersection with regular sets [22], a Braced

language, L(Bk)nR (where R is a regular set), is a deterministic

context-free language.

III.3.2. An Alternate Definition

In subsection III.3.1, we presented the standard definition of the

Braced languages; in this subsection we will present an alternate definition which may aid in the parsing of Braced languages. The grammar rules we used to define Bk are sometimes right-recursive, i.e. the nonterminal on the left-hand side of a production sometimes appears at the rightmost end of the right-hand side. In a left-recursive rule the nonterminal on the left-hand side of the production appears at the leftmost end of the right- hand side. The technical report describing Yacc [25], a commonly available compiler-compiler, has the following to say on page .19 about recursive rules:

The algorithm used by the Yacc parser encourages so called “ left recursive” grammar rules: rules of the form 42 name : name rest_of__ru!e ; These rules frequently arise when writing specifications of sequences and lists: list : item | list Y item

and seq : item | seq item

In each of these cases, the first rule will be reduced for the first item only, and the second rule will be reduced for the second and all succeeding items. With right recursive rules, such as seq : item | item seq

J the parser would be a bit bigger, and the items would be seen, and reduced, from right to left. More seriously, an internal stack in the parser would be in danger of overflowing if a very long sequence were read. Thus, the user should use left recursion wherever reasonable.

Therefore, to comply with the request to use left recursion wherever reasonable, we redefine the Braced languages in terms of a left-recursive grammar, and prove that the two definitions describe the same languages.

Definition 10: A Braced language is the intersection of a

regular set and a context-free language generated by a grammar

B'k = (Ar/, T, P', S' ),

where

• N' is the set of nonterminals {S', B' , C' } 43

• T is the set of terminals {a I., z., I c J. | 1 < : < k, 1 < j < /}. The a. s are opening brackets; the z.’s are closing x t brackets, and the c ’s are nonbracket characters

• P' is the set of productions containing only

S' -> Oj B' zv

S' -> ax C' zv

B' -> a.B'i z. i i! B' ia B' i z. ' (V*, ’1 < i <— k),n

B' — a. C' 2 . | B' a. C' z. (Vi, 1 < i < k),

C' -* C‘ Cj (Vj, 1 < j< I),

C — e

• 5' is the start symbol

We will denote a braced language by = (B'^R) = L(B'k)C',R, where

is a regular set.

We now state and prove a theorem on the equivalence of the classes of BR and B' R languages.

Theorem 1 1 . The B'R languages are the BR languages.

Proof. Since B'R = L(B'^)r'Rand BR = L(B^(~\R, it will be

sufficient to show that L(B^) = L(B'k).

First we note that the productions for S and S' (the start

productions) generate the same terminals in the same positions

and differ only in having B and C versus B' and C'.

Therefore we need only show that L(C) — L(C') and

L(B)= L(B'). be a string for n > 0. Then f has a derivation in C of

C —► c • C, C —> c ■ C, . . . , C —> c . C, C —► e, ll \ ln and £ has a derivation in C' of

C' -> C'c. , C' -► C'c. , . . . , C' -» C"c- , C" -> e. * * n 1 n-1 i 1 By inspection of the rules for C and C', we can see that the only terminals that can be generated are c.’ s. Thus,

L(C) = L{C’) = {c.| 1 < i < /}*.

Finally we show that L(B ) = L(B'). Let

f = a. X. z. a. X z. !1 ll ll Xn ln ln be a string for n > 1 and X £ {B,C}. Let

£' = a- Y ■z. • • • a. Y- z. ^ i, i, i, iii 1 1 1 n n n be a string for n> 1 and Y. = if X- = B then B' else C" V V (vy. 1 < y < n). We now introduce the convention of writing the trailing B in production rules as B, and the leading B' as

The overbar merely serves to distinguish between the two fl’s. and does not affect derivations. For every £ there is a derivation in B of

1 R —>n Y y R —> C . „ . „ . B a. X Z.~E, B — a. X. z. ~E, . . . , B a I X I I *1 !1 *1 l2 l2 l2 ln l„. For every £ there is exactly one £', and for every £' there is 45

derivation in B' of

B' — B'a X . z. , . . . , B' -> a. X . z. . i i i t, i, i, n n n 1 1 1

We have just shown that the same balanced, bracketed strings

can be derived using either B or B' By induction on the level

of nesting (the number of bracket pairs enclosing a string) at

each position we see that the same set of nested, balanced,

bracketed strings can be derived by both grammars. By an

examination of the production rules for B and B', we note that

this set is the only set derivable, i.e. there are no other strings

that can be generated by B or B ' . Thus L(B) = L(B') is the

set of nested, balanced, bracketed strings over C. Therefore the

languages B'k and B are equivalent, and the B'R languages are

the BR languages.

In this subsection we have presented an alternate definition of the braced languages. This alternate definition may be useful in parsing the languages. In the rest of this dissertation, however, we shall present braced languages by using the standard definition found in the previous subsection. 46 III.'4. The Definition of Our Standard-Form Model

Our standard form is a braced language. A braced language is the language of a Bk grammar intersected with a regular set. By such an intersection one gets a context-free language in which all parentheses are correctly balanced (as ensured by the B^ grammar) and occur only in a prescribed order (as ensured by the regular set). The a?s and z ’s correspond to the start- and end-tags; the c?s are the data characters, and the regular set is specified by a regular expression.

Oj = = a2 = z2 = a3 = = a. = z, = 4 4 a. = 2g = o

The c?s are the standard alphabet and punctuation marks

The regular expression is

- C /) * Z2 T e)a;J(a4(c1 - C2 -t- ct)* «5 + 0 * !

F ig u re 8: A Sample Standard-form Model

Figure 8 illustrates a standard-form model. 47 III.5. Related Work

We divide this related-work section into two parts: the use of standard forms for translation, and alternative languages that might serve as the basis for a standard form. It should be noted that the related- work section of Chapter IV, the chapter on translation, also explores issues related to the choice of a standard-form model. Those issues are not introduced here since they require an understanding of translation and this section does not.

III.5.1. The Use of Standard Forms

This dissertation does not deal with the automatic translation of natural languages. Indeed, the model presented here is inapplicable to natural language translation. However, we would like to make one observation in passing: in the domain of natural language translation, the notion of using a standard form during translation is not new. For years systems concerned with the machine translation of natural languages have used just such an approach [36]. In the interlingua approach, translation is divided into two stages: first, the translation of the source language into a standard form, and second, the translation from the standard form into the target language.

In other areas of computer science, many standards have been designed for the representation of objects. These standards usually address the problem of information exchange, but do not deal with the 48 translation of information to and from the standard. IGES quite explicitly states that the user is responsible for getting information into the standard form without guidance from the proposers of the standard [23, 24].

Furthermore, some of the standards are not readily extensible to other domains: those for text may be ill suited for graphics, and so on.

Other work, such as IDL [26], deals with the specification of a representation for shared objects, but does not help in dealing with arbitrary existing representations. IDL makes a fixed set of representations available from which a representation for a standard form must be built; it does not allow for the description of new semantics, but rather their simulation. It is also limited to a set of programming languages for which interface routines have been built, and is thus somewhat system dependent.

Many translators exist, but these are typically specific to a domain and of limited functionality. Some, like those in [2i, use a standard form.

These standard forms, however, serve as the starting point of a problem specific solution, and are not part of a larger meta-language approach, i.e., one that works from a neutral description of a standard form.

A noteworthy recent effort at automatic translation that uses an intermediate standard form is the work of Yellin [51, 53]. The focus of that work was the automatic generation of an inverse translation, given information about the forward translation. 49

III.5.2. Basis Languages

Chomsky used Dyck languages as the basis of his characterization of context-free languages. The Dyck languages consist of balanced strings of n types of parentheses. These languages thus embody the nesting that is a trademark of the context-free languages. However, they do not deal gracefully with non-balanced aspects of cfls. For that reason, we chose to define the B^-grammars. and to use their languages as the basis for our characterization.

We did examine other, similar, language classes with an eye towards using them as a basis. McNaughton identified the parenthesis grammars and languages [30]. Parenthesis grammars have only one type of parenthesis pair, and every production in the grammar starts with the open parenthesis, ends with the close parenthesis, and does not contain a parenthesis at any other position. These languages did not seem suitable since the parentheses do not carry tagging information, which we require and which would have to be included via added measures.

We also examined the bracketed languages. Ginsburg and Harrison identified the bracketed context-free languages which contain one type of parenthesis pair for each nonterminal in the language .15]. The rules begin and end with brackets. Thus, the bracketed languages would seem to satisfy our tagging requirements. Unfortunately the writing rules for productions allowed for usages we did not think suitable for our standard forms: to wit, while open-brackets were unique, closing-brackets needed not 50 be. Furthermore, strings of nonbracket characters were not possible. This feature made the Bracketed languages unacceptable (as it did the Dyck languages).

Takahashi identified the generalized parenthesis languages [46], and has reported many of their properties [47]. He has shown, also, that the parenthesis languages and the bracketed languages are contained in the gpl’s. Our i^-grammars are a select subset of the generalized parenthesis grammars chosen to enforce what we consider desirable properties in a standard form (see section III.3). CHAPTER IV Translation

In Chapter III we formalized the model of standard forms; in this chapter we will formalize what we mean by translation. First we present definitions of substitutions, homomorphisms and their inverses, as we use them. Then we define translation. Thereafter we discuss the consequences of using our model of standard forms with our definition of translation. Specifically, we prove that our model applies only to the class of context-free languages; and that this bound is a lower bound on the applicability of the Chameleon model of translation. We also discuss why user intervention will never be needed during translation-down, but may be needed during translation-up.

51 52 IV.1. Substitutions, Homomorphisms, and Their Inverses

The definitions presented in this section are based on those presented in Hopcroft and Ullman [22].

Definition 12: A substitution s is a mapping of a finite set A

onto subsets of B*: s(a)CB*, Va G A. The mapping s is

extended to strings in A* by defining it as a function of

substitutions of substrings of the original string. In the trivial

case it is defined as follows: s(e) = e, s(ia) = s(i)s(a), xGA,

a G A*. The mapping -s is extended to languages by defining

s(L) = u s(x) | Vx G L.

For example, let s(0) = a and s(l) = b*. Then s(010) is the regular set ab'-a.

Definition 13: A homomorphism h is a substitution such that

h(a) contains a single string for each a: where A and B are

finite sets, h(a) G B*, Va G A. Often h(a) is taken to be the

string itself, not the set containing the string. The mapping h

is extended to strings and languages as the mapping s was

extended in Definition 12.

For example, if A = (0, 1, 2}, B= {x, yy}, and h(0) = {x},

/i(l) = {yy}, and h(2) = {x}, then /i(012) = xyyx.

Definition 14: The inverse homomorphic image of a string w

is the set of strings that map to w: 53

h~l{w) = {z | h(x) = w},

and for a language L,

K~^(L) = {z | h(x) G L}.

Although this image is called an inverse homomorphism, it is not a true inverse because h(h~l(L)) might not equal hTl(h(L)) or L. However, in the case where h is a mapping from A onto itself, it can be shown that h(h~l(L)) C L Z. h~1(h{L)) [22j. Continuing the example from

Definition 13, h'1(x) = {0, 2}, and /i-1(yy) = {l}.

D efinition 15: The inverse substitutional image of a string is

s"1^) = {y I *<= s(y)}>

and for a language L,

s_1(L) = {y • Ex, x e L n s(y)}.

Again, this inverse image is not a true inverse.

IV.2. Translation Defined

The definitions of translation-up and -down presented in this section are used by the Chameleon project, but may not be applicable to what is generally thought of as translation.

D efinition 16: Translation-down is defined as the process of

applying a homomorphism to a standard-form string.

The specification of the homomorphism is done in the build phase. First, 54 a substitution is defined for each a. and z .. Then, a string is selected to specify the homomorphism. A substitution must be defined since many strings have synonyms? and these synonyms must be recognized during translation-up. For example, s(a5) = . ce/i+l says that one or more blanks in the string are synonymous. During translation-down a single string is used, but during translation-up the set of synonymous strings must be recognized. For simplicity, the example in Figure 9 uses substitutions that are homomorphisms. Figure 9 contains a set of homomorphisms that specify the translation-down to Troff form of the standard-form poem in Figure 2 (which conforms to the standard-form model given in Figure 8).

An additional simplification that we make is to treat the c.s as being equivalent to their homomorphic image. We consider strings over C to be low-level objects. Should this data need to be translated, we assume the existence of mappings to the other representations, and mappings from the other representations back to the standard-form string.

Definition 17: Translation-up is defined as the process of

mapping the strings of a variant form to a string in its inverse

substitutional image.

Figure 10 contains the set of inverse homomorphic images for translation up from Troff.

'^A synonym is one of two or more words in the same language that have the same meaning. 55

/i(a1) — e /i(-21) = .b r

h(a2) = .p s +4 h(z2) = .p s -4

/i(a3) = .s p A(z3) = e

/t(a4) = ,c e 1 /i(z4) = e

/i(a.) = ,ce 1 M ^ ) = e

/((c^ = c;., Vi, 1 < i < I

Figure 9: A Homomorphism for Translation Down to Troff

h'1.br) = {}

A"1 ,ce 1) = {, }

A'1 • ps +4) = {}

h~l .ps -4) = {}

A'1 . sp)-{< start body>}

h'x ' i ) = f°r * — * — ^

Figure 10: An Inverse Homomorphism for Translation Up from Troff 56 IV.3. Information Loss and Functionality Mismatch

Information loss is the result of a translation that fails to preserve meaning. Information loss can occur whenever the homomorphic image of a non-empty string is the empty string, or whenever the homomorphic images of two or more strings are the same. For example, if the homomorphic image of c2 were e, the information conveyed by c0 would be lost after translation. Not all mappings to e, however, result in information loss. In Figure 9 h(z_) = e does not lose information since the homomorphism of the regular expression of Figure 8 unambiguously indicates which e-strings imply z-. In another vein, the homomorphic images of h(a^) and h(a.) are both .ce 1. This mapping has lost the distinction between the strings for line and for author.

Functionality mismatch occurs when two languages do not have the same expressive power. A homomorphism that can cause information loss is said to exhibit functionality mismatch. It is possible to detect functionality mismatch before attempting translations and to determine the expected degree of information loss. Detection is possible by enumerating the homomorphisms that map to the same string or e and can not be unambiguously mapped back. The degree of information loss is the probability of encountering strings that have homomorphic images that lead to information loss. While numerical estimates may be hard to come by, rule-of-thumb estimates are useful for determining the feasibility or usefulness of the mappings. 57 IV.4. Consequences of the Model of Translation

To make some claims about the domain and range of our model of translation, we first introduce the Dyck languages and an important theorem due to Chomsky. There are many possible descriptions of the

Dyck languages: they are the languages whose strings reduce to the empty string by canceling consecutive pairs of matching open and close brackets

[11]. The formal definition that follows is similar to that given by

Takahashi 45|.

Definition 18: A Dyck language is a context-free language

generated by the grammar

Dk = ( V', T, P, S ),

where

• V is the set of nonterminals (5 }

• T is the set of terminals {a^ | 1 < i < /c}. The a ?s are

opening parentheses, and the z?s are closing parentheses

• P is the set of productions containing only

S - r e,

S —<■ cr S zS (Vf, 1 < i < k)

• S is the start symbol

L(Dj.) can be thought of as containing the balanced strings over k types of parentheses. The following theorem, due to Chomsky [10, 11], has also been proved by Stanley 4T and Ginsburg ill]. Theorem 19: Every context-free language can be expressed as

a homomorphism of the intersection of a Dyck language and a

regular set.

The class of languages to which the our model of translation can be applied is the context-free languages. We prove that this is the domain by using the following two lemmas.

Lemma 20: Every context-free language can be expressed as a

homomorphism of a braced language, i.e. for an arbitrary cfl

there exists at least one BR and homomorphism h such that

h(BR) is that cfl.

Proof: Let L(G) be an arbitrary cfl. by Theorem 19 we know

that L(G) = h(L(D^CiR). for some homomorphism h, some Dyck

language Dk, and some regular set R. Referring back to the

definition of the Bk grammars, we see that when / = 0. then

B ^— and L(B^) = L(D^. Therefore, by combining Theorem

19 with a B^ grammar where / = 0, we have

L(G) — h{L{B^'^R). Since BR = L(B^C\R, we have

L(G) = h(BR), for some homomorphism h and some braced

language BR.

Although the preceding proof is correct, it does not highlight the advantage of the BR notation. First let us note that there are many braced languages and homomorphisms that will yield the same context-free language. We do not need to insist that / = 0, as long as h(cj) = e, 59 Vj, 1 < y < /. Similarly we can choose a B , x> k, as long as

h{a ) = h{z ) = e, Vj, k < j < x. More possibilities exist to choose braced languages and homomorphisms such that they yield a given context-free language. Next let us note the rationale behind including c^.’s a s they seem to play little role. If one examines proofs of Theorem 19 (e.g. [46]), one notes that nonbracket characters are often rewritten as a start and end bracket pair that may not include intervening characters and that the homomorphism of one is the original character and the homomorphism of the other is the empty string. It is to avoid this awkward construction that c .’s are introduced. Thus whenever one detects such a ] construct/homomorphism situation, then it can be replaced by the simple and straightforward c . equivalent.

Lemma 21: Only context-free languages can be expressed as

homomorphisms of a Braced language.

Proof: The Braced languages are context-free (Theorem 9).

Therefore, the homomorphism of a Braced language must be a

context-free language.

Theorem 22: Our model of translation applies only to the _

class of context-free languages.

Proof: Immediate from Lemma 20 and Lemma 21.

Theorem 23: The class of context-free languages is a lower

bound on the power of the Chameleon model of translation. 60 Proof: The Chameleon model allows translation to be specified

exactly as we have described, but also allows for the use of

context-sensitive information in specifications. By ignoring the

extra liberty, we have the same model to which Theorem 22

applies, and thus we have established a lower bound.

Since the Braced languages are deterministic context-free languages

(Theorem 9, p. 41), and the process of applying homomorphisms is deterministic, it is guaranteed that translation-down can always be done automatically. However, since the class of deterministic context-free languages is not closed under homomorphism, translation-up will not always be deterministic. For example, if the translation-up of the Troff version of the poem in Figure 1 were attempted, how should the “ .ce 1” before “And s o a r e y o u ” be mapped? The choice is between “ < s t a r t l i n e > ” and “ < s t a r t a u th o r > ” . A user may know which choice is correct, but this information is not available to do the inverse mapping.

To accurately accomplish translation up, a knowledgable user must be consulted. Thus the requirement for an oracular parser as described on page 29 is established. 61 IV.5. Related Work

The theoretic basis of the model of translation presented in this dissertation is the theorem due to Chomsky that every context-free language can be expressed as a homomorphism of the intersection of a

Dyck language and a regular set [10. 11]. This theorem has also been proved by Stanley [44j and Ginsburg i 14]. Subsequent pertinent work has been built along the lines of Chomsky’s theorem. We mention these because, though Chomsky’s characterization seems to us a simple and powerful method for specifying our standard forms, the other methods that could have served as a starting point for the model need to be considered in terms of what they offer.

In this chapter we have shown that our model of translation is restricted to the class of context-free languages. As a basis for the following discussion, let us recall the following hierarchy of language classes: unrestricted contains recursively-enumerable, which contains context-sensitive, which contains indexed, which contains context-free.

Hirose and Yoneda have shown that every context-free language can be expressed as the homomorphic image of the intersection of a Dyck language and a minimal-linear-and-regular language 20]. We think that this approach to characterizing cfl’s is more difficult and less natural than the one we use. Furthermore it is no more expressive than our approach.

Parchmann has shown that homomorphic characterizations of regular- indexed languages can be made based on the linear and semi-Dyck 62 languages over the same alphabet. He further shows that a similar characterization can be made for indexed languages using the semi-Dyck and balanced context-free languages [39, 40]. This characterization would increase the class of languages that the model could handle. We believe, however, that this would introduce difficult formalisms which conflict with our desire for readily understandable standard forms. It is also unclear how many more everyday problems can be characterized by this approach.

Hirose, Okawa, and Yoneda have shown that every recursively- enumerable language can be expressed as the homomorphic image of the intersection of a Dyck language and a minimal-linear language [19;. This result holds the potential for describing an extremely broad class of standard forms. The formalism is again difficult, and the class of languages described is atypical of the languages usually encountered in everyday, computer usage.

An important limitation to the characterization described by Chomsky

(that of representing a class of languages as a homomorphic image of the intersection of a Dyck language and another language) was proved by

Okawa. Hirose. and Yoneda. They showed that no such characterization exists for the context-sensitive languages 37 . This result is important and disappointing because the class of context-sensitive languages is of great interest. Many desirable data-objects are context-sensitive, e.g. programming languages. Okawa et al. pose the question: “ is it possible to characterize icsPs| by imposing any restrictions on homomorphisms?” Any answers to this open question should be of interest. CHAPTER V SGML and Our Model

The Chameleon project has chosen the Standard Generalized Markup

Language (SGML) as its language for describing standard forms. SGML is an international standard (ISO 8879) (43) used in industry [3, 8, 12], and it bears a marked similarity to the standard-form model we developed theoretically. We will not give a detailed description here of SGML; the interested reader is referred to the standard. In this chapter we will examine SGML and its suitability as a language for describing standard forms. First, we will compare SGML’s model with our own, and discuss some technical features of the standard itself, including specifications that make it difficult to process with existing tools or that unnecessarily restrict a user's options. Second, we will investigate SGML’s expressive power, and find that it, too, is capable of modeling all context-free languages. Third, we will discuss the design and implementation of tools for processing SGML.

63 64 V.l. A Critique of SGML

In the final analysis, it may be impossible to separate the model that

SGML uses from the specification of the standard. Not having been involved in the drafting of the standard, we can not know whether some decisions were taken for pragmatic, theoretic, bureaucratic or political reasons. Therefore when we speak of the model below, we will address it at the highest level. Then we will make further observations on the standard’s specifications without regard to their motivations.

V.1.1. The Model

As originally proposed, SGML was intended to describe standard forms for textual documents. The standard forms were to be human- readable representations intended for publishing systems. To quote from page one of the standard itself, revealingly entitled Information Processing

— Text and Office Systems — Standard Generalized Markup Language

(SGML):

This International Standard specifies a language for document representation referred to as the “Standard Generalized Markup Language” (SGML). SGML can be used for publishing in its broadest definition, ranging from single medium conventional publishing to multi-media data base publishing. SGML can also be used in office document processing when the benefits of human readability and interchange with publishing systems are required.

An examination of the standard and its annexes will lead the reader to conclude that the “broadest definition” of publishing was limited to 65 those things that are normally published. While not explicitly excluded, no indication was given that the standard can be used to describe objects that are not usually published: for example, compiler symbol tables or formatting information. Though limited, the area to which SGML addressed itself is an important problem domain in itself.

Such a limitation, however, does an injustice to the basic power of the tagging approach. Given that the tagging, or markup, approach is roughly equivalent to our model of standard forms (see Chapter III)® we would expect that it would be applied to more than just commonly

published material. Indeed, progress has been made in extending the

application of SGML to domains beyond textual documents, e.g.,

blueprints and wiring diagrams [7j.

A basic difference between our standard forms and SGM L’s is that

SGML does not allow the reuse of tag names. Within a document type,

SGML requires that all tag names be unique regardless of their position.

Thus if an author tag in the bibliography has a different content model

than an author tag in the title page, the two tag names would have to be

different. A user would then have to remember the distinctions between

all tags that have counterparts in other structures during the creation of a

document. We feel this burden interferes with human processing of

standard forms. Our model allows for the reuse of tag names.

®To review the essentials, our model ensures that- the start, end. and type of all significant information in a string is clearly indicated. We make use of braced languages to ensure that the standard-form representations are properly ordered. 66 Theoretically, however, there is no difference in power between the two approaches. One can certainly use our model without reusing names, if one so chooses, and one can identify all reused names and assign them corresponding unique names. In the special case where the name space is finite, the approaches do differ; the SGML approach could then describe only N objects, while the approach we have developed could describe more.

As we will show in Section V.2.2, SGML is at least as powerful as the Braced languages are for describing standard-form objects. This potential promises much for the future of SGML, as it implies that SGML can be used to describe many document types in particular, and many data objects in general.

It is because of this power that we see it as a mistake to confuse the issue by including, in the specification of SGML, features th at relate directly to text processing. There is, for example, a provision to allow for the insertion of processing commands. There are also keying conventions to reduce the number of characters that must be typed: “short references’’. “NET-enabling start-tags”, “unclosed start-tags”. and

“unclosed end-tags’\ As valuable as these features are, they should not be incorporated in the standard, but rather layered on top of it.

Furthermore, it should be stressed that SGML need have no inherent, domain-specific semantics associated with it; SGML should be made available to a wide variety of domains, and not thought of as being just for text and office systems. It would seem, then, that SGML should be 67 specified as a generic system, and a document markup system should be specified in terms of it. The two should not be interwoven.

The cloaking of SGML’s power aside, it is in the contemplation of

“attributes"7 that we see the failed vision of the SGML concept of content markup. SGML defines an attribute as: “ a characteristic quality [of an element], other than type or content.” These attributes are, more or less, named parameters inside of start-tags. In figure 11, we reproduce an example from the ISO standard.

Figure 11: A Start-Tag with Two Attributes

The attributes of the start-tag “memo” are “security” and “sender”.

It is not clear to us why the sender of a memo and the security classification of a memo are not considered sufficiently important to be part of the content of the memo. Here is the problem: in a content- tagging approach all logically distinct information should be marked, why then introduce a second notation for markup when an adequate notation exists?

7 As we will see, SGML “attributes” should not be confused with the attributes of attribute grammars. 68

In SGML, the values of attributes are treated differently than content strings are. Attribute values are subjected to type checking, and default values may be supplied for missing attributes. If these functions are desirable, why not supply them for content strings as well?

It is our belief that SGML attributes are a mistaken idea and should be abandoned. Any attribute can be represented by a tagged string. One simply has to define the appropriate start- and end-tags, and rewrite the content model to contain what were formerly attributes as marked-up information. The functions previously reserved for attributes should then be made available for all strings or none.

V.1.2. The Standard Itself

We feel that the SGML standard does not reflect the state of the art in the specification of grammars intended for automatic processing. It is not within the purview of this dissertation to painstakingly examine and report on the entire SGML specification, but a discussion of several major concerns should give the interested reader a feeling for the whole. We will, therefore, present some observations on the ease with which the grammar can be processed by existing tools and a brief discussion on how to improve a part of the specification that seems needlessly inadequate. 69 V.1.2.1. Processing with Existing Tools

The processing of a string begins with lexical analysis. The current convention is to enable the lexical analyzer (scanner) to return tokens8 without concern for left- or right-context (i.e., what has gone before or will come after). SGML has characteristics that make this direct9 approach difficult. For example, SGML has the following token classes

(among others): “name-tokens”, “number-tokens”, and “numbers”; and the string “ 113 0 ” can be in any of the listed token classes depending on the context. Some scanner generators (like Lex [28,) can, with added effort, produce indirect lexical analyzers; others (like GLA [18)) can not.

Processing continues with syntactic analysis. There are several general purpose tools that generate syntactic analyzers (parsers). Yacc, perhaps the most commonly available generator, and others (like Xerox’s PGS) generate LALR(l) analyzers. The other major class of generators produces

LL(l) analyzers. SGML, as currently specified, cannot be parsed by LL(l) or LALR(l) techniques.

SGML was specified in a type of extended Backus-Naur form that can readily be converted to standard Backus-Naur form (BNF) [34). After

o We will use the word token with its standard computer-science meaning, not with its SGML meaning.

9Using the terminology of [ 1 ■. a direct lexical analyzer is told where to look for the next token; an indirect lexical analyzer must also be told what type of token to look for next. 70 conversion to BNF it was apparent that the grammar was not LL(1) since some of the productions lack unique handles (prefixes). When we attempted to use Yacc to generate an LALR(l) analyzer for SGML, we were warned of over 500 shift/reduce10 conflicts, and over 600 reduce/reduce11 conflicts for a grammar with 555 rules.

We are imprecise about the number of conflicts reported since the specification given in the standard is incomplete, and we can expect more conflicts to be reported for the complete grammar. We know that the grammar given in the standard is incomplete because not all the productions can be reached from the start symbol (e.g. “link set use declaration”), and not all nonterminals are defined (e.g. “link type use declaration”, “ISO text description”, and “character set description”).

Although the standard specifies a grammar that is not LALR(l), we believe that it is possible to specify SGML as an LALR(l) grammar. We took the incomplete grammar, and rewrote it. Our version has 19 shift/reduce conflicts (for which the Yacc default of shifting is the correct choice) and zero reduce/reduce conflicts. The main problem with SGML’s current specification of the grammar is a misconception about the role and placement of separators. The separators, which the standard says are to

shift/reduce conflict occurs when the analyzer generator must arbitrarily decide whether to make an analyzer continue with a production, or to have it accept a different production.

^A reduce/reduce conflict occurs when the analyzer generator must arbitrarily decide which of several equally acceptable productions an analyzer will select. 71 be completely ignored, are written into the very productions that are to ignore them.

name group = grpo, ts*, name, (ts*, connector, ts*, name)*, ts*, g rp c

Figure 12: A Sample SGML Rule Containing Separators

In Figure 12, we see a sample SGML rule containing separators. The

“ts” are the separators. Their presence in the rule causes a problem for single-look-ahead parsers. Consider the following string (we explicitly indicate the separators): “grpo namel ts name2 ts grpc”. When the parser considers “name2” it does not know if it should continue accepting names. So it asks for a look-ahead, but unfortunately this look-ahead does not resolve the dilemma. The ts it gets as its single look-ahead could either precede another name or the grpc. The parser would require an additional, a second, look-ahead to correctly continue. Thus we see

that the inclusion of separators in this manner makes it impossible to compute the look-ahead tables used by LALR(l) and LL(l) parsers. 72

V.1.2.2. A Case of Inadequate Syntax

In this section we will see that, in addition to being difficult to process, the grammar of the SGML specification is ambiguous.

Coincidentally, it was SGML attributes that we pointed to as weakening the theoretic model, and it is SGML attributes that we will point to as weakening the syntactic model.

la ::= b | empty

2a ::= = b |

3a ::= b |

F igure 13: A Simplified Version of the SGML Standard’s Grammar for “Attribute” Lists

In Figure 13 we present a simplified model of the SGML syntax for an attribute list. Our simplifications serve to highlight the problem, and in no way contribute to it. A brief examination of this syntax will make the problem apparent: in a list, of names how does one distinguish between the attribute names and the attribute values?

The - SGML standard attempts to offer relief from this dilemma by requiring that production 2b can only be used when the name is one of 73 an enumerated type12. A further prohibition insures th a t the name is valid for only one of the attributes for a given tag13. For example, if

“memo"’ is a tag with an attribute “sta tu s” that can be either “d ra ft” or “ f i n a l ”, then a second attribute, say “n o t i c e ”, can not use either of those words. One could not say that “notice” can be either

“firs t”, “second”, or “final” since “fin al” is now reserved for

“statu s”; one would have to substitute a word like “la s t”. We find that such a restriction on vocabulary conflicts with human readability since it is possible that unusual or unnatural words will have to be substituted for familiar words where there are duplications.

There is a yet more serious problem of ambiguity. Let us say that the “memo” tag has a third attribute, “ k e y w o rd s” of type list-of-names, that is intended to facilitate automatic retrieval of memos relating to selected topics. Now, consider the string “ ”. Is “firs t” a keyword, or is it implying “notice = firs t”? Is “d raft” a keyword, or is it implying “status = d ra ft”?

We maintain that ambiguities of this type are not addressed by the standard. One solution would be to forbid the use of reserved words in the “ k ey w o rd s” list, but we find such a solution awkward and counter to human usability.

12See clause 7.9.1.2 of the standard

13See clause 11.3.3 of the standard 7 4

la ::= ; b | empty

2a ::= = b |

3a ::= b |

Figure 14: An Improved Grammar for “Attribute” Lists

We propose a solution to these two ambiguity problems. We do so to demonstrate how simple modifications in the syntactic and semantic specification of SGML can greatly enhance automatic processing and human usability. First, we would allow production 2b only when it is unambiguous which attribute takes that value. This would allow for the reuse of natural words, for example, one would not have to remember that a final notice was specified by the reserved word “ l a s t ” . Second, we would respecify the syntax for attribute lists along the lines of the grammar shown in Figure 14. The slight change in production la makes a huge difference in removing the ambiguity from the language for attributes, and it allows lists to contain whatever words one chooses. For example, the string “” is easily parsed. 75 V.2. The Expressiveness of SGML

In Chapter IV we examined the expressive power of the braced languages under homomorphism. We did so in order to understand the braced languages’ usefulness as a standard form for the representation of information. In this section we examine the expressive power of SGML.

There are many features of SGML not related to its expressiveness.

To clarify what we are studying, we will ignore those extraneous features.

In section V.2.1, we will list features that we are ignoring, and in section

V.2.2 we will discover the underlying expressiveness of SGML. Since none of the features that we ignore decreases the power of SGML, the expressive power of the simplified language is a lower bound on the expressiveness of the whole language.

V.2.1. Restrictions Placed on SGML

This section is likely to be of interest only to those readers familiar with SGML: we recommend that other readers skip this section.

We ignore nesting depth constraints, since they are implementation- based restrictions.

We ignore entity references, which are string substitutions, as we can always deal with the string after the substitutions have been made.

We ignore character references, which are substitutions of a string by a single character. They are just a special type of entity reference. 76

We ignore data tags, which are strings that imply specific tags. We can always insert the implied tags, and then not recognize the data tags.

We ignore short tags, which are just tags from which contextually redundant characters have been removed. They express the same information as the verbose tags.

We ignore empty tags, which indicate the placement of a tag and imply its content. They can be replaced by the full tag.

We ignore the omittag facility. This facility allows tags to be omitted where they can be unambiguously inferred. We can just include the omitted tags.

We ignore short references, which are a form of entity reference.

We ignore rank, a feature that allows for some synonyms in a standard form. We can always provide the unique tag name instead of the synonym.

We ignore the link features, which specify string replacements for tags. This feature, though it resembles the homomorphisms of Chameleon, lacks Chameleon’s power since order can not be changed and the parse of the string can not be taken into consideration.

We ignore concurrent markup, which is a feature th at allows more than one standard-form markup of a given document. Since only one markup can be valid at a given time, we can ignore the inactive markup and we have reverted to simple markup.

We ignore nested subdocuments as this feature is a processing consideration and does not effect expressive power. 77

We ignore processing instructions as we are interested in the

representation of information, not in inline computation.

We ignore empty content, which allows only the start-tag to be

present for a string that must always be empty. If we require the corresponding end-tag to be present (as in the usual case) and specify that

its homomorphism is always the empty string, then we have the same functionality.

We ignore non-sgml data, which is basically ignored by SGML itself.

We ignore attributes since the information that they carry can be

modeled by marked-up strings.

We ignore the operator in content models since the rules containing it can always be rewritten without it.

W'e ignore "exceptions’’ (“inclusions” and “exclusions”), since they are notational conveniences that can be avoided.

V.2.2. The Expressiveness of Restricted-SGML

For purposes of analysis, we consider a subset of the SGML

languages: the Restricted-SGML languages.

Definition 24: A Restricted-SGML language is a context-free

language generated by the grammar

R = (V, T, P. V^),

where 78 • V is the set of nonterminals {C. V U -1 1 < i < k, v v j * — —

1 < j < m}

• T is the set of terminals {a^ z^, | 1 < i < k, 1 < j < /} .•

The a.s are start tags; the z h are end tags, and the c?s

are data characters

• P is the set of productions containing

V. - a. M. z, (Vi, 1

U{ - Nv (Vi, 1 < i < m. N{e{V- C)*),

C -£ ,

c - Cj a (vy, i

• is the start symbol

We will prove that the Restricted-SGML languages can serve, without loss of generality, as the basis for an implementation of the Chameleon model. We make use of Theorem 25, which has been proved elsewhere

(e.g. in [22;).

Theorem 25: (Chomsky normal form) Any context-free

language without e is generated by a grammar in which all

productions are of the form A —* B C or .4 —> a. Here, A, B

and C are variables, and a is a terminal.

The following two lemmas are used to prove the range of the

homomorphisms of Restricted-SGML languages.

Lemma 26: Every context-free language can be expressed as a

homomorphism of a Restricted-SGML language. Proof: For an arbitrary context-free language L , define L' as the concatenation of L with e, a terminal not in the terminal set of L. The language L' is context-free and without e: the context-free languages are closed under concatenation, and e is not in the context-free language e.

1. Represent the grammar for L' in Chomsky normal form.

2. Define h(e) = e.

3. Rewrite the productions of the form A —* BC as

A —► aJBCz^, and define h{a ) = e and h{zj = c.

4. Rewrite the productions of the form A —> a as A —* a z , v y and define h{a^j = a and h(z^) — e.

The SGML language specified by the grammar created above maps under h to the original arbitrary Context-free language.

L em m a 27: Only context-free languages can be expressed as homomorphisms of a Restricted-SGML language intersected with a regular set.

Proof: The Restricted-SGML languages are context-free.

Context-free languages are closed under substitution. Therefore, the homomorphism of a Restricted-SGML language must be a context-free language.

Theorem 28: Homomorphisms of Restricted-SGML languages can express exactly those languages that can be expressed by homomorphisms of a Braced language intersected with a regular 80

set.

Proof: Immediate from Lemma 26 and Lemma 27.

Therefore, basing the implementation on Restricted-SGML languages does not lose any of the power of the model. A minor difference is that strings of a Braced language can always be recognized without ambiguity and the same can not be said for SGML languages. This ambiguity is not serious because it happens only where two (or more) alternatives can be mapped to the empty string. In those cases the information content of the string does not depend on the actual parse selected. For example, if the productions include S —*■ aAz, S —> aBz. A —> e and B —> e, then the parse for the string az can be S —> aAz, A —► e or S —> aBz, B —> e. Either parse is suitable, so ambiguities of this kind are not serious.

V.3. Tools to Simplify the Processing of SGML Documents

Given that we expect the number of SGML applications to increase and since the Chameleon project seeks to automate the process of translation, it will be useful to develop tools that aid in the processing of

SGML documents. There are two phases to the Chameleon process: the build phase and the use phase. The SGML standard specifies one monolithic grammar, but it facilitates processing to decompose the specification into separate grammars based on where the information is needed. As Williams [48j says: "By breaking [the grammar] into [more 81 manageable] modules, the size of the parsing table in reduced, [and] each module is simpler to generate . . . In the remainder of this section, we will sketch a design for the processing SGML documents, and discuss the current state of its implementation.

V.3.1. The Beginnings of a Design

The distinction between build and use phases suggests a division in the grammar between the SGML declaration and the “document instance set”. The SGML declaration is used in the build phase and is further subdivided. The “document instance set” is the set of documents that represent an object written in standard form notation.

Whether SGML is processed in whole or in part, it is useful to have a validator for the SGML declaration. Such a tool may be separate or may be incorporated in the tools that process components of the declaration. An advantage of a separate validator is that the entire burden of error detection and handling can be isolated in one tool, and by being assured that the declaration is valid, the other tools can be made more manageable.

A major portion of an SGML declaration is a grammar specifying the language of the document (standard form). This grammar is written in a syntax close to an extended BNF. Rather than develop a compiler- compiler (syntax-analyzer generator), a large project in itself, it make sense to use one that is available. Yacc was chosen for this purpose. 8 2 Unfortunately Yacc’s syntax for a grammar specification is nearly BNF without extensions. Therefore, an automatic rewriting system from SGML notation to Yacc notation would be useful.

Another major portion of an SGML declaration is the information pertaining to the attributes of start-tags. This information could be put to good use by a scanner generator. The resultant scanner could then collect and order the attributes, check their types, and supply their implied or default values as needed. There are examples in the literature of special purpose scanner generators (e.g. [18]). The advantages offered by a custom generator over a more general one are usually related to size and performance.

The remainder of the SGML declaration deals with entity declarations.

Entity declarations are similar to macro calls in programming languages.

This information will also be useful to a scanner for SGML documents. It seems, given the expected frequency of entity references, that a tool to condition the input string before processing would be worthwhile. This is currently the approach taken in some compilers where there is a separate pass to resolve file, string and character substitutions.

In the use phase it would be convenient to have a scanner that is capable of cleaning up the input file. By cleaning up we mean handling the optional keying conventions and alternate typographic methods without involving the syntactic analyzer. In SGML there are shorthand notations permitted in start and end tags. If these methods were made part of the grammar, the state table of the parser would become much larger than if 8 3

only the uniform, rigorous, clean form of the input were to be processed.

Also, start tags can contain “attributes”; for purposes of translation it is

convenient to have the tags delivered by the scanner in a normalized

form. Such a form will have validated all the attributes and will present

them in a useful order to speed processing.

V.3.2. The State of the Implementation

I have developed a prototype version of an SGML validator.14 The

main effort here was in rewriting the specification’s grammar so that is

was LALR(l), in section V.1.2.I we discussed some of the obstacles. Our

version is not complete; there is still functionality to be added. The syntax is checked, but not all of the clauses are enforced. There are, for example, checks to be made of name length. We believe that these clauses can be handled by action routines inserted into the Yacc skeleton

that specifies our validator. The National Bureau of Standards is one of several other groups that are working on SGML validators.

Carol Hoover, while working for the Chameleon project, has been developing an SGML-grammar notation to Yacc-grammar notation translator. The translator is written as a Yacc/Lex program, and is not being built following Chameleon methodology. It would be an interesting

14The reader should bear in mind that this work is subject to the restriction of not having a complete grammar from which to work. We mentioned earlier that the SGML ISO Standard has flaws. 8 4 problem to specify a standard form for grammar notations. Indeed, some work in this direction has been done, but no consensus has been

reached [17]. Chameleon does not seek to specify standard forms Itself,

therefore it has initially opted for an ad hoc approach to this translation problem. The grammar rewriter that is being developed does not yet support the full functionality of the SGML-grammar notation.

There is currently no effort to construct the type of special-purpose scanner generator described above. After experience has been gained with the construction of a special-purpose scanner, it is contemplated that work will proceed with a scanner generator.

I have been constructing a special-purpose scanner. The scanner can currently handle many of the markup minimization features proposed by

SGML. The major work still to be done is to incorporate the checking of attributes into the scanner. One of the interesting features of this scanner is that it is a Yacc parser, thus the parsing of an input SGML document is done by a hierarchical composition of LALR(l) syntactical analyzers.

The power of a syntactical analyzer over the typical lexical analyzers, which deal with regular languages, is required to efficiently process attributes. In order to use the syntactical analyzers hierarchically, it was necessary to modify Yacc’s code to produce parsers that could be used repeatedly in a one-production-at-a-time mode and not in a file-at-a-time mode. Error recovery routines had to be written into the Yacc input descriptions to complete the conversion to one-production-at-a-time processing. CHAPTER VI The Identification of Redundant Tags

Given our model for standard forms, we ask: what factor inherent in a standard form would tend to dispose people to reject its use? The answer to this question is: its verbosity. Users can be expected to want to type as few characters as possible.10 Depending on the specific markup and data, the ratio of markup keystrokes to data keystrokes can vary. In

Figure 2 (p. 17) there are more than twice as many markup characters as there are data characters. Such a ratio, if it were the rule, would certainly inhibit the acceptance of tagging.

There are several techniques that can be employed to lessen the number of characters that are required to mark up a data object. One technique is to reduce the tag length; another is to reduce the number of tags. By reducing the tag length we do not mean choosing short and cryptic tag names. By reducing the number of tags we do not mean

1®There is, of course, a trade-off between character set size and string length. Using Chinese one has to type very few characters, but one must know a huge character set. We will not. explore this trade-off, and will assume a standard ASCII character set.

85 86 restricting the expressive power of the standard form.

In this chapter we will present a brief description and discussion of some of the reduction techniques used in SGML; and then we specifically address the general problem of detecting redundant (or omissible) tags in the strings of a braced language. We do not develop a theory for tag- length reduction techniques (methods for shortening explicit tags) since we consider general macro and string substitution techniques adequate.

VI.1. Reduction Techniques in SGML

SGML employs methods to accomplish both kinds of reduction: methods to reduce the number of characters, and methods to reduce the number of tags.

VI.1.1. Shortening Tags

SGML allows a number of techniques for reducing the tag length. It uses just one character to indicate a start tag, e.g. “<” in “” is the equivalent of ‘‘< sta rt” in “< sta rt poem>”. Similarly, it uses two characters to indicate an end tag, e.g. “” is the equivalent of “”. There are other permitted features that allow for the closing of the most recently opened tagged string, e.g. “A Poem</>” is equivalent to “<sfcart title>A </p><p>Poem<end title > ". There are any number of similar conventions, both 87 in SGML and in the imagination. The examples we have given give a flavor for this class of length reduction techniques.</p><p>Another way to reduce the length of tags is the use of keying conventions. SGML uses what it calls “shortrefs” to define keying conventions. The shortrefs are a set of character sequences, usually involving a single special character, or a tab, or a record start, or a record end, or a sequence of blanks. When a shortref is encountered in an input string it is replaced a predefined string, which usually contains a start- or end-tag. Shortrefs are similar to the conventions used in </p><p>WYSIWYG1*5 editors where two newlines might imply the start of a paragraph, and so on.</p><p>V I.1.2. O m itting Tags</p><p>The techniques we have just discussed reduce the length of tags, but in some cases it is possible to omit a tag altogether.1' Where a tag can be unmistakably inferred it is not necessary to include it. For example, given the definition of the standard form in Figure 8 (p. 46), the string</p><p>^WYSIWYG (“wizzy-wig”) is an acronym for “what you see is what you get”. In a W YSIWYG editor tiie user manipulates a representation of an object th at closely resembles the output version of the object. This approach differs from a m arkup system, in which the user’s descriptive version is processed to produce a markedly different final- output form.</p><p>17 Note that in our WYSIWYG example the two newlines were the tag—the tag was indeed present. 88 . . <end titlexstart line> . . is equivalent to . .</p><p><end titlexstart bodyxstart line> . . since the</p><p>“< start line>” tag can not follow the “<end title > ” tag without </p><p> implying the “< start body>” tag.</p><p>SGML has two methods for allowing the omission of tags. One method is the use of “data-tags”, which are the definition of data strings that imply the insertion of predefined strings (usually containing tags). </p><p>The other method is to allow the specifier of the standard form to permit certain tags to be omitted under certain circumstances.</p><p>From clause 7.3.1.1 of the ISO standard [43], we learn that in SGML</p><p>The start-tag can be omitted [with certain exceptions] if the element is a contextually required element and any other elements that could occur are contextually optional.</p><p>From clause 7.3.1.2. we learn that</p><p> the end-tag can be omitted . . . for an element that is followed either a) by the end-tag of another open element', or b) by an element or SGML character that is not allowed in its content.</p><p>The document designer, the specifier of the SGML standard form, indicates in the description of a tag pair if the start- and/or end-tag are omissible subject to the conditions given above.</p><p>The conditions under which an SGML start-tag is omissible are overly restrictive if one merely seeks to, in the words of the standard, “[prohibit] models that . . . require ‘look-ahead’ [p. 152]”. For example, let us assume that a phone number is an optional part of an address. Further </p><p> assume, a phone number must have an area code; and area codes occur 8 9 only in phone numbers. Then, if we were to process an address and find an area code, there would be no problem in concluding that a start-phone- </p><p> number tag had been omitted. Unfortunately, SGML does not allow for this scenario since the phone-number was not contextually required.</p><p>VI.2. Identifying Redundant Tags</p><p>In this section we present a general method for allowing the omission of redundant tags: tags whose presence can unerringly be inferred without look-ahead. We chose not to consider look-ahead for two main reasons. </p><p>One, general look-ahead processing would require the use of resource intensive parsing techniques, and two, its potential complexity would conflict with the goal of human readability adopted by SGML and </p><p>Chameleon. SGML’s method for identifying omissible tags is overly restrictive from a language-theoretic point of view. In the first subsection, we will look at analyzing a braced language specification to determine if there is inherent redundancy in its specification. In the second subsection, we will restate SGML's criteria for tag omissibility to allow for the omission of all tags that can be inferred without using look-ahead. 90 VI.2.1. Redundancy in a Standard-Form Specification: Omitting </p><p>Productions</p><p>We have seen in Chapter III that the braced languages are a subclass of the generalized parenthesis languages. We use the braced languages as standard forms for data objects; and our discussion of omissible tags in standard-form representations can be couched in terms of the generalized parenthesis languages.</p><p>Y am asaki et al. 50; describe a method for minimizing the </p><p>“parenthesis parts” of gpls, i.e., a method for finding “a minimal set of parentheses that can express the nesting property of the language”. Such a method may be useful for the specifier of a standard form. It would be a way to detect a redundant production. There is, however, another concern that would invalidate some of the reductions that Yamasaki’s technique would make: namely, in a standard form the parenthesis parts identify logical units. For example, in Figure 8 (p. 46), the sample standard-form for poems, we can see that the contents of three different set of parentheses are the same; the strings derived from the title-, line-, and author rules are all data-character strings.- Structurally, there is no need to have three types of parenthesis from which to derive data strings, but semantically, from the standard-form designer’s standard-point, there is a distinction to be made between titles, lines, and authors. So, even though some parenthesis parts could be eliminated for syntactic reasons, the standard form might still require the parts for semantic reasons. A 91 truly redundant rule would be removed since it adds no new semantics to the standard form.</p><p>Although a method exists for finding syntactic redundancy in a standard form, the standard-form designer must finally decide what is required for semantic reasons. Therefore, Yamasaki’s technique is expected to be of little use for standard forms.</p><p>VI.2.2. Redundancy in a Standard-Form String: Omitting Tags</p><p>Given a standard-form grammar, it is possible to discover which tags in a string are redundant by examining the regular-set portion of a braced language. We know that every regular set can be represented by a finite state automata 22 in which the edges (transitions) are labeled by start- and end-tags and data characters. Therefore our discussion of redundant tags can be couched in terms of the states and edges of a finite state autom ata.</p><p>First we give our definition of omissibility without. look-ahead.</p><p>Definition 29: (Omissibility Without Look-Ahead) Let</p><p> er = ccu-v-3, w here</p><p>• a is a complete standard-form string,</p><p>• o is the portion of the string before the tag in question,</p><p>• u is the tag for which omissibility is to be determined,</p><p>• v is the tag or character immediately following the start-</p><p> tag in question, • /? is the remainder to the string.</p><p>We say that u is omissible without look-ahead if the decision </p><p> algorithm does not depend on /?.</p><p>We now give our definition of an omissible start-tag.</p><p>Definition 30: (Omissible Start-Tag) Let cr = a-x-b-0, where</p><p>• a is a complete standard-form string,</p><p>• a is the portion of the string before the start-tag in </p><p> question.</p><p>• x is the start-tag for which omissibility is to be</p><p> determined,</p><p>• b is the tag or character immediately following the start- </p><p> tag in question,</p><p>• /? is the remainder to the string.</p><p>Let s be the state reached on input a by the minimal, </p><p> deterministic finite state automata for the regular set defining </p><p> the standard form. Let s' be the state reached from s</p><p> following transition x. Let T — { the set of states that can be</p><p> reached from s in one transition}-s'.</p><p>A start-tag is considered redundant, and therefore omissible, </p><p> only when there is no state t (t € 0 such that there is</p><p> an edge out of t with the same label as an edge out from s'.</p><p>Theorem 31: Definition 30 identifies exactly those start-tags </p><p> that are inferrable without look-ahead. Proof: Using the notation of the preceding definition, assume x </p><p> is a start-tag inferrable without look-ahead, but does not meet </p><p> the criteria of the definition. If x labels an edge from state s </p><p> back to s, then an infinite number of strings from the language </p><p> a-x-x*-b m ap to a-x-b if x can be omitted. Since it is then </p><p> impossible to always correctly infer an x, we have a</p><p> contradiction, and x can not label an edge from s back to s. </p><p>If state s has a 6-edge, then a-x-b and a-b are both prefixes of </p><p> strings in the standard-form language. Since it is then</p><p> impossible to always correctly infer an x, we have a</p><p> contradiction, and s can not have a 6-edge. Therefore x can </p><p> not be inferrable and not meet the criteria of the definition.</p><p>Clearly, a start-tag x that meets the criteria of the definition is </p><p> inferrable since it labels the bridge between the portion of the </p><p> machine that accepts a and the portion that accepts b-/3.</p><p>We now give our definition of an omissible end-tag.</p><p>Definition 32: (Omissible End-Tags) Let o = a-y-b-!3, where</p><p>• cr is a complete standard-form string,</p><p>• a is the portion of the string before the end-tag in </p><p> question,</p><p>• y is th e end-tag for w hich om issibility is to be </p><p> determined. • 6 is the tag or character immediately following the end- </p><p> tag in question,</p><p>• /? is the remainder to the string.</p><p>Let s be the state reached on input a by the minimal, deterministic finite state automata for the regular set defining the standard form. Let t be the state reached from s by taking transition 6.</p><p>An end-tag is considered redundant, and therefore omissible, only in the following two situations:</p><p>1. if b is an end-tag (matching a start-tag opened either </p><p> explicitly or implicitly), or</p><p>2. if there is no 6-edge out from s.</p><p>Theorem 33: Definition 32 identifies exactly those end-tags that are inferrable without look-ahead.</p><p>Proof: Using the notation of the preceding definition, assume y is an end-tag inferrable without look-ahead, but does not meet the criteria of the definition. So, 6 is not an end-tag, and s has a 6-edge out. In this case, if y is omitted, it can not always be correctly inferred, since both paths yield prefixes of strings in the standard-form language; and we have a contradiction. Therefore, y can not be inferrable and not meet a criterion of the definition. 9 5 Clearly, an end-tag y that meets a criterion of the definition is </p><p> inferrable since it either labels the bridge between the portion </p><p> of the machine that accepts a and the portion that accepts 6-/?, </p><p> or it is followed by an end-tag. If y is followed by an end-tag </p><p> then it is completely inferrable since the tags of a braced </p><p> language are balanced and nested, and a stack can keep track </p><p> of the start-tags explicitly or implicitly encountered in a.</p><p>We note that the definition for omissible end-tags is the same as the definition used by SGML. It is the definition of omissible start-tags that differs. Recall our example from page 88 of a situation where SGML’s definition failed to allow for a tag inferrable without look-ahead. </p><p>Definition 32 permits the start-phone-number tag to be optional.</p><p>SGML’s definition for omissible start-tags depends only on the specification of the standard form. Definition 32 takes into account information inferrable from a specific standard-form string. It thus allows for shorter standard-form representations; and more flexibility is possible in a user-produced standard-form string. /</p><p>CHAPTER VII Summary and Conclusion</p><p>In this, the concluding, chapter, we will see a brief summary of what </p><p> has been presented, a list of the main contributions, and a short review of </p><p> the areas for possible future work.</p><p>VII. 1. Summary</p><p>In this dissertation we have addressed the problem of dealing with the </p><p> diversity found in computerized environments by developing a formal </p><p> model of translation based on the use of standard forms. Standard forms </p><p> offer great savings over approaches that develop pairwise translators.</p><p>First, after the introduction, we described the Chameleon project, </p><p> which was the context for much of the work developed in this </p><p> dissertation. The testbed offered by the Chameleon project was a system </p><p> in which to try out some of the theory that we developed, and in which </p><p> to design and test solutions to some of the problems we identified: the </p><p> need for oracular parsers, and the handling of omissible tags.</p><p>9 6 9 7 Second, we defined our standard-form model: the braced languages. </p><p>We gave two definitions of the braced languages and showed that the definitions were equivalent. The significance of the definitions is apparent when parsers are constructed. to recognize the braced languages. These languages can be efficiently parsed by LALR(l) techniques, which are supplied by common parser generators like Yacc.</p><p>Third, we defined the model of translation, information loss and functionality mismatch; and we proved some theoretic results based on that model. Of major importance is that the model of translation applies only to the context-free languages, that translation-down is deterministic, and that translation-up requires oracular intervention in the general case.</p><p>Fourth, we presented a critique of the Standard Generalized Markup </p><p>Language, and discussed it in relation to our model of standard forms and translation. The critique focused on technical, as well as theoretic, aspects. We were able to propose syntactic improvements and identify </p><p>SGML’s expressive power. Additionally, we introduced some tools for the processing of SGML.</p><p>Fifth, and finally, we developed a theory for the identification of redundant, and therefore omissible, tags. Some of the techniques used by </p><p>SGML for tag length and number reduction were discussed, and their restrictiveness was demonstrated. Then, a general and complete method for identifying omissible tags was given. VII.2. Main Contributions</p><p>The main contributions of this dissertation are:</p><p>• The development of a formal model of translation and an </p><p> exploration of the theoretic properties of that model, specifically:</p><p> o the model applies only to context-free languages </p><p> o translation-down can be done deterministically </p><p> o translation-up requires oracular assistance</p><p>• A proof that the context-free languages are a lower bound on </p><p> the applicability of the Chameleon architecture</p><p>• An initial analysis of the Standard Generalized Markup </p><p>Language reporting that:</p><p> o SGML can serve as a basis for the context-free languages </p><p> o the conceptual model is not minimal</p><p> o the specification is sometimes ambiguous or overly </p><p> restrictive</p><p> o the definition of tag omissibility is overly restrictive</p><p>• An extended definition of tag omissibility without lookahead 9 9 VII.3. Future Work</p><p>There are several directions for future work to take, including human factors, software environments, theory, and algorithm design.</p><p>Possibly the most interesting future work is in extending the model to the context-sensitive languages. In Okawa et al. [37j, is was proved th at a formal model of translation like the one presented in this dissertation is inadequate for the context-sensitive languages. Okawa suggests that imposing restrictions on the homomorphisms might be a way to characterize this class of languages. The approach used by the Chameleon project has been to allow the use of functions instead of homomorphisms, but exactly which class of languages is reached by this technique is currently unknown. Yellin. also, has proposed the use of invertible functions in his work. Again the exact class to which this technique applies is unknown. Therefore, we can conclude that there is interesting work to be done in analyzing these and other techniques in an effort to reach a new class of languages while keeping the specification of standard forms accessible and human usable.</p><p>In the area of user environments, support tools for standard-form string environments need to be developed. We have anecdotal reports that some users find standard-form strings tiresome to write and read. </p><p>Naturally, such reports are disturbing and should be investigated, but to be fair to the approach certain steps should be taken. Standard-form editors and viewers should be developed, along with other tools to make 100 the use of standard-form objects simpler and quicker. Just as </p><p> programming-language based environments have been developed to improve </p><p> the productivity of programmers, tools should be built to further aid in </p><p> the task of dealing with standard-form strings.</p><p>Another facet related to human acceptance of standard-form strings is </p><p> the omission of tags. The ability to omit redundant tags lessens the </p><p> amount of typing that must be done and makes the process of producing </p><p> a standard-form string more palatable. In this dissertation we have explored the case where omissibility does not depend on lookahead. </p><p>Future work might generalize omissibility to include lookahead, it would </p><p> be interesting to know what level of omissibility users want or would be </p><p> comfortable with. The most general case would be to allow unlimited</p><p> lookahead, but more practical approaches might allow one or two </p><p> lookaheads. The trade-off is between users’ expectations and processing </p><p> costs.</p><p>The actual processing costs for braced languages (with and without </p><p> tag omissibility) should be investigated. Mehlhorn [33] showed that </p><p> bracketed languages could be recognized in log n space. Similar results </p><p> should be pursued for the braced languages. The feature of tag</p><p> omissibility (with and without lookahead) may best be implemented with</p><p> new and different parsing techniques.</p><p>Since Standard Generalized Markup Language strings are regular due </p><p> to imposed nesting-depth limitations, there would be little trouble </p><p> converting an SGML standard form into a braced language. If the 101 nesting-depth limitation is removed the problem becomes more difficult. It may be interesting to explore this case. Leiss’s work on Star equations</p><p>[27] would seem the most applicable to this problem. The potential benefit is a generic system based on the braced-language model that can accept sgml-like standard forms that do not limit nesting.</p><p>In a technical vein, it seems desirable to respecify the Standard </p><p>Generalized Markup Language to allow processing by commonly available parser construction tools and to remove some of the unnecessary restrictions and ambiguities introduced by the syntax. Furthermore, we feel that the semantics of tag attributes should be moved to the content model. Also, a layered approach should be used to separate the description of standard forms from a fixed set of tag-length and tag- omission techniques. Such a respecification would yield a more easily implementable, and therefore readily-adoptable, standard.</p><p>In a strictly theoretic vein, it might be interesting to pursue the observation that the braced languages are among a class of languages for which equivalence is decidable 42i. It is an open question whether it is possible to decide if two arbitrary deterministic context-free languages are equivalent; and it is known that it is impossible to decide for two arbitrary context-free languages. We know by proof that all context-free languages have a braced-language based representation. Thus we know that it is impossible to find an algorithm to map an arbitrary context-free language to its braced-language version. There may. however, be an interesting subclass of the context-free languages (or deterministic context- 102 free languages) that can be mapped automatically. Such a result might extend our knowledge of equivalent languages. BIBLIOGRAPHY</p><p>[ 1 i Aho, A.V. and Ullman, J.D. The Theory of Parsing, Translation, and Compiling. Prentice-Hall. Inc., Englewood Cliffs, NJ, 1972.</p><p>\2\ Albrecht, P.F, Garrison, P.E., Graham, S.L., Hyerle, R.H., Ip, P., and Kreig-Brueckner, B. Source-to-Source Translation: Ada to Pascal and Pascal to Ada. Sigplan Notices 15(ll):183-193, November, 1980.</p><p>[3] Standard for Electronic Manuscript Preparation and Markup Association of American Publishers, Washington, DC, 1986.</p><p>[4] Bono. P.R. A Survey of Graphics Standards and Their Role in Information Interchange. IEEE Computer :63-75, October, 1985.</p><p>[5] Kernighan, B.W. and Ritchie, D.M. The C Programming Language. Prentice-H all, 1978.</p><p>[6] Carson, G.S. and McGinnis, E. The Reference Model for Computer Graphics. IEEE Computer Graphics and Applications : 17-23. August, 1986.</p><p>[7] Chamberlin, D.D. and Goldfarb, C.F. Graphic Applications of the Standard Generalized Markup Language (SGML). Technical Report RJ 5540 (55569), IBM Almaden Research Center, San Jose, CA, December, 1986.</p><p>103 104 [8] Chamberlin, D.D., Hasselmeier, H.F., and Paris, D.P. Defining Document Styles for WYSIWYG Processing. Research Report RJ 5812 (58542), IBM, IBM Almaden Research Center, San Jose, CA 95120-6099, August, 1987.</p><p>[9] (uncredited). Ethical Guidelines to Publication of Chemical Research. Analytical Chemistry 58(l):264-266, January, 1986.</p><p>[10] C hom sky, N. Context-free grammars and pushdown storage. Quart. Prog. Dept No. 65 :187-194, 1962. MIT Res. Lab. Elect.</p><p>[11] Chomsky, N. and Schuetzenberger, M.P. The Algebraic Theory of Context-free Languages. Studies in Logic and the Foundations of Mathematics. Computer Programming and Formal Systems. \orth-Holland, Amsterdam, 1963, pages 118-161.</p><p>[12] (various). Special Issue devoted to the Association of American Publishers’ Electronic Manuscript Standard. Electronic Publishing Business 4 (8): 1-32, September, 1986.</p><p>[13] Furuta, R., Scofield, J., and Shaw, A. Document Formatting Systems: Survey, Concepts, and Issues. Computing Surveys 14(3):417-472, September, 1982.</p><p>[14] G insburg, S. The Mathematical Theory of Context-free Languages. McGraw-Hill, 1966.</p><p>[15] G insburg, S. a n d Harrison, M.A. Bracketed Context-free Languages. Journal of Computer and System Sciences l(l):l-23, 1967.</p><p>[16] Greibach, S.A. and Friedman, E.P. Superdeterministic PDAs: A Subcase with a Decidable Inclusion Problem . Journal of the ACM 27(4):675-700, October, 1980.</p><p>[17] H eilbrunner, S. a n d W egner, L. Formal description of languages - Is there hope for standardisation. Angewante Informatik 25(3):93-98, March, 1983. In G erm an. 105 [18] Heuring, V.P. Compiler Construction: The Automatic Generation of Fast Lexical Analyzers. Technical Report SEG-85-1, University of Colorado, Boulder, CO, 1985.</p><p>[19] Hirose, S., Okawa, S., and Yoneda, M. A homomorphic characterization of recursively enumerable languages. Theoretical Computer Science 35:261-9, 1985.</p><p>[20] Hirose, S. and Yoneda, M. On the Chomsky and Stanley’s homomorphic characterization of context-free languages. Theoretical Computer Science 36:109-12, 1985.</p><p>[21] Hopcroft, J.E. and Ullman, J.D. Formal Languages and Their Relation to Automata. Addison-VVesley Publishing Co., 1969.</p><p>[22] Hopcroft. J.E. and Ullman, J.D. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. 1979.</p><p>[23: National Bureau of Standards. Initial Graphics Exchange Specification. U.S. Dept, of Commerce, National Technical Information Service, 1978. NBSIR 80-1978R.</p><p>[24] National Bureau of Standards. Initial Graphics Exchange Specification (IGES) Version 2.0. U.S. Dept, of Commerce, National Technical Information Service, 1983. PB83-137448.</p><p>[25] Johnson, S.C. Yacc: Yet Another Compiler Compiler. Technical Report CSTR#32, Bell Laboratories, Murray Hill, NJ, 1975.</p><p>[26] Lamb, D.A. Sharing Intermediate Representations: The Interface Description Language. Technical Report CMU-CS-83 129, Carnegie-Mellon University, May, 1983. 106 [27] Leiss, E.L. On solving star equations. Theoretical Computer Science 39:327-332. 1985.</p><p>[28] Lesk, M.E. and Schmidt, E. Lex: A Lexical Analyzer Generator. Technical Report CSTR#39, Bell Laboratories, Murray Hill, NJ, October, 1975.</p><p>[29] Mamrak, S.A., Kaelbling, M.J., Nicholas, C.K., and Share, M. A Software Architecture for Supporting the Exchange of Electronic Manuscripts. Communications of the ACM 30(6):408-414, May, 1987.</p><p>[30] McNaughton, R. Parenthesis Grammars. J. of the Association for Computing Machinery 14(3):490-500, July, 1967.</p><p>[31] Medina-Mora, R. Syntax-directed editing: Towards integrated programming environments. PhD thesis, Carnegie-Mellon University, 1982.</p><p>[32] Medina-Mora, R. Aloe users’ and implementors’ guide. Second Compendium of Gandalf Documentation. Dept, of Computer Sci., Carnegie-Mellon Univ., 1982.</p><p>[33] M ehlhorn, K. Bracket-languages are recognizable in logarithmic space. Information Processing Letters 5(6): 168-170, December, 1976.</p><p>[34] Naur, P. (Ed.). Report on the algorithmic language ALGOL 60. Communications of the ACM 3(5):299-314, May, 1960.</p><p>[35] Nicholas, C.K. Assuring Accessibility of Complex Software Systems. PhD thesis, The Ohio State University, Forthcoming.</p><p>[36] Nirenburg, S., Raskin, V., and Tucker, A.B. Interlingua design for TRANSLATOR. In Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, pages 224-244. Colgate University, Hamilton, NY, USA, August, 1985. 107 [37] O kaw a, S., H irose, S., an d Yoneda, M. On the impossibility of the homomorphic characterization of context- sensitive languages. Theoretical Computer Science 44:225-228. 1986.</p><p>[38] O ssanna, J.F . NROFF/TROFF User’s Manual Bell Laboratories. M u rray Hill, N J, 1976.</p><p>[39] Parchmann, R., Duske, J., and Specht, J. Closure Properties of Deterministic Indexed Languages. Information and Control 46:200-218, 1980.</p><p>[40] Parchm ann, R. Balanced context-free languages and indexed languages. Elektron. Informationsverarb. Kybern. 20(10-11):543-56, 1984.</p><p>[41] Scribe Document Production System User Manual Fourth edition, Unilogic, Ltd., 160 North Craig St., Pittsburgh, PA, 1984.</p><p>[42] Senizergues, G . The Equivalence Problem for N.T.S. Languages Is Decidable. In Theoretical Computer Science 6th GI Conference, Dortmund BRD, pages 313-323. Springer-Verlag, January, 1983.</p><p>[43] Information processing: Text and office systems: Standard Generalized Markup Language (SGML) First edition, International Organization for Standardization, 1986. ISO 8879-1986 (E).</p><p>[44] S tanley, R.J. Finite state representations of context-free languages. Quart. Prog. Rept. 76:276-279, 1965. MIT Res. Lab: Elect.</p><p>[45] T akahashi, M . Regular sets of strings, trees, and w-structures. PhD thesis, U. of Pennsylvania, 1972.</p><p>[46] T akahashi, M . Generalizations of Regular Sets and Their Application to a Study of Context-free Languages. Information and Control 27:1-36, 1975. 108 [47] Takahashi, M. Nest Sets and Relativized Closure Properties. Theoretical Computer Science 22:253-264. 1983.</p><p>[48] Williams, T.B. Hierarchical Parsing. In Second Annual Phoenix Conference on Computers and Communications, pages 427-430. IEEE, New York, March, 1983.</p><p>[49] Wolberg, J.R. Conversion of Computer Software. Prentice-H all. Inc., Englewood Cliffs, N J, 1983.</p><p>[50] Yamasaki, H. and Takahashi, M. Generalized parenthesis languages and minimization of their parenthesis parts. Theoretical Computer Science 31:1-11, 1984.</p><p>51] Yellin, D.M. and Mueckstein, E.-M.M. Two-Way Translators Based on Attribute Grammar Inversion. IEEE :36-42, 1985.</p><p>52] Yellin, D.M . and M ueckstein, E.-M.M. The Automatic Inversion of Attribute Grammars. IEEE Transactions on Software Engineering SE-12(5):590-599, May. 1986.</p><p>[53] Yellin, D.M. Attribute Grammar Inversion and Source-to-source Translation. PhD thesis, Columbia University, 1987.</p><p>[54] Zaslaw, S.J. Manuscript preparation on Digital Equipment Corporation word processors for automatic generic coding and subsequent page composition on Texet Corporation document machines. Technical Report, DEC. 200 Baker Ave., Concord, MA 01742, 617-264-1688, M ay, 1985.</p><p>[55] Zaslaw, S.J. A n OPS5 program to analyze formatting of word-processor documents. Technical Report, DEC, 200 Baker Ave., Concord, MA 01742, 617-264-1688, 1986.</p> </div> </div> </div> </div> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" integrity="sha512-aVKKRRi/Q/YV+4mjoKBsE4x3H+BkegoM/em46NNlCqNTmUYADjBbeNefNxYV7giUp0VxICtqdrbqU7iVaeZNXA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script src="/js/details118.16.js"></script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> <noscript><div class="statcounter"><a title="Web Analytics" href="http://statcounter.com/" target="_blank"><img class="statcounter" src="//c.statcounter.com/11552861/0/b956b151/1/" alt="Web Analytics"></a></div></noscript> </body> </html><script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>