A Unified Format for Language Documents
Total Page:16
File Type:pdf, Size:1020Kb
A Unified Format for Language Documents Vadim Zaytsev and Ralf L¨ammel Software Languages Team Universit¨atKoblenz-Landau Universit¨atsstraße 1 56072 Koblenz Germany Abstract. We have analyzed a substantial number of language doc- umentation artifacts, including language standards, language specifica- tions, language reference manuals, as well as internal documents of stan- dardization bodies. We have reverse-engineered their intended internal structure, and compared the results. The Language Document Format (LDF), was developed to specifically support the documentation domain. We have also integrated LDF into an engineering discipline for language documents including tool support, for example, for rendering language documents, extracting grammars and samples, and migrating existing documents into LDF. The definition of LDF, tool support for LDF, and LDF applications are freely available through SourceForge. Keywords: language documentation, language document engineering, grammar engineering, software language engineering 1 Introduction Language documents form an important basis for software language engineering activities because they are primary references for the development of grammar- based tools. These documents are often viewed as static, read-only artifacts. We contend that this view is outdated. Language documents contain formalized elements of knowledge such as grammars and code examples. These elements should be checked and made available for the development of grammarware. Also, language documents may contain other formal statements, e.g., assertions about backward compatibility or the applicability of parsing technology. Again, such assertions should be validated in an automated fashion. Furthermore, the maintenance of language documents should be supported by designated tools for the benefit of improved consistency and traceability. In an earlier publication, a note for ISO [KZ05], we have explained why a language standardization body needs grammar engineering (or document engineering). This paper presents a data model (say, metamodel or grammar) for develop- ing language documents. Upon analyzing and reverse-engineering a wide range of language documents, which included international ISO-approved standards and vendor-specific 4GL manuals, we have designed a general format for lan- guage documents, the Language Document Format (LDF), which supports the 2 Vadim Zaytsev and Ralf L¨ammel documentation of languages in a domain-specific manner. We have integrated LDF with a formalism for syntax definition that we designed and successfully utilized in previous work [LZ09,LZ10,Zay10b]. We have integrated LDF also with existing tools and methods for grammar engineering from our previous work; see the grammar life cycle at the top of Figure 1 for an illustration. Furthermore, we have added LDF-specific tools, and begun working towards a discipline of language document engineering. There is support for creating, rendering, testing, and transforming language documents; see the document life cycle at the bottom of Figure 1 for an illustration. Given the new format LDF, it is particularly important that there are document extractors so that one can construct consistent LDF documents from existing language documents. In this paper, we will be mainly interested in LDF as the format for language documents, and the survey that supports the synthesis of the LDF format. The broader discussion of language document engineering is only sketched here. For instance, most aspects of rich tool support are deferred to substantial future work efforts. LDF can be seen as an application of literate programming [Knu84] ideology to the domain of language documentation: we aim to have one artifact that is both readable and executable. By \readable" we mean its readability, under- standability and information retrievability qualities. By \executable" we assume a proper environment such as a compiler compiler (for parser definitions) or a web browser (for hyperlinked grammars). LDF provides us with a data model narrowly tailored to the domain; it allows us to focus on one baseline artifact which is meant for both understanding and formal specification. Other artifacts such as grammars, test sets, web pages, language manuals and change documents are considered secondary in that they are to be generated or programmed. The full realization of this approach relies on a transformation language for language documents that we will briefly discuss. Summary of contributions { We have analyzed a substantial number of language documentation artifacts, including language standards, specifications and manuals of languages such as BNF dialects, C, C++, C#, Cobol dialects, Fortran, 4GLs, Haskell, Jovial, Python, SDF, XML, and other data modeling languages. Company-specific internal documents and software engineering books that document a software language (e.g., [GHJV95] with the well-known design patterns) were also researched. The objective of the analysis was to identify domain concepts and structuring principles of language documentation. { We have designed the Language Document Format (LDF) to specifically sup- port the documentation domain, and to make available language documents to language document engineering. A Unified Format for Language Documents 3 Grammar focus evolution restructuring XML Schema Prolog abstraction mapping correction specialization XML (W3C, Ecore, ...) HTML XSLT XSLT XBGF TXL TXL + XSLT XSLT BNF Extracted Transformed XBGF robust grammar grammar XSLT parser (BGF) (BGF) BNF in HTML XSLT? SDF ASF SDF ? ... ... Document focus evolution improvement ... HTML XML Schema Prolog XSLT mapping XLDF TeX XSLT W3C XML XSLT Extracted Transformed LaTeX XLDF robust parser document document (LDF) (LDF) PDF HTML XSLT DocBook XSLT? ... DocBook ... Fig. 1. Megamodels related to language document engineering. At the top, we see the life cycle of grammar extraction, recovery, and deployment. Grammars are extracted from existing software artifacts on the left, and represented in the unified format BGF. Grammars may then be subject to transformation using the XBGF trans- formation language. Parsers, browsable grammars, and other \executable" artifacts are delivered on the right. Such grammar engineering feeds into language document engi- neering. At the bottom, we see the life cycle of language document extraction, language (document) evolution, generation of end-user documents, extraction of grammars and test suites. Non-LDF documents can be converted to LDF through the extraction shown on the left. Document transformation may be needed for very different reasons, e.g., structure recovery or language evolution; see the reference to XLDF, which is the transformation language for LDF. 4 Vadim Zaytsev and Ralf L¨ammel Validation We have applied LDF to a number of language documentation problems, but a detailed discussion of such problems is not feasible in this paper for space reasons. For instance, we have applied language document engineering system- atically to the documentation of XBGF|the transformation language for BGF grammars which is used extensively in our work on grammar convergence; the outcome of this case study is available online [Zay09]. In the current paper, we briefly consider mapping W3C XML to LDF, specifically W3C's XPath stan- dard; this case study is available online, too [W3C]. In the former case, we rely on a document extractor that processes XSD schemata in a specific manner. In the latter case, the extractor maps W3C's XML Spec Schema to LDF. More generally, the SourceForge project \Software Processing Language Suite" (SLPS)1 hosts the abovementioned two case studies, the LDF definition, tool support for LDF, other LDF applications, and all other grammars and tools mentioned in this paper and our referenced, previous work. For instance, we refer to the SLPS Zoo2, which contains a collection of grammars that we ex- tracted from diverse language documents. The next step would be to properly LDF-enable all these documents. Road-map The rest of the paper is organized as follows. x2 discusses the state of the art in language documentation as far as it affects our focus on a format for language documents and its role in language document engineering. x3 identifies the con- cepts of language documentation as they are to be supported by a unified format for language documents, and as they can be inferred, to some extent, from exist- ing language documents. x4 describes the Language Document Format (LDF) in terms of the definitional grammar for LDF. It also provides a small scenario for language document transformation. x5 discusses related work (beyond the state of the art section). x6 concludes the paper. 2 State of the art in language documentation As a means of motivation for our research on a unified format for language documents, let us study the state of the art in this area. The bottom line of this discussion is that real-world language documents are engineered at a relatively low level of support for the language documentation domain. 2.1 Background on language standardization In practice, all mainstream languages are somehow standardized; the standard of a mainstream language would need to be considered the primary language doc- ument. For instance, the typical standard for a programming language entails 1 SLPS project, slps.sf.net. 2 SLPS Zoo, slps.sf.net/zoo. A Unified Format for Language Documents 5 grammar knowledge and substantial textual parts for the benefit of understand- ing the language. Let us provide some background on language standardization.