A Case Study for MKM in Software Engineering

Dimensions of Formality: A Case Study for MKM in Software Engineering Andrea Kohlhase1 and Michael Kohlhase2 and Christoph Lange2 1 German Research Center for Artificial Intelligence (DFKI) [email protected] 2 Computer Science, Jacobs University Bremen {m.kohlhase,ch.lange}@jacobs-university.de Abstract. We study the formalization process of a collection of documents created for a Software Engineering project from an MKM perspec- tive. We analyze how document and collection markup formats can cope with an open-ended, multi-dimensional space of first and secondary clas- sifications and relations. We show that RDFa-based extensions of MKM formats by flexible “metadata” relations referencing specific vocabularies are well-suited to encode and operationalize this. This explicated knowledge can be used for enriching interactive document browsing, for enabling multi-dimensional metadata queries over documents and col- lections, and for exporting Linked Data to the Semantic Web and thus enabling further reuse. 1 Introduction The field of Mathematical Knowledge Management (MKM) tries to model mathematical objects and their relations, their creation and publication processes, and their management requirements. In [CF09, 237 ff.] Carette and Farmer analyzed “six major lenses through which researchers view MKM ”: the document, library, formal, digital, interactive, and the process lens. Quite obviously, there is a tension between the formal aspects “library”, “formal”, “digital” – related to machine use of mathematical knowledge – and the informal ones “document”, “interactive”, “process” – related to human use. We encountered and dealt with all of these aspects in an extended case study in Software Engineering. The goal of this study was to create a document- oriented formalized process for Software Engineering, where MKM techniques are used to bridge the gap between informally stated user requirements and formal verification. The object of the study was a safety component for autonomous mobile service robots developed and certified as SIL-3 standard com- pliant (see [FHL+08]) in the course of the 3-year project “Sicherungskomponente für Autonome Mobile Systeme (SAMS)” at the German Research Center for Ar- tificial Intelligence (DFKI). Certification required the software development to follow the V-Model (figure 1) and to be based on a verification of certain safety properties in the proof checker Isabelle [NPW02]. The V-Model mandates e.g. that relevant document fragments (“objects”) get justified and linked to the multiform.tex 1305 2010-03-10 22:56:33Z clange corresponding objects in other members of the document collection in a succes- sive refinement process (the arms of the ‘V’ from the upper left over the bottom to the upper right and between arms in figure 1). System development with respect to this regime results in a highly interconnected collection of design documents, certification documents, code, formal specifications, and formal proofs. This collection of documents (we call it “SAMSDocs” [SAM09]) make up the basis of a case study in Fig. 1. The V-Model of Software Engineering the context of the FormalSafe project [For08] at DFKI Bremen, where they serve as a basis for research on machine-supported change management, information retrieval, and document interaction. The idea for using SAMSDocs as a FormalSafe case study was based on the assumption that a collection created with a strong formalization pressure would be easier to semantify than a regular collection. In this paper we report on — and draw conclusions from — the formalization of the LATEX documents of SAMSDocs, particularly the inherent multi- dimensionality of the explicated structures (see section 2). The consequences for possible markup approaches we explore in section 3. Section 4 discusses added- value services enabled by multi-dimensional structured representations and section 5 concludes the paper. 2 Dimensions of Formality in SAMSDocs We use the SAMSDocs corpus as a case study for what kind of implicit knowledge can be formalized. The structure of a collection – induced e.g. by the V-Model process – has a strong influence on the formality of and within the documents. At the same time, the structure itself is an object of formalization which makes the structural cues inscribed in the documents explicit. We used the STEX system [Koh08], a semantic extension of LATEX, in order to both publish the documents as high-quality human-readable PDF and formal machine-processable OMDoc [Koh10], an XML format, via LATEXML [SKG+10]. Mathematical, structural relations have a privileged state in STEX, and its command sequence/environment syntax is analogous to the native element and at- tribute names in OMDoc. STEX supports the collection view, since it allows to mark up a theory/imports structure as a collection-level structure. Unfortu- nately, it soon became apparent that this logic-inspired structure was too rigid for the intended stepwise knowledge explication. For example, STEX’s generated, convenient symbol macros could not be used without chunking document fragments into theories first. Fortunately, STEX not only offers OMDoc 1.2 core fea- tures, it also allows for the OMDoc 1.3 scheme of metadata via RDFa [ABMP08] 2 multiform.tex 1305 2010-03-10 22:56:33Z clange annotations (see [LK09,Koh10]). This enables formalization on a much broader basis, since we can add pre-formal markup in the formalization process, we speak of (semantic) preloading. Such extensional STEX packages can thus serve as a development platform for metadata vocabularies for specific structured representations, mirroring the metadata extensibility of OMDoc. This is additionally and particularly useful as complex RDFa annotations, which involve long and unintuitive URIs, can be hidden under simple command sequences for preload- ing3. Fig. 2. The Formalization Workflow with STEX-SD In figure 2 we can see an example of a concrete formalization workflow where the original tabular TEX environment contains a list of symbols for document states with its definitions, e.g. “i.B.” for “in Bearbeitung [in progress]”. STEX enables the person doing the structural explication to create collection-specific independent layout commands for the PDF and the OMDoc output; in our case we call this extension package STEX-SD . For instance, we replaced the use of the tabular environment in figure 2 by employing a project-specific SDTab-def environment together with a list of SDdef-commands. 3 We are currently developing an STEX-based markup infrastructure for defining metadata vocabularies together with a module-based inheritance mechanism of custom metadata macros. This would alleviate the need for package extensions and allow metadata extensibility inside the STEX format. The inspiration for this extension was a direct consequence of [KKL10]. 3 multiform.tex 1305 2010-03-10 22:56:33Z clange As we were interested in the kind of explicated, previously implicit knowledge for future (re)use, we analyzed the semantic preloading and found distinct dimensions of formality, which we present in the following: Preloading the Project Structure: A specific problem of the SAMS project constrained that the PDF form of the preloaded STEX documents was not supposed to differ from the originals. Here, we extended STEX-SD by syntactic sugar for project-wide layout structures, such as the definition table in figure 2 as this construct was used throughout SAMSDocs. In more detail, we introduced the SDdef command, which translates to an OMDoc element symbol named “zustand-doc-iB” and a corresponding definition element. Moreover, the new environment SDTab-def groups all respective SDdef entries into a PDF table and into an OMDoc theory. Such macro-support alleviated the project-specific markup progress and made it much more efficient. Preloading the Document Structure: Spotting objects, i.e., identifying document fragments as autonomous objects, was a major part in the formalization process. Once spotted, they were first preloaded with a referencable ID, turning them into an object, and then classified and interrelated. For example, in the PDF screenshot in figure 2 we can easily spot the symbol “i.B.” and its definition “in Bearbeitung” in the row of the definition table. The row as a document fragment turns into an object in the STEX screenshot as it is attributed the ID “zustand-doc-iB”, categorized as “definition” and implicitly specified via STEX-SD to be a symbol and its definition. Once all relevant objects for a concept were gathered, their relating and chunking process started. Relating was attempted with light markup using commands like \SDreferences created for STEX-SD or with STEX’s feature symbol macros. Latter could only be used if the respective module/theory structure for the collection was in place, that is if chunking had partially been done. Explicating the document structure is not always obvious, since many of the documents contain re- caps or previews or material that is normatively introduced in other documents/parts to make documents self-contained or enhance the narrative structure. Fig. 3. s is Bra- Consider for example figures 3 and 4, which are actu- king Distance? ally clippings from the detailed specification “Konzept- Bremsmodell.pdf”. Note the use of s resp. sG, both pointing in fig. 3 to the braking distance function for straight-ahead driving (which is obvious from the local context), whereas in fig. 4 s represents the general arc length function of a circle, which is principally different from the braking distance, but coincides here. Preloading the Collection Structure: Many concepts have occurrences in several of the documents in the collection SAMSDocs. Such occurrences are related but they are not occurrences of the same object. For example, an object was introduced as a high-level Fig. 4. Yet another concept in the contract, then it was specified in another Braking Distance s? 4 multiform.tex 1305 2010-03-10 22:56:33Z clange document, refined in a detailed specification, implemented in the code, reviewed at some stage, and so on until it was finally described in the manual.

Load more