Type Checking XML Transformations Thomas Perst

Institut fur¨ Informatik der Technischen Universitat¨ Munchen¨ Lehrstuhl Informatik II Type Checking XML Transformations Thomas Perst Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Alfons Kemper, Ph.D. Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. Helmut Seidl 2. Ass. Prof. Dr. Frank Neven, Univ. of Limburg/Belgien Die Dissertation wurde am 04.09.2006 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 13.02.2007 angenommen. Acknowledgments I wish to thank Helmut Seidl for inspiring me to conduct my research in the field of XML transformations, and I feel deep gratitude towards him for guiding me like Virgil through the circles of tree transducers. He employed me on a project position providing ideal conditions for conducting my research. It was funded by the “Deutsche Forschungs Gemeinschaft” under the headword “MttCheck”. I am grateful to many colleagues who in one way or another contributed to this thesis. Special thanks to Thomas Gawlitza, Ingo Scholtes, and Peter Ziewer for careful reading of the manuscript and for their helpful comments and suggestions. Last but not least, I want to thank Jennifer for her love and constant sup- port on the stony path of writing this work. Abstract XML documents are often generated by some application in order to be pro- cessed by a another program. For a correct exchange of information it has to be guaranteed that for correct inputs only correct outputs are produced. The shape of correct documents, i.e., their type, is usually specified by means of schema languages. This thesis is concerned with methods for statically guaranteeing that transformations are correct with respect to pre-defined types. Therefore, we consider the XML transformation language TL abstracting the key features of common XML transformation languages. We show that each TL program can be decomposed into at most three stay macro tree transducers. By means of classical results for stay macro tree transducers, this decomposition allows to formulate a type checking algo- rithm for TL. This method, however, has even for small programs an exorbi- tant run-time. Therefore, we develop an alternative approach, which allows at least for a large class of transformations that they can be type checked in polynomial time. The developed algorithms have been implemented and tested for prac- tical examples. Zusammenfassung XML-Dokumente werden oft von Anwendungen erzeugt, um von anderen Anwendungen konsumiert zu werden. Fur¨ einen korrekten Informations- austausch muss garantiert werden, dass fur¨ korrekte Eingaben nur korrekte Ausgaben erzeugt werden. Die vorliegende Arbeit untersucht Methoden, um statisch zu garantieren, dass eine Transformation nur korrekte Ausgaben erzeugt. Dazu betrachten wir die Sprache TL, die die Eigenschaften gangiger¨ XML-Transformationsprachen abstrahiert. Wir zeigen, dass jedes TL Pro- gramm in hochstens¨ drei stay macro tree transducers zerlegt werden kann. Diese Zerlegung erlaubt, ein Typuberpr¨ ufungsverfahr¨ en zu konstruieren. Dieses weist allerdings selbst fur¨ kleine Programme exorbitante Laufzeiten auf. Darum entwickeln wir einen alternativen Ansatz, der zumindest fur¨ eine große Klasse praktischer Transfromationen Typuberpr¨ ufung¨ in poly- nomieller Zeit erlaubt. Beide Ansatze¨ wurden implementiert und an prak- tischen Beispielen getestet. Contents 1 Introduction . 1 2 Preliminaries . 7 2.1 Trees and Forests . 7 2.2 Tree Substitution . 12 2.3 Finite-state Tree Automata . 13 2.4 Monadic Second-Order Logic . 15 2.5 Notes and References . 17 3 The Transformation Language TL . 19 3.1 Definition . 21 3.2 Example . 24 3.3 Induced Transformation . 26 3.3.1 Denotational Semantics . 27 3.3.2 Operational Semantics . 28 3.4 Notes and References . 30 4 Stay Macro Tree Transducers . 33 4.1 Definition . 34 4.2 Induced Tree Transformation . 35 4.2.1 Denotational Semantics . 37 4.2.2 Operational Semantics . 39 4.3 Basic Properties . 41 4.4 Stay Macro Forest Transducers . 46 4.4.1 Induced Transformation . 47 4.4.2 Characterization . 50 4.5 Stay-Mtts with Regular Look-ahead . 57 4.6 Notes and references . 58 XIV Contents 5 Type Checking TL Programs . 63 5.1 Annotating the Input . 66 5.2 Removing Global Selection . 69 5.3 Removing Up Moves . 75 5.4 The Decomposition by Example . 83 5.5 Notes and References . 85 6 Type Checking by Forward Type Inference . 89 6.1 Intersection with Regular Languages . 91 6.2 Shortening of Right-hand Sides . 93 6.3 Linear Stay-Mtts . 94 6.4 b-bounded Stay-Mtts . 99 6.5 Stay Macro Forest Transducers . 102 6.6 Forward Type Checking by Example . 106 6.7 Notes and References . 109 7 Implementation of a Type Checker . 111 7.1 A Type Checker Tutorial . 112 7.2 The Implementation Details . 115 7.2.1 Representation of Macro Tree Transducers . 116 7.2.2 Pre-Image Computation . 119 7.2.3 Implementation of Forward Type Inference . 121 7.2.4 Realization of Emptiness Check . 123 7.2.5 Tuning the Macro Tree Transducer . 124 7.3 Dealing with Infinite Alphabets . 125 7.4 Dealing with Attributes . 127 7.5 Notes and References . 128 8 Conclusion . 131 A Proofs . 133 A.1 Proof of Theorem 3.7 . 134 A.1.1 Outside-in Evaluation . 135 A.1.2 Inside-out Evaluation . 138 References . 141 1 Introduction When Axel Thue published his article about logical problems [Thu10; ST00] in 1910, he certainly would not have expected that his ideas of trees and their rewriting will become as popular as 90 years later with the introduction of the Extensible Markup Language (XML). The W3C1, an international consortium to develop Web standards, describes this language as “a simple, very flexible text format derived from SGML. Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of the variety of data on the Web and elsewhere” [Qui06]. More- over, by proposing XML the W3C fulfils its mission “to lead the World Wide Web to its full potential by developing protocols and guidelines that ensure long- term growth of the Web,” because since the publication of the standard in 1998, XML enjoys an increasing popularity in nearly all fields of information tech- nologies. XHTML is an XML format for designing Web pages [Pem02], XSL:FO is a language for outlining general layouts [Adl01], and MATHML provides elements for typesetting mathematical formulas [CIMP03]. With SMIL it is possible to create interactive audio-visual presentations [Hos98]. The biblio- graphic records of the Computer Science bibliography DBLP2 are organized with XML [Ley02]. The office-suite of OpenOffice.org [Man06] uses the Open Document Format, an XML-based file format for office applications [DBO06]. Even this small selection of XML applications shows that Vianu does not ex- aggerate if he denotes XML as the lingua franca of the Web [Via01]. Technically, XML is a meta-language for storing structured data in a se- quential format. Therefore, data are grouped into elements, and each element is enclosed between an opening and closing tag indicating the start and the end of such a logical component. Since elements can be nested, every XML document can be naturally considered as a tree whose nodes are the tags and 1 Worls Wide Web Consortium 2 Digital Bibliography & Library Project 2 1 Introduction whose leafs are the data. Once a document has been structured with XML elements, the actual XML share of the document is called markup. One of XML's design goals and advantages over other data formats is that XML is typed. This means, it is possible to specify where each element may occur in a document. For this purpose the W3C recommendation sug- gests document type definitions (DTDs), which provide rules for how the XML elements, attributes, and other data are defined and logically related in a document [BPSM+04]. A document adhering to the XML syntax specifi- cations and the rules outlined by its DTD is considered to be valid. More ex- pressive typing mechanisms are XML Schema [FW04] and Relax NG [Jam01]. They provide more base types for data values like integers, decimals or re- als; moreover, they allow users to define their own types by combining base types. Validity, however, can be checked by a type-aware parser which tests for every element whether the sequence of its children conforms to the spec- ifications made in the type definition. Another advantage of XML certainly is its extensability that allows to create and format your own document markups. The Web publishing language HTML, for example, consists of a fixed set of elements with a more or less predefined meaning. XML, on the other hand, allows to create new markup at will and, moreover, to specify in a DTD or other schema how this markup is related inside a document. Since “XML data exchange is always done in the context of a fixed type” [AMF+01;AMF+03], a common scenario is that a com- munity (or industry) agrees on a certain type, and all members have to create documents that are of that type. One possibility to asure validity, is to parse every created document and to check if it conforms to the specified type. This method, however, is very time consuming, because every document has to be reread, which takes some time for large documents. Moreover, the same document is repeatedly checked whenever it is created. Since XML documents are often generated dynamically by some program [Suc02], e.g. as result of the transformation induced by a stylesheet processor, it would be more convenient to check whether all XML documents generated by a program are valid with respect to a specified type.

Load more