RC S0- atga&Rgaahr es) LNX20 nomlProceedings Informal 2006 PLAN-X (eds.): Raghavachari & Castagna NS-05-6 BRICS BRICS Basic Research in Computer Science

PLAN-X 2006 Informal Proceedings

Charleston, South Carolina, January 14, 2006

Giuseppe Castagna Mukund Raghavachari (editors)

BRICS Notes Series NS-05-6 ISSN 0909-3206 December 2005 Copyright c 2005, Giuseppe Castagna & Mukund Raghavachari (editors). BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Notes Series publications. Copies may be obtained by contacting:

BRICS Department of Computer Science University of Aarhus Ny Munkegade, building 540 DK–8000 Aarhus C Denmark Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: [email protected]

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk This document in subdirectory NS/05/6/ PLAN-X 2006 Informal Proceedings Charleston, South Carolina 14 January 2006

Invited Talk

• Service Interaction Patterns 1 John Evdemon

Papers

• Statically Typed Document Transformation: An Xtatic Experience 2 Vladimir Gapeyev, Franc¸ois Garillot and Benjamin Pierce

• Type Checking with XML Schema in XACT 14 Christian Kirkegaard and Anders Møller

• PADX: Querying Large-scale Ad Hoc Data with XQuery 24 Mary Fernandez, Kathleen Fisher, Robert Gruber and Yitzhak Mandelbaum

• OCaml + XDuce 36 Alain Frisch

• Polymorphism and XDuce-style patterns 49 Jer´ omeˆ Vouillon

• Composing Monadic Queries in Trees 61 Emmanuel Filiot, Joachim Niehren, Jean-Marc Talbot and Sophie Tison

• Type Checking For Functional XML Programming Without Type Annotation 71 Akihiko Tozawa

Demos

• Accelerating XPath Evaluation against XML Streams 82 Dan Olteanu

• Imperative Programming Languages with Database Optimizers 83 Daniela Florescu and Anguel Novoselsky • Xcerpt and visXcerpt: Integrating Web Querying 84 Sacha Berger, Franc¸ois Bry and Tim Furche

• XJ: Integration of XML Processing into Java 85 Rajesh Bordawekar, Michael Burke, Igor Peshansky and Mukund Raghavachari

• XML Support in Visual Basic 9 86 Erik Meijer and Brian Beckman

• XACT – XML Transformations in Java 87 Christian Kirkegaard and Anders Møller

• XTATIC 88 Vladimir Gapayev, Michael Levin, Benjamin Pierce and Alan Schmitt

• OCamlDuce 89 Alain Frisch

• LAUNCHPADS: A System for Processing Ad Hoc Data 90 Mark Daly, Mary Fernandez,´ Kathleen Fischer, Yitzhak Mandelbaum and David Walker

• XHaskell 92 Martin Sulzmann and Kenny Zhou Ming Lu

Program Committee

Gavin Bierman (Microsoft Research) Giuseppe Castagna (CNRS Ecole Normale Superieure´ de Paris), chair Alain Frisch (INRIA Roquencourt) Giorgio Ghelli (University of Pisa) Tova Milo (Tel Aviv University) Makoto Murata (IBM ) Dan Olteanu (Saarland University) Benjamin Pierce (University of Pennsylvania) Mukund Raghavachari (IBM T.J. Watson Research Center), demo chair Helmut Seidl (Technische Universitat¨ Munchen)¨

General Chair Anders Møller (BRICS, University of Aarhus) Service Interaction Patterns (Invited Talk)

John Evdemon [email protected]

Abstract The traditional method for building a service requires a developer to ensure that business logic is not hosted directly within the service itself. While this approach helps make the service more flexible it does not address the biggest architectural gap facing web services today: service interaction patterns (SIPs). A SIP occurs when services engage in concurrent and interrelated interactions with other ser- vices. Traditional web service architectures are designed to accommodate simple point-to-point interactions - there is no concept of a logical flow or series of steps from one service to another. Standards such as WS-BPEL are being developed to address this gap. In this session we will discuss a “manifesto” for workflow- enabled solutions, review emerging standards (BPEL, others) and address possible misconceptions regarding these standards.

1 Statically Typed Document Transformation: An XTATIC Experience

Vladimir Gapeyev Franc¸ois Garillot Benjamin C. Pierce University of Pennsylvania Ecole´ Normale Superieure´ University of Pennsylvania

Abstract mars, section numbering, setting up cross-references, and generating the table of contents. A useful effect of emu- XTATIC is a lightweight extension of C with native sup- lating a finished untyped application is that both costs port for statically typed XML processing. It features XML and benefits are visible all at once, rather than arising trees as built-in values, a refined type system based on and being dealt with incrementally, throughout the de- regular types alaXD` UCE,andregular patterns for inves- sign and development process. To maximize the opportu- tigating and manipulating XML. We describe our experi- nities for comparison, our XTATIC implementation closely ences using XTATIC in a real-world application: a program follows not only the behavior, but also, as far as possible, for transforming XMLSPEC, a format used for authoring the structure of the original XSLT implementation. W3C technical reports, into HTML. Our implementation closely follows an existing one written in XSLT, facilitat- Thecontributionsofthepaperareasfollows.First,we ing comparison of the two languages and analysis of the draw attention to the XMLSPEC problem itself. This prob- costs and benefits—both significant—of rich static typing lem offers a good balance of size, complexity, and fa- for XML-intensive code. miliarity, and we hope that it can be re-used by others as a common benchmark for XML processing languages. Second, we present a detailed analysis of the costs and 1Introduction benefits of expressive static types for XML manipulation, both of which were substantial in this application. The A profusion of recent language designs, including main cost is the difficulty of inferring appropriate types XDUCE [17, 18, 19], CDUCE [11, 2], XACT [25, 8], for multiple, mutually recursive transformations. The XQUERY [4, 10], XJ [15], XOBE [23], and XTATIC [14, 26, main benefit is the expected one: design flaws in the 12, 13, 27], are founded on the belief that rich static type XMLSPEC DTD—which show up in the XSLT stylesheet systems based on regular tree languages can offer signif- as behavioral bugs—are instead exposed as type incon- icant benefits for XML-intensive programming. Though sistencies. Third, we demonstrate that the type sys- attractive, this belief can be questioned on a number of tem and processing primitives of XTATIC are sufficiently counts. Are familiar XML processing idioms from untyped powerful and flexible to fix (or gracefully work around) settings easy to enrich with types, or are there important these bugs without modifying the XMLSPEC DTD. Fix- idioms for which static typing is awkward or unwork- ing some of them in the XSLT stylesheet appears more able? Is it feasible to reimplement untyped applications difficult. Finally, reimplementing an existing stylesheet in a statically typed language in a “bug-for-bug compat- gives us many opportunities for head-to-head compar- ible” fashion? Does the need to please the typechecker isons of XSLT and XTATIC, highlighting areas where each lead to too much repetitive boilerplate or too many type shines. In particular, we observe that XTATIC-style regu- annotations? Our aim is to put these questions to the lar pattern matching is more natural than XSLT’s style— test by a detailed comparison of a non-trivial application structural recursion augmented with “context probing”— originally written in XSLT 1.0 [9] and a faithful reimple- when processing structures, such as the BNF grammar mentation of the same application in XTATIC. descriptions found in XMLSPEC, where ordering is im- portant. Conversely, XSLT is very convenient for straight- For this experiment, we chose a task that has also been forward structural traversions with local transformations, used as a case study in the standard XSLT reference where XTATIC requires a heavier explicit-dispatch control [20, 21]: translation of structured documents from a flow. Also, XSLT’s data model, which treats the original high-level document description language, XMLSPEC, document as a resource for the computation, is more nat- into XHTML. XMLSPEC is the format used for authoring ural for certain tasks, though we can mimic some of its official W3C recommendations and drafts. This exam- uses with generic libraries in XTATIC. ple is non-trivial but of manageable size: the DTD for XMLSPEC defines 102 elements and 57 type-like entities, Section 2 summarizes XMLSPEC and gives a high-level while the XHTML DTD defines 89 elements and 65 en- explanation of the transformation task. Section 3 de- tities; the XSLT stylesheet implementing the transforma- scribes the main challenges of expressing the core XSLT tion is 770 lines long. Besides styling XMLSPEC elements processing model in XTATIC. Section 4 compares the pro- as HTML, its functions include formatting BNF gram- cessing of structured data such as BNF grammars in XSLT

2 and XTATIC. Section 5 describes the auxiliary data struc- resented explicitly in XMLSPEC by nested sectional ele- tures that our application uses in place of the global doc- ments div1,...,div4, while in HTML it is implied by ument access primitives offered by XSLT. We close with heading elements h1,...,h6 that interrupt the flow of an overview of other evaluations of XML processing lan- paragraph-level markup. Both formats include generic guages in Section 6 and some concluding thoughts in Sec- paragraph-level elements—for example, enumerated and tion 7. The paper is intended to be self-contained, but it bulleted lists (ol and ul in HTML vs olist and ulist in does not present the motivations or technical details of XMLSPEC)andparagraphs(p in both). But XMLSPEC the XTATIC design in depth; for these, the reader is re- also defines special-purpose variants like blist,whichis ferred to our earlier papers, especially [14, 13], and to a list containing only bibliographic entries, and vcnote— Gapeyev’s forthcoming PhD dissertation. a special kind of paragraph for technical snippets called validity constraint notes. Finally, at the phrase level, 2 The Problem where HTML elements like em, i, b decorate the flow of character text with visual emphasis and the anchor ele- a The history of both XSLT and XMLSPEC goes back to ment provides simple linking points and links for exter- 1998, when the standards for XML and XSLT themselves nal resources or locations in the document, their XML- were still under development. Newer versions of the DTD SPEC counterparts play more semantically-loaded roles. termdef and the stylesheet (available from the XMLSPEC web For example, a element encloses a phrase that page, http://www.w3.org/2002/xmlspec/)continueto defines the meaning of a term (whose occurence in the be used for developing W3C specifications. definition is marked by a term element) and can be linked to from other parts of the document by element termref. Our development is based on the 1998 version of There are many more elements for specific roles, such as kw specref XMLSPEC—the one used for the original XML Recom- language keywords ( ), references ( )toother mendation. Our primary reason for using this some- parts of the specification, etc. This semantic specializa- what dated version was the public availability of the tion of elements allows one to vary independently not XML sources for the Recommendation, which we used only their visual representation, but also additional pro- as testing data; more recent W3C specifications, devel- cessing such as creation of indexes and glossaries. oped with newer versions of XMLSPEC,wereonlyavail- able only in formatted (HTML, PS, PDF) form when we Another category is XMLSPEC elements containing began the project. (Starting in September 2005, XML “structured data” of various kinds. The most interest- sources of some newer W3C specification drafts—e.g., ing example is the scrap element, which encapsulates XPATH,XSLT,andXQUERY—have again become avail- BNF rules for grammar productions; its formatting is dis- able.) The XMLSPEC Guide [29] is a useful resource for cussedindetailinSection4. understanding the XMLSPEC DTD (although it describes a later, slightly more feature-rich version). The original Most of the task of an XMLSPEC to HTML transformer is 1998 XSLT stylesheet is described in detail in the 2nd edi- thus a straightforward (often literally one-to-one) map- tion of Michael Kay’s XSLT reference [20]. Both the DTD ping from XMLSPEC to HTML element tags. But there and the stylesheet are available from the book’s web page are several aspects that are more interesting, including (http://www.wrox.com). displaying structured data in a readable form, computing section numbers based on the hierarchical positioning of div XMLSPEC is similar to more elaborate XML-based docu- elements, creating a table of contents, with entries ment schemas, such as DOCBOOK (http://www.docbook. hyperlinked to the corresponding sections, and format- org/) and the Text Encoding Initiative (http://www. ting the cross-references occurring in the document so tei-c.org/), in that it encodes the “logical” structure that they properly mention features of the referent, such of a document so that the same information can be pre- as title or its computed section number. sented in different styles and media. Here, we consider only the task of transforming an XMLSPEC document into 3 Structural Recursion a single HTML page, as shown in Figure 1. The processing model of XSLT is rather different from Since both XMLSPEC and XHTML are used for document the explicit control flow of traditional programming lan- markup, there are many similarities between their DTDs. guages, including XTATIC,beingbasedonanimplicit re- In both, a valid document file has distinct sections for cursive traversal of the input document. After introduc- meta-data and for the content proper. The content has ing the XSLT processing model and sketching how an three kinds of markup: top-level, or sectional, for hier- XTATIC application can simulate it explicitly, this section archical document organization; medium-, or paragraph- discusses the main challenges of making this implemen- level, for chunks of actual content; and low-, or phrase- tation strictly typed: (1) structuring its code to accomo- level, for the content flow itself. More interesting for date the constraints of typing and (2) fixing the typing our task, though, are the differences, which stem from bugs inherited from the stylesheet. differences in purpose between logical and presentation markup: XMLSPEC uses markup to indicate the role of a piece of text in the discussion of a subject matter, while 3.1 Implicit Structural Recursion in HTML uses markup to instruct a browser how a piece of XSLT text should be visually presented to the reader. An XSLT stylesheet is a collection of templates,eachspeci- For example, the hierarchical document structure is rep- fying a computation to be performed on document nodes

3 Introduction

XML is an application profile or restricted form of SGML .

Origin and Goals

The design goals for XML are:

XML shall be usable over...

XML shall support...

Terminology

The terminology used to describe XML...

References ISO. ISO 8879:1986(E). Standard Generalized (SGML). First edition 1986-10-15. [Geneva]: ISO, 1986. Figure 1. A sample XMLSPEC document fragment and its rendering via HTML that satisfy a specified test condition. The execution of In general, the test condition in a template’s match at- a stylesheet proceeds in a single recursive pass over the tribute is specified by an XSLT pattern, which is written in input document, in document order. For each node en- the downward subset of XPATH. A template is applicable countered during the traversal, the run time system se- to an element when its pattern matches it, i.e., there is lects the most specific template whose test is satisfied by an ancestor node, starting from which the pattern (as a the node and executes it. Consider, for example, the fol- path) would select the element. More than one template lowing template:1 can be applicable to a document node, but there is always at least one, since XSLT predefines a default template ap- plicable to any element, whose action is to proceed with

the traversal without producing any output. In the case of multiple applicable templates, only one of them gets selected for execution according to a set of priority rules The test (match="olist") says that the template is ap- whose particulars are not important for this discussion. plicable to XMLSPEC olist elements; for each such ol element, the template produces an HTML ele- The bulk of the XMLSPEC stylesheet consists of tem- ment. The contents of the ol are the result of a fur- plates similar to these, performing simple tag-to-tag ther recursive traversal of the input: the instruction transformations—sometimes augmented with other out- designates the location receiv- put whose generation depends only on the current ele- ing the result of applying the same procedure of select- ment. This processing style, known as structural recur- ing and executing an appropriate template, to each child sion [1, 5, 6], is the backbone of the XSLT processing node of the olist, in order. Since, according to the XML- model. However, since a simple one-pass structural re- SPEC DTD, the only possible children of olist are item cursion alone would not be sufficient for many appli- elements, the template that gets invoked on them is cations, XSLT augments it with more features, some of which we will see later.
  • 3.2 Types and Patterns in XTATIC which similarly constructs an HTML li element from each XMLSPEC item element. The recursive descent of Before describing our implementation of the formatter, the traversal terminates either on the document’s text let us pause, briefly, to review the XML types and patterns nodes, which get copied into the output, or on templates found in XTATIC. that do not call others via or a similar instruction. XTATIC’s types are composed from XML element tags using the familiar regular expression operators of con- catenation (“,”), alternation (“|”), repetition (“*”), and 1The XML-based syntax is a controversial aspect of non-empty repetition (“+”). They can also contain type XSLT. Readers unfamiliar with the language only need names, which are bound to their definitions by top-level to know that elements starting with the xsl: prefix are regtype declarations. For example, here is a fragment of XSLT instructions, while others are literal elements con- the XTATIC type declarations corresponding to the XML- structing the output. SPEC DTD:

    4 regtype s_olist [[ s_item+ ]] The Dispatch method uses a combination of C while regtype s_item [[ s_obj_mix+ ]] and XTATIC match statements to consume the input se- regtype s_obj_mix [[ s_p | s_olist | s_ulist | quence from the variable seq and produce the output se- ... ]] quence in the variable res.XTATIC’s match statement is similar to C ’s switch,butitscase tests are patterns and s_ We use the prefix for type names coming from XML- therefore can assign fragments of the input to variables h_ SPEC,and,later, for names coming from XHTML. for use in the clause’s body. The full code of Dispatch The double square brackets are used to separate regu- contains a case for each XMLSPEC element, except for lar types, patterns, and XML values from surrounding C elements involved in presenting structured data, which code. (The ellipsis ... is not part of XTATIC syntax; it just are not covered by the dispatching framework (see Sec- indicates that the definition of s_obj_mix is larger than tion 4). shown.)

    The semantics of XML types is similar to that of regu- 3.4 Typing the Recursion lar expressions on strings: a type is the set of values described by the type’s definition, except that the val- Our goal, however, is to implement a well-typed ues are XML document fragments—i.e. sequences of formatter—i.e., one whose output is, by construction, trees built from XML element tags and characters. For valid HTML for any valid XMLSPEC input. Therefore, we example, the values of type s_item are single XML el- need to give more precise output types to our methods. ements of the form ... whose contents are non-empty sequences of elements described by the Almost every template method returns a sequence of one union type s_obj_mix. The predefined type describes or more HTML elements that it creates itself; in these all well-formed XML values. The brackets with no con- cases, the precise output type for the method can be in- tent, [[]], denote the type containing only the empty ferred from its code alone. For example, TemplateItem is sequence (when used where a type is expected), as well intended to return values of type h_li. Precise template as the empty sequence value itself (when used where a methodtypesinduceapreciseresulttypeforDispatch, value is expected). which, instead of xml, now yields the union of the result types of all the templates it invokes. A regular pattern is a type annotated with variables. For example, This type, however, is too large. For example, in order for TemplateOlist to return a valid ol element, the static re- [[ s_item first, s_item+ rest ]] sult type of the recursive call to Dispatch at this point must contain only li elements. Thus, instead of a single is a pattern with variables first and rest that will be Dispatch method, we need to define several dispatchers, bound to values of types s_item and s_item+ after a suc- each invoking only the subset of template methods suit- cessful match. An XML value matches a pattern when able for a particular context and therefore ensuring ap- the value belongs to the type obtained by erasing the propriate input and output types. For example, the typed bound variables. These patterns are the main construct version of TemplateItem becomes: that XTATIC programs use to analyze XML. static [[h_li]] TemplateItem ([[s_item]] elt){ [[s_obj_mix+ cont]] = elt; return [[

  • DispatchInItem(cont)]]; 3.3 Explicit Structural Recursion in } XTATIC Besides the precise return type and the call to the cus- Implementing an untyped equivalent of the XMLSPEC tom dispatcher DispatchInItem, it also analyzes input stylesheet’s behavior in XTATIC is straightforward: it can elt by a pattern that strictly follows the definition of type be written as a collection of mutually recursive static class s_item and therefore gives the variable cont amorepre- methods, one per template, plus a dispatcher method that cise type, on which DispatchInItem can rely. simulates the role of the XSLT run-time system. Figure 2 shows the fragments of this implementation correspond- In general, the dispatcher used by a template must be ing to the two XSLT templates discussed above. prepared to handle any input that the template can pass to it, and its output must be acceptable for the use the The two template methods have a similar structure. template has for it. Any collection of dispatchers that TemplateItem, for example, declares s_item,thetypeof satisfy these constraints for all templates would give a XMLSPEC elements on which it can operate, as its in- type-correct formatter. For a few of the templates, how- put type; then, relying on the fact that the argument ever, it is not possible to compose a well-typed dispatcher elt can only contain an item element, it uses a pat- from the template methods that would faithfully repro- tern assignment to extract the element’s content into the duce the operation of the stylesheet’s templates. These variable cont; finally, it builds and returns the result- are instances of genuine processing bugs in the original ing HTML li element. The contents of li come from a XSLT application, which can only be fixed by modifying call to the method Dispatch, which plays the role of the existing or writing additional template code. instruction. Note that the pat- tern in the assignment follows the definition of the type In a few cases, the bugs are caused by subtle incompat- s_item. ibilities between XMLSPEC and HTML that are possible

    5 static [[xml]] TemplateOlist ([[s_olist]] elt){ static [[xml]] Dispatch ([[xml]] seq) { [[xml cont]] = elt; [[xml]] res = [[]]; return [[

      Dispatch(cont)]]; while (!seq.Equals([[]])) { } match (seq) { case [[s_olist elt, xml rest]]: static [[xml]] TemplateItem ([[s_item]] elt){ res = [[res, TemplateOlist(elt)]]; [[xml cont]] = elt; seq = rest; return [[
    1. Dispatch(cont)]]; case [[s_item elt, xml rest]]: } res = [[res, TemplateItem(elt)]]; seq = rest; //...... }} return res; }

      Figure 2. A fragment of the untyped structural recursion code in XTATIC.

      (though a bit tricky) to smooth out in XTATIC, but appar- voke would be more difficult. (None of the several pos- ently not in XSLT, so it is instructive to discuss them and sibilities we can see is completely satisfactory. Using our solutions in some detail. more complex path patterns in match attributes, such as div1/ednote and head/ednote, which test for the par- ent element, would require writing as many templates 3.5 Bugs and Fixes for ednote as there are possible parents—each such tem- plate’s body duplicating one of the only two handlers. We XMLSPEC defines an element ednote for recording ed- can write a single template for ednote that accesses the itorial remarks. The DTD allows ednote to appear in parent node and determines its type via a both paragraph- and phrase-level contexts, but the XSLT or a chain of instructions, which again have to stylesheet contains only one template for ednote,which list all the possibilities. Other options include use of tem- formats it as blockquote, a paragraph-level HTML ele- plate modes and template parameters, but these are also ment presented in browsers as an indented paragraph. quite heavy.) Clearly, appearances of ednote in phrase-level contexts (e.g., inside head elements of section titles) should be The typing bug that required the most sophisticated fix formatted differently. To handle this, we implement a in our reimplementation is caused by one of the most second template method for ednote, with a phrase-level- straightforward-looking templates in the stylesheet: friendly return type. A dispatcher that has the ednote element in its input type processes it with whichever of the two template methods that is compatible with the dis-

      patcher’s return type.

      A similar, but trickier, problem arises in the formatting This template transforms the XMLSPEC paragraph ele- p of another phrase-level XMLSPEC element, quote.This ment into an HTML element of the same name. The element is different from most others: rather than cre- trouble is, an HTML p can contain only character data ating a new HTML element or two, the corresponding and phrase-level elements, while an XMLSPEC p can also template just surrounds the result of recursively format- contain select paragraph-level elements. Consequently, ting the quote’s contents with quotation mark characters. this template can produce an HTML p with paragraph- The content type of quote is such that it gets transformed level elements, such as lists (ol,etc.),aschildren. into output belonging to the most general HTML phrase- level type, h_Inline. One of the elements that can occur The sources of the XML Recommendation actually con- inside h_Inline is the anchor element a, and the content tain quite a few instances of p elements that tickle this of the latter is described by the subtype h_a_content of bug. Since it affects validity of the generated HTML, the h_Inline, which disallows a elements, prohibiting nested bug was addressed in the later versions of the stylesheet anchors. The quote element itself, however, can occur in by a hack: when an element like ol appears inside a an XMLSPEC context that ends up formatted inside an a paragraph, the stylesheet adds to the output tree a text element, possibly producing a nested anchor. The resolu- node whose content is “

      ”, then formats the ol,and

      tion in XTATIC is similar to the one for ednote:wewrite then generates another text node whose content is “ ”. two template methods for quote, both just adding quota- This does not restore the validity of the in-memory tree tion marks, but to the results coming from two different produced by the stylesheet, but only of its textual seri- dispatchers. alization, implying that the stylesheet cannot be used in pipelining scenarios without re-parsing and re-validation The solutions for these two problems work because, by of its output. We do not see any natural way to fix this explicitly implementing the recursive traversal as a com- bug in XSLT without changing the XMLSPEC DTD. bination of calls to several distinct dispatcher methods, our algorithm tracks (static) information about its cur- Our method TemplateP implements the above fix in a rent context in the input document. In principle, an fully typed way. It uses a dispatcher that transforms XSLT stylesheet could also implement processing alter- the contents of XMLSPEC p into a sequence of text and natives for ednote and quote elements, but making the phrase- and paragraph-level HTML elements, and then context-dependent decision of which one of them to in- processes it to find (with the use of XTATIC patterns)

      6 static [[h_block*]] FlowIntoBlocks ([[h_Flow]] flow) { [[h_block*]] res = [[]]; while (!flow.Equals([[]])) { match (flow) { case [[(pcchar | h_inline | h_misc_inline)+ inl, h_Flow rest]]: res = [[res,

      inl]]; flow = rest; case [[h_block+ blocks, h_Flow rest]]: res = [[res, blocks]]; flow = rest; case [[(h_form | h_noscript) unexpected, h_Flow rest ]]: Error("Unexpected input in FlowIntoBlocks"); flow = rest; case [[]]:Error("empty case"); }} return res; } Figure 3. The method performing an HTML processing pass to detect implicit paragraphs. longest subsequences of text and phrase-level elements and wrap them as HTML p elements. Figure 3 shows the method that performs the HTML processing pass. The final result of TemplateP is paragraph-level content.

      From what we have said so far, it might appear that there is another way to implement TemplateP,notin- volving HTML post-processing: we could use patterns to find longest subsequences of XMLSPEC elements and text to be transformed into phrase-level HTML, apply an ap- propriate dispatcher to them, and wrap the results as p Figure 4. An HTML table generated from an XMLSPEC elements. In fact, the approach we sketched above is grammar the only one that works, because of another problem— this one caused by XMLSPEC termdef elements occur- ring in the content of p. These elements are used to XSLT handle the challenges of rendering structured data designate boundaries of formal definitions in a specifica- for visual presentation. tion. As with quote, the processing of a termdef does not create an HTML element—it just returns an anchor ele- ment a followed by the sequence resulting from process- 4.1 BNF Productions ing the contents. This sequence can contain both phrase- and paragraph-level elements. If termdef elements only A grammar fragment is represented in XMLSPEC as a occurred surrounded by paragraph-level elements, we sequence of production elements prod, each having the could implement TemplateTermdef like TemplateP.How- structure described by the following DTD declaration: ever, when an occurrence of termdef in p is directly pre- ceded by phrase-producing content and the result pro- duced by the termdef also starts with phrase-level con- tent, the two must be joined into a single HTML para- That is, a production consists of a left-hand side con- lhs graph. Therefore, to avoid creating spurious paragraph taining exactly one element, which introduces the breaks, we define TemplateTermdef to just return the non-terminal defined by the production, and a right-hand result of recursive processing of its contents. The lat- side, which defines the unfolding of the non-terminal and ter joins the surrounding HTML and gets processed in consists of a sequence of one or more element groups. rhs TemplateP to detect the paragraphs. Each group contains exactly one element, which rep- resents a fragment of the unfolding (usually, an alter- native BNF clause), possibly accompanied by side con- Along with these significant typing difficulties, XTATIC’s com typechecker uncovered several more minor bugs in the ditions in the form of a comment ( ), or a reference wfc vc stylesheet that also affected validity, but that were easy to a well-formedness ( )orvalidity( )constraint.It to fix by small changes to the output. is not important to know about the internals of the ele- ments inside prod. Each of them gets formatted in the usual way as an HTML fragment to be placed inside a 4 Structured Data table cell; the layout of this table is our present concern.

      XMLSPEC defines several collections of elements for Figure 4 shows an example. The generated table has five structured data. This section employs the most so- columns containing, respectively, an automatically gen- phisticated of these—elements for representing BNF erated sequence number for the production, the name of grammars—as an example showing how XTATIC and the non-terminal being defined, the symbol ::=, the frag-

      7 ments of the non-terminal’s definition, and the comments Now, the order of template execution on an instance of and constraints. prod is as follows. First, the template for prod detects all the starter elements among the prod’s children and The challenge here is assigning appropriate contents to invokes a type-appropriate starter template on each. The the table’s cells based on the relative positioning of var- starter template pads the row with empty cells (or, in case ious elements in the flat sequence of prod’s children, of lhs, starts a new row, and makes cells with a running rather than by simply reflecting a nested structure that sequence number and the ::= symbol), calls an appro- is already present in the input. priate cell template on the current element to format its own cell, and finally formats any remaining cells in the 4.2 XTATIC Solution row by applying cell templates to the appropriate follow- ing siblings of the current element. XTATIC’s patterns address this challenge naturally. Note that, in each production, the element lhs contributes This algorithm requires features of XSLT that go beyond only to the starting of the first table row corresponding structural recursion—the ability to control selection of to the production, while the rest of the first row, as well both templates and nodes during traversal (to invoke ei- as each of the remaining rows, is generated from a small ther starter or cell templates as appropriate) and to ob- “chunk” of prod’s children containing at most one rhs ele- tain information about the surroundings of the current mentandatmostonecom, wfc or vc element. This chunk node. The next few paragraphs review these XSLT fea- can be described by the type tures. regtype xs_rhschunk The selection of templates to be considered for applica- [[(s_rhs, xs_constr_mix?) | xs_constr_mix]] tion when executing xsl:apply-templates can be con- regtype xs_constr_mix trolled in XSLT by template modes. A template’s definition [[s_com | s_wfc | s_vc]] cancontain(inthestarttagofxsl:template element) an attribute mode specifying the mode of this template. E.g., and, using this type, we can easily write patterns that split the “cell” templates in our stylesheet are headed by tags the sequence of prod children into the chunks necessary like for creating the table row-by-row; the full code appears in Figure 5. The method TemplateProd starts by extracting from the production the name (lhs) of its non-terminal xsl:apply-templates and the first chunk of the definition. It uses these to con- Then, an instruction that also mode struct the first table row corresponding to the production mentions the attribute, e.g. in the newly created variable res. The number placed in the first table cell is extracted, based on the production’s identifier (prodid), from an index data structure cre- considers only the templates marked by the same mode. ated before processing the document (this process is de- scribed in Section 5). The contents of chunk is processed To control the selection of nodes to be processed by fur- by a separate method, MkRhsChunk, which performs a ther traversal, the XSLT xsl:apply-templates instruc- straightforward match on the two alternatives in the defi- tion can be augmented with the attribute select speci- nition of xs_rhschunk type and invokes TemplateRhs and fying the sequence of nodes to be processed next, instead DispatchFlow toprocessthechunk’selements.Thesec- of the default children sequence of the current element. ond part of TemplateProd is a foreach statement that For example, the prod template restricts further process- iterates over the rest of the production by cutting con- ing to starter elements only by executing the instruction secutive chunks off it with the [[xs_rhschunk chunk]] pattern, while adding to res a new table row for each A child element of an instance of prod can be classi- fied as a “starter” element if it provides data for the first The contents of select is an XPATH path expression that, non-empty cell in the HTML table’s row; otherwise as a when applied to a node, produces a sequence (possibly “follow-up” element. Accordingly, the stylesheet defines empty) of nodes from the document that are related to two templates for each child element type of prod:a the original node as specified by the path. “cell” template that just performs formatting inside the HTML table’s cell (in other words, a cell template is an or- For our current purposes, we can think of an XPATH path dinary structural recursion template in the sense of Sec- as an expression of the form2 a::n[q ]...[q ] where a tion 3), and a “starter” template that is supposed to be 1 k executed only on starter elements, performing, among other things, row padding with empty cells. 2More precisely, the construction described here is a

      8 static [[h_tr+]] TemplateProd ([[s_prod]] markup) { [[ pcdata lhs, (s_rhs, xs_constr_mix?) chunk, xs_rhschunk* rest ]] = markup; [[h_tr+]] res = [[ , ‘[‘,prodindex.Number(prodid),‘]‘, lhs, ‘::=‘, MkRhsChunk(chunk) ]]; foreach ([[xs_rhschunk chunk]] in rest) { res = [[ res, , , , MkRhsChunk(chunk) ]]; } return res; } static [[h_td, h_td]] MkRhsChunk ([[xs_rhschunk]] chunk) { match (chunk) { case [[ s_rhs rhs, xs_constr_mix? constrOPT ]]: return [[ TemplateRhs(rhs), DispatchFlow(constrOPT)]]; case [[ xs_constr_mix constr ]]: return [[ , DispatchFlow(constr)]]; } } Figure 5. BNF production formatting in XTATIC.

      is an axis, n is a node test,andqi are predicates.The responding to the same elements, the cell templates are execution of a path consists of taking the sequence of only invoked by instructions at the end of starter tem- nodes specified by the axis a and successively pruning plates, like this one in the starter template for rhs: it to contain only the nodes satisfying both the node sisting of the current node, child, that gives the chil- dren of the current node, and preceding-sibling and More detailed explanation of BNF formatting in the following-sibling that give the corresponding sibling stylesheet can be found in [20, 21]. elements of the current node. The preceding-sibling axis produces the nodes in reverse document order, i.e. the closest sibling comes first. A node test n is either an 4.4 Observations element name (as in, e.g., self::lhs), which leaves the node in the result only if the node’s name is the same The path expressions from the BNF formatting task * child::* as the test’s, or a wildcard (as in ), which is shown above are quite complicated—expressions of q satisfied by any node. A predicate can be numeric or such complexity rarely appear in document-oriented 1 boolean. A numeric predicate specifies a -based index stylesheets and their occurrences seem to indicate pro- of the node to be selected from the current sequence. cessing of the islands of structured data embedded in- E.g., the path preceding-sibling::*[1] selects the clos- side documents. The XPATH fragment needed for han- est sibling preceding the current node in the document dling structured data is more complicated and difficult (or the empty sequence if the current node is the first to master, we believe, than regular patterns, but it can child of its parent). A boolean predicate is built, using be learned. But even knowing this fragment, the major traditional boolean connectives and, or and not,fromel- difficulty for someone trying to understand how BNF for- ementary predicates, which coincide with path expres- matting works in the XSLT stylesheet comes from the fact sions. When interpreted as a predicate, a path expression that processing of a contiguous piece of data has to be is false when it returns the empty sequence, and is true distributed across several non-contiguous pieces of code, otherwise. connections between which are only loosely indicated. By contrast, the ability of XTATIC’s match statement to Taking these explanations into account, one can see keep together inspection and transformation of a piece why the above select expression restricts operation of of data constituting a logical unit allowed us to write xsl:apply-templates to elements that would start a new processing methods (Figure 5) whose responsibilities can row in the HTML table. Technically, the path selects (by be clearly specified in terms of their input-output behav- child::*)allchildrenofprod that are (according to the ior and whose code explicitly indicates dependencies as following predicate) either the lhs element, or an rhs method calls. element not immediately preceded by the lhs,oraside condition element not immediately preceded by an rhs. Another small convenience available with regular pat- The templates that get invoked on the elements so se- terns but not with XPATH pathsistheabilitytonametype lected are starter templates, since they, as well as the fragments and later reference them in patterns. For ex- xsl:apply-templates instruction, do not specify a mode ample, our definition of the type xs_constr_mix could attribute. Since mode is specified by cell templates cor- have improved clarity of the later patterns, where it is used multiple times, while no similar XPATH shortcut is step expression s, and a general path expression p is ei- available for [self::vc or self::wfc or self::com], ther a step s, or an expression of the form p/s. which is also used several times in the stylesheet.

      9 5 Gathering Global Information

      The data model of XSLT is more complex than the one Here, @def is the value of the wfc’s def attribute, and of XTATIC, supporting the notion of a document as a id() is a built-in XSLT function that, given a token, re- container of interconnected nodes and a correspond- turns the node of the current element that carries a so- ing assortment of basic operations that take advantage called ID attribute with the token as its value. In this of the richer data model. Several parts of the XML- example, id() returns the above wfcnote element, and SPEC stylesheet rely on these additional XSLT features. the following XPATH expression extracts the contents of This section explains how we handled these tasks in its head child. XTATIC, sometimes finding a generic reusable solution, other times relying on properties specific to XMLSPEC. To replicate the functionality of id() in XTATIC,ourfor- matter explicitly builds an index datastructure that maps In XTATIC, XML values are lightweight, immutable, share- IDs to elements. Fortunately, this indexing procedure is able trees, which must be inspected in a top-down fash- completely generic: our implementation is encapsulated ion. By contrase, given a node in XSLT, one can re- in an IdIndex class that can be reused in other appli- trieve the root of the document it belongs to, explore the cations requiring similar ID support. Its usage consists of document in any direction—including towards ancestors creating an IdIndex object, say idindex, at the beginning and siblings—and randomly access nodes that have been of processing by passing the document’s root element to marked by special ID attributes, which are specified to the IdIndex constructor and then using method calls like be globally unique within a valid document. Supporting idindex.Id(x) to retrieve elements from the internally all this structure makes run-time representations of XSLT maintained index. values more heavyweight, but it also provides behind- the-scenes infrastructure for several common document- A synthesized reference is yet more complicated: its processing tasks that require information about the doc- HTML formatting contains computed data not directly ument as a whole. These include generation of section present in the source document. For example, the el- numbers, creating the table of contents, and formatting ement uses the cross-references. An XTATIC version of the XMLSPEC for- ID mechanism discussed above to point to the sectional matter has to handle these tasks by explicitly computing element that starts as follows: a good deal of information that is automatically provided to a stylesheet by the XSLT run-time system. Predefined Entities

      The XMLSPEC cross-referencing elements can be classi- The HTML formatting of this reference, fied into three groups, depending on the computational needs of their formatting: “hard-wired,” “fetched,” and “synthesized.” [4.6 Predefined Entities]

      The XMLSPEC element for a hard-wired reference includes a computed section number. like XML documents contains all the data that needs to ap- The XSLT stylesheet computes the section number by pear in its HTML representation, which is XML documents . Such references andtheninvokingonittheinstruction are straightforward to process both in XSLT and XTATIC. put document to which the reference points. The (which appears to have been specially designed for this wfc vc elements and (which appeared in Section 4) purpose!). This instruction uses the specifications in its points with its def attribute to the element To approximate the behavior of this instruction, we create another index, encapsulated by the class NumberIndex, No External Entity References that, for each sectional element in the document, maps

      Attribute values cannot contain the element’s ID to the section’s number. Again, our im- external entity references.

      plementation is re-usable: the index’s creation is param- eterized by a boolean function that recognizes section- formingelements,andbyanobjectoftheCountkeeper andshouldbeformattedasanHTMLanchor class that provides an ADT for keeping track of hierar- chical section numbers and formatting them as strings. No External These parameters roughly mimic the above three param- Entity References eter attributes of xsl:number. We use another instance of NumberIndex to keep track of the sequence numbers of whose contents are a heading fetched from the wfcnote BNF grammar productions. element. The stylesheet obtains the heading with the XSLT instruction

      10 The correct operation of NumberIndex depends on the ex- parisons between XTATIC and a number of related lan- istence of a unique identity for each sectional element— guage designs; we refer the interested reader to these something that comes for free from the data model in the existing discussions, rather than repeating them here. XSLT version. We use the id attribute, when one is pro- vided, for this purpose. Otherwise (in XMLSPEC, id is Most of the recent crop of statically typed XML processing optional on div elements), we create the identity by con- languages have been tested on non-trivial applications, catenating the words from the (mandatory) head element but only rarely have these experiences been recorded in located under the div. In general, this does not guaran- print. A notable exception is the XQUERY Use Cases[7]— tee uniqueness. However, the stylesheet uses the same a collection of small examples specifically designed to il- trick to generate hyperlinks from the table of contents to lustrate typical tasks for which XQUERY is expected to be titles in the main body, so we assume it is sufficient for the used. Although they were created to illustrate the ca- present application. With more effort, it should be pos- pabilities of particular features of XQUERY rather than sible to implement the interface of NumberIndex so that to address a particular large application, they reflect the it mimics the xsl:number instruction with better fidelity, practical experience of the XQUERY editors and cover a but we did not yet pursue this direction. usefully diverse set of small transformation and extrac- tion tasks. In the absence of more practical benchmarks, Besides reference formatting, computed sequence num- the XQUERY use cases have been used to demonstrate ca- bers stored in NumberIndex objects are used when creat- pabilities of competing technologies, such as XSLT and ing section titles and grammar production entries while CQL [3]. formatting the body of the document, as well as dur- ing creation of the table of contents. This differs from Kay’s book [20, 21], which suggested the XMLSPEC ap- the stylesheet, which invokes the xsl:number instruc- plication for our project, contains two more case studies tion anew whenever a number needs to be generated— of substantial XSLT applications: HTML-based browsing indeed, a naive XSLT implementation could end up re- of structured genealogical data and an XSLT solution to peating the same computation many times. the classic problem of a knight’s tour of the chessboard. A brief overview of typing errors from more than a dozen The table of contents itself is created by a separate doc- real-life XSLT stylesheets appears in [30]. ument traversal, after the creation of the indexes but before formatting the document. It is implemented by 7Conclusions three nested foreach loops, one for each of the three sec- div1 div2 div3 tional levels ( , , ) that need to be reflected XSLT and XTATIC are quite different animals. XSLT is in the table of contents. This almost literally repeats the a high-level language with processing model founded code in the stylesheet, which also uses explicit traversal in structural recursion, path-based XML manipulation xsl:for-each ( instructions) for this task. primitives, and no static type system. XTATIC extends a general-purpose programming language with a more fa- Each of the tasks discussed in this section (creating the miliar imperative processing model and processes XML table of contents and the indexes for IDs, section num- using regular patterns, which are tightly coupled to its bers, and production numbers) requires a pass over the static XML type system. The experience described in this document in addition to the main formatting pass. We paper shows that, like XSLT, XTATIC is well suited for im- experimented with combining some of these traversals plementing at least some document-processing applica- (e.g., formatting the table of contents concurrently with tions, and that, unlike XSLT, its flexible static XML type the body of the specification, or creating all the indexes system is capable of exposing a range of design and im- together), but concluded that increased complexity of the plementation errors and facilitating fixes. We have also application code did not justify the minor efficiency gains. observed that, even in this single application, there are some practical programming tasks that are much better Overall, we have been satisfied, in this application, with served by XSLT than by XTATIC and some where XTATIC is the ability of the general-purpose facilities of XTATIC significantly better than XSLT. (those it inherits from C )tosimulatethewhole- document features of XSLT, even without direct support Designing a strictly typed structural recursion to match from XTATIC’s XML data model. the behavior of an existing untyped implementation turned out to be a surprisingly labor-intensive process. The discussion in Section 3 demonstrates that mimicking 6 Related Work the implicit structural recursion of XSLT by explicit re- cursive code is possible, even while ensuring strict static Further details about XTATIC can be found in several typing, smoothing architectural incompatibilities of the earlier papers. The core language design is presented input and output DTDs, and maintaining a clean pro- in [14], which shows how to integrate the object and tree gram organization that bears a close resemblance to the data models and establishes basic soundness results. A original XSLT code. Unfortunately, the details of the so- technique for compiling regular patterns based on match- lution described there required significant effort to dis- ing automata is described in [26] and extended to include cover, over multiple cycles of trial and error. A major type-based optimization in [28]. The run-time system difficulty that one faces during type debugging is finding of XTATIC is described in [12]. A critical evaluation of answers to lots of questions about relationships between the main language design choices can be found in [13]. types from a large collection that a DTD like XMLSPEC These papers, particularly [13], also offer detailed com- or XHTML constitutes. We hope that reading about our

      11 experience could help programmers facing similar typ- of functions, they are likely to experience similar difficul- ing tasks to find their solutions faster. It is likely, how- ties statically typing this recursion. It is possible, how- ever, that the amount and difficulty of work needed to ever, that other features found in current descendants figure out a correct solution can be intimidating and pro- of the original XDuce, such as Hosoya’s regular pattern hibitive for a typical XML-literate C programmer in a filters [16] and CDUCE’s overloaded functions with dy- typical project. If worst comes to worst, XTATIC lets one namic dispatch, could mitigate some of the difficulties. to escape typing quandaries (and postpone discovery of On the other hand, we suspect that the nominal character typing problems till run time) by using the generic xml of the type systems of XQUERY,XJ,andCω might compli- type and unsafe casts. Taking these difficulties into ac- cate the task. XACT citexact2003, by contrast, might have count, the refined typing of XTATIC might be considered an easier time with typing structural recursion, thanks overkill for one-time-use scripts, where it may be easier to to its typechecking via data flow analysis, which requires just fix the bugs upon running into their manifestations. typing specifications only for inputs and outputs of whole On the other hand, the benefits of early error discov- programs. (A new paper on XACT, presented at this work- ery and safety guarantees of well-typed code—compared shop, introduces optional type annotations [24] with the to the current mainstream technologies where only test- goal to improve modularity of typechecking; these might ing is available—can outweigh the development costs in interfere with the ease of the task in question.) Pro- projects aiming to create reusable document processing grams in all these languages, except XJ, XQUERY,and, tools. possibly, Cω, would need to maintain auxiliary data struc- tures for global document information similar to ours Another conclusion from our experience is that XSLT in Section 5, since all of them have chosen light-weight templates—especially in their simplest form, unburdened shared tree representations for XML data. These are, of by other XSLT features like non-downward paths—are a course, only speculations on our part—real observations very convenient approach to programming structural re- can only come from implementations of similar applica- cursion. A template-like construct for implementing local tions in these languages. structural recursion (i.e., traversals that can be explicitly applied to chosen document fragments as opposed to be- ing a carrier of the whole program’s computation) would Acknowledgments be a very useful addition to XML processing languages with explicit control flow. It would be necessary, how- We are grateful to Michael Levin and Alan Schmitt, our ever, for this construct to be accompanied by expressive collaborators on the XTATIC design and implementation, and flexible typing rules that minimally burden the pro- for numerous discussions about practical programming in grammer. XTATIC, and in particular to Alan Schmitt for comments on an earlier draft of the paper. Remarks from the PLAN- However, having discussed these difficulties with XTATIC, X reviewers were very helpful in revising the paper. we should also emphasize that XSLT, for its part, turned out to be convoluted (or worse) when faced with the Work on X TATIC has been supported by the National Sci- need to deviate from straightforward structural recur- ence Foundation under Career grant CCR-9701826 and sion. For the most significant typing bugs discussed in ITR CCR-0219945, and by gifts from Microsoft. Section3,wedonotseehowtheycouldbeeliminated from the XSLT stylesheet in a natural and type-safe way without revising the XMLSPEC DTD; also, processing of 8 References structured data (Section 4) is much trickier in XSLT. [1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: It would also be interesting to see how the original XML- From Relations to Semistructured Data and XML.Morgan SPEC stylesheet might be adapted to a statically typed Kaufmann, 2000. variant of XSLT. Even though XSLT 1.0 [9] is officially [2] V. Benzaken, G. Castagna, and A. Frisch. CDuce: An XML- untyped, there is now a proposal [30] and implementa- centric general-purpose language. In ACM SIGPLAN In- tion of data-flow based typechecking of XSLT stylesheets. ternational Conference on Functional Programming (ICFP), The draft of XSLT 2.0 [22] describes only a dynamic type Uppsala, Sweden, pages 51–63, 2003. system, leaving possible static variants to the discretion [3] V. Benzaken, G. Castagna, and C. Miachon. A full pattern- of implementations. based paradigm for XML query processing. In Practical As- pects of Declarative Languages (PADL), Long Beach, CA,vol- Some of the observations made in this paper in the con- ume 3350 of LNCS, pages 235–252. Springer, Jan. 2005. text of XTATIC and XSLT may be applicable to other [4] S. Boag, D. Chamberlin, M. F. Fern´andez, D. Florescu, XML programming languages. The advantages of reg- J. Robie, and J. Sim´eon. XQuery 1.0: An XML Query Lan- ular patterns over paths for processing structured data guage. Working draft, W3C, Sept. 2005. http://www.w3. (Section 4) would also hold in XDUCE and CDUCE,which org/TR/xquery/. also use patterns as the primary data inspection mech- [5] P. Buneman, M. Fernandez, and D. Suciu. UnQL: A Query anism, compared to XQUERY,XJ,XACT,orCω,which Language and Algebra for Semistructured Data Based on use paths. Any language in the XDuce family will have Structural Recursion. VLDB Journal, 9(1):76–110, 2000. to mimic the implicit structural recursion of XSLT by ex- [6]P.Buneman,S.Naqvi,V.Tannen,andL.Wong.Princi- plicit traversal via mutually recursive functions, as we did ples of programming with complex objects and collection in Section 3. And since obtaining statically typed trans- types. Theoretical Computer Science, 149(1):3–48, Septem- formations in these languages requires specifying types ber 1995.

      12 [7] D. Chamberlin, P. Fankhauser, D. Florescu, M. Marchiori, [25] C. Kirkegaard, A. Møller, and M. I. Schwartzbach. Static and J. Robie. XML Query use cases. Working draft, W3C, analysis of XML transformations in Java. IEEE Transactions Sept. 2005. http://www.w3.org/TR/xquery-use-cases/. on Software Engineering, 30(3):181–192, Mar. 2004. [8] A. S. Christensen, C. Kirkegaard, and A. Møller. A run- [26] M. Y. Levin. Compiling regular patterns. In ACM SIG- time system for XML transformations in Java. In Z. Bel- PLAN International Conference on Functional Programming lahs`ene, T. Milo, and e. a. Michael Rys, editors, Database (ICFP), Uppsala, Sweden, 2003. and XML Technologies: International XML Database Sympo- [27] M. Y. Levin. Run, Xtatic, Run: Efficient Implementation of sium (XSym), volume 3186 of Lecture Notes in Computer an Object-Oriented Language with Regular Pattern Match- Science, pages 143–157. Springer, Aug. 2004. ing. PhD thesis, University of Pennsylvania, 2005. [9] J. Clark. XSL Transformations (XSLT) Version 1.0. Rec- [28] M. Y. Levin and B. C. Pierce. Typed-based optimization for ommendation, W3C, Nov. 1999. http://www.w3.org/TR/ regular patterns. In First International Workshop on High xslt. Performance XML Processing, 2004. [10] D. Draper, P. Fankhauser, M. F. Fern´andez, A. Malhotra, [29] E. Maler. Guide to the W3C XML specification (”XML- K.Rose,M.Rys,J.Sim´eon, and P. Wadler. XQuery 1.0 and spec”) DTD, version 2.1. Technical report, W3C XPath 2.0 Formal Semantics. Working draft, W3C, Sept. Consortium, 1998. http://www.w3.org/XML/1998/06/ 2005. http://www.w3.org/TR/xquery-semantics/. xmlspec-report-v21.htm. [11] A. Frisch, G. Castagna, and V. Benzaken. Semantic sub- [30] A. Møller, M. O. Olesen, and M. I. Schwartzbach. Static typing. In IEEE Symposium on Logic in Computer Science validation of XSL Transformations. Technical Report RS- (LICS), 2002. 05-32, BRICS, Oct. 2005. [12] V. Gapeyev, M. Y. Levin, B. C. Pierce, and A. Schmitt. XML goes native: Run-time representations for Xtatic. In 14th International Conference on Compiler Construction, Apr. 2005. [13] V. Gapeyev, M. Y. Levin, B. C. Pierce, and A. Schmitt. The Xtatic experience. In Workshop on Programming Language Technologies for XML (PLAN-X), Jan. 2005. University of Pennsylvania Technical Report MS-CIS-04-24, Oct 2004. [14] V. Gapeyev and B. C. Pierce. Regular object types. In European Conference on Object-Oriented Programming (ECOOP), Darmstadt, Germany, 2003. A preliminary ver- sion was presented at FOOL ’03. [15] M. Harren, M. Raghavachari, O. Shmueli, M. G. Burke, R. Bordawekar, I. Pechtchanski, and V. Sarkar. XJ: facili- tating XML processing in Java. In International World Wide Web Conference, pages 278–287, 2005. [16] H. Hosoya. Regular expression filters for XML. In Workshop on Programming Language Technologies for XML (PLAN-X), 2004. [17] H. Hosoya and B. C. Pierce. Regular expression pattern matching. In ACM SIGPLAN–SIGACT Symposium on Princi- ples of Programming Languages (POPL), London, England, 2001. Full version in Journal of Functional Programming, 13(6), Nov. 2003, pp. 961–1004. [18] H. Hosoya and B. C. Pierce. XDuce: A statically typed XML processing language. ACM Transactions on Internet Technology, 3(2):117–148, May 2003. [19] H. Hosoya, J. Vouillon, and B. C. Pierce. Regular ex- pression types for XML. ACM Transactions on Program- ming Languages and Systems (TOPLAS), 27(1):46–90, Jan. 2005. Preliminary version in ICFP 2000. [20] M. Kay. XSLT Programmer’s Reference.Wrox,2ndedition, 2003. [21] M. Kay. XSLT 2.0 Programmer’s Reference.Wrox,3rdedi- tion, 2004. [22] M. Kay. XSL Transformations (XSLT) Version 2.0. Working draft, W3C, Sept. 2005. http://www.w3.org/TR/xslt20. [23] M. Kempa and V. Linnemann. On XML objects. In Workshop on Programming Language Technologies for XML (PLAN-X), 2003. [24] C. Kirkegaard and A. Møller. Type checking with XML Schema in Xact. Technical Report RS-05-31, BRICS, Sept. 2005. Presented at PLAN-X 2006.

      13 Type Checking with XML Schema in XACT

      ∗ Christian Kirkegaard and Anders Møller † BRICS Department of Computer Science University of Aarhus, Denmark {ck,amoeller}@brics.dk

      Abstract language, XACT, is an extension of Java where XML fragments can be manipulated through a notion of XML templates using XPath XACT is an extension of Java for making type-safe XML transfor- for navigation. Static guarantees of validity are provided by a spe- mations. Unlike other approaches, XACT provides a programming cial data-flow analysis that builds on a lattice structure of summary model based on XML templates and XPath together with a type graphs. checker based on data-flow analysis. The existing XACT system has two significant weaknesses: first, We show how to extend the data-flow analysis technique used in it only supports DTD as schema language, and it is generally agreed the XACT system to support XML Schema as type formalism. The that DTD has insufficient expressiveness for modern XML applica- technique is able to model advanced features, such as type deriva- tions; second, the data-flow analysis is a whole-program analysis tions and overloaded local element declarations, and also datatypes that has poor modularity properties and hence does not scale well of attribute values and character data. Moreover, we introduce op- to larger programs. In this paper, we present an approach for at- tional type annotations to improve modularity of the type checking. tacking these issues. The resulting system supports a flexible style of programming XML transformations and provides static guarantees of validity of the generated XML data. Contributions We have previously shown a connection between summary graphs and regular expression types [4, 10]. Also, it is known how reg- Categories and Subject Descriptors ular expression types are related to RELAX NG schemas [7] and how schemas written in XML Schema [22, 2] can be translated into D.3.3 [ ]: Language Constructs Programming Languages equivalent RELAX NG schemas [12]. We exploit these connections and Features; D.2.4 [ ]: Valida- Software Verification in this paper. Our main contributions are the following: tion; I.7.2 [Document and Text Processing]: Computing Methodologies • We present a translation from XML Schema to summary graphs and an algorithm for validating summary graphs rela- General Terms tive to schemas written in XML Schema, all via RELAX NG. This provides the foundation for using XML Schema as type Languages, Design, Verification formalism in XACT.

      • We introduce optional typing in XACT so that XML template Keywords variables can be optionally typed with schema constructs (ele- ment names and simple or complex types). We show how this XML, XML Schema, Java, language design, static analysis can lead to a validity analysis which is more modular, in the sense that it avoids iterating over the whole program. 1 Introduction Together, these improvements effectively remedy the weaknesses mentioned earlier. Furthermore, the results can be seen as indica- The overall goal of the XACT project is to integrate XML into tions of the strength of summary graphs and the use of data-flow general-purpose programming languages, in particular Java, such analysis for validating XML transformations. that programming of XML transformations can become easier and As an additional contribution, we identify a subset of RELAX safer than with the existing approaches. Specifically, we aim for a NG that is sufficient for translation from XML Schema and where system that supports a high-level and flexible programming style, language inclusion checking is tractable. permits an efficient runtime model, and has the ability to statically guarantee validity of generated XML data. In previous papers, see [15, 14], we have presented the first steps Example of our proposal for a system that fulfills these requirements. Our The resulting XACT language can be illustrated by a small toy ∗ program that uses the new features. This program converts a Supported by the Carlsberg Foundation contract number 04-0080. list of business cards represented in a special XML language into †Basic Research in Computer Science (www.brics.dk), XHTML, considering only the cards where a phone number is funded by the Danish National Research Foundation. present:

      14 import dk.brics.xact.*; through a list of card elements that have phone children and builds import java.io.*; an XHTML list. The method main loads in an XML document containing a list of business card, invokes the setWrapper method, public class PhoneList { then constructs a complete XHTML document by plugging values static { into the TITLE and MAIN gaps using the makeList method, and fi- String[] ns = {"b", "http://businesscard.org", nally outputs this document. "h", "http://www.w3.org/1999/xhtml", As an example, the program transforms the input "s", "http://www.w3.org/2001/XMLSchema"}; XML.setNamespaceMap(ns); } John Doe XML wrapper; [email protected] (202) 555-1414 void setWrapper(String color) { wrapper = [[ Zacharias Doe [email protected] <[s:string TITLE]> Jack Doe <[s:string TITLE]> [email protected] <[h:Flow MAIN]> [email protected] (202) 456-1414 ]]; }

      XML makeList(XML x) { into an XHTML document that looks as follows: XML r = [[<[CARDS]>]]; XMLIterator i = x.select("//b:card[b:phone]").iterator(); while (i.hasNext()) { XML c = i.next(); r = r.plug("CARDS", [[ <{c.select("b:name/text()")}>, Note that some XML variables in the program are declared by phone: <{c.select("b:phone/text()")}> the type XML, which represents all possible XML templates, and oth- <[CARDS]>]]); ers use a more constrained type, such as, the declaration of wrapper } or the signature of makeList. XACT now allows the programmer to return r; combine these two approaches. The static type checker uses data- } flow analysis to reason about variables that are declared using the former approach, and it conservatively checks that the annotated XML transform(String url) { types are preserved by execution of the program. For this program, XML cardlist = XML.get(url, "b:cardlist"); one consequence is that the makeList method, whose signature is setWrapper("white"); fully annotated, can be type checked separately, and invocations of return wrapper.plug("TITLE", "My Phone List") this method can be type checked without considering its body. (We .plug("MAIN", makeList(cardlist)); } discuss fields and side-effects in Section 7.) Also note that the type checker can now reason about XML Schema types rather that being public static void main(String[] args) { limited to DTD. XML x = new PhoneList().transform(args[0]); System.out.println(x); } Related Work } There are numerous other projects having similar goals as XACT; the paper [19] contains a general survey of different approaches. The general syntax for XML template constants and the meaning of The ones that are most closely related to ours are XJ [9], Cω [1], the methods select, plug, get, and various others are described and XDuce and its descendants [10]. XACT is notably differ- further in Section 2. ent in two ways: first, although variants of XML templates are In the first part of the program, some global namespace decla- widely used in Web application development frameworks, this rations are made. Schemas for these namespaces are supplied ex- paradigm is not supported by other type-safe XML transformation ternally (the schema for the business card XML language is shown languages, which typically allow only bottom-up XML tree con- in Section 3). Then a field wrapper is defined, holding an XML struction; second, the annotation overhead is minimal since schema template that must be an html tree, potentially with TITLE gaps types are only required at input and output, whereas the others re- and MAIN gaps, which may occur in place of fragments of type quire schema type annotations at all XML variable declarations. We string and Flow, respectively (all of appropriate namespaces). believe that both aspects in many cases makes the XACT program- The method setWrapper assigns such an XML template to the ming style more flexible. Furthermore, our data-flow analysis also wrapper field. This template has two gaps named TITLE and one tracks all Java string operations via a separate analysis [6], which named MAIN. Additionally, it has one code gap where the value of enables XACT to reason about validity of attribute values and char- the color parameter is inserted. The method makeList iterates acter data. (In fact, an additional consequence of the extensions

      15 described here is that our static analyzer can also model computed NG, called Restricted RELAX NG, that we will use as an intermedi- names of elements and attributes.) ate language in the program analysis. Then, in Section 4 we intro- With the extensions proposed in this paper, XACT becomes duce a variant of summary graphs. In Sections 5 and 6 we explain closer to XJ [9], which also uses XML Schema as type formalism how schemas written in XML Schema can be converted into sum- and XPath for navigation. Still, our use of optional type annotations mary graphs via Restricted RELAX NG, how to check validity of avoids a problem that can make the XJ type checker too rigid: with summary graphs relative to Restricted RELAX NG schemas, and mandatory type annotations at all variable declarations in XJ it is how these results can be used in XACT to provide static guarantees impossible to type check a sequence of operations that temporarily of XML transformations. In Section 7 we introduce optional typing invalidates data. The types that are involved in XML transforma- using XML Schema constructs and discuss the resulting language tions are often exceedingly complicated and difficult to write down, design. Finally, we present our conclusions in Section 8. and types for intermediate results often do not correspond to named Note that we here report on work in progress, and not all of what constructs in preexisting schemas. The benefits of type annotations we present has yet been implemented and tested in practice so we are that they can serve as documentation in the programs and they cannot at this stage present experimental results. Also, the limited can lead to faster type checking. By now supporting optional anno- space prevents us from going into details of our algorithms and of tations, XACT gets the best from the two worlds. the systems we build upon—instead, this paper aims to present an Moreover, XJ represents XML data as mutable trees, which in- informal overview of our ideas. curs a need for expensive runtime checks to preserve data validity. In XJ, subtyping is nominal, whereas our approach gives semantic (or structural) subtyping. A discussion of subtyping can be found 2 The XACT Programming Language in [8]. Note that although XML Schema does contain mechanisms We begin with a brief overview of the XACT language as it looks be- for declaring subtyping relationships nominally, the choice of sup- fore adding our new extensions. In XACT, XML data is represented porting XML Schema as type formalism in XACT does not force as templates, which are well-formed XML fragments that may con- us to use nominal subtyping. We use schemas only as notation for tain gaps in place of elements or attribute values. A gap is either a defining sets of XML values—the internal structure of the notation name or a piece of code that evaluates to a string or an XML tem- being used is irrelevant. plate. As an example, the following XML template contains four The XDuce language family is based on the notion of regular ex- gaps: two named TITLE, one named MAIN, and one containing the pression types. As mentioned earlier, a connection between regular expression color: expression types and a variant of the summary graphs used in our program analysis is shown in [4]. Also, the formal expressiveness of regular expression types and RELAX NG both correspond to that of regular tree languages. We return to these relations in Sections 5 <[TITLE]> and 6. As XACT, the XTATIC language [8], which is one of the descendants of XDuce, incorporates XML into an object-oriented language in an immutable style. <[TITLE]> The Cω language adds XML support to C] by combining struc- <[MAIN]> tural sequences, unions, and products with objects and simple val- ues. The basic features of XML Schema may be encoded in the type system, however little documentation of this is available. Rather The special immutable class XML corresponds to the set of all pos- than use full XPath for navigation in XML trees as in XACT, Cω sible XML templates. The central operations on this class are the uses a reminiscent notion of generalized member access that is following: closer to ordinary programming notation. The paper [18] describes a validity analysis for XSLT transfor- constant: a static method that creates a template from mations, which is also based on summary graphs. The techniques a constant string (the syntax [[foo]] is sugar for we present here for handling XML Schema as type formalism can XML.constant("foo") where quotes, whitespace, and gaps be transferred seamlessly to that analysis. have been transformed); The type annotations we introduce are reminiscent of the notion of programmer–designer contracts proposed in [3]. In both cases, plug: inserts a given string or template into all gaps of a given static declarations constrain how XML templates may be combined name in this template; in the programs. select: returns the sub-templates of this template that are se- The paper [20] contains a useful classification of schema lan- lected by a given XPath expression; guages in terms of categories of tree grammars: DTD corresponds get: a static method that creates a template from a non-constant to local tree grammars where the content model of an element can string and checks (at runtime) that it is valid relative to a given only depend on the name of the element; XML Schema corresponds constant schema type; to the larger category of single-type tree grammars where elements that are siblings and have the same name must have identical con- cast: performs a runtime check of validity of this template rela- tent models; and RELAX NG corresponds to the even more gen- tive to a given constant schema type; eral category of regular tree grammars, which is equivalent to tree analyze: instructs the static type checker to verify that this tem- automata. With our new results, XACT supports single-type tree plate will always be valid relative to a given schema type when grammars as type formalism. the program runs; and toString: converts this template to its textual representation. Overview A schema type is the name of an element (or, with our extension In Sections 2 and 3 we begin by briefly recapitulating the design of from DTD to XML Schema, a simple type or a complex type) that XACT and RELAX NG, and we characterize a subset of RELAX is declared in a schema. The language of a schema type is defined

      16 as the set of XML documents or document fragments that are valid A Restricted RELAX NG schema satisfies the following syntac- relative to the schema type. Note that in this version of XACT, be- tic requirements: fore incorporating the extensions suggested in this paper, schema types appear only at get, cast, and analyze operations. In partic- [single-type grammar] For every element pattern p, any two ular, declarations use the general type XML. element patterns that are top-level-contained by the child of p The primary job of the static type checker is to verify that only and have non-disjoint name classes must have the same (iden- valid XML data can occur at program locations marked by analyze tical) content. (This requirement limits the notation to single- operations, under the assumption that get and cast operations al- type grammars.) ways succeed. (It also checks properties of plug and select oper- ations, which is less relevant here.) [attribute context insensitivity] No attribute list pattern can be a choice pattern. Also, every optional attribute list pattern must have an attribute pattern as child. (This requirement 3 Defining a Subset of RELAX NG prohibits context sensitive attribute patterns.) [interleaved content] Every pattern that has a child that top-level A RELAX NG schema [7] is essentially a top-down tree automa- contains an interleave content pattern must be a group or ton that accepts a set of valid XML trees. It is described by element pattern. Also, a group pattern that top-level con- a grammar consisting of recursively defined patterns of various tains an interleave content pattern must have only one con- kinds, including the following: element matches one element tent pattern child. (This requirement makes it easier to check with a given name and with contents and attributes described by inclusion of interleave patterns, as explained in Section 6.) a sub-pattern; attribute similarly matches an attribute; text matches any character data or attribute value; group, optional, We here consider ref patterns as abbreviations of the patterns be- zeroOrMore, oneOrMore, and choice correspond to concatena- ing referred to. For every element and optional pattern that has tion, zero or one occurrence, zero or more occurrences, one or more more than one child pattern, we treat the children as implicitly en- occurrences, and union, respectively; empty matches the empty se- closed by a group pattern. (Also, all mixed patterns are implicitly quence of nodes; and notAllowed corresponds to the empty lan- desugared to interleave patterns in the usual way.) interleave guage. In addition, the pattern matches all possible Restricted RELAX NG has two important properties: first, it mergings of the sequences that match its sub-patterns. is sufficient for making an exact and simple embedding of XML Note that attributes are described in the same expressions as the Schema; second, it makes the summary graph validation in Sec- content models. Still, attributes are considered unordered, as al- tion 6 more tractable than using XML Schema directly or support- ways in XML, and syntactic restrictions prevent an attribute name ing full RELAX NG. from occurring more than once in any element. Mixing attributes The following schema written in XML Schema may be used to and contents in this way is useful for describing attribute–element describe the input to the example program shown in Section 1: constraints. To ensure regularity, there is an important restriction on recursive which can consist of lists of possible names and wildcards that match all names, potentially restricted to a certain namespace or excluding specific names. To describe datatypes more precisely than with the text pattern, part of XML Schema. Using the data pattern, such datatypes can be referred to, and datatype facets can be constrained by a parameter mechanism. Furthermore, RELAX NG contains various modularization mechanisms, which we can ignore here. As all other type-safe XML transformation languages, we also ignore ID and IDREF at- tributes from DTD and the equivalent compatibility features in RE- LAX NG. As mentioned in the introduction, we handle XML Schema via a intermediate language that avoids the many complicated technical NG, which we call Restricted RELAX NG, being characterized as follows. First, we define some terminology that we need. We say that a pattern p top-level-contains a pattern q if p and q are identical or p contains q (as a child or further descendant) where contents of Assuming cardlist as root element name, this can be translated element and attribute patterns are ignored. A content pattern into the following Restricted RELAX NG schema (here using the is a pattern that top-level contains one or more element, data, or compact RELAX NG syntax): text patterns (or list or value patterns, which we otherwise ig- nore here for simplicity). An attribute list pattern is a pattern that default namespace = "http://businesscard.org" top-level contains one or more attribute patterns.

      17 start = element cardlist { card* } XML templates, much like a schema but tailor-made for use in the card = element card { card_type } program analysis [15]. Second, when the fixed point has been com- card_type = element name { xsd:string }, puted, we check that the sets of templates represented by the result- element email { xsd:string }+, ing summary graphs are valid relative to the respective schemas. element phone { xsd:string }? To allow a smooth integration of XML Schema as a replacement for DTD, we slightly modify the definition of summary graphs as The translation from XML Schema to Restricted RELAX NG is explained below and change the summary graph validation algo- exact and the size of the output schema is proportional to the size rithm accordingly and to work with Restricted RELAX NG (the of the input schema. Most XML Schema constructs map directly to old algorithm supported DTD via an embedding into DSD2 [17]). RELAX NG, and we will not here explain the details of the trans- A summary graph, as it is defined in [15], has two parts: one that lation. However, a few points are worth mentioning. is set up for the given program and remains fixed during the iterative First, the all construct maps to the interleave pattern. Be- data-flow analysis, and one that changes monotonically during the cause of the limitations on the use of all in XML Schema, this analysis. does not violate the [interleaved content] requirement. The fixed part contains finite sets of nodes of various kinds: el- Second, we can ignore default declarations since we only care ement nodes (N ), attribute nodes (N ), chardata nodes (N ), and about validation and not of normalization of the input—except that E A C template nodes (N ). These node sets are determined by the use we treat an attribute or content model as optionally absent if a de- T of schemas, template constants, and XML operations in the program. fault is declared. The former three sets represent the possible elements, attributes, Third, wildcards can be converted into name classes. If and chardata sequences that may arise when running the program. processContents of an element wildcard is set to skip, then we The template nodes represent sequences of template gaps, which make a recursive pattern that matches any XML tree. either occur explicitly in template constants or implicitly due to Fourth, the most tricky parts of the translation involve type XML operations or schemas. Additionally, the fixed part speci- derivations and substitution groups. Assume that an element e has fies a number of maps: name assigns a name to each element node type t and there exists a type t0 that is derived by extension from t. and attribute node; attr : N → 2NA associates attribute nodes with In this case, an occurrence of e must match either t or t0, and in the E element nodes; contents : N → N connects element nodes with latter case e must have a special attribute xsi:type with the value E T descriptions of their contents; and gaps : N → G∗ associates a se- t0 (in the former case, the attribute is permitted but not required). T quence of gap names from a finite set G with each template node. We handle this situation by encoding the xsi:type information in The changing part of a summary graph consist of: the element name. More precisely, we create a new element pattern 0 whose name is the name of e followed by the string %t and whose • a set of root nodes R ⊆ N ∪ N ; content corresponds to the definition of t0. Each reference to e is E T then replaced by a choice between e and the variants with extended • template edges T ⊆ N × G × (N ∪ NE ∪ NC ); types. The xsi:nil feature is handled similarly. Now assume that T T 0 another element f has type t and is declared as in the substitution • string edges S : NC ∪ NA → REG where REG are all regular group of e. This means that f elements are permitted in place of string languages over the Unicode alphabet; and e elements. In Restricted RELAX NG, this is expressed simply by replacing all references to e elements by choices of e and f ele- • a gap presence map P : G → 2NA ∪NT ×2NA ∪NT ×Γ×Γ where ments. Again, because of limitations on the all construct and the Γ = 2{OPEN,CLOSED}. substitution group mechanism in XML Schema, this cannot lead to violations of the [single-type grammar] requirement, nor of the The language of a summary graph is intuitively the set of XML tem- general RELAX NG requirement that interleave branches must plates that can be obtained by unfolding it, starting from a root node be disjoint. and plugging elements, templates, and strings into gaps according By the translation to Restricted RELAX NG, a schema type cor- to the edges. A template edge (n1,g,n2) ∈ T informally means that responds to a pattern definition: n2 may be plugged into the g gaps in n1, and a string edge S(n) = L means that every string in L may be plugged into the gap in n. The • a simple type corresponds to a pattern, which we call a gap presence map, which we will not explain in further detail here, simple-type pattern, that can only contain the constructs data, is needed during the data-flow analysis to determine where template choice, list, and value; gaps and attribute gaps occur. (For the curious reader, this is all for- malized in [15].) We also define the language of an individual node • a complex type corresponds to a pattern, which we call a n in a summary graph: this is simply the language of the modified complex-type pattern, that consists of a group of two sub- summary graph where R is set to {n}. patterns—one describing a content model and one describing As an example (borrowed from [15]), we can define a sum- attributes; and mary graph whose language is the set of ul lists with zero or more • an element declaration corresponds to an element pattern that li items that each contain a string from some language L. As- , contains a simple-type pattern or a complex-type pattern. sume that the fixed structure is given by NE = {1 4}, NA = 0/, NT = {2,3,5} (where all three are sequence nodes), NC = {6}, We use these observations in Section 6. contents(1) = 2, contents(4) = 5, attr(1) = attr(4) = 0/, name(1) = {ul}, name(4) = {li}, gaps(2) = items, gaps(3) = g · items, and gaps(5) = text. The remaining components are as follows: 4 Summary Graphs in Validity Analysis R = {1} The static type checker in XACT works in two steps. First, a data- T = {(2,items,3),(3,items,3),(3,g,4),(5,text,6)} flow analysis of the whole program is performed, using the standard S(6) = L data-flow analysis framework [11] but with a highly specialized lat- tice structure where abstract values are summary graphs. A sum- (For simplicity, we ignore the gap presence map.) This can be illus- mary graph is a finite representation of a potentially infinite set of trated as follows:

      18 items summary graph that has the same language. In [15], it is shown 1 2 3 4 5 6 items g text how this can be done for DTD schemas; we now present a modified ul items g items li text L algorithm that supports Restricted RELAX NG and then rely on the items items translation from XML Schema to Restricted RELAX NG to map from schema types to patterns. The boxes represent element nodes, rounded boxes are template Intuitively, this translation is straightforward: we may simply nodes, the circle is a chardata node, and the dots represent poten- view summary graphs as a graphical representation of Restricted tially open template gaps. RELAX NG patterns, provided that we ignore the gap presence For a given program, the family of summary graphs forms a component of the summary graphs and the regularity requirement finite-height lattice, which is used in the data-flow analysis. To de- in Restricted RELAX NG. Due to the connection between RELAX termine the regular string languages used in the string edges, we NG and regular expression types, this translation can also be seen use a separate program analysis that provides conservative approx- as a variant of the translation between regular expression types and imations of the possible values of all string expression in the given summary graphs shown in [4]. program [6]. Given a Restricted RELAX NG pattern, we construct a summary We now introduce two small modifications to the definition of graph fragment as follows: summary graphs: • First, we observe that name classes and simple-type patterns 1. We let the name function return a regular set of names, rather all define regular string languages2. Namespaces are han- than a single name. This will be used to more easily model dled by expanding qualified names according to the applicable name classes in Restricted RELAX NG. The definition of un- namespace declarations. folding is generalized accordingly: unfolding an element node n yields an element whose name can be any string in name(n), • For an element pattern, we exploit the syntactic restrictions and similarly for attribute nodes. In case an unfolding leads to described in Section 4. An element pattern generally consists an element with two attributes of the same name, one of them of a name class, a content model, and a collection of attribute is chosen arbitrarily and overrides the other. declarations. Thus, we convert it to an element node e and a To accommodate attribute declarations that have infinite name template node t with contents(e) = t. We define name(e) as classes and are repeated using zeroOrMore or oneOrMore, the regular string language corresponding to the name class. we define the unfolding of an attribute node n where name(n) The attribute declarations are converted recursively into at- is infinite such that it may produce more than one attribute. tribute nodes (as explained below), and attr(e) is set accord- ingly. The content models is converted recursively into a sum- 2. We distinguish between two kinds of template nodes: se- mary graph fragment rooted by t. quence nodes and interleave nodes. The former have the meaning of the old template nodes; the latter will be used to • An attribute pattern is converted into an attribute node a. model interleave patterns. We define the unfolding of an We define name(a) in the same way as for element patterns, interleave node as all possible interleavings of the unfoldings and S(a) is set to the regular string language corresponding to of its gaps. the sub-pattern describing the attribute values. If the attribute is declared as optional using the optional pattern, the gap The data-flow transfer functions for operations remain as ex- presence map is set to record this (as in [15]). plained in [15] with only negligible changes as consequence of the modifications of the summary graph definition, the only exceptions • For patterns describing content models of elements, the pat- being the ones we address in the following section. terns text, group, optional, zeroOrMore, oneOrMore, Reflecting the [interleaved content] requirement in Restricted choice, and empty are handled exactly as the equivalent con- RELAX NG, interleave nodes never appear nested within content structs in DTD content model definitions in the way explained 1 model descriptions . The translation from Restricted RELAX NG in [15]. Intuitively, each pattern corresponds to a tiny sum- to summary graphs presented in the next section and the transfer mary graph fragment that unfolds to the same language. A functions maintain this property of interleave nodes as an invariant. data pattern becomes a chardata node s where S(s) is the cor- With the generalization of the name function, we can in fact now responding regular string language. The interleave pattern easily model computed names of elements and attributes—provided is translated in the same way as group, except that an inter- that we add operations for this in the XML class, of course, and we leave node is used instead of a sequence node. leave that to future work. • Finally, the notAllowed pattern can be modeled as a template node t where gaps(t) = g for some gap name g and t has no 5 A Translation from Restricted RELAX NG outgoing template edges. to Summary Graphs The set of root nodes R contains the single node that corresponds to To define the transfer functions for the operations get and cast, the whole pattern being translated. Recursion in pattern definitions we need an algorithm for translating the given schema type into a simply results in loops in the summary graph. The constructs from RELAX NG that we have omitted in the description in Section 3 1 To state this more precisely, we first define that a node A top- can be handled in a similar way as those mentioned here. Note that level-contains a node B if A and B are identical or B is reachable the translation is exact: the language of the pattern is the same as from A where contents of element nodes and attribute nodes are ig- the language of the resulting summary graph. nored, and a content node is a node that top-level-contains at least As an example, translating the pattern one element node or chardata node. We now require the following: every node that has a child that top-level contains an interleave con- 2We here ignore a few constraining facets that may be used on tent node must be a sequence or element node, and a sequence node the datatypes float and double. These are uncommon cases that that top-level contains an interleave content node must have only can be accommodated for without losing precision by slightly aug- one content node child. menting the definition of string edges.

      19 element ul { element li { xsd:integer }* } in Ln also occur in one of the sub-patterns. If n is an interleave node, we use a generalized product construction to check inclusion results in the summary graph shown in Section 4, assuming that L (specifically, the shuffleSubsetOf operation in [21]). is the language of strings that match xsd:integer. To avoid redundant computations (and to ensure termination, in case of loops in the summary graph or recursive definitions in the schema) we apply memoization such that a given pair (n, p) is only 6 Validating Summary Graphs processed once. If a loop is detected, we can coinductively assume that the inclusion holds. When the data-flow analysis has computed a summary graph for With this algorithm, we check for each root node n ∈ R that its each XML expression in the XACT program, we check for each language is included in the language of the pattern corresponding analyze operation that the language of its summary graph is in- to the given schema type. cluded in the language of the specified schema type. If the check As an example of the case with an element node and an element fails, appropriate validity warnings are emitted. The entire analysis pattern, let n be element node 1 in the summary graph from Sec- is sound: if no validity warnings show up, the programmer can be tion 4 and let p be the pattern shown in Section 5: sure that, at runtime, the XML values that appear at the program points marked by analyze operations will be valid relative to the p = element ul { element li { xsd:integer }* } given schema types. The old summary graph analyzer used in XACT is described in The context-free grammar for the contents of Ln has the following [5]. That algorithm, which supports DTD through an embedding productions (where N2 is the start nonterminal and N4 is the only into DSD2, as mentioned earlier, has proven successful in practice. terminal): We here describe a variant that works with Restricted RELAX NG items instead of DSD2. N2 → N2 items Given a summary graph node n ∈ N ∪ N and a Restricted RE- N2 → N3 | ε E T g items LAX NG pattern p where p is an element pattern, a simple-type N3 → N N g 3 3 pattern, or a complex-type pattern (as defined in Section 3), we wish N3 → N4 to determine whether the language of n is included in the language items N3 → N3 | ε of p. We begin by considering the case where n is not an interleave This grammar is linear, so the regular approximation is not applied. node and p is not an interleave pattern. First, a context-free The pattern p contains a single sub-pattern grammar C is constructed from the part of the summary graph that is top-level contained by n, considering element and chardata nodes p0 = element li { xsd:integer } as terminals, template nodes as nonterminals, and ignoring attribute nodes. Each chardata node terminal c is then replaced by a regular and by recursively comparing node 4 and p0 we find out that the grammar equivalent to S(c). If C is not linear, we apply a regu- language of node 4 is included in the language of p0. We now see lar over-approximation [16] (which we also use in [6]). Thus, we that Ln ⊆ Lp, so we conclude that the language of element node 1 have a regular string language Ln over element nodes and Unicode is in fact included in the language of the pattern. characters that describes the possible unfoldings of n (ignoring at- With the exception of the regular approximation of the context- tributes). Similarly, p defines a regular string language Lp over free grammars mentioned above, the inclusion check is exact. Also, element patterns and Unicode characters. To obtain a common vo- since the schemas already define only regular languages, the ap- 0 0 cabulary, we now replace each element node n in Ln by hname(n )i proximation can only cause a loss of precision if the XML transfor- (where h and i are some otherwise unused characters), and similarly mation defined by the XACT program introduces non-regularity in for the element patterns in Lp. Then, we check that Ln is included the summary graphs, and our experience from [15] and [5] indicate in Lp with standard techniques for regular string languages. (This that this rarely results in false errors. In particular, the trivial iden- works because of the restriction to single-type grammars.) If this tity function, which inputs XML data using get with some schema check fails, a suitable validity error message is generated. Other- type and immediately after applies analyze with the same schema 0 0 wise, for each pair (n , p ) of an element node in Ln and an element type, is guaranteed to type check without warnings for any schema 0 0 pattern in Lp where name(n ) and name(p ) are non-disjoint, we type. Moreover, we could replace the approximation by an algo- perform two checks. First, we check recursively that the language rithm that checks inclusion of a context-free language in a regular of contents(n0) is included in the language of the content model of language, if full precision is considered more important than per- p0. Second, we check that the attributes of n0 match those of p0: formance. for each attribute node a ∈ attr(n0), each name x ∈ name(a), and An obvious alternative approach to the algorithm explained each value y ∈ S(a), a corresponding attribute pattern must oc- above would be to exploit the connection with regular expression cur in p0—that is, one where x is in the language of its name class types and apply the results from the XDuce project for checking and y is in the language of its sub-pattern; also, attribute pat- subtyping between general regular expression types [10] or to build terns occurring in p0 that are not enclosed by optional patterns on Antimirov’s algorithm as in the XOBE project [13]. Our main must correspond to one of the non-optional attribute nodes. Again, argument for choosing the algorithm explained above is that it has a suitable validity error message is generated if the check fails. been shown earlier that this approach is efficient for XACT pro- For interleave nodes and interleave patterns, we exploit the grams. Also, unlike [23], our algorithm behaves much like existing restriction on these constructs: they cannot appear nested within XML Schema validators, but validating summary graphs instead of content model descriptions. Additionally, in RELAX NG, the sub- individual XML documents. Still, the relation between these differ- patterns of an interleave pattern must be disjoint (that is, no el- ent inclusion checking algorithms is worth a further investigation. ement name or text pattern occurs in more than one sub-pattern). As an interesting side-effect of our approach, we get an inclu- Thus, if p is an interleave pattern, we simply test each sub- sion checker for Restricted RELAX NG and hence also for XML pattern in turn, projecting Ln onto the element names occurring in Schema and DTD: given two schemas, S1 and S2, convert S1 to a the sub-pattern, and then check that all element names occurring summary graph SG using the algorithm described in Section 5 and

      20 then apply the algorithm presented above on SG and S2. (Alter- with the declared return type, and for method invocations every ac- natively, the algorithm presented above could be modified to work tual parameter value is compared with the corresponding declared directly with Restricted RELAX NG schemas instead of summary parameter type. Moreover, every plug operation must respect gap graphs.) Preliminary results indicate that our approach is efficient: tags, that is, the value being plugged in to a gap g must belong to on a standard PC, our implementation finds in a few seconds the the language of the tag of g. elements in XHTML 1.0 Transitional that are invalid according to The following describes a modification of our existing static pro- XHTML 1.0 Strict (and conversely, it reveals that Strict does not gram analysis to support checking of the extra constraints intro- imply Transitional, to our surprise). For schemas that go beyond lo- duced by annotations. cal tree grammars and use type derivations and all model groups, First, the abstract representation of sets of XML templates is we observe a similarly acceptable performance. Moreover, the val- extended to also keep track of the declared schema types of gap idator provides precise error messages in case validation fails. names. For a given XACT program, we let T denote the finite set of As an interesting bonus feature, our validator can trivially be ex- all types mentioned by gap annotations in template constants, and tended to precisely check element prohibitions (for example, that we introduce a new summary graph component D : G → T map- form elements must not contain form elements in XHTML): in ping gap names to their declared type. The language of a summary XACT, we already have a technique for evaluating XPath loca- graph is not affected by this change. tion paths on summary graphs, and element prohibitions can be This leads to extending the data-flow transfer function for the expressed as (simple) XPath location paths. constant operation to generate a summary graph with mappings D(g) = T for every gap occurring in the given XML tem- plate constant. (A simple syntactical check ensures that in each 7 Optional Type Annotations template constant all gaps of the same name are declared with iden- tical schema types.) The transfer function for the plug operation We will now extend XACT with optional type annotations such simply unions the D mappings of its arguments. (Conflicts are that programmers may declare the intended schema types for XML avoided by a check mentioned below.) All other transfer functions template variables, method parameters, and return values. Besides act as the identity on the new D component. being useful as in-lined documentation of programmer intentions, To ensure type consistency of variables declared with annotated type annotations can lead to better modularity properties of the va- XML types, we must validate all assignments to such variables. We lidity analysis. check, using the validation algorithm described in Section 6, that Every XML type may now optionally be annotated in the follow- the language of the inferred summary graph for the right-hand ,..., ,..., ing way where S and T1 Tn are schema types and g1 gn are side of an assignment is a subset of the language permitted by the gap names: schema type annotation. However, this inclusion check is modi- fied to treat gaps as if they were plugged with values correspond- XML ing to their declared types. More precisely, for every gap g in the inferred summary graph we apply the algorithm described in Sec- The semantics of an annotated type is the language described by tion 5 to construct a summary graph fragment SGg corresponding S under the assumption that every occurrence of gap gi has been to the schema type D(g) and then add template edges from all oc- plugged with a value in the language of schema type T . i currences of g to the roots of SGg. In XML template constants, every template gap must now have To ensure type consistency of template gaps, we perform an ad- the form <[T g]>, where T is a schema type and g is the gap name. ditional check of every x.plug(g,y) operation using the summary This allows us to, at runtime, tag each gap g in an XML template graphs SGx and SGy inferred by the data-flow analysis for x and y, with a schema type. respectively. First, we check that the language of SGy is a subset of In gap annotations in XML declarations and template constants, the language of Dx(g) declared for g in SGx using the inclusion al- we permit Kleene star of a schema type, T*, meaning that the gap gorithm presented in Section 6. Then, we check that all gap names can be filled with a sequence of values from the language of T. h occurring in both SGx and SGy are declared with identical types, Kleene star annotations are occasionally needed because we cannot that is, Dx(h) = Dy(h). always find existing schema types for sequences of values. As an As a product of the guaranteed type consistency of variables de- example, the XML Schema description of XHTML has no named clared with annotated XML types, reading from a variable can now content type describing a sequence of li elements. Theoretically, use the declared type instead of the inferred one. More precisely, we could permit type annotations to be arbitrary regular expressions for every read from an XML typed variable x we normally use an over schema types or even small inlined XML Schema fragments, inferred summary graph to describe the set of possible template but we have not yet observed the need for this. values at that program point, but now, since all assignments to x Every assignment of an XML template v to a variable x whose have already been checked for validity with respect to the declared type annotation is t = S[T1 g1,...,Tn gn] must, at runtime, satisfy schema type for x, we can instead apply the algorithm from Sec- three constraints: tion 5 to obtain the summary graph corresponding to the declared schema type. • All gaps occurring in v must be declared in t. Note that the support for type annotations leads to a program- • For every gap g occurring in v, the language of its type tag ming style where the explicit analyze operation is rarely needed— must be included in the language of the schema type for g as instead, one may request a static type check by assigning to an an- declared in t. notated variable. This is the style required in other XML transfor- mation languages. • The value v must, under the assumption that all gaps were It is well-known that type annotations in programming languages plugged according to their type tags, belong to the language enable more modular type checking. A component, whose interface of S. is fully annotated, can be type checked independently of its context, and type checking the context can be performed without consider- We put similar constraints on return statements and method invoca- ing the body of the component. In our setting, this, for example, tions, except that for return statements the return value is compared corresponds to methods where all XML typed parameters and return

      21 types are annotated, and further, every non-local assignment and [9] Matthew Harren, Mukund Raghavachari, Oded Shmueli, read within the method body involves fields declared with anno- Michael G. Burke, Rajesh Bordawekar, Igor Pechtchanski, tated types (the latter to constrain side-effects through field vari- and Vivek Sarkar. XJ: Facilitating XML processing in Java. ables). As discussed in Section 1, annotations also have drawbacks, In Proc. 14th International Conference on World Wide Web, however, in XACT, type annotations are optional. This allows the WWW ’05, pages 278–287. ACM, May 2005. programmer to mix annotated and unannotated XML types to get the best from both worlds. [10] Haruo Hosoya and Benjamin C. Pierce. XDuce: A statically typed XML processing language. ACM Transactions on In- ternet Technology, 3(2):117–148, 2003. 8 Conclusion [11] John B. Kam and Jeffrey D. Ullman. Monotone data flow We have presented an approach for generalizing the XACT system analysis frameworks. Acta Informatica, 7:305–317, 1977. to support XML Schema as type formalism and permit optional Springer-Verlag. type annotations. Compared with other programming languages for type-safe XML transformations, type annotations are permitted but [12] Kohsuke Kawaguchi. Sun RELAX NG Converter, April not mandatory, which allows the programmer to balance between 2003. http://www.sun.com/software/xml/developers/ the pros and cons of type annotations. relaxngconverter/. The extension to XML Schema takes advantage of connections between XML Schema, RELAX NG, and summary graphs. In par- [13] Martin Kempa and Volker Linnemann. Type checking in ticular, it involves a tractable subset of RELAX NG that we use as XOBE. In Proc. Datenbanksysteme fur¨ Business, Technologie an intermediate language in the static analysis. und Web, BTW ’03, volume 26 of LNI, February 2003. The ideas presented in this paper will become available in the next version of the XACT implementation. [14] Christian Kirkegaard, Aske Simon Christensen, and Anders Møller. A runtime system for XML transformations in Java. In Proc. Second International XML Database Symposium, References XSym ’04, volume 3186 of LNCS. Springer-Verlag, August 2004. [1] Gavin Bierman, Erik Meijer, and Wolfram Schulte. The essence of data access in Cω. In Proc. 19th European Confer- [15] Christian Kirkegaard, Anders Møller, and Michael I. ence on Object-Oriented Programming, ECOOP ’05, volume Schwartzbach. Static analysis of XML transformations 3586 of LNCS. Springer-Verlag, July 2005. in Java. IEEE Transactions on Software Engineering, 30(3):181–192, March 2004. [2] Paul V. Biron and Ashok Malhotra. XML Schema part 2: Datatypes second edition, October 2004. W3C Recommen- [16] Mehryar Mohri and Mark-Jan Nederhof. Robustness in Lan- dation. http://www.w3.org/TR/xmlschema-2/. guage and Speech Technology, chapter 9: Regular Approx- imation of Context-Free Grammars through Transformation. [3] Henning Bottger¨ , Anders Møller, and Michael I. Kluwer Academic Publishers, 2001. Schwartzbach. Contracts for cooperation between Web service programmers and HTML designers. Journal of Web [17] Anders Møller. Document Structure Description 2.0, De- Engineering, 5(1), 2006. cember 2002. BRICS, Department of Computer Science, University of Aarhus, Notes Series NS-02-7. Available from [4] Aske Simon Christensen, Anders Møller, and Michael I. http://www.brics.dk/DSD/. Schwartzbach. Static analysis for dynamic XML. Technical Report RS-02-24, BRICS, May 2002. Presented at Program- [18] Anders Møller, Mads Østerby Olesen, and Michael I. ming Language Technologies for XML, PLAN-X ’02. Schwartzbach. Static validation of XSL Transformations. Technical Report RS-05-32, BRICS, 2005. [5] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Extending Java for high-level Web service [19] Anders Møller and Michael I. Schwartzbach. The design construction. ACM Transactions on Programming Languages space of type checkers for XML transformation languages. and Systems, 25(6):814–875, 2003. In Proc. Tenth International Conference on Database The- ory, ICDT ’05, volume 3363 of LNCS, pages 17–36. Springer- [6] Aske Simon Christensen, Anders Møller, and Michael I. Verlag, January 2005. Schwartzbach. Precise analysis of string expressions. In Proc. 10th International Static Analysis Symposium, SAS ’03, vol- [20] Makoto Murata, Dongwon Lee, and Murali Mani. Taxonomy ume 2694 of LNCS, pages 1–18. Springer-Verlag, June 2003. of XML schema languages using formal language theory. In Proc. Extreme Markup Languages, August 2001. [7] and Makoto Murata. RELAX NG specification, December 2001. OASIS. [21] Anders Møller. dk.brics.automaton – finite-state http://www.oasis-open.org/committees/relax-ng/. automata and regular expressions for Java, 2005. http://www.brics.dk/automaton/. [8] Vladimir Gapeyev, Michael Y. Levin, Benjamin C. Pierce, and Alan Schmitt. The Xtatic experience. Technical Report [22] Henry S. Thompson, David Beech, Murray Maloney, and MS-CIS-04-24, University of Pennsylvania, October 2004. Noah Mendelsohn. XML Schema part 1: Structures Presented at Programming Language Technologies for XML, second edition, October 2004. W3C Recommendation. PLAN-X ’05. http://www.w3.org/TR/xmlschema-1/.

      22 [23] Akihiko Tozawa and Masami Hagiya. XML Schema contain- ment checking based on semi-implicit techniques. In Proc. 8th International Conference on Implementation and Appli- cation of Automata, CIAA ’03, volume 2759 of LNCS, July 2003.

      23 PADX : Querying Large-scale Ad Hoc Data with XQuery

      ∗ Mary Fernandez´ Robert Gruber Yitzhak Mandelbaum Kathleen Fisher Google Princeton University AT&T Labs Research [email protected] [email protected] {mff,kfisher}@research.att.com

      Name :Use Representation to roll their own tools, leading to scenarios such as the following. Web server logs (CLF): Fixed-column ASCII records An analyst receives a new ad hoc data source containing poten- Measure web workloads tially interesting information and a list of pressing questions about AT&T provisioning data: Variable-width ASCII records that data. Could she please provide the answers to the questions Monitor service activation as quickly as possible, preferably last week? The accompanying Call detail: Fraud detection Fixed-width binary records documentation is outdated and missing important information, so AT&T billing data: Various Cobol data formats she first has to experiment with the data to discover its structure. Monitor billing process Eventually, she understands the data well enough to hand-code a Netflow: Data-dependent number of parser, usually in C or PERL. Pressed for time, she interleaves code Monitor network performance fixed-width binary records to compute the answers to the supplied questions with the parser. Newick: Immune Fixed-width ASCII records As soon as the answers are computed, she gets a new data source system response simulation in tree-shaped hierarchy and a new set of questions to answer. Gene Ontology: Variable-width ASCII records Gene-gene correlations in DAG-shaped hierarchy Through her heroic efforts, the data analyst answered the neces- CPT codes: Medical diagnoses Floating point numbers sary questions, but the approach is deficient in many respects. The SnowMed: Medical clinic notes Keyword tags analyst’s hard-won understanding of the data ended up embedded Figure 1. Selected ad hoc data sources. in a hand-written parser, where it is difficult for others to benefit from her understanding. The parser is likely to be brittle with re- spect to changes in the input sources. Consider, for example, how Abstract tricky it is to figure out which $3’s should be $4’s in a PERL parser when a new column appears in the data. Errors in the data also pose This paper describes our experience designing and implementing a significant challenge in hand-coded parsers. If the data analyst thoroughly checks for errors, then the error checking code dom- PADX, a system for querying large-scale ad hoc data sources with inates the parser, making it even more difficult to understand the XQuery. PADX is the synthesis and extension of two existing sys- semantics of the data format. If she is not thorough, then erroneous tems: PADS and Galax. With PADX, an analyst writes a declarative data description of the physical layout of her ad hoc data, and the data can escape undetected, potentially (silently!) corrupting down- stream processing. Finally, during the initial data exploration and PADS compiler produces customizable libraries for parsing the data and for viewing it as XML. The resulting library is linked with an in answering the specified questions, the analyst had to code how to XQuery engine, permitting the analyst to view and query her ad hoc compute the questions rather than being able to express the queries data sources using XQuery. in a declarative fashion. Of course, many of these pitfalls can be avoided with careful design and sufficient time, but such luxuries are not available to the analyst. However, with the appropriate tool 1 Introduction support, many aspects of this process can be greatly simplified.

      Although enormous amounts of data exist in “well-behaved” for- We have two tools, PADS [2, 8] and Galax [1, 7], each of which mats such as XML and relational databases, massive amounts also addresses aspects of the analyst’s problem in isolation. The PADS exist in non-standard or ad hoc data formats. Figure 1 gives some system allows analysts to describe ad hoc data sources declara- sense of the range and pervasiveness of such data. Ad hoc data tively and then generates error-aware parsers and tools for ma- comes in many forms: ASCII, binary, EBCDIC, and mixed for- nipulating the sources, including statistical profiling tools. Such mats. It can be fixed-width, fixed-column, variable-width, or even support allows the analyst to produce a robust, error-aware parser tree-structured. It is often quite large, including some data sources quickly. The Galax system supports declarative querying of XML that generate over a gigabit per second [6]. It frequently comes with via XQuery. If Galax could be applied to ad hoc data, it would al- incomplete and/or out-of-date documentation, and there are almost low the analyst first to explore the data and then to produce answers always errors in the data. Sometimes these errors are the most in- to her questions. teresting aspect of the data, e.g., in log files where errors indicate that something is going wrong in the associated system. In this work, we strive to integrate PADS and Galax to solve the an- alyst’s data-management problems for the large ad hoc data sources The lack of standard tools for processing ad hoc data forces analysts that we have seen in practice. One approach would be to have PADS ∗ produce a tool for converting ad hoc data to XML and then ap- Work carried out while at AT&T Labs Research.

      24 Ad Hoc Data Data Description can contain more than 2.2GB of data per week. Description Compiler The summaries store the processing date and one record per order. Each order record contains a header followed by a nested sequence Generated of events. The header has 13 pipe separated fields: the order num- Data Parsing & ber, AT&T’s internal order number, the order version, four different XML Viewing telephone numbers associated with the order, the zip code, a billing Ad Hoc Data Libraries identifier, the order type, a measure of the complexity of the order, Queries PADX Query Corral an unused field, and the source of the order data. Many of these Query Results fields are optional, in which case nothing appears between the pipe Figure 2. Data analyst’s view of PADX characters. The billing identifier may not be available at the time of processing, in which case the system generates a unique identi- fier, and prefixes this value with the string “no ii” to indicate the number was generated. The event sequence represents the various ply Galax to the resulting document. (In fact, PADS provides this states a service order goes through; it is represented as a new-line ability.) However, the typical factor of eight space blow up in this terminated, pipe separated list of state, timestamp pairs. There are conversion yields an unacceptable slowdown in performance. Con- over 400 distinct states that an order may go through during provi- sequently, we chose to design and implement PADX1, a synthesis sioning. It may be apparent from this description that English is a and extension of PADS and Galax. Figure 2 depicts PADX from the poor language for describing data formats! analyst’s perspective. The analyst provides a PADS description of her ad hoc source, which is compiled into a library of components The analyst’s first task is to write a parser for the Sirius data format. for parsing her data and for viewing and querying it as XML. The Like many ad hoc data sources, Sirius data can contain unexpected resulting libraries are linked together with the PADS and Galax run- or corrupted values, so the parser must handle errors robustly to time systems into one PADX query executable, called a “query cor- ral.2” At query time, the analyst provides her ad hoc data sources avoid corrupting the results of analyses. With PADS, the analyst writes a declarative data description of the physical layout of her and her query written in XQuery, and PADX produces the query’s results. data. The language also permits the analyst to describe expected semantic properties of her data so that deviations can be flagged as errors. The intent is to allow an analyst to capture in a PADS Building PADX presented several problems. The first was semantic: We had to decide how to view ad hoc data as XML and how to description all that she knows about a given data source. express this view as a mapping from the PADS type system to XML Schema, the basis of XQuery’s type system. A second problem Figure 4 gives the PADS description for the Sirius data format. In PADS descriptions, types are declared before they are used, so the involved systems design and engineering. Building PADX required type that describes the entire data source, summary_t, appears at evolving PADS and Galax in parallel, modifying the implementation of Galax to support an abstract data model so that Galax could view the bottom of the description (Line 42). In the next section, we use this example to give an overview of the PADS language. Here, non-XML sources as XML, and augmenting PADS with the ability to generate concrete instances of this data model. Our solutions to we simply note that the data analyst writes this description, and these problems, which were necessary to build a working system, the PADS compiler produces customizable C libraries and tools for are described in Sections 3 and 4. A third problem involves the scale parsing, manipulating, and summarizing the data. The fact that use- of data and efficiency of queries, in particular, how to efficiently ful software artifacts are generated from PADS descriptions provides strong incentive for keeping the descriptions current, allowing them evaluate complex queries over large sources. Section 5 describes to serve as living documentation. how PADX currently handles large sources and the problems that we face with respect to data scale and query performance. Analysts working with ad hoc data often want to query their data. Questions posed by the Sirius analyst include “Select all orders We begin with a more detailed account of a scenario that illustrates starting within a certain time window,” “Count the number of orders the data management tasks faced by AT&T data analysts and how going through a particular state,” and “What is the average time re- PADX simplifies these tasks. We then crack open the PADX architec- quired to go from a particular event state to another particular event ture, first describing PADS and Galax in isolation, and then describ- state”. Such queries are useful for rapid information discovery and ing our solutions to the problems described above. We conclude for vetting errors and anomalies in data before that data proceeds to with related work and a discussion of open problems. a down-stream process or is loaded into a database.

      1.1 Data-management scenario With PADX, the analyst writes declarative XQuery expressions to query her ad hoc data source. Because XQuery is designed to ma- In the telecommunications industry, the term provisioning refers to nipulate semi-structured data, its expressiveness matches ad hoc the process of converting an order for phone service into the ac- data sources well. As a Turing-complete language, XQuery is pow- tual service. This process is complex, involving many interactions erful enough to express all the questions above. For example, Fig- with other companies. To discover potential problems proactively, ure 5 contains an XQuery expression that produces all orders that the Sirius project tracks AT&T’s provisioning process by compil- started in October, 2004. In Section 4, we discuss in more detail ing weekly summaries of the state of certain types of phone service why XQuery is an appropriate query language for ad hoc data. One orders. These summaries, which are stored in flat ASCII text files, benefit is that XQuery queries may be statically typed, which helps detect common errors at compile time. For example, static typing 1Pronounced “paddocks”, an enclosed area for exercising race would raise an error if the path expression in Figure 5 referred to horses. ordesr instead of orders, or if the analyst erroneously compared 2The equestrian metaphor is intentional: Getting these systems the timestamp value in tstamp to a string. to work together is like corralling race horses!

      25 0|15/Oct/2004:18:46:51 9152|9152|1|9735551212|0||9085551212|07988|no_ii152272|EDTF_6|0|APRL1|DUO|10|16/Oct/2004:10:02:10 9153|9153|1|0|0|0|0||152268|LOC_6|0|FRDW1|DUO|LOC_CRTE|1001476800|LOC_OS_10|17/Oct/2004:08:14:21 Figure 3. Tiny example of Sirius provisioning data.

      1. Precord Pstruct summary_header_t { 2UsingPADS to Access Ad Hoc Data 2. "0|"; 3. Punixtime tstamp; In this section, we give a brief overview of PADS, focusing on its 4. }; data description language and the portions of the libraries it gen- erates that are relevant to PADX. More information about PADS is 5. Pstruct no_ramp_t { available [2, 8]. 6. "no_ii"; 7. Puint64 id; 8. }; 2.1 PADS: The language

      9. Punion dib_ramp_t { A PADS specification describes the physical layout and semantic 10. Pint64 ramp; properties of an ad hoc data source. The language provides a 11. no_ramp_t genRamp; type-based model: basic types specify atomic data such as inte- 12. }; gers, strings, dates, etc., while structured types describe compound data built from simpler pieces. The PADS library provides a collec- 13. Pstruct order_header_t { 14. Puint32 order_num; tion of useful base types. Examples include 8-bit signed integers 15. ’|’; Puint32 att_order_num; (Pint8), 32-bit unsigned integers (Puint32), IP addresses (Pip), 16. ’|’; Puint32 ord_version; dates (Pdate), and strings (Pstring). By themselves, these base 17. ’|’; Popt pn_t service_tn; types do not provide sufficient information for parsing because they 18. ’|’; Popt pn_t billing_tn; do not indicate how the data is coded, i.e., in ASCII, EBCDIC, or 19. ’|’; Popt pn_t nlp_service_tn; binary. To resolve this ambiguity, PADS uses the ambient coding. 20. ’|’; Popt pn_t nlp_billing_tn; By default, the ambient coding is ASCII, but programmers can cus- 21. ’|’; Popt Pzip zip_code; tomize it as appropriate. 22. ’|’; dib_ramp_t ramp; 23. ’|’; Pstring(:’|’:) order_type; 24. ’|’; Puint32 order_details; To describe more complex data, PADS provides a collection of struc- 25. ’|’; Pstring(:’|’:) unused; tured types loosely based on C’s type structure. In particular, PADS 26. ’|’; Pstring(:’|’:) stream; has Pstructs, Punions, and Parrays to describe record-like 27. }; structures, alternatives, and sequences, respectively. Penumsde- scribe a fixed collection of literals, while Popts provide convenient 28. Pstruct event_t { syntax for optional data. A type may have an associated predicate 29. Pstring(:’|’:) state; that determines whether a parsed value is indeed a legal value for 30. ’|’; Punixtime tstamp; the type. For example, a predicate might require that one field of 31. }; a Pstruct is bigger than another or that the elements of a se- 32. Parray event_seq_t { quence are sorted. Programmers can specify such predicates using 33. event_t[] : Psep(’|’) && Pterm(Peor); PADS expressions and functions, written in a C-like syntax. Finally, 34. }; PADS Ptypedefs allow programmers to define new types that add further constraints to existing types. 35. Precord Pstruct order_t { 36. order_header_t order_header; PADS types can be parameterized by values. This mechanism re- 37. ’|’; event_seq_t events; duces the number of base types and permits the format and proper- 38. }; ties of later portions of the data to depend upon earlier portions. For example, the base type Puint16_FW(:3:) specifies an unsigned 39. Parray orders_t { 40. order_t[]; two byte integer physically represented by exactly three characters, 41. }; while the type Pstring(:’|’:) (e.g., Line 29) describes a string terminated by a vertical bar. Parameters can be used with compound 42. Psource Pstruct summary_t{ types to specify the size of an array or the appropriate branch of a 43. summary_header_t summary_header; union. 44. orders_t orders; 45. }; Pstructs describe ordered sequences of data with unrelated types. In Figure 4, the type declaration for the Pstruct order_t Figure 4. PADS description for Sirius provisioning data. (Lines 35Ð38) contains an order header (order_header_t)fol- lowed by the literal character ’|’, followed by an event sequence (event_seq_t). PADS supports character, string, and regular ex- (: Return orders started in October 2004 :) pression literals. $pads/Psource/orders/elt[events/elt[1] [tstamp/rep >= xs:dateTime("2004-10-01:00:00:00") Punions describe alternatives in the data format. For example, and tstamp/rep < xs:dateTime("2004-11-01:00:00:00")]] the dib_ramp_t type (Lines 9Ð12) indicates that the ramp field in a Sirius record can be either a Puint_64 or a string "no_ii" followed Figure 5. Query applied to Sirius provisioning data. by a Puint_64. During parsing, the branches of a Punion are tried in order; the first branch that parses without error is taken.

      26 The order_header_t type (Lines 13Ð27) contains several anony- resentation it should fill in. This control allows the description- mous uses of the Popt type. This type is syntactic sugar for a writer to specify all known constraints about the data without wor- stylized use of a Punion with two branches: the first with the in- rying about the run-time cost of verifying potentially expensive dicated type, and the second with the “void” type, which always constraints for time-critical applications. matches but never consumes any input. Appropriate error-handling is as important as processing error-free PADS provides Parrays to describe varying-length sequences of data. The parse descriptor marks which portions of the data con- data all with the same type. The event_seq_t type (Lines 32Ð tain errors and specifies the detected errors. Depending upon the 34) uses a Parray to characterize the sequence of events an or- nature of the errors and the desired application, programmers can der goes through during processing. This declaration indicates that take the appropriate action: halt the program, discard parts of the each element in the sequence has type event_t. It also specifies data, or repair the errors. If the mask requests that a data item be that the elements will be separated by vertical bars, and that the verified and set, and if the parse descriptor indicates no error, then sequence will be terminated by an end-of-record marker (Peor). the in-memory representation satisfies the semantic constraints on In general, PADS provides a rich collection of array-termination the data. conditions: reaching a maximum size, finding a terminating literal (including end-of-record and end-of-source), or satisfying a user- Because we generate a parsing function for each type in a PADS supplied predicate over the already-parsed portion of the Parray. description, we support multiple-entry point parsing, which accom- modates larger-scale data. For a small file, a programmer can call Finally, the Precord (Line 35) and Psource (Line 42) annota- the parsing function for the PADS type that describes the entire file tions deserve comment. The first indicates that the annotated type (e.g. summary_t_read) to read the whole file with one call. For constitutes a record, while the second means that the type consti- larger-scale data, programmers can sequence calls to parsing func- tutes the totality of a data source. The notion of a record varies tions that read manageable portions of the file, e.g., reading one depending upon the data encoding. ASCII data typically uses new- record at a time in a loop. The parsing code generated for Parrays line characters to delimit records, binary sources tend to have fixed- allows users to choose between reading the entire array at once or width records, while COBOL sources usually store the length of reading it one element at a time, again to support parsing and pro- each record before the actual data. PADS supports each of these en- cessing very large data sources. We return to the use of multiple- codings of records and allows users to define their own encodings. entry point parsing functions in Section 5.

      2.2 PADS: The generated library 3 Using XQuery and Galax

      From a description, the PADS compiler generates a C library for In this section, we give a brief overview of XML, XQuery, and parsing and manipulating the associated data source. From each Galax, focusing on Galax’s data-model support for viewing non- type in a PADS description, the compiler generates XML data as XML. Given the subject of this workshop, we as- • sume the reader is already familiar with XML, XQuery, and XML an in-memory representation, Schema. • parsing and printing functions, XML [18] is a flexible format that can represent many classes of • a mask, which allows customization of generated functions, data: structured documents with large fragments of marked-up text; and homogeneous records such as those in relational databases; and het- • a parse descriptor, which describes syntactic and semantic er- erogeneous records with varied structure and content such as those rors detected during parsing. in ad hoc data sources. XML makes it possible for applications to handle all these classes of data simultaneously and to exchange To give a feeling for the library that PADS generates, Figure 6 in- such data in a standard format. This flexibility has made XML the cludes a fragment of the generated library for the Sirius event_t “lingua franca” of data integration and exchange. declaration. XQuery [20] is a typed, functional query language for XML that The C declarations for the in-memory representation (Line 1Ð4), supports user-defined functions and modules for structuring large the mask (Line 5Ð9), and the parse descriptor (Line 10Ð17) all share queries. Its type system is based on XML Schema [21]. XQuery the structure of the PADS type declaration. The mapping to C for contains XPath 2.0 [19] as a proper sub-language, which supports each is straightforward: Pstructs map to C structs with appro- navigation, selection, and extraction of fragments of XML docu- priately mapped fields, Punions map to tagged unions coded as ments. XQuery also includes expressions to construct new XML C structs with a tag field and an embedded union, Parraysmap values and to integrate or join values from multiple documents. to a C struct with a length field and a dynamically allocated se- quence, Penums map to C enumerations, Popts to tagged unions, XQuery is a natural choice for querying ad hoc data. Like XML and Ptypedefs to C typedefs. Masks include auxiliary fields to data, ad hoc data is semi-structured, and XQuery is tailored to such control behavior at the level of a structured type, and parse descrip- data. XQuery’s static type system detects type errors at com- tors include fields to record the state of the parse, the number of pile time, which is valuable when querying ad hoc sources: Long- detected errors, the error code of the first detected error, and the running queries on large ad hoc sources do not raise dynamic type location of that error. errors, and queries made obsolete by schema evolution are identi- fied at compile time. XQuery is also ideal for specifying integrated The parsing functions, e.g. event_t_read on Line 19, take a mask views of multiple sources. Although here we focus on querying as an argument and returns an in-memory representation and a parse one ad hoc source at a time, XQuery supports simultaneous query- descriptor. The mask allows the user to specify which constraints ing of multiple sources. Lastly, XQuery is practical: It will soon the parser should check and which portions of the in-memory rep- be a standard; numerous manuals already exist [5]; and it is widely

      27 1. typedef struct { // In-memory representation 2. order_header_t order_header; 3. event_seq_t events; 4. } event_t;

      5. typedef struct { // Mask 6. Pbase_m compoundLevel; // Struct-level controls 7. order_header_t_m order_header; 8. event_seq_t_m events; 9. } event_t_m;

      10. typedef struct { // Parse descriptor 11. Pflags_t pstate; // Normal, partial, or panicking 12. Puint32 nerr; // Number of detected errors 13. PerrCode_t errCode; // Error code of first detected error 14. Ploc_t loc; // Location of first error 15. order_header_t_pd order_header; // Nested header information 16. event_seq_t_pd events; // Nested event sequence information 17. } event_t_pd;

      18. /* Parsing and printing functions */ 19. Perror_t event_t_read (P_t *pads, event_t_m *m, event_t_pd *pd, event_t *rep); 20. ssize_t event_t_write2io (P_t *pads, Sfio_t *io, event_t_pd *pd, event_t *rep); Figure 6. Fragment of the library generated for the event t declaration from Sirius data description. implemented in commercial databases. representation of a document. Therefore, a concrete instance of the data model must provide their implementations. Galax provides Galax is a complete, extensible, and efficient implementation of default implementations for the four descendant and ancestor axes XQuery 1.0 that supports XML 1.0 and XML Schema 1.0 and that (Lines 13Ð16), which are defined recursively in terms of the child was designed with database systems research in mind. Its archi- and parent methods. These defaults may be overridden in concrete tecture is modular and documented [15], which makes it possible data models that can provide more efficient implementations than for other researchers to experiment with a complete XQuery imple- the defaults. For example, some representations permit axes to be mentation. Its compiler produces evaluation plans in the first com- implemented by range queries over relational tables [11]. plete algebra for XQuery [13], which permits experimental compar- ison of query-compilation techniques. Lastly, its query optimizer All the axis methods take an optional node-test argument, which is produces efficient physical plans that employ traditional and novel a boolean predicate on the names or types of nodes in the given axis. join algorithms [13], which makes it possible to apply non-trivial For example, the XQuery expression descendant::order returns queries to large XML sources. Lastly, its abstract data model per- nodes in the descendant axis with name order. Galax compiles mits experimenting with various physical representations of XML this expression into a single axis/node-test operator that invokes the and non-XML data sources. Galax’s abstract data model is the fo- corresponding methods in the abstract data model, delegating eval- cus of the the rest of this section. uation of node tests to the concrete data model. Some implemen- tations, like PADX, can provide fast access to nodes by their name. 3.1 Galax’s Abstract Data Model We describe PADX’s concrete data model in Section 4. One other important feature of Galax’s abstract data model is that Galax’s abstract data model is an object-oriented realization of the sequences are represented by cursors (also known as streams), non- XQuery Data Model. The XQuery Data Model [17] contains tree functional lists that yield items lazily. Accessing the first item in nodes, atomic values, and sequences of nodes and atomic values. a sequence does not require that the entire sequence be material- A tree node corresponds to an entire XML document or to an indi- ized, i.e., evaluated eagerly. Galax’s algebraic operators produce vidual element, attribute, comment, or processing-instruction. Al- and consume cursors of values, which permits pipelined and short- gebraic operators in a query-evaluation plan produced by Galax’s circuited evaluation of query plans. query compiler access documents by applying methods in the data model’s object-oriented interface. In addition to the concrete data model for PADX,which we describe in the next section, Galax has three other concrete data models: Figure 7 contains part of Galax’s data model interface3 for a node a DOM-like representation in main memory and two “shredded” in the XQuery Data Model. Node accessors return information representations, one in main memory and one in secondary storage such as a node’s name (node_name), the XML Schema type against for very large documents (e.g. > 100MB). The shredded data model which the node was validated (type), and the node’s atomic- partitions a document into tables of elements, attributes, and values valued data if it was validated against an XML Schema simple that can be indexed on node names and values [16]. type (typed_value). The parent, child,andattribute meth- ods navigate the document and return a node sequence containing the respective parent, child, or attribute nodes of the given node. 4UsingPADX to Query Ad Hoc Data

      The first six methods in Figure 7 (Lines 5Ð11) access the physical Figure 8 depicts an internal view of the PADX architecture first shown in Figure 2. Pre-existing components (in grey boxes) include 3Galax is implemented in O’Caml, so these signatures are in the PADS compiler, the Galax query engine, and the PADS runtime O’Caml. system. In this section, we focus on the new components (in white

      28 1. type sequence = cursor 2. class virtual node : 3. object 4. (* Selected XQuery Data Model accessors *) 5. method virtual node_name : unit -> atomicQName option 6. method virtual type : unit -> (schema * atomicQName) 7. method virtual typed_value : unit -> atomicValue sequence

      8. (* Required axes *) 9. method virtual parent : node_test option -> node option 10. method virtual child : node_test option -> node sequence 11. method virtual attribute : node_test option -> node sequence

      12. (* Other axes *) 13. method descendant_or_self : node_test option -> node sequence 14. method descendant : node_test option -> node sequence 15. method ancestor_or_self : node_test option -> node sequence 16. method ancestor : node_test option -> node sequence

      ... Other accessors in XQuery Data Model ...

      Figure 7. Signatures for methods in Galax’s abstract node interface

      1. XML Document 2. PADX Query Corral 3. PADS Data 4. XQuery Program 5. Galax Query Engine XML Schema 6. Query Prolog 7. Galax Abstract Data Model 8. 9. PADX Concrete Data Model 10. PADS Data PADS Compiler PADX Node 11. Description Representation 12. PADS Runtime System 13. 14. 15. Figure 8. Internal view of PADX Architecture 16.

      Figure 9. Fragment of XML Schema for PADS base types. boxes) and describe the compiler and run-time support for viewing PADS data as XML. From a PADS description, the compiler gen- erates an XML Schema description that specifies the virtual XML PADS values that are error free and those containing errors are ac- view of the corresponding PADS data, an XQuery prolog that im- cessible in the XML view. We begin with the mapping of PADS ports the generated schema and that associates the input data with base types. the correct schema type, and a type-specific library that provides the virtual XML view of PADS values necessary to implement PADX’s A default XML Schema, pads.xsd, contains the schema types that concrete data model. represent the PADS base types shared by all PADS descriptions. Fig- ure 9 contains a fragment of this schema. Every PADS base type is Note that a query corral is customized for a particular PADS descrip- mapped to the schema simple type that most closely subsumes the tion, in particular, its concrete data model only supports views of value space of the given PADS base type. For example, the Puint32 data sources that match the PADS description. To maintain the cor- base type maps to the schema type xs:unsignedInt (Lines 1Ð3). rect correspondence between a description, XML Schema, queries, Recall that all parsed PADS values have an in-memory representa- and data, the query corral explicitly contains the generated query tion and a parse descriptor, which records the state of the parse, the prolog, which imports the XML Schema that corresponds to the error code for detected errors, and the location of those errors. The underlying type-specific library. This guarantees that the user’s XML view of a parsed value is a choice of the in-memory represen- XQuery program is statically typed, compiled, and optimized with tation (rep), if no error occurred, or of the parse descriptor (pd), if respect to the correct XML Schema and that the underlying data an error occurred (Lines 4Ð8). This light-weight view exposes the model is an instance of this XML Schema. At runtime, the query parse descriptor only when an error occurs. The parse-descriptor corral takes an XQuery program and a PADS data source and pro- type for all base types is represented by the schema type Pbase_pd duces the query result in XML. We discuss the problem of produc- (Line 10Ð14). ing native PADS values in Section 6. The fragment of the XML Schema in Figure 10 corresponds to the 4.1 Viewing PADS data as XML description in Figure 4. Note that the schema imports the schema for PADS base types (Line 5). Each compound type is mapped to a The mapping from a PADS description to an XML Schema is complex schema type with a particular content model. A Pstruct straight-forward. The interesting aspect of this mapping is that both is mapped to a complex type that contains a sequence of local el-

      29 1. 5. 6. ... 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. ... 24. 25.

      Figure 10. Fragment of XML Schema for Sirius PADS description. ements, each of which corresponds to one field in the Pstruct. At run time, the user can specify the input data as a command- For example, the Pstruct order_header_t is mapped to the line argument or by calling the XQuery fn:doc functiononaPADS complex type order_header_t (Lines 7Ð15), which contains an source, e.g. pads:/example/sirius.data. element declaration for the field order_num, among others. A Punion is mapped to a complex type that contains a choice of elements, each of which corresponds to one field in the Punion. 4.2 PADX Concrete Data Model

      Each complex type also includes an optional pd element that corre- In Figure 8, the interface between Galax and PADS consists of two sponds to the type’s parse descriptor (Lines 13 and 21). All parse- modules: the generic PADX concrete data model, which implements descriptor types contain the parse state, error code, and location. the Galax abstract data model, and a compiler-generated module, in The parse-descriptor for compound types contain additional infor- which each PADS type has a corresponding, type-specific node rep- mation, e.g., Pstruct_pd contains the number of nested errors and resentation providing the XML view of values of that type. We note Parray_pd contains the index of the array item in which the first that the generic concrete data model is implemented in O’Caml and error occurred. The pd element is absent if no errors occurred dur- the compiler-generated module is implemented in C, but to simplify ing parsing, but if present, permits an analyst to easily identify the exposition, we present the compiler-generated module in O’Caml kind and location of errors in the source data. For example, the fol- syntax. lowing XQuery expression returns the locations of all orders that contain at least one error: $pads/Psource/orders/elt/pd/loc. Figure 13 contains a fragment of the PADX concrete data model for a node. This object provides a thin wrapper around the type- The schema types for some compound types contain additional specific node representation, padx_node_rep, whose interface is fields from the PADS in-memory representation, e.g., arrays have in Figure 12. A node representation contains references to a PADS a length (Line 20). Note that Parray types do not associate a name value’s in-memory representation and parse descriptor. The node with each individual array item, so in the corresponding schema representation interface returns the XML view of the PADS value, type, the default element elt encapsulates each array item. including the value’s element name, its typed value, and parent. The kth_child and kth_child_by_name methods return all of the The PADS compiler generates a query prolog that specifies the en- PADS value’s children in order and those with a given name in order, vironment in which all XQuery programs are typed and evaluated. respectively. Figure 11 contains the query prolog for the schema in Figure 10. The import schema declaration on Line 1 imports the schema in For some methods in Figure 13 (Lines 4Ð5), the concrete data model Figure 10. This declaration puts all global element and type decla- simply invokes the corresponding type-specific methods. One ex- rations in scope for the query. The variable declaration on Line 2 ception is the child axis method (Lines 7Ð17), which we describe specifies that the value of the variable $pads is provided externally in detail as it illustrates how the XML view of a PADS source is and that its type is a document whose top-level element is of type materialized lazily. The child method takes an optional name- Psource, defined on Line 24 in Figure 10. This declaration guar- test argument. We describe the case when the name-test is ab- antees that the query is statically typed with respect to the correct sent, which corresponds to the common expression child::*.The input type. child method creates a mutable counter k (Line 8), which contains the index of the last child accessed, and a continuation function lazy_child (Lines 11Ð16), which is invoked each time the child

      30 1. import schema default element namespace "file:/example/sirius.p"; 2. declare variable $pads as document-node(Psource) external;

      Figure 11. PADX generated query prolog

      class virtual padx_node_rep : object (* Private data includes parsed value’s rep & pd *) method node_name : unit -> string method typed_value : unit -> item method parent : unit -> padx_node_rep option method kth_child : int -> padx_node_rep option method kth_child_by_name : int -> string -> padx_node_rep option end

      Figure 12. The PADX node representation

      1. class pads_node (nr : padx_node_rep)= 2. object 3. inherit Galax.node 4. method node_name () = nr#node_name() 5. method typed_value () = nr#typed_value() 6. (* ... Other data model accessors ... *) 7. method child name_test = 8. let k = ref 0 in 9. match name_test with 10. | None -> 11. let lazy_child () = 12. (incr k; 13. match nr#kth_child !k with 14. | Some cnr -> Some(new pads_node(cnr)) 15. | None -> None) 16. in Cursor.cursor_of_function lazy_child 17. | Some (NameTest name) -> (* Same as above, but call nr#kth_child_named *) 18. (* ... Other axes ... *) 19. end

      Figure 13. Fragment of the PADX concrete data model

      31 cursor is poked. On each invocation, lazy_child increments the record array(s) and then loads each array item on demand, expelling counter and delegates to the kth_child method of the type-specific old records when the buffer is filled. A small amount of meta-data node representation. For some PADS types, accessing the virtual kth is preserved for each expelled record, so that the virtual node con- child does not require reading or parsing data, e.g., if the virtual taining that data can be reconstructed on subsequent accesses. child is part of a complete PADS record. For other PADS types, e.g., Parrays that contain file records, accessing the virtual kth child The on-demand, sequential strategy is a restriction of the on- may require reading and parsing data. The kth_child method pro- demand, random-access strategy. It loads data on demand, but its vides a uniform interface to all types and delegates the problem of fixed-size buffer stores only one record at a time, and it supports when to read and parse data to the underlying type-specific node strictly sequential access to records, i.e., accessing records out of representation. order is prohibited. Given that the Galax abstract data model re- quires random access, it is not obvious when this strategy can be To illustrate type-specific compilation, we give the compiler- used, even though it has the smallest memory footprint of all three generated node representation of an order_header_t value in and therefore could scale to very large sources. It turns out that Figure 14. The object takes the name of the field that contains many common XQuery queries can be evaluated in one sequential the order_header_t value, which corresponds to the XML node scan over the input document, and in these cases, the sequential name, and the in-memory representation and parse descriptor of the strategy is both semantically correct and time and space efficient. value. The kth_child method (Lines 9Ð15) takes an index and re- We give examples of “one-scan” queries and their performance in turns the node representation of the field at that index. For example, Section 5. the first child (Line 11) corresponds to the field order_num,which contains a Puint32 value. The kth_child_by_name method PADX (Lines 16Ð21) provides constant-time lookup of a child with a par- 4.4 Ways to use ticular name: It looks up the index of the name in the associative Our focus so far has been on describing PADX’s internal architecture map name_map and then delegates to kth_child. Note that this to demonstrate the feasibility of viewing and querying ad hoc data XML view of an order_header_t value corresponds to the schema sources as though they were XML. We expect this use of PADX to be type order_header_t in Figure 10. convenient, because it supports rapid querying of transient data and does not require an analyst to convert the data into another format To summarize, the PADX concrete data model completely imple- or load it into a database before being able to ask simple queries. ments the Galax data model, making it possible to evaluate any PADX can be used in other ways. For example, an analyst might XQuery program over a PADS data source. Due to limited space, prefer to materialize a PADS source in XML and query her data we have omitted some details, such as how PADX guarantees that using a high-performance, commercial XML query engine. To do each virtual node has a unique, immutable identity, as is required this, the analyst simply runs the query “$pads”, which returns the by the Galax abstract data model. The data model’s most important entire source materialized in XML, and then provides the result- features are that it provides lazy access to virtual XML nodes in ing XML document to the query engine. Another use is to trans- the PADS source, it delegates navigation to type-specific node rep- form the PADX view of a PADS source into the XML view required resentations, and it separates navigation of the virtual nodes from by a database by some down-stream application. Such transforma- data loading, which is discussed next. tions can be easily expressed in XQuery and can be statically type checked against the PADX and target XML schemata. PADS 4.3 Loading data We note, however, that the size of an ad hoc data source is signifi- cantly smaller than its representation in XML. For our two example The PADX abstract data model provides Galax with a random- PADS sources, the ratio of the size of the original PADS data to its access view of a PADS data source. In particular, any virtual node size in XML using the mapping described in Section 4.1 ranges may be accessed in any order at any time during query evaluation from 1:7 to 1:8. Of course, this size increase depends on the PADS regardless of its physical location in the PADS data. This abstraction types and field names in the PADS description, but even a reasonable permits the PADX concrete data model to decide when and how to choice of names like those in Figure 4 results in a significant size in- read and parse, or load, a data source. crease. We mention this size increase to give the reader some sense of the relative scale of data sources that PADX can query compared PADX has three strategies for loading data, each of which use the to those supported by native XML query engines. multiple-entry parsing functions generated by the PADS compiler. The bulk strategy loads a complete PADS source before query eval- uation begins, populating all the in-memory representations and 5 Performance parse descriptors. With all data pre-fetched, bulk loading is the sim- plest strategy to implement random access. However, because each Query performance in PADX depends on the efficiency of the under- PADS value has a lot of associated meta-data, bulk loading incurs a lying concrete data model; therefore its performance must be well high memory cost and is only feasible for smaller data sources. understood before we can understand the performance of particu- lar query plans. We focus on the performance of the concrete data The on-demand, random-access strategy loads PADS data when model and measure the cost of accessing data via the PADS type- Galax accesses virtual nodes via the abstract data model. The strat- specific parsing functions, the PADX type-specific node representa- egy maintains a fixed size buffer for loaded values and when the tions, and the generic PADX concrete data model. At the end of this buffer is filled, expels values in LIFO order. The default units section, we give preliminary measurements on query performance. loaded are any PADS types annotated with Precord,whichin- dicates that the type denotes an atomic physical unit in the am- We measured data model and query performance for two PADS bient coding. This default works well in practice, because many sources, Sirius and the Web server logs in Figure 1, on data sources PADS sources contain a header, one (or more) very large array(s) of 1 to 50MB in size. Our measurements were taken on an 1.67GHz of records, and a trailer. This strategy loads all the data before the Intel Pentium M with 500MB real memory running Linux Redhat

      32 1. class order_header_t_node_rep 2. (field_name : string) 3. (rep : order_header_t) 4. (pd : order_header_t_pd) = 5. object 6. inherit padx_node_rep 7. method name() = field_name 8. ... 9. method kth_child idx = 10. match idx with 11. | 1 -> Some(new val_Puint32_node_rep("order_num", rep.order_num, pd.order_num_pd)) 12. | 2 -> Some(new val_Puint32_node_rep("att_order_num", rep.att_order_num, pd.att_order_num_pd)) 13. | ... 14. | 14 -> Some(new Pstruct_pd_node_rep("pd", pd)) 15. | _ -> None

      16. (* Chidren’s name map *) 17. let name_map = Associative_array.create [("order_num", 1); ("att_order_num", 2); ...; ("pd", 14)] 18. method kth_child_by_name child_name = 19. match Associative_array.lookup name_map child_name with 20. | None -> Cursor.empty_cursor() 21. | Some idx -> kth_child idx 22. end

      Figure 14. Fragment of compiler-generated node representation for order header t

      Data size (MB) PADX PADX Source 1 5 10 20 50 Data PADS node generic Sirius 0.25 0.23 0.23 0.22 10.64 Source size read rep DM Web server 0.70 0.67 0.67 1.18 6.14 5MB 0.07 0.27 0.61 Table 1. Bulk strategy: load time per byte in µs Sirius 10MB 0.06 0.26 0.56 50MB 0.06 0.25 0.56 5MB 0.54 0.78 1.63 9.0. Each test was run five times, the high and low times were Web server 10MB 0.53 0.74 1.61 dropped, and the mean of the remaining three times is the reported 50MB 0.53 0.74 1.58 time. Table 2. Sequential strategy: load time per byte in µs

      5.1 Concrete Data Model

      We first measured the time to bulk load data sources of 5, 10, 20, and 50MB by calling the PADS parsing functions, i.e.,thelowest level in the PADX data model. Table 1 gives the load time per byte in microseconds. For smaller sources, load time is constant, but cost compared to the lower levels. For the Sirius source, the PADX eventually increases. For Sirius, the increasing load time is ob- node-representation is four times slower than the native PADS pars- served at 50MB and for the Web server data at 20MB. We note that ing functions, but for the Web-server source, the PADX node rep- for a PADS source, the memory overhead of a PADS parsed value resentation is only 44% slower. Understanding the source of this can be four to sixteen times the size of the raw data, depending on difference requires further experiments with other sources. the value’s type. In the cases where non-linear load time occurs, the processes’ physical memory usage is close to or exceeds real mem- For both sources, the generic concrete data model (in O’Caml) ory, CPU utilization plummets, and the process begins to thrash. is twice as slow as the node representation (in C). The interface These measurements indicate that the bulk strategy is only feasible from the generic data model to the node representation crosses the for smaller data sources. O’Caml-C boundary and uses data marshalling functions gener- ated by the O’Caml IDL tool. We have noticed similar per-byte Next, we measured load time using the on-demand sequential stat- read costs in the Galax secondary storage system [16], whose data- egy on sources of 5, 10, and 50 MB. We were particularly interested model architecture is similar to that of PADX. in the overhead introduced at each level in the concrete data model. Table 2 gives the load time per byte in microseconds (µs) for three We also measured the time to load using the on-demand, random- levels: reading the source by calling the PADS parsing functions di- access strategy. In general, it was 10Ð15% slower than the on- rectly, a depth-first walk of the virtual XML document by calling demand, sequential strategy. the PADX node-representation functions, and a depth-first walk of the virtual XML document by calling the PADX generic data model. These measurements indicate that the on-demand, sequential strat- Recall that the node-rep functions are in C and the generic data egy scales with increasing data size, and that there is a constant model is in O’Caml. overhead incurred at each level in the data model. Ideally, we would like the cost of accessing data via the generic concrete data model We observe that the load time per byte at each level is near constant to be close to the PADS read cost, but this will require more engi- for increasing source size, but that each level incurs a substantial neering effort.

      33 Data size (MB) 1 5 10 20 50 related queries. Time (seconds) 1.0 4.8 10.7 24.0 90.0 Table 3. PADX query evaluation time in seconds A PADX query corrall is an example of partially compiled query engine, because its concrete data model is customized for a partic- ular data format, but its queries are interpreted over an abstract data 5.2 Querying model that delegates to the concrete model. This architecture places PADX on the continuum between query architectures that provide Ultimately, PADX’s query performance depends on Galax, because fully interpreted query plans applied to generic data models to ar- the Galax compiler produces and executes the query plans. Cur- chitectures that provide fully compiled query plans applied to cus- rently, Galax’s query compiler includes a variety of logical opti- tomized data model instances [10]. The latter architectures provide mizations for detecting joins and re-grouping constructs in XQuery very high performance on large scale data. PADX has some of the expressions. Another important optimization is detecting when a benefits of such architectures but does not have the overhead of a query can be evaluated in one scan over the input document. Path complete database system. expressions that contain only descendant axes and no branches are one example of the kind of queries that can be evaluated in one scan. Others share our interest in declarative descriptions of ad hoc data For example, the following query, which returns the locations of all formats. Currently, the Global Grid Forum is working on a stan- records containing some error in a Sirius source, can be evaluated dard data-format description language for describing ad hoc data in one scan: formats, called DFDL [3, 4]. Like PADS, DFDL has a rich collection of base types and supports a variety of ambient codings. Unlike $pads/Psource/orders/elt/pd/loc PADS, DFDL does not support semantic constraints on types nor de- pendent types, e.g., it is not possible to specify that the length of Detecting and evaluating one-scan queries (also known as stream- an array is determined by some field in the data. DFDL is an anno- able queries) is necessary in XML environments in which the XML tated subset of XML Schema, which means that the XML view of data is an infinite or bursty stream. Several query processors al- the ad hoc data is implicit in a DFDL description. DFDL is still be- ready exist in which streamable queries are evaluated directly over ing specified, so no DFDL-aware parsers or data analyzers exist yet. a stream of tokens produced by SAX-style parsers [9, 14]. We expect that bi-directional translation between PADS and DFDL to be straightforward. Such a translation would make it possible for Streamable queries are important for PADX, because the resulting DFDL users to use PADX to query their ad hoc data sources. plans can be evaluated on large PADS sources that are loaded on- demand and sequentially. Table 3 contains the time in seconds to The steps in a data-management workflow that PADX addresses typ- evaluate the query above when applied to PADS data sources into ically precede the steps that require a high-performance database which we injected errors randomly in the file (12 errors per 1MB). system, e.g., asking complex OLAP queries applied to long-lived, The query plan produced by Galax is not perfectly pipelined, thus archived data. Commercial database products do provide support the execution time is super linear. for parsing data in external formats so the data can be imported into their database systems, but they typically support a limited number To understand the costs and benefits of other evaluation strategies, of formats, e.g., COBOL copybooks, no declarative description of we materialized the 1MB PADS source in Table 3, which yielded a the original format is exposed to the user for their own use, and 7.4MB XML document. We then used Galax to execute the above they have fixed methods for coping with erroneous data. For these query, using the same query execution plan, and applied it to the reasons, PADX is complementary to database systems. 7.4MB XML document loaded into the main-memory data model. The execution time was 13.1s of which 12.9 was spent in document We continue to focus on improving the usability and scalability of parsing. To amortize the cost of document parsing time, we often PADX. Currently, PADX is not compositional, because the result of store documents in Galax’s secondary storage system. To compare evaluating a query is in native XML, not in a PADS format. Given with this strategy, we stored the 7.4MB XML document in Galax’s an arbitrary XQuery expression over a PADX source, an open prob- secondary storage system, which required 166MB of disk space. lem is being able to infer a reasonable PADS format for the result We then ran the above query on the stored document. The execu- and produce the results in this format. We have already mentioned tion time was 2.9s, almost three times slower than PADX applied to the important problem of detecting when a query can be evaluated the PADS data directly. For comparison with an independent query in a single scan over an input document and of producing a fully processor, we evaluated the above query using Saxon [12], a popu- pipelined execution plan. Interestingly, this problem is important in lar XSLT and XQuery engine, applied to the 7.4MB document and XML environments in which the XML data is an infinite or bursty it executed in 6.3s. stream. We are working on improving Galax’s ability to detect one-scan queries and to produce query plans that are indeed fully In summary, our initial impressions are that evaluating streamable pipelined and that use limited memory. XQuery expressions directly on a PADS source is feasible, efficient, and convenient. 7 References

      6 Discussion [1] Galax user manual. http://www.galaxquery.org.

      The PADX system solves important data-management tasks: it sup- [2] PADS user manual. http://www.padsproj.org/. ports declarative description of ad hoc data formats, its descriptions [3] Data format description language (DFDL) a Proposal, Working Draft, serve as living documentation, and it permits exploration of ad hoc Global Grid Forum. https://forge.gridforum.org/projects/ data and vetting of erroneous data using a standard query language. dfdl-wg/document/DFDL_Proposal/en/%2, Aug 2005. Global The resulting PADS descriptions and queries are robust to changes Grid Forum. that may occur in the data format, making it possible for more than [4] M. Beckerle and M. Westhead. GGF DFDL primer. http://www. one person to profitably use and understand a PADX description and ggf.org/Meetings/GGF11/Documents/DFDL_Primer_v2.pdf,

      34 May 2004. Global Grid Forum. [5] M. Brundage. XQuery: The XML Query Language. Addison-Wesley, 2004. [6] C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: High performance network monitoring with an SQL in- terface. In SIGMOD. ACM, 2002. [7] M. Fern«andez, J. Sim«eon, B. Choi, A. Marian, and G. Sur. Imple- menting XQuery 1.0: The Galax Experience. In Proceedings of Inter- national Conference on Very Large Databases (VLDB), pages 1077Ð 1080, Berlin, Germany, Sept. 2003. [8] K. Fisher and R. Gruber. PADS: A domain-specifi c language for pro- cessing ad hoc data. In Proceedings of the ACM SIGPLAN 2005 con- ference on Programming language design and implementation, June 2005. [9] D. Florescu, C. Hillery, D. Kossmann, P. Lucas, F. Riccardi, T. West- mann, M. J. Carey, and A. Sundararajan. The BEA streaming XQuery processor. VLDB J., 13(3):294Ð315, 2004. [10] R. Greer. Daytona and the fourth generation language cymbal. In Proceedings of ACM Conference on Management of Data (SIGMOD), 1999. [11] T. Grust, M. van Keulen, and J. Teubner. Staircase join: Teach a rela- tional DBMS to watch its axis steps. In Proceedings of International Conference on Very Large Databases (VLDB), pages 524Ð535, Berlin, Germany, Sept. 2003. [12] M. Kay. SAXON 8.0. SAXONICA.com. http://www.saxonica.com/. [13] C. R«e, J. Sim«eon, and M. Fern«andez. A complete and effi cient al- gebraic compiler for XQuery. In Proceedings of IEEE International Conference on Data Engineering (ICDE), April 2006. [14] K. Rose and L. Villard. Phantom XML. In XML Conference and Exhibition, 2005. [15] J. Sim«eonandM.F.Fern«andez. Build your own XQuery processor. EDBT Summer School, Tutorial on Galax ar- chitecture, Sept 2004. http://www.galaxquery.org/slides/ edbt-summer-school2004.pdf. [16] A. Vyas, M. F. Fern«andez, and J. Sim«eon. The simplest XML storage manager ever. In XIME-P 2004, pages 37Ð42, Paris, France, June 2004. [17] W3C. XQuery 1.0 and XPath 2.0 data model, Oct. 2005. http: //www.w3.org/TR/query-datamodel/. [18] Extensible markup language (XML) 1.0. W3C Recommendation, Feb. 2004. http://www.w3.org/TR/2004/REC-xml-20040204/. [19] XPath 2.0. W3C Working Draft, Oct. 2005. http://www.w3.org/ TR/xpath20. [20] XQuery 1.0: An XML query language. W3C Working Draft, Oct. 2005. http://www.w3.org/TR/xquery/. [21] XML schema part 1: Structures. W3C Recommendation, Oct. 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/.

      35 OCaml + XDuce

      Alain Frisch INRIA Rocquencourt [email protected]

      January 2006

      Abstract need to deal with XML but which are not necessarily focused on XML.

      This paper presents the core type system and type inference The challenge was to combine two type checkers of very dif- algorithm of OCamlDuce, a merger between OCaml and ferent natures while preserving as much as possible the best XDuce. The challenge was to combine two type checkers properties of both (principality and automatic type recon- of very different natures while preserving the best proper- struction on one side; very precise types and implicit sub- ties of both (principality and automatic type reconstruction typing on the other side). on one side; very precise types and implicit subtyping on the other side). Type inference can be described by two succe- Our main guideline was to design a type system which can be sive passes: the first one is an ML-like unification-based al- implemented by reusing existing implementations of OCaml gorithm which also extracts data flow constraints about XML and CDuce [BCF03, Fri04]. (CDuce can be seen as a di- values; the second one is an XDuce-like algorithm which alect of XDuce with first-class and overloaded functions Ð computes XML types in a direct way. An optional prepro- for the merger with OCaml, we don’t consider these extra cessing pass, called strenghtening, can be added to allow features). Because of the complexity of OCaml’s type sys- more implicit use of XML subtyping. This pass is also very tem, it was out of question to reimplement it. The typing similar to an ML type checker. algorithm we describe in this paper has been successfully im- plemented simply by combining a slightly modified OCaml type checker with the CDuce type checker, and by adding some glue code. As a result, OCamlDuce is a strict exten- 1 Introduction sion of OCaml: programs which don’t use the new features will be treated exactly the same by OCaml and OCamlDuce. It is thus possible to compile any existing OCaml library This paper presents the core type system of OCamlDuce, a with OCamlDuce. Also, we believe our modifications to the merger between OCaml [L+01] and XDuce [Hos00, HP00, OCaml compiler are small enough to make it easy to main- HP03, HVP00]. OCamlDuce source code, documentation tain OCamlDuce in sync with future evolutions of OCaml. and sample applications are available at http://www. Our experience so far confirms that. cduce.org/ocaml. Another guideline in the design of OCamlDuce was that OCaml is a widely-used general-purpose multi-paradigm XDuce programs should be easily translatable to OCaml- programming language with automatic type reconstruction Duce in a mechanical way. In XDuce, all the functions are based on unification techniques. XDuce is a domain specific defined at the toplevel and comes with an explicit signa- and type-safe functional language adapted to writing trans- ture. We can obtain an OCamlDuce program by some mi- formations of XML documents. It comes with a very precise nor syntactical modifications (the new constructions in the and flexible type system based on regular expression types language are delimited to avoid grammatical overloading of and a natural notion of subtyping. The basic type-checking notations). Explicit function signatures are simply translated primitives for XDuce constructions are rather involved, but to type annotations. the structure of the type checker is simple: types are com- puted in a bottom-up way along the abstract syntax tree; the The design goals pushed us into the direction of simplicity. input and output types of functions are explicitly provided by We choosed to segregate XDuce values from regular ML val- the programmer. The high-level objective of the OCamlDuce ues. Of course, a constructed ML value can contain nested project is to enrich OCaml with XDuce features in order to XDuce values, but from the point of view of ML, XDuce provide a robust development platform for applications that values are black boxes, and similarly for types. Also, we de-

      36 cided not to have parametric polymorphism on XDuce types. The exemple is intended to illustrate the use of the OCaml A type variable can of course be instantiated to an XDuce type checker to perform a data-flow analysis of XML values, type (or to a type which contains a nested XDuce type), but and also how OCaml features (here, higher-order functions it is not possible to force a generalized variable to be instan- and data-structures) interact with XDuce features. tiated only to XDuce types or to use a type variable within an XDuce type. The technical presentation introduces a notion Double curly-braces {{...}} are used in OCamlDuce only of foreign type variables, but they are nothing more than a to avoid ambiguities in the grammar; they carry no typing technical device for inferring ground XDuce types. information. For instance, the symbol @ used for list con- catenation in OCaml is re-used for denote XML sequence concatenation. Similarly, the square brackets [...] are used both to denote OCaml list literals (whose elements are Overview In Section 2, we give some intuitions about the separated by semi-colons) and XML sequences literals when behavior of OCamlDuce’s type-checker. used within double curly braces (their elements are separated The formalization of the type system will be developped by whitespace). XML element literals are written in the form by abstracting away from details about XDuce. In Sec- content. tion 3, we introduce an abstract notion of extension (for- The first line of the program above declares a function eign types and foreign operators) and show of XDuce can f which consists of an XML pattern matching on its ar- be seen as an extension. In Section 4, we present the type- gument, with a single branch. The XML pattern p = system and type inference algorithm for a calculus made of [ (y::_ | _)* ] extracts from an XML sequence ML [Mil78, Dam85] plus an arbitrary extension. The basic all the elements with a tag and put them (in order) in the idea is to rely on standard techniques for ML type inference. capture variable y. The function is then used twice, includ- Indeed, we start from a type system which is an instance ing once indirectly through a call to the function List.map of ML where foreign types are considered as atomic types (from the OCaml standard library) of type ∀α, β.(α → β) → and foreign operators are explicitly annotated with their in- α list → β list. For the purpose of explaining type- put and output types. Then we present an algorithm to infer checking, we will rewrite the body of the function f as: these annotations. This algorithm is described as two suc- cessive passes: the first one is a slightly modified version of an ML type-checker, and the second one is a simple forward let f x = computation on foreign types. let y = match[y;p](x) in {{y@y}} In Section 5, we present a preprocessing pass, called strengthening, whose purpose is to make more programs ac- cepted by the type system by allowing implicit use of sub- The y and p parameters of the match operator represent the typing. capture variable under consideration and the pattern itself.

      In Section 6, we present other details of the concrete integra- In OCamlDuce, XML values (elements, sequences, ...) and tion in OCaml. In Section 7, we compare our approach to regular OCaml values are kept appart. An XML value can of related works. course appear as part of an OCaml value (e.g. the XML ele- ments which are put into an OCaml list), but an OCaml value cannot appear within an XML value. The same applies to types: an XML type can appear as part of a complex OCaml 2 An example type expression, but the converse is impossible. XML op- erators can be applied to XML values and return new XML values. In the example, we can see three kind of XML oper- In this section, we illustrate the behavior of OCamlDuce’s ators: XML literals (no argument), XML concatenation (two type-checker on the following code snippet: arguments), and XML pattern matching (one argument). let f x = match x with The basic idea of the OCamlDuce type system is to infer {{ [ (y::_ | _)* ] }} -> {{y@y}} XML types for the inputs and outputs of XML operators. let z1 = This is done by introducing internally a new kind of type f {{ [ [] [] [[]] ] }} variables, called XML type variables. Before proper type- let z2 = checking starts, each XML operator used in the program is List.map f annotated with fresh XML type variables (in subscript po- [ {{ [ [[]] ] }}; sition for the inputs, and in superscript position for the out- {{ [ [[]] ] }} ] puts):

      37 let f x = The set of constraints generates dependencies between vari- ι2 let y = match[y;p]ι1 (x) in ables. We say that a variable on a left-hand side of a con- ι5 {{y@ι3,ι4 y}} straint depends on variables of the right-hand side. In our let z1 = example, the graph of dependencies between variables is f {{ [ [] [] [[]] ]ι6 }} acyclic. In this case, we can topologically order the vari- let z2 = ables and find the least possible ground XML type for each List.map f of them: we assign to a variable the union of all its lower [ {{ [ [[]] ]ι7 }}; bounds. In the example, we will thus compute the following {{ [ [[]] ]ι8 }} ] instantiation:

      The regular OCaml type-checker is then applied. It gives to each XML operator an arrow type following the annotations and then proceeds as usual (generalizes types of let-bound ι1 = [R1] identifiers, instantiates ML type-schemes when an identifier ι2 = match[y;p]([ R1 ]) = [R2] is used, and performs unifications to make type compatible). ι5 = ι2@ι2 = [R2R2]

      For instance, the concatenation operator in our example is where R1 is the regular epxression given the type ι3 → ι4 → ι5, and the type-checker performs ([][][[]])|[[]|[]]] the following unifications: ι2 = ι3 = ι4 (the type for y), and R2 is the regular expression ι1 = ι6 = ι7 = ι8 (the type for the argument of f). It also ([][[]])|[[]|[]]. produces the following types for the top-level identifiers: Type-checking is over: we have found an instantiation for val f : ι1 → ι5 XML type variables which satisfies all the constraints. In val z1 : ι5 essence, the type-checker has collected all the XML types val z2 : ι5 list that can flow to the input of the function, and then type- checked the body of the function with the union of all these Of course, we must still instantiate the XML type variables types. In general, the OCaml type-checker is used to infer the with ground XML types. Each occurence of an XML op- data flow of XML values in the programs. The way to solve erator in the program gives one constraint on the instantia- the resulting set of constraints by forward computation cor- tion. Indeed, we can interpret each n-ary operator as as n- responds roughly to the structure of the XDuce type-checker. ary function from XML types to XML types. If we choose ι1 and ι2 as representatives for their classes of equivalence modulo unification, the program is: Implicit subtyping Let’s see what happens if we add an z1 let f x = explicit type constraint for : ι2 let y = match[y;p]ι1 (x) in let z1 : {{[ _* ]}} = ι5 {{y@ι2,ι2 y}} f {{ [ [] [] [[]] ] }} let z1 = f {{ [ [] [] [[]] ]ι1 }} The algorithm described above will infer a much less precise let z2 = type for z2 as well, which is unfortunate. The reason is that List.map f the OCaml type-checker unifies ι5 with [ _* ].Ba- [ {{ [ [[]] ]ι1 }}; sically, the unification-based type system forgets about the {{ [ [[]] ]ι1 }} ] direction of the data flow. There is some dose of implicit subtyping in the algorithm, but only for the result of XML from which we read the following constraints: operators (because of the way we interpet them as subtyping

      ι2 ≥ match[y;p](ι1) - not equality - constraints). ι5 ≥ ι2@ι2 In order to address this lack of implicit subtyping, we use ι1 ≥ [ [] [] [[]] ] a preprocessing pass whose purpose is to detect which sub- ι1 ≥ [ [[]] ] expressions are of kind XML and to introduce around them a ι1 ≥ [ [[]] ] special unary XML operator id which behaves semantically In this system, we consider match[y;p] as a function as the identity, but allows subtyping. This preprocessing pass from XML types to XML types, given by XDuce’s type in- would rewrite the definition for z1 as: ference algorithm for pattern matching. Similarly, the oper- ator @ is now intrepreted as a function from pair of types to let z1 : {{[ _* ]}} = ι10 ι1 types. idι9 (f {{[[] [] [[]]] }})

      38 The variable ι9 will then be unified with ι5 and ι10 with 3 Abstract extension of ML [ _* ]. The additional constraint corresponding to the id operator is thus simply: The previous section explained the behavior of OCaml- Duce’s type checker on a example. It should be clear from [ _* ] ≥ ι5 this example that the type system is largely independent of the actual definitions of values, types, patterns and opera- tors from XDuce and could be applied to other extensions which is satisfied by the same instantiation for ι5 as in the original example. As a consequence, the type for z2 is not of OCaml as well. In this section, we will thus introduce an changed. abstract notion of extension and show how XDuce fits into this notion. This more abstract presentation should help the The preprocessing pass is quite simple. It consists of an- reader to understand the structure of the type checker, with- other run of the OCaml type-checker, where all the XML out having to care about the details of XDuce type system. types are considered equal. This allows to identify which Definition 1. An extension X is defined by: sub-expressions are of kind XML. Section 5 describes for- mally this pass. • a set of ground foreign types T ; • a subtyping relation ≤ on T , which is a partial order with a finite least-upper bound operator ; Breaking cycles The key condition which allowed us to • a set of foreign operators O; compute an instantiation for XML type variables in the ex- ample was the acyclicity of the constraints. This property • for each operator o ∈O: an arity n ≥ 0 and an ab- does not always hold. For instance, let’s extend the original stract semantics oˆ : T n →T which is monotone with example with the following definition: respect to ≤ on each of its argument. letz3=fz1 We use the meta-variable τ to range over ground foreign types. The foreign operators are used to model both foreign Without the preprocessing pass mentionned above, this line value constructors and operations on these foreign values. would force the OCaml type-checker to unify ι1 and ι5. The Since we are not going to formalize dynamic semantics, we preprocessing pass actually replaces this definition by: don’t need to distinguish between these two kinds of opera- tors. ι12 let z3 = f idι11 (z1) The monotonicity requirement on the abstract semantics en- The type-checker then unifies ι11 with ι5 and ι12 with ι1; the sures that our resolution strategy (taking the union of lower resulting constraint for id is thus: bounds for each variables) for constraints is complete.

      ι1 ≥ ι5 We don’t formalize in this paper the operational semantics of operators. Instead, we assume informally that it is given and compatible with the abstract semantics. which corresponds to the fact that the output of f can flow back to its input. We observe that the set of constraints has now a cycle between variabls ι1,ι5 and ι2. XDuce as an extension We now show how XDuce fea- tures can be seen as an extension. We consider here a simple Our type-system cannot deal with such a situation. It would version of XDuce, with the following kind of expressions: issue an error explaining that the inferred data flow on XML element constructor a[e] (seen as a unary operator), empty values has a cycle. The programmer is then required to break sequence (), concatenation e1, e2, and pattern matching explicitly this cycle by providing more type annotations. For match e with p → e | ... | p → e. OCamlDuce is actually instance, the programmer could use the same annotation as build on CDuce, which considers for instance XML element above on z1: constructors as ternary operators (the tag and a specification let z1 : {{[ _* ]}} = for XML attributes are also considered as arguments). f {{ [ [] [] [[]] ] }} The meta-variable p ranges over XDuce patterns. We don’t or maybe he will prefer to annotate the input or output type need to recall here what they are. We just need to know of f. that for any pattern p we can define an accepted type p,

      39 a finite set of capture variable Var(p), and for any type τ built-in ML constant dispatch[τ1,...,τn] of type scheme: and any variable x in Var(p), a type match[x; p](τ) (which ∀α.(τ1  ... τn) → α → (α → β) → ...→ (α → β) → represents the set of values possibly bound to x when the β, which we assume to be present in the initial typing envi- input value is in τ and the pattern succeed) ronment. Its intuitive semantics is to drop the first argument (it is used only to force the type-checker to verify that x has Here is the formal definition of an extension X for XDuce. type τ1  ... τn, which corresponds to the XDuce’s pattern We take for foreign types the XDuce types quotiented by matching exhaustivity condition), and to call the kth func- the equivalence induced by the subtyping relation (that is: tional argument (1 ≤ k ≤ n) on the second argument when types with the same set-theoretic interpretation are consid- k is the smallest integer such that this argument has type τk. ered equal). The least-upper bound operator  corresponds to XDuce’s union type constructor (usually written |). We In principle, the technique described in this paper could use the following families of foreign operators: be used to integrate many of the existing extensions to the original XDuce design (such as attribute-element con- • a unary operator for each XML label a, a unary opera- straints [HM03] or XML filters [Hos04]) without any addi- tor; tional theoretical complexity. In its current form, however, OCamlDuce integrates all the CDuce extensions except over- • a binary operator corresponding to the concatenation; loaded functions: XML attributes as extensible records, se- quence and tree pattern-based iterators, strings as sequences • a constant operator corresponding to the empty se- of characters (hence string regular expression types and pat- quence; terns), etc. • for any pattern p and variable x in Var(p), a unary op- erator written match[x; p] (its semantics is to return the value bound to x when matching its argument against the pattern p). 4 Type system

      The abstract semantics for all these operators follows directly In this section, we present a type system and a type inference from XDuce’s theory. algorithm for a fixed extension X. Our language will be the kernel of an ML-like type system, enriched with types and Element constructor, concatenation and the empty sequence operators from the extension X. expressions can directly be seen as foreign operators. This is not the case for a pattern matching match e with p1 → | | → e1 ... pn en. We are going to present an encoding Types and expressions The syntax of types and expres- of pattern-matching in terms of operators and normal ML sions is given in Figure 1. We use a vector notation to repre- expressions. This encoding is rather heavy; in practice, the sent tuples. E.g. t stands for an n-tuple (t1,...,tn). implementation deals with pattern matching directly. We assume a set of ML type constructors, ranged over by → First, we define the translation p e of a single branch the meta-variable P. Each ML type constructor comes with { } where Var(p)= x1,...,xn as the expression: a fixed arity and we assume all the types to be well-formed with respect to these arities. The arrow → is considered as λx. a distinguished binary type constructor for which we use an let x1 = match[x1; p]xin infix and right-associative syntax. ... let xn = match[xn; p]xin We assume given two infinite families of type variables and e foreign type variables, respectively ranged over by the meta- variables α and ι. In an expression ∃α.e, the type variable Then, the translation of match e with p1 → e1 | ... | pn → α is bound in e. Expressions are considered modulo α- ∃ en is defined as: conversion of bound type variables. The construction α.e thus serves to introduce a fresh type variable α to be used in let x = ein a type annotation somewhere in e. dispatch[τ1,...,τn] xx(p1 → e1) ... (pn → en) Foreign operators are annotated with the type of their argu- where τi = pi and pi = pi\(τ1  ... τi−1) (the restric- ments (in subscript position) and of their result (in super- tion of pi to values which do not match any pattern form an script); the number of type arguments is assumed to be co- preceding branch). We have used in this translation a new herent with the arity of the foreign operator. However, in

      40 practice, the source language does not include the annota- A typing problem is a tuple (Γ, e, t). (Usually, t is a fresh tions: they are automatically filled with fresh foreign type type variable.) A solution to this problem is a substitution φ variables by the compiler (we also use this convention in this such that Γφ  eφ : tφ is a valid judgment in ML(X).We paper for some examples). Putting the annotations in the syn- will now rephrase this definition in terms of a typing judg- tax is just a way of simplifying the presentation. The main ment on the full calculus. This judgment Γ X e : t is technical contribution of the paper is an algorithm to infer defined by the same rules as in Figure 2, except for foreign ground foreign types for the foreign type variables. operators, for which we take:

      ε Γ X oε : ε1 → ...→ εn → ε The ML(X) fragment We call ML(X) the fragment of our calculus where all the foreign types are restricted to be Typing environment and type schemes that are used in the ground. Figure 2 defines a typing judgment Γ  e : t judgment X are allowed to contain foreign type variables. for ML(X). It is exactly an instance of the ML type sys- We say that φ is a pre-solution to the typing problem tem [Mil78, Dam85] if we see ground foreign types as (Γ, e, t) if the assertion Γφ X eφ : tφ holds. Of course, atomic ML types and ground-annotated foreign type oper- the new rule for foreign operators forgets the constraints that τ ators oτ as built-in ML constants or constructors (we also relates the input and output types of foreign operators. In introduce explicit type annotation and type variable intro- order to ensure type soundness, we must also enforce these duction). We recall classical notions of type scheme, typing constraints. environment and generalization. A type scheme is a pair of a finite set α¯ of type variables and of a type t, written ∀α.¯ t. Formally, we define a constraint C as a finite set of annotated ε  The variables in α¯ are considered bound in this scheme. We foreign operators oε. We write C if all the elements of C τ ≤ write σ t if the type t is an instance of the type scheme are of the form oτ with oˆ(τ) τ. For an expression e, we collect in a constraint C(e) all the instances of foreign σ.Atyping environment is a finite mapping from program ε variables to type schemes. The generalization of a type t operators oε that appear in e. Note that for any substitution φ,wehaveC(e)φ = C(eφ). with respect to a typing environment Γ, written GenΓ(t) is the type scheme ∀α.¯ t where α¯ is the set of variables which We are ready to rephrase the notion of solution. are free in t, but not in Γ. Lemma 1. A substitution φ is a solution to the typing prob- lem (Γ, e, t) if and only if the following three assertions Type-soundness of the ML(X) fragment We assume that hold: a sound operational semantics is given for the ML(X) cal- τ culus. This amounts to defining δ-reduction rules for the oτ • Γφ, eφ and tφ do not contain foreign type variables; operators which are coherent with the abstract semantics for the foreign operators. Well-typed expressions in ML(X) (in • φ is a pre-solution to the typing problem; an empty typing environment, or an environment which con- •  tains built-in ML operators) cannot go wrong. We also as- C(eφ). τ sume that the operational semantics for an oτ operator de- pends only on o, not on the annotations τ, τ. This allows us to lift the semantics of ML(X) to the full calculus. Type soundness Type soundness for our calculus is a triv- ial consequence of the type soundness assumption for the ML(X) fragment. Indeed, we can see a solution φ to a typ- ing problem (Γ, e, t) as an elaboration into a well-typed pro- Typing problems A substitution φ is an idempotent func- gram in this fragment. tion from types to types that maps type variables to types, foreign type variables to foreign types, ground foreign types to themselves, and that commutes with ML type construc- Type inference Let us consider a fixed typing problem tors. We use a post-fix notation to denote a capture-avoiding (Γ, e, t). We want to find solutions to this problem. Thanks application of this substitution to typing environments, ex- to Lemma 1, we will split this task into two different steps: pressions, types or constraints.

      A substitution φ1 is more general than a substitution φ2 if • find a most-general pre-solution φ0; φ2 = φ2 ◦ φ1. (Or equivalently, because substitutions are idempotent: there exists a substitution φ such that φ2 = φ ◦ • instantiate the remaining foreign type variables so as to φ1.) satisfy the resulting constraint.6

      41 ε ::= Foreign types: e ::= Expressions: τ ground foreign type x program variable ι foreign type variable λx.e abstraction ee application t ::= Types: let x = eine local definition P t constructed (e : t) annotation α type variable ∃α.e existential variable ε ε foreign type oε foreign operator

      Figure 1: Types and expressions

      Γ(x) t Γ, x : t1  e : t2 Γ  e1 : t1 → t2 Γ  e2 : t1 Γ  e1 : t1 Γ, x : GenΓ(t1)  e2 : t2 Γ  x : t Γ  λx.e : t1 → t2 Γ  e1e2 : t2 Γ  let x = e1 in e2 : t2

      Γ  e : t Γ  e[t0/α]:t oˆ(τ) ≤ τ τ Γ  (e : t):t Γ ∃α.e : t Γ  oτ : τ1 → ...→ τn → τ

      Figure 2: Type system for the ML(X) fragment

      It is almost straightforward to adapt unification-based exist- so we prefer to stick to our design guideline that type infer- ing algorithms for ML type inference (and their implementa- ence shouldn’t be significantly more complicated than both tions) to compute a most general pre-solution if there exists ML type inference and XDuce-like type inference. XDuce a pre-solution, or to report a type error otherwise. Indeed, computes in a bottom-up way, for each sub-expression, a the typing judgment X is very close to a normal ML type type which over approximates all the possible outcomes of system. In particular, it satisfies a substitution lemma: if this sub-expression. The basic operations and their typing Γ X e : t, then Γφ X eφ : tφ for any substitution φ. discipline corresponds respectively to our foreign operators and their static semantics. XDuce’s type system uses sub- Of course, if the typing problem has no pre-solution, it has sumption only when necessary (e.g. to join the types of no solution as well. For the remaining of the discussion, the branches of a pattern matching, or when calling a func- we assume given a most general pre-solution φ0. Let us tion). So we can say that XDuce tries to compute a min- write V for the set of foreign type variables that appear in imal type for each sub-expression, by applying basic type- (Γφ0, eφ0, tφ0) and C0 for the constraint C(eφ0). checking primitives. We will do the same, and to make it work, we need some acyclicity property, which corresponds A solution to the typing problem is in particular a pre- to the bottom-up structure of XDuce’s type checker. solution. As a consequence, a substitution φ is a solution ◦ C if and only if φ = φ φ0 and if it maps foreign type vari- Definition 2. Let C be a constraint. We write ι1  ι2 if C  0 ε ables in V to ground foreign types in such a way that C φ. contains an element oε such that ι2 = ε and ι1 appears in ε. The “minimal” modification we need to bring to φ0 to get a We say that C is acyclic if the directed graph defined by this solution is to instantiate variables in V so as to validate C0. relation is acyclic. Formally, we define a valuation as a function ρ : V →T such that  C0ρ. To any valuation ρ, we can associate a solu- Our type inference algorithm only deals with the case of an tion φ defined by tφ = tφ0ρ and any solution is less general acyclic constraint C0 (this condition does not depend on the than the solution obtained this way from some valuation. In particular choice of the most general pre-solution). If the particular, a solution exists if and only if a valuation exists. condition is not met, we issue an error message. It is not So we are now looking for a valuation. a type error with respect to the type system, but a situation where the algorithm is incomplete. We won’t give a complete algorithm to check for the exis- tence of a valuation. This would lead to difficult constraint Remark. The acyclicity criterion is of course syntactical (it solving problems which might be undecidable (this of course does not depend on the semantics of constraints but on their depends on the extension X). Even if they are decidable for syntax), but it is not defined in terms of a specific inference a given extension, they might be intractable in practice and algorithm. Instead, it is defined in terms of the most-general pre-solution of an ML-like type system. In particular, it does

      42 not depend on implementation details such as the order in We can simulate partial operators by adding a new top ele- which sub-expression are type-checked. ment to the set of ground foreign types T , and by requiring the abstract semantics of operators to be such that whenever Below we furthermore assume that C0 is acyclic. We define an argument is , the result is also . Since the typing algo- →T the function ρ0 : V by the following equation: rithm infers the smallest valuation for foreign type variables,  ι we can simply look at it and check that no foreign type vari- ∀ι ∈ V. ιρ0 = {oˆ(ερ0) | o ∈ C0} ε able is mapped to . The acyclicity condition ensures that this definition is well- founded and yields a unique function ρ. Furthermore, this function is a valuation if and only if the typing problem has 5 Strengthening a solution. To check this property, only constraints whose right-hand side is a ground foreign type need to be consid- ered: As we mentioned above, we can see the type system for the τ (1) ∀oε ∈ C0. oˆ(ερ0) ≤ τ calculus as an elaboration into its ML(X) fragment, which Also, any other valuation ρ is such that: immediatly gives type soundness.

      ∀ι ∈ V. ιρ0 ≤ ιρ In this section, we consider another elaboration from the cal- culus into itself. Namely, this elaboration is intended to be In other words, under the acyclicity condition, we can check used as a preprocessing pass (rewriting expressions into ex- in a very simple way whether a given typing problem has a pressions) in order to make the type system accept more pro- solution, and if this is the case, we can compute the smallest grams. We call this elaboration procedure strengthening. valuation (for the point-wise extension of the subtyping rela- tion). This computation only involves one call to the abstract The issue addressed by strengthening is a lack of implicit semantics for each application of a foreign operator in the subsumption in our calculus. We already hinted at this issue expression to be type-checked. in Section 2. We will now give more examples. Remark. In some cases, it is possible to find manifest type errors even when the constraint is not acyclic. In practice, Subsumption missing in action We consider the typing the computation of ρ0, the verification of (1), and the check problem (Γ1, e1,β) where Γ1 = {x : τ1, y : τ2, f : ∀α. α → for acyclicity can be done in parallel, e.g. with a deep-first α → α} and e1 = fxy. It admits a solution if and only graph traversal algorithm. It can detect some violation of (1) if τ1 = τ2. In a system with implicit subtyping, we might before a cycle. In this case, we know that the typing problem expect to give type τ = τ1  τ2 to both x and y, so that the has no solution, and thus a proper type error can be issued. application succeeds and the result type is τ.

      Similarly, the expression (λx.x : τ1 → τ2) is not well-typed Manually working around the incompleteness When the even if τ1 ≤ τ2 (unless τ1 = τ2). algorithm described above infers a cyclic constraint, it can- not detect whether the typing problem (Γ, e, t) has a solution or not. However, we have the following property. If a solu- A naive solution Let us see how to implement the amount tion φ exists, then we can always produce an expression e of implicit subtyping we need to make these examples type- by adding annotations to e such that the algorithm succeeds check. The following rule could be a reasonable candidate for the typing problem (Γ, e, t) and that φ is equivalent (for as an addition to the type system (we write ≤ for the new the equivalence induced by the more-general ordering) to the typing judgment): solution φ0 computed by the algorithm. Γ ≤ e : ττ≤ τ In other words, even if the algorithm is not complete (be- Γ ≤ e : τ cause of the acyclicity condition) and makes a choice be- tween most-general solutions (the smallest one for the sub- A concrete way to see this rule is that any subexpres- typing relation), for any solution to a typing problem, the sion e can be magically transformed to the application ι2 programmer can always add annotations so that the algo- idι1 e , where id is a distinguished foreign operator such rithm infers this very solution (or an equivalent one). that idˆ (τ)=τ and ι1,ι2 are fresh foreign type variables.

      The type system extended with this rule would accept the Partial operators The foreign operators were assumed to examples given above to illustrate the lack of implicit sub- be total. This means they should apply to any foreign value. sumption. However, this rule as it stands would add a lot

      43 of complexity to the type inference algorithm. As a matter (τ1  τ3) → t) → t) where e3 is from the example above. of fact, the type system would not admit most-general pre- Here, the preprocessing pass succeeds but does not change solutions anymore. We can see this on a very simple exam- the expression because no sub-expression has a foreign type ple with the typing problem ({x : τ}, x,α). We could argue in the principal type derivation. The type-scheme inferred for that a more liberal definition of being more-general should h is a pure ML type-schema, which makes the type-system allow some dose of subtyping. So let us consider the more subsequently fail on the expression. complex example Γ3 = {f : ∀α. α → α → α} and e3 = λx.λy.λz.λg.g (fxy)(fxz). In ML, the inferred type We believe that this restriction of the ≤ system is rea- scheme would be ∀α, β.α → α → α → (α → α → β) → sonnable. It can be implemented very simply by reusing β which forces the first three arguments to have the same the same type-checker as in Section 4 in a different mode type. But if the arguments turn out to be of a foreign-type, (where all the foreign types can be unified). The simple ex- another family of types for the function is possible, namely amples at the beginning of this section are now accepted. ∀β.τ1 → τ2 → τ3 → ((τ1  τ2) → (τ1  τ3) → β) → β, Indeed, the preprocessing pass transforms the expressions to and these types cannot be obtained as instances of the ML f (id x)(id y) and ((λx.id x):τ1 → τ2) respectively. This  type scheme above (we can obtain ∀β.τ1 → τ2 → τ3 → allows the type system to use subtyping where needed. ((τ1  τ2  τ3) → (τ1  τ2  τ3) → β) → β but this is less precise). Properties The strenghtening pass cannot transform a well-typed program into an ill-typed one. Note, however, that it might break the acyclicity condition if it was already A practical solution We will now describe a practical so- met. See below for a way to relax the acyclicity condition. lution. Instead of modifying the type system by adding a new subsumption rule, we will formulate the extension as a Also, if strenghtening fails, the typing problem has no pre- rewriting preprocessing pass. The rewriting consists in in- solution (for the typing judgment ), and thus no solution. serting applications of the identity foreign operator id. The However, it is not true that if it succeeds, a pre-solution nec- challenge is then to choose which sub-expressions e should essarily exists (for the new program where applications of be rewritten to id e . If we had an oracle to tell us so, the the id operators have been added). As an example, let us composition of the rewriting pass and the type system of Sec- consider the situation where Γ={x : τ1 → τ1, y : τ2 → tion 4 would be equivalent to the type system ≤. Unfortu- τ2, f : ∀α. α → α → α} and e = fxy. The preprocessing nately, we don’t have such an oracle. We could try all the succeeds, because all the foreign types are considered equal possible choices of sub-expressions, and this would give a but does not touch the expression (because no sub-expression  complete type-checking algorithm for the type system ≤. has a foreign type in a principal typing derivation). Still, the next pass of the type inference algorithm attempts to unifiy We prefer to use a computationally simpler solution. We the types τ1 and τ2 and thus fails. also expect it to be simpler to understand by the program- mer. The idea is to use an incomplete oracle. The oracle first runs a special version of an ML type-checker on the ex- pression to be type-checked. This type-checker identifies all Relaxing the acyclicity condition Inserting applications the foreign types together. The effect is to find out which of the id operator can break the acyclicity condition. We sub-expressions have a foreign type in a principal derivation, can actually relax this condition to deal with the id oper- that is, which sub-expression have necessarily a foreign type ator more carefully. Let us consider a constraint C with a C C in all the possible derivations. The preprocessing pass con- cycle ι1  ...  ι1, such that all the edges in this cycle ι sists in adding an application of the identity operator above come from elements of the form idι . Clearly, any valu- all these sub-expressions and only those. ation ρ such that  Cρ will map all the ιi in the cycle to the same ground foreign type. So instead of considering the The important point here is that the oracle may be overly most-general pre-solution and then face a cyclic constraint, conservative. Let us consider a type variable which has been we may as well unify all these ιi first: all the solutions can generalized in the principal derivation. In a non-principal still be obtained from this less-general pre-solution. derivation, it could have been instead instantiated to a for- eign type. If this derivation had been considered instead of The relaxed condition is: There must be no cycle in the con- the principal one, the preprocessing pass would have added straint except maybe cycles whose edges are all produced by more applications of the identity operator. Maybe this would the id operator. have been necessary in order to make the resulting expres- sion type-check. An example is given by the expression To illustrate the usefulness of the relaxed condition, let us let h = e3 in (h : τ1 → τ2 → τ3 → ((τ1  τ2) → consider the expression e = fix(λg.λx.fc(gx)) with Γ =

      44 {fix : ∀α.(α → α) → α, f : ∀α.α → α → α, c : τ}. The from the compilation unit interface (the .cmi file) is in- strengthening pass builds a principal typing derivation for e tegrated before checking the acyclicity condition. Indeed, in a type algebra where all the foreign types are identified. this information acts as additional type annotations on the Here is such a derivation, where we write for foreign types values exported by the compilation unit and can thus con- and t = α → , Γ =Γ, g : t, x : α (we collapse rules for tribute to obtaining this condition. Also, in addition to type multiple abstraction and application): annotations on expressions, OCaml provides several ways to introduce explicit type informations (and thus obtain the acyclicity condition): datatype definitions (explicit types for constructor and exception arguments, record fields), module Γ  fix :(t → t) → t signatures, type annotations on ML pattern variables. Γ  f : → → Γ  g : t Γ  x : α OCaml subtyping OCaml comes with a structural subtyp-   Γ c : Γ gx: ing relation (generated by object types and polymorphic vari- Γ  fc(gx): ants subtyping and extended structurally by considering the Γ  λg.λx.fc(gx)) : t → t variance of ML type constructors). The use of this subtyping relation in programs is explicit. The syntax is (e : t1 :> t2) Γ  e : α → (sometimes, the type t1 can be inferred) and it simply checks that t1 is a subtype of t2. Of course, the OCaml subtyping On this principal derivation, we observe three sub- relation has been extended in OCamlDuce to take XDuce expressions of a foreign type. Accordingly, strengthening in- subtyping into account. For instance, if τ1 is a XDuce sub- troduces three instances of the id operator and thus rewrites type of τ2 and e has type τ1list, then it is possible to coerce the expression to: it to type τ2 list: (e :>τ2 list). ι ι ι = (λ .λ . 2 ( ( 4 )( 6 ( )))) e fix g x idι1 f idι3 c idι5 gx The type-checker which is then applied performs some uni- Crossing the boundary In our system, XDuce values fications: ι1 = ι4 = ι6, ι2 = ι5, ι3 = τ. We can for instance are opaque from the point of view of ML and XDuce assume that the computed most-general pre-solution maps ι4 types cannot be identified with other ML type construc- and ι6 to ι1 and ι5 to ι2. The first and third instances of the tors. Sometimes, we need to convert values between the C0 two worlds. For instance, we have a foreign type String id operator in e thus generate the dependencies ι1  ι2 and C0 which is different from OCaml string. This foreign ι2  ι1. Strictly speaking, the constraint is cyclic, but we type conceptually represents immutable sequences of arbi- can break the cycle simply by unifying ι1 and ι2. The small- trary Unicode characters, whereas the OCaml type should est valuation is then given by ι1ρ = τ. We would have ob- be thought as representing mutable buffers of bytes. As tained the same solution if we had applied the type-checker a consequence, we don’t even try to collapse these two directly on e without the strengthening pass. In this exam- types into a single one. Instead, OCamlDuce comes with ple, strengthening is useless and the relaxed acyclicity condi- a runtime library which exports conversion functions such tion is just a way to break a cycle introduced by strenghten- as Utf8.make: string -> String, Utf8.get: ing. We can easily imagine more complex examples where String -> string, Latin1.make: string -> strenghtening is really necessary but introduces cycles that Latin1, Utf8.get: Latin1 -> string. The can be broken by the relaxed condition. type Latin1 is a subtype of String: it represents all the strings which are only made of latin-1 characters (latin- 1 is a subset of the Unicode character set). The function 6 Integration in OCaml Utf8.make checks at runtime that the OCaml string is a valid representation of a Unicode string encoded in utf-8.

      We have described a type system for basic ML expressions. Similarly, we often need to translate between XDuce’s se- Of course, OCaml is much more than an ML kernel. We quences and OCaml’s lists. For any XDuce type τ, we can found no problem to extend it to deal with the whole OCaml easily write two functions of types [τ∗] → τ list and type system, including recursive types, modules, classes, and τ list → [τ∗] (the star between square brackets denotes other fancy features. The two ML-like typing passes (the Kleene-star). Similarly, we can imagine a natural XDuce one used during strengthening and the one using for the real counterpart of an OCaml product type τ1 × τ2, namely type-checking) are done on whole compilation units (in the [τ1 τ2], and coercion functions. However, writing this kind toplevel, they are done on each phrase). The information of coercions by hand is tedious. OCamlDuce comes with

      45 built-in support to generate them automatically. This au- specification of a type system and do not study the interac- tomatic system relies on a structural translation of some tion between XDuce type-checking and ML type inference OCaml types into XDuce types: lists and array are translated (XDuce code can call ML functions but their type must be to Kleene-star types, tuples are translated to finite-length fully known). These last points are precisely the issues tack- XDuce sequences, variant types are translated to union types, led by our contribution. For instance, our system makes etc. Some OCaml types such as polymorphic or functional it possible to avoid some type annotation on non-recursive types cannot be translated. OCamlDuce comes with two XDuce functions. Another difference is that in our approach, magic unary operators to_ml, from_ml (both written {: the XDuce/CDuce type checker and back-end (compilation ...:} in the concrete syntax). The first one takes an XDuce of pattern matching) can be re-used without any modifica- value and applies a structural coercion to it in order to obtain tion whereas their approach requires a complete reengineer- an OCaml value; this coercion is thus driven by the output ing of the XDuce part (because subtyping and pattern match- type of the operator. The type-checker requires this type to ing relations must be enriched to produce ML code) and it is be fully known (polymorphism is not allowed). Similarly, not clear how some XDuce features such as the Any type the operator from_ml takes an OCaml value and apply a can be supported in a scenario of modular compilation. We structural coercion in order to obtain an XDuce value. Since believe our approach is more robust with respect to exten- the type of its input drives its behavior, the type-checker re- sions of XDuce and that the XDuce-to-ML translation can be quires this type to be fully known. seen as an alternative implementation technique for XDuce which allows some interaction between XDuce and ML (the This system can be used to obtain coercions from complex same kind of interaction as what can be achieved with the OCaml types (e.g. obtained from big mutually recursive def- CDuce/OCaml interface described above). initions of concrete types) to XDuce types, whose values can be seen as XML documents. This gives parsing from XML The Xtatic project [GP03] is another example of the inte- and pretty-printing to XML for free. gration of XDuce types into a general purpose language, namely C#. Since both C#’s and XDuce’s type checkers op- erate with bottom-up propagation (explicit types for func- tions/methods, no type inference), the structure of Xtatic 7 Related work type-checker is quite simple. The real theoretical contribu- tion is in the definition of a subtyping relation which com- bines C# named subtyping (inheritance) and XDuce set- The CDuce language itself comes with a typed interface with theoretic subtyping. Since the resulting type algebra does OCaml. The interface was designed to: (i) let the CDuce pro- not have least-upper bounds, the nice locally-complete type grammers use existing OCaml libraries; (ii) develop hybrid inference algorithm for XDuce patterns [HP02] cannot be projects where some modules are implemented in OCaml transferred to Xtatic. In Xtatic, XDuce types and C# types and other in CDuce. The interface is actually quite simple: are stratified, but the two algebras are mutually recursive: each monomorphic OCaml type t is mapped in a structural XDuce types can appear in class definitions and C# classes ˆ way to a CDuce type t. A value defined in an OCaml mod- can be used as basic items in XDuce regular expression ule can be used from CDuce (the compiler introduces a nat- types. This does not really introduce any difficulty because → ˆ ural translation t t). Similarly, it is possible to provide C# types are not structural. The equivalent in OCamlDuce an ML interface for a CDuce module: the CDuce compiler would be to allow OCaml abstract types as part of XDuce checks that the values exported by the module are compati- types, which would not be difficult, except for scoping rea- ble with the ML-to-CDuce translation of these types and pro- sons (abstract types are scoped by the module system). duces stub code to apply a natural translation tˆ → t to these values. This CDuce/OCaml interface is used by many CDuce In the last ten years, a lot of research effort has been put into users and served as a basis to the to_ml and from_ml oper- developping type inference techniques for extensions of ML ators described in Section 6. with subtyping and other kinds of constraints. For instance, the HM(X) framework [MOW99] could serve as a basis to Sulzmann and Zhuo Ming Lu [SL05] pursue the same ob- express the type system presented here. The main modifi- jective of combining XDuce and ML. However, their contri- cation to bring to HM(X) would be to make foreign-type bution is orthogonal to ours. Indeed, they propose a compi- variables global. Another way to express it is to disallow lation scheme from XDuce into ML such that the ML rep- constraints in type-schemes (which is what we do in the cur- resentation of XDuce values is driven by their static XDuce rent presentation). We have chosen to present our system type (implicit use of subtyping are translated to explicit coer- in a setting closer to ML so as to make our message more cions). Their type system supports in addition used-defined explicit: our system can be easily implemented on top of ex- coercions from XDuce types to ML types. However, they isting ML implementations. do not describe a type inference algorithm for their abstract

      46 8 Conclusion and future work [Dam85] Luis Manuel Martins Damas. Type assignment in programming languages. PhD thesis, University of Edinburgh, Scotland, April 1985. We have presented a simple way to integrate XDuce into OCaml. The modification to the ML type-system is small [Fri04] Alain Frisch. Théorie, conception et réalisa- enough so as to make it possible to easily extend existing tion d’un langage de programmatio n fonction- ML type-checkers. nel adapté à XML. PhD thesis, Université Paris 7, December 2004. Realistic-sized examples of code have been written in OCamlDuce, such as an application that parses XML [GP03] Vladimir Gapeyev and Benjamin C. Pierce. Reg- Schema documents into an internal OCaml form and pro- ular object types. In European Conference duces an XHTML summary of its content. Compared to on Object-Oriented Programming (ECOOP), a pure OCaml solution, this OCamlDuce application was Darms tadt, Germany, 2003. easier to write and to get right: XDuce’s type system en- [HFC05] Haruo Hosoya, Alain Frisch, and Giuseppe sures that all possible cases in XML Schema are treated by Castagna. Parametric polymorphism for XML. pattern-matching and that no invalid XHTML output can be In POPL, 2005. produced). We refer the reader to OCamlDuce’s website for the source code of this application. [HM03] Haruo Hosoya and Makoto Murata. Boolean op- erations and inclusion test for attribute-element The main limitation of our approach is that it doesn’t allow constraint s. In Eighth International Conference parametric polymorphism on XDuce types. Adding poly- on Implementation and Application of Automata, morphism to XDuce is an active research area. In a previous 2003. work with Hosoya and Castagna [HFC05], we presented a solution where polymorphic functions must be explictly in- [Hos00] Haruo Hosoya. Regular Expression Types for stantiated. Integrating this kind of polymorphism into the XML. PhD thesis, The University of Tokyo, same mechanism as ML polymorphism is challenging and Japan, December 2000. left for future work. The theory recently developped by [Hos04] Haruo Hosoya. Regular expression filters for Vouillon [Vou06] could be a relevant starting point for such XML. In Programming Languages Technologies a task. for XML (PLAN-X), 2004. Another direction for improvement is to further relax the [HP00] Haruo Hosoya and Benjamin C. Pierce. XDuce: acyclicity conditions, that is, to accept more programs with- A typed XML processing language. In Proceed- out requiring extra type annotations. Once the set of con- ings of Third International Workshop on the Web straints representing XML data flow and operations have and Data bases (WebDB2000), 2000. been extracted by the ML type-checker, we could use tech- niques which are more involved than simple forward com- [HP02] Haruo Hosoya and Benjamin C. Pierce. Regular putation over types. The static analysis algorithm used in expression pattern matching for XML. Journal Xact [KMS04] could serve as a starting point in this direc- of Functional Programming, 13(4), 2002. tion. [HP03] Haruo Hosoya and Benjamin C. Pierce. A typed XML processing language. ACM Transactions on Internet Technology, 3(2):117Ð148, 2003. Acknowledgments The author would like to thank Didier [HVP00] Haruo Hosoya, Jérôme Vouillon, and Ben- Rémy and François Pottier for fruitful discussion about the jamin C. Pierce. Regular expression types for design and formalization of type systems. XML. In ICFP ’00, volume 35(9) of SIGPLAN Notices, 2000. [KMS04] Christian Kirkegaard, Anders M¿ller, and References Michael I. Schwartzbach. Static analysis of XML transformations in Java. IEEE Transac- tions on Software Engineering, 30(3):181Ð192, [BCF03] V. Benzaken, G. Castagna, and A. Frisch. March 2004. CDuce: an XML-friendly general purpose lan- guage. In ICFP ’03, 8th ACM International Con- [L+01] Xavier Leroy et al. The Objective Caml system ference on Functional Pr ogramming, pages 51Ð release 3.08; Documentation and user’s manual, 63, Uppsala, Sweden, 2003. ACM Press. 2001.

      47 [Mil78] Robin Milner. A theory of type polymorphism in programming. Journal of Computer and System Sciences, 1978. [MOW99] Martin Sulzmann Martin Odersky and Martin Wehr. Type inference with constrained types. TAPOS, 5(1), 1999. [SL05] Martin Sulzmann and Kenny Zhuo Ming Lu. A type-safe embedding of XDuce into ML. In The 2005 ACM SIGPLAN Workshop on ML, 2005. [Vou06] Jérôme Vouillon. Polymorphic regular tree types and patterns. In POPL, 2006. To appear.

      48 Polymorphism and XDuce-style patterns

      J«erˆome Vouillon CNRS and Universit«eParis7 [email protected]

      Abstract 2. A Taste of XDuce We present an extension of XDuce, a programming language ded- XDuce values are sequences of elements, where an element is icated to the processing of XML documents, with polymorphism characterized by a name and a contents. (Elements may also contain and abstract types, two crucial features for programming in the attributes, both in XDuce and XML. We omit attributes here for the large. We show that this extension makes it possible to deal with sake of simplicity.) This contents is itself a sequence of elements. first class functions and eases the interoperability with other lan- These values corresponds closely to XML documents, such as this guages. A key mechanism of XDuce is its powerful pattern match- address book example. ing construction and we mainly focus on this construction and its interaction with abstract types. Additionally, we present a novel type inference algorithm for XDuce patterns, which works directly on the syntax of patterns. Haruo Hosoya hosoya 1. Introduction Jerome Vouillon XDuce [14] is a programming language dedicated to the processing 123 of XML documents. It features a very powerful type system: types are regular tree expressions [15] which correspond closely to the schema languages used to specify the structure of XML documents. The subtyping relation is extremely flexible as it corresponds to the XDuce actually uses a more compact syntax, which we also adopt inclusion of tree automata. Another key feature is a pattern match- in this paper: ing construction which extends the algebraic patterns popularized by functional languages by using regular tree expressions as pat- addrbook[ terns [13]. person[name["Haruo Hosoya"], email["hosoya"]], In this paper, we aim at extending in a seamless way the XDuce person[name["Jerome Vouillon"], tel["123"]]] type system and pattern construction with ML-style prenex poly- morphism and abstract types. These are indeed crucial features for The shape of values can be specified using regular expression types. programming in the large in a strongly typed programming lan- A sequence of elements is described using a regular expression. guage. In our extension, patterns are not allowed to break abstrac- Mutually recursive type definitions make it possible to deal with tion. This crucial property makes it possible to embed first class the nested nature of values. Here are the type definitions for address functions and foreign values in a natural way into XDuce values. books. In another paper [21], we present a whole calculus dealing with polymorphism for regular tree types. Though most of the results type Addrbook=addrbook[Person*] in that paper (in particular, the results related to subtyping) can be type Person = person[Name,Email*,Tel?] fairly easily adapted for an extension of XDuce, a better treatment type Name = name[String] of patterns is necessary. Indeed, a straightforward application of the type Email = email[String] results would impose severe restrictions on patterns. For instance, type Tel = tel[String] binders and wildcards would be required to occur only in tail position. The present paper is therefore mostly focused on patterns These type definitions can be read as follows. An Addrbook value and overcomes these limitations. is an element with name addrbook containing a sequence of any Additionally, we present a novel type inference algorithm for number of Person values. A Person value is an element with XDuce patterns, which works directly on the syntax of patterns, name person containing a Name value followed by a sequence of rather than relying on a prior translation to tree automata. This way, Email values and optionally a Tel value. Values of type Name, better type error messages can be provided, as the reported types are Email,andTel are all composed of a single element containing closer to the types written by the programmer. In particular, type a string of characters. abbreviations can be preserved, while they would be expanded by There is a close correspondence between regular expression the translation into tree automata. types and tree automata [5]. As the inclusion problem between tree The paper is organized as follows. We introduce the XDuce type automata is decidable, the subtyping relation can be simply defined system (section 2) and present the extension (section 3). Then, we as language inclusion [15]. This subtyping relation is extremely formalize patterns (section 4) and provide algorithms for checking powerful. It includes associativity of concatenation (type A,(B,C) patterns and performing type inference (section 5). Related works is equivalent to type (A,B),C), distributivity rules (type A,(B|C) are presented in section 6. is equivalent to type (A,B)|(A,C)).

      49 In order to present the next examples, we find it convenient to fun map{X}{Y} use the following parametric type definition for lists: (f : X---> Y)(l : List{X}) : List{Y} = match l with type List{X} = element[X]* () ---> Parametric definitions are not currently implemented in XDuce, but () are a natural extension and can be viewed as just syntactic sugar: | element[x : _], rem : _ ---> all occurrences of List{T} (for any type T) can simply be replaced element[f(x)], map{X}{Y}(f)(rem) by the type element[T]* everywhere in the source code. When using a polymorphic function, type arguments may have to Another key feature of XDuce is regular expression patterns,a be explicitly given, as shown in the following expression where generalization of the algebraic patterns popularized by functional the function is applied to the identity function on integers and languages such as ML. These patterns are simply types annotated map to the empty list: with binders. Consider for instance this function which extracts the names of a list of persons. map{Int}{Int} (fun (x : Int) ---> x)(). fun names (lst : Person*) : List{String} = Indeed, it is possible to infer type arguments in simple cases, using match lst with an algorithm proposed by Hosoya, Frisch and Castagna [12], but () ---> not in general, as a best type argument does not necessarily exist: () the problem is harder in our case due to function types which are | person [name [nm : String], Email*, Tel?], contravariant on the left. rem : Person* ---> Abstract types facilitate interoperability with other languages. element [nm], names (rem) Indeed, we can consider any type from the foreign language as an abstract type as far as XDuce is concerned. For instance, the ML The function names takes an argument lst of type Person* and 2 returns a value of type List{String}. The body of the function type int can correspond to some XDuce type Int. This general- is a pattern matching construction. The value of the argument izes to parametric abstract types: to the ML type int array would lst { } is matched against two patterns. If it is the empty sequence, then it correspond the polymorphic XDuce type Array Int .Further- will match the first pattern () (the type () is the type of the empty more, if the two languages share the same representation of func- sequence ()), and the function returns the empty sequence. Other- tions, ML function types can be mapped to XDuce function types wise, the value must be a non-empty sequence of type Person*. (and conversely). Thus, for instance, a function of type int--->int Thus, it is an element of name person followed by a sequence of can be written in either language and used directly in the other lan- type Person*, and matches the second pattern. This second pat- guage without cumbersome conversion. tern contains two binders nm and rem which are bound to the cor- In order to preserve abstraction and to deal with foreign values responding part of the value. that may not support any introspection, some patterns should be Some type inference is performed on patterns: the type of the disallowed. For instance, this function should be rejected by the expression being matched is used to infer the type of the values type checker as it tries to test whether a value x of some abstract that may be bound to a binder. By taking advantage of this, the type Blob is the empty sequence. function names can be rewritten more concisely using wildcard fun f (x : Blob) : Bool = patterns1 as follows. The type of the binders nm and rem are inferred match x with to be respectively String and Person* by the compiler. () ---> true fun names (l : Person*) : List{String} = |_ ---> false match l with Another restriction is that abstract types cannot be put directly () ---> in sequences. Indeed, it does not make sense to concatenate two () values of the foreign language (two ML functions, for instance). | person [name [nm : _], _], rem : _ ---> In order to be put into a sequence, they must be wrapped in an element [nm], names (rem) element. As a type variable may be instantiated to an abstract type, and as we want to preserve abstraction for type variables too, the same restrictions apply to them: a pattern a[],X,b[] implicitly 3. Basic Ideas asserts that the variable X stands for a sequence, and thus would We want to extend regular expression types and patterns with ML- limit its polymorphism. style polymorphism (with explicit type instantiation) and abstract There are different ways to deal with type variables and abstract types. Such an extension is interesting for numerous reasons. First, types occurring syntactically in patterns. The simplest possibility is it makes it possible to describe XML documents in which arbitrary not to allow them. Instead, one can use wildcards and rely on type subdocuments can be plugged. A typical example is the SOAP inference to assign polymorphic types to binders. This approach is envelop. Here is the type of SOAP messages and of a function that taken in the related work by Hosoya, Frisch and Castagna [12]. An- extracts the body of a SOAP message. other possibility is to consider that type variables should behave as the actual types they are instantiated to at runtime. This is a natural type Soap_message{X} = approach, but this implies that patterns do not preserve abstraction. envelope[header[...], body[X]] It is also challenging to implement this efficiently, though it may fun extract_body : be possible to get good results by performing pattern compilation forall{X}. Soap_message{X} ---> X (and optimization) at run-time. Finally, it is not clear in this case A more important reason is that polymorphism is crucial for pro- how abstract types should behave in patterns. We propose a middle- gramming in the large. It is intensively used for collection datas- ground, by restricting patterns so that their behaviors do not depend tructures. As an example, we present a generic map function over on what type variables are instantiated to, and on what abstract lists. This function has two type parameters X and Y. 2 We consider here ML as the foreign language, as XDuce is currently im- 1 XDuce actually uses the pattern Any as a wildcard pattern. plemented in OCaml. But this would apply equally well to other languages.

      50 types stand for. In other words, patterns are not allowed to break () ---> ... abstraction. As a consequence, type variables can be compiled as | element[x : _], rem : List{X} ---> ... wildcards. In other words, type variables and abstract types occur- { } ring in patterns can be considered as annotations which are checked where l has type List X . at compile time but have no effect at run-time. We indeed feel it is Such a grammar-based syntax of patterns is convenient for writ- interesting to allow type variables and abstract types in patterns. A ing patterns but typically does not reflect their internal represen- first reason is that it is natural to use patterns to specify the parame- tation in a compiler. For instance, it assumes a notion of pattern { } ters of a functions. And we definitively want to put full types there. names (such as List X or Name) which may be expanded away For instance, we should be able to write such a function: at an early stage by the compiler. Binders may also be represented in a different way. Finally, this notation is not precise about sub- fun apply{X}{Y} pattern identity: for instance, in the pattern a[ ]|b[ ], it in not (funct[f : X ---> Y], arg[x : X]) : Y=f(x) clear whether one should consider the two occurrences of the wild- Another reason is that one may want to reuse a large type definition card pattern as two different subpatterns, or as a single subpattern. containing abstract types in a pattern, and it would be inconvenient The distinction matters as a compiler usually does not identify ex- to have to duplicate this definition, replacing abstract types with pressions which are structurally equal. In particular, one should be wildcards. Finally, the check can be implemented easily: the type careful not to use any termination argument that relies on struc- inference algorithm can be used to find the type of the values that tural equality. Another reason is that we need to be able to specify may be matched against any of the type variables occurring in the precisely how a value is matched by a pattern. This is especially pattern, so one just has to check that this type is a subtype of the important for type inference (section 4.9), where we get a different type variable (this usually means that the type is either empty or result depending on whether we infer a single type for both occur- equal to the type variable, but some more complex relations are rences of the wildcard pattern or a distinct type for each occurrence. possible, as we will see in section 4.3). Thus, we define a more abstract representation of patterns which provides more latitude for actual implementations. A pattern is a rooted labeled directed graph. Intuitively, this graph can be under- 4. Specifications stood as an in-memory representation of a pattern: nodes stands for We now specify our pattern matching construction, starting from memory locations and edges specify the contents of each memory the data model, continuing with types and patterns, before finally location. To be more accurate, a pattern is actually a hypergraph, dealing with the whole construction. as edges may connect a node to zero, one or several nodes: for in- stance, for a pattern (), there is an (hyper)edge with source the 4.1 Values location of the whole pattern and with no target, while for a pattern We assume given a set of names l and a set of foreign values e.A P,Q, there is an (hyper)edge connecting the location of the whole value v is either a foreign value or a sequence f of elements l[v] pattern to the location of subpatterns P and Q. (with name l and contents v). We assume given a family of name sets L,asetoftype vari- ables X and a set X of binders x. Formally, a pattern is a quadruple v ::= e foreign value (Π,φ,π0, B) of f sequence f ::= l[v],...,l[v] • afinitesetΠ of pattern locations π; • φ :Π→C(Π) We write for the empty sequence,andf,f for the concatenation a mapping from pattern locations to pattern p ∈C(Π) of two sequences f and f . components , defined below; Note that strings of characters can be embedded in this syntax • a root pattern location π0 ∈ Π. by representing each character c as an element whose name is this • a relation B⊆X×Π between binders and pattern locations. very character and whose contents is empty: c[]. This encoding was introduced by Gapeyev and Pierce [9]. Pattern components C(Π) aredefinedbythefollowinggrammar, parameterized over the set Π of pattern locations. 4.2 Patterns p ::= L[π] element pattern We start by two comments clarifying the specification of patterns. empty sequence pattern First, in all the examples given up to now, in a pattern element π, π pattern concatenation L[T], the construction L stands for a single name. It actually corre- π ∪ ...∪ π pattern union sponds in general to a set of names. This turns out to be extremely π∗ pattern repetition convenient in practice. For instance, this can be used to define char-  wildcard acter sets (remember that characters are encoded as names). Sec- X type variable ond, abstract types and type variables are very close notions. Es- sentially, the distinction is a difference of scope: an abstract type Binders do not appear directly in patterns. Instead they are specified stands for a type which is unknown to the whole piece of code by a relation between binder names and pattern locations. This considered, while a type variable has a local scope (typically, the allows us to simplify significantly the presentation of the different scope of a function). Thus, for patterns, we can unify both no- algorithms on patterns. Indeed, most of them simply ignore binders. tions. Parametric abstract types can be handled by considering each As an example, the two patterns: of their instances as a distinct type variable. Thus, the two types { } Array{Int} and Array{Bool} correspond each to a distinct type () and element[x:_],rem:List X variable in our formalization of patterns. Similarly, each function can be formally specified respectively as: type T 2 --->T1 corresponds to a distinct type variable. We explain in section 4.3 how subtyping can be expressed for these types. (Π,φ,1, ∅) and (Π,φ,2, {(x, 4), (rem, 5)}) As a running example, we consider the pattern matching code in function map: where the set of pattern locations is: match l with Π={1, 2, 3, 4, 5, 6, 7}

      51 the patterns a[],X and X*,whereX is a pattern variable, are re- 1 jected. Indeed, the semantics of a pattern variable may contain for- eign values, which cannot be concatenated. These two restrictions , element[ ]  2 3 4 are formally specified using a well-formedness condition. First, we define when a pattern location is in a sequence (figure 2). Then, ∗ element[ ] we define the well-formedness condition for pattern locations (fig- 5 6 7 X ure 3). There is one rule per pattern component. For all rules but one, in order to deduce that a pattern location is well-formed, one Figure 1. Graphical Depiction of Two Patterns must first show that its subpatterns are themselves well-formed. This ensures that there is no cycle. The exception is the rule for element patterns L[π], hence cycles going through elements are allowed. The rule for type variables X additionally requires that φ(π)=π,π φ(π)=π,π φ(π)=π∗ the pattern location is not in a sequence. Finally, a pattern is well- π seq π seq π seq formed if all its locations are. These restrictions could also have been enforced syntactically [11, 17], but we prefer to keep the syn- π seq φ(π)=π1 ∪ ...∪ πn tax as simple and uniform as possible. In the remainder of this pa- per, all patterns are implicitly assumed to be well-formed. πi seq 4.3 Typing Environments Figure 2. Locations in a Sequence π seq In order to provide a semantics to patterns, we assume given a class of binary relations between values and types variables, which we call typing environments. Equivalently, we can consider a typing π wf π wf environment as a function from type variables to their semantics φ(π)=L[π ] φ(π)= φ(π)=π ,π which is a set of values). We have two motivations for restricting π wf π wf π wf ourselves to a class of such relations rather than allowing all rela- tions. First, some type variables may have a fixed semantics, iden- π1 wf ... πn wf tical in all typing environments. This makes it possible to define φ(π)=π1 ∪ ...∪ πn φ(π)= the type of a function T 2 --->T1 (assuming that T1 and T2 are pure π wf π wf regular expression types, without type variables). The semantics of some type variables may also be correlated to the semantics of other type variables. For instance, the semantics of the type Array{X} ¬(π seq) φ(π)=X π wf φ(π)=π ∗ depends on the semantics of the type variable X. Second, for se- π wf π wf mantic reasons, the semantics of any type, and thus the semantics of type variables, may be required to satisfy some closure proper- Figure 3. Location Well-Formedness π wf ties. This is the case for instance in the ideal model [16].

      4.4 Pattern Matching and the mapping from pattern locations to pattern components is In order for the algorithms presented in this paper to be imple- the function φ defined by: mentable, the family of name sets L should be chosen so that the φ(1) = following predicates are decidable: φ(2) = 3, 5 φ(3) = [4] φ(4) =  element • lL φ(5) = 6∗ φ(6) = element[7] φ(7) = X the inclusion of a name in a name set: ; • the non-emptiness of a name set: L(that is, there exists a name (We write element for the name set containing only the name l such that lL); element.) A graphical depiction of the formal representation of the • the non-disjointness of two name sets: L1 1 L2 (that is, there two patterns is given in figure 1. The two root locations 1 and 2 are l lL lL circled. Edges are labeled with the corresponding component. One exists a name such that 1 and 2). can see three kind of edges on this picture: the edges with labels , Furthermore, for technical reasons (see section 4.7), there must be  and X have no target; one edge with label , has two targets 3 anameset containing all names. and 5 and corresponds to the component 3, 5; some edges with The semantics of a pattern (Π,φ,π0, B) is given in figure 4 label *orelement[ ] has a single target. Note that the locations 5 using inductive rules. It it parameterized over a typing environment, to 7 correspond to the expansion of type List{X}. that is a relation vXwhich provides a semantics to each Not all patterns are correct. The most important restriction is type variable. We define simultaneously the relation vπ(the that cycles are not allowed except when going through an ele- value v matches the pattern location π) and a relation f∗ π (the ment L[π ]: for instance, the pattern sequence f matches a repetition of the pattern location π). Then, a v Balanced = a[], Balanced, b[] value matches a whole pattern if it matches its root location, that is, vπ0. should be rejected, while the pattern A match of a value v against a location π is a derivation of vπ Tree = leaf[] | node[Tree, Tree] . Given such a match, we define the submatches as the set of assertions v π which occur in the derivation. These submatches is accepted. This restriction ensures that the set of values matching indicate precisely which parts of the value is associated to each a given pattern is a regular tree language3. The other restriction is that pattern variables should not occur in sequences. For instance, in two steps. The first step would be a semantics in which values contain variables matching the variables in the pattern. With this initial semantics, 3 Actually this is not quite accurate due to type variables. In order to state the denotation of a pattern would indeed be a regular tree language. The the regularity property precisely, the semantics of patterns should be defined second step would correspond in substituting values for the type variables.

      52 MATCH-ELEMENT MATCH-EPS l f,l[v] −→r l,v,f lL vπ φ(π)=L[π] φ(π)= l[v],f −→ l,v,f l[v] π π Figure 5. Value Decomposition v −→ δ l,v,v MATCH-CONCAT f π f π φ(π)=π,π f ,f π LABEL-TRANS δ δ σ −→ L,σ1,σ2 v −→ l,v1,v2 MATCH-UNION MATCH-WILD lL v ∈ σ v ∈ σ vπ φ(π)=π ∪ ...∪ π φ(π)= 1 1 2 2 i 1 n v ∈ σ vπ vπ

      EPS-TRANS CCEPT A MATCH-ABSTRACT MATCH-REP σ ; σ v ∈ σ σ ↓ v vX φ(π)=X f π φ(π)=π∗ ∗ v ∈ σ v ∈ σ vπ fπ Figure 6. Automaton Semantics v ∈ σ ATCH EP NCE MATCH-REP-CONCAT M -R -O MATCH-REP-EPS fπ f∗ πf∗ π ∗ π f π f,f  π ∗ ∗ of patterns, we now define a notion of tree automata, which we call bidirectional automata. These automata are used in particular Figure 4. Matching a Value against a Pattern vπ to specify which patterns should be rejected. They capture the idea that a value is matched from its root and that a sequence is matched one element at a time from its extremities. Still, some location in the pattern. They can thus be used to define which value freedom is left over the implementation. In particular, the automata to associate to each binder during pattern matching. do not mandate any specific strategy (such as left to right) for We choose to use a non-deterministic semantics: there may the traversal of sequences. This is achieved thanks to an original be several ways to match a value against a given pattern. The feature of the automata: at each step of their execution, the matched reasons are twofold. First, this yields much simpler specifications sequence may be consumed from either side. This symmetry in the and algorithms. Second, we don’t want to commit ourselves to a definition of automata results in symmetric restrictions on patterns: particular semantics. Indeed, we may imagine that the programmer if a pattern is disallowed, then the pattern obtained by reversing is allowed to choose between different semantics, such as a first- the order of all elements in all sequences is also disallowed. We match policy (Perl style) or a longest match policy (Posix style). believe this is easier to understand for a programmer. Additionally, Our algorithms will be sound in both cases, without any adaptation this feature is a key ingredient for our type inference algorithm. needed. Formally, a bidirectional automaton is composed of 4.5 Types • a finite set Σ of states σ; A pattern specifies a set of values: the set of values which matches • an initial state σ0 ∈ Σ; this pattern. So, patterns can be used as types. More precisely, we • asetoflabeled transitions σ −→ δ L,σ,σ; define a type as a pattern (Π,φ,π0, B) with no wildcard  (that is, • asetofepsilon transitions σ ; σ; φ(π) is different from the wildcard  for all locations π ∈ Π)and • σ ↓ v no binder (the relation B is empty). an immediate acceptance relation . The wildcard has a somewhat ambiguous status: it stands for The transitions are annotated by a tag δ ∈{l, r} which indicates any value when not in a sequence, but only stands for sequence on which side of the matched sequence they take place: either on values when it occurs inside a sequence. For instance, the values the left (tag l) or on the right (tag r). The semantics of automata accepted by a pattern ,P are not the concatenations of the values is given in figure 6: the relation v ∈ σ specifies when a value v accepted by pattern and pattern P, as some values in pattern can- is accepted by a state σ of the automaton. A value is accepted not be concatenated. Due to this ambiguous status, type inference by a whole automaton if it is accepted by its initial state σ0.The would be more complicated if the wildcard pattern was allowed in rule LABEL-TRANS states that, starting from a goal v ∈ σ,a δ types. labeled transition σ −→ L,σ1,σ2 may be performed provided that v δ l 4.6 Subtyping the value decomposes itself on side into a element with name and contents v1 followedbyavaluev2 (value decomposition is We define a subtyping relation <: in a semantic way on the loca- l 0 specified in figure 5). The name must furthermore be included tions π1 ∈ Π1 and π2 ∈ Π2 of two patterns P1 =(Π1,φ1,π1, B1) L v1 ∈ σ1 0 in the name set . One then gets two subgoals and and P2 =(Π2,φ2,π2, B2) by π1 <: π2 if and only if, for all typ- v2 ∈ σ2.TheruleEPS-TRANS moves to another state of the ing environments and for all values v, the assertion vπ1 implies automaton while remaining on the same part of the value. Usually, the assertion vπ2. Two patterns are in a subtype relation, writ- automata have a set of accepting states, which all accept the empty ten P1 <: P2, if their root locations are. The actual algorithmic sequence . Here, we use an accepting relation, so that a state subtyping relation used for type checking does not have to be as may accept whole values at once (rule ACCEPT). This is necessary precise as this semantics subtyping relation. This will simply result to deal with type variables X that match a possibly non-regular in a loss of precision. set of values and with foreign values e which are not sequences. The use of an epsilon transition relation simplify the translation 4.7 Bidirectional Automata and Disallowed Matchings from patterns to automata. It also keeps the automata smaller. In the previous section, the semantics of patterns is specified in Indeed, eliminating epsilon transitions may make an automaton a declarative way. In order to clarify the operational semantics quadratically larger. Note that our automata are non-deterministic.

      53 EC PS DEC-CONCAT EC NION b D -E D -U l,a l,b φ(π)= φ(π)=π ,π φ(π)=π1 ∪ ...∪ πn [π] ;δ [] [π] ;δ [π; π] [π] ;δ [π ] r,b i ab l,a DEC-REP DEC-WILDCARD φ(π)=π∗ φ(π)= r,b r,a δ δ a [π] ; [∗π ] [π] ; [3]

      Figure 7. Example of Bidirectional Automaton DEC-REP-LEFT DEC-REP-RIGHT DEC-REP-EPS [∗π] ;l [π; ∗π] [∗π] ;r [∗π; π] [∗π] ;δ []

      Not all patterns could be translated into deterministic automata as DEC-WILDCARD-EPS top down deterministic tree automata are strictly less powerful than [3] ;δ [] non-deterministic ones [5]. An example of bidirectional automaton is depicted in figure 7. DEC-LEFT DEC-RIGHT This automaton recognizes the sequence a[],b[]. It has four states σ ;l σ σ ;r σ Σ={ab,a,b,} ab . The initial state is circled. The labeled l r δ σ ; σ ; σ ; σ transitions are all of the form σ −→ L,,σ and are represented σ; σ ; σ ; σ by an arrow from state σ to state σ with label the pair δ, L.There is no epsilon transition. The acceptance relation, not represented, is Figure 8. Epsilon Transitions σ ;δ σ reduced to ↓ . In order to recognize the sequence a[],b[], one can first consume a[] from the left and then the remaining part b[] from either side, or consume the part b[] before the part a[]. φ(π)=L[π] The automata we build below satisfy some commutation prop- [3] −→δ  ,[3],[3] erties, which ensure that the strategy used to match a value is not [π] −→δ L,[π],[] important. For instance, one can choose to consume values only from the left, or only from the right, or any combination of these l r two strategies. In all cases, the set of accepted values remain the σ −→ L,σ1,σ2 σ −→ L,σ1,σ2 l r same. We do not state these properties. σ; σ −→ L,σ1,(σ2; σ ) σ ; σ −→ L,σ1,(σ ; σ2) We now specify the translation of a pattern (Π,φ,π0, B) into an automaton. This translation is inspired by some algorithms by δ Hopkins [10] and Antimirov [2] for building a non-deterministic Figure 9. Labeled Transitions σ −→ L,σ,σ automaton using partial derivatives of a regular expression. The way we apply the same operations symmetrically on both sides of a pattern is inspired by Conway’s factors [6, 18]. At the root vX φ(π)=X []↓ [3] ↓ e of all these works is Brzozowski’s notion of regular expression [π] ↓ v derivatives [3]. The key idea for the translation is that each state corresponds to a regular expression that exactly matches what is accepted by the Figure 10. Immediate Acceptance Relation σ ↓ v state, and a transition corresponds to a syntactic transformation of a regular expression into the regular expression of the next state. defined in figure 8, 9, and 10. The assertion σ ; σ holds when In our case, one may have expected pattern locations to take the l r role of regular expression. As they are not flexible enough, we either assertion σ ; σ or σ ; σ holds. Note that the definition actually use finite sequences of pattern locations (plus some non- of the immediate acceptance relation depends on the typing envi- binding variants). Thus, a state σ of the automaton is defined by the ronment. following grammar. As there is an infinite number of sequences σ,wedefinethe finite set of states Σ of the automaton as the set of sequences s ::= π single pattern reachable from the initial state [π0] of the automaton through the ∗π non-binding pattern repetition 3 transitions. The following lemma states that we define this way a non-binding wildcard finite automaton. σ ::= [s; ...; s] pattern sequence LEMMA 1 (Finite Automaton). The number of states σ reachable We write []for the empty pattern sequence, and σ; σ for the con- from the initial state [π0] through the transition relations is finite. catenation of the pattern sequences σ and σ. The intuition behind non-binding variants is the following. Suppose we match a value The number of states can however be exponential in the size of the a[],a[],a[] against a pattern A*. As we will see, this pattern re- pattern due to sharing. A typical example is the type definitions duces to something akin to A,A* by epsilon transitions. According below. to the semantics of patterns, the beginning of the value a[] is in- type T=a[],a[] and U=T,T and V=U,U deed bound to the location of subpattern A, but the remaining part a[],a[] is not bound to any location. Thus, the subpattern A* does We expect the state of the automata to be reasonable in practice. not correspond to a pattern location, but rather to a repetition of the Indeed, for patterns without sharing of locations, the bound is much location of the subpattern A. better: it is quadratic in the size of the pattern. The initial state of the automaton is the sequence [π0] containing An example of translation is given in figure 11. The pattern only the root π0 of the pattern. The epsilon transitions, the labeled a[],b[] is represented using the same notation as in figure 1. For transitions and the immediate acceptance relation are respectively the sake of simplicity, we do not represent the part of the automa-

      54 , a[ ] MATCH-SEQ-SINGLE MATCH-SEQ-STAR 1 2 3 MATCH-SEQ-EPS vπ f π [] ∗ v[π] f[∗π] b[ ] 4 5 MATCH-SEQ-CONCAT [4] MATCH-SEQ-WILCARD fσ f σ v[3] l,a l,b f,f σ; σ

      r,b vσ [1] [2; 4] [] Figure 13. Matching a Value against a Pattern Sequence l,a

      r,b r,a gorithmic. Still, we are confident it can be understood intuitively [2] by a programmer. We now relate the semantics of a pattern to the semantics of its translation into an automaton. It is convenient to first extend the Figure 11. Pattern and its Translation (Simplified) semantics of pattern locations to pattern sequences (figure 13). We then have the following result.

      LEMMA 2. A value is matched by a pattern if and only if it is φ(π) is a test σ ; σ v  σ matched by the corresponding automaton, as long as the matching is allowed: if v  [π0] does not hold, then v ∈ [π0] if and only if e  [π] v  σ vπ0. More generally, for any value v and any state σ such that v  σ does not hold, we have v ∈ σ if and only if vσ. δ δ σ −→ L,σ1,σ2 v −→ l,v1,v2 The restriction to allowed matchings is important. Indeed, consider lL v1  σ1 the pattern (),_. It matches only sequences but it is translated into v  σ an automaton that matches everything, as the empty sequence is eliminated by epsilon transitions (rule DEC-CONCAT followed by δ δ σ −→ L,σ1,σ2 v −→ l,v1,v2 rule DEC-EPS together with rule DEC-LEFT). lL v2  σ2 Our automata are actually designed for analyzing patterns rather v  σ than for being executed. They make it possible to focus on a partic- ular part of a pattern by consuming subpatterns from both sides. For instance, if we have a pattern A,B,C, we can focus on B by consum- v  σ Figure 12. Disallowed Matching ing A on the left and C on the right. Thus, type inference can be per- formed by consuming a type and a pattern in a synchronized way in order to find out which parts of the type corresponds to which parts ton corresponding to the element contents. The initial state of the of the pattern. For instance, if we have a type a[],T,b[] and a pat- automaton is the sequence [1]. An epsilon transition yields from tern a[],(x : ),b[], we can compute that the type of the vari- this state to the state [2; 4]. Then, values can either be consumed able x is T by simultaneously consuming the elements a[] and b[] from the left or from the right. The first case correspond to a tran- of the type and the pattern. For this to work, it must be possible to sition [2; 4] −→ l a,[3],[4], depicted as an arrow from state [2; 4] to associate a state to each part of a value matched by a pattern. As state [4] with label l,a. a consequence, there is a slight mismatch between our definition Some matchings of a value against a pattern should not be and what should be an actual implementation of patterns. First, the allowed, either because they are not implementable, or because rule for type variables in the definition of the acceptance relation is they would break abstraction. As the automata describe the opera- important for analyzing patterns but would not be used in an actual tional semantics of patterns, they are the right tool to specify which implementation, where matching against a type variable should al- [3] matchings should be rejected. This disallowed matching relation ways succeed. Second, when in state , only foreign values are v  σ is defined in figure 12. Automaton matching can be viewed immediately accepted while sequence values are progressively de- as a dynamic process: for matching a value v against a pattern se- composed. Thus is crucial for type inference but cannot be imple- quence σ, we start from the assertion v ∈ σ and try to consume mented: foreign values cannot be tested and thus an implementation the whole value by applying repeatedly the rules in figure 6. We cannot adopt a different behavior depending on whether a value is a should never arrive in a position where a test needs to be performed sequence or a foreign value. A simple change is sufficient to adapt [3] on an external value. Therefore, in the definition of the disallowed the automaton: make the state accept any value and remove any matching relation, there is one rule corresponding to epsilon transi- transition from this state. Note that this change does not affect the tions and two rules corresponding to labeled transitions, depending disallowed matching relation. on whether the failure occurs in the element contents or in the se- quence but outside this element. The last rule corresponds to an 4.8 Pattern Matching Construction immediate failure, where a test is performed on an external value. We can now complete our specification of pattern matching. We The following pattern components are tests: L[π], , (π, π),andπ∗. are only interested in how a value is matched in a pattern matching Basically, a test is a pattern component that only accepts sequences. construction: which branch is selected and which values are associ- For this last rule, we only need to consider the case when the pat- ated to the binders in this branch. We do not consider what happens tern sequence contains a single pattern location. Indeed, one can afterwards. Thus, we can ignore the body of each branch of the easily show that the only way to arrive to a sequence which is not construction and can formalize a pattern matching construction as of this form is through epsilon transitions, starting from a sequence a list of patterns. It turns out to be convenient to share between all of this form. This specification of disallowed matchings is quite al- patterns a set of pattern locations and a mapping from pattern lo-

      55 cation to pattern components. Therefore, a pattern construction is • preservation of the semantics: for all typing environments and characterized by: for all values v in T ,thevaluev is matched the same way by each pattern Pi and its erasure; • a set of pattern locations Π; • a mapping φ :Π→C(Π); By “matched the same way”, we mean that, if there is a derivation of vπi in one of the patterns Pi, then there must be an identical • afamily(πi) of root pattern locations (πi ∈ Π); derivation in the erasure of pattern Pi (except for applications of • (B ) B ⊆X×Π a family i of binder relations ( i ) rule MATCH-ABSTRACT which should be replaced by applications of rule MATCH-WILD), and conversely. Algorithms for perform- The i-th pattern is defined as Pi =(Π,φ,πi, Bi). For instance, the pattern construction in the body of the function map presented ing all these checks are presented section 5. The linearity check in section 4.2 can be specified by reusing the corresponding def- algorithm is actually omitted as it is standard and its presentation is long. initions of the set Π and mapping φ and defining (πi) and (Bi) by 4.9 Type Inference π1 =1 B1 = ∅ π =2 B = {( , 4), ( , 5)} An additional operation we are interested in is type inference: 2 2 x rem we want to compute for each binder a type which approximates In order to type-check a pattern construction, the type T of the the set of values that may be bound to it. From the semantics values that may be matched by the pattern must be known. In of the pattern matching construction above, we can derive the our example, this type is List{X}, which can be represented as following characterization of this set of values. Consider an input 0 0 apattern(R, ψ, 1, ∅) with type (Π1,φ1,π1, ∅) and a pattern (Π2,φ2,π2, B2). Then, a value v x v R = {1, 2, 3} may be bound to a binder if there exists a value 0 and a location π ∈ Π2 such that: 0 and • v0 π1 (the value v0 belongs to the input type); • 0 ψ(1) = 2∗ ψ(2) = element[3] ψ(3) = X. v0 π2 (the value is matched by the pattern); • (x, π) ∈B2 (the binder x is at location π in the pattern); The semantics of pattern matching is as follows. Given a 0 • there exists a derivation of v0 π2 containing an occurrence of value v0 belonging to the input type T ,apatternPi is chosen such the assertion vπ(the assertion vπis a submatch). that the value v0 matches the root location πi of the pattern, that is, so that there exists a derivation of v0 πi. We then consider all Several algorithms for precise type inference have been pro- submatches, that is, all assertions vπwhich occur in this deriva- posed [7, 13, 20]. These algorithms are tuned to a particular match- tion. This defines a relation M between locations and values. The ing policy (such as the first-match policy). With these algorithms, composition M◦B= {(x, v) |∃π.(x, π) ∈B∧(π, v) ∈M}of the semantics of the type computed for a binder is exactly the set of this relation with the binder relation B is then expected to be a total values that may be bound to it. (As binders are considered indepen- function from the set of binders of the pattern to parts of value v. dently, any correlation between them is lost, though.) For instance, This function indicates which part of the value v is bound to each let us consider the following function. binder x. In order to ensure that this matching process succeeds for any fun f (x : (a[] | b[] | c[])) = value of the input type T , the following checks must be performed: match x with b[] ---> ... • exhaustiveness: for all typing environments and for all values v | y : (a[] | b[] | d[]) ---> ... in the input type T , there must exists a pattern Pi such that the |_ --->... value v matches the root location πi of this pattern; A precise type algorithm infers the type a[] for the binder y.In- • v linearity: for all typing environments, for all values in the deed, values of type b[] are matched by the first line of the pat- T vπ π input type and for all derivations i where i is the tern. Therefore, only values of type the difference between type P M◦X root location of one of the patterns i, the composition a[]|b[]|c[] and type b[], that is, type a[]|c[] may be matched defined above must be a function. by the second pattern. Finally, the values matching the second pat- These two checks are standard [13]. In our case, two additional tern must also have type a[]|b[]|d[], hence their type is the in- checks must be performed. Indeed, some matchings are not allowed tersection of a[]|c[] and a[]|b[]|d[],thatis,a[]. Such a type in order to preserve abstraction and for the patterns to be imple- algorithm is implemented in CDuce and was initially implemented mentable. Furthermore, patterns are not implemented directly but in XDuce. Difference is costly to implement. Besides, though this is not only after erasure. We define the erasure of a pattern (Π,φ,π0, B) apparent in the example above, difference operations may need as the pattern (Π,φ ,π0, B) where: ( to be performed at many places in the pattern, especially when  if φ(π) is a type variable X binders are deeply nested. Hosoya proposed a simpler design [11], φ (π)= remarking that with a non-deterministic semantics (in other words, φ(π) otherwise. when the matching policy is left unspecified) no difference oper- An erased pattern is a pattern containing no pattern variable (that ation needs to be performed. An intersection operation still needs is, φ(π) is different from any variable X for all locations π in the to be performed, but only once per occurrence of a binder. So, in pattern). The semantics remain the one given above, but applied to our example, the second line of the pattern still matches values of the erased patterns. We thus have these two additional checks: type b[]. Therefore, the type of y is the intersection of the initial type a[]|b[]|c[] and the type a[]|b[],thatis,a[]|b[]. • allowed patterns: the erasure of each pattern Pi should be In our case, even the intersection operations must be avoided. allowed with respect to the input type T , that is, we must not Indeed, our types are not closed under intersection: for instance, have v  πi for any value v in the input type T , any erasure of there is no type that corresponds to the intersection of two type pattern Pi, and any typing environment. variables. Xtatic has the same issue [8, section 5.3]. The current

      56 implementation of Xtatic thus computes an approximation of the ROOT intersection. Another reason to avoid intersection is that it is not a 0 0 [π1]  [π2] syntactic operation on types in XDuce. Thus, in order to compute an intersection, types must first be translated to automata and the DEC-LEFT DEC-RIGHT intersection must be translated back from an automaton to a type. σ1 σ2 σ1 ; σ1 σ1 σ2 σ2 ; σ2 In the process, the type may become more complex. In the worst σ σ σ σ case, the size of the intersection of two automata is quadratic in the 1 2 1 2 size of these automata. Also, some type abbreviations may be lost ENTER during the successive translations. σ1 σ2 What we propose is to infer types not for binders but for wild- δ L 1 L σ −→ L ,σ ,σ cards _ and compute the type of binders by substitutions. The key 1 2 1 1 1 1 δ idea is that the intersection of a type with a wildcard is the type it- σ1 ↔ σ2 σ2 −→ L2,σ2,σ2 self. Thus, no intersection is actually needed. Consider for instance σ1 σ2 the function below. fun g (x : (a[],b[])) = SHIFT σ σ match x with 1 2 δ y:(_,(b[]|c[])) ---> ... L1 1 L2 σ1 −→ L1,σ1,σ1 δ The type inferred for the wildcard is a[]. Thus, by substitution, the σ1 ↔ σ2 σ2 −→ L2,σ2,σ2 type inferred for the binder y is a[],(b[]|c[]). We deliberately σ1 σ2 gave an example for which the inferred type is not precise, in order to emphasize the difference with other specifications of type infer- Figure 14. Type Propagation σσ ence. We expect this weaker form of type inference to perform well in practice. In particular, type inference is still precise for wild- cards (assuming a non-deterministic semantics). When needed, the programmer can provide explicitly a more precise type. We exper- imented with the examples provided with the XDuce distribution. 5.1 Exhaustiveness Only some small changes were necessary to get them to compile. The input of the algorithm is the input type T =(R, ψ, π, ∅) What we actually had to do was to replace by wildcards some ex- and the different patterns Pi =(Π,φ,πi, Bi) of the pattern plicit types which were not precise enough. construction. We define the union of the patterns Pi by P = More formally, we define the semantics of a location π of the (Π ∪{ },φ, ,∅) where the location is assumed not to be in Π pattern as the set of values v such that there exists a value v0 such 0 0 and the mapping φ is such that φ ( )=π1 ∪ ...∪ πn (the union that the assertions v0 π1 and v0 π2 holds and the assertion 0 of all root locations) and φ (π)=φ(π) for π ∈ Π One can easily vπis a submatch of a derivation of v0 π2. The type inference prove that the semantics of the pattern P is the union of the seman- algorithm then consists in computing for each location correspond- tics of the patterns Pi. Then, the pattern is exhaustive if and only if ing to a wildcard a type whose semantics is the semantics of this T<: P . location and substituting this type in place of the wildcard. The sub- Note that the union construction above can be applied to any fi- stitution may not preserve pattern well-formedness. In this case, the nite set of patterns sharing a common mapping φ. This construction type checking fails. But we believe this is unlikely to occur in prac- is also used for type inference (section 5.6). tice, as this can only happen when a wildcard location is shared in two different contexts. For instance, consider the type a[X] and the pattern a[Q],Q where Q = . The type inferred for the wildcard is 5.2 Type Propagation X|() and substituting this type does not preserve well-formedness. If the resulting pattern is well-formed, then it is a type: it does not This algorithm propagates type information in a pattern. It is used contain any wildcard. The type of a binder is the type correspond- both for checking whether a pattern is allowed (section 5.4) and for type inference (section 5.6). The input of the algorithm is composed ing to the union of the locations the binder is associated to. 0 0 of two patterns P1 =(Π1,φ1,π1, B1) and P2 =(Π2,φ2,π2, B2) σ ↔ σ σ σ 5. Algorithms on Patterns and a relation 1 2 (where the sequences 1 and 2 range over the states of the automata associated respectively to P1 and P2). We define a number of algorithms for type checking and type The relation controls when the type information is propagated inference for patterns. Each of these algorithms is specified in an across an element. The algorithm is defined in figure 14 as a relation abstract way, by defining a relation over a finite domain using σ1 σ2. The roots of the two patterns are related (rule ROOT). inductive definitions. Actually implementing them is a constraint The relation is preserved by epsilon transition (rules DEC-LEFT solving issue. Standard techniques can be used, such as search and DEC-RIGHT). The rules ENTER and SHIFT specify how the pruning (when an assertion is either obviously true or obviously relation is propagated to the contents of an element and aside an false), memoization (so as not to perform the same computation element. several times), and lazy computation (in order not to compute Though the rules are symmetric, the algorithm is used in an unused parts of the relation). asymmetric way. One of the pattern is actually always a type and The size of the finite domain provides a bound on the complex- the algorithm can be read as propagating type information derived ity of the algorithm. We don’t study the precise complexity of these from this type into the other pattern. Besides, we are not interested algorithms, as we believe this would not be really meaningful. In in computing the whole relation σ1 σ2. Rather, for some given particular, the complexity of all these algorithms is polynomial in sequences σ1, the set of pattern sequences σ2 such that σ1 σ2 the sizes of the automata associated to the patterns it operates on, must be computed. but these sizes can be exponential in the size of the patterns. Our As the algorithm is defined as a binary relation σ1 σ2 over experience on the subject leads us to believe that the algorithms the states of the automata associated to the patterns P1 and P2,itis should perform well in practice. quadratic in the size of these automata.

      57 Pattern l l φ(π)=L[π] L π φ(π)= σ1 1 σ2 σ1 ; σ1 σ1 1 σ2 σ2 ; σ2 π π σ1 1 σ2 σ1 1 σ2

      l φ(π)=π1,π2 π1 π2 σ1 1 σ2 σ1 −→ L1,σ1,σ1 l π σ1 1 σ2 σ2 −→ L2,σ2,σ2 L1 1 L2 φ(π)=π1 ∪ ...∪ πn πi φ(π)=π ∗ σ1 1 σ2 π π φ(π)=X []1 [] φ(π)= φ(π)=X [π] 1 [3] π π σ 1 σ Pattern sequence Figure 16. Non-Disjointness π σ σ  []  [∗π]  [3] 1 2  [π] σ1; σ2 THEOREM 4. If the pattern P is disallowed with respect to the type T , then there exists two locations π1 ∈ Π1 and π2 ∈ Π2 φ(π ) X π Figure 15. Non-Emptiness πand σ such that 1 is a type variable , the location 2 is a test (as defined in section 4.7), and [π1]  [π2]. The converse holds in any typing environment such that for all type variables X there exists a foreign value e such that eX. 5.3 Type Non-Emptiness The restriction in the converse case ensures that the relation σ In order to check whether a pattern is allowed (section 5.4), it turns really coincide with the non-emptiness of the sequence σ.Italso out that we need an algorithm to decide whether, given a pattern ensures that if φ(π1)=X and π2 is a test, then there exists a P =(Π,φ,π , B) π 0 , the semantics of a pattern location or a foreign value e such that e  π2. From this and [π1]  [π2], one pattern sequence σ (belonging to the set of states of the automaton can then show that the whole pattern is disallowed. associated to pattern P ) is empty, that is, whether there exists a Ana¬õve implementation of the algorithm would compute all v vπ vσ value such that or . These algorithms are defined pairs of locations π1 and π2 such that [π1]  [π2] and then π σ in figure 15 as two relations and . Their properties can be checks whether there exists a pair for which both φ(π1) is a type stated as follows. variable X and the location π2 is a test. This implementation would have the same complexity as the type propagation algorithm. An LEMMA 3. Let π be a location in pattern P and σ be a state of the immediate optimization is to stop propagating type information P v automaton associated to pattern . If there exists a value such whenever a type location π1 with no type variable below it is that vπ,thenπ. Likewise, if there exists a value v such that reached. vσ,thenσ. The converse holds in any typing environment such that for all type variables X there exists at least a value v such that 5.5 Non-Disjointness vX. The type inference algorithm relies on an algorithm that, given a type The proof of the lemma is straightforward. The reason for the 0 restriction in the converse case can be seen on the last rule in T =(Π1,φ1,π1, B1) figure 15: if φ(π)=X,thenwehaveπ. We thus need to ensure andanerasedpattern v vX that there exists a value such that . 0 The inference rules define a relation π over the finite set of P =(Π2,φ2,π2, B2), Π P pattern locations in pattern . Each rule can be implemented in decides the non-disjointness of the semantics of two pattern se- constant time. Hence, computing the relation for all locations in quences σ1 and σ2 belonging to the states of the automata asso- P a pattern can be done in linear time in the size of the pattern . ciated respectively to the type T and the pattern P . The pattern P σ Likewise, the relation can be computed in linear time in the is assumed to be allowed with respect to type T . The algorithm is P size of the automaton associated to the pattern . defined in figure 16 as a relation σ1 1 σ2. It is based on a standard algorithm for checking non-disjointness of tree automata, with a 5.4 Disallowed Pattern special case for type variables X. Note that only transitions with 0 l The algorithm checking whether a pattern P =(Π2,φ2,π2, B2) tag are used. The relation would remain unchanged if this restric- 0 is allowed with respect to an input type T =(Π1,φ1,π1, B1) is tion was removed. σ 1 σ based on an instance of the type propagation algorithm (section 5.2) The intended semantics of the algorithm is that 1 2 if v vσ vσ applied to the type T and the pattern P . For this instance, we take and only if there exists a value such that 1 and 2. σ σ σ1 ↔ σ2 iff σ1. Intuitively, if we have a type L[T1],T2 and However this does not hold for arbitrary sequences 1 and 2.The apatternL’[P1],P2 such that the sets L and L’ are not disjoint, completeness of the algorithm can be stated as follows. the type information T1 should be propagated in the pattern P1,but LEMMA 5 (Completeness). Let σ1 and σ2 be two states of the only if there is indeed a value of type L[T1],T2, thus in particular automata associated respectively to the type T and the pattern P . only if the semantics of type T2 is not empty. On the other hand, We assume that: as the implementation of the automaton may try to match a value • against the subpattern P1 before considering the subpattern P2, the pattern P is allowed with respect to type T; nothing should be assumed about P2. The algorithm relies on the • σ1 σ2 (where this relation is the instance defined in sec- following theorem. tion 5.4 for checking for disallowed patterns);

      58 • the typing environment is such that for all type variables X σ1 σ2 can be improved by stopping the propagation of type there exists at least a value v such that vX. information whenever one reach a pattern sequence σ2 with no location whose type needs to be inferred below it. Then, if σ1 1 σ2, there exists a value v such that vσ1 and vσ 2. 5.7 Preservation of the Semantics The condition σ1 σ2 together with the allowed pattern condition The input of the type inference algorithm is a type [π] ensures that, for instance, one never considers the type sequence 0 with φ(π)=X against the pattern sequence [3; 3] for which the T =(Π1,φ1,π1, B1) algorithm may give a wrong answer: one has [π] 1 [3; 3] as the [3; 3] [3] and a pattern sequence reduces by epsilon transition to the sequence , 0 but there is not reason for the sequences [π] and [3; 3] to share P =(Π2,φ2,π2, B2). a common value. The constraint on typing environments can be The algorithm is as follows. For each location π in the pattern P easily understood by looking at the rule concerning type variables. corresponding to a type variable X (that is, φ1(π)=X), a type The soundness property is hard to state. Rather than defining T is computed using the type inference algorithm on the era- precisely when it holds, which would involve defining an additional sure of pattern P . We then compare this type with the part of the complex relation, we use the following somewhat imprecise state- pattern P corresponding to location π, that is, the pattern P = ment. (Π2,φ2,π,B2). The semantics is preserved if for all such loca- tions π we have T <: P . The soundness of the algorithm relies LEMMA 6 (Soundness). Let σ1 and σ2 be two sequences of the automaton respectively associated to the input type and the input on the following theorem. σ 1 σ pattern. In any position where the relation 1 2 is used in THEOREM 8. The semantics of the pattern is preserved by erasure the type inference algorithm below (section 5.6), if there exists a if and only if for any pattern location π2 such that φ(π2)=X (for v vσ vσ σ 1 σ value such that 1 and 2,then 1 2. any type variable X) and any type location σ1 such that σ1 π2 P σ <:[π ] The algorithm is quadratic in the size of the automata associated to (by type inference on the erasure of pattern ), we have 1 2 . the type T and the pattern P . 6. Related Works 5.6 Type Inference As mentioned in the introduction, we presented in a previous pa- The input of the type inference algorithm is a type per [21] a calculus dealing with polymorphism for regular tree 0 T =(Π1,φ1,π1, B1) types. Values are binary trees rather than sequences of elements. It is straightforward to translate sequences into binary trees by rep- andanerasedpattern resenting an element contents as a node whose first child is its con- 0 tents and second child is its right sibling. A similar translation can P =(Π2,φ2,π2, B2). be defined to some extent for patterns. But there is a number of The pattern P is assumed to be allowed with respect to the type T . restrictions. In particular, wildcards and binders should only occur The algorithm is based on an instance of the type propagation in tail position. The present paper deals directly with sequences, algorithm (section 5.2) applied to the type T and the pattern P .For which makes it possible to avoid these restrictions. Additionally, this instance, we take σ1 ↔ σ2 iff σ1 1 σ2. Intuitively, if we have we specify patterns in a more precise way: we believe very few ex- a type L[T1],T2 and a pattern L’[P1],P2, the type information T1 tensions to patterns, besides support for XML attributes, would be should be propagated in the pattern P1, but only if there is indeed a necessary for a realistic implementation. value of the whole type matched by the whole pattern. In particular, Hosoya, Frisch and Castagna have also proposed an extension there should be a value shared by the type T2 and the pattern P2. of XDuce with polymorphism [12], now implemented in the latest We rely on the following result. release. In their work, type variables range over sets of basic XDuce values rather than over sets of arbitrary values. This results in THEOREM 7. If a value v is included in the semantics of a loca- design decisions which are drastically different from ours. For tion π2 of the pattern, then there exists a sequence σ1 such that instance, they consider that pattern matching on values whose type σ1  [π2]. The converse holds in any typing environment such is a type variable is possible (as the structure of all such values can that for all type variables X there exists at least a value v such that be explored by pattern matching), while we consider that this would vX. break abstraction. They can deal with bounded quantification. On The type inferred for a subpattern π2 should thus corresponds to the other hand, it is not clear how to extend their work to deal with the union of the sequences σ1 such that σ1  [π2]. One can show foreign types and higher-order functions. that for any such sequence σ1, as it belongs to the set of states of Sulzmann and Lu propose to use a structured representation of an automaton associated to a type, one can build a type T1 with XDuce values [19] and interpret subtyping as a runtime coercion. the same semantics. Then, the type inferred for the subpattern π2 As types reflect the structure of values, they do not have the issue can be build by taking the union of these types, as defined in of concatenating foreign values: the values of type A,B are pairs, section 5.1. rather than concatenations of values of type A and type B.However, The algorithm is complete only when all type variables have a they may need to use algorithms similar to ours in order to ensure non-empty semantics. It may be possible to get a stronger result that pattern matching interact well with polymorphism. by extending our type system with conditional types [1]. But we Several type inference algorithms have been proposed for reg- believe this would unnecessarily complicate the type system. ular expression types. The first one [13], by Hosoya and Pierce, is An interesting feature of this algorithm is that it works directly precise (assuming a first-match policy) but can infer a type only for on the syntax of patterns and types. In particular, the inferred binders in tail position in the pattern. Hosoya later proposed a sim- type is build from the input type using only simple operations pler design [11], corresponding to a non-deterministic semantics (concatenation and union). for patterns, where this restriction was removed. Both algorithms As in the case of the “disallowed pattern” check (section 5.4), use a translation of types and patterns into tree automata. Several ana¬õve implementation which would compute the whole relation algorithms for precise type inference for different match policies

      59 have also been presented by Vansummeren [20]. They work di- [21] J. Vouillon. Polymorphic regular tree types and patterns. In Pro- rectly on the syntax of patterns but require complex operations on ceedings of the 33th ACM Conference on Principles of Programming types such as intersection and difference. Languages, Charleston, USA, Jan. 2006. To appear. Available from CDuce [4] has some extensive support for importing functions http://www.pps.jussieu.fr/~vouillon/publi/. from OCaml. Contrary to what we propose in section 3, their extension relies on a runtime translation of ML values into CDuce values according to their types.

      References [1] A. Aiken, E. L. Wimmers, and T. K. Lakshman. Soft typing with conditional types. In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 163Ð173, New York, NY, USA, 1994. ACM Press. [2] V. Antimirov. Partial derivatives of regular expressions and finite automaton constructions. Theor. Comput. Sci., 155(2):291Ð319, 1996. [3] J. A. Brzozowski. Derivatives of regular expressions. J. ACM, 11(4):481Ð494, 1964. [4] CDuce Development Team. CDuce Programming Language User’s Manual. Available from http://www.cduce.org/ documentation.html. [5] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Ti- son, and M. Tommasi. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata, 1997. release October, 1rst 2002. [6]J.H.Conway.Regular Algebra and Finite Machines. William Clowes and Sons, 1971. [7] A. Frisch, G. Castagna, and V. Benzaken. Semantic subtyping. In 17th IEEE Symposium on Logic in Computer Science, pages 137Ð146. IEEE Computer Society Press, 2002. [8] V. Gapeyev, M. Y. Levin, B. C. Pierce, and A. Schmitt. The Xtatic experience. In Workshop on Programming Language Technologies for XML (PLAN-X), Jan. 2005. University of Pennsylvania Technical Report MS-CIS-04-24, Oct 2004. [9] V. Gapeyev and B. C. Pierce. Regular object types. In European Conference on Object-Oriented Programming (ECOOP), Darmstadt, Germany, 2003. A preliminary version was presented at FOOL ’03. [10] M. W. Hopkins. Converting regular expressions to non-deterministic finite automata, May 1992. Newsgroup message on comp.theory. [11] H. Hosoya. Regular expression pattern matching — a simpler design. Technical Report 1397, RIMS, Kyoto University, 2003. [12] H. Hosoya, A. Frisch, and G. Castagna. Parametric polymorphism for XML. In POPL ’05: Proceedings of the 32nd ACM SIGPLAN- SIGACT sysposium on Principles of programming languages, pages 50Ð62. ACM Press, 2005. [13] H. Hosoya and B. C. Pierce. Regular expression pattern matching. In ACM SIGPLANÐSIGACT Symposium on Principles of Programming Languages (POPL), London, England, 2001. Full version in Journal of Functional Programming, 13(6), Nov. 2003, pp. 961Ð1004. [14] H. Hosoya and B. C. Pierce. XDuce: A statically typed XML processing language. ACM Transactions on Internet Technology, 3(2):117Ð148, May 2003. [15] H. Hosoya, J. Vouillon, and B. C. Pierce. Regular expression types for XML. In Proceedings of the International Conference on Functional Programming (ICFP), 2000. [16] D. MacQueen, G. Plotkin, and R. Sethi. An ideal model for recursive polymorphic types. Information and Control, 71(1-2):95Ð130, 1986. [17] M. Murata. Hedge automata: a formal model for XML schemata. http://www.xml.gr.jp/relax/hedge_nice.html, 2000. [18] G. Sittampalam, O. de Moor, and K. F. Larsen. Incremental execution of transformation specifications. SIGPLAN Not., 39(1):26Ð38, 2004. [19] M. Sulzmann and K. Z. M. Lu. A type-safe embedding of xduce into ml. In ACM SIGPLAN Workshop on ML, informal proceedings, Sept. 2005. [20] S. Vansummeren. Type inference for unique pattern matching. ACM Transactions on Programming Languages and Systems (TOPLAS), 2003. To appear.

      60 Composing Monadic Queries in Trees

      Emmanuel Filiot Joachim Niehren Jean-Marc Talbot Sophie Tison INRIA Futurs, Lille INRIA Futurs, Lille LIFL, Lille LIFL, Lille

      Abstract We propose a new class of querying languages to de- fine n-ary node selection queries as compositions of Node selection in trees is a fundamental operation to monadic queries. The choice of the underlying monadic XML databases, programming languages, and informa- querying language is parametric. We show that com- tion extraction. We propose a new class of querying lan- positions of monadic MSO-definable queries capture n- guages to define n-ary node selection queries as compo- ary MSO-definable queries, and distinguish an MSO- sitions of monadic queries. The choice of the underlying complete n-ary query language that enjoys efficient monadic querying language is parametric. We show that query answering algorithms. Moreover, our language compositions of monadic MSO-definable queries cap- allow to compose different monadic query languages, ture n-ary MSO-definable queries, and distinguish an for instance a monadic query defined by a Datalog pro- MSO-complete n-ary query language that enjoys an ef- gram with another monadic query defined by an XPath ficient query answering algorithm. node formula.

      Compositions of monadic MSO-definable queries are 1 Introduction relevant to information extraction. They might be use- ful to approach an open question in the context of the Node selection in trees [12] is a fundamental operation Lixto system [6], of how to enhance such system by to XML databases, programming languages, and infor- machine learning techniques. Given a composition for- mation extraction. Node selecting captures the match- mula, and examples for n-tuples that are to be selected, ing aspect of tree transformations. Iterated node selec- one can use existing learning algorithms for monadic tion can be used to navigate through input trees while MSO-definable queries [3], in order to infer n-ary MSO producing output data structures. definable queries.

      From the database perspective, node selecting is usu- The paper is organized as follows. We recall the defini- ally viewed as a querying problem [13, 15, 10]. The tions of n-ary MSO definable queries in tree in Section 2, W3C standard querying language XPath provides de- introduce languages of compositions of monadic queries scriptions of monadic queries, i.e. queries that define in Section 3, and discuss some instances in Section 4. sets of nodes in trees. XPath queries are used by the We study their expressiveness in Section 5 and their al- W3C standard languages XQuery and XSLT for defin- gorithmic complexity in Section 6, including algorithms ing XML document transformations. for the model-checking and the query answering prob- lems, and prove the satisfiability problem to be NP-hard. Modern programming languages support node selection Finally in Section 7 we propose a fragment of it, study in trees via pattern matching, for instance all functional its expressiveness and give an efficient algorithm for the programming languages of the ML family (Caml, SML, query answering problem. Haskell). Tree pattern with n capture variables define n-ary queries, i.e. queries that select sets of n-tuples of nodes. The XML programming languages XDuce [11] 2 Node Selection Queries in Trees and CDuce [4, 2] support more expressive recursive tree pattern. We recall the definition of n-ary MSO-definable queries in trees. We develop our theory for binary trees. This Information extraction tasks for the Web can frequently will be sufficient to deal with unranked trees, since be reduced to defining n-ary queries in HTML or XML unranked trees can be viewed as binary trees via a trees. Gottlob et. al. [9] proposed monadic Datalog firstchild − nextsibling encoding [12]. as querying language for this purpose, and show that it captures monadic MSO-definable queries [8]. Their We consider binary trees as acyclic digraphs with la- Lixto system [6, 1] provides a visual interface by which beled nodes and ordered children. We start with a fi- to specify monadic queries in monadic Datalog, and to nite set Σ of node labels. A binary Σ-tree t ∈ TΣ is a compose them into n-ary queries. Composed monadic finite rooted acyclic directed graph, whose nodes are la- queries are defined in Elog, a binary Datalog language. beled in Σ. Every node is connected to the root by a

      61 unique path. All nodes of binary trees either have 0 or 2 Tarskian manner, and write t,α |= φ in this case. The children. Nodes without children are called leaves.All first-order logic FO is obtained from MSO by omitting other nodes have a distinguished first and second child. the set quantification. Actually the notations FO and MSO stands for FO[T]andMSO[T] respectively, i.e. We write root(t) for the root of tree t and nodes(t) formulae over the vocabulary T. for the set of nodes of tree t and edges(t) ⊆ nodes(t)2 for the set of edges of t. For all labels a ∈ Σ, we write We view MSO as a query language. The names of n- φ ,..., laba(t) ⊆ nodes(t) for the subset of nodes of t labeled ary queries are MSO formulas (x1 xn) with n free ,..., by a. Given two nodes v1,v2 ∈ nodes(t) we call v2 a first-order variables x1 xn. These define the follow- child of v1 and write v1 v2 iff there exists an edge from ing queries: v1 to v2, i.e., if (v1,v2) ∈ edges(t). φ(x1,...,xn)(t)={(α(x1),...,α(xn)) | t,α |= φ} The descendant relation ∗ on nodes is the reflexive Definition 3. An n-ary query is MSO definable if it is φ ,...,  transitive closure of the child relation . equal to some query (x1 xn) . Unfortunately [5] shows that the satisfiability problem The subtree of tree t rooted by node v ∈ nodes(t) is the is not fixed-parameter tractable, i.e. there exists no tree denoted by t| that satisfies: v polynomial p and elementary function f such that we  ∗  nodes(t|v)={v ∈ nodes(t) | v  v } can decide in time O( f (|φ|) p(|t|)) whether an monadic 2 φ edges(t|v)=edges(t) ∩ nodes(t|v) MSO-formula is satisfiable in the tree t.However, root(t|v)=v there exists query languages that can express all MSO- laba(t|v)=laba(t) ∩ nodes(t|v) ∀a ∈ Σ definable queries, which have polynomial-time com- bined complexity: e.g. monadic queries defined by suc- Definition 1. Let n ∈ N.Ann-ary query in binary cessful runs of tree automata have exactly the power of trees over Σ is a function q that maps trees t ∈ TΣ to MSO in defining monadic queries but deciding the non- set of n-tuples of nodes, such that ∀t ∈ TΣ : q(t) ⊆ emptyness of a monadic query is in polynomial-time nodes(t)n. Moreover, we require q to be closed under w.r.t. combined complexity [14]. tree-isomorphism, i.e. h(q(t)) = q(h(t)) for a tree iso- morphism h. Let us define some algorithmic tasks for query lan- ,. Simple examples for monadic queries in binary trees guages (N ) that are common to database theory: Σ over are the functions laba that map trees t to the sets • ∈ Σ model-checking: given a query name c,atree of nodes of t that are labeled by a for a . The binary k t and an n-tuple (v1,...,vk) ∈ nodes(t) , does query descendant relates nodes v to their descendants, ,..., ∈   i.e. descendant(t)={(v,v) ∈ nodes(t)2 | v ∗ v}. (v1 vk) c (t) hold ? • query answering: given a query name c and a tree Σ Definition 2. A query language L over alphabet is a t,returnc(t). An expected complexity might be ,. . pair L =(N ) where N is a set of names and an polynomial in the number of solutions. interpretation function mapping names c ∈ N to queries • c in Σ-trees. satisfiability (over a fixed tree): given a query name c and a tree t, does c(t) = 0/ hold ? The monadic second-order logic (MSO) in trees is a query language that is widely accepted as the yardstick Unranked trees are like binary trees, except that all for comparing the expressiveness of XML-query lan- nodes may have arbitrarily many ordered children. The guages [9, 15]. This is because of the close correspon- next-sibling of a node is the successor of the same parent dence between MSO, tree automata, and regular tree in the sibling ordering. languages [17]. Every MSO formula with n free node variables defines an n-ary query. Unranked trees can be encoded as binary trees by only using edges for the first-child and next-sibling relations. In MSO, binary trees t ∈ TΣ are seen as logical struc- Fig. 1 gives a DTD, an unranked tree matching this tures with domain nodes(t). The signature T of DTD and its first-child next-sibling encoding t.Asim- this structure contains symbols for the binary relations ple binary query on that tree is to select all pairs of name and title of the same book. It can be expressed with re- child1 and child2 and the unary relations laba for all a ∈ Σ. spect to the binary encoding by the following MSO for- mula with two free variables y,z: Let x,y,z range over a countable set of first-order vari- ∃x (labauthor(x) ∧ child1(x,y) ∧ child2(x,z)) ables and X over a countable set of monadic second- order variables. Formulas φ of MSO have the following abstract syntax, where a ∈ Σ: 3 Composing Monadic Queries φ ::= p(x) | child1(x,y) | child2(x,y) | lab (x) |¬φ | φ ∧ φ |∀xφ |∀Xφ a 1 2 Query languages for monadic queries in trees have been A variable assignment α into a tree t maps first-order widely studied by the database community in the last variables nodes(t) and second-order variables to sub- few years. See [12] for a comprehensive overview. sets of nodes(t). We define the validity of formulas Languages for n-ary queries are less frequent but have φ in trees t under variable assignments α in the usual started to arise with the XML programming languages

      62 DTD unranked tree fc-ns encoding

      bib bib book ⊥ book book author book author title author title name title author ⊥ name name name title Figure 1. A DTD, an unranked tree matching the DTD and its firstchild − nextsibling encoding t.

      XDuce and CDuce [11, 2, 4] as well as with information the set of free variables of φ. We will often write c(x) extraction tools such as Lixto [8, 6]. instead of c(x). . The set of subformulas of φ is de- noted by Sub(φ).Thecomposition size |φ| of a formula In this paper, we propose a new class of languages φ is inductively as follows: for defining n-ary queries by composition of monadic | | = 0, |φ ∧ φ| = |φ| + |φ| + 1 queries. We leave the choice of the underlying monadic |c(x).φ| = + |φ|, |φ ∨ φ| = |φ| + |φ| + querying language parametric, so the reader may choose 1 1 |∃x φ| = + |φ| his prefered monadic querying language, and extend it 1 to an n-ary query language by query composition. The composition operator is motivated by Lixto’s way of Note that this definition implies that query names are of defining n-ary queries [6]. size one.

      The principle of composition is quite simple: a com- For all trees t, all valuations ν : Var → nodes(t) rang- position of two monadic queries first selects a node an- ing over the nodes of t, and all composition formula swering the first sub-query and then launches the second φ ∈ C (L) we define the satisfaction relation t,ν |= φ as sub-query at that node. All nodes seen meanwhile can follows: be memoized and returned in an output tuple. (i) range(ν) ⊆ nodes(t) We start from a language L of monadic queries c and , , ∈ Var an infinite set x y z of variables. We then define (ii) compositions of monadic queries on basis of the com- t,ν |=  position operator that we write as the dot ’.’ . Informally ∈   . ,ν / | .φ u c (t)(1) a composition query c1(x1) c2(x2) on a tree t will first t [x u] = c(x) iff | ,ν | φ ∈   t u = (2) bind x1 nondeterministically to some node v1 c1 (t), ,ν | φ ∧ φ ,ν | φ ,ν | φ and then launch query c in the subtree t| rooted at v t = 1 2 iff t = 1 and t = 2 2 v1 1 ,ν | φ ∨ φ ,ν | φ ,ν | φ in order to bind x to some node v ∈ c (t| ). t = 1 2 iff t = 1 or t = 2 2 2 2 v1 t,ν |= ∃x φ iff there exists u ∈ nodes(t), ,ν / | φ For expressivity reasons Ð that is to capture MSO as s.t. t [x u] = soon as the monadic query language capture MSO Ð we Let us consider the satisfiability of c(x).φ. Condition add conjunction, disjunction and projection to our com- (1) implies that c selects the node u in t, condition (2) 1 position language . implies that the interpretation of φ is relativized to the subtree of t rooted at u. Given a monadic query language L =(N,.), compo- sition formulae φ ∈ C (L) are defined by the following Valuations define possible values for free variables in abstract syntax: composition formulae. A formula can define an n-ary φ query by sorting its free variables. Formally, a formula ::= composition formula φ ∈ { ,..., } φ C (L) with free variables x1 xn = FV( ) de- fines the n-ary query φ(x ,...,x ) such that for all | c(x).φ composition, c ∈ N,x∈ Var 1 n trees t: | φ ∧ φ conjunction | φ ∨ φ disjunction φ(x1,...,xn)(t)={(ν(x1),...,ν(xn)) | t,ν |= φ} |∃x φ projection

      Given a composition formula φ, we denotes by FV(φ) 4 Examples of Composition Lan- guages 1We proved that conjunctions, disjunctions and pro- jections are required to express all MSO-queries by We now discuss some instances of query languages composition of monadic MSO-definable queries C (L) by instantiating the parameter L to some concrete

      63 monadic query language. algorithm for answering compositions of monadic queries, defined either in MSO, XPath, or by tree au- As first instance, we let L be the monadic query lan- tomata. Further monadic query languages are be eas- guage containing all monadic MSO formulas. For illus- ily added by new modules called query machines. Each tration, we consider XML documents defining collec- monadic query can be expressed by different formalism tions of books, which satisfy the DTD in Fig. 1. Our within the same composition formula. target is to select all pairs of author names and titles of the same book by composition. Our concrete syntax for expressing composition queries is given in Fig. 3. A typical input consists of an XML We define the binary query on firstchild- document and a composition query. The output is an nextsibling encodings. Names and titles of a XML document representing the set of all answers. The book are contained in siblings of author-labeled nodes. implementation is done in OCaml. To select the pairs, we first select all author nodes by   the monadic query c1 defined by the monadic MSO 5 MSO Completeness formula c1 = labauthor(x). We then compose it with     two independant monadic queries c2 and c3 ,for We call an n-ary query language MSO-complete if it ∃ ∧ , selecting name by c2 = y root(y) child1(y x) and can express all MSO-definable n-ary queries. For in- ∃ ∧ , title by c3 = y root(y) child2(y x). The modeling stance, monadic Datalog is known to be a MSO com- composition formula is: plete monadic query language. In this paragraph, we φ = ∃zc1(z).(c2(x) ∧ c3(y)) study the expressiveness of composition languages over MSO-complete monadic query languages.

      Note that according to the DTD and the semantic of We show that the composition operator can be ex- composition, the first query can select nodes labeled pressed in first-order logic, so that n-ary-compositions by author, and then, in each subtrees induced by the of monadic MSO definable queries are MSO-definable previous selected nodes, one can select the node la- too. beled by name, and the node labeled by title,by  two independant monadic queries c2 = labname(x) and Let L =(N,.) be a monadic query language. For every c = lab (x) respectively. The modeling composi- 3 title name c ∈ N we introduce a binary predicate symbol Bc tion formula is then: t ⊆ 2 that we interpret as a binary relation on Bc nodes(t)   φ ∃ . ∧   = zc1(z) (c2(x) c3(y)) t { , | ∈   | } Bc = (v v ) v c (t v) We now consider the first-order logic over the signature A second instance is obtained by composing monadic (B ) ∈ ∪ T. Datalog queries [8] which are well known to capture all c c N monadic MSO. Indeed, our idea of compositions is very Proposition 1. Every composition formula φ(x) ∈ C (L) much inspired by the way in which n-ary queries are de- is equivalent to some first-order formula γ(x) over the ∗ fined from monadic Datalog queries by the Lixto system signature (Bc)c∈N ∪ T ∪{ }. for visual Web information extraction [1, 6].

      We illustrate the correspondence at the example of se- Proof. We define a function .x encoding composi- lecting pairs of author names and titles of the same tion formulas into first-order formulas over the signature ∗ books. Such a query is expressed in Lixto by a Monadic (Bc)c∈N ∪ T ∪{ } inductively: Datalog program P and an additional information about  x = the predicate hierarchy, which we model by a tree. We  .φ , ∧φ firstchild−nextsibling c(y) x = Bc(x y) y express this query in the φ ∧ φ  φ  ∧φ  encodings. The monadic Datalog program P and the 1 2 x = 1 x 2 x φ1 ∨ φ2x = φ1x ∨φ2x predicate hierarchy are given on Figure 4. ∗ ∃y φx = ∃yx y ∧φx We can express the same query by composing the Let γ(x1,...,xn) ≡∃FV(φ)\{x1,...,xn)(∃y root(y) ∧ following three monadic Datalog queries by φy),wherey ∈ FV(φ). Finally note that root(x) is FO[T]-definable. φ(y,z)=∃xP1(x).( P2(y) ∧ P3(z)) . If the monadic query language captures MSO then the binary predicates Bc are MSO-definable. The first im- P1 : Pauthor(x) :- labauthor(x) portant technical contribution of this paper is that the with the goal Pauthor converse holds too. P2 : Pname(x) :- root(y), child1(y,x) with the goal Pname Theorem 1. The class of n-ary queries defined by com- P3 : Ptitle(x) :- root(y), child2(y,x) position of MSO-definable monadic queries is exactly with the goal Ptitle. the class of n-ary MSO definable queries. To prove the first direction it suffices to show that each Implementation We have implemented a rather naive predicate Bc is MSO-definable whenever c is. The

      64 P (x) :- lab (x) author author Pauthor Pname(x) :- Pauthor(y), child1(y,x) Ptitle(x) :- Pauthor(y), child2(y,x) (a) program P Pname Ptitle (b) hierarchy

      Figure 2. A set of Monadic Datalog rules and its predicates hierarchy

      query ::= SELECT vars FROM formula formula ::= atom | formula AND formula | formula OR formula atom ::= machine(var) vars ::= var | var,vars var ::= identifier machine ::= XPATH[xpath specif] | AUTOMATON[automaton specif] | MSO[mso specif] Figure 3. Concrete sytnax for composition queries

      γ , Π binary MSO formula Bc (x y) defining Bc is exactly the jection of r w.r.t. J,definedby J(r)=(ri)i∈J.Inpar- formula γc(y) defining c where each quantification is ticular, Π0/ (r)=().Givenatreet,ann-tuple of nodes n relativized to x. v =(v1,...,vn) ∈ nodes(t) ,aNSA(A,q) with a se- n lection tuple q =(q1,...,qn) ∈ Q , and a state q ∈ Q,a The rest of this section prove the other direction, i.e. q-run of (A,q) over t selecting v is a run of A over t such the composition of monadic MSO-definable queries is that the root is labeled by q,andvi is labeled by qi for complete for n-ary MSO-definable queries. The proof each i ∈{1,...,n}. In particular, when n = 0, a q-run of is based on the equivalence between MSO-definable (A,()) over t selecting the empty sequence is a run of A queries and node selection automata as defined in [16], over t labeling the root by q. which is a consequence of the seminal theorem of Thatcher and Wright [17]. Lemma 1. Let n ≥ 2 be a natural. Let t ∈ TΣ be a binary tree. Let v =(v1,...,vn) be a tuple of length n of nodes from t, such that there exists at least two different nodes. 1 We recall that a node selection automaton (NSA) is a Let va be the least common ancestor of v. Let va be the 2 , , pair (A,S) where A =(Σ,Q,F,∆) is a tree automata and first child of va, and va its second child. Define I J Kas S is a set of selection tuples q. We write (A,q) instead follows: of (A,{q}).Arun of a tree automata A over a tree t is I = {i | v = v } atreer isomorphic to t via an isomorphism Φ,where a i = { | 1 ∗ } each node is labeled in Q, and such that the following J j va v j { | 2 ∗ } holds: K = k va vk • if v ∈ nodes(t) is a leaf labeled by a ∈ Σ,then ( , ) a → lab (Φ(v)) is in ∆, Let A q beaNSA,andqastate,thenthereexistsa r q-run r of A over t selecting viff • if v ∈ nodes(t) is an inner node labeled by ∃ ,  ∈ f ∈ Σ,andv ,v ∈ nodes(t) are its first child q q Qs.t. 1 2 ,Π and its second child respectively, then the rule - there exists a q-run of (A I(q)) over t Π 1, 2 f (labr(Φ(v1)), labr(Φ(v2))) → labr(Φ(v)) is in selecting I(v) and labeling va va ∆ by q,q respectively .  - there exists a q -run of (A,Π (q)) over t| 1 J va Arunr of A over t is successful iff its root is labeled selecting ΠJ(v) by an accepting state from F.ANSA(A,S) selects a  ,Π | - there exists a q -run of (A K(q)) over t v2 tuple of nodes (v ,...,v ) of a tree t iff there exists a 1 n selecting ΠK(v) a successful run r over t (isomorphic to t via Φ), and a selection tuple (q1,...,qn) ∈ S, such that for each i ∈{1,...,n}, the node Φ(vi) is labeled by qi in r. Proof. The proof is not difficult and left to the reader. When it is clear from the context we will omit the isomorphism Φ. Finally, the class of MSO-definable n-ary queries is exactly the class of n-ary queries Lemma 2. Let n be a natural. Given a node selec- , defined by node selection automata over binary trees tion automaton (A q) where q is an n-tuple of states, ∈ [17, 16]. given a state q Q, there exists a composition formula φA,q,q(x1,...,xn) over MSO-definable monadic queries such that for all Σ-tree t, for all v ∈ nodes(t)n, the fol- In order to prove Theorem 1 we introduce some nota- lowing are equivalent: ,..., ∈ n tions. Given a set R,ann-tuple r =(r1 rn) R , (i) there exists a q-run of A over t selecting v and a set J ⊆{1,...,n},wedenotebyΠJ(r) the pro- (ii) v ∈ φA,q,q(x1,...,xn)(t)

      65 Proof. We construct the formula inductively on n.The Corollary 1. For each MSO formula γ(x),thereex- construction mimics the decomposition given by lemma ists an equivalent composition formula φ(x) over MSO- 1. definable monadic queries. If n = 0wetakeφA,q,q = ∃xc0(x) where c0(t) is equal to nodes(t) if and only if there exists a q-run of A over t. By Thatcher and Wright’s theorem, this monadic query Proof. By [17, 16], there existsW a NSAW (A,S) equivalent γ φ φ φ is MSO-definable. to , and we define by: = q∈S q∈F A,q,q,where If n = 1, then q =(p) for some p ∈ Q,andwetake φA,q,q has been defined in the previous lemma. φ   A,(p),q(x)=c1(x) where c1 is defined by the NSA (A,(p)). Again by Thatcher and Wright’s theorem, this query is MSO-definable. 6 Algorithmic Complexity If n > 1, we consider two cases depending on whether In this paragraph L =(N,.) is a monadic query the variables x1,...,xn will be instantiated by the same language, and we suppose that there exists an algorithm node or not. So φA,q,q(x1,...,xn) will be written as a eq neq for the model-checking problem in time-complexity disjunction φ , , ∨ φ , , : A q q A q q mc(c,t),wherec ∈ N and t ∈ TΣ, and an algorithm for the query answering problem in time-complexity , • case 1 (variables will be instantiated by the same qa(c t). γ node). Let (A,q)(x) be an MSO formula such that for a tree t and an n-tuple v of nodes Fig. 4 represents a simple algorithm for the model- of t, it holds that v ∈ γ , (t) iff there ex- (A q) checking problem of a formula φ,atreet and a ists a q-run of (A,q) over t selecting v.It valuation ν. It is written in a pseudo ML-like code. It is easy to show that this formula exists, by | φ | |φ| | |maxφ∈Sub(φ)( FV( ) ) |φ|2 Thatcher and Wright’s theorem. Then we take runs in time O( M t + ) where V , φeq ,..., ∃ . M = maxc(x)∈Sub(φ)mc(c t). A,q,q(x1 xn)= xc1(x) ( i cr(xi)),where c1 is the query definedV by the monadic MSO for- mula ∃y ,...,y − (y = y ) ∧ γ (y ,...,y ) 1 n 1 i n i (A,q) 1 n This gives a naive algorithm for the query answering   and cr (t) selects the root of t,foranytreet. problem: generate all the valuations of free variables of a formula φ in a tree t, and apply the model-checking • case 2 (variables will be instantiated by at least algorithm on them. This leads to an exponential ,..., two different nodes). Let x denotes (x1 xn) grow up, but it is not clear how to avoid it since the and let P n be the sets of partitions (with possibly satisfiability problem of monadic query composition is { ,..., } empty parts) of 1 n such that for each NP-hard. partition, there exists at most one empty part. We φneq define A,q,q(x) by: Proposition 2. Let Σ = {0,1,◦} be an alphabet and W W { , },. Σ   L =( c0 c1 ) a monadic query language over {I,J,K}∈P n q ,q ∈Q q where cb selects all the nodes labeled by b ∈{0,1} ∃x∃y∃zc   (x). V q ,q in Σ-trees. Let t the binary whose roots is labeled by ◦, ( i∈I cr(xi) its first child by 0, and its second child by 1. Given a ∧c (y).φ ,Π ,  (Π (x)) 1 A J (q) q J composition formula φ over L, the satisfiability problem ∧c (z).φ  (Π (x))) 2 A,ΠK (q),q K of φ over t is NP-hard.

      where c1(t) selects the first child of the root of t,andc2(t) its second child. For any tree t,the q Proof. To prove that it is NP-hard we give a polyno-     ∈ nodes query cq ,q (t) selects a node v (t) iff mial reduction of CNF satisfiability into our problem. ,Π there exists a q-run of (A I(q)) over t selecting The ideaV is to associate with a given CNFV formula (v,v,...,v) (of length |I|), such that its first child Ψ = φ = φ   1≤i≤p Ci a composition formula 1≤i≤p i is labeled by q , and its second child by q .This over L. Each φi is a composition formula associated query is MSO-definable, again by Thatcher and to the i-th clause Ci. Itisdefinedbyassociatingto Wright’s theorem. Remark that subformulae each litteral x j the atomic formula c1(x j) and to ¬x j the φ  (Π ( )) φ  (Π ( )) A,ΠJ (q),q J x and A,ΠK (q),q K x are formula c0(x j), and to a disjunction of litterals a dis- recursively well defined, since |K|,|J| < n. junction of atomic formulae. For example, if we con- sider Ψ =(x1 ∨¬x2) ∧ (x2 ∨¬x3),thenφ =(c1(x1) ∨ c0(x2)) ∧ (c1(x2) ∨ c0(x3)). The rest of the proof is a direct application of Lemma 1. Composition and conjunctive queries Conjunctive queries over finite relational structures have been widely studied by the database community since it is the most common database query in practice. The particular case of conjunctive queries over unranked To conclude the proof of Theorem 1, we state the fol- trees have been studied in [7] over particular binary lowing corollary: XPath axis A = { Child, Child+, Child∗, NextSibling,

      66 let check(φ,t,ν)=match φ with | → true | ψ .φ → ν ∈ ψ ∧ φ, | ,ν ∧∀ ∈ φ ν ∗ ν (x) (x) (t) check( t ν(x) ) y FV( ) (x) (y) | φ ∨ φ → check(φ,t,ν) ∨ check(φ,t,ν)     | φ ∧ φ →Wcheck(φ ,t,ν) ∧ check(φ ,t,ν) |∃ φ → φ, ,ν / x u∈nodes(t) check( t [x u])

      Figure 4. Model-checking algorithm for a formula φ ∈ C (L),atreet and a valuation ν

      NextSibling+, NextSibling∗, Following }. Surprisingly monadic queries. The class of n-ary queries defined the complexity of these queries quickly fall into by E (L)-formulae is exactly the class of n-ary MSO- NP-hardness. Since each conjunctive queries over these definable queries. axis in an unranked tree is expressible by a composition query over a particular monadic query language in binary trees, all the complexity lower bounds from [7] apply to our formalism. For example, the satisfiability Proof. The proof is the same than those of Theorem 1. problem of a composition query over the monadic query It suffices to remark that the constrution of an equivalent composition formula given in Theorem 1 respects the languge ({c1∗ ,c2},.) is NP-hard w.r.t. combined ∗  ∗  { | , } required restrictions on variable sharing. complexity, where c1 (t)= v Child1(root(t) v) and c2(t)={v | Child2(root(t),v)}. 7.2 Answering algorithm In the next section we propose a composition fragment for which the satisfiability problem is in PTIME when- In this section we give an algorithm for answering a ever this holds for the underlying monadic query lan- composition query q on a tree t, so that the complexity guage, and give an efficient algorithm for query answer- may depend on the size of the output. Since the answer- ing. In addition we prove that this fragment can express ing complexity depends on the maximal number of free all MSO-definable n-ary queries whenever the underly- variables of the subformulae of the formula defining the ing monadic query language captures MSO. query, we first show that each composition formula φ is equivalent to a composition formula where there is a 7 An MSO-Complete and Tractable most 1 free variable different from the free variables of Fragment φ in its subformulae (wlog we assume that the quantified variables of φ are different from the free variables of φ). In this section, we introduce a “tractable” syntactic frag- Moreover, in order to avoid the problem of non-valued ∨ ment of composition formulae E (L), that leads to an variables Ð for example in the formula c(x) c(y) Ð, we n-ary MSO-complete query language (as soon as the complete each formula so that each part of disjunctions monadic query language L is), while enjoying efficient has the same free variable sets. For instance the for- ∨ query answering algorithms. mula c(x) c(y) is rewriting into the equivalent formula (c(x)∧true(y))∨(true(x)∧c(y)). The size of the output Let L be a language of MSO-definable monadic queries. formula can be at most quadratic in the size of the input In this fragment, variable sharing between conjunctions formula. and composition are not permitted, more precisely, if ,. φ ∧ φ and c(x).φ are (L)-formula, then FV(φ) ∩ Let L =(N ) be a monadic query language. Let E ∈ φ ∈ FV(φ)=0/,andx ∈ FV(φ). CDuce patterns for in- t TΣ beatreeandlet E (L) a composition formula. stance are built under this restriction for conjunctions We suppose to have an algorithm to answer monadic [4]. queries. The query answering algorithm processes in four steps: If the satisfiability problem for the underlying query lan- 1. rewrite φ into an equivalent formula φ in which guage is PTIME, then it holds for the composition frag- there is at most one free variable different from the ment too. The algorithm is based on dynamic program- free variables of φ, in its subformulae, and such ming Ð a satisfiability table defined inductively is com- that for each γ ∨ γ ∈ Sub(φ),FV(γ)=FV(γ); puted with memoization Ð. Then the query answering algorithm processes the formula inductively under the 2. compute two data structures Qa : N × nodes(t) → assumption that it is satisfied in the current tree. nodes(t) and Qc : N × nodes(t) × nodes(t) → {0,1} such that given a query name c ∈ N appear-   ing in φ , and two nodes v,v ∈ nodes(t), Qa(c,v) 7.1 MSO-completeness   returns the set {v : v ∈ c(t|v)} in linear time  in the size of the output, and Qc(c,v,v ) checks in We start by a theorem on expressiveness of the fragment  constant time whether v ∈ c(t|v); E (L), over MSO-definable monadic queries. 3. compute a data structure Sat : Sub(φ) × Theorem 2. Let L be a language of MSO-definable nodes(t) →{0,1} such that Sat(φ,v) checks in

      67 constant time whether a formula φ ∈ Sub(φ) is Step 4 The last phase is given on Fig. 5. Moreover, we satisfied in t|v; use memoization to avoid exponential grow-up. Valu- 4. answer the query by processing the formula φ re- ations are represented by sequences of pairs (variable, cursively with satisfiability tests, doubles elimina- node). We assume union and projection operations to tion, and memoization. eliminate doubles, so that their time complexities are linear in the input sets. This can be done by storing Step 1 Let φ be a composition formula. Wlog assume tuples in hash tables. that quantified variables of φ are different from its free variables. We define the width w(φ) of φ as the maxi- 7.3 Answering Complexity mal number, over the subformulae of φ, of free variables different from the free variables of φ. More formally In this section we study the complexity of the previous  w(φ)=maxφ∈Sub(φ)|FV(φ )\FV(φ)|.Aswesaidwe algorithm. Let L =(N,.) be a monadic query lan- transform φ into an equivalent formula φ with w(φ) ≤ 1. guage. Inputs of the algorithm are a tree t and a com- The transformation is simple by pushing down the quan- position formula φ ∈ E (L). Moreover, we suppose to tifiers. We sum up it in the following lemma: have of an algorithm to answer c on a tree t, for each c ∈ N, in time complexity qa(n,t). We write M(φ,t) for , | Lemma 3. Each query q defined by a composition for- maxv∈nodes(t),c(x)∈Sub(φ)qa(c t v). We sum-up the com- mula φ ∈ E (L) is equal to some query defined by a com- plexity by the following proposition: position formula φ ∈ E (L) such that w(φ) ≤ 1. Proposition 3. Answering a query q defined by a com- position formula φ ∈ E (L) is in time O(M(φ,t)|t||φ| + Proof. We define the translation of φ into φ by the fol- |φ|2|t|2|φ(t)|),where|φ(t)| is the output size. lowing rewriting rules:  ∃x (γ ∨ γ) → (∃x γ) ∨ (∃x γ) Proof. The first step produces a formula φ such  ∃x (γ ∧ γ) → (∃x γ) ∧ (∃x γ) that |φ | = O(|φ|2). The second step is in time ∃xc(y).φ → c(y).(∃x φ) with y = x O(M(φ,t)|t||φ|), and the computation of the satisfiabil- ∃x γ → γ if x ∈ FV(γ) ity table is in time O(|φ||t|2)=O(|φ|2|t|2). It remains to show the time complexity of algorithm de- We can show this rewriting system to terminate, and to picted in figure 5 to be O(|φ||t|2nK),whereK is the be confluent. The normal form is a formula where each number of solutions and n the arity of the query Ð we occurence of a quantified variable in an atomic formula consider that |φ(t)| = Kn Ð. We are going to show that ∃ c(x) is preceded by an existential quantification xc(x). each recursive call returns at most |t|K valuations, and Hence, normal forms are of width at most 1. Now we performs at most O(|t| + nK|t|) operations. Each call φ φ show that the normal form of a formula is equiva- to ans begins by a satisfiability test, so that the follow- φ ∃ γ ∧ γ → lent to . The only difficulties come from x ( ) ing property holds: if ans(γ,t,v) is a recursive call oc- ∃ γ ∧ ∃ γ ∃ .φ → . ∃ φ  ( x ) ( x ) and ( xc(y) ) (c(y) ( y ).The curing during the processing of φ , then the projection γ ∩ γ 0/ first case holds since FV( ) FV( )= , and the fol- of each valuation returned by ans(γ,t,v) on the vari- lowing proves the second case: ables from FV(φ) can be extended to a valuation ν such ,ν /  | ∃ .φ  t [y v ] =( xc(y) ) that t,ν, root(v) |= φ . Hence, the number of valua- iff there exists v ∈ nodes(t) s.t. t,ν[y/v ][x/v] |= c(y).φ |FV(γ)\FV(φ)|  tions returned by ans(γ,t,v) is at most |t| K. iff there exists v ∈ nodes(t|  ), v ∈ c(t) and   v Moreover, since w(φ )=1, we get |FV(γ)\FV(φ )|≤1. t|  ,ν[x/ ] |= φ v v It is clear that for conjunctions, disjunctions, and pro- iff t,ν[y/  ] |= c(y).(∃x φ). v jections, each recursive call performs at most O(|t|nK) We conclude by induction on the reduction length.   operations. If γ is of the form c(x).γ ,thenFV(γ )= FV(γ) ∩ FV(φ),sincew(γ)=1. Hence, any recursive call to ans(γ,t,v) for v ∈ c(t) returns at most K val- Remark that the size of the resulting formula is linear uations. Moreover, there are at most |t| nodes satisfying Ð multiply by two Ð in the size of the input formula, c(t), so that the recursive call ans(γ,t,v) performs at since each occurence of free variable is preceded by its most |t| + nK|t| operations. φ quantification. Then we transform so that each part Finally, since we use memoization, there are at most of a disjunction shares the same free variable sets, and |t||φ| recursive call to ans, so that the whole complexity such that each quantified variable is different from each   of ans on input, φ , t and root(t) is O(|φ ||t|2nK)). free variable of φ.

      Step 2 It is quite obvious, by using hash tables. 8 Conclusion Step 3 We compute Ð using memoization Ð a table Sat[.,.] defined inductively by: 8.1 Summary. Sat[ ,u]=1 (1) We proposed and investigated an n-ary query language W  Sat[ ( ).φ, ]=  Sat[φ, ]() (L) in which queries are specified as composition of c x u u ∈Qa(c,u) u 2 C Sat[φ ∧ φ,u]=Sat[φ,u] ∧ Sat[φ,u](3) monadic queries. The choice of the underlying monadic Sat[φ ∨ φ,u]=Sat[φ,u] ∨ Sat[φ,u](4) query language L is parametric, so that we can express a Sat[∃x φ,u]=Sat[φ,u](5) wide variety of n-ary query specification languages, for

      68 1 let ans(φ,t,u)=if Sat[φ,u] then 2 match φ with 3 | → {ε}  S    | ( ).φ →  {( , ) · ν | ν ∈ ans(φ , , )} 4 c x u ∈Qa(c,u) x u t u 5 | φ ∧ φ → ans(φ,t,u) × ans(φ,t,u) 6 | φ ∨ φ → ans(φ,t,u) ∪ ans(φ,t,u) |∃ φ →{ν ν ν \ ,ν ν| ,ν ∈ φ, , } 7 x : dom( )=dom( ) x = dom(ν)\x ans( t u) 8 else 0/ 9 in 10 ans(φ,t, root(t)) Figure 5. Answering algorithm with implicit memoization instance composition of XPath formula, Monadic Dat- purpose language. ACM SIGPLAN Notices, alog programs or node selection automata. We proved 38(9):51Ð63, 2003. our language to capture MSO as soon as the underlying [3] Julien Carme, Aurlien Lemay, and Joachim monadic query language capture MSO too. We proved Niehren. Learning node selecting tree transducer the satisfiability problem to be NP-hard and proposed from completely annotated examples. In 7th Inter- (L) an efficient fragment E of the composition language national Colloquium on Grammatical Inference, L which remains MSO-complete as soon as captures volume 3264 of Lecture Notes in Artificial Intel- MSO. We gave an algorithm for the query answering ligence, pages 91Ð102. Springer Verlag, 2004. problem in time O(M(φ,t)|t||φ| + |φ|2|t|2|φ(t)|),where |φ(t)| is the output size and M(φ,t) is the maximal com- [4] Giuseppe Castagna. Patterns and types for query- plexity of the query answering problem over subtrees of ing XML. In 10th International Symposium t, of the monadic queries appearing in φ. on Database Programming Languages, Lecture Notes in Computer Science. Springer Verlag, Au- 8.2 Future Work. gust 2005. [5] Markus Frick and Martin Grohe. The complex- A more practical aspect is the extension of the exist- ity of first-order and monadic second-order logic ing implementation of query composition to the algo- revisited. In Proc. LICS ’02: Proceedings of the rithms in Section 7 and the comparison of their query 17th Annual IEEE Symposium on Logic in Com- answering efficiencies with other querying languages, puter Science, pages 215Ð224, Washington, DC, such as implementations of XQuery, and programming USA, 2002. IEEE Computer Society. C languages such as Duce . [6] G. Gottlob, C. Koch, R. Baumgartner, M. Her- zog, and S. Flesca. The Lixto data extraction We would like to investigate the correspondence Ð men- project - back and forth between theory and prac- tioned in Section 4 between the underlying query for- tice. In 23rd ACM SIGPLAN-SIGACT Symposium malism of Lixto and our query composition language on Principles of Database Systems, pages 1Ð12. over Monadic Datalog programs. In particular, we think ACM-Press, 2004. that there exists a systematic translation between the two formalisms. [7] G. Gottlob, C. Koch, and K. Schulz. Conjunctive queries over trees, 2004. Finally, in some cases it seems to be more efficient to [8] Georg Gottlob and Christoph Koch. Monadic have the possibility to navigate everywhere in the tree, datalog and the expressive power of languages without restriction on subtrees. The binary query exam- for web information extraction. In 21rd ACM ple given in Section 3, on the tree of figure 1 seems to SIGMOD-SIGACT-SIGART Symposium on Prin- be more natural when one first selects a node labeled by ciples of Database Systems, pages 17Ð28. ACM- name, and then its sibling. In this way it is interesting Press, 2002. to investigate the more general problem of binary query composition. [9] Georg Gottlob and Christoph Koch. Monadic queries over tree-structured data. In 17th Annual We would like to thank Manuel Loth who worked on the IEEE Symposium on Logic in Computer Science, implementation of monadic query composition. pages 189Ð202, Copenhagen, 2002. [10] Georg Gottlob, Christoph Koch, and Reinhard 9 References Pichler. Efficient algorithms for processing xpath queries. ACM Transactions on Database Systems, [1] Robert Baumgartner, Sergio Flesca, and Georg 30(2):444Ð491, 2005. Gottlob. Visual web information extraction with [11] Haruo Hosoya and Benjamin Pierce. Regular ex- lixto. In 28th International Conference on Very pression pattern matching for XML. Journal of Large Data Bases, pages 119Ð128, 2001. Functional Programming, 6(13):961Ð1004, 2003. [2] Veronique« Benzaken, Giuseppe Castagna, and [12] Leonid Libkin. Logics over unranked trees: an Alain Frisch. Cduce: an XML-centric general-

      69 overview. In Automata, Languages and Program- ming: 32nd International Colloquium, number 3580 in Lecture Notes in Computer Science, pages 35Ð50. Springer Verlag, 2005. [13] Frank Neven and Jan Van Den Bussche. Expres- siveness of query languages based on attribute grammars. Journal of the ACM, 49(1):56Ð100, 2002. [14] Frank Neven and Thomas Schwentick. Query au- tomata. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems, pages 205Ð214, 1999. [15] Frank Neven and Thomas Schwentick. Query au- tomata over finite trees. Theoretical Computer Sci- ence, 275(1-2):633Ð674, 2002. [16] Joachim Niehren, Laurent Planque, Jean-Marc Talbot, and Sophie Tison. N-ary queries by tree automata. In 10th International Symposium on Database Programming Languages, volume 3774 of Lecture Notes in Computer Science, pages 217Ð 231. Springer Verlag, September 2005. [17] J. W. Thatcher and J. B. Wright. Generalized finite automata with an application to a decision prob- lem of second-order logic. Mathematical System Theory, 2:57Ð82, 1968.

      70 Type Checking For Functional XML Programming Without Type Annotation (Extended Abstract)

      Akihiko Tozawa IBM Tokyo Research Lab 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502, Japan [email protected]

      1 Introduction 2 The Language and Problem

      We discuss the type checking for XML programming with higher- An ML-like Language for XML Programming We first intro- order functions. Our type checking does not require type an- duce an ML-like yet simply-typed functional language with higher notations on programs. This is beneficial for programmers. In order functions. This language supports XML programming. In XDuce [HP03] and CDuce [FCB02], programmers always need to particular, this language manipulates two XML values, input XML figure out, for all functions in the program, what type annotations values and output XML values. Input XML values are only pro- are necessary. This task sometimes becomes very tedious, in par- cessed. We however cannot create input XML values in the lan- ticular, when structures of target XML documents are complex. guage, so that such values are always supplied from the outer world. On the other hand, output XML values, or we can say, non- To achieve the type checking without type annotation, we use observable XML values, are only constructed. We do not have any the tree transducer type-checking technique. In particular, we method to inspect their structures. employ the high-level tree transducer, first introduced by Engel- friet [EV88]. We can enjoy much benefits of functional program- Let us explain the language step by step. As an example, we use the ming with this transducer, because we can use higher order func- following program representing the identity tree transformation. tions. Given input trees, the high-level tree transducer emits func- (i→o) = tional values. letrec id x : ∗x[if x |= 1 then id x·1 else ()], |= · Our method has two steps. The first step is a conversion from func- (if x 2 then id x 2 else ()) tional programs to high-level tree transducers. The second step is in the inverse type inference, which receives an output XML type and id a high-level tree transducer and creates an input XML type. First, we have sorts. In the program, we see a superscript i → o appearing on id. This superscript indicates a sort, i.e., simple type, • The conversion in the first step is made possible by imposing of the function variable id. Let B = {i,o,B,L}. This B is the set restrictions on functional programs. These restrictions ensure of base sorts. Sort i corresponds to input XML values. Input XML that (1) a program is not allowed to examine what it creates, values indicate some nodes in the input XML document. Sort o cor- (2) a program does not receive more than one input tree, (3) responds to output XML values. Output XML values are sequences the number of internal states a program can reach is finite. of XML trees. Sorts B and L are sorts for boolean values and la- These restrictions are obviously necessary, and are even suffi- bels, respectively. We use b to range over base sorts. We extend cient for the conversion to tree transducers. We impose these base sorts B to sorts S(B) for functional values as restrictions by using simple types called sorts. S B  = | → • The key idea for the second step is the abstract interpretation ( ) s :: b s s of values emitted by transducers. For this interpretation, we where b ∈B. We use s to range over sorts. Here → associates to the start from the finite algebras called binoids. Any XML type right as usual. Note that in sorts, their constructors i,o,B,L and →, can be captured by some binoid and homomorphism from are just syntactical objects. XML values to this binoid. Such homomorphism can be ex- tended to functional values. As far as type-checking is con- In the rest of the paper, we often make sorts of function variables cerned, we always consider functional values under abstract explicit for readability. In practice, sorts of variables as well as interpretation by this homomorphism. Our inverse type in- those of expressions can be, though not uniquely, inferred by known ference is done by combining Maneth’s algorithm and such unification-based algorithms such as the one in a textbook [Mit96]. abstract interpretation. Note that sorts are not types in this paper Ð we will introduce types themselves later. Let us outline the rest of the paper. Section 2 discusses the problem we deal with in a ML-style functional language. Section 3 gives a Next, this language allows the set of specific constant primitives. formal discussion and k-level tree transducers. Section 4 gives the They operate on input and output XML values of sorts i and o. type checking algorithm. Section 5 summarizes the related work. Section 6 discusses the future work. XML instances given as inputs, are seen as binary trees which are navigated by using the set of primitives. For input XML values of

      71 sort i , we define operators ·0, ·1, and ·2 as follows. The function dept in Figure 2 implements the transformation in Fig- ure 1. The function recursively applies itself to a set of nodes se- ·1a ·2 lected by children, root, etc. Readers familiar with XSLT should b ·0 c be aware that the program is written in a style similar to XSLT pro- grams.

      This figure illustrates an XML instance a[b[]] , c[] seen as a binary Not only map functions, but we can also provide library functions tree. Assume that a node x(i) is the root of the above XML instance. for testing the document structure. For example, a function which From x, we reach a node labelled by b using x·1, and a node la- tests the existence of a child node of x(i) with a certain label l(L) belled by c using x·2, and from these nodes we can move back to i.e., corresponding to the XPath predicate [x/l], can be written in a the root node by x·1·0orx·2·0. Predicates x |= 0, x |= 1 and x |= 2 manner similar to the function children. represent tests whether it is allowed to move to that direction. If there are no nodes in that direction the test fails. We can also obtain More generally, we can even implement a deterministic (binary) the label of the node by ∗x. For instance, we have ∗x = a on the root tree automaton which tests the substructure of x through the fol- node x. lowing trick. This function autom takes a transition function trans and initial state ini as arguments. For construction of output XML values, we have a constant ()(o) (i→L→(L→L→L→L)→L) which represents a null sequence, and two operators; [ ](L→o→o) autom x ini trans := and ( , )(o→o→o). The operator [ ](L→o→o) creates a node [t] from trans the label and an output XML value t. The operator ( , )(o→o→o) (∗x) concatenates two output XML values. (if x |= 1 then autom x·1 ini trans else ini) (if x |= 2 then autom x·2 ini trans else ini) if Furthermore, the language has the -construct and equality test on The transition function trans l q q takes a label l, two successor the finite set of labels. We also have letrec for defining mutually 1 2 states q1 and q2, and returns the result of the transition. Here, we recursive functions. assume that we encode the finite set of states of the automaton as a subset of the finite label set of sort L. We do not show the de- Finally, we emphasize what this language does not have. Although tail but this technique can be also extended to implement pattern we can convert an XML value x of sort i into the same value of (i→o) match constructs in the style of XDuce based on regular expression sort o using id x, the conversion in the reverse direction is not patterns [HP03]. expressed in the language. Namely, the language does not have a → primitive constant of sort o i representing this reverse conversion. Another interesting application of higher order functions is to use Neither we can define a program performing such a conversion. In them for representing XML values containing holes. Such holes general, our language can neither create input XML values, e.g., (i) are also called gap in the language JWIG [CMS02]. For instance, x := a[], nor inspect the information of some output XML values, (o→o) ∗ a value p represents a (first-order) gapped value whose gaps e.g., t for an output XML value t. can be filled at once by a value v(o) by the application (pv)(o). E.g., p(o→o) v := dept[v] is a gapped value dept[] where  is the position of a gap. This gap can be filled by emp[] as p emp[] = dept[emp[]]. Using Higher-order Functions In XML programming, the use of higher order functions have a number of advantages. Here we We can implement a set of gap operators using higher order func- look through several use cases of higher order functions through ex- tions. amples. Note that we later translate functional programs into trans- (o→o) = ducers, and we here only discuss functions to which such translation (i) gap v : v (o→o→o) can be applied. (ii) nogap vw := v (iii) concgap((o→o)→(o→o)→o→o) pqv := pv, qv L→ → → → A typical higher order function is the map function, which applies (iv) nodegap( (o o) o o) lpv := l[pv] a function given as an argument to a set of elements at once. In (v) pluggap((o→o)→(o→o)→o→o) pqv := p (qv) XML programming, it is particularly useful to have map functions which apply argument functions to nodes selected by a certain crite- The gap operators implement the following operations. ria, e.g., children, following siblings, etc. The following functions Example 2. Some examples on the use of the gap operators. (i) chilren and siblings take argument functions of sort i → o, and re- gap = . (ii) nogap dept[] is a gapped value dept[] with no turn the concatenation of the results of applications. gaps. (iii) concgap dept[]  = dept[] , . (iv) nodegap comp dept[] = comp[dept[]]. (v) pluggap dept[] dept[] = dept[ (i→(i→o)→o) = |= · children xf: if x 1 then siblings x 1 f else () dept[]]. siblings(i→(i→o)→o) xf:= fx, if x |= 2 then siblings x·2 f else () The function dept in Figure 2 traverses the input tree many times Example 1. For example, when applied to the root node x of an due to the call to the root function. Interestingly, a similar function input tree a[b[],c[],d[]], children x f returns f (x·1), f (x·1·2), f can be written by the function using gapped values, shown in Fig- (x·1·2·2). ure 3.1 This function computes an answer by the single traversal.

      The earlier examples of higher order functions are just useful in Note that functions such as chilren and siblings are usually supplied as library functions, rather than being a part of the user program. 1In this program we use sort B → s to implement pairs. It is not With such library functions, programmers do not have to deal with difficult to extend the language with pairs and projections. We do primitive navigation operators such as x·m (x·0, x·1 and x·2). not do this here for simplicity.

      72 writing concise programs. Their use is however not essential. The with sort (i → s) → i → s and g with sort i → s. In our language, last example using gapped values, essentially uses higher order we can create fg, f ( fg) = f 2g, f ( f ( fg)) = f 3g, and so on. In par- functions. As Engelfriet observed [EV88], by raising the order of ticular, we can enumerate such functional values up to f ng, where, sorts for output values, i.e., o, o → o,(o → o) → o → o and so on, for example, n is the size of the input to the program. This makes we can arbitrarily increase the expressive power of the language. the translation into tree transducers impossible, since the number of states should not be related to the size of any input. The use of nested let and letrec for functions of sort i → s also causes the Type-Checking Problem Types for XML values, i.e., instances same problem. of sorts i or o, are described by tree regular expressions, such as τ = (a[b[]∗] ∪ c[])∗. For example, id transforms any XML value Let us give another explanation from a different point of view. As into itself, hence a value of type (a[b[]∗]∪c[])∗ into the value of the we discussed in the introduction, the decidability results for the tree same type. This observation is denoted by id :(a[b[]∗] ∪ c[])∗ → transducer type checking come from the fact that the inverse image ∗ ∗ −1 τ τ (a[b[] ] ∪ c[]) . For a function with sort i → o, the type checking fI ( I) of a transformation with respect to a regular language I, (i→o) υ → τ is always regular. Since the subsumption for regular languages is problem fI : I I can be stated as follows. decidable, the type checking amounts to check whether υI is con- tained by f −1(τ ). roblem (i→o) I I P 1. (Type checking) Given a program fI , an input type υ τ (i→o) υ → τ Here, consider the following program which has a sort i → i → o. I and output type I, the type checking problem fI : I I is to test whether or not the transformation of any XML value of type (i→i→o) = υ produces an XML value of type τ . letrec cmp xy: I I if not(x |= 2) && not(y |= 2) then ok[] else cmp x·2 y·2 In understanding Problem 1, we need to clarify the case when the in cmp transformation does not terminate. For example, the program given This program checks whether two sequences starting from nodes x in Figure 2 does not terminate if there are occurrences of dept- and y have the same length (not and && here can be defined using emp nodes inside -nodes in the input tree. if). Clearly the inverse image of such a program for τI = ok[] does comp[ not have a regular property. For example, we cannot test by means of tree automata, if a tree l[t1] , t2 has the width of t1 is equal to the comp[ dept[ ffi , , width of t2. This is the source of di culty with programs having dept[] emp[akihiko[]] → → emp[akihiko[]], ⇒ emp[ sorts such as i i o. emp[dept[]] dept[ ] emp[akihiko[]], 3 Values, Tree Automata and High-level tree ··· transducers We can use the input type υ to guarantee that dept-nodes never I We introduce XML values, and then tree automata which are the occurs inside emp-nodes, so that the function terminates for any model of XML types. The high-level tree transducer is the model input of type υ . There is a choice whether or not we include the I of XML transformations as given, using a functional language, in non-termination in type errors. In this paper, we chose to include it. examples in the last section. We discuss its syntax and semantics in the latter half of this section.

      Restrictions on the Functional Language We can solve the type Here are some notations used throughout. We consistently use bold checking problem for the language introduced so far, when the pro- font, e.g., a, to emphasize meta-variables denoting words or tuples. gram of interest can be translated into a high-level tree transducer. We use  ∈ A∗ for an empty word, and an associative operator (·) for Unfortunately, not all programs can be translated into such tree word concatenation. We let B = {true,false} be the set of boolean transducers. We here explain sufficient restrictions on programs values. This B appears in the text, so that there should be no confu- which make this translation possible. sion with the symbol B appearing in sorts. • Any functions or variable f (s) declared in let or letrec as (s) f x := e, either has their sort s = b or s = s1 →···→ sn−1 → ,..., , XML Values An XML value is a sequence of unranked ordered b, such that none of s2 sn−1 b are i. Namely, only the first L argument can be of sort i. Note that we do not restrict sorts trees over the finite set of labels. The set of XML values is defined in the form i → s to appear other argument positions, e.g., as follows. → → → children(i (i o) o). V  t ::= () | [t] | t , t • → Any function of sort i s must be declared in the top-level where ∈ L. We omit () if it is directly enclosed in [ ]. We assume letrec of the program. In other words, they must not be that , is associative and () is an identity. As explained earlier, defined in a letrec within another letrec. an XML value can also be seen as a binary tree, since each t is represented either as [t1] , t2 or (). For each t, its domain dom(t) ⊆ These two restrictions correspond to the fact that the tree transducer ∗ {1,2} is the set of locations, when seen as a binary tree, of that tree. only have a single input parameter (= first restriction) and a finite = We define the set of tree nodes U by set of states ( first and second restrictions). Obviously, we cannot have a function definition of sort i → i → s, because it means that U = {t}×dom(t) . this function has multiple input parameters (we underline the erro- ∈ neous part). In the translation, functions of sort i → s are seen as the t V finite set of states. This is guaranteed only if there are finitely many That is, U is the set of all nodes in all trees. The label of a node possibilities for such functions. Assume that there is a function f u ∈ U is denoted by ∗u ∈ L. We can move inside XML trees by the

      73 comp[ comp[ dept[ dept[], emp[akihiko[]], emp[akihiko[]], ⇒ emp[yoshinori[]] emp[yoshinori[]] ] ] ] Figure 1. The dept-transformation. Namely, we collect all nodes labelled emp, as well as subtrees of such emp-nodes, and put them into all dept-nodes in the document.

      letrec (∗ libraries ∗) children(i→(i→o)→o) xf:= if x |= 1 then siblings x·1 f else () siblings(i→(i→o)→o) xf:= fx, if x |= 2 then siblings x·2 f else () root(i→(i→o)→o) xf:= if x |= 0 then root x·0 f else fx (∗ user program ∗) dept(i→o) x := if ∗x = dept then dept[root x emp] else if ∗x = emp then () else ∗x[children x dept] emp(i→o) x := if ∗x = emp then ∗x[children x dept] else children x emp in dept Figure 2. Function dept x first looks at the label of the node x.Ifitisdept, the function creates a copy of this dept-node, in which it puts the result of call to the function emp at the root. If the label is emp the function dept skips this node, and otherwise it creates the copy of the given node. The function emp collects and copies all emp nodes inside the tree. This function emp has some problem, because it again calls dept to copy its substructure. See the Type-Checking Problem paragraph.

      letrec ... (∗ user program ∗) deptgap(i→B→o→o) x := let n(B→o→o) b := nogap () in (B→o→o) = |= · let p1 : if x 1 then deptgap x 1 else n in (B→o→o) = |= · let p2 : if x 2 then deptgap x 2 else n in let l(L) := ∗x in (B→o→o) = let p0 b : if b then if l = dept then concgap (nodegap dept gap)(p2 true) else if l = emp then p2 true else concgap (nodegap l (p1 true)) (p2 true) else if l = emp then pluggap (concgap (nogap (p1 true ())) gap)(p2 false) else pluggap (p1 false)(p2 false) in p0 dept(i→o) x := let p := deptgap x in p true (p false ()) in dept

      Figure 3. A similar function as the one in Figure 2, using first order gapped values. The function deptgap returns p(B→o→o) which represents two gap values, namely p true and p false. For example, for the left hand side document of Figure 1, deptgap returns a pair of gapped values p true = comp [dept[]] and p false = emp[akihiko[]] , emp[yoshinori[]] , . They are finally plugged by p true (p false ()) and create the resulting document on the right hand side of Figure 1.

      74 ◦ operator (·m)(m = 0,1,2). For k ∈ dom(t) known. For binoids V(τI) with homomorphism , in what follows, we often use ()◦ and τ ◦ instead of • and F above, respectively. (t, k)·m = (t, k·m)ifm = 1,2 and k·m ∈ dom(t) I , · · = , = = , (t k k) m (t k)ifm 0 and k 1 2 Example 4. Consider the XML type comp[ds] in Example 3. Here, · = ⊥ u m otherwise we give one binoid corresponding to this XML type. We here take The value ⊥ here represents a non-existing node such that ⊥  U. a certain set of tree regular expressions as the domain of the binoid V τ Finally, the set of root nodes Λ(U) ⊆ U is a set {(t,) | t ∈ V}. (comp[ds]). In the following, assume that E represents a type for values that do not belong to other elements in the domain of V(comp[ds]).

      XML Types and Tree Automata We introduce XML types. + + Each XML type represents a certain set of XML values. We use V = {(),dept[es] ,emp[] ,comp[ds],τE} τ, υ ()◦ = () metavariables for ranging over XML types throughout the pa- ◦ per. As a candidate of models of XML types, we have tree regular comp[ds] = {comp[ds]} ∈ L α ⎧ ⎫ expressions as defined by Hosoya et al. For , and ranging ⎪ , + → + ⎪ over a set of type variables, tree regular expressions use the syntax ⎪ dept emp[] dept[es] ⎪ ⎨⎪ dept,() → dept[es]+ ⎬⎪ such as [ ] = ⎪ + ⎪ ⎪ comp,emp[] → τE ⎪ ∗ ⎩⎪ ⎭⎪ T XML  τ, υ ::= () | [τ] | τ , τ | τ ∪ τ | τ | letrec α := τ;... in τ | α ⎧ ... ⎫ ⎪ , + → + ⎪ ⎨⎪ () dept[es] dept[es] ⎬⎪ Example 3. This is an example of XML type. By constraining input , = +, + → + ⎪ dept[es] dept[es] dept[es] ⎪ XML values, this type guarantees the termination of the program ⎩ ... ⎭ dept in Figure 2. ∗ ∗ We can confirm that this V(comp[ds]) satisfies Definition 2 by using letrec ds := dept[es] ; es := emp[] in comp[ds] ◦ ◦ ( ) ∈ V →V(comp[ds]) such that t ∈ t . For example, take t = ◦ + + comp dept[] , dept[]. We have (t , t) = dept[es] = dept[es] , dept[ It says that the root node is always and in which we have a + ◦ ◦ sequence of dept nodes. Inside depts, we have emp nodes, and so es] = t , t . on. We lastly give yet another form of tree automaton, which can be In this paper, we do not directly discuss the semantics τ ⊆ V of the efficiently converted into a non-deterministic tree automaton. This above syntax. Instead, we introduce tree automata which is well- automaton provides a trick which will be used at the last step of the known as a canonical model of XML types τ. Here we actually type inference algorithm as the model of inferred input XML types. introduce three forms of them. The first one is the most standard. A short explanation of the automaton is that (1) it is a variant of deterministic 2-way tree automaton; (2) it allows cyclic runs; and Definition 1. A (total) non-deterministic tree automaton M = (3) the transition function can look at a set of locations in the tree (Q,L,∆, F,•) is a tuple where ∆ ⊆ L ×Q×Q×Q is a set of bounded by the finite set Mov. transitions, F ⊆Qis a set of final states and •∈Qis an ini- tial state. A mapping µ ∈ U →Q is called a run of M if We first explain the transition function δ of the look-around tree (∗u,µ(u),µ(u·1),µ(u·2)) ∈ ∆ for any u ∈ U, where we define µ(⊥) = automaton. This δ takes as an argument, a set of information (the • ∈ Λ . An XML value with root node u (U) is accepted if there is a state-label pair) for each node u· mi at the relative position mi in run µ such that µ(u) ∈ F. Mov = {m1, m2,...mn}, and returns the state for the node u. τ We can assume for each XML type that we have a tree Mov = {m , m , ... , m } automaton M(τ) which defines the semantics τ = {t ∈ V | 1 2 n t accepted by M(τ)}. This is a standard assumption in the study of typed XML programming. See Hosoya et al. [HVP00], for this u δ detail. m i state : µ(u) The second model of XML types has a form of algebra whose do- label : main is finite. This algebra is called binoid [PQ68] in the literature, *uámi state : µ and is similar to syntactic monoid [Per90] for word languages. We (uámi) employ this representation as a canonical model of output XML types in the type inference algorithm in Section 4. As we can see For each node u, this set of information is given as a look-around from the definition, this algebra classifies a set V of XML values function, say h ∈ Mov → (L ×Q)⊥ (= (L ×Q) {⊥}). This h takes into a certain set of finite equivalence classes. The equivalence an argument m representing a relative position, and returns the pair classes are still diverse enough to check whether or not, an arbitrary · τ  of the label of, and the state assigned to, the node u m. If there is XML construction creates the result inside I . In other words, bi- no node at m i.e., u· m = ⊥, this h returns ⊥. noids provide the means of abstract interpretation of XML values. efinition M = Q,L, , Definition 2. A binoid for τ is an algebra V(τ ) = (V,•, F, D 3. A look-around tree automaton is ( Mov I I δ, F) such that Mov ⊂{0,1,2}∗ is a finite set of moves, δ ∈ (Mov → ( [ ]),( , )) such that (1) V is a finite set, •∈Vand F ⊆V, and ⊥ , ,τI, , , V τ (L ×Q) ) →Qis a transition function. A mapping µ ∈ U →Qis (2) (V () ( [ ]) ( )) is homomorphic to ( I). That is, we M µ = δ ∈ have a mapping ( ◦) ∈ V →Vsatisfying (i) ()◦ = •, (ii) v ∈ τI iff called a run of ,if (u) (h) for all u U, where the look- ◦ ∈ → L ×Q ⊥ µ v◦ ∈ F, (iii) ([t])◦ = [t◦], and (iv) (t , t )◦ = t◦ , t . around function h Mov ( ) for this u, is defined from as An algorithm, given a non-deterministic tree automaton represent- h(m) = (∗u· m,µ(u· m)) if u· m ∈ U τ ing I, that constructs one binoid satisfying the above definition is h(m) = ⊥ otherwise

      75 Term(C, X)  e ::= exactly when q is assigned to u. On the other hand, tree transducers | c (c ∈ C, constants) associate each such pair with an output value. For example, the | x (x ∈ X, variables) identity function id given earlier, can be seen as a very simple tree | ee (application) transducer. This tree transducer has one state, say id, and for each | if e then e else e (conditional) node u in the input tree, id is associated with an output tree identical | letrec f x := e;··· in e (recursive def.) to the subtree of u. High-level tree transducers provide an extension of tree transducers. L  = Con( ) c :: The distinction is that each evaluation step of the transducer creates (B) (B) | true ,false (boolean constants) a functional value rather than a tree value. In this sense, high-level L | ( ) ( ∈ L, label constants) tree transducers are closer to functional programs. | ( = )(L→L→B) (label equality) | ()(o) (empty tree) A rule of high-level tree transducer is of the form f : y  e where | ( [ ])(L→o→o) (node constructor) f is a state, y is a sequence of parameter variables, and e is called | ( , )(o→o→o) (tree concatenation) a term. Here is an example of the rule, which corresponds to a function given in Section 2. Figure 4. Definition of Term(C, X) and Con(L). autom :(i→L→(L→L→L→L)→L) ini trans  trans (∗) The automaton accepts u iff µ(u) ∈ F for some µ. (if (|= 1) then autom1 ini trans else ini) (if (|= 2) then autom2 ini trans else ini) Example 5. A deterministic tree automaton uses a transition func- δ ∈ L ×Q×Q→ Q ∆ ∈ tion instead of the transition relation In this example, autom is a state, ini and trans are parameter vari- L ×Q×Q×Q of non-deterministic tree automaton. A deterministic ables, and trans( ... ini) is a term. As we can see from above, a Q,L,δ,•, tree automaton ( F) is an instance of look-around automa- term is almost an expression of the functional language. Terms also δ ton. For this, define the transition function of the look-around should be well-sorted, cf., the definition of sorts S(B) in Section 2. Q,{, , },L,δ , automaton ( 1 2 F),as The only difference is that sorts for terms do not have any occur-

      δ (h) = δ(lab(h()),st(h(1)),st(h(2))) rences of i. Parameter variables y are also the same as those in let and letrec. They just abbreviate λ-abstractions, i.e., f : y  e is where lab(,q) = ,st(,q) = q and st(⊥) = •. Note here that we equivalent to f : λy : e or f : let g y := e in g. always have h()  ⊥ because h is computed for each u ∈ U, The meaning of autom1,(∗),(|= 2), etc. in terms are sup- Look-around tree automata can be efficiently converted into non- plied by looking into neighbor nodes. For example, the meaning of deterministic tree automata. This is formalized by the following automm is supplied by evaluating the state autom at relative posi- proposition. tion m. Similarly, the meaning of (∗)m is the label of the node at relative position m. And, the meaning of (|= 2)m is whether or not Proposition 1. Look-around tree automata accept exactly regular |= 2 holds at relative position m. Recall that the meaning of the tree tree languages. In particular, they can be efficiently converted into transducer is given at each node u ∈ U. Therefore, when this node non-deterministic tree automata. u is supplied, such relative positions 1 and  are interpreted by u·1 and u· = u, respectively. Proof. Given M = (Q,L,Mov,δ,F). Without loss of generality, we assume that Mov is prefix-closed, i.e., m·m ∈ Mov ⇒ m ∈ Mov.We We call an arbitrary set X whose each element is associated with a create a non-deterministic tree automaton M = ((Mov → (L×Q) sort, as sorted set, ⊥) {•}(= Q ),L,∆, F ,•) which accepts the same language as M. We define ∆ so that (,h0,h1,h2) ∈ ∆ iff (i) h0() = (,δ(h0)), (ii) • Figure 4 defines the sorted set Term(C, X) of terms over sorted h0(k) = ⊥ iff hk = •, (iii) hk(0)  ⊥ if hk  •, and (iv) sets C and X of constants and variables, respectively. We re- quire that each term to be well-sorted in the usual sense for h (m·k) = h (m)(m·k ∈ Mov) 0 k simple types. hk(m·0) = h0(m)(m·0 ∈ Mov) • Figure 4 also defines a sorted set of basic constants Con(L) = , • = ⊥ where k 1 2 and (m) for all conditions (i-iv). We define over a set of labels L. F = {h | st(h()) ∈ F} where st(,q) = q (always h()  ⊥). Note here that if µ ∈ U →Q is a run of M then µ ∈ U →Qdefined by Let N be a set of states, C be a set of constants which may include

      µ(u) = snd(µ (u)()) is a run of M.Ifµ ∈ U →Qis a run of M then Con(L), and Mov ⊆{0,1,2}∗ be a set of moves. We call (|= m) and define µ by µ (u)() = (∗u,µ(u)) and µ (u)(m) = µ (u· m)(). (∗) predicates, whose set is denoted by P. We define (N P)Mov to be a set of pairs in the form nm such that n ∈ N P and m ∈ Mov. Each term e appearing in the rule f : y  e, is an element of High-level Tree Transducer We introduce the high-level tree Term(C, y (N P)Mov). transducer as a model of XML transformation. Type checking for XML transformations in high-level tree transducers is decidable. Let us define the high-level tree transducer. Note that our high-level tree transducers are not exactly equivalent to transducers by Engel- Tree transducers are tree automata with outputs. Recall that tree friet [EV88]. An essential difference is that our tree transducer is a automata assign states to nodes. Another way to look at this is tree-walking transducer with upward moves inside the input tree us- that tree automata associates state-node pairs with boolean values. ing ( ·0)-operator. Also our transducer allows the recursive inspec- That is, a state-node pair (q,u) is associated with the truth value tion of the input tree. For example, the function autom in Section 2

      76 cannot be captured by the Engelfriet’s definition of the deterministic B = ⎧{o,L,B,N} ⎫ high-level tree transducer which is a top-down tree transducer. This ⎪ N→ N→ ⎪ ⎨⎪ children( o),siblings( o), ⎬⎪ is comparable to the transducer with regular look-ahead [Eng77], = , N ⎪ N→o (o) o ⎪ P which in our case, is regular look-around. ⎩ root( ),dept ,emp( ) ⎭ P = {(|= 0),(|= 1),(|= 2),(∗)}, In the following, for each sorted set X, we denote by X(s), a subpart C = Con(L) N {( = )(N→N→B)} of X whose elements are associated with sort s. Var = { f (N)}, = {, , , }, efinition Mov 0 1 2 D 4. A (look-around deterministic) high-level tree-trans- f = dept ducer H over a finite set of labels L is a tuple H = (B, N,C, P,Var, I ⎧ ⎫ ⎪ (N→o)  ⎪ Mov, f ,R) where ⎪ children : f ⎪ I ⎪ |=    ⎪ ⎪ if ( 1) then siblings 1 f else () ⎪ ⎪ N→ ⎪ • B is a set of base sorts. We have o,L,B ∈B, but not i ∈B.In ⎪ siblings :( o) f  ⎪ ⎪ ⎪ the following, all elements of sorted sets have sorts in S(B). ⎪ (if f = dept then dept else emp), ⎪ ⎪ ⎪ ⎪ if (|= 2) then siblings2 f else () ⎪ • ⎪ ⎪ N is a sorted set of states. ⎪ (N→o)  ⎪ ⎪ root : f ⎪ ⎪ if (|= 0) then root0 f else ⎪ • C is a sorted set of constants. We have Con(L) ⊆ C. ⎨⎪ ⎬⎪ R = ⎪ if f = dept then dept else emp ⎪ ⎪ ⎪ • ⊆{|= | ∈{ , , }} { ∗ } ⎪ dept :(o)  ⎪ P ( m) m 0 1 2 ( ) is a set of predicates. Predi- ⎪ ⎪ |= ∗ B L ⎪ if ∗ = emp then () ⎪ cates ( m) and ( ) are associated with sort and respec- ⎪ ⎪ ⎪ else if ∗ = dept then dept[root emp] ⎪ tively. ⎪ ⎪ ⎪ else ∗[children dept] ⎪ ⎪ ⎪ • Var is a sorted set of variables. ⎪ emp :(o)  ⎪ ⎪ ⎪ ⎪ if ∗ = emp then ∗[children dept] ⎪ • ⊆{ , , }∗ ⎪ ⎪ Mov 0 1 2 is a finite set of moves. ⎩ else children emp ⎭ • f ∈ N is an initial state. Figure 5. I A high-level tree transducer corresponding to the func- tion dept in Figure 2 • R is a finite set of rules in the form f : y  e. Ð For each f ∈ N, we have exactly one rule in R. ÐIff∈ N(s1 →···→ sn) and y = y1,...,yn−1, we have (i) y j ∈ ∈ .. − ∈ ,   is given by means of the rewrite system, which corresponds to the Var(s j) for j 1 n 1, (ii) e Term(C y (N P) Mov )(sn). operational semantics. In this paper, we give a denotational seman- We do not fix the set of base sorts B, so that we can add a new sort. tics. This gives a clear meaning to functional values emitted by the However we always require that such sorts are associated with finite high-level tree transducer. domains. See Figure 6. In the denotational semantics, a transducer H has a meaning on ρ ∈ · →D· Functional programs introduced and satisfying the restriction in each node u in U, which is an assignment (N P)( ) such that each element in subset N(s) of states, as well as P(s)of Section 2, can be translated into high-level tree transducers. Recall D  D  that those programs are already in the similar shape to the trans- predicates, is associated with an element in s . Here s is → the cpo-based semantic domain given in Figure 6. In other words, ducer, i.e., functions of sort i s only occur at top-level letrec. D  Therefore, the translation is straightforward. We here just give s is the set of functional values of sort s. See the end of this ideas. See [Toz05] for the detailed steps. paragraph. The above meaning to each node is given by the follow- ing function semH : U → (N P)(·) →D·. Essentially, what we need is to remove the occurrence of expres- Definition 5. Given H = (B, N,C, P,Var,Mov, f ,R). The mean- sions of sort i, i → i, and i → s. Functional variables of sorts i → s I defined in the top-level letrec correspond to the finite set of states ing function semH : U → (N P)(·) →D· is defined as the least N. Their definitions are easily translated into rules of the tree trans- solution satisfying the following equations. For any f ∈ N such that ducer. However, variables of sort i → s may also occur as parameter ( f : y  e) ∈ R, and (∗),(|= m) ∈ P, → → → variables, e.g., an argument of children(i (i o) o). In this case, we interpret such variables as variables of a new base sort N. We then semH (u)(∗) = ∗u prepare a finite set of constants of sort N, which has one-to-one semH (u)(|= m) = (u·m) ∈ U = Dλ    → · correspondence to the state set N. We also prepare the equality semH (u)( f ) y : e [n m semH (u m)(n)]n∈N P,m∈Mov operator ( = )(N→N→B) over N. where Deρ is a semantics of term e under ρ given in Figure 7, λ = As a result, we translate programs into transducers with base sorts in which y : e abbreviates letrec g y : e in g. In particular, ∈ Λ B = {o,L,B,N} and constants C = Con(L) N {( = )(N→N→B)}. for the root node u (U), semH (u)( fI) defines the output of the A program given in Figure 2 is translated into the high-level tree transducer. transducer (B, N,C, P,Var,Mov, fI,R) in Figure 5. The definition of Deρ is the standard cpo semantics of simply- typed call-by-value languages [Mit96].

      Semantics of High-Level Tree Transducers In the original def- Let us briefly recall this semantics. A cpo (X,) is a poset whose inition by Engelfriet, the semantics of high-level tree transducers any directed subset has the lub. Starting from flat cpos Db for

      77 τ  Do = V⊥ tomaton, representing I, on the term e of the rule f : y e. In his Db = b⊥ where b  o case, this term defines a tree value, while in our case, it defines a D →  = D  → D  ⊥ functional value. To interpret e in our case, we extend the homo- s s ( s ⊥ s ) ◦ morphism between the set of XML values V and the binoid V(τI). Figure 6. Semantic domains

      D  ∈ Term(C, X)(·) → (X(·) →D·) →D· Extending Homomorphism ◦ to Functional Space Given a type τI, we can obtain a finite binoid V(τI) with the homomorphism D ρ = ρ ◦ x (x) from V to V(τI). This homomorphism is seen as an abstraction Dcρ = c function from infinite values to finite elements. Dee ρ = ⎧Deρ(De ρ) ⎪D ρ D ρ = ◦ D  ⎪ e if e true Here let us extend this definition of to domains s where s is ⎨ ◦ Dif e then e else e ρ = ⎪De ρ if Deρ = false other than o. We define the domain of images of as in Figure 8, ⎩⎪ ∈D  ◦ ∈A  ⊥ otherwise so that for v s we have v s . Then, the idea is to define ◦ ∈D· →A· so that it further satisfies Dletrec θ in eρ = Delfp(ζθ,ρ) ◦ ◦ ◦ where v (v ) = (v(v )) ζθ,ρ (∈ (X(·) →D·) → (X(·) →D·)) = λρ : v ∈Ds → s v ∈Ds  ρ → λ ∈ D ,...,  D ρ → →···→ for and . [ f v s1 sn−1 : e [x v]]( f (s1 sn) x:=e(sn))∈θ xample V Figure 7. Semantics of terms in Term(C, X) E 6. We assume the binoid (comp[ds]) in Example 4. Let us consider gapped values of sort o → o. For example, p = comp[dept[]] (∈Do → o). What we need here is to define ◦ ⊥ ⊥ ⊥ p ∈ (V(comp[ds]) →⊥ V(comp[ds]) ) (= Ao → o) as a func- Ao = V(τ )⊥ tion. I ⎧ Ab = Db where b  o ⎪ = + = ⊥ ⎨⎪ comp[ds] if a emp[] or a () As → s = (As  →⊥ As) ◦ = τ  ⊥ p (a) ⎪ E otherwise, a ⎩ ⊥ = ⊥ Figure 8. Abstract semantic domain a This function indeed satisfies p◦(v ◦) = (p(v ))◦ for v ∈ V. For ex- ◦ ◦ ◦ + ample, assume v = dept[]. We have p (v ) = p (dept[es] ) = τE, ◦ ◦ and (p(v )) = (comp[dept[dept[]]]) = τE. base sorts, we can obtain cpos for function spaces Ds → s. Par- tial orders for functions are defined as f  g iff ∀x : f (x)  g(x). In Note that the homomorphic images As of Ds are finite sets. ⊥ ⊥ ◦ the case of call-by-value, we use a strict function space A →⊥ B The function gives a way to interpret each values as abstract val- ⊥ ⊥ ⊥ ( A → B ) such that f ∈ A →⊥ B satisfies f (⊥) = ⊥, i.e., the ap- ues. Such abstract values are suitable for analysis, because they are plication of a function to an error value results in an error value, i.e., finitely enumerable. non-termination. In the cpo-based semantics, the meaning of recur- ◦ sive functions is the least fixpoint of some equations. The above Definition 6. The abstraction function ∈D· →A· is a par- semH is indeed such a least fixpoint. tial function defined as follows. We use the induction of the size of s in extending ◦ to Ds →As. 4 Type Checking • v◦ = v(∈Ab ), if v ∈Db for b ∈B\{o}. ◦ So far, we have introduced three tools, namely • A value v ∈Ds → s is in the domain of written v ∈ Dom(◦), if for any v and v (∈ Dom(◦) ∩Ds ) such that ◦ • Binoids with homomorphism , v ◦ = v ◦, this v satisfies (v(v ))◦ = (v(v ))◦. • Look-around tree automata, and • For v ∈ Dom(◦), we define v◦ (∈As → s) to be the function • Tree-transducer and its semantics. defined as v◦(v ◦) = (v(v ))◦. Here we connect these tools and derive our type inference algo- The above definition, however, seems incomplete, since it just says rithm. In particular, the key idea is the extension of the homomor-  ◦ ◦ that we ignore values v Dom( ). That is, values do not satisfy the phism for binoids to functional spaces Ds → s. desired property. What is more interesting is the following result. As we discussed, a common technique to the tree transducer type Lemma 1. [Toz05] For any ◦, all outputs v (∈Ds) of transducers checking is based on the inverse type inference. In the case of are in Dom(◦). tree transducers or macro tree transducers (mtts), the inverse image −1 f (τI) is regular. The expressiveness of high-level tree transducers From this lemma, we can show that any output of transducers of sort I o → B is a constant function. We take a singleton set V = {•} as the is the same as k-composition of mtts [EV88], where k is the height ◦ ◦ −1 τ homomorphic image of V. Then for any t, t ,wehavet = t = •, of sorts. Therefore the inverse image fI ( I) should be a regular ◦ ◦ language also for high-level tree transducers. However, as far as we so that f (t) = ( f (t)) = ( f (t )) = f (t ). We belive that this is the know, there is no direct construction algorithm of the inverse image meaning of non-observability of output values of sort o. of high-level tree transducers. We give one such construction here.

      Maneth [Man04] gave a simple algorithm for inferring regular in- Negative Inverse Type Inference The remaining steps of the verse images for deterministic mtts. His idea was to run the au- type inference algorithm are as follows.

      78 ◦ ◦ Definition 7. The abstract meaning function sem : U → (N • Interpreting the meaning function semH by , and obtain H ◦ P)(·) →A· is the least solution of the following equations. For semH . any f ∈ N such that ( f : y  e) ∈ R, and (∗),(|= m) ∈ P, • M ◦ Defining the look-around tree automaton capturing semH . ◦ ∗ = ∗ This M gives the result of type inference. semH (u)( ) u ◦ |= = · ∈ semH (u)( m) (u m) U M ◦ ◦ Accurately speaking, what the above represents, is a negative sem (u)( f ) = Aλy : e[nm → sem (u· m)(n)] ∈ , ∈ −1 \τ H H n N P m Mov inverse image fI (V I). This is fortunate. After we inferred such an image, what we need is the emptiness check, known to be effi- where Aeρ is an abstract semantics of term e under ρ given simi- cient. larly to Figure 7, except that it uses operators on abstract semantic domains. υ  ∩  −1 \ τ  = ∅ I fI (V I) ◦ Figure 9. The abstract meaning function semH M −1 τ If this holds, the type checking succeeds. If was fI ( I), the above emptiness check turns to the containment test, which is not Definition 8. Given a transducer H, and a binoid V(τI), we de- always efficient. We later explain why our construction creates an fine a look-around automaton M = (Q,L,Mov,δ,F) as follows. automaton for such a negative image. • Mov,L are the same as H, •Q= (N P)(·) →A·, ◦ ⊥ First, we interpret the semantic function semH by means of just • δ ∈ (Mov → (L ×Q) ) →Qis defined as ◦ δ |= =  ⊥ introduced. Indeed, this can be done. This gives a function semH (h)( m) (h(m) ) (∈ U → (N P)(·) →A·) which satisfies, for all u ∈ U and n ∈ δ(h)(∗) = lab(h()) N P δ(h)( f ) = Aλy : e[nm → st(h(m))(n)]n∈N P,m∈Mov for all f : y  e ∈ R ◦ = ◦ (semH (u)(n)) semH (u)(n) where lab(,q) = ,st(,q) = q and lab(⊥) = st(⊥) = ⊥. ◦ ◦ • F = {ρ ∈Q|ρ( fI) τI }. cf., Lemma 2(a). The definition of semH in Figure 9 is exactly the M copy of Definition 5 while it uses operators on V(τI). Figure 10. The definition of the inferred automaton

      What remains is to define the look-around automaton that captures ◦ this semH . The resulting automaton is given in Figure 10. This automaton has its state set Q = (N P)(·) →A·. This set Q clas- Running Example We apply the algorithm explained so far to sifies the nodes of the input XML tree according to the (abstract) a small example. We use the following type checking problem id : output value of the transducer at its each state. A run of M gives υI → τI. Let L = {a,b}. one such classification of nodes in the input XML tree. Now, recall ◦ • The function id is translated into the following transducer H = that the same information was given by semH , which defines the ab- (B, N,C, P,Var,Mov,f ,R) with one rule stract semantics of the transducer for each state-node pair. Indeed, I ◦ ◦ this automaton captures semH in the sense that semH is always a M µ ∈ →Q ◦ (N→o)  run of . Confirm that the run of automaton U and semH id : (∈ U → (N P)(·) →A·) has the same signature. Also notice the ((∗)[if (|= 1) then id1 else ()), similarity between the transition function δ of M and the definition if (|= 2) then id2 else () ◦ δ ◦ of semH . This is defined so that it simulates semH . We have N = {id}, P = {(∗),(|= 1),(|= 2)}, Mov = {,1,2} and M f = id. As readers may expect, this exactly defines the negative inverse I image we want. • υI = b[a[]], and Lemma 2. [Toz05] (a) For all u ∈ U and n ∈ N P, we have • τ = letrec α := a[α]∗ in α. ◦ = ◦ ◦ M I (semH (u)(n)) semH (u)(n). (b) semH is the least run of . (c) V τ = { , α +,τ } The automaton M in Definition 8 accepts u iff semH (u)( fI) τI. We have ( I) () a[ ] E . The detailed proof is omitted here. Here we just note why we need For this problem, we compute the look-around automaton in Defi- to infer the negative inverse image. This is related to our treatment nition 8 whose transition function δ is shown below. The state set Q of non-termination as error, cf., Section 2. of M is a set of mappings (N P)(·) →A·, so that the transition function δ ∈ ({,1,2}→(L×Q)⊥) →Q,givenh ∈{,1,2}→(L×Q)⊥ In Definition 8, we define the final states F of M negatively, i.e., again returns functions. acceptance means type error. Note that if the program is correct, ∈ τ  ◦ i.e., semH (u)( fI) I , then semH should also give the correct re- δ(h)(|= 1) = (h(1)  ⊥) τ ◦ ◦ δ |= =  ⊥ sult in I . From the above lemma (b), if semH gives the correct (h)( 2) (h(2) ) result, i.e., is a non-accepting run of M, then “any run” is also non δ(h)(∗) = lab(h()) accepting. This shows the only-if direction of Lemma (c) (the other δ(h)(id) = + + direction is easy). a[α] st(h())(∗) = a and st(h(k))(id) ∈{⊥,(),a[α] } (k = 1,2) τE otherwise Now, assume that we include ⊥ (non-termination) to the correct result. In this case, we cannot say more than “some run is correct” M ◦ We can convert this automaton to a non-deterministic tree au- from the fact that semH gives the correct result. In fact, in this case, tomaton using the construction in Proposition 1. We here instead we must have defined the set of final states of M positively. just test the input type υI using M.

      79 In this case, since υI just defines a single tree with two nodes, the programming, it is not so usual to use functions whose order type checking problem amounts to check whether or not M accepts is more than second. So it is too early to conclude that the ap- the tree b[a[]] with two nodes, u0 = (b[a[]],) and u1 = (b[a[]],1). proach is infeasible. Our implementation naively implements In this example, we only have one run µ shown below. the enumeration of states of automata M in Section 4. We are currently seeking a different algorithm for the practical use, µ |= = µ |= = (u0)( 1) true (u1)( 1) false i.e., in XML programming languages. µ(u )(|= 2) = false µ(u )(|= 2) = false 0 1 • µ(u0)(∗) = b µ(u1)(∗) = a Connection to the type theory µ = τ µ = α + (u0)(id) E (u1)(id) a[ ] Type-checking is the central issue of functional programming. ◦ There are many approaches to type-check programs based on Now we can see that µ(u )(id) τ  . So this run is an accepting 0 I type systems. However, as far as we know, there are no such run of M. Thus the type checking id : υ → τ in this case is not I I type systems which capture the tree transducer type checking successful. as shown here. In particular, the restrictions as we gave in Section 2, do not seem to be natural assumptions in the study 5 Related Work of type systems. We are seeking their meaning.

      Milo et al. [MSV00] first propose a solution, based on inverse type inference, to the type checking for XML programming mod- 7 Acknowledgment eled by tree transducers. Milo et al. solve this problem for k- pebble transducers. The k-pebble transducers are in theory k + 1- I thank to anonymous referees for detailed reading and suggestive fold composition of mtts [Man04], and it is comparable to high- comments to the earlier draft of this paper. I also thank to Makoto level tree transducers, which is also represented by k-composition Murata for proof-reading this version of the paper. of mtts where k is the height of sorts [EV88]. Similar ap- proaches have been studied for different kinds of tree transducers [Toz01][AMN+01][MN02][MBPS05]. 8 References

      XDuce is a pioneering work [HVP00, HP03] on typed functional [AMN+01] N. Alon, T. Milo, F. Neven, D. Suciu, and V. Vianu. XML programming, which employs type checking with type- XML with data values: Typechecking revisited. In annotation. XDuce is a first order language. Its approach has Proceedings of the 20th ACM SIGACT-SIGMOD- also been employed in a number of typed XML processing lan- SIGART Symposium on Principles of Database Sys- guages, including an industrial language such as XQuery. Frisch tems, pages 138Ð149, 2001. et al. [FCB02] extended tree regular expression types in XDuce to higher order functional types. Their language is called CDuce. [CMS02] Aske Simon Christensen, Anders Moller,¬ and Michael I. Schwartzbach. Static analysis for dynamic XDuce and CDuce require type annotations. In general, they cannot XML. In Proceedings of 1st Workshop on Program- solve the type checking problem such as id : a[b[]] → a[b[]] as it is, ming Languages Technology for XML (PLAN-X 2002), by the following reasons. 2002. [Eng77] Joost Engelfriet. Top-down tree transducer with • When using XDuce, we can annotate id only by trivial types, regular look-ahead. Mathematical Systems Theory, → e.g., Any Any. For example, when we type-check id against 9(3):289Ð303, 1977. a[b[]] → a[b[]], we have to check id also against b[] → b[]. This is not possible in XDuce which associates a single arrow [EV88] Joost Engelfriet and Heiko Vogler. High level tree type with each recursive function. transducers and iterated pushdown tree transducers. Acta Informatica, 26(2):131Ð192, 1988. • CDuce has intersection types. By giving a type annotation a[b[]] → a[b[]]∩b[] → b[], the function id passes the type [FCB02] Alain Frisch, Giuseppe Castagna, and Veronique Ben- check. It is even possible to prove that id : a[b[]] → a[b[]] zaken. Semantic Subtyping. In Proceedings, Sev- holds. This is based on their subtyping algorithm. enteenth Annual IEEE Symposium on Logic in Com- puter Science, pages 137Ð146. IEEE Computer Soci- a[b[]] → a[b[]] ∩ b[] → b[] <: a[b[]] → a[b[]] ety Press, 2002. However this process is still not automatic. Users need to [HP03] Haruo Hosoya and Benjamin C. Pierce. Regular ex- figure out what type annotation is necessary in beforehand. pression pattern matching for XML. J. Funct. Pro- gram., 13(6):961Ð1004, 2003. 6 Future Work [HVP00] Haruo Hosoya, Jer« omeˆ Vouillon, and Benjamin C. Pierce. Regular expression types for XML. In Pro- As a concluding remark, we note several future directions of this ceedings of the International Conference on Func- work. tional Programming (ICFP), pages 11Ð22, Sep., 2000. [Man04] Sebastian Maneth. Models of Tree Translation. PhD • Practical use with XML programming thesis, Proefschrift Universiteit Leiden, 2004. We implemented a prototype type-checker, and tried several [MBPS05] Sebastian Maneth, Alexandru Berlea, Thomas Perst, experiments. Our implementation works well for simple pro- and Helmut Seidl. Xml type checking with macro tree grams using small sorts, such as i → o. Unfortunately, for transducers. In PODS 2005, to appear, 2005. programs with larger sorts, the initial result was not promis- ing. This reflects the time complexity of the algorithm, which [Mit96] John C. Mitchell. Foundations of programming lan- is k-exponential to the height of sorts. However, in practical guages. MIT Press, 1996.

      80 [MN02] Wim Martens and Frank Neven. Typechecking top- down uniform unranked tree transducers. In ICDT 2002, pages 64Ð78, 2002. [MSV00] Tova Milo, Dan Suciu, and Victor Vianu. Type- checking for XML transformers. In Proceedings of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 11Ð22, 2000. [Per90] Dominique Perrin. Finite automata. In Handbook of Theoretical Computer Science, volume B, pages 1Ð57. 1990. [PQ68] C. Pair and A. Quere. Definition et etude des bilan- gages reguliers. Information and Control, (6):565Ð 593, Dec 1968. [Toz01] Akihiko Tozawa. Towards static type checking for XSLT. In Proceedings of the 1st ACM Symposium on Document Engineering. ACM Press, 2001. [Toz05] Akihiko Tozawa. Type checking for functional XML programming using high-level tree transducer, 2005. full paper, in prepation, http://www.trl.ibm.com/ people/akihiko/pub/curry-full.pdf.

      81 Accelerating XPathEvaluationagainst XML Streams

      Dan Olteanu Database Group, Saarland University, Germany [email protected]

      Streamsare anemerging technology for data dissemination in cases StructuralFilters. We exemplify structural filters on a (DBLP- wherethedata throughput orsize make it unfeasible to rely on the like) stream containing information about articles possibly followed conventional approach based on storing the data beforeprocessing only atthevery end of the streambyinformation about books. Con- it. Querying XML streams without storing and without decreasing sider the SPEX network foraquery asking forauthors of books considerably the data throughput is especially challenging because with given pricesand publishers. Incasethetransducer instructed XML streams canconvey tree structured data with unbounded size to find books-nodes, saythebooks-transducer, encounters such a and depth. We demonstrate a novel compile-time optimization of node, then it sends it further to itssuccessors, with an additional SPEX [1], anXMLstream queryprocessor with polynomialcom- non-empty annotation signaling a match. Incase it encounters other bined complexity. This optimization isachieved by stream filters nodes, e.g., article-nodes,thenitstill sends it further, but with an that exploit the structural relationships between XML fragments en- empty annotation, signaling a non-match. Either way, all nodes countered along the stream at various processing states in order to from the stream reach all transducers from the network, although skip large streamfragments irrelevant to the query answer.Theeffi- this is not necessary. We can reduce the streamtraffic between ciency of optimized SPEX is positively confirmed by experiments. transducers in (atleast) two ways.

      Querying XMLStreams with SPEX. SPEX compiles XPath 1. Because all transducers following the books-transducer in the queries into networks of deterministic transducers, afterrewriting network lookalways for nodes in the stream following the books- them to forward equivalents.Anetwork foragiven forward query nodes, the queryevaluation is not altered, if the books-transducer consists of two connected parts. The upper parthas the shape of the sends further only the nodesstarting with the first books-node and query, i.e., it isasequence if the queryisasimple path, a tree if ending together with the stream, and the other transducers do the the queryhas predicates, and a directed acyclic graph, if the query same for the nodes they areinstructed to find relative to nodes found has set operators.Each step in the query inducesacorresponding by their previous transducers. transducer, and each predicate inducesabegin-scope transducer in the network. The upper partis extended with astream-delivering 2. Assume the transducers receiving (directly or indirectly) nodes in transduceratits beginning, and with an answer transduceraf- from the books-transducer look for nodes to be found only in- ter the transducer corresponding to the last step outside the query side the fragments corresponding to books-nodes (liketheir descen- predicates.Thelower partisan answer-collecting funnel, i.e., a dants,orsiblings of their descendants). Then, the books-transducer subnetwork of auxiliarytransducers serving to collect the potential can safely send further only such streamfragments corresponding answers.This funnel mirrors in in out transducers, and begin-scope to books-nodes. in end-scope transducers while preserving their nesting. Both aforementioned approaches to streamtraffic minimization can Processing anXMLstreamcorresponds to a depth-first, left-to- be easily supported by SPEX extended with special-purposepush- right, preorder traversalofits (implicit) tree. Exploiting the affinity down transducers,called structural filters.For example, in the sec- between preorder traversal and stack management, the transducers ond case above, a vertical filter,placed immediately after the books- usetheirstacks forremembering the depth of the nodes in the tree. transducer, sends further only streamfragments corresponding to This way, binary relations expressed as axes, e.g., child and descen- books-nodes.Also, in the firstcase above, this filter canbea di- dant,can be computed in asingle pass.Thetransducer network pro- agonal filterand send further only streamfragmentsstarting with cesses the stream annotatedbyits firsttransducer in, and generates an opening tag books.Diagonal filters are not alwayssuperseded progressively the output streamconveying the answers to the orig- by vertical filters, as for the aboveexamples. Itis enough to con- inal query. The other transducers in the network process stepwise sideraslightly modified example, where the query asks for Web the received annotated stream and send it with changed annotations links following books,thusafollowing-transducer gets the stream to theirsuccessor transducers. E.g., a transducer child moves the processed by the books-transducer.Furthermore, if the query con- annotation of each node to all children of that node. The answers strains the Web links to appear atthesame depth with books in the computed by a transducer network are among the nodesannotated tree conveyed by the stream, then the filter,here a horizontal one, by the answer transducer.These nodesare potential answers, as would send further only the streamfragments corresponding to the they may depend on a downstream satisfaction of predicates.The following siblings of the books-nodes. information on predicate satisfaction is conveyed by annotations to 1 References the stream. Until the predicate satisfaction is decided, the potential [1] F. Bry,F.Coskun,S.Durmaz, T.Furche,D.Olteanu, and M.Span- answers arebuffered by the out transducer. nagel. The XML stream query processor SPEX. In Proc. of Int. Conf. on Data Engineering (ICDE), 2005.

      82 declarative XQuery style of An XQuery-based imperative XML programming. The demo will show that despite of the fact that the expression programming language with a part of the language is less rich then database optimizer XQuery the language has the same opportunities for intelligent Anguel Novoselsky optimizations, provided that one admit to Zhen Liu pay the price of a more complex Daniela Florescu optimizer. Oracle Corporation The demo will show the compiler of such a language, and its virtual machine, and will exemplify the optimization XML data programming will become an opportunities on a couple of application increasing important problem of the programs. The virtual machine is years to come. The currently proposed common to all three languages: XSLT solutions fall in one of the two major 2.0, XQuery and XScript and uses categories: extensions of existing major extensively Oracle’s XML infrastructure programming languages with native (e.g. parsers, type system, runtime). The XML type and native processing compiler uses extensive data flow capabilities (e.g. Xlinq) or extensions of analysis to recover from the imperative existing XML processing languages like style of programming the opportunities XQuery (e.g. XL). The language we for rewriting and optimization traditional propose (temporarily called XScript) is in declarative languages. Due to this another variant of an XML scripting type of optimization Xscript programs language based on XQuery. can effectively scale to manipulate large volumes of data, similar to the way Two major avenues were investigated in databases scale to process large amounts the past to extend XQuery to a full of data. programming language. The first approach added the notion of statements We will show four different execution to XQuery, and duplicating the iterators scenarios for XScript programs: (FLWOR expressions) and conditionals (a) standalone execution in the both as expression constructors as well middle-tier as statement constructors. The second (b) standalone execution in the approach is a pure compositional database server approach: the side-effect operators (c) execution in the middle tier become normal expressions, and are with query shipping to the composed with the rest of the language. database server (d) execution in the database We investigate here a third stylistic server exploiting the exiting approach. We added the notion of relational optimizer and statements to XQuery and kept the runtime expressions side-effect free. Statements include update operations, variable The goal of this work is to create a assignments and error handling. They bridge between the imperative style of also include the iteration and programming, natural to programmers, conditionals. However, the iteration and and the performance advantages of a conditions are eliminated the expression declarative compiler and optimizer, part of the language. The main goal in hence obtaining the best of both worlds. the design of this particular approach is simplicity and ease of use for a large number of developers who might not be familiar or comfortable with a

      83 Xcerpt and visXcerpt: Integrating Web Querying

      Sacha Berger Franc¸ois Bry Tim Furche University of Munich, Institute for Informatics, http://www.ifi.lmu.de/

      Xcerpt [2] and visXcerpt [1], cf. http://xcerpt.org/, are Web with the variable binding paradigm make complex queries easier to query languages related to each other in an unusual way: Xcerpt specify and read. This is demonstrated by online query authoring is a textual query language, visXcerpt is a visual query language and refactoring. obtained by rendering Xcerpt query programs. Furthermore, Xcerpt and visXcerpt, short (vis)Xcerpt, have been conceived for querying To cope with the semistructured nature of Web data, (vis)Xcerpt both standard Web data such as XML and HTML and Semantic query patterns use a notion of incomplete term specifications with Web data such as RDF and Topic Maps. optional or unordered content specification. This feature distin- guishes (vis)Xcerpt from query languages like Datalog and query This paper describes a demonstration focusing on three aspects of interfaces like “Query By Example” [4]. Simple, yet powerful tex- (vis)Xcerpt. First its core features, especially the pattern-oriented tual and visual constructs of incompleteness are presented in the queries and answer-constructors, its rules or views, and its spe- demonstration. cific language constructs for incomplete specifications. Incomplete specifications are essential for retrieving semi-structured data. Sec- An important characteristic of (vis)Xcerpt is its rule-based nature: ond, the integrated querying of standard Web and Semantic Web (vis)Xcerpt provides rules very similar to SQL views. Arguably, data to ease the accessing of the two kinds of data in a same query rules or views are convenient for a logical structuring of complex program. Third, the complementary and integrated nature of the queries. Thus, in specifying a complex query, it might ease the pro- two languages. gramming and improve the program readability to specify (abstract) rules as intermediate steps—very much like procedures in conven- Setting of the Demonstration. In the demonstration, proto- tional programming. Another aspect of rules is the ability, to solve types of both, the textual query language Xcerpt and its visual ren- simple reasoning tasks. Both aspects of rules are needed for the dering visXcerpt are demonstrated in parallel on the same exam- demonstration scenario. ples. Both prototypes rely on the same run time system for eval- uating queries, but differ in rendering: visXcerpt provides a two- Referential transparency and answer closedness are essential prop- dimensional rendering of textual Xcerpt programs implemented us- erties of Xcerpt and visXcerpt, surfacing in various parts of the ing mostly HTML and CSS. Additionally, the visual prototype pro- demonstration. They are two precisely defined traits of the rather vides an interactive environment for editing visXcerpt queries, as vague notion of “declarativity”. Referential transparency means well as for data, query, and answer browsing. that within a definition scope, all occurrences of an expression have the same value, i.e., denote the same data. Answer-closedness Excerpts from DBLP1, and from a computer science taxonomy means that replacing a sub-query in a compound query by a pos- form the base for the scenario considered in the demonstration. sible single answer always yields a syntactically valid query. Ref- DBLP is a collection of bibliographic entries for articles, books, erentially transparent and answer-closed programs are easy to un- etc. in the field of Computer Science. DBLP data are represen- derstand (and therefore easy to develop and to maintain), as the tatives for standard Web data using a mixture of rather regular unavoidable shift in syntax from the data sought for to the query XML content combined with free form, HTML-like information. specifying this data is minimized. A small Computer Science taxonomy has been built for the pur- pose of this demonstration. Very much in the spirit of SKOS [3], A novelty of the visual language visXcerpt is how it has been de- this is a lightweight ontology based on RDF and RDFS. Combin- rived from the textual language: as a rendering without changing ing such an ontology as metadata with the XML data of DBLP is the language constructs and the runtime system for query evalu- a foundation for applications such as community based classifica- ation. This rendering is mainly achieved via CSS styling of the tion and analysis of bibliographic information using interrelations constructs of the textual language Xcerpt. The authors believe that between researchers and research fields. Realizing such applica- this approach to twin textual and visual languages is promising, as tions is eased by using the integrated Web and semantic Web query it makes those languages easy to learn—and easy to develop. The language (vis)Xcerpt that also allows reasoning using rules. first advantages is highlighted in the demonstration by presenting both languages side-by-side. Technical Content of the Demonstration. The use of query and construction patterns in (vis)Xcerpt is presented, both for binding References. variables in query terms and for reassembling the variables in so- called construct terms. The variable binding paradigm is that of [1] S. Berger, F. Bry, S. Schaffert, and C. Wieser. Xcerpt and visX- Datalog, i.e. the programmer specifies patterns (or terms) includ- cerpt: From Pattern-Based to Visual Querying of XML and ing variables. Special interactive behavior of variables in visXcerpt Semistructured Data. In 29th Intl. Conf. on Very Large Data highlights the relation between variables in query and construct Bases, 2003. terms. Arguably, pattern based querying and constructing together [2] S. Schaffert and F. Bry. Querying the Web Reconsidered: A Practical Introduction to Xcerpt. In Extreme Markup Lan- 1http://www.informatik.uni-trier.de/˜ley/db/ guages, 2004. [3] W3C. Simple Knowledge Organisation System (SKOS), 2004. This research has been funded by the European Commission and by the Swiss Fed- eral Office for Education and Science within the 6th Framework Programme project [4] Moshe« M. Zloof. Query-by-Example: A Data Base Language. REWERSE number 506779 (cf. http://www.rewerse.net/). IBM Systems Journal, 16(4):324Ð343, 1977.

      84 XJ: Integration of XML Processing into JavaTM

      Rajesh Bordawekar Michael Burke Igor Peshansky Mukund Raghavachari

      IBM T.J. Watson Research Center {bordaw, mgburke, igorp, raghavac}@us.ibm.com

      1 Introduction from these and other languages is its consistency with XML standards such as XML Schema and XPath, and XML has emerged as the de facto standard for data its support for in-place updates of XML data, thereby interchange. One reason for its popularity is that it keeping with the imperative nature of general-purpose defines a standard mechanism for structuring data as languages like Java. ordered, labeled trees. The utility of XML as an ap- plication integration mechanism is enhanced when in- 2 Demonstration Overview teracting applications agree on the structure and vo- cabulary of labels of the XML data interchanged. This This demonstration will introduce XJ and its language requirement has led to the development of the XML features using an Eclipse-integrated development en- Schemastandard — an XML Schema specifies a set of vironment [2], and demonstrate how the XJ compiler XML documents whose vocabulary and structure sat- converts XJ code into Java code that uses DOM[6]to isfy constraints in the XML Schema. perform accesses to XML data. We will also discuss Despite the increased importance of XML, the avail- optimizations, such as common sub-expression elim- able facilities for processing XML in current program- inations, which are applicable broadly to any XML ming languages are primitive. Programmers often use processing language, including XQuery. runtimeAPIssuchasDOM[6], which builds an in- memory tree from an XML document, or SAX [5], References where an XML document parser raises events that are [1] G. Bierman, E. Meijer, and W. Schulte. The handled by an application. None of the benefits as- essence of data access in Cω.InProceedings of sociated with high-level programming languages, such the European Conference on Object-Oriented Pro- as static type checking of operations on XML data are gramming, 2005. available. The responsibility of ensuring that opera- tions on XML data respect the XML Schema associ- [2] Eclipse project. XML schema infoset model. http: ated with it falls entirely on the programmer. //www.eclipse.org/xsd/. The alternative approach to using standard inter- faces to process XML data is to embed support for [3] V. Gapeyev and B. C. Pierce. Regular object XML within the programming language. Support for types. In Proceedings of the European Conference query languages such as XPath in the programming on Object-Oriented Programming, pages 151–175, language provides a natural, succinct and flexible con- 2003. struct for accessing XML data. Extending current pro- [4] C. Kirkegaard, A. Møller, and M. I. Schwartzbach. gramming languages with awareness of XML, XML Static analysis of XML transformations in Java. Schema, and XPath through a careful integration of IEEE Transactions on Software Engineering, the XML Schema type system and XPath expression 30(3):181–192, March 2004. syntax can simplify programming and enables useful services such as static type checking and compiler op- [5] Simple API for XML. http://www.saxproject. timizations. org. The subject of this demonstration is XJ, a research language that integrates XML as a first-class construct [6] World Wide Web Consortium. Document Object into Java. The design goals of XJ distinguish it from Model Level 2 Core, November 2000. other projects that integrate XML into programming languages. The goal of introducing XML as a type into an object-oriented imperative language is not new — Cω [1], Xtatic [3], Xact [4] have studied the integra- tion of XML into C and Java. What sets XJ apart

      85 XML Support In Visual Basic 9

      ∗ Erik Meijer Brian Beckman†

      XML Programming Using DOM XML Programming Using VB

      Programming against XML using the DOM API today is a bitch. The On top of the base XLinq API, Visual Basic adds XML literals with full accidental complexity of working with the DOM is so high that many namespace support, and late bound axis member for attribute, child, programmers are giving up on using XML altogether, cursing the and descendant access. Programming against XML now actually is hype that XML makes dealing with data simple, which no one who easy, as it was originally intended. has actually written DOM code could claim. The W3C DOM was not designed with ease of programming in mind, but rather evolved With XML literals, we can directly embed XML fragments in a Vi- as a design by committee from the existing DHTML object model sual Basic program. Inside XML literals we can leave holes for at- originally created by Netscape. tributes, attribute names, or attribute values, for element names by using (expression), or for child elements using the ASP.Net style The DOM implementation as surfaced in the .NET frameworks as syntax <%= expression %>,or<% statement %> for blocks. The the System.Xml.XmlDocument API is extremely imperative, irreg- Visual Basic compiler takes XML literals and translates them into ular, and complex. Nodes are not first class citizens and have to constructor calls of to the underlying XLinq API. As a result, XML be created and exist in the context of a given document. The ac- produced by Visual Basic can be freely passed to any other compo- cess patterns for attributes and elements are gratuitously different, nent that accepts XLinq values, and similarly, Visual Basic code can and the handling of namespaces is confusing at best. Finally even accept XLinq XML produced by external components. pretty-printing an XML document takes several lines of arcane and complex code since the .ToString() method is not properly over- Visual Basic’s XML literals also simplify handling of namespaces. ridden. We support normal namespace declarations, default namespace declarations, and no namespace declarations, as well as qualified XML Programming Using XLinq names for elements and attributes. The compiler generates the cor- rect XLinq calls to ensure that prefixes are preserved when the XML is serialized. To adress the complexity of working with XML, we designed XLinq, a new modern lightweight XML API that is designed from the ground Whereas XML literals make constructing XML easy in Visual Ba- up with simplicity and ease of programming in mind. Moreover sic, the concept of axis members makes accessing XML easy. The Xlinq integrates smoothly with the language integrated queries of essence of the idea is to delay the binding of identifiers to actual the LINQ framework. The XLinq object model contains a handful of XML attributes and elements until run time. When the compiler types. The abstract class XNode is the base for element nodes; the cannot find a binding for a variable, it emits code to call a helper abstract class XContainer is the base for element nodes that have function at run time. This tactic will be familiar to many under the children. The XElement class represents proper XML elements, rubric “late binding”, and, indeed, it is a form of ordinary Visual Ba- and the XAttribute class represents attributes and is stand-alone; sic late binding. But it has the advantage that the names of element it does not derive from XNode. The XName class represents fully tags and attributes can be used directly in Visual Basic code with- expanded XML names. out quoting. As such, it relieves the programmer of the significant cognitive burden of switching between object space and XML-data In XLinq nodes are truly first class citizens that can be passed space. The programmer can treat the spaces the same: as hierar- around freely independent of an enclosing document context. chies accessed through “.”. Nested elements are constructed in an expression-oriented fashion, but XLinq also supports imperative updates in case programmers need them. Elements and attributes are accessed uniformly using More Information familiar XPath axis-style methods, while namespace handling is sim- plified using the notion of universal names throughout the API. Last More information on LINQ, XLinq and Visual Basic 9 can be found on but not least, .ToString() actually works, so it is trivial to pretty http://msdn.microsoft.com/netframework/future/linq/ print XML documents using a single method call. ∗ [email protected] [email protected]

      86 XACT XML Transformations in Java

      Christian Kirkegaard and Anders Møller BRICS Department of Computer Science University of Aarhus, Denmark {ck,amoeller}@brics.dk

      Introduction Implementation

      XACT is a framework for programming XML transformations in Our implementation of the XACT analyzer and runtime system is Java. Among the key features of this approach are available at

      • a notion of immutable XML templates for manipulating XML http://www.brics.dk/Xact/ fragments, using XPath for navigation; and

      • static guarantees of validity of the generated XML data based References on data-flow analysis of XACT programs using a lattice struc- ture of summary graphs. [1] Christian Kirkegaard, Aske Simon Christensen, and Anders Møller. A runtime system for XML transformations in Java. An early version of the language design and the program analy- In Proc. Second International XML Database Symposium, sis is described in [3]. In [1], we present an efficient runtime rep- XSym ’04, volume 3186 of LNCS. Springer-Verlag, August resentation. The paper [2] shows how the analysis technique can 2004. be extended to support XML Schema as type formalism and per- [2] Christian Kirkegaard and Anders Møller. Type checking with mit optional type annotations for improving modularity of the type XML Schema in Xact. Technical Report RS-05-31, BRICS, checking. 2005. Presented at Programming Language Technologies for XML, PLAN-X ’06. Demonstration [3] Christian Kirkegaard, Anders Møller, and Michael I. Schwartzbach. Static analysis of XML transformations in Java. We demonstrate the capabilities of XACT by stepping through an IEEE Transactions on Software Engineering example, showing how the program analyzer works “under the , 30(3):181–192, hood”. This involves March 2004.

      1. desugaring special syntactic constructs to Java code;

      2. construction of summary graphs from XML templates and schemas;

      3. data-flow analysis (based on Soot), including transfer func- tions for XML operations; and

      4. validation of summary graphs.

      Specifically, we focus on the novel features: the support for XML Schema and optional type annotations. Schemas are converted, without loss of precision (ignoring keys and references), to a convenient subset of RELAX NG, and then further to summary graphs, which are then used in the data-flow analysis. When this analysis reaches a fixed point (which represents a conservative approximation of the XML values that may appear at runtime), the resulting summary graphs are validated relative to the schema annotations. By allowing type annotations, XACT permits a modular valid- ity analysis where components can be analyzed individually. At the same time, type annotations are optional – they can be omit- ted for intermediate results that do not conform to named schema constructs, thereby supporting a flexible style of programming.

      87 XTATIC

      PLAN-X 2006 Demo

      Vladimir Gapeyev Michael Levin∗ Benjamin Pierce Alan Schmitt†

      University of Pennsylvania

      XTATIC integrates with a mainstream object-oriented language, C, which are also decomposable, but immutable, values. the key features of statically typed XML processing previously de- veloped in XDUCE, a domain-specific XML processing language. Due to the lightweight extension approach to the design of XTATIC, These features include XML trees as built-in values, a type system the feel of XML programming in the resulting language fits be- based on regular types (closely related to schema languages such as tween programming with XML APIs and programming in high- DTD and their successors) for static typechecking of computations level XML-specific languages. On one hand, XTATIC offers—as involving XML, and a powerful form of pattern matching called the high-level languages do—native and concise XML processing regular patterns. primitives and types instead of untyped low-level API manipula- tions. On the other hand, these primitives are used within the By being an extension of C,XTATIC receives, for free, abstraction, control flow and abstractions framework of an object-oriented lan- modularization, and control flow mechanisms of an established pro- guage, which is more familiar to the majority of programmers than gramming language, as well as access to its extensive libraries. The the more esoteric frameworks of XSLT and XQUERY. Psycho- extension made by XTATIC to the core of C is minimal: it consists logical and educational considerations aside, this poses XTATIC as of enriching the universe of C values and types by constructs for an attractive alternative to API-based programming in applications trees and sequences that generalize those of XDUCE, and adding where efficiency is of immediate concern. Currently, such projects the pattern matching primitive for their processing. The key obser- tend to avoid using XSLT, XQUERY,orevenXPATH, due to un- vation for the integration is that the semantics of trees in XDUCE certainty over presence of optimization for high-level control flows easily generalizes to permit using, in place of XML tags, other kinds in a given implementation of these languages, as well as lack of of values and types as tree labels—for example, objects and classes control over decisions of the optimizer. This control (indeed, full of C. Then the integration of trees with the object-oriented data responsibility for implementing the high-level control flow) is in model of C is accomplished by grafting the subtyping relation of the hands of an XTATIC programmer to the same degree as for an the so generalized XDUCE regular types into the C class hierar- API programmer. chy under a special class Xtatic.Seq, therefore making all regular types be subtypes of seq. This allows trees and sequences to be These benefits are shared by XTATIC with other current propos- passed to generic library facilities such as collection classes, stored als for integrating XML processing into object-oriented languages, in fields of objects, etc. Finally, this general extension encodes e.g., XOBE, XJ, XACT,andCω.XTATIC differs from these in other XML by trees that use objects from a special class Xtatic.Tag respects: more flexible integration of trees into the object-oriented as tree labels. This approach is similar to the way arrays—which, data model and use of regular patterns, rather than paths, as the like trees, are a form of structural types—are integrated in C as main XML inspection mechanism. Used in conjunction with reg- subtypes of the special class System.Array. ular types, patterns support the full spectrum of processing styles, from dynamic investigation of documents of unknown or partially Subtyping in XTATIC subsumes both the declarative object-oriented known types to fully checked processing of documents for which subclass relation and the richer extensionally defined subtyping re- complete type information is known—all without changing the un- lation of regular types: It turns out that the traditional definition derlying data representation. of subclassing can be reformulated—without changing the relation itself—to mimic the XDUCE’s definition of subtyping as inclusion XTATIC is implemented as a translator into pure C code, which between sets of values inhabiting the types. Likewise, XTATIC’s can be compiled into .NET CLR and executed in conjunction with pattern matching incorporates a natural form of type-based pattern a small library that implements tree sequences and elementary op- matching on objects. This provides a safe alternative to casts as a erations on them. mechanism for determination of an object’s run-time type.

      XTATIC does not support any form of destructive update of the se- quence and tree structure of existing values. Instead, the language promotes a declarative style of processing, in which values and subtrees are extracted from existing trees and used to construct en- tirely new trees. This approach agrees with the treatment of trees in XSLT and XQUERY, and has a precedent in C provided by strings, ∗ Currently at Microsoft. † Currently at INRIA Rhˆone-Alpes.

      88 OCamlDuce

      Alain Frisch INRIA Rocquencourt [email protected]

      Context. Over the last few years, the programming language re- ming language adapted to the development of safe and efficient search community has identified issues raised by the support of XML-oriented applications. CDuce supports XML literals, Uni- XML documents in applications and has proposed new linguistic code, XML Namespaces, XML types, XML pattern matching to- features to deal with them. The work by Hosoya, Pierce and Vouil- gether with a precise type inference and an efficient automata-based lon on the XDuce project has had a big influence. Amongst its main and type-driven compilation strategy, XML iterators. Part of the contributions are the design of regular expression types (to express theory behind CDuce relies on the one developped in the XDuce structural constraints on documents) and regular expression pat- project. terns (to express complex information extraction from documents) which together contribute to a sound and expressive language for From the programmer point of view, OCamlDuce comes as drop- developping XML-oriented applications such as transformations. in replacements for the OCaml tools: bytecode and native compil- ers, toplevel. All OCaml features are available, and it is possible to XDuce encouraged the vision of XML manipulation as a value- reuse standard and third-party OCaml libraries without even recom- based process in the spirit of functional languages. As a matter a piling them. OCamlDuce also integrates all of the features from fact, XDuce has striking similarities with the family of ML lan- CDuce except overloaded functions. It is thus mostly straightfor- guages. Since XDuce and ML languages are good for different but ward to translate CDuce programs to OCamlDuce. related kind of problems and because of their apparent similarity, it is natural to try to combine them. Integrating OCaml and CDuce. OCamlDuce has been imple- mented by merging together the OCaml and CDuce source trees However, despite the similarity, XDuce is missing important fea- and adding a relatively small piece of glue code. The only tech- tures from ML languages such as first-class functions, polymor- nically challenging part of the OCaml / CDuce integration was the phism, automatic type reconstruction, and support for programming combination of two radically different type systems: OCaml relies in the large. There are two natural responses to address this lack of on Hindley-Milner-like type inference, and CDuce relies on for- features: either extend XDuce underlying theory to deal with them, ward propagation and on tree-automata techniques. The theoretical or integrate XDuce features in an existing full-blown ML language. foundation of the type system is described in a paper to be presented Examples of the former include existing extensions of XDuce with in PLAN-X 2006. The key idea to obtain a clean and simple type first-class functions or with parametric polymorphism. However, system was to keep the XML values and types self-contained: they it is not clear how these extensions could be combined, and a lot can appear within regular OCaml values and types, but the converse of work is still necessary to integrate other missing features. Also, is not possible. However, bridges between the worlds of XML and it seems pointless to design and implement a full-blown language ML values are provided in OCamlDuce. They rely on an automatic only to add support for XML. The idea of integrating XDuce fea- structural translation of ML types into XML types, which allows to tures into an existing full-blown general-purpose language has been move values between the two worlds. explored for instance in the Xtatic project, which adds XDuce types and patterns into the C# programming language. The part of Xtatic Because of the way CDuce types and values are dealt with in programs that deals with XML inherits the functional flavor from OCamlDuce, it is not possible e.g. to have first-class functions XDuce. This might indicate that a functional language could be a or arbitrary OCaml values within XML values. We don’t see very good target for integrating XDuce. it as a problem because CDuce values are intended to represent XML fragments in OCamlDuce, not arbitrary data containers which OCamlDuce. OCamlDuce is an experimental merger between the OCaml supports already pretty well. More problematic is the lack Objective Caml (OCaml) and CDuce languages. The language was of interaction between OCaml parametric polymorphism and XML designed so as to make it easier to develop possibly large applica- types (OCaml type variables cannot appear within CDuce types). tions which need to deal with XML document without necessarily We leave this challenging point for future work. being focused primarily on XML (unlike, say, pure XML-to-XML transformation). Typical use cases would be to add support for cus- The demonstration. The demonstration will illustrate how the fea- tom XML configuration files, for XHTML report generators, or for tures added to OCaml can be used to write idiomatic, expressive web-service interfaces, ...toanexisting OCaml application. and safe code that manipulates complex XML structures. The ex- amples will be taken from a medium-sized application developped OCaml is a powerful general-purpose multi-paradigm/functional- in OCamlDuce, which parses an XML-Schema definition into an oriented programming language from the ML family with a robust, OCaml graph-like data structure, extracts some informations from efficient and popular implementation. CDuce is a small program- it, and produces an XHTML report.

      89 LAUNCHPADS: A System for Processing Ad Hoc Data

      Mark Daly Mary Fernandez« Yitzhak Mandelbaum Princeton University Kathleen Fisher David Walker [email protected] AT&T Labs Research Princeton University mff,[email protected] yitzhakm,[email protected]

      An Introduction to PADS. Ideally, any data we ever encounter Name : Use Representation will be presented to us in standardized formats, such as XML. Web server logs (CLF): Fixed-column ASCII records Why? Because for formats like XML, there are a whole host Measure web workloads of software libraries, query engines, visualization tools and even CoMon data: ASCII records programming languages specially designed to help users process Monitor PlanetLab Machines their data. However, we do not live in an ideal world, and in Call detail: Fraud detection Fixed-width binary records reality, vast amounts of data is produced and communicated in AT&T billing data: Various Cobol data formats ad hoc formats, those formats for which no data processing tools Monitor billing process are readily available. Figure 1 presents a small selection of ad hoc Netflow: Data-dependent number of data sources. As one can see, ad hoc data exists in a very wide Monitor network performance fixed-width binary records variety of fields and the users range from network administrators to Newick: Immune Fixed-width ASCII records computational biologists and genomics researchers to physicists, system response simulation in tree-shaped hierarchy financial analysts and everyday programmers. Gene Ontology: Variable-width ASCII records Programmers often deal with this data by whipping up one-time Gene-gene correlations in DAG-shaped hierarchy Perl scripts or C programs to parse and analyze their data. Unfor- CPT codes: Floating point numbers tunately, this strategy is slow and tedious, and often produces code Medical diagnoses that is difficult to understand, lacks adequate error checking, and is brittle to format change over time. To expedite and improve this Figure 1. Selected ad hoc data sources. process, we developed the PADS data description language and system [2, 3]. Using the PADS language, one may write a declar- ative description of the structure of almost any ad hoc data source. The descriptions take the form of types, drawn from a dependent venient way for experts to quickly create any of the data processing type theory. For instance, PADS base types describe simple objects tools they need. including strings, integers, floating-point numbers, dates, times, LaunchPads. LAUNCHPADS combines mechanisms for graph- and ip addresses. Records and arrays specify sequences of elements ically defining structure and semantic properties of ad hoc data, in a data source, and unions, switched unions and enums specify al- for translation of this definition into PADS code, and for compi- ternatives. Any of these structured types may be parameterized and lation/execution of the generic tools that operate over ad hoc data. users may write arbitrary semantic constraints over their data as More specifically, LAUNCHPADS breaks definition of an ad hoc well. data format and generation of data processing tools into the follow- Once a programmer has written a description in the PADS lan- ing steps. Figure 2 presents a screenshot of LAUNCHPADS being guage, the PADS compiler can generate a collection of format- used to construct a data description for a web-server log format. specific libraries in C, including a parser, printer, and verifier. In ad- dition, the compiler can compose these libraries with generic tem- 1. Selection of sample data from which to build the descrip- plates to create value-added tools such as an ad hoc-to-XML for- tion. Creation of a definition within LAUNCHPADS begins mat conversion tool, a histogram generator, and a statistical analy- when a user loads sample data into the graphical interface. sis and error summary tool. Finally, PADS has been composed with In Figure 2, web log data (beginning with the IP address the GALAX query engine [6, 4, 5] for XQuery to create PADX [1], 207.136.97.49 ...) appears in the top right hand cor- a new system that allows users to query and transform any ad hoc ner of the picture. A user then selects a row of data to work on data source as if it was XML, without incurring the performance in the LAUNCHPADS gridview immediately below. penalty that usually results when one converts ad hoc data into a 2. Iterative refinement in the gridview. Once in the gridview, much more verbose XML representation. users may specify descriptions for regions of text using a high- While the PADS language provides an extremely versatile lighting scheme. The color assigned to a region represents the means of creating tools for processing ad hoc data, it is nevertheless description class (base or composite) and region boundaries. a new language and learning a new language is time-consuming for Structure within a definition is represented through a series anyone, especially for computational biologists or other scientists of refinement steps: composite regions are broken down and for whom programming is not their primary area of expertise. To level after level, thereby allowing for nested elements (Figure 2 ease the way for novice PADS users, we developed LAUNCH- shows four nesting levels). The refinement process bottoms out PADS, a new tool that provides access to the PADS system with- when one reaches an atomic description such as a character out requiring foreknowledge of the PADS language itself. Hence, string, IP address or date. Once all regions have been given a LAUNCHPADS graphic interface will also help more experienced base type in the gridview, LAUNCHPADS will generate a tree- PADS users to shorten their development cycle and provides a con- view of the definition for further processing.

      90 Figure 2. LaunchPads Interface.

      3. Customization in the treeview. The treeview is a graphical guage is structured to meet these challenges. In addition, we will representation of the abstract syntax of a PADS description. In explain how LAUNCHPADS provides further support for process- this view, programmers can manipulate definitions with a high ing ad hoc data by demonstrating both features for helping users degree of precision: definition elements may be created, de- construct data descriptions and features for creating and invoking stroyed, and renamed; type associations for existing elements tools that operate over data. We believe that both expert program- may be changed (within limitations); element ordering may be mers and novices alike can benefit from this simple system for ma- altered; user defined types may be added to the definition and nipulating ad hoc data. applied to elements; content-aware error constraints may be im- posed. Indeed, from within the tree view it is possible to access References the “expert” functions of PADS directly if one so chooses, or to completely avoid them in lieu of a simpler definition and/or [1] M. Fernandez,« K. Fisher, and Y. Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. Submitted to PLAN-X 2006. faster development time. [2] K. Fisher and R. Gruber. PADS: A domain-specific language for 4. PADS code generation, tool compilation and use. When the processing ad hoc data. In Proceedings of the ACM SIGPLAN 2005 user is satisfied with their PADS definition in the treeview, conference on Programming language design and implementation, they may generate PADS code. Any such generated code is June 2005. guaranteed to be syntactically correct so the user need not worry [3] K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data about fussing with concrete PADS syntax if they do not want to. description languages. In ACM SIGPLAN-SIGACT Symposium on Figure 2 shows the generated code in the window at the bottom Principles of Programming Languages, Jan. 2006. To appear. of the interface. By using the pulldown menus at the top and a [4] Galax user manual. http://www.galaxquery.org/doc. set of “wizards,” the user may now issue commands to compile html#manual. the generated code and create data processing tools including [5] C. Re,« J. Simeon,« and M. Fernandez.« A complete and efficient algebraic the XML converter and statistical analyzer. As development compiler for XQuery. In Proceedings of IEEE International Conference of LAUNCHPADS continues, we will add further tools and on Data Engineering (ICDE), April 2006. corresponding wizards to the interface. [6] J. Simeon« and M. F. Fernandez.« Build your own XQuery pro- cessor. EDBT Summer School, Tutorial on Galax architec- . ture, Sept 2004. http://www.galaxquery.org/slides/ edbt-summer-school2004.pdf. Conclusions In summary, in this demonstration, we will explain the many challenges that ad hoc data pose and how the PADS lan-

      91 XHaskell

      Martin Sulzmann and Kenny Zhuo Ming Lu School of Computing, National University of Singapore S16 Level 5, 3 Science Drive 2, Singapore 117543 {sulzmann,luzm}@comp.nus.edu.sg

      We demonstrate the current programming capabilities of regular hedges. Therefore, we need the auxiliary func- XHaskell Ð an extension of Haskell with XDuce style tion for each p2. regular expression types and regular expression pattern XHaskell is compiled to Haskell. Hence, we can easily matching. For example, the following classic XDuce take advantage of existing XML tools written in Haskell. program to extract telephone entries out of an address E.g., we can use the DtdtoHaskell command provided book by the HaXML tool to generate the AddrbookDTD mod- ule which describes the DTD structure of the address regtype P = P[N,T?,E*] -- Person book example in terms of some Haskell data types. Cur- regtype N = N[String] -- Name rently, the XML document representation provided by regtype T = T[String] -- Tel HaXML is slightly different from XHaskell. Hence, regtype E = E[String] -- Email the programmer must provide an extra interface mod- regtype En=En[N,T] -- Phonebook Entry ule HaXMLInterface for marshalling values between addrbook :: P* -> En* the two representations. Though, this intermediate step addrbook (P[n as N,tas T, E*], xs as P*) could be easily automated. =(En[n,t], (addrbook xs)) Here is the code integrating XHaskell with HaXML. addrbook (P[N,E*], xs as P*) = addrbook xs addrbook () = () module App where import Addrbook ( addrbook ) can be rewritten in XHaskell as follows. import AddrbookDTD importHaXMLInterface module Addrbook where (haxml2xhaskell, xhaskell2haxml ) data P = P N (T?) ((E)*) -- Person main= data N = N [Char] -- Name fix2Args >>= \(infile,outfile)-> data T = T [Char] -- Tel do value <- fReadXml infile data E = E [Char]--Email let result = xhaskell2haxml data En=EnNT --Entry (addrbook (haxml2xhaskell value)) addrbook :: ((P)*) -> ((En)*) fWriteXml result outfile addrbook (x :: ((P)*)) = (map for_each_p) x -- (1) The main function parses an XML document specified for_each_p :: P -> (En?) by argument infile, and applies function addrbook for_each_p (P (n :: N) (t :: (T?)) to the parsed value. Note that addrbook has type (es :: ((E)*))) [P]->[En] in the translation to Haskell. Finally it prints =for_each_p2 (n,(t,es)) the result into the output file specified by argument for_each_p2 :: (N,((T?),((E)*))) -> (En?) outfile. for_each_p2 ((n :: N),((t :: T), The implementation and further background material (es :: ((E)*)))) can found here: = Ennt http://www.comp.nus.edu.sg/˜luzm/xhaskell/ for_each_p2 ((n :: N),(es :: ((E)*))) =() The interesting point to note is that in XHaskell we can call Haskell Prelude functions such as map::(a->b)->[a]->[b] (see location (1)). Thus, we only need to define the transformation from Person to Entry. Our current implementation does not support

      92 Recent BRICS Notes Series Publications

      NS-05-6 Giuseppe Castagna and Mukund Raghavachari, editors. PLAN-X 2006 Informal Proceedings, (Charleston, South Car- olina, January 14, 2006), December 2005. ii+92. NS-05-5 Patrick Cousot, Lisbeth Fajstrup, Eric Goubault, Maurice Herlihy, Kim G. Larsen, and Martin Raußen, editors. Prelim- inary Proceedings of the Workshop on Geometry and Topology in Concurrency, GETCO ’05, (San Francisco, California, USA, August 21, 2005), August 2005. vi+44. NS-05-4 Scott A. Smolka and Jirˇ´ı Srba, editors. Preliminary Pro- ceedings of the 7th International Workshop on Verification of Infinite-State Systems, INFINITY ’05, (San Francisco, USA, August 27, 2005), June 2005. vi+64 pp. NS-05-3 Luca Aceto and Andrew D. Gordon, editors. Short Contribu- tions from the Workshop on Algebraic Process Calculi: The First Twenty Five Years and Beyond, PA ’05, (Bertinoro, Forl`ı, Italy, August 1–5, 2005), June 2005. vi+239 pp. NS-05-2 Luca Aceto and Willem Jan Fokkink. The Quest for Equational Axiomatizations of Parallel Composition: Status and Open Prob- lems. May 2005. 7 pp. To appear in a volume of the BRICS Notes Series devoted to the workshop “Algebraic Process Cal- culi: The First Twenty Five Years and Beyond”, August 1– 5, 2005, University of Bologna Residential Center Bertinoro (Forl`ı), Italy. NS-05-1 Luca Aceto, Magnus Mar Halldorsson, and Anna Ingolfsd´ ottir.´ What is Theoretical Computer Science? April 2005. 13 pp. NS-04-2 Patrick Cousot, Lisbeth Fajstrup, Eric Goubault, Maurice Herlihy, Martin Raußen, and Vladimiro Sassone, editors. Pre- liminary Proceedings of the Workshop on Geometry and Topol- ogy in Concurrency and Distributed Computing, GETCO ’04, (Amsterdam, The Netherlands, October 4, 2004), September 2004. vi+80. NS-04-1 Luca Aceto, Willem Jan Fokkink, and Irek Ulidowski, editors. Preliminary Proceedings of the Workshop on Structural Opera- tional Semantics, SOS ’04, (London, United Kingdom, August 30, 2004), August 2004. vi+56.