Fast XML/HTML for Haskell: XML TypeLift
Michał J. Gajda Dmitry Krylov Migamake Pte Ltd [email protected]
Abstract and historical storage, without encountering any hardware, The paper presents and compares a range of parsers with software or platform compatibility problems. Importantly, and without data mapping for conversion between XML XML offers operational compatibility through its common and Haskell. The best performing parser competes favor- syntax to enable transfer of messages across communication ably with the fastest tools available in other languages and systems of differing formats. is, thus, suitable for use in large-scale data analysis. The The generation of significant volumes of XML data has best performing parser also allows software developers of initiated a wealth of active research into rapid parsing[13, intermediate-level Haskell programming skills to start pro- 14, 26]. cessing large numbers of XML documents soon after finding the relevant XML Schema from a simple internet search, 1.2 XML Schema without the need for specialist prior knowledge or skills. We The overall XML domain is vast, with individual XML file for- hope that this unique combination of parser performance mats being tailored to specific needs and applications. While and usability will provide a new standard for XML mapping XML’s self-descriptive structure and file format allows infor- to high-level languages. mation to be easily ingested, most application areas require strict validation rules to ensure the logical correctness of 1 Introduction XML documents. In 2001, the W3C formally recommended the XML Schema[37] XML is the current dominant format for document inter- for the description and validation of XML document struc- change, being optimized for long-term schema evolution ture and content by defining document elements, attributes and extensibility[10, 16, 27, 28]. Importantly, XML is based and data types. As extensive-type descriptions, XSDs are on a number of different data formats, most of which areof helpful for standardizing the large number of XML formats practical and commercial interest. That is because they allow in current use. They do this not only by defining document system evolution over long time span (in case of PDB[16] structure and elements, but also by setting out formal require- since 1976, in case of FixML since 1992[11].) ments for document validation. Most W3C XML formats This report will review the latest developments in the gen- have XML Schema definitions that have been developed and eration of efficient parsers in Haskell. Haskell is a high-level standardized by ISO/IEC; the same applies for all ECMA stan- language that incorporates parsing plain XML of unknown dards, including Microsoft Open-XML and the LibreOffice format at a speed that is comparable with leading parsers OpenDocument format. already available in low-level languages. Haskell also auto- For newcomers to the field, we now summarize key fea- matically generates high-level representations of document tures of XML Schema. Self-descriptive representation of XML content from XML Schema, the current dominant schema documents encoded with the following: (I)
Finally it possesses typical programming language facilities hxt uses a generic data model to represent XML doc- for organizing many definitions, including namespaces, com- uments, including the DTD subset and the document ments, and cross-references for data modeling. subset, in Haskell. Namespaces for element element identifiers and refer- • hexml[25]: a fast, DOM-style XML parser that parses ences (ref="id"), so that the same element can be used in only a subset of the XML. This parser skips entities different document locations; (I) minOccurs="0"/> This method of processing not only induces inversion of
Xeno.process (def {openF, closeF, textF}) effort to ensure that our parser generator is easy toread input and, thus, properly accessible and customizable. To do this, readSTRef results we first use an abstract interface to the input string; this After calling the parser, we read the final result using the allows future parser enhancement by vectorization using the function readSTRef. VectorizedString class. Since XML TypeLift originated from Xeno, we now dis- 2.3 Obtaining fully typed outputs using parser cuss performance techniques for both libraries together; we generators generate independent code using the same performance tech- It is easy to obtain fully typed Haskell representations from niques as before, before backporting them to Xeno. XML Schema. The user must first find the XML Schema that 3.1.1 Splitting and representing tokens describes the input format on the internet standard or from an XML schema repository. Conventionally, parsers start by distinguishing tokens and Next, the XML Schema is handed to XML Typelift, which keywords in the input, and then selecting an efficient rep- generates a Haskell module that describes both parser and resentation for them. This can be achieved by making an mapping to Haskell ADT: ADT data structure or a hash table that maps each token type to an integer code. Since, in XML, key tokens are identi- $ xml-typelift-cli --schema InputSchema.xsd fiers and text fragments, we choose to take advantage ofthe --output InputSchema.hs ByteString library to represent their content as two offsets3 We then immediately test the generated parser and exam- into the input string. C ine the output structure: Capitalizing on the tradition of fast XML parsing in the $ runghc InputSchema.hs input.xml SAX [24] model, we implicitly represent tokens by distin- The obtained result is then used by the programmer to guishing callback functions that process each token; we ex- determine how to access parsed data structure in the specific pect these functions to be duly inlined by the Haskell com- program. Specifically, XML TypeLift uses a name generation piler. monad [2] to ensure that the resulting code is readable and comprehensible for Haskell programmers of intermediate 3.1.2 Performant string representation skill level. Haskell programs use the Bytestring library4, which exam- It is easier to process an example generated ADT than an ines large input strings efficiently by reducing each object original XML Schema: to: (a) a foreign pointer to the input; (b) an offset to the be- data Users= Users { users: [User] } ginning of the string in the input; (c) and an offset to the end data User= User { uid :: Int, name :: String of the string in the input. , bday :: Maybe Date } When using this data structure for fast parsing of a known input, we add a null byte to the end so that it is not necessary This ADT is easily read by the standard parser interface to check that no offsets transcend the input end. Additionally, from a generated module: by factoring out a class of string representations that allow parse :: ByteString -> Either String Users efficient scanning, interested programmers can find even We can, therefore, use it to extract the data in an example more efficient representations that offer a safe interface. user program: For our specific purposes of parsing source XML docu- import UsersSchema(parse, Users(..)) ments, we wish to parse both UTF-16 and UTF-8 string repre- processFile filename= do sentations without needing to write a large library variant to Right users <- BS.readFile filename >>= parse the Bytestring library. An additional benefit of this input rep- print $ map name users resentation is its time-saving nature compared to mapping from a file. Whenever this read-only structure is unneces- This interface is preferable, and we will now explain how sary, the operating system purges any relevant pagesfrom to ensure that it’s speed is comparable to the DOM and SAX memory, since the runtime system simply interprets those approaches. pages as a large foreign object that has no pointers and, 3 Performance techniques therefore, not worthy of collecting. This turns out to be im- portant, since our benchmarks show that most of our parser 3.1 Code generation as a scalable method of fast prototypes were limited by garbage collector performance. parsing Code generation has an undeserved reputation for producing Vectorized string interface class Careful examination re- arcane code that is difficult to read, due to its algorithmic veals that only five operations are necessary for fast XML origin; this raises concerns regarding the malicious use of 3The beginning and the end code generation [36]. In this context, we make a specific 4Haskell programs also use ‘text’, which is Bytestring’s spiritual successor. 4 Fast XML/HTML for Haskell: XML TypeLift Conference’17, July 2017, Washington, DC, USA parsing, so that it compiles well to yield fast Low Level Vir- < ? x m l ...... tual Machine (LLVM) code: < u s e r s > < u s e r > class VectorizedString where
3.2.1 Modeling XML Schema in Haskell For each element type, we keep track of its possible multi- Since we aim to ingest data quickly without validating the plicity, name, attributes and content type; note that we ignore document, the current version of XML Typelift ignores the name spaces due to limited implementation resources. following features: (I) comments, (II) syntactic type distinc- Describing element types tions, (III) value restrictions that are not enumerations, and data Element= Element { eType :: Type (IV) uniqueness constraints. minOccurs :: Int -- 0 for no limit We reduce the type representations to those that are mod- , maxOccurs :: Maybe Int -- Nothing for no limit eled within Haskell abstract data types via the following , eName :: String features: (a) xs:choice is naturally modeled by types with , targetNamespace :: NamespaceName } alternative type constructors; (b) xs:sequence and the set of xs:attributes are both modeled by record fields within a Handling of type extensions Although simple and com- single constructor; (c) references are modeled as record fields plex types that have either simple or complex content are to facilitate code re-use; (d) cardinality constraints are sim- one of most complicated aspects of specification, they can ComplexType plified to distinguish between a single value, a, an optional be easily modeled by either a entry, reference, ComplexType value, Maybe a, and a list [a]; (e) built-in types are trans- or modification of another . lated into Haskell types by predefinition; (f) type restrictions data Type= Ref String are modeled depending on the constraint: (I) enumerations | Restriction { base :: String are modeled as set of nullary constructors, and (II) other , restricted :: Restriction } constraints are reduced to type aliases. | Extension { base :: String This is essentially an XML↔Haskell mapping similar to , mixin :: Type } Object↔Relational and Object↔XML mappings, both of | Complex { mixed :: Bool which are used for SQL and XML databases, without con- , attrs :: [Attr] straining the realm of XML types on the input.6 , inner :: TyPart } Automated type generation not only allows us to signifi- Consistent with the principles of data mapping, whereby cantly decrease complexity when writing efficient parsers we seek a common type for shared parts of the data structure, for existing formats, but it also results in a more convenient we avoid duplication of fields defined in the base type when API that can be used in Haskell. We propose using the same generating declarations of extending types. code generator for other target languages in our future work. However, since XML Schema’s own schema is very com- Type restrictions Although it is easy to expand restric- plex, we use an intermediate representation to describe XML tions, for the time-being we consider only those that need document types in sufficient detail to allow parsing, while special mapping to Haskell ADT: at the same time omitting most of the details of schema data Restriction= Enum [String] | ... 7 representation. For simplicity, we omit treatment of element references and namespaces, which can simply be added to a reference 3.2.2 Schema representation dictionary. Furthermore, complex identifier types can be The entire XML schema can be represented as an environ- dealt with in post-parsing analysis to find the namespace of ment that holds both simple and complex declared types. each identifier. In addition to this environment, we also list element types Representing element children Types are described by that are allowed to occur in the document root; this uni- regular expression in language whereby letters form child form representation simplifies a multitude of distinctions elements. Here again, we use Group as a reference to another that are made by XML Schema between types that are al- TyPart: lowed in attributes, and those are allowed only as elements. Thus, further analysis and code generation are both greatly data TyPart= Seq [TyPart] | Choice [TyPart] simplified. | All [TyPart] | Group String | Elt Element data Schema= Schema { types :: Map String Type , tops :: [Element] } Note that, since XML Schema allows Choice, All, and Seq8 to be nested, the XML data declaration must be broken down into smaller sections. 6 Except for mixed types, which can be modeled as a list of values with a XML data structure flattening It is necessary to flatten type that has an added constructor TextValue String. The authors are not currently able to support this. the XML data structure by breaking types down into sec- 7Indeed, our approach is mirrored by the existence of other schema lan- tions that can be translated directly into single Haskell data guages that aim to simplify XML Schema. For example, RelaxNG [17], which aims to describe regular trees inside a document. 8Described as: xs:choice, xs:all, and xs:sequence. 6 Fast XML/HTML for Haskell: XML TypeLift Conference’17, July 2017, Washington, DC, USA declarations. This is achieved by grouping the types into the data SAXEvent= OpenElt XMLString following categories, according to complexity: | CloseElt XMLString 1. those that can be implemented as type aliases (newtypes); | Attr XMLString XMLString 2. those that can be implemented as a single Haskell ADT | ... declaration; and newtype SAXStream= SAXStream [SAXEvent] 3. those that must be split into multiple types. This, in theory, allows for easier streaming through Conduit[33], Pipe[12] or Streamly’s Stream, as well as incremental input As an example of a type that must be split into multiple processing. types, here we consider an element that contains a fixed group of children, and one variant child; for simplicity, ele- Parser 3 For Parser 3, a vector pointer structure, similar ment types are omitted. to Xeno.DOM, is created. In order to reduce memory requirements, this structure is customized to have fixed
3.3 Searching for the best parser 1 2.71 5.43 10.88 21.83 43.76 87.63 Here, we describe and analyze the performance of a variety 3 2.88 5.77 11.56 23.20 46.50 93.10 of generated code types. We keep running prototypes in the 4 2.13 4.26 8.55 17.18 34.46 69.02 repository in the bench/proto directory.9 5 2.25 4.51 9.04 18.15 36.41 72.92 Parser 1 For Parser 1, we first generate an Xeno.DOM 6 2.94 5.89 11.80 23.68 47.46 95.02 structure, and we convert the resulting tree fragments 7 2.23 4.46 8.95 17.98 36.05 72.21 into Haskell records using a lazy conversion. gen. 2.42 4.85 9.74 19.54 39.18 78.47 Parser 2 A continuation passing monad (ContT) is used for Parser 2. The processor is given as a continuation that receives individual SAX events, and continues pro- parseUser userStart= cessing. Next, a control flow separates out the different if s_index bs userStart == '<' generated types, with unfinished elements held ina && s_index bs (userStart + 1) == 'u' stack. Data is then passed to the input using an asyn- && s_index bs (userStart + 2) == 's' chronous queue to allow asynchronous (and possibly && s_index bs (userStart + 3) == 'e' chunked) input processing. && s_index bs (userStart + 4) == 'r' && s_index bs (userStart + 5) == '>' 9Interested readers are referred to the source repository so that they can Parser 5 Parser 5 retains the ‘working memory’ of IORef compare with their prototypes and preferred optimization. for each attribute or element type that is held in the 7 Conference’17, July 2017, Washington, DC, USA Michał J. Gajda and Dmitry Krylov
Table 3. Memory consumption (in Mb) comparison for gen- bday <- readIORef bdayRef erated parser and its prototypes. Columns represent memory -- ...similar code skipped... consumption (in Mb) need to process input file with respec- modifyIORef' usersRef (Users {..} : ) tive size in header (in Mb). -- back to previous level writeIORef levelRef 1 1 0.26 0.52 1.31 2.62 5.23 10.45 -- ...similar code skipped... }) 3 0.42 0.82 1.63 3.24 6.48 12.95 Just users <- readIORef usersRef 4 0.21 0.39 0.77 1.57 3.27 6.60 return users 5 0.21 0.43 0.88 1.78 3.57 7.13 6 0.51 1.00 1.97 3.90 7.78 15.56 Parser 6 Parser 6 uses data reification of event streams 7 0.12 0.24 0.47 0.94 1.87 3.74 (SAXEvent type) and passes the resulting tokens to gen. 0.13 0.25 0.50 0.99 1.98 3.96 Parsec ([6]) parser combinators. parser :: ByteString -> Either String TopLevel parser= first errorBundlePretty Table 4. Count of garbage collections of generated parser . parse parseUsers"" and prototypes . Xeno.process
1 2.66 5.32 10.67 21.41 42.93 85.96 type SaxParser a= Parsec Void [SAXEvent] a 3 2.87 5.75 11.53 23.14 46.40 92.93 4 2.20 4.40 8.82 17.72 35.54 71.18 parseUsers :: SaxParser Users 5 2.29 4.59 9.20 18.47 37.04 74.18 parseUsers= 6 3.01 6.02 12.07 24.22 48.54 97.18 withTag "users" (Users <$> parseUsers) 7 2.23 4.47 8.96 17.99 36.11 72.35 gen. 2.45 4.90 9.81 19.66 39.47 79.09 parseUsers :: SaxParser Users parseUsers= many parseUser
output structure. In order to conserve memory, it then parseUser :: SaxParser User constructs a record only upon element closure. parseUser= withTag "user" $ \_ -> User <$> tagValue "uid" parser :: ByteString -> IO TopLevel <*> tagValue "name" parser str= do <*> tagValue "bday" levelRef <- newIORef (0::Int) -- IORef holding each field in the schema -- Collection of useful combinators usersRef <- newIORef Nothing withTag :: ByteString -> SaxParser a -> SaxParser a userRef <- newIORef Nothing withTag tagName sp= -- ... similar code for each field... openTag tagName *> sp <* closeTag flip Xeno.process str (Process { openF= \tagName -> do tagValue :: (F.MonadFail m level <- readIORef levelRef ,MonadParsec e [SAXEvent] m) case (level, tagName) of => ByteString -> m ByteString -- Here we dive into structure tagValue tagName= do (0,"users") -> writeIORef levelRef 1 void $ satisfy (==OpenElt tagName) (1,"user") -> writeIORef levelRef 2 -- ...auxiliary code skipped... -- Store nested values (2,"uid") -> writeTextToRef uidRef Parser 7 Parser 7 uses a single mutable vector, similar to -- ...similar code skipped... that used by Xeno.DOM offsets (and Parser 3), to add , closeF= \tagName -> do inlined string comparisons (of Parser 4). This parser -- Construct whole record allows for lazy output extraction. level <- readIORef levelRef parser :: ByteString -> Either String TopLevel case (level, tagName) of parser= fmap extractTopLevel . parseToArray7 (2, "user") -> do let name= Nothing parseToArray7 :: ByteString uid <- readIORef uidRef -> Either String TopLevelInternal name <- readIORef nameRef parseToArray7 bs= 8 Fast XML/HTML for Haskell: XML TypeLift Conference’17, July 2017, Washington, DC, USA
Right $ TopLevelInternal bs $ UV.create $ do (arrayUidStart + 1) vec <- UMV.unsafeNew $ BS.length bs (uidStrEnd - uidStrStart) _ <- parseUsers vec return $ uidStrEnd + 6 return vec else where failExp "" uidStrEnd -- Some parser just parse tags and else -- call parser for nested tags failExp "
return (arrOfs3, strOfs3) The following example code uses actual parser combina- parseusersContent arrStart strStart= do tors: (arrOfs1, strOfs1) <- parseUsersContent arrStart strStart= do inManyTags "user" (arrOfs1, strOfs1) <- arrStart strStart inManyTags "user" arrStart strStart $ parseUserTypeContent $ parseUserTypeContent return (arrOfs1, strOfs1) return (arrOfs1, strOfs1) -- ...auxiliary code skipped... parseUserTypeContent arrStart strStart= do parseusers vec (arrOfs1, strOfs1) <- return vec inOneTag "uid" strStart $ parseInt arrStart (arrOfs2, strOfs2) <- -- | Extractor of haskell data inOneTag "name" strOfs1 $ parseString arrOfs1 -- from internal array (arrOfs3, strOfs3) <- extractTopLevel inMaybeTag "bday" arrOfs2 strOfs2 parseDay :: TopLevelInternal -> TopLevel return (arrOfs3, strOfs3) extractTopLevel (TopLevelInternal bs arr)= fst $ extractUsers1Content 0 This code does not need to store field offsets within the where
extractUserTypeContent Table 7. Total memory consumption (in Mb) comparison in (Users user, ofs1) for generated parser and other tools. Columns represent We now consider how extraction of Haskell datatype user memory consumption (in Mb) required to process input files extraction (introduced in sec. 2.3) works. according to the respective header size (in Mb) As illustrated in Figure 1, the extractor first reads 2 from the offset array and reads two User objects. In order toread xeno 0.06 0.13 0.25 0.50 1.00 1.99 each User object, the extractor first reads the offset to a piece pugixml 0.14 0.29 0.58 1.15 2.31 4.61 of string as well as the string length and it then parses this hexml 0.09 0.19 0.37 0.74 1.47 2.94 information to the appropriate type: hexpat 1.74 3.45 6.88 13.59 27.07 52.79 The field uid of the first User is formed from the 2-byte th py-lxml 0.28 0.28 0.56 1.13 2.25 4.50 long string that is extracted from the 50 byte. The field xml-typelift 0.13 0.25 0.50 0.99 1.98 3.96 name is them formed from the 4-byte long string that is extracted from 71st byte. If the
Table 5. Speed comparison for generated parser and other Our tools seem to show a clear linear scaling according tools. Columns represent the time (in seconds) required to to input size (see tbl.5). Note that the C-based hexpat tool process the input file according to the respective header size performs more slowly than any of the other tools that were (in Mb). tested, especially for large files. In particular, the C-based hexpat tool took 91.7 seconds to process the 1 Gb file, while xeno 0.03 0.07 0.13 0.26 0.53 1.08 the other tools processed the same file in under two seconds. pugixml 0.05 0.09 0.20 0.40 0.81 1.63 This result further reinforces our point that clever implemen- hexml 0.03 0.06 0.11 0.28 0.56 1.26 tation is more important than language choice when optimiz- hexpat 2.71 5.45 10.94 21.65 43.40 91.73 ing tool performance; thus, developer skill and expereince lxml 0.35 0.70 1.40 2.81 5.64 11.30 are essential to producing high performance solution. xml-typelift 0.04 0.08 0.17 0.33 0.66 1.33 All our tested tools demonstrated linear growth between file size and common allocated memory and GC (see tbls.6, 7,8); this growth had a direct correlation with performance. Table 6. Allocation comparison (in Mb) for generated parser and other tools. Columns represent allocations (in Gb) re- 5 Discussion quired to process input files according to the respective This parser contributes to evidence [1] that advanced com- header size (in Mb). pilers of high-level languages can have similar performance (speed) to those of low-level languages provided that the xeno 0.03 0.07 0.13 0.27 0.54 1.08 software developer is sufficiently well-skilled. Although parser speed can be improved through innova- pugixml 0.00 0.00 0.00 0.00 0.00 0.00 tive implementation and optimization, neither approach is hexml 0.03 0.06 0.12 0.25 0.50 1.00 used in the Glasgow Haskell Compiler (GHC) or its LLVM. hexpat 6.16 12.33 24.66 49.36 98.76 197.56 Given the importance of string processing, it is disappoint- xml-typelift 2.42 4.85 9.74 19.54 39.18 78.47 ing that neither GHC or LLVM implement inline keyword matching, and we recommend that this is pursued in future work. 4.1 Main alternatives This paper presents techniques for high-level processing For benchmarking purposes, we developed a short program of xxx, without compromising memory use or processing that simply generates XML files of similar structures, but speed. Our results confirm that effective use of high-level with different file sizes and containing random data. Sixfiles language libraries enables performance to match that of of the following sizes (in Mb) were generated: 32, 64, 128, low-level languages, while at the same time maintaining 256, 512, and 1024. safety and retaining interface convenience. We expect that 11 Conference’17, July 2017, Washington, DC, USA Michał J. Gajda and Dmitry Krylov implementation of these techniques will be expanded into which are used for interfacing OOP programs to relational high-performance parsers in the future. Offset-based encod- databases; this prevents type mismatches (detected at com- ing of pointer structures suggests a number of interesting pilation), allowing effective utilization with XML document future research directions in GC; region-based GCs would parsing. Programmers can, thus, use the full power of the possibly help in this case. type system for conversion of XML Schema data types to Haskell data types, since type validation is performed prior 6 Future work to compilation. Invalid documents that cannot be parsed to While our goal was to develop a tool suitable for a wide range Haskell data types return an error message that identifies of uses, implementation of entire XML Schema is beyond the error location. the scope of this paper; not all features are currently sup- Even though similar parsing solutions exist within other ported (xs:element references, xs:import), and isolated Haskell packages, such as HXT, XML TypeLift offers the cases remain untested. following key advantages: (i) It performs online input pro- Our future work includes adding new functionality that cessing without using system memory. (ii) It has fast, low will be designed to not hamper performance when used in memory usage and event-based parsing using pure-Haskell the most common circumstances. In particular, we plan to xeno parser. (iii) It runs fast for small and large document facilitate incremental data extraction, without building inter- data processing libraries in other languages [3, 17, 23] mediate data structures, by implementing XPath. Provided we secure sufficient funding, we might also implement a References monoidal parallelization scheme for the parser, in the spirit [1] Anonymous 2013. hPDB - Haskell library for process- of the hPDB parser [1]. ing atomic biomolecular structures in protein data bank Although we have not yet modeled mixed content, it format. BMC Research Notes. 6, 1 (2013), 483. would be relatively easy to map mixed content as a list of [2] Art of industrial code generation: 2019. https://migamake. sum types (ADT) that contain all allowed elements, and text com/presi/art-of-industrial-code-generation-mar-6-2019-uog-singapore. node constructors. We also plan to add mmap and null ter- pdf . mination pre-allocations to the XML Typelift, just as we did [3] Clark, J. 1998. The Expat XML Parser. with xeno.11 [4] Clark, J. and (eds.), S.D. 1999. XML Path Language (XPath) Since VectorizedString operations are only exposed in Version 1.0. W3C. Xeno.DOM and its generated parser, it would be interesting [5] Consortium 2019. Protein Data Bank: the single global to formally demonstrate their correctness. archive for 3D macromolecular structure data. Nucleic Acids Research. 47, 12 (2019), 520–528. 7 Conclusion [6] Daan Leijen, E.M. 2007. Direct Style Monadic Parser Com- While XML documents validated by XML Schema defini- binators for the Real World. User Modeling 2007, 11th tions are valid, in order to analyze these formats in a typed International Conference, Corfu, Greece. programming language, XML data types must be converted [7] Dai, Z. et al. 2010. A 1 Cycle-per-byte XML Parsing Ac- from XML Schema into meaningful language data type dec- celerator. Proceedings of the 18th Annual ACM/SIGDA larations, and parsers must be generated for these XML doc- International Symposium on Field Programmable Gate Ar- uments. rays (New York, NY, USA, 2010), 199–208. We presented a Haskell tool that converts XML standard [8] Done, C. 2017. Fast Haskell: Competing with C at parsing data types into Haskell data types, thus making this data XML. Blog post. available for analysis by Haskell programs. A key advan- [9] Dyck, M. et al. 2017. XQuery 3.1: An XML Query Language. tage of our tool is that it allows easy handling of large XML W3C. Schemas, such as Office OpenXML, by Microsoft Office ap- [10] ECMA 2016. Office Open XML File Formats 5th edition, plications. ECMA-376-1:2016. XML TypeLift avoids the complexities of XML parsing and [11] FIX Trading Community 2013. The Financial Informa- validation by directly mapping valid input documents into tion eXchange Protocol, FIXML 5.0 Schema Specification correct Haskell datatypes that represent the intended do- Service Pack 2 with 20131209 errata. main. In a sense, XML mapping to Algebraic Datatypes (XML- [12] Gonzalez, G. 2019. pipes - Compositional pipelines. GitHub ADT mapping), behaves just like object-relational mappings, repository. https://github.com/Gabriel439/Haskell-Pipes-Library; GitHub. 11More accurately, we will add ‘mmapped’, since it will allow us to take [13] Haw, S.C. and Rao, G.S.V.R.K. 2007. A Comparative Study advantage of the significant virtual memory that is available from a64-bit and Benchmarking on XML Parsers. The 9th International processor and, thus, map the entire input directly to memory. This will Conference on Advanced Communication Technology (Feb. allow the operating system to implicitly manage the paging-in of the input, as required. 2007), 321–325. 12 Fast XML/HTML for Haskell: XML TypeLift Conference’17, July 2017, Washington, DC, USA
[14] Head, M.R. et al. 2006. Benchmarking XML Processors 1.8x1010 for Applications in Grid Web Services. SC ’06: Proceedings generated 1.6x1010 of the 2006 ACM/IEEE conference on supercomputing (Nov. parser-1 10 2006), 30–30. 1.4x10 parser-3 [15] H. M. Berman, H.N., K. Henrick 2007. (2007) The World- 1.2x1010 parser-4 wide Protein Data Bank (wwPDB): Ensuring a single, parser-5 1x1010 uniform archive of PDB data Nucleic Acids Res. 35 (Data- parser-6 9 base issue): D301-3. Nucleic Acids Research. 35 (Database 8x10 parser-7 issue), 3 (2007), 301. 9 6x10 [16] H. M. Berman, H.N., K. Henrick 2003. Announcing the 9 worldwide Protein Data Bank. Nature Structural Biology. 4x10 Process memory size in bytes 10, 12 (2003), 980. 2x109 [17] Kapoulkine, A. 2013. Parsing XML at the speed of light. 0 The Performance of Open Source Applications. T. Arm- 3264 128 256 512 1024 strong, ed. [18] Khan, A. and Jardon, F. 2017. Data Serialization Com- Size in MB parison. CriteoLabs. [19] Kostoulas, M.G. et al. 2006. XML Screamer: An Inte- Figure 2. Memory consumption comparison for generated grated Approach to High Performance XML Parsing, Val- parser and prototypes idation and Deserialization. Proceedings of the 15th Inter- national Conference on World Wide Web (New York, NY, [32] Snoyman, M. 2012. Appendix F: xml-conduit. Developing USA, 2006), 93–102. Web Applications with Haskell and Yesod. O’Reilly Media, [20] Lam, T.C. et al. 2008. XML document parsing: Opera- Inc. tional and performance characteristics. Computer. 41, 9 [33] Snoyman, M. 2018. Conduit 1.3 - A streaming library. (2008), 30–37. GitHub repository. https://github.com/snoyberg/conduit; [21] Launchbury, J. and Peyton Jones, S.L. 1994. Lazy func- GitHub. tional state threads. SIGPLAN Not. 29, 6 (Jun. 1994), 24–35. [34] Sonar, R.P. and Ali, M.S. 2016. iXML - generation of [22] Lelli, F. et al. 2006. Improving the performance of XML efficient XML parser for embedded system. 2016 Interna- based technologies by caching and reusing information. tional Conference on Computing Communication Control 2006 IEEE International Conference on Web Services (ICWS’06) and automation (ICCUBEA) (Aug. 2016), 1–5. (2006), 689–700. [35] The Unicode Consortium 2011. The Unicode Standard. [23] Martijn Faassen, F.L., Stefan Behnel and contributors Technical Report #Version 6.0.0. Unicode Consortium. 2005. lxml - XML and HTML with Python. [36] Thompson, K. 1984. Reflections on Trusting Trust. Com- [24] Megginson, D. 2001. SAX project home page. munications of the ACM. 27, 8 (Aug. 1984), 761–763. [25] Mitchell, N. 2016. Neil Mitchell’s Haskell Blog. Neil [37] Vlist, E. van der 2002. XML Schema: The W3C’s Object- Mitchell’s Haskell Blog. Blogspot. Oriented Descriptions for XML. O’Reilly. [26] Noga, M.L. et al. 2002. Lazy XML Processing. Proceedings [38] Wallace, M. and Runciman, C. 1999. Haskell and XML: of the 2002 ACM Symposium on Document Engineering Generic Combinators or Type-Based Translation? Pro- (New York, NY, USA, 2002), 88–94. ceedings of the Fourth ACM SIGPLAN International Con- [27] OASIS 2006. Open Document Format for Office Appli- ference on Functional Programming (New York, NY, USA, cations (OpenDocument) v1.0 (Second Edition), ISO/IEC 1999), 148–159. 26300:2006. [39] Wallace, M. and Runciman, C. 1999. Haskell and XML: [28] Paoli, J. et al. 2008. Extensible Markup Language (XML) Generic Combinators or Type-Based Translation? SIG- 1.0 (Fifth Edition). W3C. PLAN Not. 34, 9 (Sep. 1999), 148–159. [29] Pike, R. and Thompson, K. 1993. Hello world or ... Pro- ceedings of the Winter 1993 USENIX Conference (San Diego, Appendix 1993), 43–50. Acknowledgments [30] Schmidt, A. et al. 2002. Chapter 89 - XMark: A Bench- This work was partially supported by IntelliShore Corp. mark for XML Data Management. VLDB ’02: Proceed- Author thanks for all tap-on-the-back donations to his ings of the 28th International Conference on Very Large past projects. Databases. P.A. Bernstein et al., eds. Morgan Kaufmann. 974–985. [31] Schmidt, M. 1999. Design and Implementation of a vali- dating XML parser in Haskell. 13 Conference’17, July 2017, Washington, DC, USA Michał J. Gajda and Dmitry Krylov
5x109 80000 xml-typelift.parse xml-typelift.parse 70000 xeno.validate xeno.validate 4x109 pugixml.parse 60000 pugixml.parse hexpat.parse hexpat.parse 3x109 hexml.parse 50000 hexml.parse 40000 9 2x10 Total GCs 30000
20000 1x109 Process memory size in bytes 10000
0 0 3264 128 256 512 1024 3264 128 256 512 1024
Size in MB Size in MB
Figure 3. Memory consumption comparison for generated Figure 5. Count of GC generated parser and other tools parser and other tools
10 100000 8x10 xml-typelift.parse 10 90000 generated 7x10 xeno.validate parser-1 pugixml.parse 80000 6x1010 parser-3 hexpat.parse 70000 parser-4 5x1010 hexml.parse 60000 parser-5 4x1010 50000 parser-6 parser-7 3x1010
Total GCs 40000 10 30000 2x10
20000 Total allocated memory in bytes 1x1010
10000 0 0 3264 128 256 512 1024 3264 128 256 512 1024 Size in MB Size in MB Figure 6. Allocations comparison for generated parser and Figure 4. Count of GC generated parser and prototypes other tools
14 Fast XML/HTML for Haskell: XML TypeLift Conference’17, July 2017, Washington, DC, USA
1.1x1011 6 1x1011 generated xml-typelift.parse 9x1010 parser-1 5 xeno.validate parser-3 pugixml.parse 8x1010 parser-4 4 hexpat.parse 7x1010 parser-5 hexml.parse 10 6x10 parser-6 10 3 5x10 parser-7 10 4x10 Time in sec 2 3x1010 10 2x10 1 Total allocated memory in bytes 1x1010 0 0 3264 128 256 512 1024 32 64 128 256 512 1024
Size in MB Size in MB
Figure 7. Allocations comparison for generated parser and Figure 9. Speed comparison for generated parser and other prototypes tools
45 generated 40 parser-1 35 parser-3 30 parser-4 parser-5 25 parser-6 20 parser-7 Time in sec 15
10
5
0 32 64 128 256 512 1024
Size in MB
Figure 8. Speed comparison for generated parser and its prototypes
15