XML Prague 2018 – Conference Proceedings Copyright © 2018 Jiří Kosek

XML Prague 2018 Conference Proceedings University of Economics, Prague Prague, Czech Republic February 8–10, 2018 XML Prague 2018 – Conference Proceedings Copyright © 2018 Jiří Kosek ISBN 978-80-906259-4-5 (pdf) ISBN 978-80-906259-5-2 (ePub) X-definition X-definition gggggcuments lGdefinitionlllllllllllllll documents. lllllllllllstand lllllllllllllllta lllllllprehenlllGdefinitionllllllll lllllllllllctures. Validationgggtruction lGdefinitionlllllll)llll lllllllGlllllnguage. lllllllllllllll llllllllllllort lllllllllllocessing gggerconnecg lGdefinitionllllconnectionl lllllllllllllllll llllllllxGmponexllllllll llllllXMLlllllllllIlllll" Connecggtabasggg llGllllllIlll llll"llllllll ggenanceggg lllllGdefinitionllllllll lllllllllllIl" ggggw.syntea.cz Table of Contents General Information ..................................................................................................... vii Sponsors .......................................................................................................................... ix Preface .............................................................................................................................. xi Assisted Structured Authoring using Conditional Random Fields – Bert Willems ...................................................................................................................... 1 XML Success Story: Creating and Integrating Collaboration Solutions to Improve the Documentation Process – Steven Higgs ............................................... 13 xqerl: XQuery 3.1 Implementation in Erlang – Zachary N. Dean ............................ 23 XML Tree Models for Efficient Copy Operations – Michael Kay ............................ 33 Using Maven with XML development projects – Christophe Marchand and Matthieu Ricaud-Dussarget ................................................. 49 Varieties of XML Merge: Concurrent versus Sequential – Tejas Pradip Barhate and Nigel Whitaker ....................................................................... 61 Including XML Markup in the Automated Collation of Literary Text – Elli Bleeker, Bram Buitendijk, Ronald Haentjens Dekker, and Astrid Kulsdom ............. 77 Multi-Layer Content Modelling to the Rescue – Erik Siegel .................................... 97 Combining graph and tree – Hans-Juergen Rennau ................................................ 107 SML – A simpler and shorter representation of XML – Jean-François Larvoire ... 137 Can we create a real world rich Internet application using Saxon-JS? – Pieter Masereeuw ........................................................................................................... 157 Implementing XForms using interactive XSLT 3.0 – O'Neil Delpratt and Debbie Lockett .............................................................................. 167 Life, the Universe, and CSS Tests – Tony Graham ................................................... 181 Form, and Content – Steven Pemberton ..................................................................... 213 tokenized-to-tree – Gerrit Imsieke .............................................................................. 229 v vi General Information Date February 8th, 9th and 10th, 2018 Location University of Economics, Prague (UEP) nám. W. Churchilla 4, 130 67 Prague 3, Czech Republic Organizing Committee Petr Cimprich, XML Prague, z.s. Vít Janota, Xyleme & XML Prague, z.s. Káťa Kabrhelová, XML Prague, z.s. Jirka Kosek, xmlguru.cz & XML Prague, z.s. & University of Economics, Prague Martin Svárovský, Memsource & XML Prague, z.s. Mohamed Zergaoui, ShareXML.com & Innovimax Program Committee Robin Berjon, The New York Times Petr Cimprich, Xyleme Jim Fuller, MarkLogic Michael Kay, Saxonica Jirka Kosek (chair), University of Economics, Prague Ari Nordström, SGMLGuru.org Uche Ogbuji, Zepheira LLC Adam Retter, Evolved Binary Andrew Sales, Bloomsbury Publishing plc Felix Sasaki, Cornelsen GmbH John Snelson, MarkLogic Jeni Tennison, Open Data Institute Eric van der Vlist, Dyomedea Priscilla Walmsley, Datypic Norman Walsh, MarkLogic Mohamed Zergaoui, Innovimax Produced By XML Prague, z.s. (http://xmlprague.cz/about) Faculty of Informatics and Statistics, UEP (http://fis.vse.cz) vii viii Sponsors oXygen (http://www.oxygenxml.com) le-tex publishing services (http://www.le-tex.de/en/) Antenna House (http://www.antennahouse.com/) Saxonica (http://www.saxonica.com/) speedata (http://www.speedata.de/) Syntea (http://www.syntea.cz/) ix x Preface This publication contains papers presented during the XML Prague 2018 conference. In its thirteenth year, XML Prague is a conference on XML for developers, markup geeks, information managers, and students. XML Prague focuses on markup and semantic on the Web, publishing and digital books, XML technologies for Big Data and recent advances in XML technologies. The conference provides an overview of successful technologies, with a focus on real world application versus theoretical exposition. The conference takes place 8–10 February 2018 at the campus of University of Economics in Prague. XML Prague 2018 is jointly organized by the non-profit organization XML Prague, z.s. and by the Faculty of Informatics and Statistics, University of Economics in Prague. The full program of the conference is broadcasted over the Internet (see http:// xmlprague.cz)—allowing XML fans, from around the world, to participate on-line. The Thursday runs in an unconference style which provides space for various XML community meetings in parallel tracks. Friday and Saturday are devoted to classical single-track format and papers from these days are published in the pro- ceeedings. Additionally, we coordinate, support and provide space for XProc working group meeting collocated with XML Prague. We hope that you enjoy XML Prague 2018 – especially as this is a very special edition – the last day of the conference is 20th anniversary of XML Recommenda- tion publication. — Petr Cimprich & Jirka Kosek & Mohamed Zergaoui XML Prague Organizing Committee xi xii Assisted Structured Authoring using Conditional Random Fields Bert Willems FontoXML <[email protected]> Abstract Authoring structured content with rich semantic markup is repetitive, time consuming and error-prone. Many Subject Matter Experts (SMEs) struggle with the task of applying the correct markup. This paper proposes a mecha- nism to partially automate this using Conditional Random Fields (CRF), a machine learning algorithm. It also proposes an architecture on how to con- tinuously improve the CRF model in production using a feedback loop. Keywords: XML, Conditional Random Fields, Structured Authoring, Machine Learning 1. Introduction With the increasing adoption of structured XML content, the amount of work required from Subject Matter Experts (SMEs) increases. Not only are they required to capture their knowledge as information to others, they are increas- ingly asked, and sometimes even required, to mark up the information with the appropriate semantic and structural metadata in the form of XML tags and attrib- utes. Examples of those markup tasks include: • Structuring bibliographic references to tag authors, journal name, publisher etc. • Marking up tasks, not with ordered lists but with steps. • Marking up interactive questions, like multiple choice questions. Although WYSIWYG XML editors help to make this task as easy as possible, the fact remains that there is additional work to be done that is often repetitive and error-prone. FontoXML conducted multiple studies to determine whether the effort of manual tagging affected adoption. The results showed a consistent nega- tive effect: SMEs and their editorial colleagues are hesitant to adopting structured authoring. In some cases this meant reverting back to their unstructured content processes, leading to unrealized potential. Prior implementations, like GROBID [3], apply markup automatically. This paper proposes to introduce Machine Learning (ML) to the authoring process 1 Assisted Structured Authoring using Conditional Random Fields instead. The reason for this is the inaccuracy of the state-of-the-art ML algo- rithms: like humans, they make mistakes. Allowing SMEs to (correct and) accept a machine provided suggestion will result in a more accurate markup. Further- more, this approach allows for the creation of a feedback loop, allowing the machine to improve over time. This paper focuses on the task of structuring bibliographic citations, although the proposed architecture scales to many of the tasks required for properly structured content. 2. Model This section describes the model used for recognizing bibliographic citations and extracting the relevant labels from it. The model used in this paper follows a divide-and-conquer strategy and is made up out of two separate models: The Citation Model and the Name Model. Partial results from the Citation Model cas- cade into the Name model to more detailed results. 2.1. Citation Model The goal of the Citation Model is to classify a sequence of text with tags that make up the parts of the citation. The tags are derived from the TEI P5 vocabulary [12] and are encoded using the IOB tagging scheme [8]. The following tags are distinguished: • author • orgName • editor • publisher • pubPlace • date • idno (bibliographic identifier) • analytic (articles, poems, etc.) • monographic (books, single & multi volumes, etc.) • journal • series • unpublished • volume • issue • pages • chapter For example, the sequence Erickson, T. & Kellogg, W. A. "Social

Load more