4/12/17

2 Announcements (Wed., Apr. 12)

• Homework #4 due Monday, April 24, 11:55 pm • 4.1, 4.2, 4.3, X1 is posted XML- • Please start early XPath and XQuery • There may be another extra credit problem Introduction to • Projects CompSci 316 Spring 2017 • keep working on them and write your final report • Demo in the week of April 24

• Google Cloud use? • If anyone is planning to use or using google cloud for HW/project, please send me an email

3 4 Approaches to XML processing

• Text files/messages • Specialized XML DBMS • Tamino (Software AG), BaseX, eXist, Sedna, … Relational Mapping • Not as mature as relational DBMS • Relational (and object-relational) DBMS • Middleware and/or extensions • IBM DB2’s pureXML, PostgreSQL’s XML type/functions…

5 6 Mapping XML to relational 1. /edge-based: schema

• Store XML in a column • Element(eid, tag) • Simple, compact • Attribute(eid, attrName, attrValue) Key: (eid, attrName) • CLOB (Character Large OBject) type + full-text indexing, or better, special XML type + functions • Attribute order does not matter • Poor integration with relational query processing • ElementChild(eid, pos, child) Keys: (eid, pos), (child) • Updates are expensive • pos specifies the ordering of children • Alternatives? • child references either Element(eid) or Text(tid) • Schema-oblivious mapping: ← Focus of this lecture well-formed XML → generic relational schema • Text(tid, value) 1. Node/edge-based mapping for graphs • tid cannot be the same as any eid 2. Interval-based mapping for trees F 3. Path-based mapping for trees Need to “invent” lots of id’s • Schema-aware mapping: FNeed indexes for efficiency, e.g., Element(tag), valid XML → special relational schema based on DTD Text(value)

1 4/12/17

7 8 Node/edge-based: example Node/edge-based: simple paths

ElementChild eid pos child Foundations of Databases Element • //title Abiteboul eid tag e0 1 e1 Hull • SELECT eid FROM Element WHERE tag = 'title'; e0 bibliography e1 1 e2 Vianu Addison Wesley e1 book e1 2 e3 • //section/title 1995 e1 3 e4 … e2 title • SELECT e2.eid e3 author e1 4 e5 e1 5 e6 FROM Element e1, ElementChild , Element e2 e4 author WHERE e1.tag = 'section' e5 author e1 6 e7 eid attrName attrValue AND e2.tag = 'title' Attribute e6 publisher e2 1 t0 e1 ISBN ISBN-10 e7 year e3 1 t1 AND e1.eid = c.eid e1 price 80 e4 1 t2 AND c.child = e2.eid; e5 1 t3 tid value Text e6 1 t4 FPath expression becomes joins! t0 Foundations of Databases e7 1 t5 t1 Abiteboul • Number of joins is proportional to the length of the path t2 Hull expression t3 Vianu t4 Addison Wesley t5 1995

9 10 Node/edge-based: complex paths Node/edge-based: descendent-or-self

• //bibliography/book[author="Abiteboul"]/@price • //book//title • SELECT a.attrValue FROM Element e1, ElementChild c1, • Requires SQL3 recursion Element e2, Attribute a • WITH RECURSIVE ReachableFromBook(id) AS WHERE e1.tag = 'bibliography' ((SELECT eid FROM Element WHERE tag = 'book') AND e1.eid = c1.eid AND c1.child = e2.eid UNION AND e2.tag = 'book' (SELECT c.child AND EXISTS (SELECT * FROM ElementChild c2, some author of e2 FROM ReachableFromBook r, ElementChild c Element e3, ElementChild c3, Text t WHERE r.eid = c.eid)) WHERE e2.eid = c2.eid AND c2.child = e3.eid is ’Abiteboul’ AND e3.tag = 'author' SELECT eid AND e3.eid = c3.eid AND c3.child = t.tid FROM Element AND t.value = 'Abiteboul') WHERE eid IN (SELECT * FROM ReachableFromBook) AND tag = 'title'; AND e2.eid = a.eid AND a.attrName = 'price';

11 12 2. Interval-based: schema Interval-based: example

1 2 bibliography 1,999,1 • Element(left, right, level, tag) 34Foundations of Databases5 67Abiteboul8 • left is the start position of the element 910Hull11 book 2,21,2 1213Vianu14 1516Addison Wesley17 • right is the end position of the element 1819199520 21… • level is the nesting depth of the element 999 • Key is left title author author author publisher year • Text(left, right, level, value) 3,5,3 6,8,3 9,11,3 12,14,3 15,17,3 18,20,3 • Key is left FWhere did ElementChild go? • Attribute(left, attrName, attrValue) • � is the parent of � iff: • Key is (left, attrName) [�.left, �.right] ⊃ [�.left, �.right], and �.level = �.level − 1

2 4/12/17

13 14 Interval-based: queries Summary so far

• //section/title Node/edge-based vs. interval-based mapping • SELECT e2.left FROM Element e1, Element e2 WHERE e1.tag = 'section' AND e2.tag = 'title' AND e1.left < e2.left AND e2.right < e1.right • Path expression steps AND e1.level = e2.level-1; • Equality vs. containment join FPath expression becomes “containment” joins! • Descendent-or-self • Number of joins is proportional to path expression length • Recursion required vs. not required • //book//title • SELECT e2.left FROM Element e1, Element e2 WHERE e1.tag = 'book' AND e2.tag = 'title' AND e1.left < e2.left AND e2.right < e1.right; FNo recursion!

15 16 3. Path-based mapping Label-path encoding: queries

Label-path encoding: paths as strings of labels • Simple path expressions with no conditions • Element(pathid, left, right, …), Path(pathid, path), //book//title … • Perform string matching on Path • Join qualified pathid’s with Element • path is a string containing the sequence of labels on a path starting from the root • //book[publisher='Prentice Hall']/title • Why are left and right still needed? • Evaluate //book/title • Evaluate //book/publisher[text()='Prentice Hall'] Element Path • Must then ensure title and publisher belong to the same pathid left right … pathid path 1 1 999 … 1 /bibliography book (how?) 2 2 21 … 2 /bibliography/book FPath expression with attached conditions needs to be 3 3 5 … 3 /bibliography/book/title broken down, processed separately, and joined back 4 6 8 … 4 /bibliography/book/author 4 9 11 … … … 4 12 14 … … … … …

17 18 Another Path-based mapping Dewey-order encoding: queries

Dewey-order encoding • Examples: • //title Each component of the id represents the order of //section/title the child within its parent //book//title //book[publisher='Prentice Hall']/title bibliography 1 • Works similarly as interval-based mapping • Except parent/child and ancestor/descendant relationship are book 1.1 1.2 1.3 1.4 checked by prefix matching

1.4.1 1.4.2

Element(dewey_pid, tag) title author author author publisher year Text(dewey_pid, value) 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 Attribute(dewey_pid, attrName, attrValue)

3 4/12/17

19 20 XSLT

• XML-to-XML rule-based transformation language • Used most frequently as a stylesheet language • An XSLT program is an XML document itself An overview of

XSLT, SAX, and DOM XSLT program

XSLT processor

Input XML Output XML Actually, output does not need to be in XML in general

21 22 XSLT program XSLT elements

• An XSLT program is an XML document containing • Element describing transformation rules • Elements in the namespace • • Elements in user namespace • Elements describing rule execution control • Roughly, result of evaluating an XSLT program on an input XML document = the XSLT document • where each element is replaced with the • result of its evaluation • Elements describing instructions • Basic ideas • , , , etc. • Templates specify how to transform matching • Elements generating output input nodes • • Structural recursion applies templates to , , , input trees recursively , , etc. • Uses XPath as a sub-language

23 24 XSLT example

Find titles of books authored by “Abiteboul” Standard header of an XSLT document xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> is the basic XSLT construct describing a transformation rule • match_expr is an XPath-like expression specifying which nodes this rule applies to evaluates xpath_expr • Not quite; we will see why later within the context of the node matching the template, and converts the result sequence to a string • and simply get copied to the output for each node matched

4 4/12/17

25 26 Template in action Removing the extra output

• Add the following template: • This template matches all text and attributes • Example XML fragment Template applies • XPath features Foundations of Databases • text() is a node test that matches any text node Abiteboul Foundations of Databases Hull • @* Vianu matches any attribute Addison Wesley Template does not apply; • | 1995 means “or” in XPath

… default behavior is to process the node recursively and print • Body of the rule is empty, so all text and attributes A First Course in Databases all text nodes become empty string Ullman A First Course in Databases Ullman Widom Widom • This rule effectively filters out things not matched by the Prentice-Hall Prentice-Hall 2002 2002 other rule
… … …

27 28 Other features of XSLT SAX & DOM

• Loop and condition Both are API’s for XML processing • White space control, insertion of newline • SAX (Simple API for XML) • Calling templates with parameters • Started out as a Java API, but now exists for other languages too • Debugging and exiting the program • DOM () • , • Language-neutral API with implementations in Java, C++, • Defining variables, keys, functions python, etc.

29 30 SAX processing model A simple SAX example

• Serial access • Print out text contents of title elements • XML document is processed as a stream • Only one look at the import sys • Cannot go back to an early portion of the document import .sax from StringIO import StringIO • Event-driven • A parser generates events as it goes through the class PathHandler(xml.sax.ContentHandler): document (e.g., start of the document, end of an def startDocument(self): element, etc.) …… • def startElement(self, name, attrs): Application defines event handlers that get invoked …… when events are generated ……

xml.sax.parse(sys.stdin, PathHandler())

5 4/12/17

31 32 SAX events A simple SAX example (cont’d)

Most frequently used events: def startDocument(self): self.outBuffer = None • startDocument startDocument startElement def startElement(self, name, attrs): • endDocument startElement Foundations of Databases if name == 'title': Tag name • startElement … startElement self.outBuffer = StringIO() A map from endElement attribute names • endElement … to values endElement def endElement(self, name): • characters endDocument if name == 'title': • Whenever the parser has print self.outBuffer.getvalue() processed a chunk of character Whitespace may come up as characters self.outBuffer = None Characters read data (without generating other or ignorableWhitespace, depending on kinds of events) whether a DTD is present def characters(self, content): • Warning: The parser may if self.outBuffer is not None: generate multiple characters self.outBuffer.write(content) events for one piece of text

33 34 DOM processing model Summary

• XML is parsed by a parser and converted into an in- • XML data can be “shredded” into rows in a memory DOM relational • DOM API allows an application to • XQueries can be translated into SQL queries • Construct a DOM tree from an XML document • Queries can then benefit from smart relational indexing, • Traverse and read a DOM tree optimization, and execution • Construct a new, empty DOM tree from scratch • With schema-oblivious approaches, comprehensive • Modify an existing DOM tree XQuery-SQL translation can be easily automated • Copy subtrees from one DOM tree to another • Different data mapping techniques lead to different etc. styles of queries • Schema-aware translation is also possible and potentially more efficient, but automation is more complex

6