4/12/17
2 Announcements (Wed., Apr. 12)
• Homework #4 due Monday, April 24, 11:55 pm • 4.1, 4.2, 4.3, X1 is posted XML- • Please start early XPath and XQuery • There may be another extra credit problem Introduction to Databases • Projects CompSci 316 Spring 2017 • keep working on them and write your final report • Demo in the week of April 24
• Google Cloud use? • If anyone is planning to use or using google cloud for HW/project, please send me an email
3 4 Approaches to XML processing
• Text files/messages • Specialized XML DBMS • Tamino (Software AG), BaseX, eXist, Sedna, … Relational Mapping • Not as mature as relational DBMS • Relational (and object-relational) DBMS • Middleware and/or extensions • IBM DB2’s pureXML, PostgreSQL’s XML type/functions…
5 6 Mapping XML to relational 1. Node/edge-based: schema
• Store XML in a column • Element(eid, tag) • Simple, compact • Attribute(eid, attrName, attrValue) Key: (eid, attrName) • CLOB (Character Large OBject) type + full-text indexing, or better, special XML type + functions • Attribute order does not matter • Poor integration with relational query processing • ElementChild(eid, pos, child) Keys: (eid, pos), (child) • Updates are expensive • pos specifies the ordering of children • Alternatives? • child references either Element(eid) or Text(tid) • Schema-oblivious mapping: ← Focus of this lecture well-formed XML → generic relational schema • Text(tid, value) 1. Node/edge-based mapping for graphs • tid cannot be the same as any eid 2. Interval-based mapping for trees F 3. Path-based mapping for trees Need to “invent” lots of id’s • Schema-aware mapping: FNeed indexes for efficiency, e.g., Element(tag), valid XML → special relational schema based on DTD Text(value)
1 4/12/17
7 8 Node/edge-based: example Node/edge-based: simple paths
9 10 Node/edge-based: complex paths Node/edge-based: descendent-or-self
• //bibliography/book[author="Abiteboul"]/@price • //book//title • SELECT a.attrValue FROM Element e1, ElementChild c1, • Requires SQL3 recursion Element e2, Attribute a • WITH RECURSIVE ReachableFromBook(id) AS WHERE e1.tag = 'bibliography' ((SELECT eid FROM Element WHERE tag = 'book') AND e1.eid = c1.eid AND c1.child = e2.eid UNION AND e2.tag = 'book' (SELECT c.child AND EXISTS (SELECT * FROM ElementChild c2, some author of e2 FROM ReachableFromBook r, ElementChild c Element e3, ElementChild c3, Text t WHERE r.eid = c.eid)) WHERE e2.eid = c2.eid AND c2.child = e3.eid is ’Abiteboul’ AND e3.tag = 'author' SELECT eid AND e3.eid = c3.eid AND c3.child = t.tid FROM Element AND t.value = 'Abiteboul') WHERE eid IN (SELECT * FROM ReachableFromBook) AND tag = 'title'; AND e2.eid = a.eid AND a.attrName = 'price';
11 12 2. Interval-based: schema Interval-based: example
1
2 4/12/17
13 14 Interval-based: queries Summary so far
• //section/title Node/edge-based vs. interval-based mapping • SELECT e2.left FROM Element e1, Element e2 WHERE e1.tag = 'section' AND e2.tag = 'title' AND e1.left < e2.left AND e2.right < e1.right • Path expression steps AND e1.level = e2.level-1; • Equality vs. containment join FPath expression becomes “containment” joins! • Descendent-or-self • Number of joins is proportional to path expression length • Recursion required vs. not required • //book//title • SELECT e2.left FROM Element e1, Element e2 WHERE e1.tag = 'book' AND e2.tag = 'title' AND e1.left < e2.left AND e2.right < e1.right; FNo recursion!
15 16 3. Path-based mapping Label-path encoding: queries
Label-path encoding: paths as strings of labels • Simple path expressions with no conditions • Element(pathid, left, right, …), Path(pathid, path), //book//title … • Perform string matching on Path • Join qualified pathid’s with Element • path is a string containing the sequence of labels on a path starting from the root • //book[publisher='Prentice Hall']/title • Why are left and right still needed? • Evaluate //book/title • Evaluate //book/publisher[text()='Prentice Hall'] Element Path • Must then ensure title and publisher belong to the same pathid left right … pathid path 1 1 999 … 1 /bibliography book (how?) 2 2 21 … 2 /bibliography/book FPath expression with attached conditions needs to be 3 3 5 … 3 /bibliography/book/title broken down, processed separately, and joined back 4 6 8 … 4 /bibliography/book/author 4 9 11 … … … 4 12 14 … … … … …
17 18 Another Path-based mapping Dewey-order encoding: queries
Dewey-order encoding • Examples: • //title Each component of the id represents the order of //section/title the child within its parent //book//title //book[publisher='Prentice Hall']/title bibliography 1 • Works similarly as interval-based mapping • Except parent/child and ancestor/descendant relationship are book 1.1 1.2 1.3 1.4 checked by prefix matching
1.4.1 1.4.2
Element(dewey_pid, tag) title author author author publisher year Text(dewey_pid, value) 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.1.6 Attribute(dewey_pid, attrName, attrValue)
3 4/12/17
19 20 XSLT
• XML-to-XML rule-based transformation language • Used most frequently as a stylesheet language • An XSLT program is an XML document itself An overview of
XSLT, SAX, and DOM XSLT program
XSLT processor
Input XML Output XML Actually, output does not need to be in XML in general
21 22 XSLT program XSLT elements
• An XSLT program is an XML document containing • Element describing transformation rules • Elements in the
23 24 XSLT example
•
4 4/12/17
25 26 Template in action Removing the extra output
27 28 Other features of XSLT SAX & DOM
• Loop and condition Both are API’s for XML processing • White space control, insertion of newline • SAX (Simple API for XML) • Calling templates with parameters • Started out as a Java API, but now exists for other languages too • Debugging and exiting the program • DOM (Document Object Model) •
29 30 SAX processing model A simple SAX example
• Serial access • Print out text contents of title elements • XML document is processed as a stream • Only one look at the data import sys • Cannot go back to an early portion of the document import xml.sax from StringIO import StringIO • Event-driven • A parser generates events as it goes through the class PathHandler(xml.sax.ContentHandler): document (e.g., start of the document, end of an def startDocument(self): element, etc.) …… • def startElement(self, name, attrs): Application defines event handlers that get invoked …… when events are generated ……
xml.sax.parse(sys.stdin, PathHandler())
5 4/12/17
31 32 SAX events A simple SAX example (cont’d)
Most frequently used events: def startDocument(self): self.outBuffer = None • startDocument startDocument
33 34 DOM processing model Summary
• XML is parsed by a parser and converted into an in- • XML data can be “shredded” into rows in a memory DOM tree relational database • DOM API allows an application to • XQueries can be translated into SQL queries • Construct a DOM tree from an XML document • Queries can then benefit from smart relational indexing, • Traverse and read a DOM tree optimization, and execution • Construct a new, empty DOM tree from scratch • With schema-oblivious approaches, comprehensive • Modify an existing DOM tree XQuery-SQL translation can be easily automated • Copy subtrees from one DOM tree to another • Different data mapping techniques lead to different etc. styles of queries • Schema-aware translation is also possible and potentially more efficient, but automation is more complex
6