Key information

Querying and • Instructor: James Cheney, [email protected] • Lectures: 11:10-12:00, Tuesday/Friday Storing XML • LT4, 7 Bristo Square Week 1 • Office Hours: (IF 5.29) Introduction & course overview, XML basics • TA: Clare Llewellyn, [email protected] January 15-18, 2013 • Webpage: • http://www.inf.ed.ac.uk/teaching/courses/qsx/ • Weekly readings, project ideas/suggested readings

QSX January 15-18, 2013

What is XML? In a nutshell:

• eXtensible Markup Language [W3C 1998] • A (meta)language for semi-structured • Ask five different people, get five different data (trees) answers... books • a self-describing data format? ... • a generalization of HTML? book book the future/past? • year title ... title author author author • best thing since sliced bread/clunky and evil? Data on a metalanguage? Abiteboul 2000 • the Web Suciu ... • http://en.wikipedia.org/wiki/List_of_XML_markup_languages Buneman

QSX January 15-18, 2013 QSX January 15-18, 2013 XML for markup/ In a nutshell: documents DataA (meta)language on the Webfor semi-structured SGML •Abiteboul • data (trees) Buneman • HTML - hypertext markup language Suciubooks 2000 • TEI - Text markup, language technology book book ... DocBook - documents -> , pdf, ... ... title year title ... SMIL - Multimedia ... author author author • SVG - Vector graphics Data on Abiteboul 2000 • the Web... Suciu ... MathML - Mathematical formulas Buneman •

QSX January 15-18, 2013 QSX January 15-18, 2013

XML for XML as an abstract (semi-)structured data syntax • MusicXML • Many systems now use XML as a general-purpose syntax for other programming languages or configuration files... NewsML • • Java servlet config (web.) • iTunes • Apache Tomcat, Google App Engine, ... • DBLP http://dblp.uni-trier.de • Web Services - WSDL, SOAP, XML-RPC • XUL - XML User Interface Language (Mozilla/Firefox) CIA World Factbook • • BPEL - Business process execution language • IMDB http://www.imdb.com/ • Other Web standards: • XBEL - bookmark files (in your browser) • XSLT, XML Schema, XQueryX • RDF/XML KML - geographical annotation (Google Maps) • • OWL - • XACML - XML Access Control Markup Language • MMI - Multimodal interaction (phone + car + PC)

QSX January 15-18, 2013 QSX January 15-18, 2013 Native XML XML tools databases • Standalone: • Offer native support for XML data & query • xsltproc, mxquery, calabash (XProc) languages (not building on existing RDBMS) • Most PLs have XML parsers; many have XSLT engines/ • Galax libraries also • MarkLogic SAX (streaming), DOM (in-memory) interfaces • • eXist • libxml2, expat, libxslt (C) • BaseX • Xerces, Xalan (Java) • among others... • XPath (path expressions) used in many languages Suitable for new or lightweight applications • JavaScript/JQuery • • XSLT, XQuery • but some lack features like transactions, views, updates

QSX January 15-18, 2013 QSX January 15-18, 2013

XML support in The wonderful thing industry about standards... • Most commercial RDBMSs now provide some XML • There are so many to choose from! support • XPath 1.0, 2.0, XSLT 1.0, 2.0, XQuery, XProc • Oracle 11g - XML DB • RDF, RDFS, OWL 1.0, 2.0, SPARQL 1.0, 1.1, ... • IBM DB2 pureXML • W3C process moves quickly • Microsoft SQL Server - XML support since 2005 • and is hit-or-miss • Language Integrated Query (LINQ) targets SQL & XML • often driven by nontechnical/industrial issues in .NET programs • Standards reflect compromises between needs of • Data publishing, exchange, integration problems are different communities very important • XML standards often compromise between “data” and “document” • big 3 have products for all of these views of world • SQL/XML standard for defining XML views of relational data • and it shows!

QSX January 15-18, 2013 QSX January 15-18, 2013 Beyond the hype (Google “I hate XML”) • XML “wave” (1999-200?): may have crested • XML can be and has been (justifiably) criticized • bloat, ad hoc features, too many standards • wheel reinvention: why not LISP S-Expressions [McCarthy 1960] • New semistructured formats/syntaxes now in vogue Where is XML used? • RDF, JSON, YAML, Google Protocol Buffers • many XML vendors re-branding as “NoSQL” • Nevertheless, the basic issues are pretty much the same • and XML is definitely not going away • Our goal: rise above fray, understand essential issues in CS terms

QSX January 15-18, 2013 QSX January 15-18, 2013

Static Web site (XML Static Web site + XSLT)

get get Browser Server Browser Server

site. site. site.css foo. site.xslt foo.xml foo.xhtml foo.xml

allows better factoring (X)HTML + CSS into data + presentation (+ SVG + MathML)

QSX January 15-18, 2013 QSX January 15-18, 2013 Web 2.0: Asynchronous Dynamic Web site Java and XML

get get Browser Server Middleware Browser Server ... XMLHttpRequest

XMLHttpRequest SQL? Q + JavaScript XQuery? Database Database maintains state QSX January 15-18, 2013 QSX January 15-18, 2013

Data integration Data exchange (warehousing) • Massive demand • across platforms/DBs • across enterprises

RDB OODB

XML has become the prime standard for • Database Database Database data interchange on the Web Schema S1 Schema S2 Schema S3

QSX January 15-18, 2013 QSX January 15-18, 2013 Data integration Data integration (warehousing) (warehousing)

Q Warehouse Warehouse Schema T Schema T

V1 V2 V3 V1 V2 V3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3

QSX January 15-18, 2013 QSX January 15-18, 2013

Data integration Data integration (warehousing) (mediation)

Q Warehouse Mediator Schema T Schema T

V1 V2 V3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3

QSX January 15-18, 2013 QSX January 15-18, 2013 Data integration Data integration (mediation) (mediation)

Q Q Mediator Mediator Schema T Schema T

Q1 Q2 Q3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3

QSX January 15-18, 2013 QSX January 15-18, 2013

Data integration Data integration (mediation) (mediation)

Q Q Mediator Mediator Schema T Schema T

Q1 Q2 Q3 Q1 Q2 Q3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3

QSX January 15-18, 2013 QSX January 15-18, 2013 What you need for this course • Prerequisites: • Language Semantics & Implementation (PL) • or Computers & Intractability (Algorithms/complexity) Course overview • or permission • Database Systems or ADBS wouldn’t hurt • good if you have at least heard of SQL • Data Integration & Exchange, Extreme Computing or Natural Language Technologies may complement

QSX January 15-18, 2013 QSX January 15-18, 2013

Evaluation Reviews

35%: Reviews (7 assignments) • Hand in reviews on Monday at 4pm before of the • week in which the papers are to be discussed. Due each Monday at 4pm until week 8 • • Online handin, details TBA soon • Week 2-3: read/review all 3 tutorials • Each review should be about half a page • Week 4-8: read/review 2 out of 3 papers • summary, key ideas, questions you have, flaws you have found, and suggestions for improvement. • 50%: Course project report • Marked on “+/✓/-” scale. due Monday, March 25, 4pm • • + = 2 pts (outstanding) • 15%: Project presentation • ✓ = 1 pt (pass) • during week 10-11 (after report due!) • - = 0 pts (incomplete)

QSX January 15-18, 2013 QSX January 15-18, 2013 Projects Suggested timetable

• The project is the main assessed component of the Week 3: select project, form groups course. • • project ideas: http://www.inf.ed.ac.uk/teaching/courses/qsx/ • Projects can take two forms: project.html • Design and development projects • contact me or use [email protected] list • (2-3 students) implement or improve upon an algorithm or system. • Week 5: literature review/identify related work • Survey/benchmarking projects • Week 7: Draft report/preliminary results • (1 student) surveying 5-10 research papers on a particular topic, or using several systems for the same task and Week 10-11: Final report & presentation comparing them • • Project report: 15-30 pages (typically longer for group • It is up to you to ensure that your project projects) stays on track!

QSX January 15-18, 2013 QSX January 15-18, 2013

Presentations Introductions

• Each group member must participate in • Who you are presentation • What you want to get out of the course • ~5 min per group member (e.g. 3- person group gets 15 minutes) ...Start thinking about project groups Summarize background • • early Present research question • • Use mailing list [email protected] • Summarize progress and next steps

QSX January 15-18, 2013 QSX January 15-18, 2013 What you should get Syllabus/topics from this course • Skills/knowledge in demand in industry ($$) • Weeks 1-3: Foundations • NOT: Specific language/tool/system/protocol/API • Weeks 4-5: Storage & publishing • you should be able to pick it up on your own. • Weeks 6-8: Additional topics (updates, • Research / team / project experience static analysis, provenance) • Background useful for working with other Week 9: Break XML or Web standards/tools • Week 10-11: Project presentations • Chance to put diverse CS concepts into • practice

QSX January 15-18, 2013 QSX January 15-18, 2013

Next time Foundations of XML

• XML background • XML itself • Query and transformation languages • XPath: a path-based language for navigating / selecting parts of XML trees • XQuery: a “SQL for XML” query language • XSLT: a pattern-matching, recursive transformation language • Schemas (DTDs, XML Schema) • validation, automata • Constraints (keys)

QSX January 15-18, 2013 QSX January 15-18, 2013 XML and databases XML updates

• XML “shredding”: Storing/querying XML in • Beyond DOM: Updating XML stored as relational databases relations • Shredding XML trees into relations (with or without schema) • XQuery Update Facility • Translating XML queries/updates to SQL over extending XQuery to support updates shredded representation • Updating XML views of relations • XML “publishing”: Providing XML views of • legacy relational data • how to translate updates to XML views to SQL • Translating XML queries over views to SQL • Schema-directed publishing

QSX January 15-18, 2013 QSX January 15-18, 2013

Typechecking and Provenance for semi- static analysis structured data • Typed programming with XML • Provenance: Understanding where data came from, how it has been produced • regular expression types and inference • Predicting structure of result of queries • Increasingly important for scientific data • application: finding “path errors” in queries • Why and where provenance • Using this to improve performance • Annotation and XML query languages • e.g. detecting when a query and update do (or • Provenance for curated (evolving) data don’t) “overlap” • can save time recomputing views QSX January 15-18, 2013 QSX January 15-18, 2013 Reading

• Web Data Management (Abiteboul et al. 2011) • ch. 1-5 provide excellent overview of XML, Schemas, XPath, XQuery and shredding • also good coverage of RDF/OWL/SPARQL and cloud computing/ MapReduce • webdam.inria.fr/Jorge/ XML background • Introduction to XML and Web Technologies (Moller, Schwartzach 2006) • now a bit dated but good coverage of XML and Java-based Web programming • slides, ch. 4 online • Research papers - listed week-by-week on course web page

QSX January 15-18, 2013 QSX January 15-18, 2013

Some history HTML: limitations

• SGML - (Charles Goldfarb, ISO 8879, 1986) • HTML was intended as a declarative markup language • widely used for document management • emphasizing structure over presentation • but complex & hard to implement • But with success of Web, intense pressure for • HTML - (Tim Berners-Lee, 1991) more presentation features • most successful application of SGML • CSS helps separate content from structure, a little XML - (W3C, 1998) • nevertheless, while great for human consumption, • HTML is not suitable for representing general data • simplify SGML, for Web data/content • fixed set of tags • still pretty complicated, though • describe display format, not structure of data

QSX January 15-18, 2013 QSX January 15-18, 2013 Good things about Scraping data from XML HTML • Tags can be defined for specific applications other than

Some data

HTML • The structure of the data can be defined more precisely • DTDs, XML Schemas • Structures can be arbitrarily nested • even including recursion
AB
12
34 • XML standard does not define how data should be displayed
• Style sheets (XSLT) can transform XML to HTML or other forms

QSX January 15-18, 2013 QSX January 15-18, 2013

Scraping data from Data as XML HTML Optional

Some data

closing tags - hard to parse
AB
12
12
34 34

QSX January 15-18, 2013 QSX January 15-18, 2013 Data as XML Data as XML Header Attribute

12123434

QSX January 15-18, 2013 QSX January 15-18, 2013

Data as XML Data as XML

Text 12123434

QSX January 15-18, 2013 QSX January 15-18, 2013 Data as XML XML Data as trees

title root NOTE: Tag names "Some data" row row

have NO pre- 12defined meaning! A B A B 34 1 2 3 4

QSX January 15-18, 2013 QSX January 15-18, 2013

XML Data as trees Basics

An XML document consists of title root Child order matters! • • elements - ... "Some data" row row • tags come in pairs • tags must be properly nested (can't skip closing tag!) A B A B • attributes - 1 2 3 4 • key-value pairs associated with elements • text values - foo123 • unquoted text inside elements

QSX January 15-18, 2013 QSX January 15-18, 2013 Elements Nested structure

• Element: the segment between an start and its corresponding • nested tags can be used to express various structures, e.g., “records”: end tag • Unique root element !! James Cheney • subelement: the relation between an element and its !! 0131 651 5658 component elements. !! [email protected] !! [email protected] !! James Cheney ! !! 0131 651 5658 • a list: represented by using the same tags repeatedly: !! [email protected] !! [email protected] ! ! !...

QSX January 15-18, 2013 QSX January 15-18, 2013

Ordering Attributes

• XML elements are ordered! • A start tag may contain attributes describing certain “properties” of the element (e.g., dimension or type) How to represent sets in XML? • ! How to represent an unordered pair (a, b) in XML? • !! 2400 • Can one directly represent the following in a relational database? !! 96 !! M05-+C$ … … ! ! • References (meaningful only when a DTD is present): ! !! James Cheney !! George Bush !! 0131 651 5658 ! !! [email protected] ! !! [email protected] !! Saddam Hussein ! !

QSX January 15-18, 2013 QSX January 15-18, 2013 Attribute structure Extras

• XML attributes cannot be nested -- flat • entity references: & " > • the names of XML attributes of an element must be unique. textual substitution; allows escaping special characters • one can’t write ... • • XML attributes are not ordered: • you can define your own if you want ! • processing instructions: !! George Bush can be used to pass information to processors ! • is the same as • comments: • CDATA sections: - !! George Bush allows including raw text (<, >, &, etc. uninterpreted) ! • • Attributes vs. subelements: unordered vs. ordered, and • Luckily, these are mostly irrelevant to use of XML for data • attributes cannot be nested (flat structure) • but you need to know about them when writing reading/writing XML • subelements cannot represent references as text

QSX January 15-18, 2013 QSX January 15-18, 2013

Quiz Quiz

• Groups of 2-3: Find as many errors in this XML document as you can • Groups of 2-3: Find as many errors in this XML document as you can Data on the Web Data on the Web Abiteboul Abiteboul Buneman & Buneman & Suciu Suciu 2000/year> 2000 Addison-Wesley Addison-Wesley bar bar

QSX January 15-18, 2013 QSX January 15-18, 2013 Processing XML SAX: Basic idea

• Most programming languages have XML • SAX: Streaming API for XML (de facto libraries standard) • parsers (SAX, DOM) • reads XML document incrementally • in-memory manipulation (DOM) • generates calls to event handlers • validation (schemas) • you write code that handles these events • Thus, usually don’t need to worry about all the fiddly details of character sets • XML supports UNICODE, many other encodings

QSX January 15-18, 2013 QSX January 15-18, 2013

SAX: Example SAX: Example

beginElt(“root”,[title=”foo”])

title root title root

"foo" row row "foo" row row

A B A B A B A B

1 2 3 4 1 2 3 4

QSX January 15-18, 2013 QSX January 15-18, 2013 SAX: Example SAX: Example

beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) beginElt(“A”) title root title root

"foo" row row "foo" row row

A B A B A B A B

1 2 3 4 1 2 3 4

QSX January 15-18, 2013 QSX January 15-18, 2013

SAX: Example SAX: Example

beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) root beginElt(“A”) root beginElt(“A”) title text(“1”) title text(“1”) endElt(“A”) "foo" row row "foo" row row

A B A B A B A B

1 2 3 4 1 2 3 4

QSX January 15-18, 2013 QSX January 15-18, 2013 SAX: Example SAX: Example

beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) root beginElt(“A”) root beginElt(“A”) title text(“1”) title text(“1”) endElt(“A”) endElt(“A”) beginElt(“B”) beginElt(“B”) "foo" row row text(“2”) "foo" row row text(“2”) endElt(“B”) endElt(“B”) endElt(“row”) endElt(“row”) beginElt(“row”) A B A B A B A B beginElt(“A”) text(“1”) endElt(“A”) beginElt(“B”) 1 2 3 4 1 2 3 4 text(“2”) endElt(“B”) endElt(“row”) endElt(“root”)

QSX January 15-18, 2013 QSX January 15-18, 2013

SAX: Advantages SAX: Disadvantages

• Very widely supported • Non-starter if random access needed • Can be very efficient • Not suitable for transformations to • for operations that are streaming-friendly persistent or large data • only realistic option for documents too large for e.g. in-browser or database updates memory • Can be tricky to program • Can easily ignore parts of the document • • comments, etc. • due to need to handle atomic events • these are still processed, however (SAX reads in • need to figure out how to incrementalize whole XML file whether or not it is all needed) processing & maintain state

QSX January 15-18, 2013 QSX January 15-18, 2013 DOM: Basic idea DOM: Example

• DOM: (W3C) root = document.body title root • reads XML document all at once c1 = root.getFirstChild() • allocates tree structure in memory "foo" row row c2 = root.getLastChild() • provides standard methods for traversing, modifying doc A B A B • widely used in JavaScript to dynamically update HTML page 1 2 3 4 • parses/loads document into memory

QSX January 15-18, 2013 QSX January 15-18, 2013

DOM: Example DOM: Example

root = document.body root = document.body title root title root c1 = root.getFirstChild() c1 = root.getFirstChild() "foo" row "foo" row c2 = root.getLastChild() c2 = root.getLastChild() row A B A B root.deleteChild(c2) root.deleteChild(c2) A B 1 2 1 2 c1.appendChild(c2) 3 4

QSX January 15-18, 2013 QSX January 15-18, 2013 DOM: Advantages DOM: Disadvantages

• Much easier to program • Memory footprint can be several times that of XML text • Offers random access & dynamic updates • which is already bloated! • Best if scalability to large data not a concern • Thus, cannot be used for data > size of memory (gigabytes) Library support for path queries can be • At a programming level, side-effecting very convenient (e.g. JQuery) • updates can be tricky to get right though naive implementations of queries can • • but no realistic alternatives yet to JavaScript for be very slow browser interactivity

QSX January 15-18, 2013 QSX January 15-18, 2013

Next week

• XPath: Navigating through XML trees • XQuery: SQL-like queries for constructing new XML trees from existing data

QSX January 15-18, 2013