Key information
Querying and • Instructor: James Cheney, [email protected] • Lectures: 11:10-12:00, Tuesday/Friday Storing XML • LT4, 7 Bristo Square Week 1 • Office Hours: (IF 5.29) Introduction & course overview, XML basics • TA: Clare Llewellyn, [email protected] January 15-18, 2013 • Webpage: • http://www.inf.ed.ac.uk/teaching/courses/qsx/ • Weekly readings, project ideas/suggested readings
QSX January 15-18, 2013
What is XML? In a nutshell:
• eXtensible Markup Language [W3C 1998] • A (meta)language for semi-structured • Ask five different people, get five different data (trees) answers... books • a self-describing data format? ... • a generalization of HTML? book book the future/past? • year title ... title author author author • best thing since sliced bread/clunky and evil? Data on a metalanguage? Abiteboul 2000 • the Web Suciu ... • http://en.wikipedia.org/wiki/List_of_XML_markup_languages Buneman
QSX January 15-18, 2013 QSX January 15-18, 2013 XML for markup/
QSX January 15-18, 2013 QSX January 15-18, 2013
XML for XML as an abstract (semi-)structured data syntax • MusicXML • Many systems now use XML as a general-purpose syntax for other programming languages or configuration files... NewsML • • Java servlet config (web.xml) • iTunes • Apache Tomcat, Google App Engine, ... • DBLP http://dblp.uni-trier.de • Web Services - WSDL, SOAP, XML-RPC • XUL - XML User Interface Language (Mozilla/Firefox) CIA World Factbook • • BPEL - Business process execution language • IMDB http://www.imdb.com/ • Other Web standards: • XBEL - bookmark files (in your browser) • XSLT, XML Schema, XQueryX • RDF/XML KML - geographical annotation (Google Maps) • • OWL - Web Ontology Language • XACML - XML Access Control Markup Language • MMI - Multimodal interaction (phone + car + PC)
QSX January 15-18, 2013 QSX January 15-18, 2013 Native XML XML tools databases • Standalone: • Offer native support for XML data & query • xsltproc, mxquery, calabash (XProc) languages (not building on existing RDBMS) • Most PLs have XML parsers; many have XSLT engines/ • Galax libraries also • MarkLogic SAX (streaming), DOM (in-memory) interfaces • • eXist • libxml2, expat, libxslt (C) • BaseX • Xerces, Xalan (Java) • among others... • XPath (path expressions) used in many languages Suitable for new or lightweight applications • JavaScript/JQuery • • XSLT, XQuery • but some lack features like transactions, views, updates
QSX January 15-18, 2013 QSX January 15-18, 2013
XML support in The wonderful thing industry about standards... • Most commercial RDBMSs now provide some XML • There are so many to choose from! support • XPath 1.0, 2.0, XSLT 1.0, 2.0, XQuery, XProc • Oracle 11g - XML DB • RDF, RDFS, OWL 1.0, 2.0, SPARQL 1.0, 1.1, ... • IBM DB2 pureXML • W3C process moves quickly • Microsoft SQL Server - XML support since 2005 • and is hit-or-miss • Language Integrated Query (LINQ) targets SQL & XML • often driven by nontechnical/industrial issues in .NET programs • Standards reflect compromises between needs of • Data publishing, exchange, integration problems are different communities very important • XML standards often compromise between “data” and “document” • big 3 have products for all of these views of world • SQL/XML standard for defining XML views of relational data • and it shows!
QSX January 15-18, 2013 QSX January 15-18, 2013 Beyond the hype (Google “I hate XML”) • XML “wave” (1999-200?): may have crested • XML can be and has been (justifiably) criticized • bloat, ad hoc features, too many standards • wheel reinvention: why not LISP S-Expressions [McCarthy 1960] • New semistructured formats/syntaxes now in vogue Where is XML used? • RDF, JSON, YAML, Google Protocol Buffers • many XML vendors re-branding as “NoSQL” • Nevertheless, the basic issues are pretty much the same • and XML is definitely not going away • Our goal: rise above fray, understand essential issues in CS terms
QSX January 15-18, 2013 QSX January 15-18, 2013
Static Web site (XML Static Web site + XSLT)
get get Browser Server Browser Server
site.css site.xslt site.css foo.xhtml site.xslt foo.xml foo.xhtml foo.xml
allows better factoring (X)HTML + CSS into data + presentation (+ SVG + MathML)
QSX January 15-18, 2013 QSX January 15-18, 2013 Web 2.0: Asynchronous Dynamic Web site Java and XML
get get Browser Server Middleware Browser Server ... XMLHttpRequest
XMLHttpRequest SQL? Q + JavaScript XQuery? Database Database maintains state QSX January 15-18, 2013 QSX January 15-18, 2013
Data integration Data exchange (warehousing) • Massive demand • across platforms/DBs • across enterprises
RDB OODB
XML has become the prime standard for • Database Database Database data interchange on the Web Schema S1 Schema S2 Schema S3
QSX January 15-18, 2013 QSX January 15-18, 2013 Data integration Data integration (warehousing) (warehousing)
Q Warehouse Warehouse Schema T Schema T
V1 V2 V3 V1 V2 V3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3
QSX January 15-18, 2013 QSX January 15-18, 2013
Data integration Data integration (warehousing) (mediation)
Q Warehouse Mediator Schema T Schema T
V1 V2 V3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3
QSX January 15-18, 2013 QSX January 15-18, 2013 Data integration Data integration (mediation) (mediation)
Q Q Mediator Mediator Schema T Schema T
Q1 Q2 Q3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3
QSX January 15-18, 2013 QSX January 15-18, 2013
Data integration Data integration (mediation) (mediation)
Q Q Mediator Mediator Schema T Schema T
Q1 Q2 Q3 Q1 Q2 Q3 Database Database Database Database Database Database Schema S1 Schema S2 Schema S3 Schema S1 Schema S2 Schema S3
QSX January 15-18, 2013 QSX January 15-18, 2013 What you need for this course • Prerequisites: • Language Semantics & Implementation (PL) • or Computers & Intractability (Algorithms/complexity) Course overview • or permission • Database Systems or ADBS wouldn’t hurt • good if you have at least heard of SQL • Data Integration & Exchange, Extreme Computing or Natural Language Technologies may complement
QSX January 15-18, 2013 QSX January 15-18, 2013
Evaluation Reviews
35%: Reviews (7 assignments) • Hand in reviews on Monday at 4pm before of the • week in which the papers are to be discussed. Due each Monday at 4pm until week 8 • • Online handin, details TBA soon • Week 2-3: read/review all 3 tutorials • Each review should be about half a page • Week 4-8: read/review 2 out of 3 papers • summary, key ideas, questions you have, flaws you have found, and suggestions for improvement. • 50%: Course project report • Marked on “+/✓/-” scale. due Monday, March 25, 4pm • • + = 2 pts (outstanding) • 15%: Project presentation • ✓ = 1 pt (pass) • during week 10-11 (after report due!) • - = 0 pts (incomplete)
QSX January 15-18, 2013 QSX January 15-18, 2013 Projects Suggested timetable
• The project is the main assessed component of the Week 3: select project, form groups course. • • project ideas: http://www.inf.ed.ac.uk/teaching/courses/qsx/ • Projects can take two forms: project.html • Design and development projects • contact me or use [email protected] list • (2-3 students) implement or improve upon an algorithm or system. • Week 5: literature review/identify related work • Survey/benchmarking projects • Week 7: Draft report/preliminary results • (1 student) surveying 5-10 research papers on a particular topic, or using several systems for the same task and Week 10-11: Final report & presentation comparing them • • Project report: 15-30 pages (typically longer for group • It is up to you to ensure that your project projects) stays on track!
QSX January 15-18, 2013 QSX January 15-18, 2013
Presentations Introductions
• Each group member must participate in • Who you are presentation • What you want to get out of the course • ~5 min per group member (e.g. 3- person group gets 15 minutes) ...Start thinking about project groups Summarize background • • early Present research question • • Use mailing list [email protected] • Summarize progress and next steps
QSX January 15-18, 2013 QSX January 15-18, 2013 What you should get Syllabus/topics from this course • Skills/knowledge in demand in industry ($$) • Weeks 1-3: Foundations • NOT: Specific language/tool/system/protocol/API • Weeks 4-5: Storage & publishing • you should be able to pick it up on your own. • Weeks 6-8: Additional topics (updates, • Research / team / project experience static analysis, provenance) • Background useful for working with other Week 9: Break XML or Web standards/tools • Week 10-11: Project presentations • Chance to put diverse CS concepts into • practice
QSX January 15-18, 2013 QSX January 15-18, 2013
Next time Foundations of XML
• XML background • XML itself • Query and transformation languages • XPath: a path-based language for navigating / selecting parts of XML trees • XQuery: a “SQL for XML” query language • XSLT: a pattern-matching, recursive transformation language • Schemas (DTDs, XML Schema) • validation, automata • Constraints (keys)
QSX January 15-18, 2013 QSX January 15-18, 2013 XML and databases XML updates
• XML “shredding”: Storing/querying XML in • Beyond DOM: Updating XML stored as relational databases relations • Shredding XML trees into relations (with or without schema) • XQuery Update Facility • Translating XML queries/updates to SQL over extending XQuery to support updates shredded representation • Updating XML views of relations • XML “publishing”: Providing XML views of • legacy relational data • how to translate updates to XML views to SQL • Translating XML queries over views to SQL • Schema-directed publishing
QSX January 15-18, 2013 QSX January 15-18, 2013
Typechecking and Provenance for semi- static analysis structured data • Typed programming with XML • Provenance: Understanding where data came from, how it has been produced • regular expression types and inference • Predicting structure of result of queries • Increasingly important for scientific data • application: finding “path errors” in queries • Why and where provenance • Using this to improve performance • Annotation and XML query languages • e.g. detecting when a query and update do (or • Provenance for curated (evolving) data don’t) “overlap” • can save time recomputing views QSX January 15-18, 2013 QSX January 15-18, 2013 Reading
• Web Data Management (Abiteboul et al. 2011) • ch. 1-5 provide excellent overview of XML, Schemas, XPath, XQuery and shredding • also good coverage of RDF/OWL/SPARQL and cloud computing/ MapReduce • webdam.inria.fr/Jorge/ XML background • Introduction to XML and Web Technologies (Moller, Schwartzach 2006) • now a bit dated but good coverage of XML and Java-based Web programming • slides, ch. 4 online • Research papers - listed week-by-week on course web page
QSX January 15-18, 2013 QSX January 15-18, 2013
Some history HTML: limitations
• SGML - (Charles Goldfarb, ISO 8879, 1986) • HTML was intended as a declarative markup language • widely used for document management • emphasizing structure over presentation • but complex & hard to implement • But with success of Web, intense pressure for • HTML - (Tim Berners-Lee, 1991) more presentation features • most successful application of SGML • CSS helps separate content from structure, a little XML - (W3C, 1998) • nevertheless, while great for human consumption, • HTML is not suitable for representing general data • simplify SGML, for Web data/content • fixed set of tags • still pretty complicated, though • describe display format, not structure of data
QSX January 15-18, 2013 QSX January 15-18, 2013 Good things about Scraping data from XML HTML • Tags can be defined for specific applications other than
Some data
HTML • The structure of the data can be defined moreA | B | • DTDs, XML Schemas • Structures can be arbitrarily nested
---|---|
1 | 2 | • even including recursion
3 | 4 • XML standard does not define how data should be displayed |
QSX January 15-18, 2013 QSX January 15-18, 2013
Scraping data from Data as XML HTML Optional
Some data
closing tags - hardA | B |
---|
1 | 2 |
3 | 4 |
QSX January 15-18, 2013 QSX January 15-18, 2013 Data as XML Data as XML Header Attribute
QSX January 15-18, 2013 QSX January 15-18, 2013
Data as XML Data as XML
QSX January 15-18, 2013 QSX January 15-18, 2013 Data as XML XML Data as trees
title root have NO pre-
QSX January 15-18, 2013 QSX January 15-18, 2013
XML Data as trees Basics
An XML document consists of title root Child order matters! • • elements - ... "Some data" row row • tags come in pairs • tags must be properly nested (can't skip closing tag!) A B A B • attributes - 1 2 3 4 • key-value pairs associated with elements • text values - foo123 • unquoted text inside elements
QSX January 15-18, 2013 QSX January 15-18, 2013 Elements Nested structure
• Element: the segment between an start and its corresponding • nested tags can be used to express various structures, e.g., “records”: end tag
QSX January 15-18, 2013 QSX January 15-18, 2013
Ordering Attributes
• XML elements are ordered! • A start tag may contain attributes describing certain “properties” of the element (e.g., dimension or type) How to represent sets in XML? • !
QSX January 15-18, 2013 QSX January 15-18, 2013 Attribute structure Extras
• XML attributes cannot be nested -- flat • entity references: & " > • the names of XML attributes of an element must be unique. textual substitution; allows escaping special characters • one can’t write
QSX January 15-18, 2013 QSX January 15-18, 2013
Quiz Quiz
• Groups of 2-3: Find as many errors in this XML document as you can • Groups of 2-3: Find as many errors in this XML document as you can
QSX January 15-18, 2013 QSX January 15-18, 2013 Processing XML SAX: Basic idea
• Most programming languages have XML • SAX: Streaming API for XML (de facto libraries standard) • parsers (SAX, DOM) • reads XML document incrementally • in-memory manipulation (DOM) • generates calls to event handlers • validation (schemas) • you write code that handles these events • Thus, usually don’t need to worry about all the fiddly details of character sets • XML supports UNICODE, many other encodings
QSX January 15-18, 2013 QSX January 15-18, 2013
SAX: Example SAX: Example
beginElt(“root”,[title=”foo”])
title root title root
"foo" row row "foo" row row
A B A B A B A B
1 2 3 4 1 2 3 4
QSX January 15-18, 2013 QSX January 15-18, 2013 SAX: Example SAX: Example
beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) beginElt(“A”) title root title root
"foo" row row "foo" row row
A B A B A B A B
1 2 3 4 1 2 3 4
QSX January 15-18, 2013 QSX January 15-18, 2013
SAX: Example SAX: Example
beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) root beginElt(“A”) root beginElt(“A”) title text(“1”) title text(“1”) endElt(“A”) "foo" row row "foo" row row
A B A B A B A B
1 2 3 4 1 2 3 4
QSX January 15-18, 2013 QSX January 15-18, 2013 SAX: Example SAX: Example
beginElt(“root”,[title=”foo”]) beginElt(“root”,[title=”foo”]) beginElt(“row”) beginElt(“row”) root beginElt(“A”) root beginElt(“A”) title text(“1”) title text(“1”) endElt(“A”) endElt(“A”) beginElt(“B”) beginElt(“B”) "foo" row row text(“2”) "foo" row row text(“2”) endElt(“B”) endElt(“B”) endElt(“row”) endElt(“row”) beginElt(“row”) A B A B A B A B beginElt(“A”) text(“1”) endElt(“A”) beginElt(“B”) 1 2 3 4 1 2 3 4 text(“2”) endElt(“B”) endElt(“row”) endElt(“root”)
QSX January 15-18, 2013 QSX January 15-18, 2013
SAX: Advantages SAX: Disadvantages
• Very widely supported • Non-starter if random access needed • Can be very efficient • Not suitable for transformations to • for operations that are streaming-friendly persistent or large data • only realistic option for documents too large for e.g. in-browser or database updates memory • Can be tricky to program • Can easily ignore parts of the document • • comments, etc. • due to need to handle atomic events • these are still processed, however (SAX reads in • need to figure out how to incrementalize whole XML file whether or not it is all needed) processing & maintain state
QSX January 15-18, 2013 QSX January 15-18, 2013 DOM: Basic idea DOM: Example
• DOM: Document Object Model (W3C) root = document.body title root • reads XML document all at once c1 = root.getFirstChild() • allocates tree structure in memory "foo" row row c2 = root.getLastChild() • provides standard methods for traversing, modifying doc A B A B • widely used in JavaScript to dynamically update HTML page 1 2 3 4 • parses/loads document into memory
QSX January 15-18, 2013 QSX January 15-18, 2013
DOM: Example DOM: Example
root = document.body root = document.body title root title root c1 = root.getFirstChild() c1 = root.getFirstChild() "foo" row "foo" row c2 = root.getLastChild() c2 = root.getLastChild() row A B A B root.deleteChild(c2) root.deleteChild(c2) A B 1 2 1 2 c1.appendChild(c2) 3 4
QSX January 15-18, 2013 QSX January 15-18, 2013 DOM: Advantages DOM: Disadvantages
• Much easier to program • Memory footprint can be several times that of XML text • Offers random access & dynamic updates • which is already bloated! • Best if scalability to large data not a concern • Thus, cannot be used for data > size of memory (gigabytes) Library support for path queries can be • At a programming level, side-effecting very convenient (e.g. JQuery) • updates can be tricky to get right though naive implementations of queries can • • but no realistic alternatives yet to JavaScript for be very slow browser interactivity
QSX January 15-18, 2013 QSX January 15-18, 2013
Next week
• XPath: Navigating through XML trees • XQuery: SQL-like queries for constructing new XML trees from existing data
QSX January 15-18, 2013