<<
Home , XML

XML Background CS 186 Fall 2002 • eXtensible • Roots are HTML and SGML XML – HTML mixes formatting and – SGML is cumbersome • XML is focused on content – Designers (or others) can create their own sets of tags. – These tag definitions can be exchanged and shared “The reason that so many among various groups (DTDs, XSchema). people are excited about XML is – XSL is companion language to specify presentation. that so many people are excited • XML is ugly about XML.” – Intended to be generated and consumed by applications --- not people! ANON

From HTML to XML HTML

Bibliography

Foundations of Databases Abiteboul, Hull, Vianu
Addison Wesley, 1995

on the Web Abiteoul, Buneman, Suciu
Morgan Kaufmann, 1999

HTML describes the presentation

Example in XML XML as a Wire Format • People quickly figured out that XML is a convenient way to exchange data among applications. Foundations… – E.g. Ford’s purchasing app generates a purchase order Abiteboul in XML format, e-mails it to a billing app at Firestone. Hull – Firestone’s billing app ingests the email, generates a bill Vianu in XML format, and e-mails it to Ford’s bank. Addison Wesley • Emerging standards to get the “e-” out of the picture: SOAP, WSDL, UDDI… 1995 • The basis of “Web Services” --- potential impact is tremendous. • Why is it catching on? … It’s just text, so… •Platform, Language, Vendor agnostic •Easy to understand, manipulate and extend. XML describes the content •Compare this to data trapped in an RDBMS.

1 What’s this got to do with Databases? XML – Basic Structure • Given that apps will communicate by exchanging XML – Ingest XML formatted data – Publish their own data in XML format

•Preamble hasXML declaration, E.F.Codd, root element, ref to “DTD” • Thinking a bit harder: A Relational Model •Elements have start and end –XML is kind of a <a href="/tags/Data_model/" rel="tag">data model</a>. of Data for Large Shared Data Banks. Tags – Why convert to/from relational if everyone wants XML? , •Well Formed: has root, proper • More cosmically: 377-387, nesting, … 1970, – Like evolution from spoken language to written language! •Valid: Conforms to DTD 13, •Note that order matters (i.e. no • The (multi-) Billion Dollar Question: CACM, 6, sets, only lists) – Will people really want to store XML data directly? db/journals/cacm/cacm13.#Codd70 – Current opinion: ORACLE, IBM, INFORMIX say no, other db/journals/cacm/Codd70.html DB vendors say Yes, or at least, “Maybe” CACMs1/CACM13/P377.pdf

Another (partial) Example Can View XML Document as a ABC Corp. Invoice as a tree

123 ABC Way
Invoice Goods Inc. Buyer Seller Itemlist
17 Main St.
Name Address Name Item Address Item Item ABC Corp. 123 ABC Way Goods Inc. 17 Main widget thingy jobber widget St. thingy jobber Question: What Normal Form is this in?

Mapping to Relational New splinters from XML  Very expensive to store variable • Relational systems handle highly types data

≠ ≠ ≠

 Difficult to search trees that are broken into tables

2 Mapping to Relational I Mapping to Relational II

• Question: What is a relational schema for • Can leverage Schema (or DTD) information to storing XML data? create relational schema. • Answer – Depends on how “Structured” it is… • Sometimes called “shredding” • If unstructured – use an “Edge Map” • For semi-structured data use hybrid with edge map for overflow. ParentLabel ID 0 NULL article 0 article STORED table Overflow buckets 0 author 1 (author, year, journal, …) article 1 E.F. Codd NULL 1 2 3 4 5 … 0 pages 2 author pages year journal number … 2 377-387 NULL author pages year journal cdrom … … … E.F. 377- 1970 CACM 6 Codd 387 E.F. 377- 1970 CACM P377.pdf Codd 387

Other XML features Document Type Definitions (DTDs) • Elements can have “attributes” (not clear why). • Grammar for describing the allowed structure of XML Documents. 1.50 • Specify what elements can appear and in what order, nesting, etc. • XML docs can have IDs and IDREFs, URIs • DTDs are optional (!) – reference to another document or document element • Many “standard” DTDs have been developed • Two for interacting with/ XML Docs: for all sorts of industries, groups, etc. – (DOM) – e.g. NITF for article dissemination. • A tree “object” API for traversing an XML doc • Typically for –SAX • -Driven: Fire an event for each tag encountered during parse. • May not need to parse the entire document.

DTD Example (partial) Beyond DTDs - XML Schemas, etc.

Here’s a DTD for a Contract • XML Schema is a proposal to replace/augment DTDs – Has a notion of types and typechecking – Quite complicated, controversial ... not really ? = 0 or 1 * = 0 or more adopted yet Namespaces segmentKey %string; #IMPLIED > – Can import tag names from others – Disambiguate by prefixing the name

3 Querying XML XPath • Syntax for tree navigation and selection

• Xpath – Navigation is defined by “paths” – A single-document language for “path expressions” – Used by other standards: XSLT, XQuery, XPointer,XLink • / = root node or separator between steps in path • XSLT – XPath plus a language for formatting output • * matches any one element name • @ references attributes of the current node • XQuery • // references any descendant of the current node – An SQL-like proposal with XPath as a sub-language – Supports aggregates, duplicates, … • [] allows specification of a filter (predicate) at a – Data model is lists, not sets step – “reference implementations” have appeared, but language is • [n] picks the nth occurrence from a list of elements. still not widely accepted.

• SQL/XML • The fun part: – the SQL standards community fights back Filters can themselves contain paths

XPath Examples XQuery • Parent/Child (‘/’) and Ancestor/Descendant (‘//’): /catalog/product//msrp • Wildcards (match any single element): FOR $x in /bib/book WHERE $x/year > 1995 /catalog/*/msrp RETURN • Element Node Filters to further refine the nodes: $x/title – Filters can contain nested path expressions //product[price/msrp < 300]/name //product[price/msrp < /dept/@budget]/name – Note, this last one is a kind of “join”

XQuery XQuery

Main Construct (replaces SELECT-FROM-WHERE): • FOR $x in expr -- binds $x to each value in the • FLWR Expression: FOR-LET-WHERE-RETURN list expr

FOR/LET Clauses • LET $x = expr -- binds $x to the entire list Ordered List of tuples expr – Useful for common subexpressions and for WHERE Clause aggregations Filtered list of tuples

RETURN Clause XML data: Instance of Xquery data model

4 XQuery Advantages of XML vs. Relational

• ASCII makes things easy FOR $p IN distinct(document("bib.xml")//publisher) –Easy to parse LET $b := document("bib.xml")/book[publisher = $p] – Easy to ship (e.g. across firewall, via email, etc.) • Self-documenting WHERE count($b) > 100 – (tag names) come with the data RETURN $p • Nested – Can bundle lots of related data into one message – (Note: object-relational allows this) • Can be sloppy –don’t have to define a schema in advance distinct = a function that eliminates duplicates • Standard count = a (aggregate) function that returns the number of elms – Lots of free Java tools for parsing and munging XML • Expect lots of tools (C#) for same • Tremendous Momentum!

What XML does not solve Reminder: Benefits of Relational • XML doesn’t standardize metadata – It only standardizes the metadata language • Data independence buys you: • Not that much better than agreeing on an alphabet – Evolution of storage -- vs. XML? – E.g. my tag vs. your tag – Evolution of schema (via views) – vs. XML? • Mine includes shipping and federal tax, and is in $US • design theory • Yours is manufacturer’s list price in ¥Japan – IC’s, dependency theory, lots of nice tools for ER – XML Schema is a proposal to help with some of this • Remember, databases are long-lived and reused • XML doesn’t help with data modeling – Today’s “nesting” might need to be inverted tomorrow! – No notions of IC’s, FD’s, etc. • Issues: – In fact, encourages non-first-normal form! – XML is good for transient data (e.g. messages) • You will probably have to translate to/from XML (at – XML is fine for data that will not get reused in a different least in the short term) way (e.g. Shakespeare, database output like reports) – Relational vendors will help with this ASAP – Relational is far cleaner for persistent data (we learned this – XML “features” (nesting, ordering, etc.) make this a pain with OODBs) – Flatten the XML if you want data independence (?) • Will benefits of XML outweigh these issues?????

More on XML

• 100s of books published – Each seems to be 1000 pages • Try some websites – xml.org provides a business view of XML – xml.apache.org has lots of useful shareware for XML – www.ibm.com/developerworks/xml/ has shareware, tutorials, reference info – xml.com is the O’Reilly resource site – www.w3.org/XML/ is the official XML standard site – the most standardized XML dialects are: • Ariba’s Commerce XML (“cxml”, see cxml.org) • RosettaNet (see rosettanet.org) • Microsoft trying to enter this (BizTalk, now .NET)

5