XML Background from HTML to XML HTML Example in XML XML As A

<Course> XML Background <Title> CS 186 </Title> <Semester> Fall 2002 </Semester> • eXtensible Markup Language <Lecture Number = “12”> • Roots are HTML and SGML <Topic> XML </Topic> – HTML mixes formatting and semantics <Topic> Databases </Topic> – SGML is cumbersome </Lecture> • XML is focused on content </Course> – Designers (or others) can create their own sets of tags. – These tag definitions can be exchanged and shared “The reason that so many among various groups (DTDs, XSchema). people are excited about XML is – XSL is a companion language to specify presentation. that so many people are excited • <Opinion> XML is ugly </Opinion> about XML.” – Intended to be generated and consumed by applications --- not people! ANON From HTML to XML HTML <h1> Bibliography </h1> <p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteoul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation Example in XML XML as a Wire Format • People quickly figured out that XML is a convenient <bibliography> way to exchange data among applications. <book> <title> Foundations… </title> – E.g. Ford’s purchasing app generates a purchase order <author> Abiteboul </author> in XML format, e-mails it to a billing app at Firestone. <author> Hull </author> – Firestone’s billing app ingests the email, generates a bill <author> Vianu </author> in XML format, and e-mails it to Ford’s bank. <publisher> Addison Wesley </publisher> • Emerging standards to get the “e-mail” out of the picture: SOAP, WSDL, UDDI… <year> 1995 </year> • The basis of “Web Services” --- potential impact is tremendous. </book> • Why is it catching on? … It’s just text, so… </bibliography> •Platform, Language, Vendor agnostic •Easy to understand, manipulate and extend. XML describes the content •Compare this to data trapped in an RDBMS. 1 What’s this got to do with Databases? XML – Basic Structure <?xml version="1.0" encoding="UTF-8"?> • Given that apps will communicate by exchanging XML <!DOCTYPE article SYSTEM data, then databases must at least be able to: "http://xml.cXML.org/schemas/cXML/1.1.010/cXML.dtd"> – Ingest XML formatted data – Publish their own data in XML format <article key="Codd70"> •Preamble hasXML declaration, <author>E.F.Codd</author>, root element, ref to “DTD” • Thinking a bit harder: <title>A Relational Model •Elements have start and end –XML is kind of a data model. of Data for Large Shared Data Banks. Tags – Why convert to/from relational if everyone wants XML? </title>, •Well Formed: has root, proper • More cosmically: <pages>377-387</pages>, nesting, … <year>1970</year>, – Like evolution from spoken language to written language! •Valid: Conforms to DTD <volume>13</volume>, •Note that order matters (i.e. no • The (multi-) Billion Dollar Question: <journal>CACM</journal>, <number>6</number>, sets, only lists) – Will people really want to store XML data directly? <url>db/journals/cacm/cacm13.html#Codd70</url> – Current opinion: ORACLE, IBM, INFORMIX say no, other <ee>db/journals/cacm/Codd70.html</ee> DB vendors say Yes, or at least, “Maybe” <cdrom>CACMs1/CACM13/P377.pdf</cdrom> </article> Another (partial) Example Can View XML Document as a Tree <Invoice> <Buyer> <Name> ABC Corp. </Name> Invoice as a tree <Address> 123 ABC Way </Address> </Buyer> Invoice <Seller> <Name> Goods Inc. </Name> Buyer Seller Itemlist <Address> 17 Main St. </Address> </Seller> Name Address Name Item <ItemList> Address Item Item ABC Corp. 123 ABC Way Goods Inc. 17 Main widget thingy jobber <Item> widget </Item> St. <Item> thingy </Item> <Item> jobber </Item> </ItemList> Question: What Normal Form is this in? </Invoice> Mapping to Relational New splinters from XML Very expensive to store variable • Relational systems handle highly structured document types data ≠ ≠ ≠ Difficult to search trees that are broken into tables 2 Mapping to Relational I Mapping to Relational II • Question: What is a relational schema for • Can leverage Schema (or DTD) information to storing XML data? create relational schema. • Answer – Depends on how “Structured” it is… • Sometimes called “shredding” • If unstructured – use an “Edge Map” • For semi-structured data use hybrid with edge map for overflow. ParentLabel ID 0 NULL article 0 article STORED table Overflow buckets 0 author 1 (author, year, journal, …) article 1 E.F. Codd NULL 1 2 3 4 5 … 0 pages 2 author pages year journal number … 2 377-387 NULL author pages year journal cdrom … … … E.F. 377- 1970 CACM 6 Codd 387 E.F. 377- 1970 CACM P377.pdf Codd 387 Other XML features Document Type Definitions (DTDs) • Elements can have “attributes” (not clear why). • Grammar for describing the allowed structure of XML Documents. <Price currency="USD">1.50</Price> • Specify what elements can appear and in what order, nesting, etc. • XML docs can have IDs and IDREFs, URIs • DTDs are optional (!) – reference to another document or document element • Many “standard” DTDs have been developed • Two APIs for interacting with/parsing XML Docs: for all sorts of industries, groups, etc. –Document Object Model (DOM) – e.g. NITF for news article dissemination. • A tree “object” API for traversing an XML doc • Typically for Java –SAX • Event-Driven: Fire an event for each tag encountered during parse. • May not need to parse the entire document. DTD Example (partial) Beyond DTDs - XML Schemas, etc. <?xml version="1.0" encoding="UTF-8"?> <!ENTITY % datetime.tz "CDATA"> Here’s a DTD for a Contract <!ENTITY % string "CDATA"> • XML Schema is a proposal to replace/augment <!ENTITY % nmtoken "CDATA">  <!ENTITY % xmlLangCode "%nmtoken;"> DTDs <!ELEMENT SupplierID (#PCDATA)> – Has a notion of types and typechecking <!ATTLIST SupplierID – May introduce some notions of IC’s domain %string; #REQUIRED Elements contain others: > – Quite complicated, controversial ... not really <!ELEMENT Comments (#PCDATA)> ? = 0 or 1 <!ELEMENT ItemSegment (ContractItem+)> * = 0 or more adopted yet <!ATTLIST ItemSegment + = 1 or more • XML Namespaces segmentKey %string; #IMPLIED > – Can import tag names from others <!ELEMENT Contract (SupplierID+, Comments?, ItemSegment+)> – Disambiguate by prefixing the namespace name <!ATTLIST Contract • I.e. berkeley-eecs:gpa is different from uphoenix:gpa effectiveDate %datetime.tz; #REQUIRED expirationDate %datetime.tz; #REQUIRED > 3 Querying XML XPath • Syntax for tree navigation and node selection • Xpath – Navigation is defined by “paths” – A single-document language for “path expressions” – Used by other standards: XSLT, XQuery, XPointer,XLink • / = root node or separator between steps in path • XSLT – XPath plus a language for formatting output • * matches any one element name • @ references attributes of the current node • XQuery • // references any descendant of the current node – An SQL-like proposal with XPath as a sub-language – Supports aggregates, duplicates, … • [] allows specification of a filter (predicate) at a – Data model is lists, not sets step – “reference implementations” have appeared, but language is • [n] picks the nth occurrence from a list of elements. still not widely accepted. • SQL/XML • The fun part: – the SQL standards community fights back Filters can themselves contain paths XPath Examples XQuery • Parent/Child (‘/’) and Ancestor/Descendant (‘//’): /catalog/product//msrp <result> • Wildcards (match any single element): FOR $x in /bib/book WHERE $x/year > 1995 /catalog/*/msrp RETURN <newtitle> • Element Node Filters to further refine the nodes: $x/title – Filters can contain nested path expressions </newtitle> //product[price/msrp < 300]/name </result> //product[price/msrp < /dept/@budget]/name – Note, this last one is a kind of “join” XQuery XQuery Main Construct (replaces SELECT-FROM-WHERE): • FOR $x in expr -- binds $x to each value in the • FLWR Expression: FOR-LET-WHERE-RETURN list expr FOR/LET Clauses • LET $x = expr -- binds $x to the entire list Ordered List of tuples expr – Useful for common subexpressions and for WHERE Clause aggregations Filtered list of tuples RETURN Clause XML data: Instance of Xquery data model 4 XQuery Advantages of XML vs. Relational <big_publishers> • ASCII makes things easy FOR $p IN distinct(document("bib.xml")//publisher) –Easy to parse LET $b := document("bib.xml")/book[publisher = $p] – Easy to ship (e.g. across firewall, via email, etc.) • Self-documenting WHERE count($b) > 100 – Metadata (tag names) come with the data RETURN $p • Nested </big_publishers> – Can bundle lots of related data into one message – (Note: object-relational allows this) • Can be sloppy –don’t have to define a schema in advance distinct = a function that eliminates duplicates • Standard count = a (aggregate) function that returns the number of elms – Lots of free Java tools for parsing and munging XML • Expect lots of Microsoft tools (C#) for same • Tremendous Momentum! What XML does not solve Reminder: Benefits of Relational • XML doesn’t standardize metadata – It only standardizes the metadata language • Data independence buys you: • Not that much better than agreeing on an alphabet – Evolution of storage -- vs. XML? – E.g. my <price> tag vs. your <price> tag – Evolution of schema (via views) – vs. XML? • Mine includes shipping and federal tax, and is in $US • Database design theory • Yours is manufacturer’s list price in ¥Japan – IC’s, dependency theory, lots of nice tools for ER – XML Schema is a proposal to help with some of this • Remember, databases are long-lived and reused • XML doesn’t help with

Load more