<<

Lecture 9 XML/Xpath/XQuery

Tuesday, May 26, 2009

1 XML Outline • XML – Syntax – Semistructured data – DTDs • Xpath • XQuery

2 Additional Readings on XML

• http://www.w3.org/XML/ – Main source on XML, but hard to read

• http://www.w3.org/TR/xquery/ – Authority on Xquery

• http://www.galaxquery.org/ – An easy to use, complete XQuery implementation

Note: XML/XQuery is NOT covered in the textbook 3 XML

• A flexible syntax for data • Used in: – Configuration files, e.g. Web.Config – Replacement for binary formats (MS Word) – Document markup: e.g. XHTML – Data: data exchange, semistructured data • Roots: SGML - a very nasty language

We will study only XML as data 4 XML as Semistructured Data

• Relational have rigid schema – Schema evolution is costly • XML is flexible: semistructured data – Store data in XML • Warning: not normal form ! Not even 1NF

5 From HTML to XML

HTML describes the presentation 6 HTML

Bibliography

Foundations of Databases Abiteboul, Hull, Vianu
Addison Wesley, 1995

Data on the Web Abiteoul, Buneman, Suciu
Morgan Kaufmann, 1999

7 XML Syntax Foundations… Abiteboul Hull Vianu Addison Wesley 1995 XML describes the content 8 XML Terminology

• tags: book, title, author, … • start tag: , end tag: • elements: , • elements are nested • empty element: abbrv. • an XML document: single root element

well formed XML document: if it has matching tags 9 More XML: Attributes Foundations of Databases Abiteboul 1995

10 Attributes v.s. Elements

Foundations of DBs Foundations of DBs Abiteboul Abiteboul … … 1995 1995 55 USD attributes are alternative ways to represent data

11 Comparison

Elements Attributes

Ordered Unordered

May be repeated Must be unique

May be nested Must be atomic

12 XML v.s. HTML

• What are the differences between XML and HTML ?

In class

13 That’s All !

• That’s all you ever need to know about XML syntax – Optional type information can be given in the DTD or XSchema (later) – We’ll discuss some additional syntax in the next few slides, but that’s not essential • What is important for you to know: XML’s semantics

14 More Syntax: Oids and References

Jane Are just keys/ foreign keys design by someone who didn’t take 444 Don’t use them: use your own Mary foreign keys instead.

oids and references in XML are just syntax15 More Syntax: CDATA Section

• Syntax:

• Example:

<>]]>

16 More Syntax: Entity References • Syntax: &entityname; • Example: this is less than < < < • Some entities: > > & & &apos ‘ ; " “ & char 17 More Syntax: Comments

• Syntax

• Yes, they are part of the data model !!!

18 XML Namespaces just a unique • name ::= [prefix:]localpart name

. . .

19 XML Semantics: a !

Element Attribute node data Mary person

person Maple 345 id Seattle name address name address
phone o555 street no city John Mary Thai John
Thailand 23456
Maple 345 Seattle Text 23456 node
Order matters !!! 20 XML as Data

• XML is self-describing • Schema elements become part of the data – Reational schema: persons(name,phone) – In XML , , are part of the data, and are repeated many times • Consequence: XML is much more flexible • XML = semistructured data

21 Mapping Relational Data to XML

The canonical mapping: XML: persons

row row row

Persons name phone name phone name phone “John” 3634 “Sue” 6343 “Dick” 6363 Name Phone John 3634 John 3634 Sue 6343 Sue Dick 6363 6343 Dick 6363

22 Mapping Relational Data to XML XML Natural mapping Persons John 3634 Name Phone 2002 Gizmo John 3634 2004 Sue 6343 Gadget Orders Sue PersonName Date Product 6343 John 2002 Gizmo 2004 Gadget John 2004 Gadget Sue 2002 Gadget 23 XML is Semi-structured Data

• Missing attributes: John 1234 Joe no phone ! • Could represent in a table with nulls name phone John 1234 Joe - 24 XML is Semi-structured Data

• Repeated attributes Mary 2345 3456 Two phones ! • Impossible in tables: name phone Mary 2345 3456 ???

25 XML is Semi-structured Data

• Attributes with different types in different objects John Smith 1234 Structured name ! • Nested collections (no 1NF) • Heterogeneous collections: – contains both s and s

26 Document Type Definitions DTD • part of the original XML specification • an XML document may have a DTD • XML document: Well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it • validation is useful in data exchange

27 DTD

Goals: • Define what tags and attributes are allowed • Define how they are nested • Define how they are ordered

Superseded by XML Schema

• Very complex: DTDs still used widely 28 Very Simple DTD

]>

29 Very Simple DTD Example of valid XML document: 123456789 John B432 1234 987654321 Jim B123 ... ... 30 DTD: The Content Model

content • Content model: model – Complex = a over other elements – Text-only = #PCDATA – Empty = EMPTY – Any = ANY – Mixed content = (#PCDATA | A | B | C)*

31 DTD: Regular Expressions DTD XML sequence . . . . . (firstName, lastName))> . . . . . optional . . . . . Kleene star . . . . . . . . . . . . . . . ...... alternation

32 SKIPPED MATERIAL: XSchema • Generalizes DTDs

• Uses XML syntax

• Two parts: structure and datatypes

• Very complex – criticized – alternative proposals: Relax NG

33 DTD v.s. XML Schemas DTD:

XML Schema: < xs:element name=“journal”/>

34 Example

A valid XML Document:

The Essence of XML Simeon Wadler 2003 POPL

35 Elements v.s. Types

type=“ttt”> type=“xs:string”/> type=“xs:string”/> type=“xs:string”/>

Both say the same thing; in DTD:

36 • Types: – Simple types (integers, strings, ...) – Complex types (regular expressions, like in DTDs)

• Element-type Alternation: – An element has a type – A type is a regular expression of elements

37 Local v.s. Global Types

• Local type: [define locally the person’s type] • Global type:

[define here the type ttt] Global types: can be reused in other elements 38 Local v.s. Global Elements

• Local element: ... • Global element:

...

Global elements: like in DTDs39 Regular Expressions

Recall the element-type-element alternation: [regular expression on elements] Regular expressions: • A B C = A B C • A B C = A | B | C • A B C = (A B C) • .. = (...)* • .. = (...)?

40 Local Names

name has . . . . . different meanings in person and in product . . . .

. . . . .

41 Subtle Use of Local Names

Arbitrary deep binary tree with A elements, and a single B element

Note: this example is not legal in XML Schema (why ?)

Hence they cannot express all regular tree languages 42 Attributes in XML Schema

......

Attributes are associated to the type, not to the element Only to complex types; more trouble if we want to add attributes to simple types. 43 “Mixed” Content, “Any” Type

. . . . • Better than in DTDs: can still enforce the type, but now may have text between any elements

. . . . • Means anything is permitted there

44 “All” Group

• A restricted form of & in SGML • Restrictions: – Only at top level – Has only elements – Each element occurs at most once • E.g. “comment” occurs 0 or 1 times

45 Derived Types by Extensions

Corresponds to inheritance 46 Derived Types by Restrictions

20003 15037 95977 95945 • Union types • Restriction types

END OF SKIPPED MATERIAL 51 Discussion 1

What kinds of applications might use XML ?

52 Discussion 1

What kinds of applications might use XML ? • Data exchange – Take the data, don’t worry about schema • Property lists – Many attributes, most are NULL • Evolving schema – Add quickly a new attribute 53 Discussion 2

How is XML processed ?

54 Discussion 2

How is XML processed ? • Via API – Called DOM – Navigate, update the XML arbitrarily – BUT: memory bound • Via some : – Xpath or Xquery – Stand-alone processor OR embedded in SQL 55 Querying XML Data Will discuss next:

• XPath = simple navigation on the tree

• XQuery = “the SQL of XML”

56 Sample Data for Queries Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of <a href="/tags/Database/" rel="tag">Database</a> and Knowledge Base Systems 1998 57 Data Model for XPath

The root

bib The root element

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul 58 XPath: Simple Expressions /bib/book/year

Result: 1995 1998

/bib/paper/year

Result: empty (there were no papers)

/bib What’s the difference ? / 59 XPath: Restricted Kleene Closure //author Result: Serge Abiteboul Rick Hull Victor Vianu Jeffrey D. Ullman

Result: Rick /bib//first-name

60 Xpath: Attribute Nodes

/bib/book/@price Result: “55”

@price means that price is has to be an attribute

61 Xpath: Wildcard

//author/*

Result: Rick Hull

* Matches any element @* Matches any attribute

62 Xpath: Text Nodes

/bib/book/author/text() Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

Functions in XPath: – text() = matches the text value – node() = matches any node (= * or @* or text()) – name() = returns the name of the current tag

63 Xpath: Predicates

/bib/book/author[first-name] Result: Rick Hull

64 Xpath: More Predicates

/bib/book/author[first-name][address[.//zip][city]]/last-name

Explain how this is evaluated !

65 Xpath: More Predicates

/bib/book/author[first-name][address[.//zip][city]]/last-name

Result:

How do we read this ? First remove all qualifiers (predicates): /bib/book/author/last-name

Then add them one by one:

/bib/book/author[first-name][address]/last-name 66 Xpath: More Predicates

/bib/book[@price < 60]

/bib/book[author/@age < 25]

/bib/book[author/text()]

67 Xpath: More Axes

. means current node /bib/book[.//review]

/bib/book[./review] Same as /bib/book[review]

/bib/book/. /author Same as /bib/book/author

68 Xpath: More Axes

.. means parent node /bib/book/author/../author Same as /bib/book/author

/bib/book[.//first-name/../last-name] Same as /bib/book[.//*[first-name][last-name]]

69 Xpath: Brief Summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book[@price<“55”]/author/lastname matches…

70 XQuery

• Based on Quilt, which is based on XML-QL

• Uses XPath to express more complex queries

71 FLWR (“Flower”) Expressions

FOR ... LET... WHERE... RETURN...

72 FOR-WHERE-RETURN

Find all book titles published after 1995: for $x in document("bib.")/bib/book where $x/year/text() > 1995 return $x/title

Result: abc def ghi 73 FOR-WHERE-RETURN

Equivalently (perhaps more geekish) for $x in document("bib.xml")/bib/book[year/text() > 1995] /title return $x

And even shorter: document("bib.xml")/bib/book[year/text() > 1995] /title

74 FOR-WHERE-RETURN

• Find all book titles and the year when they were published: for $x in document("bib.xml")/ bib/book return { $x/title/text() } { $x/year/text() }

Result: abc 1995 def 2002 ghk 1980 FOR-WHERE-RETURN

• Notice the use of “{“ and “}” • What is the result without them ? for $x in document("bib.xml")/ bib/book return $x/title/text() $x/year/text()

76 FOR-WHERE-RETURN

• Notice the use of “{“ and “}” • What is the result without them ? for $x in document("bib.xml")/bib/book return $x/title/text() $x/year/text()

$x/title/text() $x/year/text() $x/title/text() $x/year/text()

$x/title/text() $x/year/text() Nesting For each author of a book published in 1995, list all books she published: for $b in document(“bib.xml”)/bib, $a in $b/book[year/text()=1995]/author return { $a, for $t in $b/book[author/text()=$a/text()]/title return $t }

In the RETURN clause comma concatenates XML fragments78 Result

Jones abc def Smith ghi

79 Aggregates

Find all books with more than 3 authors: for $x in document("bib.xml")/bib/book where count($x/author)>3 return $x

count = a function that counts avg = computes the average sum = computes the sum distinct-values = eliminates duplicates 80 Aggregates

Same thing: for $x in document("bib.xml")/bib/book[count(author)>3] return $x

81 Aggregates

Print all authors who published more than 3 books

for $b in document("bib.xml")/bib, $a in distinct-values($b/book/author/text()) where count($b/book[author/text()=$a])>3 return { $a }

82 Flattening

• “Flatten” the authors, i.e. return a list of (author, title) pairs for $b in document("bib.xml")/bib/book, Result: $x in $b/title/text(), $y in $b/author/text() abc return efg { $x } { $y } abc hkj 83 Re-grouping

• For each author, return all titles of her/ his books Result: for $b in document("bib.xml")/bib efg let $a:=distinct-values($b/book/author/text()) abc for $x in $a klm return . . . . { $x } { for $y in $b/book[author/text()=$x]/title return $y } 84 Re-grouping

• Same thing: for $b in document("bib.xml")/bib, $x in distinct-values($b/book/author/text()) return { $x } { for $y in $b/book[author/text()=$x]/title return $y }

85 SQL and XQuery Side-by-side Find all product names, prices, Product(pid, name, maker, price) sort by price

SELECT x.name, for $x in document(“db.xml”)/db/product/row x.price order by $x/price/text() FROM Product x return ORDER BY x.price { $x/name, $x/price }

SQL XQuery

86 Xquery’s Answer

abc 7 def 23 . . . .

87 SQL and XQuery Side-by-side Product(pid, name, maker, price) Company(cid, name, city, revenues)Find all products made in Seattle for $r in document(“db.xml”)/db, $x in $r/product/row, SELECT x.name $y in $r/company/row FROM Product x, Company y where WHERE x.maker=y.cid $x/maker/text()=$y/cid/text() and y.city=“Seattle” and $y/city/text() = “seattle” return { $x/name } SQL XQuery for $y in /db/company/row[city/text()=“seattle”], Cool $x in /db/product/row[maker/text()=$y/cid/text()] XQuery return { $x/name } 88 123 abc efg …. . . . . . . .

89 SQL and XQuery Side-by-side For each company with revenues < 1M, count how many products with price > $100 they make SELECT y.name, count(*) FROM Product x, Company y WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000 GROUP BY y.cid, y.name for $r in document(“db.xml”)/db, $y in $r/company/row[revenue/text()<1000000] return { $y/name/text() } {count($r/product/row[maker/text()=$y/cid/text()][price/text()>100])} 90 SQL and XQuery Side-by-side Find companies with at least 30 products, and their average price SELECT y.name, avg(x.price) FROM Product x, Company y $r=element WHERE x.maker=y.cid GROUP BY y.cid, y.name HAVING count(*) > 30 for $r in document(“db.xml”)/db, $y in $r/company/row let $p := $r/product/row[maker/text()=$y/cid/text()] $y=collection where count($p) > 30 return { $y/name/text() } avg($p/price/text()) 91 FOR v.s. LET

FOR • Binds node variables  iteration

LET • Binds collection variables  one value

92 FOR v.s. LET

Returns: for $x in /bib/book ... return { $x } ... ... ... let $x := /bib/book Returns: ... return { $x } ... ... ... 93 XQuery

Summary: • FOR-LET-WHERE-RETURN = FLWR

FOR/LET Clauses

List of tuples

WHERE Clause

List of tuples

RETURN Clause

Instance of Xquery data model94 XML in SQL Server 2005

• Create tables with attributes of type XML

• Use Xquery in SQL queries

• Rest of the slides are from: Shankar Pal et al., Indexing XML data stored in a relational database, VLDB’2004

95 CREATE TABLE DOCS ( ID int primary key, XDOC xml)

SELECT ID, XDOC.query(’ for $s in /BOOK[@ISBN= “1-55860-438-3”]//SECTION return {data($s/TITLE)} ') FROM DOCS

96 XML Methods in SQL

• Query() = returns XML data type • Value() = extracts scalar values • Exist() = checks conditions on XML nodes • Nodes() = returns a rowset of XML nodes that the Xquery expression evaluates to

97 Examples

• From here: http://msdn.microsoft.com/library/ default.asp?url=/library/en-us/dnsql90/ /sql2k5xml.asp

98 XML Type

CREATE TABLE docs ( pk INT PRIMARY KEY, xCol XML not null )

99 Inserting an XML Value

INSERT INTO docs VALUES (2, '

XML Schema
Benefits
Features
')

100 Query( )

SELECT pk, xCol.query('/doc[@id = 123]//section') FROM docs

101 Exists( )

SELECT xCol.query('/doc[@id = 123]//section') FROM docs WHERE xCol. ('/doc[@id = 123]') = 1

102 Value( )

SELECT xCol.value( 'data((/doc//section[@num = 3]/title)[1])', 'nvarchar(max)') FROM docs

103 Nodes( )

SELECT nref.value('first-name[1]', 'nvarchar(50)') AS FirstName, nref.value('last-name[1]', 'nvarchar(50)') AS LastName FROM @xVar.nodes('//author') AS R(nref) WHERE nref.exist('.[first-name != "David"]') = 1

104 Nodes( )

SELECT nref.value('@genre', 'varchar(max)') LastName FROM docs CROSS APPLY xCol.nodes('//book') AS R(nref)

105 Internal Storage

• XML is “shredded” as a table • A few important ideas: – Dewey decimal numbering of nodes; store in clustered B- tree indes – Use only odd numbers to allow insertions – Reverse PATH-ID encoding, for efficient processing of postfix expressions like //a/b/c – Add more indexes, e.g. on data values

106

Bad Bugs Nobody loves bad bugs.
Tree Frogs All right-thinking people love tree frogs.

107 108 109 Infoset Table /BOOK[@ISBN = “1-55860-438-3”]/SECTION

SELECT SerializeXML (N2.ID, N2.ORDPATH) FROM infosettab N1 JOIN infosettab N2 ON (N1.ID = N2.ID) WHERE N1.PATH_ID = PATH_ID(/BOOK/@ISBN) AND N1.VALUE = '1-55860-438-3' AND N2.PATH_ID = PATH_ID(BOOK/SECTION) AND Parent (N1.ORDPATH) = Parent (N2.ORDPATH)

110